-
On Discrete Prompt Optimization for Diffusion Models
Authors:
Ruochen Wang,
Ting Liu,
Cho-Jui Hsieh,
Boqing Gong
Abstract:
This paper introduces the first gradient-based framework for prompt optimization in text-to-image diffusion models. We formulate prompt engineering as a discrete optimization problem over the language space. Two major challenges arise in efficiently finding a solution to this problem: (1) Enormous Domain Space: Setting the domain to the entire language space poses significant difficulty to the opt…
▽ More
This paper introduces the first gradient-based framework for prompt optimization in text-to-image diffusion models. We formulate prompt engineering as a discrete optimization problem over the language space. Two major challenges arise in efficiently finding a solution to this problem: (1) Enormous Domain Space: Setting the domain to the entire language space poses significant difficulty to the optimization process. (2) Text Gradient: Efficiently computing the text gradient is challenging, as it requires backpropagating through the inference steps of the diffusion model and a non-differentiable embedding lookup table. Beyond the problem formulation, our main technical contributions lie in solving the above challenges. First, we design a family of dynamically generated compact subspaces comprised of only the most relevant words to user input, substantially restricting the domain space. Second, we introduce "Shortcut Text Gradient" -- an effective replacement for the text gradient that can be obtained with constant memory and runtime. Empirical evaluation on prompts collected from diverse sources (DiffusionDB, ChatGPT, COCO) suggests that our method can discover prompts that substantially improve (prompt enhancement) or destroy (adversarial attack) the faithfulness of images generated by the text-to-image diffusion model.
△ Less
Submitted 26 June, 2024;
originally announced July 2024.
-
ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance
Authors:
Shuwei Shi,
Wenbo Li,
Yuechen Zhang,
**gwen He,
Biao Gong,
Yinqiang Zheng
Abstract:
Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in over-smoothed content, structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions. Specifically, ResMast…
▽ More
Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in over-smoothed content, structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions. Specifically, ResMaster leverages a low-resolution reference image created by a pre-trained diffusion model to provide structural and fine-grained guidance for crafting high-resolution images on a patch-by-patch basis. To ensure a coherent global structure, ResMaster meticulously aligns the low-frequency components of high-resolution patches with the low-resolution reference at each denoising step. For fine-grained guidance, tailored image prompts based on the low-resolution reference and enriched textual prompts produced by a vision-language model are incorporated. This approach could significantly mitigate local pattern distortions and improve detail refinement. Extensive experiments validate that ResMaster sets a new benchmark for high-resolution image generation and demonstrates promising efficiency. The project page is https://shuweis.github.io/ResMaster .
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Understanding the Impact of Negative Prompts: When and How Do They Take Effect?
Authors:
Yuanhao Ban,
Ruochen Wang,
Tianyi Zhou,
Minhao Cheng,
Boqing Gong,
Cho-Jui Hsieh
Abstract:
The concept of negative prompts, emerging from conditional generation models like Stable Diffusion, allows users to specify what to exclude from the generated images.%, demonstrating significant practical efficacy. Despite the widespread use of negative prompts, their intrinsic mechanisms remain largely unexplored. This paper presents the first comprehensive study to uncover how and when negative…
▽ More
The concept of negative prompts, emerging from conditional generation models like Stable Diffusion, allows users to specify what to exclude from the generated images.%, demonstrating significant practical efficacy. Despite the widespread use of negative prompts, their intrinsic mechanisms remain largely unexplored. This paper presents the first comprehensive study to uncover how and when negative prompts take effect. Our extensive empirical analysis identifies two primary behaviors of negative prompts. Delayed Effect: The impact of negative prompts is observed after positive prompts render corresponding content. Deletion Through Neutralization: Negative prompts delete concepts from the generated image through a mutual cancellation effect in latent space with positive prompts. These insights reveal significant potential real-world applications; for example, we demonstrate that negative prompts can facilitate object inpainting with minimal alterations to the background via a simple adaptive algorithm. We believe our findings will offer valuable insights for the community in capitalizing on the potential of negative prompts.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise
Authors:
Yuanhao Ban,
Ruochen Wang,
Tianyi Zhou,
Boqing Gong,
Cho-Jui Hsieh,
Minhao Cheng
Abstract:
Diffusion models have achieved remarkable success in text-to-image generation tasks; however, the role of initial noise has been rarely explored. In this study, we identify specific regions within the initial noise image, termed trigger patches, that play a key role for object generation in the resulting images. Notably, these patches are ``universal'' and can be generalized across various positio…
▽ More
Diffusion models have achieved remarkable success in text-to-image generation tasks; however, the role of initial noise has been rarely explored. In this study, we identify specific regions within the initial noise image, termed trigger patches, that play a key role for object generation in the resulting images. Notably, these patches are ``universal'' and can be generalized across various positions, seeds, and prompts. To be specific, extracting these patches from one noise and injecting them into another noise leads to object generation in targeted areas. We identify these patches by analyzing the dispersion of object bounding boxes across generated images, leading to the development of a posterior analysis technique. Furthermore, we create a dataset consisting of Gaussian noises labeled with bounding boxes corresponding to the objects appearing in the generated images and train a detector that identifies these patches from the initial noise. To explain the formation of these patches, we reveal that they are outliers in Gaussian noise, and follow distinct distributions through two-sample tests. Finally, we find the misalignment between prompts and the trigger patch patterns can result in unsuccessful image generations. The study proposes a reject-sampling strategy to obtain optimal noise, aiming to improve prompt adherence and positional diversity in image generation.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Bilateral Guided Radiance Field Processing
Authors:
Yuehao Wang,
Chaoyi Wang,
Bingchen Gong,
Tianfan Xue
Abstract:
Neural Radiance Fields (NeRF) achieves unprecedented performance in synthesizing novel view synthesis, utilizing multi-view consistency. When capturing multiple inputs, image signal processing (ISP) in modern cameras will independently enhance them, including exposure adjustment, color correction, local tone map**, etc. While these processings greatly improve image quality, they often break the…
▽ More
Neural Radiance Fields (NeRF) achieves unprecedented performance in synthesizing novel view synthesis, utilizing multi-view consistency. When capturing multiple inputs, image signal processing (ISP) in modern cameras will independently enhance them, including exposure adjustment, color correction, local tone map**, etc. While these processings greatly improve image quality, they often break the multi-view consistency assumption, leading to "floaters" in the reconstructed radiance fields. To address this concern without compromising visual aesthetics, we aim to first disentangle the enhancement by ISP at the NeRF training stage and re-apply user-desired enhancements to the reconstructed radiance fields at the finishing stage. Furthermore, to make the re-applied enhancements consistent between novel views, we need to perform imaging signal processing in 3D space (i.e. "3D ISP"). For this goal, we adopt the bilateral grid, a locally-affine model, as a generalized representation of ISP processing. Specifically, we optimize per-view 3D bilateral grids with radiance fields to approximate the effects of camera pipelines for each input view. To achieve user-adjustable 3D finishing, we propose to learn a low-rank 4D bilateral grid from a given single view edit, lifting photo enhancements to the whole 3D scene. We demonstrate our approach can boost the visual quality of novel view synthesis by effectively removing floaters and performing enhancements from user retouching. The source code and our data are available at: https://bilarfpro.github.io.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Deform3DGS: Flexible Deformation for Fast Surgical Scene Reconstruction with Gaussian Splatting
Authors:
Shuojue Yang,
Qian Li,
Daiyun Shen,
Bingchen Gong,
Qi Dou,
Yueming **
Abstract:
Tissue deformation poses a key challenge for accurate surgical scene reconstruction. Despite yielding high reconstruction quality, existing methods suffer from slow rendering speeds and long training times, limiting their intraoperative applicability. Motivated by recent progress in 3D Gaussian Splatting, an emerging technology in real-time 3D rendering, this work presents a novel fast reconstruct…
▽ More
Tissue deformation poses a key challenge for accurate surgical scene reconstruction. Despite yielding high reconstruction quality, existing methods suffer from slow rendering speeds and long training times, limiting their intraoperative applicability. Motivated by recent progress in 3D Gaussian Splatting, an emerging technology in real-time 3D rendering, this work presents a novel fast reconstruction framework, termed Deform3DGS, for deformable tissues during endoscopic surgery. Specifically, we introduce 3D GS into surgical scenes by integrating a point cloud initialization to improve reconstruction. Furthermore, we propose a novel flexible deformation modeling scheme (FDM) to learn tissue deformation dynamics at the level of individual Gaussians. Our FDM can model the surface deformation with efficient representations, allowing for real-time rendering performance. More importantly, FDM significantly accelerates surgical scene reconstruction, demonstrating considerable clinical values, particularly in intraoperative settings where time efficiency is crucial. Experiments on DaVinci robotic surgery videos indicate the efficacy of our approach, showcasing superior reconstruction fidelity PSNR: (37.90) and rendering speed (338.8 FPS) while substantially reducing training time to only 1 minute/scene. Our code is available at https://github.com/**lab-imvr/Deform3DGS.
△ Less
Submitted 30 May, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
Automatic Jailbreaking of the Text-to-Image Generative AI Systems
Authors:
Minseon Kim,
Hyomin Lee,
Boqing Gong,
Huishuai Zhang,
Sung Ju Hwang
Abstract:
Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jai…
▽ More
Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12% and 17% of the attacks with naive prompts, respectively, while ChatGPT blocks 84% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms.
△ Less
Submitted 28 May, 2024; v1 submitted 26 May, 2024;
originally announced May 2024.
-
Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep Learning
Authors:
Zheyuan Zhang,
Elif Keles,
Gorkem Durak,
Yavuz Taktak,
Onkar Susladkar,
Vandan Gorade,
Debesh Jha,
Asli C. Ormeci,
Alpay Medetalibeyoglu,
Lanhong Yao,
Bin Wang,
Ilkin Sevgi Isler,
Linkai Peng,
Hongyi Pan,
Camila Lopes Vendrami,
Amir Bourhani,
Yury Velichko,
Boqing Gong,
Concetto Spampinato,
Ayis Pyrros,
Pallavi Tiwari,
Derk C. F. Klatte,
Megan Engels,
Sanne Hoogenboom,
Candice W. Bolan
, et al. (13 additional authors not shown)
Abstract:
Automated volumetric segmentation of the pancreas on cross-sectional imaging is needed for diagnosis and follow-up of pancreatic diseases. While CT-based pancreatic segmentation is more established, MRI-based segmentation methods are understudied, largely due to a lack of publicly available datasets, benchmarking research efforts, and domain-specific deep learning methods. In this retrospective st…
▽ More
Automated volumetric segmentation of the pancreas on cross-sectional imaging is needed for diagnosis and follow-up of pancreatic diseases. While CT-based pancreatic segmentation is more established, MRI-based segmentation methods are understudied, largely due to a lack of publicly available datasets, benchmarking research efforts, and domain-specific deep learning methods. In this retrospective study, we collected a large dataset (767 scans from 499 participants) of T1-weighted (T1W) and T2-weighted (T2W) abdominal MRI series from five centers between March 2004 and November 2022. We also collected CT scans of 1,350 patients from publicly available sources for benchmarking purposes. We developed a new pancreas segmentation method, called PanSegNet, combining the strengths of nnUNet and a Transformer network with a new linear attention module enabling volumetric computation. We tested PanSegNet's accuracy in cross-modality (a total of 2,117 scans) and cross-center settings with Dice and Hausdorff distance (HD95) evaluation metrics. We used Cohen's kappa statistics for intra and inter-rater agreement evaluation and paired t-tests for volume and Dice comparisons, respectively. For segmentation accuracy, we achieved Dice coefficients of 88.3% (std: 7.2%, at case level) with CT, 85.0% (std: 7.9%) with T1W MRI, and 86.3% (std: 6.4%) with T2W MRI. There was a high correlation for pancreas volume prediction with R^2 of 0.91, 0.84, and 0.85 for CT, T1W, and T2W, respectively. We found moderate inter-observer (0.624 and 0.638 for T1W and T2W MRI, respectively) and high intra-observer agreement scores. All MRI data is made available at https://osf.io/kysnj/. Our source code is available at https://github.com/NUBagciLab/PaNSegNet.
△ Less
Submitted 25 May, 2024; v1 submitted 20 May, 2024;
originally announced May 2024.
-
Efficient EndoNeRF Reconstruction and Its Application for Data-driven Surgical Simulation
Authors:
Yuehao Wang,
Bingchen Gong,
Yonghao Long,
Siu Hin Fan,
Qi Dou
Abstract:
The healthcare industry has a growing need for realistic modeling and efficient simulation of surgical scenes. With effective models of deformable surgical scenes, clinicians are able to conduct surgical planning and surgery training on scenarios close to real-world cases. However, a significant challenge in achieving such a goal is the scarcity of high-quality soft tissue models with accurate sha…
▽ More
The healthcare industry has a growing need for realistic modeling and efficient simulation of surgical scenes. With effective models of deformable surgical scenes, clinicians are able to conduct surgical planning and surgery training on scenarios close to real-world cases. However, a significant challenge in achieving such a goal is the scarcity of high-quality soft tissue models with accurate shapes and textures. To address this gap, we present a data-driven framework that leverages emerging neural radiance field technology to enable high-quality surgical reconstruction and explore its application for surgical simulations. We first focus on develo** a fast NeRF-based surgical scene 3D reconstruction approach that achieves state-of-the-art performance. This method can significantly outperform traditional 3D reconstruction methods, which have failed to capture large deformations and produce fine-grained shapes and textures. We then propose an automated creation pipeline of interactive surgical simulation environments through a closed mesh extraction algorithm. Our experiments have validated the superior performance and efficiency of our proposed approach in surgical scene 3D reconstruction. We further utilize our reconstructed soft tissues to conduct FEM and MPM simulations, showcasing the practical application of our method in data-driven surgical simulations.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Analysis of a finite element DtN method for scattering resonances of sound hard obstacles
Authors:
Yingxia Xi,
Bo Gong,
Jiguang Sun
Abstract:
Scattering resonances have important applications in many areas of science and engineering. They are the replacement of discrete spectral data for problems on non-compact domains. In this paper, we consider the computation of scattering resonances defined on the exterior to a compact sound hard obstacle. The resonances are the eigenvalues of a holomorphic Fredholm operator function. We truncate th…
▽ More
Scattering resonances have important applications in many areas of science and engineering. They are the replacement of discrete spectral data for problems on non-compact domains. In this paper, we consider the computation of scattering resonances defined on the exterior to a compact sound hard obstacle. The resonances are the eigenvalues of a holomorphic Fredholm operator function. We truncate the unbounded domain and impose the Dirichlet-to-Neumann (DtN) map**. The problem is then discretized using the linear Lagrange element. Convergence of the resonances is proved using the abstract approximation theory for holomorphic Fredholm operator functions. The discretization leads to nonlinear algebraic eigenvalue problems, which are solved by the recently developed parallel spectral indicator methods. Numerical examples are presented for validation.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
Development and Assessment of a Miniaturized Thermocouple for Precise Temperature Measurement in Biological Tissues and Cells
Authors:
Onnop Srivannavit,
Rakesh Joshi,
Weibin Zhu,
Bin Gong,
Stuart C. Sealfon,
Theodorian Borca-Tasciuc,
Angelo Gaitas
Abstract:
This study presents a novel thermocouple instrument designed for precise temperature monitoring within biological tissues and cells, addressing a significant gap in biological research. Constructed on a Silicon-On-Insulator (SOI) substrate, the instrument employs doped silicon and chromium/gold junctions, achieving a Seebeck coefficient of up to 447 uV/K, rapid response times, high temperature acc…
▽ More
This study presents a novel thermocouple instrument designed for precise temperature monitoring within biological tissues and cells, addressing a significant gap in biological research. Constructed on a Silicon-On-Insulator (SOI) substrate, the instrument employs doped silicon and chromium/gold junctions, achieving a Seebeck coefficient of up to 447 uV/K, rapid response times, high temperature accuracy, and the necessary durability for tissue measurements. The cleanroom fabrication process yields a device featuring a triangular sensing tip. Using Finite Element Analysis (FEA) with COMSOL Multiphysics, the research delves into the device's thermal time constant within tissue environments. The device's efficacy in biological settings was validated by measuring temperatures inside ex-vivo tissue samples. Our findings, bolstered by FEA COMSOL simulations, confirm the device's robustness and applicability in biological studies. This advancement in thermocouple microneedle technology provides biologists with an instrument for accurately tracking temperature fluctuations in tissues.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
VideoPrism: A Foundational Visual Encoder for Video Understanding
Authors:
Long Zhao,
Nitesh B. Gundavarapu,
Liangzhe Yuan,
Hao Zhou,
Shen Yan,
Jennifer J. Sun,
Luke Friedman,
Rui Qian,
Tobias Weyand,
Yue Zhao,
Rachel Hornung,
Florian Schroff,
Ming-Hsuan Yang,
David A. Ross,
Huisheng Wang,
Hartwig Adam,
Mikhail Sirotenko,
Ting Liu,
Boqing Gong
Abstract:
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic…
▽ More
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
△ Less
Submitted 15 June, 2024; v1 submitted 20 February, 2024;
originally announced February 2024.
-
Distilling Vision-Language Models on Millions of Videos
Authors:
Yue Zhao,
Long Zhao,
Xingyi Zhou,
Jialin Wu,
Chun-Te Chu,
Hui Miao,
Florian Schroff,
Hartwig Adam,
Ting Liu,
Boqing Gong,
Philipp Krähenbühl,
Liangzhe Yuan
Abstract:
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-i…
▽ More
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset to date.
△ Less
Submitted 15 April, 2024; v1 submitted 11 January, 2024;
originally announced January 2024.
-
Instruct-Imagen: Image Generation with Multi-modal Instruction
Authors:
Hexiang Hu,
Kelvin C. K. Chan,
Yu-Chuan Su,
Wenhu Chen,
Yandong Li,
Kihyuk Sohn,
Yang Zhao,
Xue Ben,
Boqing Gong,
William Cohen,
Ming-Wei Chang,
Xuhui Jia
Abstract:
This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant gener…
▽ More
This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format.
We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Authors:
Xiang Wang,
Shiwei Zhang,
Hangjie Yuan,
Zhiwu Qing,
Biao Gong,
Yingya Zhang,
Yujun Shen,
Changxin Gao,
Nong Sang
Abstract:
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from v…
▽ More
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this, we come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end, we employ a content branch and a motion branch, which are jointly optimized with weights shared. Following such a pipeline, we study the effect of doubling the scale of training set (i.e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some text labels for training. Finally, we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms. Code and models will be publicly available at https://tf-t2v.github.io/.
△ Less
Submitted 25 December, 2023;
originally announced December 2023.
-
Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following
Authors:
Yutong Feng,
Biao Gong,
Di Chen,
Yujun Shen,
Yu Liu,
**gren Zhou
Abstract:
Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed…
▽ More
Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at https://ranni-t2i.github.io/Ranni.
△ Less
Submitted 9 April, 2024; v1 submitted 28 November, 2023;
originally announced November 2023.
-
SeamlessNeRF: Stitching Part NeRFs with Gradient Propagation
Authors:
Bingchen Gong,
Yuehao Wang,
Xiaoguang Han,
Qi Dou
Abstract:
Neural Radiance Fields (NeRFs) have emerged as promising digital mediums of 3D objects and scenes, sparking a surge in research to extend the editing capabilities in this domain. The task of seamless editing and merging of multiple NeRFs, resembling the ``Poisson blending'' in 2D image editing, remains a critical operation that is under-explored by existing work. To fill this gap, we propose Seaml…
▽ More
Neural Radiance Fields (NeRFs) have emerged as promising digital mediums of 3D objects and scenes, sparking a surge in research to extend the editing capabilities in this domain. The task of seamless editing and merging of multiple NeRFs, resembling the ``Poisson blending'' in 2D image editing, remains a critical operation that is under-explored by existing work. To fill this gap, we propose SeamlessNeRF, a novel approach for seamless appearance blending of multiple NeRFs. In specific, we aim to optimize the appearance of a target radiance field in order to harmonize its merge with a source field. We propose a well-tailored optimization procedure for blending, which is constrained by 1) pinning the radiance color in the intersecting boundary area between the source and target fields and 2) maintaining the original gradient of the target. Extensive experiments validate that our approach can effectively propagate the source appearance from the boundary area to the entire target field through the gradients. To the best of our knowledge, SeamlessNeRF is the first work that introduces gradient-guided appearance editing to radiance fields, offering solutions for seamless stitching of 3D objects represented in NeRFs.
△ Less
Submitted 30 October, 2023;
originally announced November 2023.
-
Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
Authors:
Siteng Huang,
Biao Gong,
Yutong Feng,
Xi Chen,
Yuqian Fu,
Yu Liu,
Donglin Wang
Abstract:
This study focuses on a novel task in text-to-image (T2I) generation, namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actio…
▽ More
This study focuses on a novel task in text-to-image (T2I) generation, namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features, including appearance. To overcome the preference for low-level features and the entanglement of high-level features, we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens, thereby increasing the representational richness while distributing the inversion across different features. Then, to block the inversion of action-agnostic features, ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task, we present an ActionBench that includes a variety of actions, each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation. Our project page is at https://adi-t2i.github.io/ADI.
△ Less
Submitted 10 May, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation
Authors:
Biao Gong,
Siteng Huang,
Yutong Feng,
Shiwei Zhang,
Yuyuan Li,
Yu Liu
Abstract:
Diffusion models have recently achieved remarkable progress in generating realistic images. However, challenges remain in accurately understanding and synthesizing the layout requirements in the textual prompts. To align the generated image with layout instructions, we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time.…
▽ More
Diffusion models have recently achieved remarkable progress in generating realistic images. However, challenges remain in accurately understanding and synthesizing the layout requirements in the textual prompts. To align the generated image with layout instructions, we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time. Specifically, following a "check-locate-rectify" pipeline, the system first analyses the prompt to generate the target layout and compares it with the intermediate outputs to automatically detect errors. Then, by moving the located activations and making intra- and inter-map adjustments, the rectification process can be performed with negligible computational overhead. To evaluate SimM over a range of layout requirements, we present a benchmark SimMBench that compensates for the lack of superlative spatial relations in existing datasets. And both quantitative and qualitative results demonstrate the effectiveness of the proposed SimM in calibrating the layout inconsistencies. Our project page is at https://simm-t2i.github.io/SimM.
△ Less
Submitted 25 March, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Towards A Unified Neural Architecture for Visual Recognition and Reasoning
Authors:
Calvin Luo,
Boqing Gong,
Ting Chen,
Chen Sun
Abstract:
Recognition and reasoning are two pillars of visual understanding. However, these tasks have an imbalance in focus; whereas recent advances in neural networks have shown strong empirical performance in visual recognition, there has been comparably much less success in solving visual reasoning. Intuitively, unifying these two tasks under a singular framework is desirable, as they are mutually depen…
▽ More
Recognition and reasoning are two pillars of visual understanding. However, these tasks have an imbalance in focus; whereas recent advances in neural networks have shown strong empirical performance in visual recognition, there has been comparably much less success in solving visual reasoning. Intuitively, unifying these two tasks under a singular framework is desirable, as they are mutually dependent and beneficial. Motivated by the recent success of multi-task transformers for visual recognition and language understanding, we propose a unified neural architecture for visual recognition and reasoning with a generic interface (e.g., tokens) for both. Our framework enables the principled investigation of how different visual recognition tasks, datasets, and inductive biases can help enable spatiotemporal reasoning capabilities. Noticeably, we find that object detection, which requires spatial localization of individual objects, is the most beneficial recognition task for reasoning. We further demonstrate via probing that implicit object-centric representations emerge automatically inside our framework. Intriguingly, we discover that certain architectural choices such as the backbone model of the visual encoder have a significant impact on visual reasoning, but little on object detection. Given the results of our experiments, we believe that visual reasoning should be considered as a first-class citizen alongside visual recognition, as they are strongly correlated but benefit from potentially different design choices.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Next-to-next-to-leading-order QCD corrections to double $J/ψ$ production at the $B$ factories
Authors:
Xu-Dong Huang,
Bin Gong,
Rui-Chang Niu,
Huai-Min Yu,
Jian-Xiong Wang
Abstract:
In this paper, we study the next-to-next-to-leading-order (NNLO) QCD corrections for the process $e^+e^- \to J/ψ+J/ψ$ at the $B$ factories. By including the NNLO corrections, the cross section turns negative due to the poor convergence of perturbative expansion. Consequently, to obtain a reasonable estimation for the cross section, the square of the amplitude up to NNLO is used. In addition, the c…
▽ More
In this paper, we study the next-to-next-to-leading-order (NNLO) QCD corrections for the process $e^+e^- \to J/ψ+J/ψ$ at the $B$ factories. By including the NNLO corrections, the cross section turns negative due to the poor convergence of perturbative expansion. Consequently, to obtain a reasonable estimation for the cross section, the square of the amplitude up to NNLO is used. In addition, the contributions from the bottom quark and the light-by-light part, which are usually neglected, are also included. The final cross section is obtained as $1.76^{+2.42}_{-1.66} ~{\rm fb}$ at a center-of-mass energy of $\sqrt{s}=10.58$ GeV. Our result for total cross section and differential cross section could be compared with precise experimental measurement in future at the $B$ factories.
△ Less
Submitted 11 February, 2024; v1 submitted 8 November, 2023;
originally announced November 2023.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Authors:
Lijun Yu,
José Lezama,
Nitesh B. Gundavarapu,
Luca Versari,
Kihyuk Sohn,
David Minnen,
Yong Cheng,
Vighnesh Birodkar,
Agrim Gupta,
Xiuye Gu,
Alexander G. Hauptmann,
Boqing Gong,
Ming-Hsuan Yang,
Irfan Essa,
David A. Ross,
Lu Jiang
Abstract:
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer…
▽ More
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.
△ Less
Submitted 29 March, 2024; v1 submitted 9 October, 2023;
originally announced October 2023.
-
Module-wise Adaptive Distillation for Multimodality Foundation Models
Authors:
Chen Liang,
Jiahui Yu,
Ming-Hsuan Yang,
Matthew Brown,
Yin Cui,
Tuo Zhao,
Boqing Gong,
Tianyi Zhou
Abstract:
Pre-trained multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture compone…
▽ More
Pre-trained multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently. Such an approach can be naturally formulated as a multi-armed bandit (MAB) problem, where modules and loss decrements are considered as arms and rewards, respectively. We then develop a modified-Thompson sampling algorithm named OPTIMA to address the nonstationarity of module contributions resulting from model updating. Specifically, we leverage the observed contributions in recent history to estimate the changing contribution of each module and select modules based on these estimations to maximize the cumulative contribution. We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the CoCa-Large model (Yu et al., 2022) as the teacher model.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Video Timeline Modeling For News Story Understanding
Authors:
Meng Liu,
Mingda Zhang,
Jialu Liu,
Hanjun Dai,
Ming-Hsuan Yang,
Shuiwang Ji,
Zheyun Feng,
Boqing Gong
Abstract:
In this paper, we present a novel problem, namely video timeline modeling. Our objective is to create a video-associated timeline from a set of videos related to a specific topic, thereby facilitating the content and structure understanding of the story being told. This problem has significant potential in various real-world applications, for instance, news story summarization. To bootstrap resear…
▽ More
In this paper, we present a novel problem, namely video timeline modeling. Our objective is to create a video-associated timeline from a set of videos related to a specific topic, thereby facilitating the content and structure understanding of the story being told. This problem has significant potential in various real-world applications, for instance, news story summarization. To bootstrap research in this area, we curate a realistic benchmark dataset, YouTube-News-Timeline, consisting of over $12$k timelines and $300$k YouTube news videos. Additionally, we propose a set of quantitative metrics to comprehensively evaluate and compare methodologies. With such a testbed, we further develop and benchmark several deep learning approaches to tackling this problem. We anticipate that this exploratory work will pave the way for further research in video timeline modeling. The assets are available via https://github.com/google-research/google-research/tree/master/video_timeline_modeling.
△ Less
Submitted 27 October, 2023; v1 submitted 23 September, 2023;
originally announced September 2023.
-
Multi-modal Domain Adaptation for REG via Relation Transfer
Authors:
Yifan Ding,
Liqiang Wang,
Boqing Gong
Abstract:
Domain adaptation, which aims to transfer knowledge between domains, has been well studied in many areas such as image classification and object detection. However, for multi-modal tasks, conventional approaches rely on large-scale pre-training. But due to the difficulty of acquiring multi-modal data, large-scale pre-training is often impractical. Therefore, domain adaptation, which can efficientl…
▽ More
Domain adaptation, which aims to transfer knowledge between domains, has been well studied in many areas such as image classification and object detection. However, for multi-modal tasks, conventional approaches rely on large-scale pre-training. But due to the difficulty of acquiring multi-modal data, large-scale pre-training is often impractical. Therefore, domain adaptation, which can efficiently utilize the knowledge from different datasets (domains), is crucial for multi-modal tasks. In this paper, we focus on the Referring Expression Grounding (REG) task, which is to localize an image region described by a natural language expression. Specifically, we propose a novel approach to effectively transfer multi-modal knowledge through a specially relation-tailored approach for the REG problem. Our approach tackles the multi-modal domain adaptation problem by simultaneously enriching inter-domain relations and transferring relations between domains. Experiments show that our proposed approach significantly improves the transferability of multi-modal domains and enhances adaptation performance in the REG problem.
△ Less
Submitted 23 September, 2023;
originally announced September 2023.
-
A Covariance Adaptive Student's t Based Kalman Filter
Authors:
Benyang Gong,
Jiacheng He,
Gang Wang,
Bei Peng
Abstract:
In the classical Kalman filter(KF), the estimated state is a linear combination of the one-step predicted state and measurement state, their confidence level change when the prediction mean square error matrix and covariance matrix of measurement noise vary. The existing student's t based Kalman filter(TKF) works similarly to the way KF works, they both work well with impulse noise, but when it co…
▽ More
In the classical Kalman filter(KF), the estimated state is a linear combination of the one-step predicted state and measurement state, their confidence level change when the prediction mean square error matrix and covariance matrix of measurement noise vary. The existing student's t based Kalman filter(TKF) works similarly to the way KF works, they both work well with impulse noise, but when it comes to Gaussian noise, TKF encounters an adjustment limit of the confidence level, this can lead to inaccuracies in such situations. This brief optimizes TKF by using the Gaussian mixture model(GMM), which generates a reasonable covariance matrix from the measurement noise to replace the one used in the existing algorithm and breaks the adjustment limit of the confidence level. At the end of the brief, the performance of the covariance adaptive student's t based Kalman filter(TGKF) is verified.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
AtmoRep: A stochastic model of atmosphere dynamics using large scale representation learning
Authors:
Christian Lessig,
Ilaria Luise,
Bing Gong,
Michael Langguth,
Scarlet Stadtler,
Martin Schultz
Abstract:
The atmosphere affects humans in a multitude of ways, from loss of life due to adverse weather effects to long-term social and economic impacts on societies. Computer simulations of atmospheric dynamics are, therefore, of great importance for the well-being of our and future generations. Here, we propose AtmoRep, a novel, task-independent stochastic computer model of atmospheric dynamics that can…
▽ More
The atmosphere affects humans in a multitude of ways, from loss of life due to adverse weather effects to long-term social and economic impacts on societies. Computer simulations of atmospheric dynamics are, therefore, of great importance for the well-being of our and future generations. Here, we propose AtmoRep, a novel, task-independent stochastic computer model of atmospheric dynamics that can provide skillful results for a wide range of applications. AtmoRep uses large-scale representation learning from artificial intelligence to determine a general description of the highly complex, stochastic dynamics of the atmosphere from the best available estimate of the system's historical trajectory as constrained by observations. This is enabled by a novel self-supervised learning objective and a unique ensemble that samples from the stochastic model with a variability informed by the one in the historical record. The task-independent nature of AtmoRep enables skillful results for a diverse set of applications without specifically training for them and we demonstrate this for nowcasting, temporal interpolation, model correction, and counterfactuals. We also show that AtmoRep can be improved with additional data, for example radar observations, and that it can be extended to tasks such as downscaling. Our work establishes that large-scale neural networks can provide skillful, task-independent models of atmospheric dynamics. With this, they provide a novel means to make the large record of atmospheric observations accessible for applications and for scientific inquiry, complementing existing simulations based on first principles.
△ Less
Submitted 7 September, 2023; v1 submitted 25 August, 2023;
originally announced August 2023.
-
The innermost jet in the hidden ultra-luminous X-ray source Cygnus X-3
Authors:
Jun Yang,
Federico García,
Santiago del Palacio,
Ralph Spencer,
Zsolt Paragi,
Noel Castro Segura,
Bi** Gong,
Hongmin Cao,
Wen Chen
Abstract:
Cygnus X-3 is a high-mass X-ray binary with a compact object accreting matter from a Wolf-Rayet donor star. Recently, it has been revealed by the Imaging X-ray Polarimetry Explorer (IXPE) as a hidden Galactic ultra-luminous X-ray (ULX) source with a luminosity above the Eddington limit along the direction of a narrow (opening angle <~32 degree) funnel. In between the IXPE observations, we observed…
▽ More
Cygnus X-3 is a high-mass X-ray binary with a compact object accreting matter from a Wolf-Rayet donor star. Recently, it has been revealed by the Imaging X-ray Polarimetry Explorer (IXPE) as a hidden Galactic ultra-luminous X-ray (ULX) source with a luminosity above the Eddington limit along the direction of a narrow (opening angle <~32 degree) funnel. In between the IXPE observations, we observed Cyg X-3 with the European VLBI (very long baseline interferometry) Network at 22 GHz and the NICER X-ray instrument. To probe possible relations between the X-ray funnel and the potential radio jet from the ULX, we analyzed the simultaneous multi-wavelength data. Our high-resolution VLBI image reveals an elongated structure with a position angle of -3.2+/-0.4 degree, accurately perpendicular to the direction of the linear X-ray polarization. Because Cyg X-3 was in the radio quiescent state on 2022 November 10, we identify the mas-scale structure as the innermost radio jet. The finding indicates that the radio jet propagates along and within the funnel. Moreover, the jet is marginally resolved in the transverse direction. This possibly results from the strong stellar winds and the rapid orbital motion of the binary system.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
On the Formation of Eccentric Millisecond Pulsars by Accretion-induced Collapse of Massive White Dwarfs
Authors:
D. Wang,
B. P. Gong
Abstract:
The millisecond pulsar(MSP) is believed to be an old neutron star(NS) having undergone spin-up by the accreting material from the donor. Whereas, the discovery of eccentric millisecond pulsars (eMSPs) in the Galactic field challenges such a scenario producing MSP-white dwarf (WD) only in the circular orbit. As orbital periods and companion mass of these eMSPs are located in a narrow range, a reaso…
▽ More
The millisecond pulsar(MSP) is believed to be an old neutron star(NS) having undergone spin-up by the accreting material from the donor. Whereas, the discovery of eccentric millisecond pulsars (eMSPs) in the Galactic field challenges such a scenario producing MSP-white dwarf (WD) only in the circular orbit. As orbital periods and companion mass of these eMSPs are located in a narrow range, a reasonable postulation is that they have the same origin. Although many models have been proposed to interpret their origin, however, the origin of the narrow range of the orbital period is still an open question. The accretion-induced collapse(AIC) of the ONe WD is considered to be an important pathway to form MSP, which was expected to result in the formation of MSP in the circular orbit due to tidal circularization. Here we revisited this scenario by the binary population synthesis including the specific circularization calculation. Our results indicate that binaries with insufficient circularization in this scenario can evolve into the eMSPs. The narrow initial binary parameters required by insufficient circularization can naturally account for the narrow range of the orbital period. Although the evolution of WD's AIC process has not been well understood, the characteristic of a narrow range in the orbital period of eMSPs can still set constraints on the physics of their evolution.
△ Less
Submitted 20 October, 2023; v1 submitted 27 July, 2023;
originally announced July 2023.
-
VideoGLUE: Video General Understanding Evaluation of Foundation Models
Authors:
Liangzhe Yuan,
Nitesh Bharadwaj Gundavarapu,
Long Zhao,
Hao Zhou,
Yin Cui,
Lu Jiang,
Xuan Yang,
Menglin Jia,
Tobias Weyand,
Luke Friedman,
Mikhail Sirotenko,
Huisheng Wang,
Florian Schroff,
Hartwig Adam,
Ming-Hsuan Yang,
Ting Liu,
Boqing Gong
Abstract:
We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoG…
▽ More
We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks. Our main findings are as follows. First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second,video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks(e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. Our code is released under: https://github.com/tensorflow/models/tree/master/official/projects/videoglue.
△ Less
Submitted 1 December, 2023; v1 submitted 6 July, 2023;
originally announced July 2023.
-
Exploring charge and spin fluctuations in infinite-layer cuprate SrCuO$_{2}$ from a phonon perspective
Authors:
Xin Du,
Pei-Han Sun,
Ben-Chao Gong,
Jian-Feng Zhang,
Zhong-Yi Lu,
Kai Liu
Abstract:
The infinite-layer cuprate $A$CuO$_2$ ($A=$ Ca, Sr, Ba) has the simplest crystal structure among numerous cuprate superconductors and can serve a prototypical system to explore the unconventional superconductivity. Based on the first-principles electronic structure calculations, we have studied the electronic and magnetic properties of the infinite-layer cuprate SrCuO$_{2}$ from a phonon perspecti…
▽ More
The infinite-layer cuprate $A$CuO$_2$ ($A=$ Ca, Sr, Ba) has the simplest crystal structure among numerous cuprate superconductors and can serve a prototypical system to explore the unconventional superconductivity. Based on the first-principles electronic structure calculations, we have studied the electronic and magnetic properties of the infinite-layer cuprate SrCuO$_{2}$ from a phonon perspective. We find that interesting fluctuations of charges, electrical dipoles, and local magnetic moments can be induced by the zero-point vibrations of phonon modes in SrCuO$_{2}$ upon the hole do**. Among all optical phonon modes of SrCuO$_{2}$ in the antiferromagnetic Néel state, only the $A_{1}$$_g$ mode that involves the full-breathing O vibrations along the Cu-O bonds can cause significant fluctuations of local magnetic moments on O atoms and dramatic charge redistributions between Cu and O atoms. Notably, due to the zero-point vibration of the $A_{1g}$ mode, both the charge fluctuations on Cu and the electrical dipoles on O show a dome-like evolution with increasing hole do**, quite similar to the experimentally observed behavior of the superconducting $T_c$; in comparison, the fluctuations of local magnetic moments on O display a monotonic enhancement along with the hole do**. Further analyses indicate that around the optimal do**, there exist a large softening in the frequency of the $A_{1g}$ phonon mode and a van Hove singularity in the electronic structure close to the Fermi level, suggesting potential electron-phonon coupling. Our work reveals the important role of the full-breathing O phonon mode playing in the infinite-layer SrCuO$_{2}$, which may provide new insights in understanding the cuprate superconductivity.
△ Less
Submitted 16 June, 2023;
originally announced June 2023.
-
Logic Diffusion for Knowledge Graph Reasoning
Authors:
Xiaoying Xie,
Biao Gong,
Yiliang Lv,
Zhen Han,
Guoshuai Zhao,
Xueming Qian
Abstract:
Most recent works focus on answering first order logical queries to explore the knowledge graph reasoning via multi-hop logic predictions. However, existing reasoning models are limited by the circumscribed logical paradigms of training samples, which leads to a weak generalization of unseen logic. To address these issues, we propose a plug-in module called Logic Diffusion (LoD) to discover unseen…
▽ More
Most recent works focus on answering first order logical queries to explore the knowledge graph reasoning via multi-hop logic predictions. However, existing reasoning models are limited by the circumscribed logical paradigms of training samples, which leads to a weak generalization of unseen logic. To address these issues, we propose a plug-in module called Logic Diffusion (LoD) to discover unseen queries from surroundings and achieves dynamical equilibrium between different kinds of patterns. The basic idea of LoD is relation diffusion and sampling sub-logic by random walking as well as a special training mechanism called gradient adaption. Besides, LoD is accompanied by a novel loss function to further achieve the robust logical diffusion when facing noisy data in training or testing sets. Extensive experiments on four public datasets demonstrate the superiority of mainstream knowledge graph reasoning models with LoD over state-of-the-art. Moreover, our ablation study proves the general effectiveness of LoD on the noise-rich knowledge graph.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.
-
Selective and Collaborative Influence Function for Efficient Recommendation Unlearning
Authors:
Yuyuan Li,
Chaochao Chen,
Xiaolin Zheng,
Yizhao Zhang,
Biao Gong,
Jun Wang
Abstract:
Recent regulations on the Right to be Forgotten have greatly influenced the way of running a recommender system, because users now have the right to withdraw their private data. Besides simply deleting the target data in the database, unlearning the associated data lineage e.g., the learned personal features and preferences in the model, is also necessary for data withdrawal. Existing unlearning m…
▽ More
Recent regulations on the Right to be Forgotten have greatly influenced the way of running a recommender system, because users now have the right to withdraw their private data. Besides simply deleting the target data in the database, unlearning the associated data lineage e.g., the learned personal features and preferences in the model, is also necessary for data withdrawal. Existing unlearning methods are mainly devised for generalized machine learning models in classification tasks. In this paper, we first identify two main disadvantages of directly applying existing unlearning methods in the context of recommendation, i.e., (i) unsatisfactory efficiency for large-scale recommendation models and (ii) destruction of collaboration across users and items. To tackle the above issues, we propose an extra-efficient recommendation unlearning method based on Selective and Collaborative Influence Function (SCIF). Our proposed method can (i) avoid any kind of retraining which is computationally prohibitive for large-scale systems, (ii) further enhance efficiency by selectively updating user embedding and (iii) preserve the collaboration across the remaining users and items. Furthermore, in order to evaluate the unlearning completeness, we define a Membership Inference Oracle (MIO), which can justify whether the unlearned data points were in the training set of the model, i.e., whether a data point was completely unlearned. Extensive experiments on two benchmark datasets demonstrate that our proposed method can not only greatly enhance unlearning efficiency, but also achieve adequate unlearning completeness. More importantly, our proposed method outperforms the state-of-the-art unlearning method regarding comprehensive recommendation metrics.
△ Less
Submitted 20 April, 2023;
originally announced April 2023.
-
Federated Learning of Shareable Bases for Personalization-Friendly Image Classification
Authors:
Hong-You Chen,
Jike Zhong,
Mingda Zhang,
Xuhui Jia,
Hang Qi,
Boqing Gong,
Wei-Lun Chao,
Li Zhang
Abstract:
Personalized federated learning (PFL) aims to harness the collective wisdom of clients' data while building personalized models tailored to individual clients' data distributions. Existing works offer personalization primarily to clients who participate in the FL process, making it hard to encompass new clients who were absent or newly show up. In this paper, we propose FedBasis, a novel PFL frame…
▽ More
Personalized federated learning (PFL) aims to harness the collective wisdom of clients' data while building personalized models tailored to individual clients' data distributions. Existing works offer personalization primarily to clients who participate in the FL process, making it hard to encompass new clients who were absent or newly show up. In this paper, we propose FedBasis, a novel PFL framework to tackle such a deficiency. FedBasis learns a set of few shareable ``basis'' models, which can be linearly combined to form personalized models for clients. Specifically for a new client, only a small set of combination coefficients, not the model weights, needs to be learned. This notion makes FedBasis more parameter-efficient, robust, and accurate than competitive PFL baselines, especially in the low data regime, without increasing the inference cost. To demonstrate the effectiveness and applicability of FedBasis, we also present a more practical PFL testbed for image classification, featuring larger data discrepancies across clients in both the image and label spaces as well as more faithful training and test splits.
△ Less
Submitted 31 October, 2023; v1 submitted 16 April, 2023;
originally announced April 2023.
-
Identity Encoder for Personalized Diffusion
Authors:
Yu-Chuan Su,
Kelvin C. K. Chan,
Yandong Li,
Yang Zhao,
Han Zhang,
Boqing Gong,
Huisheng Wang,
Xuhui Jia
Abstract:
Many applications can benefit from personalized image generation models, including image enhancement, video conferences, just to name a few. Existing works achieved personalization by fine-tuning one model for each person. While being successful, this approach incurs additional computation and storage overhead for each new identity. Furthermore, it usually expects tens or hundreds of examples per…
▽ More
Many applications can benefit from personalized image generation models, including image enhancement, video conferences, just to name a few. Existing works achieved personalization by fine-tuning one model for each person. While being successful, this approach incurs additional computation and storage overhead for each new identity. Furthermore, it usually expects tens or hundreds of examples per identity to achieve the best performance. To overcome these challenges, we propose an encoder-based approach for personalization. We learn an identity encoder which can extract an identity representation from a set of reference images of a subject, together with a diffusion generator that can generate new images of the subject conditioned on the identity representation. Once being trained, the model can be used to generate images of arbitrary identities given a few examples even if the model hasn't been trained on the identity. Our approach greatly reduces the overhead for personalized image generation and is more applicable in many potential applications. Empirical results show that our approach consistently outperforms existing fine-tuning based approach in both image generation and reconstruction, and the outputs is preferred by users more than 95% of the time compared with the best performing baseline.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
Domain Generalization with Adversarial Intensity Attack for Medical Image Segmentation
Authors:
Zheyuan Zhang,
Bin Wang,
Lanhong Yao,
Ugur Demir,
Debesh Jha,
Ismail Baris Turkbey,
Boqing Gong,
Ulas Bagci
Abstract:
Most statistical learning algorithms rely on an over-simplified assumption, that is, the train and test data are independent and identically distributed. In real-world scenarios, however, it is common for models to encounter data from new and different domains to which they were not exposed to during training. This is often the case in medical imaging applications due to differences in acquisition…
▽ More
Most statistical learning algorithms rely on an over-simplified assumption, that is, the train and test data are independent and identically distributed. In real-world scenarios, however, it is common for models to encounter data from new and different domains to which they were not exposed to during training. This is often the case in medical imaging applications due to differences in acquisition devices, imaging protocols, and patient characteristics. To address this problem, domain generalization (DG) is a promising direction as it enables models to handle data from previously unseen domains by learning domain-invariant features robust to variations across different domains. To this end, we introduce a novel DG method called Adversarial Intensity Attack (AdverIN), which leverages adversarial training to generate training data with an infinite number of styles and increase data diversity while preserving essential content information. We conduct extensive evaluation experiments on various multi-domain segmentation datasets, including 2D retinal fundus optic disc/cup and 3D prostate MRI. Our results demonstrate that AdverIN significantly improves the generalization ability of the segmentation models, achieving significant improvement on these challenging datasets. Code is available upon publication.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
Authors:
Xuhui Jia,
Yang Zhao,
Kelvin C. K. Chan,
Yandong Li,
Han Zhang,
Boqing Gong,
Tingbo Hou,
Huisheng Wang,
Yu-Chuan Su
Abstract:
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only…
▽ More
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme that become a critical piece in fostering object specific embedding faithfully reflected into the generation process, while kee** control and editing abilities. Once trained, the network is able to produce diverse content and styles, conditioned on both texts and objects. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity, without the need of test-time optimization. Systematic studies are also conducted to analyze our models, providing insights for future work.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Structured Video-Language Modeling with Temporal Grou** and Spatial Grounding
Authors:
Yuanhao Xiong,
Long Zhao,
Boqing Gong,
Ming-Hsuan Yang,
Florian Schroff,
Ting Liu,
Cho-Jui Hsieh,
Liangzhe Yuan
Abstract:
Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. A powerful model is expected to be capable of capturing region-object corr…
▽ More
Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. A powerful model is expected to be capable of capturing region-object correspondences and recognizing scene changes in a video clip, reflecting spatial and temporal granularity, respectively. To strengthen model's understanding into such fine-grained details, we propose a simple yet effective video-language modeling framework, S-ViLM, by exploiting the intrinsic structures of these two modalities. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grou**, to promote learning region-object alignment and temporal-aware features, simultaneously. Comprehensive evaluations demonstrate that S-ViLM performs favorably against existing approaches in learning more expressive representations. Specifically, S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization.
△ Less
Submitted 8 March, 2024; v1 submitted 28 March, 2023;
originally announced March 2023.
-
Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning
Authors:
Siteng Huang,
Biao Gong,
Yutong Feng,
Min Zhang,
Yiliang Lv,
Donglin Wang
Abstract:
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. Relying on learning the joint representation of seen compositions, these methods ignore the explicit modeling of the state and object, thus limiting the exploitation of pre-trained knowledge and generalization to unseen compo…
▽ More
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. Relying on learning the joint representation of seen compositions, these methods ignore the explicit modeling of the state and object, thus limiting the exploitation of pre-trained knowledge and generalization to unseen compositions. With a particular focus on the universality of the solution, in this work, we propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition. The presented Troika is our implementation that aligns the branch-specific prompt representations with decomposed visual features. To calibrate the bias between semantically similar multi-modal representations, we further devise a Cross-Modal Traction module into Troika that shifts the prompt representation towards the current visual content. We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings. The code will be available at https://github.com/bighuang624/Troika.
△ Less
Submitted 25 March, 2024; v1 submitted 27 March, 2023;
originally announced March 2023.
-
Unified Visual Relationship Detection with Vision and Language Models
Authors:
Long Zhao,
Liangzhe Yuan,
Boqing Gong,
Yin Cui,
Florian Schroff,
Ming-Hsuan Yang,
Hartwig Adam,
Ting Liu
Abstract:
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propos…
▽ More
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model. UniVRD achieves 38.07 mAP on HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP. More importantly, we show that our unified detector performs as well as dataset-specific models in mAP, and achieves further improvements when we scale up the model. Our code will be made publicly available on GitHub.
△ Less
Submitted 20 August, 2023; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Enhancing Unsupervised Audio Representation Learning via Adversarial Sample Generation
Authors:
Yulin Pan,
Xiangteng He,
Biao Gong,
Yuxin Peng,
Yiliang Lv
Abstract:
Existing audio analysis methods generally first transform the audio stream to spectrogram, and then feed it into CNN for further analysis. A standard CNN recognizes specific visual patterns over feature map, then pools for high-level representation, which overlooks the positional information of recognized patterns. However, unlike natural image, the semantic of an audio spectrogram is sensitive to…
▽ More
Existing audio analysis methods generally first transform the audio stream to spectrogram, and then feed it into CNN for further analysis. A standard CNN recognizes specific visual patterns over feature map, then pools for high-level representation, which overlooks the positional information of recognized patterns. However, unlike natural image, the semantic of an audio spectrogram is sensitive to positional change, as its vertical and horizontal axes indicate the frequency and temporal information of the audio, instead of naive rectangular coordinates. Thus, the insensitivity of CNN to positional change plays a negative role on audio spectrogram encoding. To address this issue, this paper proposes a new self-supervised learning mechanism, which enhances the audio representation by first generating adversarial samples (\textit{i.e.}, negative samples), then driving CNN to distinguish the embeddings of negative pairs in the latent space. Extensive experiments show that the proposed approach achieves best or competitive results on 9 downstream datasets compared with previous methods, which verifies its effectiveness on audio representation learning.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos
Authors:
Yulin Pan,
Xiangteng He,
Biao Gong,
Yiliang Lv,
Yujun Shen,
Yuxin Peng,
Deli Zhao
Abstract:
Video temporal grounding aims to pinpoint a video segment that matches the query description. Despite the recent advance in short-form videos (\textit{e.g.}, in minutes), temporal grounding in long videos (\textit{e.g.}, in hours) is still at its early stage. To address this challenge, a common practice is to employ a sliding window, yet can be inefficient and inflexible due to the limited number…
▽ More
Video temporal grounding aims to pinpoint a video segment that matches the query description. Despite the recent advance in short-form videos (\textit{e.g.}, in minutes), temporal grounding in long videos (\textit{e.g.}, in hours) is still at its early stage. To address this challenge, a common practice is to employ a sliding window, yet can be inefficient and inflexible due to the limited number of frames within the window. In this work, we propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with \textbf{one-time} network execution. Our pipeline is formulated in a coarse-to-fine manner, where we first extract context knowledge from non-overlapped video clips (\textit{i.e.}, anchors), and then supplement the anchors that highly response to the query with detailed content knowledge. Besides the remarkably high pipeline efficiency, another advantage of our approach is the capability of capturing long-range temporal correlation, thanks to modeling the entire video as a whole, and hence facilitates more accurate grounding. Experimental results suggest that, on the long-form video datasets MAD and Ego4d, our method significantly outperforms state-of-the-arts, and achieves \textbf{14.6$\times$} / \textbf{102.8$\times$} higher efficiency respectively. Project can be found at \url{https://github.com/afcedf/SOONet.git}.
△ Less
Submitted 22 March, 2023; v1 submitted 14 March, 2023;
originally announced March 2023.
-
ViM: Vision Middleware for Unified Downstream Transferring
Authors:
Yutong Feng,
Biao Gong,
Jianwen Jiang,
Yiliang Lv,
Yujun Shen,
Deli Zhao,
**gren Zhou
Abstract:
Foundation models are pre-trained on massive data and transferred to downstream tasks via fine-tuning. This work presents Vision Middleware (ViM), a new learning paradigm that targets unified transferring from a single foundation model to a variety of downstream tasks. ViM consists of a zoo of lightweight plug-in modules, each of which is independently learned on a midstream dataset with a shared…
▽ More
Foundation models are pre-trained on massive data and transferred to downstream tasks via fine-tuning. This work presents Vision Middleware (ViM), a new learning paradigm that targets unified transferring from a single foundation model to a variety of downstream tasks. ViM consists of a zoo of lightweight plug-in modules, each of which is independently learned on a midstream dataset with a shared frozen backbone. Downstream tasks can then benefit from an adequate aggregation of the module zoo thanks to the rich knowledge inherited from midstream tasks. There are three major advantages of such a design. From the efficiency aspect, the upstream backbone can be trained only once and reused for all downstream tasks without tuning. From the scalability aspect, we can easily append additional modules to ViM with no influence on existing modules. From the performance aspect, ViM can include as many midstream tasks as possible, narrowing the task gap between upstream and downstream. Considering these benefits, we believe that ViM, which the community could maintain and develop together, would serve as a powerful tool to assist foundation models.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training
Authors:
Biao Gong,
Xiaoying Xie,
Yutong Feng,
Yiliang Lv,
Yujun Shen,
Deli Zhao
Abstract:
This work presents a unified knowledge protocol, called UKnow, which facilitates knowledge-based studies from the perspective of data. Particularly focusing on visual and linguistic modalities, we categorize data knowledge into five unit types, namely, in-image, in-text, cross-image, cross-text, and image-text, and set up an efficient pipeline to help construct the multimodal knowledge graph from…
▽ More
This work presents a unified knowledge protocol, called UKnow, which facilitates knowledge-based studies from the perspective of data. Particularly focusing on visual and linguistic modalities, we categorize data knowledge into five unit types, namely, in-image, in-text, cross-image, cross-text, and image-text, and set up an efficient pipeline to help construct the multimodal knowledge graph from any data collection. Thanks to the logical information naturally contained in knowledge graph, organizing datasets under UKnow format opens up more possibilities of data usage compared to the commonly used image-text pairs. Following UKnow protocol, we collect, from public international news, a large-scale multimodal knowledge graph dataset that consists of 1,388,568 nodes (with 571,791 vision-related ones) and 3,673,817 triplets. The dataset is also annotated with rich event tags, including 11 coarse labels and 9,185 fine labels. Experiments on four benchmarks demonstrate the potential of UKnow in supporting common-sense reasoning and boosting vision-language pre-training with a single dataset, benefiting from its unified form of knowledge organization. Code, dataset, and models will be made publicly available.
△ Less
Submitted 21 March, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
RecolorNeRF: Layer Decomposed Radiance Fields for Efficient Color Editing of 3D Scenes
Authors:
Bingchen Gong,
Yuehao Wang,
Xiaoguang Han,
Qi Dou
Abstract:
Radiance fields have gradually become a main representation of media. Although its appearance editing has been studied, how to achieve view-consistent recoloring in an efficient manner is still under explored. We present RecolorNeRF, a novel user-friendly color editing approach for the neural radiance fields. Our key idea is to decompose the scene into a set of pure-colored layers, forming a palet…
▽ More
Radiance fields have gradually become a main representation of media. Although its appearance editing has been studied, how to achieve view-consistent recoloring in an efficient manner is still under explored. We present RecolorNeRF, a novel user-friendly color editing approach for the neural radiance fields. Our key idea is to decompose the scene into a set of pure-colored layers, forming a palette. By this means, color manipulation can be conducted by altering the color components of the palette directly. To support efficient palette-based editing, the color of each layer needs to be as representative as possible. In the end, the problem is formulated as an optimization problem, where the layers and their blending weights are jointly optimized with the NeRF itself. Extensive experiments show that our jointly-optimized layer decomposition can be used against multiple backbones and produce photo-realistic recolored novel-view renderings. We demonstrate that RecolorNeRF outperforms baseline methods both quantitatively and qualitatively for color editing even in complex real-world scenes.
△ Less
Submitted 18 September, 2023; v1 submitted 19 January, 2023;
originally announced January 2023.
-
On Calibrating Semantic Segmentation Models: Analyses and An Algorithm
Authors:
Dongdong Wang,
Boqing Gong,
Liqiang Wang
Abstract:
We study the problem of semantic segmentation calibration. Lots of solutions have been proposed to approach model miscalibration of confidence in image classification. However, to date, confidence calibration research on semantic segmentation is still limited. We provide a systematic study on the calibration of semantic segmentation models and propose a simple yet effective approach. First, we fin…
▽ More
We study the problem of semantic segmentation calibration. Lots of solutions have been proposed to approach model miscalibration of confidence in image classification. However, to date, confidence calibration research on semantic segmentation is still limited. We provide a systematic study on the calibration of semantic segmentation models and propose a simple yet effective approach. First, we find that model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration. Among them, prediction correctness, especially misprediction, is more important to miscalibration due to over-confidence. Next, we propose a simple, unifying, and effective approach, namely selective scaling, by separating correct/incorrect prediction for scaling and more focusing on misprediction logit smoothing. Then, we study popular existing calibration methods and compare them with selective scaling on semantic segmentation calibration. We conduct extensive experiments with a variety of benchmarks on both in-domain and domain-shift calibration and show that selective scaling consistently outperforms other methods.
△ Less
Submitted 25 March, 2023; v1 submitted 22 December, 2022;
originally announced December 2022.
-
Next-to-next-to-leading-order QCD corrections to $J/ψ$ plus $η_c$ production at the $B$ factories
Authors:
Xu-Dong Huang,
Bin Gong,
Jian-Xiong Wang
Abstract:
In this paper, we calculate the next-to-next-to-leading-order (NNLO) QCD corrections to $e^+e^- \to J/ψ+η_c$ at the $B$ factories. After including the NNLO corrections, the cross section of $e^+e^- \to J/ψ+η_c$ is enhanced by about $17\%$, and the perturbative series of the prediction shows the convergent behavior. It is also found that the contribution from bottom quark starts at the $α_s^3$-orde…
▽ More
In this paper, we calculate the next-to-next-to-leading-order (NNLO) QCD corrections to $e^+e^- \to J/ψ+η_c$ at the $B$ factories. After including the NNLO corrections, the cross section of $e^+e^- \to J/ψ+η_c$ is enhanced by about $17\%$, and the perturbative series of the prediction shows the convergent behavior. It is also found that the contribution from bottom quark starts at the $α_s^3$-order, which is about $2.4\%$ of the total prediction. The renormalization scale $μ_R$ dependence of the cross section is reduced at the NNLO level, but the prediction is sensitive to the charm quark mass $m_c$. By considering the uncertainties caused by renormalization scale $μ_R$, charm quark mass $m_c$ and the NRQCD factorization scale $μ_Λ$, our prediction shows agreement with the BABAR and BELLE measurements within errors.
△ Less
Submitted 7 February, 2023; v1 submitted 7 December, 2022;
originally announced December 2022.
-
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval
Authors:
Siteng Huang,
Biao Gong,
Yulin Pan,
Jianwen Jiang,
Yiliang Lv,
Yuyuan Li,
Donglin Wang
Abstract:
Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retriev…
▽ More
Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video & text prompts introducing, which can be regarded as a powerful baseline with only 0.1% trainable parameters. Further, based on the spatio-temporal characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achieves a 1.4% average R@1 gain across five text-video retrieval benchmarks with 6x less parameter overhead. The code will be available at https://github.com/bighuang624/VoP.
△ Less
Submitted 21 March, 2023; v1 submitted 23 November, 2022;
originally announced November 2022.
-
Intrinsic ferromagnetic axion states and a single pair of Weyl fermions in the stable-state Mn\emph{X}$_{2}$\emph{B}$_{2}$\emph{T}$_{6}$-family materials
Authors:
Yan Gao,
Weikang Wu,
Ben-Chao Gong,
Huan-Cheng Yang,
Xiang-Feng Zhou,
Yong Liu,
Shengyuan A. Yang,
Kai Liu,
Zhong-Yi Lu
Abstract:
The intrinsic ferromagnetic (FM) axion insulators and Weyl semimetals (WSMs) with only single pair of Weyl points have drawn intensive attention but so far remain rare and elusive in real materials. Here, we propose a new class of Mn\emph{X}$_{2}$\emph{B}$_{2}$\emph{T}$_{6}$-B (\emph{X}=Ge, Sn, or Pb; \emph{B}=Sb or Bi; \emph{T}=Se or Te) family that is the stable structural form of this system. W…
▽ More
The intrinsic ferromagnetic (FM) axion insulators and Weyl semimetals (WSMs) with only single pair of Weyl points have drawn intensive attention but so far remain rare and elusive in real materials. Here, we propose a new class of Mn\emph{X}$_{2}$\emph{B}$_{2}$\emph{T}$_{6}$-B (\emph{X}=Ge, Sn, or Pb; \emph{B}=Sb or Bi; \emph{T}=Se or Te) family that is the stable structural form of this system. We find that the Mn\emph{X}$_{2}$\emph{B}$_{2}$\emph{T}$_{6}$-B family has not only the intrinsic FM axion insulators MnGe$_{2}$Bi$_{2}$Te$_{6}$-B, MnSn$_{2}$Bi$_{2}$Te$_{6}$-B, and MnPb$_{2}$Bi$_{2}$Te$_{6}$-B, but also the intrinsic WSM MnSn$_{2}$Sb$_{2}$Te$_{6}$-B with only a single pair of Weyl points. Thus, the Mn\emph{X}$_{2}$\emph{B}$_{2}$\emph{T}$_{6}$-B family can provide an ideal platform to explore the exotic topological magnetoelectric effect and the intrinsic properties related to Weyl points.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds
Authors:
Minghua Liu,
Yin Zhou,
Charles R. Qi,
Boqing Gong,
Hao Su,
Dragomir Anguelov
Abstract:
Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor sce…
▽ More
Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor scenes, we are one of the first to propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds. Our method co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones. Specifically, we leverage geometry patterns in outdoor scenes to have a heuristic pre-segmentation to reduce the manual labeling and jointly design the learning targets with the labeling process. In the learning step, we leverage prototype learning to get more descriptive point embeddings and use multi-scan distillation to exploit richer semantics from temporally aggregated point clouds to boost the performance of single-scan models. Evaluated on the SemanticKITTI and the nuScenes datasets, we show that our proposed method outperforms existing label-efficient methods. With extremely limited human annotations (e.g., 0.1% point labels), our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.