-
PVUW 2024 Challenge on Complex Video Understanding: Methods and Results
Authors:
Henghui Ding,
Chang Liu,
Yunchao Wei,
Nikhila Ravi,
Shuting He,
Song Bai,
Philip Torr,
Deshui Miao,
Xin Li,
Zhenyu He,
Yaowei Wang,
Ming-Hsuan Yang,
Zhensong Xu,
Jiangtao Yao,
Cheng**g Wu,
Ting Liu,
Luoqi Liu,
Xinyu Liu,
**g Zhang,
Kexin Zhang,
Yuting Yang,
Licheng Jiao,
Shuyuan Yang,
Mingqi Gao,
**gnan Luo
, et al. (12 additional authors not shown)
Abstract:
Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as…
▽ More
Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as the disappearance and reappearance of objects, inconspicuous small objects, heavy occlusions, and crowded environments in MOSE. Moreover, we provide a new motion expression guided video segmentation dataset MeViS to study the natural language-guided video understanding in complex environments. These new videos, sentences, and annotations enable us to foster the development of a more comprehensive and robust pixel-level understanding of video scenes in complex environments and realistic scenarios. The MOSE challenge had 140 registered teams in total, 65 teams participated the validation phase and 12 teams made valid submissions in the final challenge phase. The MeViS challenge had 225 registered teams in total, 50 teams participated the validation phase and 5 teams made valid submissions in the final challenge phase.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
On estimation and order selection for multivariate extremes via clustering
Authors:
Shiyuan Deng,
He Tang,
Shuyang Bai
Abstract:
We investigate the estimation of multivariate extreme models with a discrete spectral measure using spherical clustering techniques. The primary contribution involves devising a method for selecting the order, that is, the number of clusters. The method consistently identifies the true order, i.e., the number of spectral atoms, and enjoys intuitive implementation in practice. Specifically, we intr…
▽ More
We investigate the estimation of multivariate extreme models with a discrete spectral measure using spherical clustering techniques. The primary contribution involves devising a method for selecting the order, that is, the number of clusters. The method consistently identifies the true order, i.e., the number of spectral atoms, and enjoys intuitive implementation in practice. Specifically, we introduce an extra penalty term to the well-known simplified average silhouette width, which penalizes small cluster sizes and small dissimilarities between cluster centers. Consequently, we provide a consistent method for determining the order of a max-linear factor model, where a typical information-based approach is not viable. Our second contribution is a large-deviation-type analysis for estimating the discrete spectral measure through clustering methods, which serves as an assessment of the convergence quality of clustering-based estimation for multivariate extremes. Additionally, as a third contribution, we discuss how estimating the discrete measure can lead to parameter estimations of heavy-tailed factor models. We also present simulations and real-data studies that demonstrate order selection and factor model estimation.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Long-Range Quantum Tunneling via Matter Wave
Authors:
Yuan-Xing Yang,
Si-Yuan Bai,
Jun-Hong An
Abstract:
Quantum tunneling refers to a phenomenon that a microscopic object can pass through a potential barrier even it does not have enough energy to overcome the barrier. It has led to many modern applications and nanotechnologies. A general belief is that quantum tunneling, as a manifestation of the wave-particle duality, occurs only when the width of the barrier is comparable to or smaller than the de…
▽ More
Quantum tunneling refers to a phenomenon that a microscopic object can pass through a potential barrier even it does not have enough energy to overcome the barrier. It has led to many modern applications and nanotechnologies. A general belief is that quantum tunneling, as a manifestation of the wave-particle duality, occurs only when the width of the barrier is comparable to or smaller than the de Broglie's wavelength of the object. Here, via studying the tunneling of an ultracold atom among $N$ far-separated trap** potentials in a state-selective optical lattice, we discover a mechanism to realize a long-range quantum tunneling. It is found that, by the mediation role of the propagating matter wave emitted from the excited-state atom, a coherent tunneling of the tightly confined atom to the remote trap** potentials can occur as long as bound states are present in the energy spectrum of the total system formed by the atom and its matter wave. Breaking through the generally believed distance constraint of quantum tunneling, our result opens another avenue to realize quantum tunneling and gives a guideline to develop tunneling devices.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
Authors:
Qihao Liu,
Yi Zhang,
Song Bai,
Adam Kortylewski,
Alan Yuille
Abstract:
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating the key challenge…
▽ More
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating the key challenge (i.e., data scarcity) in large-scale 3D generation. In particular, DIRECT-3D is a tri-plane diffusion model that integrates two innovations: 1) A novel learning framework where noisy data are filtered and aligned automatically during the training process. Specifically, after an initial warm-up phase using a small set of clean data, an iterative optimization is introduced in the diffusion process to explicitly estimate the 3D pose of objects and select beneficial data based on conditional density. 2) An efficient 3D representation that is achieved by disentangling object geometry and color features with two separate conditional diffusion models that are optimized hierarchically. Given a prompt input, our model generates high-quality, high-resolution, realistic, and complex 3D objects with accurate geometric details in seconds. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation. We also demonstrate that DIRECT-3D can serve as a useful 3D geometric prior of objects, for example to alleviate the well-known Janus problem in 2D-lifting methods such as DreamFusion. The code and models are available for research purposes at: https://github.com/qihao067/direct3d.
△ Less
Submitted 6 June, 2024; v1 submitted 6 June, 2024;
originally announced June 2024.
-
Cohomological splitting over rationally connected bases
Authors:
Shaoyun Bai,
Daniel Pomerleano,
Guangbo Xu
Abstract:
We prove a cohomological splitting result for Hamiltonian fibrations over enumeratively rationally connected symplectic manifolds. As a key application, we prove that the cohomology of a smooth, projective family over a smooth stably rational projective variety splits additively over any field. The main ingredient in our arguments is the theory of Fukaya-Parker-Ono (FOP) perturbations developed by…
▽ More
We prove a cohomological splitting result for Hamiltonian fibrations over enumeratively rationally connected symplectic manifolds. As a key application, we prove that the cohomology of a smooth, projective family over a smooth stably rational projective variety splits additively over any field. The main ingredient in our arguments is the theory of Fukaya-Parker-Ono (FOP) perturbations developed by the first and third author, which allows one to define integer-valued Gromov-Witten type invariants.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques
Authors:
Samita Bai,
Sidra Nasir,
Rizwan Ahmed Khan,
Sheeraz Arif,
Alexandre Meyer,
Hubert Konik
Abstract:
Breast cancer (BC) stands as one of the most common malignancies affecting women worldwide, necessitating advancements in diagnostic methodologies for better clinical outcomes. This article provides a comprehensive exploration of the application of Explainable Artificial Intelligence (XAI) techniques in the detection and diagnosis of breast cancer. As Artificial Intelligence (AI) technologies cont…
▽ More
Breast cancer (BC) stands as one of the most common malignancies affecting women worldwide, necessitating advancements in diagnostic methodologies for better clinical outcomes. This article provides a comprehensive exploration of the application of Explainable Artificial Intelligence (XAI) techniques in the detection and diagnosis of breast cancer. As Artificial Intelligence (AI) technologies continue to permeate the healthcare sector, particularly in oncology, the need for transparent and interpretable models becomes imperative to enhance clinical decision-making and patient care. This review discusses the integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and others, with machine learning and deep learning models utilized in breast cancer detection and classification. By investigating the modalities of breast cancer datasets, including mammograms, ultrasounds and their processing with AI, the paper highlights how XAI can lead to more accurate diagnoses and personalized treatment plans. It also examines the challenges in implementing these techniques and the importance of develo** standardized metrics for evaluating XAI's effectiveness in clinical settings. Through detailed analysis and discussion, this article aims to highlight the potential of XAI in bridging the gap between complex AI models and practical healthcare applications, thereby fostering trust and understanding among medical professionals and improving patient outcomes.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Jacobian Regularizer-based Neural Granger Causality
Authors:
Wanqi Zhou,
Shuanghao Bai,
Shujian Yu,
Qibin Zhao,
Badong Chen
Abstract:
With the advancement of neural networks, diverse methods for neural Granger causality have emerged, which demonstrate proficiency in handling complex data, and nonlinear relationships. However, the existing framework of neural Granger causality has several limitations. It requires the construction of separate predictive models for each target variable, and the relationship depends on the sparsity…
▽ More
With the advancement of neural networks, diverse methods for neural Granger causality have emerged, which demonstrate proficiency in handling complex data, and nonlinear relationships. However, the existing framework of neural Granger causality has several limitations. It requires the construction of separate predictive models for each target variable, and the relationship depends on the sparsity on the weights of the first layer, resulting in challenges in effectively modeling complex relationships between variables as well as unsatisfied estimation accuracy of Granger causality. Moreover, most of them cannot grasp full-time Granger causality. To address these drawbacks, we propose a Jacobian Regularizer-based Neural Granger Causality (JRNGC) approach, a straightforward yet highly effective method for learning multivariate summary Granger causality and full-time Granger causality by constructing a single model for all target variables. Specifically, our method eliminates the sparsity constraints of weights by leveraging an input-output Jacobian matrix regularizer, which can be subsequently represented as the weighted causal matrix in the post-hoc analysis. Extensive experiments show that our proposed approach achieves competitive performance with the state-of-the-art methods for learning summary Granger causality and full-time Granger causality while maintaining lower model complexity and high scalability.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Universal replication of chaotic characteristics by classical and quantum machine learning
Authors:
Sheng-Chen Bai,
Shi-Ju Ran
Abstract:
Replicating chaotic characteristics of non-linear dynamics by machine learning (ML) has recently drawn wide attentions. In this work, we propose that a ML model, trained to predict the state one-step-ahead from several latest historic states, can accurately replicate the bifurcation diagram and the Lyapunov exponents of discrete dynamic systems. The characteristics for different values of the hype…
▽ More
Replicating chaotic characteristics of non-linear dynamics by machine learning (ML) has recently drawn wide attentions. In this work, we propose that a ML model, trained to predict the state one-step-ahead from several latest historic states, can accurately replicate the bifurcation diagram and the Lyapunov exponents of discrete dynamic systems. The characteristics for different values of the hyper-parameters are captured universally by a single ML model, while the previous works considered training the ML model independently by fixing the hyper-parameters to be specific values. Our benchmarks on the one- and two-dimensional Logistic maps show that variational quantum circuit can reproduce the long-term characteristics with higher accuracy than the long short-term memory (a well-recognized classical ML model). Our work reveals an essential difference between the ML for the chaotic characteristics and that for standard tasks, from the perspective of the relation between performance and model complexity. Our results suggest that quantum circuit model exhibits potential advantages on mitigating over-fitting, achieving higher accuracy and stability.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Equivariant formality in complex-oriented theories
Authors:
Shaoyun Bai,
Daniel Pomerleano
Abstract:
Let $G$ be a product of unitary groups and let $(M,ω)$ be a compact symplectic manifold with Hamiltonian $G$-action. We prove an equivariant formality result for any complex-oriented cohomology theory $\mathbb{E}^*$ (in particular, integral cohomology). This generalizes the celebrated result of Atiyah-Bott-Kirwan for rational cohomology from the 1980s. The proof does not use classical ideas but in…
▽ More
Let $G$ be a product of unitary groups and let $(M,ω)$ be a compact symplectic manifold with Hamiltonian $G$-action. We prove an equivariant formality result for any complex-oriented cohomology theory $\mathbb{E}^*$ (in particular, integral cohomology). This generalizes the celebrated result of Atiyah-Bott-Kirwan for rational cohomology from the 1980s. The proof does not use classical ideas but instead relies on a recent cohomological splitting result of Abouzaid-McLean-Smith for Hamiltonian fibrations over $\mathbb{CP}^1.$ Moreover, we establish analogues of the "localization" and "injectivity to fixed points" theorems for certain cohomology theories studied by Hopkins-Kuhn-Ravenel. As an application of these results, we establish a Goresky-Kottwitz-MacPherson theorem with Morava $K$-theory coefficients for Hamiltonian $T$-manifolds.
△ Less
Submitted 22 May, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective
Authors:
Wanqi Zhou,
Shuanghao Bai,
Qibin Zhao,
Badong Chen
Abstract:
Pretrained vision-language models (VLMs) like CLIP have shown impressive generalization performance across various downstream tasks, yet they remain vulnerable to adversarial attacks. While prior research has primarily concentrated on improving the adversarial robustness of image encoders to guard against attacks on images, the exploration of text-based and multimodal attacks has largely been over…
▽ More
Pretrained vision-language models (VLMs) like CLIP have shown impressive generalization performance across various downstream tasks, yet they remain vulnerable to adversarial attacks. While prior research has primarily concentrated on improving the adversarial robustness of image encoders to guard against attacks on images, the exploration of text-based and multimodal attacks has largely been overlooked. In this work, we initiate the first known and comprehensive effort to study adapting vision-language models for adversarial robustness under the multimodal attack. Firstly, we introduce a multimodal attack strategy and investigate the impact of different attacks. We then propose a multimodal contrastive adversarial training loss, aligning the clean and adversarial text embeddings with the adversarial and clean visual features, to enhance the adversarial robustness of both image and text encoders of CLIP. Extensive experiments on 15 datasets across two tasks demonstrate that our method significantly improves the adversarial robustness of CLIP. Interestingly, we find that the model fine-tuned against multimodal adversarial attacks exhibits greater robustness than its counterpart fine-tuned solely against image-based attacks, even in the context of image attacks, which may open up new possibilities for enhancing the security of VLMs.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Soft Prompt Generation for Domain Generalization
Authors:
Shuanghao Bai,
Yuedi Zhang,
Wanqi Zhou,
Zhirong Luan,
Badong Chen
Abstract:
Large pre-trained vision language models (VLMs) have shown impressive zero-shot ability on downstream tasks with manually designed prompt, which are not optimal for specific domains. To further adapt VLMs to downstream tasks, soft prompt is proposed to replace manually designed prompt, which acts as a learning vector that undergoes fine-tuning based on specific domain data. Prior prompt learning m…
▽ More
Large pre-trained vision language models (VLMs) have shown impressive zero-shot ability on downstream tasks with manually designed prompt, which are not optimal for specific domains. To further adapt VLMs to downstream tasks, soft prompt is proposed to replace manually designed prompt, which acts as a learning vector that undergoes fine-tuning based on specific domain data. Prior prompt learning methods primarily learn a fixed prompt and residuled prompt from training samples. However, the learned prompts lack diversity and ignore information about unseen domains, potentially compromising the transferability of the prompts. In this paper, we reframe the prompt learning framework from a generative perspective and propose a simple yet efficient method for the Domain Generalization (DG) task, namely \textbf{S}oft \textbf{P}rompt \textbf{G}eneration (SPG). To the best of our knowledge, we are the first to introduce the generative model into prompt learning in VLMs and explore its potential for producing soft prompts by relying solely on the generative model, ensuring the diversity of prompts. Specifically, SPG consists of a two-stage training phase and an inference phase. During the training phase, we introduce soft prompt labels for each domain, aiming to incorporate the generative model domain knowledge. During the inference phase, the generator of the generative model is employed to obtain instance-specific soft prompts for the unseen target domain. Extensive experiments on five domain generalization benchmarks of three DG tasks demonstrate that our proposed SPG achieves state-of-the-art performance. The code will be available soon.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Stimulated Emission Depletion (STED) Magnetic Particle Imaging
Authors:
Guang Jia,
Zhongwei Bian,
Tianshu Li,
Shi Bai,
Chenxing Hu,
Lixuan Zhao,
Peng Gao,
Tan** Li,
Hui Hui,
Jie Tian
Abstract:
Magnetic particle imaging (MPI) is an in-vivo imaging method to detect magnetic nanoparticles for blood vessel imaging and molecular target imaging. Compared with conventional molecular imaging devices (such as nuclear medicine imaging PET and SPECT), magnetic nanoparticles have longer storage periods than radionuclides without ionizing radiation. MPI has higher detection sensitivity compared with…
▽ More
Magnetic particle imaging (MPI) is an in-vivo imaging method to detect magnetic nanoparticles for blood vessel imaging and molecular target imaging. Compared with conventional molecular imaging devices (such as nuclear medicine imaging PET and SPECT), magnetic nanoparticles have longer storage periods than radionuclides without ionizing radiation. MPI has higher detection sensitivity compared with MRI. To accurately locate molecular probes in living organisms, high-resolution images are needed to meet the requirements of precision medicine. The spatial resolution of the latest domestic and international MPI equipment is 1-6 mm and has not yet met the requirements of medical imaging detection. We previously studied the spatial encoding technology based on pulsed square wave stimulation, which significantly improved the image resolution along the field free line (FFL) direction. This study proposes an innovative idea of high-resolution MPI based on stimulated emission depletion (STED) of magnetic nanoparticle signals. The stimulated emission was implemented by using cosine stimulation on FFL-based MPI scanner systems. The STED signal was generated by adding an offset magnetic field parallel to the FFL, which may form a donut-shaped focal spot or a regular Gaussian focal spot depending on the offset field strength. Focal spot modulation techniques and deconvolution algorithms were developed to improve image resolution.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Tightly Joined Positioning and Control Model for Unmanned Aerial Vehicles Based on Factor Graph Optimization
Authors:
Peiwen Yang,
Weisong Wen,
Shiyu Bai,
Li-Ta Hsu
Abstract:
The execution of flight missions by unmanned aerial vehicles (UAV) primarily relies on navigation. In particular, the navigation pipeline has traditionally been divided into positioning and control, operating in a sequential loop. However, the existing navigation pipeline, where the positioning and control are decoupled, struggles to adapt to ubiquitous uncertainties arising from measurement noise…
▽ More
The execution of flight missions by unmanned aerial vehicles (UAV) primarily relies on navigation. In particular, the navigation pipeline has traditionally been divided into positioning and control, operating in a sequential loop. However, the existing navigation pipeline, where the positioning and control are decoupled, struggles to adapt to ubiquitous uncertainties arising from measurement noise, abrupt disturbances, and nonlinear dynamics. As a result, the navigation reliability of the UAV is significantly challenged in complex dynamic areas. For example, the ubiquitous global navigation satellite system (GNSS) positioning can be degraded by the signal reflections from surrounding high-rising buildings in complex urban areas, leading to significantly increased positioning uncertainty. An additional challenge is introduced to the control algorithm due to the complex wind disturbances in urban canyons. Given the fact that the system positioning and control are highly correlated with each other, this research proposes a **tightly joined positioning and control model (JPCM) based on factor graph optimization (FGO)**. In particular, the proposed JPCM combines sensor measurements from positioning and control constraints into a unified probabilistic factor graph. Specifically, the positioning measurements are formulated as the factors in the factor graph. In addition, the model predictive control (MPC) is also formulated as the additional factors in the factor graph. By solving the factor graph contributed by both the positioning-related factors and the MPC-based factors, the complementariness of positioning and control can be deeply exploited. Finally, we validate the effectiveness and resilience of the proposed method using a simulated quadrotor system which shows significantly improved trajectory following performance.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Narrative Action Evaluation with Prompt-Guided Multimodal Interaction
Authors:
Shiyi Zhang,
Sule Bai,
Guangyi Chen,
Lei Chen,
Jiwen Lu,
Junle Wang,
Yansong Tang
Abstract:
In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate d…
▽ More
In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at https://github.com/shiyi-zh0408/NAE_CVPR2024.
△ Less
Submitted 26 April, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
Robust Noisy Label Learning via Two-Stream Sample Distillation
Authors:
Sihan Bai,
San** Zhou,
Zheng Qin,
Le Wang,
Nanning Zheng
Abstract:
Noisy label learning aims to learn robust networks under the supervision of noisy labels, which plays a critical role in deep learning. Existing work either conducts sample selection or label correction to deal with noisy labels during the model training process. In this paper, we design a simple yet effective sample selection framework, termed Two-Stream Sample Distillation (TSSD), for noisy labe…
▽ More
Noisy label learning aims to learn robust networks under the supervision of noisy labels, which plays a critical role in deep learning. Existing work either conducts sample selection or label correction to deal with noisy labels during the model training process. In this paper, we design a simple yet effective sample selection framework, termed Two-Stream Sample Distillation (TSSD), for noisy label learning, which can extract more high-quality samples with clean labels to improve the robustness of network training. Firstly, a novel Parallel Sample Division (PSD) module is designed to generate a certain training set with sufficient reliable positive and negative samples by jointly considering the sample structure in feature space and the human prior in loss space. Secondly, a novel Meta Sample Purification (MSP) module is further designed to mine adequate semi-hard samples from the remaining uncertain training set by learning a strong meta classifier with extra golden data. As a result, more and more high-quality samples will be distilled from the noisy training set to train networks robustly in every iteration. Extensive experiments on four benchmark datasets, including CIFAR-10, CIFAR-100, Tiny-ImageNet, and Clothing-1M, show that our method has achieved state-of-the-art results over its competitors.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Cepstral Analysis Based Artifact Detection, Recognition and Removal for Prefrontal EEG
Authors:
Siqi Han,
Chao Zhang,
Jiaxin Lei,
Qingquan Han,
Yuhui Du,
Anhe Wang,
Shuo Bai,
Milin Zhang
Abstract:
This paper proposes to use cepstrum for artifact detection, recognition and removal in prefrontal EEG. This work focuses on the artifact caused by eye movement. A database containing artifact-free EEG and eye movement contaminated EEG from different subjects is established. A cepstral analysis-based feature extraction with support vector machine (SVM) based classifier is designed to identify the a…
▽ More
This paper proposes to use cepstrum for artifact detection, recognition and removal in prefrontal EEG. This work focuses on the artifact caused by eye movement. A database containing artifact-free EEG and eye movement contaminated EEG from different subjects is established. A cepstral analysis-based feature extraction with support vector machine (SVM) based classifier is designed to identify the artifacts from the target EEG signals. The proposed method achieves an accuracy of 99.62% on the artifact detection task and a 82.79% accuracy on the 6-category eye movement classification task. A statistical value-based artifact removal method is proposed and evaluated on a public EEG database, where an accuracy improvement of 3.46% is obtained on the 3-category emotion classification task. In order to make a confident decision of each 5s EEG segment, the algorithm requires only 0.66M multiplication operations. Compared to the state-of-the-art approaches in artifact detection and removal, the proposed method features higher detection accuracy and lower computational cost, which makes it a more suitable solution to be integrated into a real-time and artifact robust Brain-Machine Interface (BMI).
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Nuclear charge radii of germanium isotopes around $N$ = 40
Authors:
S. J. Wang,
A. Kanellakopoulos,
X. F. Yang,
S. W. Bai,
J. Billowes,
M. L. Bissell,
K. Blaum,
B. Cheal,
C. S. Devlin,
R. F. Garcia Ruiz,
J. Z. Han,
H. Heylen,
S. Kaufmann,
K. Konig,
A. Koszorus,
S. Lechner,
S. Malbrunot-Ettenauer,
W. Nazarewicz,
R. Neugart,
G. Neyens,
W. Nortershauser,
T. Ratajczyk,
P. -G. Reinhard,
L. V. Rodrıguez,
S. Sels
, et al. (4 additional authors not shown)
Abstract:
Collinear laser spectroscopy measurements were performed on $^{68-74}$Ge isotopes ($Z = 32$) at ISOLDE-CERN, by probing the $4s^2 4p^2 \, ^3\!P_1 \rightarrow 4s^2 4p 5s \, ^3\!P_1^o$ atomic transition (269~nm) of germanium. Nuclear charge radii are determined via the measured isotope shifts, revealing a larger local variation than the neighboring isotopic chains. Nuclear density functional theory…
▽ More
Collinear laser spectroscopy measurements were performed on $^{68-74}$Ge isotopes ($Z = 32$) at ISOLDE-CERN, by probing the $4s^2 4p^2 \, ^3\!P_1 \rightarrow 4s^2 4p 5s \, ^3\!P_1^o$ atomic transition (269~nm) of germanium. Nuclear charge radii are determined via the measured isotope shifts, revealing a larger local variation than the neighboring isotopic chains. Nuclear density functional theory with the Fayans functionals Fy($Δr$,HFB) and Fy(IVP), and the SV-min Skyrme describes the experimental data for the differential charge radii $δ\langle r^{2} \rangle$ and charge radii $R_{\rm c}$ within the theoretical uncertainties. The observed large variation in the charge radii of germanium isotopes is better accounted for by theoretical models incorporating ground state quadrupole correlations. This suggests that the polarization effects due to pairing and deformation contribute to the observed large odd-even staggering in the charge radii of the Ge isotopic chain.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Self-supervised 6-DoF Robot Gras** by Demonstration via Augmented Reality Teleoperation System
Authors:
Xiwen Dengxiong,
Xueting Wang,
Shi Bai,
Yunbo Zhang
Abstract:
Most existing 6-DoF robot gras** solutions depend on strong supervision on grasp pose to ensure satisfactory performance, which could be laborious and impractical when the robot works in some restricted area. To this end, we propose a self-supervised 6-DoF grasp pose detection framework via an Augmented Reality (AR) teleoperation system that can efficiently learn human demonstrations and provide…
▽ More
Most existing 6-DoF robot gras** solutions depend on strong supervision on grasp pose to ensure satisfactory performance, which could be laborious and impractical when the robot works in some restricted area. To this end, we propose a self-supervised 6-DoF grasp pose detection framework via an Augmented Reality (AR) teleoperation system that can efficiently learn human demonstrations and provide 6-DoF grasp poses without grasp pose annotations. Specifically, the system collects the human demonstration from the AR environment and contrastively learns the gras** strategy from the demonstration. For the real-world experiment, the proposed system leads to satisfactory gras** abilities and learning to grasp unknown objects within three demonstrations.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
High-precision measurement of the atomic mass of $^{84}$Sr and implications to isotope shift studies
Authors:
Zhuang Ge,
Shiwei Bai,
Tommi Eronen,
Ari Jokinen,
Anu Kankainen,
Sonja Kujanpää,
Iain Moore,
Dmitrii Nesterenko,
Mikael Reponen
Abstract:
The absolute mass of $^{84}$Sr was determined using the phase-imaging ion-cyclotron-resonance technique with the JYFLTRAP double Penning trap mass spectrometer. A more precise value for the mass of $^{84}$Sr is essential for providing potential indications of physics beyond the Standard Model through high-precision isotope shift measurements of Sr atomic transition frequencies. The mass excess of…
▽ More
The absolute mass of $^{84}$Sr was determined using the phase-imaging ion-cyclotron-resonance technique with the JYFLTRAP double Penning trap mass spectrometer. A more precise value for the mass of $^{84}$Sr is essential for providing potential indications of physics beyond the Standard Model through high-precision isotope shift measurements of Sr atomic transition frequencies. The mass excess of $^{84}$Sr was refined to be -80649.229(37) keV/c$^2$ from high-precision cyclotron-frequency-ratio measurements with a relative precision of 4.8$\times$10$^{-10}$. The obtained mass-excess value is in agreement with the adopted value in the Atomic Mass Evaluation 2020, but is 30 times more precise. With this new value, we confirm the previously observed nonlinearity in the study of the isotope shift of strontium. Moreover, the double-beta ($2β^{+}$) decay $Q$ value of $^{84}$Sr was directly determined to be 1790.115(37) keV, and the precision was improved by a factor of 30.
△ Less
Submitted 22 June, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Pairwise Similarity Distribution Clustering for Noisy Label Learning
Authors:
Sihan Bai
Abstract:
Noisy label learning aims to train deep neural networks using a large amount of samples with noisy labels, whose main challenge comes from how to deal with the inaccurate supervision caused by wrong labels. Existing works either take the label correction or sample selection paradigm to involve more samples with accurate labels into the training process. In this paper, we propose a simple yet effec…
▽ More
Noisy label learning aims to train deep neural networks using a large amount of samples with noisy labels, whose main challenge comes from how to deal with the inaccurate supervision caused by wrong labels. Existing works either take the label correction or sample selection paradigm to involve more samples with accurate labels into the training process. In this paper, we propose a simple yet effective sample selection algorithm, termed as Pairwise Similarity Distribution Clustering~(PSDC), to divide the training samples into one clean set and another noisy set, which can power any of the off-the-shelf semi-supervised learning regimes to further train networks for different downstream tasks. Specifically, we take the pairwise similarity between sample pairs to represent the sample structure, and the Gaussian Mixture Model~(GMM) to model the similarity distribution between sample pairs belonging to the same noisy cluster, therefore each sample can be confidently divided into the clean set or noisy set. Even under severe label noise rate, the resulting data partition mechanism has been proved to be more robust in judging the label confidence in both theory and practice. Experimental results on various benchmark datasets, such as CIFAR-10, CIFAR-100 and Clothing1M, demonstrate significant improvements over state-of-the-art methods.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Radiative lifetime of the A 2Π1/2 state in RaF with relevance to laser cooling
Authors:
M. Athanasakis-Kaklamanakis,
S. G. Wilkins,
P. Lassègues,
L. Lalanne,
J. R. Reilly,
O. Ahmad,
M. Au,
S. W. Bai,
J. Berbalk,
C. Bernerd,
A. Borschevsky,
A. A. Breier,
K. Chrysalidis,
T. E. Cocolios,
R. P. de Groote,
C. M. Fajardo-Zambrano,
K. T. Flanagan,
S. Franchoo,
R. F. Garcia Ruiz,
D. Hanstorp,
R. Heinke,
P. Imgram,
A. Koszorús,
A. A. Kyuberis,
J. Lim
, et al. (16 additional authors not shown)
Abstract:
The radiative lifetime of the $A$ $^2 Π_{1/2}$ (v=0) state in radium monofluoride (RaF) is measured to be 35(1) ns. The lifetime of this state and the related decay rate $Γ= 2.86(8) \times 10^7$ $s^{-1}$ are of relevance to the laser cooling of RaF via the optically closed $A$ $^2 Π_{1/2} \leftarrow X$ $^2Σ_{1/2}$ transition, which makes the molecule a promising probe to search for new physics. Ra…
▽ More
The radiative lifetime of the $A$ $^2 Π_{1/2}$ (v=0) state in radium monofluoride (RaF) is measured to be 35(1) ns. The lifetime of this state and the related decay rate $Γ= 2.86(8) \times 10^7$ $s^{-1}$ are of relevance to the laser cooling of RaF via the optically closed $A$ $^2 Π_{1/2} \leftarrow X$ $^2Σ_{1/2}$ transition, which makes the molecule a promising probe to search for new physics. RaF is found to have a comparable photon-scattering rate to homoelectronic laser-coolable molecules. Thanks to its highly diagonal Franck-Condon matrix, it is expected to scatter an order of magnitude more photons than other molecules when using just 3 cooling lasers, before it decays to a dark state. The lifetime measurement in RaF is benchmarked by measuring the lifetime of the $8P_{3/2}$ state in Fr to be 83(3) ns, in agreement with literature.
△ Less
Submitted 6 June, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning
Authors:
Sikai Bai,
Jie Zhang,
Shuaicheng Li,
Song Guo,
**gcai Guo,
Jun Hou,
Tao Han,
Xiaocheng Lu
Abstract:
Federated learning (FL) has emerged as a powerful paradigm for learning from decentralized data, and federated domain generalization further considers the test dataset (target domain) is absent from the decentralized training data (source domains). However, most existing FL methods assume that domain labels are provided during training, and their evaluation imposes explicit constraints on the numb…
▽ More
Federated learning (FL) has emerged as a powerful paradigm for learning from decentralized data, and federated domain generalization further considers the test dataset (target domain) is absent from the decentralized training data (source domains). However, most existing FL methods assume that domain labels are provided during training, and their evaluation imposes explicit constraints on the number of domains, which must strictly match the number of clients. Because of the underutilization of numerous edge devices and additional cross-client domain annotations in the real world, such restrictions may be impractical and involve potential privacy leaks. In this paper, we propose an efficient and novel approach, called Disentangled Prompt Tuning (DiPrompT), a method that tackles the above restrictions by learning adaptive prompts for domain generalization in a distributed manner. Specifically, we first design two types of prompts, i.e., global prompt to capture general knowledge across all clients and domain prompts to capture domain-specific knowledge. They eliminate the restriction on the one-to-one map** between source domains and local clients. Furthermore, a dynamic query metric is introduced to automatically search the suitable domain label for each sample, which includes two-substep text-image alignments based on prompt tuning without labor-intensive annotation. Extensive experiments on multiple datasets demonstrate that our DiPrompT achieves superior domain generalization performance over state-of-the-art FL methods when domain labels are not provided, and even outperforms many centralized learning methods using domain labels.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension
Authors:
Xingyu Lu,
He Cao,
Zi**g Liu,
Shengyuan Bai,
Leqing Chen,
Yuan Yao,
Hai-Tao Zheng,
Yu Li
Abstract:
Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel quest…
▽ More
Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel question answering (QA) dataset which possesses 62K QA pairs over 23K molecules. Each QA pair, composed of a manual question, a positive option and three negative options, has consistent semantics with a molecular description from authoritative molecular corpus. MoleculeQA is not only the first benchmark for molecular factual bias evaluation but also the largest QA dataset for molecular research. A comprehensive evaluation on MoleculeQA for existing molecular LLMs exposes their deficiencies in specific areas and pinpoints several particularly crucial factors for molecular understanding.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Authors:
Liang Chen,
Haozhe Zhao,
Tianyu Liu,
Shuai Bai,
Junyang Lin,
Chang Zhou,
Baobao Chang
Abstract:
In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we i…
▽ More
In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.
△ Less
Submitted 25 March, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
Debiasing Text-to-Image Diffusion Models
Authors:
Ruifei He,
Chuhui Xue,
Haoru Tan,
Wenqing Zhang,
Yingchen Yu,
Song Bai,
Xiaojuan Qi
Abstract:
Learning-based Text-to-Image (TTI) models like Stable Diffusion have revolutionized the way visual content is generated in various domains. However, recent research has shown that nonnegligible social bias exists in current state-of-the-art TTI systems, which raises important concerns. In this work, we target resolving the social bias in TTI diffusion models. We begin by formalizing the problem se…
▽ More
Learning-based Text-to-Image (TTI) models like Stable Diffusion have revolutionized the way visual content is generated in various domains. However, recent research has shown that nonnegligible social bias exists in current state-of-the-art TTI systems, which raises important concerns. In this work, we target resolving the social bias in TTI diffusion models. We begin by formalizing the problem setting and use the text descriptions of bias groups to establish an unsafe direction for guiding the diffusion process. Next, we simplify the problem into a weight optimization problem and attempt a Reinforcement solver, Policy Gradient, which shows sub-optimal performance with slow convergence. Further, to overcome limitations, we propose an iterative distribution alignment (IDA) method. Despite its simplicity, we show that IDA shows efficiency and fast convergence in resolving the social bias in TTI diffusion models. Our code will be released.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection
Authors:
Sifan Zhou,
Liang Li,
Xinyu Zhang,
Bo Zhang,
Shipeng Bai,
Miao Sun,
Ziyu Zhao,
Xiaobo Lu,
Xiangxiang Chu
Abstract:
Due to highly constrained computing power and memory, deploying 3D lidar-based detectors on edge devices equipped in autonomous vehicles and robots poses a crucial challenge. Being a convenient and straightforward model compression approach, Post-Training Quantization (PTQ) has been widely adopted in 2D vision tasks. However, applying it directly to 3D lidar-based tasks inevitably leads to perform…
▽ More
Due to highly constrained computing power and memory, deploying 3D lidar-based detectors on edge devices equipped in autonomous vehicles and robots poses a crucial challenge. Being a convenient and straightforward model compression approach, Post-Training Quantization (PTQ) has been widely adopted in 2D vision tasks. However, applying it directly to 3D lidar-based tasks inevitably leads to performance degradation. As a remedy, we propose an effective PTQ method called LiDAR-PTQ, which is particularly curated for 3D lidar detection (both SPConv-based and SPConv-free). Our LiDAR-PTQ features three main components, \textbf{(1)} a sparsity-based calibration method to determine the initialization of quantization parameters, \textbf{(2)} a Task-guided Global Positive Loss (TGPL) to reduce the disparity between the final predictions before and after quantization, \textbf{(3)} an adaptive rounding-to-nearest operation to minimize the layerwise reconstruction error. Extensive experiments demonstrate that our LiDAR-PTQ can achieve state-of-the-art quantization performance when applied to CenterPoint (both Pillar-based and Voxel-based). To our knowledge, for the very first time in lidar-based 3D detection tasks, the PTQ INT8 model's accuracy is almost the same as the FP32 model while enjoying $3\times$ inference speedup. Moreover, our LiDAR-PTQ is cost-effective being $30\times$ faster than the quantization-aware training method. Code will be released at \url{https://github.com/StiphyJay/LiDAR-PTQ}.
△ Less
Submitted 28 January, 2024;
originally announced January 2024.
-
Fast Registration of Photorealistic Avatars for VR Facial Animation
Authors:
Chaitanya Patel,
Shaojie Bai,
Te-Li Wang,
Jason Saragih,
Shih-En Wei
Abstract:
Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a photorealistic avatar of one's likeness while wearing a VR headset. Although high quality registration of person-specific avatars to headset-mounted camera (HMC) images is possible in an offline setting, the performance of generic realtime mode…
▽ More
Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a photorealistic avatar of one's likeness while wearing a VR headset. Although high quality registration of person-specific avatars to headset-mounted camera (HMC) images is possible in an offline setting, the performance of generic realtime models are significantly degraded. Online registration is also challenging due to oblique camera views and differences in modality. In this work, we first show that the domain gap between the avatar and headset-camera images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we develop a system design that decouples the problem into two parts: 1) an iterative refinement module that takes in-domain inputs, and 2) a generic avatar-guided image-to-image style transfer module that is conditioned on current estimation of expression and head pose. These two modules reinforce each other, as image style transfer becomes easier when close-to-ground-truth examples are shown, and better domain-gap removal helps registration. Our system produces high-quality results efficiently, obviating the need for costly offline registration to generate personalized labels. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over direct regression methods as well as offline registration.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Existence and multiplicity of solutions for critical Kirchhoff-Choquard equations involving the fractional $p$-Laplacian on the Heisenberg group
Authors:
S. Bai,
Y. Song,
D. D. Repovš
Abstract:
In this paper, we study existence and multiplicity of solutions for the following Kirchhoff-Choquard type equation involving the fractional $p$-Laplacian on the Heisenberg group: \begin{equation*} \begin{array}{lll} M(\|u\|_μ^{p})(μ(-Δ)^{s}_{p}u+V(ξ)|u|^{p-2}u)= f(ξ,u)+\int_{\mathbb{H}^N}\frac{|u(η)|^{Q_λ^{\ast}}}{|η^{-1}ξ|^λ}dη|u|^{Q_λ^{\ast}-2}u &\mbox{in}\ \mathbb{H}^N, \\ \end{array} \end{equa…
▽ More
In this paper, we study existence and multiplicity of solutions for the following Kirchhoff-Choquard type equation involving the fractional $p$-Laplacian on the Heisenberg group: \begin{equation*} \begin{array}{lll} M(\|u\|_μ^{p})(μ(-Δ)^{s}_{p}u+V(ξ)|u|^{p-2}u)= f(ξ,u)+\int_{\mathbb{H}^N}\frac{|u(η)|^{Q_λ^{\ast}}}{|η^{-1}ξ|^λ}dη|u|^{Q_λ^{\ast}-2}u &\mbox{in}\ \mathbb{H}^N, \\ \end{array} \end{equation*} where $(-Δ)^{s}_{p}$ is the fractional $p$-Laplacian on the Heisenberg group $\mathbb{H}^N$, $M$ is the Kirchhoff function, $V(ξ)$ is the potential function, $0<s<1$, $1<p<\frac{N}{s}$, $μ>0$, $f(ξ,u)$ is the nonlinear function, $0<λ<Q$, $Q=2N+2$, and $Q_λ^{\ast}=\frac{2Q-λ}{Q-2}$ is the Sobolev critical exponent. Using the Krasnoselskii genus theorem, the existence of infinitely many solutions is obtained if $μ$ is sufficiently large. In addition, using the fractional version of the concentrated compactness principle, we prove that problem has $m$ pairs of solutions if $μ$ is sufficiently small. As far as we know, the results of our study are new even in the Euclidean case.
△ Less
Submitted 18 January, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
Progress and Prospects in 3D Generative AI: A Technical Overview including 3D human
Authors:
Song Bai,
Jie Li
Abstract:
While AI-generated text and 2D images continue to expand its territory, 3D generation has gradually emerged as a trend that cannot be ignored. Since the year 2023 an abundant amount of research papers has emerged in the domain of 3D generation. This growth encompasses not just the creation of 3D objects, but also the rapid development of 3D character and motion generation. Several key factors cont…
▽ More
While AI-generated text and 2D images continue to expand its territory, 3D generation has gradually emerged as a trend that cannot be ignored. Since the year 2023 an abundant amount of research papers has emerged in the domain of 3D generation. This growth encompasses not just the creation of 3D objects, but also the rapid development of 3D character and motion generation. Several key factors contribute to this progress. The enhanced fidelity in stable diffusion, coupled with control methods that ensure multi-view consistency, and realistic human models like SMPL-X, contribute synergistically to the production of 3D models with remarkable consistency and near-realistic appearances. The advancements in neural network-based 3D storing and rendering models, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have accelerated the efficiency and realism of neural rendered models. Furthermore, the multimodality capabilities of large language models have enabled language inputs to transcend into human motion outputs. This paper aims to provide a comprehensive overview and summary of the relevant papers published mostly during the latter half year of 2023. It will begin by discussing the AI generated object models in 3D, followed by the generated 3D human models, and finally, the generated 3D human motions, culminating in a conclusive summary and a vision for the future.
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Authors:
Evonne Ng,
Javier Romero,
Timur Bagautdinov,
Shaojie Bai,
Trevor Darrell,
Angjoo Kanazawa,
Alexander Richard
Abstract:
We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency…
▽ More
We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields
Authors:
Xiao Pan,
Zongxin Yang,
Shuai Bai,
Yi Yang
Abstract:
In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task which targets synthesizing photo-realistic novel views given only one reference image per scene. Previous One-shot Generalizable Neural Radiance Fields (OG-NeRF) methods solve this task in an inference-time finetuning-free manner, yet suffer the blurry issue due to the encoder-only architecture that highly relies on the limi…
▽ More
In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task which targets synthesizing photo-realistic novel views given only one reference image per scene. Previous One-shot Generalizable Neural Radiance Fields (OG-NeRF) methods solve this task in an inference-time finetuning-free manner, yet suffer the blurry issue due to the encoder-only architecture that highly relies on the limited reference image. On the other hand, recent diffusion-based image-to-3d methods show vivid plausible results via distilling pre-trained 2D diffusion models into a 3D representation, yet require tedious per-scene optimization. Targeting these issues, we propose the GD$^2$-NeRF, a Generative Detail compensation framework via GAN and Diffusion that is both inference-time finetuning-free and with vivid plausible details. In detail, following a coarse-to-fine strategy, GD$^2$-NeRF is mainly composed of a One-stage Parallel Pipeline (OPP) and a 3D-consistent Detail Enhancer (Diff3DE). At the coarse stage, OPP first efficiently inserts the GAN model into the existing OG-NeRF pipeline for primarily relieving the blurry issue with in-distribution priors captured from the training dataset, achieving a good balance between sharpness (LPIPS, FID) and fidelity (PSNR, SSIM). Then, at the fine stage, Diff3DE further leverages the pre-trained image diffusion models to complement rich out-distribution details while maintaining decent 3D consistency. Extensive experiments on both the synthetic and real-world datasets show that GD$^2$-NeRF noticeably improves the details while without per-scene finetuning.
△ Less
Submitted 29 March, 2024; v1 submitted 31 December, 2023;
originally announced January 2024.
-
Improving Cross-domain Few-shot Classification with Multilayer Perceptron
Authors:
Shuanghao Bai,
Wanqi Zhou,
Zhirong Luan,
Donglin Wang,
Badong Chen
Abstract:
Cross-domain few-shot classification (CDFSC) is a challenging and tough task due to the significant distribution discrepancies across different domains. To address this challenge, many approaches aim to learn transferable representations. Multilayer perceptron (MLP) has shown its capability to learn transferable representations in various downstream tasks, such as unsupervised image classification…
▽ More
Cross-domain few-shot classification (CDFSC) is a challenging and tough task due to the significant distribution discrepancies across different domains. To address this challenge, many approaches aim to learn transferable representations. Multilayer perceptron (MLP) has shown its capability to learn transferable representations in various downstream tasks, such as unsupervised image classification and supervised concept generalization. However, its potential in the few-shot settings has yet to be comprehensively explored. In this study, we investigate the potential of MLP to assist in addressing the challenges of CDFSC. Specifically, we introduce three distinct frameworks incorporating MLP in accordance with three types of few-shot classification methods to verify the effectiveness of MLP. We reveal that MLP can significantly enhance discriminative capabilities and alleviate distribution shifts, which can be supported by our expensive experiments involving 10 baseline models and 12 benchmark datasets. Furthermore, our method even compares favorably against other state-of-the-art CDFSC algorithms.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Prompt-based Distribution Alignment for Unsupervised Domain Adaptation
Authors:
Shuanghao Bai,
Min Zhang,
Wanqi Zhou,
Siteng Huang,
Zhirong Luan,
Donglin Wang,
Badong Chen
Abstract:
Recently, despite the unprecedented success of large pre-trained visual-language models (VLMs) on a wide range of downstream tasks, the real-world unsupervised domain adaptation (UDA) problem is still not well explored. Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target dom…
▽ More
Recently, despite the unprecedented success of large pre-trained visual-language models (VLMs) on a wide range of downstream tasks, the real-world unsupervised domain adaptation (UDA) problem is still not well explored. Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target domains, thereby improving the performance of UDA. However, a major challenge for directly deploying such models on downstream UDA tasks is prompt engineering, which requires aligning the domain knowledge of source and target domains, since the performance of UDA is severely influenced by a good domain-invariant representation. We further propose a Prompt-based Distribution Alignment (PDA) method to incorporate the domain knowledge into prompt learning. Specifically, PDA employs a two-branch prompt-tuning paradigm, namely base branch and alignment branch. The base branch focuses on integrating class-related representation into prompts, ensuring discrimination among different classes. To further minimize domain discrepancy, for the alignment branch, we construct feature banks for both the source and target domains and propose image-guided feature tuning (IFT) to make the input attend to feature banks, which effectively integrates self-enhanced and cross-domain features into the model. In this way, these two branches can be mutually promoted to enhance the adaptation of VLMs for UDA. We conduct extensive experiments on three benchmarks to demonstrate that our proposed PDA achieves state-of-the-art performance. The code is available at https://github.com/BaiShuanghao/Prompt-based-Distribution-Alignment.
△ Less
Submitted 26 January, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
General Object Foundation Model for Images and Videos at Scale
Authors:
Junfeng Wu,
Yi Jiang,
Qihao Liu,
Zehuan Yuan,
Xiang Bai,
Song Bai
Abstract:
We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data…
▽ More
We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Authors:
Yong Liu,
Sule Bai,
Guanbin Li,
Yitong Wang,
Yansong Tang
Abstract:
This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding, alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge, recent works propose to utilize CLIP as an additional cl…
▽ More
This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding, alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge, recent works propose to utilize CLIP as an additional classifier and aggregate model predictions with CLIP classification results. Despite their remarkable progress, performance of OVS methods in relevant scenarios is still unsatisfactory compared with supervised counterparts. We attribute this to the in-vocabulary embedding and domain-biased CLIP prediction. To this end, we present a Semantic-assisted CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior of CLIP into proposal embedding to avoid collapsing on known categories. Besides, a contextual shift strategy is applied to mitigate the lack of global context and unnatural background noise. With above designs, SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks. Furthermore, we also focus on the problem of existing evaluation system that ignores semantic duplication across categories, and propose a new metric called Semantic-Guided IoU (SG-IoU).
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Learning to Holistically Detect Bridges from Large-Size VHR Remote Sensing Imagery
Authors:
Yansheng Li,
Junwei Luo,
Yongjun Zhang,
Yihua Tan,
**-Gang Yu,
Song Bai
Abstract:
Bridge detection in remote sensing images (RSIs) plays a crucial role in various applications, but it poses unique challenges compared to the detection of other objects. In RSIs, bridges exhibit considerable variations in terms of their spatial scales and aspect ratios. Therefore, to ensure the visibility and integrity of bridges, it is essential to perform holistic bridge detection in large-size…
▽ More
Bridge detection in remote sensing images (RSIs) plays a crucial role in various applications, but it poses unique challenges compared to the detection of other objects. In RSIs, bridges exhibit considerable variations in terms of their spatial scales and aspect ratios. Therefore, to ensure the visibility and integrity of bridges, it is essential to perform holistic bridge detection in large-size very-high-resolution (VHR) RSIs. However, the lack of datasets with large-size VHR RSIs limits the deep learning algorithms' performance on bridge detection. Due to the limitation of GPU memory in tackling large-size images, deep learning-based object detection methods commonly adopt the crop** strategy, which inevitably results in label fragmentation and discontinuous prediction. To ameliorate the scarcity of datasets, this paper proposes a large-scale dataset named GLH-Bridge comprising 6,000 VHR RSIs sampled from diverse geographic locations across the globe. These images encompass a wide range of sizes, varying from 2,048*2,048 to 16,38*16,384 pixels, and collectively feature 59,737 bridges. Furthermore, we present an efficient network for holistic bridge detection (HBD-Net) in large-size RSIs. The HBD-Net presents a separate detector-based feature fusion (SDFF) architecture and is optimized via a shape-sensitive sample re-weighting (SSRW) strategy. Based on the proposed GLH-Bridge dataset, we establish a bridge detection benchmark including the OBB and HBB tasks, and validate the effectiveness of the proposed HBD-Net. Additionally, cross-dataset generalization experiments on two publicly available datasets illustrate the strong generalization capability of the GLH-Bridge dataset.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Empirical limit theorems for Wiener chaos
Authors:
Shuyang Bai,
Jiemiao Chen
Abstract:
We consider empirical measures in a triangular array setup with underlying distributions varying as sample size grows. We study asymptotic properties of multiple integrals with respect to normalized empirical measures. Limit theorems involving series of multiple Wiener-Itô integrals are established.
We consider empirical measures in a triangular array setup with underlying distributions varying as sample size grows. We study asymptotic properties of multiple integrals with respect to normalized empirical measures. Limit theorems involving series of multiple Wiener-Itô integrals are established.
△ Less
Submitted 19 December, 2023; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Temperature-heat uncertainty relation for quantum thermometry
Authors:
Ning Zhang,
Si-Yuan Bai,
Chong Chen
Abstract:
We investigate the resource theory for temperature estimation. We demonstrate that it is the fluctuation of heat that fundamentally determines temperature precision through the temperature-heat uncertainty relation. Specifically, we find that heat is divided into trajectory heat and correlation heat, which are associated with the heat exchange along thermometer's evolution path and the correlation…
▽ More
We investigate the resource theory for temperature estimation. We demonstrate that it is the fluctuation of heat that fundamentally determines temperature precision through the temperature-heat uncertainty relation. Specifically, we find that heat is divided into trajectory heat and correlation heat, which are associated with the heat exchange along thermometer's evolution path and the correlation between the thermometer and the sample, respectively. Based on two type of thermometers, we show that both of these heat terms are resources for enhancing temperature precision. Additionally, we demonstrate that the temperature-heat uncertainty relation is consistent with the well known temperature-energy uncertainty relation in thermodynamics. By clearly distinguishing the resources for enhancing estimation precision, our findings not only explain why various quantum features are crucial for accurate temperature sensing but also provide valuable insights for designing ultrahigh-sensitive quantum thermometers.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
High and low perturbations of the critical Choquard equation on the Heisenberg group
Authors:
Shujie Bai,
Yueqiang Song,
Dušan D. Repovš
Abstract:
We study the following critical Choquard equation on the Heisenberg group: \begin{equation*}
\begin{cases}
\displaystyle {-Δ_H u }=μ |u|^{q-2}u+\int_Ω \frac{|u(η)|^{Q_λ^{\ast}}} {|η^{-1}ξ|^λ} dη|u|^{Q_λ^{\ast}-2}u &\mbox{in }\ Ω,
u=0 &\mbox{on }\ \partialΩ, \end{cases} \end{equation*} where $Ω\subset \mathbb{H}^N$ is a smooth bounded domain, $Δ_H$ is the Kohn-Laplacian on the Heisenberg grou…
▽ More
We study the following critical Choquard equation on the Heisenberg group: \begin{equation*}
\begin{cases}
\displaystyle {-Δ_H u }=μ |u|^{q-2}u+\int_Ω \frac{|u(η)|^{Q_λ^{\ast}}} {|η^{-1}ξ|^λ} dη|u|^{Q_λ^{\ast}-2}u &\mbox{in }\ Ω,
u=0 &\mbox{on }\ \partialΩ, \end{cases} \end{equation*} where $Ω\subset \mathbb{H}^N$ is a smooth bounded domain, $Δ_H$ is the Kohn-Laplacian on the Heisenberg group $\mathbb{H}^N$, $1<q<2$ or $2<q<Q_λ^\ast$, $μ>0$, $0<λ<Q=2N+2$, and $Q_λ^{\ast}=\frac{2Q-λ}{Q-2}$ is the critical exponent. Using the concentration compactness principle and the critical point theory, we prove that the above problem has the least two positive solutions for $1<q<2$ in the case of low perturbations (small values of $μ$), and has a nontrivial solution for $2<q<Q_λ^\ast$ in the case of high perturbations (large values of $μ$). Moreover, for $1<q<2$, we also show that there is a positive ground state solution, and for $2<q<Q_λ^\ast$, there are at least $n$ pairs of nontrivial weak solutions.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
SUBP: Soft Uniform Block Pruning for 1xN Sparse CNNs Multithreading Acceleration
Authors:
**gyang Xiang,
Siqi Li,
Jun Chen,
Shipeng Bai,
Yukai Ma,
Guang Dai,
Yong Liu
Abstract:
The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space…
▽ More
The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a \emph{Block Sparse Row} matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1$\times$N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel \emph{\textbf{S}oft \textbf{U}niform \textbf{B}lock \textbf{P}runing} (SUBP) approach to train a uniform 1$\times$N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1$\times$N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at \url{https://github.com/**gyangXiang/SUBP}.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
Qwen Technical Report
Authors:
**ze Bai,
Shuai Bai,
Yunfei Chu,
Zeyu Cui,
Kai Dang,
Xiaodong Deng,
Yang Fan,
Wenbin Ge,
Yu Han,
Fei Huang,
Binyuan Hui,
Luo Ji,
Mei Li,
Junyang Lin,
Runji Lin,
Dayiheng Liu,
Gao Liu,
Chengqiang Lu,
Keming Lu,
Jianxin Ma,
Rui Men,
Xingzhang Ren,
Xuancheng Ren,
Chuanqi Tan,
Sinan Tan
, et al. (23 additional authors not shown)
Abstract:
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Q…
▽ More
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Quantitative Analysis of Sodium Metal Deposition and Interphase in Na Metal Batteries
Authors:
Baharak Sayahpour,
Weikang Li,
Shuang Bai,
Bingyu Lu,
Bing Han,
Yu-Ting Chen,
Grayson Deysher,
Saurabh Parab,
Phillip Ridley,
Ganesh Raghavendran,
Long Hoang Bao Nguyen,
Minghao Zhang,
Ying Shirley Meng
Abstract:
Sodium-ion batteries exhibit significant promise as a viable alternative to current lithium-ion technologies owing to their sustainability, low cost per energy density, reliability, and safety. Despite recent advancements in cathode materials for this category of energy storage systems, the primary challenge in realizing practical applications of sodium-ion systems is the absence of an anode syste…
▽ More
Sodium-ion batteries exhibit significant promise as a viable alternative to current lithium-ion technologies owing to their sustainability, low cost per energy density, reliability, and safety. Despite recent advancements in cathode materials for this category of energy storage systems, the primary challenge in realizing practical applications of sodium-ion systems is the absence of an anode system with high energy density and durability. Although Na metal is the ultimate anode that can facilitate high-energy sodium-ion batteries, its use remains limited due to safety concerns and the high-capacity loss associated with the high reactivity of Na metal. In this study, titration gas chromatography is employed to accurately quantify the sodium inventory loss in ether- and carbonate-based electrolytes. Uniaxial pressure is developed as a powerful tool to control the deposition of sodium metal with dense morphology, thereby enabling high initial coulombic efficiencies. In ether-based electrolytes, the Na metal surface exhibits the presence of a uniform solid electrolyte interphase layer, primarily characterized by favorable inorganic chemical components with close-packed structures. The full cell, utilizing a controlled electroplated sodium metal in ether-based electrolyte, provides capacity retention of 91.84% after 500 cycles at 2C current rate and delivers 86 mAh/g discharge capacity at 45C current rate, suggesting the potential to enable Na metal in the next generation of sodium-ion technologies with specifications close to practical requirements.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
Franks' dichotomy for toric manifolds, Hofer-Zehnder conjecture, and gauged linear sigma model
Authors:
Shaoyun Bai,
Guangbo Xu
Abstract:
We prove that for any compact toric symplectic manifold, if a Hamiltonian diffeomorphism admits more fixed points, counted homologically, than the total Betti number, then it has infinitely many simple periodic points. This provides a vast generalization of Franks' famous two or infinity dichotomy for periodic orbits of area-preserving diffeomorphisms on the two-sphere, and establishes a conjectur…
▽ More
We prove that for any compact toric symplectic manifold, if a Hamiltonian diffeomorphism admits more fixed points, counted homologically, than the total Betti number, then it has infinitely many simple periodic points. This provides a vast generalization of Franks' famous two or infinity dichotomy for periodic orbits of area-preserving diffeomorphisms on the two-sphere, and establishes a conjecture attributed to Hofer-Zehnder in the case of toric manifolds. The key novelty is the application of gauged linear sigma model and its bulk deformations to the study of Hamiltonian dynamics of symplectic quotients.
△ Less
Submitted 10 January, 2024; v1 submitted 14 September, 2023;
originally announced September 2023.
-
Dataset Condensation via Generative Model
Authors:
David Junhao Zhang,
Heng Wang,
Chuhui Xue,
Rui Yan,
Wenqing Zhang,
Song Bai,
Mike Zheng Shou
Abstract:
Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation meth…
▽ More
Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation methods from scaling up to large datasets with diverse classes. Moreover, the relations among condensed samples have been neglected and hence the feature distribution of condensed samples is often not diverse. To solve these problems, we propose to condense the dataset into another format, a generative model. Such a novel format allows for the condensation of large datasets because the size of the generative model remains relatively stable as the number of classes or image resolution increases. Furthermore, an intra-class and an inter-class loss are proposed to model the relation of condensed samples. Intra-class loss aims to create more diverse samples for each class by pushing each sample away from the others of the same class. Meanwhile, inter-class loss increases the discriminability of samples by widening the gap between the centers of different classes. Extensive comparisons with state-of-the-art methods and our ablation studies confirm the effectiveness of our method and its individual component. To our best knowledge, we are the first to successfully conduct condensation on ImageNet-1k.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Ethical Framework for Harnessing the Power of AI in Healthcare and Beyond
Authors:
Sidra Nasir,
Rizwan Ahmed Khan,
Samita Bai
Abstract:
In the past decade, the deployment of deep learning (Artificial Intelligence (AI)) methods has become pervasive across a spectrum of real-world applications, often in safety-critical contexts. This comprehensive research article rigorously investigates the ethical dimensions intricately linked to the rapid evolution of AI technologies, with a particular focus on the healthcare domain. Delving deep…
▽ More
In the past decade, the deployment of deep learning (Artificial Intelligence (AI)) methods has become pervasive across a spectrum of real-world applications, often in safety-critical contexts. This comprehensive research article rigorously investigates the ethical dimensions intricately linked to the rapid evolution of AI technologies, with a particular focus on the healthcare domain. Delving deeply, it explores a multitude of facets including transparency, adept data management, human oversight, educational imperatives, and international collaboration within the realm of AI advancement. Central to this article is the proposition of a conscientious AI framework, meticulously crafted to accentuate values of transparency, equity, answerability, and a human-centric orientation. The second contribution of the article is the in-depth and thorough discussion of the limitations inherent to AI systems. It astutely identifies potential biases and the intricate challenges of navigating multifaceted contexts. Lastly, the article unequivocally accentuates the pressing need for globally standardized AI ethics principles and frameworks. Simultaneously, it aptly illustrates the adaptability of the ethical framework proposed herein, positioned skillfully to surmount emergent challenges.
△ Less
Submitted 31 August, 2023;
originally announced September 2023.
-
TouchStone: Evaluating Vision-Language Models by Language Models
Authors:
Shuai Bai,
Shusheng Yang,
**ze Bai,
Peng Wang,
Xingxuan Zhang,
Junyang Lin,
Xinggang Wang,
Chang Zhou,
**gren Zhou
Abstract:
Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual s…
▽ More
Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers fundamental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal input content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs' evaluation and pave the way for building stronger LVLMs. The evaluation code is available at https://github.com/OFA-Sys/TouchStone.
△ Less
Submitted 4 September, 2023; v1 submitted 31 August, 2023;
originally announced August 2023.
-
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Authors:
**ze Bai,
Shuai Bai,
Shusheng Yang,
Shijie Wang,
Sinan Tan,
Peng Wang,
Junyang Lin,
Chang Zhou,
**gren Zhou
Abstract:
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyon…
▽ More
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.
△ Less
Submitted 12 October, 2023; v1 submitted 24 August, 2023;
originally announced August 2023.
-
On $p$-Laplacian Kirchhoff-Schrödinger-Poisson type systems with critical growth on the Heisenberg group
Authors:
Shujie Bai,
Yueqiang Song,
Dušan D. Repovš
Abstract:
In this article, we investigate the Kirchhoff-Schrödinger-Poisson type systems on the Heisenberg group of the following form: \begin{equation*} \left\{ \begin{array}{lll} {-(a+b\int_Ω|\nabla_{H} u|^{p}dξ)Δ_{H,p}u-μφ|u|^{p-2}u}=λ|u|^{q-2}u+|u|^{Q^{\ast}-2}u &\mbox{in}\ Ω, \\ -Δ_{H}φ=|u|^{p} &\mbox{in}\ Ω, \\ u=φ=0 &\mbox{on}\ \partialΩ, \end{array} \right. \end{equation*} where $a,b$ are positive r…
▽ More
In this article, we investigate the Kirchhoff-Schrödinger-Poisson type systems on the Heisenberg group of the following form: \begin{equation*} \left\{ \begin{array}{lll} {-(a+b\int_Ω|\nabla_{H} u|^{p}dξ)Δ_{H,p}u-μφ|u|^{p-2}u}=λ|u|^{q-2}u+|u|^{Q^{\ast}-2}u &\mbox{in}\ Ω, \\ -Δ_{H}φ=|u|^{p} &\mbox{in}\ Ω, \\ u=φ=0 &\mbox{on}\ \partialΩ, \end{array} \right. \end{equation*} where $a,b$ are positive real numbers, $Ω\subset \mathbb{H}^N$ is a bounded region with smooth boundary, $1<p<Q$, $Q = 2N + 2$ is the homogeneous dimension of the Heisenberg group $\mathbb{H}^N$, $Q^{\ast}=\frac{pQ}{Q-p}$, $q\in(2p, Q^{\ast})$, and $Δ_{H,p}u=\mbox{div}(|\nabla_{H} u|^{p-2}\nabla_{H} u)$ is the $p$-horizontal Laplacian. Under some appropriate conditions for the parameters $μ$ and $λ$, we establish existence and multiplicity results for the system above. To some extent, we generalize the results of An and Liu (Israel J. Math., 2020) and Liu et al. (Adv. Nonlinear Anal., 2022).
△ Less
Submitted 19 August, 2023;
originally announced August 2023.
-
Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning
Authors:
Shipeng Bai,
Jun Chen,
Xintian Shen,
Yixuan Qian,
Yong Liu
Abstract:
Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore,…
▽ More
Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore, a few data-free methods are proposed to address this problem, but they perform data-free pruning and quantization separately, which does not explore the complementarity of pruning and quantization. In this paper, we propose a novel framework named Unified Data-Free Compression(UDFC), which performs pruning and quantization simultaneously without any data and fine-tuning process. Specifically, UDFC starts with the assumption that the partial information of a damaged(e.g., pruned or quantized) channel can be preserved by a linear combination of other channels, and then derives the reconstruction form from the assumption to restore the information loss due to compression. Finally, we formulate the reconstruction error between the original network and its compressed network, and theoretically deduce the closed-form solution. We evaluate the UDFC on the large-scale image classification task and obtain significant improvements over various network architectures and compression methods. For example, we achieve a 20.54% accuracy improvement on ImageNet dataset compared to SOTA method with 30% pruning ratio and 6-bit quantization on ResNet-34.
△ Less
Submitted 14 August, 2023;
originally announced August 2023.
-
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks
Authors:
David Junhao Zhang,
Mutian Xu,
Chuhui Xue,
Wenqing Zhang,
Xiaoguang Han,
Song Bai,
Mike Zheng Shou
Abstract:
Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been…
▽ More
Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. To address this, we start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques ( i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.
△ Less
Submitted 13 August, 2023;
originally announced August 2023.