Search | arXiv e-print repository

JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models

Authors: Haibo **, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, Haohan Wang

Abstract: The rapid evolution of artificial intelligence (AI) through developments in Large Language Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements across various technological domains. While these models enhance capabilities in natural language processing and visual interactive tasks, their growing adoption raises critical concerns regarding security and ethical alignm… ▽ More The rapid evolution of artificial intelligence (AI) through developments in Large Language Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements across various technological domains. While these models enhance capabilities in natural language processing and visual interactive tasks, their growing adoption raises critical concerns regarding security and ethical alignment. This survey provides an extensive review of the emerging field of jailbreaking--deliberately circumventing the ethical and operational boundaries of LLMs and VLMs--and the consequent development of defense mechanisms. Our study categorizes jailbreaks into seven distinct types and elaborates on defense strategies that address these vulnerabilities. Through this comprehensive examination, we identify research gaps and propose directions for future studies to enhance the security frameworks of LLMs and VLMs. Our findings underscore the necessity for a unified perspective that integrates both jailbreak strategies and defensive solutions to foster a robust, secure, and reliable environment for the next generation of language models. More details can be found on our website: \url{https://chonghan-chen.com/llm-jailbreak-zoo-survey/}. △ Less

Submitted 25 June, 2024; originally announced July 2024.

Comments: 44 pages

arXiv:2406.19844 [pdf, other]

StreamMOTP: Streaming and Unified Framework for Joint 3D Multi-Object Tracking and Trajectory Prediction

Authors: Jiaheng Zhuang, Guoan Wang, Siyu Zhang, Xiyang Wang, Hangning Zhou, Ziyao Xu, Chi Zhang, Zhiheng Li

Abstract: 3D multi-object tracking and trajectory prediction are two crucial modules in autonomous driving systems. Generally, the two tasks are handled separately in traditional paradigms and a few methods have started to explore modeling these two tasks in a joint manner recently. However, these approaches suffer from the limitations of single-frame training and inconsistent coordinate representations bet… ▽ More 3D multi-object tracking and trajectory prediction are two crucial modules in autonomous driving systems. Generally, the two tasks are handled separately in traditional paradigms and a few methods have started to explore modeling these two tasks in a joint manner recently. However, these approaches suffer from the limitations of single-frame training and inconsistent coordinate representations between tracking and prediction tasks. In this paper, we propose a streaming and unified framework for joint 3D Multi-Object Tracking and trajectory Prediction (StreamMOTP) to address the above challenges. Firstly, we construct the model in a streaming manner and exploit a memory bank to preserve and leverage the long-term latent features for tracked objects more effectively. Secondly, a relative spatio-temporal positional encoding strategy is introduced to bridge the gap of coordinate representations between the two tasks and maintain the pose-invariance for trajectory prediction. Thirdly, we further improve the quality and consistency of predicted trajectories with a dual-stream predictor. We conduct extensive experiments on popular nuSences dataset and the experimental results demonstrate the effectiveness and superiority of StreamMOTP, which outperforms previous methods significantly on both tasks. Furthermore, we also prove that the proposed framework has great potential and advantages in actual applications of autonomous driving. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.06959 [pdf, other]

Unleashing the Denoising Capability of Diffusion Prior for Solving Inverse Problems

Authors: Jiawei Zhang, Jiaxin Zhuang, Cheng **, Gen Li, Yuantao Gu

Abstract: The recent emergence of diffusion models has significantly advanced the precision of learnable priors, presenting innovative avenues for addressing inverse problems. Since inverse problems inherently entail maximum a posteriori estimation, previous works have endeavored to integrate diffusion priors into the optimization frameworks. However, prevailing optimization-based inverse algorithms primari… ▽ More The recent emergence of diffusion models has significantly advanced the precision of learnable priors, presenting innovative avenues for addressing inverse problems. Since inverse problems inherently entail maximum a posteriori estimation, previous works have endeavored to integrate diffusion priors into the optimization frameworks. However, prevailing optimization-based inverse algorithms primarily exploit the prior information within the diffusion models while neglecting their denoising capability. To bridge this gap, this work leverages the diffusion process to reframe noisy inverse problems as a two-variable constrained optimization task by introducing an auxiliary optimization variable. By employing gradient truncation, the projection gradient descent method is efficiently utilized to solve the corresponding optimization problem. The proposed algorithm, termed ProjDiff, effectively harnesses the prior information and the denoising capability of a pre-trained diffusion model within the optimization framework. Extensive experiments on the image restoration tasks and source separation and partial generation tasks demonstrate that ProjDiff exhibits superior performance across various linear and nonlinear inverse problems, highlighting its potential for practical applications. Code is available at https://github.com/weigerzan/ProjDiff/. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.05392 [pdf, other]

Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas

Authors: Chengyuan Deng, Yiqun Duan, Xin **, Heng Chang, Yijun Tian, Han Liu, Henry Peng Zou, Yiqiao **, Yijia Xiao, Yichen Wang, Shenghao Wu, Zongxing Xie, Kuofeng Gao, Sihong He, Jun Zhuang, Lu Cheng, Haohan Wang

Abstract: Large Language Models (LLMs) have achieved unparalleled success across diverse language modeling tasks in recent years. However, this progress has also intensified ethical concerns, impacting the deployment of LLMs in everyday contexts. This paper provides a comprehensive survey of ethical challenges associated with LLMs, from longstanding issues such as copyright infringement, systematic bias, an… ▽ More Large Language Models (LLMs) have achieved unparalleled success across diverse language modeling tasks in recent years. However, this progress has also intensified ethical concerns, impacting the deployment of LLMs in everyday contexts. This paper provides a comprehensive survey of ethical challenges associated with LLMs, from longstanding issues such as copyright infringement, systematic bias, and data privacy, to emerging problems like truthfulness and social norms. We critically analyze existing research aimed at understanding, examining, and mitigating these ethical risks. Our survey underscores integrating ethical standards and societal values into the development of LLMs, thereby guiding the development of responsible and ethically aligned language models. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.03368 [pdf, other]

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

Authors: David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba O. Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Chukwuneke, Happy Buzaaba, Blessing Sibanda, Godson Kalipe, Jonathan Mukiibi, Salomon Kabongo, Foutse Yuehgoh, Mmasibidi Setaka, Lolwethu Ndolela, Nkiruka Odu, Rooweither Mabuya, Shamsuddeen Hassan Muhammad, Salomey Osei, Sokhar Samb, Tadesse Kebede Guge , et al. (1 additional authors not shown)

Abstract: Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoB… ▽ More Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58\% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like LLaMa 3 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Under review

arXiv:2406.03097 [pdf, other]

Enhancing the Resilience of Graph Neural Networks to Topological Perturbations in Sparse Graphs

Authors: Shuqi He, Jun Zhuang, Ding Wang, Luyao Peng, Jun Song

Abstract: Graph neural networks (GNNs) have been extensively employed in node classification. Nevertheless, recent studies indicate that GNNs are vulnerable to topological perturbations, such as adversarial attacks and edge disruptions. Considerable efforts have been devoted to mitigating these challenges. For example, pioneering Bayesian methodologies, including GraphSS and LlnDT, incorporate Bayesian labe… ▽ More Graph neural networks (GNNs) have been extensively employed in node classification. Nevertheless, recent studies indicate that GNNs are vulnerable to topological perturbations, such as adversarial attacks and edge disruptions. Considerable efforts have been devoted to mitigating these challenges. For example, pioneering Bayesian methodologies, including GraphSS and LlnDT, incorporate Bayesian label transitions and topology-based label sampling to strengthen the robustness of GNNs. However, GraphSS is hindered by slow convergence, while LlnDT faces challenges in sparse graphs. To overcome these limitations, we propose a novel label inference framework, TraTopo, which combines topology-driven label propagation, Bayesian label transitions, and link analysis via random walks. TraTopo significantly surpasses its predecessors on sparse graphs by utilizing random walk sampling, specifically targeting isolated nodes for link prediction, thus enhancing its effectiveness in topological sampling contexts. Additionally, TraTopo employs a shortest-path strategy to refine link prediction, thereby reducing predictive overhead and improving label inference accuracy. Empirical evaluations highlight TraTopo's superiority in node classification, significantly exceeding contemporary GCN models in accuracy. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.01264 [pdf, other]

FreeTumor: Advance Tumor Segmentation via Large-Scale Tumor Synthesis

Authors: Linshan Wu, Jiaxin Zhuang, Xuefeng Ni, Hao Chen

Abstract: AI-driven tumor analysis has garnered increasing attention in healthcare. However, its progress is significantly hindered by the lack of annotated tumor cases, which requires radiologists to invest a lot of effort in collecting and annotation. In this paper, we introduce a highly practical solution for robust tumor synthesis and segmentation, termed FreeTumor, which refers to annotation-free synth… ▽ More AI-driven tumor analysis has garnered increasing attention in healthcare. However, its progress is significantly hindered by the lack of annotated tumor cases, which requires radiologists to invest a lot of effort in collecting and annotation. In this paper, we introduce a highly practical solution for robust tumor synthesis and segmentation, termed FreeTumor, which refers to annotation-free synthetic tumors and our desire to free patients that suffering from tumors. Instead of pursuing sophisticated technical synthesis modules, we aim to design a simple yet effective tumor synthesis paradigm to unleash the power of large-scale data. Specifically, FreeTumor advances existing methods mainly from three aspects: (1) Existing methods only leverage small-scale labeled data for synthesis training, which limits their ability to generalize well on unseen data from different sources. To this end, we introduce the adversarial training strategy to leverage large-scale and diversified unlabeled data in synthesis training, significantly improving tumor synthesis. (2) Existing methods largely ignored the negative impact of low-quality synthetic tumors in segmentation training. Thus, we employ an adversarial-based discriminator to automatically filter out the low-quality synthetic tumors, which effectively alleviates their negative impact. (3) Existing methods only used hundreds of cases in tumor segmentation. In FreeTumor, we investigate the data scaling law in tumor segmentation by scaling up the dataset to 11k cases. Extensive experiments demonstrate the superiority of FreeTumor, e.g., on three tumor segmentation benchmarks, average $+8.9\%$ DSC over the baseline that only using real tumors and $+6.6\%$ DSC over the state-of-the-art tumor synthesis method. Code will be available. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2405.19590 [pdf, other]

Weights Augmentation: it has never ever ever ever let her model down

Authors: Junbin Zhuang, Guiguang Din, Yunyi Yan

Abstract: Weight play an essential role in deep learning network models. Unlike network structure design, this article proposes the concept of weight augmentation, focusing on weight exploration. The core of Weight Augmentation Strategy (WAS) is to adopt random transformed weight coefficients training and transformed coefficients, named Shadow Weight(SW), for networks that can be used to calculate loss func… ▽ More Weight play an essential role in deep learning network models. Unlike network structure design, this article proposes the concept of weight augmentation, focusing on weight exploration. The core of Weight Augmentation Strategy (WAS) is to adopt random transformed weight coefficients training and transformed coefficients, named Shadow Weight(SW), for networks that can be used to calculate loss function to affect parameter updates. However, stochastic gradient descent is applied to Plain Weight(PW), which is referred to as the original weight of the network before the random transformation. During training, numerous SW collectively form high-dimensional space, while PW is directly learned from the distribution of SW instead of the data. The weight of the accuracy-oriented mode(AOM) relies on PW, which guarantees the network is highly robust and accurate. The desire-oriented mode(DOM) weight uses SW, which is determined by the network model's unique functions based on WAT's performance desires, such as lower computational complexity, lower sensitivity to particular data, etc. The dual mode be switched at anytime if needed. WAT extends the augmentation technique from data augmentation to weight, and it is easy to understand and implement, but it can improve almost all networks amazingly. Our experimental results show that convolutional neural networks, such as VGG-16, ResNet-18, ResNet-34, GoogleNet, MobilementV2, and Efficientment-Lite, can benefit much at little or no cost. The accuracy of models is on the CIFAR100 and CIFAR10 datasets, which can be evaluated to increase by 7.32\% and 9.28\%, respectively, with the highest values being 13.42\% and 18.93\%, respectively. In addition, DOM can reduce floating point operations (FLOPs) by up to 36.33\%. The code is available at https://github.com/zlearh/Weight-Augmentation-Technology. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.01606 [pdf, other]

Improving Trainability of Variational Quantum Circuits via Regularization Strategies

Authors: Jun Zhuang, Jack Cunningham, Chaowen Guan

Abstract: In the era of noisy intermediate-scale quantum (NISQ), variational quantum circuits (VQCs) have been widely applied in various domains, advancing the superiority of quantum circuits against classic models. Similar to classic models, regular VQCs can be optimized by various gradient-based methods. However, the optimization may be initially trapped in barren plateaus or eventually entangled in saddl… ▽ More In the era of noisy intermediate-scale quantum (NISQ), variational quantum circuits (VQCs) have been widely applied in various domains, advancing the superiority of quantum circuits against classic models. Similar to classic models, regular VQCs can be optimized by various gradient-based methods. However, the optimization may be initially trapped in barren plateaus or eventually entangled in saddle points during training. These gradient issues can significantly undermine the trainability of VQC. In this work, we propose a strategy that regularizes model parameters with prior knowledge of the train data and Gaussian noise diffusion. We conduct ablation studies to verify the effectiveness of our strategy across four public datasets and demonstrate that our method can improve the trainability of VQCs against the above-mentioned gradient issues. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: preprint, under review. TL;DR: we propose a regularization strategy to improve the trainability of VQCs

arXiv:2404.15760 [pdf, other]

Debiasing Machine Unlearning with Counterfactual Examples

Authors: Ziheng Chen, Jia Wang, Jun Zhuang, Abbavaram Gowtham Reddy, Fabrizio Silvestri, ** Huang, Kaushiki Nag, Kun Kuang, Xin Ning, Gabriele Tolomei

Abstract: The right to be forgotten (RTBF) seeks to safeguard individuals from the enduring effects of their historical actions by implementing machine-learning techniques. These techniques facilitate the deletion of previously acquired knowledge without requiring extensive model retraining. However, they often overlook a critical issue: unlearning processes bias. This bias emerges from two main sources: (1… ▽ More The right to be forgotten (RTBF) seeks to safeguard individuals from the enduring effects of their historical actions by implementing machine-learning techniques. These techniques facilitate the deletion of previously acquired knowledge without requiring extensive model retraining. However, they often overlook a critical issue: unlearning processes bias. This bias emerges from two main sources: (1) data-level bias, characterized by uneven data removal, and (2) algorithm-level bias, which leads to the contamination of the remaining dataset, thereby degrading model accuracy. In this work, we analyze the causal factors behind the unlearning process and mitigate biases at both data and algorithmic levels. Typically, we introduce an intervention-based approach, where knowledge to forget is erased with a debiased dataset. Besides, we guide the forgetting procedure by leveraging counterfactual examples, as they maintain semantic data consistency without hurting performance on the remaining dataset. Experimental results demonstrate that our method outperforms existing machine unlearning baselines on evaluation metrics. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.15580 [pdf, other]

MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

Authors: Jiaxin Zhuang, Linshan Wu, Qiong Wang, Varut Vardhanabhuti, Lin Luo, Hao Chen

Abstract: The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Mask AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the per… ▽ More The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Mask AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel \textit{Mask in Mask (MiM)} pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, \textit{i.e.,} Computed Tomography (CT) images containing various body parts. Extensive experiments on thirteen public datasets demonstrate the superiority of MiM over other SSL methods in organ/lesion/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. The improvement also concluded that the research community should pay more attention to the scale of the pre-training dataset towards the healthcare foundation model for 3D medical images. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: submitted to journal

arXiv:2404.06039 [pdf, other]

Breathing New Life into Existing Visualizations: A Natural Language-Driven Manipulation Framework

Authors: Can Liu, Jiacheng Yu, Yuhan Guo, Jiayi Zhuang, Yuchu Luo, Xiaoru Yuan

Abstract: We propose an approach to manipulate existing interactive visualizations to answer users' natural language queries. We analyze the natural language tasks and propose a design space of a hierarchical task structure, which allows for a systematic decomposition of complex queries. We introduce a four-level visualization manipulation space to facilitate in-situ manipulations for visualizations, enabli… ▽ More We propose an approach to manipulate existing interactive visualizations to answer users' natural language queries. We analyze the natural language tasks and propose a design space of a hierarchical task structure, which allows for a systematic decomposition of complex queries. We introduce a four-level visualization manipulation space to facilitate in-situ manipulations for visualizations, enabling a fine-grained control over the visualization elements. Our methods comprise two essential components: the natural language-to-task translator and the visualization manipulation parser. The natural language-to-task translator employs advanced NLP techniques to extract structured, hierarchical tasks from natural language queries, even those with varying degrees of ambiguity. The visualization manipulation parser leverages the hierarchical task structure to streamline these tasks into a sequence of atomic visualization manipulations. To illustrate the effectiveness of our approach, we provide real-world examples and experimental results. The evaluation highlights the precision of our natural language parsing capabilities and underscores the smooth transformation of visualization manipulations. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: 21 pages

arXiv:2404.02065 [pdf, other]

doi 10.1109/TMM.2024.3374594

Multi-Level Label Correction by Distilling Proximate Patterns for Semi-supervised Semantic Segmentation

Authors: Hui Xiao, Yuting Hong, Li Dong, Diqun Yan, Jiayan Zhuang, Junjie Xiong, Dongtai Liang, Chengbin Peng

Abstract: Semi-supervised semantic segmentation relieves the reliance on large-scale labeled data by leveraging unlabeled data. Recent semi-supervised semantic segmentation approaches mainly resort to pseudo-labeling methods to exploit unlabeled data. However, unreliable pseudo-labeling can undermine the semi-supervision processes. In this paper, we propose an algorithm called Multi-Level Label Correction (… ▽ More Semi-supervised semantic segmentation relieves the reliance on large-scale labeled data by leveraging unlabeled data. Recent semi-supervised semantic segmentation approaches mainly resort to pseudo-labeling methods to exploit unlabeled data. However, unreliable pseudo-labeling can undermine the semi-supervision processes. In this paper, we propose an algorithm called Multi-Level Label Correction (MLLC), which aims to use graph neural networks to capture structural relationships in Semantic-Level Graphs (SLGs) and Class-Level Graphs (CLGs) to rectify erroneous pseudo-labels. Specifically, SLGs represent semantic affinities between pairs of pixel features, and CLGs describe classification consistencies between pairs of pixel labels. With the support of proximate pattern information from graphs, MLLC can rectify incorrectly predicted pseudo-labels and can facilitate discriminative feature representations. We design an end-to-end network to train and perform this effective label corrections mechanism. Experiments demonstrate that MLLC can significantly improve supervised baselines and outperforms state-of-the-art approaches in different scenarios on Cityscapes and PASCAL VOC 2012 datasets. Specifically, MLLC improves the supervised baseline by at least 5% and 2% with DeepLabV2 and DeepLabV3+ respectively under different partition protocols. △ Less

Submitted 9 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 12 pages, 8 figures. IEEE Transactions on Multimedia, 2024

arXiv:2403.01976 [pdf, other]

SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis

Authors: Hengxing Cai, Xiaochen Cai, Junhan Chang, Sihang Li, Lin Yao, Changxin Wang, Zhifeng Gao, Hongshuai Wang, Yongge Li, Mujie Lin, Shuwen Yang, Jiankun Wang, Mingjun Xu, ** Huang, Fang Xi, Jiaxi Zhuang, Yuqi Yin, Yaqi Li, Changhong Chen, Zheng Cheng, Zifeng Zhao, Linfeng Zhang, Guolin Ke

Abstract: Recent breakthroughs in Large Language Models (LLMs) have revolutionized natural language understanding and generation, sparking significant interest in applying them to scientific literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency of LLMs in this domain, particularly in scenarios requiring higher-level abilities beyond mere memorization and the handling… ▽ More Recent breakthroughs in Large Language Models (LLMs) have revolutionized natural language understanding and generation, sparking significant interest in applying them to scientific literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency of LLMs in this domain, particularly in scenarios requiring higher-level abilities beyond mere memorization and the handling of multimodal data. In response to this gap, we introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis. SciAssess aims to thoroughly assess the efficacy of LLMs by focusing on their capabilities in Memorization (L1), Comprehension (L2), and Analysis \& Reasoning (L3). It encompasses a variety of tasks drawn from diverse scientific fields, including fundamental science, alloy materials, biomedicine, drug discovery, and organic materials. To ensure the reliability of SciAssess, rigorous quality control measures have been implemented, ensuring accuracy, anonymization, and compliance with copyright standards. SciAssess evaluates 11 LLMs, including GPT, Claude, and Gemini, highlighting their strengths and areas for improvement. This evaluation supports the ongoing development of LLM applications in the analysis of scientific literature. SciAssess and its resources are available at \url{https://sci-assess.github.io/}. △ Less

Submitted 18 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.10409 [pdf, other]

Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning

Authors: Jun Zhuang, Casey Kennington

Abstract: As new research on Large Language Models (LLMs) continues, it is difficult to keep up with new research and models. To help researchers synthesize the new research many have written survey papers, but even those have become numerous. In this paper, we develop a method to automatically assign survey papers to a taxonomy. We collect the metadata of 144 LLM survey papers and explore three paradigms t… ▽ More As new research on Large Language Models (LLMs) continues, it is difficult to keep up with new research and models. To help researchers synthesize the new research many have written survey papers, but even those have become numerous. In this paper, we develop a method to automatically assign survey papers to a taxonomy. We collect the metadata of 144 LLM survey papers and explore three paradigms to classify papers within the taxonomy. Our work indicates that leveraging graph structure information on co-category graphs can significantly outperform the language models in two paradigms; pre-trained language models' fine-tuning and zero-shot/few-shot classifications using LLMs. We find that our model surpasses an average human recognition level and that fine-tuning LLMs using weak labels generated by a smaller model, such as the GCN in this study, can be more effective than using ground-truth labels, revealing the potential of weak-to-strong generalization in the taxonomy classification task. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: TL;DR: We collected metadata about LLM surveys and developed a method for categorizing them into a taxonomy, indicating the superiority of graph representation learning over language models and revealing the efficacy of fine-tuning using weak labels

arXiv:2401.14828 [pdf, other]

TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts

Authors: **gyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, Ying Shan

Abstract: Text-driven 3D scene editing has gained significant attention owing to its convenience and user-friendliness. However, existing methods still lack accurate control of the specified appearance and location of the editing result due to the inherent limitations of the text description. To this end, we propose a 3D scene editing framework, TIPEditor, that accepts both text and image prompts and a 3D b… ▽ More Text-driven 3D scene editing has gained significant attention owing to its convenience and user-friendliness. However, existing methods still lack accurate control of the specified appearance and location of the editing result due to the inherent limitations of the text description. To this end, we propose a 3D scene editing framework, TIPEditor, that accepts both text and image prompts and a 3D bounding box to specify the editing region. With the image prompt, users can conveniently specify the detailed appearance/style of the target content in complement to the text description, enabling accurate control of the appearance. Specifically, TIP-Editor employs a stepwise 2D personalization strategy to better learn the representation of the existing scene and the reference image, in which a localization loss is proposed to encourage correct object placement as specified by the bounding box. Additionally, TIPEditor utilizes explicit and flexible 3D Gaussian splatting as the 3D representation to facilitate local editing while kee** the background unchanged. Extensive experiments have demonstrated that TIP-Editor conducts accurate editing following the text and image prompts in the specified bounding box region, consistently outperforming the baselines in editing quality, and the alignment to the prompts, qualitatively and quantitatively. △ Less

Submitted 25 April, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

Comments: Accpeted by Siggraph 2024 & ACM Transactions on Graphics

arXiv:2401.11261 [pdf, other]

Diffusion Model Conditioning on Gaussian Mixture Model and Negative Gaussian Mixture Gradient

Authors: Weiguo Lu, Xuan Wu, Deng Ding, **qiao Duan, Jirong Zhuang, Gangnan Yuan

Abstract: Diffusion models (DMs) are a type of generative model that has a huge impact on image synthesis and beyond. They achieve state-of-the-art generation results in various generative tasks. A great diversity of conditioning inputs, such as text or bounding boxes, are accessible to control the generation. In this work, we propose a conditioning mechanism utilizing Gaussian mixture models (GMMs) as feat… ▽ More Diffusion models (DMs) are a type of generative model that has a huge impact on image synthesis and beyond. They achieve state-of-the-art generation results in various generative tasks. A great diversity of conditioning inputs, such as text or bounding boxes, are accessible to control the generation. In this work, we propose a conditioning mechanism utilizing Gaussian mixture models (GMMs) as feature conditioning to guide the denoising process. Based on set theory, we provide a comprehensive theoretical analysis that shows that conditional latent distribution based on features and classes is significantly different, so that conditional latent distribution on features produces fewer defect generations than conditioning on classes. Two diffusion models conditioned on the Gaussian mixture model are trained separately for comparison. Experiments support our findings. A novel gradient function called the negative Gaussian mixture gradient (NGMG) is proposed and applied in diffusion model training with an additional classifier. Training stability has improved. We also theoretically prove that NGMG shares the same benefit as the Earth Mover distance (Wasserstein) as a more sensible cost function when learning distributions supported by low-dimensional manifolds. △ Less

Submitted 1 February, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

arXiv:2401.10417 [pdf, other]

doi 10.1145/3626202.3637569

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

Authors: **ming Zhuang, Zhuo** Yang, Shixin Ji, Heng Huang, Alex K. Jones, **gtong Hu, Yiyu Shi, Peipei Zhou

Abstract: With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latenc… ▽ More With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX. △ Less

Submitted 18 February, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Journal ref: 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '24)

arXiv:2401.00695 [pdf, other]

Credible Teacher for Semi-Supervised Object Detection in Open Scene

Authors: **gyu Zhuang, Kuo Wang, Liang Lin, Guanbin Li

Abstract: Semi-Supervised Object Detection (SSOD) has achieved resounding success by leveraging unlabeled data to improve detection performance. However, in Open Scene Semi-Supervised Object Detection (O-SSOD), unlabeled data may contains unknown objects not observed in the labeled data, which will increase uncertainty in the model's predictions for known objects. It is detrimental to the current methods th… ▽ More Semi-Supervised Object Detection (SSOD) has achieved resounding success by leveraging unlabeled data to improve detection performance. However, in Open Scene Semi-Supervised Object Detection (O-SSOD), unlabeled data may contains unknown objects not observed in the labeled data, which will increase uncertainty in the model's predictions for known objects. It is detrimental to the current methods that mainly rely on self-training, as more uncertainty leads to the lower localization and classification precision of pseudo labels. To this end, we propose Credible Teacher, an end-to-end framework. Credible Teacher adopts an interactive teaching mechanism using flexible labels to prevent uncertain pseudo labels from misleading the model and gradually reduces its uncertainty through the guidance of other credible pseudo labels. Empirical results have demonstrated our method effectively restrains the adverse effect caused by O-SSOD and significantly outperforms existing counterparts. △ Less

Submitted 2 January, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

Comments: Accpet by ICASSP 2024

arXiv:2312.10903 [pdf, other]

Robust Node Representation Learning via Graph Variational Diffusion Networks

Authors: Jun Zhuang, Mohammad Al Hasan

Abstract: Node representation learning by using Graph Neural Networks (GNNs) has been widely explored. However, in recent years, compelling evidence has revealed that GNN-based node representation learning can be substantially deteriorated by delicately-crafted perturbations in a graph structure. To learn robust node representation in the presence of perturbations, various works have been proposed to safegu… ▽ More Node representation learning by using Graph Neural Networks (GNNs) has been widely explored. However, in recent years, compelling evidence has revealed that GNN-based node representation learning can be substantially deteriorated by delicately-crafted perturbations in a graph structure. To learn robust node representation in the presence of perturbations, various works have been proposed to safeguard GNNs. Within these existing works, Bayesian label transition has been proven to be more effective, but this method is extensively reliant on a well-built prior distribution. The variational inference could address this limitation by sampling the latent node embedding from a Gaussian prior distribution. Besides, leveraging the Gaussian distribution (noise) in hidden layers is an appealing strategy to strengthen the robustness of GNNs. However, our experiments indicate that such a strategy can cause over-smoothing issues during node aggregation. In this work, we propose the Graph Variational Diffusion Network (GVDN), a new node encoder that effectively manipulates Gaussian noise to safeguard robustness on perturbed graphs while alleviating over-smoothing issues through two mechanisms: Gaussian diffusion and node embedding propagation. Thanks to these two mechanisms, our model can generate robust node embeddings for recovery. Specifically, we design a retraining mechanism using the generated node embedding to recover the performance of node classifications in the presence of perturbations. The experiments verify the effectiveness of our proposed model across six public datasets. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: preprint, under review

arXiv:2312.03594 [pdf, other]

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Authors: Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, Kai Chen

Abstract: Achieving high-quality versatile image inpainting, where user-specified regions are filled with plausible content according to user intent, presents a significant challenge. Existing methods face difficulties in simultaneously addressing context-aware image inpainting and text-guided object inpainting due to the distinct optimal training strategies required. To overcome this challenge, we introduc… ▽ More Achieving high-quality versatile image inpainting, where user-specified regions are filled with plausible content according to user intent, presents a significant challenge. Existing methods face difficulties in simultaneously addressing context-aware image inpainting and text-guided object inpainting due to the distinct optimal training strategies required. To overcome this challenge, we introduce PowerPaint, the first high-quality and versatile inpainting model that excels in both tasks. First, we introduce learnable task prompts along with tailored fine-tuning strategies to guide the model's focus on different inpainting targets explicitly. This enables PowerPaint to accomplish various inpainting tasks by utilizing different task prompts, resulting in state-of-the-art performance. Second, we demonstrate the versatility of the task prompt in PowerPaint by showcasing its effectiveness as a negative prompt for object removal. Additionally, we leverage prompt interpolation techniques to enable controllable shape-guided object inpainting. Finally, we extensively evaluate PowerPaint on various inpainting benchmarks to demonstrate its superior performance for versatile image inpainting. We release our codes and models on our project page: https://powerpaint.github.io/. △ Less

Submitted 11 December, 2023; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: Project page with code: https://powerpaint.github.io/

arXiv:2312.02991 [pdf, other]

REFRESH FPGAs: Sustainable FPGA Chiplet Architectures

Authors: Peipei Zhou, **ming Zhuang, Stephen Cahoon, Yue Tang, Zhuo** Yang, Xingzhen Chen, Yiyu Shi, **gtong Hu, Alex K. Jones

Abstract: There is a growing call for greater amounts of increasingly agile computational power for edge and cloud infrastructure to serve the computationally complex needs of ubiquitous computing devices. Thus, an important challenge is addressing the holistic environmental impacts of these next-generation computing systems. To accomplish this, a life-cycle view of sustainability for computing advancements… ▽ More There is a growing call for greater amounts of increasingly agile computational power for edge and cloud infrastructure to serve the computationally complex needs of ubiquitous computing devices. Thus, an important challenge is addressing the holistic environmental impacts of these next-generation computing systems. To accomplish this, a life-cycle view of sustainability for computing advancements is necessary to reduce environmental impacts such as greenhouse warming gas emissions from these computing choices. Unfortunately, decadal efforts to address operational energy efficiency in computing devices have ignored and in some cases exacerbated embodied impacts from manufacturing these edge and cloud systems, particularly their integrated circuits. During this time FPGA architectures have not changed dramatically except to increase in size. Given this context, we propose REFRESH FPGAs to build new FPGA devices and architectures from recently retired FPGA dies using 2.5D integration. To build REFRESH FPGAs requires creative architectures that leverage existing chiplet pins with an inexpensive to-manufacture interposer coupled with creative design automation. In this paper, we discuss how REFRESH FPGAs can leverage industry trends for renewable energy integration into data centers while providing an overall improvement for sustainability and amortizing their significant embodied cost investment over a much longer ``first'' lifetime. △ Less

Submitted 27 November, 2023; originally announced December 2023.

arXiv:2311.18420 [pdf, other]

TeG-DG: Textually Guided Domain Generalization for Face Anti-Spoofing

Authors: Lianrui Mu, Jianhong Bai, Xiaoxuan He, Jiangnan Ye, Xiaoyu Liang, Yuchen Yang, Jiedong Zhuang, Haoji Hu

Abstract: Enhancing the domain generalization performance of Face Anti-Spoofing (FAS) techniques has emerged as a research focus. Existing methods are dedicated to extracting domain-invariant features from various training domains. Despite the promising performance, the extracted features inevitably contain residual style feature bias (e.g., illumination, capture device), resulting in inferior generalizatio… ▽ More Enhancing the domain generalization performance of Face Anti-Spoofing (FAS) techniques has emerged as a research focus. Existing methods are dedicated to extracting domain-invariant features from various training domains. Despite the promising performance, the extracted features inevitably contain residual style feature bias (e.g., illumination, capture device), resulting in inferior generalization performance. In this paper, we propose an alternative and effective solution, the Textually Guided Domain Generalization (TeG-DG) framework, which can effectively leverage text information for cross-domain alignment. Our core insight is that text, as a more abstract and universal form of expression, can capture the commonalities and essential characteristics across various attacks, bridging the gap between different image domains. Contrary to existing vision-language models, the proposed framework is elaborately designed to enhance the domain generalization ability of the FAS task. Concretely, we first design a Hierarchical Attention Fusion (HAF) module to enable adaptive aggregation of visual features at different levels; Then, a Textual-Enhanced Visual Discriminator (TEVD) is proposed for not only better alignment between the two modalities but also to regularize the classifier with unbiased text features. TeG-DG significantly outperforms previous approaches, especially in situations with extremely limited source domain data (~14% and ~12% improvements on HTER and AUC respectively), showcasing impressive few-shot performance. △ Less

Submitted 30 January, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

arXiv:2311.16417 [pdf, other]

Challenges and Opportunities to Enable Large-Scale Computing via Heterogeneous Chiplets

Authors: Zhuo** Yang, Shixin Ji, Xingzhen Chen, **ming Zhuang, Weifeng Zhang, Dharmesh Jani, Peipei Zhou

Abstract: Fast-evolving artificial intelligence (AI) algorithms such as large language models have been driving the ever-increasing computing demands in today's data centers. Heterogeneous computing with domain-specific architectures (DSAs) brings many opportunities when scaling up and scaling out the computing system. In particular, heterogeneous chiplet architecture is favored to keep scaling up and scali… ▽ More Fast-evolving artificial intelligence (AI) algorithms such as large language models have been driving the ever-increasing computing demands in today's data centers. Heterogeneous computing with domain-specific architectures (DSAs) brings many opportunities when scaling up and scaling out the computing system. In particular, heterogeneous chiplet architecture is favored to keep scaling up and scaling out the system as well as to reduce the design complexity and the cost stemming from the traditional monolithic chip design. However, how to interconnect computing resources and orchestrate heterogeneous chiplets is the key to success. In this paper, we first discuss the diversity and evolving demands of different AI workloads. We discuss how chiplet brings better cost efficiency and shorter time to market. Then we discuss the challenges in establishing chiplet interface standards, packaging, and security issues. We further discuss the software programming challenges in chiplet systems. △ Less

Submitted 4 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

arXiv:2310.01159 [pdf, other]

Iterative Semi-Supervised Learning for Abdominal Organs and Tumor Segmentation

Authors: Jiaxin Zhuang, Luyang Luo, Zhixuan Chen, Linshan Wu

Abstract: Deep-learning (DL) based methods are playing an important role in the task of abdominal organs and tumors segmentation in CT scans. However, the large requirements of annotated datasets heavily limit its development. The FLARE23 challenge provides a large-scale dataset with both partially and fully annotated data, which also focuses on both segmentation accuracy and computational efficiency. In th… ▽ More Deep-learning (DL) based methods are playing an important role in the task of abdominal organs and tumors segmentation in CT scans. However, the large requirements of annotated datasets heavily limit its development. The FLARE23 challenge provides a large-scale dataset with both partially and fully annotated data, which also focuses on both segmentation accuracy and computational efficiency. In this study, we propose to use the strategy of Semi-Supervised Learning (SSL) and iterative pseudo labeling to address FLARE23. Initially, a deep model (nn-UNet) trained on datasets with complete organ annotations (about 220 scans) generates pseudo labels for the whole dataset. These pseudo labels are then employed to train a more powerful segmentation model. Employing the FLARE23 dataset, our approach achieves an average DSC score of 89.63% for organs and 46.07% for tumors on online validation leaderboard. For organ segmentation, We obtain 0.9007\% DSC and 0.9493\% NSD. For tumor segmentation, we obtain 0.3785% DSC and 0.2842% NSD. Our code is available at https://github.com/USTguy/Flare23. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: arXiv admin note: text overlap with arXiv:2309.05405

arXiv:2309.16137 [pdf, other]

Context-I2W: Map** Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval

Authors: Yuanmin Tang, **g Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, Qi Wu

Abstract: Different from Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to learn a more accurate image representation that has adaptive… ▽ More Different from Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to learn a more accurate image representation that has adaptive attention to the reference image for various manipulation descriptions. In this paper, we propose a novel context-dependent map** network, named Context-I2W, for adaptively converting description-relevant Image information into a pseudo-word token composed of the description for accurate ZS-CIR. Specifically, an Intent View Selector first dynamically learns a rotation rule to map the identical image to a task-specific manipulation view. Then a Visual Target Extractor further captures local information covering the main targets in ZS-CIR tasks under the guidance of multiple learnable queries. The two complementary modules work together to map an image to a context-dependent pseudo-word token without extra supervision. Our model shows strong generalization ability on four ZS-CIR tasks, including domain conversion, object composition, object manipulation, and attribute manipulation. It obtains consistent and significant performance boosts ranging from 1.88% to 3.60% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://github.com/Pter61/context-i2w. △ Less

Submitted 15 December, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

Journal ref: AAAI 2024

arXiv:2309.12275 [pdf, other]

AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP

Authors: Zhuo** Yang, **ming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K. Jones, Peipei Zhou

Abstract: Arbitrary-precision integer multiplication is the core kernel of many applications in simulation, cryptography, etc. Existing acceleration of arbitrary-precision integer multiplication includes CPUs, GPUs, FPGAs, and ASICs. Among these accelerators, FPGAs are promised to provide both good energy efficiency and flexibility. Surprisingly, in our implementations, FPGA has the lowest energy efficiency… ▽ More Arbitrary-precision integer multiplication is the core kernel of many applications in simulation, cryptography, etc. Existing acceleration of arbitrary-precision integer multiplication includes CPUs, GPUs, FPGAs, and ASICs. Among these accelerators, FPGAs are promised to provide both good energy efficiency and flexibility. Surprisingly, in our implementations, FPGA has the lowest energy efficiency, i.e., 0.29x of the CPU and 0.17x of the GPU with the same generation fabrication. Therefore, key questions arise: Where do the energy efficiency gains of CPUs and GPUs come from? Can reconfigurable computing do better? If can, how to achieve that? We identify that the biggest energy efficiency gains of the CPUs and GPUs come from the dedicated vector units. FPGA uses DSPs and lookup tables to compose the needed computation, which incurs overhead when compared to using vector units directly. New reconfigurable computing, e.g., 'FPGA+vector units' is a novel and feasible solution to improve energy efficiency. In this paper, we propose to map arbitrary-precision integer multiplication onto such a heterogeneous platform, i.e., AMD/Xilinx Versal ACAP architecture. Designing on Versal ACAP incurs several challenges and we propose AIM: Arbitrary-precision Integer Multiplication on Versal ACAP to automate and optimize the design. AIM framework includes design space exploration and AIM automatic code generation to facilitate the system design and verification. We deploy the AIM framework on three different applications, including large integer multiplication (LIM), RSA, and Mandelbrot, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experimental results show that AIM achieves up to 12.6x, and 2.1x energy efficiency gains over the Intel Xeon Ice Lake 6346 CPU, and NVidia A5000 GPU respectively, which brings reconfigurable computing the most energy-efficient platform among CPUs and GPUs. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2308.14009 [pdf, other]

doi 10.1109/TMM.2023.3280734

Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

Authors: Jiamin Zhuang, **g Yu, Yang Ding, Xiangyan Qu, Yue Hu

Abstract: Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while kee** the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pretrained encoders, which still demand time-… ▽ More Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while kee** the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pretrained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both concept level and context level by self-supervised contrastive learning. It does not require cross-modal embedding interactions during training while maintaining independent image and text encoders during retrieval. With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art non-pretraining independent-embedding models respectively by 9.1%, 4.2% and 6.6% in terms of R@sum score on Flickr30K, MSCOCO 1K and MS-COCO 5K datasets. The retrieval accuracy also outperforms most existing interactive-embedding models with orders of magnitude decrease in retrieval time. The source code is available at: https://github.com/Zjamie813/SelfAlign. △ Less

Submitted 27 August, 2023; originally announced August 2023.

Comments: Accepted in IEEE Transactions on Multimedia (TMM)

Journal ref: IEEE Transactions on Multimedia ( Early Access ), 29 May 2023

arXiv:2307.10036 [pdf, other]

Class Attention to Regions of Lesion for Imbalanced Medical Image Recognition

Authors: Jia-Xin Zhuang, Jiabin Cai, Jianguo Zhang, Wei-shi Zheng, Ruixuan Wang

Abstract: Automated medical image classification is the key component in intelligent diagnosis systems. However, most medical image datasets contain plenty of samples of common diseases and just a handful of rare ones, leading to major class imbalances. Currently, it is an open problem in intelligent diagnosis to effectively learn from imbalanced training data. In this paper, we propose a simple yet effecti… ▽ More Automated medical image classification is the key component in intelligent diagnosis systems. However, most medical image datasets contain plenty of samples of common diseases and just a handful of rare ones, leading to major class imbalances. Currently, it is an open problem in intelligent diagnosis to effectively learn from imbalanced training data. In this paper, we propose a simple yet effective framework, named \textbf{C}lass \textbf{A}ttention to \textbf{RE}gions of the lesion (CARE), to handle data imbalance issues by embedding attention into the training process of \textbf{C}onvolutional \textbf{N}eural \textbf{N}etworks (CNNs). The proposed attention module helps CNNs attend to lesion regions of rare diseases, therefore hel** CNNs to learn their characteristics more effectively. In addition, this attention module works only during the training phase and does not change the architecture of the original network, so it can be directly combined with any existing CNN architecture. The CARE framework needs bounding boxes to represent the lesion regions of rare diseases. To alleviate the need for manual annotation, we further developed variants of CARE by leveraging the traditional saliency methods or a pretrained segmentation model for bounding box generation. Results show that the CARE variants with automated bounding box generation are comparable to the original CARE framework with \textit{manual} bounding box annotations. A series of experiments on an imbalanced skin image dataset and a pneumonia dataset indicates that our method can effectively help the network focus on the lesion regions of rare diseases and remarkably improves the classification performance of rare diseases. △ Less

Submitted 20 July, 2023; v1 submitted 19 July, 2023; originally announced July 2023.

Comments: Accepted by Neurocomputing on July 2023. 37 pages

arXiv:2306.13455 [pdf, other]

DreamEditor: Text-Driven 3D Scene Editing with Neural Fields

Authors: **gyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, Guanbin Li

Abstract: Neural fields have achieved impressive advancements in view synthesis and scene reconstruction. However, editing these neural fields remains challenging due to the implicit encoding of geometry and texture information. In this paper, we propose DreamEditor, a novel framework that enables users to perform controlled editing of neural fields using text prompts. By representing scenes as mesh-based n… ▽ More Neural fields have achieved impressive advancements in view synthesis and scene reconstruction. However, editing these neural fields remains challenging due to the implicit encoding of geometry and texture information. In this paper, we propose DreamEditor, a novel framework that enables users to perform controlled editing of neural fields using text prompts. By representing scenes as mesh-based neural fields, DreamEditor allows localized editing within specific regions. DreamEditor utilizes the text encoder of a pretrained text-to-Image diffusion model to automatically identify the regions to be edited based on the semantics of the text prompts. Subsequently, DreamEditor optimizes the editing region and aligns its geometry and texture with the text prompts through score distillation sampling [29]. Extensive experiments have demonstrated that DreamEditor can accurately edit neural fields of real-world scenes according to the given text prompts while ensuring consistency in irrelevant areas. DreamEditor generates highly realistic textures and geometry, significantly surpassing previous works in both quantitative and qualitative evaluations. △ Less

Submitted 7 September, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

Comments: Accepted by SIGGRAPH Asia 2023

arXiv:2306.08939 [pdf, other]

Revisiting Stereo Triangulation in UAV Distance Estimation

Authors: Jiafan Zhuang, Duan Yuan, Rihong Yan, Weixin Huang, Wenji Li, Zhun Fan

Abstract: Distance estimation plays an important role for path planning and collision avoidance of swarm UAVs. However, the lack of annotated data seriously hinders the related studies. In this work, we build and present a UAVDE dataset for UAV distance estimation, in which distance between two UAVs is obtained by UWB sensors. During experiments, we surprisingly observe that the stereo triangulation cannot… ▽ More Distance estimation plays an important role for path planning and collision avoidance of swarm UAVs. However, the lack of annotated data seriously hinders the related studies. In this work, we build and present a UAVDE dataset for UAV distance estimation, in which distance between two UAVs is obtained by UWB sensors. During experiments, we surprisingly observe that the stereo triangulation cannot stand for UAV scenes. The core reason is the position deviation issue due to long shooting distance and camera vibration, which is common in UAV scenes. To tackle this issue, we propose a novel position correction module, which can directly predict the offset between the observed positions and the actual ones and then perform compensation in stereo triangulation calculation. Besides, to further boost performance on hard samples, we propose a dynamic iterative correction mechanism, which is composed of multiple stacked PCMs and a gating mechanism to adaptively determine whether further correction is required according to the difficulty of data samples. We conduct extensive experiments on UAVDE, and our method can achieve a significant performance improvement over a strong baseline (by reducing the relative difference from 49.4% to 9.8%), which demonstrates its effectiveness and superiority. The code and dataset are available at https://github.com/duanyuan13/PCM. △ Less

Submitted 2 December, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2306.08913 [pdf, other]

Advancing Volumetric Medical Image Segmentation via Global-Local Masked Autoencoder

Authors: Jia-Xin Zhuang, Luyang Luo, Hao Chen

Abstract: Masked autoencoder (MAE) is a promising self-supervised pre-training technique that can improve the representation learning of a neural network without human intervention. However, applying MAE directly to volumetric medical images poses two challenges: (i) a lack of global information that is crucial for understanding the clinical context of the holistic data, (ii) no guarantee of stabilizing the… ▽ More Masked autoencoder (MAE) is a promising self-supervised pre-training technique that can improve the representation learning of a neural network without human intervention. However, applying MAE directly to volumetric medical images poses two challenges: (i) a lack of global information that is crucial for understanding the clinical context of the holistic data, (ii) no guarantee of stabilizing the representations learned from randomly masked inputs. To address these limitations, we propose the \textbf{G}lobal-\textbf{L}ocal \textbf{M}asked \textbf{A}uto\textbf{E}ncoder (GL-MAE), a simple yet effective self-supervised pre-training strategy. In addition to reconstructing masked local views, as in previous methods, GL-MAE incorporates global context learning by reconstructing masked global views. Furthermore, a complete global view is integrated as an anchor to guide the reconstruction and stabilize the learning process through global-to-global consistency learning and global-to-local consistency learning. Finetuning results on multiple datasets demonstrate the superiority of our method over other state-of-the-art self-supervised algorithms, highlighting its effectiveness on versatile volumetric medical image segmentation tasks, even when annotations are scarce. Our codes and models will be released upon acceptance. △ Less

Submitted 23 August, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

arXiv:2305.18698 [pdf, other]

AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Programmable System-on-Chip

Authors: **ming Zhuang, Zhuo** Yang, Peipei Zhou

Abstract: As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with programmable logic (PL), CPUs, and dedicated AI engines (AIE) ASICs which has a theoretical throughput up to 6.4 TFLOPs for FP32, 25.6 TOPs for INT16 and 102.4 TOPs for INT8. However, the hig… ▽ More As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with programmable logic (PL), CPUs, and dedicated AI engines (AIE) ASICs which has a theoretical throughput up to 6.4 TFLOPs for FP32, 25.6 TOPs for INT16 and 102.4 TOPs for INT8. However, the higher level of complexity makes it non-trivial to achieve the theoretical performance even for well-studied applications like matrix-matrix multiply. In this paper, we provide AutoMM, an automatic white-box framework that can systematically generate the design for MM accelerators on Versal which achieves 3.7 TFLOPs, 7.5 TOPs, and 28.2 TOPs for FP32, INT16, and INT8 data type respectively. Our designs are tested on board and achieve gains of 7.20x (FP32), 3.26x (INT16), 6.23x (INT8) energy efficiency than AMD U250 FPGA, 2.32x (FP32) than Nvidia Jetson TX2 GPU, 1.06x (FP32), 1.70x (INT8) than Nvidia A100 GPU. △ Less

Submitted 29 May, 2023; originally announced May 2023.

arXiv:2304.09322 [pdf, other]

doi 10.1016/j.eswa.2023.119965

Multi-Modality Multi-Scale Cardiovascular Disease Subtypes Classification Using Raman Image and Medical History

Authors: Bo Yu, Hechang Chen, Chengyou Jia, Hongren Zhou, Lele Cong, Xiankai Li, Jianhui Zhuang, Xianling Cong

Abstract: Raman spectroscopy (RS) has been widely used for disease diagnosis, e.g., cardiovascular disease (CVD), owing to its efficiency and component-specific testing capabilities. A series of popular deep learning methods have recently been introduced to learn nuance features from RS for binary classifications and achieved outstanding performance than conventional machine learning methods. However, these… ▽ More Raman spectroscopy (RS) has been widely used for disease diagnosis, e.g., cardiovascular disease (CVD), owing to its efficiency and component-specific testing capabilities. A series of popular deep learning methods have recently been introduced to learn nuance features from RS for binary classifications and achieved outstanding performance than conventional machine learning methods. However, these existing deep learning methods still confront some challenges in classifying subtypes of CVD. For example, the nuance between subtypes is quite hard to capture and represent by intelligent models due to the chillingly similar shape of RS sequences. Moreover, medical history information is an essential resource for distinguishing subtypes, but they are underutilized. In light of this, we propose a multi-modality multi-scale model called M3S, which is a novel deep learning method with two core modules to address these issues. First, we convert RS data to various resolution images by the Gramian angular field (GAF) to enlarge nuance, and a two-branch structure is leveraged to get embeddings for distinction in the multi-scale feature extraction module. Second, a probability matrix and a weight matrix are used to enhance the classification capacity by combining the RS and medical history data in the multi-modality data fusion module. We perform extensive evaluations of M3S and found its outstanding performance on our in-house dataset, with accuracy, precision, recall, specificity, and F1 score of 0.9330, 0.9379, 0.9291, 0.9752, and 0.9334, respectively. These results demonstrate that the M3S has high performance and robustness compared with popular methods in diagnosing CVD subtypes. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Journal ref: [J]. Expert Systems with Applications, 2023: 119965

arXiv:2303.08774 [pdf, other]

GPT-4 Technical Report

Authors: OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko , et al. (256 additional authors not shown)

Abstract: We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based mo… ▽ More We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was develo** infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4. △ Less

Submitted 4 March, 2024; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: 100 pages; updated authors list; fixed author names and added citation

arXiv:2303.01047 [pdf, other]

Task-Specific Context Decoupling for Object Detection

Authors: Jiayuan Zhuang, Zheng Qin, Hao Yu, Xucan Chen

Abstract: Classification and localization are two main sub-tasks in object detection. Nonetheless, these two tasks have inconsistent preferences for feature context, i.e., localization expects more boundary-aware features to accurately regress the bounding box, while more semantic context is preferred for object classification. Exsiting methods usually leverage disentangled heads to learn different feature… ▽ More Classification and localization are two main sub-tasks in object detection. Nonetheless, these two tasks have inconsistent preferences for feature context, i.e., localization expects more boundary-aware features to accurately regress the bounding box, while more semantic context is preferred for object classification. Exsiting methods usually leverage disentangled heads to learn different feature context for each task. However, the heads are still applied on the same input features, which leads to an imperfect balance between classifcation and localization. In this work, we propose a novel Task-Specific COntext DEcoupling (TSCODE) head which further disentangles the feature encoding for two tasks. For classification, we generate spatially-coarse but semantically-strong feature encoding. For localization, we provide high-resolution feature map containing more edge information to better regress object boundaries. TSCODE is plug-and-play and can be easily incorperated into existing detection pipelines. Extensive experiments demonstrate that our method stably improves different detectors by over 1.0 AP with less computational cost. Our code and models will be publicly released. △ Less

Submitted 2 March, 2023; originally announced March 2023.

arXiv:2301.03832 [pdf, other]

Video Semantic Segmentation with Inter-Frame Feature Fusion and Inner-Frame Feature Refinement

Authors: Jiafan Zhuang, Zilei Wang, Junjie Li

Abstract: Video semantic segmentation aims to generate accurate semantic maps for each video frame. To this end, many works dedicate to integrate diverse information from consecutive frames to enhance the features for prediction, where a feature alignment procedure via estimated optical flow is usually required. However, the optical flow would inevitably suffer from inaccuracy, and then introduce noises in… ▽ More Video semantic segmentation aims to generate accurate semantic maps for each video frame. To this end, many works dedicate to integrate diverse information from consecutive frames to enhance the features for prediction, where a feature alignment procedure via estimated optical flow is usually required. However, the optical flow would inevitably suffer from inaccuracy, and then introduce noises in feature fusion and further result in unsatisfactory segmentation results. In this paper, to tackle the misalignment issue, we propose a spatial-temporal fusion (STF) module to model dense pairwise relationships among multi-frame features. Different from previous methods, STF uniformly and adaptively fuses features at different spatial and temporal positions, and avoids error-prone optical flow estimation. Besides, we further exploit feature refinement within a single frame and propose a novel memory-augmented refinement (MAR) module to tackle difficult predictions among semantic boundaries. Specifically, MAR can store the boundary features and prototypes extracted from the training samples, which together form the task-specific memory, and then use them to refine the features during inference. Essentially, MAR can move the hard features closer to the most likely category and thus make them more discriminative. We conduct extensive experiments on Cityscapes and CamVid, and the results show that our proposed methods significantly outperform previous methods and achieves the state-of-the-art performance. Code and pretrained models are available at https://github.com/jfzhuang/ST_Memory. △ Less

Submitted 10 January, 2023; originally announced January 2023.

arXiv:2301.02359 [pdf, other]

CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture

Authors: **ming Zhuang, Jason Lau, Hanchen Ye, Zhuo** Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, **gtong Hu, Deming Chen, Jason Cong, Peipei Zhou

Abstract: Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic wi… ▽ More Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI/ML. With 400 AIEs, it provides up to 6.4 TFLOPs performance for 32-bit floating-point data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers in one application. We deploy the CHARM framework for four different applications, including BERT, ViT, NCF, MLP, on the AMD Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF and MLP, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator. △ Less

Submitted 5 January, 2023; originally announced January 2023.

arXiv:2211.01607 [pdf, other]

ImageCAS: A Large-Scale Dataset and Benchmark for Coronary Artery Segmentation based on Computed Tomography Angiography Images

Authors: An Zeng, Chunbiao Wu, Mei** Huang, Jian Zhuang, Shanshan Bi, Dan Pan, Najeeb Ullah, Kaleem Nawaz Khan, Tianchen Wang, Yiyu Shi, Xiaomeng Li, Guisen Lin, Xiaowei Xu

Abstract: Cardiovascular disease (CVD) accounts for about half of non-communicable diseases. Vessel stenosis in the coronary artery is considered to be the major risk of CVD. Computed tomography angiography (CTA) is one of the widely used noninvasive imaging modalities in coronary artery diagnosis due to its superior image resolution. Clinically, segmentation of coronary arteries is essential for the diagno… ▽ More Cardiovascular disease (CVD) accounts for about half of non-communicable diseases. Vessel stenosis in the coronary artery is considered to be the major risk of CVD. Computed tomography angiography (CTA) is one of the widely used noninvasive imaging modalities in coronary artery diagnosis due to its superior image resolution. Clinically, segmentation of coronary arteries is essential for the diagnosis and quantification of coronary artery disease. Recently, a variety of works have been proposed to address this problem. However, on one hand, most works rely on in-house datasets, and only a few works published their datasets to the public which only contain tens of images. On the other hand, their source code have not been published, and most follow-up works have not made comparison with existing works, which makes it difficult to judge the effectiveness of the methods and hinders the further exploration of this challenging yet critical problem in the community. In this paper, we propose a large-scale dataset for coronary artery segmentation on CTA images. In addition, we have implemented a benchmark in which we have tried our best to implement several typical existing methods. Furthermore, we propose a strong baseline method which combines multi-scale patch fusion and two-stage processing to extract the details of vessels. Comprehensive experiments show that the proposed method achieves better performance than existing works on the proposed large-scale dataset. The benchmark and the dataset are published at https://github.com/XiaoweiXu/ImageCAS-A-Large-Scale-Dataset-and-Benchmark-for-Coronary-Artery-Segmentation-based-on-CT. △ Less

Submitted 17 October, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: 17 pages, 12 figures, 4 tables

Journal ref: Computerized Medical Imaging and Graphics, 2023

arXiv:2209.06656 [pdf, ps, other]

Syndrome decoding meets multiple instances

Authors: Haoxuan Wu, **cheng Zhuang

Abstract: The NP-hard problem of decoding random linear codes is crucial to both coding theory and cryptography. In particular, this problem underpins the security of many code based post-quantum cryptographic schemes. The state-of-art algorithms for solving this problem are the information syndrome decoding algorithm and its advanced variants. In this work, we consider syndrome decoding in the multiple ins… ▽ More The NP-hard problem of decoding random linear codes is crucial to both coding theory and cryptography. In particular, this problem underpins the security of many code based post-quantum cryptographic schemes. The state-of-art algorithms for solving this problem are the information syndrome decoding algorithm and its advanced variants. In this work, we consider syndrome decoding in the multiple instances setting. Two strategies are applied for different scenarios. The first strategy is to solve all instances with the aid of the precomputation technique. We adjust the current framework and distinguish the offline phase and online phase to reduce the amortized complexity. Further, we discuss the impact on the concrete security of some post-quantum schemes. The second strategy is to solve one out of many instances. Adapting the analysis for some earlier algorithm, we discuss the effectiveness of using advanced variants and confirm a related folklore conjecture. △ Less

Submitted 14 September, 2022; originally announced September 2022.

arXiv:2209.00726 [pdf, other]

Learning correspondences of cardiac motion from images using biomechanics-informed modeling

Authors: Xiaoran Zhang, Chenyu You, Shawn Ahn, Juntang Zhuang, Lawrence Staib, James Duncan

Abstract: Learning spatial-temporal correspondences in cardiac motion from images is important for understanding the underlying dynamics of cardiac anatomical structures. Many methods explicitly impose smoothness constraints such as the $\mathcal{L}_2$ norm on the displacement vector field (DVF), while usually ignoring biomechanical feasibility in the transformation. Other geometric constraints either regul… ▽ More Learning spatial-temporal correspondences in cardiac motion from images is important for understanding the underlying dynamics of cardiac anatomical structures. Many methods explicitly impose smoothness constraints such as the $\mathcal{L}_2$ norm on the displacement vector field (DVF), while usually ignoring biomechanical feasibility in the transformation. Other geometric constraints either regularize specific regions of interest such as imposing incompressibility on the myocardium or introduce additional steps such as training a separate network-based regularizer on physically simulated datasets. In this work, we propose an explicit biomechanics-informed prior as regularization on the predicted DVF in modeling a more generic biomechanically plausible transformation within all cardiac structures without introducing additional training complexity. We validate our methods on two publicly available datasets in the context of 2D MRI data and perform extensive experiments to illustrate the effectiveness and robustness of our proposed methods compared to other competing regularization schemes. Our proposed methods better preserve biomechanical properties by visual assessment and show advantages in segmentation performance using quantitative evaluation metrics. The code is publicly available at \url{https://github.com/Voldemort108X/bioinformed_reg}. △ Less

Submitted 1 September, 2022; originally announced September 2022.

Comments: Accepted by MICCAI-STACOM 2022 as an oral presentation

arXiv:2208.09779 [pdf, other]

doi 10.1145/3511808.3557437

Robust Node Classification on Graphs: Jointly from Bayesian Label Transition and Topology-based Label Propagation

Authors: Jun Zhuang, Mohammad Al Hasan

Abstract: Node classification using Graph Neural Networks (GNNs) has been widely applied in various real-world scenarios. However, in recent years, compelling evidence emerges that the performance of GNN-based node classification may deteriorate substantially by topological perturbation, such as random connections or adversarial attacks. Various solutions, such as topological denoising methods and mechanism… ▽ More Node classification using Graph Neural Networks (GNNs) has been widely applied in various real-world scenarios. However, in recent years, compelling evidence emerges that the performance of GNN-based node classification may deteriorate substantially by topological perturbation, such as random connections or adversarial attacks. Various solutions, such as topological denoising methods and mechanism design methods, have been proposed to develop robust GNN-based node classifiers but none of these works can fully address the problems related to topological perturbations. Recently, the Bayesian label transition model is proposed to tackle this issue but its slow convergence may lead to inferior performance. In this work, we propose a new label inference model, namely LInDT, which integrates both Bayesian label transition and topology-based label propagation for improving the robustness of GNNs against topological perturbations. LInDT is superior to existing label transition methods as it improves the label prediction of uncertain nodes by utilizing neighborhood-based label propagation leading to better convergence of label inference. Besides, LIndT adopts asymmetric Dirichlet distribution as a prior, which also helps it to improve label inference. Extensive experiments on five graph datasets demonstrate the superiority of LInDT for GNN-based node classification under three scenarios of topological perturbations. △ Less

Submitted 20 August, 2022; originally announced August 2022.

Comments: The paper is accepted for CIKM 2022

arXiv:2208.05616 [pdf, other]

OpenMedIA: Open-Source Medical Image Analysis Toolbox and Benchmark under Heterogeneous AI Computing Platforms

Authors: Jia-Xin Zhuang, Xiansong Huang, Yang Yang, Jiancong Chen, Yue Yu, Wei Gao, Ge Li, Jie Chen, Tong Zhang

Abstract: In this paper, we present OpenMedIA, an open-source toolbox library containing a rich set of deep learning methods for medical image analysis under heterogeneous Artificial Intelligence (AI) computing platforms. Various medical image analysis methods, including 2D/3D medical image classification, segmentation, localisation, and detection, have been included in the toolbox with PyTorch and/or MindS… ▽ More In this paper, we present OpenMedIA, an open-source toolbox library containing a rich set of deep learning methods for medical image analysis under heterogeneous Artificial Intelligence (AI) computing platforms. Various medical image analysis methods, including 2D/3D medical image classification, segmentation, localisation, and detection, have been included in the toolbox with PyTorch and/or MindSpore implementations under heterogeneous NVIDIA and Huawei Ascend computing systems. To our best knowledge, OpenMedIA is the first open-source algorithm library providing compared PyTorch and MindSpore implementations and results on several benchmark datasets. The source codes and models are available at https://git.openi.org.cn/OpenMedIA. △ Less

Submitted 7 September, 2022; v1 submitted 10 August, 2022; originally announced August 2022.

Comments: 12 pages, 1 figure

arXiv:2206.12558 [pdf, other]

FastBVP-Net: a lightweight pulse extraction network for measuring heart rhythm via facial videos

Authors: Jialiang Zhuang, Yuheng Chen, Yun Zhang, Xiujuan Zheng

Abstract: Remote photoplethysmography (rPPG) is an attractive camera-based health monitoring method that can measure the heart rhythm from facial videos. Many well-established deep-learning models have been reported to measure heart rate (HR) and heart rate variability (HRV). However, most of these models usually require a 30-second facial video and enormous computational resources to obtain accurate and ro… ▽ More Remote photoplethysmography (rPPG) is an attractive camera-based health monitoring method that can measure the heart rhythm from facial videos. Many well-established deep-learning models have been reported to measure heart rate (HR) and heart rate variability (HRV). However, most of these models usually require a 30-second facial video and enormous computational resources to obtain accurate and robust results, which significantly limits their applications in real-world scenarios. Hence, we propose a lightweight pulse extraction network, FastBVP-Net, to quickly measure heart rhythm via facial videos. The proposed FastBVP-Net uses a multi-frequency mode signal fusion (MMSF) mechanism to characterize the different modes of the raw signals in a decompose module and reconstruct the blood volume pulse (BVP) signal under a complex noise environment in a compose module. Meanwhile, an oversampling training scheme is used to solve the over-fitting problem caused by the limitations of the datasets. Then, the HR and HRV can be estimated based on the extracted BVP signals. Comprehensive experiments are conducted on the benchmark datasets to validate the proposed FastBVP-Net. For intra-dataset and cross-dataset testing, the proposed approach achieves better performance for HR and HRV estimation from 30-second facial videos with fewer computational burdens than the current well-established methods. Moreover, the proposed approach also achieves competitive results from 15-second facial videos. Therefore, the proposed FastBVP-Net has the potential to be applied in many real-world scenarios with shorter videos. △ Less

Submitted 21 December, 2022; v1 submitted 25 June, 2022; originally announced June 2022.

Comments: 9 pages, 2figures

arXiv:2206.02295 [pdf, other]

HIFI-Net: A Novel Network for Enhancement to Underwater Images

Authors: Jiajia Zhou, Junbin Zhuang, Yan Zheng, Di Wu

Abstract: A novel network for enhancement to underwater images is proposed in this paper. It contains a Reinforcement Fusion Module for Haar wavelet images (RFM-Haar) based on Reinforcement Fusion Unit (RFU), which is used to fuse an original image and some important information within it. Fusion is achieved for better enhancement. As this network make "Haar Images into Fusion Images", it is called HIFI-Net… ▽ More A novel network for enhancement to underwater images is proposed in this paper. It contains a Reinforcement Fusion Module for Haar wavelet images (RFM-Haar) based on Reinforcement Fusion Unit (RFU), which is used to fuse an original image and some important information within it. Fusion is achieved for better enhancement. As this network make "Haar Images into Fusion Images", it is called HIFI-Net. The experimental results show the proposed HIFI-Net performs best among many state-of-the-art methods on three datasets at three normal metrics and a new metric. △ Less

Submitted 5 June, 2022; originally announced June 2022.

Comments: 7 pages, 4 figures

arXiv:2203.08065 [pdf, other]

Surrogate Gap Minimization Improves Sharpness-Aware Training

Authors: Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, Ting Liu

Abstract: The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by minimizing a \textit{perturbed loss} defined as the maximum loss within a neighborhood in the parameter space. However, we show that both sharp and flat minima can have a low perturbed loss, implying that SAM does not always prefer flat minima. Instead, we define a \textit{surrogate gap}, a measure equivalent to th… ▽ More The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by minimizing a \textit{perturbed loss} defined as the maximum loss within a neighborhood in the parameter space. However, we show that both sharp and flat minima can have a low perturbed loss, implying that SAM does not always prefer flat minima. Instead, we define a \textit{surrogate gap}, a measure equivalent to the dominant eigenvalue of Hessian at a local minimum when the radius of the neighborhood (to derive the perturbed loss) is small. The surrogate gap is easy to compute and feasible for direct minimization during training. Based on the above observations, we propose Surrogate \textbf{G}ap Guided \textbf{S}harpness-\textbf{A}ware \textbf{M}inimization (GSAM), a novel improvement over SAM with negligible computation overhead. Conceptually, GSAM consists of two steps: 1) a gradient descent like SAM to minimize the perturbed loss, and 2) an \textit{ascent} step in the \textit{orthogonal} direction (after gradient decomposition) to minimize the surrogate gap and yet not affect the perturbed loss. GSAM seeks a region with both small loss (by step 1) and low sharpness (by step 2), giving rise to a model with high generalization capabilities. Theoretically, we show the convergence of GSAM and provably better generalization than SAM. Empirically, GSAM consistently improves generalization (e.g., +3.2\% over SAM and +5.4\% over AdamW on ImageNet top-1 accuracy for ViT-B/32). Code is released at \url{ https://sites.google.com/view/gsam-iclr22/home}. △ Less

Submitted 19 March, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

Comments: Paper accepted by ICLR22, https://openreview.net/forum?id=edONMAnhLu-

arXiv:2203.03762 [pdf, other]

Defending Graph Convolutional Networks against Dynamic Graph Perturbations via Bayesian Self-supervision

Authors: Jun Zhuang, Mohammad Al Hasan

Abstract: In recent years, plentiful evidence illustrates that Graph Convolutional Networks (GCNs) achieve extraordinary accomplishments on the node classification task. However, GCNs may be vulnerable to adversarial attacks on label-scarce dynamic graphs. Many existing works aim to strengthen the robustness of GCNs; for instance, adversarial training is used to shield GCNs against malicious perturbations.… ▽ More In recent years, plentiful evidence illustrates that Graph Convolutional Networks (GCNs) achieve extraordinary accomplishments on the node classification task. However, GCNs may be vulnerable to adversarial attacks on label-scarce dynamic graphs. Many existing works aim to strengthen the robustness of GCNs; for instance, adversarial training is used to shield GCNs against malicious perturbations. However, these works fail on dynamic graphs for which label scarcity is a pressing issue. To overcome label scarcity, self-training attempts to iteratively assign pseudo-labels to highly confident unlabeled nodes but such attempts may suffer serious degradation under dynamic graph perturbations. In this paper, we generalize noisy supervision as a kind of self-supervised learning method and then propose a novel Bayesian self-supervision model, namely GraphSS, to address the issue. Extensive experiments demonstrate that GraphSS can not only affirmatively alert the perturbations on dynamic graphs but also effectively recover the prediction of a node classifier when the graph is under such perturbations. These two advantages prove to be generalized over three classic GCNs across five public graph datasets. △ Less

Submitted 7 March, 2022; originally announced March 2022.

Comments: The paper is accepted by AAAI 2022

arXiv:2203.03634 [pdf, other]

Remote blood pressure measurement via spatiotemporal map** of a short-time facial video

Authors: Jialiang Zhuang, Bin Li, Yun Zhang, Yuheng Chen, Xiujuan Zheng

Abstract: Blood pressure (BP) monitoring is vital in daily healthcare, especially for cardiovascular diseases. However, BP values are mainly acquired through the contact sensing method, which is inconvenient and unfriendly to continuous BP measurement. Hence, we propose an efficient end-to-end network to estimate the BP values from a facial video to achieve remote BP measurement in daily life. In this study… ▽ More Blood pressure (BP) monitoring is vital in daily healthcare, especially for cardiovascular diseases. However, BP values are mainly acquired through the contact sensing method, which is inconvenient and unfriendly to continuous BP measurement. Hence, we propose an efficient end-to-end network to estimate the BP values from a facial video to achieve remote BP measurement in daily life. In this study, we first derived a Spatial-temporal map of a short-time (~15s) facial video. According to the Spatial-temporal map, we then regressed the BP ranges by a designed blood pressure classifier and simultaneously calculated the specific value by a blood pressure calculator in each BP range. In addition, we also developed an innovative oversampling training strategy to handle the unbalanced data distribution problem. Finally, we trained the proposed network on a private dataset ASPD and tested it on the popular dataset MMSE-HR. As a result, the proposed network achieved a state-of-the-art MAE of 12.35 mmHg and 9.5 mmHg on systolic and diastolic BP measurements, which is better than the recent works. It concludes that the proposed method has excellent potential for camera-based BP monitoring in real-world scenarios. △ Less

Submitted 23 June, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: 7 pages, 7 figures

arXiv:2203.03329 [pdf, other]

Open Set Domain Adaptation By Novel Class Discovery

Authors: **gyu Zhuang, Ziliang Chen, Pengxu Wei, Guanbin Li, Liang Lin

Abstract: In Open Set Domain Adaptation (OSDA), large amounts of target samples are drawn from the implicit categories that never appear in the source domain. Due to the lack of their specific belonging, existing methods indiscriminately regard them as a single class unknown. We challenge this broadly-adopted practice that may arouse unexpected detrimental effects because the decision boundaries between the… ▽ More In Open Set Domain Adaptation (OSDA), large amounts of target samples are drawn from the implicit categories that never appear in the source domain. Due to the lack of their specific belonging, existing methods indiscriminately regard them as a single class unknown. We challenge this broadly-adopted practice that may arouse unexpected detrimental effects because the decision boundaries between the implicit categories have been fully ignored. Instead, we propose Self-supervised Class-Discovering Adapter (SCDA) that attempts to achieve OSDA by gradually discovering those implicit classes, then incorporating them to restructure the classifier and update the domain-adaptive features iteratively. SCDA performs two alternate steps to achieve implicit class discovery and self-supervised OSDA, respectively. By jointly optimizing for two tasks, SCDA achieves the state-of-the-art in OSDA and shows a competitive performance to unearth the implicit target classes. △ Less

Submitted 7 March, 2022; originally announced March 2022.

arXiv:2202.08199 [pdf, other]

Less is More: Surgical Phase Recognition from Timestamp Supervision

Authors: Xinpeng Ding, Xinjian Yan, Zixun Wang, Wei Zhao, Jian Zhuang, Xiaowei Xu, Xiaomeng Li

Abstract: Surgical phase recognition is a fundamental task in computer-assisted surgery systems. Most existing works are under the supervision of expensive and time-consuming full annotations, which require the surgeons to repeat watching videos to find the precise start and end time for a surgical phase. In this paper, we introduce timestamp supervision for surgical phase recognition to train the models wi… ▽ More Surgical phase recognition is a fundamental task in computer-assisted surgery systems. Most existing works are under the supervision of expensive and time-consuming full annotations, which require the surgeons to repeat watching videos to find the precise start and end time for a surgical phase. In this paper, we introduce timestamp supervision for surgical phase recognition to train the models with timestamp annotations, where the surgeons are asked to identify only a single timestamp within the temporal boundary of a phase. This annotation can significantly reduce the manual annotation cost compared to the full annotations. To make full use of such timestamp supervisions, we propose a novel method called uncertainty-aware temporal diffusion (UATD) to generate trustworthy pseudo labels for training. Our proposed UATD is motivated by the property of surgical videos, i.e., the phases are long events consisting of consecutive frames. To be specific, UATD diffuses the single labelled timestamp to its corresponding high confident ( i.e., low uncertainty) neighbour frames in an iterative way. Our study uncovers unique insights of surgical phase recognition with timestamp supervisions: 1) timestamp annotation can reduce 74% annotation time compared with the full annotation, and surgeons tend to annotate those timestamps near the middle of phases; 2) extensive experiments demonstrate that our method can achieve competitive results compared with full supervision methods, while reducing manual annotation cost; 3) less is more in surgical phase recognition, i.e., less but discriminative pseudo labels outperform full but containing ambiguous frames; 4) the proposed UATD can be used as a plug and play method to clean ambiguous labels near boundaries between phases, and improve the performance of the current surgical phase recognition methods. △ Less

Submitted 30 November, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

Showing 1–50 of 112 results for author: Zhuang, J