Search | arXiv e-print repository

arXiv:2401.08661 [pdf]

Risk-anticipatory autonomous driving strategies considering vehicles' weights, based on hierarchical deep reinforcement learning

Authors: Di Chen, Hao Li, Zhicheng **, Huizhao Tu, Meixin Zhu

Abstract: Autonomous vehicles (AVs) have the potential to prevent accidents caused by drivers errors and reduce road traffic risks. Due to the nature of heavy vehicles, whose collisions cause more serious crashes, the weights of vehicles need to be considered when making driving strategies aimed at reducing the potential risks and their consequences in the context of autonomous driving. This study develops… ▽ More Autonomous vehicles (AVs) have the potential to prevent accidents caused by drivers errors and reduce road traffic risks. Due to the nature of heavy vehicles, whose collisions cause more serious crashes, the weights of vehicles need to be considered when making driving strategies aimed at reducing the potential risks and their consequences in the context of autonomous driving. This study develops an autonomous driving strategy based on risk anticipation, considering the weights of surrounding vehicles and using hierarchical deep reinforcement learning. A risk indicator integrating surrounding vehicles weights, based on the risk field theory, is proposed and incorporated into autonomous driving decisions. A hybrid action space is designed to allow for left lane changes, right lane changes and car-following, which enables AVs to act more freely and realistically whenever possible. To solve the above hybrid decision-making problem, a hierarchical proximal policy optimization (HPPO) algorithm with an attention mechanism (AT-HPPO) is developed, providing great advantages in maintaining stable performance with high robustness and generalization. An indicator, potential collision energy in conflicts (PCEC), is newly proposed to evaluate the performance of the developed AV driving strategy from the perspective of the consequences of potential accidents. The performance evaluation results in simulation and dataset demonstrate that our model provides driving strategies that reduce both the likelihood and consequences of potential accidents, at the same time maintaining driving efficiency. The developed method is especially meaningful for AVs driving on highways, where heavy vehicles make up a high proportion of the traffic. △ Less

Submitted 7 May, 2024; v1 submitted 27 December, 2023; originally announced January 2024.

Comments: 14 pages, 5 figures, 6 tables

arXiv:2401.07208 [pdf, other]

Enhanced Few-Shot Class-Incremental Learning via Ensemble Models

Authors: Mingli Zhu, Zihao Zhu, Sihong Chen, Chen Chen, Baoyuan Wu

Abstract: Few-shot class-incremental learning (FSCIL) aims to continually fit new classes with limited training data, while maintaining the performance of previously learned classes. The main challenges are overfitting the rare new training samples and forgetting old classes. While catastrophic forgetting has been extensively studied, the overfitting problem has attracted less attention in FSCIL. To tackle… ▽ More Few-shot class-incremental learning (FSCIL) aims to continually fit new classes with limited training data, while maintaining the performance of previously learned classes. The main challenges are overfitting the rare new training samples and forgetting old classes. While catastrophic forgetting has been extensively studied, the overfitting problem has attracted less attention in FSCIL. To tackle overfitting challenge, we design a new ensemble model framework cooperated with data augmentation to boost generalization. In this way, the enhanced model works as a library storing abundant features to guarantee fast adaptation to downstream tasks. Specifically, the multi-input multi-output ensemble structure is applied with a spatial-aware data augmentation strategy, aiming at diversifying the feature extractor and alleviating overfitting in incremental sessions. Moreover, self-supervised learning is also integrated to further improve the model generalization. Comprehensive experimental results show that the proposed method can indeed mitigate the overfitting problem in FSCIL, and outperform the state-of-the-art methods. △ Less

Submitted 21 March, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

arXiv:2401.05507 [pdf, other]

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

Authors: Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, **g Su, **g**g Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, Fei Wu

Abstract: In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files, and an agent framework which incorpora… ▽ More In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluation. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent . △ Less

Submitted 11 March, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: 27 pages, 7 figures, work in progress

arXiv:2401.04181 [pdf, other]

Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

Authors: Minjie Zhu, Yichen Zhu, **ming Li, Junjie Wen, Zhiyuan Xu, Zheng** Che, Chaomin Shen, Yaxin Peng, Dong Liu, Feifei Feng, Jian Tang

Abstract: The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual process theory in cognitive science, which suggests two parallel systems of fast and slow thinking in human decision-making, we introduce Robotics with Fast and Slow Thinking… ▽ More The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual process theory in cognitive science, which suggests two parallel systems of fast and slow thinking in human decision-making, we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision language model aligned with the policy networks, which allows the robot to recognize user intention or perform reasoning tasks. To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning. The project is available at https://jlm-z.github.io/RSFT/ △ Less

Submitted 1 February, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

Comments: accepted to ICRA2024

arXiv:2401.03128 [pdf, other]

Manifold-based Shapley for SAR Recognization Network Explanation

Authors: Xuran Hu, Mingzhe Zhu, Yuan**g Liu, Zhenpeng Feng, LJubisa Stankovic

Abstract: Explainable artificial intelligence (XAI) holds immense significance in enhancing the deep neural network's transparency and credibility, particularly in some risky and high-cost scenarios, like synthetic aperture radar (SAR). Shapley is a game-based explanation technique with robust mathematical foundations. However, Shapley assumes that model's features are independent, rendering Shapley explana… ▽ More Explainable artificial intelligence (XAI) holds immense significance in enhancing the deep neural network's transparency and credibility, particularly in some risky and high-cost scenarios, like synthetic aperture radar (SAR). Shapley is a game-based explanation technique with robust mathematical foundations. However, Shapley assumes that model's features are independent, rendering Shapley explanation invalid for high dimensional models. This study introduces a manifold-based Shapley method by projecting high-dimensional features into low-dimensional manifold features and subsequently obtaining Fusion-Shap, which aims at (1) addressing the issue of erroneous explanations encountered by traditional Shap; (2) resolving the challenge of interpretability that traditional Shap faces in complex scenarios. △ Less

Submitted 6 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures

ACM Class: H.1.m

arXiv:2401.03122 [pdf, other]

SAR Despeckling via Regional Denoising Diffusion Probabilistic Model

Authors: Xuran Hu, Ziqiang Xu, Zhihan Chen, Zhengpeng Feng, Mingzhe Zhu, LJubisa Stankovic

Abstract: Speckle noise poses a significant challenge in maintaining the quality of synthetic aperture radar (SAR) images, so SAR despeckling techniques have drawn increasing attention. Despite the tremendous advancements of deep learning in fixed-scale SAR image despeckling, these methods still struggle to deal with large-scale SAR images. To address this problem, this paper introduces a novel despeckling… ▽ More Speckle noise poses a significant challenge in maintaining the quality of synthetic aperture radar (SAR) images, so SAR despeckling techniques have drawn increasing attention. Despite the tremendous advancements of deep learning in fixed-scale SAR image despeckling, these methods still struggle to deal with large-scale SAR images. To address this problem, this paper introduces a novel despeckling approach termed Region Denoising Diffusion Probabilistic Model (R-DDPM) based on generative models. R-DDPM enables versatile despeckling of SAR images across various scales, accomplished within a single training session. Moreover, The artifacts in the fused SAR images can be avoided effectively with the utilization of region-guided inverse sampling. Experiments of our proposed R-DDPM on Sentinel-1 data demonstrates superior performance to existing methods. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 5 pages, 5 figures

ACM Class: I.4.4

arXiv:2401.02883 [pdf, other]

iPolicy: Incremental Policy Algorithms for Feedback Motion Planning

Authors: Guoxiang Zhao, Devesh K. Jha, Yebin Wang, Minghui Zhu

Abstract: This paper presents policy-based motion planning for robotic systems. The motion planning literature has been mostly focused on open-loop trajectory planning which is followed by tracking online. In contrast, we solve the problem of path planning and controller synthesis simultaneously by solving the related feedback control problem. We present a novel incremental policy (iPolicy) algorithm for mo… ▽ More This paper presents policy-based motion planning for robotic systems. The motion planning literature has been mostly focused on open-loop trajectory planning which is followed by tracking online. In contrast, we solve the problem of path planning and controller synthesis simultaneously by solving the related feedback control problem. We present a novel incremental policy (iPolicy) algorithm for motion planning, which integrates sampling-based methods and set-valued optimal control methods to compute feedback controllers for the robotic system. In particular, we use sampling to incrementally construct the state space of the system. Asynchronous value iterations are performed on the sampled state space to synthesize the incremental policy feedback controller. We show the convergence of the estimates to the optimal value function in continuous state space. Numerical results with various different dynamical systems (including nonholonomic systems) verify the optimality and effectiveness of iPolicy. △ Less

Submitted 5 January, 2024; originally announced January 2024.

arXiv:2401.02814 [pdf, other]

Object-Centric Instruction Augmentation for Robotic Manipulation

Authors: Junjie Wen, Yichen Zhu, Minjie Zhu, **ming Li, Zhiyuan Xu, Zheng** Che, Chaomin Shen, Yaxin Peng, Dong Liu, Feifei Feng, Jian Tang

Abstract: Humans interpret scenes by recognizing both the identities and positions of objects in their observations. For a robot to perform tasks such as \enquote{pick and place}, understanding both what the objects are and where they are located is crucial. While the former has been extensively discussed in the literature that uses the large language model to enrich the text descriptions, the latter remain… ▽ More Humans interpret scenes by recognizing both the identities and positions of objects in their observations. For a robot to perform tasks such as \enquote{pick and place}, understanding both what the objects are and where they are located is crucial. While the former has been extensively discussed in the literature that uses the large language model to enrich the text descriptions, the latter remains underexplored. In this work, we introduce the \textit{Object-Centric Instruction Augmentation (OCI)} framework to augment highly semantic and information-dense language instruction with position cues. We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction, thus aiding the policy network in mastering actions for versatile manipulation. Additionally, we present a feature reuse mechanism to integrate the vision-language features from off-the-shelf pre-trained MLLM into policy networks. Through a series of simulated and real-world robotic tasks, we demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions. △ Less

Submitted 1 February, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

Comments: accepted to ICRA2024

arXiv:2401.02330 [pdf, other]

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model

Authors: Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang

Abstract: In this paper, we introduce LLaVA-$φ$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intrica… ▽ More In this paper, we introduce LLaVA-$φ$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}. △ Less

Submitted 22 February, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

Comments: The datasets were incomplete as they did not include all the necessary copyrights

arXiv:2401.00625 [pdf, ps, other]

Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models

Authors: Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao

Abstract: The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however, bring forth substantial challenges in the high consumption of computational, memory, energy, and financial resources, especially in environments with limited resource capabilities. This survey aims t… ▽ More The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however, bring forth substantial challenges in the high consumption of computational, memory, energy, and financial resources, especially in environments with limited resource capabilities. This survey aims to systematically address these challenges by reviewing a broad spectrum of techniques designed to enhance the resource efficiency of LLMs. We categorize methods based on their optimization focus: computational, memory, energy, financial, and network resources and their applicability across various stages of an LLM's lifecycle, including architecture design, pretraining, finetuning, and system design. Additionally, the survey introduces a nuanced categorization of resource efficiency techniques by their specific resource types, which uncovers the intricate relationships and map**s between various resources and corresponding optimization techniques. A standardized set of evaluation metrics and datasets is also presented to facilitate consistent and fair comparisons across different models and techniques. By offering a comprehensive overview of the current sota and identifying open research avenues, this survey serves as a foundational reference for researchers and practitioners, aiding them in develo** more sustainable and efficient LLMs in a rapidly evolving landscape. △ Less

Submitted 3 January, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

Comments: Preprint. GitHub repo: https://github.com/tiingweii-shii/Awesome-Resource-Efficient-LLM-Papers

arXiv:2312.15254 [pdf, other]

Ecmas: Efficient Circuit Map** and Scheduling for Surface Code

Authors: Mingzheng Zhu, Hao Fu, Jun Wu, Chi Zhang, Wei Xie, Xiang-Yang Li

Abstract: As the leading candidate of quantum error correction codes, surface code suffers from significant overhead, such as execution time. Reducing the circuit's execution time not only enhances its execution efficiency but also improves fidelity. However, finding the shortest execution time is NP-hard. In this work, we study the surface code map** and scheduling problem. To reduce the execution time… ▽ More As the leading candidate of quantum error correction codes, surface code suffers from significant overhead, such as execution time. Reducing the circuit's execution time not only enhances its execution efficiency but also improves fidelity. However, finding the shortest execution time is NP-hard. In this work, we study the surface code map** and scheduling problem. To reduce the execution time of a quantum circuit, we first introduce two novel metrics: Circuit Parallelism Degree and Chip Communication Capacity to quantitatively characterize quantum circuits and chips. Then, we propose a resource-adaptive map** and scheduling method, named Ecmas, with customized initialization of chip resources for each circuit. Ecmas can dramatically reduce the execution time in both double defect and lattice surgery models. Furthermore, we provide an additional version Ecmas-ReSu for sufficient qubits, which is performance-guaranteed and more efficient. Extensive numerical tests on practical datasets show that Ecmas outperforms the state-of-the-art methods by reducing the execution time by 51.5% on average for double defect model. Ecmas can reach the optimal result in most benchmarks, reducing the execution time by up to 13.9% for lattice surgery model. △ Less

Submitted 23 December, 2023; originally announced December 2023.

Comments: 12 pages, Accepted to IEEE/ACM International Symposium on Code Generation and Optimization

arXiv:2312.15043 [pdf, other]

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Authors: Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin

Abstract: Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the… ▽ More Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the trained models being constrained from generalizing its capability to a broader domain. To address this challenge, we propose GroundVLP, a simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data, both of which are more conveniently obtainable and offer a broader domain compared to visual grounding annotation data. GroundVLP proposes a fusion mechanism that combines the heatmap from GradCAM and the object proposals of open-vocabulary detectors. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets, surpassing prior zero-shot state-of-the-art by approximately 28\% on the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs comparably to or even better than some non-VLP-based supervised models on the Flickr30k entities dataset. Our code is available at https://github.com/om-ai-lab/GroundVLP. △ Less

Submitted 22 December, 2023; originally announced December 2023.

arXiv:2312.13530 [pdf, other]

HW-V2W-Map: Hardware Vulnerability to Weakness Map** Framework for Root Cause Analysis with GPT-assisted Mitigation Suggestion

Authors: Yu-Zheng Lin, Muntasir Mamun, Muhtasim Alam Chowdhury, Shuyu Cai, Mingyu Zhu, Banafsheh Saber Latibari, Kevin Immanuel Gubbi, Najmeh Nazari Bavarsad, Arjun Caputo, Avesta Sasan, Houman Homayoun, Setareh Rafatirad, Pratik Satam, Soheil Salehi

Abstract: The escalating complexity of modern computing frameworks has resulted in a surge in the cybersecurity vulnerabilities reported to the National Vulnerability Database (NVD) by practitioners. Despite the fact that the stature of NVD is one of the most significant databases for the latest insights into vulnerabilities, extracting meaningful trends from such a large amount of unstructured data is stil… ▽ More The escalating complexity of modern computing frameworks has resulted in a surge in the cybersecurity vulnerabilities reported to the National Vulnerability Database (NVD) by practitioners. Despite the fact that the stature of NVD is one of the most significant databases for the latest insights into vulnerabilities, extracting meaningful trends from such a large amount of unstructured data is still challenging without the application of suitable technological methodologies. Previous efforts have mostly concentrated on software vulnerabilities; however, a holistic strategy incorporates approaches for mitigating vulnerabilities, score prediction, and a knowledge-generating system that may extract relevant insights from the Common Weakness Enumeration (CWE) and Common Vulnerability Exchange (CVE) databases is notably absent. As the number of hardware attacks on Internet of Things (IoT) devices continues to rapidly increase, we present the Hardware Vulnerability to Weakness Map** (HW-V2W-Map) Framework, which is a Machine Learning (ML) framework focusing on hardware vulnerabilities and IoT security. The architecture that we have proposed incorporates an Ontology-driven Storytelling framework, which automates the process of updating the ontology in order to recognize patterns and evolution of vulnerabilities over time and provides approaches for mitigating the vulnerabilities. The repercussions of vulnerabilities can be mitigated as a result of this, and conversely, future exposures can be predicted and prevented. Furthermore, our proposed framework utilized Generative Pre-trained Transformer (GPT) Large Language Models (LLMs) to provide mitigation suggestions. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 22 pages, 10 pages appendix, 10 figures, Submitted to ACM TODAES

arXiv:2312.13023 [pdf, other]

Class Information Guided Reconstruction for Automatic Modulation Open-Set Recognition

Authors: Ziwei Zhang, Mengtao Zhu, Jiabin Liu, Yunjie Li, Shafei Wang

Abstract: Automatic Modulation Recognition (AMR) is a crucial technology in the domains of radar and communications. Traditional AMR approaches assume a closed-set scenario, where unknown samples are forcibly misclassified into known classes, leading to serious consequences for situation awareness and threat assessment. To address this issue, Automatic Modulation Open-set Recognition (AMOSR) defines two tas… ▽ More Automatic Modulation Recognition (AMR) is a crucial technology in the domains of radar and communications. Traditional AMR approaches assume a closed-set scenario, where unknown samples are forcibly misclassified into known classes, leading to serious consequences for situation awareness and threat assessment. To address this issue, Automatic Modulation Open-set Recognition (AMOSR) defines two tasks as Known Class Classification (KCC) and Unknown Class Identification (UCI). However, AMOSR faces core challenges in terms of inappropriate decision boundaries and sparse feature distributions. To overcome the aforementioned challenges, we propose a Class Information guided Reconstruction (CIR) framework, which leverages reconstruction losses to distinguish known and unknown classes. To enhance distinguishability, we design Class Conditional Vectors (CCVs) to match the latent representations extracted from input samples, achieving perfect reconstruction for known samples while yielding poor results for unknown ones. We also propose a Mutual Information (MI) loss function to ensure reliable matching, with upper and lower bounds of MI derived for tractable optimization and mathematical proofs provided. The mutually beneficial CCVs and MI facilitate the CIR attaining optimal UCI performance without compromising KCC accuracy, especially in scenarios with a higher proportion of unknown classes. Additionally, a denoising module is introduced before reconstruction, enabling the CIR to achieve a significant performance improvement at low SNRs. Experimental results on simulated and measured signals validate the effectiveness and the robustness of the proposed method. △ Less

Submitted 14 April, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: 14 pages, 11 figures

arXiv:2312.10685 [pdf, other]

Null energy condition violation during inflation and pulsar timing array observations

Authors: Gen Ye, Mian Zhu, Yong Cai

Abstract: Recently, evidence of stochastic gravitational wave background (SGWB) signals observed by pulsar timing array (PTA) collaborations, has prompted investigations into their origins. We explore the compatibility of a proposed inflationary scenario, incorporating an intermediate null energy condition (NEC)-violating phase, with the PTA observations. The NEC violation potentially amplifies the primordi… ▽ More Recently, evidence of stochastic gravitational wave background (SGWB) signals observed by pulsar timing array (PTA) collaborations, has prompted investigations into their origins. We explore the compatibility of a proposed inflationary scenario, incorporating an intermediate null energy condition (NEC)-violating phase, with the PTA observations. The NEC violation potentially amplifies the primordial tensor power spectrum, offering a promising explanation for PTA observations. Numerical analyses, primarily focused on NANOGrav's 15-year results, reveal the model's compatibility with PTA data. Notably, the model predicts a nearly scale-invariant GW spectrum in the mHz frequency range, which sets our scenario apart from other interpretations predicting a red primordial GW spectrum on smaller scales. △ Less

Submitted 6 February, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

Comments: 14 pages plus one appendix, 4 figures; match published version; reference added

arXiv:2312.10379 [pdf, other]

Multi-parameter quantum metrology with stabilized multi-mode squeezed state

Authors: Yue Li, Xu Cheng, Lingna Wang, Xingyu Zhao, Waner Hou, Yi Li, Kamran Rehan, Mingdong Zhu, Lin Yan, Xi Qin, Xinhua Peng, Haidong Yuan, Yiheng Lin, Jiangfeng Du

Abstract: Squeezing a quantum state along a specific direction has long been recognized as a crucial technique for enhancing the precision of quantum metrology by reducing parameter uncertainty. However, practical quantum metrology often involves the simultaneous estimation of multiple parameters, necessitating the use of high-quality squeezed states along multiple orthogonal axes to surpass the standard qu… ▽ More Squeezing a quantum state along a specific direction has long been recognized as a crucial technique for enhancing the precision of quantum metrology by reducing parameter uncertainty. However, practical quantum metrology often involves the simultaneous estimation of multiple parameters, necessitating the use of high-quality squeezed states along multiple orthogonal axes to surpass the standard quantum limit for all relevant parameters. In addition, a temporally stabilized squeezed state can provide an event-ready probe for parameters, regardless of the initial state, and robust to the timing of the state preparation process once stabilized. In this work, we generate and stabilize a two-mode squeezed state along two secular motional modes in a vibrating trapped ion with reservoir engineering, despite starting from a thermal state of the motion. Leveraging this resource, we demonstrate an estimation of two simultaneous collective displacements along the squeezed axes, achieving improvements surpassing the classical limit by up to 6.9(3) and 7.0(3) decibels (dB), respectively. Our demonstration can be readily scaled to squeezed states with even more modes. The practical implications of our findings span a wide range of applications, including quantum sensing, quantum imaging, and various fields that demand precise measurements of multiple parameters. △ Less

Submitted 16 December, 2023; originally announced December 2023.

arXiv:2312.10343 [pdf, other]

In-Sensor Radio Frequency Computing for Energy-Efficient Intelligent Radar

Authors: Yang Sui, Minning Zhu, Lingyi Huang, Chung-Tse Michael Wu, Bo Yuan

Abstract: Radio Frequency Neural Networks (RFNNs) have demonstrated advantages in realizing intelligent applications across various domains. However, as the model size of deep neural networks rapidly increases, implementing large-scale RFNN in practice requires an extensive number of RF interferometers and consumes a substantial amount of energy. To address this challenge, we propose to utilize low-rank dec… ▽ More Radio Frequency Neural Networks (RFNNs) have demonstrated advantages in realizing intelligent applications across various domains. However, as the model size of deep neural networks rapidly increases, implementing large-scale RFNN in practice requires an extensive number of RF interferometers and consumes a substantial amount of energy. To address this challenge, we propose to utilize low-rank decomposition to transform a large-scale RFNN into a compact RFNN while almost preserving its accuracy. Specifically, we develop a Tensor-Train RFNN (TT-RFNN) where each layer comprises a sequence of low-rank third-order tensors, leading to a notable reduction in parameter count, thereby optimizing RF interferometer utilization in comparison to the original large-scale RFNN. Additionally, considering the inherent physical errors when map** TT-RFNN to RF device parameters in real-world deployment, from a general perspective, we construct the Robust TT-RFNN (RTT-RFNN) by incorporating a robustness solver on TT-RFNN to enhance its robustness. To adapt the RTT-RFNN to varying requirements of resha** operations, we further provide a reconfigurable resha** solution employing RF switch matrices. Empirical evaluations conducted on MNIST and CIFAR-10 datasets show the effectiveness of our proposed method. △ Less

Submitted 16 December, 2023; originally announced December 2023.

arXiv:2312.09999 [pdf, ps, other]

On graphs without cycles of length 0 modulo 4

Authors: Ervin Győri, Binlong Li, Nika Salia, Casey Tompkins, Kitti Varga, Manran Zhu

Abstract: Bollobás proved that for every $k$ and $\ell$ such that $k\mathbb{Z}+\ell$ contains an even number, an $n$-vertex graph containing no cycle of length $\ell \bmod k$ can contain at most a linear number of edges. The precise (or asymptotic) value of the maximum number of edges in such a graph is known for very few pairs $\ell$ and $k$. In this work we precisely determine the maximum number of edges… ▽ More Bollobás proved that for every $k$ and $\ell$ such that $k\mathbb{Z}+\ell$ contains an even number, an $n$-vertex graph containing no cycle of length $\ell \bmod k$ can contain at most a linear number of edges. The precise (or asymptotic) value of the maximum number of edges in such a graph is known for very few pairs $\ell$ and $k$. In this work we precisely determine the maximum number of edges in a graph containing no cycle of length $0 \bmod 4$. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2312.08890 [pdf, other]

Defenses in Adversarial Machine Learning: A Survey

Authors: Baoyuan Wu, Shaokui Wei, Mingli Zhu, Meixi Zheng, Zihao Zhu, Mingda Zhang, Hongrui Chen, Danni Yuan, Li Liu, Qingshan Liu

Abstract: Adversarial phenomenon has been widely observed in machine learning (ML) systems, especially in those using deep neural networks, describing that ML systems may produce inconsistent and incomprehensible predictions with humans at some particular cases. This phenomenon poses a serious security threat to the practical application of ML systems, and several advanced attack paradigms have been develop… ▽ More Adversarial phenomenon has been widely observed in machine learning (ML) systems, especially in those using deep neural networks, describing that ML systems may produce inconsistent and incomprehensible predictions with humans at some particular cases. This phenomenon poses a serious security threat to the practical application of ML systems, and several advanced attack paradigms have been developed to explore it, mainly including backdoor attacks, weight attacks, and adversarial examples. For each individual attack paradigm, various defense paradigms have been developed to improve the model robustness against the corresponding attack paradigm. However, due to the independence and diversity of these defense paradigms, it is difficult to examine the overall robustness of an ML system against different kinds of attacks.This survey aims to build a systematic review of all existing defense paradigms from a unified perspective. Specifically, from the life-cycle perspective, we factorize a complete machine learning system into five stages, including pre-training, training, post-training, deployment, and inference stages, respectively. Then, we present a clear taxonomy to categorize and review representative defense methods at each individual stage. The unified perspective and presented taxonomies not only facilitate the analysis of the mechanism of each defense paradigm but also help us to understand connections and differences among different defense paradigms, which may inspire future research to develop more advanced, comprehensive defenses. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: 21 pages, 5 figures, 2 tables, 237 reference papers

arXiv:2312.08880 [pdf, other]

GenDet: Towards Good Generalizations for AI-Generated Image Detection

Authors: Mingjian Zhu, Hanting Chen, Mouxiao Huang, Wei Li, Hailin Hu, Jie Hu, Yunhe Wang

Abstract: The misuse of AI imagery can have harmful societal effects, prompting the creation of detectors to combat issues like the spread of fake news. Existing methods can effectively detect images generated by seen generators, but it is challenging to detect those generated by unseen generators. They do not concentrate on amplifying the output discrepancy when detectors process real versus fake images. T… ▽ More The misuse of AI imagery can have harmful societal effects, prompting the creation of detectors to combat issues like the spread of fake news. Existing methods can effectively detect images generated by seen generators, but it is challenging to detect those generated by unseen generators. They do not concentrate on amplifying the output discrepancy when detectors process real versus fake images. This results in a close output distribution of real and fake samples, increasing classification difficulty in detecting unseen generators. This paper addresses the unseen-generator detection problem by considering this task from the perspective of anomaly detection and proposes an adversarial teacher-student discrepancy-aware framework. Our method encourages smaller output discrepancies between the student and the teacher models for real images while aiming for larger discrepancies for fake images. We employ adversarial learning to train a feature augmenter, which promotes smaller discrepancies between teacher and student networks when the inputs are fake images. Our method has achieved state-of-the-art on public benchmarks, and the visualization results show that a large output discrepancy is maintained when faced with various types of generators. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2312.06097 [pdf, other]

doi 10.1007/s11433-023-2219-7

The FAST all sky HI survey (FASHI): The first release of catalog

Authors: Chuan-Peng Zhang, M. Zhu, P. Jiang, C. Cheng, J. Wang, J. Wang, J. -L. Xu, X. -L. Liu, N. -P. Yu, L. Qian, H. Yu, M. Ai, Y. **g, C. Xu, Z. Liu, X. Guan, C. Sun, Q. Yang, M. Huang, Q. Hao, FAST Collaboration

Abstract: The FAST All Sky HI survey (FASHI) was designed to cover the entire sky observable by the Five-hundred-meter Aperture Spherical radio Telescope (FAST), spanning approximately 22000 square degrees of declination between -14 deg and +66 deg, and in the frequency range of 1050-1450 MHz, with the expectation of eventually detecting more than 100000 HI sources. Between August 2020 and June 2023, FASHI… ▽ More The FAST All Sky HI survey (FASHI) was designed to cover the entire sky observable by the Five-hundred-meter Aperture Spherical radio Telescope (FAST), spanning approximately 22000 square degrees of declination between -14 deg and +66 deg, and in the frequency range of 1050-1450 MHz, with the expectation of eventually detecting more than 100000 HI sources. Between August 2020 and June 2023, FASHI had covered more than 7600 square degrees, which is approximately 35% of the total sky observable by FAST. It has a median detection sensitivity of around 0.76 mJy/beam and a spectral line velocity resolution of ~6.4 km/s at a frequency of ~1.4 GHz. As of now, a total of 41741 extragalactic HI sources have been detected in the frequency range 1305.5-1419.5 MHz, corresponding to a redshift limit of z<0.09. By cross-matching FASHI sources with the Siena Galaxy Atlas (SGA) and the Sloan Digital Sky Survey (SDSS) catalogs, we found that 16972 (40.7%) sources have spectroscopic redshifts and 10975 (26.3%) sources have only photometric redshifts. Most of the remaining 13794 (33.0%) HI sources are located in the direction of the Galactic plane, making their optical counterparts difficult to identify due to high extinction or high contamination of Galactic stellar sources. Based on current survey results, the FASHI survey is an unprecedented blind extragalactic HI survey. It has higher spectral and spatial resolution and broader coverage than the Arecibo Legacy Fast ALFA Survey (ALFALFA). When completed, FASHI will provide the largest extragalactic HI catalog and an objective view of HI content and large-scale structure in the local universe. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: 22 pages, 12 figures, published in SCPMA. All catalogs are available at https://zcp521.github.io/fashi and https://fast.bao.ac.cn/cms/article/271/

Journal ref: Sci. China-Phys. Mech. Astron. 67, 219511 (2024)

arXiv:2312.04815 [pdf, other]

Not All Negatives Are Worth Attending to: Meta-Bootstrap** Negative Sampling Framework for Link Prediction

Authors: Yakun Wang, Binbin Hu, Shuo Yang, Meiqi Zhu, Zhiqiang Zhang, Qiyang Zhang, Jun Zhou, Guo Ye, Huimei He

Abstract: The rapid development of graph neural networks (GNNs) encourages the rising of link prediction, achieving promising performance with various applications. Unfortunately, through a comprehensive analysis, we surprisingly find that current link predictors with dynamic negative samplers (DNSs) suffer from the migration phenomenon between "easy" and "hard" samples, which goes against the preference of… ▽ More The rapid development of graph neural networks (GNNs) encourages the rising of link prediction, achieving promising performance with various applications. Unfortunately, through a comprehensive analysis, we surprisingly find that current link predictors with dynamic negative samplers (DNSs) suffer from the migration phenomenon between "easy" and "hard" samples, which goes against the preference of DNS of choosing "hard" negatives, thus severely hindering capability. Towards this end, we propose the MeBNS framework, serving as a general plugin that can potentially improve current negative sampling based link predictors. In particular, we elaborately devise a Meta-learning Supported Teacher-student GNN (MST-GNN) that is not only built upon teacher-student architecture for alleviating the migration between "easy" and "hard" samples but also equipped with a meta learning based sample re-weighting module for hel** the student GNN distinguish "hard" samples in a fine-grained manner. To effectively guide the learning of MST-GNN, we prepare a Structure enhanced Training Data Generator (STD-Generator) and an Uncertainty based Meta Data Collector (UMD-Collector) for supporting the teacher and student GNN, respectively. Extensive experiments show that the MeBNS achieves remarkable performance across six link prediction benchmark datasets. △ Less

Submitted 11 December, 2023; v1 submitted 7 December, 2023; originally announced December 2023.

arXiv:2312.04584 [pdf, other]

Towards Sample-specific Backdoor Attack with Clean Labels via Attribute Trigger

Authors: Yiming Li, Mingyan Zhu, Junfeng Guo, Tao Wei, Shu-Tao Xia, Zhan Qin

Abstract: Currently, sample-specific backdoor attacks (SSBAs) are the most advanced and malicious methods since they can easily circumvent most of the current backdoor defenses. In this paper, we reveal that SSBAs are not sufficiently stealthy due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. In particular, we demonstrate that it is ineffectiv… ▽ More Currently, sample-specific backdoor attacks (SSBAs) are the most advanced and malicious methods since they can easily circumvent most of the current backdoor defenses. In this paper, we reveal that SSBAs are not sufficiently stealthy due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. In particular, we demonstrate that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class. We reveal that it is primarily due to two reasons, including \textbf{(1)} the `antagonistic effects' of ground-truth features and \textbf{(2)} the learning difficulty of sample-specific features. Accordingly, trigger-related features of existing SSBAs cannot be effectively learned under the clean-label setting due to their mild trigger intensity required for ensuring stealthiness. We argue that the intensity constraint of existing SSBAs is mostly because their trigger patterns are `content-irrelevant' and therefore act as `noises' for both humans and DNNs. Motivated by this understanding, we propose to exploit content-relevant features, $a.k.a.$ (human-relied) attributes, as the trigger patterns to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our BAAT and its resistance to existing defenses. △ Less

Submitted 10 December, 2023; v1 submitted 3 December, 2023; originally announced December 2023.

Comments: 14 pages

arXiv:2311.18268 [pdf]

An Explorative Study on Document Type Assignment of Review Articles in Web of Science, Scopus and Journals' Website

Authors: Manman Zhu, Xinyue Lu, Fuyou Chen, Liying Yang, Zhesi Shen

Abstract: Accurately assigning the document type of review articles in citation index databases like Web of Science(WoS) and Scopus is important. This study aims to investigate the document type assignation of review articles in web of Science, Scopus and Journals' website in a large scale. 27,616 papers from 160 journals from 10 review journal series indexed in SCI are analyzed. The document types of these… ▽ More Accurately assigning the document type of review articles in citation index databases like Web of Science(WoS) and Scopus is important. This study aims to investigate the document type assignation of review articles in web of Science, Scopus and Journals' website in a large scale. 27,616 papers from 160 journals from 10 review journal series indexed in SCI are analyzed. The document types of these papers labeled on journals' website, and assigned by WoS and Scopus are retrieved and compared to determine the assigning accuracy and identify the possible reasons of wrongly assigning. For the document type labeled on the website, we further differentiate them into explicit review and implicit review based on whether the website directly indicating it is review or not. We find that WoS and Scopus performed similarly, with an average precision of about 99% and recall of about 80%. However, there were some differences between WoS and Scopus across different journal series and within the same journal series. The assigning accuracy of WoS and Scopus for implicit reviews dropped significantly. This study provides a reference for the accuracy of document type assigning of review articles in WoS and Scopus, and the identified pattern for assigning implicit reviews may be helpful to better labeling on website, WoS and Scopus. △ Less

Submitted 30 November, 2023; originally announced November 2023.

arXiv:2311.17940 [pdf, other]

Scene Summarization: Clustering Scene Videos into Spatially Diverse Frames

Authors: Chao Chen, Mingzhi Zhu, Ankush Pratap Singh, Yu Yan, Felix Juefei Xu, Chen Feng

Abstract: We propose scene summarization as a new video-based scene understanding task. It aims to summarize a long video walkthrough of a scene into a small set of frames that are spatially diverse in the scene, which has many impotant applications, such as in surveillance, real estate, and robotics. It stems from video summarization but focuses on long and continuous videos from moving cameras, instead of… ▽ More We propose scene summarization as a new video-based scene understanding task. It aims to summarize a long video walkthrough of a scene into a small set of frames that are spatially diverse in the scene, which has many impotant applications, such as in surveillance, real estate, and robotics. It stems from video summarization but focuses on long and continuous videos from moving cameras, instead of user-edited fragmented video clips that are more commonly studied in existing video summarization works. Our solution to this task is a two-stage self-supervised pipeline named SceneSum. Its first stage uses clustering to segment the video sequence. Our key idea is to combine visual place recognition (VPR) into this clustering process to promote spatial diversity. Its second stage needs to select a representative keyframe from each cluster as the summary while respecting resource constraints such as memory and disk space limits. Additionally, if the ground truth image trajectory is available, our method can be easily augmented with a supervised loss to enhance the clustering and keyframe selection. Extensive experiments on both real-world and simulated datasets show our method outperforms common video summarization baselines by 50% △ Less

Submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.17425

SpeechAct: Towards Generating Whole-body Motion from Speech

Authors: **song Zhang, Minjie Zhu, Yuxiang Zhang, Yebin Liu, Kun Li

Abstract: This paper addresses the problem of generating whole-body motion from speech. Despite great successes, prior methods still struggle to produce reasonable and diverse whole-body motions from speech. This is due to their reliance on suboptimal representations and a lack of strategies for generating diverse results. To address these challenges, we present a novel hybrid point representation to achiev… ▽ More This paper addresses the problem of generating whole-body motion from speech. Despite great successes, prior methods still struggle to produce reasonable and diverse whole-body motions from speech. This is due to their reliance on suboptimal representations and a lack of strategies for generating diverse results. To address these challenges, we present a novel hybrid point representation to achieve accurate and continuous motion generation, e.g., avoiding foot skating, and this representation can be transformed into an easy-to-use representation, i.e., SMPL-X body mesh, for many applications. To generate whole-body motion from speech, for facial motion, closely tied to the audio signal, we introduce an encoder-decoder architecture to achieve deterministic outcomes. However, for the body and hands, which have weaker connections to the audio signal, we aim to generate diverse yet reasonable motions. To boost diversity in motion generation, we propose a contrastive motion learning method to encourage the model to produce more distinctive representations. Specifically, we design a robust VQ-VAE to learn a quantized motion codebook using our hybrid representation. Then, we regress the motion representation from the audio signal by a translation model employing our contrastive motion learning method. Experimental results validate the superior performance and the correctness of our model. The project page is available for research purposes at http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct. △ Less

Submitted 13 June, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: The paper has been archived without permission from the newly added author

arXiv:2311.14631 [pdf, other]

CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

Authors: Ruoyu Zhao, Mingrui Zhu, Shiyin Dong, Nannan Wang, Xinbo Gao

Abstract: We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which po… ▽ More We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space of the encoder. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results of personalized image generation, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing. △ Less

Submitted 30 November, 2023; v1 submitted 24 November, 2023; originally announced November 2023.

Comments: For the project page, please visit https://royzhao926.github.io/CatVersion-page/

arXiv:2311.13246 [pdf, other]

CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning

Authors: Yilun Liu, Shimin Tao, Xiaofeng Zhao, Ming Zhu, Wenbing Ma, Junhao Zhu, Chang Su, Yutai Hou, Miao Zhang, Min Zhang, Hongxia Ma, Li Zhang, Hao Yang, Yanfei Jiang

Abstract: Instruction tuning is crucial for enabling Language Learning Models (LLMs) in responding to human instructions. The quality of instruction pairs used for tuning greatly affects the performance of LLMs. However, the manual creation of high-quality instruction datasets is costly, leading to the adoption of automatic generation of instruction pairs by LLMs as a popular alternative. To ensure the high… ▽ More Instruction tuning is crucial for enabling Language Learning Models (LLMs) in responding to human instructions. The quality of instruction pairs used for tuning greatly affects the performance of LLMs. However, the manual creation of high-quality instruction datasets is costly, leading to the adoption of automatic generation of instruction pairs by LLMs as a popular alternative. To ensure the high quality of LLM-generated instruction datasets, several approaches have been proposed. Nevertheless, existing methods either compromise dataset integrity by filtering a large proportion of samples, or are unsuitable for industrial applications. In this paper, instead of discarding low-quality samples, we propose CoachLM, a novel approach to enhance the quality of instruction datasets through automatic revisions on samples in the dataset. CoachLM is trained from the samples revised by human experts and significantly increases the proportion of high-quality samples in the dataset from 17.7% to 78.9%. The effectiveness of CoachLM is further assessed on various real-world instruction test sets. The results show that CoachLM improves the instruction-following capabilities of the instruction-tuned LLM by an average of 29.9%, which even surpasses larger LLMs with nearly twice the number of parameters. Furthermore, CoachLM is successfully deployed in a data management system for LLMs at Huawei, resulting in an efficiency improvement of up to 20% in the cleaning of 40k real-world instruction pairs. We release various assets of CoachLM, including the training data, code and test set (https://github.com/lunyiliu/CoachLM). △ Less

Submitted 20 March, 2024; v1 submitted 22 November, 2023; originally announced November 2023.

Comments: Accepted by ICDE 2024

arXiv:2311.12665 [pdf, ps, other]

Boundedness of stable minimal models with klt singularities

Authors: Minzhe Zhu

Abstract: We investigate the singularities and boundedness of a special kind of algebraic varieties so-called stable minimal models, which are constructed and studied by Birkar. Given a klt stable minimal model with bounded relative volume, if we fix the dimension, Iitaka volume, and a DCC set controlling coefficients, then we show that the singularities of the klt stable minimal model can be controlled uni… ▽ More We investigate the singularities and boundedness of a special kind of algebraic varieties so-called stable minimal models, which are constructed and studied by Birkar. Given a klt stable minimal model with bounded relative volume, if we fix the dimension, Iitaka volume, and a DCC set controlling coefficients, then we show that the singularities of the klt stable minimal model can be controlled uniformly. Furthermore, we prove that with certain bounded data, stable minimal models with klt singularities form a bounded family. △ Less

Submitted 28 June, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

Comments: Version 2:31 pages, Section 3 simplified, exposition improved

arXiv:2311.12200 [pdf]

Hydrogen-induced tunable remanent polarization in a perovskite nickelate

Authors: Yifan Yuan, Michele Kotiuga, Tae Joon Park, Yuanyuan Ni, Arnob Saha, Hua Zhou, Jerzy T. Sadowski, Abdullah Al-Mahboob, Haoming Yu, Kai Du, Minning Zhu, Sunbin Deng, Ravindra S. Bisht, Xiao Lyu, Chung-Tse Michael Wu, Peide D. Ye, Abhronil Sengupta, Sang-Wook Cheong, Xiaoshan Xu, Karin M. Rabe, Shriram Ramanathan

Abstract: Materials with field-tunable polarization are of broad interest to condensed matter sciences and solid-state device technologies. Here, using hydrogen (H) donor do**, we modify the room temperature metallic phase of a perovskite nickelate NdNiO3 into an insulating phase with both metastable dipolar polarization and space-charge polarization. We then demonstrate transient negative differential ca… ▽ More Materials with field-tunable polarization are of broad interest to condensed matter sciences and solid-state device technologies. Here, using hydrogen (H) donor do**, we modify the room temperature metallic phase of a perovskite nickelate NdNiO3 into an insulating phase with both metastable dipolar polarization and space-charge polarization. We then demonstrate transient negative differential capacitance in thin film capacitors. The space-charge polarization caused by long-range movement and trap** of protons dominates when the electric field exceeds the threshold value. First-principles calculations suggest the polarization originates from the polar structure created by H do**. We find that polarization decays within ~1 second which is an interesting temporal regime for neuromorphic computing hardware design, and we implement the transient characteristics in a neural network to demonstrate unsupervised learning. These discoveries open new avenues for designing novel ferroelectric materials and electrets using light-ion do**. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: 13 pages, 5 figures

arXiv:2311.12075 [pdf, other]

BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning

Authors: Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, Ee-Chien Chang

Abstract: Studying backdoor attacks is valuable for model copyright protection and enhancing defenses. While existing backdoor attacks have successfully infected multimodal contrastive learning models such as CLIP, they can be easily countered by specialized backdoor defenses for MCL models. This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defen… ▽ More Studying backdoor attacks is valuable for model copyright protection and enhancing defenses. While existing backdoor attacks have successfully infected multimodal contrastive learning models such as CLIP, they can be easily countered by specialized backdoor defenses for MCL models. This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses and introduces the \emph{\toolns} attack, which is resistant to backdoor detection and model fine-tuning defenses. To achieve this, we draw motivations from the perspective of the Bayesian rule and propose a dual-embedding guided framework for backdoor attacks. Specifically, we ensure that visual trigger patterns approximate the textual target semantics in the embedding space, making it challenging to detect the subtle parameter variations induced by backdoor learning on such natural trigger patterns. Additionally, we optimize the visual trigger patterns to align the poisoned samples with target vision features in order to hinder the backdoor unlearning through clean fine-tuning. Extensive experiments demonstrate that our attack significantly outperforms state-of-the-art baselines (+45.3% ASR) in the presence of SoTA backdoor defenses, rendering these mitigation and detection strategies virtually ineffective. Furthermore, our approach effectively attacks some more rigorous scenarios like downstream tasks. We believe that this paper raises awareness regarding the potential threats associated with the practical application of multimodal contrastive learning and encourages the development of more robust defense mechanisms. △ Less

Submitted 4 March, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

Comments: The paper lacks some work that needs to be cited

Journal ref: CVPR 2024

arXiv:2311.11969 [pdf, other]

SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks

Authors: ** Ye, Junlong Cheng, Jianpin Chen, Zhongying Deng, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Min Zhu, Shaoting Zhang, Junjun He, Yu Qiao

Abstract: Segment Anything Model (SAM) has achieved impressive results for natural image segmentation with input prompts such as points and bounding boxes. Its success largely owes to massive labeled training data. However, directly applying SAM to medical image segmentation cannot perform well because SAM lacks medical knowledge -- it does not use medical images for training. To incorporate medical knowled… ▽ More Segment Anything Model (SAM) has achieved impressive results for natural image segmentation with input prompts such as points and bounding boxes. Its success largely owes to massive labeled training data. However, directly applying SAM to medical image segmentation cannot perform well because SAM lacks medical knowledge -- it does not use medical images for training. To incorporate medical knowledge into SAM, we introduce SA-Med2D-20M, a large-scale segmentation dataset of 2D medical images built upon numerous public and private datasets. It consists of 4.6 million 2D medical images and 19.7 million corresponding masks, covering almost the whole body and showing significant diversity. This paper describes all the datasets collected in SA-Med2D-20M and details how to process these datasets. Furthermore, comprehensive statistics of SA-Med2D-20M are presented to facilitate the better use of our dataset, which can help the researchers build medical vision foundation models or apply their models to downstream medical applications. We hope that the large scale and diversity of SA-Med2D-20M can be leveraged to develop medical artificial intelligence for enhancing diagnosis, medical image analysis, knowledge sharing, and education. The data with the redistribution license is publicly available at https://github.com/OpenGVLab/SAM-Med2D. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2311.11628 [pdf, other]

Incorporating LLM Priors into Tabular Learners

Authors: Max Zhu, Siniša Stanivuk, Andrija Petrovic, Mladen Nikolic, Pietro Lio

Abstract: We present a method to integrate Large Language Models (LLMs) and traditional tabular data classification techniques, addressing LLMs challenges like data serialization sensitivity and biases. We introduce two strategies utilizing LLMs for ranking categorical variables and generating priors on correlations between continuous variables and targets, enhancing performance in few-shot scenarios. We fo… ▽ More We present a method to integrate Large Language Models (LLMs) and traditional tabular data classification techniques, addressing LLMs challenges like data serialization sensitivity and biases. We introduce two strategies utilizing LLMs for ranking categorical variables and generating priors on correlations between continuous variables and targets, enhancing performance in few-shot scenarios. We focus on Logistic Regression, introducing MonotonicLR that employs a non-linear monotonic function for map** ordinals to cardinals while preserving LLM-determined orders. Validation against baseline models reveals the superior performance of our approach, especially in low-data scenarios, while remaining interpretable. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: Table Representation Learning Workshop at NeurIPS 2023

arXiv:2311.11354 [pdf, other]

Scale-aware competition network for palmprint recognition

Authors: Chengrui Gao, Ziyuan Yang, Min Zhu, Andrew Beng ** Teoh

Abstract: Palmprint biometrics garner heightened attention in palm-scanning payment and social security due to their distinctive attributes. However, prevailing methodologies singularly prioritize texture orientation, neglecting the significant texture scale dimension. We design an innovative network for concurrently extracting intra-scale and inter-scale features to redress this limitation. This paper prop… ▽ More Palmprint biometrics garner heightened attention in palm-scanning payment and social security due to their distinctive attributes. However, prevailing methodologies singularly prioritize texture orientation, neglecting the significant texture scale dimension. We design an innovative network for concurrently extracting intra-scale and inter-scale features to redress this limitation. This paper proposes a scale-aware competitive network (SAC-Net), which includes the Inner-Scale Competition Module (ISCM) and the Across-Scale Competition Module (ASCM) to capture texture characteristics related to orientation and scale. ISCM efficiently integrates learnable Gabor filters and a self-attention mechanism to extract rich orientation data and discern textures with long-range discriminative properties. Subsequently, ASCM leverages a competitive strategy across various scales to effectively encapsulate the competitive texture scale elements. By synergizing ISCM and ASCM, our method adeptly characterizes palmprint features. Rigorous experimentation across three benchmark datasets unequivocally demonstrates our proposed approach's exceptional recognition performance and resilience relative to state-of-the-art alternatives. △ Less

Submitted 20 November, 2023; v1 submitted 19 November, 2023; originally announced November 2023.

arXiv:2311.10051 [pdf, other]

Tabular Few-Shot Generalization Across Heterogeneous Feature Spaces

Authors: Max Zhu, Katarzyna Kobalczyk, Andrija Petrovic, Mladen Nikolic, Mihaela van der Schaar, Boris Delibasic, Petro Lio

Abstract: Despite the prevalence of tabular datasets, few-shot learning remains under-explored within this domain. Existing few-shot methods are not directly applicable to tabular datasets due to varying column relationships, meanings, and permutational invariance. To address these challenges, we propose FLAT-a novel approach to tabular few-shot learning, encompassing knowledge sharing between datasets with… ▽ More Despite the prevalence of tabular datasets, few-shot learning remains under-explored within this domain. Existing few-shot methods are not directly applicable to tabular datasets due to varying column relationships, meanings, and permutational invariance. To address these challenges, we propose FLAT-a novel approach to tabular few-shot learning, encompassing knowledge sharing between datasets with heterogeneous feature spaces. Utilizing an encoder inspired by Dataset2Vec, FLAT learns low-dimensional embeddings of datasets and their individual columns, which facilitate knowledge transfer and generalization to previously unseen datasets. A decoder network parametrizes the predictive target network, implemented as a Graph Attention Network, to accommodate the heterogeneous nature of tabular datasets. Experiments on a diverse collection of 118 UCI datasets demonstrate FLAT's successful generalization to new tabular datasets and a considerable improvement over the baselines. △ Less

Submitted 16 November, 2023; originally announced November 2023.

Comments: Tabular learning, Deep learning, Few shot learning

arXiv:2311.09925 [pdf, other]

Formation of a massive lenticular galaxy under the tidal interaction with a group of dwarf galaxies

Authors: **-Long Xu, Ming Zhu, Kelley M. Hess, Nai** Yu, Chuan-Peng Zhang, Xiao-Lan Liu, Mei Ai, Peng Jiang, Jie Wang

Abstract: Based on the atomic-hydrogen (HI) observations using the Five-hundred-meter Aperture Spherical radio Telescope (FAST), we present a detailed study of the gas-rich massive S0 galaxy NGC 1023 in a nearby galaxy group. The presence of an HI extended warped disk in NGC 1023 indicates that this S0 galaxy originated from a spiral galaxy. The data also suggest that NGC 1023 is interacting with four dwarf… ▽ More Based on the atomic-hydrogen (HI) observations using the Five-hundred-meter Aperture Spherical radio Telescope (FAST), we present a detailed study of the gas-rich massive S0 galaxy NGC 1023 in a nearby galaxy group. The presence of an HI extended warped disk in NGC 1023 indicates that this S0 galaxy originated from a spiral galaxy. The data also suggest that NGC 1023 is interacting with four dwarf galaxies. In particular, one of the largest dwarf galaxies has fallen into the gas disk of NGC 1023, forming a rare bright-dark galaxy pair with a large gas clump. This clump shows the signature of a galaxy but has no optical counterpart, implying that it is a newly formed starless galaxy. Our results firstly suggest that a massive S0 galaxy in a galaxy group can form via the morphological transformation from a spiral under the joint action of multiple tidal interactions. △ Less

Submitted 16 November, 2023; originally announced November 2023.

Comments: 13 pages, 8 figures, Accepted for publication in the ApJ Letters

arXiv:2311.08786 [pdf, other]

HFORD: High-Fidelity and Occlusion-Robust De-identification for Face Privacy Protection

Authors: Dongxin Chen, Mingrui Zhu, Nannan Wang, Xinbo Gao

Abstract: With the popularity of smart devices and the development of computer vision technology, concerns about face privacy protection are growing. The face de-identification technique is a practical way to solve the identity protection problem. The existing facial de-identification methods have revealed several problems, including the impact on the realism of anonymized results when faced with occlusions… ▽ More With the popularity of smart devices and the development of computer vision technology, concerns about face privacy protection are growing. The face de-identification technique is a practical way to solve the identity protection problem. The existing facial de-identification methods have revealed several problems, including the impact on the realism of anonymized results when faced with occlusions and the inability to maintain identity-irrelevant details in anonymized results. We present a High-Fidelity and Occlusion-Robust De-identification (HFORD) method to deal with these issues. This approach can disentangle identities and attributes while preserving image-specific details such as background, facial features (e.g., wrinkles), and lighting, even in occluded scenes. To disentangle the latent codes in the GAN inversion space, we introduce an Identity Disentanglement Module (IDM). This module selects the latent codes that are closely related to the identity. It further separates the latent codes into identity-related codes and attribute-related codes, enabling the network to preserve attributes while only modifying the identity. To ensure the preservation of image details and enhance the network's robustness to occlusions, we propose an Attribute Retention Module (ARM). This module adaptively preserves identity-irrelevant details and facial occlusions and blends them into the generated results in a modulated manner. Extensive experiments show that our method has higher quality, better detail fidelity, and stronger occlusion robustness than other face de-identification methods. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.05448 [pdf]

doi 10.1038/s41586-023-06650-z

An evolutionary continuum from nucleated dwarf galaxies to star clusters

Authors: Kaixiang Wang, Eric W. Peng, Chengze Liu, J. Christopher Mihos, Patrick Côté, Laura Ferrarese, Matthew A. Taylor, John P. Blakeslee, Jean-Charles Cuillandre, Pierre-Alain Duc, Puragra Guhathakurta, Stephen Gwyn, Youkyung Ko, Ariane Lançon, Sungsoon Lim, Lauren A. MacArthur, Thomas Puzia, Joel Roediger, Laura V. Sales, Rubén Sánchez-Janssen, Chelsea Spengler, Elisa Toloba, Hongxin Zhang, Mingcheng Zhu

Abstract: Systematic studies have revealed hundreds of ultra-compact dwarf galaxies (UCDs) in the nearby Universe. With half-light radii $r_h$ of approximately 10-100 parsecs and stellar masses $M_*$ $\approx$ $10^6-10^8$ solar masses, UCDs are among the densest known stellar systems. Although similar in appearance to massive globular clusters, the detection of extended stellar envelopes, complex star forma… ▽ More Systematic studies have revealed hundreds of ultra-compact dwarf galaxies (UCDs) in the nearby Universe. With half-light radii $r_h$ of approximately 10-100 parsecs and stellar masses $M_*$ $\approx$ $10^6-10^8$ solar masses, UCDs are among the densest known stellar systems. Although similar in appearance to massive globular clusters, the detection of extended stellar envelopes, complex star formation histories, elevated mass-to-light ratio, and supermassive black holes suggest that some UCDs are remnant nuclear star clusters of tidally-stripped dwarf galaxies, or even ancient compact galaxies. However, only a few objects have been found in the transient stage of tidal strip**, and this assumed evolutionary path has never been fully traced by observations. Here we show that 106 galaxies in the Virgo cluster have morphologies that are intermediate between normal, nucleated dwarf galaxies and single-component UCDs, revealing a continuum that fully maps this morphological transition, and fills the `size gap' between star clusters and galaxies. Their spatial distribution and redder color are also consistent with stripped satellite galaxies on their first few pericentric passages around massive galaxies. The `ultra-diffuse' tidal features around several of these galaxies directly show how UCDs are forming through tidal strip**, and that this evolutionary path can include an early phase as a nucleated ultra-diffuse galaxy (UDG). These UCDs represent substantial visible fossil remnants of ancient dwarf galaxies in galaxy clusters, and more low-mass remnants probably remain to be found. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: Published in Nature. Accepted on September 15

Journal ref: Nature 623 (2023) 296-300

arXiv:2311.05075 [pdf]

Mental Health Diagnosis in the Digital Age: Harnessing Sentiment Analysis on Social Media Platforms upon Ultra-Sparse Feature Content

Authors: Haijian Shao, Ming Zhu, Shengjie Zhai

Abstract: Amid growing global mental health concerns, particularly among vulnerable groups, natural language processing offers a tremendous potential for early detection and intervention of people's mental disorders via analyzing their postings and discussions on social media platforms. However, ultra-sparse training data, often due to vast vocabularies and low-frequency words, hinders the analysis accuracy… ▽ More Amid growing global mental health concerns, particularly among vulnerable groups, natural language processing offers a tremendous potential for early detection and intervention of people's mental disorders via analyzing their postings and discussions on social media platforms. However, ultra-sparse training data, often due to vast vocabularies and low-frequency words, hinders the analysis accuracy. Multi-labeling and Co-occurrences of symptoms may also blur the boundaries in distinguishing similar/co-related disorders. To address these issues, we propose a novel semantic feature preprocessing technique with a three-folded structure: 1) mitigating the feature sparsity with a weak classifier, 2) adaptive feature dimension with modulus loops, and 3) deep-mining and extending features among the contexts. With enhanced semantic features, we train a machine learning model to predict and classify mental disorders. We utilize the Reddit Mental Health Dataset 2022 to examine conditions such as Anxiety, Borderline Personality Disorder (BPD), and Bipolar-Disorder (BD) and present solutions to the data sparsity challenge, highlighted by 99.81% non-zero elements. After applying our preprocessing technique, the feature sparsity decreases to 85.4%. Overall, our methods, when compared to seven benchmark models, demonstrate significant performance improvements: 8.0% in accuracy, 0.069 in precision, 0.093 in recall, 0.102 in F1 score, and 0.059 in AUC. This research provides foundational insights for mental health prediction and monitoring, providing innovative solutions to navigate challenges associated with ultra-sparse data feature and intricate multi-label classification in the domain of mental health analysis. △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2311.04653 [pdf, other]

Hybrid Focal and Full-Range Attention Based Graph Transformers

Authors: Minhong Zhu, Zhenhao Zhao, Weiran Cai

Abstract: The paradigm of Transformers using the self-attention mechanism has manifested its advantage in learning graph-structured data. Yet, Graph Transformers are capable of modeling full range dependencies but are often deficient in extracting information from locality. A common practice is to utilize Message Passing Neural Networks (MPNNs) as an auxiliary to capture local information, which however are… ▽ More The paradigm of Transformers using the self-attention mechanism has manifested its advantage in learning graph-structured data. Yet, Graph Transformers are capable of modeling full range dependencies but are often deficient in extracting information from locality. A common practice is to utilize Message Passing Neural Networks (MPNNs) as an auxiliary to capture local information, which however are still inadequate for comprehending substructures. In this paper, we present a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT), which can mitigate the loss of local information in learning global correlations. The core component of FFGT is a new mechanism of compound attention, which combines the conventional full-range attention with K-hop focal attention on ego-nets to aggregate both global and local information. Beyond the scope of canonical Transformers, the FFGT has the merit of being more substructure-aware. Our approach enhances the performance of existing Graph Transformers on various open datasets, while achieves compatible SOTA performance on several Long-Range Graph Benchmark (LRGB) datasets even with a vanilla transformer. We further examine influential factors on the optimal focal length of attention via introducing a novel synthetic dataset based on SBM-PATTERN. △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2310.20369 [pdf, other]

Stability and Generalization of the Decentralized Stochastic Gradient Descent Ascent Algorithm

Authors: Miaoxi Zhu, Li Shen, Bo Du, Dacheng Tao

Abstract: The growing size of available data has attracted increasing interest in solving minimax problems in a decentralized manner for various machine learning tasks. Previous theoretical research has primarily focused on the convergence rate and communication complexity of decentralized minimax algorithms, with little attention given to their generalization. In this paper, we investigate the primal-dual… ▽ More The growing size of available data has attracted increasing interest in solving minimax problems in a decentralized manner for various machine learning tasks. Previous theoretical research has primarily focused on the convergence rate and communication complexity of decentralized minimax algorithms, with little attention given to their generalization. In this paper, we investigate the primal-dual generalization bound of the decentralized stochastic gradient descent ascent (D-SGDA) algorithm using the approach of algorithmic stability under both convex-concave and nonconvex-nonconcave settings. Our theory refines the algorithmic stability in a decentralized manner and demonstrates that the decentralized structure does not destroy the stability and generalization of D-SGDA, implying that it can generalize as well as the vanilla SGDA in certain situations. Our results analyze the impact of different topologies on the generalization bound of the D-SGDA algorithm beyond trivial factors such as sample sizes, learning rates, and iterations. We also evaluate the optimization error and balance it with the generalization gap to obtain the optimal population risk of D-SGDA in the convex-concave setting. Additionally, we perform several numerical experiments which validate our theoretical findings. △ Less

Submitted 31 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023

arXiv:2310.18848 [pdf, ps, other]

Optimal concentration level of anisotropic Trudinger-Moser functionals on any bounded domain

Authors: Lu Chen, Rou Jiang, Maochun Zhu

Abstract: Let $F$ be convex and homogeneous of degree $1$, its polar $F^{o}$ represent a finsler metric on $\mathbb{R}^{n}$, and $Ω$ be any bounded open set in $\mathbb{R}^{n}$. In this paper, we first construct the theoretical structure of anisotropic harmonic transplantation. Using the anisotropic harmonic transplantation, co-area formula, limiting Sobolev approximation method, delicate estimate of level… ▽ More Let $F$ be convex and homogeneous of degree $1$, its polar $F^{o}$ represent a finsler metric on $\mathbb{R}^{n}$, and $Ω$ be any bounded open set in $\mathbb{R}^{n}$. In this paper, we first construct the theoretical structure of anisotropic harmonic transplantation. Using the anisotropic harmonic transplantation, co-area formula, limiting Sobolev approximation method, delicate estimate of level set of Green function, we investigate the optimal concentration level of the Trudinger-Moser functional \[ \int_Ωe^{λ_{n}|u|^{\frac{n}{n-1}}}dx \] under the anisotropic Dirichlet norm constraint $\int_ΩF^{n}\left( \nabla{u}\right) dx\leq1$, where $λ_{n}=n^{\frac{n}{n-1}}κ_{n}^{\frac{1}{n-1}}\ $ denotes the sharp constant of anisotropic Trudinger-Moser inequality in bounded domain and $κ_{n}$ is the Lebesgue measure of the unit Wulff ball. As an application. we can immediately deduce the existence of extremals for anisotropic Trudinger-Moser inequality on bounded domain. Finally, we also consider the optimal concentration level of the anisotropic singular Trudinger-Moser functional. The method is based on the limiting Hardy-Sobolev approximation method and constructing a suitable normalized anisotropic concentrating sequence. △ Less

Submitted 28 October, 2023; originally announced October 2023.

arXiv:2310.16917 [pdf, other]

MimicTouch: Learning Human's Control Strategy with Multi-Modal Tactile Feedback

Authors: Kelin Yu, Yunhai Han, Matthew Zhu, Ye Zhao

Abstract: In robotics and artificial intelligence, the integration of tactile processing is becoming increasingly pivotal, especially in learning to execute intricate tasks like alignment and insertion. However, existing works focusing on tactile methods for insertion tasks predominantly rely on robot teleoperation data and reinforcement learning, which do not utilize the rich insights provided by human's c… ▽ More In robotics and artificial intelligence, the integration of tactile processing is becoming increasingly pivotal, especially in learning to execute intricate tasks like alignment and insertion. However, existing works focusing on tactile methods for insertion tasks predominantly rely on robot teleoperation data and reinforcement learning, which do not utilize the rich insights provided by human's control strategy guided by tactile feedback. For utilizing human sensations, methodologies related to learning from humans predominantly leverage visual feedback, often overlooking the invaluable tactile feedback that humans inherently employ to finish complex manipulations. Addressing this gap, we introduce "MimicTouch", a novel framework that mimics human's tactile-guided control strategy. In this framework, we initially collect multi-modal tactile datasets from human demonstrators, incorporating human tactile-guided control strategies for task completion. The subsequent step involves instructing robots through imitation learning using multi-modal sensor data and retargeted human motions. To further mitigate the embodiment gap between humans and robots, we employ online residual reinforcement learning on the physical robot. Through comprehensive experiments, we validate the safety of MimicTouch in transferring a latent policy learned through imitation learning from human to robot. This ongoing work will pave the way for a broader spectrum of tactile-guided robotic applications. △ Less

Submitted 1 November, 2023; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: Presented at CoRL 2023 Deployable Workshop and NIPS 2023 Touch Processing Workshop

arXiv:2310.15975 [pdf]

Data-driven Traffic Simulation: A Comprehensive Review

Authors: Di Chen, Meixin Zhu, Hao Yang, Xuesong Wang, Yinhai Wang

Abstract: Autonomous vehicles (AVs) have the potential to significantly revolutionize society by providing a secure and efficient mode of transportation. Recent years have witnessed notable advancements in autonomous driving perception and prediction, but the challenge of validating the performance of AVs remains largely unresolved. Data-driven microscopic traffic simulation has become an important tool for… ▽ More Autonomous vehicles (AVs) have the potential to significantly revolutionize society by providing a secure and efficient mode of transportation. Recent years have witnessed notable advancements in autonomous driving perception and prediction, but the challenge of validating the performance of AVs remains largely unresolved. Data-driven microscopic traffic simulation has become an important tool for autonomous driving testing due to 1) availability of high-fidelity traffic data; 2) its advantages of enabling large-scale testing and scenario reproducibility; and 3) its potential in reactive and realistic traffic simulation. However, a comprehensive review of this topic is currently lacking. This paper aims to fill this gap by summarizing relevant studies. The primary objective of this paper is to review current research efforts and provide a futuristic perspective that will benefit future developments in the field. It introduces the general issues of data-driven traffic simulation and outlines key concepts and terms. After overviewing traffic simulation, various datasets and evaluation metrics commonly used are reviewed. The paper then offers a comprehensive evaluation of imitation learning, reinforcement learning, deep generative and deep learning methods, summarizing each and analyzing their advantages and disadvantages in detail. Moreover, it evaluates the state-of-the-art, existing challenges, and future research directions. △ Less

Submitted 23 November, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: 21 pages, 7 figures, 6 tables

arXiv:2310.13473 [pdf, other]

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

Authors: Mingwei Zhu, Leigang Sha, Yu Shu, Kangjia Zhao, Tiancheng Zhao, Jianwei Yin

Abstract: Multimodal large language models (MLLMs) have shown great potential in perception and interpretation tasks, but their capabilities in predictive reasoning remain under-explored. To address this gap, we introduce a novel benchmark that assesses the predictive reasoning capabilities of MLLMs across diverse scenarios. Our benchmark targets three important domains: abstract pattern reasoning, human ac… ▽ More Multimodal large language models (MLLMs) have shown great potential in perception and interpretation tasks, but their capabilities in predictive reasoning remain under-explored. To address this gap, we introduce a novel benchmark that assesses the predictive reasoning capabilities of MLLMs across diverse scenarios. Our benchmark targets three important domains: abstract pattern reasoning, human activity prediction, and physical interaction prediction. We further develop three evaluation methods powered by large language model to robustly quantify a model's performance in predicting and reasoning the future based on multi-visual context. Empirical experiments confirm the soundness of the proposed benchmark and evaluation methods via rigorous testing and reveal pros and cons of current popular MLLMs in the task of predictive reasoning. Lastly, our proposed benchmark provides a standardized evaluation framework for MLLMs and can facilitate the development of more advanced models that can reason and predict over complex long sequence of multimodal input. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.13320 [pdf, other]

doi 10.1109/TVCG.2024.3350901

CylinderTag: An Accurate and Flexible Marker for Cylinder-Shape Objects Pose Estimation Based on Projective Invariants

Authors: Shaoan Wang, Mingzhu Zhu, Yaoqing Hu, Dongyue Li, Fusong Yuan, Junzhi Yu

Abstract: High-precision pose estimation based on visual markers has been a thriving research topic in the field of computer vision. However, the suitability of traditional flat markers on curved objects is limited due to the diverse shapes of curved surfaces, which hinders the development of high-precision pose estimation for curved objects. Therefore, this paper proposes a novel visual marker called Cylin… ▽ More High-precision pose estimation based on visual markers has been a thriving research topic in the field of computer vision. However, the suitability of traditional flat markers on curved objects is limited due to the diverse shapes of curved surfaces, which hinders the development of high-precision pose estimation for curved objects. Therefore, this paper proposes a novel visual marker called CylinderTag, which is designed for developable curved surfaces such as cylindrical surfaces. CylinderTag is a cyclic marker that can be firmly attached to objects with a cylindrical shape. Leveraging the manifold assumption, the cross-ratio in projective invariance is utilized for encoding in the direction of zero curvature on the surface. Additionally, to facilitate the usage of CylinderTag, we propose a heuristic search-based marker generator and a high-performance recognizer as well. Moreover, an all-encompassing evaluation of CylinderTag properties is conducted by means of extensive experimentation, covering detection rate, detection speed, dictionary size, localization jitter, and pose estimation accuracy. CylinderTag showcases superior detection performance from varying view angles in comparison to traditional visual markers, accompanied by higher localization accuracy. Furthermore, CylinderTag boasts real-time detection capability and an extensive marker dictionary, offering enhanced versatility and practicality in a wide range of applications. Experimental results demonstrate that the CylinderTag is a highly promising visual marker for use on cylindrical-like surfaces, thus offering important guidance for future research on high-precision visual localization of cylinder-shaped objects. The code is available at: https://github.com/wsakobe/CylinderTag. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: 15 pages, 22 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Journal ref: IEEE Transactions on Visualization and Computer Graphics, 2024

arXiv:2310.13315 [pdf, other]

Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models

Authors: Miaoxi Zhu, Qihuang Zhong, Li Shen, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao

Abstract: Quantization is a promising approach for reducing memory overhead and accelerating inference, especially in large pre-trained language model (PLM) scenarios. While having no access to original training data due to security and privacy concerns has emerged the demand for zero-shot quantization. Most of the cutting-edge zero-shot quantization methods primarily 1) apply to computer vision tasks, and… ▽ More Quantization is a promising approach for reducing memory overhead and accelerating inference, especially in large pre-trained language model (PLM) scenarios. While having no access to original training data due to security and privacy concerns has emerged the demand for zero-shot quantization. Most of the cutting-edge zero-shot quantization methods primarily 1) apply to computer vision tasks, and 2) neglect of overfitting problem in the generative adversarial learning process, leading to sub-optimal performance. Motivated by this, we propose a novel zero-shot sharpness-aware quantization (ZSAQ) framework for the zero-shot quantization of various PLMs. The key algorithm in solving ZSAQ is the SAM-SGA optimization, which aims to improve the quantization accuracy and model generalization via optimizing a minimax problem. We theoretically prove the convergence rate for the minimax optimization problem and this result can be applied to other nonconvex-PL minimax optimization frameworks. Extensive experiments on 11 tasks demonstrate that our method brings consistent and significant performance gains on both discriminative and generative PLMs, i.e., up to +6.98 average score. Furthermore, we empirically validate that our method can effectively improve the model generalization. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP2023 (Main). Miaoxi Zhu and Qihuang Zhong contribute equally to this work

arXiv:2310.11802 [pdf, other]

De novo protein design using geometric vector field networks

Authors: Weian Mao, Muzhi Zhu, Zheng Sun, Shuaike Shen, Lin Yuanbo Wu, Hao Chen, Chunhua Shen

Abstract: Innovations like protein diffusion have enabled significant progress in de novo protein design, which is a vital topic in life science. These methods typically depend on protein structure encoders to model residue backbone frames, where atoms do not exist. Most prior encoders rely on atom-wise features, such as angles and distances between atoms, which are not available in this context. Thus far,… ▽ More Innovations like protein diffusion have enabled significant progress in de novo protein design, which is a vital topic in life science. These methods typically depend on protein structure encoders to model residue backbone frames, where atoms do not exist. Most prior encoders rely on atom-wise features, such as angles and distances between atoms, which are not available in this context. Thus far, only several simple encoders, such as IPA, have been proposed for this scenario, exposing the frame modeling as a bottleneck. In this work, we proffer the Vector Field Network (VFN), which enables network layers to perform learnable vector computations between coordinates of frame-anchored virtual atoms, thus achieving a higher capability for modeling frames. The vector computation operates in a manner similar to a linear layer, with each input channel receiving 3D virtual atom coordinates instead of scalar values. The multiple feature vectors output by the vector computation are then used to update the residue representations and virtual atom coordinates via attention aggregation. Remarkably, VFN also excels in modeling both frames and atoms, as the real atoms can be treated as the virtual atoms for modeling, positioning VFN as a potential universal encoder. In protein diffusion (frame modeling), VFN exhibits an impressive performance advantage over IPA, excelling in terms of both designability (67.04% vs. 53.58%) and diversity (66.54% vs. 51.98%). In inverse folding (frame and atom modeling), VFN outperforms the previous SoTA model, PiFold (54.7% vs. 51.66%), on sequence recovery rate. We also propose a method of equip** VFN with the ESM model, which significantly surpasses the previous ESM-based SoTA (62.67% vs. 55.65%), LM-Design, by a substantial margin. △ Less

Submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.10952 [pdf, other]

Restricted Tweedie Stochastic Block Models

Authors: Jie Jian, Mu Zhu, Peijun Sang

Abstract: The stochastic block model (SBM) is a widely used framework for community detection in networks, where the network structure is typically represented by an adjacency matrix. However, conventional SBMs are not directly applicable to an adjacency matrix that consists of non-negative zero-inflated continuous edge weights. To model the international trading network, where edge weights represent tradin… ▽ More The stochastic block model (SBM) is a widely used framework for community detection in networks, where the network structure is typically represented by an adjacency matrix. However, conventional SBMs are not directly applicable to an adjacency matrix that consists of non-negative zero-inflated continuous edge weights. To model the international trading network, where edge weights represent trading values between countries, we propose an innovative SBM based on a restricted Tweedie distribution. Additionally, we incorporate nodal information, such as the geographical distance between countries, and account for its dynamic effect on edge weights. Notably, we show that given a sufficiently large number of nodes, estimating this covariate effect becomes independent of community labels of each node when computing the maximum likelihood estimator of parameters in our model. This result enables the development of an efficient two-step algorithm that separates the estimation of covariate effects from other parameters. We demonstrate the effectiveness of our proposed method through extensive simulation studies and an application to real-world international trading data. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.10689 [pdf]

doi 10.1109/isbi53787.2023.10230816

Contrastive Self-Supervised Learning for Spatio-Temporal Analysis of Lung Ultrasound Videos

Authors: Li Chen, Jonathan Rubin, Jiahong Ouyang, Naveen Balaraju, Shubham Patil, Courosh Mehanian, Sourabh Kulhare, Rachel Millin, Kenton W Gregory, Cynthia R Gregory, Meihua Zhu, David O Kessler, Laurie Malia, Almaz Dessie, Joni Rabiner, Di Coneybeare, Bo Shopsin, Andrew Hersh, Cristian Madar, Jeffrey Shupp, Laura S Johnson, Jacob Avila, Kristin Dwyer, Peter Weimersheimer, Balasundar Raju , et al. (2 additional authors not shown)

Abstract: Self-supervised learning (SSL) methods have shown promise for medical imaging applications by learning meaningful visual representations, even when the amount of labeled data is limited. Here, we extend state-of-the-art contrastive learning SSL methods to 2D+time medical ultrasound video data by introducing a modified encoder and augmentation method capable of learning meaningful spatio-temporal r… ▽ More Self-supervised learning (SSL) methods have shown promise for medical imaging applications by learning meaningful visual representations, even when the amount of labeled data is limited. Here, we extend state-of-the-art contrastive learning SSL methods to 2D+time medical ultrasound video data by introducing a modified encoder and augmentation method capable of learning meaningful spatio-temporal representations, without requiring constraints on the input data. We evaluate our method on the challenging clinical task of identifying lung consolidations (an important pathological feature) in ultrasound videos. Using a multi-center dataset of over 27k lung ultrasound videos acquired from over 500 patients, we show that our method can significantly improve performance on downstream localization and classification of lung consolidation. Comparisons against baseline models trained without SSL show that the proposed methods are particularly advantageous when the size of labeled training data is limited (e.g., as little as 5% of the training set). △ Less

Submitted 14 October, 2023; originally announced October 2023.

Comments: ISBI 2023, 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI)

Showing 101–150 of 860 results for author: Zhu, M