Search | arXiv e-print repository

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

Authors: Alberto Mario Ceballos Arroyo, Monica Munnangi, Jiuding Sun, Karen Y. C. Zhang, Denis Jered McInerney, Byron C. Wallace, Silvio Amir

Abstract: Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a… ▽ More Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: To appear at BioNLP, ACL 2024

arXiv:2407.09357 [pdf, other]

Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees

Authors: Alexia Jolicoeur-Martineau, Aristide Baratin, Kisoo Kwon, Boris Knyazev, Yan Zhang

Abstract: Generating novel molecules is challenging, with most representations leading to generative models producing many invalid molecules. Spanning Tree-based Graph Generation (STGG) is a promising approach to ensure the generation of valid molecules, outperforming state-of-the-art SMILES and graph diffusion models for unconditional generation. In the real world, we want to be able to generate molecules… ▽ More Generating novel molecules is challenging, with most representations leading to generative models producing many invalid molecules. Spanning Tree-based Graph Generation (STGG) is a promising approach to ensure the generation of valid molecules, outperforming state-of-the-art SMILES and graph diffusion models for unconditional generation. In the real world, we want to be able to generate molecules conditional on one or multiple desired properties rather than unconditionally. Thus, in this work, we extend STGG to multi-property-conditional generation. Our approach, STGG+, incorporates a modern Transformer architecture, random masking of properties during training (enabling conditioning on any subset of properties and classifier-free guidance), an auxiliary property-prediction loss (allowing the model to self-criticize molecules and select the best ones), and other improvements. We show that STGG+ achieves state-of-the-art performance on in-distribution and out-of-distribution conditional generation, and reward maximization. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.09292 [pdf, other]

CEIPA: Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models

Authors: Dong Shu, Mingyu **, Tianle Chen, Chong Zhang, Yongfeng Zhang

Abstract: This study sheds light on the imperative need to bolster safety and privacy measures in large language models (LLMs), such as GPT-4 and LLaMA-2, by identifying and mitigating their vulnerabilities through explainable analysis of prompt attacks. We propose Counterfactual Explainable Incremental Prompt Attack (CEIPA), a novel technique where we guide prompts in a specific manner to quantitatively me… ▽ More This study sheds light on the imperative need to bolster safety and privacy measures in large language models (LLMs), such as GPT-4 and LLaMA-2, by identifying and mitigating their vulnerabilities through explainable analysis of prompt attacks. We propose Counterfactual Explainable Incremental Prompt Attack (CEIPA), a novel technique where we guide prompts in a specific manner to quantitatively measure attack effectiveness and explore the embedded defense mechanisms in these models. Our approach is distinctive for its capacity to elucidate the reasons behind the generation of harmful responses by LLMs through an incremental counterfactual methodology. By organizing the prompt modification process into four incremental levels: (word, sentence, character, and a combination of character and word) we facilitate a thorough examination of the susceptibilities inherent to LLMs. The findings from our study not only provide counterfactual explanation insight but also demonstrate that our framework significantly enhances the effectiveness of attack prompts. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 23 pages, 6 figures

arXiv:2407.09139 [pdf, other]

Measurement of $CP$ asymmetries in $B^0 \to K^0_S π^0 γ$ decays at Belle II

Authors: Belle II Collaboration, I. Adachi, L. Aggarwal, H. Ahmed, H. Aihara, N. Akopov, A. Aloisio, N. Anh Ky, D. M. Asner, H. Atmacan, T. Aushev, V. Aushev, M. Aversano, R. Ayad, V. Babu, H. Bae, S. Bahinipati, P. Bambade, Sw. Banerjee, S. Bansal, M. Barrett, J. Baudot, A. Baur, A. Beaubien, F. Becherer , et al. (414 additional authors not shown)

Abstract: We report measurements of time-dependent $CP$ asymmetries in $B^0 \to K^0_S π^0 γ$ decays based on a data sample of $(388\pm6)\times10^6$ $B\bar{B}$ events collected at the $Υ(4S)$ resonance with the Belle II detector. The Belle II experiment operates at the SuperKEKB asymmetric-energy $e^+e^-$ collider. We measure decay-time distributions to determine $CP$-violating parameters $S$ and $C$. We det… ▽ More We report measurements of time-dependent $CP$ asymmetries in $B^0 \to K^0_S π^0 γ$ decays based on a data sample of $(388\pm6)\times10^6$ $B\bar{B}$ events collected at the $Υ(4S)$ resonance with the Belle II detector. The Belle II experiment operates at the SuperKEKB asymmetric-energy $e^+e^-$ collider. We measure decay-time distributions to determine $CP$-violating parameters $S$ and $C$. We determine these parameters for two ranges of $K^0_S π^0$ invariant mass: $m(K^0_S π^0)\in (0.8, 1.0)$ $GeV/c^2$, which is dominated by $B^0 \to K^{*0} (\to K^0_S π^0) γ$ decays, and a complementary region $m(K^0_S π^0)\in (0.6, 0.8)\cup(1.0, 1.8)$ $GeV/c^2$. Our results have improved precision as compared to previous measurements and are consistent with theory predictions. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 10 pages, 4 figures

Report number: Belle II Preprint 2024-009, KEK Preprint 2024-1

arXiv:2407.09026 [pdf, other]

HPC: Hierarchical Progressive Coding Framework for Volumetric Video

Authors: Zihan Zheng, Houqiang Zhong, Qiang Hu, Xiaoyun Zhang, Li Song, Ya Zhang, Yanfeng Wang

Abstract: Volumetric video based on Neural Radiance Field (NeRF) holds vast potential for various 3D applications, but its substantial data volume poses significant challenges for compression and transmission. Current NeRF compression lacks the flexibility to adjust video quality and bitrate within a single model for various network and device capacities. To address these issues, we propose HPC, a novel hie… ▽ More Volumetric video based on Neural Radiance Field (NeRF) holds vast potential for various 3D applications, but its substantial data volume poses significant challenges for compression and transmission. Current NeRF compression lacks the flexibility to adjust video quality and bitrate within a single model for various network and device capacities. To address these issues, we propose HPC, a novel hierarchical progressive volumetric video coding framework achieving variable bitrate using a single model. Specifically, HPC introduces a hierarchical representation with a multi-resolution residual radiance field to reduce temporal redundancy in long-duration sequences while simultaneously generating various levels of detail. Then, we propose an end-to-end progressive learning approach with a multi-rate-distortion loss function to jointly optimize both hierarchical representation and compression. Our HPC trained only once can realize multiple compression levels, while the current methods need to train multiple fixed-bitrate models for different rate-distortion (RD) tradeoffs. Extensive experiments demonstrate that HPC achieves flexible quality levels with variable bitrate by a single model and exhibits competitive RD performance, even outperforming fixed-bitrate models across various datasets. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 11 pages, 7 figures

arXiv:2407.09018 [pdf, other]

AUITestAgent: Automatic Requirements Oriented GUI Function Testing

Authors: Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, Yangfan Zhou

Abstract: The Graphical User Interface (GUI) is how users interact with mobile apps. To ensure it functions properly, testing engineers have to make sure it functions as intended, based on test requirements that are typically written in natural language. While widely adopted manual testing and script-based methods are effective, they demand substantial effort due to the vast number of GUI pages and rapid it… ▽ More The Graphical User Interface (GUI) is how users interact with mobile apps. To ensure it functions properly, testing engineers have to make sure it functions as intended, based on test requirements that are typically written in natural language. While widely adopted manual testing and script-based methods are effective, they demand substantial effort due to the vast number of GUI pages and rapid iterations in modern mobile apps. This paper introduces AUITestAgent, the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification. Since test requirements typically contain interaction commands and verification oracles. AUITestAgent can extract GUI interactions from test requirements via dynamically organized agents. Then, AUITestAgent employs a multi-dimensional data extraction strategy to retrieve data relevant to the test requirements from the interaction trace and perform verification. Experiments on customized benchmarks demonstrate that AUITestAgent outperforms existing tools in the quality of generated GUI interactions and achieved the accuracy of verifications of 94%. Moreover, field deployment in Meituan has shown AUITestAgent's practical usability, with it detecting 4 new functional bugs during 10 regression tests in two months. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.08990 [pdf, other]

Dynamic neural network with memristive CIM and CAM for 2D and 3D vision

Authors: Yue Zhang, Woyu Zhang, Shaocong Wang, Ning Lin, Yifei Yu, Yangu He, Bo Wang, Hao Jiang, Peng Lin, Xiaoxin Xu, Xiaojuan Qi, Zhongrui Wang, Xumeng Zhang, Dashan Shang, Qi Liu, Kwang-Ting Cheng, Ming Liu

Abstract: The brain is dynamic, associative and efficient. It reconfigures by associating the inputs with past experiences, with fused memory and processing. In contrast, AI models are static, unable to associate inputs with past experiences, and run on digital computers with physically separated memory and processing. We propose a hardware-software co-design, a semantic memory-based dynamic neural network… ▽ More The brain is dynamic, associative and efficient. It reconfigures by associating the inputs with past experiences, with fused memory and processing. In contrast, AI models are static, unable to associate inputs with past experiences, and run on digital computers with physically separated memory and processing. We propose a hardware-software co-design, a semantic memory-based dynamic neural network (DNN) using memristor. The network associates incoming data with the past experience stored as semantic vectors. The network and the semantic memory are physically implemented on noise-robust ternary memristor-based Computing-In-Memory (CIM) and Content-Addressable Memory (CAM) circuits, respectively. We validate our co-designs, using a 40nm memristor macro, on ResNet and PointNet++ for classifying images and 3D points from the MNIST and ModelNet datasets, which not only achieves accuracy on par with software but also a 48.1% and 15.9% reduction in computational budget. Moreover, it delivers a 77.6% and 93.3% reduction in energy consumption. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: In press

arXiv:2407.08984 [pdf, ps, other]

Measurement of branching fractions, CP asymmetry, and isospin asymmetry for $\boldsymbol{B\rightarrowργ}$ decays using Belle and Belle II data

Authors: Belle II Collaboration, I. Adachi, K. Adamczyk, L. Aggarwal, H. Aihara, N. Akopov, A. Aloisio, N. Anh Ky, D. M. Asner, H. Atmacan, T. Aushev, V. Aushev, M. Aversano, R. Ayad, V. Babu, H. Bae, S. Bahinipati, P. Bambade, Sw. Banerjee, S. Bansal, M. Barrett, J. Baudot, A. Baur, A. Beaubien, F. Becherer , et al. (385 additional authors not shown)

Abstract: We present measurements of $B^{+}\rightarrowρ^{+}γ$ and $B^{0}\rightarrowρ^{0}γ$ decays using a combined data sample of $772 \times 10^6$ $B\overline{B}$ pairs collected by the Belle experiment and $387\times 10^6$ $B\overline{B}$ pairs collected by the Belle II experiment in $e^{+}e^{-}$ collisions at the $Υ(4S)$ resonance. After an optimized selection, a simultaneous fit to the Belle and Belle I… ▽ More We present measurements of $B^{+}\rightarrowρ^{+}γ$ and $B^{0}\rightarrowρ^{0}γ$ decays using a combined data sample of $772 \times 10^6$ $B\overline{B}$ pairs collected by the Belle experiment and $387\times 10^6$ $B\overline{B}$ pairs collected by the Belle II experiment in $e^{+}e^{-}$ collisions at the $Υ(4S)$ resonance. After an optimized selection, a simultaneous fit to the Belle and Belle II data sets yields $114\pm 12$ $B^{+}\rightarrowρ^{+}γ$ and $99\pm 12$ $B^{0}\rightarrowρ^{0}γ$ decays. The measured branching fractions are $(13.1^{+2.0 +1.3}_{-1.9 -1.2})\times 10^{-7}$ and $(7.5\pm 1.3^{+1.0}_{-0.8})\times 10^{-7}$ for $B^{+}\rightarrowρ^{+}γ$ and $B^{0}\rightarrowρ^{0}γ$ decays, respectively, where the first uncertainty is statistical and the second is systematic. We also measure the isospin asymmetry $A_{\rm I}(B\rightarrowργ)=(10.9^{+11.2 +7.8}_{-11.7 -7.3})\%$ and the direct CP asymmetry $A_{CP}(B^{+}\rightarrowρ^{+}γ)=(-8.2\pm 15.2^{+1.6}_{-1.2})\%$. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 12 pages, 4 figures

Report number: Belle II Preprint 2023-019; KEK Preprint 2023-37

arXiv:2407.08972 [pdf, other]

Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness

Authors: Honghao Chen, Yurong Zhang, Xiaokun Feng, Xiangxiang Chu, Kaiqi Huang

Abstract: Robustness is a vital aspect to consider when deploying deep learning models into the wild. Numerous studies have been dedicated to the study of the robustness of vision transformers (ViTs), which have dominated as the mainstream backbone choice for vision tasks since the dawn of 2020s. Recently, some large kernel convnets make a comeback with impressive performance and efficiency. However, it sti… ▽ More Robustness is a vital aspect to consider when deploying deep learning models into the wild. Numerous studies have been dedicated to the study of the robustness of vision transformers (ViTs), which have dominated as the mainstream backbone choice for vision tasks since the dawn of 2020s. Recently, some large kernel convnets make a comeback with impressive performance and efficiency. However, it still remains unclear whether large kernel networks are robust and the attribution of their robustness. In this paper, we first conduct a comprehensive evaluation of large kernel convnets' robustness and their differences from typical small kernel counterparts and ViTs on six diverse robustness benchmark datasets. Then to analyze the underlying factors behind their strong robustness, we design experiments from both quantitative and qualitative perspectives to reveal large kernel convnets' intriguing properties that are completely different from typical convnets. Our experiments demonstrate for the first time that pure CNNs can achieve exceptional robustness comparable or even superior to that of ViTs. Our analysis on occlusion invariance, kernel attention patterns and frequency characteristics provide novel insights into the source of robustness. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08966 [pdf, other]

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Authors: Yabin Zhang, Wenjie Zhu, Chenhang He, Lei Zhang

Abstract: Out-of-distribution (OOD) detection is crucial for model reliability, as it identifies samples from unknown classes and reduces errors due to unexpected inputs. Vision-Language Models (VLMs) such as CLIP are emerging as powerful tools for OOD detection by integrating multi-modal information. However, the practical application of such systems is challenged by manual prompt engineering, which demand… ▽ More Out-of-distribution (OOD) detection is crucial for model reliability, as it identifies samples from unknown classes and reduces errors due to unexpected inputs. Vision-Language Models (VLMs) such as CLIP are emerging as powerful tools for OOD detection by integrating multi-modal information. However, the practical application of such systems is challenged by manual prompt engineering, which demands domain expertise and is sensitive to linguistic nuances. In this paper, we introduce Label-driven Automated Prompt Tuning (LAPT), a novel approach to OOD detection that reduces the need for manual prompt engineering. We develop distribution-aware prompts with in-distribution (ID) class names and negative labels mined automatically. Training samples linked to these class labels are collected autonomously via image synthesis and retrieval methods, allowing for prompt learning without manual effort. We utilize a simple cross-entropy loss for prompt optimization, with cross-modal and cross-distribution mixing strategies to reduce image noise and explore the intermediate space between distributions, respectively. The LAPT framework operates autonomously, requiring only ID class names as input and eliminating the need for manual intervention. With extensive experiments, LAPT consistently outperforms manually crafted prompts, setting a new standard for OOD detection. Moreover, LAPT not only enhances the distinction between ID and OOD samples, but also improves the ID classification accuracy and strengthens the generalization robustness to covariate shifts, resulting in outstanding performance in challenging full-spectrum OOD detection tasks. Codes are available at \url{https://github.com/YBZh/LAPT}. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: ECCV2024; Codes and Supp. are available at: https://github.com/YBZh/LAPT

arXiv:2407.08965 [pdf, other]

Lite-SAM Is Actually What You Need for Segment Everything

Authors: Jianhai Fu, Yuanjie Yu, Ningchuan Li, Yi Zhang, Qichao Chen, Jun Yin, Zhiyu Xiang

Abstract: This paper introduces Lite-SAM, an efficient end-to-end solution for the SegEvery task designed to reduce computational costs and redundancy. Lite-SAM is composed of four main components: a streamlined CNN-Transformer hybrid encoder (LiteViT), an automated prompt proposal network (AutoPPN), a traditional prompt encoder, and a mask decoder. All these components are integrated within the SAM framewo… ▽ More This paper introduces Lite-SAM, an efficient end-to-end solution for the SegEvery task designed to reduce computational costs and redundancy. Lite-SAM is composed of four main components: a streamlined CNN-Transformer hybrid encoder (LiteViT), an automated prompt proposal network (AutoPPN), a traditional prompt encoder, and a mask decoder. All these components are integrated within the SAM framework. Our LiteViT, a high-performance lightweight backbone network, has only 1.16M parameters, which is a 23% reduction compared to the lightest existing backbone network Shufflenet. We also introduce AutoPPN, an innovative end-to-end method for prompt boxes and points generation. This is an improvement over traditional grid search sampling methods, and its unique design allows for easy integration into any SAM series algorithm, extending its usability. we have thoroughly benchmarked Lite-SAM across a plethora of both public and private datasets. The evaluation encompassed a broad spectrum of universal metrics, including the number of parameters, SegEvery execution time, and accuracy. The findings reveal that Lite-SAM, operating with a lean 4.2M parameters, significantly outpaces its counterparts, demonstrating performance improvements of 43x, 31x, 20x, 21x, and 1.6x over SAM, MobileSAM, Edge-SAM, EfficientViT-SAM, and MobileSAM-v2 respectively, all the while maintaining competitive accuracy. This underscores Lite-SAM's prowess in achieving an optimal equilibrium between performance and precision, thereby setting a new state-of-the-art(SOTA) benchmark in the domain. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: ECCV 2024 Accepted

arXiv:2407.08952 [pdf, other]

Detect, Investigate, Judge and Determine: A Novel LLM-based Framework for Few-shot Fake News Detection

Authors: Ye Liu, Jiajun Zhu, Kai Zhang, Haoyu Tang, Yanghai Zhang, Xukai Liu, Qi Liu, Enhong Chen

Abstract: Few-Shot Fake News Detection (FS-FND) aims to distinguish inaccurate news from real ones in extremely low-resource scenarios. This task has garnered increased attention due to the widespread dissemination and harmful impact of fake news on social media. Large Language Models (LLMs) have demonstrated competitive performance with the help of their rich prior knowledge and excellent in-context learni… ▽ More Few-Shot Fake News Detection (FS-FND) aims to distinguish inaccurate news from real ones in extremely low-resource scenarios. This task has garnered increased attention due to the widespread dissemination and harmful impact of fake news on social media. Large Language Models (LLMs) have demonstrated competitive performance with the help of their rich prior knowledge and excellent in-context learning abilities. However, existing methods face significant limitations, such as the Understanding Ambiguity and Information Scarcity, which significantly undermine the potential of LLMs. To address these shortcomings, we propose a Dual-perspective Augmented Fake News Detection (DAFND) model, designed to enhance LLMs from both inside and outside perspectives. Specifically, DAFND first identifies the keywords of each news article through a Detection Module. Subsequently, DAFND creatively designs an Investigation Module to retrieve inside and outside valuable information concerning to the current news, followed by another Judge Module to derive its respective two prediction results. Finally, a Determination Module further integrates these two predictions and derives the final result. Extensive experiments on two publicly available datasets show the efficacy of our proposed method, particularly in low-resource settings. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08739 [pdf, other]

MAVIS: Mathematical Visual Instruction Tuning

Authors: Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, Hongsheng Li

Abstract: Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, a… ▽ More Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: Work in progress. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

arXiv:2407.08672 [pdf, other]

NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning

Authors: Yi Zhang, Chun-Wun Cheng, Ke Yu, Zhihai He, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero

Abstract: In this paper, we consider the problem of prototype-based vision-language reasoning problem. We observe that existing methods encounter three major challenges: 1) escalating resource demands and prolonging training times, 2) contending with excessive learnable parameters, and 3) fine-tuning based only on a single modality. These challenges will hinder their capability to adapt Vision-Language Mode… ▽ More In this paper, we consider the problem of prototype-based vision-language reasoning problem. We observe that existing methods encounter three major challenges: 1) escalating resource demands and prolonging training times, 2) contending with excessive learnable parameters, and 3) fine-tuning based only on a single modality. These challenges will hinder their capability to adapt Vision-Language Models (VLMs) to downstream tasks. Motivated by this critical observation, we propose a novel method called NODE-Adapter, which utilizes Neural Ordinary Differential Equations for better vision-language reasoning. To fully leverage both visual and textual modalities and estimate class prototypes more effectively and accurately, we divide our method into two stages: cross-modal prototype construction and cross-modal prototype optimization using neural ordinary differential equations. Specifically, we exploit VLM to encode hand-crafted prompts into textual features and few-shot support images into visual features. Then, we estimate the textual prototype and visual prototype by averaging the textual features and visual features, respectively, and adaptively combine the textual prototype and visual prototype to construct the cross-modal prototype. To alleviate the prototype bias, we then model the prototype optimization process as an initial value problem with Neural ODEs to estimate the continuous gradient flow. Our extensive experimental results, which cover few-shot classification, domain generalization, and visual reasoning on human-object interaction, demonstrate that the proposed method significantly outperforms existing state-of-the-art approaches. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08641 [pdf, other]

How more data can hurt: Instability and regularization in next-generation reservoir computing

Authors: Yuanzhao Zhang, Sean P. Cornelius

Abstract: It has been found recently that more data can, counter-intuitively, hurt the performance of deep neural networks. Here, we show that a more extreme version of the phenomenon occurs in data-driven models of dynamical systems. To elucidate the underlying mechanism, we focus on next-generation reservoir computing (NGRC) -- a popular framework for learning dynamics from data. We find that, despite lea… ▽ More It has been found recently that more data can, counter-intuitively, hurt the performance of deep neural networks. Here, we show that a more extreme version of the phenomenon occurs in data-driven models of dynamical systems. To elucidate the underlying mechanism, we focus on next-generation reservoir computing (NGRC) -- a popular framework for learning dynamics from data. We find that, despite learning a better representation of the flow map with more training data, NGRC can adopt an ill-conditioned ``integrator'' and lose stability. We link this data-induced instability to the auxiliary dimensions created by the delayed states in NGRC. Based on these findings, we propose simple strategies to mitigate the instability, either by increasing regularization strength in tandem with data size, or by carefully introducing noise during training. Our results highlight the importance of proper regularization in data-driven modeling of dynamical systems. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 10 pages, 10 figures. Comments welcome

arXiv:2407.08608 [pdf, other]

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Authors: Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao

Abstract: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the… ▽ More Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0$\times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6$\times$ lower numerical error than a baseline FP8 attention. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08560 [pdf, other]

Causal inference through multi-stage learning and doubly robust deep neural networks

Authors: Yuqian Zhang, Jelena Bradic

Abstract: Deep neural networks (DNNs) have demonstrated remarkable empirical performance in large-scale supervised learning problems, particularly in scenarios where both the sample size $n$ and the dimension of covariates $p$ are large. This study delves into the application of DNNs across a wide spectrum of intricate causal inference tasks, where direct estimation falls short and necessitates multi-stage… ▽ More Deep neural networks (DNNs) have demonstrated remarkable empirical performance in large-scale supervised learning problems, particularly in scenarios where both the sample size $n$ and the dimension of covariates $p$ are large. This study delves into the application of DNNs across a wide spectrum of intricate causal inference tasks, where direct estimation falls short and necessitates multi-stage learning. Examples include estimating the conditional average treatment effect and dynamic treatment effect. In this framework, DNNs are constructed sequentially, with subsequent stages building upon preceding ones. To mitigate the impact of estimation errors from early stages on subsequent ones, we integrate DNNs in a doubly robust manner. In contrast to previous research, our study offers theoretical assurances regarding the effectiveness of DNNs in settings where the dimensionality $p$ expands with the sample size. These findings are significant independently and extend to degenerate single-stage learning problems. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08559 [pdf]

Study of a Novel Capacitive Pressure Sensor Using Spiral Comb Electrodes

Authors: Wenjie Chen, Qi Yang, Qi Liu, Yiqun Zhang, Liang He, Yuanlin Xia, Zhuqing Wang, Yubo Huang, Jianfeng Chen, Cao Xia

Abstract: For traditional capacitive pressure sensors, high nonlinearity and poor sensitivity greatly limited their sensing applications. Hence, an innovative design of capacitors based on spiral comb electrodes is proposed for high-sensitivity pressure detection in this work. Compared to traditional capacitive pressure sensors with straight plate electrodes, the proposed sensor with the spiral electrodes i… ▽ More For traditional capacitive pressure sensors, high nonlinearity and poor sensitivity greatly limited their sensing applications. Hence, an innovative design of capacitors based on spiral comb electrodes is proposed for high-sensitivity pressure detection in this work. Compared to traditional capacitive pressure sensors with straight plate electrodes, the proposed sensor with the spiral electrodes increases the overlap areas of electrodes sufficiently, the pressure sensitivity can thus be greatly improved. Moreover, the capacitance variation of the proposed sensor is dominated by the change of the overlap area of the electrodes rather than the electrode's distance, the linearity can also thus be improved to higher than 0.99. Theoretical analysis and COMSOL-based finite element simulation have been implemented for principle verification and performance optimization. Simulation results show that the proposed design has a mechanical sensitivity of 1.5x10-4 m/Pa, capacitive sensitivity of 1.10 aF/Pa, and nonlinear error of 3.63%, respectively, at the pressure range from 0 to 30 kPa. An equivalent experiment has been further carried out for verification. Experimental results also show that both the sensitivity and linearity of capacitive pressure sensors with spiral electrodes are higher than those with straight electrodes. This work not only provides a new avenue for capacitor design, but also can be applied to high-sensitivity pressure detection. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 20 pages, 14 figures

MSC Class: -

arXiv:2407.08546 [pdf, other]

Quantitative Evaluation of the Saliency Map for Alzheimer's Disease Classifier with Anatomical Segmentation

Authors: Yihan Zhang, Xuanshuo Zhang, Wei Wu, Haohan Wang

Abstract: Saliency maps have been widely used to interpret deep learning classifiers for Alzheimer's disease (AD). However, since AD is heterogeneous and has multiple subtypes, the pathological mechanism of AD remains not fully understood and may vary from patient to patient. Due to the lack of such understanding, it is difficult to comprehensively and effectively assess the saliency map of AD classifier. I… ▽ More Saliency maps have been widely used to interpret deep learning classifiers for Alzheimer's disease (AD). However, since AD is heterogeneous and has multiple subtypes, the pathological mechanism of AD remains not fully understood and may vary from patient to patient. Due to the lack of such understanding, it is difficult to comprehensively and effectively assess the saliency map of AD classifier. In this paper, we utilize the anatomical segmentation to allocate saliency values into different brain regions. By plotting the distributions of saliency maps corresponding to AD and NC (Normal Control), we can gain a comprehensive view of the model's decisions process. In order to leverage the fact that the brain volume shrinkage happens in AD patients during disease progression, we define a new evaluation metric, brain volume change score (VCS), by computing the average Pearson correlation of the brain volume changes and the saliency values of a model in different brain regions for each patient. Thus, the VCS metric can help us gain some knowledge of how saliency maps resulting from different models relate to the changes of the volumes across different regions in the whole brain. We trained candidate models on the ADNI dataset and tested on three different datasets. Our results indicate: (i) models with higher VCSs tend to demonstrate saliency maps with more details relevant to the AD pathology, (ii) using gradient-based adversarial training strategies such as FGSM and stochastic masking can improve the VCSs of the models. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08532 [pdf, other]

Tactics, Techniques, and Procedures (TTPs) in Interpreted Malware: A Zero-Shot Generation with Large Language Models

Authors: Ying Zhang, Xiaoyan Zhou, Hui Wen, Wenjia Niu, Jiqiang Liu, Haining Wang, Qiang Li

Abstract: Nowadays, the open-source software (OSS) ecosystem suffers from security threats of software supply chain (SSC) attacks. Interpreted OSS malware plays a vital role in SSC attacks, as criminals have an arsenal of attack vectors to deceive users into installing malware and executing malicious activities. In this paper, we introduce tactics, techniques, and procedures (TTPs) proposed by MITRE ATT\&CK… ▽ More Nowadays, the open-source software (OSS) ecosystem suffers from security threats of software supply chain (SSC) attacks. Interpreted OSS malware plays a vital role in SSC attacks, as criminals have an arsenal of attack vectors to deceive users into installing malware and executing malicious activities. In this paper, we introduce tactics, techniques, and procedures (TTPs) proposed by MITRE ATT\&CK into the interpreted malware analysis to characterize different phases of an attack lifecycle. Specifically, we propose GENTTP, a zero-shot approach to extracting a TTP of an interpreted malware package. GENTTP leverages large language models (LLMs) to automatically generate a TTP, where the input is a malicious package, and the output is a deceptive tactic and an execution tactic of attack vectors. To validate the effectiveness of GENTTP, we collect two datasets for evaluation: a dataset with ground truth labels and a large dataset in the wild. Experimental results show that GENTTP can generate TTPs with high accuracy and efficiency. To demonstrate GENTTP's benefits, we build an LLM-based Chatbot from 3,700+ PyPI malware's TTPs. We further conduct a quantitative analysis of malware's TTPs at a large scale. Our main findings include: (1) many OSS malicious packages share a relatively stable TTP, even with the increasing emergence of malware and attack campaigns, (2) a TTP reflects characteristics of a malware-based attack, and (3) an attacker's intent behind the malware is linked to a TTP. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 19 pages, 11 figures

arXiv:2407.08267 [pdf, other]

Chiral bulk solitons in photonic graphene with decorated boundaries

Authors: Shuang Shen, Ce Shang, Yongdong Li, Yiqi Zhang

Abstract: We propose a chiral bulk soliton in a nonlinear photonic lattice with decorated boundaries, presenting a novel approach to manipulate photonic transport without extensive bulk modifications. Unlike traditional methods that rely on topological edge and corner modes, our strategy leverages the robust chiral propagation of bulk modes. By introducing nonlinearity into the system, we find a stable bulk… ▽ More We propose a chiral bulk soliton in a nonlinear photonic lattice with decorated boundaries, presenting a novel approach to manipulate photonic transport without extensive bulk modifications. Unlike traditional methods that rely on topological edge and corner modes, our strategy leverages the robust chiral propagation of bulk modes. By introducing nonlinearity into the system, we find a stable bulk soliton, akin to the topological valley Hall effects. The chiral bulk soliton exhibits remarkable stability; the energy does not decay even after a long-distance propagation; and the corresponding Fourier spectrum confirms the absence of inter-valley scattering indicating a valley-locking property. Our findings not only contribute to the fundamental understanding of nonlinear photonic systems but also hold significant practical implications for the design and optimization of photonic devices. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08223 [pdf, other]

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Authors: Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, **gbo Shang, Chen-Yu Lee, Tomas Pfister

Abstract: Retrieval augmented generation (RAG) combines the generative abilities of large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent RAG advancements focus on improving retrieval outcomes through iterative LLM refinement or self-critique capabilities acquired through additional instruction tuning of LLMs. In this work, we introduce Specul… ▽ More Retrieval augmented generation (RAG) combines the generative abilities of large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent RAG advancements focus on improving retrieval outcomes through iterative LLM refinement or self-critique capabilities acquired through additional instruction tuning of LLMs. In this work, we introduce Speculative RAG - a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM. Each draft is generated from a distinct subset of retrieved documents, offering diverse perspectives on the evidence while reducing input token counts per draft. This approach enhances comprehension of each subset and mitigates potential position bias over long context. Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts. Extensive experiments demonstrate that Speculative RAG achieves state-of-the-art performance with reduced latency on TriviaQA, MuSiQue, PubHealth, and ARC-Challenge benchmarks. It notably enhances accuracy by up to 12.97% while reducing latency by 51% compared to conventional RAG systems on PubHealth. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: Preprint

arXiv:2407.08202 [pdf]

A rotational ellipsoid model for solid Earth tide with high precision

Authors: Yongfeng Yang, Yunfei Zhang, Qiang Liu, Xianqing Lv, Pu Huang

Abstract: Solid Earth tide represents the elastic response of solid Earth to the lunar (solar) gravitational force. The yielding solid Earth due to the force has been thought to be a prolate ellipsoid since the time of Lord Kelvin, yet the ellipsoid's geometry such as semi-major axis's length, semi-minor axis's length, and oblateness remains unresolved. Additionally, the tidal displacement of solid Earth is… ▽ More Solid Earth tide represents the elastic response of solid Earth to the lunar (solar) gravitational force. The yielding solid Earth due to the force has been thought to be a prolate ellipsoid since the time of Lord Kelvin, yet the ellipsoid's geometry such as semi-major axis's length, semi-minor axis's length, and oblateness remains unresolved. Additionally, the tidal displacement of solid Earth is conventionally resolved through a combination of expanded potential equations and given Earth model. Here we present a geometric model in which both the ellipsoid's geometry and the tidal displacement of solid Earth can be resolved through a rotating ellipse with respect to the Moon (Sun). We test the geometric model using 23-year gravity data from 22 superconducting gravimeter (SG) stations and compare it with the current model recommended by the IERS (International Earth Rotation System) conventions (2010), the average Root Mean Square (RMS) deviation of the gravity change yielded by the geometric model against observation is 6.47 μGal (equivalent to 2.07 cm), while that yielded by the current model is 30.77μGal (equivalent to 9.85 cm). The geometric model represents a significant advance in understanding and predicting solid Earth tide, and will greatly contribute to many application fields such as geodesy, geophysics, astronomy, and oceanography. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 20 pages, 4 figures, 1 table

arXiv:2407.08199 [pdf, other]

SRPose: Two-view Relative Pose Estimation with Sparse Keypoints

Authors: Rui Yin, Yulun Zhang, Zherong Pan, Jianjun Zhu, Cheng Wang, Biao Jia

Abstract: Two-view pose estimation is essential for map-free visual relocalization and object pose tracking tasks. However, traditional matching methods suffer from time-consuming robust estimators, while deep learning-based pose regressors only cater to camera-to-world pose estimation, lacking generalizability to different image sizes and camera intrinsics. In this paper, we propose SRPose, a sparse keypoi… ▽ More Two-view pose estimation is essential for map-free visual relocalization and object pose tracking tasks. However, traditional matching methods suffer from time-consuming robust estimators, while deep learning-based pose regressors only cater to camera-to-world pose estimation, lacking generalizability to different image sizes and camera intrinsics. In this paper, we propose SRPose, a sparse keypoint-based framework for two-view relative pose estimation in camera-to-world and object-to-camera scenarios. SRPose consists of a sparse keypoint detector, an intrinsic-calibration position encoder, and promptable prior knowledge-guided attention layers. Given two RGB images of a fixed scene or a moving object, SRPose estimates the relative camera or 6D object pose transformation. Extensive experiments demonstrate that SRPose achieves competitive or superior performance compared to state-of-the-art methods in terms of accuracy and speed, showing generalizability to both scenarios. It is robust to different image sizes and camera intrinsics, and can be deployed with low computing resources. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 30 pages, 11 figures, to be published in ECCV 2024

arXiv:2407.08187 [pdf, other]

ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation

Authors: Ruijie Zhu, Chuxin Wang, Ziyang Song, Li Liu, Tianzhu Zhang, Yongdong Zhang

Abstract: Estimating depth from a single image is a challenging visual task. Compared to relative depth estimation, metric depth estimation attracts more attention due to its practical physical significance and critical applications in real-life scenarios. However, existing metric depth estimation methods are typically trained on specific datasets with similar scenes, facing challenges in generalizing acros… ▽ More Estimating depth from a single image is a challenging visual task. Compared to relative depth estimation, metric depth estimation attracts more attention due to its practical physical significance and critical applications in real-life scenarios. However, existing metric depth estimation methods are typically trained on specific datasets with similar scenes, facing challenges in generalizing across scenes with significant scale variations. To address this challenge, we propose a novel monocular depth estimation method called ScaleDepth. Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction (SASP) module and an adaptive relative depth estimation (ARDE) module, respectively. The proposed ScaleDepth enjoys several merits. First, the SASP module can implicitly combine structural and semantic features of the images to predict precise scene scales. Second, the ARDE module can adaptively estimate the relative depth distribution of each image within a normalized depth space. Third, our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework, without the need for setting the depth range or fine-tuning model. Extensive experiments demonstrate that our method attains state-of-the-art performance across indoor, outdoor, unconstrained, and unseen scenes. Project page: https://ruijiezhu94.github.io/ScaleDepth △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 14 pages, 11 figure, 13 tables

arXiv:2407.08167 [pdf, other]

DSCENet: Dynamic Screening and Clinical-Enhanced Multimodal Fusion for MPNs Subtype Classification

Authors: Yuan Zhang, Yaolei Qi, Xiaoming Qi, Yongyue Wei, Guanyu Yang

Abstract: The precise subtype classification of myeloproliferative neoplasms (MPNs) based on multimodal information, which assists clinicians in diagnosis and long-term treatment plans, is of great clinical significance. However, it remains a great challenging task due to the lack of diagnostic representativeness for local patches and the absence of diagnostic-relevant features from a single modality. In th… ▽ More The precise subtype classification of myeloproliferative neoplasms (MPNs) based on multimodal information, which assists clinicians in diagnosis and long-term treatment plans, is of great clinical significance. However, it remains a great challenging task due to the lack of diagnostic representativeness for local patches and the absence of diagnostic-relevant features from a single modality. In this paper, we propose a Dynamic Screening and Clinical-Enhanced Network (DSCENet) for the subtype classification of MPNs on the multimodal fusion of whole slide images (WSIs) and clinical information. (1) A dynamic screening module is proposed to flexibly adapt the feature learning of local patches, reducing the interference of irrelevant features and enhancing their diagnostic representativeness. (2) A clinical-enhanced fusion module is proposed to integrate clinical indicators to explore complementary features across modalities, providing comprehensive diagnostic information. Our approach has been validated on the real clinical data, achieving an increase of 7.91% AUC and 16.89% accuracy compared with the previous state-of-the-art (SOTA) methods. The code is available at https://github.com/yuanzhang7/DSCENet. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: Accepted by MICCAI2024

arXiv:2407.08043 [pdf, other]

Spin/Phonon Dynamics in Single Molecular Magnets: I. quantum embedding

Authors: Nosheen Younas, Yu Zhang, Andrei Piryatinski, Eric R Bittner

Abstract: Single molecular magnets (SMMs) and Metal-Organic Frameworks (MOFs) attract significant interest due to their potential in quantum information processing, scalable quantum computing, and extended lifetimes and coherence times. The limiting factor in these systems is often the spin dephasing caused by interactions and couplings with the vibrational motions of the molecular framework. This work intr… ▽ More Single molecular magnets (SMMs) and Metal-Organic Frameworks (MOFs) attract significant interest due to their potential in quantum information processing, scalable quantum computing, and extended lifetimes and coherence times. The limiting factor in these systems is often the spin dephasing caused by interactions and couplings with the vibrational motions of the molecular framework. This work introduces a systematic projection/embedding scheme to analyze spin-phonon dynamics in molecular magnets. This scheme consolidates all spin/phonon couplings into a few collective degrees of freedom. quantum mechanically. Using parameters obtained from ab initio methods for spin/phonon coupling via Zeeman interaction, we apply this approach to compute the electronic spin relaxation times for a single-molecule qubit \ce{VOPc(OH)8}, which features a single unpaired electron localized on the central vanadium atom. However, our general embedding scheme can be applied to any single-molecule magnet or qubit MOF with any coupling/interaction Hamiltonian. This development offers a crucial tool for simulating spin relaxation in complex environments with significantly reduced computational complexity. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.08039 [pdf, other]

Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models

Authors: Yuji Zhang, Sha Li, Jiateng Liu, Pengfei Yu, Yi R. Fung, **g Li, Manling Li, Heng Ji

Abstract: Hallucination is often regarded as a major impediment for using large language models (LLMs), especially for knowledge-intensive tasks. Even when the training corpus consists solely of true statements, language models still generate hallucinations in the form of amalgamations of multiple facts. We coin this phenomenon as ``knowledge overshadowing'': when we query knowledge from a language model wi… ▽ More Hallucination is often regarded as a major impediment for using large language models (LLMs), especially for knowledge-intensive tasks. Even when the training corpus consists solely of true statements, language models still generate hallucinations in the form of amalgamations of multiple facts. We coin this phenomenon as ``knowledge overshadowing'': when we query knowledge from a language model with multiple conditions, some conditions overshadow others, leading to hallucinated outputs. This phenomenon partially stems from training data imbalance, which we verify on both pretrained models and fine-tuned models, over a wide range of LM model families and sizes.From a theoretical point of view, knowledge overshadowing can be interpreted as over-generalization of the dominant conditions (patterns). We show that the hallucination rate grows with both the imbalance ratio (between the popular and unpopular condition) and the length of dominant condition description, consistent with our derived generalization bound. Finally, we propose to utilize overshadowing conditions as a signal to catch hallucination before it is produced, along with a training-free self-contrastive decoding method to alleviate hallucination during inference. Our proposed approach showcases up to 82% F1 for hallucination anticipation and 11.2% to 39.4% hallucination control, with different models and datasets. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.08021 [pdf, other]

Field Deployment of Multi-Agent Reinforcement Learning Based Variable Speed Limit Controllers

Authors: Yuhang Zhang, Zhiyao Zhang, Marcos Quiñones-Grueiro, William Barbour, Clay Weston, Gautam Biswas, Daniel Work

Abstract: This article presents the first field deployment of a multi-agent reinforcement-learning (MARL) based variable speed limit (VSL) control system on the I-24 freeway near Nashville, Tennessee. We describe how we train MARL agents in a traffic simulator and directly deploy the simulation-based policy on a 17-mile stretch of Interstate 24 with 67 VSL controllers. We use invalid action masking and seve… ▽ More This article presents the first field deployment of a multi-agent reinforcement-learning (MARL) based variable speed limit (VSL) control system on the I-24 freeway near Nashville, Tennessee. We describe how we train MARL agents in a traffic simulator and directly deploy the simulation-based policy on a 17-mile stretch of Interstate 24 with 67 VSL controllers. We use invalid action masking and several safety guards to ensure the posted speed limits satisfy the real-world constraints from the traffic management center and the Tennessee Department of Transportation. Since the time of launch of the system through April, 2024, the system has made approximately 10,000,000 decisions on 8,000,000 trips. The analysis of the controller shows that the MARL policy takes control for up to 98% of the time without intervention from safety guards. The time-space diagrams of traffic speed and control commands illustrate how the algorithm behaves during rush hour. Finally, we quantify the domain mismatch between the simulation and real-world data and demonstrate the robustness of the MARL policy to this mismatch. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.07930 [pdf]

Token-Mol 1.0: Tokenized drug design with large language model

Authors: Jike Wang, Rui Qin, Mingyang Wang, Mei**g Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou

Abstract: Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug… ▽ More Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug design model. This model encodes all molecular information, including 2D and 3D structures, as well as molecular property data, into tokens, which transforms classification and regression tasks in drug discovery into probabilistic prediction problems, thereby enabling learning through a unified paradigm. Token-Mol is built on the transformer decoder architecture and trained using random causal masking techniques. Additionally, we proposed the Gaussian cross-entropy (GCE) loss function to overcome the challenges in regression tasks, significantly enhancing the capacity of LLMs to learn continuous numerical values. Through a combination of fine-tuning and reinforcement learning (RL), Token-Mol achieves performance comparable to or surpassing existing task-specific methods across various downstream tasks, including pocket-based molecular generation, conformation generation, and molecular property prediction. Compared to existing molecular pre-trained models, Token-Mol exhibits superior proficiency in handling a wider range of downstream tasks essential for drug design. Notably, our approach improves regression task accuracy by approximately 30% compared to similar token-only methods. Token-Mol overcomes the precision limitations of token-only models and has the potential to integrate seamlessly with general models such as ChatGPT, paving the way for the development of a universal artificial intelligence drug design model that facilitates rapid and high-quality drug design by experts. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.07895 [pdf, other]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Authors: Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li

Abstract: Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capa… ▽ More Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Project Page: https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/

arXiv:2407.07843 [pdf, other]

Spin/Phonon Dynamics in Single Molecular Magnets: II. spin/phonon entanglemen

Authors: Nosheen Younas, Yu Zhang, Andrei Piryatinski, Eric R Bittner

Abstract: We introduce a new quantum embedding method to explore spin-phonon interactions in molecular magnets. This technique consolidates various spin/phonon couplings into a limited number of collective degrees of freedom, allowing for a fully quantum mechanical treatment. By precisely factorizing the entire system into "system" and "bath" sub-ensembles, our approach simplifies a previously intractable p… ▽ More We introduce a new quantum embedding method to explore spin-phonon interactions in molecular magnets. This technique consolidates various spin/phonon couplings into a limited number of collective degrees of freedom, allowing for a fully quantum mechanical treatment. By precisely factorizing the entire system into "system" and "bath" sub-ensembles, our approach simplifies a previously intractable problem, making it solvable on modest-scale computers. We demonstrate the effectiveness of this method by studying the spin relaxation and dephasing times of the single-molecule qubit \ce{VOPc(OH)8}, which features a lone unpaired electron on the central vanadium atom. By using this mode projection method, we are able to perform numerical exact quantum dynamical calculation on this system which allows us to follow the flow of quantum information from the single spin qubit into the projected phonon degrees of freedom. Our results demonstrate both the utility of the method and suggest how one can engineer the environment as to further optimize the quantum properties of a qubit system. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.07678 [pdf, other]

The ESO SupJup Survey II: The $^{12}$C/$^{13}$C ratios of three young brown dwarfs with CRIRES$^+$

Authors: D. González Picos, I. A. G. Snellen, S. de Regt, R. Landman, Y. Zhang, S. Gandhi, C. Ginski, A. Y. Kesseli, P. Mollière, T. Stolker

Abstract: Young brown dwarfs exhibit atmospheric characteristics similar to those of super-Jupiters, providing a unique opportunity to study planetary atmospheres. The ESO SupJup Survey, utilizing CRIRES$^+$ on the Very Large Telescope, aims to assess the role of $^{12}$C/$^{13}$C as a formation tracer. We present observations of three young brown dwarfs: 2MASS J12003792-7845082, TWA 28, and 2MASS J08561384… ▽ More Young brown dwarfs exhibit atmospheric characteristics similar to those of super-Jupiters, providing a unique opportunity to study planetary atmospheres. The ESO SupJup Survey, utilizing CRIRES$^+$ on the Very Large Telescope, aims to assess the role of $^{12}$C/$^{13}$C as a formation tracer. We present observations of three young brown dwarfs: 2MASS J12003792-7845082, TWA 28, and 2MASS J08561384-1342242, with the goal of constraining their chemical compositions, thermal profiles, surface gravities, spin rotations, and $^{12}$C/$^{13}$C. Atmospheric retrievals of CRIRES$^+$ K-band spectra were conducted using the radiative transfer code petitRADTRANS coupled with the Bayesian inference algorithm MultiNest, resulting in a detailed characterization of the atmospheres of these objects. We report the volume mixing ratios of main molecular and atomic species, including the novel detection of hydrogen fluoride (HF) in a brown dwarf's atmosphere, and determine $^{12}$C/$^{13}$C values of $81^{+28}_{-19}$ and $79^{+20}_{-14}$ in the atmospheres of TWA 28 and J0856, respectively, with strong significance ($>3σ$). Tentative evidence ($\sim 2σ$) of $^{13}$C in J1200 was found, with $^{12}$C/$^{13}$C = $114^{+69}_{-33}$, along with $^{18}$O detected at moderate significance in J0856 (3.3$σ$) and TWA 28 (2.1$σ$). The retrieved thermal profiles indicate hot atmospheres (2300-2600 K) with low surface gravities and slow spins, consistent with young objects. The consistent carbon isotope ratios among the three objects, showing no significant deviation from the local ISM, suggest a fragmentation-based formation mechanism similar to star formation. The tentative detection of $^{18}$O in two objects highlights the potential of high-resolution spectroscopy to probe additional isotope ratios, such as $^{16}$O/$^{18}$O, in the atmospheres of brown dwarfs and super-Jupiters. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Accepted for publication in A&A

arXiv:2407.07660 [pdf, ps, other]

Boosting Medical Image Synthesis via Registration-guided Consistency and Disentanglement Learning

Authors: Chuanpu Li, Zeli Chen, Yiwen Zhang, Liming Zhong, Wei Yang

Abstract: Medical image synthesis remains challenging due to misalignment noise during training. Existing methods have attempted to address this challenge by incorporating a registration-guided module. However, these methods tend to overlook the task-specific constraints on the synthetic and registration modules, which may cause the synthetic module to still generate spatially aligned images with misaligned… ▽ More Medical image synthesis remains challenging due to misalignment noise during training. Existing methods have attempted to address this challenge by incorporating a registration-guided module. However, these methods tend to overlook the task-specific constraints on the synthetic and registration modules, which may cause the synthetic module to still generate spatially aligned images with misaligned target images during training, regardless of the registration module's function. Therefore, this paper proposes registration-guided consistency and incorporates disentanglement learning for medical image synthesis. The proposed registration-guided consistency architecture fosters task-specificity within the synthetic and registration modules by applying identical deformation fields before and after synthesis, while enforcing output consistency through an alignment loss. Moreover, the synthetic module is designed to possess the capability of disentangling anatomical structures and specific styles across various modalities. An anatomy consistency loss is introduced to further compel the synthetic module to preserve geometrical integrity within latent spaces. Experiments conducted on both an in-house abdominal CECT-CT dataset and a publicly available pelvic MR-CT dataset have demonstrated the superiority of the proposed method. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.07651 [pdf, other]

Study of the decay and production properties of $D_{s1}(2536)$ and $D_{s2}^*(2573)$

Authors: M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann , et al. (645 additional authors not shown)

Abstract: The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ processes are studied using data samples collected with the BESIII detector at center-of-mass energies from 4.530 to 4.946~GeV. The absolute branching fractions of $D_{s1}(2536)^- \rightarrow \bar{D}^{*0}K^-$ and $D_{s2}^*(2573)^- \rightarrow \bar{D}^0K^-$ are measured for the first time to be… ▽ More The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ processes are studied using data samples collected with the BESIII detector at center-of-mass energies from 4.530 to 4.946~GeV. The absolute branching fractions of $D_{s1}(2536)^- \rightarrow \bar{D}^{*0}K^-$ and $D_{s2}^*(2573)^- \rightarrow \bar{D}^0K^-$ are measured for the first time to be $(35.9\pm 4.8\pm 3.5)\%$ and $(37.4\pm 3.1\pm 4.6)\%$, respectively. The measurements are in tension with predictions based on the assumption that the $D_{s1}(2536)$ and $D_{s2}^*(2573)$ are dominated by a bare $c\bar{s}$ component. The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ cross sections are measured, and a resonant structure at around 4.6~GeV with a width of 50~MeV is observed for the first time with a statistical significance of $15σ$ in the $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ process. It could be the $Y(4626)$ found by the Belle collaboration in the $D_s^+D_{s1}(2536)^{-}$ final state, since they have similar masses and widths. There is also evidence for a structure at around 4.75~GeV in both processes. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.07523 [pdf, other]

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

Authors: Haiwen Diao, Bo Wan, Xu Jia, Yunzhi Zhuge, Ying Zhang, Huchuan Lu, Long Chen

Abstract: Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for adapting large pre-trained models to downstream tasks, greatly reducing trainable parameters while grappling with memory challenges during fine-tuning. To address it, memory-efficient series (METL) avoid backpropagating gradients through the large backbone. However, they compromise by exclusively relying o… ▽ More Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for adapting large pre-trained models to downstream tasks, greatly reducing trainable parameters while grappling with memory challenges during fine-tuning. To address it, memory-efficient series (METL) avoid backpropagating gradients through the large backbone. However, they compromise by exclusively relying on frozen intermediate outputs and limiting the exhaustive exploration of prior knowledge from pre-trained models. Moreover, the dependency and redundancy between cross-layer features are frequently overlooked, thereby submerging more discriminative representations and causing an inherent performance gap (vs. conventional PETL methods). Hence, we propose an innovative METL strategy called SHERL for resource-limited scenarios to decouple the entire adaptation into two successive and complementary processes. In the early route, intermediate outputs are consolidated via an anti-redundancy operation, enhancing their compatibility for subsequent interactions; thereby in the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead and regulate these fairly flexible features into more adaptive and powerful representations for new domains. Extensive ablations on vision-and-language and language-only tasks show that SHERL combines the strengths of both parameter and memory-efficient techniques, performing on-par or better across diverse architectures with lower memory during fine-tuning. Our code is publicly available at: https://github.com/Paranioar/SHERL. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 23 pages, 11 figures, Accepted by ECCV2024

arXiv:2407.07332 [pdf, ps, other]

Several new classes of optimal ternary cyclic codes with two or three zeros

Authors: Gaofei Wu, Zhuohui You, Zhengbang Zha, Yuqing Zhang

Abstract: Cyclic codes are a subclass of linear codes and have wide applications in data storage systems, communication systems and consumer electronics due to their efficient encoding and decoding algorithms. Let $α$ be a generator of $\mathbb{F}_{3^m}^*$, where $m$ is a positive integer. Denote by $\mathcal{C}_{(i_1,i_2,\cdots, i_t)}$ the cyclic code with generator polynomial… ▽ More Cyclic codes are a subclass of linear codes and have wide applications in data storage systems, communication systems and consumer electronics due to their efficient encoding and decoding algorithms. Let $α$ be a generator of $\mathbb{F}_{3^m}^*$, where $m$ is a positive integer. Denote by $\mathcal{C}_{(i_1,i_2,\cdots, i_t)}$ the cyclic code with generator polynomial $m_{α^{i_1}}(x)m_{α^{i_2}}(x)\cdots m_{α^{i_t}}(x)$, where ${{m}_{α^{i}}}(x)$ is the minimal polynomial of ${{α}^{i}}$ over ${\mathbb{F}_{3}}$. In this paper, by analyzing the solutions of certain equations over finite fields, we present four classes of optimal ternary cyclic codes $\mathcal{C}_{(0,1,e)}$ and $\mathcal{C}_{(1,e,s)}$ with parameters $[3^m-1,3^m-\frac{3m}{2}-2,4]$, where $s=\frac{3^m-1}{2}$. In addition, by determining the solutions of certain equations and analyzing the irreducible factors of certain polynomials over $\mathbb{F}_{3^m}$, we present four classes of optimal ternary cyclic codes $\mathcal{C}_{(2,e)}$ and $\mathcal{C}_{(1,e)}$ with parameters $[3^m-1,3^m-2m-1,4]$. We show that our new optimal cyclic codes are inequivalent to the known ones. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 16 pages

arXiv:2407.07302 [pdf, other]

Pairwise Distance Distillation for Unsupervised Real-World Image Super-Resolution

Authors: Yuehan Zhang, Seungjun Lee, Angela Yao

Abstract: Standard single-image super-resolution creates paired training data from high-resolution images through fixed downsampling kernels. However, real-world super-resolution (RWSR) faces unknown degradations in the low-resolution inputs, all the while lacking paired training data. Existing methods approach this problem by learning blind general models through complex synthetic augmentations on training… ▽ More Standard single-image super-resolution creates paired training data from high-resolution images through fixed downsampling kernels. However, real-world super-resolution (RWSR) faces unknown degradations in the low-resolution inputs, all the while lacking paired training data. Existing methods approach this problem by learning blind general models through complex synthetic augmentations on training inputs; they sacrifice the performance on specific degradation for broader generalization to many possible ones. We address the unsupervised RWSR for a targeted real-world degradation. We study from a distillation perspective and introduce a novel pairwise distance distillation framework. Through our framework, a model specialized in synthetic degradation adapts to target real-world degradations by distilling intra- and inter-model distances across the specialized model and an auxiliary generalized model. Experiments on diverse datasets demonstrate that our method significantly enhances fidelity and perceptual quality, surpassing state-of-the-art approaches in RWSR. The source code is available at https://github.com/Yuehan717/PDD. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2407.07233 [pdf, other]

A kinematical study of the launching region of the blueshifted HH 46/47 outflow with SINFONI K-band observations

Authors: M. Birney, C. Dougados, E. T. Whelan, B. Nisini, S. Cabrit, Y. Zhang

Abstract: Studying outflows is important as they may significantly contribute to angular momentum removal from the star/disk system, affecting disk evolution and planet formation. To investigate the different outflow components; the collimated jet, wide-angled molecular outflow, and outflow cavity, of the Class I HH 46/47 outflow system. We focus on their kinematics. We present Near Infrared (NIR) K-band in… ▽ More Studying outflows is important as they may significantly contribute to angular momentum removal from the star/disk system, affecting disk evolution and planet formation. To investigate the different outflow components; the collimated jet, wide-angled molecular outflow, and outflow cavity, of the Class I HH 46/47 outflow system. We focus on their kinematics. We present Near Infrared (NIR) K-band integral field observations of the blue-shifted HH 46/47 outflow base obtained using VLT/SINFONI with an angular resolution of 0".81. Our analysis focuses on [Fe II], H2 1-0 S(1), and, Br-gamma emission. We employ a wavelength recalibration technique based on OH telluric lines to probe the kinematics of the wide-angled flow with an accuracy of 1 km/s - 3 km/s. A velocity gradient of 10 km/s transverse to the outflow direction is confirmed in the wide-angled H2 outflow cavity. The H2 cavity peaks at radial velocities of -15 km/s to -30 km/s, and the atomic jet at v = -210 km/s. The outflow exhibits a layered structure; the high-velocity [Fe II] and Br-gamma jet is surrounded by a wide-angled H2 outflow cavity, which is in turn nested within the continuum emission and CO molecular outflow. The continuum emission and H2 outflow cavity are asymmetric with respect to the jet axis. We propose that the origin of the asymmetries and the velocity gradient detected in the wide-angled H2 cavity, is due to a wide-angled outflow or successive jet bowshocks expanding into an inhomogeneous ambient medium, or the presence of a secondary outflow. We eliminate outflow rotation as an exclusive origin of this velocity gradient due to large specific angular momenta values, J(r)= 3000 - 4000 km/s au calculated from 1" to 2" along the outflow. The observations reveal the complexities inherent in outflow systems, and the risk of attributing transverse velocity gradients solely to rotation. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 20 pages, 12 figures, 2 tables, submitted to A&A

arXiv:2407.07099 [pdf, other]

Nash CoT: Multi-Path Inference with Preference Equilibrium

Authors: Ziqi Zhang, Cunxiang Wang, Xiong Xiao, Yue Zhang, Donglin Wang

Abstract: Chain-of-thought (CoT) prompting has emerged as a powerful technique for enhancing the reasoning capabilities of Large Language Models (LLMs) on complex problems. Among CoT-related studies, self-consistency (Multi-path inference with answer filtering through voting) involves generating multiple reasoning paths using the CoT framework and then selecting the most frequently produced outputs standing… ▽ More Chain-of-thought (CoT) prompting has emerged as a powerful technique for enhancing the reasoning capabilities of Large Language Models (LLMs) on complex problems. Among CoT-related studies, self-consistency (Multi-path inference with answer filtering through voting) involves generating multiple reasoning paths using the CoT framework and then selecting the most frequently produced outputs standing out as a concise yet competitive approach. While self-consistency has indeed led to the improvements in LLM inference, the use of multi-path inference also escalates deployment costs. Therefore, maintaining the performance benefits of self-consistency inherited from multi-path inference while reducing the inference costs holds significant value. In this research, we conceptualize language decoding as a preference consensus game, constructing a bi-player gaming system within each local path, and introduce Nash Chain-of-Thought (Nash CoT). Specifically, for a given question, we leverage LLM to autonomously select the contextually relevant template and generate outputs guided by this template, aiming to reach Nash Equilibrium alongside normal generation in each path. This approach allows us to achieve comparable or improved performance compared to self-consistency while using fewer inference paths on various inference tasks, including Arabic reasoning, Commonsense Question answering, and Symbolic inference. △ Less

Submitted 18 June, 2024; originally announced July 2024.

arXiv:2407.07035 [pdf, other]

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Authors: Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, Parisa Kordjamshidi

Abstract: Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the… ▽ More Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on one hand, to milestone the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: Authors contributed equally to this work, and supervisors contributed equal advising to this work

arXiv:2407.07017 [pdf, other]

Shadows, greybody factors, emission rate, topological charge, and phase transitions for a charged black hole with a Kalb-Ramond field background

Authors: F. Hosseinifar, A. A. Araújo Filho, M. Y. Zhang, H. Chen, H. Hassanabadi

Abstract: In this work, we investigate a spherically symmetric charged black hole in the presence of a Kalb--Ramond field background. We calculate the photon sphere and shadow radii and, corroborating our results, we constrain them from observational data from the Event Horizon Telescope (EHT), particularly focusing on the shadow images of Sagittarius $A^{*}$. Additionally, we analyze the greybody factors,… ▽ More In this work, we investigate a spherically symmetric charged black hole in the presence of a Kalb--Ramond field background. We calculate the photon sphere and shadow radii and, corroborating our results, we constrain them from observational data from the Event Horizon Telescope (EHT), particularly focusing on the shadow images of Sagittarius $A^{*}$. Additionally, we analyze the greybody factors, emission rate, and partial absorption cross section. We also examine the topological charge and its application to the deflection angle. Finally, we conduct the analysis of the heat capacity and phase transitions. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 7 pages in two column, 10 figures and 2 tables

arXiv:2407.06975 [pdf]

Optimization of noncollinear magnetic ordering temperature in Y-type hexaferrite by machine learning

Authors: Yonghong Li, **g Zhang, Linfeng Jiang, Long Zhang, Yugang Zhang, Xueliang Wu, Yisheng Chai, Xiaoyuan Zhou, Zizhen Zhou

Abstract: Searching the optimal do** compositions of the Y-type hexaferrite Ba2Mg2Fe12O22 remains a long-standing challenge for enhanced non-collinear magnetic transition temperature (TNC). Instead of the conventional trial-and-error approach, the composition-property descriptor is established via a data driven machine learning method named SISSO (sure independence screening and sparsifying operator). Bas… ▽ More Searching the optimal do** compositions of the Y-type hexaferrite Ba2Mg2Fe12O22 remains a long-standing challenge for enhanced non-collinear magnetic transition temperature (TNC). Instead of the conventional trial-and-error approach, the composition-property descriptor is established via a data driven machine learning method named SISSO (sure independence screening and sparsifying operator). Based on the chosen efficient and physically interpretable descriptor, a series of Y-type hexaferrite compositions are predicted to hold high TNC, among which the BaSrMg0.28Co1.72Fe10Al2O22 is then experimentally validated. Test results indicate that, under appropriate external magnetic field conditions, the TNC of this composition reaches up to reaches up to 568 K, and its magnetic transition temperature is also elevated to 735 K. This work offers a machine learning-based route to develop room temperature single phase multiferroics for device applications. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: accepted by Applied Physics Letters in 2024

arXiv:2407.06955 [pdf, other]

ICLGuard: Controlling In-Context Learning Behavior for Applicability Authorization

Authors: Wai Man Si, Michael Backes, Yang Zhang

Abstract: In-context learning (ICL) is a recent advancement in the capabilities of large language models (LLMs). This feature allows users to perform a new task without updating the model. Concretely, users can address tasks during the inference time by conditioning on a few input-label pair demonstrations along with the test input. It is different than the conventional fine-tuning paradigm and offers more… ▽ More In-context learning (ICL) is a recent advancement in the capabilities of large language models (LLMs). This feature allows users to perform a new task without updating the model. Concretely, users can address tasks during the inference time by conditioning on a few input-label pair demonstrations along with the test input. It is different than the conventional fine-tuning paradigm and offers more flexibility. However, this capability also introduces potential issues. For example, users may use the model on any data without restriction, such as performing tasks with improper or sensitive content, which might violate the model policy or conflict with the model owner's interests. As a model owner, it is crucial to establish a mechanism to control the model's behavior under ICL, depending on the model owner's requirements for various content. To this end, we introduce the concept of "applicability authorization" tailored for LLMs, particularly for ICL behavior, and propose a simple approach, ICLGuard. It is a fine-tuning framework designed to allow the model owner to regulate ICL behavior on different data. ICLGuard preserves the original LLM and fine-tunes only a minimal set of additional trainable parameters to "guard" the LLM. Empirical results show that the guarded LLM can deactivate its ICL ability on target data without affecting its ICL ability on other data and its general functionality across all data. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.06794 [pdf, other]

ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Authors: Yunshan Zhong, Jiawei Hu, You Huang, Yuxin Zhang, Rongrong Ji

Abstract: Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models. However, existing methods typically overlook the intricate interdependence between quantized weight and activation, leading to considerable quantization error. In this paper, we propose ERQ, a two-step PTQ approach meticulously crafted to sequentially redu… ▽ More Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models. However, existing methods typically overlook the intricate interdependence between quantized weight and activation, leading to considerable quantization error. In this paper, we propose ERQ, a two-step PTQ approach meticulously crafted to sequentially reduce the quantization error arising from activation and weight quantization. ERQ first introduces Activation quantization error reduction (Aqer) that strategically formulates the minimization of activation quantization error as a Ridge Regression problem, tackling it by updating weights with full-precision. Subsequently, ERQ introduces Weight quantization error reduction (Wqer) that adopts an iterative approach to mitigate the quantization error induced by weight quantization. In each iteration, an empirically derived, efficient proxy is employed to refine the rounding directions of quantized weights, coupled with a Ridge Regression solver to curtail weight quantization error. Experimental results attest to the effectiveness of our approach. Notably, ERQ surpasses the state-of-the-art GPTQ by 22.36% in accuracy for W3A4 ViT-S. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: ICML2024 (Spotlight)

arXiv:2407.06691 [pdf, other]

OFDM Achieves the Lowest Ranging Sidelobe Under Random ISAC Signaling

Authors: Fan Liu, Ying Zhang, Yifeng Xiong, Shuangyang Li, Weijie Yuan, Feifei Gao, Shi **, Giuseppe Caire

Abstract: This paper aims to answer a fundamental question in the area of Integrated Sensing and Communications (ISAC): What is the optimal communication-centric ISAC waveform for ranging? Towards that end, we first established a generic framework to analyze the sensing performance of communication-centric ISAC waveforms built upon orthonormal signaling bases and random data symbols. Then, we evaluated thei… ▽ More This paper aims to answer a fundamental question in the area of Integrated Sensing and Communications (ISAC): What is the optimal communication-centric ISAC waveform for ranging? Towards that end, we first established a generic framework to analyze the sensing performance of communication-centric ISAC waveforms built upon orthonormal signaling bases and random data symbols. Then, we evaluated their ranging performance by adopting both the periodic and aperiodic auto-correlation functions (P-ACF and A-ACF), and defined the expectation of the integrated sidelobe level (EISL) as a sensing performance metric. On top of that, we proved that among all communication waveforms with cyclic prefix (CP), the orthogonal frequency division multiplexing (OFDM) modulation is the only globally optimal waveform that achieves the lowest ranging sidelobe for quadrature amplitude modulation (QAM) and phase shift keying (PSK) constellations, in terms of both the EISL and the sidelobe level at each individual lag of the P-ACF. As a step forward, we proved that among all communication waveforms without CP, OFDM is a locally optimal waveform for QAM/PSK in the sense that it achieves a local minimum of the EISL of the A-ACF. Finally, we demonstrated by numerical results that under QAM/PSK constellations, there is no other orthogonal communication-centric waveform that achieves a lower ranging sidelobe level than that of the OFDM, in terms of both P-ACF and A-ACF cases. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 14 pages, 12 figures, submitted to IEEE for possible publication

arXiv:2407.06611 [pdf, other]

CEIA: CLIP-Based Event-Image Alignment for Open-World Event-Based Understanding

Authors: Wenhao Xu, Wenming Weng, Yueyi Zhang, Zhiwei Xiong

Abstract: We present CEIA, an effective framework for open-world event-based understanding. Currently training a large event-text model still poses a huge challenge due to the shortage of paired event-text data. In response to this challenge, CEIA learns to align event and image data as an alternative instead of directly aligning event and text data. Specifically, we leverage the rich event-image datasets t… ▽ More We present CEIA, an effective framework for open-world event-based understanding. Currently training a large event-text model still poses a huge challenge due to the shortage of paired event-text data. In response to this challenge, CEIA learns to align event and image data as an alternative instead of directly aligning event and text data. Specifically, we leverage the rich event-image datasets to learn an event embedding space aligned with the image space of CLIP through contrastive learning. In this way, event and text data are naturally aligned via using image data as a bridge. Particularly, CEIA offers two distinct advantages. First, it allows us to take full advantage of the existing event-image datasets to make up the shortage of large-scale event-text datasets. Second, leveraging more training data, it also exhibits the flexibility to boost performance, ensuring scalable capability. In highlighting the versatility of our framework, we make extensive evaluations through a diverse range of event-based multi-modal applications, such as object recognition, event-image retrieval, event-text retrieval, and domain adaptation. The outcomes demonstrate CEIA's distinct zero-shot superiority over existing methods on these applications. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.06590 [pdf, other]

Revolutionizing Battery Disassembly: The Design and Implementation of a Battery Disassembly Autonomous Mobile Manipulator Robot(BEAM-1)

Authors: Yanlong Peng, Zhigang Wang, Yisheng Zhang, Shengmin Zhang, Nan Cai, Fan Wu, Ming Chen

Abstract: The efficient disassembly of end-of-life electric vehicle batteries(EOL-EVBs) is crucial for green manufacturing and sustainable development. The current pre-programmed disassembly conducted by the Autonomous Mobile Manipulator Robot(AMMR) struggles to meet the disassembly requirements in dynamic environments, complex scenarios, and unstructured processes. In this paper, we propose a Battery Disas… ▽ More The efficient disassembly of end-of-life electric vehicle batteries(EOL-EVBs) is crucial for green manufacturing and sustainable development. The current pre-programmed disassembly conducted by the Autonomous Mobile Manipulator Robot(AMMR) struggles to meet the disassembly requirements in dynamic environments, complex scenarios, and unstructured processes. In this paper, we propose a Battery Disassembly AMMR(BEAM-1) system based on NeuralSymbolic AI. It detects the environmental state by leveraging a combination of multi-sensors and neural predicates and then translates this information into a quasi-symbolic space. In real-time, it identifies the optimal sequence of action primitives through LLM-heuristic tree search, ensuring high-precision execution of these primitives. Additionally, it employs positional speculative sampling using intuitive networks and achieves the disassembly of various bolt types with a meticulously designed end-effector. Importantly, BEAM-1 is a continuously learning embodied intelligence system capable of subjective reasoning like a human, and possessing intuition. A large number of real scene experiments have proved that it can autonomously perceive, decide, and execute to complete the continuous disassembly of bolts in multiple, multi-category, and complex situations, with a success rate of 98.78%. This research attempts to use NeuroSymbolic AI to give robots real autonomous reasoning, planning, and learning capabilities. BEAM-1 realizes the revolution of battery disassembly. Its framework can be easily ported to any robotic system to realize different application scenarios, which provides a ground-breaking idea for the design and implementation of future embodied intelligent robotic systems. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.06565 [pdf, ps, other]

Non-uniqueness of Leray weak solutions of the forced MHD equations

Authors: Jun Wang, Fei Xu, Yong Zhang

Abstract: In this paper, we exhibit non-uniqueness of Leray weak solutions of the forced magnetohydrodynamic (MHD for short) equations. Similar to the solutions constructed in \cite{ABC2}, we first find a special steady solution of ideal MHD equations whose linear unstability was proved in \cite{Lin}. It is possible to perturb the unstable scenario of ideal MHD to 3D viscous and resistive MHD equations, whi… ▽ More In this paper, we exhibit non-uniqueness of Leray weak solutions of the forced magnetohydrodynamic (MHD for short) equations. Similar to the solutions constructed in \cite{ABC2}, we first find a special steady solution of ideal MHD equations whose linear unstability was proved in \cite{Lin}. It is possible to perturb the unstable scenario of ideal MHD to 3D viscous and resistive MHD equations, which can be regarded as the first unstable "background" solution. Our perturbation argument is based on the spectral theoretic approach \cite{Kato}. The second solution we would construct is a trajectory on the unstable manifold associated to the unstable steady solution. It is worth noting that these solutions live precisely on the borderline of the known well-posedness theory. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 23pp. arXiv admin note: text overlap with arXiv:2310.10075, arXiv:2112.03116 by other authors

arXiv:2407.06504 [pdf, other]

Reprogramming Distillation for Medical Foundation Models

Authors: Yuhang Zhou, Siyuan Du, Haolin Li, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Abstract: Medical foundation models pre-trained on large-scale datasets have demonstrated powerful versatile capabilities for various tasks. However, due to the gap between pre-training tasks (or modalities) and downstream tasks (or modalities), the real-world computation and speed constraints, it might not be straightforward to apply medical foundation models in the downstream scenarios. Previous methods,… ▽ More Medical foundation models pre-trained on large-scale datasets have demonstrated powerful versatile capabilities for various tasks. However, due to the gap between pre-training tasks (or modalities) and downstream tasks (or modalities), the real-world computation and speed constraints, it might not be straightforward to apply medical foundation models in the downstream scenarios. Previous methods, such as parameter efficient fine-tuning (PEFT) methods and knowledge distillation (KD) methods, are unable to simultaneously address the task (or modality) inconsistency and achieve personalized lightweight deployment under diverse real-world demands. To address the above issues, we propose a novel framework called Reprogramming Distillation (RD). On one hand, RD reprograms the original feature space of the foundation model so that it is more relevant to downstream scenarios, aligning tasks and modalities. On the other hand, through a co-training mechanism and a shared classifier, connections are established between the reprogrammed knowledge and the knowledge of student models, ensuring that the reprogrammed feature space can be smoothly mimic by the student model of different structures. Further, to reduce the randomness under different training conditions, we design a Centered Kernel Alignment (CKA) distillation to promote robust knowledge transfer. Empirically, we show that on extensive datasets, RD consistently achieve superior performance compared with previous PEFT and KD methods. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: MICCAI 2024

Showing 1–50 of 18,986 results for author: Zhang., Y