Search | arXiv e-print repository

ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance

Authors: Shuwei Shi, Wenbo Li, Yuechen Zhang, **gwen He, Biao Gong, Yinqiang Zheng

Abstract: Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in over-smoothed content, structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions. Specifically, ResMast… ▽ More Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in over-smoothed content, structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions. Specifically, ResMaster leverages a low-resolution reference image created by a pre-trained diffusion model to provide structural and fine-grained guidance for crafting high-resolution images on a patch-by-patch basis. To ensure a coherent global structure, ResMaster meticulously aligns the low-frequency components of high-resolution patches with the low-resolution reference at each denoising step. For fine-grained guidance, tailored image prompts based on the low-resolution reference and enriched textual prompts produced by a vision-language model are incorporated. This approach could significantly mitigate local pattern distortions and improve detail refinement. Extensive experiments validate that ResMaster sets a new benchmark for high-resolution image generation and demonstrates promising efficiency. The project page is https://shuweis.github.io/ResMaster . △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16457 [pdf, other]

A hybrid FEM-NN optimization method to learn the physics-constrained constitutive relations from full-field data

Authors: Xinxin Wu Kaiqiang Sun, Shaohua Yang, Huan Wang, Ye Xu, Yin Zhang, Sheng Mao

Abstract: Neural networks (NNs) have demonstrated strong capabilities of representing high-dimensional, complex functional relations, and hence have been widely used to characterize complex constitutive relations for various types of materials, such as polycrystals, polymers, etc. However, to construct a reliable NN-based constitutive model, a considerable amount of data, i.e. stress-strain states along dif… ▽ More Neural networks (NNs) have demonstrated strong capabilities of representing high-dimensional, complex functional relations, and hence have been widely used to characterize complex constitutive relations for various types of materials, such as polycrystals, polymers, etc. However, to construct a reliable NN-based constitutive model, a considerable amount of data, i.e. stress-strain states along different loading paths is needed, which can be expensive to collect. To address such challenge, we develop a hybrid finite element method (FEM) - NN optimization framework to learn complex hyperelastic constitutive relations from full-field data. The key advantage of this framework is that it can make use of the non-uniform displacement field due to the geometric inhomogeneities for training NN-based constitutive models. Since such data can provide many different stress-strain states in a single test, it can greatly reduce the number of experiments needed for the training of NNs. Besides, we adopt a mechanics-informed neural network (MINN) as our architecture to ensure that our NN-based models satisfy all necessary physical constraints by construction, such as objectivity, material symmetry, polyconvexity, etc. Such architecture is also key to the convergence of our optimization framework. We then use both synthetic and experimental data to test the performance of our proposed framework on various isotropic hyperelastic materials. Results show that our optimization framework can be used to train NN-based constitutive models for hyperelastic materials with high accuracy and efficiency using data generated from simple tests, which can also be easily adapted to characterize complex constitutive models for a broader range of materials. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 14 pages,7 figures

arXiv:2406.16431 [pdf, other]

A ROOT based detector geometry and event visualization system for JUNO-TAO

Authors: Minghua Liao, Kaixuan Huang, Yumei Zhang, Jiayang Xu, Guofu Cao, Zhengyun You

Abstract: The Taishan Antineutrino Observatory (TAO or JUNO-TAO) is a satellite experiment of Jiangmen Underground Neutrino Observatory (JUNO) and located near the Taishan nuclear power plant (NPP). TAO will measure the energy spectrum of reactor antineutrinos with unprecedented precision, which will benefit both reactor neutrino physics and the nuclear database. A detector geometry and event visualization… ▽ More The Taishan Antineutrino Observatory (TAO or JUNO-TAO) is a satellite experiment of Jiangmen Underground Neutrino Observatory (JUNO) and located near the Taishan nuclear power plant (NPP). TAO will measure the energy spectrum of reactor antineutrinos with unprecedented precision, which will benefit both reactor neutrino physics and the nuclear database. A detector geometry and event visualization system has been developed for TAO. The software is based on ROOT packages and embedded in the TAO offline software framework. It provides an intuitive tool to visualize the detector geometry, tune the reconstruction algorithm, understand the neutrino physics, and monitor the operation of reactors at NPP. The further applications of the visualization system in the experimental operation of TAO and its future development are also discussed. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16427 [pdf, other]

Dynamic Pseudo Label Optimization in Point-Supervised Nuclei Segmentation

Authors: Ziyue Wang, Ye Zhang, Yifeng Wang, Linghan Cai, Yongbing Zhang

Abstract: Deep learning has achieved impressive results in nuclei segmentation, but the massive requirement for pixel-wise labels remains a significant challenge. To alleviate the annotation burden, existing methods generate pseudo masks for model training using point labels. However, the generated masks are inevitably different from the ground truth, and these dissimilarities are not handled reasonably dur… ▽ More Deep learning has achieved impressive results in nuclei segmentation, but the massive requirement for pixel-wise labels remains a significant challenge. To alleviate the annotation burden, existing methods generate pseudo masks for model training using point labels. However, the generated masks are inevitably different from the ground truth, and these dissimilarities are not handled reasonably during the network training, resulting in the subpar performance of the segmentation model. To tackle this issue, we propose a framework named DoNuSeg, enabling \textbf{D}ynamic pseudo label \textbf{O}ptimization in point-supervised \textbf{Nu}clei \textbf{Seg}mentation. Specifically, DoNuSeg takes advantage of class activation maps (CAMs) to adaptively capture regions with semantics similar to annotated points. To leverage semantic diversity in the hierarchical feature levels, we design a dynamic selection module to choose the optimal one among CAMs from different encoder blocks as pseudo masks. Meanwhile, a CAM-guided contrastive module is proposed to further enhance the accuracy of pseudo masks. In addition to exploiting the semantic information provided by CAMs, we consider location priors inherent to point labels, develo** a task-decoupled structure for effectively differentiating nuclei. Extensive experiments demonstrate that DoNuSeg outperforms state-of-the-art point-supervised methods. The code is available at https://github.com/shinning0821/MICCAI24-DoNuSeg. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: early accepted by MICCAI2024

arXiv:2406.16148 [pdf, other]

Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking

Authors: Yuwei Zhang, Tong Xia, **g Han, Yu Wu, Georgios Rizos, Yang Liu, Mohammed Mosuily, Jagmohan Chauhan, Cecilia Mascolo

Abstract: Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing… ▽ More Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing advantages and possibly unlock this impasse. However, given the safety-critical nature of healthcare applications, it is pivotal to also ensure openness and replicability for any proposed foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory Acoustic foundation model pretraining and benchmarking system, as the first approach answering this need. We curate large-scale respiratory audio datasets (~136K samples, 440 hours), pretrain three pioneering foundation models, and build a benchmark consisting of 19 downstream respiratory health tasks for evaluation. Our pretrained models demonstrate superior performance (against existing acoustic models pretrained with general audio on 16 out of 19 tasks) and generalizability (to unseen datasets and new respiratory audio modalities). This highlights the great promise of respiratory acoustic foundation models and encourages more studies using OPERA as an open resource to accelerate research on respiratory audio for health. The system is accessible from https://github.com/evelyn0414/OPERA. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.16129 [pdf]

UDHF2-Net: An Uncertainty-diffusion-model-based High-Frequency TransFormer Network for High-accuracy Interpretation of Remotely Sensed Imagery

Authors: Pengfei Zhang, Chang Li, Yongjun Zhang, Rongjun Qin

Abstract: Remotely sensed image high-accuracy interpretation (RSIHI), including tasks such as semantic segmentation and change detection, faces the three major problems: (1) complementarity problem of spatially stationary-and-non-stationary frequency; (2) edge uncertainty problem caused by down-sampling in the encoder step and intrinsic edge noises; and (3) false detection problem caused by imagery registra… ▽ More Remotely sensed image high-accuracy interpretation (RSIHI), including tasks such as semantic segmentation and change detection, faces the three major problems: (1) complementarity problem of spatially stationary-and-non-stationary frequency; (2) edge uncertainty problem caused by down-sampling in the encoder step and intrinsic edge noises; and (3) false detection problem caused by imagery registration error in change detection. To solve the aforementioned problems, an uncertainty-diffusion-model-based high-Frequency TransFormer network (UDHF2-Net) is the proposed for RSIHI, the superiority of which is as following: (1) a spatially-stationary-and-non-stationary high-frequency connection paradigm (SHCP) is proposed to enhance the interaction of spatially stationary and non-stationary frequency features to yield high-fidelity edge extraction result. Inspired by HRFormer, SHCP remains the high-frequency stream through the whole encoder-decoder process with parallel high-to-low frequency streams and reduces the edge loss by a downsampling operation; (2) a mask-and-geo-knowledge-based uncertainty diffusion module (MUDM) is proposed to improve the robustness and edge noise resistance. MUDM could further optimize the uncertain region to improve edge extraction result by gradually removing the multiple geo-knowledge-based noises; (3) a semi-pseudo-Siamese UDHF2-Net for change detection task is proposed to reduce the pseudo change by registration error. It adopts semi-pseudo-Siamese architecture to extract above complemental frequency features for adaptively reducing registration differencing, and MUDM to recover the uncertain region by gradually reducing the registration error besides above edge noises. Comprehensive experiments were performed to demonstrate the superiority of UDHF2-Net. Especially ablation experiments indicate the effectiveness of UDHF2-Net. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.16011 [pdf, ps, other]

The derived dimensions and representation distances of artin algebras

Authors: Junling Zheng, Yingying Zhang, **bi Zhang

Abstract: There is a well-known class of algebras called Igusa-Todorov algebras which were introduced in relation to finitistic dimension conjecture. As a generalization of Igusa-Todorov algebras, the new notion of $(m,n)$-Igusa-Todorov algebras provides a wider framework for studying derived dimensions. In this paper, we give a method for constructing $(m,n)$-Igusa-Todorov algebras. As an application, we p… ▽ More There is a well-known class of algebras called Igusa-Todorov algebras which were introduced in relation to finitistic dimension conjecture. As a generalization of Igusa-Todorov algebras, the new notion of $(m,n)$-Igusa-Todorov algebras provides a wider framework for studying derived dimensions. In this paper, we give a method for constructing $(m,n)$-Igusa-Todorov algebras. As an application, we present for general artin algebras a relationship between the derived dimension and the representation distance. Moreover, we end this paper to show that the main result can be used to give a better upper bound for the derived dimension for some classes of algebras. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: accepted for publication in Archiv der Mathematik

MSC Class: 18G20; 16E10; 18E10

arXiv:2406.15992 [pdf, other]

Can LLM Graph Reasoning Generalize beyond Pattern Memorization?

Authors: Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xiaochuang Han, Tianxing He, Yulia Tsvetkov

Abstract: Large language models (LLMs) demonstrate great potential for problems with implicit graphical structures, while recent works seek to enhance the graph reasoning capabilities of LLMs through specialized instruction tuning. The resulting 'graph LLMs' are evaluated with in-distribution settings only, thus it remains underexplored whether LLMs are learning generalizable graph reasoning skills or merel… ▽ More Large language models (LLMs) demonstrate great potential for problems with implicit graphical structures, while recent works seek to enhance the graph reasoning capabilities of LLMs through specialized instruction tuning. The resulting 'graph LLMs' are evaluated with in-distribution settings only, thus it remains underexplored whether LLMs are learning generalizable graph reasoning skills or merely memorizing patterns in the synthetic training data. To this end, we propose the NLGift benchmark, an evaluation suite of LLM graph reasoning generalization: whether LLMs could go beyond semantic, numeric, structural, reasoning patterns in the synthetic training data and improve utility on real-world graph-based tasks. Extensive experiments with two LLMs across four graph reasoning tasks demonstrate that while generalization on simple patterns (semantic, numeric) is somewhat satisfactory, LLMs struggle to generalize across reasoning and real-world patterns, casting doubt on the benefit of synthetic graph tuning for real-world tasks with underlying network structures. We explore three strategies to improve LLM graph reasoning generalization, and we find that while post-training alignment is most promising for real-world tasks, empowering LLM graph reasoning to go beyond pattern memorization remains an open research question. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 16 pages, 6 figures, Code and data will be publicly available at https://github.com/MatthewYZhang/NLGift

ACM Class: I.2.7

arXiv:2406.15956 [pdf]

Decoupling Many-Body Interactions in CeO2 (111) Oxygen Vacancy Structure: Insights from Machine-Learning and Cluster Expansion

Authors: Yu**g Zhang, Zhong-Kang Han, Beien Zhu, Xiaojuan Hu, Maria Troppenz, Santiago Riga-monti, Hui Li, Claudia Draxl, M. Verónica Ganduglia-Pirovano, Yi Gao

Abstract: Oxygen vacancies (VO's) are of paramount importance in influencing the properties and applications of ceria (CeO2). Yet, comprehending the distribution and nature of the VO's poses a significant challenge due to the vast number of electronic configurations and intricate many-body interactions among VO's and polarons (Ce3+'s). In this study, we employed a combination of LASSO regression in machine… ▽ More Oxygen vacancies (VO's) are of paramount importance in influencing the properties and applications of ceria (CeO2). Yet, comprehending the distribution and nature of the VO's poses a significant challenge due to the vast number of electronic configurations and intricate many-body interactions among VO's and polarons (Ce3+'s). In this study, we employed a combination of LASSO regression in machine learning, in conjunction with a cluster expansion model and first-principles calculations to decouple the interactions among the Ce3+'s and VO's, thereby circumventing the limitations associated with sampling electronic configurations. By separating these interactions, we identified specific electronic configurations characterized by the most favorable VO-Ce3+ attractions and the least Ce3+-Ce3+/VO-VO repulsions, which are crucial in determining the stability of vacancy structures. Through more than 10^8 Metropolis Monte Carlo samplings of Vo's and Ce3+ in the near-surface of CeO2(111), we explored potential configurations within an 8x8 supercell. Our findings revealed that oxygen vacancies tend to aggregate and are most abundant in the third oxygen layer, primarily due to extensive geometric relaxation-an aspect previously overlooked. This behavior is notably dependent on the concentration of Vo. This work introduces a novel theoretical framework for unraveling the complex vacancy structures in metal oxides, with potential applications in redox and catalytic chemistry. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 22 pages, 1 scheme, 5 figures

arXiv:2406.15945 [pdf, other]

Full-Space Wireless Sensing Enabled by Multi-Sector Intelligent Surfaces

Authors: Yumeng Zhang, Xiaodan Shao, Hongyu Li, Bruno Clerckx, Rui Zhang

Abstract: The multi-sector intelligent surface (IS), benefiting from a smarter wave manipulation capability, has been shown to enhance channel gain and offer full-space coverage in communications. However, the benefits of multi-sector IS in wireless sensing remain unexplored. This paper introduces the application of multi-sector IS for wireless sensing/localization. Specifically, we propose a new self-sensi… ▽ More The multi-sector intelligent surface (IS), benefiting from a smarter wave manipulation capability, has been shown to enhance channel gain and offer full-space coverage in communications. However, the benefits of multi-sector IS in wireless sensing remain unexplored. This paper introduces the application of multi-sector IS for wireless sensing/localization. Specifically, we propose a new self-sensing system, where an active source controller uses the multi-sector IS geometry to reflect/scatter the emitted signals towards the entire space, thereby achieving full-space coverage for wireless sensing. Additionally, dedicated sensors are installed aligned with the IS elements at each sector, which collect echo signals from the target and cooperate to sense the target angle. In this context, we develop a maximum likelihood estimator of the target angle for the proposed multi-sector IS self-sensing system, along with the corresponding theoretical limits defined by the Cramér-Rao Bound. The analysis reveals that the advantages of the multi-sector IS self-sensing system stem from two aspects: enhancing the probing power on targets (thereby improving power efficiency) and increasing the rate of target angle (thereby enhancing the transceiver's sensitivity to target angles). Finally, our analysis and simulations confirm that the multi-sector IS self-sensing system, particularly the 4-sector architecture, achieves full-space sensing capability beyond the single-sector IS configuration. Furthermore, similarly to communications, employing directive antenna patterns on each sector's IS elements and sensors significantly enhances sensing capabilities. This enhancement originates from both aspects of improved power efficiency and target angle sensitivity, with the former also being observed in communications while the latter being unique in sensing. △ Less

Submitted 25 June, 2024; v1 submitted 22 June, 2024; originally announced June 2024.

Comments: 13 pages, 9 figures

arXiv:2406.15836 [pdf, other]

Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

Authors: Yang Zhang, Chenjia Bai, Bin Zhao, Junchi Yan, Xiu Li, Xuelong Li

Abstract: Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue in a centralized architecture arising from a large number of agents, and also the non-stationarity issue in a decentralized… ▽ More Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue in a centralized architecture arising from a large number of agents, and also the non-stationarity issue in a decentralized architecture stemming from the inter-dependency among agents. To address both challenges, we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents. We cast the dynamics learning as an auto-regressive sequence modeling problem over discrete tokens by leveraging the expressive Transformer architecture, in order to model complex local dynamics across different agents and provide accurate and consistent long-term imaginations. As the first pioneering Transformer-based world model for multi-agent systems, we introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation within this context. Results on Starcraft Multi-Agent Challenge (SMAC) show that it outperforms strong model-free approaches and existing model-based methods in both sample efficiency and overall performance. △ Less

Submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.15829 [pdf, other]

MVOC: a training-free multiple video object composition method with diffusion models

Authors: Wei Wang, Yaosen Chen, Yuegen Liu, Qi Yuan, Shubin Yang, Yanru Zhang

Abstract: Video composition is the core task of video editing. Although image composition based on diffusion models has been highly successful, it is not straightforward to extend the achievement to video object composition tasks, which not only exhibit corresponding interaction effects but also ensure that the objects in the composited video maintain motion and identity consistency, which is necessary to c… ▽ More Video composition is the core task of video editing. Although image composition based on diffusion models has been highly successful, it is not straightforward to extend the achievement to video object composition tasks, which not only exhibit corresponding interaction effects but also ensure that the objects in the composited video maintain motion and identity consistency, which is necessary to composite a physical harmony video. To address this challenge, we propose a Multiple Video Object Composition (MVOC) method based on diffusion models. Specifically, we first perform DDIM inversion on each video object to obtain the corresponding noise features. Secondly, we combine and edit each object by image editing methods to obtain the first frame of the composited video. Finally, we use the image-to-video generation model to composite the video with feature and attention injections in the Video Object Dependence Module, which is a training-free conditional guidance operation for video generation, and enables the coordination of features and attention maps between various objects that can be non-independent in the composited video. The final generative model not only constrains the objects in the generated video to be consistent with the original object motion and identity, but also introduces interaction effects between objects. Extensive experiments have demonstrated that the proposed method outperforms existing state-of-the-art approaches. Project page: https://sobeymil.github.io/mvoc.com. △ Less

Submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.15768 [pdf, other]

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Authors: Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, Shanghang Zhang

Abstract: In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding,… ▽ More In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding, and vision perception models usually suffer from open-world distribution shifts due to their limited model capacity. To overcome these challenges, we propose the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models, enhancing multimodal comprehension and vision perception synergistically. Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs, like object detection bounding boxes, to capture subtle visual elements, thus enriching the understanding of both visual and textual data. In addition, an innovative perception-embedded prompt generation mechanism is proposed to embed perceptual information into the language model's prompts, aligning the responses contextually and perceptually for a more accurate multimodal interpretation. Extensive experiments demonstrate MR-MLLM's superior performance in various multimodal comprehension and vision perception tasks, particularly those requiring corner case vision perception and fine-grained language comprehension. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 14 pages, 8 figures

arXiv:2406.15741 [pdf, other]

Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level

Authors: Zhaopeng Feng, Ruizhe Chen, Yan Zhang, Zijie Meng, Zuozhu Liu

Abstract: General-purpose Large Language Models (LLMs) like GPT-4 have achieved remarkable advancements in machine translation (MT) by leveraging extensive web content. On the other hand, translation-specific LLMs are built by pre-training on domain-specific monolingual corpora and fine-tuning with human-annotated translation data. Despite the superior performance, these methods either demand an unprecedent… ▽ More General-purpose Large Language Models (LLMs) like GPT-4 have achieved remarkable advancements in machine translation (MT) by leveraging extensive web content. On the other hand, translation-specific LLMs are built by pre-training on domain-specific monolingual corpora and fine-tuning with human-annotated translation data. Despite the superior performance, these methods either demand an unprecedented scale of computing and data or substantial human editing and annotation efforts. In this paper, we develop Ladder, a novel model-agnostic and cost-effective tool to refine the performance of general LLMs for MT. Ladder is trained on pseudo-refinement triplets which can be easily obtained from existing LLMs without additional human cost. During training, we propose a hierarchical fine-tuning strategy with an easy-to-hard schema, improving Ladder's refining performance progressively. The trained Ladder can be seamlessly integrated with any general-purpose LLMs to boost their translation performance. By utilizing Gemma-2B/7B as the backbone, Ladder-2B can elevate raw translations to the level of top-tier open-source models (e.g., refining BigTranslate-13B with +6.91 BLEU and +3.52 COMET for XX-En), and Ladder-7B can further enhance model performance to be on par with the state-of-the-art GPT-4. Extensive ablation and analysis corroborate the effectiveness of Ladder in diverse settings. Our code is available at https://github.com/fzp0424/Ladder △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: Our code is available at https://github.com/fzp0424/Ladder

arXiv:2406.15696 [pdf]

Functional photoacoustic noninvasive Doppler angiography in humans

Authors: Yang Zhang, Joshua Olick-Gibson, Karteekeya Sastry, Lihong V. Wang

Abstract: Optical imaging of blood flow yields critical functional insights into the circulatory system, but its clinical implementation has typically been limited to shallow depths (~1 millimeter) due to light scattering in biological tissue. Here, we present photoacoustic noninvasive Doppler angiography (PANDA) for deep blood flow imaging. PANDA synergizes the photoacoustic and Doppler effects to generate… ▽ More Optical imaging of blood flow yields critical functional insights into the circulatory system, but its clinical implementation has typically been limited to shallow depths (~1 millimeter) due to light scattering in biological tissue. Here, we present photoacoustic noninvasive Doppler angiography (PANDA) for deep blood flow imaging. PANDA synergizes the photoacoustic and Doppler effects to generate color Doppler velocity and power Doppler blood flow maps of the vascular lumen. Our results demonstrate PANDA's ability to measure blood flow in vivo up to one centimeter in depth, marking approximately an order of magnitude improvement over existing high-resolution pure optical modalities. PANDA enhances photoacoustic flow imaging by increasing depth and enabling cross-sectional blood vessel imaging. We also showcase PANDA's clinical feasibility through three-dimensional imaging of blood flow in healthy subjects and a patient with varicose veins. By integrating the imaging system onto a mobile platform, we have designed PANDA to be a portable modality that is primed for expedient clinical translation. PANDA offers noninvasive, single modality imaging of hemoglobin and blood flow with three-dimensional capability, facilitating comprehensive assessment of deep vascular dynamics in humans. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 38 pages, 7 main figures, 10 supplementary figures

arXiv:2406.15557 [pdf, other]

Observation of a non-Hermitian supersonic mode

Authors: Yuxuan Zhang, Juan Carrasquilla, Yong Baek Kim

Abstract: Quantum computers have long been anticipated to excel in simulating quantum many-body physics. While most previous work has focused on Hermitian physics, we demonstrate the power of variational quantum circuits for resource-efficient simulations of dynamical and equilibrium physics in non-Hermitian systems, revealing new phenomena beyond standard Hermitian quantum machines. Using a variational qua… ▽ More Quantum computers have long been anticipated to excel in simulating quantum many-body physics. While most previous work has focused on Hermitian physics, we demonstrate the power of variational quantum circuits for resource-efficient simulations of dynamical and equilibrium physics in non-Hermitian systems, revealing new phenomena beyond standard Hermitian quantum machines. Using a variational quantum compilation scheme for fermionic systems, we reduce gate count, save qubits, and eliminate the need for postselection, a major challenge in simulating non-Hermitian dynamics via standard Trotterization. Experimentally, we observed a supersonic mode in the connected density-density correlation function on an $ n = 18 $ fermionic chain after a non-Hermitian, locally interacting quench, which would otherwise be forbidden by the Lieb-Robinson bound in a Hermitian system. Additionally, we investigate sequential quantum circuits generated by tensor networks for ground state preparation, here defined as the eigenstate with the lowest real part eigenvalue, using a variance minimization scheme. Through a trapped-ion implementation on the Quantinuum H1 quantum processor, we accurately capture correlation functions and energies across an exceptional point on a dissipative spin chain up to length $ n = 20 $ using only 3 qubits. Motivated by these advancements, we provide an analytical example demonstrating that simulating single-qubit non-Hermitian dynamics for $Θ(\log(n))$ time from certain initial states is exponentially hard on a quantum computer, offering insights into the opportunities and limitations of using quantum computation for simulating non-Hermitian physics. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.15501 [pdf]

Secure Combination of Untrusted Time information Based on Optimized Dempster-Shafer Theory

Authors: Yang Li, Yujie Luo, Yichen Zhang, Ao Sun, Wei Huang, Shuai Zhang, Tao Zhang, Chuang Zhou, Li Ma, Jie Yang, Mei Wu, Heng Wang, Yan Pan, Yun Shao, Xing Chen, Ziyang Chen, Song Yu, Hong Guo, Bingjie Xu

Abstract: Secure precision time synchronization is important for applications of Cyber-Physical Systems. However, several attacks, especially the Time Delay Attack (TDA), deteriorates the performance of time synchronization system seriously. Multiple paths scheme is thought as an effective security countermeasure to decrease the influence of TDA. However, the effective secure combination algorithm is still… ▽ More Secure precision time synchronization is important for applications of Cyber-Physical Systems. However, several attacks, especially the Time Delay Attack (TDA), deteriorates the performance of time synchronization system seriously. Multiple paths scheme is thought as an effective security countermeasure to decrease the influence of TDA. However, the effective secure combination algorithm is still missed for precision time synchronization. In this paper, a secure combination algorithm based on Dempster-Shafer theory is proposed for multiple paths method. Special optimizations are done for the combination algorithm to solve the potential problems due to untrusted evidence. Theoretical simulation shows that the proposed algorithm works much better than Fault Tolerant Algorithm (FTA) and the attack detection method based on single path. And experimental demonstration proves the feasibility and superiority of the proposed algorithm, where the time stability with 27.97 ps, 1.57 ps, and 1.12 ps at average time 1s, 10s, 100s is achieved under TDA and local clock jump. The proposed algorithm can be used to improve the security and resilience of many importance synchronization protocol, such as NTP, PTP, and TWFTT. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.15474 [pdf, other]

WundtGPT: Sha** Large Language Models To Be An Empathetic, Proactive Psychologist

Authors: Chenyu Ren, Yazhou Zhang, Daihai He, **g Qin

Abstract: Large language models (LLMs) are raging over the medical domain, and their momentum has carried over into the mental health domain, leading to the emergence of few mental health LLMs. Although such mental health LLMs could provide reasonable suggestions for psychological counseling, how to develop an authentic and effective doctor-patient relationship (DPR) through LLMs is still an important probl… ▽ More Large language models (LLMs) are raging over the medical domain, and their momentum has carried over into the mental health domain, leading to the emergence of few mental health LLMs. Although such mental health LLMs could provide reasonable suggestions for psychological counseling, how to develop an authentic and effective doctor-patient relationship (DPR) through LLMs is still an important problem. To fill this gap, we dissect DPR into two key attributes, i.e., the psychologist's empathy and proactive guidance. We thus present WundtGPT, an empathetic and proactive mental health large language model that is acquired by fine-tuning it with instruction and real conversation between psychologists and patients. It is designed to assist psychologists in diagnosis and help patients who are reluctant to communicate face-to-face understand their psychological conditions. Its uniqueness lies in that it could not only pose purposeful questions to guide patients in detailing their symptoms but also offer warm emotional reassurance. In particular, WundtGPT incorporates Collection of Questions, Chain of Psychodiagnosis, and Empathy Constraints into a comprehensive prompt for eliciting LLMs' questions and diagnoses. Additionally, WundtGPT proposes a reward model to promote alignment with empathetic mental health professionals, which encompasses two key factors: cognitive empathy and emotional empathy. We offer a comprehensive evaluation of our proposed model. Based on these outcomes, we further conduct the manual evaluation based on proactivity, effectiveness, professionalism and coherence. We notice that WundtGPT can offer professional and effective consultation. The model is available at huggingface. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.15420 [pdf]

A comprehensive overview of diffuse correlation spectroscopy: theoretical framework, recent advances in hardware, analysis, and applications

Authors: Quan Wang, Mingliang Pan, Lucas Kreiss, Saeed Samaei, Stefan A. Carp, Johannes D. Johansson, Yuanzhe Zhang, Melissa Wu, Roarke Horstmeyer, Mamadou Diop, David Day-Uei Li

Abstract: Diffuse correlation spectroscopy (DCS) is a powerful tool for assessing microvascular hemodynamic in deep tissues. Recent advances in sensors, lasers, and deep learning have further boosted the development of new DCS methods. However, newcomers might feel overwhelmed, not only by the already complex DCS theoretical framework but also by the broad range of component options and system architectures… ▽ More Diffuse correlation spectroscopy (DCS) is a powerful tool for assessing microvascular hemodynamic in deep tissues. Recent advances in sensors, lasers, and deep learning have further boosted the development of new DCS methods. However, newcomers might feel overwhelmed, not only by the already complex DCS theoretical framework but also by the broad range of component options and system architectures. To facilitate new entry into this exciting field, we present a comprehensive review of DCS hardware architectures (continuous-wave, frequency-domain, and time-domain) and summarize corresponding theoretical models. Further, we discuss new applications of highly integrated silicon single-photon avalanche diode (SPAD) sensors in DCS, compare SPADs with existing sensors, and review other components (lasers, fibers, and correlators), as well as new data analysis tools, including deep learning. Potential applications in medical diagnosis are discussed, and an outlook for the future directions is provided, to offer effective guidance to embark on DCS research. △ Less

Submitted 18 May, 2024; originally announced June 2024.

arXiv:2406.15303 [pdf, other]

ADR: Attention Diversification Regularization for Mitigating Overfitting in Multiple Instance Learning based Whole Slide Image Classification

Authors: Yunlong Zhang, Zhongyi Shui, Yunxuan Sun, Honglin Li, **gxiong Li, Chenglu Zhu, Sunyi Zheng, Lin Yang

Abstract: Multiple Instance Learning (MIL) has demonstrated effectiveness in analyzing whole slide images (WSIs), yet it often encounters overfitting challenges in real-world applications. This paper reveals the correlation between MIL's performance and the entropy of attention values. Based on this observation, we propose Attention Diversity Regularization (ADR), a simple but effective technique aimed at p… ▽ More Multiple Instance Learning (MIL) has demonstrated effectiveness in analyzing whole slide images (WSIs), yet it often encounters overfitting challenges in real-world applications. This paper reveals the correlation between MIL's performance and the entropy of attention values. Based on this observation, we propose Attention Diversity Regularization (ADR), a simple but effective technique aimed at promoting high entropy in attention values. Specifically, ADR introduces a negative Shannon entropy loss for attention values into the regular MIL framework. Compared to existing methods aimed at alleviating overfitting, which often necessitate additional modules or processing steps, our ADR approach requires no such extras, demonstrating simplicity and efficiency. We evaluate our ADR on three WSI classification tasks. ADR achieves superior performance over the state-of-the-art on most of them. We also show that ADR can enhance heatmaps, aligning them better with pathologists' diagnostic criteria. The source code is available at \url{https://github.com/dazhangyu123/ADR}. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.15283 [pdf, other]

FT-AED: Benchmark Dataset for Early Freeway Traffic Anomalous Event Detection

Authors: Austin Coursey, Junyi Ji, Marcos Quinones-Grueiro, William Barbour, Yuhang Zhang, Tyler Derr, Gautam Biswas, Daniel B. Work

Abstract: Early and accurate detection of anomalous events on the freeway, such as accidents, can improve emergency response and clearance. However, existing delays and errors in event identification and reporting make it a difficult problem to solve. Current large-scale freeway traffic datasets are not designed for anomaly detection and ignore these challenges. In this paper, we introduce the first large-s… ▽ More Early and accurate detection of anomalous events on the freeway, such as accidents, can improve emergency response and clearance. However, existing delays and errors in event identification and reporting make it a difficult problem to solve. Current large-scale freeway traffic datasets are not designed for anomaly detection and ignore these challenges. In this paper, we introduce the first large-scale lane-level freeway traffic dataset for anomaly detection. Our dataset consists of a month of weekday radar detection sensor data collected in 4 lanes along an 18-mile stretch of Interstate 24 heading toward Nashville, TN, comprising over 3.7 million sensor measurements. We also collect official crash reports from the Nashville Traffic Management Center and manually label all other potential anomalies in the dataset. To show the potential for our dataset to be used in future machine learning and traffic research, we benchmark numerous deep learning anomaly detection models on our dataset. We find that unsupervised graph neural network autoencoders are a promising solution for this problem and that ignoring spatial relationships leads to decreased performance. We demonstrate that our methods can reduce reporting delays by over 10 minutes on average while detecting 75% of crashes. Our dataset and all preprocessing code needed to get started are publicly released at https://vu.edu/ft-aed/ to facilitate future research. △ Less

Submitted 24 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.15093 [pdf, other]

ECLIPSE: Expunging Clean-label Indiscriminate Poisons via Sparse Diffusion Purification

Authors: Xianlong Wang, Shengshan Hu, Yechao Zhang, Ziqi Zhou, Leo Yu Zhang, Peng Xu, Wei Wan, Hai **

Abstract: Clean-label indiscriminate poisoning attacks add invisible perturbations to correctly labeled training images, thus dramatically reducing the generalization capability of the victim models. Recently, some defense mechanisms have been proposed such as adversarial training, image transformation techniques, and image purification. However, these schemes are either susceptible to adaptive attacks, bui… ▽ More Clean-label indiscriminate poisoning attacks add invisible perturbations to correctly labeled training images, thus dramatically reducing the generalization capability of the victim models. Recently, some defense mechanisms have been proposed such as adversarial training, image transformation techniques, and image purification. However, these schemes are either susceptible to adaptive attacks, built on unrealistic assumptions, or only effective against specific poison types, limiting their universal applicability. In this research, we propose a more universally effective, practical, and robust defense scheme called ECLIPSE. We first investigate the impact of Gaussian noise on the poisons and theoretically prove that any kind of poison will be largely assimilated when imposing sufficient random noise. In light of this, we assume the victim has access to an extremely limited number of clean images (a more practical scene) and subsequently enlarge this sparse set for training a denoising probabilistic model (a universal denoising tool). We then begin by introducing Gaussian noise to absorb the poisons and then apply the model for denoising, resulting in a roughly purified dataset. Finally, to address the trade-off of the inconsistency in the assimilation sensitivity of different poisons by Gaussian noise, we propose a lightweight corruption compensation module to effectively eliminate residual poisons, providing a more universal defense approach. Extensive experiments demonstrate that our defense approach outperforms 10 state-of-the-art defenses. We also propose an adaptive attack against ECLIPSE and verify the robustness of our defense scheme. Our code is available at https://github.com/CGCL-codes/ECLIPSE. △ Less

Submitted 24 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted by ESORICS 2024

arXiv:2406.15068 [pdf, other]

Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

Authors: Gianna Paulin, Paul Scheffler, Thomas Benz, Matheus Cavalcante, Tim Fischer, Manuel Eggimann, Yichao Zhang, Nils Wistoff, Luca Bertaccini, Luca Colagrande, Gianmarco Ottavi, Frank K. Gürkaynak, Davide Rossi, Luca Benini

Abstract: We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stenc… ▽ More We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 2 pages, 7 figures. Accepted at the 2024 IEEE Symposium on VLSI Technology & Circuits

arXiv:2406.15030 [pdf, ps, other]

Search for the $e^+e^- \to φχ_{c1}(3872)$ process at BESIII

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (639 additional authors not shown)

Abstract: Based on 368.5 pb$^{-1}$ of $e^+e^-$ collision data collected at center-of-mass energies 4.914 and 4.946 GeV by the BESIII detector, the $e^+e^- \to φχ_{c1}(3872)$ process is searched for the first time. No significant signal is observed and the upper limits at the 90\% confidence level on the product of the Born cross section $σ(e^+e^- \to φχ_{c1}(3872))$ and the branching fraction… ▽ More Based on 368.5 pb$^{-1}$ of $e^+e^-$ collision data collected at center-of-mass energies 4.914 and 4.946 GeV by the BESIII detector, the $e^+e^- \to φχ_{c1}(3872)$ process is searched for the first time. No significant signal is observed and the upper limits at the 90\% confidence level on the product of the Born cross section $σ(e^+e^- \to φχ_{c1}(3872))$ and the branching fraction $\mathcal{B}[χ_{c1}(3872)\toπ^+π^- J/ψ]$ at 4.914 and 4.946 GeV are set to be 0.85 and 0.96 pb, respectively. These measurements provide useful information for the production of the $χ_{c1}(3872)$ at $e^+e^-$ collider and deepen our understanding about the nature of this particle. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 11 pages, 3 figures

arXiv:2406.15028 [pdf, other]

The high-contrast performance of the Keck Planet Imager and Characterizer

Authors: Jason J. Wang, Dimitri Mawet, Jerry W. Xuan, Chih-Chun Hsu, Jean-Baptiste Ruffio, Katelyn Horstman, Yinzi Xin, Jacques-Robert Delorme, Nemanja Jovanovic, Yapeng Zhang, Luke Finnerty, Ashley Baker, Randall Bartos, Geoffrey A. Blake, Benjamin Calvin, Sylvain Cetre, Gregory W. Doppmann, Daniel Echeverri, Michael P. Fitzgerald, Joshua Liberman, Ronald Lopez, Evan Morris, Jacklyn Pezzato-Rovner, Ben Sappey, Tobias Schofield , et al. (3 additional authors not shown)

Abstract: The Keck Planet Imager and Characterizer (KPIC), a series of upgrades to the Keck II Adaptive Optics System and Instrument Suite, aims to demonstrate high-resolution spectroscopy of faint exoplanets that are spatially resolved from their host stars. In this paper, we measure KPIC's sensitivity to companions as a function of separation (i.e., the contrast curve) using on-sky data collected over fou… ▽ More The Keck Planet Imager and Characterizer (KPIC), a series of upgrades to the Keck II Adaptive Optics System and Instrument Suite, aims to demonstrate high-resolution spectroscopy of faint exoplanets that are spatially resolved from their host stars. In this paper, we measure KPIC's sensitivity to companions as a function of separation (i.e., the contrast curve) using on-sky data collected over four years of operation. We show that KPIC is able to reach contrasts of $1.3 \times 10^{-4}$ at 90 mas and $9.2 \times 10^{-6}$ at 420 mas separation from the star, and that KPIC can reach planet-level sensitivities at angular separations within the inner working angle of coronagraphic instruments such as GPI and SPHERE. KPIC is also able to achieve more extreme contrasts than other medium-/high-resolution spectrographs that are not as optimized for high-contrast performance. We decompose the KPIC performance budget into individual noise terms and discuss limiting factors. The fringing that results from combining a high-contrast imaging system with a high-resolution spectrograph is identified as an important source of systematic noise. After mitigation and correction, KPIC is able to reach within a factor of 2 of the photon noise limit at separations < 200 mas. At large separations, KPIC is limited by the background noise performance of NIRSPEC. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 16 pages, 6 figures, submitted to the proceedings of SPIE Astronomical Telescopes + Instrumentation 2024, 13096-69

arXiv:2406.14977 [pdf, other]

Trustworthy Enhanced Multi-view Multi-modal Alzheimer's Disease Prediction with Brain-wide Imaging Transcriptomics Data

Authors: Shan Cong, Zhoujie Fan, Hongwei Liu, Yinghan Zhang, Xin Wang, Haoran Luo, Xiaohui Yao

Abstract: Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer's disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities,… ▽ More Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer's disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities, most studies overlook the informativeness disparities between modalities. Here, we propose TMM, a trusted multiview multimodal graph attention framework for AD diagnosis, using extensive brain-wide transcriptomics and imaging data. First, we construct view-specific brain regional co-function networks (RRIs) from transcriptomics and multimodal radiomics data to incorporate interaction information from both biomolecular and imaging perspectives. Next, we apply graph attention (GAT) processing to each RRI network to produce graph embeddings and employ cross-modal attention to fuse transcriptomics-derived embedding with each imagingderived embedding. Finally, a novel true-false-harmonized class probability (TFCP) strategy is designed to assess and adaptively adjust the prediction confidence of each modality for AD diagnosis. We evaluate TMM using the AHBA database with brain-wide transcriptomics data and the ADNI database with three imaging modalities (AV45-PET, FDG-PET, and VBM-MRI). The results demonstrate the superiority of our method in identifying AD, EMCI, and LMCI compared to state-of-the-arts. Code and data are available at https://github.com/Yaolab-fantastic/TMM. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14966 [pdf, other]

AIGC-Chain: A Blockchain-Enabled Full Lifecycle Recording System for AIGC Product Copyright Management

Authors: Jiajia Jiang, Moting Su, Xiangli Xiao, Yushu Zhang, Yuming Fang

Abstract: As artificial intelligence technology becomes increasingly prevalent, Artificial Intelligence Generated Content (AIGC) is being adopted across various sectors. Although AIGC is playing an increasingly significant role in business and culture, questions surrounding its copyright have sparked widespread debate. The current legal framework for copyright and intellectual property is grounded in the co… ▽ More As artificial intelligence technology becomes increasingly prevalent, Artificial Intelligence Generated Content (AIGC) is being adopted across various sectors. Although AIGC is playing an increasingly significant role in business and culture, questions surrounding its copyright have sparked widespread debate. The current legal framework for copyright and intellectual property is grounded in the concept of human authorship, but in the creation of AIGC, human creators primarily provide conceptual ideas, with AI independently responsible for the expressive elements. This disconnect creates complexity and difficulty in determining copyright ownership under existing laws. Consequently, it is imperative to reassess the intellectual contributions of all parties involved in the creation of AIGC to ensure a fair allocation of copyright ownership. To address this challenge, we introduce AIGC-Chain, a blockchain-enabled full lifecycle recording system designed to manage the copyright of AIGC products. It is engineered to meticulously document the entire lifecycle of AIGC products, providing a transparent and dependable platform for copyright management. Furthermore, we propose a copyright tracing method based on an Indistinguishable Bloom Filter, named IBFT, which enhances the efficiency of blockchain transaction queries and significantly reduces the risk of fraudulent copyright claims for AIGC products. In this way, auditors can analyze the copyright of AIGC products by reviewing all relevant information retrieved from the blockchain. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14900 [pdf, other]

Decoding Matters: Addressing Amplification Bias and Homogeneity Issue for LLM-based Recommendation

Authors: Keqin Bao, Jizhi Zhang, Yang Zhang, Xinyue Huo, Chong Chen, Fuli Feng

Abstract: Adapting Large Language Models (LLMs) for recommendation requires careful consideration of the decoding process, given the inherent differences between generating items and natural language. Existing approaches often directly apply LLMs' original decoding methods. However, we find these methods encounter significant challenges: 1) amplification bias -- where standard length normalization inflates… ▽ More Adapting Large Language Models (LLMs) for recommendation requires careful consideration of the decoding process, given the inherent differences between generating items and natural language. Existing approaches often directly apply LLMs' original decoding methods. However, we find these methods encounter significant challenges: 1) amplification bias -- where standard length normalization inflates scores for items containing tokens with generation probabilities close to 1 (termed ghost tokens), and 2) homogeneity issue -- generating multiple similar or repetitive items for a user. To tackle these challenges, we introduce a new decoding approach named Debiasing-Diversifying Decoding (D3). D3 disables length normalization for ghost tokens to alleviate amplification bias, and it incorporates a text-free assistant model to encourage tokens less frequently generated by LLMs for counteracting recommendation homogeneity. Extensive experiments on real-world datasets demonstrate the method's effectiveness in enhancing accuracy and diversity. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14863 [pdf, other]

Older and Wiser: The Marriage of Device Aging and Intellectual Property Protection of Deep Neural Networks

Authors: Ning Lin, Shaocong Wang, Yue Zhang, Yangu He, Kwunhang Wong, Arindam Basu, Dashan Shang, Xiaoming Chen, Zhongrui Wang

Abstract: Deep neural networks (DNNs), such as the widely-used GPT-3 with billions of parameters, are often kept secret due to high training costs and privacy concerns surrounding the data used to train them. Previous approaches to securing DNNs typically require expensive circuit redesign, resulting in additional overheads such as increased area, energy consumption, and latency. To address these issues, we… ▽ More Deep neural networks (DNNs), such as the widely-used GPT-3 with billions of parameters, are often kept secret due to high training costs and privacy concerns surrounding the data used to train them. Previous approaches to securing DNNs typically require expensive circuit redesign, resulting in additional overheads such as increased area, energy consumption, and latency. To address these issues, we propose a novel hardware-software co-design approach for DNN intellectual property (IP) protection that capitalizes on the inherent aging characteristics of circuits and a novel differential orientation fine-tuning (DOFT) to ensure effective protection. Hardware-wise, we employ random aging to produce authorized chips. This process circumvents the need for chip redesign, thereby eliminating any additional hardware overhead during the inference procedure of DNNs. Moreover, the authorized chips demonstrate a considerable disparity in DNN inference performance when compared to unauthorized chips. Software-wise, we propose a novel DOFT, which allows pre-trained DNNs to maintain their original accuracy on authorized chips with minimal fine-tuning, while the model's performance on unauthorized chips is reduced to random guessing. Extensive experiments on various models, including MLP, VGG, ResNet, Mixer, and SwinTransformer, with lightweight binary and practical multi-bit weights demonstrate that the proposed method achieves effective IP protection, with only 10\% accuracy on unauthorized chips, while preserving nearly the original accuracy on authorized ones. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Design Automation Conference 2024

arXiv:2406.14604 [pdf, other]

Two-Loop Spacelike Splitting Amplitude for N=4 Super-Yang-Mills Theory

Authors: Johannes Henn, Rourou Ma, Yongqun Xu, Kai Yan, Yang Zhang, Hua Xing Zhu

Abstract: The study of collinear behavior for gauge theories in the spacelike region is of great phenomenological and theoretical importance. We analytically calculate the two-loop spacelike splitting amplitude for the full color N=4 Super-Yang-Mills theory. The result is derived by two complementary methods starting from the known amplitude: one is based on a discontinuity analysis, while the other one is… ▽ More The study of collinear behavior for gauge theories in the spacelike region is of great phenomenological and theoretical importance. We analytically calculate the two-loop spacelike splitting amplitude for the full color N=4 Super-Yang-Mills theory. The result is derived by two complementary methods starting from the known amplitude: one is based on a discontinuity analysis, while the other one is based on analytic continuation. Our result explicitly shows terms that violate naive factorization. However we show that factorization is restored at the level of color-summed unpolarized squared amplitudes at next-to-next-to-next-to leading order. We conjecture that the two-loop tripole terms in the generalized splitting amplitudes in QCD are identical to what we obtain in N=4 super Yang-Mills theory. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 6 packages, 3 figures

Report number: USTC-ICTS/PCFT-24-18

arXiv:2406.14377 [pdf, other]

Computation-Efficient Semi-Supervised Learning for ECG-based Cardiovascular Diseases Detection

Authors: Rushuang Zhou, Zijun Liu, Lei Clifton, David A. Clifton, Kannie W. Y. Chan, Yuan-Ting Zhang, Yining Dong

Abstract: Label scarcity problem is the main challenge that hinders the wide application of deep learning systems in automatic cardiovascular diseases (CVDs) detection using electrocardiography (ECG). Tuning pre-trained models alleviates this problem by transferring knowledge learned from large datasets to downstream small datasets. However, bottlenecks in computational efficiency and CVDs detection perform… ▽ More Label scarcity problem is the main challenge that hinders the wide application of deep learning systems in automatic cardiovascular diseases (CVDs) detection using electrocardiography (ECG). Tuning pre-trained models alleviates this problem by transferring knowledge learned from large datasets to downstream small datasets. However, bottlenecks in computational efficiency and CVDs detection performance limit its clinical applications. It is difficult to improve the detection performance without significantly sacrificing model computational efficiency. Here, we propose a computation-efficient semi-supervised learning paradigm (FastECG) for robust and computation-efficient CVDs detection using ECG. It enables a robust adaptation of pre-trained models on downstream datasets with limited supervision and high computational efficiency. First, a random-deactivation technique is developed to achieve robust and fast low-rank adaptation of pre-trained weights. Subsequently, we propose a one-shot rank allocation module to determine the optimal ranks for the update matrices of the pre-trained weights. Finally, a lightweight semi-supervised learning pipeline is introduced to enhance model performance by leveraging labeled and unlabeled data with high computational efficiency. Extensive experiments on four downstream ECG datasets demonstrate that FastECG not only outperforms the state-of-the-art methods in multi-label CVDs detection but also consumes fewer GPU footprints, training time, and parameter storage space. As such, this paradigm provides an effective solution for achieving high computational efficiency and robust detection performance in the clinical applications of pre-trained models under limited supervision. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14264 [pdf, other]

Zero-Shot Image Denoising for High-Resolution Electron Microscopy

Authors: Xuanyu Tian, Zhuoya Dong, Xiyue Lin, Yue Gao, Hongjiang Wei, Yanhang Ma, **gyi Yu, Yuyao Zhang

Abstract: High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we… ▽ More High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we propose a super-resolution (SR) based self-supervised training strategy, incorporating the Random Sub-sampler module. The Random Sub-sampler is designed to generate approximate infinite noisy pairs from a single noisy image, serving as an effective data augmentation in zero-shot denoising. Noise2SR trains the network with paired noisy images of different resolutions, which is conducted via SR strategy. The SR-based training facilitates the network adopting more pixels for supervision, and the random sub-sampling helps compel the network to learn continuous signals enhancing the robustness. Meanwhile, we mitigate the uncertainty caused by random-sampling by adopting minimum mean squared error (MMSE) estimation for the denoised results. With the distinctive integration of training strategy and proposed designs, Noise2SR can achieve superior denoising performance using a single noisy HREM image. We evaluate the performance of Noise2SR in both simulated and real HREM denoising tasks. It outperforms state-of-the-art ZS-SSL methods and achieves comparable denoising performance with supervised methods. The success of Noise2SR suggests its potential for improving the SNR of images in material imaging domains. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 12 pages, 12 figures

arXiv:2406.14176 [pdf, other]

A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection

Authors: Kyungbok Lee, You Zhang, Zhiyao Duan

Abstract: This paper addresses the challenge of develo** a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for t… ▽ More This paper addresses the challenge of develo** a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14096 [pdf, other]

Graph Neural Networks for Job Shop Scheduling Problems: A Survey

Authors: Igor G. Smit, Jianan Zhou, Robbert Reijnen, Yaoxin Wu, Jian Chen, Cong Zhang, Zaharah Bukhsh, Wim Nuijten, Yingqian Zhang

Abstract: Job shop scheduling problems (JSSPs) represent a critical and challenging class of combinatorial optimization problems. Recent years have witnessed a rapid increase in the application of graph neural networks (GNNs) to solve JSSPs, albeit lacking a systematic survey of the relevant literature. This paper aims to thoroughly review prevailing GNN methods for different types of JSSPs and the closely… ▽ More Job shop scheduling problems (JSSPs) represent a critical and challenging class of combinatorial optimization problems. Recent years have witnessed a rapid increase in the application of graph neural networks (GNNs) to solve JSSPs, albeit lacking a systematic survey of the relevant literature. This paper aims to thoroughly review prevailing GNN methods for different types of JSSPs and the closely related flow-shop scheduling problems (FSPs), especially those leveraging deep reinforcement learning (DRL). We begin by presenting the graph representations of various JSSPs, followed by an introduction to the most commonly used GNN architectures. We then review current GNN-based methods for each problem type, highlighting key technical elements such as graph representations, GNN architectures, GNN tasks, and training algorithms. Finally, we summarize and analyze the advantages and limitations of GNNs in solving JSSPs and provide potential future research opportunities. We hope this survey can motivate and inspire innovative approaches for more powerful GNN-based approaches in tackling JSSPs and other scheduling problems. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14095 [pdf, other]

Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

Authors: Qianli Shen, Yezhen Wang, Zhouhao Yang, Xiang Li, Haonan Wang, Yang Zhang, Jonathan Scarlett, Zhanxing Zhu, Kenji Kawaguchi

Abstract: Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the dem… ▽ More Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce $\textbf{F}$orward $\textbf{G}$radient $\textbf{U}$nrolling with $\textbf{F}$orward $\textbf{F}$radient, abbreviated as $(\textbf{FG})^2\textbf{U}$, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. $(\text{FG})^2\text{U}$ circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, $(\text{FG})^2\text{U}$ is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, $(\text{FG})^2\text{U}$ and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, $(\text{FG})^2\text{U}$ is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for $(\text{FG})^2\text{U}$, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14054 [pdf, other]

Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing

Authors: Xinbo Zhao, Yingxue Zhang, Xin Zhang, Yu Yang, Yiqun Xie, Yanhua Li, Jun Luo

Abstract: Enhancing diverse human decision-making processes in an urban environment is a critical issue across various applications, including ride-sharing vehicle dispatching, public transportation management, and autonomous driving. Offline reinforcement learning (RL) is a promising approach to learn and optimize human urban strategies (or policies) from pre-collected human-generated spatial-temporal urba… ▽ More Enhancing diverse human decision-making processes in an urban environment is a critical issue across various applications, including ride-sharing vehicle dispatching, public transportation management, and autonomous driving. Offline reinforcement learning (RL) is a promising approach to learn and optimize human urban strategies (or policies) from pre-collected human-generated spatial-temporal urban data. However, standard offline RL faces two significant challenges: (1) data scarcity and data heterogeneity, and (2) distributional shift. In this paper, we introduce MODA -- a Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing approach. MODA addresses the challenges of data scarcity and heterogeneity in a multi-task urban setting through Contrastive Data Sharing among tasks. This technique involves extracting latent representations of human behaviors by contrasting positive and negative data pairs. It then shares data presenting similar representations with the target task, facilitating data augmentation for each task. Moreover, MODA develops a novel model-based multi-task offline RL algorithm. This algorithm constructs a robust Markov Decision Process (MDP) by integrating a dynamics model with a Generative Adversarial Network (GAN). Once the robust MDP is established, any online RL or planning algorithm can be applied. Extensive experiments conducted in a real-world multi-task urban setting validate the effectiveness of MODA. The results demonstrate that MODA exhibits significant improvements compared to state-of-the-art baselines, showcasing its capability in advancing urban decision-making processes. We also made our code available to the research community. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: KDD 2024

arXiv:2406.14039 [pdf]

CryptoGPT: a 7B model rivaling GPT-4 in the task of analyzing and classifying real-time financial news

Authors: Ying Zhang, Matthieu Petit Guillaume, Aurélien Krauth, Manel Labidi

Abstract: CryptoGPT: a 7B model competing with GPT-4 in a specific task -- The Impact of Automatic Annotation and Strategic Fine-Tuning via QLoRAIn this article, we present a method aimed at refining a dedicated LLM of reasonable quality with limited resources in an industrial setting via CryptoGPT. It is an LLM designed for financial news analysis for the cryptocurrency market in real-time. This project wa… ▽ More CryptoGPT: a 7B model competing with GPT-4 in a specific task -- The Impact of Automatic Annotation and Strategic Fine-Tuning via QLoRAIn this article, we present a method aimed at refining a dedicated LLM of reasonable quality with limited resources in an industrial setting via CryptoGPT. It is an LLM designed for financial news analysis for the cryptocurrency market in real-time. This project was launched in an industrial context. This model allows not only for the classification of financial information but also for providing comprehensive analysis. We refined different LLMs of the same size such as Mistral-7B and LLama-7B using semi-automatic annotation and compared them with various LLMs such as GPT-3.5 and GPT-4. Our goal is to find a balance among several needs: 1. Protecting data (by avoiding their transfer to external servers), 2. Limiting annotation cost and time, 3. Controlling the model's size (to manage deployment costs), and 4. Maintaining better analysis quality. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Journ{é}e Nationale sur la Fouille de Textes, Pascal CUXAC; Adrien GUILLE; C{é}dric LOPEZ, Jun 2024, Lyon (Universit{é} Lumi{è}re Lyon 2), France

arXiv:2406.13979 [pdf, other]

Knowledge-driven Subspace Fusion and Gradient Coordination for Multi-modal Learning

Authors: Yupei Zhang, Xiaofei Wang, Fangliangzi Meng, ** Tang, Chao Li

Abstract: Multi-modal learning plays a crucial role in cancer diagnosis and prognosis. Current deep learning based multi-modal approaches are often limited by their abilities to model the complex correlations between genomics and histology data, addressing the intrinsic complexity of tumour ecosystem where both tumour and microenvironment contribute to malignancy. We propose a biologically interpretative an… ▽ More Multi-modal learning plays a crucial role in cancer diagnosis and prognosis. Current deep learning based multi-modal approaches are often limited by their abilities to model the complex correlations between genomics and histology data, addressing the intrinsic complexity of tumour ecosystem where both tumour and microenvironment contribute to malignancy. We propose a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics by decomposing the feature subspace of histology images and genomics, reflecting distinct tumour and microenvironment features. To enhance cross-modal interactions, we design a knowledge-driven subspace fusion scheme, consisting of a cross-modal deformable attention module and a gene-guided consistency strategy. Additionally, in pursuit of dynamically optimizing the subspace knowledge, we further propose a novel gradient coordination learning strategy. Extensive experiments demonstrate the effectiveness of the proposed method, outperforming state-of-the-art techniques in three downstream tasks of glioma diagnosis, tumour grading, and survival analysis. Our code is available at https://github.com/helenypzhang/Subspace-Multimodal-Learning. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.13974 [pdf, other]

A Combinatorial Decomposition of Knapsack Cones

Authors: Guoce Xin, Yingrui Zhang, Zihao Zhang

Abstract: In this paper, we focus on knapsack cones, a specific type of simplicial cones that arise naturally in the context of the knapsack problem $x_1 a_1 + \cdots + x_n a_n = a_0$. We present a novel combinatorial decomposition for these cones, named \texttt{DecDenu}, which aligns with Barvinok's unimodular cone decomposition within the broader framework of Algebraic Combinatorics. Computer experiments… ▽ More In this paper, we focus on knapsack cones, a specific type of simplicial cones that arise naturally in the context of the knapsack problem $x_1 a_1 + \cdots + x_n a_n = a_0$. We present a novel combinatorial decomposition for these cones, named \texttt{DecDenu}, which aligns with Barvinok's unimodular cone decomposition within the broader framework of Algebraic Combinatorics. Computer experiments support us to conjecture that our \texttt{DecDenu} algorithm is polynomial when the number of variables $n$ is fixed. If true, \texttt{DecDenu} will provide the first alternative polynomial algorithm for Barvinok's unimodular cone decomposition, at least for denumerant cones. The \texttt{CTEuclid} algorithm is designed for MacMahon's partition analysis, and is notable for being the first algorithm to solve the counting problem for Magic squares of order 6. We have enhanced the \texttt{CTEuclid} algorithm by incorporating \texttt{DecDenu}, resulting in the \texttt{LLLCTEuclid} algorithm. This enhanced algorithm makes significant use of LLL's algorithm and stands out as an effective elimination-based approach. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 22 pages

MSC Class: Primary 52C07; Secondary 05--04; 05--08

arXiv:2406.13940 [pdf, other]

AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought

Authors: Yongheng Zhang, Qiguang Chen, Min Li, Wanxiang Che, Libo Qin

Abstract: Cross-lingual chain-of-thought can effectively complete reasoning tasks across languages, which gains increasing attention. Recently, dominant approaches in the literature improve cross-lingual alignment capabilities by integrating reasoning knowledge from different languages. Despite achieving excellent performance, current methods still have two main challenges: (1) Manual language specification… ▽ More Cross-lingual chain-of-thought can effectively complete reasoning tasks across languages, which gains increasing attention. Recently, dominant approaches in the literature improve cross-lingual alignment capabilities by integrating reasoning knowledge from different languages. Despite achieving excellent performance, current methods still have two main challenges: (1) Manual language specification: They still highly rely on manually selecting the languages to integrate, severely affecting their generalizability; (2) Static weight allocation: Current methods simply integrate all languages equally. In fact, different language reasoning paths should have different weights to achieve better complementation and integration. Motivated by this, we introduce an Automatic Cross-lingual Alignment Planning (AutoCAP) for zero-shot chain-of-thought to address the above challenges. The core of AutoCAP consists of two components: (1) Automatic Language Selection Prompting to guide LLMs to select appropriate languages and (2) Automatic Weight Allocation Prompting to automatically allocate alignment weight scores to each reasoning path. Extensive experiments on several benchmarks reveal that AutoCAP achieves state-of-the-art performance, surpassing previous methods that required manual effort. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted by ACL2024 Findings

arXiv:2406.13939 [pdf, other]

2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation

Authors: Bin Cao, Yisi Zhang, Xuanxu Lin, Xingjian He, Bo Zhao, **g Liu

Abstract: Motion Expression guided Video Segmentation is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions. Unlike the previous referring video object segmentation (RVOS), this task focuses more on the motion in video content for language-guided video object segmentation, requiring an enhanced ability to model longer temporal, moti… ▽ More Motion Expression guided Video Segmentation is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions. Unlike the previous referring video object segmentation (RVOS), this task focuses more on the motion in video content for language-guided video object segmentation, requiring an enhanced ability to model longer temporal, motion-oriented vision-language data. In this report, based on the RVOS methods, we successfully introduce mask information obtained from the video instance segmentation model as preliminary information for temporal enhancement and employ SAM for spatial refinement. Finally, our method achieved a score of 49.92 J &F in the validation phase and 54.20 J &F in the test phase, securing the final ranking of 2nd in the MeViS Track at the CVPR 2024 PVUW Challenge. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13923 [pdf, other]

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Authors: Junjie Wang, Yin Zhang, Yatai Ji, Yuxiang Zhang, Chunyang Jiang, Yubo Wang, Kang Zhu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Bei Chen, Qunshu Lin, Minghao Liu, Ge Zhang, Wenhu Chen

Abstract: Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PI… ▽ More Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13890 [pdf, other]

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Authors: Weixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, Jiayi Wang, Weishan Zhao, Yixin Zhang, Renjun Zhang, Li Zhu

Abstract: LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical eval… ▽ More LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13779 [pdf, other]

doi 10.1145/3637528.3672065

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

Authors: Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, **jie Gu

Abstract: Retrieval Augmented Generation (RAG) has become prevalent in question-answering (QA) tasks due to its ability of utilizing search engine to enhance the quality of long-form question-answering (LFQA). Despite the emergence of various open source methods and web-enhanced commercial systems such as Bing Chat, two critical problems remain unsolved, i.e., the lack of factuality and clear logic in the g… ▽ More Retrieval Augmented Generation (RAG) has become prevalent in question-answering (QA) tasks due to its ability of utilizing search engine to enhance the quality of long-form question-answering (LFQA). Despite the emergence of various open source methods and web-enhanced commercial systems such as Bing Chat, two critical problems remain unsolved, i.e., the lack of factuality and clear logic in the generated long-form answers. In this paper, we remedy these issues via a systematic study on answer generation in web-enhanced LFQA. Specifically, we first propose a novel outline-enhanced generator to achieve clear logic in the generation of multifaceted answers and construct two datasets accordingly. Then we propose a factuality optimization method based on a carefully designed doubly fine-grained RLHF framework, which contains automatic evaluation and reward modeling in different levels of granularity. Our generic framework comprises conventional fine-grained RLHF methods as special cases. Extensive experiments verify the superiority of our proposed \textit{Factuality-optimized RAG (FoRAG)} method on both English and Chinese benchmarks. In particular, when applying our method to Llama2-7B-chat, the derived model FoRAG-L-7B outperforms WebGPT-175B in terms of three commonly used metrics (i.e., coherence, helpfulness, and factuality), while the number of parameters is much smaller (only 1/24 of that of WebGPT-175B). Our datasets and models are made publicly available for better reproducibility: https://huggingface.co/forag. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Report number: 30th

Journal ref: KDD 2024

arXiv:2406.13611 [pdf, other]

Solving k-SAT problems with generalized quantum measurement

Authors: Yipei Zhang, Philippe Lewalle, K. Birgitta Whaley

Abstract: We generalize the projection-based quantum measurement-driven $k$-SAT algorithm of Benjamin, Zhao, and Fitzsimons (BZF, arxiv:1711.02687) to arbitrary strength quantum measurements, including the limit of continuous monitoring. In doing so, we clarify that this algorithm is a particular case of the measurement-driven quantum control strategy elsewhere referred to as "Zeno dragging". We argue that… ▽ More We generalize the projection-based quantum measurement-driven $k$-SAT algorithm of Benjamin, Zhao, and Fitzsimons (BZF, arxiv:1711.02687) to arbitrary strength quantum measurements, including the limit of continuous monitoring. In doing so, we clarify that this algorithm is a particular case of the measurement-driven quantum control strategy elsewhere referred to as "Zeno dragging". We argue that the algorithm is most efficient with finite time and measurement resources in the continuum limit, where measurements have an infinitesimal strength and duration. Moreover, for solvable $k$-SAT problems, the dynamics generated by the algorithm converge deterministically towards target dynamics in the long-time (Zeno) limit, implying that the algorithm can successfully operate autonomously via Lindblad dissipation, without detection. We subsequently study both the conditional and unconditional dynamics of the algorithm implemented via generalized measurements, quantifying the advantages of detection for heralding errors. These strategies are investigated first in a computationally-trivial $2$-qubit $2$-SAT problem to build intuition, and then we consider the scaling of the algorithm on $3$-SAT problems encoded with $4 - 10$ qubits. The average number of shots needed to obtain a solution scales with qubit number as $λ^n$. For vanishing dragging time (with final readout only), we find $λ= 2$ (corresponding to a brute-force search over possible solutions). However, the deterministic (autonomous) property of the algorithm in the adiabatic (Zeno) limit implies that we can drive $λ$ arbitrarily close to $1$, at the cost of a growing pre-factor. We numerically investigate the tradeoffs in these scalings with respect to algorithmic runtime and assess their implications for using this analog measurement-driven approach to quantum computing in practice. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 23 + 8 pages, 15 figures

arXiv:2406.13538 [pdf, other]

Farey tree locking of terahertz semiconductor laser frequency combs

Authors: Guibin Liu, Xuhong Ma, Kang Zhou, Binbin Liu, Lulu Zheng, Xianglong Bi, Shumin Wu, Yanming Lu, Zi** Li, Wenjian Wan, Zhenzhen Zhang, Junsong Peng, Ya Zhang, He** Zeng, Hua Li

Abstract: Frequency combs show various applications in molecular fingerprinting, imaging, communications, and so on. In the terahertz frequency range, semiconductor-based quantum cascade lasers (QCLs) are ideal platforms for realizing the frequency comb operation. Although self-started frequency comb operation can be obtained in free-running terahertz QCLs due to the four-wave mixing locking effects, resona… ▽ More Frequency combs show various applications in molecular fingerprinting, imaging, communications, and so on. In the terahertz frequency range, semiconductor-based quantum cascade lasers (QCLs) are ideal platforms for realizing the frequency comb operation. Although self-started frequency comb operation can be obtained in free-running terahertz QCLs due to the four-wave mixing locking effects, resonant/off-resonant microwave injection, phase locking, and femtosecond laser based locking techniques have been widely used to broaden and stabilize terahertz QCL combs. These active locking methods indeed show significant effects on the frequency stabilization of terahertz QCL combs, but they simultaneously have drawbacks, such as introducing large phase noise and requiring complex optical coupling and/or electrical circuits. Here, we demonstrate Farey tree locking of terahertz QCL frequency combs under microwave injection. The frequency competition between the Farey fraction frequency and the cavity round-trip frequency results in the frequency locking of terahertz QCL combs, and the Farey fraction frequencies can be accurately anticipated based on the downward trend of the Farey tree hierarchy. Furthermore, dual-comb experimental results show that the phase noise of the dual-comb spectral lines is significantly reduced by employing the Farey tree locking method. These results pave the way to deploying compact and low phase noise terahertz frequency comb sources. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 22 page, 7 figures

arXiv:2406.13478 [pdf, other]

Semiparametric Localized Principal Stratification Analysis with Continuous Strata

Authors: Yichi Zhang, Shu Yang

Abstract: Principal stratification is essential for revealing causal mechanisms involving post-treatment intermediate variables. Principal stratification analysis with continuous intermediate variables is increasingly common but challenging due to the infinite principal strata and the nonidentifiability and nonregularity of principal causal effects. Inspired by recent research, we resolve these challenges b… ▽ More Principal stratification is essential for revealing causal mechanisms involving post-treatment intermediate variables. Principal stratification analysis with continuous intermediate variables is increasingly common but challenging due to the infinite principal strata and the nonidentifiability and nonregularity of principal causal effects. Inspired by recent research, we resolve these challenges by first using a flexible copula-based principal score model to identify principal causal effect under weak principal ignorability. We then target the local functional substitute of principal causal effect, which is statistically regular and can accurately approximate principal causal effect with vanishing bandwidth. We simplify the full efficient influence function of the local functional substitute by considering its oracle-scenario alternative. This leads to a computationally efficient and straightforward estimator for the local functional substitute and principal causal effect with vanishing bandwidth. We prove the double robustness and statistical optimality of our proposed estimator, and derive its asymptotic normality for inferential purposes. We illustrate the appealing statistical performance of our proposed estimator in simulations, and apply it to two real datasets with intriguing scientific discoveries. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13457 [pdf, other]

EvTexture: Event-driven Texture Enhancement for Video Super-Resolution

Authors: Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun

Abstract: Event-based vision has drawn increasing attention due to its unique characteristics, such as high temporal resolution and high dynamic range. It has been used in video super-resolution (VSR) recently to enhance the flow estimation and temporal alignment. Rather than for motion learning, we propose in this paper the first VSR method that utilizes event signals for texture enhancement. Our method, c… ▽ More Event-based vision has drawn increasing attention due to its unique characteristics, such as high temporal resolution and high dynamic range. It has been used in video super-resolution (VSR) recently to enhance the flow estimation and temporal alignment. Rather than for motion learning, we propose in this paper the first VSR method that utilizes event signals for texture enhancement. Our method, called EvTexture, leverages high-frequency details of events to better recover texture regions in VSR. In our EvTexture, a new texture enhancement branch is presented. We further introduce an iterative texture enhancement module to progressively explore the high-temporal-resolution event information for texture restoration. This allows for gradual refinement of texture regions across multiple iterations, leading to more accurate and rich high-resolution details. Experimental results show that our EvTexture achieves state-of-the-art performance on four datasets. For the Vid4 dataset with rich textures, our method can get up to 4.67dB gain compared with recent event-based methods. Code: https://github.com/DachunKai/EvTexture. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: ICML 2024. Project page: https://dachunkai.github.io/evtexture.github.io/

arXiv:2406.13413 [pdf, other]

Recurrent Inference Machine for Medical Image Registration

Authors: Yi Zhang, Yidong Zhao, Hui Xue, Peter Kellman, Stefan Klein, Qian Tao

Abstract: Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to tradi… ▽ More Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to traditional optimization-based registration methods, the speed advantage may come at the cost of registration performance at inference time. Besides, deep neural networks ideally demand large training datasets while optimization-based methods are training-free. To improve registration accuracy and data efficiency, we propose a novel image registration method, termed Recurrent Inference Image Registration (RIIR) network. RIIR is formulated as a meta-learning solver to the registration problem in an iterative manner. RIIR addresses the accuracy and data efficiency issues, by learning the update rule of optimization, with implicit regularization combined with explicit gradient input. We evaluated RIIR extensively on brain MRI and quantitative cardiac MRI datasets, in terms of both registration accuracy and training data efficiency. Our experiments showed that RIIR outperformed a range of deep learning-based methods, even with only $5\%$ of the training data, demonstrating high data efficiency. Key findings from our ablation studies highlighted the important added value of the hidden states introduced in the recurrent inference framework for meta-learning. Our proposed RIIR offers a highly data-efficient framework for deep learning-based medical image registration. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2406.13409 [pdf, other]

doi 10.1145/3581783.3612007

PetalView: Fine-grained Location and Orientation Extraction of Street-view Images via Cross-view Local Search with Supplementary Materials

Authors: Wenmiao Hu, Yichen Zhang, Yuxuan Liang, Xian**g Han, Yifang Yin, Hannes Kruppa, See-Kiong Ng, Roger Zimmermann

Abstract: Satellite-based street-view information extraction by cross-view matching refers to a task that extracts the location and orientation information of a given street-view image query by using one or multiple geo-referenced satellite images. Recent work has initiated a new research direction to find accurate information within a local area covered by one satellite image centered at a location prior (… ▽ More Satellite-based street-view information extraction by cross-view matching refers to a task that extracts the location and orientation information of a given street-view image query by using one or multiple geo-referenced satellite images. Recent work has initiated a new research direction to find accurate information within a local area covered by one satellite image centered at a location prior (e.g., from GPS). It can be used as a standalone solution or complementary step following a large-scale search with multiple satellite candidates. However, these existing works require an accurate initial orientation (angle) prior (e.g., from IMU) and/or do not efficiently search through all possible poses. To allow efficient search and to give accurate prediction regardless of the existence or the accuracy of the angle prior, we present PetalView extractors with multi-scale search. The PetalView extractors give semantically meaningful features that are equivalent across two drastically different views, and the multi-scale search strategy efficiently inspects the satellite image from coarse to fine granularity to provide sub-meter and sub-degree precision extraction. Moreover, when an angle prior is given, we propose a learnable prior angle mixer to utilize this information. Our method obtains the best performance on the VIGOR dataset and successfully improves the performance on KITTI dataset test 1 set with the recall within 1 meter (r@1m) for location estimation to 68.88% and recall within 1 degree (r@1d) 21.10% when no angle prior is available, and with angle prior achieves stable estimations at r@1m and r@1d above 70% and 21%, up to a 40-degree noise level. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: This paper has been accepted by ACM Multimedia 2023. This version contains additional supplementary materials

Journal ref: Proceedings of the 31st ACM International Conference on Multimedia (2023) 56-66

Showing 101–150 of 18,850 results for author: Zhang, Y