Search | arXiv e-print repository

doi 10.1145/3637528.3672020

CAT: Interpretable Concept-based Taylor Additive Models

Authors: Viet Duong, Qiong Wu, Zhengyi Zhou, Hongjue Zhao, Chenxiang Luo, Eric Zavesky, Huaxiu Yao, Huajie Shao

Abstract: As an emerging interpretable technique, Generalized Additive Models (GAMs) adopt neural networks to individually learn non-linear functions for each feature, which are then combined through a linear model for final predictions. Although GAMs can explain deep neural networks (DNNs) at the feature level, they require large numbers of model parameters and are prone to overfitting, making them hard to… ▽ More As an emerging interpretable technique, Generalized Additive Models (GAMs) adopt neural networks to individually learn non-linear functions for each feature, which are then combined through a linear model for final predictions. Although GAMs can explain deep neural networks (DNNs) at the feature level, they require large numbers of model parameters and are prone to overfitting, making them hard to train and scale. Additionally, in real-world datasets with many features, the interpretability of feature-based explanations diminishes for humans. To tackle these issues, recent research has shifted towards concept-based interpretable methods. These approaches try to integrate concept learning as an intermediate step before making predictions, explaining the predictions in terms of human-understandable concepts. However, these methods require domain experts to extensively label concepts with relevant names and their ground-truth values. In response, we propose CAT, a novel interpretable Concept-bAsed Taylor additive model to simply this process. CAT does not have to require domain experts to annotate concepts and their ground-truth values. Instead, it only requires users to simply categorize input features into broad groups, which can be easily accomplished through a quick metadata review. Specifically, CAT first embeds each group of input features into one-dimensional high-level concept representation, and then feeds the concept representations into a new white-box Taylor Neural Network (TaylorNet). The TaylorNet aims to learn the non-linear relationship between the inputs and outputs using polynomials. Evaluation results across multiple benchmarks demonstrate that CAT can outperform or compete with the baselines while reducing the need of extensive model parameters. Importantly, it can explain model predictions through high-level concepts that human can understand. △ Less

Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.14869 [pdf, other]

Cost-Effective RF Fingerprinting Based on Hybrid CVNN-RF Classifier with Automated Multi-Dimensional Early-Exit Strategy

Authors: Jiayan Gan, Zhixing Du, Qiang Li, Huaizong Shao, **gran Lin, Ye Pan, Zhongyi Wen, Shafei Wang

Abstract: While the Internet of Things (IoT) technology is booming and offers huge opportunities for information exchange, it also faces unprecedented security challenges. As an important complement to the physical layer security technologies for IoT, radio frequency fingerprinting (RFF) is of great interest due to its difficulty in counterfeiting. Recently, many machine learning (ML)-based RFF algorithms h… ▽ More While the Internet of Things (IoT) technology is booming and offers huge opportunities for information exchange, it also faces unprecedented security challenges. As an important complement to the physical layer security technologies for IoT, radio frequency fingerprinting (RFF) is of great interest due to its difficulty in counterfeiting. Recently, many machine learning (ML)-based RFF algorithms have emerged. In particular, deep learning (DL) has shown great benefits in automatically extracting complex and subtle features from raw data with high classification accuracy. However, DL algorithms face the computational cost problem as the difficulty of the RFF task and the size of the DNN have increased dramatically. To address the above challenge, this paper proposes a novel costeffective early-exit neural network consisting of a complex-valued neural network (CVNN) backbone with multiple random forest branches, called hybrid CVNN-RF. Unlike conventional studies that use a single fixed DL model to process all RF samples, our hybrid CVNN-RF considers differences in the recognition difficulty of RF samples and introduces an early-exit mechanism to dynamically process the samples. When processing "easy" samples that can be well classified with high confidence, the hybrid CVNN-RF can end early at the random forest branch to reduce computational cost. Conversely, subsequent network layers will be activated to ensure accuracy. To further improve the early-exit rate, an automated multi-dimensional early-exit strategy is proposed to achieve scheduling control from multiple dimensions within the network depth and classification category. Finally, our experiments on the public ADS-B dataset show that the proposed algorithm can reduce the computational cost by 83% while improving the accuracy by 1.6% under a classification task with 100 categories. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted by IEEE Internet of Things Journal

arXiv:2406.13867 [pdf, other]

Error-Correcting Graph Codes

Authors: Swastik Kopparty, Aditya Potukuchi, Harry Sha

Abstract: In this paper, we define, study, and construct {\em Error-Correcting Graph Codes}. An error-correcting graph code of distance $δ$ is a family $C$ of graphs, on a common vertex set of size $n$, such that if we start with any graph in $C$, we would have to modify the neighborhoods of at least $δn$ vertices in order to reach some other graph in $C$. This is a natural graph generalization of the sta… ▽ More In this paper, we define, study, and construct {\em Error-Correcting Graph Codes}. An error-correcting graph code of distance $δ$ is a family $C$ of graphs, on a common vertex set of size $n$, such that if we start with any graph in $C$, we would have to modify the neighborhoods of at least $δn$ vertices in order to reach some other graph in $C$. This is a natural graph generalization of the standard Hamming distance error-correcting codes for binary strings. We show: 1. Combinatorial results determining the optimal rate vs distance tradeoff nonconstructively. 2. A connection to rank-metric codes, enabling some simple and some involved constructions achieving certain positive rates and distances. 3. Graph code analogues of Reed-Solomon codes and code concatenation, leading to positive distance codes for all rates and positive rate codes for all distances. 4. Graph code analogues of dual-BCH codes, yielding large codes with distance $δ= 1-o(1)$. This gives an explicit "graph code of Ramsey graphs". Several recent works, starting with the paper of Alon, Gujgiczer, Körner, Milojević, and Simonyi, have studied more general graph codes; where the symmetric difference between any two graphs in the code is required to have a desired property. Error-correcting graph codes are a particularly interesting instantiation of this concept. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 27 pages, 3 figures, 1 table

ACM Class: G.2.1; E.4

arXiv:2405.18164 [pdf]

Imaging, counting, and positioning single interstitial atoms in solids

Authors: Jizhe Cui, Haozhi Sha, Liangze Mao, Kang Sun, Wenfeng Yang, Rong Yu

Abstract: Interstitial atoms are ubiquitous in solids and they are widely incorporated into materials to tune their lattice structure, electronic transportation, and mechanical properties. Because the distribution of interstitial atoms in matrix materials is usually disordered and most of them are light atoms with weak scattering ability, it remains a challenge to directly image single interstitial atoms an… ▽ More Interstitial atoms are ubiquitous in solids and they are widely incorporated into materials to tune their lattice structure, electronic transportation, and mechanical properties. Because the distribution of interstitial atoms in matrix materials is usually disordered and most of them are light atoms with weak scattering ability, it remains a challenge to directly image single interstitial atoms and measure their geometrical positions. In this work, direct imaging and measuring of single interstitial atoms have been realized with adaptive-propagator ptychography. The measurement of their three-dimensional coordinates enables quantitative analysis of the pair distribution function of the interstitial atoms and reveals the anisotropic occupation of oxygen in the interstitial sites in titanium. The current work paves the way for the determination of interstitial atoms in materials, and for the correlation between the atomic-scale behavior of interstitial atoms and the physical properties of materials. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 20 pages and 8 figures; Jizhe Cui and Haozhi Sha contributed equally to this work. Rong Yu, corresponding author: [email protected]

arXiv:2405.17529 [pdf, other]

Clip Body and Tail Separately: High Probability Guarantees for DPSGD with Heavy Tails

Authors: Haichao Sha, Yang Cao, Yong Liu, Yuncheng Wu, Ruixuan Liu, Hong Chen

Abstract: Differentially Private Stochastic Gradient Descent (DPSGD) is widely utilized to preserve training data privacy in deep learning, which first clips the gradients to a predefined norm and then injects calibrated noise into the training procedure. Existing DPSGD works typically assume the gradients follow sub-Gaussian distributions and design various clip** mechanisms to optimize training performa… ▽ More Differentially Private Stochastic Gradient Descent (DPSGD) is widely utilized to preserve training data privacy in deep learning, which first clips the gradients to a predefined norm and then injects calibrated noise into the training procedure. Existing DPSGD works typically assume the gradients follow sub-Gaussian distributions and design various clip** mechanisms to optimize training performance. However, recent studies have shown that the gradients in deep learning exhibit a heavy-tail phenomenon, that is, the tails of the gradient have infinite variance, which may lead to excessive clip** loss to the gradients with existing DPSGD mechanisms. To address this problem, we propose a novel approach, Discriminative Clip**~(DC)-DPSGD, with two key designs. First, we introduce a subspace identification technique to distinguish between body and tail gradients. Second, we present a discriminative clip** mechanism that applies different clip** thresholds for body and tail gradients to reduce the clip** loss. Under the non-convex condition, \ourtech{} reduces the empirical gradient norm from {${\mathbb{O}\left(\log^{\max(0,θ-1)}(T/δ)\log^{2θ}(\sqrt{T})\right)}$} to {${\mathbb{O}\left(\log(\sqrt{T})\right)}$} with heavy-tailed index $θ\geq 1/2$, iterations $T$, and arbitrary probability $δ$. Extensive experiments on four real-world datasets demonstrate that our approach outperforms three baselines by up to 9.72\% in terms of accuracy. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.17233 [pdf, other]

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Authors: Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

Abstract: Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantizatio… ▽ More Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings, especially in extremely low-bit scenarios. Code is available at https://github.com/fayuge/CLAQ. △ Less

Submitted 2 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.16889 [pdf]

Extraction of In-Phase and Quadrature Components by Time-Encoding Sampling

Authors: Y. H. Shao, S. Y. Chen, H. Z. Yang, F. Xi, H. Hong, Z. Liu

Abstract: Time encoding machine (TEM) is a biologically-inspired scheme to perform signal sampling using timing. In this paper, we study its application to the sampling of bandpass signals. We propose an integrate-and-fire TEM scheme by which the in-phase (I) and quadrature (Q) components are extracted through reconstruction. We design the TEM according to the signal bandwidth and amplitude instead of upper… ▽ More Time encoding machine (TEM) is a biologically-inspired scheme to perform signal sampling using timing. In this paper, we study its application to the sampling of bandpass signals. We propose an integrate-and-fire TEM scheme by which the in-phase (I) and quadrature (Q) components are extracted through reconstruction. We design the TEM according to the signal bandwidth and amplitude instead of upper-edge frequency and amplitude as in the case of bandlimited/lowpass signals. We show that the I and Q components can be perfectly reconstructed from the TEM measurements if the minimum firing rate is equal to the Landau's rate of the signal. For the reconstruction of I and Q components, we develop an alternating projection onto convex sets (POCS) algorithm in which two POCS algorithms are alternately iterated. For the algorithm analysis, we define a solution space of vector-valued signals and prove that the proposed reconstruction algorithm converges to the correct unique solution in the noiseless case. The proposed TEM can operate regardless of the center frequencies of the bandpass signals. This is quite different from traditional bandpass sampling, where the center frequency should be carefully allocated for Landau's rate and its variations have the negative effect on the sampling performance. In addition, the proposed TEM achieves certain reconstructed signal-to-noise-plus-distortion ratios for small firing rates in thermal noise, which is unavoidably present and will be aliased to the Nyquist band in the traditional sampling such that high sampling rates are required. We demonstrate the reconstruction performance and substantiate our claims via simulation experiments. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 30 pages, 8 figures

arXiv:2405.14292 [pdf, other]

A New Method in Facial Registration in Clinics Based on Structure Light Images

Authors: Pengfei Li, Ziyue Ma, Hong Wang, Juan Deng, Yan Wang, Zhenyu Xu, Feng Yan, Wenjun Tu, Hong Sha

Abstract: Background and Objective: In neurosurgery, fusing clinical images and depth images that can improve the information and details is beneficial to surgery. We found that the registration of face depth images was invalid frequently using existing methods. To abundant traditional image methods with depth information, a method in registering with depth images and traditional clinical images was investi… ▽ More Background and Objective: In neurosurgery, fusing clinical images and depth images that can improve the information and details is beneficial to surgery. We found that the registration of face depth images was invalid frequently using existing methods. To abundant traditional image methods with depth information, a method in registering with depth images and traditional clinical images was investigated. Methods: We used the dlib library, a C++ library that could be used in face recognition, and recognized the key points on faces from the structure light camera and CT image. The two key point clouds were registered for coarse registration by the ICP method. Fine registration was finished after coarse registration by the ICP method. Results: RMSE after coarse and fine registration is as low as 0.995913 mm. Compared with traditional methods, it also takes less time. Conclusions: The new method successfully registered the facial depth image from structure light images and CT with a low error, and that would be promising and efficient in clinical application of neurosurgery. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.06607 [pdf, other]

SO(5) multicriticality in two-dimensional quantum magnets

Authors: Jun Takahashi, Hui Shao, Bowen Zhao, Wenan Guo, Anders W. Sandvik

Abstract: We resolve the nature of the quantum phase transition between a Néel antiferromagnet and a valence-bond solid in two-dimensional spin-1/2 magnets. We study a class of $J$-$Q$ models, in which Heisenberg exchange $J$ competes with interactions $Q_n$ formed by products of $n$ singlet projectors on adjacent parallel lattice links. QMC simulations provide unambiguous evidence for first-order transitio… ▽ More We resolve the nature of the quantum phase transition between a Néel antiferromagnet and a valence-bond solid in two-dimensional spin-1/2 magnets. We study a class of $J$-$Q$ models, in which Heisenberg exchange $J$ competes with interactions $Q_n$ formed by products of $n$ singlet projectors on adjacent parallel lattice links. QMC simulations provide unambiguous evidence for first-order transitions, with the discontinuities increasing with $n$. For $n=2$ and $n=3$ models, the first-order signatures are very weak. On intermediate length scales, we extract well-defined scaling dimensions (critical exponents) that are common to the models with small $n$, indicating proximity to a quantum critical point. By combining two $Q$ terms, the transition can be tuned from weak to more strongly first-order. The two coexisting orders on the first-order line scale with a large exponent $β\approx 0.85$. This exponent and others are close to bounds for an SO($5$) symmetric CFT with a relevant SO($5$) singlet. We characterize the emergent SO($5$) symmetry by the scaling dimensions of its leading irrelevant perturbations. The large $β$ value and a large correlation length exponent, $ν\approx 1.4$, partially explain why the transition remains near-critical even quite far away from the critical point and in many different models without fine-tuning. In addition, we find that few-spin lattice operators are dominated by the SO($5$) violating field (the traceless symmetric tensor), and interactions involving many spins are required to observe strong effects of the relevant SO($5$) singlet. The exponent that had previously been identified with the divergent correlation length when crossing between the two phases does not have a corresponding CFT operator. We explain this emergent pseudocritical scale by a mechanism relying on a dangerously irrelevant SO($5$) perturbation. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: 57 pages, 36 figures

arXiv:2405.03882 [pdf, other]

Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

Authors: Huihong Shi, Haikuo Shao, Wendong Mao, Zhongfeng Wang

Abstract: Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unf… ▽ More Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly {Softmax}, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with \textit{standard ViTs}, we focus our attention towards the quantization and acceleration for \textit{efficient ViTs}, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose \emph{Trio-ViT} accordingly. Specifically, at the algorithm level, we develop a {tailored post-training quantization engine} taking the unique activation distributions of Softmax-free efficient ViTs into full consideration, aiming to boost quantization accuracy. Furthermore, at the hardware level, we build an accelerator dedicated to the specific Convolution-Transformer hybrid architecture of efficient ViTs, thereby enhancing hardware efficiency. Extensive experimental results consistently prove the effectiveness of our Trio-ViT framework. {Particularly, we can gain up to $\uparrow$$\mathbf{7.2}\times$ and $\uparrow$$\mathbf{14.6}\times$ FPS under comparable accuracy over state-of-the-art ViT accelerators, as well as $\uparrow$$\mathbf{5.9}\times$ and $\uparrow$$\mathbf{2.0}\times$ DSP efficiency.} Codes will be released publicly upon acceptance. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2404.13046 [pdf, other]

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Authors: Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, Yu Liu

Abstract: As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understandi… ▽ More As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM) equipped with expert-routing low-rank adaptation (LoRA). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks. Codes and models will be available at https://github.com/TempleX98/MoVA. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.12867 [pdf, other]

FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving

Authors: Xingtai Gui, Tengteng Huang, Haonan Shao, Haotian Yao, Chi Zhang

Abstract: The future instance prediction from a Bird's Eye View(BEV) perspective is a vital component in autonomous driving, which involves future instance segmentation and instance motion prediction. Existing methods usually rely on a redundant and complex pipeline which requires multiple auxiliary outputs and post-processing procedures. Moreover, estimated errors on each of the auxiliary predictions will… ▽ More The future instance prediction from a Bird's Eye View(BEV) perspective is a vital component in autonomous driving, which involves future instance segmentation and instance motion prediction. Existing methods usually rely on a redundant and complex pipeline which requires multiple auxiliary outputs and post-processing procedures. Moreover, estimated errors on each of the auxiliary predictions will lead to degradation of the prediction performance. In this paper, we propose a simple yet effective fully end-to-end framework named Future Instance Prediction Transformer(FipTR), which views the task as BEV instance segmentation and prediction for future frames. We propose to adopt instance queries representing specific traffic participants to directly estimate the corresponding future occupied masks, and thus get rid of complex post-processing procedures. Besides, we devise a flow-aware BEV predictor for future BEV feature prediction composed of a flow-aware deformable attention that takes backward flow guiding the offset sampling. A novel future instance matching strategy is also proposed to further improve the temporal coherence. Extensive experiments demonstrate the superiority of FipTR and its effectiveness under different temporal BEV encoders. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.08145 [pdf]

Polar vortex hidden in twisted bilayers of paraelectric SrTiO3

Authors: Haozhi Sha, Yixuan Zhang, Yunpeng Ma, Wei Li, Wenfeng Yang, Jizhe Cui, Qian Li, Houbing Huang, Rong Yu

Abstract: Polar topologies, such as vortex and skyrmion, have attracted significant interest due to their unique physical properties and promising applications in high-density memory devices. Currently, most polar vortices are observed in heterostructures containing ferroelectric materials and constrained by substrates. In this study, we unravel arrays of polar vortices formed in twisted freestanding bilaye… ▽ More Polar topologies, such as vortex and skyrmion, have attracted significant interest due to their unique physical properties and promising applications in high-density memory devices. Currently, most polar vortices are observed in heterostructures containing ferroelectric materials and constrained by substrates. In this study, we unravel arrays of polar vortices formed in twisted freestanding bilayers composed of SrTiO3, a quantum-paraelectric material. Depth-resolved structures of the bilayers are measured with deep-sub-angstrom resolution and one picometer accuracy using multislice ptychography, enabling identification of the three-dimensional variations of polarization topology. Our findings reveal the evolution of the polar vortices in the twisted overlap** layers, demonstrating the reverse of rotation manner in the depth direction. Twisted freestanding bilayers provide a unique platform for exploration and modulation of novel polar topologies. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.02571 [pdf]

Wenzhou TE: a first-principles calculated thermoelectric materials database

Authors: Ying Fang, Hezhu Shao

Abstract: Since the implementation of the Materials Genome Project by the Obama administration in the United States, the development of various computational materials databases has fundamentally expanded the choices of industries such as materials and energy. In the field of thermoelectric materials, the thermoelectric figure of merit ZT quantifies the performance of the material. From the viewpoint of cal… ▽ More Since the implementation of the Materials Genome Project by the Obama administration in the United States, the development of various computational materials databases has fundamentally expanded the choices of industries such as materials and energy. In the field of thermoelectric materials, the thermoelectric figure of merit ZT quantifies the performance of the material. From the viewpoint of calculations for vast materials, the ZT values are not easily obtained due to their computational complexity. Here, we show how to build a database of thermoelectric materials based on first-principles calculations for the electronic and heat transport of materials. Firstly, the initial structures are classified according to the values of bandgap and other basic properties using the clustering algorithm K-means in machine learning, and high-throughput first principles calculations are carried out for narrow-bandgap semiconductors which exhibiting potential thermoelectric application. The present framework of calculations mainly includes deformation potential module, electrical transport performance module, mechanical and thermodynamic properties module. We have also set up a search webpage for the calculated database of thermoelectric materials, providing searching and viewing the related physical properties of materials. Our work may inspire the construction of more computational databases of first-principle thermoelectric materials and accelerate research progress in the field of thermoelectrics. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 13 pages, 5 figures

Journal ref: https://www.mdpi.com/1996-1944/17/10/2200

arXiv:2404.01448 [pdf]

Prior Frequency Guided Diffusion Model for Limited Angle (LA)-CBCT Reconstruction

Authors: Jiacheng Xie, Hua-Chieh Shao, Yunxiang Li, You Zhang

Abstract: Cone-beam computed tomography (CBCT) is widely used in image-guided radiotherapy. Reconstructing CBCTs from limited-angle acquisitions (LA-CBCT) is highly desired for improved imaging efficiency, dose reduction, and better mechanical clearance. LA-CBCT reconstruction, however, suffers from severe under-sampling artifacts, making it a highly ill-posed inverse problem. Diffusion models can generate… ▽ More Cone-beam computed tomography (CBCT) is widely used in image-guided radiotherapy. Reconstructing CBCTs from limited-angle acquisitions (LA-CBCT) is highly desired for improved imaging efficiency, dose reduction, and better mechanical clearance. LA-CBCT reconstruction, however, suffers from severe under-sampling artifacts, making it a highly ill-posed inverse problem. Diffusion models can generate data/images by reversing a data-noising process through learned data distributions; and can be incorporated as a denoiser/regularizer in LA-CBCT reconstruction. In this study, we developed a diffusion model-based framework, prior frequency-guided diffusion model (PFGDM), for robust and structure-preserving LA-CBCT reconstruction. PFGDM uses a conditioned diffusion model as a regularizer for LA-CBCT reconstruction, and the condition is based on high-frequency information extracted from patient-specific prior CT scans which provides a strong anatomical prior for LA-CBCT reconstruction. Specifically, we developed two variants of PFGDM (PFGDM-A and PFGDM-B) with different conditioning schemes. PFGDM-A applies the high-frequency CT information condition until a pre-optimized iteration step, and drops it afterwards to enable both similar and differing CT/CBCT anatomies to be reconstructed. PFGDM-B, on the other hand, continuously applies the prior CT information condition in every reconstruction step, while with a decaying mechanism, to gradually phase out the reconstruction guidance from the prior CT scans. The two variants of PFGDM were tested and compared with current available LA-CBCT reconstruction solutions, via metrics including PSNR and SSIM. PFGDM outperformed all traditional and diffusion model-based methods. PFGDM reconstructs high-quality LA-CBCTs under very-limited gantry angles, allowing faster and more flexible CBCT scans with dose reductions. △ Less

Submitted 8 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: 20 pages, 8 figures, submitted to Physics in Medicine & Biology

arXiv:2403.20230 [pdf, other]

An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT

Authors: Haikuo Shao, Huihong Shi, Wendong Mao, Zhongfeng Wang

Abstract: Vision Transformers (ViTs) have achieved significant success in computer vision. However, their intensive computations and massive memory footprint challenge ViTs' deployment on embedded devices, calling for efficient ViTs. Among them, EfficientViT, the state-of-the-art one, features a Convolution-Transformer hybrid architecture, enhancing both accuracy and hardware efficiency. Unfortunately, exis… ▽ More Vision Transformers (ViTs) have achieved significant success in computer vision. However, their intensive computations and massive memory footprint challenge ViTs' deployment on embedded devices, calling for efficient ViTs. Among them, EfficientViT, the state-of-the-art one, features a Convolution-Transformer hybrid architecture, enhancing both accuracy and hardware efficiency. Unfortunately, existing accelerators cannot fully exploit the hardware benefits of EfficientViT due to its unique architecture. In this paper, we propose an FPGA-based accelerator for EfficientViT to advance the hardware efficiency frontier of ViTs. Specifically, we design a reconfigurable architecture to efficiently support various operation types, including lightweight convolutions and attention, boosting hardware utilization. Additionally, we present a time-multiplexed and pipelined dataflow to facilitate both intra- and inter-layer fusions, reducing off-chip data access costs. Experimental results show that our accelerator achieves up to 780.2 GOPS in throughput and 105.1 GOPS/W in energy efficiency at 200MHz on the Xilinx ZCU102 FPGA, which significantly outperforms prior works. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: To appear in the 2024 IEEE International Symposium on Circuits and Systems (ISCAS 2024)

arXiv:2403.16999 [pdf, other]

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Authors: Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li

Abstract: Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks. However, they often lack interpretability and struggle with complex visual inputs, especially when the resolution of the input image is high or when the interested region that could provide key information for answering the question is small. To address these challenges, we collect and introduc… ▽ More Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks. However, they often lack interpretability and struggle with complex visual inputs, especially when the resolution of the input image is high or when the interested region that could provide key information for answering the question is small. To address these challenges, we collect and introduce the large-scale Visual CoT dataset comprising 438k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions. Additionally, about 98k pairs of them are annotated with detailed reasoning steps. Importantly, we propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts. We also introduce the related benchmark to evaluate the MLLMs in scenarios requiring specific local region identification. Extensive experiments demonstrate the effectiveness of our framework and shed light on better inference strategies. The Visual CoT dataset, benchmark, and pre-trained models are released to foster further research in this direction. △ Less

Submitted 7 July, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: Code: https://github.com/deepcs233/Visual-CoT

arXiv:2403.15464 [pdf, other]

LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction

Authors: Hejie Cui, Zhuocheng Shen, Jieyu Zhang, Hui Shao, Lianhui Qin, Joyce C. Ho, Carl Yang

Abstract: Electronic health records (EHRs) contain valuable patient data for health-related prediction tasks, such as disease prediction. Traditional approaches rely on supervised learning methods that require large labeled datasets, which can be expensive and challenging to obtain. In this study, we investigate the feasibility of applying Large Language Models (LLMs) to convert structured patient visit dat… ▽ More Electronic health records (EHRs) contain valuable patient data for health-related prediction tasks, such as disease prediction. Traditional approaches rely on supervised learning methods that require large labeled datasets, which can be expensive and challenging to obtain. In this study, we investigate the feasibility of applying Large Language Models (LLMs) to convert structured patient visit data (e.g., diagnoses, labs, prescriptions) into natural language narratives. We evaluate the zero-shot and few-shot performance of LLMs using various EHR-prediction-oriented prompting strategies. Furthermore, we propose a novel approach that utilizes LLM agents with different roles: a predictor agent that makes predictions and generates reasoning processes and a critic agent that analyzes incorrect predictions and provides guidance for improving the reasoning of the predictor agent. Our results demonstrate that with the proposed approach, LLMs can achieve decent few-shot performance compared to traditional supervised learning methods in EHR-based disease predictions, suggesting its potential for health-oriented applications. △ Less

Submitted 19 March, 2024; originally announced March 2024.

ACM Class: J.3; I.2.7

arXiv:2403.14693 [pdf]

A2CI: A Cloud-based, Service-oriented Geospatial Cyberinfrastructure to Support Atmospheric Research

Authors: Wenwen Li, Hu Shao, Sizhe Wang, Xiran Zhou, Sheng Wu

Abstract: Big earth science data offers the scientific community great opportunities. Many more studies at large-scales, over long-terms and at high resolution can now be conducted using the rich information collected by remote sensing satellites, ground-based sensor networks, and even social media input. However, the hundreds of terabytes of information collected and compiled on an hourly basis by NASA and… ▽ More Big earth science data offers the scientific community great opportunities. Many more studies at large-scales, over long-terms and at high resolution can now be conducted using the rich information collected by remote sensing satellites, ground-based sensor networks, and even social media input. However, the hundreds of terabytes of information collected and compiled on an hourly basis by NASA and other government agencies present a significant challenge for atmospheric scientists seeking to improve the understanding of the Earth atmospheric system. These challenges include effective discovery, organization, analysis and visualization of large amounts of data. This paper reports the outcomes of an NSF-funded project that developed a geospatial cyberinfrastructure -- the A2CI (Atmospheric Analysis Cyberinfrastructure) -- to support atmospheric research. We first introduce the service-oriented system framework then describe in detail the implementation of the data discovery module, data management module, data integration module, data analysis and visualization modules following the cloud computing principles-Data-as-a-Service, Software-as-a-Service, Platform-as-a-Service and Infrastructure-as-a-Service. We demonstrate the graphic user interface by performing an analysis between Sea Surface Temperature and the intensity of tropical storms in the North Atlantic and Pacific oceans. We expect this work to contribute to the technical advancement of cyberinfrastructure research as well as to the development of an online, collaborative scientific analysis system for atmospheric science. △ Less

Submitted 15 March, 2024; originally announced March 2024.

MSC Class: big data; cyberinfrastructure; cloud computing

arXiv:2403.11492 [pdf, other]

SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction

Authors: Yang Zhou, Hao Shao, Letian Wang, Steven L. Waslander, Hongsheng Li, Yu Liu

Abstract: Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. Context information, such as road maps and surrounding agents' states, provides crucial geometric and semantic information for motion behavior prediction. To this end, recent works explore two-stage prediction frameworks where coarse trajectori… ▽ More Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. Context information, such as road maps and surrounding agents' states, provides crucial geometric and semantic information for motion behavior prediction. To this end, recent works explore two-stage prediction frameworks where coarse trajectories are first proposed, and then used to select critical context information for trajectory refinement. However, they either incur a large amount of computation or bring limited improvement, if not both. In this paper, we introduce a novel scenario-adaptive refinement strategy, named SmartRefine, to refine prediction with minimal additional computation. Specifically, SmartRefine can comprehensively adapt refinement configurations based on each scenario's properties, and smartly chooses the number of refinement iterations by introducing a quality score to measure the prediction quality and remaining refinement potential of each scenario. SmartRefine is designed as a generic and flexible approach that can be seamlessly integrated into most state-of-the-art motion prediction models. Experiments on Argoverse (1 & 2) show that our method consistently improves the prediction accuracy of multiple state-of-the-art prediction models. Specifically, by adding SmartRefine to QCNet, we outperform all published ensemble-free works on the Argoverse 2 leaderboard (single agent track) at submission. Comprehensive studies are also conducted to ablate design choices and explore the mechanism behind multi-iteration refinement. Codes are available at https://github.com/opendilab/SmartRefine/ △ Less

Submitted 19 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: Camera-ready version for CVPR 2024

arXiv:2403.10779 [pdf, other]

LLM-based Conversational AI Therapist for Daily Functioning Screening and Psychotherapeutic Intervention via Everyday Smart Devices

Authors: **g** Nie, Hanya Shao, Yuang Fan, Qijia Shao, Haoxuan You, Matthias Preindl, Xiaofan Jiang

Abstract: Despite the global mental health crisis, access to screenings, professionals, and treatments remains high. In collaboration with licensed psychotherapists, we propose a Conversational AI Therapist with psychotherapeutic Interventions (CaiTI), a platform that leverages large language models (LLM)s and smart devices to enable better mental health self-care. CaiTI can screen the day-to-day functionin… ▽ More Despite the global mental health crisis, access to screenings, professionals, and treatments remains high. In collaboration with licensed psychotherapists, we propose a Conversational AI Therapist with psychotherapeutic Interventions (CaiTI), a platform that leverages large language models (LLM)s and smart devices to enable better mental health self-care. CaiTI can screen the day-to-day functioning using natural and psychotherapeutic conversations. CaiTI leverages reinforcement learning to provide personalized conversation flow. CaiTI can accurately understand and interpret user responses. When the user needs further attention during the conversation, CaiTI can provide conversational psychotherapeutic interventions, including cognitive behavioral therapy (CBT) and motivational interviewing (MI). Leveraging the datasets prepared by the licensed psychotherapists, we experiment and microbenchmark various LLMs' performance in tasks along CaiTI's conversation flow and discuss their strengths and weaknesses. With the psychotherapists, we implement CaiTI and conduct 14-day and 24-week studies. The study results, validated by therapists, demonstrate that CaiTI can converse with users naturally, accurately understand and interpret user responses, and provide psychotherapeutic interventions appropriately and effectively. We showcase the potential of CaiTI LLMs to assist the mental therapy diagnosis and treatment and improve day-to-day functioning screening and precautionary psychotherapeutic intervention systems. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.10319 [pdf, other]

NetBench: A Large-Scale and Comprehensive Network Traffic Benchmark Dataset for Foundation Models

Authors: Chen Qian, Xiaochang Li, Qineng Wang, Gang Zhou, Huajie Shao

Abstract: In computer networking, network traffic refers to the amount of data transmitted in the form of packets between internetworked computers or Cyber-Physical Systems. Monitoring and analyzing network traffic is crucial for ensuring the performance, security, and reliability of a network. However, a significant challenge in network traffic analysis is to process diverse data packets including both cip… ▽ More In computer networking, network traffic refers to the amount of data transmitted in the form of packets between internetworked computers or Cyber-Physical Systems. Monitoring and analyzing network traffic is crucial for ensuring the performance, security, and reliability of a network. However, a significant challenge in network traffic analysis is to process diverse data packets including both ciphertext and plaintext. While many methods have been adopted to analyze network traffic, they often rely on different datasets for performance evaluation. This inconsistency results in substantial manual data processing efforts and unfair comparisons. Moreover, some data processing methods may cause data leakage due to improper separation of training and testing data. To address these issues, we introduce the NetBench, a large-scale and comprehensive benchmark dataset for assessing machine learning models, especially foundation models, in both network traffic classification and generation tasks. NetBench is built upon seven publicly available datasets and encompasses a broad spectrum of 20 tasks, including 15 classification tasks and 5 generation tasks. Furthermore, we evaluate eight State-Of-The-Art (SOTA) classification models (including two foundation models) and two generative models using our benchmark. The results show that foundation models significantly outperform the traditional deep learning methods in traffic classification. We believe NetBench will facilitate fair comparisons among various approaches and advance the development of foundation models for network traffic. Our benchmark is available at https://github.com/WM-JayLab/NetBench. △ Less

Submitted 18 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.09615 [pdf, other]

PrompTHis: Visualizing the Process and Influence of Prompt Editing during Text-to-Image Creation

Authors: Yuhan Guo, Hanning Shao, Can Liu, Kai Xu, Xiaoru Yuan

Abstract: Generative text-to-image models, which allow users to create appealing images through a text prompt, have seen a dramatic increase in popularity in recent years. However, most users have a limited understanding of how such models work and it often requires many trials and errors to achieve satisfactory results. The prompt history contains a wealth of information that could provide users with insig… ▽ More Generative text-to-image models, which allow users to create appealing images through a text prompt, have seen a dramatic increase in popularity in recent years. However, most users have a limited understanding of how such models work and it often requires many trials and errors to achieve satisfactory results. The prompt history contains a wealth of information that could provide users with insights into what have been explored and how the prompt changes impact the output image, yet little research attention has been paid to the visual analysis of such process to support users. We propose the Image Variant Graph, a novel visual representation designed to support comparing prompt-image pairs and exploring the editing history. The Image Variant Graph models prompt differences as edges between corresponding images and presents the distances between images through projection. Based on the graph, we developed the PrompTHis system through co-design with artists. Besides Image Variant Graph, PrompTHis also incorporates a detailed prompt-image history and a navigation mini-map. Based on the review and analysis of the prompting history, users can better understand the impact of prompt changes and have a more effective control of image generation. A quantitative user study with eleven amateur participants and qualitative interviews with five professionals and one amateur user were conducted to evaluate the effectiveness of PrompTHis. The results demonstrate PrompTHis can help users review the prompt history, make sense of the model, and plan their creative process. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.07390 [pdf, other]

Learning Correction Errors via Frequency-Self Attention for Blind Image Super-Resolution

Authors: Haochen Sun, Yan Yuan, Lijuan Su, Haotian Shao

Abstract: Previous approaches for blind image super-resolution (SR) have relied on degradation estimation to restore high-resolution (HR) images from their low-resolution (LR) counterparts. However, accurate degradation estimation poses significant challenges. The SR model's incompatibility with degradation estimation methods, particularly the Correction Filter, may significantly impair performance as a res… ▽ More Previous approaches for blind image super-resolution (SR) have relied on degradation estimation to restore high-resolution (HR) images from their low-resolution (LR) counterparts. However, accurate degradation estimation poses significant challenges. The SR model's incompatibility with degradation estimation methods, particularly the Correction Filter, may significantly impair performance as a result of correction errors. In this paper, we introduce a novel blind SR approach that focuses on Learning Correction Errors (LCE). Our method employs a lightweight Corrector to obtain a corrected low-resolution (CLR) image. Subsequently, within an SR network, we jointly optimize SR performance by utilizing both the original LR image and the frequency learning of the CLR image. Additionally, we propose a new Frequency-Self Attention block (FSAB) that enhances the global information utilization ability of Transformer. This block integrates both self-attention and frequency spatial attention mechanisms. Extensive ablation and comparison experiments conducted across various settings demonstrate the superiority of our method in terms of visual quality and accuracy. Our approach effectively addresses the challenges associated with degradation estimation and correction errors, paving the way for more accurate blind image SR. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 16 pages

arXiv:2402.19303 [pdf, ps, other]

Learnability Gaps of Strategic Classification

Authors: Lee Cohen, Yishay Mansour, Shay Moran, Han Shao

Abstract: In contrast with standard classification tasks, strategic classification involves agents strategically modifying their features in an effort to receive favorable predictions. For instance, given a classifier determining loan approval based on credit scores, applicants may open or close their credit cards to fool the classifier. The learning goal is to find a classifier robust against strategic man… ▽ More In contrast with standard classification tasks, strategic classification involves agents strategically modifying their features in an effort to receive favorable predictions. For instance, given a classifier determining loan approval based on credit scores, applicants may open or close their credit cards to fool the classifier. The learning goal is to find a classifier robust against strategic manipulations. Various settings, based on what and when information is known, have been explored in strategic classification. In this work, we focus on addressing a fundamental question: the learnability gaps between strategic classification and standard learning. We essentially show that any learnable class is also strategically learnable: we first consider a fully informative setting, where the manipulation structure (which is modeled by a manipulation graph $G^\star$) is known and during training time the learner has access to both the pre-manipulation data and post-manipulation data. We provide nearly tight sample complexity and regret bounds, offering significant improvements over prior results. Then, we relax the fully informative setting by introducing two natural types of uncertainty. First, following Ahmadi et al. (2023), we consider the setting in which the learner only has access to the post-manipulation data. We improve the results of Ahmadi et al. (2023) and close the gap between mistake upper bound and lower bound raised by them. Our second relaxation of the fully informative setting introduces uncertainty to the manipulation structure. That is, we assume that the manipulation graph is unknown but belongs to a known class of graphs. We provide nearly tight bounds on the learning complexity in various unknown manipulation graph settings. Notably, our algorithm in this setting is of independent interest and can be applied to other problems such as multi-label learning. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.19221 [pdf, other]

doi 10.1007/JHEP07(2024)050

FKS subtraction for quarkonium production at NLO

Authors: Ajjath A H, Hua-Sheng Shao, Lukas Simon

Abstract: We extend the local infrared-divergence subtraction formalism, originally proposed by Frixione, Kunszt and Signer (FKS), to calculate short-distance (differential) cross section for any inclusive process involving a quarkonium particle in non-relativistic QCD (NRQCD) factorisation at next-to-leading order (NLO) accuracy in the strong coupling constant $α_s$. The new formulas are generally applicab… ▽ More We extend the local infrared-divergence subtraction formalism, originally proposed by Frixione, Kunszt and Signer (FKS), to calculate short-distance (differential) cross section for any inclusive process involving a quarkonium particle in non-relativistic QCD (NRQCD) factorisation at next-to-leading order (NLO) accuracy in the strong coupling constant $α_s$. The new formulas are generally applicable to the production of an S- or P-wave quarkonium state in association with any number of elementary particles. The main new ingredients derived in this paper are the local and integrated soft counterterms for the colour-singlet and colour-octet P-wave bound states. It, therefore, paves the way to the automation of the NLO calculations for heavy quarkonium inclusive and associated production processes. △ Less

Submitted 6 July, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Comments: 53 pages, 2 figures, v2 (journal version)

Journal ref: JHEP 07 (2024) 050

arXiv:2402.15991 [pdf, other]

$C^3$: Confidence Calibration Model Cascade for Inference-Efficient Cross-Lingual Natural Language Understanding

Authors: Taixi Lu, Haoyu Wang, Huajie Shao, **g Gao, Huaxiu Yao

Abstract: Cross-lingual natural language understanding (NLU) is a critical task in natural language processing (NLP). Recent advancements have seen multilingual pre-trained language models (mPLMs) significantly enhance the performance of these tasks. However, mPLMs necessitate substantial resources and incur high computational costs during inference, posing challenges for deployment in real-world and real-t… ▽ More Cross-lingual natural language understanding (NLU) is a critical task in natural language processing (NLP). Recent advancements have seen multilingual pre-trained language models (mPLMs) significantly enhance the performance of these tasks. However, mPLMs necessitate substantial resources and incur high computational costs during inference, posing challenges for deployment in real-world and real-time systems. Existing model cascade methods seek to enhance inference efficiency by greedily selecting the lightest model capable of processing the current input from a variety of models, based on model confidence scores. Nonetheless, deep models tend to exhibit overconfidence, and confidence distributions vary across languages. This leads to the emission of confident but incorrect predictions by smaller models, hindering their ability to generalize effectively across test languages. In this study, we introduce a confidence calibration model cascade ($C^3$) method. This approach, simple yet effective, involves calibration prior to cascade inference, thereby enhancing cascade accuracy through more reliable predictions. Extensive experiments conducted on three cross-lingual benchmarks demonstrate that $C^3$ significantly outperforms all state-of-the-art baselines. △ Less

Submitted 25 February, 2024; originally announced February 2024.

arXiv:2402.15758 [pdf, other]

Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens

Authors: Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Hui** Zhuang, Hongen Shao, Xiaofeng Zou

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their widespread application is hindered by the resource-intensive decoding process. To address this challenge, current approaches have incorporated additional decoding heads to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the ac… ▽ More Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their widespread application is hindered by the resource-intensive decoding process. To address this challenge, current approaches have incorporated additional decoding heads to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the accuracy of these decoding heads falls short of the auto-regressive decoding approach. In light of these limitations, we propose Chimera, a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. To ensure both accuracy and efficiency, we present two strategies within the lightweight draft model. Firstly, we focus on capturing short-range dependencies at the bottom layer. Secondly, we leverage the readily available representations from the original LLM.Through empirical evaluation on the Vicuna and LlaMA-2 series, Chimera demonstrates impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach. This highlights the potential of our proposed framework in significantly improving the efficiency of large language models during the decoding process. △ Less

Submitted 18 April, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

arXiv:2402.14605 [pdf, other]

Observation of the antiferromagnetic phase transition in the fermionic Hubbard model

Authors: Hou-Ji Shao, Yu-Xuan Wang, De-Zhi Zhu, Yan-Song Zhu, Hao-Nan Sun, Si-Yuan Chen, Chi Zhang, Zhi-Jie Fan, You** Deng, Xing-Can Yao, Yu-Ao Chen, Jian-Wei Pan

Abstract: The fermionic Hubbard model (FHM)[1], despite its simple form, captures essential features of strongly correlated electron physics. Ultracold fermions in optical lattices[2, 3] provide a clean and well-controlled platform for simulating FHM. Do** its antiferromagnetic ground state at half filling, various exotic phases are expected to arise in the FHM simulator, including stripe order[4], pseudo… ▽ More The fermionic Hubbard model (FHM)[1], despite its simple form, captures essential features of strongly correlated electron physics. Ultracold fermions in optical lattices[2, 3] provide a clean and well-controlled platform for simulating FHM. Do** its antiferromagnetic ground state at half filling, various exotic phases are expected to arise in the FHM simulator, including stripe order[4], pseudogap[5], and d-wave superconductors[6], offering valuable insights into high-temperature superconductivity[7{9]. Although notable progress, such as the observation of antiferromagnetic correlations over short[10] and extended distances[11], has been obtained, the antiferromagnetic phase has yet to be realized due to the significant challenges of achieving low temperatures in a large and uniform quantum simulator. Here, we report the observation of the antiferromagnetic phase transition in a three-dimensional fermionic Hubbard system comprising lithium-6 atoms in a uniform optical lattice with approximately 800,000 sites. When the interaction strength, temperature, and do** concentration are finely tuned to approach their respective critical values, sharp increases in the spin structure factor (SSF) are observed. These observations can be well described by a power-law divergence, with a critical exponent of 1.396 from the Heisenberg universality class[12]. At half filling and with optimal interaction strength, the measured SSF reaches 123(8), signifying the establishment of an antiferromagnetic phase. Our results set the stage for exploring the low-temperature phase diagram of FHM. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.12634 [pdf, other]

doi 10.1145/3613904.3643022

Data Storytelling in Data Visualisation: Does it Enhance the Efficiency and Effectiveness of Information Retrieval and Insights Comprehension?

Authors: Honbo Shao, Roberto Martinez-Maldonado, Vanessa Echeverria, Lixiang Yan, Dragan Gasevic

Abstract: Data storytelling (DS) is rapidly gaining attention as an approach that integrates data, visuals, and narratives to create data stories that can help a particular audience to comprehend the key messages underscored by the data with enhanced efficiency and effectiveness. It has been posited that DS can be especially advantageous for audiences with limited visualisation literacy, by presenting the d… ▽ More Data storytelling (DS) is rapidly gaining attention as an approach that integrates data, visuals, and narratives to create data stories that can help a particular audience to comprehend the key messages underscored by the data with enhanced efficiency and effectiveness. It has been posited that DS can be especially advantageous for audiences with limited visualisation literacy, by presenting the data clearly and concisely. However, empirical studies confirming whether data stories indeed provide these benefits over conventional data visualisations are scarce. To bridge this gap, we conducted a study with 103 participants to determine whether DS indeed improve both efficiency and effectiveness in tasks related to information retrieval and insights comprehension. Our findings suggest that data stories do improve the efficiency of comprehension tasks, as well as the effectiveness of comprehension tasks that involve a single insight compared with conventional visualisations. Interestingly, these benefits were not associated with participants' visualisation literacy. △ Less

Submitted 20 May, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

Comments: Accepted to CHI24 Edited two typos. One in the abstract, another in a formulae

arXiv:2402.09013 [pdf, other]

doi 10.1117/1.JATIS.10.1.015002

Asgard/NOTT: L-band nulling interferometry at the VLTI. II. Warm optical design and injection system

Authors: Germain Garreau, Azzurra Bigioli, Romain Laugier, Gert Raskin, Johan Morren, Jean-Philippe Berger, Colin Dandumont, Harry-Dean Kenchington Goldsmith, Simon Gross, Michael Ireland, Lucas Labadie, Jérôme Loicq, Stephen Madden, Guillermo Martin, Marc-Antoine Martinod, Alexandra Mazzoli, Ahmed Sanny, Hancheng Shao, Kunlun Yan, Denis Defrère

Abstract: Asgard/NOTT (previously Hi-5) is a European Research Council (ERC)-funded project hosted at KU Leuven and a new visitor instrument for the Very Large Telescope Interferometer (VLTI). Its primary goal is to image the snow line region around young stars using nulling interferometry in the L-band (3.5 to 4.0)$μ$m, where the contrast between exoplanets and their host stars is advantageous. The breakth… ▽ More Asgard/NOTT (previously Hi-5) is a European Research Council (ERC)-funded project hosted at KU Leuven and a new visitor instrument for the Very Large Telescope Interferometer (VLTI). Its primary goal is to image the snow line region around young stars using nulling interferometry in the L-band (3.5 to 4.0)$μ$m, where the contrast between exoplanets and their host stars is advantageous. The breakthrough is the use of a photonic beam combiner, which only recently allowed the required theoretical raw contrast of $10^{-3}$ in this spectral range. Nulling interferometry observations of exoplanets also require a high degree of balancing between the four pupils of the VLTI in terms of intensity, phase, and polarization. The injection into the beam combiner and the requirements of nulling interferometry are driving the design of the warm optics and the injection system. The optical design up to the beam combiner is presented. It offers a technical solution to efficiently couple the light from the VLTI into the beam combiner. During the coupling, the objective is to limit throughput losses to 5% of the best expected efficiency for the injection. To achieve this, a list of different loss sources is considered with their respective impact on the injection efficiency. Solutions are also proposed to meet the requirements on beam balancing for intensity, phase, and polarization. The different properties of the design are listed, including the optics used, their alignment and tolerances, and their impact on the instrumental performances in terms of throughput and null depth. The performance evaluation gives an expected throughput loss of less than <6.4% of the best efficiency for the injection and a null depth of $\sim2.10^{-3}$, mainly from optical path delay errors outside the scope of this work. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: Accepted for publication in JATIS. 23 pages, 11 figures, 8 tables

Journal ref: J. Astron. Telesc. Instrum. Syst. 10(1), 015002 (2024)

arXiv:2402.05935 [pdf, other]

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Authors: Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng **, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, Peng Gao

Abstract: We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we… ▽ More We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory △ Less

Submitted 26 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

Comments: Accepted by ICML 2024. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

arXiv:2402.03646 [pdf, other]

Lens: A Foundation Model for Network Traffic in Cybersecurity

Authors: Qineng Wang, Chen Qian, Xiaochang Li, Ziyu Yao, Huajie Shao

Abstract: Network traffic refers to the amount of data being sent and received over the internet or any system that connects computers. Analyzing and understanding network traffic is vital for improving network security and management. However, the analysis of network traffic is challenging due to the diverse nature of data packets, which often feature heterogeneous headers and encrypted payloads lacking se… ▽ More Network traffic refers to the amount of data being sent and received over the internet or any system that connects computers. Analyzing and understanding network traffic is vital for improving network security and management. However, the analysis of network traffic is challenging due to the diverse nature of data packets, which often feature heterogeneous headers and encrypted payloads lacking semantics. To capture the latent semantics of traffic, a few studies have adopted pre-training techniques based on the Transformer encoder or decoder to learn the representations from massive traffic data. However, these methods typically excel in traffic understanding (classification) or traffic generation tasks. To address this issue, we develop Lens, a foundation model for network traffic that leverages the T5 architecture to learn the pre-trained representations from large-scale unlabeled data. Harnessing the strength of the encoder-decoder framework, which captures the global information while preserving the generative ability, our model can better learn the representations from raw data. To further enhance pre-training effectiveness, we design a novel loss that combines three distinct tasks: Masked Span Prediction (MSP), Packet Order Prediction (POP), and Homologous Traffic Prediction (HTP). Evaluation results across various benchmark datasets demonstrate that the proposed Lens outperforms the baselines in most downstream tasks related to both traffic understanding and generation. Notably, it also requires much less labeled data for fine-tuning compared to current methods. △ Less

Submitted 28 March, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2402.02851 [pdf, other]

Enhancing Compositional Generalization via Compositional Feature Alignment

Authors: Haoxiang Wang, Haozhe Si, Huajie Shao, Han Zhao

Abstract: Real-world applications of machine learning models often confront data distribution shifts, wherein discrepancies exist between the training and test data distributions. In the common multi-domain multi-class setup, as the number of classes and domains scales up, it becomes infeasible to gather training data for every domain-class combination. This challenge naturally leads the quest for models wi… ▽ More Real-world applications of machine learning models often confront data distribution shifts, wherein discrepancies exist between the training and test data distributions. In the common multi-domain multi-class setup, as the number of classes and domains scales up, it becomes infeasible to gather training data for every domain-class combination. This challenge naturally leads the quest for models with Compositional Generalization (CG) ability, where models can generalize to unseen domain-class combinations. To delve into the CG challenge, we develop CG-Bench, a suite of CG benchmarks derived from existing real-world image datasets, and observe that the prevalent pretraining-finetuning paradigm on foundational models, such as CLIP and DINOv2, struggles with the challenge. To address this challenge, we propose Compositional Feature Alignment (CFA), a simple two-stage finetuning technique that i) learns two orthogonal linear heads on a pretrained encoder with respect to class and domain labels, and ii) fine-tunes the encoder with the newly learned head frozen. We theoretically and empirically justify that CFA encourages compositional feature learning of pretrained models. We further conduct extensive experiments on CG-Bench for CLIP and DINOv2, two powerful pretrained vision foundation models. Experiment results show that CFA outperforms common finetuning techniques in compositional generalization, corroborating CFA's efficacy in compositional feature learning. △ Less

Submitted 22 May, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: Published in Transactions on Machine Learning Research (TMLR). The code is released at https://github.com/Haoxiang-Wang/Compositional-Feature-Alignment

arXiv:2401.02439 [pdf]

Information limit of 15 pm achieved with bright-field ptychography

Authors: Haozhi Sha, Jizhe Cui, Wenfeng Yang, Rong Yu

Abstract: It is generally assumed that a high spatial resolution of a microscope requires a large numerical aperture of the imaging lens or detector. In this study, the information limit of 15 pm is achieved in transmission electron microscopy using only the bright-field disk (small numerical aperture) via multislice ptychography. The results indicate that high-frequency information has been encoded in the… ▽ More It is generally assumed that a high spatial resolution of a microscope requires a large numerical aperture of the imaging lens or detector. In this study, the information limit of 15 pm is achieved in transmission electron microscopy using only the bright-field disk (small numerical aperture) via multislice ptychography. The results indicate that high-frequency information has been encoded in the electrons scattered to low angles due to the multiple scattering of electrons in the objects, making it possible to break the diffraction limit of imaging via bright-field ptychography. △ Less

Submitted 20 December, 2023; originally announced January 2024.

Comments: 10 pages, 4 figures

arXiv:2401.01638 [pdf, other]

Radon Removal Commissioning of the PandaX-4T Cryogenic Distillation System

Authors: Xiangyi Cui, Zhou Wang, Jiafu Li, Shuaijie Li, Lin Si, Yonglin Ju, Wenbo Ma, Jianglai Liu, Li Zhao, Xiangdong Ji, Rui Yan, Haidong Sha, Peiyao Huang, Xiuli Wang, Huaxuan Liu

Abstract: The PandaX-4T distillation system, designed for the removal of krypton and radon from xenon, is evaluated for its radon removal efficiency using a $^{222}$Rn source during the online distillation process. The PandaX-4T dark matter detector is employed to monitor the temporal evolution of radon activity. To determine the radon reduction factor, the experimental data of radon atoms introduced into a… ▽ More The PandaX-4T distillation system, designed for the removal of krypton and radon from xenon, is evaluated for its radon removal efficiency using a $^{222}$Rn source during the online distillation process. The PandaX-4T dark matter detector is employed to monitor the temporal evolution of radon activity. To determine the radon reduction factor, the experimental data of radon atoms introduced into and bypassed the distillation system is compared. The results indicate that the PandaX-4T distillation system achieves a radon reduction factor exceeding 190 at the flow rate of 10 slpm and the reflux ratio of 1.44. Gas-only online distillation process of a flow rate of 20 slpm is also conducted without observing significant reduction of radon levels in the detector. This observation suggests that the migration flow of radon atoms from the liquid phase to the gas phase is limited, and the flow rate of gas circulation and duration of the process are insignificant compared to the total xenon mass of 5.6 tons in the detector. This study provides the experimental data to support the efficient removal of radon at $\sim$Bq level using the PandaX-4T distillation system, which is the prerequisite of the radon background control in the detector. The further operation with higher flow rate will be applied for the upcoming science run in PandaX-4T. △ Less

Submitted 19 April, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

Comments: 14 pages, 9 figures

arXiv:2401.01495 [pdf, other]

A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning

Authors: Wei Ai, FuChen Zhang, Tao Meng, YunTao Shou, HongEn Shao, Keqin Li

Abstract: In terms of human-computer interaction, it is becoming more and more important to correctly understand the user's emotional state in a conversation, so the task of multimodal emotion recognition (MER) started to receive more attention. However, existing emotion classification methods usually perform classification only once. Sentences are likely to be misclassified in a single round of classificat… ▽ More In terms of human-computer interaction, it is becoming more and more important to correctly understand the user's emotional state in a conversation, so the task of multimodal emotion recognition (MER) started to receive more attention. However, existing emotion classification methods usually perform classification only once. Sentences are likely to be misclassified in a single round of classification. Previous work usually ignores the similarities and differences between different morphological features in the fusion process. To address the above issues, we propose a two-stage emotion recognition model based on graph contrastive learning (TS-GCL). First, we encode the original dataset with different preprocessing modalities. Second, a graph contrastive learning (GCL) strategy is introduced for these three modal data with other structures to learn similarities and differences within and between modalities. Finally, we use MLP twice to achieve the final emotion classification. This staged classification method can help the model to better focus on different levels of emotional information, thereby improving the performance of the model. Extensive experiments show that TS-GCL has superior performance on IEMOCAP and MELD datasets compared with previous methods. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: 9 pages, 3 figures

arXiv:2401.00376 [pdf, other]

Magnon, doublon and quarton excitations in 2D S=1/2 trimerized Heisenberg models

Authors: Yue-Yue Chang, Jun-Qing Cheng, Hui Shao, Dao-Xin Yao, Han-Qing Wu

Abstract: We investigate the magnetic excitations of the trimerized Heisenberg models with intra-trimer interaction $J_1$ and inter-trimer interaction $J_2$ on four different two-dimensional lattices using a combination of stochastic series expansion quantum Monte Carlo (SSE QMC) and stochastic analytic continuation methods (SAC), complemented by cluster perturbation theory (CPT). These models exhibit quasi… ▽ More We investigate the magnetic excitations of the trimerized Heisenberg models with intra-trimer interaction $J_1$ and inter-trimer interaction $J_2$ on four different two-dimensional lattices using a combination of stochastic series expansion quantum Monte Carlo (SSE QMC) and stochastic analytic continuation methods (SAC), complemented by cluster perturbation theory (CPT). These models exhibit quasi-particle-like excitations when $g=J_2/J_1$ is small, characterized by low-energy magnons, intermediate-energy doublons, and high-energy quartons. The low-energy magnons are associated with the magnetic ground states. They can be described by the linear spin wave theory (LSWT) of the effective block spin model and the original spin model. Doublons and quartons emerge from the corresponding internal excitations of the trimers with distinct energy levels, which can be effectively analyzed using perturbation theory when the ratio of exchange interactions $g$ is small. In this small $g$ regime, we observe a clear separation between the magnon and higher-energy spectra. However, as $g$ increases, these three spectra gradually merge into the magnon modes or continua. Nevertheless, the LSWT fails to provide quantitative descriptions of the higher-energy excitation bands due to significant quantum fluctuations. Notably, in the Collinear II and trimerized hexagon lattice, a broad continuum emerges above the single-magnon spectrum, originating from the quasi-1D physics due to the dilute connections between chains. Our numerical analysis of these 2D trimers yields valuable theoretical predictions and explanations for the inelastic neutron scattering (INS) spectra of 2D magnetic materials featuring trimerized lattices. △ Less

Submitted 16 June, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

arXiv:2312.16966 [pdf, other]

doi 10.1007/JHEP03(2024)121

Two-loop massive QCD and QED helicity amplitudes for light-by-light scattering

Authors: Ajjath A H, Ekta Chaubey, Hua-Sheng Shao

Abstract: We present the analytic and compact two-loop helicity amplitudes for QCD and QED corrections to the light-by-light scattering process with massive internal fermions. We express the master integrals either in terms of multiple polylogarithms or in terms of iterated integrals with dlog one-forms. We also elaborate on optimizing the analytic results for each phase-space region. This makes the numeric… ▽ More We present the analytic and compact two-loop helicity amplitudes for QCD and QED corrections to the light-by-light scattering process with massive internal fermions. We express the master integrals either in terms of multiple polylogarithms or in terms of iterated integrals with dlog one-forms. We also elaborate on optimizing the analytic results for each phase-space region. This makes the numerical evaluation of the scattering amplitudes fast, stable and suitable for phenomenological applications. △ Less

Submitted 21 March, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: 35 pages, 1 figure, v2: journal version (two long equations move into appendix)

Journal ref: JHEP 03 (2024) 121

arXiv:2312.16956 [pdf, other]

doi 10.1016/j.physletb.2024.138555

Light-by-Light Scattering at Next-to-Leading Order in QCD and QED

Authors: Ajjath A H, Ekta Chaubey, Mathijs Fraaije, Valentin Hirschi, Hua-Sheng Shao

Abstract: The recent experimental observation of Light-by-Light (LbL) scattering at the Large Hadron Collider has revived interest in this fundamental process, and especially of the accurate prediction of its cross-section, which we present here for the first time at Next-to-Leading Order (NLO) in both QCD and QED. We compare two radically different computational approaches, both exact in the fermion mass d… ▽ More The recent experimental observation of Light-by-Light (LbL) scattering at the Large Hadron Collider has revived interest in this fundamental process, and especially of the accurate prediction of its cross-section, which we present here for the first time at Next-to-Leading Order (NLO) in both QCD and QED. We compare two radically different computational approaches, both exact in the fermion mass dependence, thus offering a strong cross-check of our results. The first approach is a fully analytic method to calculate compact and well-organized two-loop helicity amplitudes. The second one is entirely numerical and leverages the Local Unitarity construction. Our two calculations agree with each other and conclude that including the exact fermion mass contribution typically increases the size of the NLO corrections. Moreover, we find that the exact result converges slowly to the massless limit of the high-energy regime, thus emphasizing the importance of including the full mass dependence at NLO. We also compare our results with the ATLAS measurement of LbL in ultra-peripheral lead-lead collisions, and find that the inclusion of exact NLO corrections reduces, but does not eliminate, the existing tension with theoretical predictions. △ Less

Submitted 10 March, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: 11 pages, 6 figures (including appendix) v2: minor corrections and journal version

Journal ref: Phys.Lett.B 851 (2024) 138555

arXiv:2312.08866 [pdf, other]

MCANet: Medical Image Segmentation with Multi-Scale Cross-Axis Attention

Authors: Hao Shao, Quansheng Zeng, Qibin Hou, Jufeng Yang

Abstract: Efficiently capturing multi-scale information and building long-range dependencies among pixels are essential for medical image segmentation because of the various sizes and shapes of the lesion regions or organs. In this paper, we present Multi-scale Cross-axis Attention (MCA) to solve the above challenging issues based on the efficient axial attention. Instead of simply connecting axial attentio… ▽ More Efficiently capturing multi-scale information and building long-range dependencies among pixels are essential for medical image segmentation because of the various sizes and shapes of the lesion regions or organs. In this paper, we present Multi-scale Cross-axis Attention (MCA) to solve the above challenging issues based on the efficient axial attention. Instead of simply connecting axial attention along the horizontal and vertical directions sequentially, we propose to calculate dual cross attentions between two parallel axial attentions to capture global information better. To process the significant variations of lesion regions or organs in individual sizes and shapes, we also use multiple convolutions of strip-shape kernels with different kernel sizes in each axial attention path to improve the efficiency of the proposed MCA in encoding spatial information. We build the proposed MCA upon the MSCAN backbone, yielding our network, termed MCANet. Our MCANet with only 4M+ parameters performs even better than most previous works with heavy backbones (e.g., Swin Transformer) on four challenging tasks, including skin lesion segmentation, nuclei segmentation, abdominal multi-organ segmentation, and polyp segmentation. Code is available at https://github.com/haoshao-nku/medical_seg. △ Less

Submitted 19 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

arXiv:2312.08735 [pdf, other]

Polyper: Boundary Sensitive Polyp Segmentation

Authors: Hao Shao, Yang Zhang, Qibin Hou

Abstract: We present a new boundary sensitive framework for polyp segmentation, called Polyper. Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackle blurred boundaries.Inspired by this, we propose explicitly leveraging polyp regions to bolster the model's boundary discrimination capability while minimizing… ▽ More We present a new boundary sensitive framework for polyp segmentation, called Polyper. Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackle blurred boundaries.Inspired by this, we propose explicitly leveraging polyp regions to bolster the model's boundary discrimination capability while minimizing computation. Our approach first extracts boundary and polyp regions from the initial segmentation map through morphological operators. Then, we design the boundary sensitive attention that concentrates on augmenting the features near the boundary regions using the interior polyp regions's characteristics to generate good segmentation results. Our proposed method can be seamlessly integrated with classical encoder networks, like ResNet-50, MiT-B1, and Swin Transformer. To evaluate the effectiveness of Polyper, we conduct experiments on five publicly available challenging datasets, and receive state-of-the-art performance on all of them. Code is available at https://github.com/haoshao-nku/medical_seg.git. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: Accepted to AAAI 2024

arXiv:2312.07488 [pdf, other]

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Authors: Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, Hongsheng Li

Abstract: Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand, large language models (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand, previous autonomous driving m… ▽ More Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand, large language models (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand, previous autonomous driving methods tend to rely on limited-format inputs (e.g. sensor data and navigation waypoints), restricting the vehicle's ability to understand language information and interact with humans. To this end, this paper introduces LMDrive, a novel language-guided, end-to-end, closed-loop autonomous driving framework. LMDrive uniquely processes and integrates multi-modal sensor data with natural language instructions, enabling interaction with humans and navigation software in realistic instructional settings. To facilitate further research in language-based closed-loop autonomous driving, we also publicly release the corresponding dataset which includes approximately 64K instruction-following data clips, and the LangAuto benchmark that tests the system's ability to handle complex instructions and challenging driving scenarios. Extensive closed-loop experiments are conducted to demonstrate LMDrive's effectiveness. To the best of our knowledge, we're the very first work to leverage LLMs for closed-loop end-to-end autonomous driving. Codes, models, and datasets can be found at https://github.com/opendilab/LMDrive △ Less

Submitted 21 December, 2023; v1 submitted 12 December, 2023; originally announced December 2023.

Comments: project page: https://hao-shao.com/projects/lmdrive.html

arXiv:2312.03792 [pdf, other]

PCDP-SGD: Improving the Convergence of Differentially Private SGD via Projection in Advance

Authors: Haichao Sha, Ruixuan Liu, Yixuan Liu, Hong Chen

Abstract: The paradigm of Differentially Private SGD~(DP-SGD) can provide a theoretical guarantee for training data in both centralized and federated settings. However, the utility degradation caused by DP-SGD limits its wide application in high-stakes tasks, such as medical image diagnosis. In addition to the necessary perturbation, the convergence issue is attributed to the information loss on the gradien… ▽ More The paradigm of Differentially Private SGD~(DP-SGD) can provide a theoretical guarantee for training data in both centralized and federated settings. However, the utility degradation caused by DP-SGD limits its wide application in high-stakes tasks, such as medical image diagnosis. In addition to the necessary perturbation, the convergence issue is attributed to the information loss on the gradient clip**. In this work, we propose a general framework PCDP-SGD, which aims to compress redundant gradient norms and preserve more crucial top gradient components via projection operation before gradient clip**. Additionally, we extend PCDP-SGD as a fundamental component in differential privacy federated learning~(DPFL) for mitigating the data heterogeneous challenge and achieving efficient communication. We prove that pre-projection enhances the convergence of DP-SGD by reducing the dependence of clip** error and bias to a fraction of the top gradient eigenspace, and in theory, limits cross-client variance to improve the convergence under heterogeneous federation. Experimental results demonstrate that PCDP-SGD achieves higher accuracy compared with state-of-the-art DP-SGD variants in computer vision tasks. Moreover, PCDP-SGD outperforms current federated learning frameworks when DP is guaranteed on local training sets. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.16459 [pdf, other]

On the Effect of Defections in Federated Learning and How to Prevent Them

Authors: Minbiao Han, Kumar Kshitij Patel, Han Shao, Lingxiao Wang

Abstract: Federated learning is a machine learning protocol that enables a large population of agents to collaborate over multiple rounds to produce a single consensus model. There are several federated learning applications where agents may choose to defect permanently$-$essentially withdrawing from the collaboration$-$if they are content with their instantaneous model in that round. This work demonstrates… ▽ More Federated learning is a machine learning protocol that enables a large population of agents to collaborate over multiple rounds to produce a single consensus model. There are several federated learning applications where agents may choose to defect permanently$-$essentially withdrawing from the collaboration$-$if they are content with their instantaneous model in that round. This work demonstrates the detrimental impact of such defections on the final model's robustness and ability to generalize. We also show that current federated optimization algorithms fail to disincentivize these harmful defections. We introduce a novel optimization algorithm with theoretical guarantees to prevent defections while ensuring asymptotic convergence to an effective solution for all participating agents. We also provide numerical experiments to corroborate our findings and demonstrate the effectiveness of our algorithm. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.12070 [pdf, other]

FDDM: Unsupervised Medical Image Translation with a Frequency-Decoupled Diffusion Model

Authors: Yunxiang Li, Hua-Chieh Shao, Xiaoxue Qian, You Zhang

Abstract: Diffusion models have demonstrated significant potential in producing high-quality images in medical image translation to aid disease diagnosis, localization, and treatment. Nevertheless, current diffusion models have limited success in achieving faithful image translations that can accurately preserve the anatomical structures of medical images, especially for unpaired datasets. The preservation… ▽ More Diffusion models have demonstrated significant potential in producing high-quality images in medical image translation to aid disease diagnosis, localization, and treatment. Nevertheless, current diffusion models have limited success in achieving faithful image translations that can accurately preserve the anatomical structures of medical images, especially for unpaired datasets. The preservation of structural and anatomical details is essential to reliable medical diagnosis and treatment planning, as structural mismatches can lead to disease misidentification and treatment errors. In this study, we introduce the Frequency Decoupled Diffusion Model (FDDM) for MR-to-CT conversion. FDDM first obtains the anatomical information of the CT image from the MR image through an initial conversion module. This anatomical information then guides a subsequent diffusion model to generate high-quality CT images. Our diffusion model uses a dual-path reverse diffusion process for low-frequency and high-frequency information, achieving a better balance between image quality and anatomical accuracy. We extensively evaluated FDDM using public datasets for brain MR-to-CT and pelvis MR-to-CT translations, demonstrating its superior performance to other GAN-based, VAE-based, and diffusion-based models. The evaluation metrics included Frechet Inception Distance (FID), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM). FDDM achieved the best scores on all metrics for both datasets, particularly excelling in FID, with scores of 25.9 for brain data and 29.2 for pelvis data, significantly outperforming other methods. These results demonstrate that FDDM can generate high-quality target domain images while maintaining the accuracy of translated anatomical structures. △ Less

Submitted 26 June, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

arXiv:2311.10036 [pdf]

Dynamic CBCT Imaging using Prior Model-Free Spatiotemporal Implicit Neural Representation (PMF-STINR)

Authors: Hua-Chieh Shao, Mengke Tielige, Tinsu Pan, You Zhang

Abstract: Dynamic cone-beam computed tomography (CBCT) can capture high-spatial-resolution, time-varying images for motion monitoring, patient setup, and adaptive planning of radiotherapy. However, dynamic CBCT reconstruction is an extremely ill-posed spatiotemporal inverse problem, as each CBCT volume in the dynamic sequence is only captured by one or a few X-ray projections. We developed a machine learnin… ▽ More Dynamic cone-beam computed tomography (CBCT) can capture high-spatial-resolution, time-varying images for motion monitoring, patient setup, and adaptive planning of radiotherapy. However, dynamic CBCT reconstruction is an extremely ill-posed spatiotemporal inverse problem, as each CBCT volume in the dynamic sequence is only captured by one or a few X-ray projections. We developed a machine learning-based technique, prior-model-free spatiotemporal implicit neural representation (PMF-STINR), to reconstruct dynamic CBCTs from sequentially acquired X-ray projections. PMF-STINR employs a joint image reconstruction and registration approach to address the under-sampling challenge. Specifically, PMF-STINR uses spatial implicit neural representation to reconstruct a reference CBCT volume, and it applies temporal INR to represent the intra-scan dynamic motion with respect to the reference CBCT to yield dynamic CBCTs. PMF-STINR couples the temporal INR with a learning-based B-spline motion model to capture time-varying deformable motion during the reconstruction. Compared with previous methods, the spatial INR, the temporal INR, and the B-spline model of PMF-STINR are all learned on the fly during reconstruction in a one-shot fashion, without using any patient-specific prior knowledge or motion sorting/binning. PMF-STINR was evaluated via digital phantom simulations, physical phantom measurements, and a multi-institutional patient dataset featuring various imaging protocols (half-fan/full-fan, full sampling/sparse sampling, different energy and mAs settings, etc.). The results showed that the one-shot learning-based PMF-STINR can accurately and robustly reconstruct dynamic CBCTs and capture highly irregular motion with high temporal (~0.1s) resolution and sub-millimeter accuracy. It can be a promising tool for motion management by offering richer motion information than traditional 4D-CBCTs. △ Less

Submitted 4 December, 2023; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2311.07754 [pdf, other]

Efficient Prior-Free Mechanisms for No-Regret Agents

Authors: Natalie Collina, Aaron Roth, Han Shao

Abstract: We study a repeated Principal Agent problem between a long lived Principal and Agent pair in a prior free setting. In our setting, the sequence of realized states of nature may be adversarially chosen, the Agent is non-myopic, and the Principal aims for a strong form of policy regret. Following Camara, Hartline, and Johnson, we model the Agent's long-run behavior with behavioral assumptions that r… ▽ More We study a repeated Principal Agent problem between a long lived Principal and Agent pair in a prior free setting. In our setting, the sequence of realized states of nature may be adversarially chosen, the Agent is non-myopic, and the Principal aims for a strong form of policy regret. Following Camara, Hartline, and Johnson, we model the Agent's long-run behavior with behavioral assumptions that relax the common prior assumption (for example, that the Agent has no swap regret). Within this framework, we revisit the mechanism proposed by Camara et al., which informally uses calibrated forecasts of the unknown states of nature in place of a common prior. We give two main improvements. First, we give a mechanism that has an exponentially improved dependence (in terms of both running time and regret bounds) on the number of distinct states of nature. To do this, we show that our mechanism does not require truly calibrated forecasts, but rather forecasts that are unbiased subject to only a polynomially sized collection of events -- which can be produced with polynomial overhead. Second, in several important special cases -- including the focal linear contracting setting -- we show how to remove strong ``Alignment'' assumptions (which informally require that near-ties are always broken in favor of the Principal) by specifically deploying ``stable'' policies that do not have any near ties that are payoff relevant to the Principal. Taken together, our new mechanism makes the compelling framework proposed by Camara et al. much more powerful, now able to be realized over polynomially sized state spaces, and while requiring only mild assumptions on Agent behavior. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.05075 [pdf]

Mental Health Diagnosis in the Digital Age: Harnessing Sentiment Analysis on Social Media Platforms upon Ultra-Sparse Feature Content

Authors: Haijian Shao, Ming Zhu, Shengjie Zhai

Abstract: Amid growing global mental health concerns, particularly among vulnerable groups, natural language processing offers a tremendous potential for early detection and intervention of people's mental disorders via analyzing their postings and discussions on social media platforms. However, ultra-sparse training data, often due to vast vocabularies and low-frequency words, hinders the analysis accuracy… ▽ More Amid growing global mental health concerns, particularly among vulnerable groups, natural language processing offers a tremendous potential for early detection and intervention of people's mental disorders via analyzing their postings and discussions on social media platforms. However, ultra-sparse training data, often due to vast vocabularies and low-frequency words, hinders the analysis accuracy. Multi-labeling and Co-occurrences of symptoms may also blur the boundaries in distinguishing similar/co-related disorders. To address these issues, we propose a novel semantic feature preprocessing technique with a three-folded structure: 1) mitigating the feature sparsity with a weak classifier, 2) adaptive feature dimension with modulus loops, and 3) deep-mining and extending features among the contexts. With enhanced semantic features, we train a machine learning model to predict and classify mental disorders. We utilize the Reddit Mental Health Dataset 2022 to examine conditions such as Anxiety, Borderline Personality Disorder (BPD), and Bipolar-Disorder (BD) and present solutions to the data sparsity challenge, highlighted by 99.81% non-zero elements. After applying our preprocessing technique, the feature sparsity decreases to 85.4%. Overall, our methods, when compared to seven benchmark models, demonstrate significant performance improvements: 8.0% in accuracy, 0.069 in precision, 0.093 in recall, 0.102 in F1 score, and 0.059 in AUC. This research provides foundational insights for mental health prediction and monitoring, providing innovative solutions to navigate challenges associated with ultra-sparse data feature and intricate multi-label classification in the domain of mental health analysis. △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2311.00260 [pdf, ps, other]

Incentivized Collaboration in Active Learning

Authors: Lee Cohen, Han Shao

Abstract: In collaborative active learning, where multiple agents try to learn labels from a common hypothesis, we introduce an innovative framework for incentivized collaboration. Here, rational agents aim to obtain labels for their data sets while kee** label complexity at a minimum. We focus on designing (strict) individually rational (IR) collaboration protocols, ensuring that agents cannot reduce the… ▽ More In collaborative active learning, where multiple agents try to learn labels from a common hypothesis, we introduce an innovative framework for incentivized collaboration. Here, rational agents aim to obtain labels for their data sets while kee** label complexity at a minimum. We focus on designing (strict) individually rational (IR) collaboration protocols, ensuring that agents cannot reduce their expected label complexity by acting individually. We first show that given any optimal active learning algorithm, the collaboration protocol that runs the algorithm as is over the entire data is already IR. However, computing the optimal algorithm is NP-hard. We therefore provide collaboration protocols that achieve (strict) IR and are comparable with the best known tractable approximation algorithm in terms of label complexity. △ Less

Submitted 31 October, 2023; originally announced November 2023.

Showing 1–50 of 344 results for author: Shao, h