-
Theory on Mixture-of-Experts in Continual Learning
Authors:
Hongbo Li,
Sen Lin,
Lingjie Duan,
Yingbin Liang,
Ness B. Shroff
Abstract:
Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (MoE) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify…
▽ More
Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (MoE) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify and distribute diverse tasks among multiple experts. However, there is a lack of theoretical analysis of MoE and its impact on the learning performance in CL. This paper provides the first theoretical results to characterize the impact of MoE in CL via the lens of overparameterized linear regression tasks. We establish the benefit of MoE over a single expert by proving that the MoE model can diversify its experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the loads across all experts. Our study further suggests an intriguing fact that the MoE in CL needs to terminate the update of the gating network after sufficient training rounds to attain system convergence, which is not needed in the existing MoE studies that do not consider the continual task arrival. Furthermore, we provide explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of MoE in the learning performance in CL. Interestingly, adding more experts requires additional rounds before convergence, which may not enhance the learning performance. Finally, we conduct experiments on both synthetic and real datasets to extend these insights from linear models to deep neural networks (DNNs), which also shed light on the practical algorithm design for MoE in CL.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
On the Transformations across Reward Model, Parameter Update, and In-Context Prompt
Authors:
Deng Cai,
Huayang Li,
Tingchen Fu,
Siheng Li,
Weiwen Xu,
Shuaiyi Li,
Bowen Cao,
Zhisong Zhang,
Xinting Huang,
Leyang Cui,
Yan Wang,
Lemao Liu,
Taro Watanabe,
Shuming Shi
Abstract:
Despite the general capabilities of pre-trained large language models (LLMs), they still need further adaptation to better serve practical applications. In this paper, we demonstrate the interchangeability of three popular and distinct adaptation tools: parameter updating, reward modeling, and in-context prompting. This interchangeability establishes a triangular framework with six transformation…
▽ More
Despite the general capabilities of pre-trained large language models (LLMs), they still need further adaptation to better serve practical applications. In this paper, we demonstrate the interchangeability of three popular and distinct adaptation tools: parameter updating, reward modeling, and in-context prompting. This interchangeability establishes a triangular framework with six transformation directions, each of which facilitates a variety of applications. Our work offers a holistic view that unifies numerous existing studies and suggests potential research directions. We envision our work as a useful roadmap for future research on LLMs.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging
Authors:
Mingyang Zhang,
Yi Zhou,
Yi Ren,
Chen Zhang,
Xiang Yin,
Haizhou Li
Abstract:
This paper proposes RefXVC, a method for cross-lingual voice conversion (XVC) that leverages reference information to improve conversion performance. Previous XVC works generally take an average speaker embedding to condition the speaker identity, which does not account for the changing timbre of speech that occurs with different pronunciations. To address this, our method uses both global and loc…
▽ More
This paper proposes RefXVC, a method for cross-lingual voice conversion (XVC) that leverages reference information to improve conversion performance. Previous XVC works generally take an average speaker embedding to condition the speaker identity, which does not account for the changing timbre of speech that occurs with different pronunciations. To address this, our method uses both global and local speaker embeddings to capture the timbre changes during speech conversion. Additionally, we observed a connection between timbre and pronunciation in different languages and utilized this by incorporating a timbre encoder and a pronunciation matching network into our model. Furthermore, we found that the variation in tones is not adequately reflected in a sentence, and therefore, we used multiple references to better capture the range of a speaker's voice. The proposed method outperformed existing systems in terms of both speech quality and speaker similarity, highlighting the effectiveness of leveraging reference information in cross-lingual voice conversion. The converted speech samples can be found on the website: \url{http://refxvc.dn3point.com}
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Displaced Heavy Neutral Lepton from New Higgs Doublet
Authors:
Fa-Xin Yang,
Feng-Lan Shao,
Zhi-Long Han,
Yi **,
Honglei Li
Abstract:
Heavy neutral leptons $N$ are introduced to explain the tiny neutrino masses via the seesaw mechanism. For proper small mixing parameter $V_{\ell N}$, the heavy neutral leptons $N$ become long-lived, which leads to the displaced vertex signature at colliders. In this paper, we consider the displaced heavy neutral lepton from the neutrinophilic Higgs doublet $Φ_ν$ decay. The new Higgs doublet with…
▽ More
Heavy neutral leptons $N$ are introduced to explain the tiny neutrino masses via the seesaw mechanism. For proper small mixing parameter $V_{\ell N}$, the heavy neutral leptons $N$ become long-lived, which leads to the displaced vertex signature at colliders. In this paper, we consider the displaced heavy neutral lepton from the neutrinophilic Higgs doublet $Φ_ν$ decay. The new Higgs doublet with MeV scale VEV can naturally explain the tiny neutrino masses with TeV scale $N$. Different from current experimental searches via the $W^\pm\to \ell^\pm N$ decay, the new decays as $H^\pm\to \ell^\pm N$ are not suppressed by the small mixing parameter $V_{\ell N}$. Therefore, a larger parameter space is expected to be detected at colliders. We then investigate the promising region at the 14 TeV HL-LHC and the 3 TeV CLIC. According to our simulation, the DV signature could probe $|V_{\ell N}|^2\gtrsim10^{-19}$ with $m_N<m_{H^+}$, which covers the seesaw predicted value $|V_{\ell N}|^2\sim m_ν/m_N$. We could probe $m_{H^+}\lesssim1200$ GeV at the 14 TeV HL-LHC and $m_{H^+}\lesssim1490$ GeV at the 3 TeV CLIC.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing
Authors:
Jiangshu Du,
Yibo Wang,
Wenting Zhao,
Zhongfen Deng,
Shuaiqi Liu,
Renze Lou,
Henry Peng Zou,
Pranav Narayanan Venkit,
Nan Zhang,
Mukund Srinath,
Haoran Ranran Zhang,
Vipul Gupta,
Yinghui Li,
Tao Li,
Fei Wang,
Qin Liu,
Tianlin Liu,
Pengzhi Gao,
Congying Xia,
Chen Xing,
Jiayang Cheng,
Zhaowei Wang,
Ying Su,
Raj Sanjay Shah,
Ruohao Guo
, et al. (15 additional authors not shown)
Abstract:
This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as th…
▽ More
This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload?
This study focuses on the topic of LLMs assist NLP Researchers, particularly examining the effectiveness of LLM in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with "deficiency" labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) "LLMs as Reviewers", how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) "LLMs as Metareviewers", how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.
△ Less
Submitted 25 June, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
Intensity Confusion Matters: An Intensity-Distance Guided Loss for Bronchus Segmentation
Authors:
Haifan Gong,
Wenhao Huang,
Huan Zhang,
Yu Wang,
Xiang Wan,
Hong Shen,
Guanbin Li,
Haofeng Li
Abstract:
Automatic segmentation of the bronchial tree from CT imaging is important, as it provides structural information for disease diagnosis. Despite the merits of previous automatic bronchus segmentation methods, they have paied less attention to the issue we term as \textit{Intensity Confusion}, wherein the intensity values of certain background voxels approach those of the foreground voxels within br…
▽ More
Automatic segmentation of the bronchial tree from CT imaging is important, as it provides structural information for disease diagnosis. Despite the merits of previous automatic bronchus segmentation methods, they have paied less attention to the issue we term as \textit{Intensity Confusion}, wherein the intensity values of certain background voxels approach those of the foreground voxels within bronchi. Conversely, the intensity values of some foreground voxels are nearly identical to those of background voxels. This proximity in intensity values introduces significant challenges to neural network methodologies. To address the issue, we introduce a novel Intensity-Distance Guided loss function, which assigns adaptive weights to different image voxels for mining hard samples that cause the intensity confusion. The proposed loss estimates the voxel-level hardness of samples, on the basis of the following intensity and distance priors. We regard a voxel as a hard sample if it is in: (1) the background and has an intensity value close to the bronchus region; (2) the bronchus region and is of higher intensity than most voxels inside the bronchus; (3) the background region and at a short distance from the bronchus. Extensive experiments not only show the superiority of our method compared with the state-of-the-art methods, but also verify that tackling the intensity confusion issue helps to significantly improve bronchus segmentation. Project page: https://github.com/lhaof/ICM.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning
Authors:
Erin J. Talvitie,
Zilei Shao,
Huiying Li,
**ghan Hu,
Jacob Boerma,
Rory Zhao,
Xintong Wang
Abstract:
In model-based reinforcement learning, simulated experiences from the learned model are often treated as equivalent to experience from the real environment. However, when the model is inaccurate, it can catastrophically interfere with policy learning. Alternatively, the agent might learn about the model's accuracy and selectively use it only when it can provide reliable predictions. We empirically…
▽ More
In model-based reinforcement learning, simulated experiences from the learned model are often treated as equivalent to experience from the real environment. However, when the model is inaccurate, it can catastrophically interfere with policy learning. Alternatively, the agent might learn about the model's accuracy and selectively use it only when it can provide reliable predictions. We empirically explore model uncertainty measures for selective planning and show that best results require distribution insensitive inference to estimate the uncertainty over model-based updates. To that end, we propose and evaluate bounding-box inference, which operates on bounding-boxes around sets of possible states and other quantities. We find that bounding-box inference can reliably support effective selective planning.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Determining the Dielectric Constant of Solid/Liquid Interfaces
Authors:
Somaiyeh Dadashi,
Narendra M. Adhikari,
Hao Li,
Stefan M. Piontek,
Zheming Wang,
Kevin M. Rosso,
Eric Borguet
Abstract:
The dielectric constant ($\varepsilon^{\prime}$) of interfacial water is an important parameter, but its measurement has posed challenges, and no consensus has been reached on a generalized expression. We derived a formula for $\varepsilon^{\prime}$ of a buried interface using the slab model for a half-solvated sphere:…
▽ More
The dielectric constant ($\varepsilon^{\prime}$) of interfacial water is an important parameter, but its measurement has posed challenges, and no consensus has been reached on a generalized expression. We derived a formula for $\varepsilon^{\prime}$ of a buried interface using the slab model for a half-solvated sphere: $ \varepsilon^{\prime}=\varepsilon_1 \varepsilon_2\left(\varepsilon_2-\varepsilon_1+6\right) / 2\left(2 \varepsilon_2+\varepsilon_1\right)$, where $\varepsilon_1$ and $\varepsilon_2$ are the dielectric constants of the solid and liquid phases, respectively. We experimentally validated this expression using vibrational sum frequency generation and Fresnel factor calculations for interfaces of alumina with water ($ \mathrm{H_2O} $ and $ \mathrm{D_2O} $) and acetonitrile. This fills an important knowledge gap in the description of the dielectric constant of interfaces.
△ Less
Submitted 25 June, 2024; v1 submitted 22 June, 2024;
originally announced June 2024.
-
Decoupling Many-Body Interactions in CeO2 (111) Oxygen Vacancy Structure: Insights from Machine-Learning and Cluster Expansion
Authors:
Yu**g Zhang,
Zhong-Kang Han,
Beien Zhu,
Xiaojuan Hu,
Maria Troppenz,
Santiago Riga-monti,
Hui Li,
Claudia Draxl,
M. Verónica Ganduglia-Pirovano,
Yi Gao
Abstract:
Oxygen vacancies (VO's) are of paramount importance in influencing the properties and applications of ceria (CeO2). Yet, comprehending the distribution and nature of the VO's poses a significant challenge due to the vast number of electronic configurations and intricate many-body interactions among VO's and polarons (Ce3+'s). In this study, we employed a combination of LASSO regression in machine…
▽ More
Oxygen vacancies (VO's) are of paramount importance in influencing the properties and applications of ceria (CeO2). Yet, comprehending the distribution and nature of the VO's poses a significant challenge due to the vast number of electronic configurations and intricate many-body interactions among VO's and polarons (Ce3+'s). In this study, we employed a combination of LASSO regression in machine learning, in conjunction with a cluster expansion model and first-principles calculations to decouple the interactions among the Ce3+'s and VO's, thereby circumventing the limitations associated with sampling electronic configurations. By separating these interactions, we identified specific electronic configurations characterized by the most favorable VO-Ce3+ attractions and the least Ce3+-Ce3+/VO-VO repulsions, which are crucial in determining the stability of vacancy structures. Through more than 10^8 Metropolis Monte Carlo samplings of Vo's and Ce3+ in the near-surface of CeO2(111), we explored potential configurations within an 8x8 supercell. Our findings revealed that oxygen vacancies tend to aggregate and are most abundant in the third oxygen layer, primarily due to extensive geometric relaxation-an aspect previously overlooked. This behavior is notably dependent on the concentration of Vo. This work introduces a novel theoretical framework for unraveling the complex vacancy structures in metal oxides, with potential applications in redox and catalytic chemistry.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Full-Space Wireless Sensing Enabled by Multi-Sector Intelligent Surfaces
Authors:
Yumeng Zhang,
Xiaodan Shao,
Hongyu Li,
Bruno Clerckx,
Rui Zhang
Abstract:
The multi-sector intelligent surface (IS), benefiting from a smarter wave manipulation capability, has been shown to enhance channel gain and offer full-space coverage in communications. However, the benefits of multi-sector IS in wireless sensing remain unexplored. This paper introduces the application of multi-sector IS for wireless sensing/localization. Specifically, we propose a new self-sensi…
▽ More
The multi-sector intelligent surface (IS), benefiting from a smarter wave manipulation capability, has been shown to enhance channel gain and offer full-space coverage in communications. However, the benefits of multi-sector IS in wireless sensing remain unexplored. This paper introduces the application of multi-sector IS for wireless sensing/localization. Specifically, we propose a new self-sensing system, where an active source controller uses the multi-sector IS geometry to reflect/scatter the emitted signals towards the entire space, thereby achieving full-space coverage for wireless sensing. Additionally, dedicated sensors are installed aligned with the IS elements at each sector, which collect echo signals from the target and cooperate to sense the target angle. In this context, we develop a maximum likelihood estimator of the target angle for the proposed multi-sector IS self-sensing system, along with the corresponding theoretical limits defined by the Cramér-Rao Bound. The analysis reveals that the advantages of the multi-sector IS self-sensing system stem from two aspects: enhancing the probing power on targets (thereby improving power efficiency) and increasing the rate of target angle (thereby enhancing the transceiver's sensitivity to target angles). Finally, our analysis and simulations confirm that the multi-sector IS self-sensing system, particularly the 4-sector architecture, achieves full-space sensing capability beyond the single-sector IS configuration. Furthermore, similarly to communications, employing directive antenna patterns on each sector's IS elements and sensors significantly enhances sensing capabilities. This enhancement originates from both aspects of improved power efficiency and target angle sensitivity, with the former also being observed in communications while the latter being unique in sensing.
△ Less
Submitted 25 June, 2024; v1 submitted 22 June, 2024;
originally announced June 2024.
-
Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow
Authors:
Zhichao Chen,
Haoxuan Li,
Fangyikang Wang,
Odin Zhang,
Hu Xu,
Xiaoyu Jiang,
Zhihuan Song,
Eric H. Wang
Abstract:
Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within…
▽ More
Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within the realm of numerical tabular datasets, we introduce a novel principled approach termed Kernelized Negative Entropy-regularized Wasserstein gradient flow Imputation (KnewImp). Specifically, based on Wasserstein gradient flow (WGF) framework, we first prove that issue (1) stems from the cost functionals implicitly maximized in DM-based MDI are equivalent to the MDI's objective plus diversification-promoting non-negative terms. Based on this, we then design a novel cost functional with diversification-discouraging negative entropy and derive our KnewImp approach within WGF framework and reproducing kernel Hilbert space. After that, we prove that the imputation procedure of KnewImp can be derived from another cost functional related to the joint distribution, eliminating the need for the mask matrix and hence naturally addressing issue (2). Extensive experiments demonstrate that our proposed KnewImp approach significantly outperforms existing state-of-the-art methods.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Predicting fluorescent labels in label-free microscopy images with pix2pix and adaptive loss in Light My Cells challenge
Authors:
Han Liu,
Hao Li,
Jiacheng Wang,
Yubo Fan,
Zhoubing Xu,
Ipek Oguz
Abstract:
Fluorescence labeling is the standard approach to reveal cellular structures and other subcellular constituents for microscopy images. However, this invasive procedure may perturb or even kill the cells and the procedure itself is highly time-consuming and complex. Recently, in silico labeling has emerged as a promising alternative, aiming to use machine learning models to directly predict the flu…
▽ More
Fluorescence labeling is the standard approach to reveal cellular structures and other subcellular constituents for microscopy images. However, this invasive procedure may perturb or even kill the cells and the procedure itself is highly time-consuming and complex. Recently, in silico labeling has emerged as a promising alternative, aiming to use machine learning models to directly predict the fluorescently labeled images from label-free microscopy. In this paper, we propose a deep learning-based in silico labeling method for the Light My Cells challenge. Built upon pix2pix, our proposed method can be trained using the partially labeled datasets with an adaptive loss. Moreover, we explore the effectiveness of several training strategies to handle different input modalities, such as training them together or separately. The results show that our method achieves promising performance for in silico labeling. Our code is available at https://github.com/MedICL-VU/LightMyCells.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Self-Supervised Alignment Learning for Medical Image Segmentation
Authors:
Haofeng Li,
Yiming Ouyang,
Xiang Wan
Abstract:
Recently, self-supervised learning (SSL) methods have been used in pre-training the segmentation models for 2D and 3D medical images. Most of these methods are based on reconstruction, contrastive learning and consistency regularization. However, the spatial correspondence of 2D slices from a 3D medical image has not been fully exploited. In this paper, we propose a novel self-supervised alignment…
▽ More
Recently, self-supervised learning (SSL) methods have been used in pre-training the segmentation models for 2D and 3D medical images. Most of these methods are based on reconstruction, contrastive learning and consistency regularization. However, the spatial correspondence of 2D slices from a 3D medical image has not been fully exploited. In this paper, we propose a novel self-supervised alignment learning framework to pre-train the neural network for medical image segmentation. The proposed framework consists of a new local alignment loss and a global positional loss. We observe that in the same 3D scan, two close 2D slices usually contain similar anatomic structures. Thus, the local alignment loss is proposed to make the pixel-level features of matched structures close to each other. Experimental results show that the proposed alignment learning is competitive with existing self-supervised pre-training approaches on CT and MRI datasets, under the setting of limited annotations.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Preliminary Design of a General Electronics Platform for Accelerator Facilities
Authors:
**fu Zhu,
Hongli Ding,
Haokui Li,
Qiaoye Ran,
Xiwen Dai,
Wei Li,
Jiawei Han,
Yue Li,
Zhiyuan Zhang,
Weixin Qiu,
Weiqing Zhang
Abstract:
Many accelerators require considerable electronic systems for tests, verification, and operation. In Shenzhen Superconducting Soft X-ray Free Electron Laser (S3FEL), to meet the early tests and verification of various systems, save development expenses, and improve the reusability of hardware, firmware, and software systems, we have considered the needs of each system and preliminarily designed a…
▽ More
Many accelerators require considerable electronic systems for tests, verification, and operation. In Shenzhen Superconducting Soft X-ray Free Electron Laser (S3FEL), to meet the early tests and verification of various systems, save development expenses, and improve the reusability of hardware, firmware, and software systems, we have considered the needs of each system and preliminarily designed a general electronics platform based on MicroTCA.4. The Advanced Mezzanine Card (AMC) will place an FPGA Mezzanine Card (FMC) that supports 500 MSPS to 2 GSPS ADC/DAC. We will design two FMC cards on the Rear Transition Module (RTM), which can be used for analog signal conditioning and waveform digitization by 10 MSPS to 250 MSPS ADC/DAC or motor control. The commercial MCH, CPU, power module, and MTCA crate are deployed. This platform can also be applied to other accelerator facilities.
△ Less
Submitted 11 May, 2024;
originally announced June 2024.
-
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
Authors:
Daniel Dauner,
Marcel Hallgarten,
Tianyu Li,
Xinshuo Weng,
Zhiyu Huang,
Zetong Yang,
Hongyang Li,
Igor Gilitschenski,
Boris Ivanovic,
Marco Pavone,
Andreas Geiger,
Kashyap Chitta
Abstract:
Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resu…
▽ More
Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resulted in an inability to draw clear conclusions from the rapidly growing body of research on end-to-end autonomous driving. In this paper, we present NAVSIM, a middle ground between these evaluation paradigms, where we use large datasets in combination with a non-reactive simulator to enable large-scale real-world benchmarking. Specifically, we gather simulation-based metrics, such as progress and time to collision, by unrolling bird's eye view abstractions of the test scenes for a short simulation horizon. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. As we demonstrate empirically, this decoupling allows open-loop metric computation while being better aligned with closed-loop evaluations than traditional displacement errors. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights. On a large set of challenging scenarios, we observe that simple methods with moderate compute requirements such as TransFuser can match recent large-scale end-to-end driving architectures such as UniAD. Our modular framework can potentially be extended with new datasets, data curation strategies, and metrics, and will be continually maintained to host future challenges. Our code is available at https://github.com/autonomousvision/navsim.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance
Authors:
Haoling Li,
Xin Zhang,
Xiao Liu,
Yeyun Gong,
Yifan Wang,
Yujiu Yang,
Qi Chen,
Peng Cheng
Abstract:
Large language models (LLMs) have revolutionized lots of fields of research. Although it is well-known that fine-tuning is essential for enhancing the capabilities of LLMs, existing research suggests that there is potential redundancy in the fine-tuning process and therefore proposes to update only a subset of parameters. However, these methods fail to leverage the task-specific information to ide…
▽ More
Large language models (LLMs) have revolutionized lots of fields of research. Although it is well-known that fine-tuning is essential for enhancing the capabilities of LLMs, existing research suggests that there is potential redundancy in the fine-tuning process and therefore proposes to update only a subset of parameters. However, these methods fail to leverage the task-specific information to identify important parameters during training. Based on the insight that gradients inherently contain information on task-specific data, we propose Gradient-Mask Tuning (GMT), a method that selectively updates parameters during training based on their gradient information. Specifically, we compute the absolute values of the gradients and apply masking to those with relatively smaller magnitudes. Our empirical results across various tasks demonstrate that GMT not only outperforms traditional fine-tuning methods but also elevates the upper limits of LLM performance. Further analysis indicates that GMT exhibits insensitivity to mask ratio and possesses computational efficiency comparable to vanilla SFT.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
ADR: Attention Diversification Regularization for Mitigating Overfitting in Multiple Instance Learning based Whole Slide Image Classification
Authors:
Yunlong Zhang,
Zhongyi Shui,
Yunxuan Sun,
Honglin Li,
**gxiong Li,
Chenglu Zhu,
Sunyi Zheng,
Lin Yang
Abstract:
Multiple Instance Learning (MIL) has demonstrated effectiveness in analyzing whole slide images (WSIs), yet it often encounters overfitting challenges in real-world applications. This paper reveals the correlation between MIL's performance and the entropy of attention values. Based on this observation, we propose Attention Diversity Regularization (ADR), a simple but effective technique aimed at p…
▽ More
Multiple Instance Learning (MIL) has demonstrated effectiveness in analyzing whole slide images (WSIs), yet it often encounters overfitting challenges in real-world applications. This paper reveals the correlation between MIL's performance and the entropy of attention values. Based on this observation, we propose Attention Diversity Regularization (ADR), a simple but effective technique aimed at promoting high entropy in attention values. Specifically, ADR introduces a negative Shannon entropy loss for attention values into the regular MIL framework. Compared to existing methods aimed at alleviating overfitting, which often necessitate additional modules or processing steps, our ADR approach requires no such extras, demonstrating simplicity and efficiency. We evaluate our ADR on three WSI classification tasks. ADR achieves superior performance over the state-of-the-art on most of them. We also show that ADR can enhance heatmaps, aligning them better with pathologists' diagnostic criteria. The source code is available at \url{https://github.com/dazhangyu123/ADR}.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
SaTor: Satellite Routing in Tor to Reduce Latency
Authors:
Haozhi Li,
Tariq Elahi
Abstract:
High latency is a critical limitation within the Tor network. A key factor exacerbating Tor latency is the creation of lengthy circuits that span across geographically distant regions, causing significant transmission delays. To address this issue, a common strategy involves modifying Tor's circuit building process to reduce the likelihood of selecting lengthy circuits. However, this strategy comp…
▽ More
High latency is a critical limitation within the Tor network. A key factor exacerbating Tor latency is the creation of lengthy circuits that span across geographically distant regions, causing significant transmission delays. To address this issue, a common strategy involves modifying Tor's circuit building process to reduce the likelihood of selecting lengthy circuits. However, this strategy compromises the randomness of Tor's routing, thereby increasing the risk of deanonymization. Improving Tor's latency performance while minimizing security degradation presents a critical challenge.
This paper proposes SaTor, a latency-improving scheme for Tor using satellite routing technology. SaTor proposes equip** a targeted subset of Tor relays with satellite network access, utilizing long-distance satellite transmission to accelerate slow circuits, without biasing the existing path selection process. Our SaTor performance evaluation, using a simulator we developed coupled with real-world measurements, demonstrates that over the long-term, SaTor offers an expected speed-up of roughly 40 ms for over 70% of circuits under common conditions. This improvement necessitates outfitting the top approx. 30-40% relays with satellite access. Our research uncovers a viable way to overcome Tor's latency bottleneck, serving as a practical reference for its future enhancement.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Search for the $e^+e^- \to φχ_{c1}(3872)$ process at BESIII
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (639 additional authors not shown)
Abstract:
Based on 368.5 pb$^{-1}$ of $e^+e^-$ collision data collected at center-of-mass energies 4.914 and 4.946 GeV by the BESIII detector, the $e^+e^- \to φχ_{c1}(3872)$ process is searched for the first time. No significant signal is observed and the upper limits at the 90\% confidence level on the product of the Born cross section $σ(e^+e^- \to φχ_{c1}(3872))$ and the branching fraction…
▽ More
Based on 368.5 pb$^{-1}$ of $e^+e^-$ collision data collected at center-of-mass energies 4.914 and 4.946 GeV by the BESIII detector, the $e^+e^- \to φχ_{c1}(3872)$ process is searched for the first time. No significant signal is observed and the upper limits at the 90\% confidence level on the product of the Born cross section $σ(e^+e^- \to φχ_{c1}(3872))$ and the branching fraction $\mathcal{B}[χ_{c1}(3872)\toπ^+π^- J/ψ]$ at 4.914 and 4.946 GeV are set to be 0.85 and 0.96 pb, respectively. These measurements provide useful information for the production of the $χ_{c1}(3872)$ at $e^+e^-$ collider and deepen our understanding about the nature of this particle.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Optimizing Novelty of Top-k Recommendations using Large Language Models and Reinforcement Learning
Authors:
Amit Sharma,
Hua Li,
Xue Li,
Jian Jiao
Abstract:
Given an input query, a recommendation model is trained using user feedback data (e.g., click data) to output a ranked list of items. In real-world systems, besides accuracy, an important consideration for a new model is novelty of its top-k recommendations w.r.t. an existing deployed model. However, novelty of top-k items is a difficult goal to optimize a model for, since it involves a non-differ…
▽ More
Given an input query, a recommendation model is trained using user feedback data (e.g., click data) to output a ranked list of items. In real-world systems, besides accuracy, an important consideration for a new model is novelty of its top-k recommendations w.r.t. an existing deployed model. However, novelty of top-k items is a difficult goal to optimize a model for, since it involves a non-differentiable sorting operation on the model's predictions. Moreover, novel items, by definition, do not have any user feedback data. Given the semantic capabilities of large language models, we address these problems using a reinforcement learning (RL) formulation where large language models provide feedback for the novel items. However, given millions of candidate items, the sample complexity of a standard RL algorithm can be prohibitively high. To reduce sample complexity, we reduce the top-k list reward to a set of item-wise rewards and reformulate the state space to consist of <query, item> tuples such that the action space is reduced to a binary decision; and show that this reformulation results in a significantly lower complexity when the number of items is large. We evaluate the proposed algorithm on improving novelty for a query-ad recommendation task on a large-scale search engine. Compared to supervised finetuning on recent <query, ad> pairs, the proposed RL-based algorithm leads to significant novelty gains with minimal loss in recall. We obtain similar results on the ORCAS query-webpage matching dataset and a product recommendation dataset based on Amazon reviews.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Prediction and Reference Quality Adaptation for Learned Video Compression
Authors:
Xihua Sheng,
Li Li,
Dong Liu,
Houqiang Li
Abstract:
Temporal prediction is one of the most important technologies for video compression. Various prediction coding modes are designed in traditional video codecs. Traditional video codecs will adaptively to decide the optimal coding mode according to the prediction quality and reference quality. Recently, learned video codecs have made great progress. However, they ignore the prediction and reference…
▽ More
Temporal prediction is one of the most important technologies for video compression. Various prediction coding modes are designed in traditional video codecs. Traditional video codecs will adaptively to decide the optimal coding mode according to the prediction quality and reference quality. Recently, learned video codecs have made great progress. However, they ignore the prediction and reference quality adaptation, which leads to incorrect utilization of temporal prediction and reconstruction error propagation. Therefore, in this paper, we first propose a confidence-based prediction quality adaptation (PQA) module to provide explicit discrimination for the spatial and channel-wise prediction quality difference. With this module, the prediction with low quality will be suppressed and that with high quality will be enhanced. The codec can adaptively decide which spatial or channel location of predictions to use. Then, we further propose a reference quality adaptation (RQA) module and an associated repeat-long training strategy to provide dynamic spatially variant filters for diverse reference qualities. With the filters, it is easier for our codec to achieve the target reconstruction quality according to reference qualities, thus reducing the propagation of reconstruction errors. Experimental results show that our codec obtains higher compression performance than the reference software of H.266/VVC and the previous state-of-the-art learned video codecs in both RGB and YUV420 colorspaces.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models
Authors:
Ziche Liu,
Rui Ke,
Feng Jiang,
Haizhou Li
Abstract:
Data selection for fine-tuning Large Language Models (LLMs) aims to select a high-quality subset from a given candidate dataset to train a Pending Fine-tune Model (PFM) into a Selective-Enhanced Model (SEM). It can improve the model performance and accelerate the training process. Although a few surveys have investigated related works of data selection, there is a lack of comprehensive comparison…
▽ More
Data selection for fine-tuning Large Language Models (LLMs) aims to select a high-quality subset from a given candidate dataset to train a Pending Fine-tune Model (PFM) into a Selective-Enhanced Model (SEM). It can improve the model performance and accelerate the training process. Although a few surveys have investigated related works of data selection, there is a lack of comprehensive comparison between existing methods due to their various experimental settings. To address this issue, we first propose a three-stage scheme for data selection and comprehensively review existing works according to this scheme. Then, we design a unified comparing method with ratio-based efficiency indicators and ranking-based feasibility indicators to overcome the difficulty of comparing various models with diverse experimental settings. After an in-depth comparative analysis, we find that the more targeted method with data-specific and model-specific quality labels has higher efficiency, but the introduction of additional noise information should be avoided when designing selection algorithms. Finally, we summarize the trends in data selection and highlight the short-term and long-term challenges to guide future research.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
PAPR Reduction with Pre-chirp Selection for Affine Frequency Division Multiple
Authors:
Haozhi Yuan,
Yin Xu,
Xinghao Guo,
Tianyao Ma,
Haoyang Li,
Dazhi He,
Wenjun Zhang
Abstract:
Affine frequency division multiplexing (AFDM) is a promising new multicarrier technique based on discrete affine Fourier transform (DAFT). By properly tuning pre-chirp parameter and post-chirp parameter in the DAFT, the effective channel in the DAFT domain can completely avoid overlap of different paths, thus constitutes a full representation of delay-Doppler profile, which significantly improves…
▽ More
Affine frequency division multiplexing (AFDM) is a promising new multicarrier technique based on discrete affine Fourier transform (DAFT). By properly tuning pre-chirp parameter and post-chirp parameter in the DAFT, the effective channel in the DAFT domain can completely avoid overlap of different paths, thus constitutes a full representation of delay-Doppler profile, which significantly improves the system performance in high mobility scenarios. However, AFDM has the crucial problem of high peak-to-average power ratio (PAPR) caused by phase randomness of modulated symbols. In this letter, an algorithm named grouped pre-chirp selection (GPS) is proposed to reduce the PAPR by changing the value of pre-chirp parameter on sub-carriers group by group. Specifically, it is demonstrated first that the important properties of AFDM system are maintained when implementing GPS. Secondly, we elaborate the operation steps of GPS algorithm, illustrating its effect on PAPR reduction and its advantage in terms of computational complexity compared with the ungrouped approach. Finally, simulation results of PAPR reduction in the form of complementary cumulative distribution function (CCDF) show the effectiveness of the proposed GPS algorithm.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Gaze-directed Vision GNN for Mitigating Shortcut Learning in Medical Image
Authors:
Shaoxuan Wu,
Xiao Zhang,
Bin Wang,
Zhuo **,
Hansheng Li,
Jun Feng
Abstract:
Deep neural networks have demonstrated remarkable performance in medical image analysis. However, its susceptibility to spurious correlations due to shortcut learning raises concerns about network interpretability and reliability. Furthermore, shortcut learning is exacerbated in medical contexts where disease indicators are often subtle and sparse. In this paper, we propose a novel gaze-directed V…
▽ More
Deep neural networks have demonstrated remarkable performance in medical image analysis. However, its susceptibility to spurious correlations due to shortcut learning raises concerns about network interpretability and reliability. Furthermore, shortcut learning is exacerbated in medical contexts where disease indicators are often subtle and sparse. In this paper, we propose a novel gaze-directed Vision GNN (called GD-ViG) to leverage the visual patterns of radiologists from gaze as expert knowledge, directing the network toward disease-relevant regions, and thereby mitigating shortcut learning. GD-ViG consists of a gaze map generator (GMG) and a gaze-directed classifier (GDC). Combining the global modelling ability of GNNs with the locality of CNNs, GMG generates the gaze map based on radiologists' visual patterns. Notably, it eliminates the need for real gaze data during inference, enhancing the network's practical applicability. Utilizing gaze as the expert knowledge, the GDC directs the construction of graph structures by incorporating both feature distances and gaze distances, enabling the network to focus on disease-relevant foregrounds. Thereby avoiding shortcut learning and improving the network's interpretability. The experiments on two public medical image datasets demonstrate that GD-ViG outperforms the state-of-the-art methods, and effectively mitigates shortcut learning. Our code is available at https://github.com/SX-SS/GD-ViG.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
EAGER: Two-Stream Generative Recommender with Behavior-Semantic Collaboration
Authors:
Ye Wang,
Jiahao Xun,
Mingjie Hong,
Jieming Zhu,
Tao **,
Wang Lin,
Haoyuan Li,
Linjun Li,
Yan Xia,
Zhou Zhao,
Zhenhua Dong
Abstract:
Generative retrieval has recently emerged as a promising approach to sequential recommendation, framing candidate item retrieval as an autoregressive sequence generation problem. However, existing generative methods typically focus solely on either behavioral or semantic aspects of item information, neglecting their complementary nature and thus resulting in limited effectiveness. To address this…
▽ More
Generative retrieval has recently emerged as a promising approach to sequential recommendation, framing candidate item retrieval as an autoregressive sequence generation problem. However, existing generative methods typically focus solely on either behavioral or semantic aspects of item information, neglecting their complementary nature and thus resulting in limited effectiveness. To address this limitation, we introduce EAGER, a novel generative recommendation framework that seamlessly integrates both behavioral and semantic information. Specifically, we identify three key challenges in combining these two types of information: a unified generative architecture capable of handling two feature types, ensuring sufficient and independent learning for each type, and fostering subtle interactions that enhance collaborative information utilization. To achieve these goals, we propose (1) a two-stream generation architecture leveraging a shared encoder and two separate decoders to decode behavior tokens and semantic tokens with a confidence-based ranking strategy; (2) a global contrastive task with summary tokens to achieve discriminative decoding for each type of information; and (3) a semantic-guided transfer task designed to implicitly promote cross-interactions through reconstruction and estimation objectives. We validate the effectiveness of EAGER on four public benchmarks, demonstrating its superior performance compared to existing methods.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration Retriever
Authors:
Hang Li,
Tianlong Xu,
Jiliang Tang,
Qingsong Wen
Abstract:
Knowledge tagging for questions plays a crucial role in contemporary intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations are always conducted by pedagogical experts, as the task requires not only a strong semantic understanding of both question stems and knowledge definitio…
▽ More
Knowledge tagging for questions plays a crucial role in contemporary intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations are always conducted by pedagogical experts, as the task requires not only a strong semantic understanding of both question stems and knowledge definitions but also deep insights into connecting question-solving logic with corresponding knowledge concepts. With the recent emergence of advanced text encoding algorithms, such as pre-trained language models, many researchers have developed automatic knowledge tagging systems based on calculating the semantic similarity between the knowledge and question embeddings. In this paper, we explore automating the task using Large Language Models (LLMs), in response to the inability of prior encoding-based methods to deal with the hard cases which involve strong domain knowledge and complicated concept definitions. By showing the strong performance of zero- and few-shot results over math questions knowledge tagging tasks, we demonstrate LLMs' great potential in conquering the challenges faced by prior methods. Furthermore, by proposing a reinforcement learning-based demonstration retriever, we successfully exploit the great potential of different-sized LLMs in achieving better performance results while kee** the in-context demonstration usage efficiency high.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Enhance the Image: Super Resolution using Artificial Intelligence in MRI
Authors:
Ziyu Li,
Zihan Li,
Haoxiang Li,
Qiuyun Fan,
Karla L. Miller,
Wenchuan Wu,
Akshay S. Chaudhari,
Qiyuan Tian
Abstract:
This chapter provides an overview of deep learning techniques for improving the spatial resolution of MRI, ranging from convolutional neural networks, generative adversarial networks, to more advanced models including transformers, diffusion models, and implicit neural representations. Our exploration extends beyond the methodologies to scrutinize the impact of super-resolved images on clinical an…
▽ More
This chapter provides an overview of deep learning techniques for improving the spatial resolution of MRI, ranging from convolutional neural networks, generative adversarial networks, to more advanced models including transformers, diffusion models, and implicit neural representations. Our exploration extends beyond the methodologies to scrutinize the impact of super-resolved images on clinical and neuroscientific assessments. We also cover various practical topics such as network architectures, image evaluation metrics, network loss functions, and training data specifics, including downsampling methods for simulating low-resolution images and dataset selection. Finally, we discuss existing challenges and potential future directions regarding the feasibility and reliability of deep learning-based MRI super-resolution, with the aim to facilitate its wider adoption to benefit various clinical and neuroscientific applications.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments
Authors:
Yuhan Zhu,
Jian Wang,
Bing Li,
Xuxian Tang,
Hao Li,
Neng Zhang,
Yuqi Zhao
Abstract:
With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more difficulties, such as unstable networks and high latency across network segments. Accurately identifying the root cause of microservices in a cloud-edge collaborative e…
▽ More
With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more difficulties, such as unstable networks and high latency across network segments. Accurately identifying the root cause of microservices in a cloud-edge collaborative environment has thus become an urgent problem. In this paper, we propose MicroCERCL, a novel approach that pinpoints root causes at the kernel and application level in the cloud-edge collaborative environment. Our key insight is that failures propagate through direct invocations and indirect resource-competition dependencies in a cloud-edge collaborative environment characterized by instability and high latency. This will become more complex in the hybrid deployment that simultaneously involves multiple microservice systems. Leveraging this insight, we extract valid contents from kernel-level logs to prioritize localizing the kernel-level root cause. Moreover, we construct a heterogeneous dynamic topology stack and train a graph neural network model to accurately localize the application-level root cause without relying on historical data. Notably, we released the first benchmark hybrid deployment microservice system in a cloud-edge collaborative environment (the largest and most complex within our knowledge). Experiments conducted on the dataset collected from the benchmark show that MicroCERCL can accurately localize the root cause of microservice systems in such environments, significantly outperforming state-of-the-art approaches with an increase of at least 24.1% in top-1 accuracy.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Farey tree locking of terahertz semiconductor laser frequency combs
Authors:
Guibin Liu,
Xuhong Ma,
Kang Zhou,
Binbin Liu,
Lulu Zheng,
Xianglong Bi,
Shumin Wu,
Yanming Lu,
Zi** Li,
Wenjian Wan,
Zhenzhen Zhang,
Junsong Peng,
Ya Zhang,
He** Zeng,
Hua Li
Abstract:
Frequency combs show various applications in molecular fingerprinting, imaging, communications, and so on. In the terahertz frequency range, semiconductor-based quantum cascade lasers (QCLs) are ideal platforms for realizing the frequency comb operation. Although self-started frequency comb operation can be obtained in free-running terahertz QCLs due to the four-wave mixing locking effects, resona…
▽ More
Frequency combs show various applications in molecular fingerprinting, imaging, communications, and so on. In the terahertz frequency range, semiconductor-based quantum cascade lasers (QCLs) are ideal platforms for realizing the frequency comb operation. Although self-started frequency comb operation can be obtained in free-running terahertz QCLs due to the four-wave mixing locking effects, resonant/off-resonant microwave injection, phase locking, and femtosecond laser based locking techniques have been widely used to broaden and stabilize terahertz QCL combs. These active locking methods indeed show significant effects on the frequency stabilization of terahertz QCL combs, but they simultaneously have drawbacks, such as introducing large phase noise and requiring complex optical coupling and/or electrical circuits. Here, we demonstrate Farey tree locking of terahertz QCL frequency combs under microwave injection. The frequency competition between the Farey fraction frequency and the cavity round-trip frequency results in the frequency locking of terahertz QCL combs, and the Farey fraction frequencies can be accurately anticipated based on the downward trend of the Farey tree hierarchy. Furthermore, dual-comb experimental results show that the phase noise of the dual-comb spectral lines is significantly reduced by employing the Farey tree locking method. These results pave the way to deploying compact and low phase noise terahertz frequency comb sources.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Stabilizing the Kerr arbitrary cat states and holonomic universal control
Authors:
Ke-hui Yu,
Fan Zhu,
Jiao-jiao Xue,
Hong-rong Li
Abstract:
The interference-free double potential wells realized by the two-photon driving Kerr nonlinear resonator (KNR) can stabilize cat states and protect them from decoherence through a large energy gap. In this work, we use a parametrically driving KNR to propose a novel engineering Hamiltonian that can stabilize arbitrary cat states and independently manipulate the superposed coherent states to move a…
▽ More
The interference-free double potential wells realized by the two-photon driving Kerr nonlinear resonator (KNR) can stabilize cat states and protect them from decoherence through a large energy gap. In this work, we use a parametrically driving KNR to propose a novel engineering Hamiltonian that can stabilize arbitrary cat states and independently manipulate the superposed coherent states to move arbitrarily in phase space. This greater degree of control allows us to make the two potential wells collide and merge, generating a collision state with many novel properties. Furthermore, the potential wells carrying quantum states move adiabatically in phase space produce quantum holonomy. We explore the quantum holonomy of collision states for the first time and propose a holonomy-free preparation method for arbitrary cat states. Additionally, we develop a universal holonomic quantum computing protocol utilizing the quantum holonomy of coherent and collision states, including single-qubit rotation gates and multi-qubit control gates. Finally, we propose an experimentally feasible physical realization in superconducting circuits to achieve the Hamiltonian described above. Our proposal provides a platform with greater control degrees of freedom, enabling more operations on bosonic modes and the exploration of intriguing physics.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Authors:
Junyi Ao,
Yuancheng Wang,
Xiaohai Tian,
Dekun Chen,
Jun Zhang,
Lu Lu,
Yuxuan Wang,
Haizhou Li,
Zhizheng Wu
Abstract:
Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, includin…
▽ More
Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at https://github.com/amphionspace/SD-Eval.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Measurement of Spin-Density Matrix Elements in $Δ^{++}(1232)$ photoproduction
Authors:
F. Afzal,
C. S. Akondi,
M. Albrecht,
M. Amaryan,
S. Arrigo,
V. Arroyave,
A. Asaturyan,
A. Austregesilo,
Z. Baldwin,
F. Barbosa,
J. Barlow,
E. Barriga,
R. Barsotti,
D. Barton,
V. Baturin,
V. V. Berdnikov,
T. Black,
W. Boeglin,
M. Boer,
W. J. Briscoe,
T. Britton,
S. Cao,
E. Chudakov,
G. Chung,
P. L. Cole
, et al. (124 additional authors not shown)
Abstract:
We report the measurement of spin-density matrix elements of the $Δ^{++}(1232)$ in the photoproduction reaction $γp \to π^-Δ^{++}(1232)$ with the GlueX experiment in Hall D at Jefferson Lab. The measurement used a linearly polarized photon beam with $E_γ=8.2-8.8$~GeV and the statistical precision exceeds the previous measurement from SLAC by three orders of magnitude for the momentum transfer squa…
▽ More
We report the measurement of spin-density matrix elements of the $Δ^{++}(1232)$ in the photoproduction reaction $γp \to π^-Δ^{++}(1232)$ with the GlueX experiment in Hall D at Jefferson Lab. The measurement used a linearly polarized photon beam with $E_γ=8.2-8.8$~GeV and the statistical precision exceeds the previous measurement from SLAC by three orders of magnitude for the momentum transfer squared region $-t < 1.4$ GeV$^2$. The data are sensitive to the previously undetermined relative sign between couplings in existing Regge exchange models. Linear combinations of the extracted SDMEs allow for a decomposition into natural and unnatural exchange amplitudes, which shows that the unnatural exchange plays an important role in the low $-t$ region.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Composited-Nested-Learning with Data Augmentation for Nested Named Entity Recognition
Authors:
Xingming Liao,
Nankai Lin,
Haowen Li,
Lianglun Cheng,
Zhuowei Wang,
Chong Chen
Abstract:
Nested Named Entity Recognition (NNER) focuses on addressing overlapped entity recognition. Compared to Flat Named Entity Recognition (FNER), annotated resources are scarce in the corpus for NNER. Data augmentation is an effective approach to address the insufficient annotated corpus. However, there is a significant lack of exploration in data augmentation methods for NNER. Due to the presence of…
▽ More
Nested Named Entity Recognition (NNER) focuses on addressing overlapped entity recognition. Compared to Flat Named Entity Recognition (FNER), annotated resources are scarce in the corpus for NNER. Data augmentation is an effective approach to address the insufficient annotated corpus. However, there is a significant lack of exploration in data augmentation methods for NNER. Due to the presence of nested entities in NNER, existing data augmentation methods cannot be directly applied to NNER tasks. Therefore, in this work, we focus on data augmentation for NNER and resort to more expressive structures, Composited-Nested-Label Classification (CNLC) in which constituents are combined by nested-word and nested-label, to model nested entities. The dataset is augmented using the Composited-Nested-Learning (CNL). In addition, we propose the Confidence Filtering Mechanism (CFM) for a more efficient selection of generated data. Experimental results demonstrate that this approach results in improvements in ACE2004 and ACE2005 and alleviates the impact of sample imbalance.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Rationale-based Ensemble of Multiple QA Strategies for Zero-shot Knowledge-based VQA
Authors:
Miaoyu Li,
Haoxin Li,
Zilin Du,
Boyang Li
Abstract:
Knowledge-based Visual Qustion-answering (K-VQA) necessitates the use of background knowledge beyond what is depicted in the image. Current zero-shot K-VQA methods usually translate an image to a single type of textual decision context and use a text-based model to answer the question based on it, which conflicts with the fact that K-VQA questions often require the combination of multiple question…
▽ More
Knowledge-based Visual Qustion-answering (K-VQA) necessitates the use of background knowledge beyond what is depicted in the image. Current zero-shot K-VQA methods usually translate an image to a single type of textual decision context and use a text-based model to answer the question based on it, which conflicts with the fact that K-VQA questions often require the combination of multiple question-answering strategies. In light of this, we propose Rationale-based Ensemble of Answer Context Tactics (REACT) to achieve a dynamic ensemble of multiple question-answering tactics, comprising Answer Candidate Generation (ACG) and Rationale-based Strategy Fusion (RSF). In ACG, we generate three distinctive decision contexts to provide different strategies for each question, resulting in the generation of three answer candidates. RSF generates automatic and mechanistic rationales from decision contexts for each candidate, allowing the model to select the correct answer from all candidates. We conduct comprehensive experiments on the OK-VQA and A-OKVQA datasets, and our method significantly outperforms state-of-the-art LLM-based baselines on all datasets.
△ Less
Submitted 22 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Tactile SoftHand-A: 3D-Printed, Tactile, Highly-underactuated, Anthropomorphic Robot Hand with an Antagonistic Tendon Mechanism
Authors:
Haoran Li,
Christopher J. Ford,
Chenghua Lu,
Yijiong Lin,
Matteo Bianchi,
Manuel G. Catalano,
Efi Psomopoulou,
Nathan F. Lepora
Abstract:
For tendon-driven multi-fingered robotic hands, ensuring grasp adaptability while minimizing the number of actuators needed to provide human-like functionality is a challenging problem. Inspired by the Pisa/IIT SoftHand, this paper introduces a 3D-printed, highly-underactuated, five-finger robotic hand named the Tactile SoftHand-A, which features only two actuators. The dual-tendon design allows f…
▽ More
For tendon-driven multi-fingered robotic hands, ensuring grasp adaptability while minimizing the number of actuators needed to provide human-like functionality is a challenging problem. Inspired by the Pisa/IIT SoftHand, this paper introduces a 3D-printed, highly-underactuated, five-finger robotic hand named the Tactile SoftHand-A, which features only two actuators. The dual-tendon design allows for the active control of specific (distal or proximal interphalangeal) joints to adjust the hand's grasp gesture. We have also developed a new design of fully 3D-printed tactile sensor that requires no hand assembly and is printed directly as part of the robotic finger. This sensor is integrated into the fingertips and combined with the antagonistic tendon mechanism to develop a human-hand-guided tactile feedback gras** system. The system can actively mirror human hand gestures, adaptively stabilize grasp gestures upon contact, and adjust grasp gestures to prevent object movement after detecting slippage. Finally, we designed four different experiments to evaluate the novel fingers coupled with the antagonistic mechanism for controlling the robotic hand's gestures, adaptive gras** ability, and human-hand-guided tactile feedback gras** capability. The experimental results demonstrate that the Tactile SoftHand-A can adaptively grasp objects of a wide range of shapes and automatically adjust its grip** gestures upon detecting contact and slippage. Overall, this study points the way towards a class of low-cost, accessible, 3D-printable, underactuated human-like robotic hands, and we openly release the designs to facilitate others to build upon this work. This work is Open-sourced at github.com/SoutheastWind/Tactile_SoftHand_A
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
ED-sKWS: Early-Decision Spiking Neural Networks for Rapid,and Energy-Efficient Keyword Spotting
Authors:
Zeyang Song,
Qianhui Liu,
Qu Yang,
Yizhou Peng,
Haizhou Li
Abstract:
Keyword Spotting (KWS) is essential in edge computing requiring rapid and energy-efficient responses. Spiking Neural Networks (SNNs) are well-suited for KWS for their efficiency and temporal capacity for speech. To further reduce the latency and energy consumption, this study introduces ED-sKWS, an SNN-based KWS model with an early-decision mechanism that can stop speech processing and output the…
▽ More
Keyword Spotting (KWS) is essential in edge computing requiring rapid and energy-efficient responses. Spiking Neural Networks (SNNs) are well-suited for KWS for their efficiency and temporal capacity for speech. To further reduce the latency and energy consumption, this study introduces ED-sKWS, an SNN-based KWS model with an early-decision mechanism that can stop speech processing and output the result before the end of speech utterance. Furthermore, we introduce a Cumulative Temporal (CT) loss that can enhance prediction accuracy at both the intermediate and final timesteps. To evaluate early-decision performance, we present the SC-100 dataset including 100 speech commands with beginning and end timestamp annotation. Experiments on the Google Speech Commands v2 and our SC-100 datasets show that ED-sKWS maintains competitive accuracy with 61% timesteps and 52% energy consumption compared to SNN models without early-decision mechanism, ensuring rapid response and energy efficiency.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
The small boundary property in products
Authors:
David Kerr,
Hanfeng Li
Abstract:
For a continuous action $G\curvearrowright X$ of a countable group on a compact metrizable space we show that the following are equivalent: (i) the action $G\curvearrowright X$ has the small boundary property and no finite orbits, (ii) for every continuous action $H\curvearrowright Y$ of a countable group on a compact metrizable space, the product action $G\times H\curvearrowright X\times Y$ has t…
▽ More
For a continuous action $G\curvearrowright X$ of a countable group on a compact metrizable space we show that the following are equivalent: (i) the action $G\curvearrowright X$ has the small boundary property and no finite orbits, (ii) for every continuous action $H\curvearrowright Y$ of a countable group on a compact metrizable space, the product action $G\times H\curvearrowright X\times Y$ has the small boundary property. In particular, (ii) is automatic when $G$ is infinite and the action $G\curvearrowright X$ is minimal and has the small boundary property. The argument relies on a small boundary version of the Urysohn lemma.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Formation of the Supersonic Stellar Wind: Parker's Theory Revisited
Authors:
Paul Song,
Jiannan Tu,
Stanley W. H. Cowley,
Chi Wang,
Hui Li
Abstract:
We examine the classical theory of stellar wind formation. The theory requires that to form a supersonic stellar wind, a subsonic flow speed must start at a specific initial speed from the coronal base, called eigenspeed, go along a continuous eigenfunction, and reach the sonic point, which is where the flow speed equals the sonic speed, while the critical condition, which is where the effective d…
▽ More
We examine the classical theory of stellar wind formation. The theory requires that to form a supersonic stellar wind, a subsonic flow speed must start at a specific initial speed from the coronal base, called eigenspeed, go along a continuous eigenfunction, and reach the sonic point, which is where the flow speed equals the sonic speed, while the critical condition, which is where the effective driving force is zero, is satisfied. Any mismatch between the sonic point and critical condition distances results in either subsonic winds when the initial speed is below the eigenspeed, or no stellar wind when the initial speed is above the eigenspeed. However, because the critical condition is determined by the stellar wind temperature profile, which depends on ionization process at the top of the chromosphere and the heating process around the coronal base but not by the processes at the sonic point, the requirement of sonic point satisfying the critical condition is generally not met and hence the momentum equation in the conventional theory encounters difficulty when the initial speed is above the eigenspeed. To resolve the difficulty, we propose a discontinuity between the sonic point and the critical condition to reach supersonic stellar wind solutions. As a result, supersonic stellar winds can be produced when the initial speed in the coronal base is greater than the eigenspeed. The critical solution or eigen function provided by the conventional stellar wind model describes the condition that separates the supersonic stellar winds from subsonic ones.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding
Authors:
Linrui Xu,
Ling Zhao,
Wang Guo,
Qiujun Li,
Kewang Long,
Kaiqi Zou,
Yuhan Wang,
Haifeng Li
Abstract:
The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to…
▽ More
The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to advances in RSI intelligence understanding in the last decade, are no longer suitable for fire-new tasks. We argued that a new dataset must be designed to lighten tasks with the following features: 1) Generalization: training model to learn shared knowledge among tasks and to adapt to different tasks; 2) Understanding complex scenes: training model to understand the fine-grained attribute of the objects of interest, and to be able to describe the scene with natural language; 3) Reasoning: training model to be able to realize high-level visual reasoning. In this paper, we designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding produced by GPT-4V and existing datasets, which we called RS-GPT4V. To achieve generalization, we used a (Question, Answer) which was deduced from GPT-4V via instruction-following to unify the tasks such as captioning and localization; To achieve complex scene, we proposed a hierarchical instruction description with local strategy in which the fine-grained attributes of the objects and their spatial relationships are described and global strategy in which all the local information are integrated to yield detailed instruction descript; To achieve reasoning, we designed multiple-turn QA pair to provide the reasoning ability for a model. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information. The dataset is available at: https://github.com/GeoX-Lab/RS-GPT4V.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Text-aware Speech Separation for Multi-talker Keyword Spotting
Authors:
Haoyu Li,
Baochen Yang,
Yu Xi,
Linfeng Yu,
Tian Tan,
Hao Li,
Kai Yu
Abstract:
For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To ad…
▽ More
For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To address it, this paper proposes a novel Text-aware Permutation Determinization Training method for multi-talker KWS with a clue-based Speech Separation front-end (TPDT-SS). Our research highlights the critical role of SS front-ends and shows that incorporating keyword-specific clues into these models can greatly enhance the effectiveness. TPDT-SS shows remarkable success in addressing permutation problems in mixed keyword speech, thereby greatly boosting the performance of the backend. Additionally, fine-tuning our system on unseen mixed speech results in further performance improvement.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
PruningBench: A Comprehensive Benchmark of Structural Pruning
Authors:
Haoling Li,
Changhao Li,
Mengqi Xue,
Gongfan Fang,
Sheng Zhou,
Zunlei Feng,
Huiqiong Wang,
Yong Wang,
Lechao Cheng,
Mingli Song,
Jie Song
Abstract:
Structural pruning has emerged as a promising approach for producing more efficient models. Nevertheless, the community suffers from a lack of standardized benchmarks and metrics, leaving the progress in this area not fully comprehended. To fill this gap, we present the first comprehensive benchmark, termed \textit{PruningBench}, for structural pruning. PruningBench showcases the following three c…
▽ More
Structural pruning has emerged as a promising approach for producing more efficient models. Nevertheless, the community suffers from a lack of standardized benchmarks and metrics, leaving the progress in this area not fully comprehended. To fill this gap, we present the first comprehensive benchmark, termed \textit{PruningBench}, for structural pruning. PruningBench showcases the following three characteristics: 1) PruningBench employs a unified and consistent framework for evaluating the effectiveness of diverse structural pruning techniques; 2) PruningBench systematically evaluates 16 existing pruning methods, encompassing a wide array of models (e.g., CNNs and ViTs) and tasks (e.g., classification and detection); 3) PruningBench provides easily implementable interfaces to facilitate the implementation of future pruning methods, and enables the subsequent researchers to incorporate their work into our leaderboards. We provide an online pruning platform http://pruning.vipazoo.cn for customizing pruning tasks and reproducing all results in this paper. Codes will be made publicly on https://github.com/HollyLee2000/PruningBench.
△ Less
Submitted 28 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
DASSF: Dynamic-Attention Scale-Sequence Fusion for Aerial Object Detection
Authors:
Haodong Li,
Haicheng Qu
Abstract:
The detection of small objects in aerial images is a fundamental task in the field of computer vision. Moving objects in aerial photography have problems such as different shapes and sizes, dense overlap, occlusion by the background, and object blur, however, the original YOLO algorithm has low overall detection accuracy due to its weak ability to perceive targets of different scales. In order to…
▽ More
The detection of small objects in aerial images is a fundamental task in the field of computer vision. Moving objects in aerial photography have problems such as different shapes and sizes, dense overlap, occlusion by the background, and object blur, however, the original YOLO algorithm has low overall detection accuracy due to its weak ability to perceive targets of different scales. In order to improve the detection accuracy of densely overlap** small targets and fuzzy targets, this paper proposes a dynamic-attention scale-sequence fusion algorithm (DASSF) for small target detection in aerial images. First, we propose a dynamic scale sequence feature fusion (DSSFF) module that improves the up-sampling mechanism and reduces computational load. Secondly, a x-small object detection head is specially added to enhance the detection capability of small targets. Finally, in order to improve the expressive ability of targets of different types and sizes, we use the dynamic head (DyHead). The model we proposed solves the problem of small target detection in aerial images and can be applied to multiple different versions of the YOLO algorithm, which is universal. Experimental results show that when the DASSF method is applied to YOLOv8, compared to YOLOv8n, on the VisDrone-2019 and DIOR datasets, the model shows an increase of 9.2% and 2.4% in the mean average precision (mAP), respectively, and outperforms the current mainstream methods.
△ Less
Submitted 22 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Precision measurement of the $Ξ^-_b$ baryon lifetime
Authors:
LHCb collaboration,
R. Aaij,
A. S. W. Abdelmotteleb,
C. Abellan Beteta,
F. Abudinén,
T. Ackernley,
A. A. Adefisoye,
B. Adeva,
M. Adinolfi,
P. Adlarson,
C. Agapopoulou,
C. A. Aidala,
Z. Ajaltouni,
S. Akar,
K. Akiba,
P. Albicocco,
J. Albrecht,
F. Alessio,
M. Alexander,
Z. Aliouche,
P. Alvarez Cartelle,
R. Amalric,
S. Amato,
J. L. Amey,
Y. Amhis
, et al. (1064 additional authors not shown)
Abstract:
A sample of $pp$ collision data, corresponding to an integrated luminosity of 5.5 fb$^{-1}$ and collected by the LHCb experiment during Run 2, is used to measure the ratio of the lifetime of the $Ξ^-_b$ baryon to that of the $Λ^0_b$ baryon, $r_τ\equivτ_{Ξ^-_b}/τ_{Λ^0_b}$. The value ${r_τ^{\rm Run\,2}=1.076\pm0.013\pm0.006}$ is obtained, where the first uncertainty is statistical and the second sys…
▽ More
A sample of $pp$ collision data, corresponding to an integrated luminosity of 5.5 fb$^{-1}$ and collected by the LHCb experiment during Run 2, is used to measure the ratio of the lifetime of the $Ξ^-_b$ baryon to that of the $Λ^0_b$ baryon, $r_τ\equivτ_{Ξ^-_b}/τ_{Λ^0_b}$. The value ${r_τ^{\rm Run\,2}=1.076\pm0.013\pm0.006}$ is obtained, where the first uncertainty is statistical and the second systematic. This value is averaged with the corresponding value from Run 1 to obtain ${r_τ^{\rm Run\,1,2} = 1.078\pm0.012\pm0.007}$. Multiplying by the world-average value of the $Λ^0_b$ lifetime yields $τ_{Ξ^-_b}^{\rm Run~1,2} = 1.578\pm0.018\pm0.010\pm0.011$ ps, where the uncertainties are statistical, systematic, and due to the limited knowledge of the $Λ^0_b$ lifetime. This measurement improves the precision of the current world average of the $Ξ^-_b$ lifetime by about a factor of two, and is in good agreement with the most recent theoretical predictions.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning
Authors:
Hui Liu,
Wenya Wang,
Hao Sun,
Chris Xing Tian,
Chenqi Kong,
Xin Dong,
Haoliang Li
Abstract:
Large Language Models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities from few-shot demonstration exemplars. While recent learning-based demonstration selection methods have proven beneficial to ICL by choosing more useful exemplars, their underlying mechanisms are opaque, hindering efforts to address limitations such as high training costs and poor generalization across…
▽ More
Large Language Models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities from few-shot demonstration exemplars. While recent learning-based demonstration selection methods have proven beneficial to ICL by choosing more useful exemplars, their underlying mechanisms are opaque, hindering efforts to address limitations such as high training costs and poor generalization across tasks. These methods generally assume the selection process captures similarities between the exemplar and the target instance, however, it remains unknown what kinds of similarities are captured and vital to performing ICL. To dive into this question, we analyze the working mechanisms of the learning-based demonstration selection methods and empirically identify two important factors related to similarity measurement: 1) The ability to integrate different levels of task-agnostic text similarities between the input of exemplars and test cases enhances generalization power across different tasks. 2) Incorporating task-specific labels when measuring the similarities significantly improves the performance on each specific task. We validate these two findings through extensive quantitative and qualitative analyses across ten datasets and various LLMs. Based on our findings, we introduce two effective yet simplified exemplar selection methods catering to task-agnostic and task-specific demands, eliminating the costly LLM inference overhead.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Autoregressive Image Generation without Vector Quantization
Authors:
Tianhong Li,
Yonglong Tian,
He Li,
Mingyang Deng,
Kaiming He
Abstract:
Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us t…
▽ More
Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space. Rather than using categorical cross-entropy loss, we define a Diffusion Loss function to model the per-token probability. This approach eliminates the need for discrete-valued tokenizers. We evaluate its effectiveness across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, our image generator achieves strong results while enjoying the speed advantage of sequence modeling. We hope this work will motivate the use of autoregressive generation in other continuous-valued domains and applications.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
Authors:
Bingqi Ma,
Zhuofan Zong,
Guanglu Song,
Hongsheng Li,
Yu Liu
Abstract:
Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the…
▽ More
Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The powerful LI-DiT-10B will be available through the online platform and API after further optimization and security checks.
△ Less
Submitted 21 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Nemotron-4 340B Technical Report
Authors:
Nvidia,
:,
Bo Adler,
Niket Agarwal,
Ashwath Aithal,
Dong H. Anh,
Pallab Bhattacharya,
Annika Brundyn,
Jared Casper,
Bryan Catanzaro,
Sharon Clay,
Jonathan Cohen,
Sirshak Das,
Ayush Dattagupta,
Olivier Delalleau,
Leon Derczynski,
Yi Dong,
Daniel Egert,
Ellie Evans,
Aleksander Ficek,
Denys Fridman,
Shaona Ghosh,
Boris Ginsburg,
Igor Gitman,
Tomasz Grzegorzek
, et al. (58 additional authors not shown)
Abstract:
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation be…
▽ More
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Association between a Failed Prominence Eruption and the Drainage of Mass from Another Prominence
Authors:
Jianchao Xue,
Li Feng,
Hui Li,
** Zhang,
Jun Chen,
Guanglu Shi,
Kaifan Ji,
Ye Qiu,
Chuan Li,
Lei Lu,
Beili Ying,
Ying Li,
Yu Huang,
You** Li,
**gwei Li,
Jie Zhao,
Dechao Song,
Shuting Li,
Zhengyuan Tian,
Yingna Su,
Qingmin Zhang,
Yunyi Ge,
Jiahui Shan,
Qiao Li,
Gen Li
, et al. (9 additional authors not shown)
Abstract:
Sympathetic eruptions of solar prominences have been studied for decades, however, it is usually difficult to identify their causal links. Here we present two failed prominence eruptions on 26 October 2022 and explore their connections. Using stereoscopic observations, the south prominence (PRO-S) erupts with untwisting motions, flare ribbons occur underneath, and new connections are formed during…
▽ More
Sympathetic eruptions of solar prominences have been studied for decades, however, it is usually difficult to identify their causal links. Here we present two failed prominence eruptions on 26 October 2022 and explore their connections. Using stereoscopic observations, the south prominence (PRO-S) erupts with untwisting motions, flare ribbons occur underneath, and new connections are formed during the eruption. The north prominence (PRO-N) rises up along with PRO-S, and its upper part disappears due to catastrophic mass draining along an elongated structure after PRO-S failed eruption. We suggest that the eruption of PRO-S initiates due to a kink instability, further rises up, and fails to erupt due to reconnection with surrounding fields. The elongated structure connecting PRO-N overlies PRO-S, which causes the rising up of PRO-N along with PRO-S and mass drainage after PRO-S eruption. This study suggests that a prominence may end its life through mass drainage forced by an eruption underneath.
△ Less
Submitted 20 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Holographic spectral function of fermion in instantonic plasma
Authors:
Si-wen Li,
Yi-peng Zhang,
Hao-qian Li
Abstract:
Using the gauge-gravity duality, we investigate the fermionic correlation function in the D(-1)-D3 brane system which describes the instantonic plasma in holography. In this system, the charge of the D(-1) brane as the D-instanton gives the gluon condensate. To simplify the holographic setup, we first reduce briefly the ten-dimensional supergravity background produced by D(-1)-D3-branes to an equi…
▽ More
Using the gauge-gravity duality, we investigate the fermionic correlation function in the D(-1)-D3 brane system which describes the instantonic plasma in holography. In this system, the charge of the D(-1) brane as the D-instanton gives the gluon condensate. To simplify the holographic setup, we first reduce briefly the ten-dimensional supergravity background produced by D(-1)-D3-branes to an equivalently five-dimensional background. Then using the standard method for computing the Green function in the AdS/CFT dictionary, we derive the equations for the fermionic correlation functions and solve them numerically with the infalling boundary condition. Our numerical results illustrate that the fermionic correlation function as spectral function includes two branches of the dispersion curves whose behavior is very close to the results obtained from the method of hard thermal loop. And the effective mass generated by the medium effect of fermion splits into two values due to the spin-dependent interactions induced by instantons. Therefore, this work in holography demonstrates the instantonic configuration is very influential to the fermionic plasmino in plasma.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Obfuscating IoT Device Scanning Activity via Adversarial Example Generation
Authors:
Haocong Li,
Yaxin Zhang,
Long Cheng,
Wenjia Niu,
Haining Wang,
Qiang Li
Abstract:
Nowadays, attackers target Internet of Things (IoT) devices for security exploitation, and search engines for devices and services compromise user privacy, including IP addresses, open ports, device types, vendors, and products.Typically, application banners are used to recognize IoT device profiles during network measurement and reconnaissance. In this paper, we propose a novel approach to obfusc…
▽ More
Nowadays, attackers target Internet of Things (IoT) devices for security exploitation, and search engines for devices and services compromise user privacy, including IP addresses, open ports, device types, vendors, and products.Typically, application banners are used to recognize IoT device profiles during network measurement and reconnaissance. In this paper, we propose a novel approach to obfuscating IoT device banners (BANADV) based on adversarial examples. The key idea is to explore the susceptibility of fingerprinting techniques to a slight perturbation of an IoT device banner. By modifying device banners, BANADV disrupts the collection of IoT device profiles. To validate the efficacy of BANADV, we conduct a set of experiments. Our evaluation results show that adversarial examples can spoof state-of-the-art fingerprinting techniques, including learning- and matching-based approaches. We further provide a detailed analysis of the weakness of learning-based/matching-based fingerprints to carefully crafted samples. Overall, the innovations of BANADV lie in three aspects: (1) it utilizes an IoT-related semantic space and a visual similarity space to locate available manipulating perturbations of IoT banners; (2) it achieves at least 80\% success rate for spoofing IoT scanning techniques; and (3) it is the first to utilize adversarial examples of IoT banners in network measurement and reconnaissance.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.