-
Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems
Authors:
Zhengyang Chen,
Xuechen Liu,
Erica Cooper,
Junichi Yamagishi,
Yanmin Qian
Abstract:
This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We ado…
▽ More
This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We adopt the Low-rank Adaptation (LoRA) technique to swiftly tailor a pre-trained language model to our needs, facilitating the extraction of speaker-related traits from the prompt text. Besides, different from other prompt-driven text-to-speech (TTS) systems, we separate the prompt-to-speaker module from the multi-speaker TTS system, enhancing system flexibility and compatibility with various pre-trained multi-speaker TTS systems. Moreover, for the prompt-to-speaker characteristic module, we also compared the discriminative method and flow-matching based generative method and we found that combining both methods can help the system simultaneously capture speaker-related information from prompts better and generate speech with higher fidelity.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
LLM-based Knowledge Pruning for Time Series Data Analytics on Edge-computing Devices
Authors:
Ruibing **,
Qing Xu,
Min Wu,
Yuecong Xu,
Dan Li,
Xiaoli Li,
Zhenghua Chen
Abstract:
Limited by the scale and diversity of time series data, the neural networks trained on time series data often overfit and show unsatisfacotry performances. In comparison, large language models (LLMs) recently exhibit impressive generalization in diverse fields. Although massive LLM based approaches are proposed for time series tasks, these methods require to load the whole LLM in both training and…
▽ More
Limited by the scale and diversity of time series data, the neural networks trained on time series data often overfit and show unsatisfacotry performances. In comparison, large language models (LLMs) recently exhibit impressive generalization in diverse fields. Although massive LLM based approaches are proposed for time series tasks, these methods require to load the whole LLM in both training and reference. This high computational demands limit practical applications in resource-constrained settings, like edge-computing and IoT devices. To address this issue, we propose Knowledge Pruning (KP), a novel paradigm for time series learning in this paper. For a specific downstream task, we argue that the world knowledge learned by LLMs is much redundant and only the related knowledge termed as "pertinent knowledge" is useful. Unlike other methods, our KP targets to prune the redundant knowledge and only distill the pertinent knowledge into the target model. This reduces model size and computational costs significantly. Additionally, different from existing LLM based approaches, our KP does not require to load the LLM in the process of training and testing, further easing computational burdens. With our proposed KP, a lightweight network can effectively learn the pertinent knowledge, achieving satisfactory performances with a low computation cost. To verify the effectiveness of our KP, two fundamental tasks on edge-computing devices are investigated in our experiments, where eight diverse environments or benchmarks with different networks are used to verify the generalization of our KP. Through experiments, our KP demonstrates effective learning of pertinent knowledge, achieving notable performance improvements in regression (19.7% on average) and classification (up to 13.7%) tasks, showcasing state-of-the-art results.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Constraints on Ultra Heavy Dark Matter Properties from Dwarf Spheroidal Galaxies with LHAASO Observations
Authors:
Zhen Cao,
F. Aharonian,
Q. An,
Axikegu,
Y. X. Bai,
Y. W. Bao,
D. Bastieri,
X. J. Bi,
Y. J. Bi,
J. T. Cai,
Q. Cao,
W. Y. Cao,
Zhe Cao,
J. Chang,
J. F. Chang,
A. M. Chen,
E. S. Chen,
Liang Chen,
Lin Chen,
Long Chen,
M. J. Chen,
M. L. Chen,
Q. H. Chen,
S. H. Chen,
S. Z. Chen
, et al. (255 additional authors not shown)
Abstract:
In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes…
▽ More
In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes of astrophysical $γ$-ray background while large amount of dark matter. By analyzing more than 700 days observational data at LHAASO, no significant dark matter signal from 1 TeV to 1 EeV is detected. Accordingly we derive the most stringent constraints on the ultra-heavy dark matter annihilation cross-section up to EeV. The constraints on the lifetime of dark matter in decay mode are also derived.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Authors:
Qingyun Li,
Zhe Chen,
Weiyun Wang,
Wenhai Wang,
Shenglong Ye,
Zhenjiang **,
Guanzhou Chen,
Yinan He,
Zhangwei Gao,
Erfei Cui,
Jiashuo Yu,
Hao Tian,
Jiasheng Zhou,
Chao Xu,
Bin Wang,
Xingjian Wei,
Wei Li,
Wenjian Zhang,
Bo Zhang,
Pinlong Cai,
Licheng Wen,
Xiangchao Yan,
Zhenxiang Li,
Pei Chu,
Yi Wang
, et al. (15 additional authors not shown)
Abstract:
Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an…
▽ More
Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.
△ Less
Submitted 13 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Authors:
Jiannan Wu,
Muyan Zhong,
Sen Xing,
Zeqiang Lai,
Zhaoyang Liu,
Wenhai Wang,
Zhe Chen,
Xizhou Zhu,
Lewei Lu,
Tong Lu,
** Luo,
Yu Qiao,
Jifeng Dai
Abstract:
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such a…
▽ More
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed "super link", as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.
△ Less
Submitted 14 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Practical, Automated Scenario-based Mobile App Testing
Authors:
Shengcheng Yu,
Chunrong Fang,
Mingzhe Du,
Zimin Ding,
Zhenyu Chen,
Zhendong Su
Abstract:
The importance of mobile application (app) quality insurance is increasing with the rapid development of the mobile Internet. Automated test generation approaches, as a dominant direction of app quality insurance, follow specific models or strategies, targeting at optimizing the code coverage. Such approaches lead to a huge gap between testing execution and app business logic. Test scripts develop…
▽ More
The importance of mobile application (app) quality insurance is increasing with the rapid development of the mobile Internet. Automated test generation approaches, as a dominant direction of app quality insurance, follow specific models or strategies, targeting at optimizing the code coverage. Such approaches lead to a huge gap between testing execution and app business logic. Test scripts developed by human testers consider business logic by focusing on testing scenarios. Due to the GUI-intensive feature of mobile apps, human testers always understand app GUI to organize test scripts for scenarios. This inspires us to utilize domain knowledge from app GUI understanding for scenario-based test generation.
In this paper, we propose a novel approach, ScenTest, for scenario-based mobile app testing with event knowledge graph (EKG) via GUI image understanding. ScenTest tries to start automated testing by imitating human practices and integrating domain knowledge into scenario-based mobile app testing, realizing fully automated testing on target testing scenarios for the first time. ScenTest extracts four kinds of entities and five kinds of corresponding relationships from crowdsourced test reports, where the test events and app GUI information are presented, and constructs the EKGs for specific scenarios. Then, ScenTest conducts test generation for specific scenarios on different apps with the guidance of EKG with the combination consideration of app current state and testing context. We conduct an evaluation on ScenTest on different aspects. The results show that the test generation of ScenTest on the basis of EKG is effective, and ScenTest can reveal 80+ distinct real-world bugs in specific scenarios compared with representative baselines.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Observation of $η_{c}$(1S, 2S) and $χ_{cJ}$ decays to 2$(π^{+}π^{-})η$ via $ψ$(3686) radiative transitions
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (636 additional authors not shown)
Abstract:
Based on $2.7 \times 10^9~ψ(3686)$ decays collected with the BESIII detector, the radiative decay $ψ(3686)\to\gamma2(π^{+}π^{-})η$ is investigated to measure properties of S- and P-wave charmonium states. The branching fraction of the decay $η_{c}(1S) \to 2(π^{+}π^{-})η$, which is found to have a strong dependence on the interference pattern between $η_c(1S)$ and non-$η_c(1S)$ processes, is measur…
▽ More
Based on $2.7 \times 10^9~ψ(3686)$ decays collected with the BESIII detector, the radiative decay $ψ(3686)\to\gamma2(π^{+}π^{-})η$ is investigated to measure properties of S- and P-wave charmonium states. The branching fraction of the decay $η_{c}(1S) \to 2(π^{+}π^{-})η$, which is found to have a strong dependence on the interference pattern between $η_c(1S)$ and non-$η_c(1S)$ processes, is measured in both destructive and constructive interference scenarios for the first time. The mass and width of the $η_{c}(1S)$ are measured to be $M=(2984.14 \pm 0.13 \pm 0.38)$ MeV/$c^{2}$ and $Γ=(28.82 \pm 0.11 \pm 0.82)$ MeV, respectively. Clear signals for the decays of the $χ_{cJ}(J=0,1,2)$ and the $η_{c}(2S)$ to $2(π^{+}π^{-})η$ are also observed for the first time, and the corresponding branching fractions are measured. The ratio of the branching fractions between the $η_{c}(2S)$ and $η_{c}(1S)$ decays is significantly lower than the theoretical prediction, which might suggest different dynamics in their decays.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Aggregation Design for Personalized Federated Multi-Modal Learning over Wireless Networks
Authors:
Benshun Yin,
Zhiyong Chen,
Meixia Tao
Abstract:
Federated Multi-Modal Learning (FMML) is an emerging field that integrates information from different modalities in federated learning to improve the learning performance. In this letter, we develop a parameter scheduling scheme to improve personalized performance and communication efficiency in personalized FMML, considering the non-independent and nonidentically distributed (non-IID) data along…
▽ More
Federated Multi-Modal Learning (FMML) is an emerging field that integrates information from different modalities in federated learning to improve the learning performance. In this letter, we develop a parameter scheduling scheme to improve personalized performance and communication efficiency in personalized FMML, considering the non-independent and nonidentically distributed (non-IID) data along with the modality heterogeneity. Specifically, a learning-based approach is utilized to obtain the aggregation coefficients for parameters of different modalities on distinct devices. Based on the aggregation coefficients and channel state, a subset of parameters is scheduled to be uploaded to a server for each modality. Experimental results show that the proposed algorithm can effectively improve the personalized performance of FMML.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
100 Drivers, 2200 km: A Natural Dataset of Driving Style toward Human-centered Intelligent Driving Systems
Authors:
Chaopeng Zhang,
Wenshuo Wang,
Zhaokun Chen,
Junqiang Xi
Abstract:
Effective driving style analysis is critical to develo** human-centered intelligent driving systems that consider drivers' preferences. However, the approaches and conclusions of most related studies are diverse and inconsistent because no unified datasets tagged with driving styles exist as a reliable benchmark. The absence of explicit driving style labels makes verifying different approaches a…
▽ More
Effective driving style analysis is critical to develo** human-centered intelligent driving systems that consider drivers' preferences. However, the approaches and conclusions of most related studies are diverse and inconsistent because no unified datasets tagged with driving styles exist as a reliable benchmark. The absence of explicit driving style labels makes verifying different approaches and algorithms difficult. This paper provides a new benchmark by constructing a natural dataset of Driving Style (100-DrivingStyle) tagged with the subjective evaluation of 100 drivers' driving styles. In this dataset, the subjective quantification of each driver's driving style is from themselves and an expert according to the Likert-scale questionnaire. The testing routes are selected to cover various driving scenarios, including highways, urban, highway ramps, and signalized traffic. The collected driving data consists of lateral and longitudinal manipulation information, including steering angle, steering speed, lateral acceleration, throttle position, throttle rate, brake pressure, etc. This dataset is the first to provide detailed manipulation data with driving-style tags, and we demonstrate its benchmark function using six classifiers. The 100-DrivingStyle dataset is available via https://github.com/chaopengzhang/100-DrivingStyle-Dataset
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Multi-agent Reinforcement Learning with Deep Networks for Diverse Q-Vectors
Authors:
Zhenglong Luo,
Zhiyong Chen,
James Welsh
Abstract:
Multi-agent reinforcement learning (MARL) has become a significant research topic due to its ability to facilitate learning in complex environments. In multi-agent tasks, the state-action value, commonly referred to as the Q-value, can vary among agents because of their individual rewards, resulting in a Q-vector. Determining an optimal policy is challenging, as it involves more than just maximizi…
▽ More
Multi-agent reinforcement learning (MARL) has become a significant research topic due to its ability to facilitate learning in complex environments. In multi-agent tasks, the state-action value, commonly referred to as the Q-value, can vary among agents because of their individual rewards, resulting in a Q-vector. Determining an optimal policy is challenging, as it involves more than just maximizing a single Q-value. Various optimal policies, such as a Nash equilibrium, have been studied in this context. Algorithms like Nash Q-learning and Nash Actor-Critic have shown effectiveness in these scenarios. This paper extends this research by proposing a deep Q-networks (DQN) algorithm capable of learning various Q-vectors using Max, Nash, and Maximin strategies. The effectiveness of this approach is demonstrated in an environment where dual robotic arms collaborate to lift a pot.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR
Authors:
Yerbolat Khassanov,
Zhipeng Chen,
Tianfeng Chen,
Tze Yuang Chong,
Wei Li,
Jun Zhang,
Lu Lu,
Yuxuan Wang
Abstract:
This paper addresses challenges in integrating new languages into a pre-trained multilingual automatic speech recognition (mASR) system, particularly in scenarios where training data for existing languages is limited or unavailable. The proposed method employs a dual-pipeline with low-rank adaptation (LoRA). It maintains two data flow pipelines-one for existing languages and another for new langua…
▽ More
This paper addresses challenges in integrating new languages into a pre-trained multilingual automatic speech recognition (mASR) system, particularly in scenarios where training data for existing languages is limited or unavailable. The proposed method employs a dual-pipeline with low-rank adaptation (LoRA). It maintains two data flow pipelines-one for existing languages and another for new languages. The primary pipeline follows the standard flow through the pre-trained parameters of mASR, while the secondary pipeline additionally utilizes language-specific parameters represented by LoRA and a separate output decoder module. Importantly, the proposed approach minimizes the performance degradation of existing languages and enables a language-agnostic operation mode, facilitated by a decoder selection strategy. We validate the effectiveness of the proposed method by extending the pre-trained Whisper model to 19 new languages from the FLEURS dataset
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
IceCube Search for Neutrino Emission from X-ray Bright Seyfert Galaxies
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
S. K. Agarwalla,
J. A. Aguilar,
M. Ahlers,
J. M. Alameddine,
N. M. Amin,
K. Andeen,
C. Argüelles,
Y. Ashida,
S. Athanasiadou,
L. Ausborm,
S. N. Axani,
X. Bai,
A. Balagopal V.,
M. Baricevic,
S. W. Barwick,
S. Bash,
V. Basu,
R. Bay,
J. J. Beatty,
J. Becker Tjus,
J. Beise,
C. Bellenghi
, et al. (400 additional authors not shown)
Abstract:
The recent IceCube detection of TeV neutrino emission from the nearby active galaxy NGC 1068 suggests that active galactic nuclei (AGN) could make a sizable contribution to the diffuse flux of astrophysical neutrinos. The absence of TeV $γ$-rays from NGC 1068 indicates neutrino production in the vicinity of the supermassive black hole, where the high radiation density leads to $γ$-ray attenuation.…
▽ More
The recent IceCube detection of TeV neutrino emission from the nearby active galaxy NGC 1068 suggests that active galactic nuclei (AGN) could make a sizable contribution to the diffuse flux of astrophysical neutrinos. The absence of TeV $γ$-rays from NGC 1068 indicates neutrino production in the vicinity of the supermassive black hole, where the high radiation density leads to $γ$-ray attenuation. Therefore, any potential neutrino emission from similar sources is not expected to correlate with high-energy $γ$-rays. Disk-corona models predict neutrino emission from Seyfert galaxies to correlate with keV X-rays, as they are tracers of coronal activity. Using through-going track events from the Northern Sky recorded by IceCube between 2011 and 2021, we report results from a search for individual and aggregated neutrino signals from 27 additional Seyfert galaxies that are contained in the BAT AGN Spectroscopic Survey (BASS). Besides the generic single power-law, we evaluate the spectra predicted by the disk-corona model. Assuming all sources to be intrinsically similar to NGC 1068, our findings constrain the collective neutrino emission from X-ray bright Seyfert galaxies in the Northern Hemisphere, but, at the same time, show excesses of neutrinos that could be associated with the objects NGC 4151 and CGCG 420-015. These excesses result in a 2.7$σ$ significance with respect to background expectations.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions
Authors:
Renjie Pi,
Jianshu Zhang,
Jipeng Zhang,
Rui Pan,
Zhekai Chen,
Tong Zhang
Abstract:
Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scra** of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is t…
▽ More
Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scra** of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention
Authors:
Mingshuai Liu,
Zhuangqi Chen,
Xiaopeng Yan,
Yuanjun Lv,
Xianjun Xia,
Chuanzeng Huang,
Yijian Xiao,
Lei Xie
Abstract:
In real-time speech communication systems, speech signals are often degraded by multiple distortions. Recently, a two-stage Repair-and-Denoising network (RaD-Net) was proposed with superior speech quality improvement in the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. However, failure to use future information and constraint receptive field of convolution layers limit the system's perfor…
▽ More
In real-time speech communication systems, speech signals are often degraded by multiple distortions. Recently, a two-stage Repair-and-Denoising network (RaD-Net) was proposed with superior speech quality improvement in the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. However, failure to use future information and constraint receptive field of convolution layers limit the system's performance. To mitigate these problems, we extend RaD-Net to its upgraded version, RaD-Net 2. Specifically, a causality-based knowledge distillation is introduced in the first stage to use future information in a causal way. We use the non-causal repairing network as the teacher to improve the performance of the causal repairing network. In addition, in the second stage, complex axial self-attention is applied in the denoising network's complex feature encoder/decoder. Experimental results on the ICASSP 2024 SSI Challenge blind test set show that RaD-Net 2 brings 0.10 OVRL DNSMOS improvement compared to RaD-Net.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting
Authors:
Zhiqi Ai,
Zhiyong Chen,
Shugong Xu
Abstract:
In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. T…
▽ More
In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Improved criteria of detecting multipartite entanglement structure
Authors:
Kai Wu,
Zhihua Chen,
Zhen-Peng Xu,
Zhihao Ma,
Shao-Ming Fei
Abstract:
Multipartite entanglement is one of the crucial resources in quantum information processing tasks such as quantum metrology, quantum computing and quantum communications. It is essential to verify not only the multipartite entanglement, but also the entanglement structure in both fundamental theories and the applications of quantum information technologies. However, it is proved to be challenging…
▽ More
Multipartite entanglement is one of the crucial resources in quantum information processing tasks such as quantum metrology, quantum computing and quantum communications. It is essential to verify not only the multipartite entanglement, but also the entanglement structure in both fundamental theories and the applications of quantum information technologies. However, it is proved to be challenging to detect the entanglement structures, including entanglement depth, entanglement intactness and entanglement stretchability, especially for general states and large-scale quantum systems. By using the partitions of the tensor product space we propose a systematic method to construct powerful entanglement witnesses which identify better the multipartite entanglement structures. Besides, an efficient algorithm using semi-definite programming and a gradient descent algorithm are designed to detect entanglement structure from the inner polytope of the convex set containing all the states with the same entanglement structure. We demonstrate by detailed examples that our criteria perform better than other known ones. Our results may be applied to many quantum information processing tasks.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Needle In A Multimodal Haystack
Authors:
Weiyun Wang,
Shuibo Zhang,
Yiming Ren,
Yuchen Duan,
Tiantong Li,
Shuo Liu,
Mengkang Hu,
Zhe Chen,
Kaipeng Zhang,
Lewei Lu,
Xizhou Zhu,
** Luo,
Yu Qiao,
Jifeng Dai,
Wenqi Shao,
Wenhai Wang
Abstract:
With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplored. In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capab…
▽ More
With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplored. In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. In each task, the model is required to answer the questions according to different key information scattered throughout the given multimodal document. Evaluating the leading MLLMs on MM-NIAH, we observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation. We hope this work can provide a platform for further research on long multimodal document comprehension and contribute to the advancement of MLLMs. Code and benchmark are released at https://github.com/OpenGVLab/MM-NIAH.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Target Speech Diarization with Multimodal Prompts
Authors:
Yidi Jiang,
Ruijie Tao,
Zhengyang Chen,
Yanmin Qian,
Haizhou Li
Abstract:
Traditional speaker diarization seeks to detect ``who spoke when'' according to speaker characteristics. Extending to target speech diarization, we detect ``when target event occurs'' according to the semantic characteristics of speech. We propose a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexib…
▽ More
Traditional speaker diarization seeks to detect ``who spoke when'' according to speaker characteristics. Extending to target speech diarization, we detect ``when target event occurs'' according to the semantic characteristics of speech. We propose a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexible and user-friendly manner, including semantic language description, pre-enrolled speech, pre-registered face image, and audio-language logical prompts. We further propose a voice-face aligner module to project human voice and face representation into a shared space. We develop a multi-modal dataset based on VoxCeleb2 for MM-TSD training and evaluation. Additionally, we conduct comparative analysis and ablation studies for each category of prompts to validate the efficacy of each component in the proposed framework. Furthermore, our framework demonstrates versatility in performing various signal processing tasks, including speaker diarization and overlap speech detection, using task-specific prompts. MM-TSD achieves robust and comparable performance as a unified system compared to specialized models. Moreover, MM-TSD shows capability to handle complex conversations for real-world dataset.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
ULog: Unsupervised Log Parsing with Large Language Models through Log Contrastive Units
Authors:
Junjie Huang,
Zhihan Jiang,
Zhuangbin Chen,
Michael R. Lyu
Abstract:
Log parsing serves as an essential prerequisite for various log analysis tasks. Recent advancements in this field have improved parsing accuracy by leveraging the semantics in logs through fine-tuning large language models (LLMs) or learning from in-context demonstrations. However, these methods heavily depend on labeled examples to achieve optimal performance. In practice, collecting sufficient l…
▽ More
Log parsing serves as an essential prerequisite for various log analysis tasks. Recent advancements in this field have improved parsing accuracy by leveraging the semantics in logs through fine-tuning large language models (LLMs) or learning from in-context demonstrations. However, these methods heavily depend on labeled examples to achieve optimal performance. In practice, collecting sufficient labeled data is challenging due to the large scale and continuous evolution of logs, leading to performance degradation of existing log parsers after deployment. To address this issue, we propose ULog, an unsupervised LLM-based method for efficient and off-the-shelf log parsing. Our key insight is that while LLMs may struggle with direct log parsing, their performance can be significantly enhanced through comparative analysis across multiple logs that differ only in their parameter parts. We refer to such groups of logs as Log Contrastive Units (LCUs). Given the vast volume of logs, obtaining LCUs is difficult. Therefore, ULog introduces a hybrid ranking scheme to effectively search for LCUs by jointly considering the commonality and variability among logs. Additionally, ULog crafts a novel parsing prompt for LLMs to identify contrastive patterns and extract meaningful log structures from LCUs. Experiments on large-scale public datasets demonstrate that ULog significantly outperforms state-of-the-art log parsers in terms of accuracy and efficiency, providing an effective and scalable solution for real-world deployment.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Learning Discrete Latent Variable Structures with Tensor Rank Conditions
Authors:
Zhengming Chen,
Ruichu Cai,
Feng Xie,
Jie Qiao,
Anpeng Wu,
Zijian Li,
Zhifeng Hao,
Kun Zhang
Abstract:
Unobserved discrete data are ubiquitous in many scientific disciplines, and how to learn the causal structure of these latent variables is crucial for uncovering data patterns. Most studies focus on the linear latent variable model or impose strict constraints on latent structures, which fail to address cases in discrete data involving non-linear relationships or complex latent structures. To achi…
▽ More
Unobserved discrete data are ubiquitous in many scientific disciplines, and how to learn the causal structure of these latent variables is crucial for uncovering data patterns. Most studies focus on the linear latent variable model or impose strict constraints on latent structures, which fail to address cases in discrete data involving non-linear relationships or complex latent structures. To achieve this, we explore a tensor rank condition on contingency tables for an observed variable set $\mathbf{X}_p$, showing that the rank is determined by the minimum support of a specific conditional set (not necessary in $\mathbf{X}_p$) that d-separates all variables in $\mathbf{X}_p$. By this, one can locate the latent variable through probing the rank on different observed variables set, and further identify the latent causal structure under some structure assumptions. We present the corresponding identification algorithm and conduct simulated experiments to verify the effectiveness of our method. In general, our results elegantly extend the identification boundary for causal discovery with discrete latent variables and expand the application scope of causal discovery with latent variables.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
TraceMesh: Scalable and Streaming Sampling for Distributed Traces
Authors:
Zhuangbin Chen,
Zhihan Jiang,
Yuxin Su,
Michael R. Lyu,
Zibin Zheng
Abstract:
Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy…
▽ More
Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy, which inevitably captures overlap** and redundant information. More advanced methods employ learning-based approaches to bias the sampling toward more informative traces. However, existing methods fall short of considering the high-dimensional and dynamic nature of trace data, which is essential for the production deployment of trace sampling. To address these practical challenges, in this paper we present TraceMesh, a scalable and streaming sampler for distributed traces. TraceMesh employs Locality-Sensitivity Hashing (LSH) to improve sampling efficiency by projecting traces into a low-dimensional space while preserving their similarity. In this process, TraceMesh accommodates previously unseen trace features in a unified and streamlined way. Subsequently, TraceMesh samples traces through evolving clustering, which dynamically adjusts the sampling decision to avoid over-sampling of recurring traces. The proposed method is evaluated with trace data collected from both open-source microservice benchmarks and production service systems. Experimental results demonstrate that TraceMesh outperforms state-of-the-art methods by a significant margin in both sampling accuracy and efficiency.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Authors:
Zigeng Chen,
Xinyin Ma,
Gongfan Fang,
Zhenxiong Tan,
Xinchao Wang
Abstract:
Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enable…
▽ More
Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff.
△ Less
Submitted 27 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Search for neutrino emission from hard X-ray AGN with IceCube
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
S. K. Agarwalla,
J. A. Aguilar,
M. Ahlers,
J. M. Alameddine,
N. M. Amin,
K. Andeen,
C. Argüelles,
Y. Ashida,
S. Athanasiadou,
L. Ausborm,
S. N. Axani,
X. Bai,
A. Balagopal V.,
M. Baricevic,
S. W. Barwick,
S. Bash,
V. Basu,
R. Bay,
J. J. Beatty,
J. Becker Tjus,
J. Beise,
C. Bellenghi
, et al. (401 additional authors not shown)
Abstract:
Active Galactic Nuclei (AGN) are promising candidate sources of high-energy astrophysical neutrinos since they provide environments rich in matter and photon targets where cosmic ray interactions may lead to the production of gamma rays and neutrinos. We searched for high-energy neutrino emission from AGN using the $\textit{Swift}$-BAT Spectroscopic Survey (BASS) catalog of hard X-ray sources and…
▽ More
Active Galactic Nuclei (AGN) are promising candidate sources of high-energy astrophysical neutrinos since they provide environments rich in matter and photon targets where cosmic ray interactions may lead to the production of gamma rays and neutrinos. We searched for high-energy neutrino emission from AGN using the $\textit{Swift}$-BAT Spectroscopic Survey (BASS) catalog of hard X-ray sources and 12 years of IceCube muon track data. First, upon performing a stacked search, no significant emission was found. Second, we searched for neutrinos from a list of 43 candidate sources and found an excess from the direction of two sources, Seyfert galaxies NGC 1068 and NGC 4151. We observed NGC 1068 at flux $φ_{ν_μ+\barν_μ}$ = $4.02_{-1.52}^{+1.58} \times 10^{-11}$ TeV$^{-1}$ cm$^{-2}$ s$^{-1}$ normalized at 1 TeV, with power-law spectral index, $γ$ = 3.10$^{+0.26}_{-0.22}$, consistent with previous IceCube results. The observation of a neutrino excess from the direction of NGC 4151 is at a post-trial significance of 2.9$σ$. If interpreted as an astrophysical signal, the excess observed from NGC 4151 corresponds to a flux $φ_{ν_μ+\barν_μ}$ = $1.51_{-0.81}^{+0.99} \times 10^{-11}$ TeV$^{-1}$ cm$^{-2}$ s$^{-1}$ normalized at 1 TeV and $γ$ = 2.83$^{+0.35}_{-0.28}$.
△ Less
Submitted 12 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Authors:
Owen Dugan,
Donato Manuel Jimenez Beneto,
Charlotte Loh,
Zhuo Chen,
Rumen Dangovski,
Marin Soljačić
Abstract:
Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. To achieve accurate calculations, language model systems often enable LLMs to generate code for arithmetic operations. However, this approach compromises speed and security and, if finetuning is involved, risks the language mo…
▽ More
Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. To achieve accurate calculations, language model systems often enable LLMs to generate code for arithmetic operations. However, this approach compromises speed and security and, if finetuning is involved, risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in \textit{a single autoregressive step}, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of an LLM to control a symbolic architecture which performs arithmetic. Our implementation using Llama 3 8B Instruct with OccamNet as a symbolic model (OccamLlama) achieves 100\% accuracy on single arithmetic operations ($+,-,\times,÷,\sin{},\cos{},\log{},\exp{},\sqrt{}$), outperforming GPT 4o and on par with GPT 4o using a code interpreter. OccamLlama also outperforms GPT 4o both with and without a code interpreter on mathematical problem solving benchmarks involving challenging arithmetic, thus enabling small LLMs to match the arithmetic performance of even much larger models. We will make our code public shortly.
△ Less
Submitted 29 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation
Authors:
Haozhe Xie,
Zhaoxi Chen,
Fangzhou Hong,
Ziwei Liu
Abstract:
3D city generation with NeRF-based methods shows promising generation results but is computationally inefficient. Recently 3D Gaussian Splatting (3D-GS) has emerged as a highly efficient alternative for object-level 3D generation. However, adapting 3D-GS from finite-scale 3D objects and humans to infinite-scale 3D cities is non-trivial. Unbounded 3D city generation entails significant storage over…
▽ More
3D city generation with NeRF-based methods shows promising generation results but is computationally inefficient. Recently 3D Gaussian Splatting (3D-GS) has emerged as a highly efficient alternative for object-level 3D generation. However, adapting 3D-GS from finite-scale 3D objects and humans to infinite-scale 3D cities is non-trivial. Unbounded 3D city generation entails significant storage overhead (out-of-memory issues), arising from the need to expand points to billions, often demanding hundreds of Gigabytes of VRAM for a city scene spanning 10km^2. In this paper, we propose GaussianCity, a generative Gaussian Splatting framework dedicated to efficiently synthesizing unbounded 3D cities with a single feed-forward pass. Our key insights are two-fold: 1) Compact 3D Scene Representation: We introduce BEV-Point as a highly compact intermediate representation, ensuring that the growth in VRAM usage for unbounded scenes remains constant, thus enabling unbounded city generation. 2) Spatial-aware Gaussian Attribute Decoder: We present spatial-aware BEV-Point decoder to produce 3D Gaussian attributes, which leverages Point Serializer to integrate the structural and contextual characteristics of BEV points. Extensive experiments demonstrate that GaussianCity achieves state-of-the-art results in both drone-view and street-view 3D city generation. Notably, compared to CityDreamer, GaussianCity exhibits superior performance with a speedup of 60 times (10.72 FPS v.s. 0.18 FPS).
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Merlin: A Vision Language Foundation Model for 3D Computed Tomography
Authors:
Louis Blankemeier,
Joseph Paul Cohen,
Ashwin Kumar,
Dave Van Veen,
Syed Jamal Safdar Gardezi,
Magdalini Paschali,
Zhihong Chen,
Jean-Benoit Delbrouck,
Eduardo Reis,
Cesar Truyts,
Christian Bluethgen,
Malte Engmann Kjeldskov Jensen,
Sophie Ostmeier,
Maya Varma,
Jeya Maria Jose Valanarasu,
Zhongnan Fang,
Zepeng Huo,
Zaid Nabulsi,
Diego Ardila,
Wei-Hung Weng,
Edson Amaro Junior,
Neera Ahuja,
Jason Fries,
Nigam H. Shah,
Andrew Johnston
, et al. (6 additional authors not shown)
Abstract:
Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision la…
▽ More
Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision language models (VLMs). However, current medical VLMs are generally limited to 2D images and short reports, and do not leverage electronic health record (EHR) data for supervision. We introduce Merlin - a 3D VLM that we train using paired CT scans (6+ million images from 15,331 CTs), EHR diagnosis codes (1.8+ million codes), and radiology reports (6+ million tokens). We evaluate Merlin on 6 task types and 752 individual tasks. The non-adapted (off-the-shelf) tasks include zero-shot findings classification (31 findings), phenotype classification (692 phenotypes), and zero-shot cross-modal retrieval (image to findings and image to impressions), while model adapted tasks include 5-year disease prediction (6 diseases), radiology report generation, and 3D semantic segmentation (20 organs). We perform internal validation on a test set of 5,137 CTs, and external validation on 7,000 clinical CTs and on two public CT datasets (VerSe, TotalSegmentator). Beyond these clinically-relevant evaluations, we assess the efficacy of various network architectures and training strategies to depict that Merlin has favorable performance to existing task-specific baselines. We derive data scaling laws to empirically assess training data needs for requisite downstream task performance. Furthermore, unlike conventional VLMs that require hundreds of GPUs for training, we perform all training on a single GPU.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Nodewise Loreg: Nodewise $L_0$-penalized Regression for High-dimensional Sparse Precision Matrix Estimation
Authors:
Hai Shu,
Ziqi Chen,
Yingjie Zhang,
Hongtu Zhu
Abstract:
We propose Nodewise Loreg, a nodewise $L_0$-penalized regression method for estimating high-dimensional sparse precision matrices. We establish its asymptotic properties, including convergence rates, support recovery, and asymptotic normality under high-dimensional sub-Gaussian settings. Notably, the Nodewise Loreg estimator is asymptotically unbiased and normally distributed, eliminating the need…
▽ More
We propose Nodewise Loreg, a nodewise $L_0$-penalized regression method for estimating high-dimensional sparse precision matrices. We establish its asymptotic properties, including convergence rates, support recovery, and asymptotic normality under high-dimensional sub-Gaussian settings. Notably, the Nodewise Loreg estimator is asymptotically unbiased and normally distributed, eliminating the need for debiasing required by Nodewise Lasso. We also develop a desparsified version of Nodewise Loreg, similar to the desparsified Nodewise Lasso estimator. The asymptotic variances of the undesparsified Nodewise Loreg estimator are upper bounded by those of both desparsified Nodewise Loreg and Lasso estimators for Gaussian data, potentially offering more powerful statistical inference. Extensive simulations show that the undesparsified Nodewise Loreg estimator generally outperforms the two desparsified estimators in asymptotic normal behavior. Moreover, Nodewise Loreg surpasses Nodewise Lasso, CLIME, and GLasso in most simulations in terms of matrix norm losses, support recovery, and timing performance. Application to a breast cancer gene expression dataset further demonstrates Nodewise Loreg's superiority over the three $L_1$-norm based methods.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Accurate Prediction of Core Level Binding Energies from Ground-State Density Functional Calculations: The Importance of Localization and Screening
Authors:
**cheng Yu,
Yuncai Mei,
Zehua Chen,
Weitao Yang
Abstract:
A new method for predicting core level binding energies (CLBEs) is developed by both localizing the core-level states and describing the screening effect. CLBEs contain important information about the electronic structure, elemental chemistry, and chemical environment of molecules and materials. Theoretical study of CLBEs can provide insights for analyzing and interpreting the experimental results…
▽ More
A new method for predicting core level binding energies (CLBEs) is developed by both localizing the core-level states and describing the screening effect. CLBEs contain important information about the electronic structure, elemental chemistry, and chemical environment of molecules and materials. Theoretical study of CLBEs can provide insights for analyzing and interpreting the experimental results obtained from the X-ray photoelectron spectroscopy, in which the overlap** of signals is very common. The localization of core-level holes is important for the theoretical calculation of CLBEs. Predicting CLBEs from commonly used density functional approximations (DFAs) is challenging, because conventional DFAs often produce delocalized core-level states, especially when degenerate core-level states exist. In this work, we combine the localization procedure from the localized orbital scaling correction method and the curvature matrix generalized from the exact second-order correction method that contains the screening effect, and the resulting approach can accurately predict CLBEs from ground-state density functional calculations.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Multi-Prompting Decoder Helps Better Language Understanding
Authors:
Zifeng Cheng,
Zhaoling Chen,
Zhiwei Jiang,
Yafeng Yin,
Shi** Ge,
Yuliang Liu,
Qing Gu
Abstract:
Recent Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the outp…
▽ More
Recent Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the output hidden states and class scores of the PLM. Despite the effectiveness of these methods, they only use a single prompt to query PLMs for decoding, leading to a heavy reliance on the quality of the adopted prompt. In this paper, we propose a simple yet effective Multi-Prompting Decoder (MPD) framework for MaaS adaptation. The core idea is to query PLMs with multiple different prompts for each sample, thereby obtaining multiple output hidden states and class scores for subsequent decoding. Such multi-prompting decoding paradigm can simultaneously mitigate reliance on the quality of a single prompt, alleviate the issue of data scarcity under the few-shot setting, and provide richer knowledge extracted from PLMs. Specifically, we propose two decoding strategies: multi-prompting decoding with optimal transport for hidden states and calibrated decoding for class scores. Extensive experiments demonstrate that our method achieves new state-of-the-art results on multiple natural language understanding datasets under the few-shot setting.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Strong and weak $CP$ tests in sequential decays of polarized $Σ^0$ hyperons
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (644 additional authors not shown)
Abstract:
The $J/ψ, ψ(3686) \to Σ^0 \barΣ^{0}$ processes and subsequent decays are studied using the world's largest $J/ψ$ and $ψ(3686)$ data samples collected with the BESIII detector. The strong-$CP$ symmetry is tested in the decays of the $Σ^0$ hyperons for the first time by measuring the decay parameters, $α_{Σ^0} = -0.0017 \pm 0.0021 \pm 0.0018$ and $\barα_{Σ^0} = 0.0021 \pm 0.0020 \pm 0.0022$. The wea…
▽ More
The $J/ψ, ψ(3686) \to Σ^0 \barΣ^{0}$ processes and subsequent decays are studied using the world's largest $J/ψ$ and $ψ(3686)$ data samples collected with the BESIII detector. The strong-$CP$ symmetry is tested in the decays of the $Σ^0$ hyperons for the first time by measuring the decay parameters, $α_{Σ^0} = -0.0017 \pm 0.0021 \pm 0.0018$ and $\barα_{Σ^0} = 0.0021 \pm 0.0020 \pm 0.0022$. The weak-$CP$ test is performed in the subsequent decays of their daughter particles $Λ$ and $\barΛ$. Also for the first time, the transverse polarizations of the $Σ^0$ hyperons in $J/ψ$ and $ψ(3686)$ decays are observed with opposite directions, and the ratios between the S-wave and D-wave contributions of the $J/ψ, ψ(3686) \to Σ^0 \barΣ^{0}$ decays are obtained. These results are crucial to understand the decay dynamics of the charmonium states and the production mechanism of the $Σ^0-\barΣ^0$ pairs.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
GAIA: Rethinking Action Quality Assessment for AI-Generated Videos
Authors:
Zijian Chen,
Wei Sun,
Yuan Tian,
Jun Jia,
Zicheng Zhang,
Jiarui Wang,
Ru Huang,
Xiongkuo Min,
Guangtao Zhai,
Wenjun Zhang
Abstract:
Assessing action quality is both imperative and challenging due to its significant impact on the quality of AI-generated videos, further complicated by the inherently ambiguous nature of actions within AI-generated video (AIGV). Current action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features, thus ren…
▽ More
Assessing action quality is both imperative and challenging due to its significant impact on the quality of AI-generated videos, further complicated by the inherently ambiguous nature of actions within AI-generated video (AIGV). Current action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features, thus rendering them inapplicable in AIGVs. To address these problems, we construct GAIA, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective, resulting in 971,244 ratings among 9,180 video-action pairs. Based on GAIA, we evaluate a suite of popular text-to-video (T2V) models on their ability to generate visually rational actions, revealing their pros and cons on different categories of actions. We also extend GAIA as a testbed to benchmark the AQA capacity of existing automatic evaluation methods. Results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods correlate poorly with human opinions, indicating a sizable gap between current models and human action perception patterns in AIGVs. Our findings underscore the significance of action quality as a unique perspective for studying AIGVs and can catalyze progress towards methods with enhanced capacities for AQA in AIGVs.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Enabling Large-Scale and High-Precision Fluid Simulations on Near-Term Quantum Computers
Authors:
Zhao-Yun Chen,
Teng-Yang Ma,
Chuang-Chao Ye,
Liang Xu,
Ming-Yang Tan,
Xi-Ning Zhuang,
Xiao-Fan Xu,
Yun-Jie Wang,
Tai-** Sun,
Yong Chen,
Lei Du,
Liang-Liang Guo,
Hai-Feng Zhang,
Hao-Ran Tao,
Tian-Le Wang,
Xiao-Yan Yang,
Ze-An Zhao,
Peng Wang,
Sheng Zhang,
Chi Zhang,
Ren-Ze Zhao,
Zhi-Long Jia,
Wei-Cheng Kong,
Meng-Han Dou,
Jun-Chao Wang
, et al. (7 additional authors not shown)
Abstract:
Quantum computational fluid dynamics (QCFD) offers a promising alternative to classical computational fluid dynamics (CFD) by leveraging quantum algorithms for higher efficiency. This paper introduces a comprehensive QCFD method, including an iterative method "Iterative-QLS" that suppresses error in quantum linear solver, and a subspace method to scale the solution to a larger size. We implement o…
▽ More
Quantum computational fluid dynamics (QCFD) offers a promising alternative to classical computational fluid dynamics (CFD) by leveraging quantum algorithms for higher efficiency. This paper introduces a comprehensive QCFD method, including an iterative method "Iterative-QLS" that suppresses error in quantum linear solver, and a subspace method to scale the solution to a larger size. We implement our method on a superconducting quantum computer, demonstrating successful simulations of steady Poiseuille flow and unsteady acoustic wave propagation. The Poiseuille flow simulation achieved a relative error of less than $0.2\%$, and the unsteady acoustic wave simulation solved a 5043-dimensional matrix. We emphasize the utilization of the quantum-classical hybrid approach in applications of near-term quantum computers. By adapting to quantum hardware constraints and offering scalable solutions for large-scale CFD problems, our method paves the way for practical applications of near-term quantum computers in computational science.
△ Less
Submitted 19 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
Authors:
Peng Xia,
Ze Chen,
Juanxi Tian,
Yangrui Gong,
Ruibo Hou,
Yue Xu,
Zhenbang Wu,
Zhiyuan Fan,
Yiyang Zhou,
Kangyu Zhu,
Wenhao Zheng,
Zhaoyang Wang,
Xiao Wang,
Xuchao Zhang,
Chetan Bansal,
Marc Niethammer,
Junzhou Huang,
Hongtu Zhu,
Yun Li,
Jimeng Sun,
Zongyuan Ge,
Gang Li,
James Zou,
Huaxiu Yao
Abstract:
Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment. In this paper, we introduce CARES and aim to comprehen…
▽ More
Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment. In this paper, we introduce CARES and aim to comprehensively evaluate the Trustworthiness of Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness. CARES comprises about 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions. Our analysis reveals that the models consistently exhibit concerns regarding trustworthiness, often displaying factual inaccuracies and failing to maintain fairness across different demographic groups. Furthermore, they are vulnerable to attacks and demonstrate a lack of privacy awareness. We publicly release our benchmark and code in https://github.com/richard-peng-xia/CARES.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Expressive Power of Graph Neural Networks for (Mixed-Integer) Quadratic Programs
Authors:
Ziang Chen,
Xiaohan Chen,
Jialin Liu,
Xinshang Wang,
Wotao Yin
Abstract:
Quadratic programming (QP) is the most widely applied category of problems in nonlinear programming. Many applications require real-time/fast solutions, though not necessarily with high precision. Existing methods either involve matrix decomposition or use the preconditioned conjugate gradient method. For relatively large instances, these methods cannot achieve the real-time requirement unless the…
▽ More
Quadratic programming (QP) is the most widely applied category of problems in nonlinear programming. Many applications require real-time/fast solutions, though not necessarily with high precision. Existing methods either involve matrix decomposition or use the preconditioned conjugate gradient method. For relatively large instances, these methods cannot achieve the real-time requirement unless there is an effective precondition.
Recently, graph neural networks (GNNs) opened new possibilities for QP. Some promising empirical studies of applying GNNs for QP tasks show that GNNs can capture key characteristics of an optimization instance and provide adaptive guidance accordingly to crucial configurations during the solving process, or directly provide an approximate solution. Despite notable empirical observations, theoretical foundations are still lacking. In this work, we investigate the expressive or representative power of GNNs, a crucial aspect of neural network theory, specifically in the context of QP tasks, with both continuous and mixed-integer settings. We prove the existence of message-passing GNNs that can reliably represent key properties of quadratic programs, including feasibility, optimal objective value, and optimal solution. Our theory is validated by numerical results.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
OmniControlNet: Dual-stage Integration for Conditional Image Generation
Authors:
Yilin Wang,
Haiyang Xu,
Xiang Zhang,
Zeyuan Chen,
Zhizhou Sha,
Zirui Wang,
Zhuowen Tu
Abstract:
We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method and incorporating its individually trained image generation processes into a single model. Despite its tremendous success, the ControlNet of a two-stage pipeline bears limitations in being not self-contained (e.g. calls the external condit…
▽ More
We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method and incorporating its individually trained image generation processes into a single model. Despite its tremendous success, the ControlNet of a two-stage pipeline bears limitations in being not self-contained (e.g. calls the external condition generation algorithms) with a large model redundancy (separately trained models for different types of conditioning inputs). Our proposed OmniControlNet consolidates 1) the condition generation (e.g., HED edges, depth maps, user scribble, and animal pose) by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance. OmniControlNet achieves significantly reduced model complexity and redundancy while capable of producing images of comparable quality for conditioned text-to-image generation.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Measurement of the integrated luminosity of the data collected at 3.773 GeV by BESIII from 2021 to 2024
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (634 additional authors not shown)
Abstract:
We present a measurement of the integrated luminosity of $e^+e^-$ collision data collected with the BESIII detector at the BEPCII collider at a center-of-mass energy of $E_{\rm cm} = 3.773$~GeV. The integrated luminosities of the data sets taken from December 2021 to June 2022, from November 2022 to June 2023, and from October 2023 to February 2024 are determined to be $4.995 \pm 0.019$~fb$^{-1}$,…
▽ More
We present a measurement of the integrated luminosity of $e^+e^-$ collision data collected with the BESIII detector at the BEPCII collider at a center-of-mass energy of $E_{\rm cm} = 3.773$~GeV. The integrated luminosities of the data sets taken from December 2021 to June 2022, from November 2022 to June 2023, and from October 2023 to February 2024 are determined to be $4.995 \pm 0.019$~fb$^{-1}$, $8.157 \pm 0.031$~fb$^{-1}$, and $4.191 \pm 0.016$~fb$^{-1}$, respectively, by analyzing large angle Bhabha scattering events. The uncertainties are dominated by systematic effects and the statistical uncertainties are negligible. Our results provide essential input for future analyses and precision measurements.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Above room-temperature two-dimensional ferromagnetic half-metals in Mn-based Janus magnets
Authors:
Xiang-Fan Huang,
Kang-Jie Li,
Zequan Wang,
Shi-Bo Zhao,
Bing Shen,
Zu-Xin Chen,
Yusheng Hou
Abstract:
Two-dimensional (2D) ferromagnets and their heterostructures offer fertile grounds for designing fascinating functionalities in ultra-thin spintronic devices. Here, by first-principles calculations, we report the discovery of energetically and thermodynamically stable 2D ferromagnets with very strong inplane magnetic anisotropy in MnXY (X = S, and Se; Y = Cl, Br and I) monolayers. Remarkably, we f…
▽ More
Two-dimensional (2D) ferromagnets and their heterostructures offer fertile grounds for designing fascinating functionalities in ultra-thin spintronic devices. Here, by first-principles calculations, we report the discovery of energetically and thermodynamically stable 2D ferromagnets with very strong inplane magnetic anisotropy in MnXY (X = S, and Se; Y = Cl, Br and I) monolayers. Remarkably, we find that the Curie temperatures of the ferromagnetic MnSBr, MnSI, MnSeCl, and MnSeI monolayers are as high as 271, 273, 231 and 418 K, respectively. In addition, we demonstrate that these ferromagnetic monolayers are intrinsic half-metals with large spin band gaps ranging from 2.5 eV to 3.2 eV. When spin-orbit coupling is considered in these ferromagnetic monolayers, the nature of their half-metal is almost unaffected. Finally, the strong inplane magnetic anisotropy of MnSY (Y = Br, I) and MnSeY (Y = Cl, I) monolayers originate mainly from halogen and chalcogen atoms, respectively. Our work shows 2D Janus Mn-based ferromagnetic half-metals may have appealing functionalities in high-performance spintronic applications.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Binarized Diffusion Model for Image Super-Resolution
Authors:
Zheng Chen,
Haotong Qin,
Yong Guo,
Xiongfei Su,
Xin Yuan,
Linghe Kong,
Yulun Zhang
Abstract:
Advanced diffusion models (DMs) perform impressively in image super-resolution (SR), but the high memory and computational costs hinder their deployment. Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating DMs. Nonetheless, due to the model structure and the multi-step iterative attribute of DMs, existing binarization methods result in significant perfor…
▽ More
Advanced diffusion models (DMs) perform impressively in image super-resolution (SR), but the high memory and computational costs hinder their deployment. Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating DMs. Nonetheless, due to the model structure and the multi-step iterative attribute of DMs, existing binarization methods result in significant performance degradation. In this paper, we introduce a novel binarized diffusion model, BI-DiffSR, for image SR. First, for the model structure, we design a UNet architecture optimized for binarization. We propose the consistent-pixel-downsample (CP-Down) and consistent-pixel-upsample (CP-Up) to maintain dimension consistent and facilitate the full-precision information transfer. Meanwhile, we design the channel-shuffle-fusion (CS-Fusion) to enhance feature fusion in skip connection. Second, for the activation difference across timestep, we design the timestep-aware redistribution (TaR) and activation function (TaA). The TaR and TaA dynamically adjust the distribution of activations based on different timesteps, improving the flexibility and representation alability of the binarized module. Comprehensive experiments demonstrate that our BI-DiffSR outperforms existing binarization methods. Code is available at https://github.com/zhengchen1999/BI-DiffSR.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Demystifying the Characteristics for Smart Contract Upgrades
Authors:
Ye Liu,
Shuo Li,
Xiuheng Wu,
Yi Li,
Zhiyang Chen,
David Lo
Abstract:
Upgradable smart contracts play an important role in the decentralized application ecosystem, to support routine maintenance, security patching, and feature additions. In this paper, we conduct an empirical study on proxy-based upgradable smart contracts to understand the characteristics of contract upgrading. Through our study on 57,118 open source proxy contracts, we found that 583 contracts hav…
▽ More
Upgradable smart contracts play an important role in the decentralized application ecosystem, to support routine maintenance, security patching, and feature additions. In this paper, we conduct an empirical study on proxy-based upgradable smart contracts to understand the characteristics of contract upgrading. Through our study on 57,118 open source proxy contracts, we found that 583 contracts have ever been upgraded on Ethereum, involving 973 unique implementation contract versions. The results show that developers often intend to improve usability of contracts if upgrading, where functionality addition and update are the most frequent upgrade intentions. We investigated the practical impacts of contract upgrades, e.g., breaking changes causing compatibility issues, storage collisions and initialization risks leading to security vulnerabilities. The results demonstrate that there are 4,334 ABI breaking changes due to the upgrades of 276 proxies, causing real-world broken usages within 584 transactions witnessed by the blockchain; 36 contract upgrades had storage collisions and five proxies with 59 implementation contracts are vulnerable to initialization attacks.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Intrinsic second-order topological insulators in two-dimensional polymorphic graphyne with sublattice approximation
Authors:
Z. J. Chen,
S. G. Xu,
Z. J. Xie,
H. Xu,
H. M. Weng
Abstract:
In two dimensions, intrinsic second-order topological insulators (SOTIs) are characterized by topological corner states that emerge at the intersections of distinct edges with reversed mass signs, enforced by spatial symmetries. Here, we present a comprehensive investigation within the class BDI to clarify the symmetry conditions ensuring the presence of intrinsic SOTIs in two dimensions. We revea…
▽ More
In two dimensions, intrinsic second-order topological insulators (SOTIs) are characterized by topological corner states that emerge at the intersections of distinct edges with reversed mass signs, enforced by spatial symmetries. Here, we present a comprehensive investigation within the class BDI to clarify the symmetry conditions ensuring the presence of intrinsic SOTIs in two dimensions. We reveal that the (anti-)commutation relationship between spatial symmetries and chiral symmetry is a reliable indicator of intrinsic corner states. Through first-principles calculations, we identify several ideal candidates within carbon-based polymorphic graphyne structures for realizing intrinsic SOTIs under sublattice approximation. Furthermore, we show that the corner states in these materials persist even in the absence of sublattice approximation. Our findings not only deepen the understanding of higher-order topological phases but also open new pathways for realizing topological corner states that are readily observable.
△ Less
Submitted 20 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding
Authors:
Yiqing Shen,
Zan Chen,
Michail Mamalakis,
Luhan He,
Haiyang Xia,
Tianbin Li,
Yanzhou Su,
Junjun He,
Yu Guang Wang
Abstract:
The parallels between protein sequences and natural language in their sequential structures have inspired the application of large language models (LLMs) to protein understanding. Despite the success of LLMs in NLP, their effectiveness in comprehending protein sequences remains an open question, largely due to the absence of datasets linking protein sequences to descriptive text. Researchers have…
▽ More
The parallels between protein sequences and natural language in their sequential structures have inspired the application of large language models (LLMs) to protein understanding. Despite the success of LLMs in NLP, their effectiveness in comprehending protein sequences remains an open question, largely due to the absence of datasets linking protein sequences to descriptive text. Researchers have then attempted to adapt LLMs for protein understanding by integrating a protein sequence encoder with a pre-trained LLM. However, this adaptation raises a fundamental question: "Can LLMs, originally designed for NLP, effectively comprehend protein sequences as a form of language?" Current datasets fall short in addressing this question due to the lack of a direct correlation between protein sequences and corresponding text descriptions, limiting the ability to train and evaluate LLMs for protein understanding effectively. To bridge this gap, we introduce ProteinLMDataset, a dataset specifically designed for further self-supervised pretraining and supervised fine-tuning (SFT) of LLMs to enhance their capability for protein sequence comprehension. Specifically, ProteinLMDataset includes 17.46 billion tokens for pretraining and 893,000 instructions for SFT. Additionally, we present ProteinLMBench, the first benchmark dataset consisting of 944 manually verified multiple-choice questions for assessing the protein understanding capabilities of LLMs. ProteinLMBench incorporates protein-related details and sequences in multiple languages, establishing a new standard for evaluating LLMs' abilities in protein comprehension. The large language model InternLM2-7B, pretrained and fine-tuned on the ProteinLMDataset, outperforms GPT-4 on ProteinLMBench, achieving the highest accuracy score. The dataset and the benchmark are available at https://huggingface.co/datasets/tsynbio/ProteinLMBench.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis
Authors:
Lecheng Zheng,
Zhengzhang Chen,
Dongjie Wang,
Chengyuan Deng,
Reon Matsuoka,
Haifeng Chen
Abstract:
Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap, we introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. LEMMA-RCA features various real-world fault scena…
▽ More
Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap, we introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. LEMMA-RCA features various real-world fault scenarios from IT and OT operation systems, encompassing microservices, water distribution, and water treatment systems, with hundreds of system entities involved. We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset under various settings, including offline and online modes as well as single and multiple modalities. Our experimental results demonstrate the high quality of LEMMA-RCA. The dataset is publicly available at https://lemma-rca.github.io/.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
Planning Like Human: A Dual-process Framework for Dialogue Planning
Authors:
Tao He,
Lizi Liao,
Yixin Cao,
Yuanxing Liu,
Ming Liu,
Zerui Chen,
Bing Qin
Abstract:
In proactive dialogue, the challenge lies not just in generating responses but in steering conversations toward predetermined goals, a task where Large Language Models (LLMs) typically struggle due to their reactive nature. Traditional approaches to enhance dialogue planning in LLMs, ranging from elaborate prompt engineering to the integration of policy networks, either face efficiency issues or d…
▽ More
In proactive dialogue, the challenge lies not just in generating responses but in steering conversations toward predetermined goals, a task where Large Language Models (LLMs) typically struggle due to their reactive nature. Traditional approaches to enhance dialogue planning in LLMs, ranging from elaborate prompt engineering to the integration of policy networks, either face efficiency issues or deliver suboptimal performance. Inspired by the dualprocess theory in psychology, which identifies two distinct modes of thinking - intuitive (fast) and analytical (slow), we propose the Dual-Process Dialogue Planning (DPDP) framework. DPDP embodies this theory through two complementary planning systems: an instinctive policy model for familiar contexts and a deliberative Monte Carlo Tree Search (MCTS) mechanism for complex, novel scenarios. This dual strategy is further coupled with a novel two-stage training regimen: offline Reinforcement Learning for robust initial policy model formation followed by MCTS-enhanced on-the-fly learning, which ensures a dynamic balance between efficiency and strategic depth. Our empirical evaluations across diverse dialogue tasks affirm DPDP's superiority in achieving both high-quality dialogues and operational efficiency, outpacing existing methods.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
Thermalization dynamics in photonic lattices of different geometries
Authors:
Guowen Yang,
Domenico Bongiovanni,
Daohong Song,
Roberto Morandotti,
Zhigang Chen,
Nikolaos K. Efremidis
Abstract:
The statistical mechanical behavior of weakly nonlinear multimoded optical settings is attracting increased interest during the last few years. The main purpose of this work is to numerically investigate the main factors that affect the thermalization process in photonic lattices. In particular, we find that lattices with identically selected properties (such as temperature, coupling coefficient,…
▽ More
The statistical mechanical behavior of weakly nonlinear multimoded optical settings is attracting increased interest during the last few years. The main purpose of this work is to numerically investigate the main factors that affect the thermalization process in photonic lattices. In particular, we find that lattices with identically selected properties (such as temperature, coupling coefficient, lattice size, and excitation conditions) can exhibit very different thermalization dynamics and thus thermalization distances. Our investigation is focused on two different two-dimensional lattices: the honeycomb lattice and the triangular lattice. Our numerical results show that, independently of the excitation conditions, the honeycomb lattice always thermalizes faster than the triangular lattice. We mainly explain this behavior to the quasilinear spectrum that promotes wave-mixing in the honeycomb lattice in comparison to the power-like spectrum of the triangular lattice. In addition, we investigate the combined effects of temperature as well as the sign and magnitude of the nonlinearity. Switching either the sign of the Kerr nonlinear coefficient or the sign of the temperature can lead to significant differences in the thermalization dynamics, a phenomenon that can be physically explained in terms of wave instabilities. Larger absolute values of the temperature |T| result in more uniform distributions for the power occupation numbers and faster thermalization speeds. Finally, as expected, increasing the magnitude of the nonlinearity results in accelerated thermalization. Our findings provide valuable insights into optical thermalization in discrete systems where experimental realization may bring about new possibilities for light manipulation and applications.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
Generative Explore-Exploit: Training-free Optimization of Generative Recommender Systems using LLM Optimizers
Authors:
Lütfi Kerem Senel,
Besnik Fetahu,
Davis Yoshida,
Zhiyu Chen,
Giuseppe Castellucci,
Nikhita Vedula,
Jason Choi,
Shervin Malmasi
Abstract:
Recommender systems are widely used to suggest engaging content, and Large Language Models (LLMs) have given rise to generative recommenders. Such systems can directly generate items, including for open-set tasks like question suggestion. While the world knowledge of LLMs enable good recommendations, improving the generated content through user feedback is challenging as continuously fine-tuning L…
▽ More
Recommender systems are widely used to suggest engaging content, and Large Language Models (LLMs) have given rise to generative recommenders. Such systems can directly generate items, including for open-set tasks like question suggestion. While the world knowledge of LLMs enable good recommendations, improving the generated content through user feedback is challenging as continuously fine-tuning LLMs is prohibitively expensive. We present a training-free approach for optimizing generative recommenders by connecting user feedback loops to LLM-based optimizers. We propose a generative explore-exploit method that can not only exploit generated items with known high engagement, but also actively explore and discover hidden population preferences to improve recommendation quality. We evaluate our approach on question generation in two domains (e-commerce and general knowledge), and model user feedback with Click Through Rate (CTR). Experiments show our LLM-based explore-exploit approach can iteratively improve recommendations, and consistently increase CTR. Ablation analysis shows that generative exploration is key to learning user preferences, avoiding the pitfalls of greedy exploit-only approaches. A human evaluation strongly supports our quantitative findings.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
ON-OFF Neuromorphic ISING Machines using Fowler-Nordheim Annealers
Authors:
Zihao Chen,
Zhili Xiao,
Mahmoud Akl,
Johannes Leugring,
Omowuyi Olajide,
Adil Malik,
Nik Dennler,
Chad Harper,
Subhankar Bose,
Hector A. Gonzalez,
Jason Eshraghian,
Riccardo Pignari,
Gianvito Urgese,
Andreas G. Andreou,
Sadasivan Shankar,
Christian Mayr,
Gert Cauwenberghs,
Shantanu Chakrabartty
Abstract:
We introduce NeuroSA, a neuromorphic architecture specifically designed to ensure asymptotic convergence to the ground state of an Ising problem using an annealing process that is governed by the physics of quantum mechanical tunneling using Fowler-Nordheim (FN). The core component of NeuroSA consists of a pair of asynchronous ON-OFF neurons, which effectively map classical simulated annealing (SA…
▽ More
We introduce NeuroSA, a neuromorphic architecture specifically designed to ensure asymptotic convergence to the ground state of an Ising problem using an annealing process that is governed by the physics of quantum mechanical tunneling using Fowler-Nordheim (FN). The core component of NeuroSA consists of a pair of asynchronous ON-OFF neurons, which effectively map classical simulated annealing (SA) dynamics onto a network of integrate-and-fire (IF) neurons. The threshold of each ON-OFF neuron pair is adaptively adjusted by an FN annealer which replicates the optimal escape mechanism and convergence of SA, particularly at low temperatures. To validate the effectiveness of our neuromorphic Ising machine, we systematically solved various benchmark MAX-CUT combinatorial optimization problems. Across multiple runs, NeuroSA consistently generates solutions that approach the state-of-the-art level with high accuracy (greater than 99%), and without any graph-specific hyperparameter tuning. For practical illustration, we present results from an implementation of NeuroSA on the SpiNNaker2 platform, highlighting the feasibility of map** our proposed architecture onto a standard neuromorphic accelerator platform.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion
Authors:
Xingrui Wang,
Xin Li,
Zhibo Chen
Abstract:
Tuning-free long video diffusion has been proposed to generate extended-duration videos with enriched content by reusing the knowledge from pre-trained short video diffusion model without retraining. However, most works overlook the fine-grained long-term video consistency modeling, resulting in limited scene consistency (i.e., unreasonable object or background transitions), especially with multip…
▽ More
Tuning-free long video diffusion has been proposed to generate extended-duration videos with enriched content by reusing the knowledge from pre-trained short video diffusion model without retraining. However, most works overlook the fine-grained long-term video consistency modeling, resulting in limited scene consistency (i.e., unreasonable object or background transitions), especially with multiple text inputs. To mitigate this, we propose the Consistency Noise Injection, dubbed CoNo, which introduces the "look-back" mechanism to enhance the fine-grained scene transition between different video clips, and designs the long-term consistency regularization to eliminate the content shifts when extending video contents through noise prediction. In particular, the "look-back" mechanism breaks the noise scheduling process into three essential parts, where one internal noise prediction part is injected into two video-extending parts, intending to achieve a fine-grained transition between two video clips. The long-term consistency regularization focuses on explicitly minimizing the pixel-wise distance between the predicted noises of the extended video clip and the original one, thereby preventing abrupt scene transitions. Extensive experiments have shown the effectiveness of the above strategies by performing long-video generation under both single- and multi-text prompt conditions. The project has been available in https://wxrui182.github.io/CoNo.github.io/.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Targeted Mining Precise-positioning Episode Rules
Authors:
Jian Zhu,
Xiaoye Chen,
Wensheng Gan,
Zefeng Chen,
Philip S. Yu
Abstract:
The era characterized by an exponential increase in data has led to the widespread adoption of data intelligence as a crucial task. Within the field of data mining, frequent episode mining has emerged as an effective tool for extracting valuable and essential information from event sequences. Various algorithms have been developed to discover frequent episodes and subsequently derive episode rules…
▽ More
The era characterized by an exponential increase in data has led to the widespread adoption of data intelligence as a crucial task. Within the field of data mining, frequent episode mining has emerged as an effective tool for extracting valuable and essential information from event sequences. Various algorithms have been developed to discover frequent episodes and subsequently derive episode rules using the frequency function and anti-monotonicity principles. However, currently, there is a lack of algorithms specifically designed for mining episode rules that encompass user-specified query episodes. To address this challenge and enable the mining of target episode rules, we introduce the definition of targeted precise-positioning episode rules and formulate the problem of targeted mining precise-positioning episode rules. Most importantly, we develop an algorithm called Targeted Mining Precision Episode Rules (TaMIPER) to address the problem and optimize it using four proposed strategies, leading to significant reductions in both time and space resource requirements. As a result, TaMIPER offers high accuracy and efficiency in mining episode rules of user interest and holds promising potential for prediction tasks in various domains, such as weather observation, network intrusion, and e-commerce. Experimental results on six real datasets demonstrate the exceptional performance of TaMIPER.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter
Authors:
Jitai Hao,
WeiWei Sun,
Xin Xin,
Qi Meng,
Zhumin Chen,
Pengjie Ren,
Zhaochun Ren
Abstract:
Parameter-Efficient Fine-tuning (PEFT) facilitates the fine-tuning of Large Language Models (LLMs) under limited resources. However, the fine-tuning performance with PEFT on complex, knowledge-intensive tasks is limited due to the constrained model capacity, which originates from the limited number of additional trainable parameters. To overcome this limitation, we introduce a novel mechanism that…
▽ More
Parameter-Efficient Fine-tuning (PEFT) facilitates the fine-tuning of Large Language Models (LLMs) under limited resources. However, the fine-tuning performance with PEFT on complex, knowledge-intensive tasks is limited due to the constrained model capacity, which originates from the limited number of additional trainable parameters. To overcome this limitation, we introduce a novel mechanism that fine-tunes LLMs with adapters of larger size yet memory-efficient. This is achieved by leveraging the inherent activation sparsity in the Feed-Forward Networks (FFNs) of LLMs and utilizing the larger capacity of Central Processing Unit (CPU) memory compared to Graphics Processing Unit (GPU). We store and update the parameters of larger adapters on the CPU. Moreover, we employ a Mixture of Experts (MoE)-like architecture to mitigate unnecessary CPU computations and reduce the communication volume between the GPU and CPU. This is particularly beneficial over the limited bandwidth of PCI Express (PCIe). Our method can achieve fine-tuning results comparable to those obtained with larger memory capacities, even when operating under more limited resources such as a 24GB memory single GPU setup, with acceptable loss in training efficiency. Our codes are available at https://github.com/CURRENTF/MEFT.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
M17 MIR: A Massive Star is Forming via Episodic Mass Accretion
Authors:
Wei Zhou,
Zhiwei Chen,
Zhibo Jiang,
Haoran Feng,
Yu Jiang
Abstract:
We analyzed the Atacama Large Millimeter/submillimeter Array (ALMA) band 6 data for the outbursting massive protostar M17~MIR. The ALMA CO $J=2-1$ data reveal a collimated and bipolar north-south outflow from M17~MIR. The blue-shifted outflow exhibits four CO knots (N1 to N4) along the outflow axis, while the red-shifted outflow appears as a single knot (S1). The extremely high velocity (EHV) emis…
▽ More
We analyzed the Atacama Large Millimeter/submillimeter Array (ALMA) band 6 data for the outbursting massive protostar M17~MIR. The ALMA CO $J=2-1$ data reveal a collimated and bipolar north-south outflow from M17~MIR. The blue-shifted outflow exhibits four CO knots (N1 to N4) along the outflow axis, while the red-shifted outflow appears as a single knot (S1). The extremely high velocity (EHV) emissions of N1 and S1 are jet-like and contain sub-knots along the outflow axis. Assuming the nearest EHV sub-knots trace the ejecta from the accretion outbursts in the past decades, a tangential ejection velocity of $\sim421\,\mathrm{km\,s^{-1}}$ is derived for M17~MIR. Assuming the same velocity, the dynamical times of the multiple ejecta, traced by the four blue-shifted CO knots, range from 20 to 364 years. The four blue-shifted CO knots imply four clustered accretion outbursts with a duration of tens of years in the past few hundred years. The intervals between the four clustered accretion outbursts are also about tens of years. These properties of the four clustered accretion outbursts are in line with the disk gravitational instability and fragmentation model. The episodic accretion history of M17~MIR traced by episodic outflow suggests that a massive star can form from a lower-mass protostar via frequent episodic accretion events triggered by disk gravitational instability and fragmentation. The first detection of the knotty outflow from an outbursting massive protostar suggests that mass ejections accompanied with accretion events could serve as an effective diagnostic tool for the episodic accretion histories of massive protostars.
△ Less
Submitted 17 June, 2024; v1 submitted 7 June, 2024;
originally announced June 2024.