-
Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)
Authors:
Nick Bryan-Kinns,
Corey Ford,
Shuoyang Zheng,
Helen Kennedy,
Alan Chamberlain,
Makayla Lewis,
Drew Hemment,
Zi** Li,
Qiong Wu,
Lanxi Xiao,
Gus Xia,
Jeba Rezwana,
Michael Clemens,
Gabriel Vigliensoni
Abstract:
This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.
This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.
△ Less
Submitted 22 June, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision
Authors:
Yafeng Chen,
Siqi Zheng,
Hui Wang,
Luyao Cheng,
Qian Chen,
Shiliang Zhang,
Wen Wang
Abstract:
Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an…
▽ More
Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the superiority of SDPN in self-supervised speaker verification. SDPN sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H respectively, without using any speaker labels in training.
△ Less
Submitted 25 June, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
Beyond the Visible: Jointly Attending to Spectral and Spatial Dimensions with HSI-Diffusion for the FINCH Spacecraft
Authors:
Ian Vyse,
Rishit Dagli,
Dav Vrat Chadha,
John P. Ma,
Hector Chen,
Isha Ruparelia,
Prithvi Seran,
Matthew Xie,
Eesa Aamer,
Aidan Armstrong,
Naveen Black,
Ben Borstein,
Kevin Caldwell,
Orrin Dahanaggamaarachchi,
Joe Dai,
Abeer Fatima,
Stephanie Lu,
Maxime Michet,
Anoushka Paul,
Carrie Ann Po,
Shivesh Prakash,
Noa Prosser,
Riddhiman Roy,
Mirai Shinjo,
Iliya Shofman
, et al. (4 additional authors not shown)
Abstract:
Satellite remote sensing missions have gained popularity over the past fifteen years due to their ability to cover large swaths of land at regular intervals, making them ideal for monitoring environmental trends. The FINCH mission, a 3U+ CubeSat equipped with a hyperspectral camera, aims to monitor crop residue cover in agricultural fields. Although hyperspectral imaging captures both spectral and…
▽ More
Satellite remote sensing missions have gained popularity over the past fifteen years due to their ability to cover large swaths of land at regular intervals, making them ideal for monitoring environmental trends. The FINCH mission, a 3U+ CubeSat equipped with a hyperspectral camera, aims to monitor crop residue cover in agricultural fields. Although hyperspectral imaging captures both spectral and spatial information, it is prone to various types of noise, including random noise, stripe noise, and dead pixels. Effective denoising of these images is crucial for downstream scientific tasks. Traditional methods, including hand-crafted techniques encoding strong priors, learned 2D image denoising methods applied across different hyperspectral bands, or diffusion generative models applied independently on bands, often struggle with varying noise strengths across spectral bands, leading to significant spectral distortion. This paper presents a novel approach to hyperspectral image denoising using latent diffusion models that integrate spatial and spectral information. We particularly do so by building a 3D diffusion model and presenting a 3-stage training approach on real and synthetically crafted datasets. The proposed method preserves image structure while reducing noise. Evaluations on both popular hyperspectral denoising datasets and synthetically crafted datasets for the FINCH mission demonstrate the effectiveness of this approach.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Sustainable Wireless Networks via Reconfigurable Intelligent Surfaces (RISs): Overview of the ETSI ISG RIS
Authors:
Ruiqi Liu,
Shuang Zheng,
Qingqing Wu,
Yifan Jiang,
Nan Zhang,
Yuanwei Liu,
Marco Di Renzo,
and George C. Alexandropoulos
Abstract:
Reconfigurable Intelligent Surfaces (RISs) are a novel form of ultra-low power devices that are capable to increase the communication data rates as well as the cell coverage in a cost- and energy-efficient way. This is attributed to their programmable operation that enables them to dynamically manipulate the wireless propagation environment, a feature that has lately inspired numerous research inv…
▽ More
Reconfigurable Intelligent Surfaces (RISs) are a novel form of ultra-low power devices that are capable to increase the communication data rates as well as the cell coverage in a cost- and energy-efficient way. This is attributed to their programmable operation that enables them to dynamically manipulate the wireless propagation environment, a feature that has lately inspired numerous research investigations and applications. To pave the way to the formal standardization of RISs, the European Telecommunications Standards Institute (ETSI) launched the Industry Specification Group (ISG) on the RIS technology in September 2021. This article provides a comprehensive overview of the status of the work conducted by the ETSI ISG RIS, covering typical deployment scenarios of reconfigurable metasurfaces, use cases and operating applications, requirements, emerging hardware architectures and operating modes, as well as the latest insights regarding future directions of RISs and the resulting smart wireless environments.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency
Authors:
Yafeng Chen,
Siqi Zheng,
Hui Wang,
Luyao Cheng,
Qian Chen,
Shiliang Zhang,
Junjie Li
Abstract:
Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature fusion approach has been proposed to effectively capture speaker characteristics from short utterances. Constrained by the model's size, a robust backbone Enhanced Res2Net (ERes2Net) combining global and local feature fusion…
▽ More
Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature fusion approach has been proposed to effectively capture speaker characteristics from short utterances. Constrained by the model's size, a robust backbone Enhanced Res2Net (ERes2Net) combining global and local feature fusion demonstrates sub-optimal performance in short-duration speaker verification. To further improve the short-duration feature extraction capability of ERes2Net, we expand the channel dimension within each stage. However, this modification also increases the number of model parameters and computational complexity. To alleviate this problem, we propose an improved ERes2NetV2 by pruning redundant structures, ultimately reducing both the model parameters and its computational cost. A range of experiments conducted on the VoxCeleb datasets exhibits the superiority of ERes2NetV2, which achieves EER of 0.61% for the full-duration trial, 0.98% for the 3s-duration trial, and 1.48% for the 2s-duration trial on VoxCeleb1-O, respectively.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec
Authors:
Shengpeng Ji,
Jialong Zuo,
Minghui Fang,
Siqi Zheng,
Qian Chen,
Wen Wang,
Ziyue Jiang,
Hai Huang,
Xize Cheng,
Rongjie Huang,
Zhou Zhao
Abstract:
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and…
▽ More
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging new task-a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture corresponding codec representations in a discrete decoupling codec space. Moreover, we discovered the issue of text style controllability in a many-to-many map** fashion and proposed the Style Mixture Semantic Density (SMSD) model to resolve this problem. SMSD module which is based on Gaussian mixture density networks, is designed to enhance the fine-grained partitioning and sampling capabilities of style semantic information and generate speech with more diverse styles. In terms of experiments, we make available a controllable model toolkit called ControlToolkit with a new style controllable dataset, some replicated baseline models and propose new metrics to evaluate both the control capability and the quality of generated audio in ControlSpeech. The relevant ablation studies validate the necessity of each component in ControlSpeech is necessary. We hope that ControlSpeech can establish the next foundation paradigm of controllable speech synthesis. The relevant code and demo are available at https://github.com/jishengpeng/ControlSpeech .
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
AudioLCM: Text-to-Audio Generation with Latent Consistency Models
Authors:
Huadai Liu,
Rongjie Huang,
Yang Liu,
Hengyuan Cao,
Jialei Wang,
Xize Cheng,
Siqi Zheng,
Zhou Zhao
Abstract:
Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficie…
▽ More
Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a map** from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization
Authors:
Yafeng Chen,
Siqi Zheng,
Hui Wang,
Luyao Cheng,
Tinglong Zhu,
Changhe Song,
Rongjie Huang,
Ziyang Ma,
Qian Chen,
Shiliang Zhang,
Xihao Li
Abstract:
This paper introduces 3D-Speaker-Toolkit, an open source toolkit for multi-modal speaker verification and diarization. It is designed for the needs of academic researchers and industrial practitioners. The 3D-Speaker-Toolkit adeptly leverages the combined strengths of acoustic, semantic, and visual data, seamlessly fusing these modalities to offer robust speaker recognition capabilities. The acous…
▽ More
This paper introduces 3D-Speaker-Toolkit, an open source toolkit for multi-modal speaker verification and diarization. It is designed for the needs of academic researchers and industrial practitioners. The 3D-Speaker-Toolkit adeptly leverages the combined strengths of acoustic, semantic, and visual data, seamlessly fusing these modalities to offer robust speaker recognition capabilities. The acoustic module extracts speaker embeddings from acoustic features, employing both fully-supervised and self-supervised learning approaches. The semantic module leverages advanced language models to apprehend the substance and context of spoken language, thereby augmenting the system's proficiency in distinguishing speakers through linguistic patterns. Finally, the visual module applies image processing technologies to scrutinize facial features, which bolsters the precision of speaker diarization in multi-speaker environments. Collectively, these modules empower the 3D-Speaker-Toolkit to attain elevated levels of accuracy and dependability in executing speaker-related tasks, establishing a new benchmark in multi-modal speaker analysis. The 3D-Speaker project also includes a handful of open-sourced state-of-the-art models and a large dataset containing over 10,000 speakers. The toolkit is publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Human Activity Recognition with Low-Resolution Infrared Array Sensor Using Semi-supervised Cross-domain Neural Networks for Indoor Environment
Authors:
Cunyi Yin,
Xiren Miao,
**g Chen,
Hao Jiang,
Deying Chen,
Yixuan Tong,
Shaocong Zheng
Abstract:
Low-resolution infrared-based human activity recognition (HAR) attracted enormous interests due to its low-cost and private. In this paper, a novel semi-supervised crossdomain neural network (SCDNN) based on 8 $\times$ 8 low-resolution infrared sensor is proposed for accurately identifying human activity despite changes in the environment at a low-cost. The SCDNN consists of feature extractor, dom…
▽ More
Low-resolution infrared-based human activity recognition (HAR) attracted enormous interests due to its low-cost and private. In this paper, a novel semi-supervised crossdomain neural network (SCDNN) based on 8 $\times$ 8 low-resolution infrared sensor is proposed for accurately identifying human activity despite changes in the environment at a low-cost. The SCDNN consists of feature extractor, domain discriminator and label classifier. In the feature extractor, the unlabeled and minimal labeled target domain data are trained for domain adaptation to achieve a map** of the source domain and target domain data. The domain discriminator employs the unsupervised learning to migrate data from the source domain to the target domain. The label classifier obtained from training the source domain data improves the recognition of target domain activities due to the semi-supervised learning utilized in training the target domain data. Experimental results show that the proposed method achieves 92.12\% accuracy for recognition of activities in the target domain by migrating the source and target domains. The proposed approach adapts superior to cross-domain scenarios compared to the existing deep learning methods, and it provides a low-cost yet highly adaptable solution for cross-domain scenarios.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models
Authors:
Shengpeng Ji,
Minghui Fang,
Ziyue Jiang,
Siqi Zheng,
Qian Chen,
Rongjie Huang,
Jialung Zuo,
Shulei Wang,
Zhou Zhao
Abstract:
In recent years, large language models have achieved significant success in generative tasks (e.g., speech cloning and audio generation) related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serves as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs a…
▽ More
In recent years, large language models have achieved significant success in generative tasks (e.g., speech cloning and audio generation) related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serves as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs and downstream speech language models. Specifically, 1) most codec models are trained on only 1,000 hours of data, whereas most speech language models are trained on 60,000 hours; 2) Achieving good reconstruction performance requires the utilization of numerous codebooks, which increases the burden on downstream speech language models; 3) The initial channel of the codebooks contains excessive information, making it challenging to directly generate acoustic tokens from weakly supervised signals such as text in downstream tasks. Consequently, leveraging the characteristics of speech language models, we propose Language-Codec. In the Language-Codec, we introduce a Mask Channel Residual Vector Quantization (MCRVQ) mechanism along with improved Fourier transform structures and larger training datasets to address the aforementioned gaps. We compare our method with competing audio compression algorithms and observe significant outperformance across extensive evaluations. Furthermore, we also validate the efficiency of the Language-Codec on downstream speech language models. The source code and pre-trained models can be accessed at https://github.com/jishengpeng/languagecodec .
△ Less
Submitted 27 April, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Authors:
Ziyang Ma,
Guanrou Yang,
Yifan Yang,
Zhifu Gao,
Jiaming Wang,
Zhihao Du,
Fan Yu,
Qian Chen,
Siqi Zheng,
Shiliang Zhang,
Xie Chen
Abstract:
In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning f…
▽ More
In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
Multi-modal Learning with Missing Modality in Predicting Axillary Lymph Node Metastasis
Authors:
Shichuan Zhang,
Sunyi Zheng,
Zhongyi Shui,
Honglin Li,
Lin Yang
Abstract:
Multi-modal Learning has attracted widespread attention in medical image analysis. Using multi-modal data, whole slide images (WSIs) and clinical information, can improve the performance of deep learning models in the diagnosis of axillary lymph node metastasis. However, clinical information is not easy to collect in clinical practice due to privacy concerns, limited resources, lack of interoperab…
▽ More
Multi-modal Learning has attracted widespread attention in medical image analysis. Using multi-modal data, whole slide images (WSIs) and clinical information, can improve the performance of deep learning models in the diagnosis of axillary lymph node metastasis. However, clinical information is not easy to collect in clinical practice due to privacy concerns, limited resources, lack of interoperability, etc. Although patient selection can ensure the training set to have multi-modal data for model development, missing modality of clinical information can appear during test. This normally leads to performance degradation, which limits the use of multi-modal models in the clinic. To alleviate this problem, we propose a bidirectional distillation framework consisting of a multi-modal branch and a single-modal branch. The single-modal branch acquires the complete multi-modal knowledge from the multi-modal branch, while the multi-modal learns the robust features of WSI from the single-modal. We conduct experiments on a public dataset of Lymph Node Metastasis in Early Breast Cancer to validate the method. Our approach not only achieves state-of-the-art performance with an AUC of 0.861 on the test set without missing data, but also yields an AUC of 0.842 when the rate of missing modality is 80\%. This shows the effectiveness of the approach in dealing with multi-modal data and missing modality. Such a model has the potential to improve treatment decision-making for early breast cancer patients who have axillary lymph node metastatic status.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
Adaptive Kalman-based hybrid car following strategy using TD3 and CACC
Authors:
Yuqi Zheng,
Ruidong Yan,
Bin Jia,
Rui Jiang,
Adriana TAPUS,
Xiao**g Chen,
Shiteng Zheng,
Ying Shang
Abstract:
In autonomous driving, the hybrid strategy of deep reinforcement learning and cooperative adaptive cruise control (CACC) can fully utilize the advantages of the two algorithms and significantly improve the performance of car following. However, it is challenging for the traditional hybrid strategy based on fixed coefficients to adapt to mixed traffic flow scenarios, which may decrease the performa…
▽ More
In autonomous driving, the hybrid strategy of deep reinforcement learning and cooperative adaptive cruise control (CACC) can fully utilize the advantages of the two algorithms and significantly improve the performance of car following. However, it is challenging for the traditional hybrid strategy based on fixed coefficients to adapt to mixed traffic flow scenarios, which may decrease the performance and even lead to accidents. To address the above problems, a hybrid car following strategy based on an adaptive Kalman Filter is proposed by regarding CACC and Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithms. Different from traditional hybrid strategy based on fixed coefficients, the Kalman gain H, using as an adaptive coefficient, is derived from multi-timestep predictions and Monte Carlo Tree Search. At the end of study, simulation results with 4157745 timesteps indicate that, compared with the TD3 and HCFS algorithms, the proposed algorithm in this study can substantially enhance the safety of car following in mixed traffic flow without compromising the comfort and efficiency.
△ Less
Submitted 26 December, 2023;
originally announced December 2023.
-
Joint Beam Scheduling and Power Optimization for Beam Hop** LEO Satellite Systems
Authors:
Shuang Zheng,
Xing Zhang,
Peng Wang,
Wenbo Wang
Abstract:
Low earth orbit (LEO) satellite communications can provide ubiquitous and reliable services, making it an essential part of the Internet of Everything network. Beam hop** (BH) is an emerging technology for effectively addressing the issue of low resource utilization caused by the non-uniform spatio-temporal distribution of traffic demands. However, how to allocate multi-dimensional resources in…
▽ More
Low earth orbit (LEO) satellite communications can provide ubiquitous and reliable services, making it an essential part of the Internet of Everything network. Beam hop** (BH) is an emerging technology for effectively addressing the issue of low resource utilization caused by the non-uniform spatio-temporal distribution of traffic demands. However, how to allocate multi-dimensional resources in a timely and efficient way for the highly dynamic LEO satellite systems remains a challenge. This paper proposes a joint beam scheduling and power optimization beam hop** (JBSPO-BH) algorithm considering the differences in the geographic distribution of sink nodes. The JBSPO-BH algorithm decouples the original problem into two sub-problems. The beam scheduling problem is modelled as a potential game, and the Nash equilibrium (NE) point is obtained as the beam scheduling strategy. Moreover, the penalty function interior point method is applied to optimize the power allocation. Simulation results show that the JBSPO-BH algorithm has low time complexity and fast convergence and achieves better performance both in throughput and fairness. Compared with greedy-based BH, greedy-based BH with the power optimization, round-robin BH, Max-SINR BH and satellite resource allocation algorithm, the throughput of the proposed algorithm is improved by 44.99%, 20.79%, 156.06%, 15.39% and 8.17%, respectively.
△ Less
Submitted 3 December, 2023;
originally announced December 2023.
-
Deep Learning-Based Frequency Offset Estimation
Authors:
Tao Chen,
Shilian Zheng,
Jiawei Zhu,
Qi Xuan,
Xiaoniu Yang
Abstract:
In wireless communication systems, the asynchronization of the oscillators in the transmitter and the receiver along with the Doppler shift due to relative movement may lead to the presence of carrier frequency offset (CFO) in the received signals. Estimation of CFO is crucial for subsequent processing such as coherent demodulation. In this brief, we demonstrate the utilization of deep learning fo…
▽ More
In wireless communication systems, the asynchronization of the oscillators in the transmitter and the receiver along with the Doppler shift due to relative movement may lead to the presence of carrier frequency offset (CFO) in the received signals. Estimation of CFO is crucial for subsequent processing such as coherent demodulation. In this brief, we demonstrate the utilization of deep learning for CFO estimation by employing a residual network (ResNet) to learn and extract signal features from the raw in-phase (I) and quadrature (Q) components of the signals. We use multiple modulation schemes in the training set to make the trained model adaptable to multiple modulations or even new signals. In comparison to the commonly used traditional CFO estimation methods, our proposed IQ-ResNet method exhibits superior performance across various scenarios including different oversampling ratios, various signal lengths, and different channels
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Correlation-Distance Graph Learning for Treatment Response Prediction from rs-fMRI
Authors:
Xiatian Zhang,
Sisi Zheng,
Hubert P. H. Shum,
Haozheng Zhang,
Nan Song,
Mingkang Song,
Hongxiao Jia
Abstract:
Resting-state fMRI (rs-fMRI) functional connectivity (FC) analysis provides valuable insights into the relationships between different brain regions and their potential implications for neurological or psychiatric disorders. However, specific design efforts to predict treatment response from rs-fMRI remain limited due to difficulties in understanding the current brain state and the underlying mech…
▽ More
Resting-state fMRI (rs-fMRI) functional connectivity (FC) analysis provides valuable insights into the relationships between different brain regions and their potential implications for neurological or psychiatric disorders. However, specific design efforts to predict treatment response from rs-fMRI remain limited due to difficulties in understanding the current brain state and the underlying mechanisms driving the observed patterns, which limited the clinical application of rs-fMRI. To overcome that, we propose a graph learning framework that captures comprehensive features by integrating both correlation and distance-based similarity measures under a contrastive loss. This approach results in a more expressive framework that captures brain dynamic features at different scales and enables more accurate prediction of treatment response. Our experiments on the chronic pain and depersonalization disorder datasets demonstrate that our proposed method outperforms current methods in different scenarios. To the best of our knowledge, we are the first to explore the integration of distance-based and correlation-based neural similarity into graph learning for treatment response prediction.
△ Less
Submitted 17 November, 2023;
originally announced November 2023.
-
Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR
Authors:
Qian Chen,
Wen Wang,
Qinglin Zhang,
Siqi Zheng,
Shiliang Zhang,
Chong Deng,
Yukun Ma,
Hai Yu,
Jiaqing Liu,
Chong Zhang
Abstract:
Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Mask…
▽ More
Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld
△ Less
Submitted 4 February, 2024; v1 submitted 8 November, 2023;
originally announced November 2023.
-
Augmenting Radio Signals with Wavelet Transform for Deep Learning-Based Modulation Recognition
Authors:
Tao Chen,
Shilian Zheng,
Kunfeng Qiu,
Luxin Zhang,
Qi Xuan,
Xiaoniu Yang
Abstract:
The use of deep learning for radio modulation recognition has become prevalent in recent years. This approach automatically extracts high-dimensional features from large datasets, facilitating the accurate classification of modulation schemes. However, in real-world scenarios, it may not be feasible to gather sufficient training data in advance. Data augmentation is a method used to increase the d…
▽ More
The use of deep learning for radio modulation recognition has become prevalent in recent years. This approach automatically extracts high-dimensional features from large datasets, facilitating the accurate classification of modulation schemes. However, in real-world scenarios, it may not be feasible to gather sufficient training data in advance. Data augmentation is a method used to increase the diversity and quantity of training dataset and to reduce data sparsity and imbalance. In this paper, we propose data augmentation methods that involve replacing detail coefficients decomposed by discrete wavelet transform for reconstructing to generate new samples and expand the training set. Different generation methods are used to generate replacement sequences. Simulation results indicate that our proposed methods significantly outperform the other augmentation methods.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Foundation Model Based Native AI Framework in 6G with Cloud-Edge-End Collaboration
Authors:
Xiang Chen,
Zhiheng Guo,
Xijun Wang,
Howard H. Yang,
Chenyuan Feng,
Junshen Su,
Sihui Zheng,
Tony Q. S. Quek
Abstract:
Future wireless communication networks are in a position to move beyond data-centric, device-oriented connectivity and offer intelligent, immersive experiences based on task-oriented connections, especially in the context of the thriving development of pre-trained foundation models (PFM) and the evolving vision of 6G native artificial intelligence (AI). Therefore, redefining modes of collaboration…
▽ More
Future wireless communication networks are in a position to move beyond data-centric, device-oriented connectivity and offer intelligent, immersive experiences based on task-oriented connections, especially in the context of the thriving development of pre-trained foundation models (PFM) and the evolving vision of 6G native artificial intelligence (AI). Therefore, redefining modes of collaboration between devices and servers and constructing native intelligence libraries become critically important in 6G. In this paper, we analyze the challenges of achieving 6G native AI from the perspectives of data, intelligence, and networks. Then, we propose a 6G native AI framework based on foundation models, provide a customization approach for intent-aware PFM, present a construction of a task-oriented AI toolkit, and outline a novel cloud-edge-end collaboration paradigm. As a practical use case, we apply this framework for orchestration, achieving the maximum sum rate within a wireless communication system, and presenting preliminary evaluation results. Finally, we outline research directions for achieving native AI in 6G.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
Accurate battery lifetime prediction across diverse aging conditions with deep learning
Authors:
Han Zhang,
Yuqi Li,
Shun Zheng,
Ziheng Lu,
Xiaofan Gui,
Wei Xu,
Jiang Bian
Abstract:
Accurately predicting the lifetime of battery cells in early cycles holds tremendous value for battery research and development as well as numerous downstream applications. This task is rather challenging because diverse conditions, such as electrode materials, operating conditions, and working environments, collectively determine complex capacity-degradation behaviors. However, current prediction…
▽ More
Accurately predicting the lifetime of battery cells in early cycles holds tremendous value for battery research and development as well as numerous downstream applications. This task is rather challenging because diverse conditions, such as electrode materials, operating conditions, and working environments, collectively determine complex capacity-degradation behaviors. However, current prediction methods are developed and validated under limited aging conditions, resulting in questionable adaptability to varied aging conditions and an inability to fully benefit from historical data collected under different conditions. Here we introduce a universal deep learning approach that is capable of accommodating various aging conditions and facilitating effective learning under low-resource conditions by leveraging data from rich conditions. Our key finding is that incorporating inter-cell feature differences, rather than solely considering single-cell characteristics, significantly increases the accuracy of battery lifetime prediction and its cross-condition robustness. Accordingly, we develop a holistic learning framework accommodating both single-cell and inter-cell modeling. A comprehensive benchmark is built for evaluation, encompassing 401 battery cells utilizing 5 prevalent electrode materials across 168 cycling conditions. We demonstrate remarkable capabilities in learning across diverse aging conditions, exclusively achieving 10% prediction error using the first 100 cycles, and in facilitating low-resource learning, almost halving the error of single-cell modeling in many cases. More broadly, by breaking the learning boundaries among different aging conditions, our approach could significantly accelerate the development and optimization of lithium-ion batteries.
△ Less
Submitted 24 November, 2023; v1 submitted 8 October, 2023;
originally announced October 2023.
-
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT
Authors:
Jiaming Wang,
Zhihao Du,
Qian Chen,
Yunfei Chu,
Zhifu Gao,
Zerui Li,
Kai Hu,
Xiaohuan Zhou,
** Xu,
Ziyang Ma,
Wen Wang,
Siqi Zheng,
Chang Zhou,
Zhijie Yan,
Shiliang Zhang
Abstract:
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks. However, there has been limited research on applying similar frameworks to audio tasks. Previously proposed large language models for audio tasks either lack sufficient quantitative evaluations, or are limited to tasks for recognizing and understanding audio content, o…
▽ More
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks. However, there has been limited research on applying similar frameworks to audio tasks. Previously proposed large language models for audio tasks either lack sufficient quantitative evaluations, or are limited to tasks for recognizing and understanding audio content, or significantly underperform existing state-of-the-art (SOTA) models. In this paper, we propose LauraGPT, a unified GPT model for audio recognition, understanding, and generation. LauraGPT is a versatile language model that can process both audio and text inputs and generate outputs in either modalities. It can perform a wide range of tasks related to content, semantics, paralinguistics, and audio-signal analysis. Some of its noteworthy tasks include automatic speech recognition, speech-to-text translation, text-to-speech synthesis, machine translation, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding. To achieve this goal, we use a combination of continuous and discrete features for audio. We encode input audio into continuous representations using an audio encoder and decode output audio from discrete codec codes. We then fine-tune a large decoder-only Transformer-based language model on multiple audio-to-text, text-to-audio, audio-to-audio, and text-to-text tasks using a supervised multitask learning approach. Extensive experiments show that LauraGPT achieves competitive or superior performance compared to existing SOTA models on various audio processing benchmarks.
△ Less
Submitted 10 October, 2023; v1 submitted 6 October, 2023;
originally announced October 2023.
-
Analysis on Multi-robot Relative 6-DOF Pose Estimation Error Based on UWB Range
Authors:
Xinran Li,
Shuaikang Zheng,
Pengcheng Zheng,
Haifeng Zhang,
Zhitian Li,
Xudong Zou
Abstract:
Relative pose estimation is the foundational requirement for multi-robot system, while it is a challenging research topic in infrastructure-free scenes. In this study, we analyze the relative 6-DOF pose estimation error of multi-robot system in GNSS-denied and anchor-free environment. An analytical lower bound of position and orientation estimation error is given under the assumption that distance…
▽ More
Relative pose estimation is the foundational requirement for multi-robot system, while it is a challenging research topic in infrastructure-free scenes. In this study, we analyze the relative 6-DOF pose estimation error of multi-robot system in GNSS-denied and anchor-free environment. An analytical lower bound of position and orientation estimation error is given under the assumption that distance between the nodes are far more than the size of robotic platform. Through simulation, impact of distance between nodes, altitudes and circumradius of tag simplex on pose estimation accuracy is discussed, which verifies the analysis results. Our analysis is expected to determine parameters (e.g. deployment of tags) of UWB based multi-robot systems.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation
Authors:
Luyao Cheng,
Siqi Zheng,
Qinglin Zhang,
Hui Wang,
Yafeng Chen,
Qian Chen,
Shiliang Zhang
Abstract:
Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit th…
▽ More
Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit these semantic cues utilizing language models. In this work we propose a novel approach to effectively leverage semantic information in clustering-based speaker diarization systems. Firstly, we introduce spoken language understanding modules to extract speaker-related semantic information and utilize these information to construct pairwise constraints. Secondly, we present a novel framework to integrate these constraints into the speaker diarization pipeline, enhancing the performance of the entire system. Extensive experiments conducted on the public dataset demonstrate the consistent superiority of our proposed approach over acoustic-only speaker diarization systems.
△ Less
Submitted 4 February, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec
Authors:
Zhihao Du,
Shiliang Zhang,
Kai Hu,
Siqi Zheng
Abstract:
This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as…
▽ More
This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec.
△ Less
Submitted 6 October, 2023; v1 submitted 13 September, 2023;
originally announced September 2023.
-
Leveraging Large Language Models for Exploiting ASR Uncertainty
Authors:
Pranay Dighe,
Yi Su,
Shangshang Zheng,
Yunshu Liu,
Vineet Garg,
Xiaochuan Niu,
Ahmed Tewfik
Abstract:
While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the…
▽ More
While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the accuracy of a fixed ASR system on the spoken input. Specifically, we tackle speech-intent classification task, where a high word-error-rate can limit the LLM's ability to understand the spoken intent. Instead of chasing a high accuracy by designing complex or specialized architectures regardless of deployment costs, we seek to answer how far we can go without substantially changing the underlying ASR and LLM, which can potentially be shared by multiple unrelated tasks. To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We explore prompt-engineering to explain the concept of n-best lists to the LLM; followed by the finetuning of Low-Rank Adapters on the downstream tasks. Our approach using n-best lists proves to be effective on a device-directed speech detection task as well as on a keyword spotting task, where systems using n-best list prompts outperform those using 1-best ASR hypothesis; thus paving the way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications.
△ Less
Submitted 12 September, 2023; v1 submitted 9 September, 2023;
originally announced September 2023.
-
Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision
Authors:
Yafeng Chen,
Siqi Zheng,
Hui Wang,
Luyao Cheng,
Qian Chen,
Shiliang Zhang,
Wen Wang
Abstract:
Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an…
▽ More
Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the superiority of SDPN in self-supervised speaker verification. SDPN sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H respectively, without using any speaker labels in training. Ablation studies show that both proposed learnable prototypes in self-distillation network and diversity regularization contribute to the verification performance.
△ Less
Submitted 26 June, 2024; v1 submitted 4 August, 2023;
originally announced August 2023.
-
Learning to Segment from Noisy Annotations: A Spatial Correction Approach
Authors:
Jiachen Yao,
Yikai Zhang,
Songzhu Zheng,
Mayank Goswami,
Prateek Prasanna,
Chao Chen
Abstract:
Noisy labels can significantly affect the performance of deep neural networks (DNNs). In medical image segmentation tasks, annotations are error-prone due to the high demand in annotation time and in the annotators' expertise. Existing methods mostly assume noisy labels in different pixels are \textit{i.i.d}. However, segmentation label noise usually has strong spatial correlation and has prominen…
▽ More
Noisy labels can significantly affect the performance of deep neural networks (DNNs). In medical image segmentation tasks, annotations are error-prone due to the high demand in annotation time and in the annotators' expertise. Existing methods mostly assume noisy labels in different pixels are \textit{i.i.d}. However, segmentation label noise usually has strong spatial correlation and has prominent bias in distribution. In this paper, we propose a novel Markov model for segmentation noisy annotations that encodes both spatial correlation and bias. Further, to mitigate such label noise, we propose a label correction method to recover true label progressively. We provide theoretical guarantees of the correctness of the proposed method. Experiments show that our approach outperforms current state-of-the-art methods on both synthetic and real-world noisy annotations.
△ Less
Submitted 20 July, 2023;
originally announced August 2023.
-
3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement
Authors:
Siqi Zheng,
Luyao Cheng,
Yafeng Chen,
Hui Wang,
Qian Chen
Abstract:
Disentangling uncorrelated information in speech utterances is a crucial research topic within speech community. Different speech-related tasks focus on extracting distinct speech representations while minimizing the affects of other uncorrelated information. We present a large-scale speech corpus to facilitate the research of speech representation disentanglement. 3D-Speaker contains over 10,000…
▽ More
Disentangling uncorrelated information in speech utterances is a crucial research topic within speech community. Different speech-related tasks focus on extracting distinct speech representations while minimizing the affects of other uncorrelated information. We present a large-scale speech corpus to facilitate the research of speech representation disentanglement. 3D-Speaker contains over 10,000 speakers, each of whom are simultaneously recorded by multiple Devices, locating at different Distances, and some speakers are speaking multiple Dialects. The controlled combinations of multi-dimensional audio data yield a matrix of a diverse blend of speech representation entanglement, thereby motivating intriguing methods to untangle them. The multi-domain nature of 3D-Speaker also makes it a suitable resource to evaluate large universal speech models and experiment methods of out-of-domain learning and self-supervised learning. https://3dspeaker.github.io/
△ Less
Submitted 24 September, 2023; v1 submitted 27 June, 2023;
originally announced June 2023.
-
Computationally Enhanced Approach for Chance-Constrained OPF Considering Voltage Stability
Authors:
Yuanxi Wu,
Zhi Wu,
Yijun Xu,
Huan Long,
Wei Gu,
Shu Zheng,
**gtao Zhao
Abstract:
The effective management of stochastic characteristics of renewable power generations is vital for ensuring the stable and secure operation of power systems. This paper addresses the task of optimizing the chance-constrained voltage-stability-constrained optimal power flow (CC-VSC-OPF) problem, which is hindered by the implicit voltage stability index and intractable chance constraints Leveraging…
▽ More
The effective management of stochastic characteristics of renewable power generations is vital for ensuring the stable and secure operation of power systems. This paper addresses the task of optimizing the chance-constrained voltage-stability-constrained optimal power flow (CC-VSC-OPF) problem, which is hindered by the implicit voltage stability index and intractable chance constraints Leveraging a neural network (NN)-based surrogate model, the stability constraint is explicitly formulated and directly integrated into the model. To perform uncertainty propagation without relying on presumptions or complicated transformations, an advanced data-driven method known as adaptive polynomial chaos expansion (APCE) is developed. To extend the scalability of the proposed algorithm, a partial least squares (PLS)-NN framework is designed, which enables the establishment of a parsimonious surrogate model and efficient computation of large-scale Hessian matrices. In addition, a dimensionally decomposed APCE (DD-APCE) is proposed to alleviate the "curse of dimensionality" by restricting the interaction order among random variables. Finally, the above techniques are merged into an iterative scheme to update the operation point. Simulation results reveal the cost-effective performances of the proposed method in several test systems.
△ Less
Submitted 3 January, 2024; v1 submitted 26 June, 2023;
originally announced June 2023.
-
Hybrid Driven Learning for Channel Estimation in Intelligent Reflecting Surface Aided Millimeter Wave Communications
Authors:
Shuntian Zheng,
Sheng Wu,
Chunxiao Jiang,
Wei Zhang,
Xiaojun **g
Abstract:
Intelligent reflecting surfaces (IRS) have been proposed in millimeter wave (mmWave) and terahertz (THz) systems to achieve both coverage and capacity enhancement, where the design of hybrid precoders, combiners, and the IRS typically relies on channel state information. In this paper, we address the problem of uplink wideband channel estimation for IRS aided multiuser multiple-input single-output…
▽ More
Intelligent reflecting surfaces (IRS) have been proposed in millimeter wave (mmWave) and terahertz (THz) systems to achieve both coverage and capacity enhancement, where the design of hybrid precoders, combiners, and the IRS typically relies on channel state information. In this paper, we address the problem of uplink wideband channel estimation for IRS aided multiuser multiple-input single-output (MISO) systems with hybrid architectures. Combining the structure of model driven and data driven deep learning approaches, a hybrid driven learning architecture is devised for joint estimation and learning the properties of the channels. For a passive IRS aided system, we propose a residual learned approximate message passing as a model driven network. A denoising and attention network in the data driven network is used to jointly learn spatial and frequency features. Furthermore, we design a flexible hybrid driven network in a hybrid passive and active IRS aided system. Specifically, the depthwise separable convolution is applied to the data driven network, leading to less network complexity and fewer parameters at the IRS side. Numerical results indicate that in both systems, the proposed hybrid driven channel estimation methods significantly outperform existing deep learning-based schemes and effectively reduce the pilot overhead by about 60% in IRS aided systems.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization
Authors:
Luyao Cheng,
Siqi Zheng,
Zhang Qinglin,
Hui Wang,
Yafeng Chen,
Qian Chen
Abstract:
Speaker diarization(SD) is a classic task in speech processing and is crucial in multi-party scenarios such as meetings and conversations. Current mainstream speaker diarization approaches consider acoustic information only, which result in performance degradation when encountering adverse acoustic conditions. In this paper, we propose methods to extract speaker-related information from semantic c…
▽ More
Speaker diarization(SD) is a classic task in speech processing and is crucial in multi-party scenarios such as meetings and conversations. Current mainstream speaker diarization approaches consider acoustic information only, which result in performance degradation when encountering adverse acoustic conditions. In this paper, we propose methods to extract speaker-related information from semantic content in multi-party meetings, which, as we will show, can further benefit speaker diarization. We introduce two sub-tasks, Dialogue Detection and Speaker-Turn Detection, in which we effectively extract speaker information from conversational semantics. We also propose a simple yet effective algorithm to jointly model acoustic and semantic information and obtain speaker-identified texts. Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification
Authors:
Yafeng Chen,
Siqi Zheng,
Hui Wang,
Luyao Cheng,
Qian Chen,
Jiajun Qi
Abstract:
Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the…
▽ More
Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
△ Less
Submitted 3 August, 2023; v1 submitted 22 May, 2023;
originally announced May 2023.
-
Review of X-ray pulsar spacecraft autonomous navigation
Authors:
Yidi Wang,
Wei Zheng,
Shuangnan Zhang,
Minyu Ge,
Liansheng Li,
Kun Jiang,
Xiaoqian Chen,
Xiang Zhang,
Shijie Zheng,
Fangjun Lu
Abstract:
This article provides a review on X-ray pulsar-based navigation (XNAV). The review starts with the basic concept of XNAV, and briefly introduces the past, present and future projects concerning XNAV. This paper focuses on the advances of the key techniques supporting XNAV, including the navigation pulsar database, the X-ray detection system, and the pulse time of arrival estimation. Moreover, the…
▽ More
This article provides a review on X-ray pulsar-based navigation (XNAV). The review starts with the basic concept of XNAV, and briefly introduces the past, present and future projects concerning XNAV. This paper focuses on the advances of the key techniques supporting XNAV, including the navigation pulsar database, the X-ray detection system, and the pulse time of arrival estimation. Moreover, the methods to improve the estimation performance of XNAV are reviewed. Finally, some remarks on the future development of XNAV are provided.
△ Less
Submitted 9 April, 2023;
originally announced April 2023.
-
A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI
Authors:
Chenshuang Zhang,
Chaoning Zhang,
Sheng Zheng,
Mengchun Zhang,
Maryam Qamar,
Sung-Ho Bae,
In So Kweon
Abstract:
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the r…
▽ More
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed or added into the input speech. Comparisons of experimental results and discussions are also covered in this survey.
△ Less
Submitted 2 April, 2023; v1 submitted 23 March, 2023;
originally announced March 2023.
-
Adaptive Local Adversarial Attacks on 3D Point Clouds for Augmented Reality
Authors:
Weiquan Liu,
Shijun Zheng,
Cheng Wang
Abstract:
As the key technology of augmented reality (AR), 3D recognition and tracking are always vulnerable to adversarial examples, which will cause serious security risks to AR systems. Adversarial examples are beneficial to improve the robustness of the 3D neural network model and enhance the stability of the AR system. At present, most 3D adversarial attack methods perturb the entire point cloud to gen…
▽ More
As the key technology of augmented reality (AR), 3D recognition and tracking are always vulnerable to adversarial examples, which will cause serious security risks to AR systems. Adversarial examples are beneficial to improve the robustness of the 3D neural network model and enhance the stability of the AR system. At present, most 3D adversarial attack methods perturb the entire point cloud to generate adversarial examples, which results in high perturbation costs and difficulty in reconstructing the corresponding real objects in the physical world. In this paper, we propose an adaptive local adversarial attack method (AL-Adv) on 3D point clouds to generate adversarial point clouds. First, we analyze the vulnerability of the 3D network model and extract the salient regions of the input point cloud, namely the vulnerable regions. Second, we propose an adaptive gradient attack algorithm that targets vulnerable regions. The proposed attack algorithm adaptively assigns different disturbances in different directions of the three-dimensional coordinates of the point cloud. Experimental results show that our proposed method AL-Adv achieves a higher attack success rate than the global attack method. Specifically, the adversarial examples generated by the AL-Adv demonstrate good imperceptibility and small generation costs.
△ Less
Submitted 12 March, 2023;
originally announced March 2023.
-
CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking
Authors:
Hui Wang,
Siqi Zheng,
Yafeng Chen,
Luyao Cheng,
Qian Chen
Abstract:
Time delay neural network (TDNN) has been proven to be efficient for speaker verification. One of its successful variants, ECAPA-TDNN, achieved state-of-the-art performance at the cost of much higher computational complexity and slower inference speed. This makes it inadequate for scenarios with demanding inference rate and limited computational resources. We are thus interested in finding an arch…
▽ More
Time delay neural network (TDNN) has been proven to be efficient for speaker verification. One of its successful variants, ECAPA-TDNN, achieved state-of-the-art performance at the cost of much higher computational complexity and slower inference speed. This makes it inadequate for scenarios with demanding inference rate and limited computational resources. We are thus interested in finding an architecture that can achieve the performance of ECAPA-TDNN and the efficiency of vanilla TDNN. In this paper, we propose an efficient network based on context-aware masking, namely CAM++, which uses densely connected time delay neural network (D-TDNN) as backbone and adopts a novel multi-granularity pooling to capture contextual information at different levels. Extensive experiments on two public benchmarks, VoxCeleb and CN-Celeb, demonstrate that the proposed architecture outperforms other mainstream speaker verification systems with lower computational cost and faster inference speed.
△ Less
Submitted 16 June, 2023; v1 submitted 1 March, 2023;
originally announced March 2023.
-
Efficient Constrained Codes That Enable Page Separation in Modern Flash Memories
Authors:
Ahmed Hareedy,
Simeng Zheng,
Paul Siegel,
Robert Calderbank
Abstract:
The pivotal storage density win achieved by solid-state devices over magnetic devices recently is a result of multiple innovations in physics, architecture, and signal processing. Constrained coding is used in Flash devices to increase reliability via mitigating inter-cell interference. Recently, capacity-achieving constrained codes were introduced to serve that purpose. While these codes result i…
▽ More
The pivotal storage density win achieved by solid-state devices over magnetic devices recently is a result of multiple innovations in physics, architecture, and signal processing. Constrained coding is used in Flash devices to increase reliability via mitigating inter-cell interference. Recently, capacity-achieving constrained codes were introduced to serve that purpose. While these codes result in minimal redundancy, they result in non-negligible complexity increase and access speed limitation since pages cannot be read separately. In this paper, we suggest new constrained coding schemes that have low-complexity and preserve the desirable high access speed in modern Flash devices. The idea is to eliminate error-prone patterns by coding data either only on the left-most page (binary coding) or only on the two left-most pages ($4$-ary coding) while leaving data on all the remaining pages uncoded. Our coding schemes are systematic and capacity-approaching. We refer to the proposed schemes as read-and-run (RR) constrained coding schemes. The $4$-ary RR coding scheme is introduced to limit the rate loss. We analyze the new RR coding schemes and discuss their impact on the probability of occurrence of different charge levels. We also demonstrate the performance improvement achieved via RR coding on a practical triple-level cell Flash device.
△ Less
Submitted 3 February, 2023;
originally announced February 2023.
-
Decentralized Energy Market Integrating Carbon Allowance Trade and Uncertainty Balance in Energy Communities
Authors:
Yuanxi Wu,
Zhi Wu,
Wei Gu,
Zheng Xu,
Shu Zheng,
Qirun Sun
Abstract:
With the sustained attention on carbon neutrality, the personal carbon trading (PCT) scheme has been embraced as an auspicious paradigm for scaling down carbon emissions. To facilitate the simultaneous clearance of energy and carbon allowance inside the energy community while hedging against uncertainty, a joint trading framework is proposed in this article. The energy trading is implemented in a…
▽ More
With the sustained attention on carbon neutrality, the personal carbon trading (PCT) scheme has been embraced as an auspicious paradigm for scaling down carbon emissions. To facilitate the simultaneous clearance of energy and carbon allowance inside the energy community while hedging against uncertainty, a joint trading framework is proposed in this article. The energy trading is implemented in a peer-to-peer (P2P) manner without the intervention of a central operator, and the uncertainty trading is materialized through procuring reserve of conventional generators and flexibility of users. Under the PCT scheme, carbon allowance is transacted via a sharing mechanism. Possible excessive carbon emissions due to uncertainty balance are tackled by obliging renewable agents to procure sufficient carbon allowances, following the consumption responsibility principle. A two-stage iterative method consisting of tightening McCormick envelope and alternating direction method of multipliers (ADMM) is devised to transform the model into a mixed-integer second-order cone program (MISOCP) and to allow for a fully decentralized market-clearing procedure. Numerical results have validated the effectiveness of the proposed market model.
△ Less
Submitted 28 January, 2023;
originally announced January 2023.
-
DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect
Authors:
**glin Liu,
Zhenhui Ye,
Qian Chen,
Siqi Zheng,
Wen Wang,
Qinglin Zhang,
Zhou Zhao
Abstract:
Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual realities. Binaural audio helps users orient themselves and establish immersion by providing the brain with interaural time differences reflecting spatial information. However, existing BAS methods are limited in terms of phase estimation, which is crucial for spatial he…
▽ More
Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual realities. Binaural audio helps users orient themselves and establish immersion by providing the brain with interaural time differences reflecting spatial information. However, existing BAS methods are limited in terms of phase estimation, which is crucial for spatial hearing. In this paper, we propose the \textbf{DopplerBAS} method to explicitly address the Doppler effect of the moving sound source. Specifically, we calculate the radial relative velocity of the moving speaker in spherical coordinates, which further guides the synthesis of binaural audio. This simple method introduces no additional hyper-parameters and does not modify the loss functions, and is plug-and-play: it scales well to different types of backbones. DopperBAS distinctly improves the representative WarpNet and BinauralGrad backbones in the phase error metric and reaches a new state of the art (SOTA): 0.780 (versus the current SOTA 0.807). Experiments and ablation studies demonstrate the effectiveness of our method.
△ Less
Submitted 1 June, 2023; v1 submitted 13 December, 2022;
originally announced December 2022.
-
Contextual Expressive Text-to-Speech
Authors:
Jianhong Tu,
Zeyu Cui,
Xiaohuan Zhou,
Siqi Zheng,
Kai Hu,
Ju Fan,
Chang Zhou
Abstract:
The goal of expressive Text-to-speech (TTS) is to synthesize natural speech with desired content, prosody, emotion, or timbre, in high expressiveness. Most of previous studies attempt to generate speech from given labels of styles and emotions, which over-simplifies the problem by classifying styles and emotions into a fixed number of pre-defined categories. In this paper, we introduce a new task…
▽ More
The goal of expressive Text-to-speech (TTS) is to synthesize natural speech with desired content, prosody, emotion, or timbre, in high expressiveness. Most of previous studies attempt to generate speech from given labels of styles and emotions, which over-simplifies the problem by classifying styles and emotions into a fixed number of pre-defined categories. In this paper, we introduce a new task setting, Contextual TTS (CTTS). The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. Thus, in the CTTS task, we propose to utilize such context to guide the speech synthesis process instead of relying on explicit labels of styles and emotions. To achieve this task, we construct a synthetic dataset and develop an effective framework. Experiments show that our framework can generate high-quality expressive speech based on the given context both in synthetic datasets and real-world scenarios.
△ Less
Submitted 26 November, 2022;
originally announced November 2022.
-
Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis
Authors:
Zhihao Du,
Shiliang Zhang,
Siqi Zheng,
Zhijie Yan
Abstract:
Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-l…
▽ More
Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-label prediction problem via the proposed power set encoding (PSE). Through this formulation, speaker dependency and overlaps can be explicitly modeled. To fully leverage this formulation, we further propose the speaker overlap-aware neural diarization (SOND) model, which consists of a context-independent (CI) scorer to model global speaker discriminability, a context-dependent scorer (CD) to model local discriminability, and a speaker combining network (SCN) to combine and reassign speaker activities. Experimental results show that using the proposed formulation can outperform the state-of-the-art methods based on target speaker voice activity detection, and the performance can be further improved with SOND, resulting in a 6.30% relative diarization error reduction.
△ Less
Submitted 18 November, 2022;
originally announced November 2022.
-
Pushing the limits of self-supervised speaker verification using regularized distillation framework
Authors:
Yafeng Chen,
Siqi Zheng,
Hui Wang,
Luyao Cheng,
Qian Chen
Abstract:
Training robust speaker verification systems without speaker labels has long been a challenging task. Previous studies observed a large performance gap between self-supervised and fully supervised methods. In this paper, we apply a non-contrastive self-supervised learning framework called DIstillation with NO labels (DINO) and propose two regularization terms applied to embeddings in DINO. One reg…
▽ More
Training robust speaker verification systems without speaker labels has long been a challenging task. Previous studies observed a large performance gap between self-supervised and fully supervised methods. In this paper, we apply a non-contrastive self-supervised learning framework called DIstillation with NO labels (DINO) and propose two regularization terms applied to embeddings in DINO. One regularization term guarantees the diversity of the embeddings, while the other regularization term decorrelates the variables of each embedding. The effectiveness of various data augmentation techniques are explored, on both time and frequency domain. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the regularized DINO framework in speaker verification. Our method achieves the state-of-the-art speaker verification performance under a single-stage self-supervised setting on VoxCeleb. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
△ Less
Submitted 2 August, 2023; v1 submitted 8 November, 2022;
originally announced November 2022.
-
The Brain-Inspired Cooperative Shared Control for Brain-Machine Interface
Authors:
Shengjie Zheng,
Ling Liu,
Junjie Yang,
Lang Qian,
Gang Gao,
Xin Chen,
Wenqi **,
Chunshan Deng,
Xiaojian Li
Abstract:
In the practical application of brain-machine interface technology, the problem often faced is the low information content and high noise of the neural signals collected by the electrode and the difficulty of decoding by the decoder, which makes it difficult for the robotic to obtain stable instructions to complete the task. The idea based on the principle of cooperative shared control can be achi…
▽ More
In the practical application of brain-machine interface technology, the problem often faced is the low information content and high noise of the neural signals collected by the electrode and the difficulty of decoding by the decoder, which makes it difficult for the robotic to obtain stable instructions to complete the task. The idea based on the principle of cooperative shared control can be achieved by extracting general motor commands from brain activity, while the fine details of the movement can be hosted to the robot for completion, or the brain can have complete control. This study proposes a brain-machine interface shared control system based on spiking neural networks for robotic arm movement control and wheeled robots wheel speed control and steering, respectively. The former can reliably control the robotic arm to move to the destination position, while the latter controls the wheeled robots for object tracking and map generation. The results show that the shared control based on brain-inspired intelligence can perform some typical tasks in complex environments and positively improve the fluency and ease of use of brain-machine interaction, and also demonstrate the potential of this control method in clinical applications of brain-machine interfaces.
△ Less
Submitted 25 June, 2024; v1 submitted 17 October, 2022;
originally announced October 2022.
-
Invertible Rescaling Network and Its Extensions
Authors:
Mingqing Xiao,
Shuxin Zheng,
Chang Liu,
Zhouchen Lin,
Tie-Yan Liu
Abstract:
Image rescaling is a commonly used bidirectional operation, which first downscales high-resolution images to fit various display screens or to be storage- and bandwidth-friendly, and afterward upscales the corresponding low-resolution images to recover the original resolution or the details in the zoom-in images. However, the non-injective downscaling map** discards high-frequency contents, lead…
▽ More
Image rescaling is a commonly used bidirectional operation, which first downscales high-resolution images to fit various display screens or to be storage- and bandwidth-friendly, and afterward upscales the corresponding low-resolution images to recover the original resolution or the details in the zoom-in images. However, the non-injective downscaling map** discards high-frequency contents, leading to the ill-posed problem for the inverse restoration task. This can be abstracted as a general image degradation-restoration problem with information loss. In this work, we propose a novel invertible framework to handle this general problem, which models the bidirectional degradation and restoration from a new perspective, i.e. invertible bijective transformation. The invertibility enables the framework to model the information loss of pre-degradation in the form of distribution, which could mitigate the ill-posed problem during post-restoration. To be specific, we develop invertible models to generate valid degraded images and meanwhile transform the distribution of lost contents to the fixed distribution of a latent variable during the forward degradation. Then restoration is made tractable by applying the inverse transformation on the generated degraded image together with a randomly-drawn latent variable. We start from image rescaling and instantiate the model as Invertible Rescaling Network (IRN), which can be easily extended to the similar decolorization-colorization task. We further propose to combine the invertible framework with existing degradation methods such as image compression for wider applications. Experimental results demonstrate the significant improvement of our model over existing methods in terms of both quantitative and qualitative evaluations of upscaling and colorizing reconstruction from downscaled and decolorized images, and rate-distortion of image compression.
△ Less
Submitted 9 October, 2022;
originally announced October 2022.
-
Modeling and Control of Discrete Event Systems under Joint Sensor-Actuator Cyber Attacks
Authors:
Shengbao Zheng,
Shaolong Shu,
Feng Lin
Abstract:
In this paper, we investigate joint sensor-actuator cyber attacks in discrete event systems. We assume that attackers can attack some sensors and actuators at the same time by altering observations and control commands. Because of the nondeterminism in observation and control caused by cyber attacks, the behavior of the supervised system becomes nondeterministic and may deviate from the safety spe…
▽ More
In this paper, we investigate joint sensor-actuator cyber attacks in discrete event systems. We assume that attackers can attack some sensors and actuators at the same time by altering observations and control commands. Because of the nondeterminism in observation and control caused by cyber attacks, the behavior of the supervised system becomes nondeterministic and may deviate from the safety specification. We define the upper-bound on all possible languages that can be generated by the supervised system to investigate the safety supervisory control problem under cyber attacks. After introducing CA-controllability and CA-observability, we prove that the supervisory control problem under cyber attacks is solvable if and only if the given specification language is CA-controllable and CA-observable. Furthermore, we obtain methods to calculate the state estimates under sensor attacks and to synthesize a state-estimate-based supervisor to achieve a given safety specification under cyber attacks. We further show that of all the solutions, the proposed state-estimate-based supervisor is maximally-permissive.
△ Less
Submitted 11 January, 2023; v1 submitted 27 September, 2022;
originally announced September 2022.
-
FIAT: Fine-grained Information Audit for Trustless Transborder Data Flow
Authors:
Shuhao Zheng,
Yanxi Lin,
Yang Yu,
Ye Yuan,
Yongzheng Jia,
Xue Liu
Abstract:
Auditing the information leakage of latent sensitive features during the transborder data flow has attracted sufficient attention from global digital regulators. However, there is missing a technical approach for the audit practice due to two technical challenges. Firstly, there is a lack of theory and tools for measuring the information of sensitive latent features in a dataset. Secondly, the tra…
▽ More
Auditing the information leakage of latent sensitive features during the transborder data flow has attracted sufficient attention from global digital regulators. However, there is missing a technical approach for the audit practice due to two technical challenges. Firstly, there is a lack of theory and tools for measuring the information of sensitive latent features in a dataset. Secondly, the transborder data flow involves multi-stakeholders with diverse interests, which means the audit must be trustless. Despite the tremendous efforts in protecting data privacy, an important issue that has long been neglected is that the transmitted data in data flows can leak other regulated information that is not explicitly contained in the data, leading to unaware information leakage risks. To unveil such risks trustfully before the actual data transfer, we propose FIAT, a Fine-grained Information Audit system for Trustless transborder data flow. In FIAT, we use a learning approach to quantify the amount of information leakage, while the technologies of zero-knowledge proof and smart contracts are applied to provide trustworthy and privacy-preserving auditing results. Experiments show that large information leakage can boost the predictability of uninvolved information using simple machine-learning models, revealing the importance of information auditing. Further performance benchmarking also validates the efficiency and scalability of the FIAT auditing system.
△ Less
Submitted 10 February, 2023; v1 submitted 23 September, 2022;
originally announced September 2022.
-
Performance-Driven Controller Tuning via Derivative-Free Reinforcement Learning
Authors:
Yuheng Lei,
Jianyu Chen,
Shengbo Eben Li,
Sifa Zheng
Abstract:
Choosing an appropriate parameter set for the designed controller is critical for the final performance but usually requires a tedious and careful tuning process, which implies a strong need for automatic tuning methods. However, among existing methods, derivative-free ones suffer from poor scalability or low efficiency, while gradient-based ones are often unavailable due to possibly non-different…
▽ More
Choosing an appropriate parameter set for the designed controller is critical for the final performance but usually requires a tedious and careful tuning process, which implies a strong need for automatic tuning methods. However, among existing methods, derivative-free ones suffer from poor scalability or low efficiency, while gradient-based ones are often unavailable due to possibly non-differentiable controller structure. To resolve the issues, we tackle the controller tuning problem using a novel derivative-free reinforcement learning (RL) framework, which performs timestep-wise perturbation in parameter space during experience collection and integrates derivative-free policy updates into the advanced actor-critic RL architecture to achieve high versatility and efficiency. To demonstrate the framework's efficacy, we conduct numerical experiments on two concrete examples from autonomous driving, namely, adaptive cruise control with PID controller and trajectory tracking with MPC controller. Experimental results show that the proposed method outperforms popular baselines and highlight its strong potential for controller tuning.
△ Less
Submitted 11 September, 2022;
originally announced September 2022.
-
Test-time Adaptation with Calibration of Medical Image Classification Nets for Label Distribution Shift
Authors:
Wenao Ma,
Cheng Chen,
Shuang Zheng,
**g Qin,
Huimao Zhang,
Qi Dou
Abstract:
Class distribution plays an important role in learning deep classifiers. When the proportion of each class in the test set differs from the training set, the performance of classification nets usually degrades. Such a label distribution shift problem is common in medical diagnosis since the prevalence of disease vary over location and time. In this paper, we propose the first method to tackle labe…
▽ More
Class distribution plays an important role in learning deep classifiers. When the proportion of each class in the test set differs from the training set, the performance of classification nets usually degrades. Such a label distribution shift problem is common in medical diagnosis since the prevalence of disease vary over location and time. In this paper, we propose the first method to tackle label shift for medical image classification, which effectively adapt the model learned from a single training label distribution to arbitrary unknown test label distribution. Our approach innovates distribution calibration to learn multiple representative classifiers, which are capable of handling different one-dominating-class distributions. When given a test image, the diverse classifiers are dynamically aggregated via the consistency-driven test-time adaptation, to deal with the unknown test label distribution. We validate our method on two important medical image classification tasks including liver fibrosis staging and COVID-19 severity prediction. Our experiments clearly show the decreased model performance under label shift. With our method, model performance significantly improves on all the test datasets with different label shifts for both medical image diagnosis tasks.
△ Less
Submitted 9 July, 2022; v1 submitted 2 July, 2022;
originally announced July 2022.
-
AS-IntroVAE: Adversarial Similarity Distance Makes Robust IntroVAE
Authors:
Changjie Lu,
Shen Zheng,
Zirui Wang,
Omar Dib,
Gaurav Gupta
Abstract:
Recently, introspective models like IntroVAE and S-IntroVAE have excelled in image generation and reconstruction tasks. The principal characteristic of introspective models is the adversarial learning of VAE, where the encoder attempts to distinguish between the real and the fake (i.e., synthesized) images. However, due to the unavailability of an effective metric to evaluate the difference betwee…
▽ More
Recently, introspective models like IntroVAE and S-IntroVAE have excelled in image generation and reconstruction tasks. The principal characteristic of introspective models is the adversarial learning of VAE, where the encoder attempts to distinguish between the real and the fake (i.e., synthesized) images. However, due to the unavailability of an effective metric to evaluate the difference between the real and the fake images, the posterior collapse and the vanishing gradient problem still exist, reducing the fidelity of the synthesized images. In this paper, we propose a new variation of IntroVAE called Adversarial Similarity Distance Introspective Variational Autoencoder (AS-IntroVAE). We theoretically analyze the vanishing gradient problem and construct a new Adversarial Similarity Distance (AS-Distance) using the 2-Wasserstein distance and the kernel trick. With weight annealing on AS-Distance and KL-Divergence, the AS-IntroVAE are able to generate stable and high-quality images. The posterior collapse problem is addressed by making per-batch attempts to transform the image so that it better fits the prior distribution in the latent space. Compared with the per-image approach, this strategy fosters more diverse distributions in the latent space, allowing our model to produce images of great diversity. Comprehensive experiments on benchmark datasets demonstrate the effectiveness of AS-IntroVAE on image generation and reconstruction tasks.
△ Less
Submitted 31 October, 2022; v1 submitted 28 June, 2022;
originally announced June 2022.
-
Deep Representation Decomposition for Rate-Invariant Speaker Verification
Authors:
Fuchuan Tong,
Siqi Zheng,
Haodong Zhou,
Xingjia Xie,
Qingyang Hong,
Lin Li
Abstract:
While promising performance for speaker verification has been achieved by deep speaker embeddings, the advantage would reduce in the case of speaking-style variability. Speaking rate mismatch is often observed in practical speaker verification systems, which may actually degrade the system performance. To reduce intra-class discrepancy caused by speaking rate, we propose a deep representation deco…
▽ More
While promising performance for speaker verification has been achieved by deep speaker embeddings, the advantage would reduce in the case of speaking-style variability. Speaking rate mismatch is often observed in practical speaker verification systems, which may actually degrade the system performance. To reduce intra-class discrepancy caused by speaking rate, we propose a deep representation decomposition approach with adversarial learning to learn speaking rate-invariant speaker embeddings. Specifically, adopting an attention block, we decompose the original embedding into an identity-related component and a rate-related component through multi-task training. Additionally, to reduce the latent relationship between the two decomposed components, we further propose a cosine map** block to train the parameters adversarially to minimize the cosine similarity between the two decomposed components. As a result, identity-related features become robust to speaking rate and then are used for verification. Experiments are conducted on VoxCeleb1 data and HI-MIA data to demonstrate the effectiveness of our proposed approach.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.