-
MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research
Authors:
Song Li,
Yongbin You,
Xuezhi Wang,
Zhengkun Tian,
Ke Ding,
Guanglu Wan
Abstract:
Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers' efforts to study multilingua…
▽ More
Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers' efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data. We also introduce how to use the MSR-86K corpus and other open-source corpora to train a robust multilingual ASR model that is competitive with Whisper. MSR-86K will be publicly released on HuggingFace, and we believe that such a large corpus will pave new avenues for research in multilingual ASR.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Timeliness of Status Update System: The Effect of Parallel Transmission Using Heterogeneous Updating Devices
Authors:
Zhengchuan Chen,
Kang Lang,
Nikolaos Pappas,
Howard H. Yang,
Min Wang,
Zhong Tian,
Tony Q. S. Quek
Abstract:
Timely status updating is the premise of emerging interaction-based applications in the Internet of Things (IoT). Using redundant devices to update the status of interest is a promising method to improve the timeliness of information. However, parallel status updating leads to out-of-order arrivals at the monitor, significantly challenging timeliness analysis. This work studies the Age of Informat…
▽ More
Timely status updating is the premise of emerging interaction-based applications in the Internet of Things (IoT). Using redundant devices to update the status of interest is a promising method to improve the timeliness of information. However, parallel status updating leads to out-of-order arrivals at the monitor, significantly challenging timeliness analysis. This work studies the Age of Information (AoI) of a multi-queue status update system where multiple devices monitor the same physical process. Specifically, two systems are considered: the Basic System, which only has type-1 devices that are ad hoc devices located close to the source, and the Hybrid System, which contains additional type-2 devices that are infrastructure-based devices located in fixed points compared to the Basic System. Using the Stochastic Hybrid Systems (SHS) framework, a mathematical model that combines discrete and continuous dynamics, we derive the expressions of the average AoI of the considered two systems in closed form. Numerical results verify the accuracy of the analysis. It is shown that when the number and parameters of the type-1 devices/type-2 devices are fixed, the logarithm of average AoI will linearly decrease with the logarithm of the total arrival rate of type-2 devices or that of the number of type-1 devices under specific condition. It has also been demonstrated that the proposed systems can significantly outperform the FCFS M/M/N status update system.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
ComposerX: Multi-Agent Symbolic Music Composition with LLMs
Authors:
Qixin Deng,
Qikai Yang,
Ruibin Yuan,
Yipeng Huang,
Yi Wang,
Xubo Liu,
Zeyue Tian,
Jiahao Pan,
Ge Zhang,
Hanfeng Lin,
Yizhi Li,
Yinghao Ma,
Jie Fu,
Chenghua Lin,
Emmanouil Benetos,
Wenwu Wang,
Guangyu Xia,
Wei Xue,
Yike Guo
Abstract:
Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and C…
▽ More
Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and Chain-of-Thoughts. To further explore and enhance LLMs' potential in music composition by leveraging their reasoning ability and the large knowledge base in music history and theory, we propose ComposerX, an agent-based symbolic music generation framework. We find that applying a multi-agent approach significantly improves the music composition quality of GPT-4. The results demonstrate that ComposerX is capable of producing coherent polyphonic music compositions with captivating melodies, while adhering to user instructions.
△ Less
Submitted 30 April, 2024; v1 submitted 28 April, 2024;
originally announced April 2024.
-
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
Authors:
Yazhou Xing,
Yingqing He,
Zeyue Tian,
Xintao Wang,
Qifeng Chen
Abstract:
Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-au…
▽ More
Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at https://yzxing87.github.io/Seeing-and-Hearing/
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Emergency Caching: Coded Caching-based Reliable Map Transmission in Emergency Networks
Authors:
Zeyu Tian,
Lianming Xu,
Liang Li,
Li Wang,
Aiguo Fei
Abstract:
Many rescue missions demand effective perception and real-time decision making, which highly rely on effective data collection and processing. In this study, we propose a three-layer architecture of emergency caching networks focusing on data collection and reliable transmission, by leveraging efficient perception and edge caching technologies. Based on this architecture, we propose a disaster map…
▽ More
Many rescue missions demand effective perception and real-time decision making, which highly rely on effective data collection and processing. In this study, we propose a three-layer architecture of emergency caching networks focusing on data collection and reliable transmission, by leveraging efficient perception and edge caching technologies. Based on this architecture, we propose a disaster map collection framework that integrates coded caching technologies. Our framework strategically caches coded fragments of maps across unmanned aerial vehicles (UAVs), fostering collaborative uploading for augmented transmission reliability. Additionally, we establish a comprehensive probability model to assess the effective recovery area of disaster maps. Towards the goal of utility maximization, we propose a deep reinforcement learning (DRL) based algorithm that jointly makes decisions about cooperative UAVs selection, bandwidth allocation and coded caching parameter adjustment, accommodating the real-time map updates in a dynamic disaster situation. Our proposed scheme is more effective than the non-coding caching scheme, as validated by simulation.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Authors:
Ruibin Yuan,
Hanfeng Lin,
Yi Wang,
Zeyue Tian,
Shangda Wu,
Tianhao Shen,
Ge Zhang,
Yuhang Wu,
Cong Liu,
Ziya Zhou,
Ziyang Ma,
Liumeng Xue,
Ziyu Wang,
Qin Liu,
Tianyu Zheng,
Yizhi Li,
Yinghao Ma,
Yiming Liang,
Xiaowei Chi,
Ruibo Liu,
Zili Wang,
Pengfei Li,
**gcheng Wu,
Chenghua Lin,
Qifeng Liu
, et al. (10 additional authors not shown)
Abstract:
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the…
▽ More
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
DISCO Might Not Be Funky: Random Intelligent Reflective Surface Configurations That Attack
Authors:
Huan Huang,
Lipeng Dai,
Hongliang Zhang,
Chongfu Zhang,
Zhongxing Tian,
Yi Cai,
A. Lee Swindlehurst,
Zhu Han
Abstract:
Emerging intelligent reflective surfaces (IRSs) significantly improve system performance, but also pose a significant risk for physical layer security (PLS). Unlike the extensive research on legitimate IRS-enhanced communications, in this article we present an adversarial IRS-based fully-passive jammer (FPJ). We describe typical application scenarios for Disco IRS (DIRS)-based FPJ, where an illegi…
▽ More
Emerging intelligent reflective surfaces (IRSs) significantly improve system performance, but also pose a significant risk for physical layer security (PLS). Unlike the extensive research on legitimate IRS-enhanced communications, in this article we present an adversarial IRS-based fully-passive jammer (FPJ). We describe typical application scenarios for Disco IRS (DIRS)-based FPJ, where an illegitimate IRS with random, time-varying reflection properties acts like a "disco ball" to randomly change the propagation environment. We introduce the principles of DIRS-based FPJ and overview existing investigations of the technology, including a design example employing one-bit phase shifters. The DIRS-based FPJ can be implemented without either jamming power or channel state information (CSI) for the legitimate users (LUs). It does not suffer from the energy constraints of traditional active jammers, nor does it require any knowledge of the LU channels. In addition to the proposed jamming attack, we also propose an anti-jamming strategy that requires only statistical rather than instantaneous CSI. Furthermore, we present a data frame structure that enables the legitimate access point (AP) to estimate the DIRS-jammed channels' statistical characteristics in the presence of the DIRS jamming. Typical cases are discussed to show the impact of the DIRS-based FPJ and the feasibility of the anti-jamming precoder (AJP). Moreover, we outline future research directions and challenges for the DIRS-based FPJ and its anti-jamming precoding to stimulate this line of research and pave the way for practical applications.
△ Less
Submitted 10 June, 2024; v1 submitted 1 October, 2023;
originally announced October 2023.
-
CPPF: A contextual and post-processing-free model for automatic speech recognition
Authors:
Lei Zhang,
Zhengkun Tian,
Xiang Chen,
Jiaming Sun,
Hongyu Xiang,
Ke Ding,
Guanglu Wan
Abstract:
ASR systems have become increasingly widespread in recent years. However, their textual outputs often require post-processing tasks before they can be practically utilized. To address this issue, we draw inspiration from the multifaceted capabilities of LLMs and Whisper, and focus on integrating multiple ASR text processing tasks related to speech recognition into the ASR model. This integration n…
▽ More
ASR systems have become increasingly widespread in recent years. However, their textual outputs often require post-processing tasks before they can be practically utilized. To address this issue, we draw inspiration from the multifaceted capabilities of LLMs and Whisper, and focus on integrating multiple ASR text processing tasks related to speech recognition into the ASR model. This integration not only shortens the multi-stage pipeline, but also prevents the propagation of cascading errors, resulting in direct generation of post-processed text. In this study, we focus on ASR-related processing tasks, including Contextual ASR and multiple ASR post processing tasks. To achieve this objective, we introduce the CPPF model, which offers a versatile and highly effective alternative to ASR processing. CPPF seamlessly integrates these tasks without any significant loss in recognition performance.
△ Less
Submitted 20 September, 2023; v1 submitted 13 September, 2023;
originally announced September 2023.
-
Cross-Utterance Conditioned VAE for Speech Generation
Authors:
Yang Li,
Cheng Yu,
Guangzhi Sun,
Weiqin Zu,
Zheng Tian,
Ying Wen,
Wei Pan,
Chao Zhang,
Jun Wang,
Yang Yang,
Fanglei Sun
Abstract:
Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representat…
▽ More
Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
Anti-Jamming Precoding Against Disco Intelligent Reflecting Surfaces Based Fully-Passive Jamming Attacks
Authors:
Huan Huang,
Lipeng Dai,
Hongliang Zhang,
Zhongxing Tian,
Yi Cai,
Chongfu Zhang,
A. Lee Swindlehurst,
Zhu Han
Abstract:
Emerging intelligent reflecting surfaces (IRSs) significantly improve system performance, but also pose a huge risk for physical layer security. Existing works have illustrated that a disco IRS (DIRS), i.e., an illegitimate IRS with random time-varying reflection properties (like a "disco ball"), can be employed by an attacker to actively age the channels of legitimate users (LUs). Such active cha…
▽ More
Emerging intelligent reflecting surfaces (IRSs) significantly improve system performance, but also pose a huge risk for physical layer security. Existing works have illustrated that a disco IRS (DIRS), i.e., an illegitimate IRS with random time-varying reflection properties (like a "disco ball"), can be employed by an attacker to actively age the channels of legitimate users (LUs). Such active channel aging (ACA) generated by the DIRS can be employed to jam multi-user multiple-input single-output (MU-MISO) systems without relying on either jamming power or LU channel state information (CSI). To address the significant threats posed by DIRS-based fully-passive jammers (FPJs), an anti-jamming precoder is proposed that requires only the statistical characteristics of the DIRS-based ACA channels instead of their CSI. The statistical characteristics of DIRS-jammed channels are first derived, and then the anti-jamming precoder is derived based on the statistical characteristics. Furthermore, we prove that the anti-jamming precoder can achieve the maximum signal-to-jamming-plus-noise ratio (SJNR). To acquire the ACA statistics without changing the system architecture or cooperating with the illegitimate DIRS, we design a data frame structure that the legitimate access point (AP) can use to estimate the statistical characteristics. During the designed data frame, the LUs only need to feed back their received power to the legitimate AP when they detect jamming attacks. Numerical results are also presented to evaluate the effectiveness of the proposed anti-jamming precoder against the DIRS-based FPJs and the feasibility of the designed data frame used by the legitimate AP to estimate the statistical characteristics.
△ Less
Submitted 24 January, 2024; v1 submitted 29 August, 2023;
originally announced August 2023.
-
Can We Transfer Noise Patterns? A Multi-environment Spectrum Analysis Model Using Generated Cases
Authors:
Haiwen Du,
Zheng Ju,
Yu An,
Honghui Du,
Dongjie Zhu,
Zhaoshuo Tian,
Aonghus Lawlor,
Ruihai Dong
Abstract:
Spectrum analysis systems in online water quality testing are designed to detect types and concentrations of pollutants and enable regulatory agencies to respond promptly to pollution incidents. However, spectral data-based testing devices suffer from complex noise patterns when deployed in non-laboratory environments. To make the analysis model applicable to more environments, we propose a noise…
▽ More
Spectrum analysis systems in online water quality testing are designed to detect types and concentrations of pollutants and enable regulatory agencies to respond promptly to pollution incidents. However, spectral data-based testing devices suffer from complex noise patterns when deployed in non-laboratory environments. To make the analysis model applicable to more environments, we propose a noise patterns transferring model, which takes the spectrum of standard water samples in different environments as cases and learns the differences in their noise patterns, thus enabling noise patterns to transfer to unknown samples. Unfortunately, the inevitable sample-level baseline noise makes the model unable to obtain the paired data that only differ in dataset-level environmental noise. To address the problem, we generate a sample-to-sample case-base to exclude the interference of sample-level noise on dataset-level noise learning, enhancing the system's learning performance. Experiments on spectral data with different background noises demonstrate the good noise-transferring ability of the proposed method against baseline systems ranging from wavelet denoising, deep neural networks, and generative models. From this research, we posit that our method can enhance the performance of DL models by generating high-quality cases. The source code is made publicly available online at https://github.com/Magnomic/CNST.
△ Less
Submitted 14 August, 2023; v1 submitted 2 August, 2023;
originally announced August 2023.
-
TST: Time-Sparse Transducer for Automatic Speech Recognition
Authors:
Xiaohui Zhang,
Mangui Liang,
Zhengkun Tian,
Jiangyan Yi,
Jianhua Tao
Abstract:
End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain th…
▽ More
End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain the intermediate representations by reducing the time resolution of the hidden states. Then the weighted average algorithm is used to combine these representations into sparse hidden states followed by the decoder. All the experiments are conducted on a Mandarin dataset AISHELL-1. Compared with RNN-T, the character error rate of the time-sparse transducer is close to RNN-T and the real-time factor is 50.00% of the original. By adjusting the time resolution, the time-sparse transducer can also reduce the real-time factor to 16.54% of the original at the expense of a 4.94% loss of precision.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
MARBLE: Music Audio Representation Benchmark for Universal Evaluation
Authors:
Ruibin Yuan,
Yinghao Ma,
Yizhi Li,
Ge Zhang,
Xingran Chen,
Hanzhi Yin,
Le Zhuo,
Yiqi Liu,
Jiawen Huang,
Zeyue Tian,
Binyue Deng,
Ningzhi Wang,
Chenghua Lin,
Emmanouil Benetos,
Anton Ragni,
Norbert Gyenge,
Roger Dannenberg,
Wenhu Chen,
Gus Xia,
Wei Xue,
Si Liu,
Shi Wang,
Ruibo Liu,
Yike Guo,
Jie Fu
Abstract:
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue…
▽ More
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at https://marble-bm.shef.ac.uk to promote future music AI research.
△ Less
Submitted 23 November, 2023; v1 submitted 18 June, 2023;
originally announced June 2023.
-
Distributed Learning over Networks with Graph-Attention-Based Personalization
Authors:
Zhuojun Tian,
Zhaoyang Zhang,
Zhaohui Yang,
Richeng **,
Huaiyu Dai
Abstract:
In conventional distributed learning over a network, multiple agents collaboratively build a common machine learning model. However, due to the underlying non-i.i.d. data distribution among agents, the unified learning model becomes inefficient for each agent to process its locally accessible data. To address this problem, we propose a graph-attention-based personalized training algorithm (GATTA)…
▽ More
In conventional distributed learning over a network, multiple agents collaboratively build a common machine learning model. However, due to the underlying non-i.i.d. data distribution among agents, the unified learning model becomes inefficient for each agent to process its locally accessible data. To address this problem, we propose a graph-attention-based personalized training algorithm (GATTA) for distributed deep learning. The GATTA enables each agent to train its local personalized model while exploiting its correlation with neighboring nodes and utilizing their useful information for aggregation. In particular, the personalized model in each agent is composed of a global part and a node-specific part. By treating each agent as one node in a graph and the node-specific parameters as its features, the benefits of the graph attention mechanism can be inherited. Namely, instead of aggregation based on averaging, it learns the specific weights for different neighboring nodes without requiring prior knowledge about the graph structure or the neighboring nodes' data distribution. Furthermore, relying on the weight-learning procedure, we develop a communication-efficient GATTA by skip** the transmission of information with small aggregation weights. Additionally, we theoretically analyze the convergence properties of GATTA for non-convex loss functions. Numerical results validate the excellent performances of the proposed algorithms in terms of convergence and communication cost.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Super-Resolution Harmonic Retrieval of Non-Circular Signals
Authors:
Yu Zhang,
Yue Wang,
Zhi Tian,
Geert Leus,
Gong Zhang
Abstract:
This paper proposes a super-resolution harmonic retrieval method for uncorrelated strictly non-circular signals, whose covariance and pseudo-covariance present Toeplitz and Hankel structures, respectively. Accordingly, the augmented covariance matrix constructed by the covariance and pseudo-covariance matrices is not only low rank but also jointly Toeplitz-Hankel structured. To efficiently exploit…
▽ More
This paper proposes a super-resolution harmonic retrieval method for uncorrelated strictly non-circular signals, whose covariance and pseudo-covariance present Toeplitz and Hankel structures, respectively. Accordingly, the augmented covariance matrix constructed by the covariance and pseudo-covariance matrices is not only low rank but also jointly Toeplitz-Hankel structured. To efficiently exploit such a desired structure for high estimation accuracy, we develop a low-rank Toeplitz-Hankel covariance reconstruction (LRTHCR) solution employed over the augmented covariance matrix. Further, we design a fitting error constraint to flexibly implement the LRTHCR algorithm without knowing the noise statistics. In addition, performance analysis is provided for the proposed LRTHCR in practical settings. Simulation results reveal that the LRTHCR outperforms the benchmark methods in terms of lower estimation errors.
△ Less
Submitted 17 January, 2023;
originally announced January 2023.
-
Semantic-Aware Sensing Information Transmission for Metaverse: A Contest Theoretic Approach
Authors:
Jiacheng Wang,
Hongyang Du,
Zengshan Tian,
Dusit Niyato,
Jiawen Kang,
Xuemin,
Shen
Abstract:
With the advancement of network and computer technologies, virtual cyberspace keeps evolving, and Metaverse is the main representative. As an irreplaceable technology that supports Metaverse, the sensing information transmission from the physical world to Metaverse is vital. Inspired by emerging semantic communication, in this paper, we propose a semantic transmission framework for transmitting se…
▽ More
With the advancement of network and computer technologies, virtual cyberspace keeps evolving, and Metaverse is the main representative. As an irreplaceable technology that supports Metaverse, the sensing information transmission from the physical world to Metaverse is vital. Inspired by emerging semantic communication, in this paper, we propose a semantic transmission framework for transmitting sensing information from the physical world to Metaverse. Leveraging the in-depth understanding of sensing information, we define the semantic bases, through which the semantic encoding of sensing data is achieved for the first time. Consequently, the amount of sensing data that needs to be transmitted is dramatically reduced. Unlike conventional methods that undergo data degradation and require data recovery, our approach achieves the sensing goal without data recovery while maintaining performance. To further improve Metaverse service quality, we introduce contest theory to create an incentive mechanism that motivates users to upload data more frequently. Experimental results show that the average data amount after semantic encoding is reduced to about 27.87% of that before encoding, while ensuring the sensing performance. Additionally, the proposed contest theoretic based incentive mechanism increases the sum of data uploading frequency by 27.47% compared to the uniform award scheme.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
Compressive Spectrum Sensing Using Blind-Block Orthogonal Least Squares
Authors:
Liyang Lu,
Wenbo Xu,
Yue Wang,
Zhi Tian
Abstract:
Compressive sensing (CS) has recently emerged as an extremely efficient technology of the wideband spectrum sensing. In compressive spectrum sensing (CSS), it is necessary to know the sparsity or the noise information in advance for reliable reconstruction. However, such information is usually absent in practical applications. In this paper, we propose a blind-block orthogonal least squares-based…
▽ More
Compressive sensing (CS) has recently emerged as an extremely efficient technology of the wideband spectrum sensing. In compressive spectrum sensing (CSS), it is necessary to know the sparsity or the noise information in advance for reliable reconstruction. However, such information is usually absent in practical applications. In this paper, we propose a blind-block orthogonal least squares-based compressive spectrum sensing (B-BOLS-CSS) algorithm, which utilizes a novel blind stop** rule to cut the cords to these prior information. Specifically, we first present both the noiseless and noisy recovery guarantees for the BOLS algorithm based on the mutual incoherence property (MIP). Motivated by them, we then formulate the blind stop** rule, which exploits an $\ell_{2,\infty}$ sufficient statistic to blindly test the support atoms in the remaining measurement matrix. We further evaluate the theoretical performance analysis of the holistic B-BOLS-CSS algorithm by develo** a lower bound of the signal-to-noise ratio (SNR) to ensure that the probability of exact recovery is no lower than a given threshold. Simulations not only demonstrate the improvement of our derived theoretical results, but also illustrate that B-BOLS-CSS works well in both low and high SNR environments.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
Compressive Spectrum Sensing Using Sampling-Controlled Block Orthogonal Matching Pursuit
Authors:
Liyang Lu,
Wenbo Xu,
Yue Wang,
Zhi Tian
Abstract:
This paper proposes two novel schemes of wideband compressive spectrum sensing (CSS) via block orthogonal matching pursuit (BOMP) algorithm, for achieving high sensing accuracy in real time. These schemes aim to reliably recover the spectrum by adaptively adjusting the number of required measurements without inducing unnecessary sampling redundancy. To this end, the minimum number of required meas…
▽ More
This paper proposes two novel schemes of wideband compressive spectrum sensing (CSS) via block orthogonal matching pursuit (BOMP) algorithm, for achieving high sensing accuracy in real time. These schemes aim to reliably recover the spectrum by adaptively adjusting the number of required measurements without inducing unnecessary sampling redundancy. To this end, the minimum number of required measurements for successful recovery is first derived in terms of its probabilistic lower bound. Then, a CSS scheme is proposed by tightening the derived lower bound, where the key is the design of a nonlinear exponential indicator through a general-purpose sampling-controlled algorithm (SCA). In particular, a sampling-controlled BOMP (SC-BOMP) is developed through a holistic integration of the existing BOMP and the proposed SCA. For fast implementation, a modified version of SC-BOMP is further developed by exploring the block orthogonality in the form of sub-coherence of measurement matrices, which allows more compressive sampling in terms of smaller lower bound of the number of measurements. Such a fast SC-BOMP scheme achieves a desired tradeoff between the complexity and the performance. Simulations demonstrate that the two SC-BOMP schemes outperform the other benchmark algorithms.
△ Less
Submitted 13 April, 2023; v1 submitted 13 November, 2022;
originally announced November 2022.
-
SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection
Authors:
Jiangyan Yi,
Chenglong Wang,
Jianhua Tao,
Chu Yuan Zhang,
Cunhang Fan,
Zhengkun Tian,
Haoxin Ma,
Ruibo Fu
Abstract:
Many datasets have been designed to further the development of fake audio detection. However, fake utterances in previous datasets are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audio. These datasets leave out a scenario, in which the acoustic scene of an original audio is manipulated with a forged one. It will pose a major threat to our society i…
▽ More
Many datasets have been designed to further the development of fake audio detection. However, fake utterances in previous datasets are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audio. These datasets leave out a scenario, in which the acoustic scene of an original audio is manipulated with a forged one. It will pose a major threat to our society if some people misuse the manipulated audio with malicious purpose. Therefore, this motivates us to fill in the gap. This paper proposes such a dataset for scene fake audio detection named SceneFake, where a manipulated audio is generated by only tampering with the acoustic scene of an real utterance by using speech enhancement technologies. Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper. In addition, an analysis of fake attacks with different speech enhancement technologies and signal-to-noise ratios are presented in this paper. The results indicate that scene fake utterances cannot be reliably detected by baseline models trained on the ASVspoof 2019 dataset. Although these models perform well on the SceneFake training set and seen testing set, their performance is poor on the unseen test set. The dataset (https://zenodo.org/record/7663324#.Y_XKMuPYuUk) and benchmark source codes (https://github.com/ADDchallenge/SceneFake) are publicly available.
△ Less
Submitted 4 April, 2024; v1 submitted 11 November, 2022;
originally announced November 2022.
-
Peak-First CTC: Reducing the Peak Latency of CTC Models by Applying Peak-First Regularization
Authors:
Zhengkun Tian,
Hongyu Xiang,
Min Li,
Feifei Lin,
Ke Ding,
Guanglu Wan
Abstract:
The CTC model has been widely applied to many application scenarios because of its simple structure, excellent performance, and fast inference speed. There are many peaks in the probability distribution predicted by the CTC models, and each peak represents a non-blank token. The recognition latency of CTC models can be reduced by encouraging the model to predict peaks earlier. Existing methods to…
▽ More
The CTC model has been widely applied to many application scenarios because of its simple structure, excellent performance, and fast inference speed. There are many peaks in the probability distribution predicted by the CTC models, and each peak represents a non-blank token. The recognition latency of CTC models can be reduced by encouraging the model to predict peaks earlier. Existing methods to reduce latency require modifying the transition relationship between tokens in the forward-backward algorithm, and the gradient calculation. Some of these methods even depend on the forced alignment results provided by other pretrained models. The above methods are complex to implement. To reduce the peak latency, we propose a simple and novel method named peak-first regularization, which utilizes a frame-wise knowledge distillation function to force the probability distribution of the CTC model to shift left along the time axis instead of directly modifying the calculation process of CTC loss and gradients. All the experiments are conducted on a Chinese Mandarin dataset AISHELL-1. We have verified the effectiveness of the proposed regularization on both streaming and non-streaming CTC models respectively. The results show that the proposed method can reduce the average peak latency by about 100 to 200 milliseconds with almost no degradation of recognition accuracy.
△ Less
Submitted 15 March, 2023; v1 submitted 6 November, 2022;
originally announced November 2022.
-
Distributed Swarm Learning for Internet of Things at the Edge: Where Artificial Intelligence Meets Biological Intelligence
Authors:
Yue Wang,
Zhi Tian,
Xin Fan,
Yan Huo,
Cameron Nowzari,
Kai Zeng
Abstract:
With the proliferation of versatile Internet of Things (IoT) services, smart IoT devices are increasingly deployed at the edge of wireless networks to perform collaborative machine learning tasks using locally collected data, giving rise to the edge learning paradigm. Due to device restrictions and resource constraints, edge learning among massive IoT devices faces major technical challenges cause…
▽ More
With the proliferation of versatile Internet of Things (IoT) services, smart IoT devices are increasingly deployed at the edge of wireless networks to perform collaborative machine learning tasks using locally collected data, giving rise to the edge learning paradigm. Due to device restrictions and resource constraints, edge learning among massive IoT devices faces major technical challenges caused by the communication bottleneck, data and device heterogeneity, non-convex optimization, privacy and security concerns, and dynamic environments. To overcome these challenges, this article studies a new framework of distributed swarm learning (DSL) through a holistic integration of artificial intelligence and biological swarm intelligence. Leveraging efficient and robust signal processing and communication techniques, DSL contributes to novel tools for learning and optimization tailored for real-time operations of large-scale IoT in edge wireless environments, which will benefit a wide range of edge IoT applications.
△ Less
Submitted 29 October, 2022;
originally announced October 2022.
-
Robust Distributed Learning Against Both Distributional Shifts and Byzantine Attacks
Authors:
Guanqiang Zhou,
** Xu,
Yue Wang,
Zhi Tian
Abstract:
In distributed learning systems, robustness issues may arise from two sources. On one hand, due to distributional shifts between training data and test data, the trained model could exhibit poor out-of-sample performance. On the other hand, a portion of working nodes might be subject to byzantine attacks which could invalidate the learning result. Existing works mostly deal with these two issues s…
▽ More
In distributed learning systems, robustness issues may arise from two sources. On one hand, due to distributional shifts between training data and test data, the trained model could exhibit poor out-of-sample performance. On the other hand, a portion of working nodes might be subject to byzantine attacks which could invalidate the learning result. Existing works mostly deal with these two issues separately. In this paper, we propose a new algorithm that equips distributed learning with robustness measures against both distributional shifts and byzantine attacks. Our algorithm is built on recent advances in distributionally robust optimization as well as norm-based screening (NBS), a robust aggregation scheme against byzantine attacks. We provide convergence proofs in three cases of the learning model being nonconvex, convex, and strongly convex for the proposed algorithm, shedding light on its convergence behaviors and endurability against byzantine attacks. In particular, we deduce that any algorithm employing NBS (including ours) cannot converge when the percentage of byzantine nodes is 1/3 or higher, instead of 1/2, which is the common belief in current literature. The experimental results demonstrate the effectiveness of our algorithm against both robustness issues. To the best of our knowledge, this is the first work to address distributional shifts and byzantine attacks simultaneously.
△ Less
Submitted 29 October, 2022;
originally announced October 2022.
-
Bootstrap** meaning through listening: Unsupervised learning of spoken sentence embeddings
Authors:
Jian Zhu,
Zuoyu Tian,
Yadong Liu,
Cong Zhang,
Chia-wen Lo
Abstract:
Inducing semantic representations directly from speech signals is a highly challenging task but has many useful applications in speech mining and spoken language understanding. This study tackles the unsupervised learning of semantic representations for spoken utterances. Through converting speech signals into hidden units generated from acoustic unit discovery, we propose WavEmbed, a multimodal s…
▽ More
Inducing semantic representations directly from speech signals is a highly challenging task but has many useful applications in speech mining and spoken language understanding. This study tackles the unsupervised learning of semantic representations for spoken utterances. Through converting speech signals into hidden units generated from acoustic unit discovery, we propose WavEmbed, a multimodal sequential autoencoder that predicts hidden units from a dense representation of speech. Secondly, we also propose S-HuBERT to induce meaning through knowledge distillation, in which a sentence embedding model is first trained on hidden units and passes its knowledge to a speech encoder through contrastive learning. The best performing model achieves a moderate correlation (0.5~0.6) with human judgments, without relying on any labels or transcriptions. Furthermore, these models can also be easily extended to leverage textual transcriptions of speech to learn much better speech embeddings that are strongly correlated with human annotations. Our proposed methods are applicable to the development of purely data-driven systems for speech mining, indexing and search.
△ Less
Submitted 23 October, 2022;
originally announced October 2022.
-
Fully Automated End-to-End Fake Audio Detection
Authors:
Chenglong Wang,
Jiangyan Yi,
Jianhua Tao,
Haiyang Sun,
Xun Chen,
Zhengkun Tian,
Haoxin Ma,
Cunhang Fan,
Ruibo Fu
Abstract:
The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure. However, artificial adjustment of the parameters can have a relatively obvious influence on the results. It is almost impossible to manually set the best set of parameters. Therefore this paper proposes a fully automated end-toen…
▽ More
The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure. However, artificial adjustment of the parameters can have a relatively obvious influence on the results. It is almost impossible to manually set the best set of parameters. Therefore this paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. Furthermore, for the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS. It learns deep speech representations while automatically learning and optimizing complex neural structures consisting of convolutional operations and residual blocks. The experimental results on the ASVspoof 2019 LA dataset show that our proposed system achieves an equal error rate (EER) of 1.08%, which outperforms the state-of-the-art single system.
△ Less
Submitted 20 August, 2022;
originally announced August 2022.
-
CB-DSL: Communication-efficient and Byzantine-robust Distributed Swarm Learning on Non-i.i.d. Data
Authors:
Xin Fan,
Yue Wang,
Yan Huo,
Zhi Tian
Abstract:
The valuable data collected by IoT devices in edge networks together with the resurgence of ML stimulate the latest trend of edge AI. However, recent FL methods face major challenges including communication bottleneck, data heterogeneity and security concerns in edge IoT scenarios, especially when being adopted for distributed learning among massive IoT devices equipped with limited data and trans…
▽ More
The valuable data collected by IoT devices in edge networks together with the resurgence of ML stimulate the latest trend of edge AI. However, recent FL methods face major challenges including communication bottleneck, data heterogeneity and security concerns in edge IoT scenarios, especially when being adopted for distributed learning among massive IoT devices equipped with limited data and transmission resources. Meanwhile, the swarm nature of IoT systems is overlooked by most existing literature, which calls for new designs of distributed learning algorithms. Inspired by the success of biological intelligence (BI) of gregarious organisms, we propose a novel edge learning approach for swarm IoT, called communication-efficient and Byzantine-robust distributed swarm learning (CB-DSL), through a holistic integration of AI-enabled stochastic gradient descent and BI-enabled particle swarm optimization. To deal with non-i.i.d. data issues and Byzantine attacks, global data samples are introduced in CB-DSL and shared among IoT workers, which not only alleviates the local data heterogeneity effectively but also enables to fully utilize the exploration-exploitation mechanism of swarm intelligence. Further, we provide convergence analysis to theoretically demonstrate that the proposed CB-DSL is superior to the standard FL with better convergence behavior. In addition, to measure the effectiveness of the introduction of the globally shared dataset, we also evaluate the model divergence by deriving its upper bound, which is related to the distance between the data distribution at local IoT devices and the population distribution for the whole datasets. Numerical results verify that the proposed CB-DSL outperforms the existing benchmarks in terms of faster convergence speed, higher convergent accuracy, lower communication cost, and better robustness against non-i.i.d. data and Byzantine attacks.
△ Less
Submitted 20 October, 2022; v1 submitted 10 August, 2022;
originally announced August 2022.
-
SCAI: A Spectral data Classification framework with Adaptive Inference for the IoT platform
Authors:
Yundong Sun,
Dongjie Zhu,
Haiwen Du,
Yansong Wang,
Zhaoshuo Tian
Abstract:
Currently, it is a hot research topic to realize accurate, efficient, and real-time identification of massive spectral data with the help of deep learning and IoT technology. Deep neural networks played a key role in spectral analysis. However, the inference of deeper models is performed in a static manner, and cannot be adjusted according to the device. Not all samples need to allocate all comput…
▽ More
Currently, it is a hot research topic to realize accurate, efficient, and real-time identification of massive spectral data with the help of deep learning and IoT technology. Deep neural networks played a key role in spectral analysis. However, the inference of deeper models is performed in a static manner, and cannot be adjusted according to the device. Not all samples need to allocate all computation to reach confident prediction, which hinders maximizing the overall performance. To address the above issues, we propose a Spectral data Classification framework with Adaptive Inference. Specifically, to allocate different computations for different samples while better exploiting the collaboration among different devices, we leverage Early-exit architecture, place intermediate classifiers at different depths of the architecture, and the model outputs the results when the prediction confidence reaches a preset threshold. We propose a training paradigm of self-distillation learning, the deepest classifier performs soft supervision on the shallow ones to maximize their performance and training speed. At the same time, to mitigate the vulnerability of performance to the location and number settings of intermediate classifiers in the Early-exit paradigm, we propose a Position-Adaptive residual network. It can adjust the number of layers in each block at different curve positions, so it can focus on important positions of the curve (e.g.: Raman peak), and accurately allocate the appropriate computational budget based on task performance and computing resources. To the best of our knowledge, this paper is the first attempt to conduct optimization by adaptive inference for spectral detection under the IoT platform. We conducted many experiments, the experimental results show that our proposed method can achieve higher performance with less computational budget than existing methods.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Blind Orthogonal Least Squares based Compressive Spectrum Sensing
Authors:
Liyang Lu,
Wenbo Xu,
Yue Wang,
Zhi Tian
Abstract:
As an enabling technique of cognitive radio (CR), compressive spectrum sensing (CSS) based on compressive sensing (CS) can detect the spectrum opportunities from wide frequency bands efficiently and accurately by using sub-Nyquist sampling rate. However, the sensing performance of most existing CSS excessively relies on the prior information such as spectrum sparsity or noise variance. Thus, a key…
▽ More
As an enabling technique of cognitive radio (CR), compressive spectrum sensing (CSS) based on compressive sensing (CS) can detect the spectrum opportunities from wide frequency bands efficiently and accurately by using sub-Nyquist sampling rate. However, the sensing performance of most existing CSS excessively relies on the prior information such as spectrum sparsity or noise variance. Thus, a key challenge in practical CSS is how to work effectively even in the absence of such information. In this paper, we propose a blind orthogonal least squares based CSS algorithm (B-OLS-CSS), which functions properly without the requirement of prior information. Specifically, we develop a novel blind stop** rule for the OLS algorithm based on its probabilistic recovery condition. This innovative rule gets rid of the need of the spectrum sparsity or noise information, but only requires the computational-feasible mutual incoherence property of the given measurement matrix. Our theoretical analysis indicates that the signal-to-noise ratio required by the proposed B-OLS-CSS for achieving a certain sensing accuracy is relaxed than that by the benchmark CSS using the OMP algorithm, which is verified by extensive simulation results.
△ Less
Submitted 13 November, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Reconfigurable Intelligent Surface-Aided Spectrum Sharing Coexisting with Multiple Primary Networks
Authors:
Zhong Tian,
Zhengchuan Chen,
Min Wang,
Yunjian Jia,
Wanli Wen
Abstract:
Considering the spectrum sharing system (SSS) coexisting with multiple primary networks, we have employed a well-designed reconfigurable intelligent surface (RIS) to control the radio environments of wireless channels and relieve the scarcity of the spectrum resource in this work. Specifically, the enhancement of the spectral efficiency of the secondary user in the considered SSS is decomposed int…
▽ More
Considering the spectrum sharing system (SSS) coexisting with multiple primary networks, we have employed a well-designed reconfigurable intelligent surface (RIS) to control the radio environments of wireless channels and relieve the scarcity of the spectrum resource in this work. Specifically, the enhancement of the spectral efficiency of the secondary user in the considered SSS is decomposed into two subproblems which are a second-order cone programming (SOCP) and a fractional programming of the convex quadratic form (CQFP), respectively, to optimize alternatively the beamforming vector at the secondary access point (S-AP) and the reflecting coefficients at the RIS. The SOCP subproblem is shown as a concave problem, which can be solved optimally using standard convex optimization tools. The CQFP subproblem can be solved by a low-complexity method of gradient-based linearization with domain (GLD), providing a sub-optimal solution for fast deployment. Taking the discrete phase control at the RIS into account, a nearest point searching with penalty (NPSP) method is also developed, realizing the discretization of the phase shifts of the RIS in practice. The simulation results indicate that both GLD and NPSP can achieve an excellent performance.
△ Less
Submitted 4 November, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
Hybrid Mechanical and Electronic Beam Steering for Maximizing OAM Channel Capacity
Authors:
Rui Chen,
Zhenyang Tian,
Wen-Xuan Long,
Xiaodong Wang,
Wei Zhang
Abstract:
Radio frequency-orbital angular momentum (RF-OAM) is a novel approach of multiplexing a set of orthogonal modes on the same frequency channel to achieve high spectrum efficiencies. Since OAM requires precise alignment of the transmit and the receive antennas, the electronic beam steering approach has been proposed for the uniform circular array (UCA)-based OAM communication system to circumvent la…
▽ More
Radio frequency-orbital angular momentum (RF-OAM) is a novel approach of multiplexing a set of orthogonal modes on the same frequency channel to achieve high spectrum efficiencies. Since OAM requires precise alignment of the transmit and the receive antennas, the electronic beam steering approach has been proposed for the uniform circular array (UCA)-based OAM communication system to circumvent large performance degradation induced by small antenna misalignment in practical environment. However, in the case of large-angle misalignment, the OAM channel capacity can not be effectively compensated only by the electronic beam steering. To solve this problem, we propose a hybrid mechanical and electronic beam steering scheme, in which mechanical rotating devices controlled by pulse width modulation (PWM) signals as the execution unit are utilized to eliminate the large misalignment angle, while electronic beam steering is in charge of the remaining small misalignment angle caused by perturbations. Furthermore, due to the interferometry, the receive signal-to-noise ratios (SNRs) are not uniform at the elements of the receive UCA. Therefore, a rotatable UCA structure is proposed for the OAM receiver to maximize the channel capacity, in which the simulated annealing algorithm is adopted to obtain the optimal rotation angle at first, then the servo system performs mechanical rotation, at last the electronic beam steering is adjusted accordingly. Both mathematical analysis and simulation results validate that the proposed hybrid mechanical and electronic beam steering scheme can effectively eliminate the effect of diverse misalignment errors of any practical OAM channel and maximize the OAM channel capacity.
△ Less
Submitted 5 August, 2022; v1 submitted 28 January, 2022;
originally announced February 2022.
-
ADD 2022: the First Audio Deep Synthesis Detection Challenge
Authors:
Jiangyan Yi,
Ruibo Fu,
Jianhua Tao,
Shuai Nie,
Haoxin Ma,
Chenglong Wang,
Tao Wang,
Zhengkun Tian,
Ye Bai,
Cunhang Fan,
Shan Liang,
Shiming Wang,
Shuai Zhang,
Xinrui Yan,
Le Xu,
Zhengqi Wen,
Haizhou Li,
Zheng Lian,
Bin Liu
Abstract:
Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake gam…
▽ More
Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG). The LF track focuses on dealing with bona fide and fully fake utterances with various real-world noises etc. The PF track aims to distinguish the partially fake audio from the real. The FG track is a rivalry game, which includes two tasks: an audio generation task and an audio fake detection task. In this paper, we describe the datasets, evaluation metrics, and protocols. We also report major findings that reflect the recent advances in audio deepfake detection tasks.
△ Less
Submitted 26 February, 2022; v1 submitted 16 February, 2022;
originally announced February 2022.
-
Reducing language context confusion for end-to-end code-switching automatic speech recognition
Authors:
Shuai Zhang,
Jiangyan Yi,
Zhengkun Tian,
Jianhua Tao,
Yu Ting Yeung,
Liqun Deng
Abstract:
Code-switching deals with alternative languages in communication process. Training end-to-end (E2E) automatic speech recognition (ASR) systems for code-switching is especially challenging as code-switching training data are always insufficient to combat the increased multilingual context confusion due to the presence of more than one language. We propose a language-related attention mechanism to r…
▽ More
Code-switching deals with alternative languages in communication process. Training end-to-end (E2E) automatic speech recognition (ASR) systems for code-switching is especially challenging as code-switching training data are always insufficient to combat the increased multilingual context confusion due to the presence of more than one language. We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model based on the Equivalence Constraint (EC) Theory. The linguistic theory requires that any monolingual fragment that occurs in the code-switching sentence must occur in one of the monolingual sentences. The theory establishes a bridge between monolingual data and code-switching data. We leverage this linguistics theory to design the code-switching E2E ASR model. The proposed model efficiently transfers language knowledge from rich monolingual data to improve the performance of the code-switching ASR model. We evaluate our model on ASRU 2019 Mandarin-English code-switching challenge dataset. Compared to the baseline model, our proposed model achieves a 17.12% relative error reduction.
△ Less
Submitted 29 June, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Recovery Conditions of Sparse Signals Using Orthogonal Least Squares-Type Algorithms
Authors:
L. Lu,
W. Xu,
Y. Wang,
Z. Tian
Abstract:
Orthogonal least squares (OLS)-type algorithms are efficient in reconstructing sparse signals, which include the well-known OLS, multiple OLS (MOLS) and block OLS (BOLS). In this paper, we first investigate the noiseless exact recovery conditions of these algorithms. Specifically, based on mutual incoherence property (MIP), we provide theoretical analysis of OLS and MOLS to ensure that the correct…
▽ More
Orthogonal least squares (OLS)-type algorithms are efficient in reconstructing sparse signals, which include the well-known OLS, multiple OLS (MOLS) and block OLS (BOLS). In this paper, we first investigate the noiseless exact recovery conditions of these algorithms. Specifically, based on mutual incoherence property (MIP), we provide theoretical analysis of OLS and MOLS to ensure that the correct nonzero support can be selected during the iterative procedure. Nevertheless, theoretical analysis for BOLS utilizes the block-MIP to deal with the block sparsity. Furthermore, the noiseless MIP-based analyses are extended to the noisy scenario. Our results indicate that for K-sparse signals, when MIP or SNR satisfies certain conditions, OLS and MOLS obtain reliable reconstruction in at most K iterations, while BOLS succeeds in at most (K/d) iterations where d is the block length. It is shown that our derived theoretical results improve the existing ones, which are verified by simulation tests.
△ Less
Submitted 12 October, 2022; v1 submitted 13 January, 2022;
originally announced January 2022.
-
Broadband beam steering for misaligned multi-mode OAM communication systems
Authors:
Zhengjuan Tian,
Rui Chen,
Wen-Xuan Long,
Hong Zhou,
Marco Moretti
Abstract:
Orbital angular momentum (OAM) at radio frequency (RF) has attracted more and more attention as a novel approach of multiplexing a set of orthogonal OAM modes on the same frequency channel to achieve high spectral efficiency (SE). However, the precondition for maintaining the orthogonality among different OAM modes is perfect alignment of the transmit and receive uniform circular arrays (UCAs), wh…
▽ More
Orbital angular momentum (OAM) at radio frequency (RF) has attracted more and more attention as a novel approach of multiplexing a set of orthogonal OAM modes on the same frequency channel to achieve high spectral efficiency (SE). However, the precondition for maintaining the orthogonality among different OAM modes is perfect alignment of the transmit and receive uniform circular arrays (UCAs), which is difficult to be satisfied in practical wireless communication scenario. Therefore, to achieve available multi-mode OAM broadband wireless communication, we first investigate the effect of oblique angles on the transmission performance of the multi-mode OAM broadband system in the non-parallel misalignment case. Then, we compare the UCA-based RF analog and baseband digital transceiver structures and corresponding beam steering schemes. Mathematical analysis and numerical simulations validate that the SE of the misaligned multi-mode OAM broadband system is quite low, while analog and digital beam steering both can significantly improve the SE of the system. However, digital beam steering can obtain higher SE than analog beam steering especially when the bandwidth and the number of array elements are large, which validates that baseband digital transceiver with digital beam steering is more suitable for multi-mode OAM broadband wireless communication systems in practice.
△ Less
Submitted 1 December, 2021;
originally announced December 2021.
-
A Unified 3D Beam Training and Tracking Procedure for Terahertz Communication
Authors:
Boyu Ning,
Zhi Chen,
Zhongbao Tian,
Chong Han,
Shaoqian Li
Abstract:
Terahertz (THz) communication is considered as an attractive way to overcome the bandwidth bottleneck and satisfy the ever-increasing capacity demand in the future. Due to the high directivity and propagation loss of THz waves, a massive MIMO system using beamforming is envisioned as a promising technology in THz communication to realize high-gain and directional transmission. However, pilots, whi…
▽ More
Terahertz (THz) communication is considered as an attractive way to overcome the bandwidth bottleneck and satisfy the ever-increasing capacity demand in the future. Due to the high directivity and propagation loss of THz waves, a massive MIMO system using beamforming is envisioned as a promising technology in THz communication to realize high-gain and directional transmission. However, pilots, which are the fundamentals for many beamforming schemes, are challenging to be accurately detected in the THz band owing to the severe propagation loss. In this paper, a unified 3D beam training and tracking procedure is proposed to effectively realize the beamforming in THz communications, by considering the line-of-sight (LoS) propagation. In particular, a novel quadruple-uniform planar array (QUPA) architecture is analyzed to enlarge the signal coverage, increase the beam gain, and reduce the beam squint loss. Then, a new 3D grid-based (GB) beam training is developed with low complexity, including the design of the 3D codebook and training protocol. Finally, a simple yet effective grid-based hybrid (GBH) beam tracking is investigated to support THz beamforming in an efficient manner. The communication framework based on this procedure can dynamically trigger beam training/tracking depending on the real-time quality of service. Numerical results are presented to demonstrate the superiority of our proposed beam training and tracking over the benchmark methods.
△ Less
Submitted 3 October, 2021;
originally announced October 2021.
-
Convex Optimization of Speed and Energy Management System for Fuel Cell Hybrid Trains
Authors:
Rabee Jibrin,
Stuart Hillmansen,
Clive Roberts,
Ning Zhao,
Zhongbei Tian
Abstract:
We look into minimizing the hydrogen fuel consumption of hydrogen hybrid trains by optimizing their operation. The powertrain considered is a fuel cell charge-sustaining hybrid. Convex optimization is utilized to compute optimal speed and energy management trajectories. The barrier method is used to solve the optimization problems quickly on the order of tens of seconds for the entire journey. Sim…
▽ More
We look into minimizing the hydrogen fuel consumption of hydrogen hybrid trains by optimizing their operation. The powertrain considered is a fuel cell charge-sustaining hybrid. Convex optimization is utilized to compute optimal speed and energy management trajectories. The barrier method is used to solve the optimization problems quickly on the order of tens of seconds for the entire journey. Simulations show a considerable reduction in fuel consumption when both trajectories -- speed and energy management -- are optimized concurrently within a single optimization problem in comparison to being optimized separately in a sequential manner -- optimizing energy management after optimizing speed. It is concluded that the concurrent method greatly benefits from its holistic powertrain knowledge while optimizing all trajectories together within a single optimization problem.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
Generative adversarial network based single pixel imaging
Authors:
Ming Zhao,
Fengqiang Li,
Fengyue Huo,
Zhiming Tian
Abstract:
Single pixel imaging can reconstruct two-dimensional images of a scene with only a single-pixel detector. It has been widely used for imaging in non-visible bandwidth (e.g., near-infrared and X-ray) where focal-plane array sensors are challenging to be manufactured. In this paper, we propose a generative adversarial network based reconstruction algorithm for single pixel imaging, which demonstrate…
▽ More
Single pixel imaging can reconstruct two-dimensional images of a scene with only a single-pixel detector. It has been widely used for imaging in non-visible bandwidth (e.g., near-infrared and X-ray) where focal-plane array sensors are challenging to be manufactured. In this paper, we propose a generative adversarial network based reconstruction algorithm for single pixel imaging, which demonstrates efficient reconstruction in 10ms and higher quality. We verify the proposed method with both synthetic and real-world experiments, and demonstrate a good quality of reconstruction of a real-world plaster using a 0.05 sampling rate.
△ Less
Submitted 11 July, 2021;
originally announced July 2021.
-
Beamforming Technologies for Ultra-Massive MIMO in Terahertz Communications
Authors:
Boyu Ning,
Zhongbao Tian,
Weidong Mei,
Zhi Chen,
Chong Han,
Shaoqian Li,
**hong Yuan,
Rui Zhang
Abstract:
Terahertz (THz) communications with a frequency band $0.1-10$ THz are envisioned as a promising solution to future high-speed wireless communication. Although with tens of gigahertz available bandwidth, THz signals suffer from severe free-spreading loss and molecular-absorption loss, which limit the wireless transmission distance. To compensate for the propagation loss, the ultra-massive multiple-…
▽ More
Terahertz (THz) communications with a frequency band $0.1-10$ THz are envisioned as a promising solution to future high-speed wireless communication. Although with tens of gigahertz available bandwidth, THz signals suffer from severe free-spreading loss and molecular-absorption loss, which limit the wireless transmission distance. To compensate for the propagation loss, the ultra-massive multiple-input-multiple-output (UM-MIMO) can be applied to generate a high-gain directional beam by beamforming technologies. In this paper, a review of beamforming technologies for THz UM-MIMO systems is provided. Specifically, we first present the system model of THz UM-MIMO and identify its channel parameters and architecture types. Then, we illustrate the basic principles of beamforming via UM-MIMO and discuss the far-field and near-field assumptions in THz UM-MIMO. Moreover, an important beamforming strategy in THz band, i.e., beam training, is introduced wherein the beam training protocol and codebook design approaches are summarized. The intelligent-reflecting-surface (IRS)-assisted joint beamforming and multi-user beamforming in THz UM-MIMO systems are studied, respectively. The spatial-wideband effect and frequency-wideband effect in the THz beamforming are analyzed and the corresponding solutions are provided. Further, we present the corresponding fabrication techniques and illuminate the emerging applications benefiting from THz beamforming. Open challenges and future research directions on THz UM-MIMO systems are finally highlighted.
△ Less
Submitted 13 March, 2023; v1 submitted 7 July, 2021;
originally announced July 2021.
-
Continual Learning for Fake Audio Detection
Authors:
Haoxin Ma,
Jiangyan Yi,
Jianhua Tao,
Ye Bai,
Zhengkun Tian,
Chenglong Wang
Abstract:
Fake audio attack becomes a major threat to the speaker verification system. Although current detection approaches have achieved promising results on dataset-specific scenarios, they encounter difficulties on unseen spoofing data. Fine-tuning and retraining from scratch have been applied to incorporate new data. However, fine-tuning leads to performance degradation on previous data. Retraining tak…
▽ More
Fake audio attack becomes a major threat to the speaker verification system. Although current detection approaches have achieved promising results on dataset-specific scenarios, they encounter difficulties on unseen spoofing data. Fine-tuning and retraining from scratch have been applied to incorporate new data. However, fine-tuning leads to performance degradation on previous data. Retraining takes a lot of time and computation resources. Besides, previous data are unavailable due to privacy in some situations. To solve the above problems, this paper proposes detecting fake without forgetting, a continual-learning-based method, to make the model learn new spoofing attacks incrementally. A knowledge distillation loss is introduced to loss function to preserve the memory of original model. Supposing the distribution of genuine voice is consistent among different scenarios, an extra embedding similarity loss is used as another constraint to further do a positive sample alignment. Experiments are conducted on the ASVspoof2019 dataset. The results show that our proposed method outperforms fine-tuning by the relative reduction of average equal error rate up to 81.62%.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
Half-Truth: A Partially Fake Audio Detection Dataset
Authors:
Jiangyan Yi,
Ye Bai,
Jianhua Tao,
Haoxin Ma,
Zhengkun Tian,
Chenglong Wang,
Tao Wang,
Ruibo Fu
Abstract:
Diverse promising datasets have been designed to hold back the development of fake audio detection, such as ASVspoof databases. However, previous datasets ignore an attacking situation, in which the hacker hides some small fake clips in real speech audio. This poses a serious threat since that it is difficult to distinguish the small fake clip from the whole speech utterance. Therefore, this paper…
▽ More
Diverse promising datasets have been designed to hold back the development of fake audio detection, such as ASVspoof databases. However, previous datasets ignore an attacking situation, in which the hacker hides some small fake clips in real speech audio. This poses a serious threat since that it is difficult to distinguish the small fake clip from the whole speech utterance. Therefore, this paper develops such a dataset for half-truth audio detection (HAD). Partially fake audio in the HAD dataset involves only changing a few words in an utterance.The audio of the words is generated with the very latest state-of-the-art speech synthesis technology. We can not only detect fake uttrances but also localize manipulated regions in a speech using this dataset. Some benchmark results are presented on this dataset. The results show that partially fake audio presents much more challenging than fully fake audio for fake audio detection. The HAD dataset is publicly available: https://zenodo.org/records/10377492.
△ Less
Submitted 15 December, 2023; v1 submitted 8 April, 2021;
originally announced April 2021.
-
FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization
Authors:
Zhengkun Tian,
Jiangyan Yi,
Ye Bai,
Jianhua Tao,
Shuai Zhang,
Zhengqi Wen
Abstract:
Transducer-based models, such as RNN-Transducer and transformer-transducer, have achieved great success in speech recognition. A typical transducer model decodes the output sequence conditioned on the current acoustic state and previously predicted tokens step by step. Statistically, The number of blank tokens in the prediction results accounts for nearly 90\% of all tokens. It takes a lot of comp…
▽ More
Transducer-based models, such as RNN-Transducer and transformer-transducer, have achieved great success in speech recognition. A typical transducer model decodes the output sequence conditioned on the current acoustic state and previously predicted tokens step by step. Statistically, The number of blank tokens in the prediction results accounts for nearly 90\% of all tokens. It takes a lot of computation and time to predict the blank tokens, but only the non-blank tokens will appear in the final output sequence. Therefore, we propose a method named fast-skip regularization, which tries to align the blank position predicted by a transducer with that predicted by a CTC model. During the inference, the transducer model can predict the blank tokens in advance by a simple CTC project layer without many complicated forward calculations of the transducer decoder and then skip them, which will reduce the computation and improve the inference speed greatly. All experiments are conducted on a public Chinese mandarin dataset AISHELL-1. The results show that the fast-skip regularization can indeed help the transducer model learn the blank position alignments. Besides, the inference with fast-skip can be speeded up nearly 4 times with only a little performance degradation.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition
Authors:
Zhengkun Tian,
Jiangyan Yi,
Jianhua Tao,
Ye Bai,
Shuai Zhang,
Zhengqi Wen,
Xuefei Liu
Abstract:
The autoregressive (AR) models, such as attention-based encoder-decoder models and RNN-Transducer, have achieved great success in speech recognition. They predict the output sequence conditioned on the previous tokens and acoustic encoded states, which is inefficient on GPUs. The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire…
▽ More
The autoregressive (AR) models, such as attention-based encoder-decoder models and RNN-Transducer, have achieved great success in speech recognition. They predict the output sequence conditioned on the previous tokens and acoustic encoded states, which is inefficient on GPUs. The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step. However, the NAR model still faces two major problems. On the one hand, there is still a great gap in performance between the NAR models and the advanced AR models. On the other hand, it's difficult for most of the NAR models to train and converge. To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT), which improves the performance and accelerating the convergence of the NAR model by learning prior knowledge from a parameters-sharing AR model. Furthermore, we introduce the two-stage method into the inference process, which improves the model performance greatly. All the experiments are conducted on a public Chinese mandarin dataset ASIEHLL-1. The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
△ Less
Submitted 3 April, 2021;
originally announced April 2021.
-
Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT
Authors:
Ye Bai,
Jiangyan Yi,
Jianhua Tao,
Zhengkun Tian,
Zhengqi Wen,
Shuai Zhang
Abstract:
Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech u…
▽ More
Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech utterance, which has the token-level relationship implicitly, we can predict a token without explicitly autoregressive language modeling. When the prediction of a token does not rely on other tokens, the parallel prediction of all tokens in the sequence is realizable. Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once). The model consists of an encoder, a decoder, and a position dependent summarizer (PDS). The three modules are based on basic attention blocks. The encoder extracts high-level representations from the speech. The PDS uses positional encodings corresponding to tokens to convert the acoustic representations into token-level representations. The decoder further captures token-level relationships with the self-attention mechanism. At last, the probability distribution on the vocabulary is computed for each token position. Therefore, speech recognition is re-formulated as a position-wise classification problem. Further, we propose a cross-modal transfer learning method to refine semantics from a large-scale pre-trained language model BERT for improving the performance.
△ Less
Submitted 29 August, 2021; v1 submitted 15 February, 2021;
originally announced February 2021.
-
D2A U-Net: Automatic Segmentation of COVID-19 Lesions from CT Slices with Dilated Convolution and Dual Attention Mechanism
Authors:
Xiangyu Zhao,
Peng Zhang,
Fan Song,
Guangda Fan,
Yangyang Sun,
Yujia Wang,
Zheyuan Tian,
Luqi Zhang,
Guanglei Zhang
Abstract:
Coronavirus Disease 2019 (COVID-19) has caused great casualties and becomes almost the most urgent public health events worldwide. Computed tomography (CT) is a significant screening tool for COVID-19 infection, and automated segmentation of lung infection in COVID-19 CT images will greatly assist diagnosis and health care of patients. However, accurate and automatic segmentation of COVID-19 lung…
▽ More
Coronavirus Disease 2019 (COVID-19) has caused great casualties and becomes almost the most urgent public health events worldwide. Computed tomography (CT) is a significant screening tool for COVID-19 infection, and automated segmentation of lung infection in COVID-19 CT images will greatly assist diagnosis and health care of patients. However, accurate and automatic segmentation of COVID-19 lung infections remains to be challenging. In this paper we propose a dilated dual attention U-Net (D2A U-Net) for COVID-19 lesion segmentation in CT slices based on dilated convolution and a novel dual attention mechanism to address the issues above. We introduce a dilated convolution module in model decoder to achieve large receptive field, which refines decoding process and contributes to segmentation accuracy. Also, we present a dual attention mechanism composed of two attention modules which are inserted to skip connection and model decoder respectively. The dual attention mechanism is utilized to refine feature maps and reduce semantic gap between different levels of the model. The proposed method has been evaluated on open-source dataset and outperforms cutting edges methods in semantic segmentation. Our proposed D2A U-Net with pretrained encoder achieves a Dice score of 0.7298 and recall score of 0.7071. Besides, we also build a simplified D2A U-Net without pretrained encoder to provide a fair comparison with other models trained from scratch, which still outperforms popular U-Net family models with a Dice score of 0.7047 and recall score of 0.6626. Our experiment results have shown that by introducing dilated convolution and dual attention mechanism, the number of false positives is significantly reduced, which improves sensitivity to COVID-19 lesions and subsequently brings significant increase to Dice score.
△ Less
Submitted 9 February, 2021;
originally announced February 2021.
-
Instance and Panoptic Segmentation Using Conditional Convolutions
Authors:
Zhi Tian,
Bowen Zhang,
Hao Chen,
Chunhua Shen
Abstract:
We propose a simple yet effective framework for instance and panoptic segmentation, termed CondInst (conditional convolutions for instance and panoptic segmentation). In the literature, top-performing instance segmentation methods typically follow the paradigm of Mask R-CNN and rely on ROI operations (typically ROIAlign) to attend to each instance. In contrast, we propose to attend to the instance…
▽ More
We propose a simple yet effective framework for instance and panoptic segmentation, termed CondInst (conditional convolutions for instance and panoptic segmentation). In the literature, top-performing instance segmentation methods typically follow the paradigm of Mask R-CNN and rely on ROI operations (typically ROIAlign) to attend to each instance. In contrast, we propose to attend to the instances with dynamic conditional convolutions. Instead of using instance-wise ROIs as inputs to the instance mask head of fixed weights, we design dynamic instance-aware mask heads, conditioned on the instances to be predicted. CondInst enjoys three advantages: 1.) Instance and panoptic segmentation are unified into a fully convolutional network, eliminating the need for ROI crop** and feature alignment. 2.) The elimination of the ROI crop** also significantly improves the output instance mask resolution. 3.) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference time per instance and making the overall inference time almost constant, irrelevant to the number of instances. We demonstrate a simpler method that can achieve improved accuracy and inference speed on both instance and panoptic segmentation tasks. On the COCO dataset, we outperform a few state-of-the-art methods. We hope that CondInst can be a strong baseline for instance and panoptic segmentation. Code is available at: https://git.io/AdelaiDet
△ Less
Submitted 19 January, 2022; v1 submitted 5 February, 2021;
originally announced February 2021.
-
Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition
Authors:
Cunhang Fan,
Jiangyan Yi,
Jianhua Tao,
Zhengkun Tian,
Bin Liu,
Zhengqi Wen
Abstract:
The joint training framework for speech enhancement and recognition methods have obtained quite good performances for robust end-to-end automatic speech recognition (ASR). However, these methods only utilize the enhanced feature as the input of the speech recognition component, which are affected by the speech distortion problem. In order to address this problem, this paper proposes a gated recurr…
▽ More
The joint training framework for speech enhancement and recognition methods have obtained quite good performances for robust end-to-end automatic speech recognition (ASR). However, these methods only utilize the enhanced feature as the input of the speech recognition component, which are affected by the speech distortion problem. In order to address this problem, this paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and enhanced features. Therefore, the GRF can not only remove the noise signals from the enhanced features, but also learn the raw fine structures from the noisy features so that it can alleviate the speech distortion. The proposed method consists of speech enhancement, GRF and speech recognition. Firstly, the mask based speech enhancement network is applied to enhance the input speech. Secondly, the GRF is applied to address the speech distortion problem. Thirdly, to improve the performance of ASR, the state-of-the-art speech transformer algorithm is used as the speech recognition component. Finally, the joint training framework is utilized to optimize these three components, simultaneously. Our experiments are conducted on an open-source Mandarin speech corpus called AISHELL-1. Experimental results show that the proposed method achieves the relative character error rate (CER) reduction of 10.04\% over the conventional joint enhancement and transformer method only using the enhanced features. Especially for the low signal-to-noise ratio (0 dB), our proposed method can achieves better performances with 12.67\% CER reduction, which suggests the potential of our proposed method.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition
Authors:
Shuai Zhang,
Jiangyan Yi,
Zhengkun Tian,
Ye Bai,
Jianhua Tao,
Zhengqi wen
Abstract:
Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts:…
▽ More
Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network. The A2P network can learn acoustic pattern scenarios using large-scale monolingual paired data. Meanwhile, it generates multiple phoneme sequence candidates for single audio data in real-time during the training process. Then the generated phoneme-text paired data is used to train the P2T network. This network can be pre-trained with large amounts of external unpaired text data. By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model to a certain extent. Finally, the two networks are optimized jointly through attention fusion. We evaluate the proposed method on the public Mandarin-English code-switching dataset. Compared with our transformer baseline, the proposed method achieves 18.14% relative mix error rate reduction.
△ Less
Submitted 28 October, 2020;
originally announced October 2020.
-
One In A Hundred: Select The Best Predicted Sequence from Numerous Candidates for Streaming Speech Recognition
Authors:
Zhengkun Tian,
Jiangyan Yi,
Ye Bai,
Jianhua Tao,
Shuai Zhang,
Zhengqi Wen
Abstract:
The RNN-Transducers and improved attention-based encoder-decoder models are widely applied to streaming speech recognition. Compared with these two end-to-end models, the CTC model is more efficient in training and inference. However, it cannot capture the linguistic dependencies between the output tokens. Inspired by the success of two-pass end-to-end models, we introduce a transformer decoder an…
▽ More
The RNN-Transducers and improved attention-based encoder-decoder models are widely applied to streaming speech recognition. Compared with these two end-to-end models, the CTC model is more efficient in training and inference. However, it cannot capture the linguistic dependencies between the output tokens. Inspired by the success of two-pass end-to-end models, we introduce a transformer decoder and the two-stage inference method into the streaming CTC model. During inference, the CTC decoder first generates many candidates in a streaming fashion. Then the transformer decoder selects the best candidate based on the corresponding acoustic encoded states. The second-stage transformer decoder can be regarded as a conditional language model. We assume that a large enough number and enough diversity of candidates generated in the first stage can compensate the CTC model for the lack of language modeling ability. All the experiments are conducted on a Chinese Mandarin dataset AISHELL-1. The results show that our proposed model can implement streaming decoding in a fast and straightforward way. Our model can achieve up to a 20% reduction in the character error rate than the baseline CTC model. In addition, our model can also perform non-streaming inference with only a little performance degradation.
△ Less
Submitted 3 April, 2021; v1 submitted 28 October, 2020;
originally announced October 2020.
-
SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving
Authors:
Ming Zhou,
Jun Luo,
Julian Villella,
Yaodong Yang,
David Rusu,
Jiayu Miao,
Weinan Zhang,
Montgomery Alban,
Iman Fadakar,
Zheng Chen,
Aurora Chongxi Huang,
Ying Wen,
Kimia Hassanzadeh,
Daniel Graves,
Dong Chen,
Zhengbang Zhu,
Nhat Nguyen,
Mohamed Elsayed,
Kun Shao,
Sanjeevan Ahilan,
Baokuan Zhang,
Jiannan Wu,
Zhengang Fu,
Kasra Rezaee,
Peyman Yadmellat
, et al. (12 additional authors not shown)
Abstract:
Multi-agent interaction is a fundamental aspect of autonomous driving in the real world. Despite more than a decade of research and development, the problem of how to competently interact with diverse road users in diverse scenarios remains largely unsolved. Learning methods have much to offer towards solving this problem. But they require a realistic multi-agent simulator that generates diverse a…
▽ More
Multi-agent interaction is a fundamental aspect of autonomous driving in the real world. Despite more than a decade of research and development, the problem of how to competently interact with diverse road users in diverse scenarios remains largely unsolved. Learning methods have much to offer towards solving this problem. But they require a realistic multi-agent simulator that generates diverse and competent driving interactions. To meet this need, we develop a dedicated simulation platform called SMARTS (Scalable Multi-Agent RL Training School). SMARTS supports the training, accumulation, and use of diverse behavior models of road users. These are in turn used to create increasingly more realistic and diverse interactions that enable deeper and broader research on multi-agent interaction. In this paper, we describe the design goals of SMARTS, explain its basic architecture and its key features, and illustrate its use through concrete multi-agent experiments on interactive scenarios. We open-source the SMARTS platform and the associated benchmark tasks and evaluation metrics to encourage and empower research on multi-agent learning for autonomous driving. Our code is available at https://github.com/huawei-noah/SMARTS.
△ Less
Submitted 31 October, 2020; v1 submitted 19 October, 2020;
originally announced October 2020.
-
Distributed ADMM with Synergetic Communication and Computation
Authors:
Zhuojun Tian,
Zhaoyang Zhang,
Jue Wang,
Xiaoming Chen,
Wei Wang,
Huaiyu Dai
Abstract:
In this paper, we propose a novel distributed alternating direction method of multipliers (ADMM) algorithm with synergetic communication and computation, called SCCD-ADMM, to reduce the total communication and computation cost of the system. Explicitly, in the proposed algorithm, each node interacts with only part of its neighboring nodes, the number of which is progressively determined according…
▽ More
In this paper, we propose a novel distributed alternating direction method of multipliers (ADMM) algorithm with synergetic communication and computation, called SCCD-ADMM, to reduce the total communication and computation cost of the system. Explicitly, in the proposed algorithm, each node interacts with only part of its neighboring nodes, the number of which is progressively determined according to a heuristic searching procedure, which takes into account both the predicted convergence rate and the communication and computation costs at each iteration, resulting in a trade-off between communication and computation. Then the node chooses its neighboring nodes according to an importance sampling distribution derived theoretically to minimize the variance with the latest information it locally stores. Finally, the node updates its local information with a new update rule which adapts to the number of communication nodes. We prove the convergence of the proposed algorithm and provide an upper bound of the convergence variance brought by randomness. Extensive simulations validate the excellent performances of the proposed algorithm in terms of convergence rate and variance, the overall communication and computation cost, the impact of network topology as well as the time for evaluation, in comparison with the traditional counterparts.
△ Less
Submitted 29 September, 2020;
originally announced September 2020.
-
Third-Order Statistics Reconstruction from Compressive Measurements
Authors:
Yanbo Wang,
Zhi Tian
Abstract:
Estimation of third-order statistics relies on the availability of a huge amount of data records, which can pose severe challenges on the data collecting hardware in terms of considerable storage costs, overwhelming energy consumption, and unaffordably high sampling rate especially when dealing with high-dimensional data such as wideband signals. To overcome these challenges, this paper focuses on…
▽ More
Estimation of third-order statistics relies on the availability of a huge amount of data records, which can pose severe challenges on the data collecting hardware in terms of considerable storage costs, overwhelming energy consumption, and unaffordably high sampling rate especially when dealing with high-dimensional data such as wideband signals. To overcome these challenges, this paper focuses on the reconstruction of the third-order cumulants under the compressive sensing framework. Specifically, this paper derives a transformed linear system that directly connects the cross-cumulants of compressive measurements to the desired third-order statistics. We provide sufficient conditions for lossless third-order statistics reconstruction via solving simple least-squares, along with the strongest achievable compression ratio. To reduce the computational burden, we also propose an approach to recover diagonal cumulant slices directly from compressive measurements, which is useful when the cumulant slices are sufficient for the inference task at hand. All the proposed techniques are tested via extensive simulations. The developed joint sampling and reconstruction approach to third-order statistics estimation is able to reduce the required sampling rates significantly by exploiting the cumulant structure resulting from signal stationarity, even in the absence of any sparsity constraints on the signal or cumulants.
△ Less
Submitted 13 May, 2021; v1 submitted 29 July, 2020;
originally announced July 2020.