Search | arXiv e-print repository

Multitask frame-level learning for few-shot sound event detection

Authors: Liang Zou, Genwei Yan, Ruoyu Wang, Jun Du, Meng Lei, Tian Gao, Xin Fang

Abstract: This paper focuses on few-shot Sound Event Detection (SED), which aims to automatically recognize and classify sound events with limited samples. However, prevailing methods methods in few-shot SED predominantly rely on segment-level predictions, which often providing detailed, fine-grained predictions, particularly for events of brief duration. Although frame-level prediction strategies have been… ▽ More This paper focuses on few-shot Sound Event Detection (SED), which aims to automatically recognize and classify sound events with limited samples. However, prevailing methods methods in few-shot SED predominantly rely on segment-level predictions, which often providing detailed, fine-grained predictions, particularly for events of brief duration. Although frame-level prediction strategies have been proposed to overcome these limitations, these strategies commonly face difficulties with prediction truncation caused by background noise. To alleviate this issue, we introduces an innovative multitask frame-level SED framework. In addition, we introduce TimeFilterAug, a linear timing mask for data augmentation, to increase the model's robustness and adaptability to diverse acoustic environments. The proposed method achieves a F-score of 63.8%, securing the 1st rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2023. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: 6 pages, 4 figures, conference

arXiv:2402.02775 [pdf]

Instant square lattice structured illumination microscopy: an optimal strategy towards photon-saving and real-time super-resolution observation

Authors: Tianyu Zhao, Zhaojun Wang, Manming Shu, **gxiang Zhang, Yansheng Liang, Shaowei Wang, Ming Lei

Abstract: Over the past decade, structured illumination microscopy (SIM) has found its niche in super-resolution (SR) microscopy due to its fast imaging speed and low excitation intensity. However, due to the significantly higher light dose compared to wide-field microscopy and the time-consuming post-processing procedures, long-term, real-time, super-resolution observation of living cells is still out of r… ▽ More Over the past decade, structured illumination microscopy (SIM) has found its niche in super-resolution (SR) microscopy due to its fast imaging speed and low excitation intensity. However, due to the significantly higher light dose compared to wide-field microscopy and the time-consuming post-processing procedures, long-term, real-time, super-resolution observation of living cells is still out of reach for most SIM setups, which inevitably limits its routine use by cell biologists. Here, we describe square lattice SIM (SL-SIM) for long-duration live cell imaging by using the square lattice optical field as illumination, which allows continuous super-resolved observation over long periods of time. In addition, by extending the previous joint spatial-frequency reconstruction concept to SL-SIM, a high-speed reconstruction strategy is validated in the GPU environment, whose reconstruction time is even shorter than image acquisition time, thus enabling real-time observation. We have demonstrated the potential of SL-SIM on various biological applications, ranging from microtubule cytoskeleton dynamics to the interactions of mitochondrial cristae and DNAs in COS7 cells. The inherent lower light dose and user-friendly workflow of the SL-SIM could help make long-duration, real-time and super-resolved observations accessible to biological laboratories. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2312.01423 [pdf, other]

Self-Critical Alternate Learning based Semantic Broadcast Communication

Authors: Zhilin Lu, Rongpeng Li, Ming Lei, Chan Wang, Zhifeng Zhao, Honggang Zhang

Abstract: Semantic communication (SemCom) has been deemed as a promising communication paradigm to break through the bottleneck of traditional communications. Nonetheless, most of the existing works focus more on point-to-point communication scenarios and its extension to multi-user scenarios is not that straightforward due to its cost-inefficiencies to directly scale the JSCC framework to the multi-user co… ▽ More Semantic communication (SemCom) has been deemed as a promising communication paradigm to break through the bottleneck of traditional communications. Nonetheless, most of the existing works focus more on point-to-point communication scenarios and its extension to multi-user scenarios is not that straightforward due to its cost-inefficiencies to directly scale the JSCC framework to the multi-user communication system. Meanwhile, previous methods optimize the system by differentiable bit-level supervision, easily leading to a "semantic gap". Therefore, we delve into multi-user broadcast communication (BC) based on the universal transformer (UT) and propose a reinforcement learning (RL) based self-critical alternate learning (SCAL) algorithm, named SemanticBC-SCAL, to capably adapt to the different BC channels from one transmitter (TX) to multiple receivers (RXs) for sentence generation task. In particular, to enable stable optimization via a nondifferentiable semantic metric, we regard sentence similarity as a reward and formulate this learning process as an RL problem. Considering the huge decision space, we adopt a lightweight but efficient self-critical supervision to guide the learning process. Meanwhile, an alternate learning mechanism is developed to provide cost-effective learning, in which the encoder and decoders are updated asynchronously with different iterations. Notably, the incorporation of RL makes SemanticBC-SCAL compliant with any user-defined semantic similarity metric and simultaneously addresses the channel non-differentiability issue by alternate learning. Besides, the convergence of SemanticBC-SCAL is also theoretically established. Extensive simulation results have been conducted to verify the effectiveness and superiorness of our approach, especially in low SNRs. △ Less

Submitted 3 December, 2023; originally announced December 2023.

arXiv:2311.08188 [pdf, ps, other]

Fast List Decoding of High-Rate Polar Codes

Authors: Yang Lu, Ming-Min Zhao, Ming Lei, Min-Jian Zhao

Abstract: Due to the ability to provide superior error-correction performance, the successive cancellation list (SCL) algorithm is widely regarded as one of the most promising decoding algorithms for polar codes with short-to-moderate code lengths. However, the application of SCL decoding in low-latency communication scenarios is limited due to its sequential nature. To reduce the decoding latency, developi… ▽ More Due to the ability to provide superior error-correction performance, the successive cancellation list (SCL) algorithm is widely regarded as one of the most promising decoding algorithms for polar codes with short-to-moderate code lengths. However, the application of SCL decoding in low-latency communication scenarios is limited due to its sequential nature. To reduce the decoding latency, develo** tailored fast and efficient list decoding algorithms of specific polar substituent codes (special nodes) is a promising solution. Recently, fast list decoding algorithms are proposed by considering special nodes with low code rates. Aiming to further speedup the SCL decoding, this paper presents fast list decoding algorithms for two types of high-rate special nodes, namely single-parity-check (SPC) nodes and sequence rate one or single-parity-check (SR1/SPC) nodes. In particular, we develop two classes of fast list decoding algorithms for these nodes, where the first class uses a sequential decoding procedure to yield decoding latency that is linear with the list size, and the second further parallelizes the decoding process by pre-determining the redundant candidate paths offline. Simulation results show that the proposed list decoding algorithms are able to achieve up to 70.7\% lower decoding latency than state-of-the-art fast SCL decoders, while exhibiting the same error-correction performance. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: 13 pages, 8 figures

arXiv:2302.02587 [pdf, other]

Joint Scattering Environment Sensing and Channel Estimation Based on Non-stationary Markov Random Field

Authors: Wenkang Xu, Yongbo Xiao, An Liu, Ming Lei, Minjian Zhao

Abstract: This paper considers an integrated sensing and communication system, where some radar targets also serve as communication scatterers. A location domain channel modeling method is proposed based on the position of targets and scatterers in the scattering environment, and the resulting radar and communication channels exhibit a two-dimensional (2-D) joint burst sparsity. We propose a joint scatterin… ▽ More This paper considers an integrated sensing and communication system, where some radar targets also serve as communication scatterers. A location domain channel modeling method is proposed based on the position of targets and scatterers in the scattering environment, and the resulting radar and communication channels exhibit a two-dimensional (2-D) joint burst sparsity. We propose a joint scattering environment sensing and channel estimation scheme to enhance the target/scatterer localization and channel estimation performance simultaneously, where a spatially non-stationary Markov random field (MRF) model is proposed to capture the 2-D joint burst sparsity. An expectation maximization (EM) based method is designed to solve the joint estimation problem, where the E-step obtains the Bayesian estimation of the radar and communication channels and the M-step automatically learns the dynamic position grid and prior parameters in the MRF. However, the existing sparse Bayesian inference methods used in the E-step involve a high-complexity matrix inverse per iteration. Moreover, due to the complicated non-stationary MRF prior, the complexity of M-step is exponentially large. To address these difficulties, we propose an inverse-free variational Bayesian inference algorithm for the E-step and a low-complexity method based on pseudo-likelihood approximation for the M-step. In the simulations, the proposed scheme can achieve a better performance than the state-of-the-art method while reducing the computational overhead significantly. △ Less

Submitted 18 July, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

Comments: 15 pages, 13 figures, submitted to IEEE Transactions on Wireless Communications

arXiv:2206.12281 [pdf]

Real-time Dual-channel 2 * 2 MIMO Fiber-THz-Fiber Seamless Integration System at 385 GHz and 435 GHz

Authors: Jiao Zhang, Min Zhu, Bingchang Hua, Mingzheng Lei, Yuancheng Cai, Liang Tian, Yucong Zou, Like Ma, Yongming Huang, Jianjun Yu, Xiaohu You

Abstract: We demonstrate the first practical real-time dual-channel fiber-THz-fiber 2 * 2 MIMO seamless integration system with a record net data rate of 2 * 103.125 Gb/s at 385 GHz and 435 GHz over two spans of 20 km SSMF and 3 m wireless link. We demonstrate the first practical real-time dual-channel fiber-THz-fiber 2 * 2 MIMO seamless integration system with a record net data rate of 2 * 103.125 Gb/s at 385 GHz and 435 GHz over two spans of 20 km SSMF and 3 m wireless link. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: This paper has been accepted by ECOC 2022

arXiv:2204.12115 [pdf, ps, other]

Fast Successive-Cancellation Decoding of Polar Codes with Sequence Nodes

Authors: Yang Lu, Ming-Min Zhao, Ming Lei, Min-Jian Zhao

Abstract: Due to the sequential nature of the successive-cancellation (SC) algorithm, the decoding of polar codes suffers from significant decoding latencies. Fast SC decoding is able to speed up the SC decoding process, by implementing parallel decoders at the intermediate levels of the SC decoding tree for some special nodes with specific information and frozen bit patterns. To further improve the paralle… ▽ More Due to the sequential nature of the successive-cancellation (SC) algorithm, the decoding of polar codes suffers from significant decoding latencies. Fast SC decoding is able to speed up the SC decoding process, by implementing parallel decoders at the intermediate levels of the SC decoding tree for some special nodes with specific information and frozen bit patterns. To further improve the parallelism of SC decoding, this paper present a new class of special nodes composed of a sequence of rate one or single-parity-check (SR1/SPC) nodes, which can be easily found especially in high-rate polar code and is able to envelop a wide variety of existing special node types. Then, we analyse the parity constraints caused by the frozen bits in each descendant node, such that the decoding performance of the SR1/SPC node can be preserved once the parity constraints are satisfied. Finally, a generalized fast decoding algorithm is proposed to decode SR1/SPC nodes efficiently, where the corresponding parity constraints are taken into consideration. Simulation results show that the proposed decoding algorithm of the SR1/SPC node can achieve near-ML performance, and the overall decoding latency can be reduced by 43.8% as compared to the state-of-the-art fast SC decoder. △ Less

Submitted 18 November, 2022; v1 submitted 26 April, 2022; originally announced April 2022.

Comments: 30 pages, 6 figures, submitted for possible journal publication

arXiv:2202.07816 [pdf, other]

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Authors: Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, Zhou Zhao

Abstract: Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the nat… ▽ More Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high-quality data for TTS training, the distribution of prosody cannot be fully shaped. To tackle these issues, we propose ProsoSpeech, which enhances the prosody using quantized latent vectors pre-trained on large-scale unpaired and low-quality text and speech data. Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV). Then we introduce an LPV predictor, which predicts LPV given word sequence. We pre-train the LPV predictor on large-scale text and low-quality speech data and fine-tune it on the high-quality TTS dataset. Finally, our model can generate expressive speech conditioned on the predicted LPV. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods. △ Less

Submitted 15 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022

arXiv:2111.13694 [pdf, other]

Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information

Authors: Zhihao Du, Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei

Abstract: Overlap** speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech feat… ▽ More Overlap** speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech features and given speaker embeddings. Our method is further extended and integrated with downstream tasks by utilizing the textual information, which has not been well studied in previous literature. The experimental results show that our method achieves lower diarization error rate than the target-speaker voice activity detection. When textual information is involved, the diarization errors can be further reduced. For the real meeting scenario, our method can achieve 34.11% relative improvement compared with the Bayesian hidden Markov model based clustering algorithm. △ Less

Submitted 28 November, 2021; originally announced November 2021.

Comments: Submitted to ICASSP 2022, 5 pages, 2 figures

arXiv:2110.07216 [pdf, other]

doi 10.24963/ijcai.2021/527

FedSpeech: Federated Text-to-Speech with Continual Learning

Authors: Ziyue Jiang, Yi Ren, Ming Lei, Zhou Zhao

Abstract: Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally. However, federated text-to-speech faces several challenges: very few training samples from each speaker are available, training samples are a… ▽ More Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally. However, federated text-to-speech faces several challenges: very few training samples from each speaker are available, training samples are all stored in local device of each user, and global model is vulnerable to various attacks. In this paper, we propose a novel federated learning architecture based on continual learning approaches to overcome the difficulties above. Specifically, 1) we use gradual pruning masks to isolate parameters for preserving speakers' tones; 2) we apply selective masks for effectively reusing knowledge from tasks; 3) a private speaker embedding is introduced to keep users' privacy. Experiments on a reduced VCTK dataset demonstrate the effectiveness of FedSpeech: it nearly matches multi-task training in terms of multi-speaker speech quality; moreover, it sufficiently retains the speakers' tones and even outperforms the multi-task training in the speaker similarity experiment. △ Less

Submitted 22 May, 2023; v1 submitted 14 October, 2021; originally announced October 2021.

Comments: Accepted by IJCAI 2021

Journal ref: 2021. Main Track. Pages 3829-3835

arXiv:2109.04049 [pdf, other]

BeamTransformer: Microphone Array-based Overlap** Speech Detection

Authors: Siqi Zheng, Shiliang Zhang, Weilong Huang, Qian Chen, Hongbin Suo, Ming Lei, **wei Feng, Zhijie Yan

Abstract: We propose BeamTransformer, an efficient architecture to leverage beamformer's edge in spatial filtering and transformer's capability in context sequence modeling. BeamTransformer seeks to optimize modeling of sequential relationship among signals from different spatial direction. Overlap** speech detection is one of the tasks where such optimization is favorable. In this paper we effectively ap… ▽ More We propose BeamTransformer, an efficient architecture to leverage beamformer's edge in spatial filtering and transformer's capability in context sequence modeling. BeamTransformer seeks to optimize modeling of sequential relationship among signals from different spatial direction. Overlap** speech detection is one of the tasks where such optimization is favorable. In this paper we effectively apply BeamTransformer to detect overlap** segments. Comparing to single-channel approach, BeamTransformer exceeds in learning to identify the relationship among different beam sequences and hence able to make predictions not only from the acoustic signals but also the localization of the source. The results indicate that a successful incorporation of microphone array signals can lead to remarkable gains. Moreover, BeamTransformer takes one step further, as speech from overlapped speakers have been internally separated into different beams. △ Less

Submitted 9 September, 2021; originally announced September 2021.

arXiv:2106.09317 [pdf, other]

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

Authors: Chenye Cui, Yi Ren, **glin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, Zhou Zhao

Abstract: Recently, there has been an increasing interest in neural speech synthesis. While the deep neural network achieves the state-of-the-art result in text-to-speech (TTS) tasks, how to generate a more emotional and more expressive speech is becoming a new challenge to researchers due to the scarcity of high-quality emotion speech dataset and the lack of advanced emotional TTS model. In this paper, we… ▽ More Recently, there has been an increasing interest in neural speech synthesis. While the deep neural network achieves the state-of-the-art result in text-to-speech (TTS) tasks, how to generate a more emotional and more expressive speech is becoming a new challenge to researchers due to the scarcity of high-quality emotion speech dataset and the lack of advanced emotional TTS model. In this paper, we first briefly introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. After that, we propose a simple but efficient architecture for emotional speech synthesis called EMSpeech. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations. Finally, by showing a comparable performance in the emotional speech synthesis task, we successfully demonstrate the ability of the proposed model. △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: Accepted by Interspeech 2021

arXiv:2104.05784 [pdf, other]

Extremely Low Footprint End-to-End ASR System for Smart Device

Authors: Zhifu Gao, Yiwu Yao, Shiliang Zhang, Jun Yang, Ming Lei, Ian McLoughlin

Abstract: Recently, end-to-end (E2E) speech recognition has become popular, since it can integrate the acoustic, pronunciation and language models into a single neural network, which outperforms conventional models. Among E2E approaches, attention-based models, e.g. Transformer, have emerged as being superior. Such models have opened the door to deployment of ASR on smart devices, however they still suffer… ▽ More Recently, end-to-end (E2E) speech recognition has become popular, since it can integrate the acoustic, pronunciation and language models into a single neural network, which outperforms conventional models. Among E2E approaches, attention-based models, e.g. Transformer, have emerged as being superior. Such models have opened the door to deployment of ASR on smart devices, however they still suffer from requiring a large number of model parameters. We propose an extremely low footprint E2E ASR system for smart devices, to achieve the goal of satisfying resource constraints without sacrificing recognition accuracy. We design cross-layer weight sharing to improve parameter efficiency and further exploit model compression methods including sparsification and quantization, to reduce memory storage and boost decoding efficiency. We evaluate our approaches on the public AISHELL-1 and AISHELL-2 benchmarks. On the AISHELL-2 task, the proposed method achieves more than 10x compression (model size reduces from 248 to 24MB), at the cost of only minor performance loss (CER reduces from 6.49% to 6.92%). △ Less

Submitted 6 July, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

Comments: 5 pages, 2 figures, accepted by INTERSPEECH 2021

arXiv:2010.15311 [pdf, other]

DeviceTTS: A Small-Footprint, Fast, Stable Network for On-Device Text-to-Speech

Authors: Zhiying Huang, Hao Li, Ming Lei

Abstract: With the number of smart devices increasing, the demand for on-device text-to-speech (TTS) increases rapidly. In recent years, many prominent End-to-End TTS methods have been proposed, and have greatly improved the quality of synthesized speech. However, to ensure the qualified speech, most TTS systems depend on large and complex neural network models, and it's hard to deploy these TTS systems on-… ▽ More With the number of smart devices increasing, the demand for on-device text-to-speech (TTS) increases rapidly. In recent years, many prominent End-to-End TTS methods have been proposed, and have greatly improved the quality of synthesized speech. However, to ensure the qualified speech, most TTS systems depend on large and complex neural network models, and it's hard to deploy these TTS systems on-device. In this paper, a small-footprint, fast, stable network for on-device TTS is proposed, named as DeviceTTS. DeviceTTS makes use of a duration predictor as a bridge between encoder and decoder so as to avoid the problem of words skip** and repeating in Tacotron. As we all know, model size is a key factor for on-device TTS. For DeviceTTS, Deep Feedforward Sequential Memory Network (DFSMN) is used as the basic component. Moreover, to speed up inference, mix-resolution decoder is proposed for balance the inference speed and speech quality. Experiences are done with WORLD and LPCNet vocoder. Finally, with only 1.4 million model parameters and 0.099 GFLOPS, DeviceTTS achieves comparable performance with Tacotron and FastSpeech. As far as we know, the DeviceTTS can meet the needs of most of the devices in practical application. △ Less

Submitted 14 January, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

Comments: 5 pages, 1 figure, Submitted to ICASSP2021

arXiv:2010.14099 [pdf, other]

Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

Authors: Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin

Abstract: Recently, online end-to-end ASR has gained increasing attention. However, the performance of online systems still lags far behind that of offline systems, with a large gap in quality of recognition. For specific scenarios, we can trade-off between performance and latency, and can train multiple systems with different delays to match the performance and latency requirements of various application s… ▽ More Recently, online end-to-end ASR has gained increasing attention. However, the performance of online systems still lags far behind that of offline systems, with a large gap in quality of recognition. For specific scenarios, we can trade-off between performance and latency, and can train multiple systems with different delays to match the performance and latency requirements of various application scenarios. In this work, in contrast to trading-off between performance and latency, we envisage a single system that can match the needs of different scenarios. We propose a novel architecture, termed Universal ASR that can unify streaming and non-streaming ASR models into one system. The embedded streaming ASR model can configure different delays according to requirements to obtain real-time recognition results, while the non-streaming model is able to refresh the final recognition result for better performance. We have evaluated our approach on the public AISHELL-2 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. The experimental results show that the Universal ASR provides an efficient mechanism to integrate streaming and non-streaming models that can recognize speech quickly and accurately. On the AISHELL-2 task, Universal ASR comfortably outperforms other state-of-the-art systems. △ Less

Submitted 27 October, 2020; originally announced October 2020.

Comments: 5 pages, 2 figures, submitted to ICASSP 2021

arXiv:2009.04293 [pdf]

An Infrared Communication System based on Handstand Pendulum

Authors: Xingchen Li, Changlu Li, Yun Wang, Mengqi Lei

Abstract: This paper mainly introduces an infrared optical communication system based on stable and handstand pendulum. This system adopts the method of loading the infrared light emitting end on an handstand pendulum to realize the stability and controllability of the infrared light transmission light path. In this system, 940nm infrared light is mainly used for audio signal transmission, and an handstand… ▽ More This paper mainly introduces an infrared optical communication system based on stable and handstand pendulum. This system adopts the method of loading the infrared light emitting end on an handstand pendulum to realize the stability and controllability of the infrared light transmission light path. In this system, 940nm infrared light is mainly used for audio signal transmission, and an handstand pendulum based on PID is used to control the angle and stability of infrared light emission. Experimental results show that the system can effectively ensure the stability of the transmission optical path and is suitable for accurate and stable signal transmission in bumpy environments. △ Less

Submitted 9 September, 2020; originally announced September 2020.

arXiv:2006.12761 [pdf, other]

Benchmarking features from different radiomics toolkits / toolboxes using Image Biomarkers Standardization Initiative

Authors: Mingxi Lei, Bino Varghese, Darryl Hwang, Steven Cen, Xiaomeng Lei, Afshin Azadikhah, Bhushan Desai, Assad Oberai, Vinay Duddalwar

Abstract: There is no consensus regarding the radiomic feature terminology, the underlying mathematics, or their implementation. This creates a scenario where features extracted using different toolboxes could not be used to build or validate the same model leading to a non-generalization of radiomic results. In this study, the image biomarker standardization initiative (IBSI) established phantom and benchm… ▽ More There is no consensus regarding the radiomic feature terminology, the underlying mathematics, or their implementation. This creates a scenario where features extracted using different toolboxes could not be used to build or validate the same model leading to a non-generalization of radiomic results. In this study, the image biomarker standardization initiative (IBSI) established phantom and benchmark values were used to compare the variation of the radiomic features while using 6 publicly available software programs and 1 in-house radiomics pipeline. All IBSI-standardized features (11 classes, 173 in total) were extracted. The relative differences between the extracted feature values from the different software and the IBSI benchmark values were calculated to measure the inter-software agreement. To better understand the variations, features are further grouped into 3 categories according to their properties: 1) morphology, 2) statistic/histogram and 3)texture features. While a good agreement was observed for a majority of radiomics features across the various programs, relatively poor agreement was observed for morphology features. Significant differences were also found in programs that use different gray level discretization approaches. Since these programs do not include all IBSI features, the level of quantitative assessment for each category was analyzed using Venn and the UpSet diagrams and also quantified using two ad hoc metrics. Morphology features earns lowest scores for both metrics, indicating that morphological features are not consistently evaluated among software programs. We conclude that radiomic features calculated using different software programs may not be identical and reliable. Further studies are needed to standardize the workflow of radiomic feature extraction. △ Less

Submitted 23 June, 2020; originally announced June 2020.

Comments: 21 pages, 8 figures

arXiv:2006.06240 [pdf, ps, other]

A PDD Decoder for Binary Linear Codes With Neural Check Polytope Projection

Authors: Yi Wei, Ming-Min Zhao, Min-Jian Zhao, Ming Lei

Abstract: Linear Programming (LP) is an important decoding technique for binary linear codes. However, the advantages of LP decoding, such as low error floor and strong theoretical guarantee, etc., come at the cost of high computational complexity and poor performance at the low signal-to-noise ratio (SNR) region. In this letter, we adopt the penalty dual decomposition (PDD) framework and propose a PDD algo… ▽ More Linear Programming (LP) is an important decoding technique for binary linear codes. However, the advantages of LP decoding, such as low error floor and strong theoretical guarantee, etc., come at the cost of high computational complexity and poor performance at the low signal-to-noise ratio (SNR) region. In this letter, we adopt the penalty dual decomposition (PDD) framework and propose a PDD algorithm to address the fundamental polytope based maximum likelihood (ML) decoding problem. Furthermore, we propose to integrate machine learning techniques into the most time-consuming part of the PDD decoding algorithm, i.e., check polytope projection (CPP). Inspired by the fact that a multi-layer perception (MLP) can theoretically approximate any nonlinear map** function, we present a specially designed neural CPP (NCPP) algorithm to decrease the decoding latency. Simulation results demonstrate the effectiveness of the proposed algorithms. △ Less

Submitted 11 June, 2020; originally announced June 2020.

Comments: This pape has been accepted for publication in IEEE wireless communications letters

arXiv:2006.01713 [pdf, other]

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

Authors: Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin

Abstract: End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of self-att… ▽ More End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of self-attention instead of recurrent mechanisms, enabling both encoder and decoder to capture long-range dependencies with lower computational complexity.In this work, we propose boosting the self-attention ability with a DFSMN memory block, forming the proposed memory equipped self-attention (SAN-M) mechanism. Theoretical and empirical comparisons have been made to demonstrate the relevancy and complementarity between self-attention and the DFSMN memory block. Furthermore, the proposed SAN-M provides an efficient mechanism to integrate these two modules. We have evaluated our approach on the public AISHELL-1 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. On both tasks, SAN-M systems achieved much better performance than the self-attention based Transformer baseline system. Specially, it can achieve a CER of 6.46% on the AISHELL-1 task even without using any external LM, comfortably outperforming other state-of-the-art systems. △ Less

Submitted 20 May, 2020; originally announced June 2020.

Comments: submitted to INTERSPEECH2020

arXiv:2006.01712 [pdf, other]

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

Authors: Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie

Abstract: Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention(SCAMA) and a latency control memory equipped self-attention network (LC-SA… ▽ More Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention(SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR. △ Less

Submitted 20 May, 2020; originally announced June 2020.

Comments: submitted to INTERSPEECH2020

arXiv:2005.10463 [pdf, other]

Simplified Self-Attention for Transformer-based End-to-End Speech Recognition

Authors: Haoneng Luo, Shiliang Zhang, Ming Lei, Lei Xie

Abstract: Transformer models have been introduced into end-to-end speech recognition with state-of-the-art performance on various tasks owing to their superiority in modeling long-term dependencies. However, such improvements are usually obtained through the use of very large neural networks. Transformer models mainly include two submodules - position-wise feedforward layers and self-attention (SAN) layers.… ▽ More Transformer models have been introduced into end-to-end speech recognition with state-of-the-art performance on various tasks owing to their superiority in modeling long-term dependencies. However, such improvements are usually obtained through the use of very large neural networks. Transformer models mainly include two submodules - position-wise feedforward layers and self-attention (SAN) layers. In this paper, to reduce the model complexity while maintaining good performance, we propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors for transformer-based end-to-end speech recognition. We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks. Results show that our proposed SSAN-based transformer model can achieve over 20% relative reduction in model parameters and 6.7% relative CER reduction on the AISHELL-1 task. With impressively 20% parameter reduction, our model shows no loss of recognition performance on the 20,000-hour large-scale task. △ Less

Submitted 17 November, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

Comments: Accepted to SLT 2021

arXiv:2002.07601 [pdf, other]

ADMM-based Decoder for Binary Linear Codes Aided by Deep Learning

Authors: Yi Wei, Ming-Min Zhao, Min-Jian Zhao, Ming Lei

Abstract: Inspired by the recent advances in deep learning (DL), this work presents a deep neural network aided decoding algorithm for binary linear codes. Based on the concept of deep unfolding, we design a decoding network by unfolding the alternating direction method of multipliers (ADMM)-penalized decoder. In addition, we propose two improved versions of the proposed network. The first one transforms th… ▽ More Inspired by the recent advances in deep learning (DL), this work presents a deep neural network aided decoding algorithm for binary linear codes. Based on the concept of deep unfolding, we design a decoding network by unfolding the alternating direction method of multipliers (ADMM)-penalized decoder. In addition, we propose two improved versions of the proposed network. The first one transforms the penalty parameter into a set of iteration-dependent ones, and the second one adopts a specially designed penalty function, which is based on a piecewise linear function with adjustable slopes. Numerical results show that the resulting DL-aided decoders outperform the original ADMM-penalized decoder for various low density parity check (LDPC) codes with similar computational complexity. △ Less

Submitted 13 February, 2020; originally announced February 2020.

Comments: 5 pages, 4 figures, accepted for publication in IEEE communications letters

arXiv:1906.03814 [pdf, other]

doi 10.1109/TSP.2020.3035832

Learned Conjugate Gradient Descent Network for Massive MIMO Detection

Authors: Yi Wei, Ming-Min Zhao, Mingyi Hong, Min-jian Zhao, Ming Lei

Abstract: In this work, we consider the use of model-driven deep learning techniques for massive multiple-input multiple-output (MIMO) detection. Compared with conventional MIMO systems, massive MIMO promises improved spectral efficiency, coverage and range. Unfortunately, these benefits are coming at the cost of significantly increased computational complexity. To reduce the complexity of signal detection… ▽ More In this work, we consider the use of model-driven deep learning techniques for massive multiple-input multiple-output (MIMO) detection. Compared with conventional MIMO systems, massive MIMO promises improved spectral efficiency, coverage and range. Unfortunately, these benefits are coming at the cost of significantly increased computational complexity. To reduce the complexity of signal detection and guarantee the performance, we present a learned conjugate gradient descent network (LcgNet), which is constructed by unfolding the iterative conjugate gradient descent (CG) detector. In the proposed network, instead of calculating the exact values of the scalar step-sizes, we explicitly learn their universal values. Also, we can enhance the proposed network by augmenting the dimensions of these step-sizes. Furthermore, in order to reduce the memory costs, a novel quantized LcgNet is proposed, where a low-resolution nonuniform quantizer is integrated into the LcgNet to smartly quantize the aforementioned step-sizes. The quantizer is based on a specially designed soft staircase function with learnable parameters to adjust its shape. Meanwhile, due to fact that the number of learnable parameters is limited, the proposed networks are easy and fast to train. Numerical results demonstrate that the proposed network can achieve promising performance with much lower complexity. △ Less

Submitted 1 June, 2020; v1 submitted 10 June, 2019; originally announced June 2019.

Comments: Part of this work has been accepted by IEEE ICC 2020

arXiv:1904.10045 [pdf, other]

Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition

Authors: Shiliang Zhang, Ming Lei, Zhijie Yan

Abstract: Connectionist Temporal Classification (CTC) based end-to-end speech recognition system usually need to incorporate an external language model by using WFST-based decoding in order to achieve promising results. This is more essential to Mandarin speech recognition since it owns a special phenomenon, namely homophone, which causes a lot of substitution errors. The linguistic information introduced b… ▽ More Connectionist Temporal Classification (CTC) based end-to-end speech recognition system usually need to incorporate an external language model by using WFST-based decoding in order to achieve promising results. This is more essential to Mandarin speech recognition since it owns a special phenomenon, namely homophone, which causes a lot of substitution errors. The linguistic information introduced by language model will help to distinguish these substitution errors. In this work, we propose a transformer based spelling correction model to automatically correct errors especially the substitution errors made by CTC-based Mandarin speech recognition system. Specifically, we investigate using the recognition results generated by CTC-based systems as input and the ground-truth transcriptions as output to train a transformer with encoder-decoder architecture, which is much similar to machine translation. Results in a 20,000 hours Mandarin speech recognition task show that the proposed spelling correction model can achieve a CER of 3.41%, which results in 22.9% and 53.2% relative improvement compared to the baseline CTC-based systems decoded with and without language model respectively. △ Less

Submitted 27 March, 2019; originally announced April 2019.

Comments: 6pages, 5 figures

arXiv:1811.02353 [pdf]

An amplitudes-perturbation data augmentation method in convolutional neural networks for EEG decoding

Authors: Xian-Rui Zhang, Meng-Ying Lei, Yang Li

Abstract: Brain-Computer Interface (BCI) system provides a pathway between humans and the outside world by analyzing brain signals which contain potential neural information. Electroencephalography (EEG) is one of most commonly used brain signals and EEG recognition is an important part of BCI system. Recently, convolutional neural networks (ConvNet) in deep learning are becoming the new cutting edge tools… ▽ More Brain-Computer Interface (BCI) system provides a pathway between humans and the outside world by analyzing brain signals which contain potential neural information. Electroencephalography (EEG) is one of most commonly used brain signals and EEG recognition is an important part of BCI system. Recently, convolutional neural networks (ConvNet) in deep learning are becoming the new cutting edge tools to tackle the problem of EEG recognition. However, training an effective deep learning model requires a big number of data, which limits the application of EEG datasets with a small number of samples. In order to solve the issue of data insufficiency in deep learning for EEG decoding, we propose a novel data augmentation method that add perturbations to amplitudes of EEG signals after transform them to frequency domain. In experiments, we explore the performance of signal recognition with the state-of-the-art models before and after data augmentation on BCI Competition IV dataset 2a and our local dataset. The results show that our data augmentation technique can improve the accuracy of EEG recognition effectively. △ Less

Submitted 6 November, 2018; originally announced November 2018.

arXiv:1810.09119 [pdf]

A Parametric Time Frequency-Conditional Granger Causality Method Using Ultra-regularized Orthogonal Least Squares and Multiwavelets for Dynamic Connectivity Analysis in EEGs

Authors: Yang Li, Mengying Lei, Weigang Cui, Yuzhu Guo, Hua-Liang Wei

Abstract: Objective: This study proposes a new parametric TF (time frequency) CGC (conditional Granger causality) method for high precision connectivity analysis over time and frequency in multivariate coupling nonstationary systems, and applies it to scalp and source EEG signals to reveal dynamic interaction patterns in oscillatory neocortical sensorimotor networks. Methods: The Geweke spectral measure is… ▽ More Objective: This study proposes a new parametric TF (time frequency) CGC (conditional Granger causality) method for high precision connectivity analysis over time and frequency in multivariate coupling nonstationary systems, and applies it to scalp and source EEG signals to reveal dynamic interaction patterns in oscillatory neocortical sensorimotor networks. Methods: The Geweke spectral measure is combined with the TVARX (time varying autoregressive with exogenous input) modelling approach, which uses multiwavelets and ultra regularized orthogonal least squares (UROLS) algorithm aided by APRESS (adjustable prediction error sum of squares), to obtain high resolution time varying CGC representations. The UROLS APRESS algorithm, which adopts both the regularization technique and the ultra least squares criterion to measure not only the signal data themselves but also the weak derivatives of them, is a novel powerful method in constructing time varying models with good generalization performance, and can accurately track smooth and fast changing causalities. The generalized measurement based on CGC decomposition is able to eliminate indirect influences in multivariate systems. Results: The proposed method is validated on two simulations and then applied to multichannel motor imagery (MI) EEG signals at scalp and source level, where the predicted distributions are well recovered with high TF precision, and the detected connectivity patterns of MI EEG data are physiologically and anatomically interpretable and yield new insights into the dynamical organization of oscillatory cortical networks. Conclusion: Experimental results confirm the effectiveness of the proposed TF CGC method in tracking rapidly varying causalities of EEG based oscillatory networks. Significance: The novel TF CGC method is expected to provide important information of neural mechanisms of perception and cognition. △ Less

Submitted 22 October, 2018; originally announced October 2018.

arXiv:1803.02445 [pdf, other]

Linear networks based speaker adaptation for speech synthesis

Authors: Zhiying Huang, Heng Lu, Ming Lei, Zhijie Yan

Abstract: Speaker adaptation methods aim to create fair quality synthesis speech voice font for target speakers while only limited resources available. Recently, as deep neural networks based statistical parametric speech synthesis (SPSS) methods become dominant in SPSS TTS back-end modeling, speaker adaptation under the neural network based SPSS framework has also became an important task. In this paper, l… ▽ More Speaker adaptation methods aim to create fair quality synthesis speech voice font for target speakers while only limited resources available. Recently, as deep neural networks based statistical parametric speech synthesis (SPSS) methods become dominant in SPSS TTS back-end modeling, speaker adaptation under the neural network based SPSS framework has also became an important task. In this paper, linear networks (LN) is inserted in multiple neural network layers and fine-tuned together with output layer for best speaker adaptation performance. When adaptation data is extremely small, the low-rank plus diagonal(LRPD) decomposition for LN is employed to make the adapted voice more stable. Speaker adaptation experiments are conducted under a range of adaptation utterances numbers. Moreover, speaker adaptation from 1) female to female, 2) male to female and 3) female to male are investigated. Objective measurement and subjective tests show that LN with LRPD decomposition performs most stable when adaptation data is extremely limited, and our best speaker adaptation (SA) model with only 200 adaptation utterances achieves comparable quality with speaker dependent (SD) model trained with 1000 utterances, in both naturalness and similarity to target speaker. △ Less

Submitted 5 March, 2018; originally announced March 2018.

Comments: 5 pages, 6 figures, accepted by ICASSP 2018

arXiv:1709.07747 [pdf]

SNR-based adaptive acquisition method for fast Fourier ptychographic microscopy

Authors: An Pan, Yan Zhang, Maosen Li, Meiling Zhou, Junwei Min, Ming Lei, Baoli Yao

Abstract: Fourier ptychographic microscopy (FPM) is a computational imaging technique with both high resolution and large field-of-view. However, the effective numerical aperture (NA) achievable with a typical LED panel is ambiguous and usually relies on the repeated tests of different illumination NAs. The imaging quality of each raw image usually depends on the visual assessments, which is subjective and… ▽ More Fourier ptychographic microscopy (FPM) is a computational imaging technique with both high resolution and large field-of-view. However, the effective numerical aperture (NA) achievable with a typical LED panel is ambiguous and usually relies on the repeated tests of different illumination NAs. The imaging quality of each raw image usually depends on the visual assessments, which is subjective and inaccurate especially for those dark field images. Moreover, the acquisition process is really time-consuming.In this paper, we propose a SNR-based adaptive acquisition method for quantitative evaluation and adaptive collection of each raw image according to the signal-to-noise ration (SNR) value, to improve the FPM's acquisition efficiency and automatically obtain the maximum achievable NA, reducing the time of collection, storage and subsequent calculation. The widely used EPRY-FPM algorithm is applied without adding any algorithm complexity and computational burden. The performance has been demonstrated in both USAF targets and biological samples with different imaging sensors respectively, which have either Poisson or Gaussian noises model. Further combined with the sparse LEDs strategy, the number of collection images can be shorten to around 25 frames while the former needs 361 images, the reduction ratio can reach over 90%. This method will make FPM more practical and automatic, and can also be used in different configurations of FPM. △ Less

Submitted 2 October, 2017; v1 submitted 19 September, 2017; originally announced September 2017.

Comments: 11 pages, 6 figures

Showing 1–28 of 28 results for author: Lei, M