Search | arXiv e-print repository

Integrated Sensing and Communication for Anti-Jamming with OAM

Authors: Li** Liang, Wenchi Cheng, Wei Zhang, Zhuohui Yao

Abstract: The spectrum share and open nature of wireless channels enable integrated sensing and communication (ISAC) susceptible to hostile jamming attacks. Due to the intrinsic orthogonality and rich azimuth angle information of orbital angular momentum (OAM), vortex electromagnetic waves with helical phase fronts have shown great potential to achieve high-resolution imaging and strong anti-jamming capabil… ▽ More The spectrum share and open nature of wireless channels enable integrated sensing and communication (ISAC) susceptible to hostile jamming attacks. Due to the intrinsic orthogonality and rich azimuth angle information of orbital angular momentum (OAM), vortex electromagnetic waves with helical phase fronts have shown great potential to achieve high-resolution imaging and strong anti-jamming capability of wireless communication. Focusing on significantly enhancing the anti-jamming results of ISAC systems with limited bandwidth under hostile jamming, in this paper we propose a novel ISAC for anti-jamming with OAM scheme, where the OAM legitimate transmitter can simultaneously sense the position of jammers with dynamic behavior and send data to multiple OAM legitimate users. Specifically, the OAM modes for sensing and communications are respectively hopped according to pre-set index modulation information to suppress jamming. To acquire the position of the jammer, we develop the enhanced multiple-signal-classification-based three-dimension position estimation scheme with continuous sensing in both frequency and angular domains, where the OAM transmitter is designed with the concentric uniform-circular-array mono-static method, to significantly increase the azimuthal resolution. Then, based on the acquired jamming channel state information, we develop the joint transmit-receive beamforming and power allocation scheme, where the transmit and receive beamforming matrices are dynamically adjusted to mitigate the mixed interference containing inter-mode interference, inter-user interference, and jamming, thus maximizing the achievable sum rates (ASRs) of all users. Numerical results demonstrate that our proposed scheme can significantly increase the ASR under broadband jamming attacks and achieve high detection accuracy of targets . △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2404.17400 [pdf, other]

Spatial-frequency Dual-Domain Feature Fusion Network for Low-Light Remote Sensing Image Enhancement

Authors: Zishu Yao, Guodong Fan, **fu Fan, Min Gan, C. L. Philip Chen

Abstract: Low-light remote sensing images generally feature high resolution and high spatial complexity, with continuously distributed surface features in space. This continuity in scenes leads to extensive long-range correlations in spatial domains within remote sensing images. Convolutional Neural Networks, which rely on local correlations for long-distance modeling, struggle to establish long-range corre… ▽ More Low-light remote sensing images generally feature high resolution and high spatial complexity, with continuously distributed surface features in space. This continuity in scenes leads to extensive long-range correlations in spatial domains within remote sensing images. Convolutional Neural Networks, which rely on local correlations for long-distance modeling, struggle to establish long-range correlations in such images. On the other hand, transformer-based methods that focus on global information face high computational complexities when processing high-resolution remote sensing images. From another perspective, Fourier transform can compute global information without introducing a large number of parameters, enabling the network to more efficiently capture the overall image structure and establish long-range correlations. Therefore, we propose a Dual-Domain Feature Fusion Network (DFFN) for low-light remote sensing image enhancement. Specifically, this challenging task of low-light enhancement is divided into two more manageable sub-tasks: the first phase learns amplitude information to restore image brightness, and the second phase learns phase information to refine details. To facilitate information exchange between the two phases, we designed an information fusion affine block that combines data from different phases and scales. Additionally, we have constructed two dark light remote sensing datasets to address the current lack of datasets in dark light remote sensing image enhancement. Extensive evaluations show that our method outperforms existing state-of-the-art methods. The code is available at https://github.com/iijjlk/DFFN. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: 14 page

arXiv:2404.13955 [pdf, other]

doi 10.33012/2023.19426

GNSS Measurement-Based Context Recognition for Vehicle Navigation using Gated Recurrent Unit

Authors: Sheng Liu, Zhiqiang Yao, Xuemeng Cao, Xiaowen Cai

Abstract: Recent years, people have put forward higher and higher requirements for context-adaptive navigation (CAN). CAN system realizes seamless navigation in complex environments by recognizing the ambient surroundings of vehicles, and it is crucial to develop a fast, reliable, and robust navigational context recognition (NCR) method to enable CAN systems to operate effectively. Environmental context rec… ▽ More Recent years, people have put forward higher and higher requirements for context-adaptive navigation (CAN). CAN system realizes seamless navigation in complex environments by recognizing the ambient surroundings of vehicles, and it is crucial to develop a fast, reliable, and robust navigational context recognition (NCR) method to enable CAN systems to operate effectively. Environmental context recognition based on Global Navigation Satellite System (GNSS) measurements has attracted widespread attention due to its low cost because it does not require additional infrastructure. The performance and application value of NCR methods depend on three main factors: context categorization, feature extraction, and classification models. In this paper, a fine-grained context categorization framework comprising seven environment categories (open sky, tree-lined avenue, semi-outdoor, urban canyon, viaduct-down, shallow indoor, and deep indoor) is proposed, which currently represents the most elaborate context categorization framework known in this research domain. To improve discrimination between categories, a new feature called the C/N0-weighted azimuth distribution factor, is designed. Then, to ensure real-time performance, a lightweight gated recurrent unit (GRU) network is adopted for its excellent sequence data processing capabilities. A dataset containing 59,996 samples is created and made publicly available to researchers in the NCR community on Github. Extensive experiments have been conducted on the dataset, and the results show that the proposed method achieves an overall recognition accuracy of 99.41\% for isolated scenarios and 94.95\% for transition scenarios, with an average transition delay of 2.14 seconds. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: 9 pages, 9 figures, 5 tables

Journal ref: Proceedings of the 36th International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS+ 2023)

arXiv:2404.09729 [pdf]

Amplitude-Phase Fusion for Enhanced Electrocardiogram Morphological Analysis

Authors: Shuaicong Hu, Yanan Wang, Jian Liu, **gyu Lin, Shengmei Qin, Zhenning Nie, Zhifeng Yao, Wenjie Cai, Cuiwei Yang

Abstract: Considering the variability of amplitude and phase patterns in electrocardiogram (ECG) signals due to cardiac activity and individual differences, existing entropy-based studies have not fully utilized these two patterns and lack integration. To address this gap, this paper proposes a novel fusion entropy metric, morphological ECG entropy (MEE) for the first time, specifically designed for ECG mor… ▽ More Considering the variability of amplitude and phase patterns in electrocardiogram (ECG) signals due to cardiac activity and individual differences, existing entropy-based studies have not fully utilized these two patterns and lack integration. To address this gap, this paper proposes a novel fusion entropy metric, morphological ECG entropy (MEE) for the first time, specifically designed for ECG morphology, to comprehensively describe the fusion of amplitude and phase patterns. MEE is computed based on beat-level samples, enabling detailed analysis of each cardiac cycle. Experimental results demonstrate that MEE achieves rapid, accurate, and label-free localization of abnormal ECG arrhythmia regions. Furthermore, MEE provides a method for assessing sample diversity, facilitating compression of imbalanced training sets (via representative sample selection), and outperforms random pruning. Additionally, MEE exhibits the ability to describe areas of poor quality. By discussing, it proves the robustness of MEE value calculation to noise interference and its low computational complexity. Finally, we integrate this method into a clinical interactive interface to provide a more convenient and intuitive user experience. These findings indicate that MEE serves as a valuable clinical descriptor for ECG characterization. The implementation code can be referenced at the following link: https://github.com/fdu-harry/ECG-MEE-metric. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 16 pages, 12 figures

ACM Class: I.5.2

arXiv:2403.08170 [pdf, other]

Versatile Defense Against Adversarial Attacks on Image Recognition

Authors: Haibo Zhang, Zhihua Yao, Kouichi Sakurai

Abstract: Adversarial attacks present a significant security risk to image recognition tasks. Defending against these attacks in a real-life setting can be compared to the way antivirus software works, with a key consideration being how well the defense can adapt to new and evolving attacks. Another important factor is the resources involved in terms of time and cost for training defense models and updating… ▽ More Adversarial attacks present a significant security risk to image recognition tasks. Defending against these attacks in a real-life setting can be compared to the way antivirus software works, with a key consideration being how well the defense can adapt to new and evolving attacks. Another important factor is the resources involved in terms of time and cost for training defense models and updating the model database. Training many models that are specific to each type of attack can be time-consuming and expensive. Ideally, we should be able to train one single model that can handle a wide range of attacks. It appears that a defense method based on image-to-image translation may be capable of this. The proposed versatile defense approach in this paper only requires training one model to effectively resist various unknown adversarial attacks. The trained model has successfully improved the classification accuracy from nearly zero to an average of 86%, performing better than other defense methods proposed in prior studies. When facing the PGD attack and the MI-FGSM attack, versatile defense model even outperforms the attack-specific models trained based on these two attacks. The robustness check also shows that our versatile defense model performs stably regardless with the attack strength. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2402.00313 [pdf, other]

Control in Stochastic Environment with Delays: A Model-based Reinforcement Learning Approach

Authors: Zhiyuan Yao, Ionut Florescu, Chihoon Lee

Abstract: In this paper we are introducing a new reinforcement learning method for control problems in environments with delayed feedback. Specifically, our method employs stochastic planning, versus previous methods that used deterministic planning. This allows us to embed risk preference in the policy optimization problem. We show that this formulation can recover the optimal policy for problems with dete… ▽ More In this paper we are introducing a new reinforcement learning method for control problems in environments with delayed feedback. Specifically, our method employs stochastic planning, versus previous methods that used deterministic planning. This allows us to embed risk preference in the policy optimization problem. We show that this formulation can recover the optimal policy for problems with deterministic transitions. We contrast our policy with two prior methods from literature. We apply the methodology to simple tasks to understand its features. Then, we compare the performance of the methods in controlling multiple Atari games. △ Less

Submitted 31 January, 2024; originally announced February 2024.

Comments: Under Review

arXiv:2401.09508 [pdf, other]

4D-ONIX: A deep learning approach for reconstructing 3D movies from sparse X-ray projections

Authors: Yuhe Zhang, Zisheng Yao, Robert Klöfkorn, Tobias Ritschel, Pablo Villanueva-Perez

Abstract: The X-ray flux provided by X-ray free-electron lasers and storage rings offers new spatiotemporal possibilities to study in-situ and operando dynamics, even using single pulses of such facilities. X-ray Multi-Projection Imaging (XMPI) is a novel technique that enables volumetric information using single pulses of such facilities and avoids centrifugal forces induced by state-of-the-art time-resolv… ▽ More The X-ray flux provided by X-ray free-electron lasers and storage rings offers new spatiotemporal possibilities to study in-situ and operando dynamics, even using single pulses of such facilities. X-ray Multi-Projection Imaging (XMPI) is a novel technique that enables volumetric information using single pulses of such facilities and avoids centrifugal forces induced by state-of-the-art time-resolved 3D methods such as time-resolved tomography. As a result, XMPI can acquire 3D movies (4D) at least three orders of magnitude faster than current methods. However, it is exceptionally challenging to reconstruct 4D from highly sparse projections as acquired by XMPI with current algorithms. Here, we present 4D-ONIX, a Deep Learning (DL)-based approach that learns to reconstruct 3D movies (4D) from an extremely limited number of projections. It combines the computational physical model of X-ray interaction with matter and state-of-the-art DL methods. We demonstrate the potential of 4D-ONIX to generate high-quality 4D by generalizing over multiple experiments with only two projections per timestamp for binary droplet collisions. We envision that 4D-ONIX will become an enabling tool for 4D analysis, offering new spatiotemporal resolutions to study processes not possible before. △ Less

Submitted 2 February, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

arXiv:2401.05425 [pdf, other]

An Unobtrusive and Lightweight Ear-worn System for Continuous Epileptic Seizure Detection

Authors: Abdul Aziz, Nhat Pham, Neel Vora, Cody Reynolds, Jaime Lehnen, Pooja Venkatesh, Zhuoran Yao, Jay Harvey, Tam Vu, Kan Ding, Phuc Nguyen

Abstract: Epilepsy is one of the most common neurological diseases globally, affecting around 50 million people worldwide. Fortunately, up to 70 percent of people with epilepsy could live seizure-free if properly diagnosed and treated, and a reliable technique to monitor the onset of seizures could improve the quality of life of patients who are constantly facing the fear of random seizure attacks. The scal… ▽ More Epilepsy is one of the most common neurological diseases globally, affecting around 50 million people worldwide. Fortunately, up to 70 percent of people with epilepsy could live seizure-free if properly diagnosed and treated, and a reliable technique to monitor the onset of seizures could improve the quality of life of patients who are constantly facing the fear of random seizure attacks. The scalp-based EEG test, despite being the gold standard for diagnosing epilepsy, is costly, necessitates hospitalization, demands skilled professionals for operation, and is discomforting for users. In this paper, we propose EarSD, a novel lightweight, unobtrusive, and socially acceptable ear-worn system to detect epileptic seizure onsets by measuring the physiological signals from behind the user's ears. EarSD includes an integrated custom-built sensing, computing, and communication PCB to collect and amplify the signals of interest, remove the noises caused by motion artifacts and environmental impacts, and stream the data wirelessly to the computer or mobile phone nearby, where data are uploaded to the host computer for further processing. We conducted both in-lab and in-hospital experiments with epileptic seizure patients who were hospitalized for seizure studies. The preliminary results confirm that EarSD can detect seizures with up to 95.3 percent accuracy by just using classical machine learning algorithms. △ Less

Submitted 1 January, 2024; originally announced January 2024.

arXiv:2311.16149 [pdf, other]

doi 10.1364/OE.510800

Development towards high-resolution kHz-speed rotation-free volumetric imaging

Authors: Eleni Myrto Asimakopoulou, Valerio Bellucci, Sarlota Birnsteinova, Zisheng Yao, Yuhe Zhang, Ilia Petrov, Carsten Deiter, Andrea Mazzolari, Marco Romagnoni, Dusan Korytar, Zdenko Zaprazny, Zuzana Kuglerova, Libor Juha, Bratislav Lukic, Alexander Rack, Liubov Samoylova, Francisco Garcia Moreno, Stephen A Hall, Tillmann Neu, Xiaoyu Liang, Patrik Vagovic, Pablo Villanueva-Perez

Abstract: X-ray multi-projection imaging (XMPI) provides rotation-free 3D movies of optically opaque samples. The absence of rotation enables superior imaging speed and preserves fragile sample dynamics by avoiding the shear forces introduced by conventional rotary tomography. Here, we present our XMPI observations at the ID19 beamline (ESRF, France) of 3D dynamics in melted aluminum with 1000 frames per se… ▽ More X-ray multi-projection imaging (XMPI) provides rotation-free 3D movies of optically opaque samples. The absence of rotation enables superior imaging speed and preserves fragile sample dynamics by avoiding the shear forces introduced by conventional rotary tomography. Here, we present our XMPI observations at the ID19 beamline (ESRF, France) of 3D dynamics in melted aluminum with 1000 frames per second and 8 $μ$m resolution per projection using the full dynamical range of our detectors. Since XMPI is a method under development, we also provide different tests for the instrumentation of up to 3000 frames per second. As the flux of X-ray sources grows globally, XMPI is a promising technique for current and future X-ray imaging instruments. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: 12 pages, 7 figures

Journal ref: Opt. Express 32, (2024), 4413-4426

arXiv:2311.03062 [pdf]

Imaging through multimode fibres with physical prior

Authors: Chuncheng Zhang, Yingjie Shi, Zheyi Yao, Xiubao Sui, Qian Chen

Abstract: Imaging through perturbed multimode fibres based on deep learning has been widely researched. However, existing methods mainly use target-speckle pairs in different configurations. It is challenging to reconstruct targets without trained networks. In this paper, we propose a physics-assisted, unsupervised, learning-based fibre imaging scheme. The role of the physical prior is to simplify the mappi… ▽ More Imaging through perturbed multimode fibres based on deep learning has been widely researched. However, existing methods mainly use target-speckle pairs in different configurations. It is challenging to reconstruct targets without trained networks. In this paper, we propose a physics-assisted, unsupervised, learning-based fibre imaging scheme. The role of the physical prior is to simplify the map** relationship between the speckle pattern and the target image, thereby reducing the computational complexity. The unsupervised network learns target features according to the optimized direction provided by the physical prior. Therefore, the reconstruction process of the online learning only requires a few speckle patterns and unpaired targets. The proposed scheme also increases the generalization ability of the learning-based method in perturbed multimode fibres. Our scheme has the potential to extend the application of multimode fibre imaging. △ Less

Submitted 13 November, 2023; v1 submitted 6 November, 2023; originally announced November 2023.

arXiv:2311.02818 [pdf, other]

Signal Processing Meets SGD: From Momentum to Filter

Authors: Zhipeng Yao, Guiyuan Fu, Ying Li, Yu Zhang, Dazhou Li, Rui Yu

Abstract: In deep learning, stochastic gradient descent (SGD) and its momentum-based variants are widely used for optimization, but they typically suffer from slow convergence. Conversely, existing adaptive learning rate optimizers speed up convergence but often compromise generalization. To resolve this issue, we propose a novel optimization method designed to accelerate SGD's convergence without sacrifici… ▽ More In deep learning, stochastic gradient descent (SGD) and its momentum-based variants are widely used for optimization, but they typically suffer from slow convergence. Conversely, existing adaptive learning rate optimizers speed up convergence but often compromise generalization. To resolve this issue, we propose a novel optimization method designed to accelerate SGD's convergence without sacrificing generalization. Our approach reduces the variance of the historical gradient, improves first-order moment estimation of SGD by applying Wiener filter theory, and introduces a time-varying adaptive gain. Empirical results demonstrate that SGDF (SGD with Filter) effectively balances convergence and generalization compared to state-of-the-art optimizers. △ Less

Submitted 24 May, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

arXiv:2310.11230 [pdf, other]

Zipformer: A faster and better encoder for automatic speech recognition

Authors: Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui **, Long Lin, Daniel Povey

Abstract: The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame… ▽ More The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall. △ Less

Submitted 9 April, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: Published as a conference paper at ICLR 2024

arXiv:2309.08105 [pdf, other]

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

Authors: Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey

Abstract: In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casin… ▽ More In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casing and text context, which brings more flexibility for system building. Specifically, we propose a general and efficient pipeline to locate, align and segment the audios in previously published Librilight to its corresponding texts. The same as Librilight, Libriheavy also has three training subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We also extract the dev and test evaluation sets from the aligned audios and guarantee there is no overlap** speakers and books in training sets. Baseline systems are built on the popular CTC-Attention and transducer models. Additionally, we open-source our dataset creatation pipeline which can also be used to other audio alignment tasks. △ Less

Submitted 14 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: Submitted to ICASSP 2024

arXiv:2309.07414 [pdf, other]

PromptASR for contextualized ASR with controllable style

Authors: Xiaoyu Yang, Wei Kang, Zengwei Yao, Yifan Yang, Liyong Guo, Fangjun Kuang, Long Lin, Daniel Povey

Abstract: Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. Specifically, a dedicated text encoder encodes the text prompts and t… ▽ More Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. Specifically, a dedicated text encoder encodes the text prompts and the encodings are injected into the speech encoder by cross-attending the features from two modalities. When using the ground truth text from preceding utterances as content prompt, the proposed system achieves 21.9% and 6.8% relative word error rate reductions on a book reading dataset and an in-house dataset compared to a baseline ASR system. The system can also take word-level biasing lists as prompt to improve recognition accuracy on rare words. An additional style prompt can be given to the text encoder and guide the ASR system to output different styles of transcriptions. The code is available at icefall. △ Less

Submitted 24 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: Proc. ICASSP 2024

arXiv:2306.06284 [pdf, other]

doi 10.1145/3587819.3592542

Everybody Compose: Deep Beats To Music

Authors: Conghao Shen, Violet Z. Yao, Yixin Liu

Abstract: This project presents a deep learning approach to generate monophonic melodies based on input beats, allowing even amateurs to create their own music compositions. Three effective methods - LSTM with Full Attention, LSTM with Local Attention, and Transformer with Relative Position Representation - are proposed for this novel task, providing great variation, harmony, and structure in the generated… ▽ More This project presents a deep learning approach to generate monophonic melodies based on input beats, allowing even amateurs to create their own music compositions. Three effective methods - LSTM with Full Attention, LSTM with Local Attention, and Transformer with Relative Position Representation - are proposed for this novel task, providing great variation, harmony, and structure in the generated music. This project allows anyone to compose their own music by tap** their keyboards or ``recoloring'' beat sequences from existing works. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Comments: Accepted MMSys '23

Journal ref: Proceedings of the 14th Conference on ACM Multimedia Systems (2023)

arXiv:2305.11920 [pdf, other]

Megahertz X-ray Multi-projection imaging

Authors: Pablo Villanueva-Perez, Valerio Bellucci, Yuhe Zhang, Sarlota Birnsteinova, Rita Graceffa, Luigi Adriano, Eleni Myrto Asimakopoulou, Ilia Petrov, Zisheng Yao, Marco Romagnoni, Andrea Mazzolari, Romain Letrun, Chan Kim, Jayanath C. P. Koliyadu, Carsten Deiter, Richard Bean, Gabriele Giovanetti, Luca Gelisio, Tobias Ritschel, Adrian Mancuso, Henry N. Chapman, Alke Meents, Tokushi Sato, Patrik Vagovic

Abstract: X-ray time-resolved tomography is one of the most popular X-ray techniques to probe dynamics in three dimensions (3D). Recent developments in time-resolved tomography opened the possibility of recording kilohertz-rate 3D movies. However, tomography requires rotating the sample with respect to the X-ray beam, which prevents characterization of faster structural dynamics. Here, we present megahertz… ▽ More X-ray time-resolved tomography is one of the most popular X-ray techniques to probe dynamics in three dimensions (3D). Recent developments in time-resolved tomography opened the possibility of recording kilohertz-rate 3D movies. However, tomography requires rotating the sample with respect to the X-ray beam, which prevents characterization of faster structural dynamics. Here, we present megahertz (MHz) X-ray multi-projection imaging (MHz-XMPI), a technique capable of recording volumetric information at MHz rates and micrometer resolution without scanning the sample. We achieved this by harnessing the unique megahertz pulse structure and intensity of the European X-ray Free-electron Laser with a combination of novel detection and reconstruction approaches that do not require sample rotations. Our approach enables generating multiple X-ray probes that simultaneously record several angular projections for each pulse in the megahertz pulse burst. We provide a proof-of-concept demonstration of the MHz-XMPI technique's capability to probe 4D (3D+time) information on stochastic phenomena and non-reproducible processes three orders of magnitude faster than state-of-the-art time-resolved X-ray tomography, by generating 3D movies of binary droplet collisions. We anticipate that MHz-XMPI will enable in-situ and operando studies that were impossible before, either due to the lack of temporal resolution or because the systems were opaque (such as for MHz imaging based on optical microscopy). △ Less

Submitted 19 May, 2023; originally announced May 2023.

arXiv:2305.11558 [pdf, other]

Blank-regularized CTC for Frame Skip** in Neural Transducer

Authors: Yifan Yang, Xiaoyu Yang, Liyong Guo, Zengwei Yao, Wei Kang, Fangjun Kuang, Long Lin, Xie Chen, Daniel Povey

Abstract: Neural Transducer and connectionist temporal classification (CTC) are popular end-to-end automatic speech recognition systems. Due to their frame-synchronous design, blank symbols are introduced to address the length mismatch between acoustic frames and output tokens, which might bring redundant computation. Previous studies managed to accelerate the training and inference of neural Transducers by… ▽ More Neural Transducer and connectionist temporal classification (CTC) are popular end-to-end automatic speech recognition systems. Due to their frame-synchronous design, blank symbols are introduced to address the length mismatch between acoustic frames and output tokens, which might bring redundant computation. Previous studies managed to accelerate the training and inference of neural Transducers by discarding frames based on the blank symbols predicted by a co-trained CTC. However, there is no guarantee that the co-trained CTC can maximize the ratio of blank symbols. This paper proposes two novel regularization methods to explicitly encourage more blanks by constraining the self-loop of non-blank symbols in the CTC. It is interesting to find that the frame reduction ratio of the neural Transducer can approach the theoretical boundary. Experiments on LibriSpeech corpus show that our proposed method accelerates the inference of neural Transducer by 4 times without sacrificing performance. Our work is open-sourced and publicly available https://github.com/k2-fsa/icefall. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: Accepted in INTERSPEECH 2023

arXiv:2305.11539 [pdf, other]

Delay-penalized CTC implemented based on Finite State Transducer

Authors: Zengwei Yao, Wei Kang, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Yifan Yang, Long Lin, Daniel Povey

Abstract: Connectionist Temporal Classification (CTC) suffers from the latency problem when applied to streaming models. We argue that in CTC lattice, the alignments that can access more future context are preferred during training, thereby leading to higher symbol delay. In this work we propose the delay-penalized CTC which is augmented with latency penalty regularization. We devise a flexible and efficien… ▽ More Connectionist Temporal Classification (CTC) suffers from the latency problem when applied to streaming models. We argue that in CTC lattice, the alignments that can access more future context are preferred during training, thereby leading to higher symbol delay. In this work we propose the delay-penalized CTC which is augmented with latency penalty regularization. We devise a flexible and efficient implementation based on the differentiable Finite State Transducer (FST). Specifically, by attaching a binary attribute to CTC topology, we can locate the frames that firstly emit non-blank tokens on the resulting CTC lattice, and add the frame offsets to the log-probabilities. Experimental results demonstrate the effectiveness of our proposed delay-penalized CTC, which is able to balance the delay-accuracy trade-off. Furthermore, combining the delay-penalized transducer enables the CTC model to achieve better performance and lower latency. Our work is open-sourced and publicly available https://github.com/k2-fsa/k2. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: Accepted in INTERSPEECH 2023

arXiv:2304.11316 [pdf, other]

Iterative fluctuation ghost imaging

Authors: Huan Zhao, Xiao-Qian Wang, Chao Gao, Zhuo Yu, Hong Wang, Yu Wang, Li-Dan Gou, Zhi-Hai Yao

Abstract: We present a new technique, iterative fluctuation ghost imaging (IFGI) which dramatically enhances the resolution of ghost imaging (GI). It is shown that, by the fluctuation characteristics of the second-order correlation function, the imaging information with the narrower point spread function (PSF) than the original information can be got. The effects arising from the PSF and the iteration times… ▽ More We present a new technique, iterative fluctuation ghost imaging (IFGI) which dramatically enhances the resolution of ghost imaging (GI). It is shown that, by the fluctuation characteristics of the second-order correlation function, the imaging information with the narrower point spread function (PSF) than the original information can be got. The effects arising from the PSF and the iteration times also be discussed. △ Less

Submitted 22 April, 2023; originally announced April 2023.

arXiv:2212.03830 [pdf, other]

A Hierarchical Deep Reinforcement Learning Framework for 6-DOF UCAV Air-to-Air Combat

Authors: Jiajun Chai, Wenzhang Chen, Yuanheng Zhu, Zong-xin Yao, Dongbin Zhao

Abstract: Unmanned combat air vehicle (UCAV) combat is a challenging scenario with continuous action space. In this paper, we propose a general hierarchical framework to resolve the within-vision-range (WVR) air-to-air combat problem under 6 dimensions of degree (6-DOF) dynamics. The core idea is to divide the whole decision process into two loops and use reinforcement learning (RL) to solve them separately… ▽ More Unmanned combat air vehicle (UCAV) combat is a challenging scenario with continuous action space. In this paper, we propose a general hierarchical framework to resolve the within-vision-range (WVR) air-to-air combat problem under 6 dimensions of degree (6-DOF) dynamics. The core idea is to divide the whole decision process into two loops and use reinforcement learning (RL) to solve them separately. The outer loop takes into account the current combat situation and decides the expected macro behavior of the aircraft according to a combat strategy. Then the inner loop tracks the macro behavior with a flight controller by calculating the actual input signals for the aircraft. We design the Markov decision process for both the outer loop strategy and inner loop controller, and train them by proximal policy optimization (PPO) algorithm. For the inner loop controller, we design an effective reward function to accurately track various macro behavior. For the outer loop strategy, we further adopt a fictitious self-play mechanism to improve the combat performance by constantly combating against the historical strategies. Experiment results show that the inner loop controller can achieve better tracking performance than fine-tuned PID controller, and the outer loop strategy can perform complex maneuvers to get higher and higher winning rate, with the generation evolves. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.13443 [pdf, other]

TESSP: Text-Enhanced Self-Supervised Speech Pre-training

Authors: Zhuoyuan Yao, Shuo Ren, Sanyuan Chen, Ziyang Ma, Pengcheng Guo, Lei Xie

Abstract: Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the… ▽ More Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the same model. To solve this problem, we propose Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to incorporate the linguistic information into speech pre-training. Our model consists of three parts, i.e., a speech encoder, a text encoder and a shared encoder. The model takes unsupervised speech and text data as the input and leverages the common HuBERT and MLM losses respectively. We also propose phoneme up-sampling and representation swap** to enable joint modeling of the speech and text information. Specifically, to fix the length mismatching problem between speech and text data, we phonemize the text sequence and up-sample the phonemes with the alignment information extracted from a small set of supervised data. Moreover, to close the gap between the learned speech and text representations, we swap the text representation with the speech representation extracted by the respective private encoders according to the alignment information. Experiments on the Librispeech dataset shows the proposed TESSP model achieves more than 10% improvement compared with WavLM on the test-clean and test-other sets. We also evaluate our model on the SUPERB benchmark, showing our model has better performance on Phoneme Recognition, Acoustic Speech Recognition and Speech Translation compared with WavLM. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: 9 pages, 4 figures

arXiv:2211.13005 [pdf, other]

A CNN-Transformer Deep Learning Model for Real-time Sleep Stage Classification in an Energy-Constrained Wireless Device

Authors: Zongyan Yao, Xilin Liu

Abstract: This paper proposes a deep learning (DL) model for automatic sleep stage classification based on single-channel EEG data. The DL model features a convolutional neural network (CNN) and transformers. The model was designed to run on energy and memory-constrained devices for real-time operation with local processing. The Fpz-Cz EEG signals from a publicly available Sleep-EDF dataset are used to trai… ▽ More This paper proposes a deep learning (DL) model for automatic sleep stage classification based on single-channel EEG data. The DL model features a convolutional neural network (CNN) and transformers. The model was designed to run on energy and memory-constrained devices for real-time operation with local processing. The Fpz-Cz EEG signals from a publicly available Sleep-EDF dataset are used to train and test the model. Four convolutional filter layers were used to extract features and reduce the data dimension. Then, transformers were utilized to learn the time-variant features of the data. To improve performance, we also implemented a subject specific training before the inference (i.e., prediction) stage. With the subject specific training, the F1 score was 0.91, 0.37, 0.84, 0.877, and 0.73 for wake, N1-N3, and rapid eye movement (REM) stages, respectively. The performance of the model was comparable to the state-of-the-art works with significantly greater computational costs. We tested a reduced-sized version of the proposed model on a low-cost Arduino Nano 33 BLE board and it was fully functional and accurate. In the future, a fully integrated wireless EEG sensor with edge DL will be developed for sleep research in pre-clinical and clinical experiments, such as real-time sleep modulation. △ Less

Submitted 20 November, 2022; originally announced November 2022.

arXiv:2211.00508 [pdf, other]

Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation

Authors: Liyong Guo, Xiaoyu Yang, Quandong Wang, Yuxiang Kong, Zengwei Yao, Fan Cui, Fangjun Kuang, Wei Kang, Long Lin, Mingshuang Luo, Piotr Zelasko, Daniel Povey

Abstract: Knowledge distillation(KD) is a common approach to improve model performance in automatic speech recognition (ASR), where a student model is trained to imitate the output behaviour of a teacher model. However, traditional KD methods suffer from teacher label storage issue, especially when the training corpora are large. Although on-the-fly teacher label generation tackles this issue, the training… ▽ More Knowledge distillation(KD) is a common approach to improve model performance in automatic speech recognition (ASR), where a student model is trained to imitate the output behaviour of a teacher model. However, traditional KD methods suffer from teacher label storage issue, especially when the training corpora are large. Although on-the-fly teacher label generation tackles this issue, the training speed is significantly slower as the teacher model has to be evaluated every batch. In this paper, we reformulate the generation of teacher label as a codec problem. We propose a novel Multi-codebook Vector Quantization (MVQ) approach that compresses teacher embeddings to codebook indexes (CI). Based on this, a KD training framework (MVQ-KD) is proposed where a student model predicts the CI generated from the embeddings of a self-supervised pre-trained teacher model. Experiments on the LibriSpeech clean-100 hour show that MVQ-KD framework achieves comparable performance as traditional KD methods (l1, l2), while requiring 256 times less storage. When the full LibriSpeech dataset is used, MVQ-KD framework results in 13.8% and 8.2% relative word error rate reductions (WERRs) for non -streaming transducer on test-clean and test-other and 4.0% and 4.9% for streaming transducer. The implementation of this work is already released as a part of the open-source project icefall. △ Less

Submitted 31 October, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2022

arXiv:2211.00490 [pdf, other]

Delay-penalized transducer for low-latency streaming ASR

Authors: Wei Kang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Long lin, Piotr Żelasko, Daniel Povey

Abstract: In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy. Although a few existing methods are able to achieve this goal, they are difficult to implement due to their dependency on external alignments. In this paper, we propose a simple way to penalize symbol delay in transducer model, so that we can b… ▽ More In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy. Although a few existing methods are able to achieve this goal, they are difficult to implement due to their dependency on external alignments. In this paper, we propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments. Specifically, our method adds a small constant times (T/2 - t), where T is the number of frames and t is the current frame, to all the non-blank log-probabilities (after normalization) that are fed into the two dimensional transducer recursion. For both streaming Conformer models and unidirectional long short-term memory (LSTM) models, experimental results show that it can significantly reduce the symbol delay with an acceptable performance degradation. Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay. Our work is open-sourced and publicly available (https://github.com/k2-fsa/k2). △ Less

Submitted 31 October, 2022; originally announced November 2022.

Comments: Submitted to 2023 IEEE International Conference on Acoustics, Speech and Signal Processing

arXiv:2211.00484 [pdf, ps, other]

Fast and parallel decoding for transducer

Authors: Wei Kang, Liyong Guo, Fangjun Kuang, Long Lin, Mingshuang Luo, Zengwei Yao, Xiaoyu Yang, Piotr Żelasko, Daniel Povey

Abstract: The transducer architecture is becoming increasingly popular in the field of speech recognition, because it is naturally streaming as well as high in accuracy. One of the drawbacks of transducer is that it is difficult to decode in a fast and parallel way due to an unconstrained number of symbols that can be emitted per time step. In this work, we introduce a constrained version of transducer loss… ▽ More The transducer architecture is becoming increasingly popular in the field of speech recognition, because it is naturally streaming as well as high in accuracy. One of the drawbacks of transducer is that it is difficult to decode in a fast and parallel way due to an unconstrained number of symbols that can be emitted per time step. In this work, we introduce a constrained version of transducer loss to learn strictly monotonic alignments between the sequences; we also improve the standard greedy search and beam search algorithms by limiting the number of symbols that can be emitted per time step in transducer decoding, making it more efficient to decode in parallel with batches. Furthermore, we propose an finite state automaton-based (FSA) parallel beam search algorithm that can run with graphs on GPU efficiently. The experiment results show that we achieve slight word error rate (WER) improvement as well as significant speedup in decoding. Our work is open-sourced and publicly available\footnote{https://github.com/k2-fsa/icefall}. △ Less

Submitted 31 October, 2022; originally announced November 2022.

Comments: Submitted to 2023 IEEE International Conference on Acoustics, Speech and Signal Processing

arXiv:2209.15329 [pdf, other]

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Authors: Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, Lirong Dai, **yu Li, Furu Wei

Abstract: How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discret… ▽ More How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM. △ Less

Submitted 15 June, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: We have corrected the errors in the pre-training data for SpeechLM-P Base models, new results are updated

arXiv:2207.13334 [pdf]

Fast optical refocusing through multimode fiber bend using Cake-Cutting Hadamard encoding algorithm to improve robustness

Authors: Chuncheng Zhang, Zheyi Yao, Zhengyue Qin, Guohua Gu, Qian Chen, Zhihua Xie, Guodong Liu, Xiubao Sui

Abstract: Multimode fibres offer the advantages of high resolution and miniaturization over single mode fibers in the field of optical imaging. However, multimode fibre's imaging is susceptible to perturbations of MMF that can lead to secondary spatial distortions in the transmitted image. Perturbations include random disturbances in the fiber as well as environmental noise. Here, we exploit the fast focusi… ▽ More Multimode fibres offer the advantages of high resolution and miniaturization over single mode fibers in the field of optical imaging. However, multimode fibre's imaging is susceptible to perturbations of MMF that can lead to secondary spatial distortions in the transmitted image. Perturbations include random disturbances in the fiber as well as environmental noise. Here, we exploit the fast focusing capability of the Cake-Cutting Hadamard coding algorithm to counteract the effects of perturbations and improve the system's robustness. Simulation shows that it can approach the theoretical enhancement at 2000 measurements. Experimental results show that the algorithm can help the system to refocus in a short time when MMFs are perturbed. This research will further contribute to using multimode fibres in medicine, communication, and detection. △ Less

Submitted 27 July, 2022; originally announced July 2022.

arXiv:2206.13236 [pdf, other]

Pruned RNN-T for fast, memory-efficient ASR training

Authors: Fangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, Daniel Povey

Abstract: The RNN-Transducer (RNN-T) framework for speech recognition has been growing in popularity, particularly for deployed real-time ASR systems, because it combines high accuracy with naturally streaming recognition. One of the drawbacks of RNN-T is that its loss function is relatively slow to compute, and can use a lot of memory. Excessive GPU memory usage can make it impractical to use RNN-T loss in… ▽ More The RNN-Transducer (RNN-T) framework for speech recognition has been growing in popularity, particularly for deployed real-time ASR systems, because it combines high accuracy with naturally streaming recognition. One of the drawbacks of RNN-T is that its loss function is relatively slow to compute, and can use a lot of memory. Excessive GPU memory usage can make it impractical to use RNN-T loss in cases where the vocabulary size is large: for example, for Chinese character-based ASR. We introduce a method for faster and more memory-efficient RNN-T loss computation. We first obtain pruning bounds for the RNN-T recursion using a simple joiner network that is linear in the encoder and decoder embeddings; we can evaluate this without using much memory. We then use those pruning bounds to evaluate the full, non-linear joiner network. △ Less

Submitted 23 June, 2022; originally announced June 2022.

arXiv:2206.01451 [pdf, other]

Learning Distributed and Fair Policies for Network Load Balancing as Markov Potential Game

Authors: Zhiyuan Yao, Zihan Ding

Abstract: This paper investigates the network load balancing problem in data centers (DCs) where multiple load balancers (LBs) are deployed, using the multi-agent reinforcement learning (MARL) framework. The challenges of this problem consist of the heterogeneous processing architecture and dynamic environments, as well as limited and partial observability of each LB agent in distributed networking systems,… ▽ More This paper investigates the network load balancing problem in data centers (DCs) where multiple load balancers (LBs) are deployed, using the multi-agent reinforcement learning (MARL) framework. The challenges of this problem consist of the heterogeneous processing architecture and dynamic environments, as well as limited and partial observability of each LB agent in distributed networking systems, which can largely degrade the performance of in-production load balancing algorithms in real-world setups. Centralised-training-decentralised-execution (CTDE) RL scheme has been proposed to improve MARL performance, yet it incurs -- especially in distributed networking systems, which prefer distributed and plug-and-play design scheme -- additional communication and management overhead among agents. We formulate the multi-agent load balancing problem as a Markov potential game, with a carefully and properly designed workload distribution fairness as the potential function. A fully distributed MARL algorithm is proposed to approximate the Nash equilibrium of the game. Experimental evaluations involve both an event-driven simulator and real-world system, where the proposed MARL load balancing algorithm shows close-to-optimal performance in simulations, and superior results over in-production LBs in the real-world system. △ Less

Submitted 14 October, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

arXiv:2203.15455 [pdf, other]

WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit

Authors: Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fu** Pan, Jianwei Niu

Abstract: Recently, we made available WeNet, a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) W… ▽ More Recently, we made available WeNet, a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by a right-to-left attention decoder to improve the representative ability of the shared encoder and the performance during the rescoring stage. (2) We introduce an n-gram based language model and a WFST-based decoder into WeNet 2.0, promoting the use of rich text data in production scenarios. (3) We design a unified contextual biasing framework, which leverages user-specific context (e.g., contact lists) to provide rapid adaptation ability for production and improves ASR accuracy in both with-LM and without-LM scenarios. (4) We design a unified IO to support large-scale data for effective model training. In summary, the brand-new WeNet 2.0 achieves up to 10\% relative recognition performance improvement over the original WeNet on various corpora and makes available several important production-oriented features. △ Less

Submitted 5 July, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

arXiv:2203.00682 [pdf, other]

ONIX: an X-ray deep-learning tool for 3D reconstructions from sparse views

Authors: Yuhe Zhang, Zisheng Yao, Tobias Ritschel, Pablo Villanueva-Perez

Abstract: Three-dimensional (3D) X-ray imaging techniques like tomography and confocal microscopy are crucial for academic and industrial applications. These approaches access 3D information by scanning the sample with respect to the X-ray source. However, the scanning process limits the temporal resolution when studying dynamics and is not feasible for some applications, such as surgical guidance in medica… ▽ More Three-dimensional (3D) X-ray imaging techniques like tomography and confocal microscopy are crucial for academic and industrial applications. These approaches access 3D information by scanning the sample with respect to the X-ray source. However, the scanning process limits the temporal resolution when studying dynamics and is not feasible for some applications, such as surgical guidance in medical applications. Alternatives to obtaining 3D information when scanning is not possible are X-ray stereoscopy and multi-projection imaging. However, these approaches suffer from limited volumetric information as they only acquire a small number of views or projections compared to traditional 3D scanning techniques. Here, we present ONIX (Optimized Neural Implicit X-ray imaging), a deep-learning algorithm capable of retrieving 3D objects with arbitrary large resolution from only a set of sparse projections. ONIX, although it does not have access to any volumetric information, outperforms current 3D reconstruction approaches because it includes the physics of image formation with X-rays, and it generalizes across different experiments over similar samples to overcome the limited volumetric information provided by sparse views. We demonstrate the capabilities of ONIX compared to state-of-the-art tomographic reconstruction algorithms by applying it to simulated and experimental datasets, where a maximum of eight projections are acquired. We anticipate that ONIX will become a crucial tool for the X-ray community by i) enabling the study of fast dynamics not possible today when implemented together with X-ray multi-projection imaging, and ii) enhancing the volumetric information and capabilities of X-ray stereoscopic imaging in medical applications. △ Less

Submitted 28 February, 2022; originally announced March 2022.

arXiv:2110.04791 [pdf, other]

doi 10.1109/TASLP.2022.3140556

Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-order Latent Domain

Authors: Zengwei Yao, Wenjie Pei, Fanglin Chen, Guangming Lu, David Zhang

Abstract: The crux of single-channel speech separation is how to encode the mixture of signals into such a latent embedding space that the signals from different speakers can be precisely separated. Existing methods for speech separation either transform the speech signals into frequency domain to perform separation or seek to learn a separable embedding space by constructing a latent domain based on convol… ▽ More The crux of single-channel speech separation is how to encode the mixture of signals into such a latent embedding space that the signals from different speakers can be precisely separated. Existing methods for speech separation either transform the speech signals into frequency domain to perform separation or seek to learn a separable embedding space by constructing a latent domain based on convolutional filters. While the latter type of methods learning an embedding space achieves substantial improvement for speech separation, we argue that the embedding space defined by only one latent domain does not suffice to provide a thoroughly separable encoding space for speech separation. In this paper, we propose the Stepwise-Refining Speech Separation Network (SRSSN), which follows a coarse-to-fine separation framework. It first learns a 1-order latent domain to define an encoding space and thereby performs a rough separation in the coarse phase. Then the proposed SRSSN learns a new latent domain along each basis function of the existing latent domain to obtain a high-order latent domain in the refining phase, which enables our model to perform a refining separation to achieve a more precise speech separation. We demonstrate the effectiveness of our SRSSN by conducting extensive experiments, including speech separation in a clean (noise-free) setting on WSJ0-2/3mix datasets as well as in noisy/reverberant settings on WHAM!/WHAMR! datasets. Furthermore, we also perform experiments of speech recognition on separated speech signals by our model to evaluate the performance of speech separation indirectly. △ Less

Submitted 31 January, 2022; v1 submitted 10 October, 2021; originally announced October 2021.

Comments: Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

arXiv:2109.00374 [pdf, other]

ImageTBAD: A 3D Computed Tomography Angiography Image Dataset for Automatic Segmentation of Type-B Aortic Dissection

Authors: Zeyang Yao, Jiawei Zhang, Hailong Qiu, Tianchen Wang, Yiyu Shi, Jian Zhuang, Yuhao Dong, Mei** Huang, Xiaowei Xu

Abstract: Type-B Aortic Dissection (TBAD) is one of the most serious cardiovascular events characterized by a growing yearly incidence,and the severity of disease prognosis. Currently, computed tomography angiography (CTA) has been widely adopted for the diagnosis and prognosis of TBAD. Accurate segmentation of true lumen (TL), false lumen (FL), and false lumen thrombus (FLT) in CTA are crucial for the prec… ▽ More Type-B Aortic Dissection (TBAD) is one of the most serious cardiovascular events characterized by a growing yearly incidence,and the severity of disease prognosis. Currently, computed tomography angiography (CTA) has been widely adopted for the diagnosis and prognosis of TBAD. Accurate segmentation of true lumen (TL), false lumen (FL), and false lumen thrombus (FLT) in CTA are crucial for the precise quantification of anatomical features. However, existing works only focus on only TL and FL without considering FLT. In this paper, we propose ImageTBAD, the first 3D computed tomography angiography (CTA) image dataset of TBAD with annotation of TL, FL, and FLT. The proposed dataset contains 100 TBAD CTA images, which is of decent size compared with existing medical imaging datasets. As FLT can appear almost anywhere along the aorta with irregular shapes, segmentation of FLT presents a wide class of segmentation problems where targets exist in a variety of positions with irregular shapes. We further propose a baseline method for automatic segmentation of TBAD. Results show that the baseline method can achieve comparable results with existing works on aorta and TL segmentation. However, the segmentation accuracy of FLT is only 52%, which leaves large room for improvement and also shows the challenge of our dataset. To facilitate further research on this challenging problem, our dataset and codes are released to the public. △ Less

Submitted 1 September, 2021; originally announced September 2021.

arXiv:2105.02822 [pdf]

doi 10.1109/TIM.2021.3060586

A High-Performance, Reconfigurable, Fully Integrated Time-Domain Reflectometry Architecture Using Digital I/Os

Authors: Zhenyu Xu, Thomas Mauldin, Zheyi Yao, Gerald Hefferman, Tao Wei

Abstract: Time-domain reflectometry (TDR) is an established means of measuring impedance inhomogeneity of a variety of waveguides, providing critical data necessary to characterize and optimize the performance of high-bandwidth computational and communication systems. However, TDR systems with both the high spatial resolution (sub-cm) and voltage resolution (sub-$\muV$) required to evaluate high-performance… ▽ More Time-domain reflectometry (TDR) is an established means of measuring impedance inhomogeneity of a variety of waveguides, providing critical data necessary to characterize and optimize the performance of high-bandwidth computational and communication systems. However, TDR systems with both the high spatial resolution (sub-cm) and voltage resolution (sub-$\muV$) required to evaluate high-performance waveguides are physically large and often cost-prohibitive, severely limiting their utility as testing platforms and greatly limiting their use in characterizing and trouble-shooting fielded hardware. Consequently, there exists a growing technical need for an electronically simple, portable, and low-cost TDR technology. The receiver of a TDR system plays a key role in recording reflection waveforms; thus, such a receiver must have high analog bandwidth, high sampling rate, and high-voltage resolution. However, these requirements are difficult to meet using low-cost analog-to-digital converters (ADCs). This article describes a new TDR architecture, namely, jitter-based APC (JAPC), which obviates the need for external components based on an alternative concept, analog-to-probability conversion (APC) that was recently proposed. These results demonstrate that a fully reconfigurable and highly integrated TDR (iTDR) can be implemented on a field-programmable gate array (FPGA) chip without using any external circuit components. Empirical evaluation of the system was conducted using an HDMI cable as the device under test (DUT), and the resulting impedance inhomogeneity pattern (IIP) of the DUT was extracted with spatial and voltage resolutions of 5 cm and 80 $\muV$, respectively. These results demonstrate the feasibility of using the prototypical JAPC-based iTDR for real-world waveguide characterization applications △ Less

Submitted 1 May, 2021; originally announced May 2021.

Comments: 8 pages, 8 figures

Journal ref: February 2021, IEEE Transactions on Instrumentation and Measurement PP(99):1-1

arXiv:2104.04702 [pdf, other]

Boundary and Context Aware Training for CIF-based Non-Autoregressive End-to-end ASR

Authors: Fan Yu, Haoneng Luo, Pengcheng Guo, Yuhao Liang, Zhuoyuan Yao, Lei Xie, Yingying Gao, Lei**g Hou, Shilei Zhang

Abstract: Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition with competitive performance compared with other NAR methods. However, such an alignment learning strategy may suffer from an erroneous acoustic boundary estimation, severely hindering the convergence speed as well as the system… ▽ More Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition with competitive performance compared with other NAR methods. However, such an alignment learning strategy may suffer from an erroneous acoustic boundary estimation, severely hindering the convergence speed as well as the system performance. In this paper, we propose a boundary and context aware training approach for CIF based NAR models. Firstly, the connectionist temporal classification (CTC) spike information is utilized to guide the learning of acoustic boundaries in the CIF. Besides, an additional contextual decoder is introduced behind the CIF decoder, aiming to capture the linguistic dependencies within a sentence. Finally, we adopt a recently proposed Conformer architecture to improve the capacity of acoustic modeling. Experiments on the open-source Mandarin AISHELL-1 corpus show that the proposed method achieves a comparable character error rates (CERs) of 4.9% with only 1/24 latency compared with a state-of-the-art autoregressive (AR) Conformer model. Futhermore, when evaluating on an internal 7500 hours Mandarin corpus, our model still outperforms other NAR methods and even reaches the AR Conformer model on a challenging real-world noisy test set. △ Less

Submitted 26 September, 2021; v1 submitted 10 April, 2021; originally announced April 2021.

Comments: 5 pages,4 figures

arXiv:2103.16827 [pdf, other]

Integer-only Zero-shot Quantization for Efficient Speech Recognition

Authors: Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Aniruddha Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, Kurt Keutzer

Abstract: End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks. However, these models perform poorly on edge hardware due to large memory and computation requirements. While quantizing model weights and/or activations to low-precision can be a promising solution, previous research on quantizing ASR models is limited. In particular, the previous ap… ▽ More End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks. However, these models perform poorly on edge hardware due to large memory and computation requirements. While quantizing model weights and/or activations to low-precision can be a promising solution, previous research on quantizing ASR models is limited. In particular, the previous approaches use floating-point arithmetic during inference and thus they cannot fully exploit efficient integer processing units. Moreover, they require training and/or validation data during quantization, which may not be available due to security or privacy concerns. To address these limitations, we propose an integer-only, zero-shot quantization scheme for ASR models. In particular, we generate synthetic data whose runtime statistics resemble the real data, and we use it to calibrate models during quantization. We apply our method to quantize QuartzNet, Jasper, and Conformer and show negligible WER degradation as compared to the full-precision baseline models, even without using any data. Moreover, we achieve up to 2.35x speedup on a T4 GPU and 4x compression rate, with a modest WER degradation of <1% with INT8 quantization. △ Less

Submitted 30 January, 2022; v1 submitted 31 March, 2021; originally announced March 2021.

Journal ref: ICASSP 2022

arXiv:2102.01547 [pdf, other]

WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit

Authors: Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, Xin Lei

Abstract: In this paper, we propose an open source, production first, and production ready speech recognition toolkit called WeNet in which a new two-pass approach is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. The main motivation of WeNet is to close the gap between the research and the production of E2E speechrecognition models. WeNet provides an… ▽ More In this paper, we propose an open source, production first, and production ready speech recognition toolkit called WeNet in which a new two-pass approach is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. The main motivation of WeNet is to close the gap between the research and the production of E2E speechrecognition models. WeNet provides an efficient way to ship ASR applications in several real-world scenarios, which is the main difference and advantage to other open source E2E speech recognition toolkits. In our toolkit, a new two-pass method is implemented. Our method propose a dynamic chunk-based attention strategy of the the transformer layers to allow arbitrary right context length modifies in hybrid CTC/attention architecture. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. Our experiments on the AISHELL-1 dataset using WeNet show that, our model achieves 5.03\% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. After model quantification, our model perform reasonable RTF and latency. △ Less

Submitted 29 December, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

Comments: 5 pages, 2 figures, 4 tables

arXiv:2102.00131 [pdf, other]

doi 10.1109/TSP.2022.3168363

New Closed-form Joint Localization and Synchronization using Sequential One-way TOAs

Authors: Ningyan Guo, Sihao Zhao, Xiao-** Zhang, Zheng Yao, Xiaowei Cui, Mingquan Lu

Abstract: It is an essential technique for the moving user nodes (UNs) with clock offset and clock skew to resolve the joint localization and synchronization (JLAS) problem. Existing iterative maximum likelihood methods using sequential one-way time-of-arrival (TOA) measurements from the anchor nodes' (AN) broadcast signals require a good initial guess and have a computational complexity that grows with the… ▽ More It is an essential technique for the moving user nodes (UNs) with clock offset and clock skew to resolve the joint localization and synchronization (JLAS) problem. Existing iterative maximum likelihood methods using sequential one-way time-of-arrival (TOA) measurements from the anchor nodes' (AN) broadcast signals require a good initial guess and have a computational complexity that grows with the number of iterations, given the size of the problem. In this paper, we propose a new closed-form JLAS approach, namely CFJLAS, which achieves the asymptotically optimal solution in one shot without initialization when the noise is small, and has a low computational complexity. After squaring and differencing the sequential TOA measurement equations, we devise two intermediate variables to reparameterize the non-linear problem. In this way, we convert the problem to a simpler one of solving two simultaneous quadratic equations. We then solve the equations analytically to obtain a raw closed-form JLAS estimation. Finally, we apply a weighted least squares (WLS) step to optimize the estimation. We derive the Cramer-Rao lower bound (CRLB), analyze the estimation error, and show that the estimation accuracy of the CFJLAS reaches the CRLB under the small noise condition. The complexity of the new CFJLAS is only determined by the size of the problem, unlike the conventional iterative method, whose complexity is additionally multiplied by the number of iterations. Simulations in a 2D scene verify that the estimation accuracies of the new CFJLAS method in position, velocity, clock offset, and clock skew all reach the CRLB under the small noise condition. Compared with the conventional iterative method, the proposed new CFJLAS method does not require initialization, obtains the optimal solution under the small noise condition, and has a low computational complexity. △ Less

Submitted 14 April, 2022; v1 submitted 29 January, 2021; originally announced February 2021.

arXiv:2101.09334 [pdf, other]

doi 10.1109/JAS.2021.1004272

Robotic Knee Tracking Control to Mimic the Intact Human Knee Profile Based on Actor-critic Reinforcement Learning

Authors: Ruofan Wu, Zhikai Yao, Jennie Si, He, Huang

Abstract: We address a state-of-the-art reinforcement learning (RL) control approach to automatically configure robotic prosthesis impedance parameters to enable end-to-end, continuous locomotion intended for transfemoral amputee subjects. Specifically, our actor-critic based RL provides tracking control of a robotic knee prosthesis to mimic the intact knee profile. This is a significant advance from our pr… ▽ More We address a state-of-the-art reinforcement learning (RL) control approach to automatically configure robotic prosthesis impedance parameters to enable end-to-end, continuous locomotion intended for transfemoral amputee subjects. Specifically, our actor-critic based RL provides tracking control of a robotic knee prosthesis to mimic the intact knee profile. This is a significant advance from our previous RL based automatic tuning of prosthesis control parameters which have centered on regulation control with a designer prescribed robotic knee profile as the target. In addition to presenting the complete tracking control algorithm based on direct heuristic dynamic programming (dHDP), we provide an analytical framework for the tracking controller with constrained inputs. We show that our proposed tracking control possesses several important properties, such as weight convergence of the learning networks, Bellman (sub)optimality of the cost-to-go value function and control input, and practical stability of the human-robot system under input constraint. We further provide a systematic simulation of the proposed tracking control using a realistic human-robot system simulator, the OpenSim, to emulate how the dHDP enables level ground walking, walking on different terrains and at different paces. These results show that our proposed dHDP based tracking control is not only theoretically suitable, but also practically useful. △ Less

Submitted 22 January, 2021; originally announced January 2021.

arXiv:2101.03487 [pdf, other]

Reinforcement Learning Enabled Automatic Impedance Control of a Robotic Knee Prosthesis to Mimic the Intact Knee Motion in a Co-Adapting Environment

Authors: Ruofan Wu, Minhan Li, Zhikai Yao, Jennie Si, He, Huang

Abstract: Automatically configuring a robotic prosthesis to fit its user's needs and physical conditions is a great technical challenge and a roadblock to the adoption of the technology. Previously, we have successfully developed reinforcement learning (RL) solutions toward addressing this issue. Yet, our designs were based on using a subjectively prescribed target motion profile for the robotic knee during… ▽ More Automatically configuring a robotic prosthesis to fit its user's needs and physical conditions is a great technical challenge and a roadblock to the adoption of the technology. Previously, we have successfully developed reinforcement learning (RL) solutions toward addressing this issue. Yet, our designs were based on using a subjectively prescribed target motion profile for the robotic knee during level ground walking. This is not realistic for different users and for different locomotion tasks. In this study for the first time, we investigated the feasibility of RL enabled automatic configuration of impedance parameter settings for a robotic knee to mimic the intact knee motion in a co-adapting environment. We successfully achieved such tracking control by an online policy iteration. We demonstrated our results in both OpenSim simulations and two able-bodied (AB) subjects. △ Less

Submitted 10 January, 2021; originally announced January 2021.

arXiv:2101.00068 [pdf, other]

Toward Reliable Designs of Data-Driven Reinforcement Learning Tracking Control for Euler-Lagrange Systems

Authors: Zhikai Yao, Jennie Si, Ruofan Wu, Jianyong Yao

Abstract: This paper addresses reinforcement learning based, direct signal tracking control with an objective of develo** mathematically suitable and practically useful design approaches. Specifically, we aim to provide reliable and easy to implement designs in order to reach reproducible neural network-based solutions. Our proposed new design takes advantage of two control design frameworks: a reinforcem… ▽ More This paper addresses reinforcement learning based, direct signal tracking control with an objective of develo** mathematically suitable and practically useful design approaches. Specifically, we aim to provide reliable and easy to implement designs in order to reach reproducible neural network-based solutions. Our proposed new design takes advantage of two control design frameworks: a reinforcement learning based, data-driven approach to provide the needed adaptation and (sub)optimality, and a backstep** based approach to provide closed-loop system stability framework. We develop this work based on an established direct heuristic dynamic programming (dHDP) learning paradigm to perform online learning and adaptation and a backstep** design for a class of important nonlinear dynamics described as Euler-Lagrange systems. We provide a theoretical guarantee for the stability of the overall dynamic system, weight convergence of the approximating nonlinear neural networks, and the Bellman (sub)optimality of the resulted control policy. We use simulations to demonstrate significantly improved design performance of the proposed approach over the original dHDP. △ Less

Submitted 30 March, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

arXiv:2012.07237 [pdf]

Accurate Cell Segmentation in Digital Pathology Images via Attention Enforced Networks

Authors: Muyi Sun, Zeyi Yao, Guanhong Zhang

Abstract: Automatic cell segmentation is an essential step in the pipeline of computer-aided diagnosis (CAD), such as the detection and grading of breast cancer. Accurate segmentation of cells can not only assist the pathologists to make a more precise diagnosis, but also save much time and labor. However, this task suffers from stain variation, cell inhomogeneous intensities, background clutters and cells… ▽ More Automatic cell segmentation is an essential step in the pipeline of computer-aided diagnosis (CAD), such as the detection and grading of breast cancer. Accurate segmentation of cells can not only assist the pathologists to make a more precise diagnosis, but also save much time and labor. However, this task suffers from stain variation, cell inhomogeneous intensities, background clutters and cells from different tissues. To address these issues, we propose an Attention Enforced Network (AENet), which is built on spatial attention module and channel attention module, to integrate local features with global dependencies and weight effective channels adaptively. Besides, we introduce a feature fusion branch to bridge high-level and low-level features. Finally, the marker controlled watershed algorithm is applied to post-process the predicted segmentation maps for reducing the fragmented regions. In the test stage, we present an individual color normalization method to deal with the stain variation problem. We evaluate this model on the MoNuSeg dataset. The quantitative comparisons against several prior methods demonstrate the superiority of our approach. △ Less

Submitted 27 December, 2020; v1 submitted 13 December, 2020; originally announced December 2020.

Comments: 6 pages. Accepted by ICPR2020 in the first round

arXiv:2012.05481 [pdf, other]

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

Authors: Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang, Fan Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei Xie, Xin Lei

Abstract: In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. We propose a dynamic chunk-based attention strategy to allow arbitrary right context length. At inference time, the CTC decoder generates n-b… ▽ More In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. We propose a dynamic chunk-based attention strategy to allow arbitrary right context length. At inference time, the CTC decoder generates n-best hypotheses in a streaming way. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. This efficient rescoring process causes very little sentence-level latency. Our experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently. On the AISHELL-1 test set, our unified model achieves 5.60% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. The same model achieves 5.42% CER with 640ms latency in a streaming ASR system. △ Less

Submitted 29 December, 2021; v1 submitted 10 December, 2020; originally announced December 2020.

arXiv:2011.08469 [pdf, other]

Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter

Authors: Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie

Abstract: End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its la… ▽ More End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency. △ Less

Submitted 17 November, 2020; originally announced November 2020.

Comments: 7 pages, 3 figures, 5 tables

arXiv:2011.06724 [pdf, other]

The SLT 2021 children speech recognition challenge: Open datasets, rules and baselines

Authors: Fan Yu, Zhuoyuan Yao, Xiong Wang, Keyu An, Lei Xie, Zhijian Ou, Bo Liu, Xiulin Li, Guanqiong Miao

Abstract: Automatic speech recognition (ASR) has been significantly advanced with the use of deep learning and big data. However improving robustness, including achieving equally good performance on diverse speakers and accents, is still a challenging problem. In particular, the performance of children speech recognition (CSR) still lags behind due to 1) the speech and language characteristics of children's… ▽ More Automatic speech recognition (ASR) has been significantly advanced with the use of deep learning and big data. However improving robustness, including achieving equally good performance on diverse speakers and accents, is still a challenging problem. In particular, the performance of children speech recognition (CSR) still lags behind due to 1) the speech and language characteristics of children's voice are substantially different from those of adults and 2) sizable open dataset for children speech is still not available in the research community. To address these problems, we launch the Children Speech Recognition Challenge (CSRC), as a flagship satellite event of IEEE SLT 2021 workshop. The challenge will release about 400 hours of Mandarin speech data for registered teams and set up two challenge tracks and provide a common testbed to benchmark the CSR performance. In this paper, we introduce the datasets, rules, evaluation method as well as baselines. △ Less

Submitted 16 November, 2020; v1 submitted 12 November, 2020; originally announced November 2020.

Comments: 7 pages, 3 figures, 3 tables

arXiv:2011.02198 [pdf, other]

IEEE SLT 2021 Alpha-mini Speech Challenge: Open Datasets, Tracks, Rules and Baselines

Authors: Yihui Fu, Zhuoyuan Yao, Weipeng He, Jian Wu, Xiong Wang, Zhanheng Yang, Shimin Zhang, Lei Xie, Dongyan Huang, Hui Bu, Petr Motlicek, Jean-Marc Odobez

Abstract: The IEEE Spoken Language Technology Workshop (SLT) 2021 Alpha-mini Speech Challenge (ASC) is intended to improve research on keyword spotting (KWS) and sound source location (SSL) on humanoid robots. Many publications report significant improvements in deep learning based KWS and SSL on open source datasets in recent years. For deep learning model training, it is necessary to expand the data cover… ▽ More The IEEE Spoken Language Technology Workshop (SLT) 2021 Alpha-mini Speech Challenge (ASC) is intended to improve research on keyword spotting (KWS) and sound source location (SSL) on humanoid robots. Many publications report significant improvements in deep learning based KWS and SSL on open source datasets in recent years. For deep learning model training, it is necessary to expand the data coverage to improve the robustness of model. Thus, simulating multi-channel noisy and reverberant data from single-channel speech, noise, echo and room impulsive response (RIR) is widely adopted. However, this approach may generate mismatch between simulated data and recorded data in real application scenarios, especially echo data. In this challenge, we open source a sizable speech, keyword, echo and noise corpus for promoting data-driven methods, particularly deep-learning approaches on KWS and SSL. We also choose Alpha-mini, a humanoid robot produced by UBTECH equipped with a built-in four-microphone array on its head, to record development and evaluation sets under the actual Alpha-mini robot application scenario, including noise as well as echo and mechanical noise generated by the robot itself for model evaluation. Furthermore, we illustrate the rules, evaluation methods and baselines for researchers to quickly assess their achievements and optimize their models. △ Less

Submitted 14 November, 2020; v1 submitted 4 November, 2020; originally announced November 2020.

Comments: Accepted at IEEE SLT 2021

arXiv:2007.07511 [pdf, other]

doi 10.1109/JSEN.2020.3007701

A Novel Posture Positioning Method for Multi-Joint Manipulators

Authors: Zhi-Qiang Yao, Yi-Jue Dai, Qing-Na Li, Dang Xie, Ze-Hui Liu

Abstract: Safety and automatic control are extremely important when operating manipulators. For large engineering manipulators, the main challenge is to accurately recognize the posture of all arm segments. In classical sensing methods, the accuracy of an inclinometer is easily affected by the elastic deformation in the manipulator's arms. This results in big error accumulations when sensing the angle of jo… ▽ More Safety and automatic control are extremely important when operating manipulators. For large engineering manipulators, the main challenge is to accurately recognize the posture of all arm segments. In classical sensing methods, the accuracy of an inclinometer is easily affected by the elastic deformation in the manipulator's arms. This results in big error accumulations when sensing the angle of joints between arms one by one. In addition, the sensing method based on machine vision is not suitable for such kind of outdoor working situation yet. In this paper, we propose a novel posture positioning method for multi-joint manipulators based on wireless sensor network localization. The posture sensing problem is formulated as a Nearest-Euclidean-Distance-Matrix (NEDM) model. The resulting approach is referred to as EDM-based posture positioning approach (EPP) and it satisfies the following guiding principles: (i) The posture of each arm segment on a multi-joint manipulator must be estimated as accurately as possible; (ii) The approach must be computationally fast; (iii) The designed approach should not be susceptible to obstructions. To further improve accuracy, we explore the inherent structure of manipulators, i.e., fixed-arm length. This is naturally presented as linear constraints in the NEDM model. For concrete pumps, a typical multi-joint manipulator, the mechanical property that all arm segments always lie in a 2D plane is used for dimension-reduction operation. Simulation and experimental results show that the proposed method provides efficient solutions for posture sensing problem and can obtain preferable localization performance with faster speed than applying the existing localization methods. △ Less

Submitted 15 July, 2020; originally announced July 2020.

Comments: 7 pages, 8 figures

Journal ref: IEEE Sensors Journal, 2020

Showing 1–47 of 47 results for author: Yao, Z