Search | arXiv e-print repository

Shorter SPECT Scans Using Self-supervised Coordinate Learning to Synthesize Skipped Projection Views

Authors: Zongyu Li, Yixuan Jia, Xiaojian Xu, Jason Hu, Jeffrey A. Fessler, Yuni K. Dewaraja

Abstract: Purpose: This study addresses the challenge of extended SPECT imaging duration under low-count conditions, as encountered in Lu-177 SPECT imaging, by develo** a self-supervised learning approach to synthesize skipped SPECT projection views, thus shortening scan times in clinical settings. Methods: We employed a self-supervised coordinate-based learning technique, adapting the neural radiance fie… ▽ More Purpose: This study addresses the challenge of extended SPECT imaging duration under low-count conditions, as encountered in Lu-177 SPECT imaging, by develo** a self-supervised learning approach to synthesize skipped SPECT projection views, thus shortening scan times in clinical settings. Methods: We employed a self-supervised coordinate-based learning technique, adapting the neural radiance field (NeRF) concept in computer vision to synthesize under-sampled SPECT projection views. For each single scan, we used self-supervised coordinate learning to estimate skipped SPECT projection views. The method was tested with various down-sampling factors (DFs=2, 4, 8) on both Lu-177 phantom SPECT/CT measurements and clinical SPECT/CT datasets, from 11 patients undergoing Lu-177 DOTATATE and 6 patients undergoing Lu-177 PSMA-617 radiopharmaceutical therapy. Results: For SPECT reconstructions, our method outperformed the use of linearly interpolated projections and partial projection views in relative contrast-to-noise-ratios (RCNR) averaged across different downsampling factors: 1) DOTATATE: 83% vs. 65% vs. 67% for lesions and 86% vs. 70% vs. 67% for kidney, 2) PSMA: 76% vs. 69% vs. 68% for lesions and 75% vs. 55% vs. 66% for organs, including kidneys, lacrimal glands, parotid glands, and submandibular glands. Conclusion: The proposed method enables reduction in acquisition time (by factors of 2, 4, or 8) while maintaining quantitative accuracy in clinical SPECT protocols by allowing for the collection of fewer projections. Importantly, the self-supervised nature of this NeRF-based approach eliminates the need for extensive training data, instead learning from each patient's projection data alone. The reduction in acquisition time is particularly relevant for imaging under low-count conditions and for protocols that require multiple-bed positions such as whole-body imaging. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 25 pages, 5568 words

arXiv:2406.08806 [pdf, ps, other]

Adaptive Cooperative Streaming of Holographic Video Over Wireless Networks: A Proximal Policy Optimization Solution

Authors: Wanli Wen, Ji** Yan, Yulu Zhang, Zhen Huang, Liang Liang, Yunjian Jia

Abstract: Adapting holographic video streaming to fluctuating wireless channels is essential to maintain consistent and satisfactory Quality of Experience (QoE) for users, which, however, is a challenging task due to the dynamic and uncertain characteristics of wireless networks. To address this issue, we propose a holographic video cooperative streaming framework designed for a generic wireless network in… ▽ More Adapting holographic video streaming to fluctuating wireless channels is essential to maintain consistent and satisfactory Quality of Experience (QoE) for users, which, however, is a challenging task due to the dynamic and uncertain characteristics of wireless networks. To address this issue, we propose a holographic video cooperative streaming framework designed for a generic wireless network in which multiple access points can cooperatively transmit video with different bitrates to multiple users. Additionally, we model a novel QoE metric tailored specifically for holographic video streaming, which can effectively encapsulate the nuances of holographic video quality, quality fluctuations, and rebuffering occurrences simultaneously. Furthermore, we formulate a formidable QoE maximization problem, which is a non-convex mixed integer nonlinear programming problem. Using proximal policy optimization (PPO), a new class of reinforcement learning algorithms, we devise a joint beamforming and bitrate control scheme, which can be wisely adapted to fluctuations in the wireless channel. The numerical results demonstrate the superiority of the proposed scheme over representative baselines. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: This paper has been accepted for publication in IEEE Wireless Communications Letters

arXiv:2406.02133 [pdf, other]

SimulTron: On-Device Simultaneous Speech to Speech Translation

Authors: Alex Agranovich, Eliya Nachmani, Oleg Rybakov, Yifan Ding, Ye Jia, Nadav Bar, Heiga Zen, Michelle Tadmor Ramanovich

Abstract: Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the st… ▽ More Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the strengths of the Translatotron framework while incorporating key modifications for streaming operation, and an adjustable fixed delay. Our experiments show that SimulTron surpasses Translatotron 2 in offline evaluations. Furthermore, real-time evaluations reveal that SimulTron improves upon the performance achieved by Translatotron 1. Additionally, SimulTron achieves superior BLEU scores and latency compared to previous real-time S2ST method on the MuST-C dataset. Significantly, we have successfully deployed SimulTron on a Pixel 7 Pro device, show its potential for simultaneous S2ST on-device. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.02055 [pdf]

Stochastic Carbon Footprint Tracing Methods in Power Systems

Authors: Jiashuo Hu, Xiao-** Zhang, Youwei Jia

Abstract: As the penetration of distributed energy resources (DER) and renewable energy sources (RES) increases, carbon footprint tracking requires more granular analysis results. Existing carbon footprint tracking methods focus on deterministic steady-state analysis where the high uncertainties of RES cannot be considered. Considering the deficiency of the existing deterministic method, this paper proposes… ▽ More As the penetration of distributed energy resources (DER) and renewable energy sources (RES) increases, carbon footprint tracking requires more granular analysis results. Existing carbon footprint tracking methods focus on deterministic steady-state analysis where the high uncertainties of RES cannot be considered. Considering the deficiency of the existing deterministic method, this paper proposes two stochastic carbon footprint tracking methods to cope with the impact of RES uncertainty on load-side carbon footprint tracing. The first method introduces probabilistic analysis in the framework of carbon emissions flow (CEF) to provide a global reference for the spatial characteristic of the power system component carbon intensity distribution. Considering that the CEF network expands with the increasing penetration of DERs, the second method can effectively improve the computational efficiency over the first method while ensuring the computational accuracy on the large power systems. These proposed models are tested and compared in a synthetic 1004-bus test system in the case study to demonstrate the performance of the two proposed methods △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2404.12598 [pdf, ps, other]

Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty

Authors: Yanwei Jia

Abstract: This paper studies continuous-time risk-sensitive reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation with the exponential-form objective. The risk-sensitive objective arises either as the agent's risk attitude or as a distributionally robust approach against the model uncertainty. Owing to the martingale perspective in Jia and Zhou (2023) the risk-… ▽ More This paper studies continuous-time risk-sensitive reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation with the exponential-form objective. The risk-sensitive objective arises either as the agent's risk attitude or as a distributionally robust approach against the model uncertainty. Owing to the martingale perspective in Jia and Zhou (2023) the risk-sensitive RL problem is shown to be equivalent to ensuring the martingale property of a process involving both the value function and the q-function, augmented by an additional penalty term: the quadratic variation of the value process, capturing the variability of the value-to-go along the trajectory. This characterization allows for the straightforward adaptation of existing RL algorithms developed for non-risk-sensitive scenarios to incorporate risk sensitivity by adding the realized variance of the value process. Additionally, I highlight that the conventional policy gradient representation is inadequate for risk-sensitive problems due to the nonlinear nature of quadratic variation; however, q-learning offers a solution and extends to infinite horizon settings. Finally, I prove the convergence of the proposed algorithm for Merton's investment problem and quantify the impact of temperature parameter on the behavior of the learning procedure. I also conduct simulation experiments to demonstrate how risk-sensitive RL improves the finite-sample performance in the linear-quadratic control problem. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: 49 pages, 2 figures, 1 table

MSC Class: 62L20; 68T05; 93E03; 93E20; 93E35

arXiv:2403.15156 [pdf, other]

Infrastructure-Assisted Collaborative Perception in Automated Valet Parking: A Safety Perspective

Authors: Yukuan Jia, Jiawen Zhang, Shimeng Lu, Baokang Fan, Ruiqing Mao, Sheng Zhou, Zhisheng Niu

Abstract: Environmental perception in Automated Valet Parking (AVP) has been a challenging task due to severe occlusions in parking garages. Although Collaborative Perception (CP) can be applied to broaden the field of view of connected vehicles, the limited bandwidth of vehicular communications restricts its application. In this work, we propose a BEV feature-based CP network architecture for infrastructur… ▽ More Environmental perception in Automated Valet Parking (AVP) has been a challenging task due to severe occlusions in parking garages. Although Collaborative Perception (CP) can be applied to broaden the field of view of connected vehicles, the limited bandwidth of vehicular communications restricts its application. In this work, we propose a BEV feature-based CP network architecture for infrastructure-assisted AVP systems. The model takes the roadside camera and LiDAR as optional inputs and adaptively fuses them with onboard sensors in a unified BEV representation. Autoencoder and downsampling are applied for channel-wise and spatial-wise dimension reduction, while sparsification and quantization further compress the feature map with little loss in data precision. Combining these techniques, the size of a BEV feature map is effectively compressed to fit in the feasible data rate of the NR-V2X network. With the synthetic AVP dataset, we observe that CP can effectively increase perception performance, especially for pedestrians. Moreover, the advantage of infrastructure-assisted CP is demonstrated in two typical safety-critical scenarios in the AVP setting, increasing the maximum safe cruising speed by up to 3m/s in both scenarios. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: 7 pages, 7 figures, 4 tables, accepted by IEEE VTC2024-Spring

arXiv:2403.10622 [pdf, other]

NeuralOCT: Airway OCT Analysis via Neural Fields

Authors: Yining Jiao, Amy Oldenburg, Yinghan Xu, Srikamal Soundararajan, Carlton Zdanski, Julia Kimbell, Marc Niethammer

Abstract: Optical coherence tomography (OCT) is a popular modality in ophthalmology and is also used intravascularly. Our interest in this work is OCT in the context of airway abnormalities in infants and children where the high resolution of OCT and the fact that it is radiation-free is important. The goal of airway OCT is to provide accurate estimates of airway geometry (in 2D and 3D) to assess airway abn… ▽ More Optical coherence tomography (OCT) is a popular modality in ophthalmology and is also used intravascularly. Our interest in this work is OCT in the context of airway abnormalities in infants and children where the high resolution of OCT and the fact that it is radiation-free is important. The goal of airway OCT is to provide accurate estimates of airway geometry (in 2D and 3D) to assess airway abnormalities such as subglottic stenosis. We propose $\texttt{NeuralOCT}$, a learning-based approach to process airway OCT images. Specifically, $\texttt{NeuralOCT}$ extracts 3D geometries from OCT scans by robustly bridging two steps: point cloud extraction via 2D segmentation and 3D reconstruction from point clouds via neural fields. Our experiments show that $\texttt{NeuralOCT}$ produces accurate and robust 3D airway reconstructions with an average A-line error smaller than 70 micrometer. Our code will cbe available on GitHub. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2402.02694 [pdf, other]

Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift

Authors: Jisheng Bai, Mou Wang, Haohe Liu, Han Yin, Yafei Jia, Siwei Huang, Yutong Du, Dongzhe Zhang, Dongyuan Shi, Woon-Seng Gan, Mark D. Plumbley, Susanto Rahardja, Bin Xiang, Jianfeng Chen

Abstract: Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is the domain shift between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Althoug… ▽ More Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is the domain shift between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Although this task, in recent years, has achieved substantial progress in device generalization, the challenge of domain shift between different geographical regions, involving discrepancies such as time, space, culture, and language, remains insufficiently explored at present. In addition, considering the abundance of unlabeled acoustic scene data in the real world, it is important to study the possible ways to utilize these unlabelled data. Therefore, we introduce the task Semi-supervised Acoustic Scene Classification under Domain Shift in the ICME 2024 Grand Challenge. We encourage participants to innovate with semi-supervised learning techniques, aiming to develop more robust ASC models under domain shift. △ Less

Submitted 28 February, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

arXiv:2312.14482 [pdf, other]

On Smart Morphing Wing Aircraft Robust Adaptive Beamforming

Authors: Yizhen Jia, Hui Chen, Wen-Qin Wang, Jie Cheng

Abstract: The smart morphing wing aircraft (SMWA) is a highly adaptable platform that can be widely used for intelligent warfare due to its real-time variable structure. The flexible conformal array (FCA) is a vital detection component of SMWA, when the deformation parameters of FCA are mismatched or array elements are mutually coupled, detection performance will be degraded. To overcome this problem and en… ▽ More The smart morphing wing aircraft (SMWA) is a highly adaptable platform that can be widely used for intelligent warfare due to its real-time variable structure. The flexible conformal array (FCA) is a vital detection component of SMWA, when the deformation parameters of FCA are mismatched or array elements are mutually coupled, detection performance will be degraded. To overcome this problem and ensure robust beamforming for FCA, deviations in array control parameters (ACPs) and array perturbations, the effect of mutual coupling in addition to looking-direction errors should be considered. In this paper, we propose a robust adaptive beamforming (RAB) algorithm by reconstructing a multi-domain interference plus noise covariance matrix (INCM) and estimating steering vector (SV) for FCA. We first reconstruct the INCM using multi-domain processing, including ACP and angular domains. Then, SV estimation is executed through an optimization procedure. Experimental results have shown that the proposed beamformer outperforms existing beamformers in various mismatch conditions and harsh environments, such as high interference-to-noise ratios, and mutual coupling of antennas. △ Less

Submitted 22 December, 2023; originally announced December 2023.

Comments: Conference extended version

arXiv:2309.07141 [pdf]

Design of Recognition and Evaluation System for Table Tennis Players' Motor Skills Based on Artificial Intelligence

Authors: Zhuo-yong Shi, Ye-tao Jia, Ke-xin Zhang, Ding-han Wang, Long-meng Ji, Yong Wu

Abstract: With the rapid development of electronic science and technology, the research on wearable devices is constantly updated, but for now, it is not comprehensive for wearable devices to recognize and analyze the movement of specific sports. Based on this, this paper improves wearable devices of table tennis sport, and realizes the pattern recognition and evaluation of table tennis players' motor skill… ▽ More With the rapid development of electronic science and technology, the research on wearable devices is constantly updated, but for now, it is not comprehensive for wearable devices to recognize and analyze the movement of specific sports. Based on this, this paper improves wearable devices of table tennis sport, and realizes the pattern recognition and evaluation of table tennis players' motor skills through artificial intelligence. Firstly, a device is designed to collect the movement information of table tennis players and the actual movement data is processed. Secondly, a sliding window is made to divide the collected motion data into a characteristic database of six table tennis benchmark movements. Thirdly, motion features were constructed based on feature engineering, and motor skills were identified for different models after dimensionality reduction. Finally, the hierarchical evaluation system of motor skills is established with the loss functions of different evaluation indexes. The results show that in the recognition of table tennis players' motor skills, the feature-based BP neural network proposed in this paper has higher recognition accuracy and stronger generalization ability than the traditional convolutional neural network. △ Less

Submitted 4 September, 2023; originally announced September 2023.

Comments: 34pages, 16figures

MSC Class: 93-01 ACM Class: G.1; H.4

arXiv:2308.13839 [pdf, other]

A Conflict Resolution Dataset Derived from Argoverse-2: Analysis of the Safety and Efficiency Impacts of Autonomous Vehicles at Intersections

Authors: Guopeng Li, Yiru Jiao, Simeon C. Calvert, J. W. C. van Lint

Abstract: As the deployment of autonomous vehicles (AVs) in mixed traffic flow becomes increasingly prevalent, ensuring safe and smooth interactions between AVs and human agents is of critical importance. How road users resolve conflicts at intersections has significant impacts on driving safety and traffic efficiency. These impacts depend on both the behaviours of AVs and humans' reactions to the presence… ▽ More As the deployment of autonomous vehicles (AVs) in mixed traffic flow becomes increasingly prevalent, ensuring safe and smooth interactions between AVs and human agents is of critical importance. How road users resolve conflicts at intersections has significant impacts on driving safety and traffic efficiency. These impacts depend on both the behaviours of AVs and humans' reactions to the presence of AVs. Therefore, using real-world data to assess and compare the safety and efficiency measures of AV-involved and AV-free scenarios is crucial. To this end, this paper presents a high-quality conflict resolution dataset derived from the open Argoverse-2 motion forecasting data to analyse the safety and efficiency impacts of AVs. The contribution is twofold: First, we propose and apply a specific data processing pipeline to select scenarios of interest, rectify data errors, and enhance the raw data in Argoverse-2. As a result, 5000+ cases where an AV resolves conflict with a human road user and 16000+ conflict resolution cases without AVs are obtained. Motion data is smooth and consistent in these cases. This open dataset comprises diverse and balanced conflict resolution regimes. Second, this paper employs surrogate safety measures and a novel efficiency measure to assess the impact of AVs at intersections. The results suggest that human drivers exhibit similar safety and efficiency performances when interacting with AVs and with other human drivers. In contrast, pedestrians demonstrate more diverse reactions. Furthermore, due to the safety-prior strategy of AVs, the average efficiency of AV-involved conflict resolution decreases by 8.6% compared to AV-free cases. This informative dataset provides a valuable resource for researchers and the findings give insights into the possible impacts of AVs. The dataset is openly available via https://github.com/RomainLITUD/conflict_resolution_dataset. △ Less

Submitted 9 December, 2023; v1 submitted 26 August, 2023; originally announced August 2023.

Comments: 20 pages, 16 figures

arXiv:2307.08556 [pdf, other]

Machine-Learning-based Colorectal Tissue Classification via Acoustic Resolution Photoacoustic Microscopy

Authors: Shangqing Tong, Peng Ge, Yanan Jiao, Zhaofu Ma, Ziye Li, Longhai Liu, Feng Gao, Xiaohui Du, Fei Gao

Abstract: Colorectal cancer is a deadly disease that has become increasingly prevalent in recent years. Early detection is crucial for saving lives, but traditional diagnostic methods such as colonoscopy and biopsy have limitations. Colonoscopy cannot provide detailed information within the tissues affected by cancer, while biopsy involves tissue removal, which can be painful and invasive. In order to impro… ▽ More Colorectal cancer is a deadly disease that has become increasingly prevalent in recent years. Early detection is crucial for saving lives, but traditional diagnostic methods such as colonoscopy and biopsy have limitations. Colonoscopy cannot provide detailed information within the tissues affected by cancer, while biopsy involves tissue removal, which can be painful and invasive. In order to improve diagnostic efficiency and reduce patient suffering, we studied machine-learningbased approach for colorectal tissue classification that uses acoustic resolution photoacoustic microscopy (ARPAM). With this tool, we were able to classify benign and malignant tissue using multiple machine learning methods. Our results were analyzed both quantitatively and qualitatively to evaluate the effectiveness of our approach. △ Less

Submitted 17 July, 2023; originally announced July 2023.

arXiv:2307.08239 [pdf, other]

Dynamic Kernel Convolution Network with Scene-dedicate Training for Sound Event Localization and Detection

Authors: Siwei Huang, Jianfeng Chen, Jisheng Bai, Yafei Jia, Dongzhe Zhang

Abstract: DNN-based methods have shown high performance in sound event localization and detection(SELD). While in real spatial sound scenes, reverberation and the imbalanced presence of various sound events increase the complexity of the SELD task. In this paper, we propose an effective SELD system in real spatial scenes.In our approach, a dynamic kernel convolution module is introduced after the convolutio… ▽ More DNN-based methods have shown high performance in sound event localization and detection(SELD). While in real spatial sound scenes, reverberation and the imbalanced presence of various sound events increase the complexity of the SELD task. In this paper, we propose an effective SELD system in real spatial scenes.In our approach, a dynamic kernel convolution module is introduced after the convolution blocks to adaptively model the channel-wise features with different receptive fields. Secondly, we incorporate the SELDnet and EINv2 framework into the proposed SELD system with multi-track ACCDOA. Moreover, two scene-dedicated strategies are introduced into the training stage to improve the generalization of the system in realistic spatial sound scenes. Finally, we apply data augmentation methods to extend the dataset using channel rotation, spatial data synthesis. Four joint metrics are used to evaluate the performance of the SELD system on the Sony-TAu Realistic Spatial Soundscapes 2022 dataset.Experimental results show that the proposed systems outperform the fixed-kernel convolution SELD systems. In addition, the proposed system achieved an SELD score of 0.348 in the DCASE SELD task and surpassed the SOTA methods. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: 11 pages, 6 figures

arXiv:2306.04987 [pdf, other]

Convolutional Recurrent Neural Network with Attention for 3D Speech Enhancement

Authors: Han Yin, Jisheng Bai, Mou Wang, Siwei Huang, Yafei Jia, Jianfeng Chen

Abstract: 3D speech enhancement can effectively improve the auditory experience and plays a crucial role in augmented reality technology. However, traditional convolutional-based speech enhancement methods have limitations in extracting dynamic voice information. In this paper, we incorporate a dual-path recurrent neural network block into the U-Net to iteratively extract dynamic audio information in both t… ▽ More 3D speech enhancement can effectively improve the auditory experience and plays a crucial role in augmented reality technology. However, traditional convolutional-based speech enhancement methods have limitations in extracting dynamic voice information. In this paper, we incorporate a dual-path recurrent neural network block into the U-Net to iteratively extract dynamic audio information in both the time and frequency domains. And an attention mechanism is proposed to fuse the original signal, reference signal, and generated masks. Moreover, we introduce a loss function to simultaneously optimize the network in the time-frequency and time domains. Experimental results show that our system outperforms the state-of-the-art systems on the dataset of ICASSP L3DAS23 challenge. △ Less

Submitted 19 November, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: Published on IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC 2023)

arXiv:2305.18921 [pdf, other]

Large Car-following Data Based on Lyft level-5 Open Dataset: Following Autonomous Vehicles vs. Human-driven Vehicles

Authors: Guopeng Li, Yiru Jiao, Victor L. Knoop, Simeon C. Calvert, J. W. C. van Lint

Abstract: Car-Following (CF), as a fundamental driving behaviour, has significant influences on the safety and efficiency of traffic flow. Investigating how human drivers react differently when following autonomous vs. human-driven vehicles (HV) is thus critical for mixed traffic flow. Research in this field can be expedited with trajectory datasets collected by Autonomous Vehicles (AVs). However, trajector… ▽ More Car-Following (CF), as a fundamental driving behaviour, has significant influences on the safety and efficiency of traffic flow. Investigating how human drivers react differently when following autonomous vs. human-driven vehicles (HV) is thus critical for mixed traffic flow. Research in this field can be expedited with trajectory datasets collected by Autonomous Vehicles (AVs). However, trajectories collected by AVs are noisy and not readily applicable for studying CF behaviour. This paper extracts and enhances two categories of CF data, HV-following-AV (H-A) and HV-following-HV (H-H), from the open Lyft level-5 dataset. First, CF pairs are selected based on specific rules. Next, the quality of raw data is assessed by anomaly analysis. Then, the raw CF data is corrected and enhanced via motion planning, Kalman filtering, and wavelet denoising. As a result, 29k+ H-A and 42k+ H-H car-following segments are obtained, with a total driving distance of 150k+ km. A diversity assessment shows that the processed data cover complete CF regimes for calibrating CF models. This open and ready-to-use dataset provides the opportunity to investigate the CF behaviours of following AVs vs. HVs from real-world data. It can further facilitate studies on exploring the impact of AVs on mixed urban traffic. △ Less

Submitted 21 November, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

Comments: 6 pages, 9 figures

arXiv:2305.00579 [pdf, other]

RAPID: Autonomous Multi-Agent Racing using Constrained Potential Dynamic Games

Authors: Yixuan Jia, Maulik Bhatt, Negar Mehr

Abstract: In this work, we consider the problem of autonomous racing with multiple agents where agents must interact closely and influence each other to compete. We model interactions among agents through a game-theoretical framework and propose an efficient algorithm for tractably solving the resulting game in real time. More specifically, we capture interactions among multiple agents through a constrained… ▽ More In this work, we consider the problem of autonomous racing with multiple agents where agents must interact closely and influence each other to compete. We model interactions among agents through a game-theoretical framework and propose an efficient algorithm for tractably solving the resulting game in real time. More specifically, we capture interactions among multiple agents through a constrained dynamic game. We show that the resulting dynamic game is an instance of a simple-to-analyze class of games. Namely, we show that our racing game is an instance of a constrained dynamic potential game. An important and appealing property of dynamic potential games is that a generalized Nash equilibrium of the underlying game can be computed by solving a single constrained optimal control problem instead of multiple coupled constrained optimal control problems. Leveraging this property, we show that the problem of autonomous racing is greatly simplified and develop RAPID (autonomous multi-agent RAcing using constrained PotentIal Dynamic games), a racing algorithm that can be solved tractably in real-time. Through simulation studies, we demonstrate that our algorithm outperforms the state-of-the-art approach. We further show the real-time capabilities of our algorithm in hardware experiments. △ Less

Submitted 30 April, 2023; originally announced May 2023.

Comments: 8 pages

arXiv:2303.10510 [pdf, other]

A Deep Learning System for Domain-specific Speech Recognition

Authors: Yanan Jia

Abstract: As human-machine voice interfaces provide easy access to increasingly intelligent machines, many state-of-the-art automatic speech recognition (ASR) systems are proposed. However, commercial ASR systems usually have poor performance on domain-specific speech especially under low-resource settings. The author works with pre-trained DeepSpeech2 and Wav2Vec2 acoustic models to develop benefit-specifi… ▽ More As human-machine voice interfaces provide easy access to increasingly intelligent machines, many state-of-the-art automatic speech recognition (ASR) systems are proposed. However, commercial ASR systems usually have poor performance on domain-specific speech especially under low-resource settings. The author works with pre-trained DeepSpeech2 and Wav2Vec2 acoustic models to develop benefit-specific ASR systems. The domain-specific data are collected using proposed semi-supervised learning annotation with little human intervention. The best performance comes from a fine-tuned Wav2Vec2-Large-LV60 acoustic model with an external KenLM, which surpasses the Google and AWS ASR systems on benefit-specific speech. The viability of using error prone ASR transcriptions as part of spoken language understanding (SLU) is also investigated. Results of a benefit-specific natural language understanding (NLU) task show that the domain-specific fine-tuned ASR system can outperform the commercial ASR systems even when its transcriptions have higher word error rate (WER), and the results between fine-tuned ASR and human transcriptions are similar. △ Less

Submitted 27 September, 2023; v1 submitted 18 March, 2023; originally announced March 2023.

Comments: 4th International Conference on Natural Language Processing and Computational Linguistics (NLPCL 2023)

arXiv:2301.06304 [pdf]

LYSTO: The Lymphocyte Assessment Hackathon and Benchmark Dataset

Authors: Yi** Jiao, Jeroen van der Laak, Shadi Albarqouni, Zhang Li, Tao Tan, Abhir Bhalerao, Jiabo Ma, Jiamei Sun, Johnathan Pocock, Josien P. W. Pluim, Navid Alemi Koohbanani, Raja Muhammad Saad Bashir, Shan E Ahmed Raza, Sibo Liu, Simon Graham, Suzanne Wetstein, Syed Ali Khurram, Thomas Watson, Nasir Rajpoot, Mitko Veta, Francesco Ciompi

Abstract: We introduce LYSTO, the Lymphocyte Assessment Hackathon, which was held in conjunction with the MICCAI 2019 Conference in Shenzen (China). The competition required participants to automatically assess the number of lymphocytes, in particular T-cells, in histopathological images of colon, breast, and prostate cancer stained with CD3 and CD8 immunohistochemistry. Differently from other challenges se… ▽ More We introduce LYSTO, the Lymphocyte Assessment Hackathon, which was held in conjunction with the MICCAI 2019 Conference in Shenzen (China). The competition required participants to automatically assess the number of lymphocytes, in particular T-cells, in histopathological images of colon, breast, and prostate cancer stained with CD3 and CD8 immunohistochemistry. Differently from other challenges setup in medical image analysis, LYSTO participants were solely given a few hours to address this problem. In this paper, we describe the goal and the multi-phase organization of the hackathon; we describe the proposed methods and the on-site results. Additionally, we present post-competition results where we show how the presented methods perform on an independent set of lung cancer slides, which was not part of the initial competition, as well as a comparison on lymphocyte assessment between presented methods and a panel of pathologists. We show that some of the participants were capable to achieve pathologist-level performance at lymphocyte assessment. After the hackathon, LYSTO was left as a lightweight plug-and-play benchmark dataset on grand-challenge website, together with an automatic evaluation platform. LYSTO has supported a number of research in lymphocyte assessment in oncology. LYSTO will be a long-lasting educational challenge for deep learning and digital pathology, it is available at https://lysto.grand-challenge.org/. △ Less

Submitted 13 April, 2023; v1 submitted 16 January, 2023; originally announced January 2023.

Comments: will be sumitted to IEEE-JBHI

MSC Class: 68T07 ACM Class: I.4.9; I.5.4; I.2.1

arXiv:2212.06299 [pdf]

Interpretable Diabetic Retinopathy Diagnosis based on Biomarker Activation Map

Authors: Pengxiao Zang, Tristan T. Hormel, Jie Wang, Yukun Guo, Steven T. Bailey, Christina J. Flaxel, David Huang, Thomas S. Hwang, Yali Jia

Abstract: Deep learning classifiers provide the most accurate means of automatically diagnosing diabetic retinopathy (DR) based on optical coherence tomography (OCT) and its angiography (OCTA). The power of these models is attributable in part to the inclusion of hidden layers that provide the complexity required to achieve a desired task. However, hidden layers also render algorithm outputs difficult to in… ▽ More Deep learning classifiers provide the most accurate means of automatically diagnosing diabetic retinopathy (DR) based on optical coherence tomography (OCT) and its angiography (OCTA). The power of these models is attributable in part to the inclusion of hidden layers that provide the complexity required to achieve a desired task. However, hidden layers also render algorithm outputs difficult to interpret. Here we introduce a novel biomarker activation map (BAM) framework based on generative adversarial learning that allows clinicians to verify and understand classifiers decision-making. A data set including 456 macular scans were graded as non-referable or referable DR based on current clinical standards. A DR classifier that was used to evaluate our BAM was first trained based on this data set. The BAM generation framework was designed by combing two U-shaped generators to provide meaningful interpretability to this classifier. The main generator was trained to take referable scans as input and produce an output that would be classified by the classifier as non-referable. The BAM is then constructed as the difference image between the output and input of the main generator. To ensure that the BAM only highlights classifier-utilized biomarkers an assistant generator was trained to do the opposite, producing scans that would be classified as referable by the classifier from non-referable scans. The generated BAMs highlighted known pathologic features including nonperfusion area and retinal fluid. A fully interpretable classifier based on these highlights could help clinicians better utilize and verify automated DR diagnosis. △ Less

Submitted 26 June, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: This paper has been accepted by IEEE TBME

ACM Class: I.2.0; I.4.0; J.3

arXiv:2211.14830 [pdf, other]

Medical Image Segmentation Review: The success of U-Net

Authors: Reza Azad, Ehsan Khodapanah Aghdam, Amelie Rauland, Yiwei Jia, Atlas Haddadi Avval, Afshin Bozorgpour, Sanaz Karimijafarbigloo, Joseph Paul Cohen, Ehsan Adeli, Dorit Merhof

Abstract: Automatic medical image segmentation is a crucial topic in the medical domain and successively a critical counterpart in the computer-aided diagnosis paradigm. U-Net is the most widespread image segmentation architecture due to its flexibility, optimized modular design, and success in all medical image modalities. Over the years, the U-Net model achieved tremendous attention from academic and indu… ▽ More Automatic medical image segmentation is a crucial topic in the medical domain and successively a critical counterpart in the computer-aided diagnosis paradigm. U-Net is the most widespread image segmentation architecture due to its flexibility, optimized modular design, and success in all medical image modalities. Over the years, the U-Net model achieved tremendous attention from academic and industrial researchers. Several extensions of this network have been proposed to address the scale and complexity created by medical tasks. Addressing the deficiency of the naive U-Net model is the foremost step for vendors to utilize the proper U-Net variant model for their business. Having a compendium of different variants in one place makes it easier for builders to identify the relevant research. Also, for ML researchers it will help them understand the challenges of the biological tasks that challenge the model. To address this, we discuss the practical aspects of the U-Net model and suggest a taxonomy to categorize each network variant. Moreover, to measure the performance of these strategies in a clinical application, we propose fair evaluations of some unique and famous designs on well-known datasets. We provide a comprehensive implementation library with trained models for future research. In addition, for ease of future studies, we created an online list of U-Net papers with their possible official implementation. All information is gathered in https://github.com/NITR098/Awesome-U-Net repository. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: Submitted to the IEEE Transactions on Pattern Analysis and Machine Intelligence Journal

arXiv:2211.09658 [pdf, other]

doi 10.1109/TIV.2023.3234261

Energy-Efficient Driving in Connected Corridors via Minimum Principle Control: Vehicle-in-the-Loop Experimental Verification in Mixed Fleets

Authors: Tyler Ard, Longxiang Guo, Jihun Han, Yunyi Jia, Ardalan Vahidi, Dominik Karbowski

Abstract: Connected and automated vehicles (CAVs) can plan and actuate control that explicitly considers performance, system safety, and actuation constraints in a manner more efficient than their human-driven counterparts. In particular, eco-driving is enabled through connected exchange of information from signalized corridors that share their upcoming signal phase and timing (SPaT). This is accomplished i… ▽ More Connected and automated vehicles (CAVs) can plan and actuate control that explicitly considers performance, system safety, and actuation constraints in a manner more efficient than their human-driven counterparts. In particular, eco-driving is enabled through connected exchange of information from signalized corridors that share their upcoming signal phase and timing (SPaT). This is accomplished in the proposed control approach, which follows first principles to plan a free-flow acceleration-optimal trajectory through green traffic light intervals by Pontryagin's Minimum Principle in a feedback manner. Urban conditions are then imposed from exogeneous traffic comprised of a mixture of human-driven vehicles (HVs) - as well as other CAVs. As such, safe disturbance compensation is achieved by implementing a model predictive controller (MPC) to anticipate and avoid collisions by issuing braking commands as necessary. The control strategy is experimentally vetted through vehicle-in-the-loop (VIL) of a prototype CAV that is embedded into a virtual traffic corridor realized through microsimulation. Up to 36% fuel savings are measured with the proposed control approach over a human-modelled driver, and it was found connectivity in the automation approach improved fuel economy by up to 26% over automation without. Additionally, the passive energy benefits realizable for human drivers when driving behind downstream CAVs are measured, showing up to 22% fuel savings in a HV when driving behind a small penetration of connectivity-enabled automated vehicles. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: 13 Figures

arXiv:2211.00115 [pdf, other]

Textless Direct Speech-to-Speech Translation with Discrete Speech Representation

Authors: Xinjian Li, Ye Jia, Chung-Cheng Chiu

Abstract: Research on speech-to-speech translation (S2ST) has progressed rapidly in recent years. Many end-to-end systems have been proposed and show advantages over conventional cascade systems, which are often composed of recognition, translation and synthesis sub-systems. However, most of the end-to-end systems still rely on intermediate textual supervision during training, which makes it infeasible to w… ▽ More Research on speech-to-speech translation (S2ST) has progressed rapidly in recent years. Many end-to-end systems have been proposed and show advantages over conventional cascade systems, which are often composed of recognition, translation and synthesis sub-systems. However, most of the end-to-end systems still rely on intermediate textual supervision during training, which makes it infeasible to work for languages without written forms. In this work, we propose a novel model, Textless Translatotron, which is based on Translatotron 2, for training an end-to-end direct S2ST model without any textual supervision. Instead of jointly training with an auxiliary task predicting target phonemes as in Translatotron 2, the proposed model uses an auxiliary task predicting discrete speech representations which are obtained from learned or random speech quantizers. When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2 on the multilingual CVSS-C corpus as well as the bilingual Fisher Spanish-English corpus. On the latter, it outperforms the prior state-of-the-art textless model by +18.5 BLEU. △ Less

Submitted 31 October, 2022; originally announced November 2022.

arXiv:2210.12361 [pdf]

doi 10.2147/JMDH.S417068

MS-DCANet: A Novel Segmentation Network For Multi-Modality COVID-19 Medical Images

Authors: Xiaoyu Pan, Huazheng Zhu, **glong Du, Guangtao Hu, Baoru Han, Yuanyuan Jia

Abstract: The Coronavirus Disease 2019 (COVID-19) pandemic has increased the public health burden and brought profound disaster to humans. For the particularity of the COVID-19 medical images with blurred boundaries, low contrast and different infection sites, some researchers have improved the accuracy by adding more complexity. Also, they overlook the complexity of lesions, which hinder their ability to c… ▽ More The Coronavirus Disease 2019 (COVID-19) pandemic has increased the public health burden and brought profound disaster to humans. For the particularity of the COVID-19 medical images with blurred boundaries, low contrast and different infection sites, some researchers have improved the accuracy by adding more complexity. Also, they overlook the complexity of lesions, which hinder their ability to capture the relationship between segmentation sites and the background, as well as the edge contours and global context. However, increasing the computational complexity, parameters and inference speed is unfavorable for model transfer from laboratory to clinic. A perfect segmentation network needs to balance the above three factors completely. To solve the above issues, this paper propose a symmetric automatic segmentation framework named MS-DCANet. We introduce Tokenized MLP block, a novel attention scheme that use a shift-window mechanism to conditionally fuse local and global features to get more continuous boundaries and spatial positioning capabilities. It has greater understanding of irregular lesions contours. MS-DCANet also uses several Dual Channel blocks and a Res-ASPP block to improve the ability to recognize small targets. On multi-modality COVID-19 tasks, MS-DCANet achieved state-of-the-art performance compared with other baselines. It can well trade off the accuracy and complexity. To prove the strong generalization ability of our proposed model, we apply it to other tasks (ISIC 2018 and BAA) and achieve satisfactory results. △ Less

Submitted 19 July, 2023; v1 submitted 22 October, 2022; originally announced October 2022.

Comments: 21pages,13 figures,9 tables

Journal ref: J Multidiscip Healthc. 2023;16:2023-2043

arXiv:2210.07749

LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge

Authors: Yan Jia, Mi Hong, **gyu Hou, Kailong Ren, Sifan Ma, ** Wang, Fangzhen Peng, Yinglin Ji, Lin Yang, Junjie Wang

Abstract: This paper describes LeVoice automatic speech recognition systems to track2 of intelligent cockpit speech recognition challenge 2022. Track2 is a speech recognition task without limits on the scope of model size. Our main points include deep learning based speech enhancement, text-to-speech based speech generation, training data augmentation via various techniques and speech recognition model fusi… ▽ More This paper describes LeVoice automatic speech recognition systems to track2 of intelligent cockpit speech recognition challenge 2022. Track2 is a speech recognition task without limits on the scope of model size. Our main points include deep learning based speech enhancement, text-to-speech based speech generation, training data augmentation via various techniques and speech recognition model fusion. We compared and fused the hybrid architecture and two kinds of end-to-end architecture. For end-to-end modeling, we used models based on connectionist temporal classification/attention-based encoder-decoder architecture and recurrent neural network transducer/attention-based encoder-decoder architecture. The performance of these models is evaluated with an additional language model to improve word error rates. As a result, our system achieved 10.2\% character error rate on the challenge test set data and ranked third place among the submitted systems in the challenge. △ Less

Submitted 16 October, 2022; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: There are experimental errors

arXiv:2209.13786 [pdf, other]

A Parameter-free Nonconvex Low-rank Tensor Completion Model for Spatiotemporal Traffic Data Recovery

Authors: Yang He, Yuheng Jia, Liyang Hu, Chengchuan An, Zhenbo Lu, **gxin Xia

Abstract: Traffic data chronically suffer from missing and corruption, leading to accuracy and utility reduction in subsequent Intelligent Transportation System (ITS) applications. Noticing the inherent low-rank property of traffic data, numerous studies formulated missing traffic data recovery as a low-rank tensor completion (LRTC) problem. Due to the non-convexity and discreteness of the rank minimization… ▽ More Traffic data chronically suffer from missing and corruption, leading to accuracy and utility reduction in subsequent Intelligent Transportation System (ITS) applications. Noticing the inherent low-rank property of traffic data, numerous studies formulated missing traffic data recovery as a low-rank tensor completion (LRTC) problem. Due to the non-convexity and discreteness of the rank minimization in LRTC, existing methods either replaced rank with convex surrogates that are quite far away from the rank function or approximated rank with nonconvex surrogates involving many parameters. In this study, we proposed a Parameter-Free Non-Convex Tensor Completion model (TC-PFNC) for traffic data recovery, in which a log-based relaxation term was designed to approximate tensor algebraic rank. Moreover, previous studies usually assumed the observations are reliable without any outliers. Therefore, we extended the TC-PFNC to a robust version (RTC-PFNC) by modeling potential traffic data outliers, which can recover the missing value from partial and corrupted observations and remove the anomalies in observations. The numerical solutions of TC-PFNC and RTC-PFNC were elaborated based on the alternating direction multiplier method (ADMM). The extensive experimental results conducted on four real-world traffic data sets demonstrated that the proposed methods outperform other state-of-the-art methods in both missing and corrupted data recovery. The code used in this paper is available at: https://github.com/YoungHe49/T-ITSPFNC. △ Less

Submitted 27 September, 2022; originally announced September 2022.

Comments: 10 pages, 7 figures

arXiv:2209.11451 [pdf, other]

FIAT: Fine-grained Information Audit for Trustless Transborder Data Flow

Authors: Shuhao Zheng, Yanxi Lin, Yang Yu, Ye Yuan, Yongzheng Jia, Xue Liu

Abstract: Auditing the information leakage of latent sensitive features during the transborder data flow has attracted sufficient attention from global digital regulators. However, there is missing a technical approach for the audit practice due to two technical challenges. Firstly, there is a lack of theory and tools for measuring the information of sensitive latent features in a dataset. Secondly, the tra… ▽ More Auditing the information leakage of latent sensitive features during the transborder data flow has attracted sufficient attention from global digital regulators. However, there is missing a technical approach for the audit practice due to two technical challenges. Firstly, there is a lack of theory and tools for measuring the information of sensitive latent features in a dataset. Secondly, the transborder data flow involves multi-stakeholders with diverse interests, which means the audit must be trustless. Despite the tremendous efforts in protecting data privacy, an important issue that has long been neglected is that the transmitted data in data flows can leak other regulated information that is not explicitly contained in the data, leading to unaware information leakage risks. To unveil such risks trustfully before the actual data transfer, we propose FIAT, a Fine-grained Information Audit system for Trustless transborder data flow. In FIAT, we use a learning approach to quantify the amount of information leakage, while the technologies of zero-knowledge proof and smart contracts are applied to provide trustworthy and privacy-preserving auditing results. Experiments show that large information leakage can boost the predictability of uninvolved information using simple machine-learning models, revealing the importance of information auditing. Further performance benchmarking also validates the efficiency and scalability of the FIAT auditing system. △ Less

Submitted 10 February, 2023; v1 submitted 23 September, 2022; originally announced September 2022.

Comments: 10 pages, 6 figures, 1 table

arXiv:2208.13183 [pdf, other]

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Authors: Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alexey Petelin, Jonathan Shen, Vincent Wan, Yu Zhang, Yonghui Wu, Rob Clark

Abstract: Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. T… ▽ More Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. This paper demonstrates that transfer can be obtained by training a robust TTS system on data generated by a less robust TTS system designed for a high-quality transfer task; in particular, a CHiVE-BERT monolingual TTS system is trained on the output of a Tacotron model designed for accent transfer. While some quality loss is inevitable with this approach, experimental results show that the models trained on synthetic data this way can produce high quality audio displaying accent transfer, while preserving speaker characteristics such as speaking style. △ Less

Submitted 28 August, 2022; originally announced August 2022.

Comments: To be published in Interspeech 2022

arXiv:2207.07609 [pdf, other]

doi 10.1007/978-3-031-26348-4_29

DOLPHINS: Dataset for Collaborative Perception enabled Harmonious and Interconnected Self-driving

Authors: Ruiqing Mao, **gyu Guo, Yukuan Jia, Yuxuan Sun, Sheng Zhou, Zhisheng Niu

Abstract: Vehicle-to-Everything (V2X) network has enabled collaborative perception in autonomous driving, which is a promising solution to the fundamental defect of stand-alone intelligence including blind zones and long-range perception. However, the lack of datasets has severely blocked the development of collaborative perception algorithms. In this work, we release DOLPHINS: Dataset for cOllaborative Per… ▽ More Vehicle-to-Everything (V2X) network has enabled collaborative perception in autonomous driving, which is a promising solution to the fundamental defect of stand-alone intelligence including blind zones and long-range perception. However, the lack of datasets has severely blocked the development of collaborative perception algorithms. In this work, we release DOLPHINS: Dataset for cOllaborative Perception enabled Harmonious and INterconnected Self-driving, as a new simulated large-scale various-scenario multi-view multi-modality autonomous driving dataset, which provides a ground-breaking benchmark platform for interconnected autonomous driving. DOLPHINS outperforms current datasets in six dimensions: temporally-aligned images and point clouds from both vehicles and Road Side Units (RSUs) enabling both Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) based collaborative perception; 6 typical scenarios with dynamic weather conditions make the most various interconnected autonomous driving dataset; meticulously selected viewpoints providing full coverage of the key areas and every object; 42376 frames and 292549 objects, as well as the corresponding 3D annotations, geo-positions, and calibrations, compose the largest dataset for collaborative perception; Full-HD images and 64-line LiDARs construct high-resolution data with sufficient details; well-organized APIs and open-source codes ensure the extensibility of DOLPHINS. We also construct a benchmark of 2D detection, 3D detection, and multi-view collaborative perception tasks on DOLPHINS. The experiment results show that the raw-level fusion scheme through V2X communication can help to improve the precision as well as to reduce the necessity of expensive LiDAR equipment on vehicles when RSUs exist, which may accelerate the popularity of interconnected self-driving vehicles. DOLPHINS is now available on https://dolphins-dataset.net/. △ Less

Submitted 15 July, 2022; originally announced July 2022.

arXiv:2203.13339 [pdf, other]

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Authors: Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobuyuki Morioka

Abstract: End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of pai… ▽ More End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of paired S2ST training data. In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron 2. With our most effective approaches, the average translation quality of direct S2ST on 21 language pairs on the CVSS-C corpus is improved by +13.6 BLEU (or +113% relatively), as compared to the previous state-of-the-art trained without additional data. The improvements on low-resource language are even more significant (+398% relatively on average). Our comparative studies suggest future research directions for S2ST and speech representation learning. △ Less

Submitted 27 June, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

Comments: Interspeech 2022

arXiv:2203.00508 [pdf, ps, other]

Reconfigurable Intelligent Surface-Aided Spectrum Sharing Coexisting with Multiple Primary Networks

Authors: Zhong Tian, Zhengchuan Chen, Min Wang, Yunjian Jia, Wanli Wen

Abstract: Considering the spectrum sharing system (SSS) coexisting with multiple primary networks, we have employed a well-designed reconfigurable intelligent surface (RIS) to control the radio environments of wireless channels and relieve the scarcity of the spectrum resource in this work. Specifically, the enhancement of the spectral efficiency of the secondary user in the considered SSS is decomposed int… ▽ More Considering the spectrum sharing system (SSS) coexisting with multiple primary networks, we have employed a well-designed reconfigurable intelligent surface (RIS) to control the radio environments of wireless channels and relieve the scarcity of the spectrum resource in this work. Specifically, the enhancement of the spectral efficiency of the secondary user in the considered SSS is decomposed into two subproblems which are a second-order cone programming (SOCP) and a fractional programming of the convex quadratic form (CQFP), respectively, to optimize alternatively the beamforming vector at the secondary access point (S-AP) and the reflecting coefficients at the RIS. The SOCP subproblem is shown as a concave problem, which can be solved optimally using standard convex optimization tools. The CQFP subproblem can be solved by a low-complexity method of gradient-based linearization with domain (GLD), providing a sub-optimal solution for fast deployment. Taking the discrete phase control at the RIS into account, a nearest point searching with penalty (NPSP) method is also developed, realizing the discretization of the phase shifts of the RIS in practice. The simulation results indicate that both GLD and NPSP can achieve an excellent performance. △ Less

Submitted 4 November, 2022; v1 submitted 1 March, 2022; originally announced March 2022.

arXiv:2201.03713 [pdf, other]

CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

Authors: Ye Jia, Michelle Tadmor Ramanovich, Quan Wang, Heiga Zen

Abstract: We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems. Two versions of t… ▽ More We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems. Two versions of translation speeches are provided: 1) CVSS-C: All the translation speeches are in a single high-quality canonical voice; 2) CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. In addition, CVSS provides normalized translation text which matches the pronunciation in the translation speech. On each version of CVSS, we built baseline multilingual direct S2ST models and cascade S2ST models, verifying the effectiveness of the corpus. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous state-of-the-art trained on the corpus without extra data by 5.8 BLEU. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.1 or 0.7 BLEU difference on ASR transcribed translation when initialized from matching ST models. △ Less

Submitted 26 June, 2022; v1 submitted 10 January, 2022; originally announced January 2022.

Comments: LREC 2022

arXiv:2201.00167 [pdf, other]

Generating Adversarial Samples For Training Wake-up Word Detection Systems Against Confusing Words

Authors: Haoxu Wang, Yan Jia, Zeqing Zhao, Xuyang Wang, Junjie Wang, Ming Li

Abstract: Wake-up word detection models are widely used in real life, but suffer from severe performance degradation when encountering adversarial samples. In this paper we discuss the concept of confusing words in adversarial samples. Confusing words are commonly encountered, which are various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system's robustne… ▽ More Wake-up word detection models are widely used in real life, but suffer from severe performance degradation when encountering adversarial samples. In this paper we discuss the concept of confusing words in adversarial samples. Confusing words are commonly encountered, which are various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system's robustness against confusing words, we propose several methods to generate the adversarial confusing samples for simulating real confusing words scenarios in which we usually do not have any real confusing samples in the training set. The generated samples include concatenated audio, synthesized data, and partially masked keywords. Moreover, we use a domain embedding concatenated system to improve the performance. Experimental results show that the adversarial samples generated in our approach help improve the system's robustness in both the common scenario and the confusing words scenario. In addition, we release the confusing words testing database called HI-MIA-CW for future research. △ Less

Submitted 1 January, 2022; originally announced January 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2011.01460

arXiv:2112.15354 [pdf, ps, other]

Statistical Device Activity Detection for OFDM-based Massive Grant-Free Access

Authors: Yuhang Jia, Ying Cui, Wuyang Jiang

Abstract: Existing works on grant-free access, proposed to support massive machine-type communication (mMTC) for the Internet of things (IoT), mainly concentrate on narrow band systems under flat fading. However, little is known about massive grant-free access for wideband systems under frequency-selective fading. This paper investigates massive grant-free access in a wideband system under frequency-selecti… ▽ More Existing works on grant-free access, proposed to support massive machine-type communication (mMTC) for the Internet of things (IoT), mainly concentrate on narrow band systems under flat fading. However, little is known about massive grant-free access for wideband systems under frequency-selective fading. This paper investigates massive grant-free access in a wideband system under frequency-selective fading. First, we present an orthogonal frequency division multiplexing (OFDM)-based massive grant-free access scheme. Then, we propose two different but equivalent models for the received pilot signal, which are essential for designing various device activity detection and channel estimation methods for OFDM-based massive grant-free access. One directly models the received signal for actual devices, whereas the other can be interpreted as a signal model for virtual devices. Next, we investigate statistical device activity detection under frequency-selective Rayleigh fading based on the two signal models. We first model device activities as unknown deterministic quantities and propose three maximum likelihood (ML) estimation-based device activity detection methods with different detection accuracies and computation times. We also model device activities as random variables with a known joint distribution and propose three maximum a posterior probability (MAP) estimation-based device activity methods, which further enhance the accuracies of the corresponding ML estimation-based methods. Optimization techniques and matrix analysis are applied in designing and analyzing these methods. Finally, numerical results show that the proposed statistical device activity detection methods outperform existing state-of-the-art device activity detection methods under frequency-selective Rayleigh fading. △ Less

Submitted 31 December, 2021; originally announced December 2021.

Comments: 30 pages, 7 figures, be submitted to IEEE Transactions on WIreless Communications

arXiv:2112.13369 [pdf, other]

Stop Line Aided Cooperative Positioning of Connected Vehicles

Authors: Xingqi Wang, Chaoyang Jiang, Shuxuan Sheng, Yanjie Xu, Yifei Jia

Abstract: This paper develops a stop line aided cooperative positioning framework for connected vehicles, which creatively utilizes the location of the stop-line to achieve the positioning enhancement for a vehicular ad-hoc network (VANET) in intersection scenarios via Vehicle-to-Vehicle (V2V) communication. Firstly, a self-positioning correction scheme for the first stopped vehicle is presented, which appl… ▽ More This paper develops a stop line aided cooperative positioning framework for connected vehicles, which creatively utilizes the location of the stop-line to achieve the positioning enhancement for a vehicular ad-hoc network (VANET) in intersection scenarios via Vehicle-to-Vehicle (V2V) communication. Firstly, a self-positioning correction scheme for the first stopped vehicle is presented, which applied the stop line information as benchmarks to correct the GNSS/INS positioning results. Then, the local observations of each vehicle are fused with the position estimates of other vehicles and the inter-vehicle distance measurements by using an extended Kalman filter (EKF). In this way, the benefits of the first stopped vehicle are extended to the whole VANET. Such a cooperative inertial navigation (CIN) framework can greatly improve the positioning performance of the VANET. Finally, experiments in Bei**g show the effectiveness of the proposed stop line aided cooperative positioning framework. △ Less

Submitted 26 December, 2021; originally announced December 2021.

arXiv:2111.14486 [pdf, other]

Just Least Squares: Binary Compressive Sampling with Low Generative Intrinsic Dimension

Authors: Yuling Jiao, Dingwei Li, Min Liu, Xiangliang Lu, Yuanyuan Yang

Abstract: In this paper, we consider recovering $n$ dimensional signals from $m$ binary measurements corrupted by noises and sign flips under the assumption that the target signals have low generative intrinsic dimension, i.e., the target signals can be approximately generated via an $L$-Lipschitz generator $G: \mathbb{R}^k\rightarrow\mathbb{R}^{n}, k\ll n$. Although the binary measurements model is highly… ▽ More In this paper, we consider recovering $n$ dimensional signals from $m$ binary measurements corrupted by noises and sign flips under the assumption that the target signals have low generative intrinsic dimension, i.e., the target signals can be approximately generated via an $L$-Lipschitz generator $G: \mathbb{R}^k\rightarrow\mathbb{R}^{n}, k\ll n$. Although the binary measurements model is highly nonlinear, we propose a least square decoder and prove that, up to a constant $c$, with high probability, the least square decoder achieves a sharp estimation error $\mathcal{O} (\sqrt{\frac{k\log (Ln)}{m}})$ as long as $m\geq \mathcal{O}( k\log (Ln))$. Extensive numerical simulations and comparisons with state-of-the-art methods demonstrated the least square decoder is robust to noise and sign flips, as indicated by our theory. By constructing a ReLU network with properly chosen depth and width, we verify the (approximately) deep generative prior, which is of independent interest. △ Less

Submitted 29 November, 2021; originally announced November 2021.

arXiv:2107.08661 [pdf, other]

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation

Authors: Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz

Abstract: We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three datasets consistently show that Translatotron 2 outperforms the original Translatotron by a large margin on… ▽ More We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three datasets consistently show that Translatotron 2 outperforms the original Translatotron by a large margin on both translation quality (up to +15.5 BLEU) and speech generation quality, and approaches the same of cascade systems. In addition, we propose a simple method for preserving speakers' voices from the source speech to the translation speech in a different language. Unlike existing approaches, the proposed method is able to preserve each speaker's voice on speaker turns without requiring for speaker segmentation. Furthermore, compared to existing approaches, it better preserves speaker's privacy and mitigates potential misuse of voice cloning for creating spoofing audio artifacts. △ Less

Submitted 17 May, 2022; v1 submitted 19 July, 2021; originally announced July 2021.

Comments: ICML 2022

arXiv:2106.08649 [pdf, ps, other]

doi 10.21437/Interspeech.2021-1555

Improving the expressiveness of neural vocoding with non-affine Normalizing Flows

Authors: Adam Gabryś, Yunlong Jiao, Viacheslav Klimkov, Daniel Korzekwa, Roberto Barra-Chicote

Abstract: This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible non-affine function. The greater expressiveness of the improved PW leads to better-perceived signal quality… ▽ More This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible non-affine function. The greater expressiveness of the improved PW leads to better-perceived signal quality and naturalness in the waveform reconstruction and text-to-speech (TTS) tasks. We evaluate the model across different speaking styles on a multi-speaker, multi-lingual dataset. In the waveform reconstruction task, the proposed model closes the naturalness and signal quality gap from the original PW to recordings by $10\%$, and from other state-of-the-art neural vocoding systems by more than $60\%$. We also demonstrate improvements in objective metrics on the evaluation test set with L2 Spectral Distance and Cross-Entropy reduced by $3\%$ and $6\unicode{x2030}$ comparing to the affine PW. Furthermore, we extend the probability density distillation procedure proposed by the original PW paper, so that it works with any non-affine invertible and differentiable function. △ Less

Submitted 16 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021, 5 pages,3 figures

arXiv:2106.02934 [pdf, other]

Lightweight Dual-channel Target Speaker Separation for Mobile Voice Communication

Authors: Yuanyuan Bao, Yanze Xu, Na Xu, Wen**g Yang, Hongfeng Li, Shicong Li, Yongtao Jia, Fei Xiang, **cheng He, Ming Li

Abstract: Nowadays, there is a strong need to deploy the target speaker separation (TSS) model on mobile devices with a limitation of the model size and computational complexity. To better perform TSS for mobile voice communication, we first make a dual-channel dataset based on a specific scenario, LibriPhone. Specifically, to better mimic the real-case scenario, instead of simulating from the single-channe… ▽ More Nowadays, there is a strong need to deploy the target speaker separation (TSS) model on mobile devices with a limitation of the model size and computational complexity. To better perform TSS for mobile voice communication, we first make a dual-channel dataset based on a specific scenario, LibriPhone. Specifically, to better mimic the real-case scenario, instead of simulating from the single-channel dataset, LibriPhone is made by simultaneously replaying pairs of utterances from LibriSpeech by two professional artificial heads and recording by two built-in microphones of the mobile. Then, we propose a lightweight time-frequency domain separation model, LSTM-Former, which is based on the LSTM framework with source-to-noise ratio (SI-SNR) loss. For the experiments on Libri-Phone, we explore the dual-channel LSTMFormer model and a single-channel version by a random single channel of Libri-Phone. Experimental result shows that the dual-channel LSTM-Former outperforms the single-channel LSTMFormer with relative 25% improvement. This work provides a feasible solution for the TSS task on mobile devices, playing back and recording multiple data sources in real application scenarios for getting dual-channel real data can assist the lightweight model to achieve higher performance. △ Less

Submitted 5 June, 2021; originally announced June 2021.

arXiv:2105.08280 [pdf, other]

Peer-to-Peer Energy Cooperation in Building Community over A Lossy Network

Authors: Cheng Lyu, Youwei Jia, Zhao Xu

Abstract: Energy management of buildings is of vital importance for the urban low-carbon transition. This paper proposes a sustainable energy cooperation framework for the building community by communication-efficient peer-to-peer transaction. Firstly, the energy cooperation of buildings is formulated as a social welfare maximization problem, in which buildings may directly trade energy with neighbors. In a… ▽ More Energy management of buildings is of vital importance for the urban low-carbon transition. This paper proposes a sustainable energy cooperation framework for the building community by communication-efficient peer-to-peer transaction. Firstly, the energy cooperation of buildings is formulated as a social welfare maximization problem, in which buildings may directly trade energy with neighbors. In addition, considering privacy concerns and communication losses arisen in peer-to-peer energy trading, a communication-failure-robust distributed algorithm is developed to achieve the optimal energy dispatch solutions. Finally, simulation results show that the proposed framework substantially reduces the total cost of the building community and the algorithm is robust to communication losses in the network when only part of links (even one link) are active during iterations. △ Less

Submitted 19 June, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: 5 pages, 6 figures, accepted to IEEE PESGM 2021, Best Paper Award

arXiv:2104.08114 [pdf, other]

AI-driven Bayesian inference of statistical microstructure descriptors from finite-frequency waves

Authors: Wouter Klessens, Ivan Vasconcelos, Yang Jiao

Abstract: The ability to image materials at the microscale from long-wavelength wave data is a major challenge to the geophysical, engineering and medical fields. Here, we present a framework to constrain microstructure geometry and properties from long-scale waves. To realistically quantify microstructures we use two-point statistics, from which we derive scale-dependent effective wave properties - wavespe… ▽ More The ability to image materials at the microscale from long-wavelength wave data is a major challenge to the geophysical, engineering and medical fields. Here, we present a framework to constrain microstructure geometry and properties from long-scale waves. To realistically quantify microstructures we use two-point statistics, from which we derive scale-dependent effective wave properties - wavespeed and attenuation - using strong-contrast expansions (SCE) for (visco)elastic wavefields. By evaluating various two-point correlation functions we observe that both effective wavespeeds and attenuation of long-scale waves predominantly depend on volume fraction and phase properties, and that especially attenuation at small scales is highly sensitive to the geometry of microstructure heterogeneity (e.g. geometric hyperuniformity) due to incoherent inference of sub-wavelength multiple scattering. Our goal is to infer microstructure properties from observed effective wave parameters. To this end, we use the supervised machine learning method of Random Forests (RF) to construct a Bayesian inference approach. We can accurately resolve two-point correlation functions sampled from various microstructural configurations, including: a bead pack, Berea sandstone and Ketton limestone samples. Importantly, we show that inversion of small scale-induced effective elastic waves yields the best results, particularly compared to single-wave-mode (e.g., acoustic only) information. Additionally, we show that the retrieval of microscale medium contrasts is more difficult - as it is highly ill-posed - and can only be achieved with specific a priori knowledge. Our results are promising for many applications, such as earthquake hazard monitoring,non-destructive testing, imaging fluid flow in porous media, quantifying tissue properties in medical ultrasound, or designing materials with tailor-made wave properties. △ Less

Submitted 16 April, 2021; originally announced April 2021.

arXiv:2104.04993 [pdf, other]

The DKU System Description for The Interspeech 2021 Auto-KWS Challenge

Authors: Yechen Wang, Yan Jia, Murong Ma, Zexin Cai, Ming Li

Abstract: This paper introduces the system submitted by the DKU-SMIIP team for the Auto-KWS 2021 Challenge. Our implementation consists of a two-stage keyword spotting system based on query-by-example spoken term detection and a speaker verification system. We employ two different detection algorithms in our proposed keyword spotting system. The first stage adopts subsequence dynamic time war** for templa… ▽ More This paper introduces the system submitted by the DKU-SMIIP team for the Auto-KWS 2021 Challenge. Our implementation consists of a two-stage keyword spotting system based on query-by-example spoken term detection and a speaker verification system. We employ two different detection algorithms in our proposed keyword spotting system. The first stage adopts subsequence dynamic time war** for template matching based on frame-level language-independent bottleneck feature and phoneme posterior probability. We use a sliding window template matching algorithm based on acoustic word embeddings to further verify the detection from the first stage. As a result, our KWS system achieves an average score of 0.61 on the feedback dataset, which outperforms the baseline1 system by 0.25. △ Less

Submitted 11 April, 2021; originally announced April 2021.

Comments: 5 pages, 1 figures, submitted to INTERSPEECH

arXiv:2104.04819 [pdf]

Real-time Operation Optimization of Microgrids with Battery Energy Storage System: A Tube-based Model Predictive Control Approach

Authors: Cheng Lyu, Youwei Jia, Zhao Xu

Abstract: Battery energy storage systems (ESS) are widely used in microgrids to complement high renewables. However, the real-time energy management of microgrids with battery ESS is challenging in two aspects: 1) the evolution process of battery energy level is across-time coupled; 2) uncertainties unavoidably arise in the forecasting process for renewable generation. In this paper, a tube-based model pred… ▽ More Battery energy storage systems (ESS) are widely used in microgrids to complement high renewables. However, the real-time energy management of microgrids with battery ESS is challenging in two aspects: 1) the evolution process of battery energy level is across-time coupled; 2) uncertainties unavoidably arise in the forecasting process for renewable generation. In this paper, a tube-based model predictive control (MPC) approach is innovatively proposed in accommodating the real-time energy management of microgrids with battery ESS. Firstly, a real-time operation model of battery, including the degradation cost and time-aware SoC range, is proposed for the battery ESS. In particular, the battery feature shallower-cheaper is depicted and the terminal SoC requirement is achieved. Secondly, two cascaded MPC controllers are designed in the proposed tube-based MPC, in which reference trajectories are generated by the nominal MPC without uncertainties, and then the ancillary MPC steers the actual trajectories to the nominal ones upon the realization of uncertainties. Specifically, in this paper, the battery SoC is viewed as the state variable of the system, while the generator power output and exchange power with the utility are seen as control variables. Lastly, numerous case studies demonstrate the effectiveness of the proposed approach, including both the low and high penetration level of renewables. Additional Monte Carlo simulations of consecutive 365 days show that the competitive ratio of the proposed approach is excellently below 1.10. △ Less

Submitted 10 April, 2021; originally announced April 2021.

Comments: 8 pages, 12 figures

arXiv:2103.15060 [pdf, other]

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

Authors: Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang, Yonghui Wu

Abstract: This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a self-supervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model usin… ▽ More This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a self-supervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline model using only phoneme input with no pre-training. Subjective side-by-side preference evaluations show that raters have no statistically significant preference between the speech synthesized using a PnG BERT and ground truth recordings from professional speakers. △ Less

Submitted 7 June, 2021; v1 submitted 28 March, 2021; originally announced March 2021.

Comments: Accepted to Interspeech 2021

arXiv:2103.14574 [pdf, other]

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, RJ Skerry-Ryan, Yonghui Wu

Abstract: This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time War**, this model can learn token-frame alignments as well as token durations automatica… ▽ More This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time War**, this model can learn token-frame alignments as well as token durations automatically. Experimental results show that Parallel Tacotron 2 outperforms baselines in subjective naturalness in several diverse multi speaker evaluations. Its duration control capability is also demonstrated. △ Less

Submitted 29 August, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: Submitted to INTERSPEECH 2021

arXiv:2102.07000 [pdf]

Adaptive Optimization of Autonomous Vehicle Computational Resources for Performance and Energy Improvement

Authors: Saurabh Jambotkar, Longxiang Guo, Yunyi Jia

Abstract: Autonomous vehicles usually consume a large amount of computational power for their operations, especially for the tasks of sensing and perception with artificial intelligence algorithms. Such a computation may not only cost a significant amount of energy but also cause performance issues when the onboard computational resources are limited. To address this issue, this paper proposes an adaptive o… ▽ More Autonomous vehicles usually consume a large amount of computational power for their operations, especially for the tasks of sensing and perception with artificial intelligence algorithms. Such a computation may not only cost a significant amount of energy but also cause performance issues when the onboard computational resources are limited. To address this issue, this paper proposes an adaptive optimization method to online allocate the onboard computational resources of an autonomous vehicle amongst multiple vehicular subsystems depending on the contexts of the situations that the vehicle is facing. Different autonomous driving scenarios were designed to validate the proposed approach and the results showed that it could help improve the overall performance and energy consumption of autonomous vehicles compared to existing computational arrangement. △ Less

Submitted 30 July, 2021; v1 submitted 13 February, 2021; originally announced February 2021.

Comments: 7 pages

arXiv:2102.01106 [pdf, other]

Universal Neural Vocoding with Parallel WaveNet

Authors: Yunlong Jiao, Adam Gabrys, Georgi Tinchev, Bartosz Putrycz, Daniel Korzekwa, Viacheslav Klimkov

Abstract: We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio Encoder. Our universal vocoder offers real-time high-quality speech synthesis on a wide range of use cases. We tested it on 43 internal speakers of diverse age and gender, speaking 20 languages in 17 unique styles, of which 7 voices and 5 styles were not exposed during training. We… ▽ More We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio Encoder. Our universal vocoder offers real-time high-quality speech synthesis on a wide range of use cases. We tested it on 43 internal speakers of diverse age and gender, speaking 20 languages in 17 unique styles, of which 7 voices and 5 styles were not exposed during training. We show that the proposed universal vocoder significantly outperforms speaker-dependent vocoders overall. We also show that the proposed vocoder outperforms several existing neural vocoder architectures in terms of naturalness and universality. These findings are consistent when we further test on more than 300 open-source voices. △ Less

Submitted 15 February, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

Comments: 5 pages, 2 figures. Accepted to ICASSP 2021

arXiv:2101.01935 [pdf, other]

The 2020 Personalized Voice Trigger Challenge: Open Database, Evaluation Metrics and the Baseline Systems

Authors: Yan Jia, Xingming Wang, Xiaoyi Qin, Yin** Zhang, Xuyang Wang, Junjie Wang, Ming Li

Abstract: The 2020 Personalized Voice Trigger Challenge (PVTC2020) addresses two different research problems a unified setup: joint wake-up word detection with speaker verification on close-talking single microphone data and far-field multi-channel microphone array data. Specially, the second task poses an additional cross-channel matching challenge on top of the far-field condition. To simulate the real-li… ▽ More The 2020 Personalized Voice Trigger Challenge (PVTC2020) addresses two different research problems a unified setup: joint wake-up word detection with speaker verification on close-talking single microphone data and far-field multi-channel microphone array data. Specially, the second task poses an additional cross-channel matching challenge on top of the far-field condition. To simulate the real-life application scenario, the enrollment utterances are recorded from close-talking cell-phone only, while the test utterances are recorded from both the close-talking cell-phone and the far-field microphone arrays. This paper introduces our challenge setup and the released database as well as the evaluation metrics. In addition, we present a joint end-to-end neural network baseline system trained with the proposed database for speaker-dependent wake-up word detection. Results show that the cost calculated from the miss rate and the false alarm rate, can reach 0.37 in the close-talking single microphone task and 0.31 in the far-field microphone array task. The official website and the open-source baseline system have been released. △ Less

Submitted 6 January, 2021; originally announced January 2021.

arXiv:2012.10239 [pdf]

doi 10.1063/5.0041901

Computational interference microscopy enabled by deep learning

Authors: Yuheng Jiao, Yuchen R. He, Mikhail E. Kandel, Xiaojun Liu, Wenlong Lu, Gabriel Popescu

Abstract: Quantitative phase imaging (QPI) has been widely applied in characterizing cells and tissues. Spatial light interference microscopy (SLIM) is a highly sensitive QPI method, due to its partially coherent illumination and common path interferometry geometry. However, its acquisition rate is limited because of the four-frame phase-shifting scheme. On the other hand, off-axis methods like diffraction… ▽ More Quantitative phase imaging (QPI) has been widely applied in characterizing cells and tissues. Spatial light interference microscopy (SLIM) is a highly sensitive QPI method, due to its partially coherent illumination and common path interferometry geometry. However, its acquisition rate is limited because of the four-frame phase-shifting scheme. On the other hand, off-axis methods like diffraction phase microscopy (DPM), allows for single-shot QPI. However, the laser-based DPM system is plagued by spatial noise due to speckles and multiple reflections. In a parallel development, deep learning was proven valuable in the field of bioimaging, especially due to its ability to translate one form of contrast into another. Here, we propose using deep learning to produce synthetic, SLIM-quality, high-sensitivity phase maps from DPM, single-shot images as input. We used an inverted microscope with its two ports connected to the DPM and SLIM modules, such that we have access to the two types of images on the same field of view. We constructed a deep learning model based on U-net and trained on over 1,000 pairs of DPM and SLIM images. The model learned to remove the speckles in laser DPM and overcame the background phase noise in both the test set and new data. Furthermore, we implemented the neural network inference into the live acquisition software, which now allows a DPM user to observe in real-time an extremely low-noise phase image. We demonstrated this principle of computational interference microscopy (CIM) imaging using blood smears, as they contain both erythrocytes and leukocytes, in static and dynamic conditions. △ Less

Submitted 17 December, 2020; originally announced December 2020.

arXiv:2011.01460 [pdf, other]

Training Wake Word Detection with Synthesized Speech Data on Confusion Words

Authors: Yan Jia, Zexin Cai, Murong Ma, Zeqing Zhao, Xuyang Wang, Junjie Wang, Ming Li

Abstract: Confusing-words are commonly encountered in real-life keyword spotting applications, which causes severe degradation of performance due to complex spoken terms and various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system's robustness on such scenarios, we investigate two data augmentation setups for training end-to-end KWS systems. One is invo… ▽ More Confusing-words are commonly encountered in real-life keyword spotting applications, which causes severe degradation of performance due to complex spoken terms and various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system's robustness on such scenarios, we investigate two data augmentation setups for training end-to-end KWS systems. One is involving the synthesized data from a multi-speaker speech synthesis system, and the other augmentation is performed by adding random noise to the acoustic feature. Experimental results show that augmentations help improve the system's robustness. Moreover, by augmenting the training set with the synthetic data generated by the multi-speaker text-to-speech system, we achieve a significant improvement regarding confusing words scenario. △ Less

Submitted 2 November, 2020; originally announced November 2020.

Comments: Submitted to ICASSP 2021

arXiv:2010.11439 [pdf, other]

Parallel Tacotron: Non-Autoregressive and Controllable TTS

Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron Weiss, Yonghui Wu

Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, a… ▽ More Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many map** nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Showing 1–50 of 91 results for author: Jiao, Y