Search | arXiv e-print repository

Gull: A Generative Multifunctional Audio Codec

Authors: Yi Luo, Jianwei Yu, Hangting Chen, Rongzhi Gu, Chao Weng

Abstract: We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recen… ▽ More We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recent progress in audio source separation, (2) gain-shape representations motivated by traditional audio codecs, (3) improved residual vector quantization modules, (4) elastic decoder network that enables user-defined model size and complexity during inference time, (5) built-in ability for audio super-resolution without the increase of bitrate. We compare Gull with existing traditional and neural audio codecs and show that Gull is able to achieve on par or better performance across various sample rates, bitrates and model complexities in both subjective and objective evaluation metrics. △ Less

Submitted 7 June, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

Comments: Demo page: https://yluo42.github.io/Gull/

arXiv:2404.01784 [pdf, other]

Learning-Based Joint Beamforming and Antenna Movement Design for Movable Antenna Systems

Authors: Caihao Weng, Yuanbin Chen, Lipeng Zhu, Ying Wang

Abstract: In this paper, we investigate a multi-receiver communication system enabled by movable antennas (MAs). Specifically, the transmit beamforming and the double-side antenna movement at the transceiver are jointly designed to maximize the sum-rate of all receivers under imperfect channel state information (CSI). Since the formulated problem is non-convex with highly coupled variables, conventional opt… ▽ More In this paper, we investigate a multi-receiver communication system enabled by movable antennas (MAs). Specifically, the transmit beamforming and the double-side antenna movement at the transceiver are jointly designed to maximize the sum-rate of all receivers under imperfect channel state information (CSI). Since the formulated problem is non-convex with highly coupled variables, conventional optimization methods cannot solve it efficiently. To address these challenges, an effective learning-based algorithm is proposed, namely heterogeneous multi-agent deep deterministic policy gradient (MADDPG), which incorporates two agents to learn policies for beamforming and movement of MAs, respectively. Based on the offline learning under numerous imperfect CSI, the proposed heterogeneous MADDPG can output the solutions for transmit beamforming and antenna movement in real time. Simulation results validate the effectiveness of the proposed algorithm, and the MA can significantly improve the sum-rate performance of multiple receivers compared to other benchmark schemes. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: 13 pages, 5 figures

arXiv:2401.09047 [pdf, other]

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Authors: Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan

Abstract: Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using… ▽ More Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: Homepage: https://ailab-cvc.github.io/videocrafter; Github: https://github.com/AILab-CVC/VideoCrafter

arXiv:2401.06791 [pdf, other]

A Span-based Model for Extracting Overlap** PICO Entities from RCT Publications

Authors: Gongbo Zhang, Yiliang Zhou, Yan Hu, Hua Xu, Chunhua Weng, Yifan Peng

Abstract: Objectives Extraction of PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method PICOX to extract overlap** PICO entities. Materials and Methods PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then it uses a multi-label classifier to assign one or more PICO l… ▽ More Objectives Extraction of PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method PICOX to extract overlap** PICO entities. Materials and Methods PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then it uses a multi-label classifier to assign one or more PICO labels to a span candidate. PICOX was evaluated using one of the best-performing baselines, EBM-NLP, and three more datasets, i.e., PICO-Corpus, and RCT publications on Alzheimer's Disease or COVID-19, using entity-level precision, recall, and F1 scores. Results PICOX achieved superior precision, recall, and F1 scores across the board, with the micro F1 score improving from 45.05 to 50.87 (p << 0.01). On the PICO-Corpus, PICOX obtained higher recall and F1 scores than the baseline and improved the micro recall score from 56.66 to 67.33. On the COVID-19 dataset, PICOX also outperformed the baseline and improved the micro F1 score from 77.10 to 80.32. On the AD dataset, PICOX demonstrated comparable F1 scores with higher precision when compared to the baseline. Conclusion PICOX excels in identifying overlap** entities and consistently surpasses a leading baseline across multiple datasets. Ablation studies reveal that its data augmentation strategy effectively minimizes false positives and improves precision. △ Less

Submitted 7 January, 2024; originally announced January 2024.

arXiv:2312.15463 [pdf, other]

Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng

Abstract: The query-based audio separation usually employs specific queries to extract target sources from a mixture of audio signals. Currently, most query-based separation models need additional networks to obtain query embedding. In this way, separation model is optimized to be adapted to the distribution of query embedding. However, query embedding may exhibit mismatches with separation models due to in… ▽ More The query-based audio separation usually employs specific queries to extract target sources from a mixture of audio signals. Currently, most query-based separation models need additional networks to obtain query embedding. In this way, separation model is optimized to be adapted to the distribution of query embedding. However, query embedding may exhibit mismatches with separation models due to inconsistent structures and independent information. In this paper, we present CaRE-SEP, a consistent and relevant embedding network for general sound separation to encourage a comprehensive reconsideration of query usage in audio separation. CaRE-SEP alleviates the potential mismatch between queries and separation in two aspects, including sharing network structure and sharing feature information. First, a Swin-Unet model with a shared encoder is conducted to unify query encoding and sound separation into one model, eliminating the network architecture difference and generating consistent distribution of query and separation features. Second, by initializing CaRE-SEP with a pretrained classification network and allowing gradient backpropagation, the query embedding is optimized to be relevant to the separation feature, further alleviating the feature mismatch problem. Experimental results indicate the proposed CaRE-SEP model substantially improves the performance of separation tasks. Moreover, visualizations validate the potential mismatch and how CaRE-SEP solves it. △ Less

Submitted 24 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024

arXiv:2312.15320 [pdf]

GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts

Authors: Da Wu, **gye Yang, Cong Liu, Tzung-Chien Hsieh, Elaine Marchi, Justin Blair, Peter Krawitz, Chunhua Weng, Wendy Chung, Gholson J. Lyon, Ian D. Krantz, Jennifer M. Kalish, Kai Wang

Abstract: Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of time. Addressing this "diagnostic odyssey" thus has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artifi… ▽ More Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of time. Addressing this "diagnostic odyssey" thus has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artificial intelligence algorithms to facilitate clinical diagnosis, in prioritizing candidate diseases to be further examined by lab tests or genetic assays, or in hel** the phenotype-driven reinterpretation of genome/exome sequencing data. Existing methods using frontal facial photos were built on conventional Convolutional Neural Networks (CNNs), rely exclusively on facial images, and cannot capture non-facial phenotypic traits and demographic information essential for guiding accurate diagnoses. Here we introduce GestaltMML, a multimodal machine learning (MML) approach solely based on the Transformer architecture. It integrates facial images, demographic information (age, sex, ethnicity), and clinical notes (optionally, a list of Human Phenotype Ontology terms) to improve prediction accuracy. Furthermore, we also evaluated GestaltMML on a diverse range of datasets, including 528 diseases from the GestaltMatcher Database, several in-house datasets of Beckwith-Wiedemann syndrome (BWS, over-growth syndrome with distinct facial features), Sotos syndrome (overgrowth syndrome with overlap** features with BWS), NAA10-related neurodevelopmental syndrome, Cornelia de Lange syndrome (multiple malformation syndrome), and KBG syndrome (multiple malformation syndrome). Our results suggest that GestaltMML effectively incorporates multiple modalities of data, greatly narrowing candidate genetic diagnoses of rare diseases and may facilitate the reinterpretation of genome/exome sequencing data. △ Less

Submitted 21 April, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

Comments: Significant revisions

arXiv:2311.11211 [pdf]

Leveraging Generative AI for Clinical Evidence Summarization Needs to Ensure Trustworthiness

Authors: Gongbo Zhang, Qiao **, Denis Jered McInerney, Yong Chen, Fei Wang, Curtis L. Cole, Qian Yang, Yanshan Wang, Bradley A. Malin, Mor Peleg, Byron C. Wallace, Zhiyong Lu, Chunhua Weng, Yifan Peng

Abstract: Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, ho… ▽ More Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, develo** accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence. △ Less

Submitted 31 March, 2024; v1 submitted 18 November, 2023; originally announced November 2023.

arXiv:2310.20323 [pdf, other]

SemanticBoost: Elevating Motion Generation with Augmented Textual Cues

Authors: Xin He, Shaoli Huang, Xiaohang Zhan, Chao Weng, Ying Shan

Abstract: Current techniques face difficulties in generating motions from intricate semantic descriptions, primarily due to insufficient semantic annotations in datasets and weak contextual understanding. To address these issues, we present SemanticBoost, a novel framework that tackles both challenges simultaneously. Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser… ▽ More Current techniques face difficulties in generating motions from intricate semantic descriptions, primarily due to insufficient semantic annotations in datasets and weak contextual understanding. To address these issues, we present SemanticBoost, a novel framework that tackles both challenges simultaneously. Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD). The Semantic Enhancement module extracts supplementary semantics from motion data, enriching the dataset's textual description and ensuring precise alignment between text and motion data without depending on large language models. On the other hand, the CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences by effectively capturing context information and aligning the generated motion with the given textual descriptions. Distinct from existing methods, our approach can synthesize accurate orientational movements, combined motions based on specific body part descriptions, and motions generated from complex, extended sentences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques, achieving cutting-edge performance on the Humanml3D dataset while maintaining realistic and smooth motion generation quality. △ Less

Submitted 28 November, 2023; v1 submitted 31 October, 2023; originally announced October 2023.

arXiv:2310.19512 [pdf, other]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Authors: Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, **bo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan

Abstract: Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video… ▽ More Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: Tech Report; Github: https://github.com/AILab-CVC/VideoCrafter Homepage: https://ailab-cvc.github.io/videocrafter/

arXiv:2310.14864 [pdf, other]

Diverse Priors for Deep Reinforcement Learning

Authors: Chenfan Weng, Zhongguo Li

Abstract: In Reinforcement Learning (RL), agents aim at maximizing cumulative rewards in a given environment. During the learning process, RL agents face the dilemma of exploitation and exploration: leveraging existing knowledge to acquire rewards or seeking potentially higher ones. Using uncertainty as a guiding principle provides an active and effective approach to solving this dilemma and ensemble-based… ▽ More In Reinforcement Learning (RL), agents aim at maximizing cumulative rewards in a given environment. During the learning process, RL agents face the dilemma of exploitation and exploration: leveraging existing knowledge to acquire rewards or seeking potentially higher ones. Using uncertainty as a guiding principle provides an active and effective approach to solving this dilemma and ensemble-based methods are one of the prominent avenues for quantifying uncertainty. Nevertheless, conventional ensemble-based uncertainty estimation lacks an explicit prior, deviating from Bayesian principles. Besides, this method requires diversity among members to generate less biased uncertainty estimation results. To address the above problems, previous research has incorporated random functions as priors. Building upon these foundational efforts, our work introduces an innovative approach with delicately designed prior NNs, which can incorporate maximal diversity in the initial value functions of RL. Our method has demonstrated superior performance compared with the random prior approaches in solving classic control problems and general exploration tasks, significantly improving sample efficiency. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: 8 pages, 4 figures

arXiv:2309.12792 [pdf, other]

DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis

Authors: Yu Gu, Yianrao Bian, Guangzhi Lei, Chao Weng, Dan Su

Abstract: This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed Du… ▽ More This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed DurIAN-E utilizes multiple stacked SwishRNN-based Transformer blocks as linguistic encoders. Style-Adaptive Instance Normalization (SAIN) layers are exploited into frame-level encoders to improve the modeling ability of expressiveness. A denoiser incorporating both denoising diffusion probabilistic model (DDPM) for mel-spectrograms and SAIN modules is conducted to further improve the synthetic speech quality and expressiveness. Experimental results prove that the proposed expressive TTS model in this paper can achieve better performance than the state-of-the-art approaches in both subjective mean opinion score (MOS) and preference tests. △ Less

Submitted 22 September, 2023; originally announced September 2023.

arXiv:2309.07803 [pdf, other]

SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias

Authors: Sipan Li, Songxiang Liu, Luwen Zhang, Xiang Li, Yanyao Bian, Chao Weng, Zhiyong Wu, Helen Meng

Abstract: Generative adversarial network (GAN)-based neural vocoders have been widely used in audio synthesis tasks due to their high generation quality, efficient inference, and small computation footprint. However, it is still challenging to train a universal vocoder which can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pi… ▽ More Generative adversarial network (GAN)-based neural vocoders have been widely used in audio synthesis tasks due to their high generation quality, efficient inference, and small computation footprint. However, it is still challenging to train a universal vocoder which can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pieces. In this work, we propose SnakeGAN, a GAN-based universal vocoder, which can synthesize high-fidelity audio in various OOD scenarios. SnakeGAN takes a coarse-grained signal generated by a differentiable digital signal processing (DDSP) model as prior knowledge, aiming at recovering high-fidelity waveform from a Mel-spectrogram. We introduce periodic nonlinearities through the Snake activation function and anti-aliased representation into the generator, which further brings the desired inductive bias for audio synthesis and significantly improves the extrapolation capacity for universal vocoding in unseen scenarios. To validate the effectiveness of our proposed method, we train SnakeGAN with only speech data and evaluate its performance for various OOD distributions with both subjective and objective metrics. Experimental results show that SnakeGAN significantly outperforms the compared approaches and can generate high-fidelity audio samples including unseen speakers with unseen styles, singing voices, instrumental pieces, and nonverbal vocalization. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: Accepted by ICME 2023

arXiv:2309.07757 [pdf, other]

Complexity Scaling for Speech Denoising

Authors: Hangting Chen, Jianwei Yu, Chao Weng

Abstract: Computational complexity is critical when deploying deep learning-based speech denoising models for on-device applications. Most prior research focused on optimizing model architectures to meet specific computational cost constraints, often creating distinct neural network architectures for different complexity limitations. This study conducts complexity scaling for speech denoising tasks, aiming… ▽ More Computational complexity is critical when deploying deep learning-based speech denoising models for on-device applications. Most prior research focused on optimizing model architectures to meet specific computational cost constraints, often creating distinct neural network architectures for different complexity limitations. This study conducts complexity scaling for speech denoising tasks, aiming to consolidate models with various complexities into a unified architecture. We present a Multi-Path Transform-based (MPT) architecture to handle both low- and high-complexity scenarios. A series of MPT networks present high performance covering a wide range of computational complexities on the DNS challenge dataset. Moreover, inspired by the scaling experiments in natural language processing, we explore the empirical relationship between model performance and computational cost on the denoising task. As the complexity number of multiply-accumulate operations (MACs) is scaled from 50M/s to 15G/s on MPT networks, we observe a linear increase in the values of PESQ-WB and SI-SNR, proportional to the logarithm of MACs, which might contribute to the understanding and application of complexity scaling in speech denoising tasks. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: Submitted to ICASSP2024

arXiv:2309.05786 [pdf, other]

doi 10.1145/3616195.3616209

Stringesthesia: Dynamically Shifting Musical Agency Between Audience and Performer Based on Trust in an Interactive and Improvised Performance

Authors: Torin Hopkins, Emily Doherty, Netta Ofer, Suibi Che Chuan Weng, Peter Gyrory, Chad Tobin, Leanne Hirshfield, Ellen Yi-Luen Do

Abstract: This paper introduces Stringesthesia, an interactive and improvised performance paradigm. Stringesthesia uses real-time neuroimaging to connect performers and audiences, enabling direct access to the performers mental state and determining audience participation during the performance. Functional near-infrared spectroscopy, or fNIRS, a noninvasive neuroimaging tool, was used to assess metabolic ac… ▽ More This paper introduces Stringesthesia, an interactive and improvised performance paradigm. Stringesthesia uses real-time neuroimaging to connect performers and audiences, enabling direct access to the performers mental state and determining audience participation during the performance. Functional near-infrared spectroscopy, or fNIRS, a noninvasive neuroimaging tool, was used to assess metabolic activity of brain areas collectively associated with a metric we call trust. A visualization representing the real-time measurement of the performers level of trust was projected behind the performer and used to dynamically restrict or promote audience participation. Throughout the paper we discuss prior work that heavily influenced our design, conceptual and methodological issues with using fNIRS technology, system architecture, and feedback from the audience and performer. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Journal ref: Audio Mostly 2023, Edinburgh, UK

arXiv:2309.00842 [pdf, other]

DualStream: Spatially Sharing Selves and Surroundings using Mobile Devices and Augmented Reality

Authors: Rishi Vanukuru, Suibi Che-Chuan Weng, Krithik Ranjan, Torin Hopkins, Amy Banic, Mark D. Gross, Ellen Yi-Luen Do

Abstract: In-person human interaction relies on our spatial perception of each other and our surroundings. Current remote communication tools partially address each of these aspects. Video calls convey real user representations but without spatial interactions. Augmented and Virtual Reality (AR/VR) experiences are immersive and spatial but often use virtual environments and characters instead of real-life r… ▽ More In-person human interaction relies on our spatial perception of each other and our surroundings. Current remote communication tools partially address each of these aspects. Video calls convey real user representations but without spatial interactions. Augmented and Virtual Reality (AR/VR) experiences are immersive and spatial but often use virtual environments and characters instead of real-life representations. Bridging these gaps, we introduce DualStream, a system for synchronous mobile AR remote communication that captures, streams, and displays spatial representations of users and their surroundings. DualStream supports transitions between user and environment representations with different levels of visuospatial fidelity, as well as the creation of persistent shared spaces using environment snapshots. We demonstrate how DualStream can enable spatial communication in real-world contexts, and support the creation of blended spaces for collaboration. A formative evaluation of DualStream revealed that users valued the ability to interact spatially and move between representations, and could see DualStream fitting into their own remote communication practices in the near future. Drawing from these findings, we discuss new opportunities for designing more widely accessible spatial communication tools, centered around the mobile phone. △ Less

Submitted 2 September, 2023; originally announced September 2023.

Comments: 10 pages, 4 figures, 1 table; To appear in the proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2023

arXiv:2308.14553 [pdf, other]

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

Authors: Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, Lirong Dai, Jie Zhang

Abstract: Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background… ▽ More Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech. Meanwhile, it was shown that self-supervised pre-trained models exhibit excellent noise robustness on many speech tasks, implying that the learned representation has a better tolerance for noise perturbations. In this work, we therefore explore pre-trained models to improve the noise robustness of TTS models. Based on HiFi-GAN, we first propose a representation-to-waveform vocoder, which aims to learn to map the representation of pre-trained models to the waveform. We then propose a text-to-representation FastSpeech2 model, which aims to learn to map text to pre-trained model representations. Experimental results on the LJSpeech and LibriTTS datasets show that our method outperforms those using speech enhancement methods in both subjective and objective metrics. Audio samples are available at: https://zqs01.github.io/rep2wav. △ Less

Submitted 3 September, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

Comments: 5 pages,2 figures

arXiv:2308.11053 [pdf, other]

doi 10.21437/Interspeech.2023-2302

Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression

Authors: Hangting Chen, Jianwei Yu, Yi Luo, Rongzhi Gu, Weihua Li, Zhuocheng Lu, Chao Weng

Abstract: Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity. In this paper, we introduce time-frequency dual-path compression to achieve a wide range of compression ratios on computational cost. Specifically, for frequency compression, trainable filters are used to r… ▽ More Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity. In this paper, we introduce time-frequency dual-path compression to achieve a wide range of compression ratios on computational cost. Specifically, for frequency compression, trainable filters are used to replace manually designed filters for dimension reduction. For time compression, only using frame skipped prediction causes large performance degradation, which can be alleviated by a post-processing network with full sequence modeling. We have found that under fixed compression ratios, dual-path compression combining both the time and frequency methods will give further performance improvement, covering compression ratios from 4x to 32x with little model size change. Moreover, the proposed models show competitive performance compared with fast FullSubNet and DeepFilterNet. △ Less

Submitted 10 October, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

Comments: Proceedings of INTERSPEECH

arXiv:2308.10107 [pdf, other]

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

Authors: **chuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu, Shinji Watanabe

Abstract: Automatic speech recognition (ASR) based on transducers is widely used. In training, a transducer maximizes the summed posteriors of all paths. The path with the highest posterior is commonly defined as the predicted alignment between the speech and the transcription. While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferre… ▽ More Automatic speech recognition (ASR) based on transducers is widely used. In training, a transducer maximizes the summed posteriors of all paths. The path with the highest posterior is commonly defined as the predicted alignment between the speech and the transcription. While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferred paths and achieve controllable alignment prediction. Specifically, this work proposes Bayes Risk Transducer (BRT), which uses a Bayes risk function to set lower risk values to the preferred paths so that the predicted alignment is more likely to satisfy specific desired properties. We further demonstrate that these predicted alignments with intentionally designed properties can provide practical advantages over the vanilla transducer. Experimentally, the proposed BRT saves inference cost by up to 46% for non-streaming ASR and reduces overall system latency by 41% for streaming ASR. △ Less

Submitted 19 August, 2023; originally announced August 2023.

Journal ref: Interspeech 2023

arXiv:2308.08660 [pdf, other]

Large Language Models for Granularized Barrett's Esophagus Diagnosis Classification

Authors: Jenna Kefeli, Ali Soroush, Courtney J. Diamond, Haley M. Zylberberg, Benjamin May, Julian A. Abrams, Chunhua Weng, Nicholas Tatonetti

Abstract: Diagnostic codes for Barrett's esophagus (BE), a precursor to esophageal cancer, lack granularity and precision for many research or clinical use cases. Laborious manual chart review is required to extract key diagnostic phenotypes from BE pathology reports. We developed a generalizable transformer-based method to automate data extraction. Using pathology reports from Columbia University Irving Me… ▽ More Diagnostic codes for Barrett's esophagus (BE), a precursor to esophageal cancer, lack granularity and precision for many research or clinical use cases. Laborious manual chart review is required to extract key diagnostic phenotypes from BE pathology reports. We developed a generalizable transformer-based method to automate data extraction. Using pathology reports from Columbia University Irving Medical Center with gastroenterologist-annotated targets, we performed binary dysplasia classification as well as granularized multi-class BE-related diagnosis classification. We utilized two clinically pre-trained large language models, with best model performance comparable to a highly tailored rule-based system developed using the same data. Binary dysplasia extraction achieves 0.964 F1-score, while the multi-class model achieves 0.911 F1-score. Our method is generalizable and faster to implement as compared to a tailored rule-based approach. △ Less

Submitted 16 August, 2023; originally announced August 2023.

arXiv:2308.06294 [pdf]

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT

Authors: **gye Yang, Cong Liu, Wendy Deng, Da Wu, Chunhua Weng, Yunyun Zhou, Kai Wang

Abstract: We hypothesize that large language models (LLMs) based on the transformer architecture can enable automated detection of clinical phenotype terms, including terms not documented in the HPO. In this study, we developed two types of models: PhenoBCBERT, a BERT-based model, utilizing Bio+Clinical BERT as its pre-trained model, and PhenoGPT, a GPT-based model that can be initialized from diverse GPT m… ▽ More We hypothesize that large language models (LLMs) based on the transformer architecture can enable automated detection of clinical phenotype terms, including terms not documented in the HPO. In this study, we developed two types of models: PhenoBCBERT, a BERT-based model, utilizing Bio+Clinical BERT as its pre-trained model, and PhenoGPT, a GPT-based model that can be initialized from diverse GPT models, including open-source versions such as GPT-J, Falcon, and LLaMA, as well as closed-source versions such as GPT-3 and GPT-3.5. We compared our methods with PhenoTagger, a recently developed HPO recognition tool that combines rule-based and deep learning methods. We found that our methods can extract more phenotype concepts, including novel ones not characterized by HPO. We also performed case studies on biomedical literature to illustrate how new phenotype information can be recognized and extracted. We compared current BERT-based versus GPT-based models for phenotype tagging, in multiple aspects including model architecture, memory usage, speed, accuracy, and privacy protection. We also discussed the addition of a negation step and an HPO normalization layer to the transformer models for improved HPO term tagging. In conclusion, PhenoBCBERT and PhenoGPT enable the automated discovery of phenotype terms from clinical notes and biomedical literature, facilitating automated downstream tasks to derive new biological insights on human diseases. △ Less

Submitted 9 November, 2023; v1 submitted 10 August, 2023; originally announced August 2023.

arXiv:2307.13468 [pdf, other]

Gaussian Graph with Prototypical Contrastive Learning in E-Commerce Bundle Recommendation

Authors: Zhao-Yang Liu, Liucheng Sun, Chenwei Weng, Qi** Chen, Chengfu Huo

Abstract: Bundle recommendation aims to provide a bundle of items to satisfy the user preference on e-commerce platform. Existing successful solutions are based on the contrastive graph learning paradigm where graph neural networks (GNNs) are employed to learn representations from user-level and bundle-level graph views with a contrastive learning module to enhance the cooperative association between differ… ▽ More Bundle recommendation aims to provide a bundle of items to satisfy the user preference on e-commerce platform. Existing successful solutions are based on the contrastive graph learning paradigm where graph neural networks (GNNs) are employed to learn representations from user-level and bundle-level graph views with a contrastive learning module to enhance the cooperative association between different views. Nevertheless, they ignore the uncertainty issue which has a significant impact in real bundle recommendation scenarios due to the lack of discriminative information caused by highly sparsity or diversity. We further suggest that their instancewise contrastive learning fails to distinguish the semantically similar negatives (i.e., sampling bias issue), resulting in performance degradation. In this paper, we propose a novel Gaussian Graph with Prototypical Contrastive Learning (GPCL) framework to overcome these challenges. In particular, GPCL embeds each user/bundle/item as a Gaussian distribution rather than a fixed vector. We further design a prototypical contrastive learning module to capture the contextual information and mitigate the sampling bias issue. Extensive experiments demonstrate that benefiting from the proposed components, we achieve new state-of-the-art performance compared to previous methods on several public datasets. Moreover, GPCL has been deployed on real-world e-commerce platform and achieved substantial improvements. △ Less

Submitted 25 July, 2023; originally announced July 2023.

arXiv:2307.06940 [pdf, other]

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Authors: Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, **bo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen

Abstract: Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by develo** a framework comprised of two functi… ▽ More Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by develo** a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: Github: https://github.com/VideoCrafter/Animate-A-Story Project page: https://videocrafter.github.io/Animate-A-Story

arXiv:2305.19269 [pdf, other]

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Authors: Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Lu** Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu

Abstract: Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, the majority of voice synthesis models currently rely on annotated audio data, but it is crucial to scale them to self-supervised datasets in order to effectively capture the wide range of acoustic variations present in human voice, including speak… ▽ More Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, the majority of voice synthesis models currently rely on annotated audio data, but it is crucial to scale them to self-supervised datasets in order to effectively capture the wide range of acoustic variations present in human voice, including speaker identity, emotion, and prosody. In this work, we propose Make-A-Voice, a unified framework for synthesizing and manipulating voice signals from discrete representations. Make-A-Voice leverages a "coarse-to-fine" approach to model the human voice, which involves three stages: 1) semantic stage: model high-level transformation between linguistic content and self-supervised semantic tokens, 2) acoustic stage: introduce varying control signals as acoustic conditions for semantic-to-acoustic modeling, and 3) generation stage: synthesize high-fidelity waveforms from acoustic tokens. Make-A-Voice offers notable benefits as a unified voice synthesis framework: 1) Data scalability: the major backbone (i.e., acoustic and generation stage) does not require any annotations, and thus the training data could be scaled up. 2) Controllability and conditioning flexibility: we investigate different conditioning mechanisms and effectively handle three voice synthesis applications, including text-to-speech (TTS), voice conversion (VC), and singing voice synthesis (SVS) by re-synthesizing the discrete voice representations with prompt guidance. Experimental results demonstrate that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models. Audio samples are available at https://Make-A-Voice.github.io △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2305.16749 [pdf, other]

Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model

Authors: Xiang Li, Songxiang Liu, Max W. Y. Lam, Zhiyong Wu, Chao Weng, Helen Meng

Abstract: Expressive human speech generally abounds with rich and flexible speech prosody variations. The speech prosody predictors in existing expressive speech synthesis methods mostly produce deterministic predictions, which are learned by directly minimizing the norm of prosody prediction error. Its unimodal nature leads to a mismatch with ground truth distribution and harms the model's ability in makin… ▽ More Expressive human speech generally abounds with rich and flexible speech prosody variations. The speech prosody predictors in existing expressive speech synthesis methods mostly produce deterministic predictions, which are learned by directly minimizing the norm of prosody prediction error. Its unimodal nature leads to a mismatch with ground truth distribution and harms the model's ability in making diverse predictions. Thus, we propose a novel prosody predictor based on the denoising diffusion probabilistic model to take advantage of its high-quality generative modeling and training stability. Experiment results confirm that the proposed prosody predictor outperforms the deterministic baseline on both the expressiveness and diversity of prediction results with even fewer network parameters. △ Less

Submitted 7 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: Proceedings of Interspeech 2023 (doi: 10.21437/Interspeech.2023-715), demo site at https://thuhcsi.github.io/interspeech2023-DiffVar/

arXiv:2305.02765 [pdf, other]

HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, **chuan Tian, Chao Weng, Yuexian Zou

Abstract: Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encode… ▽ More Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encodec model as an intermediate feature to aid TTS tasks. Despite their usefulness, two challenges persist: (1) training these audio codec models can be difficult due to the lack of publicly available training processes and the need for large-scale data and GPUs; (2) achieving good reconstruction performance requires many codebooks, which increases the burden on generation models. In this study, we propose a group-residual vector quantization (GRVQ) technique and use it to develop a novel \textbf{Hi}gh \textbf{Fi}delity Audio Codec model, HiFi-Codec, which only requires 4 codebooks. We train all the models using publicly available TTS data such as LibriTTS, VCTK, AISHELL, and more, with a total duration of over 1000 hours, using 8 GPUs. Our experimental results show that HiFi-Codec outperforms Encodec in terms of reconstruction performance despite requiring only 4 codebooks. To facilitate research in audio codec and generation, we introduce AcademiCodec, the first open-source audio codec toolkit that offers training codes and pre-trained models for Encodec, SoundStream, and HiFi-Codec. Code and pre-trained model can be found on: \href{https://github.com/yangdongchao/AcademiCodec}{https://github.com/yangdongchao/AcademiCodec} △ Less

Submitted 7 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

Comments: The second version of HiFi-Codec

arXiv:2303.14622 [pdf, ps, other]

doi 10.1007/s11433-023-2105-7

Experimental quantum secret sharing based on phase encoding of coherent states

Authors: Ao Shen, Xiao-Yu Cao, Yang Wang, Yao Fu, Jie Gu, Wen-Bo Liu, Chen-Xun Weng, Hua-Lei Yin, Zeng-Bing Chen

Abstract: Quantum secret sharing (QSS) is one of the basic communication primitives in future quantum networks which addresses part of the basic cryptographic tasks of multiparty communication and computation. Nevertheless, it is a challenge to provide a practical QSS protocol with security against general attacks. A QSS protocol that balances security and practicality is still lacking. Here, we propose a Q… ▽ More Quantum secret sharing (QSS) is one of the basic communication primitives in future quantum networks which addresses part of the basic cryptographic tasks of multiparty communication and computation. Nevertheless, it is a challenge to provide a practical QSS protocol with security against general attacks. A QSS protocol that balances security and practicality is still lacking. Here, we propose a QSS protocol with simple phase encoding of coherent states among three parties. Removing the requirement of impractical entangled resources and the need for phase randomization, our protocol can be implemented with accessible technology. We provide the finite-key analysis against coherent attacks and implement a proof-of-principle experiment to demonstrate our scheme's feasibility. Our scheme achieves a key rate of 85.3 bps under a 35 dB channel loss. Combined with security against general attacks and accessible technology, our protocol is a promising candidate for practical multiparty quantum communication networks. △ Less

Submitted 27 March, 2023; v1 submitted 26 March, 2023; originally announced March 2023.

Comments: 10 pages, 5 figures, 3 tables, accepted by Sci. China-Phys. Mech. Astron

Journal ref: Sci. China-Phys. Mech. Astron. 66, 260311 (2023)

arXiv:2302.14349 [pdf, other]

doi 10.1103/PhysRevApplied.19.054070

Advantages of Asynchronous Measurement-Device-Independent Quantum Key Distribution in Intercity Networks

Authors: Yuan-Mei Xie, Jun-Lin Bai, Yu-Shuo Lu, Chen-Xun Weng, Hua-Lei Yin, Zeng-Bing Chen

Abstract: The new variant of measurement-device-independent quantum key distribution (MDI-QKD), called asynchronous MDI-QKD or mode-pairing MDI-QKD, offers similar repeater-like rate-loss scaling but has the advantage of simple technology implementation by exploiting an innovative post-measurement pairing technique. We herein present an evaluation of the practical aspects of decoy-state asynchronous MDI-QKD… ▽ More The new variant of measurement-device-independent quantum key distribution (MDI-QKD), called asynchronous MDI-QKD or mode-pairing MDI-QKD, offers similar repeater-like rate-loss scaling but has the advantage of simple technology implementation by exploiting an innovative post-measurement pairing technique. We herein present an evaluation of the practical aspects of decoy-state asynchronous MDI-QKD. To determine its effectiveness, we analyze the optimal method of decoy-state calculation and examine the impact of asymmetrical channels and multi-user networks. Our simulations show that, under realistic conditions, aynchronous MDI-QKD can furnish the highest key rate with MDI security as compared to other QKD protocols over distances ranging from 50 km to 480 km. At fiber distances of 50 km and 100 km, the key rates attain 6.02 Mbps and 2.29 Mbps respectively, which are sufficient to facilitate real-time one-time-pad video encryption. Our findings indicate that experimental implementation of asynchronous MDI-QKD in intercity networks can be both practical and efficient. △ Less

Submitted 24 July, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

Comments: 15 pages, 4 figures

Journal ref: Phys. Rev. Applied 19, 054070 (2023)

arXiv:2302.08504 [pdf, other]

PersonNeRF: Personalized Reconstruction from Photo Collections

Authors: Chung-Yi Weng, Pratul P. Srinivasan, Brian Curless, Ira Kemelmacher-Shlizerman

Abstract: We present PersonNeRF, a method that takes a collection of photos of a subject (e.g. Roger Federer) captured across multiple years with arbitrary body poses and appearances, and enables rendering the subject with arbitrary novel combinations of viewpoint, body pose, and appearance. PersonNeRF builds a customized neural volumetric 3D model of the subject that is able to render an entire space spann… ▽ More We present PersonNeRF, a method that takes a collection of photos of a subject (e.g. Roger Federer) captured across multiple years with arbitrary body poses and appearances, and enables rendering the subject with arbitrary novel combinations of viewpoint, body pose, and appearance. PersonNeRF builds a customized neural volumetric 3D model of the subject that is able to render an entire space spanned by camera viewpoint, body pose, and appearance. A central challenge in this task is dealing with sparse observations; a given body pose is likely only observed by a single viewpoint with a single appearance, and a given appearance is only observed under a handful of different body poses. We address this issue by recovering a canonical T-pose neural volumetric representation of the subject that allows for changing appearance across different observations, but uses a shared pose-dependent motion field across all observations. We demonstrate that this approach, along with regularization of the recovered volumetric geometry to encourage smoothness, is able to recover a model that renders compelling images from novel combinations of viewpoint, pose, and appearance from these challenging unstructured photo collections, outperforming prior work for free-viewpoint human rendering. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: Project Page: https://grail.cs.washington.edu/projects/personnerf/

arXiv:2301.13662 [pdf, other]

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng

Abstract: Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined… ▽ More Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined styles. (2) Using reference speech as style input, which results in a problem that the extracted style information is not intuitive or interpretable. In this study, we attempt to use natural language as style prompt to control the styles in the synthetic speech, e.g., "Sigh tone in full of sad mood with some helpless feeling". Considering that there is no existing TTS corpus which is proper to benchmark this novel task, we first construct a speech corpus, whose speech samples are annotated with not only content transcriptions but also style descriptions in natural language. Then we propose an expressive TTS model, named as InstructTTS, which is novel in the sense of following aspects: (1) We fully take the advantage of self-supervised learning and cross-modal metric learning, and propose a novel three-stage training procedure to obtain a robust sentence embedding model, which can effectively capture semantic information from the style prompts and control the speaking style in the generated speech. (2) We propose to model acoustic features in discrete latent space and train a novel discrete diffusion probabilistic model to generate vector-quantized (VQ) acoustic tokens rather than the commonly-used mel spectrogram. (3) We jointly apply mutual information (MI) estimation and minimization during acoustic model training to minimize style-speaker and style-content MI, avoiding possible content and speaker information leakage from the style prompt. △ Less

Submitted 25 June, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

Comments: Submit to TASLP

arXiv:2212.03080 [pdf, other]

Straggler-Resilient Differentially-Private Decentralized Learning

Authors: Yauhen Yakimenka, Chung-Wei Weng, Hsuan-Yin Lin, Eirik Rosnes, Jörg Kliewer

Abstract: We consider the straggler problem in decentralized learning over a logical ring while preserving user data privacy. Especially, we extend the recently proposed framework of differential privacy (DP) amplification by decentralization by Cyffers and Bellet to include overall training latency--comprising both computation and communication latency. Analytical results on both the convergence speed and… ▽ More We consider the straggler problem in decentralized learning over a logical ring while preserving user data privacy. Especially, we extend the recently proposed framework of differential privacy (DP) amplification by decentralization by Cyffers and Bellet to include overall training latency--comprising both computation and communication latency. Analytical results on both the convergence speed and the DP level are derived for both a skip** scheme (which ignores the stragglers after a timeout) and a baseline scheme that waits for each node to finish before the training continues. A trade-off between overall training latency, accuracy, and privacy, parameterized by the timeout of the skip** scheme, is identified and empirically validated for logistic regression on a real-world dataset and for image classification using the MNIST and CIFAR-10 datasets. △ Less

Submitted 28 June, 2024; v1 submitted 6 December, 2022; originally announced December 2022.

Comments: To appear in the IEEE Journal on Selected Areas in Information Theory (special issue on Information-Theoretic Methods for Trustworthy and Reliable Machine Learning)

arXiv:2211.02448 [pdf, other]

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

Authors: Dongchao Yang, Songxiang Liu, Jianwei Yu, Helin Wang, Chao Weng, Yuexian Zou

Abstract: Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre from a reference audio, which faces the following challenges: (1) The highly dynamic prosody information in the reference audio is difficult to extract, especially, when the reference audio contains background noise. (2) The TTS systems should have good generalization for unseen speaking styles. In t… ▽ More Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre from a reference audio, which faces the following challenges: (1) The highly dynamic prosody information in the reference audio is difficult to extract, especially, when the reference audio contains background noise. (2) The TTS systems should have good generalization for unseen speaking styles. In this paper, we present a \textbf{no}ise-\textbf{r}obust \textbf{e}xpressive TTS model (NoreSpeech), which can robustly transfer speaking style in a noisy reference utterance to synthesized speech. Specifically, our NoreSpeech includes several components: (1) a novel DiffStyle module, which leverages powerful probabilistic denoising diffusion models to learn noise-agnostic speaking style features from a teacher model by knowledge distillation; (2) a VQ-VAE block, which maps the style features into a controllable quantized latent space for improving the generalization of style transfer; and (3) a straight-forward but effective parameter-free text-style alignment module, which enables NoreSpeech to transfer style to a textual input from a length-mismatched reference utterance. Experiments demonstrate that NoreSpeech is more effective than previous expressive TTS models in noise environments. Audio samples and code are available at: \href{http://dongchaoyang.top/NoreSpeech\_demo/}{http://dongchaoyang.top/NoreSpeech\_demo/} △ Less

Submitted 4 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP2023

arXiv:2210.07499 [pdf, other]

Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

Authors: **chuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

Abstract: Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target… ▽ More Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target units. As there are multiple potential aligning sequences (called paths) that are equally considered in CTC formulation, the choice of which path will be most probable and become the predicted alignment is always uncertain. In addition, it is usually observed that the alignment predicted by vanilla CTC will drift compared with its reference and rarely provides practical functionalities. Thus, the motivation of this work is to make the CTC alignment prediction controllable and thus equip CTC with extra functionalities. The Bayes risk CTC (BRCTC) criterion is then proposed in this work, in which a customizable Bayes risk function is adopted to enforce the desired characteristics of the predicted alignment. With the risk function, the BRCTC is a general framework to adopt some customizable preference over the paths in order to concentrate the posterior into a particular subset of the paths. In applications, we explore one particular preference which yields models with the down-sampling ability and reduced inference costs. By using BRCTC with another preference for early emissions, we obtain an improved performance-latency trade-off for online models. Experimentally, the proposed BRCTC reduces the inference cost of offline models by up to 47% without performance degradation and cuts down the overall latency of online systems to an unseen level. △ Less

Submitted 31 January, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

Journal ref: International Conference on Learning Representations (ICLR), 2023

arXiv:2210.05092 [pdf, other]

The DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022

Authors: Xiaoyi Qin, Na Li, Yuke Lin, Yiwei Ding, Chao Weng, Dan Su, Ming Li

Abstract: This paper is the system description of the DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC22). In this challenge, we focus on track1 and track3. For track1, multiple backbone networks are adopted to extract frame-level features. Since track1 focus on the cross-age scenarios, we adopt the cross-age trials and perform QMF to calibrate score. The magnitude-based qualit… ▽ More This paper is the system description of the DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC22). In this challenge, we focus on track1 and track3. For track1, multiple backbone networks are adopted to extract frame-level features. Since track1 focus on the cross-age scenarios, we adopt the cross-age trials and perform QMF to calibrate score. The magnitude-based quality measures achieve a large improvement. For track3, the semi-supervised domain adaptation task, the pseudo label method is adopted to make domain adaptation. Considering the noise labels in clustering, the ArcFace is replaced by Sub-center ArcFace. The final submission achieves 0.107 mDCF in task1 and 7.135% EER in task3. △ Less

Submitted 10 October, 2022; originally announced October 2022.

arXiv:2207.09983 [pdf, other]

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

Authors: Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu

Abstract: Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses t… ▽ More Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform. We found that the decoder significantly influences the generation performance. Thus, we focus on designing a good decoder in this study. We begin with the traditional autoregressive decoder, which has been proved as a state-of-the-art method in previous sound generation works. However, the AR decoder always predicts the mel-spectrogram tokens one by one in order, which introduces the unidirectional bias and accumulation of errors problems. Moreover, with the AR decoder, the sound generation time increases linearly with the sound duration. To overcome the shortcomings introduced by AR decoders, we propose a non-autoregressive decoder based on the discrete diffusion model, named Diffsound. Specifically, the Diffsound predicts all of the mel-spectrogram tokens in one step and then refines the predicted tokens in the next step, so the best-predicted results can be obtained after several steps. Our experiments show that our proposed Diffsound not only produces better text-to-sound generation results when compared with the AR decoder but also has a faster generation speed, e.g., MOS: 3.56 \textit{v.s} 2.786, and the generation speed is five times faster than the AR decoder. △ Less

Submitted 28 April, 2023; v1 submitted 20 July, 2022; originally announced July 2022.

Comments: Accepted by TASLP2022

arXiv:2207.05929 [pdf, other]

Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings

Authors: Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li

Abstract: Automatic speaker verification has achieved remarkable progress in recent years. However, there is little research on cross-age speaker verification (CASV) due to insufficient relevant data. In this paper, we mine cross-age test sets based on the VoxCeleb dataset and propose our age-invariant speaker representation(AISR) learning method. Since the VoxCeleb is collected from the YouTube platform, t… ▽ More Automatic speaker verification has achieved remarkable progress in recent years. However, there is little research on cross-age speaker verification (CASV) due to insufficient relevant data. In this paper, we mine cross-age test sets based on the VoxCeleb dataset and propose our age-invariant speaker representation(AISR) learning method. Since the VoxCeleb is collected from the YouTube platform, the dataset consists of cross-age data inherently. However, the meta-data does not contain the speaker age label. Therefore, we adopt the face age estimation method to predict the speaker age value from the associated visual data, then label the audio recording with the estimated age. We construct multiple Cross-Age test sets on VoxCeleb (Vox-CA), which deliberately select the positive trials with large age-gap. Also, the effect of nationality and gender is considered in selecting negative pairs to align with Vox-H cases. The baseline system performance drops from 1.939\% EER on the Vox-H test set to 10.419\% on the Vox-CA20 test set, which indicates how difficult the cross-age scenario is. Consequently, we propose an age-decoupling adversarial learning (ADAL) method to alleviate the negative effect of the age gap and reduce intra-class variance. Our method outperforms the baseline system by over 10\% related EER reduction on the Vox-CA20 test set. The source code and trial resources are available on https://github.com/qinxiaoyi/Cross-Age_Speaker_Verification △ Less

Submitted 12 July, 2022; originally announced July 2022.

Comments: Accepted by Interspeech2022

arXiv:2206.09159 [pdf, other]

doi 10.34133/research.0272

Beating the fault-tolerance bound and security loopholes for Byzantine agreement with a quantum solution

Authors: Chen-Xun Weng, Rui-Qi Gao, Yu Bao, Bing-Hong Li, Wen-Bo Liu, Yuan-Mei Xie, Yu-Shuo Lu, Hua-Lei Yin, Zeng-Bing Chen

Abstract: Byzantine agreement, the underlying core of blockchain, aims to make every node in a decentralized network reach consensus. Classical Byzantine agreements unavoidably face two major problems. One is $1/3$ fault-tolerance bound, which means that the system to tolerate $f$ malicious players requires at least $3f+1$ players. The other is the security loopholes from its classical cryptography methods.… ▽ More Byzantine agreement, the underlying core of blockchain, aims to make every node in a decentralized network reach consensus. Classical Byzantine agreements unavoidably face two major problems. One is $1/3$ fault-tolerance bound, which means that the system to tolerate $f$ malicious players requires at least $3f+1$ players. The other is the security loopholes from its classical cryptography methods. Here, we propose a Byzantine agreement framework with unconditional security to break this bound with nearly $1/2$ fault tolerance due to multiparty correlation provided by quantum digital signatures. \textcolor{black}{It is intriguing that quantum entanglement is not necessary to break the $1/3$ fault-tolerance bound, and we show that weaker correlation, such as asymmetric relationship of quantum digital signature, can also work.} Our work strictly obeys two Byzantine conditions and can be extended to any number of players without requirements for multiparticle entanglement. We experimentally demonstrate three-party and five-party consensus for a digital ledger. Our work indicates the quantum advantage in terms of consensus problems and suggests an important avenue for quantum blockchain and quantum consensus networks. △ Less

Submitted 22 November, 2023; v1 submitted 18 June, 2022; originally announced June 2022.

Comments: 21 pages, 7 figures. All comments are welcome!

Journal ref: Research 6, 0272 (2023)

arXiv:2206.02093 [pdf, other]

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Authors: **chuan Tian, Jianwei Yu, Chunlei Zhang, Chao Weng, Yuexian Zou, Dong Yu

Abstract: Despite the rapid progress in automatic speech recognition (ASR) research, recognizing multilingual speech using a unified ASR system remains highly challenging. Previous works on multilingual speech recognition mainly focus on two directions: recognizing multiple monolingual speech or recognizing code-switched speech that uses different languages interchangeably within a single utterance. However… ▽ More Despite the rapid progress in automatic speech recognition (ASR) research, recognizing multilingual speech using a unified ASR system remains highly challenging. Previous works on multilingual speech recognition mainly focus on two directions: recognizing multiple monolingual speech or recognizing code-switched speech that uses different languages interchangeably within a single utterance. However, a pragmatic multilingual recognizer is expected to be compatible with both directions. In this work, a novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information and generating frame-level language-aware representations during encoding. In the LAE, the primary encoding is implemented by the shared block while the language-specific blocks are used to extract specific representations for each language. To learn language-specific information discriminatively, a language-aware training method is proposed to optimize the language-specific blocks in LAE. Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks. With either a real-recorded or simulated code-switched dataset, the proposed LAE achieves statistically significant improvements on both CTC and neural transducer systems. Code is released △ Less

Submitted 5 June, 2022; originally announced June 2022.

arXiv:2204.00821 [pdf, other]

Improving Target Sound Extraction with Timestamp Information

Authors: Helin Wang, Dongchao Yang, Chao Weng, Jianwei Yu, Yuexian Zou

Abstract: Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events. The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this pap… ▽ More Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events. The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this paper, we study to utilize such timestamp information to help extract the target sound via a target sound detection network and a target-weighted time-frequency loss function. More specifically, we use the detection result of a target sound detection (TSD) network as the additional information to guide the learning of target sound extraction network. We also find that the result of TSE can further improve the performance of the TSD network, so that a mutual learning framework of the target sound detection and extraction is proposed. In addition, a target-weighted time-frequency loss function is designed to pay more attention to the temporal regions of the target sound during training. Experimental results on the synthesized data generated from the Freesound Datasets show that our proposed method can significantly improve the performance of TSE. △ Less

Submitted 2 April, 2022; originally announced April 2022.

Comments: submitted to interspeech2022

arXiv:2203.15614 [pdf, other]

doi 10.1109/TASLP.2022.3198555

Integrating Lattice-Free MMI into End-to-End Speech Recognition

Authors: **chuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

Abstract: In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems.… ▽ More In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds. To this end, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI), into E2E ASR systems not only in the training stage but also in the decoding process. The proposed LF-MMI training and decoding methods show their effectiveness on two widely used E2E frameworks: Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Compared with MBR-based methods, the proposed LF-MMI method: maintains the consistency between training and decoding; eschews the on-the-fly decoding process; trains from randomly initialized models with superior training efficiency. Experiments suggest that the LF-MMI method outperforms its MBR counterparts and consistently leads to statistically significant performance improvements on various frameworks and datasets from 30 hours to 14.3k hours. The proposed method achieves state-of-the-art (SOTA) results on Aishell-1 (CER 4.10%) and Aishell-2 (CER 5.02%) datasets. Code is released. △ Less

Submitted 22 August, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022

arXiv:2203.03539 [pdf, other]

Understanding The Robustness of Self-supervised Learning Through Topic Modeling

Authors: Ze** Luo, Shiyou Wu, Cindy Weng, Mo Zhou, Rong Ge

Abstract: Self-supervised learning has significantly improved the performance of many NLP tasks. However, how can self-supervised learning discover useful representations, and why is it better than traditional approaches such as probabilistic models are still largely unknown. In this paper, we focus on the context of topic modeling and highlight a key advantage of self-supervised learning - when applied to… ▽ More Self-supervised learning has significantly improved the performance of many NLP tasks. However, how can self-supervised learning discover useful representations, and why is it better than traditional approaches such as probabilistic models are still largely unknown. In this paper, we focus on the context of topic modeling and highlight a key advantage of self-supervised learning - when applied to data generated by topic models, self-supervised learning can be oblivious to the specific model, and hence is less susceptible to model misspecification. In particular, we prove that commonly used self-supervised objectives based on reconstruction or contrastive samples can both recover useful posterior information for general topic models. Empirically, we show that the same objectives can perform on par with posterior inference using the correct model, while outperforming posterior inference using misspecified models. △ Less

Submitted 27 February, 2023; v1 submitted 2 February, 2022; originally announced March 2022.

Comments: Accepted at ICLR 2023. Camera ready version

arXiv:2202.01986 [pdf, other]

The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

Authors: Naijun Zheng, Na Li, Xixin Wu, Lingwei Meng, Jiawen Kang, Haibin Wu, Chao Weng, Dan Su, Helen Meng

Abstract: This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks. In these meeting scenarios, the uncertainty of the speaker number and the high ratio of overlapped speech present great challenges for d… ▽ More This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks. In these meeting scenarios, the uncertainty of the speaker number and the high ratio of overlapped speech present great challenges for diarization. Based on the assumption that there is valuable complementary information between acoustic features, spatial-related and speaker-related features, we propose a multi-level feature fusion mechanism based target-speaker voice activity detection (FFM-TS-VAD) system to improve the performance of the conventional TS-VAD system. Furthermore, we propose a data augmentation method during training to improve the system robustness when the angular difference between two speakers is relatively small. We provide comparisons for different sub-systems we used in M2MeT challenge. Our submission is a fusion of several sub-systems and ranks second in the diarization task. △ Less

Submitted 4 February, 2022; originally announced February 2022.

Comments: submitted to ICASSP2022

arXiv:2201.04127 [pdf, other]

HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video

Authors: Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, Ira Kemelmacher-Shlizerman

Abstract: We introduce a free-viewpoint rendering method -- HumanNeRF -- that works on a given monocular video of a human performing complex body motions, e.g. a video from YouTube. Our method enables pausing the video at any frame and rendering the subject from arbitrary new camera viewpoints or even a full 360-degree camera path for that particular frame and body pose. This task is particularly challengin… ▽ More We introduce a free-viewpoint rendering method -- HumanNeRF -- that works on a given monocular video of a human performing complex body motions, e.g. a video from YouTube. Our method enables pausing the video at any frame and rendering the subject from arbitrary new camera viewpoints or even a full 360-degree camera path for that particular frame and body pose. This task is particularly challenging, as it requires synthesizing photorealistic details of the body, as seen from various camera angles that may not exist in the input video, as well as synthesizing fine details such as cloth folds and facial appearance. Our method optimizes for a volumetric representation of the person in a canonical T-pose, in concert with a motion field that maps the estimated canonical representation to every frame of the video via backward warps. The motion field is decomposed into skeletal rigid and non-rigid motions, produced by deep networks. We show significant performance improvements over prior work, and compelling examples of free-viewpoint renderings from monocular video of moving humans in challenging uncontrolled capture scenarios. △ Less

Submitted 14 June, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

Comments: CVPR 2022 (oral). Project page with videos: https://grail.cs.washington.edu/projects/humannerf/

arXiv:2201.01995 [pdf, other]

doi 10.1109/LSP.2022.3154241

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

Authors: **chuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

Abstract: Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into the decoding can further improve the recognition performance of E2E ASR systems. To align with the modeling units adopted in E2E ASR systems, subword-level (e.g., characters, BPE) LMs are usually used to cooperate with current E2E ASR systems.… ▽ More Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into the decoding can further improve the recognition performance of E2E ASR systems. To align with the modeling units adopted in E2E ASR systems, subword-level (e.g., characters, BPE) LMs are usually used to cooperate with current E2E ASR systems. However, the use of subword-level LMs will ignore the word-level information, which may limit the strength of the external LMs in E2E ASR. Although several methods have been proposed to incorporate word-level external LMs in E2E ASR, these methods are mainly designed for languages with clear word boundaries such as English and cannot be directly applied to languages like Mandarin, in which each character sequence can have multiple corresponding word sequences. To this end, we propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences for each partial hypothesis. Then, the LM score of the hypothesis is obtained by intersecting the generated lattice with an external word N-gram LM. The proposed method is examined on both Attention-based Encoder-Decoder (AED) and Neural Transducer (NT) frameworks. Experiments suggest that our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM. We achieve state-of-the-art results on both Aishell-1 (CER 4.18%) and Aishell-2 (CER 5.06%) datasets and reduce CER by 14.8% relatively on a 21K-hour Mandarin dataset. △ Less

Submitted 6 January, 2022; originally announced January 2022.

Comments: 5pages, 1 figure

arXiv:2112.11635 [pdf, other]

doi 10.1103/PRXQuantum.3.020315

Breaking the Rate-Loss Bound of Quantum Key Distribution with Asynchronous Two-Photon Interference

Authors: Yuan-Mei Xie, Yu-Shuo Lu, Chen-Xun Weng, Xiao-Yu Cao, Zhao-Ying Jia, Yu Bao, Yang Wang, Yao Fu, Hua-Lei Yin, Zeng-Bing Chen

Abstract: Twin-field quantum key distribution can overcome the secret key capacity of repeaterless quantum key distribution via single-photon interference. However, to compensate for the channel fluctuations and lock the laser fluctuations, the techniques of phase tracking and phase locking are indispensable in experiment, which drastically increase experimental complexity and hinder free-space realization.… ▽ More Twin-field quantum key distribution can overcome the secret key capacity of repeaterless quantum key distribution via single-photon interference. However, to compensate for the channel fluctuations and lock the laser fluctuations, the techniques of phase tracking and phase locking are indispensable in experiment, which drastically increase experimental complexity and hinder free-space realization. Inspired by the duality in entanglement, we herein present an asynchronous measurement-device-independent quantum key distribution protocol that can surpass the secret key capacity even without phase tracking and phase locking. Leveraging the concept of time multiplexing, asynchronous two-photon Bell-state measurement is realized by postmatching two interference detection events. For a 1 GHz system, the new protocol reaches a transmission distance of 450 km without phase tracking. After further removing phase locking, our protocol is still capable of breaking the capacity at 270 km. Intriguingly, when using the same experimental techniques, our protocol has a higher key rate than the phase-matching-type twin-field protocol. In the presence of imperfect intensity modulation, it also has a significant advantage in terms of the transmission distance over the sending-or-not-sending type twin-field protocol. With high key rates and accessible technology, our work provides a promising candidate for practical scalable quantum communication networks. △ Less

Submitted 26 April, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

Comments: 15 pages, 10 figures. arXiv admin note: text overlap with arXiv:2112.11165

Journal ref: PRX Quantum 3, 020315 (2022)

arXiv:2112.11165 [pdf, other]

doi 10.1103/PhysRevA.107.042603

Scalable High-Rate Twin-Field Quantum Key Distribution Networks without Constraint of Probability and Intensity

Authors: Yuan-Mei Xie, Chen-Xun Weng, Yu-Shuo Lu, Yao Fu, Yang Wang, Hua-Lei Yin, Zeng-Bing Chen

Abstract: Implementation of a twin-field quantum key distribution network faces limitations, including the low tolerance of interference errors for phase-matching type protocols and the strict constraint regarding intensity and probability for sending-or-not-sending type protocols. Here, we propose a two-photon twin-field quantum key distribution protocol and achieve twin-field-type two-photon interference… ▽ More Implementation of a twin-field quantum key distribution network faces limitations, including the low tolerance of interference errors for phase-matching type protocols and the strict constraint regarding intensity and probability for sending-or-not-sending type protocols. Here, we propose a two-photon twin-field quantum key distribution protocol and achieve twin-field-type two-photon interference through post-matching phase-correlated single-photon interference events. We exploit the non-interference mode as the code mode to highly tolerate interference errors, and the two-photon interference naturally removes the intensity and probability constraint. Therefore, our protocol can transcend the abovementioned limitations while breaking the secret key capacity of repeaterless quantum key distribution. Simulations show that for a four-user networks, under which each node with fixed system parameters can dynamically switch different attenuation links, the key rates of our protocol for all six links can either exceed or approach the secret key capacity. However, the key rates of all links are lower than the key capacity when using phase-matching type protocols. Additionally, four of the links could not extract the key when using sending-or-not-sending type protocols. We anticipate that our protocol can facilitate the development of practical and efficient quantum networks. △ Less

Submitted 9 April, 2023; v1 submitted 21 December, 2021; originally announced December 2021.

Comments: 17 pages, 6 figures, 3 tables, Accepted for Publication in Phys. Rev. A

Journal ref: Phys. Rev. A 107, 042603 (2023)

arXiv:2112.10821 [pdf]

Natural language processing to identify lupus nephritis phenotype in electronic health records

Authors: Yu Deng, Jennifer A. Pacheco, Anh Chung, Chengsheng Mao, Joshua C. Smith, Juan Zhao, Wei-Qi Wei, April Barnado, Chunhua Weng, Cong Liu, Adam Cordon, **gzhi Yu, Yacob Tedla, Abel Kho, Rosalind Ramsey-Goldman, Theresa Walunas, Yuan Luo

Abstract: Systemic lupus erythematosus (SLE) is a rare autoimmune disorder characterized by an unpredictable course of flares and remission with diverse manifestations. Lupus nephritis, one of the major disease manifestations of SLE for organ damage and mortality, is a key component of lupus classification criteria. Accurately identifying lupus nephritis in electronic health records (EHRs) would therefore b… ▽ More Systemic lupus erythematosus (SLE) is a rare autoimmune disorder characterized by an unpredictable course of flares and remission with diverse manifestations. Lupus nephritis, one of the major disease manifestations of SLE for organ damage and mortality, is a key component of lupus classification criteria. Accurately identifying lupus nephritis in electronic health records (EHRs) would therefore benefit large cohort observational studies and clinical trials where characterization of the patient population is critical for recruitment, study design, and analysis. Lupus nephritis can be recognized through procedure codes and structured data, such as laboratory tests. However, other critical information documenting lupus nephritis, such as histologic reports from kidney biopsies and prior medical history narratives, require sophisticated text processing to mine information from pathology reports and clinical notes. In this study, we developed algorithms to identify lupus nephritis with and without natural language processing (NLP) using EHR data. We developed four algorithms: a rule-based algorithm using only structured data (baseline algorithm) and three algorithms using different NLP models. The three NLP models are based on regularized logistic regression and use different sets of features including positive mention of concept unique identifiers (CUIs), number of appearances of CUIs, and a mixture of three components respectively. The baseline algorithm and the best performed NLP algorithm were external validated on a dataset from Vanderbilt University Medical Center (VUMC). Our best performing NLP model incorporating features from both structured data, regular expression concepts, and mapped CUIs improved F measure in both the NMEDW (0.41 vs 0.79) and VUMC (0.62 vs 0.96) datasets compared to the baseline lupus nephritis algorithm. △ Less

Submitted 20 December, 2021; originally announced December 2021.

arXiv:2112.02498 [pdf, other]

Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI

Authors: **chuan Tian, Jianwei Yu, Chao Weng, Shi-Xiong Zhang, Dan Su, Dong Yu, Yuexian Zou

Abstract: Recently, End-to-End (E2E) frameworks have achieved remarkable results on various Automatic Speech Recognition (ASR) tasks. However, Lattice-Free Maximum Mutual Information (LF-MMI), as one of the discriminative training criteria that show superior performance in hybrid ASR systems, is rarely adopted in E2E ASR frameworks. In this work, we propose a novel approach to integrate LF-MMI criterion int… ▽ More Recently, End-to-End (E2E) frameworks have achieved remarkable results on various Automatic Speech Recognition (ASR) tasks. However, Lattice-Free Maximum Mutual Information (LF-MMI), as one of the discriminative training criteria that show superior performance in hybrid ASR systems, is rarely adopted in E2E ASR frameworks. In this work, we propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages. The proposed approach shows its effectiveness on two of the most widely used E2E frameworks including Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements on various datasets and different E2E ASR frameworks. The best of our models achieves competitive CER of 4.1\% / 4.4\% on Aishell-1 dev/test set; we also achieve significant error reduction on Aishell-2 and Librispeech datasets over strong baselines. △ Less

Submitted 29 December, 2021; v1 submitted 5 December, 2021; originally announced December 2021.

arXiv:2111.15016 [pdf, other]

Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization

Authors: Brian Yan, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe, Dong Yu

Abstract: Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type. In this work, we propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition. By defining the monolingual sub-tasks with label-to-frame synchronization, our joint m… ▽ More Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type. In this work, we propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition. By defining the monolingual sub-tasks with label-to-frame synchronization, our joint modeling framework can be conditionally factorized such that the final bilingual output, which may or may not be code-switched, is obtained given only monolingual information. We show that this conditionally factorized joint framework can be modeled by an end-to-end differentiable neural network. We demonstrate the efficacy of our proposed model on bilingual Mandarin-English speech recognition across both monolingual and code-switched corpora. △ Less

Submitted 29 November, 2021; originally announced November 2021.

arXiv:2111.03775 [pdf, ps, other]

doi 10.1364/OL.443099

Long-distance twin-field quantum key distribution with entangled sources

Authors: Bing-Hong Li, Yuan-Mei Xie, Zhao Li, Chen-Xun Weng, Chen-Long Li, Hua-Lei Yin, Zeng-Bing Chen

Abstract: Twin-field quantum key distribution (TFQKD), using single-photon-type interference, offers a way to exceed the rate-distance limit without quantum repeaters. However, it still suffers from the photon losses and dark counts, which impose an ultimate limit on its transmission distance. In this letter, we propose a scheme to implement TFQKD with an entangled coherent state source in the middle to inc… ▽ More Twin-field quantum key distribution (TFQKD), using single-photon-type interference, offers a way to exceed the rate-distance limit without quantum repeaters. However, it still suffers from the photon losses and dark counts, which impose an ultimate limit on its transmission distance. In this letter, we propose a scheme to implement TFQKD with an entangled coherent state source in the middle to increase its range, as well as comparing its performance under coherent attacks with that of TFQKD variants. Simulations show that our protocol has a theoretical distance advantage of 400 kilometers. Moreover, the scheme has great robustness against the misalignment error and finite-size effects. Our work is a promising step toward long-distance secure communication and is greatly compatible with future global quantum network. △ Less

Submitted 5 November, 2021; originally announced November 2021.

Comments: 4+4 pages, 5 figures

Journal ref: Opt. Lett. 46, 5529 (2021)

arXiv:2110.06534 [pdf, other]

Simple Attention Module based Speaker Verification with Iterative noisy label detection

Authors: Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li

Abstract: Recently, the attention mechanism such as squeeze-and-excitation module (SE) and convolutional block attention module (CBAM) has achieved great success in deep learning-based speaker verification system. This paper introduces an alternative effective yet simple one, i.e., simple attention module (SimAM), for speaker verification. The SimAM module is a plug-and-play module without extra modal param… ▽ More Recently, the attention mechanism such as squeeze-and-excitation module (SE) and convolutional block attention module (CBAM) has achieved great success in deep learning-based speaker verification system. This paper introduces an alternative effective yet simple one, i.e., simple attention module (SimAM), for speaker verification. The SimAM module is a plug-and-play module without extra modal parameters. In addition, we propose a noisy label detection method to iteratively filter out the data samples with a noisy label from the training data, considering that a large-scale dataset labeled with human annotation or other automated processes may contain noisy labels. Data with the noisy label may over parameterize a deep neural network (DNN) and result in a performance drop due to the memorization effect of the DNN. Experiments are conducted on VoxCeleb dataset. The speaker verification model with SimAM achieves the 0.675% equal error rate (EER) on VoxCeleb1 original test trials. Our proposed iterative noisy label detection method further reduces the EER to 0.643%. △ Less

Submitted 13 October, 2021; originally announced October 2021.

Comments: submitted to ICASSP2022

Showing 1–50 of 81 results for author: Weng, C