Skip to main content

Showing 1–50 of 81 results for author: Weng, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.04947  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Gull: A Generative Multifunctional Audio Codec

    Authors: Yi Luo, Jianwei Yu, Hangting Chen, Rongzhi Gu, Chao Weng

    Abstract: We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recen… ▽ More

    Submitted 7 June, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

    Comments: Demo page: https://yluo42.github.io/Gull/

  2. arXiv:2404.01784  [pdf, other

    cs.IT eess.SP

    Learning-Based Joint Beamforming and Antenna Movement Design for Movable Antenna Systems

    Authors: Caihao Weng, Yuanbin Chen, Lipeng Zhu, Ying Wang

    Abstract: In this paper, we investigate a multi-receiver communication system enabled by movable antennas (MAs). Specifically, the transmit beamforming and the double-side antenna movement at the transceiver are jointly designed to maximize the sum-rate of all receivers under imperfect channel state information (CSI). Since the formulated problem is non-convex with highly coupled variables, conventional opt… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 13 pages, 5 figures

  3. arXiv:2401.09047  [pdf, other

    cs.CV

    VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

    Authors: Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan

    Abstract: Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

    Comments: Homepage: https://ailab-cvc.github.io/videocrafter; Github: https://github.com/AILab-CVC/VideoCrafter

  4. arXiv:2401.06791  [pdf, other

    cs.IR cs.AI cs.CL

    A Span-based Model for Extracting Overlap** PICO Entities from RCT Publications

    Authors: Gongbo Zhang, Yiliang Zhou, Yan Hu, Hua Xu, Chunhua Weng, Yifan Peng

    Abstract: Objectives Extraction of PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method PICOX to extract overlap** PICO entities. Materials and Methods PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then it uses a multi-label classifier to assign one or more PICO l… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

  5. arXiv:2312.15463  [pdf, other

    eess.AS cs.SD

    Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

    Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng

    Abstract: The query-based audio separation usually employs specific queries to extract target sources from a mixture of audio signals. Currently, most query-based separation models need additional networks to obtain query embedding. In this way, separation model is optimized to be adapted to the distribution of query embedding. However, query embedding may exhibit mismatches with separation models due to in… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  6. arXiv:2312.15320  [pdf

    q-bio.QM cs.CV cs.LG cs.MM q-bio.GN

    GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts

    Authors: Da Wu, **gye Yang, Cong Liu, Tzung-Chien Hsieh, Elaine Marchi, Justin Blair, Peter Krawitz, Chunhua Weng, Wendy Chung, Gholson J. Lyon, Ian D. Krantz, Jennifer M. Kalish, Kai Wang

    Abstract: Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of time. Addressing this "diagnostic odyssey" thus has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artifi… ▽ More

    Submitted 21 April, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

    Comments: Significant revisions

  7. arXiv:2311.11211  [pdf

    cs.AI

    Leveraging Generative AI for Clinical Evidence Summarization Needs to Ensure Trustworthiness

    Authors: Gongbo Zhang, Qiao **, Denis Jered McInerney, Yong Chen, Fei Wang, Curtis L. Cole, Qian Yang, Yanshan Wang, Bradley A. Malin, Mor Peleg, Byron C. Wallace, Zhiyong Lu, Chunhua Weng, Yifan Peng

    Abstract: Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, ho… ▽ More

    Submitted 31 March, 2024; v1 submitted 18 November, 2023; originally announced November 2023.

  8. arXiv:2310.20323  [pdf, other

    cs.CV cs.AI cs.GR cs.HC

    SemanticBoost: Elevating Motion Generation with Augmented Textual Cues

    Authors: Xin He, Shaoli Huang, Xiaohang Zhan, Chao Weng, Ying Shan

    Abstract: Current techniques face difficulties in generating motions from intricate semantic descriptions, primarily due to insufficient semantic annotations in datasets and weak contextual understanding. To address these issues, we present SemanticBoost, a novel framework that tackles both challenges simultaneously. Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser… ▽ More

    Submitted 28 November, 2023; v1 submitted 31 October, 2023; originally announced October 2023.

  9. arXiv:2310.19512  [pdf, other

    cs.CV

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Authors: Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, **bo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan

    Abstract: Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: Tech Report; Github: https://github.com/AILab-CVC/VideoCrafter Homepage: https://ailab-cvc.github.io/videocrafter/

  10. arXiv:2310.14864  [pdf, other

    cs.LG

    Diverse Priors for Deep Reinforcement Learning

    Authors: Chenfan Weng, Zhongguo Li

    Abstract: In Reinforcement Learning (RL), agents aim at maximizing cumulative rewards in a given environment. During the learning process, RL agents face the dilemma of exploitation and exploration: leveraging existing knowledge to acquire rewards or seeking potentially higher ones. Using uncertainty as a guiding principle provides an active and effective approach to solving this dilemma and ensemble-based… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: 8 pages, 4 figures

  11. arXiv:2309.12792  [pdf, other

    eess.AS cs.SD

    DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis

    Authors: Yu Gu, Yianrao Bian, Guangzhi Lei, Chao Weng, Dan Su

    Abstract: This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed Du… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

  12. arXiv:2309.07803  [pdf, other

    eess.AS cs.SD

    SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias

    Authors: Sipan Li, Songxiang Liu, Luwen Zhang, Xiang Li, Yanyao Bian, Chao Weng, Zhiyong Wu, Helen Meng

    Abstract: Generative adversarial network (GAN)-based neural vocoders have been widely used in audio synthesis tasks due to their high generation quality, efficient inference, and small computation footprint. However, it is still challenging to train a universal vocoder which can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pi… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted by ICME 2023

  13. arXiv:2309.07757  [pdf, other

    eess.AS cs.SD

    Complexity Scaling for Speech Denoising

    Authors: Hangting Chen, Jianwei Yu, Chao Weng

    Abstract: Computational complexity is critical when deploying deep learning-based speech denoising models for on-device applications. Most prior research focused on optimizing model architectures to meet specific computational cost constraints, often creating distinct neural network architectures for different complexity limitations. This study conducts complexity scaling for speech denoising tasks, aiming… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP2024

  14. Stringesthesia: Dynamically Shifting Musical Agency Between Audience and Performer Based on Trust in an Interactive and Improvised Performance

    Authors: Torin Hopkins, Emily Doherty, Netta Ofer, Suibi Che Chuan Weng, Peter Gyrory, Chad Tobin, Leanne Hirshfield, Ellen Yi-Luen Do

    Abstract: This paper introduces Stringesthesia, an interactive and improvised performance paradigm. Stringesthesia uses real-time neuroimaging to connect performers and audiences, enabling direct access to the performers mental state and determining audience participation during the performance. Functional near-infrared spectroscopy, or fNIRS, a noninvasive neuroimaging tool, was used to assess metabolic ac… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Journal ref: Audio Mostly 2023, Edinburgh, UK

  15. arXiv:2309.00842  [pdf, other

    cs.HC

    DualStream: Spatially Sharing Selves and Surroundings using Mobile Devices and Augmented Reality

    Authors: Rishi Vanukuru, Suibi Che-Chuan Weng, Krithik Ranjan, Torin Hopkins, Amy Banic, Mark D. Gross, Ellen Yi-Luen Do

    Abstract: In-person human interaction relies on our spatial perception of each other and our surroundings. Current remote communication tools partially address each of these aspects. Video calls convey real user representations but without spatial interactions. Augmented and Virtual Reality (AR/VR) experiences are immersive and spatial but often use virtual environments and characters instead of real-life r… ▽ More

    Submitted 2 September, 2023; originally announced September 2023.

    Comments: 10 pages, 4 figures, 1 table; To appear in the proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2023

  16. arXiv:2308.14553  [pdf, other

    eess.AS cs.SD

    Rep2wav: Noise Robust text-to-speech Using self-supervised representations

    Authors: Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, Lirong Dai, Jie Zhang

    Abstract: Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background… ▽ More

    Submitted 3 September, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: 5 pages,2 figures

  17. Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression

    Authors: Hangting Chen, Jianwei Yu, Yi Luo, Rongzhi Gu, Weihua Li, Zhuocheng Lu, Chao Weng

    Abstract: Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity. In this paper, we introduce time-frequency dual-path compression to achieve a wide range of compression ratios on computational cost. Specifically, for frequency compression, trainable filters are used to r… ▽ More

    Submitted 10 October, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

    Comments: Proceedings of INTERSPEECH

  18. arXiv:2308.10107  [pdf, other

    cs.CL

    Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

    Authors: **chuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu, Shinji Watanabe

    Abstract: Automatic speech recognition (ASR) based on transducers is widely used. In training, a transducer maximizes the summed posteriors of all paths. The path with the highest posterior is commonly defined as the predicted alignment between the speech and the transcription. While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferre… ▽ More

    Submitted 19 August, 2023; originally announced August 2023.

    Journal ref: Interspeech 2023

  19. arXiv:2308.08660  [pdf, other

    cs.CL

    Large Language Models for Granularized Barrett's Esophagus Diagnosis Classification

    Authors: Jenna Kefeli, Ali Soroush, Courtney J. Diamond, Haley M. Zylberberg, Benjamin May, Julian A. Abrams, Chunhua Weng, Nicholas Tatonetti

    Abstract: Diagnostic codes for Barrett's esophagus (BE), a precursor to esophageal cancer, lack granularity and precision for many research or clinical use cases. Laborious manual chart review is required to extract key diagnostic phenotypes from BE pathology reports. We developed a generalizable transformer-based method to automate data extraction. Using pathology reports from Columbia University Irving Me… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

  20. arXiv:2308.06294  [pdf

    q-bio.QM cs.AI

    Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT

    Authors: **gye Yang, Cong Liu, Wendy Deng, Da Wu, Chunhua Weng, Yunyun Zhou, Kai Wang

    Abstract: We hypothesize that large language models (LLMs) based on the transformer architecture can enable automated detection of clinical phenotype terms, including terms not documented in the HPO. In this study, we developed two types of models: PhenoBCBERT, a BERT-based model, utilizing Bio+Clinical BERT as its pre-trained model, and PhenoGPT, a GPT-based model that can be initialized from diverse GPT m… ▽ More

    Submitted 9 November, 2023; v1 submitted 10 August, 2023; originally announced August 2023.

  21. arXiv:2307.13468  [pdf, other

    cs.IR cs.LG

    Gaussian Graph with Prototypical Contrastive Learning in E-Commerce Bundle Recommendation

    Authors: Zhao-Yang Liu, Liucheng Sun, Chenwei Weng, Qi** Chen, Chengfu Huo

    Abstract: Bundle recommendation aims to provide a bundle of items to satisfy the user preference on e-commerce platform. Existing successful solutions are based on the contrastive graph learning paradigm where graph neural networks (GNNs) are employed to learn representations from user-level and bundle-level graph views with a contrastive learning module to enhance the cooperative association between differ… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

  22. arXiv:2307.06940  [pdf, other

    cs.CV

    Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

    Authors: Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, **bo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen

    Abstract: Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by develo** a framework comprised of two functi… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: Github: https://github.com/VideoCrafter/Animate-A-Story Project page: https://videocrafter.github.io/Animate-A-Story

  23. arXiv:2305.19269  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Make-A-Voice: Unified Voice Synthesis With Discrete Representation

    Authors: Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Lu** Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu

    Abstract: Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, the majority of voice synthesis models currently rely on annotated audio data, but it is crucial to scale them to self-supervised datasets in order to effectively capture the wide range of acoustic variations present in human voice, including speak… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  24. arXiv:2305.16749  [pdf, other

    cs.SD eess.AS

    Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model

    Authors: Xiang Li, Songxiang Liu, Max W. Y. Lam, Zhiyong Wu, Chao Weng, Helen Meng

    Abstract: Expressive human speech generally abounds with rich and flexible speech prosody variations. The speech prosody predictors in existing expressive speech synthesis methods mostly produce deterministic predictions, which are learned by directly minimizing the norm of prosody prediction error. Its unimodal nature leads to a mismatch with ground truth distribution and harms the model's ability in makin… ▽ More

    Submitted 7 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Proceedings of Interspeech 2023 (doi: 10.21437/Interspeech.2023-715), demo site at https://thuhcsi.github.io/interspeech2023-DiffVar/

  25. arXiv:2305.02765  [pdf, other

    cs.SD eess.AS

    HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

    Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, **chuan Tian, Chao Weng, Yuexian Zou

    Abstract: Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encode… ▽ More

    Submitted 7 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: The second version of HiFi-Codec

  26. Experimental quantum secret sharing based on phase encoding of coherent states

    Authors: Ao Shen, Xiao-Yu Cao, Yang Wang, Yao Fu, Jie Gu, Wen-Bo Liu, Chen-Xun Weng, Hua-Lei Yin, Zeng-Bing Chen

    Abstract: Quantum secret sharing (QSS) is one of the basic communication primitives in future quantum networks which addresses part of the basic cryptographic tasks of multiparty communication and computation. Nevertheless, it is a challenge to provide a practical QSS protocol with security against general attacks. A QSS protocol that balances security and practicality is still lacking. Here, we propose a Q… ▽ More

    Submitted 27 March, 2023; v1 submitted 26 March, 2023; originally announced March 2023.

    Comments: 10 pages, 5 figures, 3 tables, accepted by Sci. China-Phys. Mech. Astron

    Journal ref: Sci. China-Phys. Mech. Astron. 66, 260311 (2023)

  27. Advantages of Asynchronous Measurement-Device-Independent Quantum Key Distribution in Intercity Networks

    Authors: Yuan-Mei Xie, Jun-Lin Bai, Yu-Shuo Lu, Chen-Xun Weng, Hua-Lei Yin, Zeng-Bing Chen

    Abstract: The new variant of measurement-device-independent quantum key distribution (MDI-QKD), called asynchronous MDI-QKD or mode-pairing MDI-QKD, offers similar repeater-like rate-loss scaling but has the advantage of simple technology implementation by exploiting an innovative post-measurement pairing technique. We herein present an evaluation of the practical aspects of decoy-state asynchronous MDI-QKD… ▽ More

    Submitted 24 July, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: 15 pages, 4 figures

    Journal ref: Phys. Rev. Applied 19, 054070 (2023)

  28. arXiv:2302.08504  [pdf, other

    cs.CV cs.GR

    PersonNeRF: Personalized Reconstruction from Photo Collections

    Authors: Chung-Yi Weng, Pratul P. Srinivasan, Brian Curless, Ira Kemelmacher-Shlizerman

    Abstract: We present PersonNeRF, a method that takes a collection of photos of a subject (e.g. Roger Federer) captured across multiple years with arbitrary body poses and appearances, and enables rendering the subject with arbitrary novel combinations of viewpoint, body pose, and appearance. PersonNeRF builds a customized neural volumetric 3D model of the subject that is able to render an entire space spann… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: Project Page: https://grail.cs.washington.edu/projects/personnerf/

  29. arXiv:2301.13662  [pdf, other

    cs.SD eess.AS

    InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

    Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng

    Abstract: Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined… ▽ More

    Submitted 25 June, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

    Comments: Submit to TASLP

  30. arXiv:2212.03080  [pdf, other

    cs.LG cs.CR cs.IT

    Straggler-Resilient Differentially-Private Decentralized Learning

    Authors: Yauhen Yakimenka, Chung-Wei Weng, Hsuan-Yin Lin, Eirik Rosnes, Jörg Kliewer

    Abstract: We consider the straggler problem in decentralized learning over a logical ring while preserving user data privacy. Especially, we extend the recently proposed framework of differential privacy (DP) amplification by decentralization by Cyffers and Bellet to include overall training latency--comprising both computation and communication latency. Analytical results on both the convergence speed and… ▽ More

    Submitted 28 June, 2024; v1 submitted 6 December, 2022; originally announced December 2022.

    Comments: To appear in the IEEE Journal on Selected Areas in Information Theory (special issue on Information-Theoretic Methods for Trustworthy and Reliable Machine Learning)

  31. arXiv:2211.02448  [pdf, other

    cs.SD eess.AS

    NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

    Authors: Dongchao Yang, Songxiang Liu, Jianwei Yu, Helin Wang, Chao Weng, Yuexian Zou

    Abstract: Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre from a reference audio, which faces the following challenges: (1) The highly dynamic prosody information in the reference audio is difficult to extract, especially, when the reference audio contains background noise. (2) The TTS systems should have good generalization for unseen speaking styles. In t… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP2023

  32. arXiv:2210.07499  [pdf, other

    cs.CL cs.SD eess.AS

    Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

    Authors: **chuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

    Abstract: Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target… ▽ More

    Submitted 31 January, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Journal ref: International Conference on Learning Representations (ICLR), 2023

  33. arXiv:2210.05092  [pdf, other

    cs.SD eess.AS

    The DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022

    Authors: Xiaoyi Qin, Na Li, Yuke Lin, Yiwei Ding, Chao Weng, Dan Su, Ming Li

    Abstract: This paper is the system description of the DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC22). In this challenge, we focus on track1 and track3. For track1, multiple backbone networks are adopted to extract frame-level features. Since track1 focus on the cross-age scenarios, we adopt the cross-age trials and perform QMF to calibrate score. The magnitude-based qualit… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  34. arXiv:2207.09983  [pdf, other

    cs.SD cs.AI eess.AS

    Diffsound: Discrete Diffusion Model for Text-to-sound Generation

    Authors: Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu

    Abstract: Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses t… ▽ More

    Submitted 28 April, 2023; v1 submitted 20 July, 2022; originally announced July 2022.

    Comments: Accepted by TASLP2022

  35. arXiv:2207.05929  [pdf, other

    eess.AS cs.SD

    Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings

    Authors: Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li

    Abstract: Automatic speaker verification has achieved remarkable progress in recent years. However, there is little research on cross-age speaker verification (CASV) due to insufficient relevant data. In this paper, we mine cross-age test sets based on the VoxCeleb dataset and propose our age-invariant speaker representation(AISR) learning method. Since the VoxCeleb is collected from the YouTube platform, t… ▽ More

    Submitted 12 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech2022

  36. Beating the fault-tolerance bound and security loopholes for Byzantine agreement with a quantum solution

    Authors: Chen-Xun Weng, Rui-Qi Gao, Yu Bao, Bing-Hong Li, Wen-Bo Liu, Yuan-Mei Xie, Yu-Shuo Lu, Hua-Lei Yin, Zeng-Bing Chen

    Abstract: Byzantine agreement, the underlying core of blockchain, aims to make every node in a decentralized network reach consensus. Classical Byzantine agreements unavoidably face two major problems. One is $1/3$ fault-tolerance bound, which means that the system to tolerate $f$ malicious players requires at least $3f+1$ players. The other is the security loopholes from its classical cryptography methods.… ▽ More

    Submitted 22 November, 2023; v1 submitted 18 June, 2022; originally announced June 2022.

    Comments: 21 pages, 7 figures. All comments are welcome!

    Journal ref: Research 6, 0272 (2023)

  37. arXiv:2206.02093  [pdf, other

    cs.CL cs.AI

    LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

    Authors: **chuan Tian, Jianwei Yu, Chunlei Zhang, Chao Weng, Yuexian Zou, Dong Yu

    Abstract: Despite the rapid progress in automatic speech recognition (ASR) research, recognizing multilingual speech using a unified ASR system remains highly challenging. Previous works on multilingual speech recognition mainly focus on two directions: recognizing multiple monolingual speech or recognizing code-switched speech that uses different languages interchangeably within a single utterance. However… ▽ More

    Submitted 5 June, 2022; originally announced June 2022.

  38. arXiv:2204.00821  [pdf, other

    cs.SD eess.AS

    Improving Target Sound Extraction with Timestamp Information

    Authors: Helin Wang, Dongchao Yang, Chao Weng, Jianwei Yu, Yuexian Zou

    Abstract: Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events. The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this pap… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: submitted to interspeech2022

  39. arXiv:2203.15614  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Lattice-Free MMI into End-to-End Speech Recognition

    Authors: **chuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

    Abstract: In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems.… ▽ More

    Submitted 22 August, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022

  40. arXiv:2203.03539  [pdf, other

    cs.CL cs.LG stat.ML

    Understanding The Robustness of Self-supervised Learning Through Topic Modeling

    Authors: Ze** Luo, Shiyou Wu, Cindy Weng, Mo Zhou, Rong Ge

    Abstract: Self-supervised learning has significantly improved the performance of many NLP tasks. However, how can self-supervised learning discover useful representations, and why is it better than traditional approaches such as probabilistic models are still largely unknown. In this paper, we focus on the context of topic modeling and highlight a key advantage of self-supervised learning - when applied to… ▽ More

    Submitted 27 February, 2023; v1 submitted 2 February, 2022; originally announced March 2022.

    Comments: Accepted at ICLR 2023. Camera ready version

  41. arXiv:2202.01986  [pdf, other

    eess.AS cs.SD

    The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

    Authors: Naijun Zheng, Na Li, Xixin Wu, Lingwei Meng, Jiawen Kang, Haibin Wu, Chao Weng, Dan Su, Helen Meng

    Abstract: This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks. In these meeting scenarios, the uncertainty of the speaker number and the high ratio of overlapped speech present great challenges for d… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

    Comments: submitted to ICASSP2022

  42. arXiv:2201.04127  [pdf, other

    cs.CV cs.GR

    HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video

    Authors: Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, Ira Kemelmacher-Shlizerman

    Abstract: We introduce a free-viewpoint rendering method -- HumanNeRF -- that works on a given monocular video of a human performing complex body motions, e.g. a video from YouTube. Our method enables pausing the video at any frame and rendering the subject from arbitrary new camera viewpoints or even a full 360-degree camera path for that particular frame and body pose. This task is particularly challengin… ▽ More

    Submitted 14 June, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

    Comments: CVPR 2022 (oral). Project page with videos: https://grail.cs.washington.edu/projects/humannerf/

  43. arXiv:2201.01995  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

    Authors: **chuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

    Abstract: Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into the decoding can further improve the recognition performance of E2E ASR systems. To align with the modeling units adopted in E2E ASR systems, subword-level (e.g., characters, BPE) LMs are usually used to cooperate with current E2E ASR systems.… ▽ More

    Submitted 6 January, 2022; originally announced January 2022.

    Comments: 5pages, 1 figure

  44. arXiv:2112.11635  [pdf, other

    quant-ph cs.CR physics.optics

    Breaking the Rate-Loss Bound of Quantum Key Distribution with Asynchronous Two-Photon Interference

    Authors: Yuan-Mei Xie, Yu-Shuo Lu, Chen-Xun Weng, Xiao-Yu Cao, Zhao-Ying Jia, Yu Bao, Yang Wang, Yao Fu, Hua-Lei Yin, Zeng-Bing Chen

    Abstract: Twin-field quantum key distribution can overcome the secret key capacity of repeaterless quantum key distribution via single-photon interference. However, to compensate for the channel fluctuations and lock the laser fluctuations, the techniques of phase tracking and phase locking are indispensable in experiment, which drastically increase experimental complexity and hinder free-space realization.… ▽ More

    Submitted 26 April, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

    Comments: 15 pages, 10 figures. arXiv admin note: text overlap with arXiv:2112.11165

    Journal ref: PRX Quantum 3, 020315 (2022)

  45. arXiv:2112.11165  [pdf, other

    quant-ph cs.CR cs.NI physics.optics

    Scalable High-Rate Twin-Field Quantum Key Distribution Networks without Constraint of Probability and Intensity

    Authors: Yuan-Mei Xie, Chen-Xun Weng, Yu-Shuo Lu, Yao Fu, Yang Wang, Hua-Lei Yin, Zeng-Bing Chen

    Abstract: Implementation of a twin-field quantum key distribution network faces limitations, including the low tolerance of interference errors for phase-matching type protocols and the strict constraint regarding intensity and probability for sending-or-not-sending type protocols. Here, we propose a two-photon twin-field quantum key distribution protocol and achieve twin-field-type two-photon interference… ▽ More

    Submitted 9 April, 2023; v1 submitted 21 December, 2021; originally announced December 2021.

    Comments: 17 pages, 6 figures, 3 tables, Accepted for Publication in Phys. Rev. A

    Journal ref: Phys. Rev. A 107, 042603 (2023)

  46. arXiv:2112.10821  [pdf

    cs.LG

    Natural language processing to identify lupus nephritis phenotype in electronic health records

    Authors: Yu Deng, Jennifer A. Pacheco, Anh Chung, Chengsheng Mao, Joshua C. Smith, Juan Zhao, Wei-Qi Wei, April Barnado, Chunhua Weng, Cong Liu, Adam Cordon, **gzhi Yu, Yacob Tedla, Abel Kho, Rosalind Ramsey-Goldman, Theresa Walunas, Yuan Luo

    Abstract: Systemic lupus erythematosus (SLE) is a rare autoimmune disorder characterized by an unpredictable course of flares and remission with diverse manifestations. Lupus nephritis, one of the major disease manifestations of SLE for organ damage and mortality, is a key component of lupus classification criteria. Accurately identifying lupus nephritis in electronic health records (EHRs) would therefore b… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

  47. arXiv:2112.02498  [pdf, other

    cs.AI cs.CL

    Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI

    Authors: **chuan Tian, Jianwei Yu, Chao Weng, Shi-Xiong Zhang, Dan Su, Dong Yu, Yuexian Zou

    Abstract: Recently, End-to-End (E2E) frameworks have achieved remarkable results on various Automatic Speech Recognition (ASR) tasks. However, Lattice-Free Maximum Mutual Information (LF-MMI), as one of the discriminative training criteria that show superior performance in hybrid ASR systems, is rarely adopted in E2E ASR frameworks. In this work, we propose a novel approach to integrate LF-MMI criterion int… ▽ More

    Submitted 29 December, 2021; v1 submitted 5 December, 2021; originally announced December 2021.

  48. arXiv:2111.15016  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization

    Authors: Brian Yan, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe, Dong Yu

    Abstract: Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type. In this work, we propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition. By defining the monolingual sub-tasks with label-to-frame synchronization, our joint m… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

  49. arXiv:2111.03775  [pdf, ps, other

    quant-ph cs.CR

    Long-distance twin-field quantum key distribution with entangled sources

    Authors: Bing-Hong Li, Yuan-Mei Xie, Zhao Li, Chen-Xun Weng, Chen-Long Li, Hua-Lei Yin, Zeng-Bing Chen

    Abstract: Twin-field quantum key distribution (TFQKD), using single-photon-type interference, offers a way to exceed the rate-distance limit without quantum repeaters. However, it still suffers from the photon losses and dark counts, which impose an ultimate limit on its transmission distance. In this letter, we propose a scheme to implement TFQKD with an entangled coherent state source in the middle to inc… ▽ More

    Submitted 5 November, 2021; originally announced November 2021.

    Comments: 4+4 pages, 5 figures

    Journal ref: Opt. Lett. 46, 5529 (2021)

  50. arXiv:2110.06534  [pdf, other

    cs.SD eess.AS

    Simple Attention Module based Speaker Verification with Iterative noisy label detection

    Authors: Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li

    Abstract: Recently, the attention mechanism such as squeeze-and-excitation module (SE) and convolutional block attention module (CBAM) has achieved great success in deep learning-based speaker verification system. This paper introduces an alternative effective yet simple one, i.e., simple attention module (SimAM), for speaker verification. The SimAM module is a plug-and-play module without extra modal param… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

    Comments: submitted to ICASSP2022