Skip to main content

Showing 1–15 of 15 results for author: Tu, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2404.06674  [pdf, other

    cs.SD cs.AI eess.AS

    VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

    Authors: Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yu** Wang, Yuxuan Wang, Mingbo Ma

    Abstract: We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion… ▽ More

    Submitted 11 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  2. arXiv:2305.18566  [pdf

    astro-ph.IM eess.SP

    The Scientific Investigation of Unidentified Aerial Phenomena (UAP) Using Multimodal Ground-Based Observatories

    Authors: Wesley Andrés Watters, Abraham Loeb, Frank Laukien, Richard Cloete, Alex Delacroix, Sergei Dobroshinsky, Benjamin Horvath, Ezra Kelderman, Sarah Little, Eric Masson, Andrew Mead, Mitch Randall, Forrest Schultz, Matthew Szenher, Foteini Vervelidou, Abigail White, Angelique Ahlström, Carol Cleland, Spencer Dockal, Natasha Donahue, Mark Elowitz, Carson Ezell, Alex Gersznowicz, Nicholas Gold, Michael G. Hercz , et al. (13 additional authors not shown)

    Abstract: (Abridged) Unidentified Aerial Phenomena (UAP) have resisted explanation and have received little formal scientific attention for 75 years. A primary objective of the Galileo Project is to build an integrated software and instrumentation system designed to conduct a multimodal census of aerial phenomena and to recognize anomalies. Here we present key motivations for the study of UAP and address hi… ▽ More

    Submitted 31 May, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: This paper is published in the Journal of Astronomical Instrumentation, 12(1), 2340006 (2023) https://doi.org/10.1142/S2251171723400068

    Journal ref: Journal of Astronomical Instrumentation, 12(1), 2340006 (2023)

  3. arXiv:2305.18551  [pdf

    astro-ph.IM cs.SD eess.AS

    Multi-Band Acoustic Monitoring of Aerial Signatures

    Authors: Andrew Mead, Sarah Little, Paul Sail, Michelle Tu, Wesley Andrés Watters, Abigail White, Richard Cloete

    Abstract: The Galileo Project's acoustic monitoring, omni-directional system (AMOS) aids in the detection and characterization of aerial phenomena. It uses a multi-band microphone suite spanning infrasonic to ultrasonic frequencies, providing an independent signal modality for validation and characterization of detected objects. The system utilizes infrasonic, audible, and ultrasonic systems to cover a wide… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Journal ref: Journal of Astronomical Instrumentation, 12(1), 2340005 (2023)

  4. arXiv:2305.15719  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Efficient Neural Music Generation

    Authors: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yu** Wang, Yuxuan Wang

    Abstract: Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

  5. arXiv:2305.11576  [pdf, other

    eess.AS cs.CL cs.SD

    Language-universal phonetic encoder for low-resource speech recognition

    Authors: Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

    Abstract: Multilingual training is effective in improving low-resource ASR, which may partially be explained by phonetic representation sharing between languages. In end-to-end (E2E) ASR systems, graphemes are often used as basic modeling units, however graphemes may not be ideal for multilingual phonetic sharing. In this paper, we leverage International Phonetic Alphabet (IPA) based language-universal phon… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

    Comments: Accepted for publication in INTERSPEECH 2023

  6. arXiv:2305.11569  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

    Authors: Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

    Abstract: We improve low-resource ASR by integrating the ideas of multilingual training and self-supervised learning. Concretely, we leverage an International Phonetic Alphabet (IPA) multilingual model to create frame-level pseudo labels for unlabeled speech, and use these pseudo labels to guide hidden-unit BERT (HuBERT) based speech pretraining in a phonetically-informed manner. The experiments on the Mult… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

    Comments: Accepted for publication in INTERSPEECH 2023

  7. arXiv:2301.00066  [pdf, other

    cs.CL eess.AS

    Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition

    Authors: Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

    Abstract: Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a n… ▽ More

    Submitted 30 December, 2022; originally announced January 2023.

    Comments: Submitted to ICASSP 2023

  8. arXiv:2210.15158  [pdf, other

    eess.AS cs.SD

    Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

    Authors: Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yu** Wang, Yuxuan Wang

    Abstract: Streaming voice conversion (VC) is the task of converting the voice of one person to another in real-time. Previous streaming VC methods use phonetic posteriorgrams (PPGs) extracted from automatic speech recognition (ASR) systems to represent speaker-independent information. However, PPGs lack the prosody and vocalization information of the source speaker, and streaming PPGs contain undesired leak… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: The paper has been submitted to ICASSP2023

  9. arXiv:2110.03347  [pdf, ps, other

    eess.AS cs.HC cs.SD

    Cloning one's voice using very limited data in the wild

    Authors: Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao Tian, Yu** Wang, Yuxuan Wang

    Abstract: With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and tim… ▽ More

    Submitted 8 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

  10. arXiv:1911.01533  [pdf, other

    eess.AS cs.LG cs.SD

    Speaker-invariant Affective Representation Learning via Adversarial Training

    Authors: Haoqi Li, Ming Tu, **g Huang, Shrikanth Narayanan, Panayiotis Georgiou

    Abstract: Representation learning for speech emotion recognition is challenging due to labeled data sparsity issue and lack of gold standard references. In addition, there is much variability from input speech signals, human subjective perception of the signals and emotion label ambiguity. In this paper, we propose a machine learning framework to obtain speech emotion representations by limiting the effect… ▽ More

    Submitted 12 August, 2021; v1 submitted 4 November, 2019; originally announced November 2019.

    Comments: Accepted by ICASSP 2020; 5 pages

  11. arXiv:1904.07386  [pdf, other

    eess.AS cs.CL cs.SD

    I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

    Authors: Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, **g Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda , et al. (21 additional authors not shown)

    Abstract: The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res… ▽ More

    Submitted 15 April, 2019; originally announced April 2019.

    Comments: 5 pages

  12. arXiv:1903.09606  [pdf, other

    eess.AS cs.SD

    Towards adversarial learning of speaker-invariant representation for speech emotion recognition

    Authors: Ming Tu, Yun Tang, **g Huang, Xiaodong He, Bowen Zhou

    Abstract: Speech emotion recognition (SER) has attracted great attention in recent years due to the high demand for emotionally intelligent speech interfaces. Deriving speaker-invariant representations for speech emotion recognition is crucial. In this paper, we propose to apply adversarial training to SER to learn speaker-invariant representations. Our model consists of three parts: a representation learni… ▽ More

    Submitted 22 March, 2019; originally announced March 2019.

  13. arXiv:1807.01738  [pdf, other

    eess.AS cs.SD

    Investigating the role of L1 in automatic pronunciation evaluation of L2 speech

    Authors: Ming Tu, Anna Grabek, Julie Liss, Visar Berisha

    Abstract: Automatic pronunciation evaluation plays an important role in pronunciation training and second language education. This field draws heavily on concepts from automatic speech recognition (ASR) to quantify how close the pronunciation of non-native speech is to native-like pronunciation. However, it is known that the formation of accent is related to pronunciation patterns of both the target languag… ▽ More

    Submitted 4 July, 2018; originally announced July 2018.

    Comments: To appear in Interspeech 2018

  14. arXiv:1804.10325  [pdf, other

    eess.AS

    Simulating dysarthric speech for training data augmentation in clinical speech applications

    Authors: Yishan Jiao, Ming Tu, Visar Berisha, Julie Liss

    Abstract: Training machine learning algorithms for speech applications requires large, labeled training data sets. This is problematic for clinical applications where obtaining such data is prohibitively expensive because of privacy concerns or lack of access. As a result, clinical speech applications are typically developed using small data sets with only tens of speakers. In this paper, we propose a metho… ▽ More

    Submitted 26 April, 2018; originally announced April 2018.

    Comments: Will appear in Proc. of ICASSP 2018

  15. arXiv:1804.08663  [pdf, other

    eess.AS cs.SD

    A Discriminative Acoustic-Prosodic Approach for Measuring Local Entrainment

    Authors: Megan M. Willi, Stephanie A. Borrie, Tyson S. Barrett, Ming Tu, Visar Berisha

    Abstract: Acoustic-prosodic entrainment describes the tendency of humans to align or adapt their speech acoustics to each other in conversation. This alignment of spoken behavior has important implications for conversational success. However, modeling the subtle nature of entrainment in spoken dialogue continues to pose a challenge. In this paper, we propose a straightforward definition for local entrainmen… ▽ More

    Submitted 12 July, 2018; v1 submitted 23 April, 2018; originally announced April 2018.