Skip to main content

Showing 51–100 of 845 results for author: Watanabe, S

.
  1. arXiv:2401.18045  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition

    Authors: Yihan Wu, Soumi Maiti, Yifan Peng, Wangyou Zhang, Chenda Li, Yuyue Wang, Xihua Wang, Shinji Watanabe, Ruihua Song

    Abstract: Recent advancements in language models have significantly enhanced performance in multiple speech-related tasks. Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model. However, this design omits the intrinsic connections between different speech tasks, which can potentially boost the performance of each task. In this work, we… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: 11 pages, 2 figures

  2. arXiv:2401.17619  [pdf, ps, other

    cs.SD eess.AS

    Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

    Authors: Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin **, Shinji Watanabe

    Abstract: In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatu… ▽ More

    Submitted 12 June, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted by Interspeech2024

  3. arXiv:2401.17230  [pdf, other

    cs.SD cs.AI eess.AS

    ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

    Authors: Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

    Abstract: This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also… ▽ More

    Submitted 13 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures, 7 tables, Interspeech 2024

  4. arXiv:2401.16812  [pdf, other

    cs.SD eess.AS

    SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

    Authors: Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, Hiroshi Saruwatari

    Abstract: While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTS… ▽ More

    Submitted 12 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Accepted by Interspeech 2024. An extended version with Appendix. Code: https://github.com/Takaaki-Saeki/DiscreteSpeechMetrics

  5. arXiv:2401.16658  [pdf, ps, other

    cs.CL eess.AS

    OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

    Authors: Yifan Peng, **chuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

    Abstract: Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder archite… ▽ More

    Submitted 16 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted at INTERSPEECH 2024. Webpage: https://www.wavlab.org/activities/2024/owsm/

  6. arXiv:2401.14965  [pdf, other

    cs.IT

    An Improved Lower Bound on Oblivious Transfer Capacity via Interactive Erasure Emulation

    Authors: So Suda, Shun Watanabe, Haruya Yamaguchi

    Abstract: We revisit the oblivious transfer (OT) capacities of noisy channels against the passive adversary, which have been identified only for a limited class of channels. In the literature, the general construction of oblivious transfer has been known only for generalized erasure channels (GECs); for other channels, we first convert a given channel to a GEC via alphabet extension and erasure emulation, a… ▽ More

    Submitted 26 January, 2024; originally announced January 2024.

    Comments: 6 pages, 2 figures

  7. arXiv:2401.14271  [pdf, other

    eess.AS cs.SD

    Improving Design of Input Condition Invariant Speech Enhancement

    Authors: Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian

    Abstract: Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model wa… ▽ More

    Submitted 15 February, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024, 5 pages, 2 figures, 3 tables (corrected the results of no processing on CHiME-4 (Simu) in Table 2)

  8. arXiv:2401.12473  [pdf, other

    eess.AS cs.SD

    Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

    Authors: Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhong-Qiu Wang, Shinji Watanabe

    Abstract: We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: 5 pages, 4 figures, accepted by ICASSP 2024

  9. arXiv:2401.10449  [pdf, other

    eess.AS cs.CL cs.SD

    Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

    Authors: Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or de… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: accepted by ICASSP20224

  10. arXiv:2401.08835  [pdf, other

    cs.CL eess.AS

    Improving ASR Contextual Biasing with Guided Attention

    Authors: Jiyang Tang, Kwangyoun Kim, Suwon Shon, Felix Wu, Prashant Sridhar, Shinji Watanabe

    Abstract: In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To addres… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  11. arXiv:2401.06806  [pdf, ps, other

    cs.CL cs.AI

    AugSumm: towards generalizable speech summarization using synthetic labels from large language model

    Authors: Jee-weon Jung, Roshan Sharma, William Chen, Bhiksha Raj, Shinji Watanabe

    Abstract: Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech. Given variations in information captured and phrasing, recordings can be summarized in multiple ways. Therefore, it is more reasonable to consider a probabilistic distribution of all potential summaries rather than a single summary. However, conventional SSUM models are mostly trained and evaluated with a si… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: This work has been submitted to the IEEE ICASSP for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 5 pages

  12. Solar neutrino measurements using the full data period of Super-Kamiokande-IV

    Authors: Super-Kamiokande Collaboration, :, K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, S. Imaizumi, K. Iyogi, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, Y. Kato, Y. Kishimoto, S. Miki, S. Mine, M. Miura, T. Mochizuki, S. Moriyama, Y. Nagao, M. Nakahata , et al. (305 additional authors not shown)

    Abstract: An analysis of solar neutrino data from the fourth phase of Super-Kamiokande~(SK-IV) from October 2008 to May 2018 is performed and the results are presented. The observation time of the data set of SK-IV corresponds to $2970$~days and the total live time for all four phases is $5805$~days. For more precise solar neutrino measurements, several improvements are applied in this analysis: lowering th… ▽ More

    Submitted 20 February, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: 47 pages, 61 figures

    Journal ref: Phys. Rev. D 109, 092001 (2024)

  13. arXiv:2312.10019  [pdf, other

    cs.IT cs.LG eess.AS

    Understanding Probe Behaviors through Variational Bounds of Mutual Information

    Authors: Kwanghee Choi, Jee-weon Jung, Shinji Watanabe

    Abstract: With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. Fi… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024, implementation available at https://github.com/juice500ml/information_probing

  14. arXiv:2312.09895  [pdf, other

    cs.CL cs.SD eess.AS

    Generative Context-aware Fine-tuning of Self-supervised Speech Models

    Authors: Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe, Karen Livescu

    Abstract: When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, L… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  15. Identified charged-hadron production in $p$$+$Al, $^3$He$+$Au, and Cu$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV and in U$+$U collisions at $\sqrt{s_{_{NN}}}=193$ GeV

    Authors: PHENIX Collaboration, N. J. Abdulameer, U. Acharya, A. Adare, C. Aidala, N. N. Ajitanand, Y. Akiba, R. Akimoto, J. Alexander, M. Alfred, V. Andrieux, K. Aoki, N. Apadula, H. Asano, E. T. Atomssa, T. C. Awes, B. Azmoun, V. Babintsev, M. Bai, X. Bai, N. S. Bandara, B. Bannier, K. N. Barish, S. Bathe, V. Baublis , et al. (456 additional authors not shown)

    Abstract: The PHENIX experiment has performed a systematic study of identified charged-hadron ($π^\pm$, $K^\pm$, $p$, $\bar{p}$) production at midrapidity in $p$$+$Al, $^3$He$+$Au, Cu$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV and U$+$U collisions at $\sqrt{s_{_{NN}}}=193$ GeV. Identified charged-hadron invariant transverse-momentum ($p_T$) and transverse-mass ($m_T$) spectra are presented and interprete… ▽ More

    Submitted 22 May, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: 480 authors from 78 institutions, 18 pages, 6 tables, 16 figures. v2 is version accepted for publication in Physical Review C. HEPdata tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

    Journal ref: Phys. Rev. C 109, 054910 (2024)

  16. arXiv:2312.09582  [pdf, other

    cs.CL cs.SD eess.AS

    Phoneme-aware Encoding for Prefix-tree-based Contextual ASR

    Authors: Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Hiroaki Ogawa, Siddhant Arora, Shinji Watanabe

    Abstract: In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree. While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusua… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP2024

  17. arXiv:2312.09522  [pdf, ps, other

    math.PR

    Annealed transition density of simple random walk on a high-dimensional loop-erased random walk

    Authors: David A. Croydon, Daisuke Shiraishi, Satomi Watanabe

    Abstract: We derive sub-Gaussian bounds for the annealed transition density of the simple random walk on a high-dimensional loop-erased random walk. The walk dimension that appears in these is the exponent governing the space-time scaling of the process with respect to the extrinsic Euclidean distance, which contrasts with the exponent given by the intrinsic graph distance that appears in the corresponding… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: 36 pages, 7 figures

    MSC Class: 60K37 (primary); 05C81; 60J35; 60K35; 82B41

  18. arXiv:2312.07915  [pdf, other

    astro-ph.IM hep-ex physics.ins-det

    2-mm-Thick Large-Area CdTe Double-sided Strip Detectors for High-Resolution Spectroscopic Imaging of X-ray and Gamma-ray with Depth-Of-Interaction Sensing

    Authors: Takahiro Minami, Miho Katsuragawa, Shunsaku Nagasawa, Shin'ichiro Takeda, Shin Watanabe, Yutaka Tsuzuki, Tadayuki Takahashi

    Abstract: We developed a 2-mm-thick CdTe double-sided strip detector (CdTe-DSD) with a 250 um strip pitch, which has high spatial resolution with a uniform large imaging area of 10 cm$^2$ and high energy resolution with high detection efficiency in tens to hundreds keV. The detector can be employed in a wide variety of fields for quantitative observations of hard X-ray and soft gamma-ray with spectroscopic… ▽ More

    Submitted 17 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

    Comments: 13 pages, 11 figures, 1 table, Accepted for publication in NIM A

  19. arXiv:2312.07029  [pdf, other

    cond-mat.mtrl-sci physics.app-ph

    Spatiotemporal visualization of a surface acoustic wave coupled to magnons across a submillimeter-long sample by pulsed laser interferometry

    Authors: Kazuki Maezawa, Shun Fujii, Kazuto Yamanoi, Yukio Nozaki, Shinichi Watanabe

    Abstract: Surface acoustic waves (SAWs) coupled to magnons have attracted much attention because they allow for the long-range transport of magnetic information which cannot be achieved by magnon alone. We employed pulsed laser interferometry to visualize the entire spatiotemporal dynamics of a SAW that travels on a nickel (Ni) thin film and is coupled to magnons. It was possible to trace the coupling-induc… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: 24 pages, 6 figures

    Journal ref: Phys. Rev. Applied vol. 21 (4), 044047 (2024)

  20. arXiv:2312.06963  [pdf

    cond-mat.mtrl-sci

    Enhanced ionic conductivity through crystallization of glass-Li$_3$PS$_4$ by machine learning molecular dynamics simulations

    Authors: Koji Shimizu, Parth Bahuguna, Shigeo Mori, Akitoshi Hayashi, Satoshi Watanabe

    Abstract: Understanding the atomistic mechanism of ion conduction in solid electrolytes is critical for the advancement of all-solid-state batteries. Glass-ceramics, which undergo crystallization from a glass state, frequently exhibit unique properties including enhanced ionic conductivities compared to both the original crystalline and glass forms. Despite these distinctive features, specific details regar… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  21. arXiv:2311.14328  [pdf, other

    cs.RO

    Offline Skill Generalization via Task and Motion Planning

    Authors: Shin Watanabe, Geir Horn, Jim Tørresen, Kai Olav Ellefsen

    Abstract: This paper presents a novel approach to generalizing robot manipulation skills by combining a sampling-based task-and-motion planner with an offline reinforcement learning algorithm. Starting with a small library of scripted primitive skills (e.g. Push) and object-centric symbolic predicates (e.g. On(block, plate)), the planner autonomously generates a demonstration dataset of manipulation skills… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

    Comments: 7 pages, 4 figures

  22. arXiv:2311.07069  [pdf, other

    cs.SD eess.AS

    Music ControlNet: Multiple Time-varying Controls for Music Generation

    Authors: Shih-Lun Wu, Chris Donahue, Shinji Watanabe, Nicholas J. Bryan

    Abstract: Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the changing dynamics of the music. We propose Music ControlN… ▽ More

    Submitted 12 November, 2023; originally announced November 2023.

    Comments: 11 pages, 4 figure, 5 tables, Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

  23. arXiv:2310.17864  [pdf, other

    eess.AS cs.SD

    TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

    Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

    Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  24. arXiv:2310.08277  [pdf, other

    eess.AS cs.SD

    A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction

    Authors: Kohei Saijo, Wangyou Zhang, Zhong-Qiu Wang, Shinji Watanabe, Tetsunori Kobayashi, Tetsuji Ogawa

    Abstract: We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting. This is achieved by integrating two modules into an SE model: 1) an internal separation module that does both speaker counting and separation; and 2) a TSE module that extrac… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: 6 pages, 4 figures, 2 tables, accepted by ASRU2023

  25. arXiv:2310.05513  [pdf, other

    cs.SD cs.CL eess.AS

    Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

    Authors: Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-** Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe

    Abstract: The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification. The challenge comprises a research track focused on applying ML-SUPERB to specific multilingual subjects, a Challenge Track for model submissions, and a New Language Track w… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: Accepted by ASRU

  26. arXiv:2310.03975  [pdf, ps, other

    cs.SD cs.CL

    HuBERTopic: Enhancing Semantic Representation of HuBERT through Self-supervision Utilizing Topic Model

    Authors: Takashi Maekaku, Jiatong Shi, Xuankai Chang, Yuya Fujita, Shinji Watanabe

    Abstract: Recently, the usefulness of self-supervised representation learning (SSRL) methods has been confirmed in various downstream tasks. Many of these models, as exemplified by HuBERT and WavLM, use pseudo-labels generated from spectral features or the model's own representation features. From previous studies, it is known that the pseudo-labels contain semantic information. However, the masked predicti… ▽ More

    Submitted 5 October, 2023; originally announced October 2023.

    Comments: Submitted to IEEE ICASSP 2024

  27. arXiv:2310.03938  [pdf, other

    cs.SD eess.AS

    EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios

    Authors: Tejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe

    Abstract: Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing diverse SSL models could achieve superior performance compared to using one SSL model. However, fusing models increases the overall parameter size, leading to higher computational costs. We propose EFFUSE, a no… ▽ More

    Submitted 5 June, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Comments: 5 pages, 2 figures, 3 tables

  28. arXiv:2310.02973  [pdf, other

    cs.CL cs.SD eess.AS

    UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

    Authors: Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

    Abstract: Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additio… ▽ More

    Submitted 3 April, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted at NAACL 2024

  29. arXiv:2310.01688  [pdf, other

    eess.AS cs.CL cs.SD

    One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

    Authors: Samuele Cornell, Jee-weon Jung, Shinji Watanabe, Stefano Squartini

    Abstract: This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving ``who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarizat… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  30. arXiv:2310.00704  [pdf, other

    cs.SD eess.AS

    UniAudio: An Audio Foundation Model Toward Universal Audio Generation

    Authors: Dongchao Yang, **chuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Shinji Watanabe, Helen Meng

    Abstract: Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other con… ▽ More

    Submitted 11 December, 2023; v1 submitted 1 October, 2023; originally announced October 2023.

  31. arXiv:2309.17384  [pdf, other

    eess.AS cs.SD eess.SP

    Toward Universal Speech Enhancement for Diverse Input Conditions

    Authors: Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe, Yanmin Qian

    Abstract: The past decade has witnessed substantial growth of data-driven speech enhancement (SE) techniques thanks to deep learning. While existing approaches have shown impressive performance in some common datasets, most of them are designed only for a single condition (e.g., single-channel, multi-channel, or a fixed sampling frequency) or only consider a single task (e.g., denoising or dereverberation).… ▽ More

    Submitted 15 February, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: 6 pages, 3 figures, 5 tables, published in ASRU 2023 (corrected the results of noisy speech on CHiME-4 (Simu) in Table 4)

  32. arXiv:2309.17352  [pdf, other

    cs.SD eess.AS

    Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

    Authors: Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, Shinji Watanabe

    Abstract: Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this w… ▽ More

    Submitted 9 January, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ICASSP 2024 camera-ready paper. Winner of the DCASE 2023 Challenge Task 6A: Automated Audio Captioning (AAC)

  33. arXiv:2309.15826  [pdf, other

    cs.CL cs.SD eess.AS

    Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

    Authors: Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe

    Abstract: Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modal… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  34. arXiv:2309.15800  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

    Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, **chuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

    Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  35. arXiv:2309.15686  [pdf, other

    cs.CL cs.SD eess.AS

    Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

    Authors: Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied. To bridge this gap, we introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments. Additionally, we propose context dropout to ensure robustness to the absence of… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  36. arXiv:2309.15674  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Speech collage: code-switched audio generation by collaging monolingual corpora

    Authors: Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan, Shammur Chowdhury, Ahmed Ali, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  37. arXiv:2309.15317  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

    Authors: William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

    Abstract: Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more… ▽ More

    Submitted 27 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to ASRU 2023

  38. Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

    Authors: Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

    Abstract: Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam sea… ▽ More

    Submitted 30 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023

    Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop 2023

  39. arXiv:2309.13927  [pdf, other

    quant-ph

    ZZ-Interaction-Free Single-Qubit-Gate Optimization in Superconducting Qubits

    Authors: Shu Watanabe, Yutaka Tabuchi, Kentaro Heya, Shuhei Tamate, Yasunobu Nakamura

    Abstract: Overcoming the issue of qubit-frequency fluctuations is essential to realize stable and practical quantum computing with solid-state qubits. Static ZZ interaction, which causes a frequency shift of a qubit depending on the state of neighboring qubits, is one of the major obstacles to integrating fixed-frequency transmon qubits. Here we propose and experimentally demonstrate ZZ-interaction-free sin… ▽ More

    Submitted 5 January, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: 11 pages, 8 figures

  40. arXiv:2309.13876  [pdf, other

    cs.CL cs.SD eess.AS

    Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

    Authors: Yifan Peng, **chuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe

    Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for develo** such models (from data collection to training) is not publicly accessib… ▽ More

    Submitted 24 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023

  41. arXiv:2309.11379  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

    Authors: Peter Polák, Brian Yan, Shinji Watanabe, Alex Waibel, Ondřej Bojar

    Abstract: Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -- this scheme… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted at INTERSPEECH 2023

    Journal ref: Polák, P., Yan, B., Watanabe, S., Waibel, A., Bojar, O. (2023) Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff. Proc. INTERSPEECH 2023, 3979-3983

  42. arXiv:2309.10926  [pdf, other

    cs.CL cs.SD eess.AS

    Semi-Autoregressive Streaming ASR With Label Context

    Authors: Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

    Abstract: Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy. Since NAR automatic speech recognition (ASR) models must wait for the completion of the entire utterance before processing, some works explore streaming NAR models based… ▽ More

    Submitted 20 February, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  43. arXiv:2309.10787  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

    Authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

    Abstract: Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual a… ▽ More

    Submitted 19 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024; Evaluation Code: https://github.com/roger-tseng/av-superb Submission Platform: https://av.superbbenchmark.org

  44. arXiv:2309.09510  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

    Authors: Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee

    Abstract: Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for bui… ▽ More

    Submitted 22 March, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: To appear in the proceedings of ICASSP 2024

  45. arXiv:2309.08876  [pdf, ps, other

    eess.AS cs.SD

    Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

    Authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

    Abstract: Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training schemes to use text-only data. Inspired by recent advances in decoder-only language models (LMs), such as GPT-3 and PaLM adopted for speech-processing tasks, we prop… ▽ More

    Submitted 9 January, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

  46. arXiv:2309.08535  [pdf, other

    cs.CV cs.AI eess.AS

    Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

    Authors: Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

    Abstract: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the… ▽ More

    Submitted 12 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  47. arXiv:2309.08531  [pdf, other

    cs.CV cs.CL eess.AS eess.IV

    Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

    Authors: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

    Abstract: In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-s… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  48. arXiv:2309.08348  [pdf, other

    eess.AS cs.SD

    The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

    Authors: Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, **gdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao

    Abstract: Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures

  49. arXiv:2309.07937  [pdf, other

    eess.AS cs.LG cs.SD

    Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

    Authors: Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe

    Abstract: We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech syn… ▽ More

    Submitted 24 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

  50. arXiv:2308.10107  [pdf, other

    cs.CL

    Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

    Authors: **chuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu, Shinji Watanabe

    Abstract: Automatic speech recognition (ASR) based on transducers is widely used. In training, a transducer maximizes the summed posteriors of all paths. The path with the highest posterior is commonly defined as the predicted alignment between the speech and the transcription. While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferre… ▽ More

    Submitted 19 August, 2023; originally announced August 2023.

    Journal ref: Interspeech 2023