Search | arXiv e-print repository

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition

Authors: Yihan Wu, Soumi Maiti, Yifan Peng, Wangyou Zhang, Chenda Li, Yuyue Wang, Xihua Wang, Shinji Watanabe, Ruihua Song

Abstract: Recent advancements in language models have significantly enhanced performance in multiple speech-related tasks. Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model. However, this design omits the intrinsic connections between different speech tasks, which can potentially boost the performance of each task. In this work, we… ▽ More Recent advancements in language models have significantly enhanced performance in multiple speech-related tasks. Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model. However, this design omits the intrinsic connections between different speech tasks, which can potentially boost the performance of each task. In this work, we propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens. Built upon four primary tasks -- speech synthesis, speech recognition, speech language modeling, and text language modeling -- SpeechComposer can easily extend to more speech tasks via compositions of well-designed prompt tokens, like voice conversion and speech enhancement. The unification of prompt tokens also makes it possible for knowledge sharing among different speech tasks in a more structured manner. Experimental results demonstrate that our proposed SpeechComposer can improve the performance of both primary tasks and composite tasks, showing the effectiveness of the shared prompt tokens. Remarkably, the unified decoder-only model achieves a comparable and even better performance than the baselines which are expert models designed for single tasks. △ Less

Submitted 31 January, 2024; originally announced January 2024.

Comments: 11 pages, 2 figures

arXiv:2401.17619 [pdf, ps, other]

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

Authors: Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin **, Shinji Watanabe

Abstract: In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatu… ▽ More In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatural voice synthesis. This innovative method has led to the creation of two expansive singing voice datasets, ACE-Opencpop and ACE-KiSing, which are instrumental for large-scale, multi-singer voice synthesis. Through thorough experimentation, we establish that these datasets not only serve as new benchmarks for SVS but also enhance SVS performance on other singing voice datasets when used as supplementary resources. The corpora, pre-trained models, and their related training recipes are publicly available at ESPnet-Muskits (\url{https://github.com/espnet/espnet}) △ Less

Submitted 12 June, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

Comments: Accepted by Interspeech2024

arXiv:2401.17230 [pdf, other]

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

Authors: Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

Abstract: This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also… ▽ More This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN. △ Less

Submitted 13 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: 5 pages, 3 figures, 7 tables, Interspeech 2024

arXiv:2401.16812 [pdf, other]

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

Authors: Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, Hiroshi Saruwatari

Abstract: While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTS… ▽ More While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTScore computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths. We also propose SpeechBLEU and SpeechTokenDistance, which are computed on speech discrete tokens. The evaluations on synthesized speech show that our method correlates better with human subjective ratings than mel cepstral distortion and a recent mean opinion score prediction model. Also, they are effective in noisy speech evaluation and have cross-lingual applicability. △ Less

Submitted 12 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: Accepted by Interspeech 2024. An extended version with Appendix. Code: https://github.com/Takaaki-Saeki/DiscreteSpeechMetrics

arXiv:2401.16658 [pdf, ps, other]

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Authors: Yifan Peng, **chuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

Abstract: Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder archite… ▽ More Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder architectures. This work aims to improve the performance and efficiency of OWSM without additional data. We present a series of E-Branchformer-based models named OWSM v3.1, ranging from 100M to 1B parameters. OWSM v3.1 outperforms its predecessor, OWSM v3, in most evaluation benchmarks, while showing an improved inference speed of up to 25%. We further reveal the emergent ability of OWSM v3.1 in zero-shot contextual biasing speech recognition. We also provide a model trained on a subset of data with low license restrictions. We will publicly release the code, pre-trained models, and training logs. △ Less

Submitted 16 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: Accepted at INTERSPEECH 2024. Webpage: https://www.wavlab.org/activities/2024/owsm/

arXiv:2401.14965 [pdf, other]

An Improved Lower Bound on Oblivious Transfer Capacity via Interactive Erasure Emulation

Authors: So Suda, Shun Watanabe, Haruya Yamaguchi

Abstract: We revisit the oblivious transfer (OT) capacities of noisy channels against the passive adversary, which have been identified only for a limited class of channels. In the literature, the general construction of oblivious transfer has been known only for generalized erasure channels (GECs); for other channels, we first convert a given channel to a GEC via alphabet extension and erasure emulation, a… ▽ More We revisit the oblivious transfer (OT) capacities of noisy channels against the passive adversary, which have been identified only for a limited class of channels. In the literature, the general construction of oblivious transfer has been known only for generalized erasure channels (GECs); for other channels, we first convert a given channel to a GEC via alphabet extension and erasure emulation, and then apply the general construction for GEC. In this paper, we derive an improved lower bound on the OT capacity of the binary symmetric channel (BSC) and binary symmetric erasure channel (BSEC) by proposing a new protocol; by using interactive communication between the sender and the receiver, our protocol emulates erasure events recursively in multiple rounds. We also discuss a potential necessity of multiple rounds interactive communication to attain the OT capacity. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: 6 pages, 2 figures

arXiv:2401.14271 [pdf, other]

Improving Design of Input Condition Invariant Speech Enhancement

Authors: Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian

Abstract: Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model wa… ▽ More Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions. In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated. For this purpose, we redesign the key components that comprise such a system. First, we identify that the channel-modeling module's generalization to unseen scenarios can be sub-optimal and redesign this module. We further introduce a two-stage training strategy to enhance training efficiency. Second, we propose two novel dual-path time-frequency blocks, demonstrating superior performance with fewer parameters and computational costs compared to the existing method. All proposals combined, experiments on various public datasets validate the efficacy of the proposed model, with significantly improved performance on real conditions. Recipe with full model details is released at https://github.com/espnet/espnet. △ Less

Submitted 15 February, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP 2024, 5 pages, 2 figures, 3 tables (corrected the results of no processing on CHiME-4 (Simu) in Table 2)

arXiv:2401.12473 [pdf, other]

Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Authors: Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhong-Qiu Wang, Shinji Watanabe

Abstract: We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations… ▽ More We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures, accepted by ICASSP 2024

arXiv:2401.10449 [pdf, other]

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Authors: Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe

Abstract: End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or de… ▽ More End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: accepted by ICASSP20224

arXiv:2401.08835 [pdf, other]

Improving ASR Contextual Biasing with Guided Attention

Authors: Jiyang Tang, Kwangyoun Kim, Suwon Shon, Felix Wu, Prashant Sridhar, Shinji Watanabe

Abstract: In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To addres… ▽ More In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To address this challenge, we employ a GA loss as an additional training objective besides the Transducer loss. The proposed GA loss aims to teach the cross attention how to align bias phrases with text tokens or audio frames. Compared to studies with similar motivations, the proposed loss operates directly on the cross attention weights and is easier to implement. Through extensive experiments based on Conformer Transducer with Contextual Adapter, we demonstrate that the proposed method not only leads to a lower WER but also retains its effectiveness as the number of bias phrases increases. Specifically, the GA loss decreases the WER of rare vocabularies by up to 19.2% on LibriSpeech compared to the contextual biasing baseline, and up to 49.3% compared to a vanilla Transducer. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted at ICASSP 2024

arXiv:2401.06806 [pdf, ps, other]

AugSumm: towards generalizable speech summarization using synthetic labels from large language model

Authors: Jee-weon Jung, Roshan Sharma, William Chen, Bhiksha Raj, Shinji Watanabe

Abstract: Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech. Given variations in information captured and phrasing, recordings can be summarized in multiple ways. Therefore, it is more reasonable to consider a probabilistic distribution of all potential summaries rather than a single summary. However, conventional SSUM models are mostly trained and evaluated with a si… ▽ More Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech. Given variations in information captured and phrasing, recordings can be summarized in multiple ways. Therefore, it is more reasonable to consider a probabilistic distribution of all potential summaries rather than a single summary. However, conventional SSUM models are mostly trained and evaluated with a single ground-truth (GT) human-annotated deterministic summary for every recording. Generating multiple human references would be ideal to better represent the distribution statistically, but is impractical because annotation is expensive. We tackle this challenge by proposing AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries for training and evaluation. First, we explore prompting strategies to generate synthetic summaries from ChatGPT. We validate the quality of synthetic summaries using multiple metrics including human evaluation, where we find that summaries generated using AugSumm are perceived as more valid to humans. Second, we develop methods to utilize synthetic summaries in training and evaluation. Experiments on How2 demonstrate that pre-training on synthetic summaries and fine-tuning on GT summaries improves ROUGE-L by 1 point on both GT and AugSumm-based test sets. AugSumm summaries are available at https://github.com/Jungjee/AugSumm. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: This work has been submitted to the IEEE ICASSP for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 5 pages

arXiv:2312.12907 [pdf, ps, other]

doi 10.1103/PhysRevD.109.092001

Solar neutrino measurements using the full data period of Super-Kamiokande-IV

Authors: Super-Kamiokande Collaboration, :, K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, S. Imaizumi, K. Iyogi, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, Y. Kato, Y. Kishimoto, S. Miki, S. Mine, M. Miura, T. Mochizuki, S. Moriyama, Y. Nagao, M. Nakahata , et al. (305 additional authors not shown)

Abstract: An analysis of solar neutrino data from the fourth phase of Super-Kamiokande~(SK-IV) from October 2008 to May 2018 is performed and the results are presented. The observation time of the data set of SK-IV corresponds to $2970$~days and the total live time for all four phases is $5805$~days. For more precise solar neutrino measurements, several improvements are applied in this analysis: lowering th… ▽ More An analysis of solar neutrino data from the fourth phase of Super-Kamiokande~(SK-IV) from October 2008 to May 2018 is performed and the results are presented. The observation time of the data set of SK-IV corresponds to $2970$~days and the total live time for all four phases is $5805$~days. For more precise solar neutrino measurements, several improvements are applied in this analysis: lowering the data acquisition threshold in May 2015, further reduction of the spallation background using neutron clustering events, precise energy reconstruction considering the time variation of the PMT gain. The observed number of solar neutrino events in $3.49$--$19.49$ MeV electron kinetic energy region during SK-IV is $65,443^{+390}_{-388}\,(\mathrm{stat.})\pm 925\,(\mathrm{syst.})$ events. Corresponding $\mathrm{^{8}B}$ solar neutrino flux is $(2.314 \pm 0.014\, \rm{(stat.)} \pm 0.040 \, \rm{(syst.)}) \times 10^{6}~\mathrm{cm^{-2}\,s^{-1}}$, assuming a pure electron-neutrino flavor component without neutrino oscillations. The flux combined with all SK phases up to SK-IV is $(2.336 \pm 0.011\, \rm{(stat.)} \pm 0.043 \, \rm{(syst.)}) \times 10^{6}~\mathrm{cm^{-2}\,s^{-1}}$. Based on the neutrino oscillation analysis from all solar experiments, including the SK $5805$~days data set, the best-fit neutrino oscillation parameters are $\rm{sin^{2} θ_{12,\,solar}} = 0.306 \pm 0.013 $ and $Δm^{2}_{21,\,\mathrm{solar}} = (6.10^{+ 0.95}_{-0.81}) \times 10^{-5}~\rm{eV}^{2}$, with a deviation of about 1.5$σ$ from the $Δm^{2}_{21}$ parameter obtained by KamLAND. The best-fit neutrino oscillation parameters obtained from all solar experiments and KamLAND are $\sin^{2} θ_{12,\,\mathrm{global}} = 0.307 \pm 0.012 $ and $Δm^{2}_{21,\,\mathrm{global}} = (7.50^{+ 0.19}_{-0.18}) \times 10^{-5}~\rm{eV}^{2}$. △ Less

Submitted 20 February, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: 47 pages, 61 figures

Journal ref: Phys. Rev. D 109, 092001 (2024)

arXiv:2312.10019 [pdf, other]

Understanding Probe Behaviors through Variational Bounds of Mutual Information

Authors: Kwanghee Choi, Jee-weon Jung, Shinji Watanabe

Abstract: With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. Fi… ▽ More With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. First, we connect probing with the variational bounds of mutual information (MI) to relax the probe design, equating linear probing with fine-tuning. Then, we investigate empirical behaviors and practices of probing through our mathematical framework. We analyze the layer-wise performance curve being convex, which seemingly violates the data processing inequality. However, we show that the intermediate representations can have the biggest MI estimate because of the tradeoff between better separability and decreasing MI. We further suggest that the margin of linearly separable representations can be a criterion for measuring the "goodness of representation." We also compare accuracy with MI as the measuring criteria. Finally, we empirically validate our claims by observing the self-supervised speech models on retaining word and phoneme information. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted to ICASSP 2024, implementation available at https://github.com/juice500ml/information_probing

arXiv:2312.09895 [pdf, other]

Generative Context-aware Fine-tuning of Self-supervised Speech Models

Authors: Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe, Karen Livescu

Abstract: When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, L… ▽ More When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, LLM could generate a prediction of the next sentence or abstractive text like titles or topics. In this paper, we study the use of LLM-generated context information and propose an approach to distill the generated information during fine-tuning of self-supervised speech models, which we refer to as generative context-aware fine-tuning. This approach allows the fine-tuned model to make improved predictions without access to the true surrounding segments or to the LLM at inference time, while requiring only a very small additional context module. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis. The results show that generative context-aware fine-tuning outperforms a context injection fine-tuning approach that accesses the ground-truth previous text, and is competitive with a generative context injection fine-tuning approach that requires the LLM at inference time. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2312.09827 [pdf, other]

doi 10.1103/PhysRevC.109.054910

Identified charged-hadron production in $p$$+$Al, $^3$He$+$Au, and Cu$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV and in U$+$U collisions at $\sqrt{s_{_{NN}}}=193$ GeV

Authors: PHENIX Collaboration, N. J. Abdulameer, U. Acharya, A. Adare, C. Aidala, N. N. Ajitanand, Y. Akiba, R. Akimoto, J. Alexander, M. Alfred, V. Andrieux, K. Aoki, N. Apadula, H. Asano, E. T. Atomssa, T. C. Awes, B. Azmoun, V. Babintsev, M. Bai, X. Bai, N. S. Bandara, B. Bannier, K. N. Barish, S. Bathe, V. Baublis , et al. (456 additional authors not shown)

Abstract: The PHENIX experiment has performed a systematic study of identified charged-hadron ($π^\pm$, $K^\pm$, $p$, $\bar{p}$) production at midrapidity in $p$$+$Al, $^3$He$+$Au, Cu$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV and U$+$U collisions at $\sqrt{s_{_{NN}}}=193$ GeV. Identified charged-hadron invariant transverse-momentum ($p_T$) and transverse-mass ($m_T$) spectra are presented and interprete… ▽ More The PHENIX experiment has performed a systematic study of identified charged-hadron ($π^\pm$, $K^\pm$, $p$, $\bar{p}$) production at midrapidity in $p$$+$Al, $^3$He$+$Au, Cu$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV and U$+$U collisions at $\sqrt{s_{_{NN}}}=193$ GeV. Identified charged-hadron invariant transverse-momentum ($p_T$) and transverse-mass ($m_T$) spectra are presented and interpreted in terms of radially expanding thermalized systems. The particle ratios of $K/π$ and $p/π$ have been measured in different centrality ranges of large (Cu$+$Au, U$+$U) and small ($p$$+$Al, $^3$He$+$Au) collision systems. The values of $K/π$ ratios measured in all considered collision systems were found to be consistent with those measured in $p$$+$$p$ collisions. However the values of $p/π$ ratios measured in large collision systems reach the values of $\approx0.6$, which is $\approx2$ times larger than in $p$$+$$p$ collisions. These results can be qualitatively understood in terms of the baryon enhancement expected from hadronization by recombination. Identified charged-hadron nuclear-modification factors ($R_{AB}$) are also presented. Enhancement of proton $R_{AB}$ values over meson $R_{AB}$ values was observed in central $^3$He$+$Au, Cu$+$Au, and U$+$U collisions. The proton $R_{AB}$ values measured in $p$$+$Al collision system were found to be consistent with $R_{AB}$ values of $φ$, $π^\pm$, $K^\pm$, and $π^0$ mesons, which may indicate that the size of the system produced in $p$$+$Al collisions is too small for recombination to cause a noticeable increase in proton production. △ Less

Submitted 22 May, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: 480 authors from 78 institutions, 18 pages, 6 tables, 16 figures. v2 is version accepted for publication in Physical Review C. HEPdata tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

Journal ref: Phys. Rev. C 109, 054910 (2024)

arXiv:2312.09582 [pdf, other]

Phoneme-aware Encoding for Prefix-tree-based Contextual ASR

Authors: Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Hiroaki Ogawa, Siddhant Arora, Shinji Watanabe

Abstract: In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree. While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusua… ▽ More In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree. While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations. As TCPGen handles biasing words as subword units, we propose obtaining subword-level phoneme-aware encoding by using alignment between phonemes and subwords. Furthermore, we propose injecting phoneme-level predictions from CTC into queries of TCPGen so that the model better interprets the phoneme-aware encodings. We conducted ASR experiments with TCPGen for RNN transducer. We observed that proposed phoneme-aware encoding outperformed ordinary grapheme-based encoding on both the English LibriSpeech and Japanese CSJ datasets, demonstrating the robustness of our approach across linguistically diverse languages. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted to ICASSP2024

arXiv:2312.09522 [pdf, ps, other]

Annealed transition density of simple random walk on a high-dimensional loop-erased random walk

Authors: David A. Croydon, Daisuke Shiraishi, Satomi Watanabe

Abstract: We derive sub-Gaussian bounds for the annealed transition density of the simple random walk on a high-dimensional loop-erased random walk. The walk dimension that appears in these is the exponent governing the space-time scaling of the process with respect to the extrinsic Euclidean distance, which contrasts with the exponent given by the intrinsic graph distance that appears in the corresponding… ▽ More We derive sub-Gaussian bounds for the annealed transition density of the simple random walk on a high-dimensional loop-erased random walk. The walk dimension that appears in these is the exponent governing the space-time scaling of the process with respect to the extrinsic Euclidean distance, which contrasts with the exponent given by the intrinsic graph distance that appears in the corresponding quenched bounds. We prove our main result using novel pointwise Gaussian estimates on the distribution of the high-dimensional loop-erased random walk. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: 36 pages, 7 figures

MSC Class: 60K37 (primary); 05C81; 60J35; 60K35; 82B41

arXiv:2312.07915 [pdf, other]

doi 10.1016/j.nima.2023.169024

2-mm-Thick Large-Area CdTe Double-sided Strip Detectors for High-Resolution Spectroscopic Imaging of X-ray and Gamma-ray with Depth-Of-Interaction Sensing

Authors: Takahiro Minami, Miho Katsuragawa, Shunsaku Nagasawa, Shin'ichiro Takeda, Shin Watanabe, Yutaka Tsuzuki, Tadayuki Takahashi

Abstract: We developed a 2-mm-thick CdTe double-sided strip detector (CdTe-DSD) with a 250 um strip pitch, which has high spatial resolution with a uniform large imaging area of 10 cm$^2$ and high energy resolution with high detection efficiency in tens to hundreds keV. The detector can be employed in a wide variety of fields for quantitative observations of hard X-ray and soft gamma-ray with spectroscopic… ▽ More We developed a 2-mm-thick CdTe double-sided strip detector (CdTe-DSD) with a 250 um strip pitch, which has high spatial resolution with a uniform large imaging area of 10 cm$^2$ and high energy resolution with high detection efficiency in tens to hundreds keV. The detector can be employed in a wide variety of fields for quantitative observations of hard X-ray and soft gamma-ray with spectroscopic imaging, for example, space observation, nuclear medicine, and non-destructive elemental analysis. This detector is thicker than the 0.75-mm-thick one previously developed by a factor of $\sim$2.7, thus providing better detection efficiency for hard X-rays and soft gamma rays. The increased thickness could potentially enhance bias-induced polarization if we do not apply sufficient bias and if we do not operate at a low temperature, but the polarization is not evident in our detector when a high voltage of 500 V is applied to the CdTe diode and the temperature is maintained at -20 $^\circ$C during one-day experiments. The ''Depth Of Interaction'' (DOI) dependence due to the CdTe diode's poor carrier-transport property is also more significant, resulting in much DOI information while complicated detector responses such as charge sharings or low-energy tails that exacerbate the loss in the energy resolution. In this paper, we developed 2-mm-thick CdTe-DSDs, studied their response, and evaluated their energy resolution, spatial resolution, and uniformity. We also constructed a theoretical model to understand the detector response theoretically, resulting in reconstructing the DOI with an accuracy of 100 um while estimating the carrier-transport property. We realized the detector that has high energy resolution and high 3D spatial resolution with a uniform large imaging area. △ Less

Submitted 17 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: 13 pages, 11 figures, 1 table, Accepted for publication in NIM A

arXiv:2312.07029 [pdf, other]

doi 10.1103/PhysRevApplied.21.044047

Spatiotemporal visualization of a surface acoustic wave coupled to magnons across a submillimeter-long sample by pulsed laser interferometry

Authors: Kazuki Maezawa, Shun Fujii, Kazuto Yamanoi, Yukio Nozaki, Shinichi Watanabe

Abstract: Surface acoustic waves (SAWs) coupled to magnons have attracted much attention because they allow for the long-range transport of magnetic information which cannot be achieved by magnon alone. We employed pulsed laser interferometry to visualize the entire spatiotemporal dynamics of a SAW that travels on a nickel (Ni) thin film and is coupled to magnons. It was possible to trace the coupling-induc… ▽ More Surface acoustic waves (SAWs) coupled to magnons have attracted much attention because they allow for the long-range transport of magnetic information which cannot be achieved by magnon alone. We employed pulsed laser interferometry to visualize the entire spatiotemporal dynamics of a SAW that travels on a nickel (Ni) thin film and is coupled to magnons. It was possible to trace the coupling-induced amplitude reduction and phase shift that occurs as the SAW propagates over a distance of 0.4 mm. The observed changes are consistent with results obtained from conventional radio-frequency transmission measurements, which probe the total SAW absorption due to magnon--phonon coupling. This result verifies that our method can accurately capture the spatiotemporal dynamics of a SAW coupled to magnons across the entire length of the sample. Additionally, we validated our time-resolved profiles by comparing them with theoretical results that take the echo wave due to reflection into account. The impact of the echo wave is significant even when it has propagated over a distance on the order of millimeters. Our imaging results highlight the visualization of the long-range propagation of the SAW coupled to magnons and offer more information about the surface vibration profiles in such devices. △ Less

Submitted 12 December, 2023; originally announced December 2023.

Comments: 24 pages, 6 figures

Journal ref: Phys. Rev. Applied vol. 21 (4), 044047 (2024)

arXiv:2312.06963 [pdf]

Enhanced ionic conductivity through crystallization of glass-Li$_3$PS$_4$ by machine learning molecular dynamics simulations

Authors: Koji Shimizu, Parth Bahuguna, Shigeo Mori, Akitoshi Hayashi, Satoshi Watanabe

Abstract: Understanding the atomistic mechanism of ion conduction in solid electrolytes is critical for the advancement of all-solid-state batteries. Glass-ceramics, which undergo crystallization from a glass state, frequently exhibit unique properties including enhanced ionic conductivities compared to both the original crystalline and glass forms. Despite these distinctive features, specific details regar… ▽ More Understanding the atomistic mechanism of ion conduction in solid electrolytes is critical for the advancement of all-solid-state batteries. Glass-ceramics, which undergo crystallization from a glass state, frequently exhibit unique properties including enhanced ionic conductivities compared to both the original crystalline and glass forms. Despite these distinctive features, specific details regarding the behavior of ion conduction in glass-ceramics, particularly concerning conduction pathways, remain elusive. In this study, we demonstrate the crystallization process of glass-Li$_3$PS$_4$ through molecular dynamics simulations employing machine learning interatomic potentials constructed from first principles calculation data. Our analyses of Li conduction using the obtained partially crystallized structures reveal that the diffusion barriers of Li decrease as the crystallinity in Li$_3$PS$_4$ glass-ceramics increases. Furthermore, Li displacements predominantly occur in the precipitated crystalline portion, suggesting that percolation conduction plays a significant role in enhanced Li conduction. These findings provide valuable insights for the future utilization of glass-ceramic materials. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2311.14328 [pdf, other]

Offline Skill Generalization via Task and Motion Planning

Authors: Shin Watanabe, Geir Horn, Jim Tørresen, Kai Olav Ellefsen

Abstract: This paper presents a novel approach to generalizing robot manipulation skills by combining a sampling-based task-and-motion planner with an offline reinforcement learning algorithm. Starting with a small library of scripted primitive skills (e.g. Push) and object-centric symbolic predicates (e.g. On(block, plate)), the planner autonomously generates a demonstration dataset of manipulation skills… ▽ More This paper presents a novel approach to generalizing robot manipulation skills by combining a sampling-based task-and-motion planner with an offline reinforcement learning algorithm. Starting with a small library of scripted primitive skills (e.g. Push) and object-centric symbolic predicates (e.g. On(block, plate)), the planner autonomously generates a demonstration dataset of manipulation skills in the context of a long-horizon task. An offline reinforcement learning algorithm then extracts a policy from the dataset without further interactions with the environment and replaces the scripted skill in the existing library. Refining the skill library improves the robustness of the planner, which in turn facilitates data collection for more complex manipulation skills. We validate our approach in simulation, on a block-pushing task. We show that the proposed method requires less training data than conventional reinforcement learning methods. Furthermore, interaction with the environment is collision-free because of the use of planner demonstrations, making the approach more amenable to persistent robot learning in the real world. △ Less

Submitted 24 November, 2023; originally announced November 2023.

Comments: 7 pages, 4 figures

arXiv:2311.07069 [pdf, other]

Music ControlNet: Multiple Time-varying Controls for Music Generation

Authors: Shih-Lun Wu, Chris Donahue, Shinji Watanabe, Nicholas J. Bryan

Abstract: Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the changing dynamics of the music. We propose Music ControlN… ▽ More Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the changing dynamics of the music. We propose Music ControlNet, a diffusion-based music generation model that offers multiple precise, time-varying controls over generated audio. To imbue text-to-music models with time-varying control, we propose an approach analogous to pixel-wise control of the image-domain ControlNet method. Specifically, we extract controls from training audio yielding paired data, and fine-tune a diffusion-based conditional generative model over audio spectrograms given melody, dynamics, and rhythm controls. While the image-domain Uni-ControlNet method already allows generation with any subset of controls, we devise a new strategy to allow creators to input controls that are only partially specified in time. We evaluate both on controls extracted from audio and controls we expect creators to provide, demonstrating that we can generate realistic music that corresponds to control inputs in both settings. While few comparable music generation models exist, we benchmark against MusicGen, a recent model that accepts text and melody input, and show that our model generates music that is 49% more faithful to input melodies despite having 35x fewer parameters, training on 11x less data, and enabling two additional forms of time-varying control. Sound examples can be found at https://MusicControlNet.github.io/web/. △ Less

Submitted 12 November, 2023; originally announced November 2023.

Comments: 11 pages, 4 figure, 5 tables, Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

arXiv:2310.17864 [pdf, other]

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.08277 [pdf, other]

A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction

Authors: Kohei Saijo, Wangyou Zhang, Zhong-Qiu Wang, Shinji Watanabe, Tetsunori Kobayashi, Tetsuji Ogawa

Abstract: We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting. This is achieved by integrating two modules into an SE model: 1) an internal separation module that does both speaker counting and separation; and 2) a TSE module that extrac… ▽ More We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting. This is achieved by integrating two modules into an SE model: 1) an internal separation module that does both speaker counting and separation; and 2) a TSE module that extracts the target speech from the internal separation outputs using target speaker cues. The model is trained to perform TSE if the target speaker cue is given and SS otherwise. By training the model to remove noise and reverberation, we allow the model to tackle the five tasks mentioned above with a single model, which has not been accomplished yet. Evaluation results demonstrate that the proposed MUSE model can successfully handle multiple tasks with a single model. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: 6 pages, 4 figures, 2 tables, accepted by ASRU2023

arXiv:2310.05513 [pdf, other]

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

Authors: Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-** Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe

Abstract: The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification. The challenge comprises a research track focused on applying ML-SUPERB to specific multilingual subjects, a Challenge Track for model submissions, and a New Language Track w… ▽ More The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification. The challenge comprises a research track focused on applying ML-SUPERB to specific multilingual subjects, a Challenge Track for model submissions, and a New Language Track where language resource researchers can contribute and evaluate their low-resource language data in the context of the latest progress in multilingual speech recognition. The challenge garnered 12 model submissions and 54 language corpora, resulting in a comprehensive benchmark encompassing 154 languages. The findings indicate that merely scaling models is not the definitive solution for multilingual speech tasks, and a variety of speech/voice types present significant challenges in multilingual speech processing. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: Accepted by ASRU

arXiv:2310.03975 [pdf, ps, other]

HuBERTopic: Enhancing Semantic Representation of HuBERT through Self-supervision Utilizing Topic Model

Authors: Takashi Maekaku, Jiatong Shi, Xuankai Chang, Yuya Fujita, Shinji Watanabe

Abstract: Recently, the usefulness of self-supervised representation learning (SSRL) methods has been confirmed in various downstream tasks. Many of these models, as exemplified by HuBERT and WavLM, use pseudo-labels generated from spectral features or the model's own representation features. From previous studies, it is known that the pseudo-labels contain semantic information. However, the masked predicti… ▽ More Recently, the usefulness of self-supervised representation learning (SSRL) methods has been confirmed in various downstream tasks. Many of these models, as exemplified by HuBERT and WavLM, use pseudo-labels generated from spectral features or the model's own representation features. From previous studies, it is known that the pseudo-labels contain semantic information. However, the masked prediction task, the learning criterion of HuBERT, focuses on local contextual information and may not make effective use of global semantic information such as speaker, theme of speech, and so on. In this paper, we propose a new approach to enrich the semantic representation of HuBERT. We apply topic model to pseudo-labels to generate a topic label for each utterance. An auxiliary topic classification task is added to HuBERT by using topic labels as teachers. This allows additional global semantic information to be incorporated in an unsupervised manner. Experimental results demonstrate that our method achieves comparable or better performance than the baseline in most tasks, including automatic speech recognition and five out of the eight SUPERB tasks. Moreover, we find that topic labels include various information about utterance, such as gender, speaker, and its theme. This highlights the effectiveness of our approach in capturing multifaceted semantic nuances. △ Less

Submitted 5 October, 2023; originally announced October 2023.

Comments: Submitted to IEEE ICASSP 2024

arXiv:2310.03938 [pdf, other]

EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios

Authors: Tejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe

Abstract: Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing diverse SSL models could achieve superior performance compared to using one SSL model. However, fusing models increases the overall parameter size, leading to higher computational costs. We propose EFFUSE, a no… ▽ More Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing diverse SSL models could achieve superior performance compared to using one SSL model. However, fusing models increases the overall parameter size, leading to higher computational costs. We propose EFFUSE, a novel approach that uses a single SSL model to mimic the features of multiple SSL models via prediction, resulting in a lightweight framework with competitive performance. Our experiments show that EFFUSE outperforms individual SSL models in multilingual speech recognition tasks. Our best performing model achieves an average SUPERB score increase of 63.5 (6.3%) from the SSL baselines in Multilingual Speech Universal PERformance Benchmark (ML-SUPERB), while decreasing parameter size on average by 317M parameters (49%) from the fusion models. △ Less

Submitted 5 June, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

Comments: 5 pages, 2 figures, 3 tables

arXiv:2310.02973 [pdf, other]

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

Authors: Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

Abstract: Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additio… ▽ More Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additional tasks using single-token task specifiers. We enhance this approach through instruction tuning, i.e., finetuning by describing the task using natural language instructions followed by the list of label options. Our approach can generalize to new task descriptions for the seen tasks during inference, thereby enhancing its user-friendliness. We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages. On most tasks, UniverSLU achieves competitive performance and often even surpasses task-specific models. Additionally, we assess the zero-shot capabilities, finding that the model generalizes to new datasets and languages for seen task types. △ Less

Submitted 3 April, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: Accepted at NAACL 2024

arXiv:2310.01688 [pdf, other]

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Authors: Samuele Cornell, Jee-weon Jung, Shinji Watanabe, Stefano Squartini

Abstract: This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving ``who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarizat… ▽ More This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving ``who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and ``Whisper-style" prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2310.00704 [pdf, other]

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

Authors: Dongchao Yang, **chuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Shinji Watanabe, Helen Meng

Abstract: Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other con… ▽ More Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other condition modalities, 2) concatenates source-target pair as a single sequence, and 3) performs next-token prediction using LLM. Also, a multi-scale Transformer model is proposed to handle the overly long sequences caused by the residual vector quantization based neural codec in tokenization. Training of UniAudio is scaled up to 165K hours of audio and 1B parameters, based on all generative tasks, aiming to obtain sufficient prior knowledge not only in the intrinsic properties of audio but also the inter-relationship between audio and other modalities. Therefore, the trained UniAudio model has the potential to become a foundation model for universal audio generation: it shows strong capability in all trained tasks and can seamlessly support new audio generation tasks after simple fine-tuning. Experiments demonstrate that UniAudio achieves state-of-the-art or at least competitive results on most of the 11 tasks. Demo and code are released at https://github.com/yangdongchao/UniAudio △ Less

Submitted 11 December, 2023; v1 submitted 1 October, 2023; originally announced October 2023.

arXiv:2309.17384 [pdf, other]

Toward Universal Speech Enhancement for Diverse Input Conditions

Authors: Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe, Yanmin Qian

Abstract: The past decade has witnessed substantial growth of data-driven speech enhancement (SE) techniques thanks to deep learning. While existing approaches have shown impressive performance in some common datasets, most of them are designed only for a single condition (e.g., single-channel, multi-channel, or a fixed sampling frequency) or only consider a single task (e.g., denoising or dereverberation).… ▽ More The past decade has witnessed substantial growth of data-driven speech enhancement (SE) techniques thanks to deep learning. While existing approaches have shown impressive performance in some common datasets, most of them are designed only for a single condition (e.g., single-channel, multi-channel, or a fixed sampling frequency) or only consider a single task (e.g., denoising or dereverberation). Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model. In this paper, we make the first attempt to investigate this line of research. First, we devise a single SE model that is independent of microphone channels, signal lengths, and sampling frequencies. Second, we design a universal SE benchmark by combining existing public corpora with multiple conditions. Our experiments on a wide range of datasets show that the proposed single model can successfully handle diverse conditions with strong performance. △ Less

Submitted 15 February, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

Comments: 6 pages, 3 figures, 5 tables, published in ASRU 2023 (corrected the results of noisy speech on CHiME-4 (Simu) in Table 4)

arXiv:2309.17352 [pdf, other]

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Authors: Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, Shinji Watanabe

Abstract: Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this w… ▽ More Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this work, we strive to improve the performance of seq2seq AAC models by extensively leveraging pretrained models and large language models (LLMs). Specifically, we utilize BEATs to extract fine-grained audio features. Then, we employ Instructor LLM to fetch text embeddings of captions, and infuse their language-modality knowledge into BEATs audio features via an auxiliary InfoNCE loss function. Moreover, we propose a novel data augmentation method that uses ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increase not only the amount but also the complexity and diversity of training data. During inference, we propose to employ nucleus sampling and a hybrid reranking algorithm, which has not been explored in AAC research. Combining our efforts, our model achieves a new state-of-the-art 32.6 SPIDEr-FL score on the Clotho evaluation split, and wins the 2023 DCASE AAC challenge. △ Less

Submitted 9 January, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

Comments: ICASSP 2024 camera-ready paper. Winner of the DCASE 2023 Challenge Task 6A: Automated Audio Captioning (AAC)

arXiv:2309.15826 [pdf, other]

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

Authors: Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe

Abstract: Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modal… ▽ More Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modally. Our method reduces the speech-text modality gap via a pre-processing stage which converts speech and text inputs into two discrete token sequences of similar length -- this allows models to indiscriminately process both modalities simply using a joint vocabulary. With experiments on MuST-C, we demonstrate that our multi-tasking framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU without any external MT data. Further, we show that this framework incorporates external MT data, yielding +0.8 BLEU, and also improves transfer learning from pre-trained textual models, yielding +1.8 BLEU. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.15800 [pdf, other]

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, **chuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations, which significantly compresses the size of speech data. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length. Hence, training time is significantly reduced while retaining notable performance. In this study, we undertake a comprehensive and systematic exploration into the application of discrete units within end-to-end speech processing models. Experiments on 12 automatic speech recognition, 3 speech translation, and 1 spoken language understanding corpora demonstrate that discrete units achieve reasonably good results in almost all the settings. We intend to release our configurations and trained models to foster future research efforts. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: Submitted to IEEE ICASSP 2024

arXiv:2309.15686 [pdf, other]

Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

Authors: Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe, Sanjeev Khudanpur

Abstract: Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied. To bridge this gap, we introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments. Additionally, we propose context dropout to ensure robustness to the absence of… ▽ More Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied. To bridge this gap, we introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments. Additionally, we propose context dropout to ensure robustness to the absence of context, and further improve performance by adding speaker information. Our proposed contextual E2E-ST outperforms the isolated utterance-based E2E-ST approach. Lastly, we demonstrate that in conversational speech, contextual information primarily contributes to capturing context style, as well as resolving anaphora and named entities. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.15674 [pdf, other]

Speech collage: code-switched audio generation by collaging monolingual corpora

Authors: Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan, Shammur Chowdhury, Ahmed Ali, Shinji Watanabe, Sanjeev Khudanpur

Abstract: Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We… ▽ More Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We investigate the impact of generated data on speech recognition in two scenarios: using in-domain CS text and a zero-shot approach with synthesized CS text. Empirical results highlight up to 34.4% and 16.2% relative reductions in Mixed-Error Rate and Word-Error Rate for in-domain and zero-shot scenarios, respectively. Lastly, we demonstrate that CS augmentation bolsters the model's code-switching inclination and reduces its monolingual bias. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.15317 [pdf, other]

Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

Authors: William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

Abstract: Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more… ▽ More Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more research groups. We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages. To build WavLabLM, we devise a novel multi-stage pre-training method, designed to address the language imbalance of multilingual data. WavLabLM achieves comparable performance to XLS-R on ML-SUPERB with less than 10% of the training data, making SSL realizable with academic compute. We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials. We open-source all code and models in ESPnet. △ Less

Submitted 27 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

Comments: Accepted to ASRU 2023

arXiv:2309.14922 [pdf, other]

doi 10.1109/ASRU57964.2023.10389796

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

Authors: Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

Abstract: Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam sea… ▽ More Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam search for improving the inference speed of an ASR model based on the hybrid connectionist temporal classification (CTC) attention-based architecture. It first generates an initial hypothesis using greedy CTC decoding, identifying low-confidence tokens based on their output probabilities. We then utilize the decoder to perform segment-level vectorized beam search on these tokens, re-predicting in parallel with minimal decoder calculations. Experimental results show that our method is 12 to 13 times faster in inference on the LibriSpeech corpus over AR decoding whilst preserving high accuracy. △ Less

Submitted 30 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

Comments: Accepted at ASRU 2023

Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop 2023

arXiv:2309.13927 [pdf, other]

ZZ-Interaction-Free Single-Qubit-Gate Optimization in Superconducting Qubits

Authors: Shu Watanabe, Yutaka Tabuchi, Kentaro Heya, Shuhei Tamate, Yasunobu Nakamura

Abstract: Overcoming the issue of qubit-frequency fluctuations is essential to realize stable and practical quantum computing with solid-state qubits. Static ZZ interaction, which causes a frequency shift of a qubit depending on the state of neighboring qubits, is one of the major obstacles to integrating fixed-frequency transmon qubits. Here we propose and experimentally demonstrate ZZ-interaction-free sin… ▽ More Overcoming the issue of qubit-frequency fluctuations is essential to realize stable and practical quantum computing with solid-state qubits. Static ZZ interaction, which causes a frequency shift of a qubit depending on the state of neighboring qubits, is one of the major obstacles to integrating fixed-frequency transmon qubits. Here we propose and experimentally demonstrate ZZ-interaction-free single-qubit-gate operations on a superconducting transmon qubit by utilizing a semi-analytically optimized pulse based on a perturbative analysis. The gate is designed to be robust against slow qubit-frequency fluctuations. The robustness of the optimized gate spans a few MHz, which is sufficient for suppressing the adverse effects of the ZZ interaction. Our result paves the way for an efficient approach to overcoming the issue of ZZ interaction without any additional hardware overhead. △ Less

Submitted 5 January, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: 11 pages, 8 figures

arXiv:2309.13876 [pdf, other]

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Authors: Yifan Peng, **chuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe

Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for develo** such models (from data collection to training) is not publicly accessib… ▽ More Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for develo** such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science. △ Less

Submitted 24 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: Accepted at ASRU 2023

arXiv:2309.11379 [pdf, other]

doi 10.21437/Interspeech.2023-2225

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

Authors: Peter Polák, Brian Yan, Shinji Watanabe, Alex Waibel, Ondřej Bojar

Abstract: Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -- this scheme… ▽ More Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -- this scheme cannot directly show a single \textit{incremental} translation to users. Further, this method lacks mechanisms for \textit{controlling} the quality vs. latency tradeoff. We propose a modified incremental blockwise beam search incorporating local agreement or hold-$n$ policies for quality-latency control. We apply our framework to models trained for online or offline translation and demonstrate that both types can be effectively used in online mode. Experimental results on MuST-C show 0.6-3.6 BLEU improvement without changing latency or 0.8-1.4 s latency improvement without changing quality. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: Accepted at INTERSPEECH 2023

Journal ref: Polák, P., Yan, B., Watanabe, S., Waibel, A., Bojar, O. (2023) Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff. Proc. INTERSPEECH 2023, 3979-3983

arXiv:2309.10926 [pdf, other]

Semi-Autoregressive Streaming ASR With Label Context

Authors: Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

Abstract: Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy. Since NAR automatic speech recognition (ASR) models must wait for the completion of the entire utterance before processing, some works explore streaming NAR models based… ▽ More Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy. Since NAR automatic speech recognition (ASR) models must wait for the completion of the entire utterance before processing, some works explore streaming NAR models based on blockwise attention for low-latency applications. However, streaming NAR models significantly lag in accuracy compared to streaming AR and non-streaming NAR models. To address this, we propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context using a Language Model (LM) subnetwork. We also introduce a novel greedy decoding algorithm that addresses insertion and deletion errors near block boundaries while not significantly increasing the inference time. Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets. It also reduced the accuracy gap with streaming AR and non-streaming NAR models while achieving 2.5x lower latency. We also demonstrate that our approach can effectively utilize external text data to pre-train the LM subnetwork to further improve streaming ASR accuracy. △ Less

Submitted 20 February, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

Comments: Accepted at ICASSP 2024

arXiv:2309.10787 [pdf, other]

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

Abstract: Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual a… ▽ More Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning. △ Less

Submitted 19 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

Comments: Accepted to ICASSP 2024; Evaluation Code: https://github.com/roger-tseng/av-superb Submission Platform: https://av.superbbenchmark.org

arXiv:2309.09510 [pdf, ps, other]

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

Authors: Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee

Abstract: Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for bui… ▽ More Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark. To initiate, Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions, providing a comprehensive platform for evaluation. Additionally, we propose several approaches to establish benchmark baselines. These include the utilization of speech models, text language models, and the multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks, they struggle with unseen ones. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together. △ Less

Submitted 22 March, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: To appear in the proceedings of ICASSP 2024

arXiv:2309.08876 [pdf, ps, other]

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

Authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Abstract: Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training schemes to use text-only data. Inspired by recent advances in decoder-only language models (LMs), such as GPT-3 and PaLM adopted for speech-processing tasks, we prop… ▽ More Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training schemes to use text-only data. Inspired by recent advances in decoder-only language models (LMs), such as GPT-3 and PaLM adopted for speech-processing tasks, we propose using a decoder-only architecture for ASR with simple text augmentation. To provide audio information, encoder features compressed by CTC prediction are used as prompts for the decoder, which can be regarded as refining CTC prediction using the decoder-only model. Because the decoder architecture is the same as an autoregressive LM, it is simple to enhance the model by leveraging external text data with LM training. An experimental comparison using LibriSpeech and Switchboard shows that our proposed models with text augmentation training reduced word error rates from ordinary CTC by 0.3% and 1.4% on LibriSpeech test-clean and testother set, respectively, and 2.9% and 5.0% on Switchboard and CallHome. The proposed model had advantage on computational efficiency compared with conventional encoder-decoder ASR models with a similar parameter setup, and outperformed them on the LibriSpeech 100h and Switchboard training scenarios. △ Less

Submitted 9 January, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

arXiv:2309.08535 [pdf, other]

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

Authors: Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

Abstract: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the… ▽ More This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages △ Less

Submitted 12 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

Comments: Accepted at ICASSP 2024

arXiv:2309.08531 [pdf, other]

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

Authors: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

Abstract: In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-s… ▽ More In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model. The speech units mainly contain linguistic information while suppressing other characteristics of speech. This allows us to incorporate the language modeling capability of the pre-trained vision-language model into the spoken language modeling of Im2Sp. With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases, COCO and Flickr8k. Then, we further improve the efficiency of the Im2Sp model. Similar to the speech unit case, we convert the original image into image units, which are derived through vector quantization of the raw image. With these image units, we can drastically reduce the required data storage for saving image data to just 0.8% when compared to the original image data in terms of bits. Demo page: https://ms-dot-k.github.io/Image-to-Speech-Captioning. △ Less

Submitted 15 September, 2023; originally announced September 2023.

arXiv:2309.08348 [pdf, other]

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Authors: Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, **gdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao

Abstract: Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023… ▽ More Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: 5 pages, 4 figures

arXiv:2309.07937 [pdf, other]

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

Authors: Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe

Abstract: We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech syn… ▽ More We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. Further, VoxtLM is trained with publicly available data and training recipes and model checkpoints are open-sourced to make fully reproducible work. △ Less

Submitted 24 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

arXiv:2308.10107 [pdf, other]

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

Authors: **chuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu, Shinji Watanabe

Abstract: Automatic speech recognition (ASR) based on transducers is widely used. In training, a transducer maximizes the summed posteriors of all paths. The path with the highest posterior is commonly defined as the predicted alignment between the speech and the transcription. While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferre… ▽ More Automatic speech recognition (ASR) based on transducers is widely used. In training, a transducer maximizes the summed posteriors of all paths. The path with the highest posterior is commonly defined as the predicted alignment between the speech and the transcription. While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferred paths and achieve controllable alignment prediction. Specifically, this work proposes Bayes Risk Transducer (BRT), which uses a Bayes risk function to set lower risk values to the preferred paths so that the predicted alignment is more likely to satisfy specific desired properties. We further demonstrate that these predicted alignments with intentionally designed properties can provide practical advantages over the vanilla transducer. Experimentally, the proposed BRT saves inference cost by up to 46% for non-streaming ASR and reduces overall system latency by 41% for streaming ASR. △ Less

Submitted 19 August, 2023; originally announced August 2023.

Journal ref: Interspeech 2023

Showing 51–100 of 845 results for author: Watanabe, S