Search | arXiv e-print repository

TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement

Authors: Yunyang Zeng, Joseph Konan, Shuo Han, David Bick, Muqiao Yang, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

Abstract: Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable… ▽ More Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features. Unlike prior work that looks at aggregated acoustic parameters or a few categories of acoustic parameters, our temporal acoustic parameter (TAP) loss enables auxiliary optimization and improvement of many fine-grain speech characteristics in enhancement workflows. We show that adding TAPLoss as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility. We use data from the Deep Noise Suppression 2020 Challenge to demonstrate that both time-domain models and time-frequency domain models can benefit from our method. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: Accepted at ICASSP 2023

arXiv:2302.08059 [pdf, other]

A Geometric Reduction Approach for Identity Testing of Reversible Markov Chains

Authors: Geoffrey Wolfer, Shun Watanabe

Abstract: We consider the problem of testing the identity of a reversible Markov chain against a reference from a single trajectory of observations. Employing the recently introduced notion of a lum**-congruent Markov embedding, we show that, at least in a mildly restricted setting, testing identity to a reversible chain reduces to testing to a symmetric chain over a larger state space and recover state-o… ▽ More We consider the problem of testing the identity of a reversible Markov chain against a reference from a single trajectory of observations. Employing the recently introduced notion of a lum**-congruent Markov embedding, we show that, at least in a mildly restricted setting, testing identity to a reversible chain reduces to testing to a symmetric chain over a larger state space and recover state-of-the-art sample complexity for the problem. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2302.07928 [pdf, other]

Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge

Authors: Samuele Cornell, Zhong-Qiu Wang, Yoshiki Masuyama, Shinji Watanabe, Manuel Pariente, Nobutaka Ono

Abstract: This paper describes our submission to the Second Clarity Enhancement Challenge (CEC2), which consists of target speech enhancement for hearing-aid (HA) devices in noisy-reverberant environments with multiple interferers such as music and competing speakers. Our approach builds upon the powerful iterative neural/beamforming enhancement (iNeuBe) framework introduced in our recent work, and this p… ▽ More This paper describes our submission to the Second Clarity Enhancement Challenge (CEC2), which consists of target speech enhancement for hearing-aid (HA) devices in noisy-reverberant environments with multiple interferers such as music and competing speakers. Our approach builds upon the powerful iterative neural/beamforming enhancement (iNeuBe) framework introduced in our recent work, and this paper extends it for target speaker extraction. We therefore name the proposed approach as iNeuBe-X, where the X stands for extraction. To address the challenges encountered in the CEC2 setting, we introduce four major novelties: (1) we extend the state-of-the-art TF-GridNet model, originally designed for monaural speaker separation, for multi-channel, causal speech enhancement, and large improvements are observed by replacing the TCNDenseNet used in iNeuBe with this new architecture; (2) we leverage a recent dual window size approach with future-frame prediction to ensure that iNueBe-X satisfies the 5 ms constraint on algorithmic latency required by CEC2; (3) we introduce a novel speaker-conditioning branch for TF-GridNet to achieve target speaker extraction; (4) we propose a fine-tuning step, where we compute an additional loss with respect to the target speaker signal compensated with the listener audiogram. Without using external data, on the official development set our best model reaches a hearing-aid speech perception index (HASPI) score of 0.942 and a scale-invariant signal-to-distortion ratio improvement (SI-SDRi) of 18.8 dB. These results are promising given the fact that the CEC2 data is extremely challenging (e.g., on the development set the mixture SI-SDR is -12.3 dB). A demo of our submitted system is available at WAVLab CEC2 demo. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2302.06774 [pdf, other]

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

Authors: Peter Wu, Li-Wei Chen, Cheol Jun Cho, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

Abstract: To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible map** from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages… ▽ More To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible map** from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages self-supervision to generalize to unseen speakers. Our approach obtains 0.784 correlation on an electromagnetic articulography (EMA) dataset, improving the state-of-the-art by 12.5\%. Additionally, we show the interpretability of these representations through directly comparing the behavior of estimated representations with speech production behavior. Finally, we propose a resynthesis-based AAI evaluation metric that does not rely on articulatory labels, demonstrating its efficacy with an 18-speaker dataset. △ Less

Submitted 24 July, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

arXiv:2302.04689 [pdf, other]

doi 10.1051/0004-6361/202244822

The Venus' Cloud Discontinuity in 2022

Authors: J. Peralta, A. Cidadão, L. Morrone, C. Foster, M. Bullock, E. F. Young, I. Garate-Lopez, A. Sánchez-Lavega, T. Horinouchi, T. Imamura, E. Kardasis, A. Yamazaki, S. Watanabe

Abstract: First identified in 2016 by JAXA's Akatsuki mission, the discontinuity/disruption is a recurrent wave observed to propagate during decades at the deeper clouds of Venus (47--56 km above the surface), while its absence at the clouds' top ($\sim$70 km) suggests that it dissipates at the upper clouds and contributes in the maintenance of the puzzling atmospheric superrotation of Venus through wave-me… ▽ More First identified in 2016 by JAXA's Akatsuki mission, the discontinuity/disruption is a recurrent wave observed to propagate during decades at the deeper clouds of Venus (47--56 km above the surface), while its absence at the clouds' top ($\sim$70 km) suggests that it dissipates at the upper clouds and contributes in the maintenance of the puzzling atmospheric superrotation of Venus through wave-mean flow interaction. Taking advantage of the campaign of ground-based observations undertaken in coordination with the Akatsuki mission since December 2021 until July 2022, we aimed to undertake the longest uninterrupted monitoring of the cloud discontinuity up to date to obtain a pioneering long-term characterization of its main properties and better constrain its recurrence and lifetime. The dayside upper, middle and nightside lower clouds were studied with images with suitable filters acquired by Akatsuki/UVI, amateur observers and NASA's IRTF/SpeX, respectively. Hundreds of images were inspected in search of manifestations of the discontinuity events and to measure key properties like its dimensions, orientation or rotation period. We succeeded in tracking the discontinuity at the middle clouds during 109 days without interruption. The discontinuity exhibited properties nearly identical to measurements in 2016 and 2020, with an orientation of $91^{\circ}\pm 8^{\circ}$, length/width of $4100\pm 800$ / $500\pm 100$ km and a rotation period of $5.11\pm 0.09$ days. Ultraviolet images during 13-14 June 2022 suggest that the discontinuity may have manifested at the top of the clouds during $\sim$21 hours as a result of an altitude change in the critical level for this wave due to slower zonal winds. △ Less

Submitted 9 February, 2023; originally announced February 2023.

Comments: 8 pages, 4 figures, 2 animated figures, 1 table

Journal ref: A&A 672, L2 (2023)

arXiv:2302.04215 [pdf, other]

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Authors: Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

Abstract: Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synt… ▽ More Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures. △ Less

Submitted 8 February, 2023; originally announced February 2023.

Comments: Accepted to AAAI 2023

arXiv:2301.12596 [pdf, other]

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Authors: Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari

Abstract: While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource la… ▽ More While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language. △ Less

Submitted 27 May, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

Comments: To appear in IJCAI 2023

arXiv:2301.09099 [pdf, ps, other]

Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study

Authors: Massa Baali, Tomoki Hayashi, Hamdy Mubarak, Soumi Maiti, Shinji Watanabe, Wassim El-Hajj, Ahmed Ali

Abstract: Several high-resource Text to Speech (TTS) systems currently produce natural, well-established human-like speech. In contrast, low-resource languages, including Arabic, have very limited TTS systems due to the lack of resources. We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies for TTS training, using broadcast news… ▽ More Several high-resource Text to Speech (TTS) systems currently produce natural, well-established human-like speech. In contrast, low-resource languages, including Arabic, have very limited TTS systems due to the lack of resources. We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies for TTS training, using broadcast news as a case study. We show how careful selection of data, yet smaller amounts, can improve the efficiency of TTS system in generating more natural speech than a system trained on a bigger dataset. We adopt to propose different approaches for the: 1) data: we applied automatic annotations using DNSMOS, automatic vowelization, and automatic speech recognition (ASR) for fixing transcriptions' errors; 2) model: we used transfer learning from high-resource language in TTS model and fine-tuned it with one hour broadcast recording then we used this model to guide a FastSpeech2-based Conformer model for duration. Our objective evaluation shows 3.9% character error rate (CER), while the groundtruth has 1.3% CER. As for the subjective evaluation, where 1 is bad and 5 is excellent, our FastSpeech2-based Conformer model achieved a mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness, where many annotators recognized the voice of the broadcaster, which proves the effectiveness of our proposed unsupervised method. △ Less

Submitted 26 January, 2023; v1 submitted 22 January, 2023; originally announced January 2023.

arXiv:2301.08547 [pdf, ps, other]

Infinite collision property for the three-dimensional uniform spanning tree

Authors: Satomi Watanabe

Abstract: Let $\mathcal{U}$ be the uniform spanning tree on $\mathbb{Z}^3$, whose probability law is denoted by $\mathbf{P}$. For $\mathbf{P}$-a.s. realization of $\mathcal{U}$, the recurrence of the the simple random walk on $\mathcal{U}$ is proved in [5] and it is also demonstrated in [8] that two independent simple random walks on $\mathcal{U}$ collide infinitely often. In this article, we will give a qu… ▽ More Let $\mathcal{U}$ be the uniform spanning tree on $\mathbb{Z}^3$, whose probability law is denoted by $\mathbf{P}$. For $\mathbf{P}$-a.s. realization of $\mathcal{U}$, the recurrence of the the simple random walk on $\mathcal{U}$ is proved in [5] and it is also demonstrated in [8] that two independent simple random walks on $\mathcal{U}$ collide infinitely often. In this article, we will give a quantitative estimate on the number of collisions of two independent simple random walks on $\mathcal{U}$, which provides another proof of the infinite collision property of $\mathcal{U}$. △ Less

Submitted 9 February, 2023; v1 submitted 20 January, 2023; originally announced January 2023.

Comments: 13 pages, 1 figure

MSC Class: 60K37

arXiv:2212.10818 [pdf, other]

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Authors: Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

Abstract: The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models d… ▽ More The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models depending on the application requirement, resulting in the increased overhead of maintaining all models. Several methods for integrating two of these complementary models to mitigate the overhead issue have been proposed; however, if we integrate more models, we will further benefit from these complementary models and realize broader applications with a single system. This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict, which has the following three advantages: 1) The four decoders are jointly trained so that they can be easily switched depending on the application scenarios. 2) Joint training may bring model regularization and improve the model robustness thanks to their complementary properties. 3) Novel one-pass joint decoding methods using CTC, attention, and RNN-T further improves the performance. The experimental results showed that the proposed model consistently reduced the WER. △ Less

Submitted 29 May, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: Accepted by INTERRSPEECH2023

arXiv:2212.10801 [pdf, other]

Measurement of the cosmogenic neutron yield in Super-Kamiokande with gadolinium loaded water

Authors: Super-Kamiokande Collaboration, :, M. Shinoki, K. Abe, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya , et al. (217 additional authors not shown)

Abstract: Cosmic-ray muons that enter the Super-Kamiokande detector cause hadronic showers due to spallation in water, producing neutrons and radioactive isotopes. Those are a major background source for studies of MeV-scale neutrinos and searches for rare events. Since 2020, gadolinium was introduced in the ultra-pure water in the Super-Kamiokande detector to improve the detection efficiency of neutrons. I… ▽ More Cosmic-ray muons that enter the Super-Kamiokande detector cause hadronic showers due to spallation in water, producing neutrons and radioactive isotopes. Those are a major background source for studies of MeV-scale neutrinos and searches for rare events. Since 2020, gadolinium was introduced in the ultra-pure water in the Super-Kamiokande detector to improve the detection efficiency of neutrons. In this study, the cosmogenic neutron yield was measured using data acquired during the period after the gadolinium loading. The yield was found to be $(2.76 \pm 0.02\,\mathrm{(stat.) \pm 0.19\,\mathrm{(syst.)}}) \times 10^{-4}\,μ^{-1} \mathrm{g^{-1} cm^{2}}$ at 259 GeV of average muon energy at the Super-Kamiokande detector. △ Less

Submitted 25 October, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: 10 pages, 10 figures, 3 tables

arXiv:2212.10525 [pdf, other]

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

Authors: Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

Abstract: Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce suc… ▽ More Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models. △ Less

Submitted 15 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: accepted in ACL 2023 (long paper)

arXiv:2212.08542 [pdf, other]

Context-aware Fine-tuning of Self-supervised Speech Models

Authors: Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu, Shinji Watanabe

Abstract: Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tu… ▽ More Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tuning. We attach a context module on top of the last layer of a pre-trained model to encode the whole segment into a context embedding vector which is then used as an additional feature for the final prediction. During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments. This allows the model to make predictions without access to these surrounding segments at inference time and requires only a tiny overhead compared to standard fine-tuned models. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: Automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). The results show that context-aware fine-tuning not only outperforms a standard fine-tuning baseline but also rivals a strong context injection baseline that uses neighboring speech segments during inference. △ Less

Submitted 28 March, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

arXiv:2212.08055 [pdf, other]

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

Authors: Hirofumi Inaguma, Sravya Popuri, Ilia Kulikov, Peng-Jen Chen, Changhan Wang, Yu-An Chung, Yun Tang, Ann Lee, Shinji Watanabe, Juan Pino

Abstract: Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword predictio… ▽ More Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case. △ Less

Submitted 26 May, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

Comments: ACL 2023 (main conference)

arXiv:2212.06751 [pdf, other]

Speeding Up Multi-Objective Hyperparameter Optimization by Task Similarity-Based Meta-Learning for the Tree-Structured Parzen Estimator

Authors: Shuhei Watanabe, Noor Awad, Masaki Onishi, Frank Hutter

Abstract: Hyperparameter optimization (HPO) is a vital step in improving performance in deep learning (DL). Practitioners are often faced with the trade-off between multiple criteria, such as accuracy and latency. Given the high computational needs of DL and the growing demand for efficient HPO, the acceleration of multi-objective (MO) optimization becomes ever more important. Despite the significant body o… ▽ More Hyperparameter optimization (HPO) is a vital step in improving performance in deep learning (DL). Practitioners are often faced with the trade-off between multiple criteria, such as accuracy and latency. Given the high computational needs of DL and the growing demand for efficient HPO, the acceleration of multi-objective (MO) optimization becomes ever more important. Despite the significant body of work on meta-learning for HPO, existing methods are inapplicable to MO tree-structured Parzen estimator (MO-TPE), a simple yet powerful MO-HPO algorithm. In this paper, we extend TPE's acquisition function to the meta-learning setting using a task similarity defined by the overlap of top domains between tasks. We also theoretically analyze and address the limitations of our task similarity. In the experiments, we demonstrate that our method speeds up MO-TPE on tabular HPO benchmarks and attains state-of-the-art performance. Our method was also validated externally by winning the AutoML 2022 competition on "Multiobjective Hyperparameter Optimization for Transformers". △ Less

Submitted 31 May, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

Comments: Accpeted to IJCAI 2023

arXiv:2212.04559 [pdf, other]

SpeechLMScore: Evaluating speech generation using speech language model

Authors: Soumi Maiti, Yifan Peng, Takaaki Saeki, Shinji Watanabe

Abstract: While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming. Previous studies on automatic speech quality assessment address the problem by predicting human evaluation scores with machine learning models. However, they rely on supervised learning and thus suffer from high annotation costs and domain-shift problems. We propo… ▽ More While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming. Previous studies on automatic speech quality assessment address the problem by predicting human evaluation scores with machine learning models. However, they rely on supervised learning and thus suffer from high annotation costs and domain-shift problems. We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech-language model. SpeechLMScore computes the average log-probability of a speech signal by map** it into discrete tokens and measures the average probability of generating the sequence of tokens. Therefore, it does not require human annotation and is a highly scalable framework. Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks including voice conversion, text-to-speech, and speech enhancement. △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2211.17196 [pdf, other]

EURO: ESPnet Unsupervised ASR Open-source Toolkit

Authors: Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola Garcia, Hung-yi Lee, Shinji Watanabe, Sanjeev Khudanpur

Abstract: This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR). EURO adopts the state-of-the-art UASR learning method introduced by the Wav2vec-U, originally implemented at FAIRSEQ, which leverages self-supervised speech representations and adversarial training. In addition to wav2vec2, EURO extend… ▽ More This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR). EURO adopts the state-of-the-art UASR learning method introduced by the Wav2vec-U, originally implemented at FAIRSEQ, which leverages self-supervised speech representations and adversarial training. In addition to wav2vec2, EURO extends the functionality and promotes reproducibility for UASR tasks by integrating S3PRL and k2, resulting in flexible frontends from 27 self-supervised models and various graph-based decoding strategies. EURO is implemented in ESPnet and follows its unified pipeline to provide UASR recipes with a complete setup. This improves the pipeline's efficiency and allows EURO to be easily applied to existing datasets in ESPnet. Extensive experiments on three mainstream self-supervised models demonstrate the toolkit's effectiveness and achieve state-of-the-art UASR performance on TIMIT and LibriSpeech datasets. EURO will be publicly available at https://github.com/espnet/espnet, aiming to promote this exciting and emerging research area based on UASR through open-source activity. △ Less

Submitted 20 May, 2023; v1 submitted 30 November, 2022; originally announced November 2022.

arXiv:2211.15031 [pdf, ps, other]

Volume and heat kernel fluctuations for the three-dimensional uniform spanning tree

Authors: Daisuke Shiraishi, Satomi Watanabe

Abstract: Let $\mathcal{U}$ be the uniform spanning tree on $\mathbb{Z}^{3}$. We show the occurrence of log-logarithmic fluctuations around the leading order for the volume of intrinsic balls in $\mathcal{U}$. As an application, we obtain similar fluctuations for the quenched heat kernel of the simple random walk on $\mathcal{U}$. Let $\mathcal{U}$ be the uniform spanning tree on $\mathbb{Z}^{3}$. We show the occurrence of log-logarithmic fluctuations around the leading order for the volume of intrinsic balls in $\mathcal{U}$. As an application, we obtain similar fluctuations for the quenched heat kernel of the simple random walk on $\mathcal{U}$. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: 39 pages, 20 figures

MSC Class: 60K37 (primary); 60D05

arXiv:2211.14411 [pdf, other]

c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization

Authors: Shuhei Watanabe, Frank Hutter

Abstract: Hyperparameter optimization (HPO) is crucial for strong performance of deep learning algorithms and real-world applications often impose some constraints, such as memory usage, or latency on top of the performance requirement. In this work, we propose constrained TPE (c-TPE), an extension of the widely-used versatile Bayesian optimization method, tree-structured Parzen estimator (TPE), to handle t… ▽ More Hyperparameter optimization (HPO) is crucial for strong performance of deep learning algorithms and real-world applications often impose some constraints, such as memory usage, or latency on top of the performance requirement. In this work, we propose constrained TPE (c-TPE), an extension of the widely-used versatile Bayesian optimization method, tree-structured Parzen estimator (TPE), to handle these constraints. Our proposed extension goes beyond a simple combination of an existing acquisition function and the original TPE, and instead includes modifications that address issues that cause poor performance. We thoroughly analyze these modifications both empirically and theoretically, providing insights into how they effectively overcome these challenges. In the experiments, we demonstrate that c-TPE exhibits the best average rank performance among existing methods with statistical significance on 81 expensive HPO with inequality constraints. Due to the lack of baselines, we only discuss the applicability of our method to hard-constrained optimization in Appendix D. △ Less

Submitted 26 May, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

Comments: Accepted to IJCAI 2023

arXiv:2211.12433 [pdf, other]

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

Abstract: We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral map**, where the real and imaginary (RI)… ▽ More We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral map**, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral map**, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this paper is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge. △ Less

Submitted 4 August, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: In IEEE/ACM Transactions on Audio, Speech, and Language Processing. A sound demo is available at https://zqwang7.github.io/demos/TF-GridNet-demo/index.html, and the code is available at https://github.com/espnet/espnet/pull/5395

arXiv:2211.10049 [pdf, ps, other]

Recent Advances in Algebraic Geometry and Bayesian Statistics

Authors: Sumio Watanabe

Abstract: This article is a review of theoretical advances in the research field of algebraic geometry and Bayesian statistics in the last two decades. Many statistical models and learning machines which contain hierarchical structures or latent variables are called nonidentifiable, because the map from a parameter to a statistical model is not one-to-one. In nonidentifiable models, both the likelihood func… ▽ More This article is a review of theoretical advances in the research field of algebraic geometry and Bayesian statistics in the last two decades. Many statistical models and learning machines which contain hierarchical structures or latent variables are called nonidentifiable, because the map from a parameter to a statistical model is not one-to-one. In nonidentifiable models, both the likelihood function and the posterior distribution have singularities in general, hence it was difficult to analyze their statistical properties. However, from the end of the 20th century, new theory and methodology based on algebraic geometry have been established which enables us to investigate such models and machines in the real world. In this article, the following results in recent advances are reported. First, we explain the framework of Bayesian statistics and introduce a new perspective from the birational geometry. Second, two mathematical solutions are derived based on algebraic geometry. An appropriate parameter space can be found by a resolution map, which makes the posterior distribution be normal crossing and the log likelihood ratio function be well-defined. Third, three applications to statistics are introduced. The posterior distribution is represented by the renormalized form, the asymptotic free energy is derived, and the universal formula among the generalization loss, the cross validation, and the information criterion is established. Two mathematical solutions and three applications to statistics based on algebraic geometry reported in this article are now being used in many practical fields in data science and artificial intelligence. △ Less

Submitted 18 November, 2022; originally announced November 2022.

arXiv:2211.08989 [pdf, other]

Avoid Overthinking in Self-Supervised Models for Speech Recognition

Authors: Dan Berrebbi, Brian Yan, Shinji Watanabe

Abstract: Self-supervised learning (SSL) models reshaped our approach to speech, language and vision. However their huge size and the opaque relations between their layers and tasks result in slow inference and network overthinking, where predictions made from the last layer of large models is worse than those made from intermediate layers. Early exit (EE) strategies can solve both issues by dynamically red… ▽ More Self-supervised learning (SSL) models reshaped our approach to speech, language and vision. However their huge size and the opaque relations between their layers and tasks result in slow inference and network overthinking, where predictions made from the last layer of large models is worse than those made from intermediate layers. Early exit (EE) strategies can solve both issues by dynamically reducing computations at inference time for certain samples. Although popular for classification tasks in vision and language, EE has seen less use for sequence-to-sequence speech recognition (ASR) tasks where outputs from early layers are often degenerate. This challenge is further compounded when speech SSL models are applied on out-of-distribution (OOD) data. This paper first shows that SSL models do overthinking in ASR. We then motivate further research in EE by computing an optimal bound for performance versus speed trade-offs. To approach this bound we propose two new strategies for ASR: (1) we adapt the recently proposed patience strategy to ASR; and (2) we design a new EE strategy specific to ASR that performs better than all strategies previously introduced. △ Less

Submitted 1 November, 2022; originally announced November 2022.

arXiv:2211.08726 [pdf, other]

Streaming Joint Speech Recognition and Disfluency Detection

Authors: Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao Okuda, Siddhant Arora, Shinji Watanabe

Abstract: Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to… ▽ More Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors and provide non-verbal clues. Moreover, joint modeling results in low-latency and lightweight inference. We investigate two joint model variants for streaming disfluency detection: a transcript-enriched model and a multi-task model. The transcript-enriched model is trained on text with special tags indicating the starting and ending points of the disfluent part. However, it has problems with latency and standard language model adaptation, which arise from the additional disfluency tags. We propose a multi-task model to solve such problems, which has two output layers at the Transformer decoder; one for speech recognition and the other for disfluency detection. It is modeled to be conditioned on the currently recognized token with an additional token-dependency mechanism. We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency, on both the Switchboard and the corpus of spontaneous Japanese. △ Less

Submitted 11 May, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

Comments: Accepted at ICASSP2023

arXiv:2211.06535 [pdf, other]

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

Authors: Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

Abstract: We present a unified system to realize one-shot voice conversion (VC) on the pitch, rhythm, and speaker attributes. Existing works generally ignore the correlation between prosody and language content, leading to the degradation of naturalness in converted speech. Additionally, the lack of proper language features prevents these systems from accurately preserving language content after conversion.… ▽ More We present a unified system to realize one-shot voice conversion (VC) on the pitch, rhythm, and speaker attributes. Existing works generally ignore the correlation between prosody and language content, leading to the degradation of naturalness in converted speech. Additionally, the lack of proper language features prevents these systems from accurately preserving language content after conversion. To address these issues, we devise a cascaded modular system leveraging self-supervised discrete speech units as language representation. These discrete units provide duration information essential for rhythm modeling. Our system first extracts utterance-level prosody and speaker representations from the raw waveform. Given the prosody representation, a prosody predictor estimates pitch, energy, and duration for each discrete unit in the utterance. A synthesizer further reconstructs speech based on the predicted prosody, speaker representation, and discrete units. Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability. Code and samples are publicly available. △ Less

Submitted 11 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2211.05967 [pdf, ps, other]

Align, Write, Re-order: Explainable End-to-End Speech Translation via Operation Sequence Generation

Authors: Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji Watanabe

Abstract: The black-box nature of end-to-end speech translation (E2E ST) systems makes it difficult to understand how source language inputs are being mapped to the target language. To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word. A major challenge arises f… ▽ More The black-box nature of end-to-end speech translation (E2E ST) systems makes it difficult to understand how source language inputs are being mapped to the target language. To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word. A major challenge arises from the fact that translation is a non-monotonic sequence transduction task due to word ordering differences between languages -- this clashes with the monotonic nature of ASR. Therefore, we propose to generate ST tokens out-of-order while remembering how to re-order them later. We achieve this by predicting a sequence of tuples consisting of a source word, the corresponding target words, and post-editing operations dictating the correct insertion points for the target word. We examine two variants of such operation sequences which enable generation of monotonic transcriptions and non-monotonic translations from the same speech input simultaneously. We apply our approach to offline and real-time streaming models, demonstrating that we can provide explainable translations without sacrificing quality or latency. In fact, the delayed re-ordering ability of our approach improves performance during streaming. As an added benefit, our method performs ASR and ST simultaneously, making it faster than using two separate systems to perform these tasks. △ Less

Submitted 10 November, 2022; originally announced November 2022.

arXiv:2211.05869 [pdf, other]

A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Authors: Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe

Abstract: Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and thei… ▽ More Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively. △ Less

Submitted 10 November, 2022; originally announced November 2022.

Comments: Accepted at SLT 2022

arXiv:2211.03541 [pdf, other]

Multi-blank Transducers for Speech Recognition

Authors: Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris Ginsburg

Abstract: This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training… ▽ More This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (https://github.com/NVIDIA/NeMo) toolkit. △ Less

Submitted 11 April, 2024; v1 submitted 4 November, 2022; originally announced November 2022.

Journal ref: ICASSP 2023

arXiv:2211.03025 [pdf, other]

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR

Authors: Jiatong Shi, Chan-Jan Hsu, Holam Chung, Dongji Gao, Paola Garcia, Shinji Watanabe, Ann Lee, Hung-yi Lee

Abstract: Spoken language understanding (SLU) is a task aiming to extract high-level semantics from spoken utterances. Previous works have investigated the use of speech self-supervised models and textual pre-trained models, which have shown reasonable improvements to various SLU tasks. However, because of the mismatched modalities between speech signals and text tokens, previous methods usually need comple… ▽ More Spoken language understanding (SLU) is a task aiming to extract high-level semantics from spoken utterances. Previous works have investigated the use of speech self-supervised models and textual pre-trained models, which have shown reasonable improvements to various SLU tasks. However, because of the mismatched modalities between speech signals and text tokens, previous methods usually need complex designs of the frameworks. This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models, resulting in an unsupervised speech-to-semantic pre-trained model for various tasks in SLU. To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models. Our experiments show that unsupervised ASR itself can improve the representations from speech self-supervised models. More importantly, it is shown as an efficient connector between speech and textual pre-trained models, improving the performances of five different SLU tasks. Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark. △ Less

Submitted 6 November, 2022; originally announced November 2022.

Comments: ICASSP2023 submission

arXiv:2211.02333 [pdf, other]

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

Authors: Yusuke Shinohara, Shinji Watanabe

Abstract: Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods, such as alignment-restricted training and FastEmit, have been studied to reduce the latency, latency reduction is often accompanied with a significant degradatio… ▽ More Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods, such as alignment-restricted training and FastEmit, have been studied to reduce the latency, latency reduction is often accompanied with a significant degradation in accuracy. We argue that this suboptimal performance might be caused because none of the prior methods explicitly model and reduce the latency. In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models. First, we define the expected latency at each diagonal line on the lattice, and show that its gradient can be computed efficiently within the forward-backward algorithm. Then we augment the transducer loss with this expected latency, so that an optimal trade-off between latency and accuracy is achieved. Experimental results on the WSJ dataset show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%, and outperforms conventional alignment-restricted training (110 ms) and FastEmit (67 ms) methods. △ Less

Submitted 4 November, 2022; originally announced November 2022.

Comments: Presented at INTERSPEECH 2022

arXiv:2211.01458 [pdf, other]

Towards Zero-Shot Code-Switched Speech Recognition

Authors: Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe

Abstract: In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, th… ▽ More In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, these methods require the monolingual modules to perform language segmentation. That is, each monolingual module has to simultaneously detect CS points and transcribe speech segments of one language while ignoring those of other languages -- not a trivial task. We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script (i.e. transliteration). This simple modification passes the responsibility of CS point detection to subsequent bilingual modules which determine the final output by considering multiple monolingual transliterations along with external language model information. We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zero-shot CS ASR on Mandarin-English SEAME test sets. △ Less

Submitted 9 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: 5 pages

arXiv:2211.00795 [pdf, other]

InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

Abstract: This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision. Momentum PL (MPL) trains a connectionist temporal classification (CTC)-based model on unlabeled data by continuously generating pseudo-labels on the fly and improving their quality. In contrast to autoregressive formulati… ▽ More This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision. Momentum PL (MPL) trains a connectionist temporal classification (CTC)-based model on unlabeled data by continuously generating pseudo-labels on the fly and improving their quality. In contrast to autoregressive formulations, such as the attention-based encoder-decoder and transducer, CTC is well suited for MPL, or PL-based semi-supervised ASR in general, owing to its simple/fast inference algorithm and robustness against generating collapsed labels. However, CTC generally yields inferior performance than the autoregressive models due to the conditional independence assumption, thereby limiting the performance of MPL. We propose to enhance MPL by introducing intermediate loss, inspired by the recent advances in CTC-based modeling. Specifically, we focus on self-conditional and hierarchical conditional CTC, that apply auxiliary CTC losses to intermediate layers such that the conditional independence assumption is explicitly relaxed. We also explore how pseudo-labels should be generated and used as supervision for intermediate losses. Experimental results in different semi-supervised settings demonstrate that the proposed approach outperforms MPL and improves an ASR model by up to a 12.1% absolute performance gain. In addition, our detailed analysis validates the importance of the intermediate loss. △ Less

Submitted 16 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: Accepted to ICASSP2023

arXiv:2211.00792 [pdf, other]

BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

Abstract: We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR has been actively studied, aiming to utilize versatile linguistic knowledge for generating accurate text. One crucial factor that makes this integration challenging… ▽ More We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR has been actively studied, aiming to utilize versatile linguistic knowledge for generating accurate text. One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain. To overcome such an issue, we propose BECTRA, an extended version of our previous BERT-CTC, that realizes BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder using a vocabulary suitable for a target task. With the combination of the transducer and BERT-CTC, we also propose a novel inference algorithm for taking advantage of both autoregressive and non-autoregressive decoding. Experimental results on several ASR tasks, varying in amounts of data, speaking styles, and languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing with the vocabulary mismatch while exploiting BERT knowledge. △ Less

Submitted 16 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: Accepted to ICASSP2023

arXiv:2210.16663 [pdf, other]

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Authors: Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

Abstract: This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the… ▽ More This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the input and hypothesized output sequences via the self-attention mechanism. This mechanism encourages a model to learn inner/inter-dependencies between the audio and token representations while maintaining CTC's training efficiency. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines an output sequence. The experimental results reveal that BERT-CTC improves over conventional approaches across variations in speaking styles and languages. Finally, we show that the semantic representations in BERT-CTC are beneficial towards downstream spoken language understanding tasks. △ Less

Submitted 19 April, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

Comments: v1: Accepted to Findings of EMNLP2022, v2: Minor corrections and clearer derivation of Eq. (21)

arXiv:2210.16498 [pdf, other]

Articulatory Representation Learning Via Joint Factor Analysis and Neural Matrix Factorization

Authors: Jiachen Lian, Alan W Black, Yi**g Lu, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

Abstract: Articulatory representation learning is the fundamental research in modeling neural speech production system. Our previous work has established a deep paradigm to decompose the articulatory kinematics data into gestures, which explicitly model the phonological and linguistic structure encoded with human speech production mechanism, and corresponding gestural scores. We continue with this line of w… ▽ More Articulatory representation learning is the fundamental research in modeling neural speech production system. Our previous work has established a deep paradigm to decompose the articulatory kinematics data into gestures, which explicitly model the phonological and linguistic structure encoded with human speech production mechanism, and corresponding gestural scores. We continue with this line of work by raising two concerns: (1) The articulators are entangled together in the original algorithm such that some of the articulators do not leverage effective moving patterns, which limits the interpretability of both gestures and gestural scores; (2) The EMA data is sparsely sampled from articulators, which limits the intelligibility of learned representations. In this work, we propose a novel articulatory representation decomposition algorithm that takes the advantage of guided factor analysis to derive the articulatory-specific factors and factor scores. A neural convolutive matrix factorization algorithm is then employed on the factor scores to derive the new gestures and gestural scores. We experiment with the rtMRI corpus that captures the fine-grained vocal tract contours. Both subjective and objective evaluation results suggest that the newly proposed system delivers the articulatory representations that are intelligible, generalizable, efficient and interpretable. △ Less

Submitted 20 February, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

Comments: Accepted to 2023 ICASSP. Camera Ready

arXiv:2210.15734 [pdf, other]

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

Authors: Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, Shinji Watanabe

Abstract: End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation. However, these systems model sequence labeling as a sequence prediction task causing a divergence from its well-established token-level tagging formulation. We build compositional end-to-end SLU systems that explicitly separate the a… ▽ More End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation. However, these systems model sequence labeling as a sequence prediction task causing a divergence from its well-established token-level tagging formulation. We build compositional end-to-end SLU systems that explicitly separate the added complexity of recognizing spoken mentions in SLU from the NLU task of sequence labeling. By relying on intermediate decoders trained for ASR, our end-to-end systems transform the input modality from speech to token-level representations that can be used in the traditional sequence labeling framework. This composition of ASR and NLU formulations in our end-to-end SLU system offers direct compatibility with pre-trained ASR and NLU systems, allows performance monitoring of individual components and enables the use of globally normalized losses like CRF, making them attractive in practical scenarios. Our models outperform both cascaded and direct end-to-end models on a labeling task of named entity recognition across SLU benchmarks. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted at EMNLP 2022 Findings. Our code and models will be publicly available as part of the ESPnet-SLU toolkit: https://github.com/espnet/espnet and the release can be followed here: https://github.com/espnet/espnet/pull/4735

arXiv:2210.14682 [pdf, other]

In search of strong embedding extractors for speaker diarisation

Authors: Jee-weon Jung, Hee-Soo Heo, Bong-** Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

Abstract: Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and… ▽ More Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. Second, embedding extractors have not seen utterances in which multiple speakers exist. These inputs are inevitably present in speaker diarisation because of overlapped speech and speaker changes; they degrade the performance. To mitigate the first problem, we generate speaker verification evaluation protocols that mimic the diarisation scenario better. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input. One technique generates overlapped speech segments, and the other generates segments where two speakers utter sequentially. Extensive experimental results using three state-of-the-art speaker embedding extractors demonstrate that both proposed approaches are effective. △ Less

Submitted 26 October, 2022; originally announced October 2022.

Comments: 5pages, 1 figure, 2 tables, submitted to ICASSP

arXiv:2210.12948 [pdf, other]

Searching for neutrinos from solar flares across solar cycles 23 and 24 with the Super-Kamiokande detector

Authors: K. Okamoto, K. Abe, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, Y. Kaneshima, Y. Kataoka, Y. Kashiwagi, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nagao, M. Nakahata, Y. Nakano, S. Nakayama, Y. Noguchi, K. Sato, H. Sekiya, K. Shimizu, M. Shiozawa , et al. (220 additional authors not shown)

Abstract: Neutrinos associated with solar flares (solar-flare neutrinos) provide information on particle acceleration mechanisms during the impulsive phase of solar flares. We searched using the Super-Kamiokande detector for neutrinos from solar flares that occurred during solar cycles $23$ and $24$, including the largest solar flare (X28.0) on November 4th, 2003. In order to minimize the background rate we… ▽ More Neutrinos associated with solar flares (solar-flare neutrinos) provide information on particle acceleration mechanisms during the impulsive phase of solar flares. We searched using the Super-Kamiokande detector for neutrinos from solar flares that occurred during solar cycles $23$ and $24$, including the largest solar flare (X28.0) on November 4th, 2003. In order to minimize the background rate we searched for neutrino interactions within narrow time windows coincident with $γ$-rays and soft X-rays recorded by satellites. In addition, we performed the first attempt to search for solar-flare neutrinos from solar flares on the invisible side of the Sun by using the emission time of coronal mass ejections (CMEs). By selecting twenty powerful solar flares above X5.0 on the visible side and eight CMEs whose emission speed exceeds $2000$ $\mathrm{km \, s^{-1}}$ on the invisible side from 1996 to 2018, we found two (six) neutrino events coincident with solar flares occurring on the visible (invisible) side of the Sun, with a typical background rate of $0.10$ ($0.62$) events per flare in the MeV-GeV energy range. No significant solar-flare neutrino signal above the estimated background rate was observed. As a result we set the following upper limit on neutrino fluence at the Earth $\mathitΦ<1.1\times10^{6}$ $\mathrm{cm^{-2}}$ at the $90\%$ confidence level for the largest solar flare. The resulting fluence limits allow us to constrain some of the theoretical models for solar-flare neutrino emission. △ Less

Submitted 26 October, 2022; v1 submitted 24 October, 2022; originally announced October 2022.

Comments: 36 pages, 18 figures, 9 tables (Figure 12 was replaced because it was incorrect in version 1.)

arXiv:2210.10985 [pdf, ps, other]

Large-scale learning of generalised representations for speaker recognition

Authors: Jee-weon Jung, Hee-Soo Heo, Bong-** Lee, Jaesong Lee, Hye-** Shim, Youngki Kwon, Joon Son Chung, Shinji Watanabe

Abstract: The objective of this work is to develop a speaker recognition model to be used in diverse scenarios. We hypothesise that two components should be adequately configured to build such a model. First, adequate architecture would be required. We explore several recent state-of-the-art models, including ECAPA-TDNN and MFA-Conformer, as well as other baselines. Second, a massive amount of data would be… ▽ More The objective of this work is to develop a speaker recognition model to be used in diverse scenarios. We hypothesise that two components should be adequately configured to build such a model. First, adequate architecture would be required. We explore several recent state-of-the-art models, including ECAPA-TDNN and MFA-Conformer, as well as other baselines. Second, a massive amount of data would be required. We investigate several new training data configurations combining a few existing datasets. The most extensive configuration includes over 87k speakers' 10.22k hours of speech. Four evaluation protocols are adopted to measure how the trained model performs in diverse scenarios. Through experiments, we find that MFA-Conformer with the least inductive bias generalises the best. We also show that training with proposed large data configurations gives better performance. A boost in generalisation is observed, where the average performance on four evaluation protocols improves by more than 20%. In addition, we also demonstrate that these models' performances can improve even further when increasing capacity. △ Less

Submitted 27 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

Comments: 5pages, 5 tables, submitted to ICASSP

arXiv:2210.10742 [pdf, other]

End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation

Authors: Yoshiki Masuyama, Xuankai Chang, Samuele Cornell, Shinji Watanabe, Nobutaka Ono

Abstract: Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end ar… ▽ More Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end architecture by integrating dereverberation, beamforming, SSLR, and ASR within a single neural network. Our system achieves the best performance reported in the literature on the CHiME-4 6-channel track with a word error rate (WER) of 1.77%. While the WavLM-based strong SSLR demonstrates promising results by itself, the end-to-end integration with the weighted power minimization distortionless response beamformer, which simultaneously performs dereverberation and denoising, improves WER significantly. Its effectiveness is also validated on the REVERB dataset. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: Accepted to IEEE SLT 2022

arXiv:2210.08634 [pdf, other]

SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning

Authors: Tzu-hsun Feng, Annie Dong, Ching-Feng Yeh, Shu-wen Yang, Tzu-Quan Lin, Jiatong Shi, Kai-Wei Chang, Zili Huang, Haibin Wu, Xuankai Chang, Shinji Watanabe, Abdelrahman Mohamed, Shang-Wen Li, Hung-yi Lee

Abstract: We present the SUPERB challenge at SLT 2022, which aims at learning self-supervised speech representation for better performance, generalization, and efficiency. The challenge builds upon the SUPERB benchmark and implements metrics to measure the computation requirements of self-supervised learning (SSL) representation and to evaluate its generalizability and performance across the diverse SUPERB… ▽ More We present the SUPERB challenge at SLT 2022, which aims at learning self-supervised speech representation for better performance, generalization, and efficiency. The challenge builds upon the SUPERB benchmark and implements metrics to measure the computation requirements of self-supervised learning (SSL) representation and to evaluate its generalizability and performance across the diverse SUPERB tasks. The SUPERB benchmark provides comprehensive coverage of popular speech processing tasks, from speech and speaker recognition to audio generation and semantic understanding. As SSL has gained interest in the speech community and showed promising outcomes, we envision the challenge to uplevel the impact of SSL techniques by motivating more practical designs of techniques beyond task performance. We summarize the results of 14 submitted models in this paper. We also discuss the main findings from those submissions and the future directions of SSL research. △ Less

Submitted 29 October, 2022; v1 submitted 16 October, 2022; originally announced October 2022.

Comments: Accepted by 2022 SLT Workshop

arXiv:2210.07499 [pdf, other]

Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

Authors: **chuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

Abstract: Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target… ▽ More Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target units. As there are multiple potential aligning sequences (called paths) that are equally considered in CTC formulation, the choice of which path will be most probable and become the predicted alignment is always uncertain. In addition, it is usually observed that the alignment predicted by vanilla CTC will drift compared with its reference and rarely provides practical functionalities. Thus, the motivation of this work is to make the CTC alignment prediction controllable and thus equip CTC with extra functionalities. The Bayes risk CTC (BRCTC) criterion is then proposed in this work, in which a customizable Bayes risk function is adopted to enforce the desired characteristics of the predicted alignment. With the risk function, the BRCTC is a general framework to adopt some customizable preference over the paths in order to concentrate the posterior into a particular subset of the paths. In applications, we explore one particular preference which yields models with the down-sampling ability and reduced inference costs. By using BRCTC with another preference for early emissions, we obtain an improved performance-latency trade-off for online models. Experimentally, the proposed BRCTC reduces the inference cost of offline models by up to 47% without performance degradation and cuts down the overall latency of online systems to an unseen level. △ Less

Submitted 31 January, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

Journal ref: International Conference on Learning Representations (ICLR), 2023

arXiv:2210.07189 [pdf, other]

On Compressing Sequences for Self-Supervised Speech Models

Authors: Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-yi Lee, Hao Tang

Abstract: Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how in… ▽ More Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how individual downstream tasks are sensitive to input frame rates. Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference. Variable-length subsampling performs particularly well under low frame rates. In addition, if we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz. △ Less

Submitted 25 October, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: Accepted to IEEE SLT 2022

arXiv:2210.05200 [pdf, other]

CTC Alignments Improve Autoregressive Translation

Authors: Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

Abstract: Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment. However for translation, CTC exhibits clear limitations due to the contextual and non-monotonic nature of the task and thus lags behind attentional decoder approaches in terms of translation quality. In this work, we argue that CT… ▽ More Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment. However for translation, CTC exhibits clear limitations due to the contextual and non-monotonic nature of the task and thus lags behind attentional decoder approaches in terms of translation quality. In this work, we argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework wherein CTC's core properties can counteract several key weaknesses of pure-attention models during training and decoding. To validate this conjecture, we modify the Hybrid CTC/Attention model originally proposed for ASR to support text-to-text translation (MT) and speech-to-text translation (ST). Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks. △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2210.03459 [pdf, other]

Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization

Authors: Shota Horiguchi, Yuki Takashima, Shinji Watanabe, Paola Garcia

Abstract: Due to the high performance of multi-channel speech processing, we can use the outputs from a multi-channel model as teacher labels when training a single-channel model with knowledge distillation. To the contrary, it is also known that single-channel speech data can benefit multi-channel models by mixing it with multi-channel speech data during training or by using it for model pretraining. This… ▽ More Due to the high performance of multi-channel speech processing, we can use the outputs from a multi-channel model as teacher labels when training a single-channel model with knowledge distillation. To the contrary, it is also known that single-channel speech data can benefit multi-channel models by mixing it with multi-channel speech data during training or by using it for model pretraining. This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately. We first introduce an end-to-end neural diarization model that can handle both single- and multi-channel inputs. Using this model, we alternately conduct i) knowledge distillation from a multi-channel model to a single-channel model and ii) finetuning from the distilled single-channel model to a multi-channel model. Experimental results on two-speaker data show that the proposed method mutually improved single- and multi-channel speaker diarization performances. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: Accepted to IEEE SLT 2022

arXiv:2210.00077 [pdf, other]

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Authors: Kwangyoun Kim, Felix Wu, Yifan Peng, **g Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

Abstract: Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Bra… ▽ More Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data. △ Less

Submitted 14 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

Comments: Accepted to SLT 2022

arXiv:2209.14968 [pdf, other]

doi 10.1103/PhysRevLett.130.031802

Search for Cosmic-ray Boosted Sub-GeV Dark Matter using Recoil Protons at Super-Kamiokande

Authors: The Super-Kamiokande Collaboration, :, K. Abe, Y. Hayato, K. Hiraide, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya, H. Shiba, K. Shimizu , et al. (197 additional authors not shown)

Abstract: We report a search for cosmic-ray boosted dark matter with protons using the 0.37 megaton$\times$years data collected at Super-Kamiokande experiment during the 1996-2018 period (SKI-IV phase). We searched for an excess of proton recoils above the atmospheric neutrino background from the vicinity of the Galactic Center. No such excess is observed, and limits are calculated for two reference models… ▽ More We report a search for cosmic-ray boosted dark matter with protons using the 0.37 megaton$\times$years data collected at Super-Kamiokande experiment during the 1996-2018 period (SKI-IV phase). We searched for an excess of proton recoils above the atmospheric neutrino background from the vicinity of the Galactic Center. No such excess is observed, and limits are calculated for two reference models of dark matter with either a constant interaction cross-section or through a scalar mediator. This is the first experimental search for boosted dark matter with hadrons using directional information. The results present the most stringent limits on cosmic-ray boosted dark matter and exclude the dark matter-nucleon elastic scattering cross-section between $10^{-33}\text{ cm}^{2}$ and $10^{-27}\text{ cm}^{2}$ for dark matter mass from 10 MeV/$c^2$ to 1 GeV/$c^2$. △ Less

Submitted 30 August, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: With 1-page appendix. A bug was found in July 2023. This version is updated to match the erratum

Journal ref: Phys. Rev. Lett. 130 (2023) 031802

arXiv:2209.10275 [pdf, other]

Tight Exponential Strong Converse for Source Coding Problem with Encoded Side Information

Authors: Daisuke Takeuchi, Shun Watanabe

Abstract: The source coding problem with encoded side information is considered. A lower bound on the strong converse exponent has been derived by Oohama, but its tightness has not been clarified. In this paper, we derive a tight strong converse exponent. For the special case such that the side-information does not exists, we demonstrate that our tight exponent of the WAK problem reduces to the known tight… ▽ More The source coding problem with encoded side information is considered. A lower bound on the strong converse exponent has been derived by Oohama, but its tightness has not been clarified. In this paper, we derive a tight strong converse exponent. For the special case such that the side-information does not exists, we demonstrate that our tight exponent of the WAK problem reduces to the known tight expression of that special case while Oohama's lower bound is strictly loose. The converse part is proved by a judicious use of the change-of-measure argument, which was introduced by Gu-Effros and further developed by Tyagi-Watanabe. Interestingly, the soft Markov constraint, which was introduced by Oohama as a proof technique, is naturally incorporated into the characterization of the exponent. A technical innovation of this paper is recognizing that the soft Markov constraint is a part of the exponent, rather than a penalty term that should be vanished. In fact, via numerical experiment, we provide evidence that the soft Markov constraint is strictly positive. The achievability part is derived by a careful analysis of the type argument; however, unlike the conventional analysis for the achievable rate region, we need to derive the soft Markov constraint in the analysis of the correct probability. Furthermore, we present an application of our derivation of strong converse exponent to the privacy amplification. △ Less

Submitted 3 April, 2024; v1 submitted 21 September, 2022; originally announced September 2022.

Comments: 15 pages, 5 figures; v2 adds an analysis of full-side information case; v3 adds numerical experiment and an application to the privacy amplification

arXiv:2209.09756 [pdf, other]

ESPnet-ONNX: Bridging a Gap Between Research and Production

Authors: Masao Someki, Yosuke Higuchi, Tomoki Hayashi, Shinji Watanabe

Abstract: In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks. In contrast, application developers are interested in making models suitable for actual products, which involves optimizing a model for faster inference and adapting a model to various platforms (e.g., C++ and Python). In this work, to fill the gap between the two, we establish… ▽ More In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks. In contrast, application developers are interested in making models suitable for actual products, which involves optimizing a model for faster inference and adapting a model to various platforms (e.g., C++ and Python). In this work, to fill the gap between the two, we establish an effective procedure for optimizing a PyTorch-based research-oriented model for deployment, taking ESPnet, a widely used toolkit for speech processing, as an instance. We introduce different techniques to ESPnet, including converting a model into an ONNX format, fusing nodes in a graph, and quantizing parameters, which lead to approximately 1.3-2$\times$ speedup in various tasks (i.e., ASR, TTS, speech translation, and spoken language understanding) while kee** its performance without any additional training. Our ESPnet-ONNX will be publicly available at https://github.com/espnet/espnet_onnx △ Less

Submitted 14 November, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

Comments: Accepted to APSIPA ASC 2022

arXiv:2209.09335 [pdf, ps, other]

doi 10.1073/pnas.2112202118

Topological magnetic textures and long-range orders in Tb-based quasicrystal and approximant

Authors: Shinji Watanabe

Abstract: The quasicrystal(QC)s have unique lattice structure with the rotational symmetry forbidden in the periodic crystals. The electric properties are far from complete understanding. It has been unresolved whether the magnetic long-range orders are realized in the QC. Here we report our theoretical discovery of the ferromagnetic long-range order in the Tb-based QC. The difficulty in past theoretical st… ▽ More The quasicrystal(QC)s have unique lattice structure with the rotational symmetry forbidden in the periodic crystals. The electric properties are far from complete understanding. It has been unresolved whether the magnetic long-range orders are realized in the QC. Here we report our theoretical discovery of the ferromagnetic long-range order in the Tb-based QC. The difficulty in past theoretical studies on the QC was lack of the microscopic theory of the crystalline electric field (CEF), which is crucially important in the rare-earth systems. By analyzing the CEF in the Tb-based QC, we clarify that magnetic anisotropy plays a key role in realizing unique magnetic textures in the Tb-based QC and approximant crystal (AC). By constructing the minimal model, we show that various magnetic textures on the icosahedron at whose vertices Tb atoms are located. We find that the hedgehog state is characterized by the topological charge of one and the whirling-moment state is characterized by unusually large topological charge of three. The hedgehog and whirling-moment states are shown to be realized as antiferromagnetic orders transcribed as the emergent monopole and antimonopole in the 1/1 AC. We find that these states exhibit the topological Hall effect under applied magnetic field accompanied by the topological as well as metamagnetic transition. Our model and the determined phase diagram are expected to be relevant to the broad range of the rare-earth based QCs and ACs with strong magnetic anisotropy, which are useful not only to understand magnetism but also to explore novel topological properties. △ Less

Submitted 21 September, 2022; v1 submitted 19 September, 2022; originally announced September 2022.

Comments: 15 pages, 5 figures

Journal ref: Proc. Natl. Acad. Sci. USA. Vol. 118 (43) (2021) e2112202118

arXiv:2209.08609 [pdf, other]

doi 10.1088/1748-0221/17/10/P10029

Neutron Tagging following Atmospheric Neutrino Events in a Water Cherenkov Detector

Authors: K. Abe, Y. Haga, Y. Hayato, K. Hiraide, K. Ieki, M. Ikeda, S. Imaizumi, K. Iyogi, J. Kameda, Y. Kanemura, Y. Kataoka, Y. Kato, Y. Kishimoto, S. Miki, S. Mine, M. Miura, T. Mochizuki, S. Moriyama, Y. Nagao, M. Nakahata, T. Nakajima, Y. Nakano, S. Nakayama, T. Okada, K. Okamoto , et al. (281 additional authors not shown)

Abstract: We present the development of neutron-tagging techniques in Super-Kamiokande IV using a neural network analysis. The detection efficiency of neutron capture on hydrogen is estimated to be 26%, with a mis-tag rate of 0.016 per neutrino event. The uncertainty of the tagging efficiency is estimated to be 9.0%. Measurement of the tagging efficiency with data from an Americium-Beryllium calibration agr… ▽ More We present the development of neutron-tagging techniques in Super-Kamiokande IV using a neural network analysis. The detection efficiency of neutron capture on hydrogen is estimated to be 26%, with a mis-tag rate of 0.016 per neutrino event. The uncertainty of the tagging efficiency is estimated to be 9.0%. Measurement of the tagging efficiency with data from an Americium-Beryllium calibration agrees with this value within 10%. The tagging procedure was performed on 3,244.4 days of SK-IV atmospheric neutrino data, identifying 18,091 neutrons in 26,473 neutrino events. The fitted neutron capture lifetime was measured as 218 \pm 9 μs. △ Less

Submitted 20 September, 2022; v1 submitted 18 September, 2022; originally announced September 2022.

Journal ref: JINST 17 P10029 (2022)

Showing 151–200 of 845 results for author: Watanabe, S