Skip to main content

Showing 151–200 of 845 results for author: Watanabe, S

.
  1. arXiv:2302.08088  [pdf, other

    cs.CL cs.SD eess.AS

    TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement

    Authors: Yunyang Zeng, Joseph Konan, Shuo Han, David Bick, Muqiao Yang, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

    Abstract: Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

    Comments: Accepted at ICASSP 2023

  2. arXiv:2302.08059  [pdf, other

    math.PR cs.IT stat.ML

    A Geometric Reduction Approach for Identity Testing of Reversible Markov Chains

    Authors: Geoffrey Wolfer, Shun Watanabe

    Abstract: We consider the problem of testing the identity of a reversible Markov chain against a reference from a single trajectory of observations. Employing the recently introduced notion of a lum**-congruent Markov embedding, we show that, at least in a mildly restricted setting, testing identity to a reversible chain reduces to testing to a symmetric chain over a larger state space and recover state-o… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  3. arXiv:2302.07928  [pdf, other

    eess.AS cs.SD eess.SP

    Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge

    Authors: Samuele Cornell, Zhong-Qiu Wang, Yoshiki Masuyama, Shinji Watanabe, Manuel Pariente, Nobutaka Ono

    Abstract: This paper describes our submission to the Second Clarity Enhancement Challenge (CEC2), which consists of target speech enhancement for hearing-aid (HA) devices in noisy-reverberant environments with multiple interferers such as music and competing speakers. Our approach builds upon the powerful iterative neural/beamforming enhancement (iNeuBe) framework introduced in our recent work, and this p… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  4. arXiv:2302.06774  [pdf, other

    eess.AS cs.SD

    Speaker-Independent Acoustic-to-Articulatory Speech Inversion

    Authors: Peter Wu, Li-Wei Chen, Cheol Jun Cho, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

    Abstract: To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible map** from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages… ▽ More

    Submitted 24 July, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

  5. arXiv:2302.04689  [pdf, other

    astro-ph.EP physics.ao-ph

    The Venus' Cloud Discontinuity in 2022

    Authors: J. Peralta, A. Cidadão, L. Morrone, C. Foster, M. Bullock, E. F. Young, I. Garate-Lopez, A. Sánchez-Lavega, T. Horinouchi, T. Imamura, E. Kardasis, A. Yamazaki, S. Watanabe

    Abstract: First identified in 2016 by JAXA's Akatsuki mission, the discontinuity/disruption is a recurrent wave observed to propagate during decades at the deeper clouds of Venus (47--56 km above the surface), while its absence at the clouds' top ($\sim$70 km) suggests that it dissipates at the upper clouds and contributes in the maintenance of the puzzling atmospheric superrotation of Venus through wave-me… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

    Comments: 8 pages, 4 figures, 2 animated figures, 1 table

    Journal ref: A&A 672, L2 (2023)

  6. arXiv:2302.04215  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

    Authors: Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

    Abstract: Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synt… ▽ More

    Submitted 8 February, 2023; originally announced February 2023.

    Comments: Accepted to AAAI 2023

  7. arXiv:2301.12596  [pdf, other

    eess.AS cs.CL

    Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

    Authors: Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource la… ▽ More

    Submitted 27 May, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

    Comments: To appear in IJCAI 2023

  8. arXiv:2301.09099  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study

    Authors: Massa Baali, Tomoki Hayashi, Hamdy Mubarak, Soumi Maiti, Shinji Watanabe, Wassim El-Hajj, Ahmed Ali

    Abstract: Several high-resource Text to Speech (TTS) systems currently produce natural, well-established human-like speech. In contrast, low-resource languages, including Arabic, have very limited TTS systems due to the lack of resources. We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies for TTS training, using broadcast news… ▽ More

    Submitted 26 January, 2023; v1 submitted 22 January, 2023; originally announced January 2023.

  9. arXiv:2301.08547  [pdf, ps, other

    math.PR

    Infinite collision property for the three-dimensional uniform spanning tree

    Authors: Satomi Watanabe

    Abstract: Let $\mathcal{U}$ be the uniform spanning tree on $\mathbb{Z}^3$, whose probability law is denoted by $\mathbf{P}$. For $\mathbf{P}$-a.s. realization of $\mathcal{U}$, the recurrence of the the simple random walk on $\mathcal{U}$ is proved in [5] and it is also demonstrated in [8] that two independent simple random walks on $\mathcal{U}$ collide infinitely often. In this article, we will give a qu… ▽ More

    Submitted 9 February, 2023; v1 submitted 20 January, 2023; originally announced January 2023.

    Comments: 13 pages, 1 figure

    MSC Class: 60K37

  10. arXiv:2212.10818  [pdf, other

    cs.SD cs.CL eess.AS

    4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

    Authors: Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

    Abstract: The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models d… ▽ More

    Submitted 29 May, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

    Comments: Accepted by INTERRSPEECH2023

  11. arXiv:2212.10801  [pdf, other

    hep-ex

    Measurement of the cosmogenic neutron yield in Super-Kamiokande with gadolinium loaded water

    Authors: Super-Kamiokande Collaboration, :, M. Shinoki, K. Abe, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya , et al. (217 additional authors not shown)

    Abstract: Cosmic-ray muons that enter the Super-Kamiokande detector cause hadronic showers due to spallation in water, producing neutrons and radioactive isotopes. Those are a major background source for studies of MeV-scale neutrinos and searches for rare events. Since 2020, gadolinium was introduced in the ultra-pure water in the Super-Kamiokande detector to improve the detection efficiency of neutrons. I… ▽ More

    Submitted 25 October, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

    Comments: 10 pages, 10 figures, 3 tables

  12. arXiv:2212.10525  [pdf, other

    cs.CL eess.AS

    SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

    Authors: Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

    Abstract: Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce suc… ▽ More

    Submitted 15 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: accepted in ACL 2023 (long paper)

  13. arXiv:2212.08542  [pdf, other

    eess.AS cs.CL

    Context-aware Fine-tuning of Self-supervised Speech Models

    Authors: Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tu… ▽ More

    Submitted 28 March, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

  14. arXiv:2212.08055  [pdf, other

    cs.CL cs.SD eess.AS

    UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

    Authors: Hirofumi Inaguma, Sravya Popuri, Ilia Kulikov, Peng-Jen Chen, Changhan Wang, Yu-An Chung, Yun Tang, Ann Lee, Shinji Watanabe, Juan Pino

    Abstract: Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword predictio… ▽ More

    Submitted 26 May, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

    Comments: ACL 2023 (main conference)

  15. arXiv:2212.06751  [pdf, other

    cs.LG cs.AI

    Speeding Up Multi-Objective Hyperparameter Optimization by Task Similarity-Based Meta-Learning for the Tree-Structured Parzen Estimator

    Authors: Shuhei Watanabe, Noor Awad, Masaki Onishi, Frank Hutter

    Abstract: Hyperparameter optimization (HPO) is a vital step in improving performance in deep learning (DL). Practitioners are often faced with the trade-off between multiple criteria, such as accuracy and latency. Given the high computational needs of DL and the growing demand for efficient HPO, the acceleration of multi-objective (MO) optimization becomes ever more important. Despite the significant body o… ▽ More

    Submitted 31 May, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: Accpeted to IJCAI 2023

  16. arXiv:2212.04559  [pdf, other

    eess.AS cs.LG cs.SD

    SpeechLMScore: Evaluating speech generation using speech language model

    Authors: Soumi Maiti, Yifan Peng, Takaaki Saeki, Shinji Watanabe

    Abstract: While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming. Previous studies on automatic speech quality assessment address the problem by predicting human evaluation scores with machine learning models. However, they rely on supervised learning and thus suffer from high annotation costs and domain-shift problems. We propo… ▽ More

    Submitted 8 December, 2022; originally announced December 2022.

  17. arXiv:2211.17196  [pdf, other

    cs.CL cs.SD eess.AS

    EURO: ESPnet Unsupervised ASR Open-source Toolkit

    Authors: Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola Garcia, Hung-yi Lee, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR). EURO adopts the state-of-the-art UASR learning method introduced by the Wav2vec-U, originally implemented at FAIRSEQ, which leverages self-supervised speech representations and adversarial training. In addition to wav2vec2, EURO extend… ▽ More

    Submitted 20 May, 2023; v1 submitted 30 November, 2022; originally announced November 2022.

  18. arXiv:2211.15031  [pdf, ps, other

    math.PR

    Volume and heat kernel fluctuations for the three-dimensional uniform spanning tree

    Authors: Daisuke Shiraishi, Satomi Watanabe

    Abstract: Let $\mathcal{U}$ be the uniform spanning tree on $\mathbb{Z}^{3}$. We show the occurrence of log-logarithmic fluctuations around the leading order for the volume of intrinsic balls in $\mathcal{U}$. As an application, we obtain similar fluctuations for the quenched heat kernel of the simple random walk on $\mathcal{U}$.

    Submitted 27 November, 2022; originally announced November 2022.

    Comments: 39 pages, 20 figures

    MSC Class: 60K37 (primary); 60D05

  19. arXiv:2211.14411  [pdf, other

    cs.LG cs.AI

    c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization

    Authors: Shuhei Watanabe, Frank Hutter

    Abstract: Hyperparameter optimization (HPO) is crucial for strong performance of deep learning algorithms and real-world applications often impose some constraints, such as memory usage, or latency on top of the performance requirement. In this work, we propose constrained TPE (c-TPE), an extension of the widely-used versatile Bayesian optimization method, tree-structured Parzen estimator (TPE), to handle t… ▽ More

    Submitted 26 May, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

    Comments: Accepted to IJCAI 2023

  20. arXiv:2211.12433  [pdf, other

    cs.SD eess.AS

    TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

    Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

    Abstract: We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral map**, where the real and imaginary (RI)… ▽ More

    Submitted 4 August, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: In IEEE/ACM Transactions on Audio, Speech, and Language Processing. A sound demo is available at https://zqwang7.github.io/demos/TF-GridNet-demo/index.html, and the code is available at https://github.com/espnet/espnet/pull/5395

  21. arXiv:2211.10049  [pdf, ps, other

    math.ST cs.LG

    Recent Advances in Algebraic Geometry and Bayesian Statistics

    Authors: Sumio Watanabe

    Abstract: This article is a review of theoretical advances in the research field of algebraic geometry and Bayesian statistics in the last two decades. Many statistical models and learning machines which contain hierarchical structures or latent variables are called nonidentifiable, because the map from a parameter to a statistical model is not one-to-one. In nonidentifiable models, both the likelihood func… ▽ More

    Submitted 18 November, 2022; originally announced November 2022.

  22. arXiv:2211.08989  [pdf, other

    cs.CL cs.SD eess.AS

    Avoid Overthinking in Self-Supervised Models for Speech Recognition

    Authors: Dan Berrebbi, Brian Yan, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) models reshaped our approach to speech, language and vision. However their huge size and the opaque relations between their layers and tasks result in slow inference and network overthinking, where predictions made from the last layer of large models is worse than those made from intermediate layers. Early exit (EE) strategies can solve both issues by dynamically red… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

  23. arXiv:2211.08726  [pdf, other

    cs.CL cs.SD eess.AS

    Streaming Joint Speech Recognition and Disfluency Detection

    Authors: Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao Okuda, Siddhant Arora, Shinji Watanabe

    Abstract: Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to… ▽ More

    Submitted 11 May, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

    Comments: Accepted at ICASSP2023

  24. arXiv:2211.06535  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

    Authors: Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

    Abstract: We present a unified system to realize one-shot voice conversion (VC) on the pitch, rhythm, and speaker attributes. Existing works generally ignore the correlation between prosody and language content, leading to the degradation of naturalness in converted speech. Additionally, the lack of proper language features prevents these systems from accurately preserving language content after conversion.… ▽ More

    Submitted 11 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  25. arXiv:2211.05967  [pdf, ps, other

    cs.CL eess.AS

    Align, Write, Re-order: Explainable End-to-End Speech Translation via Operation Sequence Generation

    Authors: Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji Watanabe

    Abstract: The black-box nature of end-to-end speech translation (E2E ST) systems makes it difficult to understand how source language inputs are being mapped to the target language. To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word. A major challenge arises f… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

  26. arXiv:2211.05869  [pdf, other

    cs.CL cs.SD eess.AS

    A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

    Authors: Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe

    Abstract: Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and thei… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: Accepted at SLT 2022

  27. arXiv:2211.03541  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-blank Transducers for Speech Recognition

    Authors: Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris Ginsburg

    Abstract: This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training… ▽ More

    Submitted 11 April, 2024; v1 submitted 4 November, 2022; originally announced November 2022.

    Journal ref: ICASSP 2023

  28. arXiv:2211.03025  [pdf, other

    cs.CL cs.SD eess.AS

    Bridging Speech and Textual Pre-trained Models with Unsupervised ASR

    Authors: Jiatong Shi, Chan-Jan Hsu, Holam Chung, Dongji Gao, Paola Garcia, Shinji Watanabe, Ann Lee, Hung-yi Lee

    Abstract: Spoken language understanding (SLU) is a task aiming to extract high-level semantics from spoken utterances. Previous works have investigated the use of speech self-supervised models and textual pre-trained models, which have shown reasonable improvements to various SLU tasks. However, because of the mismatched modalities between speech signals and text tokens, previous methods usually need comple… ▽ More

    Submitted 6 November, 2022; originally announced November 2022.

    Comments: ICASSP2023 submission

  29. arXiv:2211.02333  [pdf, other

    eess.AS cs.CL cs.SD

    Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

    Authors: Yusuke Shinohara, Shinji Watanabe

    Abstract: Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods, such as alignment-restricted training and FastEmit, have been studied to reduce the latency, latency reduction is often accompanied with a significant degradatio… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: Presented at INTERSPEECH 2022

  30. arXiv:2211.01458  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Zero-Shot Code-Switched Speech Recognition

    Authors: Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe

    Abstract: In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, th… ▽ More

    Submitted 9 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: 5 pages

  31. arXiv:2211.00795  [pdf, other

    eess.AS cs.CL cs.SD

    InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

    Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

    Abstract: This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision. Momentum PL (MPL) trains a connectionist temporal classification (CTC)-based model on unlabeled data by continuously generating pseudo-labels on the fly and improving their quality. In contrast to autoregressive formulati… ▽ More

    Submitted 16 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP2023

  32. arXiv:2211.00792  [pdf, other

    eess.AS cs.CL cs.SD

    BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

    Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

    Abstract: We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR has been actively studied, aiming to utilize versatile linguistic knowledge for generating accurate text. One crucial factor that makes this integration challenging… ▽ More

    Submitted 16 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP2023

  33. arXiv:2210.16663  [pdf, other

    eess.AS cs.CL

    BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

    Authors: Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

    Abstract: This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the… ▽ More

    Submitted 19 April, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

    Comments: v1: Accepted to Findings of EMNLP2022, v2: Minor corrections and clearer derivation of Eq. (21)

  34. arXiv:2210.16498  [pdf, other

    eess.AS cs.SD

    Articulatory Representation Learning Via Joint Factor Analysis and Neural Matrix Factorization

    Authors: Jiachen Lian, Alan W Black, Yi**g Lu, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

    Abstract: Articulatory representation learning is the fundamental research in modeling neural speech production system. Our previous work has established a deep paradigm to decompose the articulatory kinematics data into gestures, which explicitly model the phonological and linguistic structure encoded with human speech production mechanism, and corresponding gestural scores. We continue with this line of w… ▽ More

    Submitted 20 February, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

    Comments: Accepted to 2023 ICASSP. Camera Ready

  35. arXiv:2210.15734  [pdf, other

    cs.CL cs.SD eess.AS

    Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

    Authors: Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, Shinji Watanabe

    Abstract: End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation. However, these systems model sequence labeling as a sequence prediction task causing a divergence from its well-established token-level tagging formulation. We build compositional end-to-end SLU systems that explicitly separate the a… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted at EMNLP 2022 Findings. Our code and models will be publicly available as part of the ESPnet-SLU toolkit: https://github.com/espnet/espnet and the release can be followed here: https://github.com/espnet/espnet/pull/4735

  36. arXiv:2210.14682  [pdf, other

    cs.SD cs.AI eess.AS

    In search of strong embedding extractors for speaker diarisation

    Authors: Jee-weon Jung, Hee-Soo Heo, Bong-** Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

    Abstract: Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: 5pages, 1 figure, 2 tables, submitted to ICASSP

  37. arXiv:2210.12948  [pdf, other

    astro-ph.SR astro-ph.HE hep-ex physics.space-ph

    Searching for neutrinos from solar flares across solar cycles 23 and 24 with the Super-Kamiokande detector

    Authors: K. Okamoto, K. Abe, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, Y. Kaneshima, Y. Kataoka, Y. Kashiwagi, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nagao, M. Nakahata, Y. Nakano, S. Nakayama, Y. Noguchi, K. Sato, H. Sekiya, K. Shimizu, M. Shiozawa , et al. (220 additional authors not shown)

    Abstract: Neutrinos associated with solar flares (solar-flare neutrinos) provide information on particle acceleration mechanisms during the impulsive phase of solar flares. We searched using the Super-Kamiokande detector for neutrinos from solar flares that occurred during solar cycles $23$ and $24$, including the largest solar flare (X28.0) on November 4th, 2003. In order to minimize the background rate we… ▽ More

    Submitted 26 October, 2022; v1 submitted 24 October, 2022; originally announced October 2022.

    Comments: 36 pages, 18 figures, 9 tables (Figure 12 was replaced because it was incorrect in version 1.)

  38. arXiv:2210.10985  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Large-scale learning of generalised representations for speaker recognition

    Authors: Jee-weon Jung, Hee-Soo Heo, Bong-** Lee, Jaesong Lee, Hye-** Shim, Youngki Kwon, Joon Son Chung, Shinji Watanabe

    Abstract: The objective of this work is to develop a speaker recognition model to be used in diverse scenarios. We hypothesise that two components should be adequately configured to build such a model. First, adequate architecture would be required. We explore several recent state-of-the-art models, including ECAPA-TDNN and MFA-Conformer, as well as other baselines. Second, a massive amount of data would be… ▽ More

    Submitted 27 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: 5pages, 5 tables, submitted to ICASSP

  39. arXiv:2210.10742  [pdf, other

    cs.SD eess.AS

    End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation

    Authors: Yoshiki Masuyama, Xuankai Chang, Samuele Cornell, Shinji Watanabe, Nobutaka Ono

    Abstract: Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end ar… ▽ More

    Submitted 19 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  40. arXiv:2210.08634  [pdf, other

    cs.CL cs.SD eess.AS

    SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning

    Authors: Tzu-hsun Feng, Annie Dong, Ching-Feng Yeh, Shu-wen Yang, Tzu-Quan Lin, Jiatong Shi, Kai-Wei Chang, Zili Huang, Haibin Wu, Xuankai Chang, Shinji Watanabe, Abdelrahman Mohamed, Shang-Wen Li, Hung-yi Lee

    Abstract: We present the SUPERB challenge at SLT 2022, which aims at learning self-supervised speech representation for better performance, generalization, and efficiency. The challenge builds upon the SUPERB benchmark and implements metrics to measure the computation requirements of self-supervised learning (SSL) representation and to evaluate its generalizability and performance across the diverse SUPERB… ▽ More

    Submitted 29 October, 2022; v1 submitted 16 October, 2022; originally announced October 2022.

    Comments: Accepted by 2022 SLT Workshop

  41. arXiv:2210.07499  [pdf, other

    cs.CL cs.SD eess.AS

    Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

    Authors: **chuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

    Abstract: Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target… ▽ More

    Submitted 31 January, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Journal ref: International Conference on Learning Representations (ICLR), 2023

  42. arXiv:2210.07189  [pdf, other

    cs.CL cs.SD eess.AS

    On Compressing Sequences for Self-Supervised Speech Models

    Authors: Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-yi Lee, Hao Tang

    Abstract: Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how in… ▽ More

    Submitted 25 October, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  43. arXiv:2210.05200  [pdf, other

    cs.CL cs.SD eess.AS

    CTC Alignments Improve Autoregressive Translation

    Authors: Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

    Abstract: Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment. However for translation, CTC exhibits clear limitations due to the contextual and non-monotonic nature of the task and thus lags behind attentional decoder approaches in terms of translation quality. In this work, we argue that CT… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

  44. arXiv:2210.03459  [pdf, other

    eess.AS cs.CL cs.SD

    Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization

    Authors: Shota Horiguchi, Yuki Takashima, Shinji Watanabe, Paola Garcia

    Abstract: Due to the high performance of multi-channel speech processing, we can use the outputs from a multi-channel model as teacher labels when training a single-channel model with knowledge distillation. To the contrary, it is also known that single-channel speech data can benefit multi-channel models by mixing it with multi-channel speech data during training or by using it for model pretraining. This… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  45. arXiv:2210.00077  [pdf, other

    eess.AS cs.LG

    E-Branchformer: Branchformer with Enhanced merging for speech recognition

    Authors: Kwangyoun Kim, Felix Wu, Yifan Peng, **g Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

    Abstract: Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Bra… ▽ More

    Submitted 14 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  46. Search for Cosmic-ray Boosted Sub-GeV Dark Matter using Recoil Protons at Super-Kamiokande

    Authors: The Super-Kamiokande Collaboration, :, K. Abe, Y. Hayato, K. Hiraide, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya, H. Shiba, K. Shimizu , et al. (197 additional authors not shown)

    Abstract: We report a search for cosmic-ray boosted dark matter with protons using the 0.37 megaton$\times$years data collected at Super-Kamiokande experiment during the 1996-2018 period (SKI-IV phase). We searched for an excess of proton recoils above the atmospheric neutrino background from the vicinity of the Galactic Center. No such excess is observed, and limits are calculated for two reference models… ▽ More

    Submitted 30 August, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

    Comments: With 1-page appendix. A bug was found in July 2023. This version is updated to match the erratum

    Journal ref: Phys. Rev. Lett. 130 (2023) 031802

  47. arXiv:2209.10275  [pdf, other

    cs.IT

    Tight Exponential Strong Converse for Source Coding Problem with Encoded Side Information

    Authors: Daisuke Takeuchi, Shun Watanabe

    Abstract: The source coding problem with encoded side information is considered. A lower bound on the strong converse exponent has been derived by Oohama, but its tightness has not been clarified. In this paper, we derive a tight strong converse exponent. For the special case such that the side-information does not exists, we demonstrate that our tight exponent of the WAK problem reduces to the known tight… ▽ More

    Submitted 3 April, 2024; v1 submitted 21 September, 2022; originally announced September 2022.

    Comments: 15 pages, 5 figures; v2 adds an analysis of full-side information case; v3 adds numerical experiment and an application to the privacy amplification

  48. arXiv:2209.09756  [pdf, other

    eess.AS

    ESPnet-ONNX: Bridging a Gap Between Research and Production

    Authors: Masao Someki, Yosuke Higuchi, Tomoki Hayashi, Shinji Watanabe

    Abstract: In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks. In contrast, application developers are interested in making models suitable for actual products, which involves optimizing a model for faster inference and adapting a model to various platforms (e.g., C++ and Python). In this work, to fill the gap between the two, we establish… ▽ More

    Submitted 14 November, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

    Comments: Accepted to APSIPA ASC 2022

  49. arXiv:2209.09335  [pdf, ps, other

    cond-mat.str-el cond-mat.mtrl-sci

    Topological magnetic textures and long-range orders in Tb-based quasicrystal and approximant

    Authors: Shinji Watanabe

    Abstract: The quasicrystal(QC)s have unique lattice structure with the rotational symmetry forbidden in the periodic crystals. The electric properties are far from complete understanding. It has been unresolved whether the magnetic long-range orders are realized in the QC. Here we report our theoretical discovery of the ferromagnetic long-range order in the Tb-based QC. The difficulty in past theoretical st… ▽ More

    Submitted 21 September, 2022; v1 submitted 19 September, 2022; originally announced September 2022.

    Comments: 15 pages, 5 figures

    Journal ref: Proc. Natl. Acad. Sci. USA. Vol. 118 (43) (2021) e2112202118

  50. arXiv:2209.08609  [pdf, other

    hep-ex astro-ph.IM physics.ins-det

    Neutron Tagging following Atmospheric Neutrino Events in a Water Cherenkov Detector

    Authors: K. Abe, Y. Haga, Y. Hayato, K. Hiraide, K. Ieki, M. Ikeda, S. Imaizumi, K. Iyogi, J. Kameda, Y. Kanemura, Y. Kataoka, Y. Kato, Y. Kishimoto, S. Miki, S. Mine, M. Miura, T. Mochizuki, S. Moriyama, Y. Nagao, M. Nakahata, T. Nakajima, Y. Nakano, S. Nakayama, T. Okada, K. Okamoto , et al. (281 additional authors not shown)

    Abstract: We present the development of neutron-tagging techniques in Super-Kamiokande IV using a neural network analysis. The detection efficiency of neutron capture on hydrogen is estimated to be 26%, with a mis-tag rate of 0.016 per neutrino event. The uncertainty of the tagging efficiency is estimated to be 9.0%. Measurement of the tagging efficiency with data from an Americium-Beryllium calibration agr… ▽ More

    Submitted 20 September, 2022; v1 submitted 18 September, 2022; originally announced September 2022.

    Journal ref: JINST 17 P10029 (2022)