Search | arXiv e-print repository

doi 10.7566/JPSJ.90.063701

Crystalline Electronic Field in Rare-Earth Based Quasicrystal and Approximant: Analysis of Quantum Critical Au-Al-Yb Quasicrystal and Approximant

Authors: Shinji Watanabe, Mina Kawamoto

Abstract: On the basis of the point charge model, we formulate the crystalline electronic field (CEF) Hamiltonian $H_{\rm CEF}$ in the rare-earth based quasicrystal (QC) and approximant crystal (AC) with ligand ions located at pseudo 5-fold configurations by using the operator equivalent method. By setting the total angular momentum $J=7/2$, the CEF in the quantum critical QC Au$_{51}$Al$_{34}$Yb$_{15}$ and… ▽ More On the basis of the point charge model, we formulate the crystalline electronic field (CEF) Hamiltonian $H_{\rm CEF}$ in the rare-earth based quasicrystal (QC) and approximant crystal (AC) with ligand ions located at pseudo 5-fold configurations by using the operator equivalent method. By setting the total angular momentum $J=7/2$, the CEF in the quantum critical QC Au$_{51}$Al$_{34}$Yb$_{15}$ and the 1/1 AC Au$_{51}$Al$_{35}$Yb$_{14}$ is analyzed with consideration for the effect of Al/Au mixed sites. We find that the ratio of the valences of ligand ions $x=Z_{\rm Al}/Z_{\rm Au}$ plays an important role in characterizing the CEF ground state. As $x$ decreases from $x=3$, the 4f wave function of the CEF ground state with the flat shape lying in the mirror plane is deformed around $x\approx 0.8$ to the flat shape perpendicular to the pseudo 5-fold axis at $x=0$. The formulated $H_{\rm CEF}$ by $J$ is generally applicable to rare-earth-based QCs and ACs, which is useful to analyze the CEF. △ Less

Submitted 19 March, 2022; originally announced March 2022.

Comments: 5 pages, 6 figures

Journal ref: J. Phys. Soc. Jpn. 90 (2021) 063701

arXiv:2203.10242 [pdf]

doi 10.1063/5.0099155

Temporal-offset dual-comb vibrometer with picometer axial precision

Authors: A. Iwasaki, D. Nishikawa, M. Okano, S. Tateno, K. Yamanoi, Y. Nozaki, S. Watanabe

Abstract: We demonstrate a dual-comb vibrometer where the pulses of one frequency-comb are split into pulse pairs. We introduce a delay between the two pulses of each pulse pair in front of the sample, and after the corresponding two consecutive reflections at the vibrating sample surface, the initially introduced delay is cancelled by a modified Sagnac geometry. The remaining phase difference between the t… ▽ More We demonstrate a dual-comb vibrometer where the pulses of one frequency-comb are split into pulse pairs. We introduce a delay between the two pulses of each pulse pair in front of the sample, and after the corresponding two consecutive reflections at the vibrating sample surface, the initially introduced delay is cancelled by a modified Sagnac geometry. The remaining phase difference between the two pulses corresponds to the change in the axial position of the surface during the two consecutive reflections. The Sagnac geometry reduces the effect of phase jitter since both pulses propagate through nearly the same optical path (in opposite directions), and spurious signals are eliminated by time gating. We determine the amplitude of a surface vibration on a surface-acoustic-wave device with an axial precision of 4 pm. This technique enables highly accurate determination of extremely small displacements. △ Less

Submitted 19 March, 2022; originally announced March 2022.

Comments: 22 pages, 5 figures

Journal ref: APL Photonics 7(10), 106101 (2022)

arXiv:2203.09894 [pdf, other]

doi 10.1103/PhysRevC.107.024907

Measurements of second-harmonic Fourier coefficients from azimuthal anisotropies in $p$$+$$p$, $p$$+$Au, $d$$+$Au, and $^3$He$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV

Authors: N. J. Abdulameer, U. Acharya, A. Adare, C. Aidala, N. N. Ajitanand, Y. Akiba, M. Alfred, V. Andrieux, K. Aoki, N. Apadula, H. Asano, C. Ayuso, B. Azmoun, V. Babintsev, M. Bai, N. S. Bandara, B. Bannier, K. N. Barish, S. Bathe, A. Bazilevsky, M. Beaumier, S. Beckman, R. Belmont, A. Berdnikov, Y. Berdnikov , et al. (368 additional authors not shown)

Abstract: Recently, the PHENIX Collaboration has published second- and third-harmonic Fourier coefficients $v_2$ and $v_3$ for midrapidity ($|η|<0.35$) charged hadrons in 0\%--5\% central $p$$+$Au, $d$$+$Au, and $^3$He$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV utilizing three sets of two-particle correlations for two detector combinations with different pseudorapidity acceptance [Phys. Rev. C {\bf 105},… ▽ More Recently, the PHENIX Collaboration has published second- and third-harmonic Fourier coefficients $v_2$ and $v_3$ for midrapidity ($|η|<0.35$) charged hadrons in 0\%--5\% central $p$$+$Au, $d$$+$Au, and $^3$He$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV utilizing three sets of two-particle correlations for two detector combinations with different pseudorapidity acceptance [Phys. Rev. C {\bf 105}, 024901 (2022)]. This paper extends these measurements of $v_2$ to all centralities in $p$$+$Au, $d$$+$Au, and $^3$He$+$Au collisions, as well as $p$$+$$p$ collisions, as a function of transverse momentum ($p_T$) and event multiplicity. The kinematic dependence of $v_2$ is quantified as the ratio $R$ of $v_2$ between the two detector combinations as a function of event multiplicity for $0.5$$<$$p_T$$<$$1$ and $2$$<$$p_T$$<$$2.5$ GeV/$c$. A multiphase-transport (AMPT) model can reproduce the observed $v_2$ in most-central to midcentral $d$$+$Au and $^3$He$+$Au collisions. However, the AMPT model systematically overestimates the measurements in $p$$+$$p$, $p$$+$Au, and peripheral $d$$+$Au and $^3$He$+$Au collisions, indicating a higher nonflow contribution in AMPT than in the experimental data. The AMPT model fails to describe the observed $R$ for $0.5$$<$$p_T$$<$$1$ GeV/$c$, but there is qualitative agreement with the measurements for $2$$<$$p_T$$<$$2.5$ GeV/$c$. △ Less

Submitted 4 March, 2023; v1 submitted 18 March, 2022; originally announced March 2022.

Comments: 393 authors from 72 institutions, 14 pages, 10 figures, 2014, 2015, and 2016 data. v2 is version accepted for publication in Physical Review C. HEPdata tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

Journal ref: Phys. Rev. C 107, 024907 (2023)

arXiv:2203.07960 [pdf, other]

Investigating self-supervised learning for speech enhancement and separation

Authors: Zili Huang, Shinji Watanabe, Shu-wen Yang, Paola Garcia, Sanjeev Khudanpur

Abstract: Speech enhancement and separation are two fundamental tasks for robust speech processing. Speech enhancement suppresses background noise while speech separation extracts target speech from interfering speakers. Despite a great number of supervised learning-based enhancement and separation methods having been proposed and achieving good performance, studies on applying self-supervised learning (SSL… ▽ More Speech enhancement and separation are two fundamental tasks for robust speech processing. Speech enhancement suppresses background noise while speech separation extracts target speech from interfering speakers. Despite a great number of supervised learning-based enhancement and separation methods having been proposed and achieving good performance, studies on applying self-supervised learning (SSL) to enhancement and separation are limited. In this paper, we evaluate 13 SSL upstream methods on speech enhancement and separation downstream tasks. Our experimental results on Voicebank-DEMAND and Libri2Mix show that some SSL representations consistently outperform baseline features including the short-time Fourier transform (STFT) magnitude and log Mel filterbank (FBANK). Furthermore, we analyze the factors that make existing SSL frameworks difficult to apply to speech enhancement and separation and discuss the representation properties desired for both tasks. Our study is included as the official speech enhancement and separation downstreams for SUPERB. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Comments: To appear in ICASSP 2022

arXiv:2203.06884 [pdf, other]

Asymptotic Behavior of Bayesian Generalization Error in Multinomial Mixtures

Authors: Takumi Watanabe, Sumio Watanabe

Abstract: Multinomial mixtures are widely used in the information engineering field, however, their mathematical properties are not yet clarified because they are singular learning models. In fact, the models are non-identifiable and their Fisher information matrices are not positive definite. In recent years, the mathematical foundation of singular statistical models are clarified by using algebraic geomet… ▽ More Multinomial mixtures are widely used in the information engineering field, however, their mathematical properties are not yet clarified because they are singular learning models. In fact, the models are non-identifiable and their Fisher information matrices are not positive definite. In recent years, the mathematical foundation of singular statistical models are clarified by using algebraic geometric methods. In this paper, we clarify the real log canonical thresholds and multiplicities of the multinomial mixtures and elucidate their asymptotic behaviors of generalization error and free energy. △ Less

Submitted 14 March, 2022; originally announced March 2022.

arXiv:2203.06849 [pdf, other]

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

Authors: Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy T. Liu, Cheng-I Jeff Lai, Jiatong Shi, Xuankai Chang, Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

Abstract: Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards in… ▽ More Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards introducing a common benchmark to evaluate pre-trained models across various speech tasks. In this paper, we introduce SUPERB-SG, a new benchmark focused on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain and quality across different types of tasks. It entails freezing pre-trained model parameters, only using simple task-specific trainable heads. The goal is to be inclusive of all researchers, and encourage efficient use of computational resources. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation. △ Less

Submitted 14 March, 2022; originally announced March 2022.

Comments: ACL 2022 main conference

arXiv:2203.06087 [pdf, other]

doi 10.1103/PhysRevC.106.014908

Study of $φ$-meson production in $p$$+$Al, $p$$+$Au, $d$$+$Au, and $^3$He$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV

Authors: U. Acharya, A. Adare, C. Aidala, N. N. Ajitanand, Y. Akiba, M. Alfred, V. Andrieux, N. Apadula, H. Asano, B. Azmoun, V. Babintsev, M. Bai, N. S. Bandara, B. Bannier, K. N. Barish, S. Bathe, A. Bazilevsky, M. Beaumier, S. Beckman, R. Belmont, A. Berdnikov, Y. Berdnikov, L. Bichon, B. Blankenship, D. S. Blau , et al. (346 additional authors not shown)

Abstract: Small nuclear collisions are mainly sensitive to cold-nuclear-matter effects; however, the collective behavior observed in these collisions shows a hint of hot-nuclear-matter effects. The identified-particle spectra, especially the $φ$ mesons which contain strange and antistrange quarks and have a relatively small hadronic-interaction cross section, are a good tool to study these effects. The PHEN… ▽ More Small nuclear collisions are mainly sensitive to cold-nuclear-matter effects; however, the collective behavior observed in these collisions shows a hint of hot-nuclear-matter effects. The identified-particle spectra, especially the $φ$ mesons which contain strange and antistrange quarks and have a relatively small hadronic-interaction cross section, are a good tool to study these effects. The PHENIX experiment has measured $φ$ mesons in a specific set of small collision systems $p$$+$Al, $p$$+$Au, and $^3$He$+$Au, as well as $d$$+$Au [Phys. Rev. C {\bf 83}, 024909 (2011)], at $\sqrt{s_{_{NN}}}=200$ GeV. The transverse-momentum spectra and nuclear-modification factors are presented and compared to theoretical-model predictions. The comparisons with different calculations suggest that quark-gluon plasma may be formed in these small collision systems at $\sqrt{s_{_{NN}}}=200$ GeV. However, the volume and the lifetime of the produced medium may be insufficient for observing strangeness-enhancement and jet-quenching effects. Comparison with calculations suggests that the main production mechanisms of $φ$ mesons at midrapidity may be different in $p$$+$Al versus $p/d/$$^3$He$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV. While thermal quark recombination seems to dominate in $p/d/$$^3$He$+$Au collisions, fragmentation seems to be the main production mechanism in $p$$+$Al collisions. △ Less

Submitted 26 July, 2022; v1 submitted 11 March, 2022; originally announced March 2022.

Comments: 371 authors from 72 institutions, 13 pages, 7 figures, 7 tables, 2014 and 2015 data. v2 is version accepted for publication Physical Review C. Plain text data tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

Journal ref: Phys. Rev. C 106, 014908 (2022)

arXiv:2203.05749 [pdf, ps, other]

doi 10.1162/neco_a_01580

Classification from Positive and Biased Negative Data with Skewed Labeled Posterior Probability

Authors: Shotaro Watanabe, Hidetoshi Matsui

Abstract: The binary classification problem has a situation where only biased data are observed in one of the classes. In this paper, we propose a new method to approach the positive and biased negative (PbN) classification problem, which is a weakly supervised learning method to learn a binary classifier from positive data and negative data with biased observations. We incorporate a method to correct the n… ▽ More The binary classification problem has a situation where only biased data are observed in one of the classes. In this paper, we propose a new method to approach the positive and biased negative (PbN) classification problem, which is a weakly supervised learning method to learn a binary classifier from positive data and negative data with biased observations. We incorporate a method to correct the negative impact due to skewed confidence, which represents the posterior probability that the observed data are positive. This reduces the distortion of the posterior probability that the data are labeled, which is necessary for the empirical risk minimization of the PbN classification problem. We verified the effectiveness of the proposed method by numerical experiments and real data analysis. △ Less

Submitted 10 March, 2022; originally announced March 2022.

Comments: 14 pages, 1 figure

arXiv:2203.04575 [pdf, other]

Geometric Aspects of Data-Processing of Markov Chains

Authors: Geoffrey Wolfer, Shun Watanabe

Abstract: We examine data-processing of Markov chains through the lens of information geometry. We first establish a theory of congruent Markov morphisms within the framework of stochastic matrices. Specifically, we introduce and justify the concept of a linear right inverse (congruent embedding) for lum**, a well-known operation used in Markov chains to extract coarse information. Furthermore, we inspect… ▽ More We examine data-processing of Markov chains through the lens of information geometry. We first establish a theory of congruent Markov morphisms within the framework of stochastic matrices. Specifically, we introduce and justify the concept of a linear right inverse (congruent embedding) for lum**, a well-known operation used in Markov chains to extract coarse information. Furthermore, we inspect information projections onto geodesically convex sets of stochastic matrices, and show that under some conditions, projecting (m-projection) onto doubly convex submanifolds can be regarded as a form of data-processing. Finally, we show that the family of lumpable stochastic matrices can be meaningfully endowed with the structure of a foliated manifold and motivate our construction in the context of embedded models and inference. △ Less

Submitted 20 December, 2023; v1 submitted 9 March, 2022; originally announced March 2022.

MSC Class: 60J10

arXiv:2203.03022 [pdf, ps, other]

HEAR: Holistic Evaluation of Audio Representations

Authors: Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu **, Yonatan Bisk

Abstract: What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, in… ▽ More What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. HEAR was launched as a NeurIPS 2021 shared challenge. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear. △ Less

Submitted 29 May, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

Comments: to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track

arXiv:2203.00232 [pdf, other]

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

Authors: Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux

Abstract: Graph-based temporal classification (GTC), a generalized form of the connectionist temporal classification loss, was recently proposed to improve automatic speech recognition (ASR) systems using graph-based supervision. For example, GTC was first used to encode an N-best list of pseudo-label sequences into a graph for semi-supervised learning. In this paper, we propose an extension of GTC to model… ▽ More Graph-based temporal classification (GTC), a generalized form of the connectionist temporal classification loss, was recently proposed to improve automatic speech recognition (ASR) systems using graph-based supervision. For example, GTC was first used to encode an N-best list of pseudo-label sequences into a graph for semi-supervised learning. In this paper, we propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network, which can be applied to a wider range of tasks. As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task. The transcriptions and speaker information of multi-speaker speech are represented by a graph, where the speaker information is associated with the transitions and ASR outputs with the nodes. Using GTC-e, multi-speaker ASR modelling becomes very similar to single-speaker ASR modeling, in that tokens by multiple speakers are recognized as a single merged sequence in chronological order. For evaluation, we perform experiments on a simulated multi-speaker speech dataset derived from LibriSpeech, obtaining promising results with performance close to classical benchmarks for the task. △ Less

Submitted 1 March, 2022; originally announced March 2022.

Comments: To appear in ICASSP2022

arXiv:2202.12298 [pdf, other]

Towards Low-distortion Multi-channel Speech Enhancement: The ESPNet-SE Submission to The L3DAS22 Challenge

Authors: Yen-Ju Lu, Samuele Cornell, Xuankai Chang, Wangyou Zhang, Chenda Li, Zhaoheng Ni, Zhong-Qiu Wang, Shinji Watanabe

Abstract: This paper describes our submission to the L3DAS22 Challenge Task 1, which consists of speech enhancement with 3D Ambisonic microphones. The core of our approach combines Deep Neural Network (DNN) driven complex spectral map** with linear beamformers such as the multi-frame multi-channel Wiener filter. Our proposed system has two DNNs and a linear beamformer in between. Both DNNs are trained to… ▽ More This paper describes our submission to the L3DAS22 Challenge Task 1, which consists of speech enhancement with 3D Ambisonic microphones. The core of our approach combines Deep Neural Network (DNN) driven complex spectral map** with linear beamformers such as the multi-frame multi-channel Wiener filter. Our proposed system has two DNNs and a linear beamformer in between. Both DNNs are trained to perform complex spectral map**, using a combination of waveform and magnitude spectrum losses. The estimated signal from the first DNN is used to drive a linear beamformer, and the beamforming result, together with this enhanced signal, are used as extra inputs for the second DNN which refines the estimation. Then, from this new estimated signal, the linear beamformer and second DNN are run iteratively. The proposed method was ranked first in the challenge, achieving, on the evaluation set, a ranking metric of 0.984, versus 0.833 of the challenge baseline. △ Less

Submitted 24 February, 2022; originally announced February 2022.

Comments: to be published in IEEE ICASSP 2022

arXiv:2202.08470 [pdf, other]

doi 10.21437/Interspeech.2021-2218

Acoustic Event Detection with Classifier Chains

Authors: Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi

Abstract: This paper proposes acoustic event detection (AED) with classifier chains, a new classifier based on the probabilistic chain rule. The proposed AED with classifier chains consists of a gated recurrent unit and performs iterative binary detection of each event one by one. In each iteration, the event's activity is estimated and used to condition the next output based on the probabilistic chain rule… ▽ More This paper proposes acoustic event detection (AED) with classifier chains, a new classifier based on the probabilistic chain rule. The proposed AED with classifier chains consists of a gated recurrent unit and performs iterative binary detection of each event one by one. In each iteration, the event's activity is estimated and used to condition the next output based on the probabilistic chain rule to form classifier chains. Therefore, the proposed method can handle the interdependence among events upon classification, while the conventional AED methods with multiple binary classifiers with a linear layer and sigmoid function have placed an assumption of conditional independence. In the experiments with a real-recording dataset, the proposed method demonstrates its superior AED performance to a relative 14.80% improvement compared to a convolutional recurrent neural network baseline system with the multiple binary classifiers. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: 5pages, presented at Interspeech2021

arXiv:2202.08158 [pdf, other]

doi 10.1103/PhysRevLett.130.251901

Measurement of Direct-Photon Cross Section and Double-Helicity Asymmetry at $\sqrt{s}=510$ GeV in $\vec{p}+\vec{p}$ Collisions

Authors: PHENIX Collaboration, N. J. Abdulameer, U. Acharya, A. Adare, C. Aidala, N. N. Ajitanand, Y. Akiba, R. Akimoto, M. Alfred, N. Apadula, Y. Aramaki, H. Asano, E. T. Atomssa, T. C. Awes, B. Azmoun, V. Babintsev, M. Bai, N. S. Bandara, B. Bannier, K. N. Barish, S. Bathe, A. Bazilevsky, M. Beaumier, S. Beckman, R. Belmont , et al. (336 additional authors not shown)

Abstract: We present measurements of the cross section and double-helicity asymmetry $A_{LL}$ of direct-photon production in $\vec{p}+\vec{p}$ collisions at $\sqrt{s}=510$ GeV. The measurements have been performed at midrapidity ($|η|<0.25$) with the PHENIX detector at the Relativistic Heavy Ion Collider. At relativistic energies, direct photons are dominantly produced from the initial quark-gluon hard scat… ▽ More We present measurements of the cross section and double-helicity asymmetry $A_{LL}$ of direct-photon production in $\vec{p}+\vec{p}$ collisions at $\sqrt{s}=510$ GeV. The measurements have been performed at midrapidity ($|η|<0.25$) with the PHENIX detector at the Relativistic Heavy Ion Collider. At relativistic energies, direct photons are dominantly produced from the initial quark-gluon hard scattering and do not interact via the strong force at leading order. Therefore, at $\sqrt{s}=510$ GeV, where leading-order-effects dominate, these measurements provide clean and direct access to the gluon helicity in the polarized proton in the gluon-momentum-fraction range $0.02<x<0.08$, with direct sensitivity to the sign of the gluon contribution. △ Less

Submitted 6 May, 2023; v1 submitted 16 February, 2022; originally announced February 2022.

Comments: 358 authors from 72 institutions, 8 pages, 2 figures, 1 table, 2013 data. v2 is version accepted by Physical Review Letters. Plain text data tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

arXiv:2202.06497 [pdf]

Universal and Efficient p-Do** of Organic Semiconductors by Electrophilic Attack of Cations

Authors: **g Guo, Ying Liu, **-An Chen, Xinhao Wang, Yanpei Wang, **g Guo, Xincan Qiu, Zebing Zeng, Lang Jiang, Yuan** Yi, Shun Watanabe, Lei Liao, Yugang Bai, Thuc-Quyen Nguyen, Yuanyuan Hu

Abstract: Do** is of great importance to tailor the electrical properties of semiconductors. However, the present do** methodologies for organic semiconductors (OSCs) are either inefficient or can only apply to a small number of OSCs, seriously limiting their general application. Herein, we reveal a novel p-do** mechanism by investigating the interactions between the dopant trityl cation and poly(3-he… ▽ More Do** is of great importance to tailor the electrical properties of semiconductors. However, the present do** methodologies for organic semiconductors (OSCs) are either inefficient or can only apply to a small number of OSCs, seriously limiting their general application. Herein, we reveal a novel p-do** mechanism by investigating the interactions between the dopant trityl cation and poly(3-hexylthiophene) (P3HT). It is found that electrophilic attack of the trityl cations on thiophenes results in the formation of alkylated ions that induce electron transfer from neighboring P3HT chains, resulting in p-do**. This unique p-do** mechanism can be employed to dope various OSCs including those with high ionization energy (IE=5.8 eV). Moreover, this do** mechanism endows trityl cation with strong do** ability, leading to polaron yielding efficiency of 100 % and do** efficiency of over 80 % in P3HT. The discovery and elucidation of this novel do** mechanism not only points out that strong electrophiles are a class of efficient p-dopants for OSCs, but also provides new opportunities towards highly efficient do** of OSCs. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2202.05256 [pdf, other]

Conditional Diffusion Probabilistic Model for Speech Enhancement

Authors: Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, Yu Tsao

Abstract: Speech enhancement is a critical component of many user-oriented audio applications, yet current systems still suffer from distorted and unnatural outputs. While generative models have shown strong potential in speech synthesis, they are still lagging behind in speech enhancement. This work leverages recent advances in diffusion probabilistic models, and proposes a novel speech enhancement algorit… ▽ More Speech enhancement is a critical component of many user-oriented audio applications, yet current systems still suffer from distorted and unnatural outputs. While generative models have shown strong potential in speech synthesis, they are still lagging behind in speech enhancement. This work leverages recent advances in diffusion probabilistic models, and proposes a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes. More specifically, we propose a generalized formulation of the diffusion probabilistic model named conditional diffusion probabilistic model that, in its reverse process, can adapt to non-Gaussian real noises in the estimated speech signal. In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models, and investigate the generalization capability of our models to other datasets with noise characteristics unseen during training. △ Less

Submitted 10 February, 2022; originally announced February 2022.

arXiv:2202.03863 [pdf, other]

doi 10.1103/PhysRevC.105.064912

Measurement of $ψ(2S)$ nuclear modification at backward and forward rapidity in $p$$+$$p$, $p$$+$Al, and $p$$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV

Authors: U. A. Acharya, C. Aidala, Y. Akiba, M. Alfred, V. Andrieux, N. Apadula, H. Asano, B. Azmoun, V. Babintsev, N. S. Bandara, K. N. Barish, S. Bathe, A. Bazilevsky, M. Beaumier, R. Belmont, A. Berdnikov, Y. Berdnikov, L. Bichon, B. Blankenship, D. S. Blau, J. S. Bok, V. Borisov, M. L. Brooks, J. Bryslawskyj, V. Bumazhnov , et al. (291 additional authors not shown)

Abstract: Suppression of the $J/ψ$ nuclear-modification factor has been seen as a trademark signature of final-state effects in large collision systems for decades. In small systems, the nuclear modification was attributed to cold-nuclear-matter effects until the observation of strong differential suppression of the $ψ(2S)$ state in $p/d$$+$$A$ collisions suggested the presence of final-state effects. Resul… ▽ More Suppression of the $J/ψ$ nuclear-modification factor has been seen as a trademark signature of final-state effects in large collision systems for decades. In small systems, the nuclear modification was attributed to cold-nuclear-matter effects until the observation of strong differential suppression of the $ψ(2S)$ state in $p/d$$+$$A$ collisions suggested the presence of final-state effects. Results of $J/ψ$ and $ψ(2S)$ measurements in the dimuon decay channel are presented here for $p$$+$$p$, $p$$+$Al, and $p$$+$Au collision systems at $\sqrt{s_{_{NN}}}=200$ GeV. The results are predominantly shown in the form of the nuclear-modification factor, $R_{pA}$, the ratio of the $ψ(2S)$ invariant yield per nucleon-nucleon collision in collisions of proton on target nucleus to that in $p$$+$$p$ collisions. Measurements of the $J/ψ$ and $ψ(2S)$ nuclear-modification factor are compared with shadowing and transport-model predictions, as well as to complementary measurements at Large-Hadron-Collider energies. △ Less

Submitted 30 June, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

Comments: 315 authors from 69 institutions, 16 pages, 9 figures, 4 tables, 2015 data. v2 is version accepted for publication in Physical Review C. Plain text data tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

Journal ref: Phys. Rev. C 105, 064912 (2022)

arXiv:2202.01405 [pdf, other]

Joint Speech Recognition and Audio Captioning

Authors: Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe

Abstract: Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AA… ▽ More Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR). The goal of AAC is to generate natural language descriptions of contents in audio samples. We propose several approaches for end-to-end joint modeling of ASR and AAC tasks and demonstrate their advantages over traditional approaches, which model these tasks independently. A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions. Therefore we also create a multi-task dataset by mixing the clean speech Wall Street Journal corpus with multiple levels of background noises chosen from the AudioCaps dataset. We also perform extensive experimental evaluation and show improvements of our proposed methods as compared to existing state-of-the-art ASR and AAC methods. △ Less

Submitted 2 February, 2022; originally announced February 2022.

Comments: 5 pages, 2 figures. Accepted for ICASSP 2022

arXiv:2201.13005 [pdf, ps, other]

On Sub-optimality of Random Binning for Distributed Hypothesis Testing

Authors: Shun Watanabe

Abstract: We investigate the quantize and binning scheme, known as the Shimokawa-Han-Amari (SHA) scheme, for the distributed hypothesis testing. We develop tools to evaluate the critical rate attainable by the SHA scheme. For a product of binary symmetric double sources, we present a sequential scheme that improves upon the SHA scheme. We investigate the quantize and binning scheme, known as the Shimokawa-Han-Amari (SHA) scheme, for the distributed hypothesis testing. We develop tools to evaluate the critical rate attainable by the SHA scheme. For a product of binary symmetric double sources, we present a sequential scheme that improves upon the SHA scheme. △ Less

Submitted 2 February, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

Comments: 6 pages; v2 added a reference

arXiv:2201.10190 [pdf, ps, other]

Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR

Authors: Emiru Tsunoo, Chaitanya Narisetty, Michael Hentschel, Yosuke Kashiwagi, Shinji Watanabe

Abstract: A streaming style inference of encoder-decoder automatic speech recognition (ASR) system is important for reducing latency, which is essential for interactive use cases. To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination. In the endpoint prediction, we compute the expectation of the numbe… ▽ More A streaming style inference of encoder-decoder automatic speech recognition (ASR) system is important for reducing latency, which is essential for interactive use cases. To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination. In the endpoint prediction, we compute the expectation of the number of tokens that are yet to be emitted in the encoder features of the current blocks using the CTC posterior. Based on the expectation value, the decoder predicts the endpoint to realize continuous block synchronization, as a running stitch. Meanwhile, endpoint post-determination probabilistically detects backward jump of the source-target attention, which is caused by the misprediction of endpoints. Then it resumes decoding by discarding those hypotheses, as back stitch. We combine these methods into a hybrid approach, namely run-and-back stitch search, which reduces the computational cost and latency. Evaluations of various ASR tasks show the efficiency of our proposed decoding algorithm, which achieves a latency reduction, for instance in the Librispeech test set from 1487 ms to 821 ms at the 90th percentile, while maintaining a high recognition accuracy. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: Accepted for ICASSP2022

arXiv:2201.10103 [pdf, other]

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Authors: Keqi Deng, Zehui Yang, Shinji Watanabe, Yosuke Higuchi, Gaofeng Cheng, Pengyuan Zhang

Abstract: While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference… ▽ More While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference speed, but they still fall behind AR systems in recognition accuracy. To fulfill the two demands, in this paper, we propose a NAR CTC/attention model utilizing both pre-trained acoustic and language models: wav2vec2.0 and BERT. To bridge the modality gap between speech and text representations obtained from the pre-trained models, we design a novel modality conversion mechanism, which is more suitable for logographic languages. During inference, we employ a CTC branch to generate a target length, which enables the BERT to predict tokens in parallel. We also design a cache-based CTC/attention joint decoding method to improve the recognition accuracy while kee** the decoding speed fast. Experimental results show that the proposed NAR model greatly outperforms our strong wav2vec2.0 CTC baseline (15.1% relative CER reduction on AISHELL-1). The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks. △ Less

Submitted 26 January, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

Comments: Accepted by ICASSP2022

arXiv:2201.05420 [pdf, other]

A Study of Transducer based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies

Authors: Florian Boyer, Yusuke Shinohara, Takaaki Ishii, Hirofumi Inaguma, Shinji Watanabe

Abstract: In this study, we present recent developments of models trained with the RNN-T loss in ESPnet. It involves the use of various architectures such as recently proposed Conformer, multi-task learning with different auxiliary criteria and multiple decoding strategies, including our own proposition. Through experiments and benchmarks, we show that our proposed systems can be competitive against other s… ▽ More In this study, we present recent developments of models trained with the RNN-T loss in ESPnet. It involves the use of various architectures such as recently proposed Conformer, multi-task learning with different auxiliary criteria and multiple decoding strategies, including our own proposition. Through experiments and benchmarks, we show that our proposed systems can be competitive against other state-of-art systems on well-known datasets such as LibriSpeech and AISHELL-1. Additionally, we demonstrate that these models are promising against other already implemented systems in ESPnet in regards to both performance and decoding speed, enabling the possibility to have powerful systems for a streaming task. With these additions, we hope to expand the usefulness of the ESPnet toolkit for the research community and also give tools for the ASR industry to deploy our systems in realistic and production environments. △ Less

Submitted 14 January, 2022; originally announced January 2022.

arXiv:2112.09872 [pdf]

doi 10.1364/OE.451729

Hyperparameter tuning of optical neural network classifiers for high-order gaussian beams

Authors: Shunsuke Watanabe, Tomoyoshi Shimobaba, Takashi Kakue, Tomoyoshi Ito

Abstract: High-order Gaussian beams with multiple propagation modes have been studied for free-space optical communications. Fast classification of beams using a diffractive deep neural network, D2NN, has been proposed. D2NN optimization is important because it has numerous hyperparameters, such as interlayer distances and mode combinations. In this study, we classify Hermite-Gaussian beams, which are high-… ▽ More High-order Gaussian beams with multiple propagation modes have been studied for free-space optical communications. Fast classification of beams using a diffractive deep neural network, D2NN, has been proposed. D2NN optimization is important because it has numerous hyperparameters, such as interlayer distances and mode combinations. In this study, we classify Hermite-Gaussian beams, which are high-order Gaussian beams, using a D2NN, and automatically tune one of its hyperparameters known as the interlayer distance. We used the tree-structured Parzen estimator, a hyperparameter auto-tuning algorithm, to search for the best model. Results indicated that classification accuracy obtained by auto-tuning hyperparameters was higher than that obtained by manually setting interlayer distances at equal intervals. In addition, we confirmed that accuracy by auto-tuning improves as the number of classification modes increases. △ Less

Submitted 18 December, 2021; originally announced December 2021.

arXiv:2112.09382 [pdf, other]

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

Authors: **g Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu

Abstract: Deep learning based models have significantly improved the performance of speech separation with input mixtures like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression models to predict the ground-truth speech from the mixture, using the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). This study… ▽ More Deep learning based models have significantly improved the performance of speech separation with input mixtures like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression models to predict the ground-truth speech from the mixture, using the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem, with great flexibility and strong potential. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized. Evaluation results based on the WSJ0-2mix and VCTK-noisy corpora in various settings show that our proposed method can steadily synthesize the separated speech with high speech quality and without any interference, which is difficult to avoid in regression-based methods. In addition, with negligible loss of listening quality, the speaker conversion of enhanced/separated speech could be easily realized through our method. △ Less

Submitted 9 January, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: 5 pages, https://shincling.github.io/discreteSeparation/

arXiv:2112.09323 [pdf, other]

JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification

Authors: Shinnosuke Takamichi, Ludwig Kürzinger, Takaaki Saeki, Sayaka Shiota, Shinji Watanabe

Abstract: In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can autom… ▽ More In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can automatically filter the videos and subtitles with almost no language-dependent processes. We consistently employ Connectionist Temporal Classification (CTC)-based techniques for automatic speech recognition (ASR) and a speaker variation-based method for automatic speaker verification (ASV). We build 1) a large-scale Japanese ASR benchmark with more than 1,300 hours of data and 2) 900 hours of data for Japanese ASV. △ Less

Submitted 17 December, 2021; originally announced December 2021.

Comments: Submitted to ICASSP2022

arXiv:2112.05680 [pdf, other]

doi 10.1103/PhysRevD.105.032003

Transverse-single-spin asymmetries of charged pions at midrapidity in transversely polarized $p{+}p$ collisions at $\sqrt{s}=200$ GeV

Authors: U. A. Acharya, C. Aidala, Y. Akiba, M. Alfred, V. Andrieux, N. Apadula, H. Asano, B. Azmoun, V. Babintsev, N. S. Bandara, K. N. Barish, S. Bathe, A. Bazilevsky, M. Beaumier, R. Belmont, A. Berdnikov, Y. Berdnikov, L. Bichon, B. Blankenship, D. S. Blau, J. S. Bok, V. Borisov, M. L. Brooks, J. Bryslawskyj, V. Bumazhnov , et al. (286 additional authors not shown)

Abstract: In 2015, the PHENIX collaboration has measured single-spin asymmetries for charged pions in transversely polarized proton-proton collisions at the center of mass energy of $\sqrt{s}=200$ GeV. The pions were detected at central rapidities of $|η|<0.35$. The single-spin asymmetries are consistent with zero for each charge individually, as well as consistent with the previously published neutral-pion… ▽ More In 2015, the PHENIX collaboration has measured single-spin asymmetries for charged pions in transversely polarized proton-proton collisions at the center of mass energy of $\sqrt{s}=200$ GeV. The pions were detected at central rapidities of $|η|<0.35$. The single-spin asymmetries are consistent with zero for each charge individually, as well as consistent with the previously published neutral-pion asymmetries in the same rapidity range. However, they show a slight indication of charge-dependent differences which may suggest a flavor dependence in the underlying mechanisms that create these asymmetries. △ Less

Submitted 9 February, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

Comments: 311 authors from 68 institutions, 8 pages, 3 figures, 1 table. 2015 data. v2 is version accepted for publication in Physical Review D. Plain text data tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

Journal ref: Phys. Rev. D 105, 032003 (2022)

arXiv:2112.02492 [pdf, other]

History-dependent deformation of a rotated granular pile governed by granular friction

Authors: T. Irie, R. Yamaguchi, S. Watanabe, H. Katsuragi

Abstract: We experimentally examined the history dependence of the rotation-induced granular deformation. As an initial state, we prepared a quasi-two-dimensional granular pile whose apex is at the rotational axis and its initial inclination is at the angle of repose. The rotation rate was increased from $0$ to $620$~(rpm) and then decreased back to $0$. During the rotation, deformation of the rotated granu… ▽ More We experimentally examined the history dependence of the rotation-induced granular deformation. As an initial state, we prepared a quasi-two-dimensional granular pile whose apex is at the rotational axis and its initial inclination is at the angle of repose. The rotation rate was increased from $0$ to $620$~(rpm) and then decreased back to $0$. During the rotation, deformation of the rotated granular pile was captured by a camera. From the acquired image data, granular friction coefficient $μ$ was measured as a function of the ratio between centrifugal force and gravity, $Γ$. To systematically evaluate the variation of $μ$ both in the increasing (spinning up) and decreasing (spinning down) rotation-rate regimes, surface profiles of the deformed granular piles were fitted to a model considering the force balance among gravity, friction, and centrifugal force at the surface. We found that $μ$ value grows in the increasing $Γ$ regime. However, when $Γ$ was reduced, $μ$ cannot recover its initial value. A part of the history-dependent behaviors of the rotated granular pile can be understood by the force balance model. △ Less

Submitted 2 June, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

Comments: 11 pages, 10 figures

arXiv:2112.02489 [pdf, other]

doi 10.1103/PhysRevE.104.064902

Deformation of a rotated granular pile governed by body-force-dependent friction

Authors: T. Irie, R. Yamaguchi, S. Watanabe, H. Katsuragi

Abstract: Although the gravity dependence of granular friction is crucial to understand various natural phenomena, its precise characterization is difficult. We propose a method to characterize granular friction under various gravity (body force) conditions controlled by centrifugal force; specifically, the deformation of a rotated granular pile was measured. To understand the mechanics governing the observ… ▽ More Although the gravity dependence of granular friction is crucial to understand various natural phenomena, its precise characterization is difficult. We propose a method to characterize granular friction under various gravity (body force) conditions controlled by centrifugal force; specifically, the deformation of a rotated granular pile was measured. To understand the mechanics governing the observed nontrivial deformation of this pile, we introduced an analytic model considering local force balance. The excellent agreement between the experimental data and theoretical model suggests that the deformation is simply governed by the net body force (sum of gravity and centrifugal force) and friction angle. The body-force dependence of granular friction was precisely measured from the experimental results. The results reveal that the grain shape affects the degree of body-force dependence of the granular friction. △ Less

Submitted 5 December, 2021; originally announced December 2021.

Comments: 10 pages, 5 figures

arXiv:2111.15016 [pdf, other]

Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization

Authors: Brian Yan, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe, Dong Yu

Abstract: Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type. In this work, we propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition. By defining the monolingual sub-tasks with label-to-frame synchronization, our joint m… ▽ More Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type. In this work, we propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition. By defining the monolingual sub-tasks with label-to-frame synchronization, our joint modeling framework can be conditionally factorized such that the final bilingual output, which may or may not be code-switched, is obtained given only monolingual information. We show that this conditionally factorized joint framework can be modeled by an end-to-end differentiable neural network. We demonstrate the efficacy of our proposed model on bilingual Mandarin-English speech recognition across both monolingual and code-switched corpora. △ Less

Submitted 29 November, 2021; originally announced November 2021.

arXiv:2111.14706 [pdf, other]

ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet

Authors: Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay Kumar, Karthik Ganesan, Brian Yan, Ngoc Thang Vu, Alan W Black, Shinji Watanabe

Abstract: As Automatic Speech Processing (ASR) systems are getting better, there is an increasing interest of using the ASR output to do downstream Natural Language Processing (NLP) tasks. However, there are few open source toolkits that can be used to generate reproducible results on different Spoken Language Understanding (SLU) benchmarks. Hence, there is a need to build an open source standard that can b… ▽ More As Automatic Speech Processing (ASR) systems are getting better, there is an increasing interest of using the ASR output to do downstream Natural Language Processing (NLP) tasks. However, there are few open source toolkits that can be used to generate reproducible results on different Spoken Language Understanding (SLU) benchmarks. Hence, there is a need to build an open source standard that can be used to have a faster start into SLU research. We present ESPnet-SLU, which is designed for quick development of spoken language understanding in a single framework. ESPnet-SLU is a project inside end-to-end speech processing toolkit, ESPnet, which is a widely used open-source standard for various speech processing tasks like ASR, Text to Speech (TTS) and Speech Translation (ST). We enhance the toolkit to provide implementations for various SLU benchmarks that enable researchers to seamlessly mix-and-match different ASR and NLU models. We also provide pretrained models with intensively tuned hyper-parameters that can match or even outperform the current state-of-the-art performances. The toolkit is publicly available at https://github.com/espnet/espnet. △ Less

Submitted 3 March, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

Comments: Accepted at ICASSP 2022 (5 pages)

arXiv:2111.08201 [pdf, other]

Attention-based Multi-hypothesis Fusion for Speech Summarization

Authors: Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe

Abstract: Speech summarization, which generates a text summary from speech, can be achieved by combining automatic speech recognition (ASR) and text summarization (TS). With this cascade approach, we can exploit state-of-the-art models and large training datasets for both subtasks, i.e., Transformer for ASR and Bidirectional Encoder Representations from Transformers (BERT) for TS. However, ASR errors direct… ▽ More Speech summarization, which generates a text summary from speech, can be achieved by combining automatic speech recognition (ASR) and text summarization (TS). With this cascade approach, we can exploit state-of-the-art models and large training datasets for both subtasks, i.e., Transformer for ASR and Bidirectional Encoder Representations from Transformers (BERT) for TS. However, ASR errors directly affect the quality of the output summary in the cascade approach. We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary. We investigate several schemes to combine ASR hypotheses. First, we propose using the sum of sub-word embedding vectors weighted by their posterior values provided by an ASR system as an input to a BERT-based TS system. Then, we introduce a more general scheme that uses an attention-based fusion module added to a pre-trained BERT module to align and combine several ASR hypotheses. Finally, we perform speech summarization experiments on the How2 dataset and a newly assembled TED-based dataset that we will release with this paper. These experiments show that retraining the BERT-based TS system with these schemes can improve summarization performance and that the attention-based fusion module is particularly effective. △ Less

Submitted 15 November, 2021; originally announced November 2021.

arXiv:2111.05756 [pdf, other]

doi 10.1103/PhysRevC.105.064902

Systematic study of nuclear effects in $p$$+$Al, $p$$+$Au, $d$$+$Au, and $^{3}$He$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV using $π^0$ production

Authors: U. A. Acharya, A. Adare, C. Aidala, N. N. Ajitanand, Y. Akiba, H. Al-Bataineh, J. Alexander, M. Alfred, V. Andrieux, A. Angerami, K. Aoki, N. Apadula, Y. Aramaki, H. Asano, E. T. Atomssa, R. Averbeck, T. C. Awes, B. Azmoun, V. Babintsev, M. Bai, G. Baksay, L. Baksay, N. S. Bandara, B. Bannier, K. N. Barish , et al. (529 additional authors not shown)

Abstract: The PHENIX collaboration presents a systematic study of $π^0$ production from $p$$+$$p$, $p$$+$Al, $p$$+$Au, $d$$+$Au, and $^{3}$He$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV. Measurements were performed with different centrality selections as well as the total inelastic, 0%--100%, selection for all collision systems. For 0%--100% collisions, the nuclear modification factors, $R_{xA}$, are cons… ▽ More The PHENIX collaboration presents a systematic study of $π^0$ production from $p$$+$$p$, $p$$+$Al, $p$$+$Au, $d$$+$Au, and $^{3}$He$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV. Measurements were performed with different centrality selections as well as the total inelastic, 0%--100%, selection for all collision systems. For 0%--100% collisions, the nuclear modification factors, $R_{xA}$, are consistent with unity for $p_T$ above 8 GeV/$c$, but exhibit an enhancement in peripheral collisions and a suppression in central collisions. The enhancement and suppression characteristics are similar for all systems for the same centrality class. It is shown that for high-$p_T$-$π^0$ production, the nucleons in the $d$ and $^3$He interact mostly independently with the Au nucleus and that the counter intuitive centrality dependence is likely due to a physical correlation between multiplicity and the presence of a hard scattering process. These observations disfavor models where parton energy loss has a significant contribution to nuclear modifications in small systems. Nuclear modifications at lower $p_T$ resemble the Cronin effect -- an increase followed by a peak in central or inelastic collisions and a plateau in peripheral collisions. The peak height has a characteristic ordering by system size as $p$$+$Au $>$ $d$$+$Au $>$ $^{3}$He$+$Au $>$ $p$$+$Al. For collisions with Au ions, current calculations based on initial state cold nuclear matter effects result in the opposite order, suggesting the presence of other contributions to nuclear modifications, in particular at lower $p_T$. △ Less

Submitted 6 June, 2022; v1 submitted 10 November, 2021; originally announced November 2021.

Comments: 554 authors from 81 institutions, 21 pages, 13 figures, and 3 tables. Data from 2008, 2014, and 2015. v2 is version accepted for publication in Physical Review C. Plain text data tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

Journal ref: Phys. Rev. C 105, 064902 (2022)

arXiv:2111.01326 [pdf, other]

Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity

Authors: Peter Wu, Jiatong Shi, Yifan Zhong, Shinji Watanabe, Alan W Black

Abstract: Speech processing systems currently do not support the vast majority of languages, in part due to the lack of data in low-resource languages. Cross-lingual transfer offers a compelling way to help bridge this digital divide by incorporating high-resource data into low-resource systems. Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-re… ▽ More Speech processing systems currently do not support the vast majority of languages, in part due to the lack of data in low-resource languages. Cross-lingual transfer offers a compelling way to help bridge this digital divide by incorporating high-resource data into low-resource systems. Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages. However, scaling up speech systems to support hundreds of low-resource languages remains unsolved. To help bridge this gap, we propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages. We demonstrate the effectiveness of our approach in language family classification, speech recognition, and speech synthesis tasks. △ Less

Submitted 1 November, 2021; originally announced November 2021.

arXiv:2111.01272 [pdf, other]

Sequence Transduction with Graph-based Supervision

Authors: Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux

Abstract: The recurrent neural network transducer (RNN-T) objective plays a major role in building today's best automatic speech recognition (ASR) systems for production. Similarly to the connectionist temporal classification (CTC) objective, the RNN-T loss uses specific rules that define how a set of alignments is generated to form a lattice for the full-sum training. However, it is yet largely unknown if… ▽ More The recurrent neural network transducer (RNN-T) objective plays a major role in building today's best automatic speech recognition (ASR) systems for production. Similarly to the connectionist temporal classification (CTC) objective, the RNN-T loss uses specific rules that define how a set of alignments is generated to form a lattice for the full-sum training. However, it is yet largely unknown if these rules are optimal and do lead to the best possible ASR results. In this work, we present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels, thus providing a flexible and efficient framework to manipulate training lattices, e.g., for studying different transition rules, implementing different transducer losses, or restricting alignments. We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T, while also ensuring a strictly monotonic alignment, which will allow better optimization of the decoding procedure. For example, the proposed CTC-like transducer achieves an improvement of 4.8% on the test-other condition of LibriSpeech relative to an equivalent RNN-T based system. △ Less

Submitted 31 March, 2022; v1 submitted 1 November, 2021; originally announced November 2021.

Comments: Accepted for publication at IEEE ICASSP 2022

arXiv:2110.15018 [pdf, other]

TorchAudio: Building Blocks for Audio and Speech Processing

Authors: Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Anjali Chourdia, Artyom Astafurov, Caroline Chen, Ching-Feng Yeh, Christian Puhrsch, David Pollack, Dmitriy Genzel, Donny Greenberg, Edward Z. Yang, Jason Lian, Jay Mahadeokar, Jeff Hwang, Ji Chen, Peter Goldsborough, Prabhat Roy, Sean Narenthiran, Shinji Watanabe, Soumith Chintala, Vincent Quenneville-Bélair, Yangyang Shi

Abstract: This document describes version 0.10 of TorchAudio: building blocks for machine learning applications in the audio and speech processing domain. The objective of TorchAudio is to accelerate the development and deployment of machine learning applications for researchers and engineers by providing off-the-shelf building blocks. The building blocks are designed to be GPU-compatible, automatically dif… ▽ More This document describes version 0.10 of TorchAudio: building blocks for machine learning applications in the audio and speech processing domain. The objective of TorchAudio is to accelerate the development and deployment of machine learning applications for researchers and engineers by providing off-the-shelf building blocks. The building blocks are designed to be GPU-compatible, automatically differentiable, and production-ready. TorchAudio can be easily installed from Python Package Index repository and the source code is publicly available under a BSD-2-Clause License (as of September 2021) at https://github.com/pytorch/audio. In this document, we provide an overview of the design principles, functionalities, and benchmarks of TorchAudio. We also benchmark our implementation of several audio and speech operations and models. We verify through the benchmarks that our implementations of various operations and models are valid and perform similarly to other publicly available implementations. △ Less

Submitted 16 February, 2022; v1 submitted 28 October, 2021; originally announced October 2021.

Comments: Accepted by ICASSP 2022

arXiv:2110.14852 [pdf, ps, other]

doi 10.1214/22-ECP461

The Boué--Dupuis formula and the exponential hypercontractivity in the Gaussian space

Authors: Yuu Hariya, Sou Watanabe

Abstract: This paper concerns a variational representation formula for Wiener functionals. Let $B=\{ B_{t}\} _{t\ge 0}$ be a standard $d$-dimensional Brownian motion. Boué and Dupuis (1998) showed that, for any bounded measurable functional $F(B)$ of $B$ up to time $1$, the expectation $\mathbb{E}\!\left[ e^{F(B)}\right] $ admits a variational representation in terms of drifted Brownian motions. In this pap… ▽ More This paper concerns a variational representation formula for Wiener functionals. Let $B=\{ B_{t}\} _{t\ge 0}$ be a standard $d$-dimensional Brownian motion. Boué and Dupuis (1998) showed that, for any bounded measurable functional $F(B)$ of $B$ up to time $1$, the expectation $\mathbb{E}\!\left[ e^{F(B)}\right] $ admits a variational representation in terms of drifted Brownian motions. In this paper, with a slight modification of insightful reasoning by Lehec (2013) allowing also $F(B)$ to be a functional of $B$ over the whole time interval, we prove that the Boué--Dupuis formula holds true provided that both $e^{F(B)}$ and $F(B)$ are integrable, relaxing conditions in earlier works. We also show that the formula implies the exponential hypercontractivity of the Ornstein--Uhlenbeck semigroup in $\mathbb{R}^{d}$, and hence, due to their equivalence, implies the logarithmic Sobolev inequality in the $d$-dimensional Gaussian space. △ Less

Submitted 3 November, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: 15 pages: newly added reference [9] by Chandra et al. (arXiv:2006.15933); also added is a corollary (Corollary 2.1) to Theorem 1.1, in which the case of bounded drifts is treated

MSC Class: 60H30 (Primary) 60J65; 60E15 (Secondary)

arXiv:2110.14139 [pdf, other]

Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions

Authors: Wangyou Zhang, **g Shi, Chenda Li, Shinji Watanabe, Yanmin Qian

Abstract: The deep learning based time-domain models, e.g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement. However, many experiments on the time-domain speech enhancement model are done in simulated conditions, and it is not well studied whether the good performance can generalize to real-world scenarios. In this paper, we aim to provide an insightful inv… ▽ More The deep learning based time-domain models, e.g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement. However, many experiments on the time-domain speech enhancement model are done in simulated conditions, and it is not well studied whether the good performance can generalize to real-world scenarios. In this paper, we aim to provide an insightful investigation of applying multi-channel Conv-TasNet based speech enhancement to both simulation and real data. Our preliminary experiments show a large performance gap between the two conditions in terms of the ASR performance. Several approaches are applied to close this gap, including the integration of multi-channel Conv-TasNet into the beamforming model with various strategies, and the joint training of speech enhancement and speech recognition models. Our experiments on the CHiME-4 corpus show that our proposed approaches can greatly reduce the speech recognition performance discrepancy between simulation and real data, while preserving the strong speech enhancement capability in the frontend. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: 5 pages, 3 figures, accepted by IEEE WASPAA 2021

arXiv:2110.07840 [pdf, other]

ESPnet2-TTS: Extending the Edge of TTS Research

Authors: Tomoki Hayashi, Ryuichi Yamamoto, Takenori Yoshimura, Peter Wu, Jiatong Shi, Takaaki Saeki, Yooncheol Ju, Yusuke Yasuda, Shinnosuke Takamichi, Shinji Watanabe

Abstract: This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance T… ▽ More This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet. △ Less

Submitted 14 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP2022. Demo HP: https://espnet.github.io/icassp2022-tts/

arXiv:2110.07504 [pdf, other]

doi 10.1103/PhysRevD.105.032004

Transverse single spin asymmetries of forward neutrons in $p$$+$$p$, $p$$+$Al, and $p$$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV as a function of transverse and longitudinal momenta

Authors: U. A. Acharya, C. Aidala, Y. Akiba, M. Alfred, V. Andrieux, N. Apadula, H. Asano, B. Azmoun, V. Babintsev, N. S. Bandara, K. N. Barish, S. Bathe, A. Bazilevsky, M. Beaumier, R. Belmont, A. Berdnikov, Y. Berdnikov, L. Bichon, B. Blankenship, D. S. Blau, J. S. Bok, V. Borisov, M. L. Brooks, J. Bryslawskyj, V. Bumazhnov , et al. (286 additional authors not shown)

Abstract: In 2015 the PHENIX collaboration at the Relativistic Heavy Ion Collider recorded $p$$+$$p$, $p$$+$Al, and $p$$+$Au collision data at center of mass energies of $\sqrt{s_{_{NN}}}=200$ GeV with the proton beam(s) transversely polarized. At very forward rapidities $η>6.8$ relative to the polarized proton beam, neutrons were detected either inclusively or in (anti)correlation with detector activity re… ▽ More In 2015 the PHENIX collaboration at the Relativistic Heavy Ion Collider recorded $p$$+$$p$, $p$$+$Al, and $p$$+$Au collision data at center of mass energies of $\sqrt{s_{_{NN}}}=200$ GeV with the proton beam(s) transversely polarized. At very forward rapidities $η>6.8$ relative to the polarized proton beam, neutrons were detected either inclusively or in (anti)correlation with detector activity related to hard collisions. The resulting single spin asymmetries, that were previously reported, have now been extracted as a function of the transverse momentum of the neutron as well as its longitudinal momentum fraction $x_F$. The explicit kinematic dependence, combined with the correlation information allows for a closer look at the interplay of different mechanisms suggested to describe these asymmetries, such as hadronic interactions or electromagnetic interactions in ultra-peripheral collisions, UPC. Events that are correlated with a hard collision indeed display a mostly negative asymmetry that increases in magnitude as a function of transverse momentum with only little dependence on $x_F$. In contrast, events that are not likely to have emerged from a hard collision display positive asymmetries for the nuclear collisions with a kinematic dependence that resembles that of a UPC based model. Because the UPC interaction depends strongly on the charge of the nucleus, those effects are very small for $p$$+$$p$ collisions, moderate for $p$$+$Al collisions, and large for $p$$+$Au collisions. △ Less

Submitted 9 February, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

Comments: 311 authors from 68 institutions, 12 pages, 8 figures, 2015 data. v2 is version accepted for publication in Physical Review D. Plain text data tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

Journal ref: Phys. Rev. D 105, 032004 (2022)

arXiv:2110.06280 [pdf, other]

S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda

Abstract: This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we… ▽ More This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting. We also provide comparisons between not only different S3Rs but also top systems in VCC2020 with supervised representations. Systematic objective and subjective evaluation were conducted, and we show that S3R is comparable with VCC2020 top systems in the A2O setting in terms of similarity, and achieves state-of-the-art in S3R-based A2A VC. We believe the extensive analysis, as well as the toolkit itself, contribute to not only the S3R community but also the VC community. The codebase is now open-sourced. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP 2022. Code available at: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream/a2o-vc-vcc2020

arXiv:2110.05571 [pdf, other]

SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition

Authors: **g Pan, Tao Lei, Kwangyoun Kim, Kyu Han, Shinji Watanabe

Abstract: The Transformer architecture has been well adopted as a dominant architecture in most sequence transduction tasks including automatic speech recognition (ASR), since its attention mechanism excels in capturing long-range dependencies. While models built solely upon attention can be better parallelized than regular RNN, a novel network architecture, SRU++, was recently proposed. By combining the fa… ▽ More The Transformer architecture has been well adopted as a dominant architecture in most sequence transduction tasks including automatic speech recognition (ASR), since its attention mechanism excels in capturing long-range dependencies. While models built solely upon attention can be better parallelized than regular RNN, a novel network architecture, SRU++, was recently proposed. By combining the fast recurrence and attention mechanism, SRU++ exhibits strong capability in sequence modeling and achieves near-state-of-the-art results in various language modeling and machine translation tasks with improved compute efficiency. In this work, we present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks and study how the benefits can be generalized to long-form speech inputs. On the popular LibriSpeech benchmark, our SRU++ model achieves 2.0% / 4.7% WER on test-clean / test-other, showing competitive performances compared with the state-of-the-art Conformer encoder under the same set-up. Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis. △ Less

Submitted 11 October, 2021; originally announced October 2021.

arXiv:2110.05249 [pdf, other]

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

Authors: Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe

Abstract: Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we con… ▽ More Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR). Experiments are performed in the state-of-the-art setting using ESPnet. The results on various tasks provide interesting findings for develo** an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances. We also show that the techniques can be combined for further improvement and applied to NAR end-to-end speech translation. All the implementations are publicly available to encourage further research in NAR speech processing. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: Accepted to ASRU2021

arXiv:2110.04694 [pdf, other]

Multi-Channel End-to-End Neural Diarization with Distributed Microphones

Authors: Shota Horiguchi, Yuki Takashima, Paola Garcia, Shinji Watanabe, Yohei Kawaguchi

Abstract: Recent progress on end-to-end neural diarization (EEND) has enabled overlap-aware speaker diarization with a single neural network. This paper proposes to enhance EEND by using multi-channel signals from distributed microphones. We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input: spatio-temporal and co-attention encoders. Both are independent of t… ▽ More Recent progress on end-to-end neural diarization (EEND) has enabled overlap-aware speaker diarization with a single neural network. This paper proposes to enhance EEND by using multi-channel signals from distributed microphones. We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input: spatio-temporal and co-attention encoders. Both are independent of the number and geometry of microphones and suitable for distributed microphone settings. We also propose a model adaptation method using only single-channel recordings. With simulated and real-recorded datasets, we demonstrated that the proposed method outperformed conventional EEND when a multi-channel input was given while maintaining comparable performance with a single-channel input. We also showed that the proposed method performed well even when spatial information is inoperative given multi-channel inputs, such as in hybrid meetings in which the utterances of multiple remote participants are played back from the same loudspeaker. △ Less

Submitted 28 March, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

Comments: Accepted to ICASSP 2022

arXiv:2110.04590 [pdf, other]

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

Authors: Xuankai Chang, Takashi Maekaku, Pengcheng Guo, **g Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-yi Lee, Shinji Watanabe

Abstract: Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations… ▽ More Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model architectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or outperform current state-of-the-art (SOTA) recognition performance. Moreover, we further explore more scenarios for whether the pretraining representations are effective, such as the cross-language or overlapped speech. The scripts, configuratons and the trained models have been released in ESPnet to let the community reproduce our experiments and improve them. △ Less

Submitted 9 October, 2021; originally announced October 2021.

Comments: To appear in ASRU2021

arXiv:2110.00285 [pdf, ps, other]

Independence and orthogonality of algebraic eigenvectors over the max-plus algebra

Authors: Yuki Nishida, Sennosuke Watanabe, Yoshihide Watanabe

Abstract: The max-plus algebra $\mathbb{R}\cup \{-\infty \}$ is a semiring with the two operations: addition $a \oplus b := \max(a,b)$ and multiplication $a \otimes b := a + b$. Roots of the characteristic polynomial of a max-plus matrix are called algebraic eigenvalues. Recently, algebraic eigenvectors with respect to algebraic eigenvalues were introduced as a generalized concept of eigenvectors. In this p… ▽ More The max-plus algebra $\mathbb{R}\cup \{-\infty \}$ is a semiring with the two operations: addition $a \oplus b := \max(a,b)$ and multiplication $a \otimes b := a + b$. Roots of the characteristic polynomial of a max-plus matrix are called algebraic eigenvalues. Recently, algebraic eigenvectors with respect to algebraic eigenvalues were introduced as a generalized concept of eigenvectors. In this paper, we present properties of algebraic eigenvectors analogous to those of eigenvectors in the conventional linear algebra. First, we prove that for generic matrices algebraic eigenvectors with respect to distinct algebraic eigenvalues are linearly independent. We further prove that for symmetric matrices algebraic eigenvectors with respect to distinct algebraic eigenvalues are orthogonal to each other. △ Less

Submitted 3 October, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

Comments: 29 pages, 1 figure

MSC Class: 15A16; 15A80

arXiv:2109.12804 [pdf, other]

Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates

Authors: Hirofumi Inaguma, Siddharth Dalmia, Brian Yan, Shinji Watanabe

Abstract: The multi-decoder (MD) end-to-end speech translation model has demonstrated high translation quality by searching for better intermediate automatic speech recognition (ASR) decoder states as hidden intermediates (HI). It is a two-pass decoding model decomposing the overall task into ASR and machine translation sub-tasks. However, the decoding speed is not fast enough for real-world applications be… ▽ More The multi-decoder (MD) end-to-end speech translation model has demonstrated high translation quality by searching for better intermediate automatic speech recognition (ASR) decoder states as hidden intermediates (HI). It is a two-pass decoding model decomposing the overall task into ASR and machine translation sub-tasks. However, the decoding speed is not fast enough for real-world applications because it conducts beam search for both sub-tasks during inference. We propose Fast-MD, a fast MD model that generates HI by non-autoregressive (NAR) decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder. We investigated two types of NAR HI: (1) parallel HI by using an autoregressive Transformer ASR decoder and (2) masked HI by using Mask-CTC, which combines CTC and the conditional masked language model. To reduce a mismatch in the ASR decoder between teacher-forcing during training and conditioning on CTC outputs during testing, we also propose sampling CTC outputs during training. Experimental evaluations on three corpora show that Fast-MD achieved about 2x and 4x faster decoding speed than that of the naïve MD model on GPU and CPU with comparable translation quality. Adopting the Conformer encoder and intermediate CTC loss further boosts its quality without sacrificing decoding speed. △ Less

Submitted 27 September, 2021; originally announced September 2021.

Comments: Accepted at IEEE ASRU 2021

arXiv:2109.11174 [pdf, other]

doi 10.1103/PhysRevD.104.122002

Diffuse Supernova Neutrino Background Search at Super-Kamiokande

Authors: Super-Kamiokande Collaboration, :, K. Abe, C. Bronner, Y. Hayato, K. Hiraide, M. Ikeda, S. Imaizumi, J. Kameda, Y. Kanemura, Y. Kataoka, S. Miki, M. Miura, S. Moriyama, Y. Nagao, M. Nakahata, S. Nakayama, T. Okada, K. Okamoto, A. Orii, G. Pronost, H. Sekiya, M. Shiozawa, Y. Sonoda, Y. Suzuki , et al. (197 additional authors not shown)

Abstract: A new search for the diffuse supernova neutrino background (DSNB) flux has been conducted at Super-Kamiokande (SK), with a $22.5\times2970$-kton$\cdot$day exposure from its fourth operational phase IV. The new analysis improves on the existing background reduction techniques and systematic uncertainties and takes advantage of an improved neutron tagging algorithm to lower the energy threshold comp… ▽ More A new search for the diffuse supernova neutrino background (DSNB) flux has been conducted at Super-Kamiokande (SK), with a $22.5\times2970$-kton$\cdot$day exposure from its fourth operational phase IV. The new analysis improves on the existing background reduction techniques and systematic uncertainties and takes advantage of an improved neutron tagging algorithm to lower the energy threshold compared to the previous phases of SK. This allows for setting the world's most stringent upper limit on the extraterrestrial $\barν_e$ flux, for neutrino energies below 31.3 MeV. The SK-IV results are combined with the ones from the first three phases of SK to perform a joint analysis using $22.5\times5823$ kton$\cdot$days of data. This analysis has the world's best sensitivity to the DSNB $\barν_e$ flux, comparable to the predictions from various models. For neutrino energies larger than 17.3 MeV, the new combined $90\%$ C.L. upper limits on the DSNB $\barν_e$ flux lie around $2.7$ cm$^{-2}$$\cdot$$\text{sec}^{-1}$, strongly disfavoring the most optimistic predictions. Finally, potentialities of the gadolinium phase of SK and the future Hyper-Kamiokande experiment are discussed. △ Less

Submitted 2 November, 2021; v1 submitted 23 September, 2021; originally announced September 2021.

Comments: 42 pages, 37 figures, 14 tables

arXiv:2109.04411 [pdf, other]

Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring

Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

Abstract: This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models. End-to-end speech translation models have several advantages over traditional cascade systems such as inference latency reduction. However, conventional AR decoding methods are not fast enough because each token is generated incrementally. NAR models, however, can accelera… ▽ More This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models. End-to-end speech translation models have several advantages over traditional cascade systems such as inference latency reduction. However, conventional AR decoding methods are not fast enough because each token is generated incrementally. NAR models, however, can accelerate the decoding speed by generating multiple tokens in parallel on the basis of the token-wise conditional independence assumption. We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder. The auxiliary shallow AR decoder selects the best hypothesis by rescoring multiple candidates generated from the NAR decoder in parallel (parallel AR rescoring). We adopt conditional masked language model (CMLM) and a connectionist temporal classification (CTC)-based model as NAR decoders for Orthros, referred to as Orthros-CMLM and Orthros-CTC, respectively. We also propose two training methods to enhance the CMLM decoder. Experimental evaluations on three benchmark datasets with six language directions demonstrated that Orthros achieved large improvements in translation quality with a very small overhead compared with the baseline NAR model. Moreover, the Conformer encoder architecture enabled large quality improvements, especially for CTC-based models. Orthros-CTC with the Conformer encoder increased decoding speed by 3.63x on CPU with translation quality comparable to that of an AR model. △ Less

Submitted 9 September, 2021; originally announced September 2021.

arXiv:2109.03868 [pdf, ps, other]

doi 10.1038/s41598-021-95322-x

An operator-theoretical study on the BCS-Bogoliubov model of superconductivity near absolute zero temperature

Authors: Shuji Watanabe

Abstract: In the preceding papers the present author gave another proof of the existence and uniqueness of the solution to the BCS-Bogoliubov gap equation for superconductivity from the viewpoint of operator theory, and showed that the solution is partially differentiable with respect to the temperature twice. Thanks to these results, we can indeed partially differentiate the solution and the thermodynamic… ▽ More In the preceding papers the present author gave another proof of the existence and uniqueness of the solution to the BCS-Bogoliubov gap equation for superconductivity from the viewpoint of operator theory, and showed that the solution is partially differentiable with respect to the temperature twice. Thanks to these results, we can indeed partially differentiate the solution and the thermodynamic potential with respect to the temperature twice so as to obtain the entropy and the specific heat at constant volume of a superconductor. In this paper we show the behavior near absolute zero temperature of the thus-obtained entropy, the specific heat, the solution and the critical magnetic field from the viewpoint of operator theory since we did not study it in the preceding papers. Here, the potential in the BCS-Bogoliubov gap equation is an arbitrary, positive continuous function and need not be a constant. △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: 9 pages

MSC Class: 45G10; 47H10; 47N50; 82D55

Journal ref: Scientific Reports 11 (2021), 15983

arXiv:2109.00360 [pdf, other]

doi 10.1016/j.nima.2021.166248

First Gadolinium Loading to Super-Kamiokande

Authors: K. Abe, C. Bronner, Y. Hayato, K. Hiraide, M. Ikeda, S. Imaizumi, J. Kameda, Y. Kanemura, Y. Kataoka, S. Miki, M. Miura, S. Moriyama, Y. Nagao, M. Nakahata, S. Nakayama, T. Okada, K. Okamoto, A. Orii, G. Pronost, H. Sekiya, M. Shiozawa, Y. Sonoda, Y. Suzuki, A. Takeda, Y. Takemoto , et al. (192 additional authors not shown)

Abstract: In order to improve Super-Kamiokande's neutron detection efficiency and to thereby increase its sensitivity to the diffuse supernova neutrino background flux, 13 tons of $\rm Gd_2(\rm SO_4)_3\cdot \rm 8H_2O$ (gadolinium sulfate octahydrate) was dissolved into the detector's otherwise ultrapure water from July 14 to August 17, 2020, marking the start of the SK-Gd phase of operations. During the loa… ▽ More In order to improve Super-Kamiokande's neutron detection efficiency and to thereby increase its sensitivity to the diffuse supernova neutrino background flux, 13 tons of $\rm Gd_2(\rm SO_4)_3\cdot \rm 8H_2O$ (gadolinium sulfate octahydrate) was dissolved into the detector's otherwise ultrapure water from July 14 to August 17, 2020, marking the start of the SK-Gd phase of operations. During the loading, water was continuously recirculated at a rate of 60 m$^3$/h, extracting water from the top of the detector and mixing it with concentrated $\rm Gd_2(\rm SO_4)_3\cdot \rm 8H_2O$ solution to create a 0.02% solution of the Gd compound before injecting it into the bottom of the detector. A clear boundary between the Gd-loaded and pure water was maintained through the loading, enabling monitoring of the loading itself and the spatial uniformity of the Gd concentration over the 35 days it took to reach the top of the detector. During the subsequent commissioning the recirculation rate was increased to 120 m$^3$/h, resulting in a constant and uniform distribution of Gd throughout the detector and water transparency equivalent to that of previous pure-water operation periods. Using an Am-Be neutron calibration source the mean neutron capture time was measured to be $115\pm1$ $μ$s, which corresponds to a Gd concentration of $111\pm2$ ppm, as expected for this level of Gd loading. This paper describes changes made to the water circulation system for this detector upgrade, the Gd loading procedure, detector commissioning, and the first neutron calibration measurements in SK-Gd. △ Less

Submitted 15 December, 2021; v1 submitted 1 September, 2021; originally announced September 2021.

Comments: 37 pages, 19 Figures, Accepted for publication in Nucl. Instrum. Meth. A

Journal ref: Nuclear Inst. and Methods in Physics Research, A 1027 (2022) 166248

Showing 251–300 of 845 results for author: Watanabe, S