Skip to main content

Showing 1–17 of 17 results for author: Kawai, H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2312.10964  [pdf, other

    cs.CL cs.SD eess.AS

    Generative linguistic representation for spoken language identification

    Authors: Peng Shen, Xuguang Lu, Hisashi Kawai

    Abstract: Effective extraction and application of linguistic features are central to the enhancement of spoken Language IDentification (LID) performance. With the success of recent large models, such as GPT and Whisper, the potential to leverage such pre-trained models for extracting linguistic features for LID tasks has become a promising area of research. In this paper, we explore the utilization of the d… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

    Comments: Accepted by IEEE ASRU2023

  2. arXiv:2312.10959  [pdf, other

    cs.SD cs.CL eess.AS

    Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition

    Authors: Peng Shen, Xugang Lu, Hisashi Kawai

    Abstract: Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed. In this paper, to better address these tasks, we first introduce speaker labels into an autoregressive transformer-based speech recognition model to support multi-speaker overlapped speech recognition. Then, to improve speaker diariza… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

  3. arXiv:2310.13471  [pdf, ps, other

    eess.AS cs.SD

    Neural domain alignment for spoken language recognition based on optimal transport

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Domain shift poses a significant challenge in cross-domain spoken language recognition (SLR) by reducing its effectiveness. Unsupervised domain adaptation (UDA) algorithms have been explored to address domain shifts in SLR without relying on class labels in the target domain. One successful UDA approach focuses on learning domain-invariant representations to align feature distributions between dom… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  4. arXiv:2309.16093  [pdf, ps, other

    eess.AS cs.SD

    Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) base… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  5. arXiv:2309.13650  [pdf, ps, other

    eess.AS cs.SD

    Cross-modal Alignment with Optimal Transport for CTC-based ASR

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external language model (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretraine… ▽ More

    Submitted 24 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE ASRU 2023

  6. arXiv:2207.14578  [pdf, other

    cs.CL cs.SD eess.AS

    Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

    Authors: Peng Shen, Xugang Lu, Hisashi Kawai

    Abstract: For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks, compared to character-based modeling units, pronunciation-based modeling units could improve the sharing of modeling units in model training but meet homophone problems. In this study, we propose to use a novel pronunciation-aware unique character encoding for building E2E RNN-T-based Mandarin ASR systems. The proposed encodin… ▽ More

    Submitted 29 July, 2022; originally announced July 2022.

  7. arXiv:2204.10561  [pdf, other

    cs.SD eess.AS

    Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

    Authors: Detai Xin, Shinnosuke Takamichi, Takuma Okamoto, Hisashi Kawai, Hiroshi Saruwatari

    Abstract: This paper presents a speaking-rate-controllable HiFi-GAN neural vocoder. Original HiFi-GAN is a high-fidelity, computationally efficient, and tiny-footprint neural vocoder. We attempt to incorporate a speaking rate control function into HiFi-GAN for improving the accessibility of synthetic speech. The proposed method inserts a differentiable interpolation layer into the HiFi-GAN architecture. A s… ▽ More

    Submitted 22 April, 2022; originally announced April 2022.

    Comments: submitted to INTERSPEECH 2022

  8. arXiv:2204.03888  [pdf, other

    cs.CL cs.SD eess.AS

    Transducer-based language embedding for spoken language identification

    Authors: Peng Shen, Xugang Lu, Hisashi Kawai

    Abstract: The acoustic and linguistic features are important cues for the spoken language identification (LID) task. Recent advanced LID systems mainly use acoustic features that lack the usage of explicit linguistic feature encoding. In this paper, we propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefi… ▽ More

    Submitted 29 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: This paper was accepted by Interspeech 2022

  9. arXiv:2203.17036  [pdf, ps, other

    eess.AS cs.CL

    Partial Coupling of Optimal Transport for Spoken Language Identification

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: In order to reduce domain discrepancy to improve the performance of cross-domain spoken language identification (SLID) system, as an unsupervised domain adaptation (UDA) method, we have proposed a joint distribution alignment (JDA) model based on optimal transport (OT). A discrepancy measurement based on OT was adopted for JDA between training and test data sets. In our previous study, it was supp… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

    Comments: This work was submitted to INTERSPEECH 2022

  10. arXiv:2104.03004  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Siamese Neural Network with Joint Bayesian Model Structure for Speaker Verification

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Generative probability models are widely used for speaker verification (SV). However, the generative models are lack of discriminative feature selection ability. As a hypothesis test, the SV can be regarded as a binary classification task which can be designed as a Siamese neural network (SiamNN) with discriminative training. However, in most of the discriminative training for SiamNN, only the dis… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2101.03329

  11. arXiv:2101.03329  [pdf, ps, other

    eess.AS cs.SD

    Coupling a generative model with a discriminative learning framework for speaker verification

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: The speaker verification (SV) task is to decide whether an utterance is spoken by a target or an imposter speaker. For most studies, a log-likelihood ratio (LLR) score is estimated based on a generative probability model on speaker features and compared with a threshold for making a decision. However, the generative model usually focuses on individual feature distributions, does not have the discr… ▽ More

    Submitted 24 November, 2021; v1 submitted 9 January, 2021; originally announced January 2021.

  12. arXiv:2012.13152  [pdf, ps, other

    cs.LG cs.CL cs.SD eess.AS

    Unsupervised neural adaptation model based on optimal transport for spoken language identification

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded. In this paper, we propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID. In our model, we explicitly formulate the adaptation as to reduce the distribution dis… ▽ More

    Submitted 24 December, 2020; originally announced December 2020.

  13. Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

    Authors: Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

    Abstract: In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real time because of its compact model and non-autoregre… ▽ More

    Submitted 19 February, 2021; v1 submitted 25 July, 2020; originally announced July 2020.

    Comments: 15 pages, 10 figures, 8 tables

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 792-806, 2021

  14. arXiv:2005.08654  [pdf, other

    eess.AS cs.SD

    Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

    Authors: Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

    Abstract: In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG. PWG is a compact non-autoregressive (non-AR) speech generation model, whose generative speed is much faster than real time. While utilizing PWG as a vocoder to generate speech on the basis of acoustic features such as spectral and prosodic feat… ▽ More

    Submitted 6 August, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: 5 page, 6 figures, 2 tables. Proc. Interspeech, 2020

  15. arXiv:1912.12011  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Cross-scale Attention Model for Acoustic Event Classification

    Authors: Xugang Lu, Peng Shen, Sheng Li, Yu Tsao, Hisashi Kawai

    Abstract: A major advantage of a deep convolutional neural network (CNN) is that the focused receptive field size is increased by stacking multiple convolutional layers. Accordingly, the model can explore the long-range dependency of features from the top layers. However, a potential limitation of the network is that the discriminative features from the bottom layers (which can model the short-range depende… ▽ More

    Submitted 15 June, 2020; v1 submitted 27 December, 2019; originally announced December 2019.

  16. arXiv:1904.13142  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Incorporating Symbolic Sequential Modeling for Speech Enhancement

    Authors: Chien-Feng Liao, Yu Tsao, Xugang Lu, Hisashi Kawai

    Abstract: In a noisy environment, a lossy speech signal can be automatically restored by a listener if he/she knows the language well. That is, with the built-in knowledge of a "language model", a listener may effectively suppress noise interference and retrieve the target speech signals. Accordingly, we argue that familiarity with the underlying linguistic content of spoken utterances benefits speech enhan… ▽ More

    Submitted 1 July, 2019; v1 submitted 30 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019

  17. arXiv:1310.0296  [pdf, ps, other

    eess.SY

    Tracking Control for FES-Cycling based on Force Direction Efficiency with Antagonistic Bi-Articular Muscles

    Authors: Hiroyuki Kawai, Matthew J. Bellman, Ryan J. Downey, Warren E. Dixon

    Abstract: A functional electrical stimulation (FES)-based tracking controller is developed to enable cycling based on a strategy to yield force direction efficiency by exploiting antagonistic bi-articular muscles. Given the input redundancy naturally occurring among multiple muscle groups, the force direction at the pedal is explicitly determined as a means to improve the efficiency of cycling. A model of a… ▽ More

    Submitted 1 October, 2013; originally announced October 2013.

    Comments: 8 pages, 4 figures, submitted to ACC2014