Skip to main content

Showing 1–30 of 30 results for author: Moriya, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.18972  [pdf, ps, other

    eess.AS cs.CL

    Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over

    Authors: Atsunori Ogawa, Naoyuki Kamo, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Takatomo Kano, Naohiro Tawara, Marc Delcroix

    Abstract: Large language models (LLMs) have been successfully applied for rescoring automatic speech recognition (ASR) hypotheses. However, their ability to rescore ASR hypotheses of casual conversations has not been sufficiently explored. In this study, we reveal it by performing N-best ASR hypotheses rescoring using Llama2 on the CHiME-7 distant ASR (DASR) task. Llama2 is one of the most representative LL… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 5 pages

  2. arXiv:2406.18910  [pdf, other

    cs.CL cs.SD eess.AS

    Factor-Conditioned Speaking-Style Captioning

    Authors: Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura

    Abstract: This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned capti… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  3. arXiv:2401.17632  [pdf, other

    cs.CL cs.SD eess.AS

    What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

    Authors: Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima

    Abstract: Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speech SSL models, such as WavLM, employ masked prediction training to encode general-purpose representations. In contrast, speaker SSL models, exemplified by DINO-based models, adopt utterance-level training objectives primarily for speaker representation. Understanding how these model… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  4. arXiv:2401.05111  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

    Authors: Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa, Marc Delcroix, Takafumi Moriya, Yusuke Ijima

    Abstract: The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately. However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise. In this paper, we propose a noise-robust zero-shot TTS method.… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: 5 pages,3 figures, Accepted to IEEE ICASSP 2024

  5. arXiv:2311.02797  [pdf, other

    cs.IT

    Optimal Construction of N-bit-delay Almost Instantaneous Fixed-to-Variable-Length Codes

    Authors: Ryosuke Sugiura, Masaaki Nishino, Norihito Yasuda, Yutaka Kamamoto, Takehiro Moriya

    Abstract: This paper presents an optimal construction of $N$-bit-delay almost instantaneous fixed-to-variable-length (AIFV) codes, the general form of binary codes we can make when finite bits of decoding delay are allowed. The presented method enables us to optimize lossless codes among a broader class of codes compared to the conventional FV and AIFV codes. The paper first discusses the problem of code co… ▽ More

    Submitted 5 November, 2023; originally announced November 2023.

    Comments: submitted to IEEE Trans. IT on 31st Oct. 2023

  6. arXiv:2311.01715  [pdf, other

    cs.SD eess.AS eess.SP

    Acousto-optic reconstruction of exterior sound field based on concentric circle sampling with circular harmonic expansion

    Authors: Phuc Duc Nguyen, Kenji Ishikawa, Noboru Harada, Takehiro Moriya

    Abstract: Acousto-optic sensing provides an alternative approach to traditional microphone arrays by shedding light on the interaction of light with an acoustic field. Sound field reconstruction is a fascinating and advanced technique used in acousto-optics sensing. Current challenges in sound-field reconstruction methods pertain to scenarios in which the sound source is located within the reconstruction ar… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  7. arXiv:2310.11010  [pdf, ps, other

    eess.AS cs.CL

    Iterative Shallow Fusion of Backward Language Model for End-to-End Speech Recognition

    Authors: Atsunori Ogawa, Takafumi Moriya, Naoyuki Kamo, Naohiro Tawara, Marc Delcroix

    Abstract: We propose a new shallow fusion (SF) method to exploit an external backward language model (BLM) for end-to-end automatic speech recognition (ASR). The BLM has complementary characteristics with a forward language model (FLM), and the effectiveness of their combination has been confirmed by rescoring ASR hypotheses as post-processing. In the proposed SF, we iteratively apply the BLM to partial ASR… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Accepted to ICASSP 2023

  8. arXiv:2306.17317  [pdf, ps, other

    eess.AS cs.SD

    Modified Parametric Multichannel Wiener Filter \\for Low-latency Enhancement of Speech Mixtures with Unknown Number of Speakers

    Authors: Ning Guo, Tomohiro Nakatani, Shoko Araki, Takehiro Moriya

    Abstract: This paper introduces a novel low-latency online beamforming (BF) algorithm, named Modified Parametric Multichannel Wiener Filter (Mod-PMWF), for enhancing speech mixtures with unknown and varying number of speakers. Although conventional BFs such as linearly constrained minimum variance BF (LCMV BF) can enhance a speech mixture, they typically require such attributes of the speech mixture as the… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

  9. arXiv:2306.08374  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

    Authors: Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, Yusuke Ijima, Taichi Asami, Marc Delcroix, Yukinori Honma

    Abstract: Self-supervised learning (SSL) for speech representation has been successfully applied in various downstream tasks, such as speech and speaker recognition. More recently, speech SSL models have also been shown to be beneficial in advancing spoken language understanding tasks, implying that the SSL models have the potential to learn not only acoustic but also linguistic information. In this paper,… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted at INTERSPEECH 2023

  10. arXiv:2306.04233  [pdf, other

    cs.CL cs.SD eess.AS

    Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix

    Abstract: End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. This approach is promising because it, in contrast to the conventional cascade approach, can utilize full acoustical information and mitigate to the propagation of transcription errors. However, due to the high cost of collecting speech-summary pairs, an E2E SSum model… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: Accepted by Interspeech 2023

  11. arXiv:2306.02273  [pdf, ps, other

    cs.CL cs.SD eess.AS

    End-to-End Joint Target and Non-Target Speakers ASR

    Authors: Ryo Masumura, Naoki Makishima, Taiga Yamane, Yoshihiko Yamazaki, Saki Mizuno, Mana Ihori, Mihiro Uchida, Keita Suzuki, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando

    Abstract: This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker's speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. Target-speaker ASR systems are a promising way to only transcribe a target speaker's speech by enrolling the target speaker's information. However, in conversational ASR applicatio… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech 2023

  12. arXiv:2305.14723  [pdf, other

    eess.AS cs.SD

    Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss

    Authors: Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo

    Abstract: Self-supervised learning (SSL) is the latest breakthrough in speech processing, especially for label-scarce downstream tasks by leveraging massive unlabeled audio data. The noise robustness of the SSL is one of the important challenges to expanding its application. We can use speech enhancement (SE) to tackle this issue. However, the mismatch between the SE model and SSL models potentially limits… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 4 pages , 2 figures, Accepted to Interspeech 2023

  13. arXiv:2305.05201  [pdf, other

    cs.CL cs.SD eess.AS

    Exploration of Language Dependency for Japanese Self-Supervised Speech Representation Models

    Authors: Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka

    Abstract: Self-supervised learning (SSL) has been dramatically successful not only in monolingual but also in cross-lingual settings. However, since the two settings have been studied individually in general, there has been little research focusing on how effective a cross-lingual model is in comparison with a monolingual model. In this paper, we investigate this fundamental question empirically with Japane… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at ICASSP 2023

  14. arXiv:2304.14923  [pdf, ps, other

    eess.SP cs.SD eess.AS eess.IV physics.optics

    Deep sound-field denoiser: optically-measured sound-field denoising using deep neural network

    Authors: Kenji Ishikawa, Daiki Takeuchi, Noboru Harada, Takehiro Moriya

    Abstract: This paper proposes a deep sound-field denoiser, a deep neural network (DNN) based denoising of optically measured sound-field images. Sound-field imaging using optical methods has gained considerable attention due to its ability to achieve high-spatial-resolution imaging of acoustic phenomena that conventional acoustic sensors cannot accomplish. However, the optically measured sound-field images… ▽ More

    Submitted 21 September, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

    Comments: 16 pages, 10 figures, 2 tables

  15. Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

    Authors: Kenichi Fujita, Takanori Ashihara, Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima

    Abstract: This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style tokens still have a gap in reproducing the speaker characteristics of unseen speakers. A novel point of the proposed method is the direct use of the SSL model to o… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

    Comments: 5 pages,3 figures, Accepted to IEEE ICASSP 2023 workshop Self-supervision in Audio, Speech and Beyond

    Journal ref: 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2023, pp. 1-5,

  16. arXiv:2303.00978  [pdf, other

    cs.CL eess.AS

    Leveraging Large Text Corpora for End-to-End Speech Summarization

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura

    Abstract: End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  17. arXiv:2210.15937  [pdf, other

    cs.CL cs.SD eess.AS

    On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis

    Authors: Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato

    Abstract: This paper investigates the effectiveness and implementation of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis~(MSA). Although the effectiveness of pre-trained encoders in various fields has been reported, conventional MSA methods employ them for only linguistic modality, and their application has not been investigated. This paper compares the features yielded… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  18. arXiv:2209.04175  [pdf, other

    eess.AS cs.SD

    Streaming Target-Speaker ASR with Neural Transducer

    Authors: Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki

    Abstract: Although recent advances in deep learning technology have boosted automatic speech recognition (ASR) performance in the single-talker case, it remains difficult to recognize multi-talker speech in which many voices overlap. One conventional approach to tackle this problem is to use a cascade of a speech separation or target speech extraction front-end with an ASR back-end. However, the extra compu… ▽ More

    Submitted 19 September, 2022; v1 submitted 9 September, 2022; originally announced September 2022.

    Comments: Accepted to Interspeech 2022

  19. arXiv:2209.02926  [pdf, ps, other

    math.AG cs.SC math.NT

    Some explicit arithmetics on curves of genus three and their applications

    Authors: Tomoki Moriya, Momonari Kudo

    Abstract: A Richelot isogeny between Jacobian varieties is an isogeny whose kernel is included in the $2$-torsion subgroup of the domain. In particular, a Richelot isogeny whose codomain is the product of two or more principally porlalized abelian varieties is called a decomposed Richelot isogeny. In this paper, we develop some explicit arithmetics on curves of genus $3$, including algorithms to compute the… ▽ More

    Submitted 1 March, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

    Comments: Comments are welcome!

  20. arXiv:2207.06867  [pdf, other

    cs.CL cs.SD eess.AS

    Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models

    Authors: Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka

    Abstract: Self-supervised learning (SSL) is seen as a very promising approach with high performance for several speech downstream tasks. Since the parameters of SSL models are generally so large that training and inference require a lot of memory and computational cost, it is desirable to produce compact SSL models without a significant performance degradation by applying compression methods such as knowled… ▽ More

    Submitted 1 September, 2022; v1 submitted 14 July, 2022; originally announced July 2022.

    Comments: Accepted at Interspeech 2022

  21. arXiv:2206.08174  [pdf, other

    eess.AS cs.SD eess.SP

    Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

    Authors: Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura

    Abstract: Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utter… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: 5 pages, 2 figures, 3 tables Submitted to Interspeech 2022

  22. arXiv:2203.08437  [pdf, other

    cs.IT

    General form of almost instantaneous fixed-to-variable-length codes

    Authors: Ryosuke Sugiura, Yutaka Kamamoto, Takehiro Moriya

    Abstract: A general class of the almost instantaneous fixed-to-variable-length (AIFV) codes is proposed, which contains every possible binary code we can make when allowing finite bits of decoding delay. The contribution of the paper lies in the following. (i) Introducing $N$-bit-delay AIFV codes, constructed by multiple code trees with higher flexibility than the conventional AIFV codes. (ii) Proving that… ▽ More

    Submitted 7 September, 2023; v1 submitted 16 March, 2022; originally announced March 2022.

    Comments: submitted to IEEE Transactions on Information Theory. arXiv admin note: text overlap with arXiv:1607.07247 by other authors

  23. Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlap** Speech Recognition

    Authors: Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Naoyuki Kamo, Takafumi Moriya

    Abstract: The combination of a deep neural network (DNN) -based speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end is a widely used approach to implement overlap** speech recognition. However, the SE front-end generates processing artifacts that can degrade the ASR performance. We previously found that such performance degradation can occur even under fully overlap** co… ▽ More

    Submitted 11 January, 2022; originally announced January 2022.

    Comments: 5 pages, 2 figures

    Journal ref: In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6287-6291

  24. arXiv:2107.01569  [pdf, other

    cs.CL cs.LG

    Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

    Authors: Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima

    Abstract: We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition (ASR) system so as to exclude ASR errors. Generally, neural correction models are composed of encoder-decoder networks, which can directly model sequence-to-sequence map** problems. The most successful method is to use both input speech and its ASR output text as the inp… ▽ More

    Submitted 4 July, 2021; originally announced July 2021.

    Comments: Accepted to Interspeech 2021

  25. Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlap** Speech Recognition

    Authors: Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo

    Abstract: Although recent advances in deep learning technology improved automatic speech recognition (ASR), it remains difficult to recognize speech when it overlaps other people's voices. Speech separation or extraction is often used as a front-end to ASR to handle such overlap** speech. However, deep neural network-based speech enhancement can generate `processing artifacts' as a side effect of the enha… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: 5 pages, 1 figure

    Journal ref: in Proc. Interspeech 2021, 1149-1153

  26. arXiv:2004.03272  [pdf

    cs.CV eess.IV

    Super-resolution of clinical CT volumes with modified CycleGAN using micro CT volumes

    Authors: Tong ZHENG, Hirohisa ODA, Takayasu MORIYA, Takaaki SUGINO, Shota NAKAMURA, Masahiro ODA, Masaki MORI, Hirotsugu TAKABATAKE, Hiroshi NATORI, Kensaku MORI

    Abstract: This paper presents a super-resolution (SR) method with unpaired training dataset of clinical CT and micro CT volumes. For obtaining very detailed information such as cancer invasion from pre-operative clinical CT volumes of lung cancer patients, SR of clinical CT volumes to $\m$}CT level is desired. While most SR methods require paired low- and high- resolution images for training, it is infeasib… ▽ More

    Submitted 7 April, 2020; originally announced April 2020.

    Comments: 6 pages, 2 figures

  27. arXiv:1912.12838  [pdf, other

    eess.IV cs.CV

    Multi-modality super-resolution loss for GAN-based super-resolution of clinical CT images using micro CT image database

    Authors: Tong Zheng, Hirohisa Oda, Takayasu Moriya, Shota Nakamura, Masahiro Oda, Masaki Mori, Horitsugu Takabatake, Hiroshi Natori, Kensaku Mori

    Abstract: This paper newly introduces multi-modality loss function for GAN-based super-resolution that can maintain image structure and intensity on unpaired training dataset of clinical CT and micro CT volumes. Precise non-invasive diagnosis of lung cancer mainly utilizes 3D multidetector computed-tomography (CT) data. On the other hand, we can take micro CT images of resected lung specimen in 50 micro met… ▽ More

    Submitted 7 April, 2020; v1 submitted 30 December, 2019; originally announced December 2019.

    Comments: 6 pages, 2 figures

  28. Unsupervised Segmentation of 3D Medical Images Based on Clustering and Deep Representation Learning

    Authors: Takayasu Moriya, Holger R. Roth, Shota Nakamura, Hirohisa Oda, Kai Nagara, Masahiro Oda, Kensaku Mori

    Abstract: This paper presents a novel unsupervised segmentation method for 3D medical images. Convolutional neural networks (CNNs) have brought significant advances in image segmentation. However, most of the recent methods rely on supervised learning, which requires large amounts of manually annotated data. Thus, it is challenging for these methods to cope with the growing amount of medical images. This pa… ▽ More

    Submitted 11 April, 2018; originally announced April 2018.

    Comments: This paper was presented at SPIE Medical Imaging 2018, Houston, TX, USA

    Journal ref: Proc. SPIE 10578, Medical Imaging 2018: Biomedical Applications in Molecular, Structural, and Functional Imaging, 1057820 (12 March 2018)

  29. Unsupervised Pathology Image Segmentation Using Representation Learning with Spherical K-means

    Authors: Takayasu Moriya, Holger R. Roth, Shota Nakamura, Hirohisa Oda, Kai Nagara, Masahiro Oda, Kensaku Mori

    Abstract: This paper presents a novel method for unsupervised segmentation of pathology images. Staging of lung cancer is a major factor of prognosis. Measuring the maximum dimensions of the invasive component in a pathology images is an essential task. Therefore, image segmentation methods for visualizing the extent of invasive and noninvasive components on pathology images could support pathological exami… ▽ More

    Submitted 11 April, 2018; originally announced April 2018.

    Comments: This paper was presented at SPIE Medical Imaging 2018, Houston, TX, USA

    Journal ref: Proc. SPIE 10581, Medical Imaging 2018: Digital Pathology, 1058111 (6 March 2018)

  30. arXiv:1801.01449  [pdf, other

    cs.CV

    3D Surface-to-Structure Translation using Deep Convolutional Networks

    Authors: Takumi Moriya, Kazuyuki Saito, Hiroya Tanaka

    Abstract: Our demonstration shows a system that estimates internal body structures from 3D surface models using deep convolutional neural networks trained on CT (computed tomography) images of the human body. To take pictures of structures inside the body, we need to use a CT scanner or an MRI (Magnetic Resonance Imaging) scanner. However, assuming that the mutual information between outer shape of the body… ▽ More

    Submitted 8 December, 2017; originally announced January 2018.

    Comments: 2 pages, 3 figures