Search | arXiv e-print repository

Signal processing algorithm effective for sound quality of hearing loss simulators

Authors: Toshio Irino, Shintaro Doan, Minami Ishikawa

Abstract: Hearing loss (HL) simulators, which allow normal hearing (NH) listeners to experience HL, have been used in speech intelligibility experiments, but not in sound quality experiments due to perceptible distortion. If they produced less distortion, they might be useful for NH listeners to evaluate the sound quality of, for example, hearing aids. We conducted perceptual sound quality experiments to co… ▽ More Hearing loss (HL) simulators, which allow normal hearing (NH) listeners to experience HL, have been used in speech intelligibility experiments, but not in sound quality experiments due to perceptible distortion. If they produced less distortion, they might be useful for NH listeners to evaluate the sound quality of, for example, hearing aids. We conducted perceptual sound quality experiments to compare the Cambridge version of HL simulator (CamHLS) and the Wakayama version of the HL simulator (WHIS), which has the two algorithms of filterbank analysis synthesis (FBAS) and direct time-varying filter (DTVF). The experimental results showed that WHIS with DTVF produces less perceptible distortion in speech sounds than CamHLS and WHIS with FBAS, even when the nonlinear process is working. This advantage is mainly due to the use of the DTVF algorithm, which could be applied to various signal synthesis applications with filterbank analysis. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: This paper has been accepted for publication in Interspeech 2024

arXiv:2310.15399 [pdf, ps, other]

GESI: Gammachirp Envelope Similarity Index for Predicting Intelligibility of Simulated Hearing Loss Sounds

Authors: Ayako Yamamoto, Toshio Irino, Fuki Miyazaki, Honoka Tamaru

Abstract: We propose an objective intelligibility measure (OIM), called the Gammachirp Envelope Similarity Index (GESI), which can predict the speech intelligibility (SI) of simulated hearing loss (HL) sounds for normal hearing (NH) listeners. GESI is an intrusive method that computes the SI metric using the gammachirp filterbank (GCFB), the modulation filterbank, and the extended cosine similarity measure.… ▽ More We propose an objective intelligibility measure (OIM), called the Gammachirp Envelope Similarity Index (GESI), which can predict the speech intelligibility (SI) of simulated hearing loss (HL) sounds for normal hearing (NH) listeners. GESI is an intrusive method that computes the SI metric using the gammachirp filterbank (GCFB), the modulation filterbank, and the extended cosine similarity measure. The unique features of GESI are that i) it reflects the hearing impaired (HI) listener's HL that appears in the audiogram and is caused by active and passive cochlear dysfunction, ii) it provides a single goodness metric, as in the widely used STOI and ESTOI, that can be used immediately to evaluate SE algorithms, and iii) it provides a simple control parameter to accept the level asymmetry of the reference and test sounds and to deal with individual listening conditions and environments. We evaluated GESI and the conventional OIMs, STOI, ESTOI, MBSTOI, and HASPI versions 1 and 2 by using four SI experiments on words of male and female speech sounds in both laboratory and remote environments. GESI was shown to outperform the other OIMs in the evaluations. GESI could be used to improve SE algorithms in assistive listening devices for individual HI listeners. △ Less

Submitted 13 March, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: This paper was submitted to JASA on March 14, 2024

arXiv:2306.01522 [pdf, ps, other]

Auditory Representation Effective for Estimating Vocal Tract Information

Authors: Toshio Irino, Shintaro Doan

Abstract: We can estimate the size of the speakers based on their speech sounds alone. We had proposed an auditory computational theory of the Stabilised Wavelet-Mellin Transform (SWMT), which segregates information about the size and shape of the vocal tract and glottal vibration, to explain this observation. It has been shown that the auditory representation or excitation pattern (EP) associated with a we… ▽ More We can estimate the size of the speakers based on their speech sounds alone. We had proposed an auditory computational theory of the Stabilised Wavelet-Mellin Transform (SWMT), which segregates information about the size and shape of the vocal tract and glottal vibration, to explain this observation. It has been shown that the auditory representation or excitation pattern (EP) associated with a weighting function based on the SWMT, termed the ``SSI weight,'' can account for the psychometric functions of size perception. In this study, we investigated whether EP with SSI weight can accurately estimate vocal tract lengths (VTLs) which were measured by magnetic resonance imaging (MRI) in male and female subjects. It was found that the use of SSI weight significantly improved the VTL estimation. Furthermore, the estimation errors in the EP with the SSI weight were significantly smaller than those in the commonly used spectra derived from the Fourier transform, Mel filterbank, and WORLD vocoder. It was also shown that the SSI weight can be easily introduced into these spectra to improve the performance. △ Less

Submitted 14 September, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: This manuscript is a revised version after acceptance for publication in Proc. APSIPA ASC 2023 on August 25, 2023

arXiv:2206.06604 [pdf, other]

doi 10.1109/ACCESS.2023.3298673

WHIS: Hearing impairment simulator based on the gammachirp auditory filterbank

Authors: Toshio Irino

Abstract: A new version of a hearing impairment simulator (WHIS) was implemented based on a revised version of the gammachirp filterbank (GCFB), which incorporates fast frame-based processing, absolute threshold (AT), an audiogram of a hearing-impaired (HI) listener, and a parameter to control the cochlear input-output (IO) function. The parameter referred to as the compression health $α$ controlled the slo… ▽ More A new version of a hearing impairment simulator (WHIS) was implemented based on a revised version of the gammachirp filterbank (GCFB), which incorporates fast frame-based processing, absolute threshold (AT), an audiogram of a hearing-impaired (HI) listener, and a parameter to control the cochlear input-output (IO) function. The parameter referred to as the compression health $α$ controlled the slope of the IO function to range from normal hearing (NH) listeners to HI listeners, without largely changing the total hearing loss (HL). The new WHIS was designed provide an NH listener the same EPs as those of a target HI listener.The analysis part of WHIS was almost the same as that of the revised GCFB, except that the IO function was used instead of the gain function. We proposed two synthesis methods: a direct time-varying filter for perceptually small distortion and a filterbank analysis-synthesis for further HI simulations including temporal smearing. We evaluated the WHIS family and a Cambridge version of the HL simulator (CamHLS) in terms of differences in the IO function and spectral distance. The IO functions were simulated fairly well at $α$ less than 0.5 but not at $α$ equal to 1. Thus, it is difficult to simulate the HL when the IO function is sufficiently healthy. This is a fundamental limit of any existing HL simulator as well as WHIS. The new WHIS yielded a smaller spectral distortion than CamHLS and was fairly compatible with the previous version. △ Less

Submitted 28 November, 2023; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: This preprint was an original version that was unsuccessfully submitted to Trends in Hearing on June 5, 2022. The revised version has been accepted for publication in IEEE access. See https://doi.org/10.1109/ACCESS.2023.3298673 ( https://ieeexplore.ieee.org/document/10193769 )

Journal ref: IEEE access, 25 July 2023

arXiv:2206.06573 [pdf, ps, other]

doi 10.21437/Interspeech.2022-211

Speech intelligibility of simulated hearing loss sounds and its prediction using the Gammachirp Envelope Similarity Index (GESI)

Authors: Toshio Irino, Honoka Tamaru, Ayako Yamamoto

Abstract: In the present study, speech intelligibility (SI) experiments were performed using simulated hearing loss (HL) sounds in laboratory and remote environments to clarify the effects of peripheral dysfunction. Noisy speech sounds were processed to simulate the average HL of 70- and 80-year-olds using Wadai Hearing Impairment Simulator (WHIS). These sounds were presented to normal hearing (NH) listener… ▽ More In the present study, speech intelligibility (SI) experiments were performed using simulated hearing loss (HL) sounds in laboratory and remote environments to clarify the effects of peripheral dysfunction. Noisy speech sounds were processed to simulate the average HL of 70- and 80-year-olds using Wadai Hearing Impairment Simulator (WHIS). These sounds were presented to normal hearing (NH) listeners whose cognitive function could be assumed to be normal. The results showed that the divergence was larger in the remote experiments than in the laboratory ones. However, the remote results could be equalized to the laboratory ones, mostly through data screening using the results of tone pip tests prepared on the experimental web page. In addition, a newly proposed objective intelligibility measure (OIM) called the Gammachirp Envelope Similarity Index (GESI) explained the psychometric functions in the laboratory and remote experiments fairly well. GESI has the potential to explain the SI of HI listeners by properly setting HL parameters. △ Less

Submitted 28 November, 2023; v1 submitted 13 June, 2022; originally announced June 2022.

Comments: This preprint is a copy of the final version accepted for Interspeech 2022. See https://doi.org/10.21437/Interspeech.2022-211

Journal ref: Proc. Interspeech 2022

arXiv:2203.16760 [pdf, ps, other]

doi 10.23919/APSIPAASC55919.2022.9979946

Effective data screening technique for crowdsourced speech intelligibility experiments: Evaluation with IRM-based speech enhancement

Authors: Ayako Yamamoto, Toshio Irino, Shoko Araki, Kenichi Arai, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

Abstract: It is essential to perform speech intelligibility (SI) experiments with human listeners in order to evaluate objective intelligibility measures for develo** effective speech enhancement and noise reduction algorithms. Recently, crowdsourced remote testing has become a popular means for collecting a massive amount and variety of data at a relatively small cost and in a short time. However, carefu… ▽ More It is essential to perform speech intelligibility (SI) experiments with human listeners in order to evaluate objective intelligibility measures for develo** effective speech enhancement and noise reduction algorithms. Recently, crowdsourced remote testing has become a popular means for collecting a massive amount and variety of data at a relatively small cost and in a short time. However, careful data screening is essential for attaining reliable SI data. We performed SI experiments on speech enhanced by an "oracle" ideal ratio mask (IRM) in a well-controlled laboratory and in crowdsourced remote environments that could not be controlled directly. We introduced simple tone pip tests, in which participants were asked to report the number of audible tone pips, to estimate their listening levels above audible thresholds. The tone pip tests were very effective for data screening to reduce the variability of crowdsourced remote results so that the laboratory results would become similar. The results also demonstrated the SI of an oracle IRM, giving us the upper limit of the mask-based single-channel speech enhancement. △ Less

Submitted 19 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

Comments: This paper was submitted to APSIPA ASC 2022 (https://www.apsipa2022.org). The original title [v1] was "Subjective intelligibility of speech sounds enhanced by ideal ratio mask via crowdsourced remote experiments with effective data screening."

Journal ref: Proc. APSIPA ASC 2022

arXiv:2109.11594 [pdf, ps, other]

Implementation of interactive tools for investigating fundamental frequency response of voiced sounds to auditory stimulation

Authors: Hideki Kawahara, Toshie Matsui Kohei, Yatabe Ken-Ichi Sakakibara Minoru Tsuzaki Masanori Morise Toshio Irino

Abstract: We introduced a measurement procedure for the involuntary response of voice fundamental-frequency to frequency modulated auditory stimulation. This involuntary response plays an essential role in voice fundamental frequency control while less investigated due to technical difficulties. This article introduces an interactive and real-time tool for investigating this response and supporting tools ad… ▽ More We introduced a measurement procedure for the involuntary response of voice fundamental-frequency to frequency modulated auditory stimulation. This involuntary response plays an essential role in voice fundamental frequency control while less investigated due to technical difficulties. This article introduces an interactive and real-time tool for investigating this response and supporting tools adopting our new measurement method. The method enables simultaneous measurement of multiple system properties based on a novel set of extended time-stretched pulses combined with orthogonalization. We made MATLAB implementation of these tools available as an open-source repository. This article also provides the detailed measurement procedure using the interactive tool followed by offline measurement tools for conducting subjective experiments and statistical analyses. It also provides technical descriptions of constituent signal processing subsystems as appendices. This application serves as an example for adopting our method to biological system analysis. △ Less

Submitted 23 September, 2021; originally announced September 2021.

Comments: Accepted for APSIPA ASC 2021

MSC Class: 91E45; 92-04; 94A11

arXiv:2104.10001 [pdf, ps, other]

doi 10.21437/Interspeech.2021-174

Comparison of remote experiments using crowdsourcing and laboratory experiments on speech intelligibility

Authors: Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

Abstract: Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory. One solution is to perform remote testing using crowdsourcing; however, because we cannot control the listening conditions, it is unclear whether the results are entirely reliable. In this study,… ▽ More Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory. One solution is to perform remote testing using crowdsourcing; however, because we cannot control the listening conditions, it is unclear whether the results are entirely reliable. In this study, we compared speech intelligibility scores obtained in remote and laboratory experiments. The results showed that the mean and standard deviation (SD) of the remote experiments' speech reception threshold (SRT) were higher than those of the laboratory experiments. However, the variance in the SRTs across the speech-enhancement conditions revealed similarities, implying that remote testing results may be as useful as laboratory experiments to develop an objective measure. We also show that the practice session scores correlate with the SRT values. This is a priori information before performing the main tests and would be useful for data screening to reduce the variability of the SRT distribution. △ Less

Submitted 16 April, 2021; originally announced April 2021.

Comments: This paper was submitted to Interspeech2021

Journal ref: Proc. Interspeech 2021

arXiv:2104.01444 [pdf, ps, other]

doi 10.21437/Interspeech.2021-2073

Mixture of orthogonal sequences made from extended time-stretched pulses enables measurement of involuntary voice fundamental frequency response to pitch perturbation

Authors: Hideki Kawahara, Toshie Matsui, Kohei Yatabe, Ken-Ichi Sakakibara, Minoru Tsuzaki, Masanori Morise, Toshio Irino

Abstract: Auditory feedback plays an essential role in the regulation of the fundamental frequency of voiced sounds. The fundamental frequency also responds to auditory stimulation other than the speaker's voice. We propose to use this response of the fundamental frequency of sustained vowels to frequency-modulated test signals for investigating involuntary control of voice pitch. This involuntary response… ▽ More Auditory feedback plays an essential role in the regulation of the fundamental frequency of voiced sounds. The fundamental frequency also responds to auditory stimulation other than the speaker's voice. We propose to use this response of the fundamental frequency of sustained vowels to frequency-modulated test signals for investigating involuntary control of voice pitch. This involuntary response is difficult to identify and isolate by the conventional paradigm, which uses step-shaped pitch perturbation. We recently developed a versatile measurement method using a mixture of orthogonal sequences made from a set of extended time-stretched pulses (TSP). In this article, we extended our approach and designed a set of test signals using the mixture to modulate the fundamental frequency of artificial signals. For testing the response, the experimenter presents the modulated signal aurally while the subject is voicing sustained vowels. We developed a tool for conducting this test quickly and interactively. We make the tool available as an open-source and also provide executable GUI-based applications. Preliminary tests revealed that the proposed method consistently provides compensatory responses with about 100 ms latency, representing involuntary control. Finally, we discuss future applications of the proposed method for objective and non-invasive auditory response measurements. △ Less

Submitted 3 April, 2021; originally announced April 2021.

Comments: 5 pages, 9 figures, submitted to Interspeech2021

MSC Class: 92C55

arXiv:1909.04301 [pdf, ps, other]

doi 10.1109/APSIPAASC47483.2019.9023247

Frequency domain variant of Velvet noise and its application to acoustic measurements

Authors: Hideki Kawahara, Ken-Ichi Sakakibara, Mitsunori Mizumachi, Hideki Banno, Masanori Morise, Toshio Irino

Abstract: We propose a new family of test signals for acoustic measurements such as impulse response, nonlinearity, and the effects of background noise. The proposed family complements difficulties in existing families, the Swept-Sine (SS), pseudo-random noise such as the maximum length sequence (MLS). The proposed family uses the frequency domain variant of the Velvet noise (FVN) as its building block. An… ▽ More We propose a new family of test signals for acoustic measurements such as impulse response, nonlinearity, and the effects of background noise. The proposed family complements difficulties in existing families, the Swept-Sine (SS), pseudo-random noise such as the maximum length sequence (MLS). The proposed family uses the frequency domain variant of the Velvet noise (FVN) as its building block. An FVN is an impulse response of an all-pass filter and yields the unit impulse when convolved with the time-reversed version of itself. In this respect, FVN is a member of the time-stretched pulse (TSP) in the broadest sense. The high degree of freedom in designing an FVN opens a vast range of applications in acoustic measurement. We introduce the following applications and their specific procedures, among other possibilities. They are as follows. a) Spectrum sha** adaptive to background noise. b) Simultaneous measurement of impulse responses of multiple acoustic paths. d) Simultaneous measurement of linear and nonlinear components of an acoustic path. e) Automatic procedure for time axis alignment of the source and the receiver when they are using independent clocks in acoustic impulse response measurement. We implemented a reference measurement tool equipped with all these procedures. The MATLAB source code and related materials are open-sourced and placed in a GitHub repository. △ Less

Submitted 10 September, 2019; originally announced September 2019.

Comments: 10 pages, 14 figures, APSIPA ASC 2019. arXiv admin note: text overlap with arXiv:1806.06812

Journal ref: 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 2019, pp. 1523-1532

arXiv:1904.02096 [pdf, other]

doi 10.1016/j.specom.2020.06.001

GEDI: Gammachirp Envelope Distortion Index for Predicting Intelligibility of Enhanced Speech

Authors: Katsuhiko Yamamoto, Toshio Irino, Shoko Araki, Keisuke Kinoshita, Tomohiro Nakatani

Abstract: In this study, we propose a new concept, the gammachirp envelope distortion index (GEDI), based on the signal-to-distortion ratio in the auditory envelope, SDRenv to predict the intelligibility of speech enhanced by nonlinear algorithms. The objective of GEDI is to calculate the distortion between enhanced and clean-speech representations in the domain of a temporal envelope extracted by the gamma… ▽ More In this study, we propose a new concept, the gammachirp envelope distortion index (GEDI), based on the signal-to-distortion ratio in the auditory envelope, SDRenv to predict the intelligibility of speech enhanced by nonlinear algorithms. The objective of GEDI is to calculate the distortion between enhanced and clean-speech representations in the domain of a temporal envelope extracted by the gammachirp auditory filterbank and modulation filterbank. We also extend GEDI with multi-resolution analysis (mr-GEDI) to predict the speech intelligibility of sounds under non-stationary noise conditions. We evaluate GEDI in terms of speech intelligibility predictions of speech sounds enhanced by a classic spectral subtraction and a Wiener filtering method. The predictions are compared with human results for various signal-to-noise ratio conditions with additive pink and babble noises. The results showed that mr-GEDI predicted the intelligibility curves better than short-time objective intelligibility (STOI) measure, extended-STOI (ESTOI) measure, and hearing-aid speech perception index (HASPI) under pink-noise conditions, and better than HASPI under babble-noise conditions. The mr-GEDI method does not present an overestimation tendency and is considered a more conservative approach than STOI and ESTOI. Therefore, the evaluation with mr-GEDI may provide additional information in the development of speech enhancement algorithms. △ Less

Submitted 19 July, 2020; v1 submitted 3 April, 2019; originally announced April 2019.

Comments: Preprint, 37 pages, 6 tables, 9 figures

Journal ref: Speech Communication, Vol. 123, pp. 43-58, 2020

arXiv:1806.06812 [pdf, ps, other]

Frequency domain variants of velvet noise and their application to speech processing and synthesis: with appendices

Authors: Hideki Kawahara, Ken-Ichi Sakakibara, Masanori Morise, Hideki Banno, Tomoki Toda, Toshio Irino

Abstract: We propose a new excitation source signal for VOCODERs and an all-pass impulse response for post-processing of synthetic sounds and pre-processing of natural sounds for data-augmentation. The proposed signals are variants of velvet noise, which is a sparse discrete signal consisting of a few non-zero (1 or -1) elements and sounds smoother than Gaussian white noise. One of the proposed variants, FV… ▽ More We propose a new excitation source signal for VOCODERs and an all-pass impulse response for post-processing of synthetic sounds and pre-processing of natural sounds for data-augmentation. The proposed signals are variants of velvet noise, which is a sparse discrete signal consisting of a few non-zero (1 or -1) elements and sounds smoother than Gaussian white noise. One of the proposed variants, FVN (Frequency domain Velvet Noise) applies the procedure to generate a velvet noise on the cyclic frequency domain of DFT (Discrete Fourier Transform). Then, by smoothing the generated signal to design the phase of an all-pass filter followed by inverse Fourier transform yields the proposed FVN. Temporally variable frequency weighted mixing of FVN generated by frozen and shuffled random number provides a unified excitation signal which can span from random noise to a repetitive pulse train. The other variant, which is an all-pass impulse response, significantly reduces "buzzy" impression of VOCODER output by filtering. Finally, we will discuss applications of the proposed signal for watermarking and psychoacoustic research. △ Less

Submitted 18 June, 2018; originally announced June 2018.

Comments: 11 pages, 20 figures, and 1 table, Interspeech 2018

arXiv:1702.06724 [pdf, ps, other]

doi 10.21437/Interspeech.2017-15

A new cosine series antialiasing function and its application to aliasing-free glottal source models for speech and singing synthesis

Authors: Hideki Kawahara, Ken-Ichi Sakakibara, Hideki Banno, Masanori Morise, Tomoki Toda, Toshio Irino

Abstract: We formulated and implemented a procedure to generate aliasing-free excitation source signals. It uses a new antialiasing filter in the continuous time domain followed by an IIR digital filter for response equalization. We introduced a cosine-series-based general design procedure for the new antialiasing function. We applied this new procedure to implement the antialiased Fujisaki-Ljungqvist model… ▽ More We formulated and implemented a procedure to generate aliasing-free excitation source signals. It uses a new antialiasing filter in the continuous time domain followed by an IIR digital filter for response equalization. We introduced a cosine-series-based general design procedure for the new antialiasing function. We applied this new procedure to implement the antialiased Fujisaki-Ljungqvist model. We also applied it to revise our previous implementation of the antialiased Fant-Liljencrants model. A combination of these signals and a lattice implementation of the time varying vocal tract model provides a reliable and flexible basis to test fo extractors and source aperiodicity analysis methods. MATLAB implementations of these antialiased excitation source models are available as part of our open source tools for speech science. △ Less

Submitted 8 June, 2017; v1 submitted 22 February, 2017; originally announced February 2017.

Comments: Submitted to Interspeech 2017

Journal ref: Proc. Interspeech 2017, pp.1358-1362

Showing 1–13 of 13 results for author: Irino, T