Search | arXiv e-print repository

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Authors: Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei Wang, Midia Yousefi, Yanmin Qian, **yu Li, Sheng Zhao, Michael Zeng

Abstract: There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complex… ▽ More There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Work in progress

arXiv:2404.15961 [pdf, other]

Soil analysis with machine-learning-based processing of stepped-frequency GPR field measurements: Preliminary study

Authors: Chunlei Xu, Michael Pregesbauer, Naga Sravani Chilukuri, Daniel Windhager, Mahsa Yousefi, Pedro Julian, Lothar Ratschbacher

Abstract: Ground Penetrating Radar (GPR) has been widely studied as a tool for extracting soil parameters relevant to agriculture and horticulture. When combined with Machine-Learning-based (ML) methods, high-resolution Stepped Frequency Countinuous Wave Radar (SFCW) measurements hold the promise to give cost effective access to depth resolved soil parameters, including at root-level depth. In a first step… ▽ More Ground Penetrating Radar (GPR) has been widely studied as a tool for extracting soil parameters relevant to agriculture and horticulture. When combined with Machine-Learning-based (ML) methods, high-resolution Stepped Frequency Countinuous Wave Radar (SFCW) measurements hold the promise to give cost effective access to depth resolved soil parameters, including at root-level depth. In a first step in this direction, we perform an extensive field survey with a tractor mounted SFCW GPR instrument. Using ML data processing we test the GPR instrument's capabilities to predict the apparent electrical conductivity (ECaR) as measured by a simultaneously recording Electromagnetic Induction (EMI) instrument. The large-scale field measurement campaign with 3472 co-registered and geo-located GPR and EMI data samples distributed over ~6600 square meters was performed on a golf course. The selected terrain benefits from a high surface homogeneity, but also features the challenge of only small, and hence hard to discern, variations in the measured soil parameter. Based on the quantitative results we suggest the use of nugget-to-sill ratio as a performance metric for the evaluation of end-to-end ML performance in the agricultural setting and discuss the limiting factors in the multi-sensor regression setting. The code is released as open source and available at https://opensource.silicon-austria.com/xuc/soil-analysis-machine-learning-stepped-frequency-gpr. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.07239 [pdf]

Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis

Authors: Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak

Abstract: Thyroid cancer is an increasing global health concern that requires advanced diagnostic methods. The application of AI and radiomics to thyroid cancer diagnosis is examined in this review. A review of multiple databases was conducted in compliance with PRISMA guidelines until October 2023. A combination of keywords led to the discovery of an English academic publication on thyroid cancer and relat… ▽ More Thyroid cancer is an increasing global health concern that requires advanced diagnostic methods. The application of AI and radiomics to thyroid cancer diagnosis is examined in this review. A review of multiple databases was conducted in compliance with PRISMA guidelines until October 2023. A combination of keywords led to the discovery of an English academic publication on thyroid cancer and related subjects. 267 papers were returned from the original search after 109 duplicates were removed. Relevant studies were selected according to predetermined criteria after 124 articles were eliminated based on an examination of their abstract and title. After the comprehensive analysis, an additional six studies were excluded. Among the 28 included studies, radiomics analysis, which incorporates ultrasound (US) images, demonstrated its effectiveness in diagnosing thyroid cancer. Various results were noted, some of the studies presenting new strategies that outperformed the status quo. The literature has emphasized various challenges faced by AI models, including interpretability issues, dataset constraints, and operator dependence. The synthesized findings of the 28 included studies mentioned the need for standardization efforts and prospective multicenter studies to address these concerns. Furthermore, approaches to overcome these obstacles were identified, such as advances in explainable AI technology and personalized medicine techniques. The review focuses on how AI and radiomics could transform the diagnosis and treatment of thyroid cancer. Despite challenges, future research on multidisciplinary cooperation, clinical applicability validation, and algorithm improvement holds the potential to improve patient outcomes and diagnostic precision in the treatment of thyroid cancer. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: 50 pages, 8 figures, 1 table, 119 references

ACM Class: J.3.2; J.3.3

arXiv:2404.06690 [pdf, other]

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Authors: Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, **yu Li, Lei He, Sheng Zhao, Michael Zeng

Abstract: Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-rou… ▽ More Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are available at https://aka.ms/covomix. △ Less

Submitted 29 May, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2310.05950 [pdf, other]

Quantization of Neural Network Equalizers in Optical Fiber Transmission Experiments

Authors: Jamal Darweesh, Nelson Costa, Antonio Napoli, Bernhard Spinnler, Yves Jaouen, Mansoor Yousefi

Abstract: The quantization of neural networks for the mitigation of the nonlinear and components' distortions in dual-polarization optical fiber transmission is studied. Two low-complexity neural network equalizers are applied in three 16-QAM 34.4 GBaud transmission experiments with different representative fibers. A number of post-training quantization and quantization-aware training algorithms are compare… ▽ More The quantization of neural networks for the mitigation of the nonlinear and components' distortions in dual-polarization optical fiber transmission is studied. Two low-complexity neural network equalizers are applied in three 16-QAM 34.4 GBaud transmission experiments with different representative fibers. A number of post-training quantization and quantization-aware training algorithms are compared for casting the weights and activations of the neural network in few bits, combined with the uniform, additive power-of-two, and companding quantization. For quantization in the large bit-width regime of $\geq 5$ bits, the quantization-aware training with the straight-through estimation incurs a Q-factor penalty of less than 0.5 dB compared to the unquantized neural network. For quantization in the low bit-width regime, an algorithm dubbed companding successive alpha-blending quantization is suggested. This method compensates for the quantization error aggressively by successive grou** and retraining of the parameters, as well as an incremental transition from the floating-point representations to the quantized values within each group. The activations can be quantized at 8 bits and the weights on average at 1.75 bits, with a penalty of $\leq 0.5$~dB. If the activations are quantized at 6 bits, the weights can be quantized at 3.75 bits with minimal penalty. The computational complexity and required storage of the neural networks are drastically reduced, typically by over 90\%. The results indicate that low-complexity neural networks can mitigate nonlinearities in optical fiber transmission. △ Less

Submitted 9 September, 2023; originally announced October 2023.

Comments: 15 pages, 9 figures, 5 tables

arXiv:2309.12521 [pdf, other]

Profile-Error-Tolerant Target-Speaker Voice Activity Detection

Authors: Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Midia Yousefi, Takuya Yoshioka, Jian Wu

Abstract: Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. T… ▽ More Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. This paper proposes an extension to TS-VAD, called Profile-Error-Tolerant TS-VAD (PET-TSVAD), which is robust to such speaker profile errors. This is achieved by employing transformer-based TS-VAD that can handle a variable number of speakers and further introducing a set of additional pseudo-speaker profiles to handle speakers undetected during the first pass diarization. During training, we use speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch between the training and testing conditions regarding speaker profiles. Experimental results show that PET-TSVAD consistently outperforms the existing TS-VAD method on both the VoxConverse and DIHARD-I datasets. △ Less

Submitted 3 April, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

Comments: Submission for ICASSP 2024

arXiv:2307.06821 [pdf, other]

Equalization in Dispersion-Managed Systems Using Learned Digital Back-Propagation

Authors: Mohannad Abu-Romoh, Nelson Costa, Yves Jaouën, Antonio Napoli, João Pedro, Bernhard Spinnler, Mansoor Yousefi

Abstract: In this paper, we investigate the use of the learned digital back-propagation (LDBP) for equalizing dual-polarization fiber-optic transmission in dispersion-managed (DM) links. LDBP is a deep neural network that optimizes the parameters of DBP using the stochastic gradient descent. We evaluate DBP and LDBP in a simulated WDM dual-polarization fiber transmission system operating at the bitrate of 2… ▽ More In this paper, we investigate the use of the learned digital back-propagation (LDBP) for equalizing dual-polarization fiber-optic transmission in dispersion-managed (DM) links. LDBP is a deep neural network that optimizes the parameters of DBP using the stochastic gradient descent. We evaluate DBP and LDBP in a simulated WDM dual-polarization fiber transmission system operating at the bitrate of 256 Gbit/s per channel, with a dispersion map designed for a 2016 km link with 15% residual dispersion. Our results show that in single-channel transmission, LDBP achieves an effective signal-to-noise ratio improvement of 6.3 dB and 2.5 dB, respectively, over linear equalization and DBP. In WDM transmission, the corresponding $Q$-factor gains are 1.1 dB and 0.4 dB, respectively. Additionally, we conduct a complexity analysis, which reveals that a frequency-domain implementation of LDBP and DBP is more favorable in terms of complexity than the time-domain implementation. These findings demonstrate the effectiveness of LDBP in mitigating the nonlinear effects in DM fiber-optic transmission systems. △ Less

Submitted 26 May, 2023; originally announced July 2023.

arXiv:2305.02234 [pdf]

Forged Channel: A Breakthrough Approach for Accurate Parkinson's Disease Classification using Leave-One-Subject-Out Cross-Validation

Authors: A. Hamidi, k. Mohamed-Pour, M. Yousefi

Abstract: This paper introduces a novel technique called "Forged Channel," which aims to comprehensively represent EEG signals in order to achieve accurate classification of Parkinson's disease. The forged channel method prepares EEG signals in a manner that allows a deep learning model to effectively perceive all EEG channels within a single input. By employing this approach alongside a convolutional neura… ▽ More This paper introduces a novel technique called "Forged Channel," which aims to comprehensively represent EEG signals in order to achieve accurate classification of Parkinson's disease. The forged channel method prepares EEG signals in a manner that allows a deep learning model to effectively perceive all EEG channels within a single input. By employing this approach alongside a convolutional neural network, an impressive accuracy of 90.32% was achieved using leave-one-subject-out cross-validation. This performance closely reflects real-world conditions, highlighting the superiority of our method compared to similar approaches. △ Less

Submitted 16 April, 2024; v1 submitted 3 May, 2023; originally announced May 2023.

Comments: 5 Pages, 2 Figure, 3 Table

arXiv:2210.05454 [pdf, ps, other]

doi 10.1364/SPPCOM.2021.SpM5C.5

Low Complexity Convolutional Neural Networks for Equalization in Optical Fiber Transmission

Authors: Mohannad Abu-romoh, Nelson Costa, Antonio Napoli, João Pedro, Yves Jaouën, Mansoor Yousefi

Abstract: A convolutional neural network is proposed to mitigate fiber transmission effects, achieving a five-fold reduction in trainable parameters compared to alternative equalizers, and 3.5 dB improvement in MSE compared to DBP with comparable complexity. A convolutional neural network is proposed to mitigate fiber transmission effects, achieving a five-fold reduction in trainable parameters compared to alternative equalizers, and 3.5 dB improvement in MSE compared to DBP with comparable complexity. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: 2 pages, 3 figures. Submitted to the OSA Advanced Photonics Congress 2021. Presented in Signal Processing in Photonic Communications (SPPCom) 2021. From the session: Neural Networks Applications for Photonic Systems (SpM5C)

arXiv:2207.12154 [pdf, other]

Complexity Reduction over Bi-RNN-Based Nonlinearity Mitigation in Dual-Pol Fiber-Optic Communications via a CRNN-Based Approach

Authors: Abtin Shahkarami, Mansoor Yousefi, Yves Jaouen

Abstract: Bidirectional recurrent neural networks (bi-RNNs), in particular, bidirectional long short term memory (bi-LSTM), bidirectional gated recurrent unit, and convolutional bi-LSTM models have recently attracted attention for nonlinearity mitigation in fiber-optic communication. The recently adopted approaches based on these models, however, incur a high computational complexity which may impede their… ▽ More Bidirectional recurrent neural networks (bi-RNNs), in particular, bidirectional long short term memory (bi-LSTM), bidirectional gated recurrent unit, and convolutional bi-LSTM models have recently attracted attention for nonlinearity mitigation in fiber-optic communication. The recently adopted approaches based on these models, however, incur a high computational complexity which may impede their real-time functioning. In this paper, by addressing the sources of complexity in these methods, we propose a more efficient network architecture, where a convolutional neural network encoder and a unidirectional many-to-one vanilla RNN operate in tandem, each best capturing one set of channel impairments while compensating for the shortcomings of the other. We deploy this model in two different receiver configurations. In one, the neural network is placed after a linear equalization chain and is merely responsible for nonlinearity mitigation; in the other, the neural network is directly placed after the chromatic dispersion compensation and is responsible for joint nonlinearity and polarization mode dispersion compensation. For a 16-QAM 64 GBd dual-polarization optical transmission over 14x80 km standard single-mode fiber, we demonstrate that the proposed hybrid model achieves the bit error probability of the state-of-the-art bi-RNN-based methods with greater than 50% lower complexity, in both receiver configurations. △ Less

Submitted 25 July, 2022; originally announced July 2022.

arXiv:2205.11376 [pdf, other]

Learned Digital Back-Propagation for Dual-Polarization Dispersion Managed Systems

Authors: Mohannad Abu-romoh, Nelson Costa, Antonio Napoli, Bernhard Spinnler, Yves Jaouën, Mansoor Yousefi

Abstract: Digital back-propagation (DBP) and learned DBP (LDBP) are proposed for nonlinearity mitigation in WDM dual-polarization dispersion-managed systems. LDBP achieves Q-factor improvement of 1.8 dB and 1.2 dB, respectively, over linear equalization and a variant of DBP adapted to DM systems. Digital back-propagation (DBP) and learned DBP (LDBP) are proposed for nonlinearity mitigation in WDM dual-polarization dispersion-managed systems. LDBP achieves Q-factor improvement of 1.8 dB and 1.2 dB, respectively, over linear equalization and a variant of DBP adapted to DM systems. △ Less

Submitted 23 May, 2022; originally announced May 2022.

arXiv:2205.11284 [pdf, other]

Few-bit Quantization of Neural Networks for Nonlinearity Mitigation in a Fiber Transmission Experiment

Authors: Jamal Darweesh, Nelson Costa, Antonio Napoli, Bernhard Spinnler, Yves Jaouen, Mansoor Yousefi, .

Abstract: A neural network is quantized for the mitigation of nonlinear and components distortions in a 16-QAM 9x50km dual-polarization fiber transmission experiment. Post-training additive power-of-two quantization at 6 bits incurs a negligible Q-factor penalty. At 5 bits, the model size is reduced by 85%, with 0.8 dB penalty. A neural network is quantized for the mitigation of nonlinear and components distortions in a 16-QAM 9x50km dual-polarization fiber transmission experiment. Post-training additive power-of-two quantization at 6 bits incurs a negligible Q-factor penalty. At 5 bits, the model size is reduced by 85%, with 0.8 dB penalty. △ Less

Submitted 25 May, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: 4 pages ,3 figuers

arXiv:2204.04488 [pdf]

Comparison of EEG based epilepsy diagnosis using neural networks and wavelet transform

Authors: Mohammad Reza Yousefi, Saina Golnejad, Melika Mohammad Hosseini, Amin Dehghani

Abstract: Epilepsy is one of the common neurological disorders characterized by recurrent and uncontrollable seizures, which seriously affect the life of patients. In many cases, electroencephalograms signal can provide important physiological information about the activity of the human brain which can be used to diagnose epilepsy. However, visual inspection of a large number of electroencephalogram signals… ▽ More Epilepsy is one of the common neurological disorders characterized by recurrent and uncontrollable seizures, which seriously affect the life of patients. In many cases, electroencephalograms signal can provide important physiological information about the activity of the human brain which can be used to diagnose epilepsy. However, visual inspection of a large number of electroencephalogram signals is very time-consuming and can often lead to inconsistencies in physicians' diagnoses. Quantification of abnormalities in brain signals can indicate brain conditions and pathology so the electroencephalogram (EEG) signal plays a key role in the diagnosis of epilepsy. In this article, an attempt has been made to create a single instruction for diagnosing epilepsy, which consists of two steps. In the first step, a low-pass filter was used to preprocess the data and three separate mid-pass filters for different frequency bands and a multilayer neural network were designed. In the second step, the wavelet transform technique was used to process data. In particular, this paper proposes a multilayer perceptron neural network classifier for the diagnosis of epilepsy, that requires normal data and epilepsy data for education, but this classifier can recognize normal disorders, epilepsy, and even other disorders taught in educational examples. Also, the value of using electroencephalogram signal has been evaluated in two ways: using wavelet transform and non-using wavelet transform. Finally, the evaluation results indicate a relatively uniform impact factor on the use or non-use of wavelet transform on the improvement of epilepsy data functions, but in the end, it was shown that the use of perceptron multilayer neural network can provide a higher accuracy coefficient for experts. △ Less

Submitted 12 August, 2023; v1 submitted 9 April, 2022; originally announced April 2022.

Comments: 8 pages, 4 tables, 3 figures

arXiv:2111.08635 [pdf, other]

Single-channel speech separation using Soft-minimum Permutation Invariant Training

Authors: Midia Yousefi, John H. L. Hansen

Abstract: The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background noise using a supervised learning algorithm, typic… ▽ More The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background noise using a supervised learning algorithm, typically a deep neural network. A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal, referred to as label permutation ambiguity. Permutation ambiguity refers to the problem of determining the output-label assignment between the separated sources and the available single-speaker speech labels. Finding the best output-label assignment is required for calculation of separation error, which is later used for updating parameters of the model. Recently, Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem. However, the overconfident choice of the output-label assignment by PIT results in a sub-optimal trained model. In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment. Our proposed method entitled trainable Soft-minimum PIT is then employed on the same Long-Short Term Memory (LSTM) architecture used in Permutation Invariant Training (PIT) speech separation method. The results of our experiments show that the proposed method outperforms conventional PIT speech separation significantly (p-value $ < 0.01$) by +1dB in Signal to Distortion Ratio (SDR) and +1.5dB in Signal to Interference Ratio (SIR). △ Less

Submitted 16 November, 2021; originally announced November 2021.

arXiv:2111.00320 [pdf, other]

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

Authors: Midia Yousefi, John H. L. Hanse

Abstract: This study addresses the problem of single-channel Automatic Speech Recognition of a target speaker within an overlap speech scenario. In the proposed method, the hidden representations in the acoustic model are modulated by speaker auxiliary information to recognize only the desired speaker. Affine transformation layers are inserted into the acoustic model network to integrate speaker information… ▽ More This study addresses the problem of single-channel Automatic Speech Recognition of a target speaker within an overlap speech scenario. In the proposed method, the hidden representations in the acoustic model are modulated by speaker auxiliary information to recognize only the desired speaker. Affine transformation layers are inserted into the acoustic model network to integrate speaker information with the acoustic features. The speaker conditioning process allows the acoustic model to perform computation in the context of target-speaker auxiliary information. The proposed speaker conditioning method is a general approach and can be applied to any acoustic model architecture. Here, we employ speaker conditioning on a ResNet acoustic model. Experiments on the WSJ corpus show that the proposed speaker conditioning method is an effective solution to fuse speaker auxiliary information with acoustic features for multi-speaker speech recognition, achieving +9% and +20% relative WER reduction for clean and overlap speech scenarios, respectively, compared to the original ResNet acoustic model baseline. △ Less

Submitted 30 October, 2021; originally announced November 2021.

arXiv:2111.00316 [pdf, other]

Real-time Speaker counting in a cocktail party scenario using Attention-guided Convolutional Neural Network

Authors: Midia Yousefi, John H. L. Hansen

Abstract: Most current speech technology systems are designed to operate well even in the presence of multiple active speakers. However, most solutions assume that the number of co-current speakers is known. Unfortunately, this information might not always be available in real-world applications. In this study, we propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to est… ▽ More Most current speech technology systems are designed to operate well even in the presence of multiple active speakers. However, most solutions assume that the number of co-current speakers is known. Unfortunately, this information might not always be available in real-world applications. In this study, we propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlap** speech. The proposed system extracts higher-level information from the speech spectral content using a CNN model. Next, the attention mechanism summarizes the extracted information into a compact feature vector without losing critical information. Finally, the active speakers are classified using a fully connected network. Experiments on simulated overlap** speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling. The proposed Attention-guided CNN achieves 76.15% for both Weighted Accuracy and average Recall, and 75.80% Precision on speech segments as short as 20 frames (i.e., 200 ms). All the classification metrics exceed 92% for the attention-guided model in offline scenarios where the input signal is more than 100 frames long (i.e., 1s). △ Less

Submitted 30 October, 2021; originally announced November 2021.

arXiv:2001.09937 [pdf, other]

Frame-based overlap** speech detection using Convolutional Neural Networks

Authors: Midia Yousefi, John H. L. Hansen

Abstract: Naturalistic speech recordings usually contain speech signals from multiple speakers. This phenomenon can degrade the performance of speech technologies due to the complexity of tracing and recognizing individual speakers. In this study, we investigate the detection of overlap** speech on segments as short as 25 ms using Convolutional Neural Networks. We evaluate the detection performance using… ▽ More Naturalistic speech recordings usually contain speech signals from multiple speakers. This phenomenon can degrade the performance of speech technologies due to the complexity of tracing and recognizing individual speakers. In this study, we investigate the detection of overlap** speech on segments as short as 25 ms using Convolutional Neural Networks. We evaluate the detection performance using different spectral features, and show that pyknogram features outperforms other commonly used speech features. The proposed system can predict overlap** speech with an accuracy of 84\% and Fscore of 88\% on a dataset of mixed speech generated based on the GRID dataset. △ Less

Submitted 12 February, 2020; v1 submitted 27 January, 2020; originally announced January 2020.

arXiv:1908.01768 [pdf, other]

Probabilistic Permutation Invariant Training for Speech Separation

Authors: Midia Yousefi, Soheil Khorram, John H. L. Hansen

Abstract: Single-microphone, speaker-independent speech separation is normally performed through two steps: (i) separating the specific speech sources, and (ii) determining the best output-label assignment to find the separation error. The second step is the main obstacle in training neural networks for speech separation. Recently proposed Permutation Invariant Training (PIT) addresses this problem by deter… ▽ More Single-microphone, speaker-independent speech separation is normally performed through two steps: (i) separating the specific speech sources, and (ii) determining the best output-label assignment to find the separation error. The second step is the main obstacle in training neural networks for speech separation. Recently proposed Permutation Invariant Training (PIT) addresses this problem by determining the output-label assignment which minimizes the separation error. In this study, we show that a major drawback of this technique is the overconfident choice of the output-label assignment, especially in the initial steps of training when the network generates unreliable outputs. To solve this problem, we propose Probabilistic PIT (Prob-PIT) which considers the output-label permutation as a discrete latent random variable with a uniform prior distribution. Prob-PIT defines a log-likelihood function based on the prior distributions and the separation errors of all permutations; it trains the speech separation networks by maximizing the log-likelihood function. Prob-PIT can be easily implemented by replacing the minimum function of PIT with a soft-minimum function. We evaluate our approach for speech separation on both TIMIT and CHiME datasets. The results show that the proposed method significantly outperforms PIT in terms of Signal to Distortion Ratio and Signal to Interference Ratio. △ Less

Submitted 4 August, 2019; originally announced August 2019.

Comments: Interspeech 2019

arXiv:1804.06941 [pdf, other]

Reducing Conservatism in Model-Invariant Safety-Preserving Control of Propofol Anesthesia Using Falsification

Authors: Mahdi Yousefi, Klaske van Heusden, Ian M. Mitchell, J. Mark Ansermino, Guy A. Dumont

Abstract: This work provides a formalized model-invariant safety system for closed-loop anesthesia that uses feedback from measured data for model falsification to reduce conservatism. The safety system maintains predicted propofol plasma concentrations, as well as the patient's blood pressure, within safety bounds despite uncertainty in patient responses to propofol. Model-invariant formal verification is… ▽ More This work provides a formalized model-invariant safety system for closed-loop anesthesia that uses feedback from measured data for model falsification to reduce conservatism. The safety system maintains predicted propofol plasma concentrations, as well as the patient's blood pressure, within safety bounds despite uncertainty in patient responses to propofol. Model-invariant formal verification is used to formalize the safety system. This technique requires a multi-model description of model-uncertainty. Model-invariant verification considers all possible dynamics of an uncertain system, and the resulting safety system may be conservative for systems that do not exhibit the worst-case dynamical response. In this work, we employ model falsification to reduce conservatism of the model-invariant safety system. Members of a model set that characterizes model- uncertainty are falsified if discrepancy between predictions of those models and measured responses of the uncertain system is established, thereby reducing model uncertainty. We show that including falsification in a model-invariant safety system reduces conservatism of the safety system. △ Less

Submitted 18 April, 2018; originally announced April 2018.

Comments: 11 pages, 9 figures, submitted to IEEE TCST

Showing 1–19 of 19 results for author: Yousefi, M