-
Unsupervised Multi-channel Separation and Adaptation
Authors:
Cong Han,
Kevin Wilson,
Scott Wisdom,
John R. Hershey
Abstract:
A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlap** reverberant and noisy speech from the AMI Corpus. Th…
▽ More
A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlap** reverberant and noisy speech from the AMI Corpus. The models are trained on both supervised and unsupervised training data, and are tested on real AMI recordings containing overlap** speech. To objectively evaluate our models, we also use a synthetic multi-channel AMI test set. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest improvement to SI-SNR and to human listening ratings across synthetic and real datasets, outperforming supervised models trained on well-matched synthetic data. Our results demonstrate that unsupervised learning through MixIT enables model adaptation on both single- and multi-channel real-world speech recordings.
△ Less
Submitted 22 March, 2024; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Segment Anything Model (SAM) for Digital Pathology: Assess Zero-shot Segmentation on Whole Slide Imaging
Authors:
Ruining Deng,
Can Cui,
Quan Liu,
Tianyuan Yao,
Lucas W. Remedios,
Shunxing Bao,
Bennett A. Landman,
Lee E. Wheless,
Lori A. Coburn,
Keith T. Wilson,
Yaohong Wang,
Shilin Zhao,
Agnes B. Fogo,
Haichun Yang,
Yucheng Tang,
Yuankai Huo
Abstract:
The segment anything model (SAM) was released as a foundation model for image segmentation. The promptable segmentation model was trained by over 1 billion masks on 11M licensed and privacy-respecting images. The model supports zero-shot image segmentation with various segmentation prompts (e.g., points, boxes, masks). It makes the SAM attractive for medical image analysis, especially for digital…
▽ More
The segment anything model (SAM) was released as a foundation model for image segmentation. The promptable segmentation model was trained by over 1 billion masks on 11M licensed and privacy-respecting images. The model supports zero-shot image segmentation with various segmentation prompts (e.g., points, boxes, masks). It makes the SAM attractive for medical image analysis, especially for digital pathology where the training data are rare. In this study, we evaluate the zero-shot segmentation performance of SAM model on representative segmentation tasks on whole slide imaging (WSI), including (1) tumor segmentation, (2) non-tumor tissue segmentation, (3) cell nuclei segmentation. Core Results: The results suggest that the zero-shot SAM model achieves remarkable segmentation performance for large connected objects. However, it does not consistently achieve satisfying performance for dense instance object segmentation, even with 20 prompts (clicks/boxes) on each image. We also summarized the identified limitations for digital pathology: (1) image resolution, (2) multiple scales, (3) prompt selection, and (4) model fine-tuning. In the future, the few-shot fine-tuning with images from downstream pathological segmentation tasks might help the model to achieve better performance in dense object segmentation.
△ Less
Submitted 9 April, 2023;
originally announced April 2023.
-
Cross-scale Multi-instance Learning for Pathological Image Diagnosis
Authors:
Ruining Deng,
Can Cui,
Lucas W. Remedios,
Shunxing Bao,
R. Michael Womick,
Sophie Chiron,
Jia Li,
Joseph T. Roland,
Ken S. Lau,
Qi Liu,
Keith T. Wilson,
Yaohong Wang,
Lori A. Coburn,
Bennett A. Landman,
Yuankai Huo
Abstract:
Analyzing high resolution whole slide images (WSIs) with regard to information across multiple scales poses a significant challenge in digital pathology. Multi-instance learning (MIL) is a common solution for working with high resolution images by classifying bags of objects (i.e. sets of smaller image patches). However, such processing is typically performed at a single scale (e.g., 20x magnifica…
▽ More
Analyzing high resolution whole slide images (WSIs) with regard to information across multiple scales poses a significant challenge in digital pathology. Multi-instance learning (MIL) is a common solution for working with high resolution images by classifying bags of objects (i.e. sets of smaller image patches). However, such processing is typically performed at a single scale (e.g., 20x magnification) of WSIs, disregarding the vital inter-scale information that is key to diagnoses by human pathologists. In this study, we propose a novel cross-scale MIL algorithm to explicitly aggregate inter-scale relationships into a single MIL network for pathological image diagnosis. The contribution of this paper is three-fold: (1) A novel cross-scale MIL (CS-MIL) algorithm that integrates the multi-scale information and the inter-scale relationships is proposed; (2) A toy dataset with scale-specific morphological features is created and released to examine and visualize differential cross-scale attention; (3) Superior performance on both in-house and public datasets is demonstrated by our simple cross-scale MIL strategy. The official implementation is publicly available at https://github.com/hrlblab/CS-MIL.
△ Less
Submitted 16 February, 2024; v1 submitted 31 March, 2023;
originally announced April 2023.
-
Optimizing Real-Time Performances for Timed-Loop Racing under F1TENTH
Authors:
Nitish Gupta,
Kurt Wilson,
Zhishan Guo
Abstract:
Motion planning and control in autonomous car racing are one of the most challenging and safety-critical tasks due to high speed and dynamism. The lower-level control nodes are expected to be highly optimized due to resource constraints of onboard embedded processing units, although there are strict latency requirements. Some of these guarantees can be provided at the application level, such as us…
▽ More
Motion planning and control in autonomous car racing are one of the most challenging and safety-critical tasks due to high speed and dynamism. The lower-level control nodes are expected to be highly optimized due to resource constraints of onboard embedded processing units, although there are strict latency requirements. Some of these guarantees can be provided at the application level, such as using ROS2's Real-Time executors. However, the performance can be far from satisfactory as many modern control algorithms (such as Model Predictive Control) rely on solving complicated online optimization problems at each iteration. In this paper, we present a simple yet effective multi-threading technique to optimize the throughput of online-control algorithms for resource-constrained autonomous racing platforms. We achieve this by maintaining a systematic pool of worker threads solving the optimization problem in parallel which can improve the system performance by reducing latency between control input commands. We further demonstrate the effectiveness of our method using the Model Predictive Contouring Control (MPCC) algorithm running on Nvidia's Xavier AGX platform.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Distance-Based Sound Separation
Authors:
Katharine Patterson,
Kevin Wilson,
Scott Wisdom,
John R. Hershey
Abstract:
We propose the novel task of distance-based sound separation, where sounds are separated based only on their distance from a single microphone. In the context of assisted listening devices, proximity provides a simple criterion for sound selection in noisy environments that would allow the user to focus on sounds relevant to a local conversation. We demonstrate the feasibility of this approach by…
▽ More
We propose the novel task of distance-based sound separation, where sounds are separated based only on their distance from a single microphone. In the context of assisted listening devices, proximity provides a simple criterion for sound selection in noisy environments that would allow the user to focus on sounds relevant to a local conversation. We demonstrate the feasibility of this approach by training a neural network to separate near sounds from far sounds in single channel synthetic reverberant mixtures, relative to a threshold distance defining the boundary between near and far. With a single nearby speaker and four distant speakers, the model improves scale-invariant signal to noise ratio by 4.4 dB for near sounds and 6.8 dB for far sounds.
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
Random Multi-Channel Image Synthesis for Multiplexed Immunofluorescence Imaging
Authors:
Shunxing Bao,
Yucheng Tang,
Ho Hin Lee,
Riqiang Gao,
Sophie Chiron,
Ilwoo Lyu,
Lori A. Coburn,
Keith T. Wilson,
Joseph T. Roland,
Bennett A. Landman,
Yuankai Huo
Abstract:
Multiplex immunofluorescence (MxIF) is an emerging imaging technique that produces the high sensitivity and specificity of single-cell map**. With a tenet of 'seeing is believing', MxIF enables iterative staining and imaging extensive antibodies, which provides comprehensive biomarkers to segment and group different cells on a single tissue section. However, considerable depletion of the scarce…
▽ More
Multiplex immunofluorescence (MxIF) is an emerging imaging technique that produces the high sensitivity and specificity of single-cell map**. With a tenet of 'seeing is believing', MxIF enables iterative staining and imaging extensive antibodies, which provides comprehensive biomarkers to segment and group different cells on a single tissue section. However, considerable depletion of the scarce tissue is inevitable from extensive rounds of staining and bleaching ('missing tissue'). Moreover, the immunofluorescence (IF) imaging can globally fail for particular rounds ('missing stain''). In this work, we focus on the 'missing stain' issue. It would be appealing to develop digital image synthesis approaches to restore missing stain images without losing more tissue physically. Herein, we aim to develop image synthesis approaches for eleven MxIF structural molecular markers (i.e., epithelial and stromal) on real samples. We propose a novel multi-channel high-resolution image synthesis approach, called pixN2N-HD, to tackle possible missing stain scenarios via a high-resolution generative adversarial network (GAN). Our contribution is three-fold: (1) a single deep network framework is proposed to tackle missing stain in MxIF; (2) the proposed 'N-to-N' strategy reduces theoretical four years of computational time to 20 hours when covering all possible missing stains scenarios, with up to five missing stains (e.g., '(N-1)-to-1', '(N-2)-to-2'); and (3) this work is the first comprehensive experimental study of investigating cross-stain synthesis in MxIF. Our results elucidate a promising direction of advancing MxIF imaging with deep image synthesis.
△ Less
Submitted 18 September, 2021;
originally announced September 2021.
-
A Joint Technique for Nonlinearity Compensation in CO-OFDM Superchannel Systems
Authors:
O. S. Sunish Kumar,
A. Amari,
O. A. Dobre,
R. Venkatesan,
S. K. Wilson
Abstract:
We propose a technique combining the singlechannel digital-back-propagation (SC-DBP) with phaseconjugated-twin-wave (PCTW) to compensate nonlinearities in CO-OFDM superchannel systems. This exhibits a similar performance as multi-channel DBP while providing increased transmission reach compared to SC-DBP, PCTW, and linear dispersion compensation (LDC).
We propose a technique combining the singlechannel digital-back-propagation (SC-DBP) with phaseconjugated-twin-wave (PCTW) to compensate nonlinearities in CO-OFDM superchannel systems. This exhibits a similar performance as multi-channel DBP while providing increased transmission reach compared to SC-DBP, PCTW, and linear dispersion compensation (LDC).
△ Less
Submitted 27 June, 2021;
originally announced June 2021.
-
A Spectrally Efficient Linear Polarization Coding Scheme for Fiber Nonlinearity Compensation in CO-OFDM Systems
Authors:
O. S. Sunish Kumar,
O. A. Dobre,
R. Venkatesan,
S. K. Wilson,
O. Omomukuyo,
A. Amari,
D. Chang
Abstract:
In this paper, we propose a linear polarization coding scheme (LPC) combined with the phase conjugated twin signals (PCTS) technique, referred to as LPC-PCTS, for fiber nonlinearity mitigation in coherent optical orthogonal frequency division multiplexing (CO-OFDM) systems. The LPC linearly combines the data symbols on the adjacent subcarriers of the OFDM symbol, one at full amplitude and the othe…
▽ More
In this paper, we propose a linear polarization coding scheme (LPC) combined with the phase conjugated twin signals (PCTS) technique, referred to as LPC-PCTS, for fiber nonlinearity mitigation in coherent optical orthogonal frequency division multiplexing (CO-OFDM) systems. The LPC linearly combines the data symbols on the adjacent subcarriers of the OFDM symbol, one at full amplitude and the other at half amplitude. The linearly coded data is then transmitted as phase conjugate pairs on the same subcarriers of the two OFDM symbols on the two orthogonal polarizations. The nonlinear distortions added to these subcarriers are essentially anti-correlated, since they carry phase conjugate pairs of data. At the receiver, the coherent superposition of the information symbols received on these pairs of subcarriers eventually leads to the cancellation of the nonlinear distortions. We conducted numerical simulation of a single channel 200 Gb/s CO-OFDM system employing the LPCPCTS technique. The results show that a Q-factor improvement of 2.3 dB and 1.7 dB with and without the dispersion symmetry, respectively, when compared to the recently proposed phase conjugated subcarrier coding (PCSC) technique, at an average launch power of 3 dBm. In addition, our proposed LPCPCTS technique shows a significant performance improvement when compared to the 16-quadrature amplitude modulation (QAM) with phase conjugated twin waves (PCTW) scheme, at the same spectral efficiency, for an uncompensated transmission distance of 2800 km.
△ Less
Submitted 27 June, 2021;
originally announced June 2021.
-
End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings
Authors:
Soumi Maiti,
Hakan Erdogan,
Kevin Wilson,
Scott Wisdom,
Shinji Watanabe,
John R. Hershey
Abstract:
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers,…
▽ More
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multi-task transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Authors:
Quan Wang,
Ignacio Lopez Moreno,
Mert Saglam,
Kevin Wilson,
Alan Chiao,
Renjie Liu,
Yanzhang He,
Wei Li,
Jason Pelecanos,
Marily Nika,
Alexander Gruenstein
Abstract:
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance unde…
▽ More
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime.
△ Less
Submitted 9 September, 2020;
originally announced September 2020.
-
Unsupervised Sound Separation Using Mixture Invariant Training
Authors:
Scott Wisdom,
Efthymios Tzinis,
Hakan Erdogan,
Ron J. Weiss,
Kevin Wilson,
John R. Hershey
Abstract:
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon…
▽ More
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the acoustic conditions and distribution of sources. The acoustic properties can be challenging to accurately simulate, and the distribution of sound types may be hard to replicate. In this paper, we propose a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures. In MixIT, training examples are constructed by mixing together existing mixtures, and the model separates them into a variable number of latent sources, such that the separated sources can be remixed to approximate the original mixtures. We show that MixIT can achieve competitive performance compared to supervised methods on speech separation. Using MixIT in a semi-supervised learning setting enables unsupervised domain adaptation and learning from large amounts of real world data without ground-truth source waveforms. In particular, we significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data.
△ Less
Submitted 23 October, 2020; v1 submitted 22 June, 2020;
originally announced June 2020.
-
Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement
Authors:
Zhong-Qiu Wang,
Hakan Erdogan,
Scott Wisdom,
Kevin Wilson,
Desh Raj,
Shinji Watanabe,
Zhuo Chen,
John R. Hershey
Abstract:
This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, incl…
▽ More
This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, including factorizing the spatial covariance into a time-varying amplitude component and a time-invariant spatial component, as well as using block-based techniques. In addition, we introduce a multi-frame beamforming method which improves the results significantly by adding contextual frames to the beamforming formulations. We extensively evaluate and analyze the effects of window size, block size, and multi-frame context size for these methods. Our best method utilizes a sequence of three neural separation and multi-frame time-invariant spatial beamforming stages, and demonstrates an average improvement of 2.75 dB in scale-invariant signal-to-noise ratio and 14.2% absolute reduction in a comparative speech recognition metric across four challenging reverberant speech enhancement and separation tasks. We also use our three-speaker separation model to separate real recordings in the LibriCSS evaluation set into non-overlap** tracks, and achieve a better word error rate as compared to a baseline mask based beamformer.
△ Less
Submitted 3 November, 2020; v1 submitted 18 November, 2019;
originally announced November 2019.
-
Fully-automated patient-level malaria assessment on field-prepared thin blood film microscopy images, including Supplementary Information
Authors:
Charles B. Delahunt,
Mayoore S. Jaiswal,
Matthew P. Horning,
Samantha Janko,
Clay M. Thompson,
Sourabh Kulhare,
Liming Hu,
Travis Ostbye,
Grace Yun,
Roman Gebrehiwot,
Benjamin K. Wilson,
Earl Long,
Stephane Proux,
Dionicia Gamboa,
Peter Chiodini,
Jane Carter,
Mehul Dhorda,
David Isaboke,
Bernhards Ogutu,
Wellington Oyibo,
Elizabeth Villasis,
Kyaw Myo Tun,
Christine Bachman,
David Bell,
Courosh Mehanian
Abstract:
Malaria is a life-threatening disease affecting millions. Microscopy-based assessment of thin blood films is a standard method to (i) determine malaria species and (ii) quantitate high-parasitemia infections. Full automation of malaria microscopy by machine learning (ML) is a challenging task because field-prepared slides vary widely in quality and presentation, and artifacts often heavily outnumb…
▽ More
Malaria is a life-threatening disease affecting millions. Microscopy-based assessment of thin blood films is a standard method to (i) determine malaria species and (ii) quantitate high-parasitemia infections. Full automation of malaria microscopy by machine learning (ML) is a challenging task because field-prepared slides vary widely in quality and presentation, and artifacts often heavily outnumber relatively rare parasites. In this work, we describe a complete, fully-automated framework for thin film malaria analysis that applies ML methods, including convolutional neural nets (CNNs), trained on a large and diverse dataset of field-prepared thin blood films. Quantitation and species identification results are close to sufficiently accurate for the concrete needs of drug resistance monitoring and clinical use-cases on field-prepared samples. We focus our methods and our performance metrics on the field use-case requirements. We discuss key issues and important metrics for the application of ML methods to malaria microscopy.
△ Less
Submitted 11 September, 2022; v1 submitted 5 August, 2019;
originally announced August 2019.
-
Universal Sound Separation
Authors:
Ilya Kavalerov,
Scott Wisdom,
Hakan Erdogan,
Brian Patton,
Kevin Wilson,
Jonathan Le Roux,
John R. Hershey
Abstract:
Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a…
▽ More
Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
△ Less
Submitted 2 August, 2019; v1 submitted 8 May, 2019;
originally announced May 2019.
-
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
Authors:
Scott Wisdom,
John R. Hershey,
Kevin Wilson,
Jeremy Thorpe,
Michael Chinen,
Brian Patton,
Rif A. Saurous
Abstract:
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglec…
▽ More
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks. In this paper, we show that STFT consistency and mixture consistency can be jointly imposed by adding simple differentiable projection layers to the enhancement network. These layers are compatible with real or complex-valued masks. Using both of these constraints with complex-valued masks provides a 0.7 dB increase in scale-invariant signal-to-distortion ratio (SI-SDR) on a large dataset of speech corrupted by a wide variety of nonstationary noise across a range of input SNRs.
△ Less
Submitted 20 November, 2018;
originally announced November 2018.
-
Exploring Tradeoffs in Models for Low-latency Speech Enhancement
Authors:
Kevin Wilson,
Michael Chinen,
Jeremy Thorpe,
Brian Patton,
John Hershey,
Rif A. Saurous,
Jan Skoglund,
Richard F. Lyon
Abstract:
We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and…
▽ More
We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and find that zero-look-ahead models can achieve, on average, within 0.03 dB SDR of our best bidirectional model. Further, we find that 200 milliseconds of look-ahead is sufficient to achieve equivalent performance to our best bidirectional model.
△ Less
Submitted 16 November, 2018;
originally announced November 2018.
-
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Authors:
Quan Wang,
Hannah Muckenhirn,
Kevin Wilson,
Prashant Sridhar,
Zelin Wu,
John Hershey,
Rif A. Saurous,
Ron J. Weiss,
Ye Jia,
Ignacio Lopez Moreno
Abstract:
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embe…
▽ More
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
△ Less
Submitted 19 June, 2019; v1 submitted 10 October, 2018;
originally announced October 2018.
-
AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies
Authors:
Sourish Chaudhuri,
Joseph Roth,
Daniel P. W. Ellis,
Andrew Gallagher,
Liat Kaver,
Radhika Marvin,
Caroline Pantofaru,
Nathan Reale,
Loretta Guarino Reid,
Kevin Wilson,
Zhonghua Xi
Abstract:
Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on task-specific datasets, or…
▽ More
Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches and understand their strengths and weaknesses. In this paper, we describe a new dataset which we will release publicly containing densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for this task. The labels in the dataset annotate three different speech activity conditions: clean speech, speech co-occurring with music, and speech co-occurring with noise, which enable analysis of model performance in more challenging conditions based on the presence of overlap** noise. We report benchmark performance numbers on AVA-Speech using off-the-shelf, state-of-the-art audio and vision models that serve as a baseline to facilitate future research.
△ Less
Submitted 23 August, 2018; v1 submitted 1 August, 2018;
originally announced August 2018.
-
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Authors:
Ariel Ephrat,
Inbar Mosseri,
Oran Lang,
Tali Dekel,
Kevin Wilson,
Avinatan Hassidim,
William T. Freeman,
Michael Rubinstein
Abstract:
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and aud…
▽ More
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).
△ Less
Submitted 9 August, 2018; v1 submitted 10 April, 2018;
originally announced April 2018.