Search | arXiv e-print repository

Binaural Angular Separation Network

Authors: Yang Yang, George Sung, Shao-Fu Shih, Hakan Erdogan, Chehung Lee, Matthias Grundmann

Abstract: We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones. The model is trained with simulated room impulse responses (RIRs) using omni-directional microphones without needing to collect real RIRs. By relying on specific angular regions and multiple room simulations, the model utilizes consistent time diffe… ▽ More We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones. The model is trained with simulated room impulse responses (RIRs) using omni-directional microphones without needing to collect real RIRs. By relying on specific angular regions and multiple room simulations, the model utilizes consistent time difference of arrival (TDOA) cues, or what we call delay contrast, to separate target and interference sources while remaining robust in various reverberation environments. We demonstrate the model is not only generalizable to a commercially available device with a slightly different microphone geometry, but also outperforms our previous work which uses one additional microphone on the same device. The model runs in real-time on-device and is suitable for low-latency streaming applications such as telephony and video conferencing. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted to ICASSP 2024

arXiv:2312.04229 [pdf]

Accelerated Real-Life (ARL) Testing and Characterization of Automotive LiDAR Sensors to facilitate the Development and Validation of Enhanced Sensor Models

Authors: Marcel Kettelgerdes, Tjorven Hillmann, Thomas Hirmer, Hüseyin Erdogan, Bernhard Wunderle, Gordon Elger

Abstract: In the realm of automated driving simulation and sensor modeling, the need for highly accurate sensor models is paramount for ensuring the reliability and safety of advanced driving assistance systems (ADAS). Hence, numerous works focus on the development of high-fidelity models of ADAS sensors, such as camera, Radar as well as modern LiDAR systems to simulate the sensor behavior in different driv… ▽ More In the realm of automated driving simulation and sensor modeling, the need for highly accurate sensor models is paramount for ensuring the reliability and safety of advanced driving assistance systems (ADAS). Hence, numerous works focus on the development of high-fidelity models of ADAS sensors, such as camera, Radar as well as modern LiDAR systems to simulate the sensor behavior in different driving scenarios, even under varying environmental conditions, considering for example adverse weather effects. However, aging effects of sensors, leading to suboptimal system performance, are mostly overlooked by current simulation techniques. This paper introduces a cutting-edge Hardware-in-the-Loop (HiL) test bench designed for the automated, accelerated aging and characterization of Automotive LiDAR sensors. The primary objective of this research is to address the aging effects of LiDAR sensors over the product life cycle, specifically focusing on aspects such as laser beam profile deterioration, output power reduction and intrinsic parameter drift, which are mostly neglected in current sensor models. By that, this proceeding research is intended to path the way, not only towards identifying and modeling respective degradation effects, but also to suggest quantitative model validation metrics. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: 9th Symposium Driving Simulation 2023, Brunswick, Germany

arXiv:2308.10415 [pdf, other]

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

Authors: Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, John R. Hershey

Abstract: We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses… ▽ More We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We also present a "refinement" version of the model that predicts enhanced audio tokens from the audio tokens of speech separated by a conventional separation model. Using both objective metrics and subjective MUSHRA listening tests, we show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model. △ Less

Submitted 20 August, 2023; originally announced August 2023.

Comments: INTERSPEECH 2023, project webpage with audio demos at https://google-research.github.io/sound-separation/papers/tokensplit

arXiv:2303.07486 [pdf, other]

Guided Speech Enhancement Network

Authors: Yang Yang, Shao-Fu Shih, Hakan Erdogan, Jamie Menjay Lin, Chehung Lee, Yunpeng Li, George Sung, Matthias Grundmann

Abstract: High quality speech capture has been widely studied for both voice communication and human computer interface reasons. To improve the capture performance, we can often find multi-microphone speech enhancement techniques deployed on various devices. Multi-microphone speech enhancement problem is often decomposed into two decoupled steps: a beamformer that provides spatial filtering and a single-cha… ▽ More High quality speech capture has been widely studied for both voice communication and human computer interface reasons. To improve the capture performance, we can often find multi-microphone speech enhancement techniques deployed on various devices. Multi-microphone speech enhancement problem is often decomposed into two decoupled steps: a beamformer that provides spatial filtering and a single-channel speech enhancement model that cleans up the beamformer output. In this work, we propose a speech enhancement solution that takes both the raw microphone and beamformer outputs as the input for an ML model. We devise a simple yet effective training scheme that allows the model to learn from the cues of the beamformer by contrasting the two inputs and greatly boost its capability in spatial rejection, while conducting the general tasks of denoising and dereverberation. The proposed solution takes advantage of classical spatial filtering algorithms instead of competing with them. By design, the beamformer module then could be selected separately and does not require a large amount of data to be optimized for a given form factor, and the network model can be considered as a standalone module which is highly transferable independently from the microphone array. We name the ML module in our solution as GSENet, short for Guided Speech Enhancement Network. We demonstrate its effectiveness on real world data collected on multi-microphone devices in terms of the suppression of noise and interfering speech. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: Accepted to ICASSP 2023

arXiv:2203.15652 [pdf, other]

CycleGAN-Based Unpaired Speech Dereverberation

Authors: Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey

Abstract: Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance. The main limitation of this approach is that such models can only be trained on large amounts of data and a variety of room impulse responses when the data is synthetically reverberated, since acquiring real paired data is costly. In thi… ▽ More Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance. The main limitation of this approach is that such models can only be trained on large amounts of data and a variety of room impulse responses when the data is synthetically reverberated, since acquiring real paired data is costly. In this paper we propose a CycleGAN-based approach that enables dereverberation models to be trained on unpaired data. We quantify the impact of using unpaired data by comparing the proposed unpaired model to a paired model with the same architecture and trained on the paired version of the same dataset. We show that the performance of the unpaired model is comparable to the performance of the paired model on two different datasets, according to objective evaluation metrics. Furthermore, we run two subjective evaluations and show that both models achieve comparable subjective quality on the AMI dataset, which was not seen during training. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: Submitted to Interspeech 2022

arXiv:2110.10739 [pdf, other]

Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training

Authors: Aswin Sivaraman, Scott Wisdom, Hakan Erdogan, John R. Hershey

Abstract: The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlap** reverberant and noisy speech data from the AMI Corpus. The models are tested on real A… ▽ More The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlap** reverberant and noisy speech data from the AMI Corpus. The models are tested on real AMI recordings containing overlap** speech, and are evaluated subjectively by human listeners. To objectively evaluate our models, we also devise a synthetic AMI test set. For human evaluations on real recordings, we also propose a modification of the standard MUSHRA protocol to handle imperfect reference signals, which we call MUSHIRA. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets, outperforming unadapted generalist models trained on orders of magnitude more data. Our results show that unsupervised learning through MixIT enables model adaptation on real-world unlabeled spontaneous speech recordings. △ Less

Submitted 20 October, 2021; originally announced October 2021.

arXiv:2106.15813 [pdf, other]

DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement

Authors: Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdogan, John R. Hershey, Llion Jones, Michiel Bacchiani

Abstract: Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an analysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network. In this study, we aim to improve the sequ… ▽ More Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an analysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network. In this study, we aim to improve the sequential modeling ability of Conv-TasNet architectures by integrating Conformer layers into a new mask prediction network. To make the model computationally feasible, we extend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers. We trained the model on 3,396 hours of noisy speech data, and show that (i) the use of linear complexity attention avoids high computational complexity, and (ii) our model achieves higher scale-invariant signal-to-noise ratio than the improved time-dilated convolution network (TDCN++), an extended version of Conv-TasNet. △ Less

Submitted 5 August, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

Comments: 5 pages, 2 figure. accepted for WASPAA 2021

arXiv:2106.00847 [pdf, other]

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Authors: Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R. Hershey

Abstract: Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. F… ▽ More Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak class labels for each mixture. To handle larger numbers of sources, we introduce an efficient approximation using a fast least-squares solution, projected onto the MixIT constraint set. Our experiments show that the proposed losses curtail over-separation and improve overall performance. The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation. On the FUSS test set, we achieve over 13 dB in multi-source SI-SNR improvement, while boosting single-source reconstruction SI-SNR by over 17 dB. △ Less

Submitted 16 October, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

Comments: 5 pages, 1 figure. WASPAA 2021

arXiv:2105.02096 [pdf, other]

End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

Authors: Soumi Maiti, Hakan Erdogan, Kevin Wilson, Scott Wisdom, Shinji Watanabe, John R. Hershey

Abstract: We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers,… ▽ More We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multi-task transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: 5 pages, 2 figures, ICASSP 2021

Journal ref: ICASSP 2021, SPE-54.1

arXiv:2012.09727 [pdf, other]

Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording

Authors: Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen

Abstract: Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years. Recent research includes extracting target speech by using the target speaker's voice snippet and jointly separating all participating speakers by using a pool of additional speaker signals, which is known as speech separation using speaker inventory (SSUSI). However, all th… ▽ More Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years. Recent research includes extracting target speech by using the target speaker's voice snippet and jointly separating all participating speakers by using a pool of additional speaker signals, which is known as speech separation using speaker inventory (SSUSI). However, all these systems ideally assume that the pre-enrolled speaker signals are available and are only evaluated on simple data configurations. In realistic multi-talker conversations, the speech signal contains a large proportion of non-overlapped regions, where we can derive robust speaker embedding of individual talkers. In this work, we adopt the SSUSI model in long recordings and propose a self-informed, clustering-based inventory forming scheme for long recording, where the speaker inventory is fully built from the input signal without the need for external speaker signals. Experiment results on simulated noisy reverberant long recording datasets show that the proposed method can significantly improve the separation performance across various conditions. △ Less

Submitted 18 December, 2020; v1 submitted 17 December, 2020; originally announced December 2020.

arXiv:2011.02014 [pdf, other]

Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

Authors: Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, **yu Li, Scott Wisdom, John R. Hershey

Abstract: Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this pape… ▽ More Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlap** speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlap** ASR. △ Less

Submitted 3 November, 2020; originally announced November 2020.

Comments: Accepted to IEEE SLT 2021

arXiv:2011.00803 [pdf, other]

What's All the FUSS About Free Universal Sound Separation Data?

Authors: Scott Wisdom, Hakan Erdogan, Daniel Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, John Hershey

Abstract: We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate… ▽ More We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box shaped rooms with frequency-dependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.5 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge. △ Less

Submitted 2 November, 2020; originally announced November 2020.

arXiv:2011.00801 [pdf, other]

Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes

Authors: Nicolas Turpault, Romain Serizel, Scott Wisdom, Hakan Erdogan, John Hershey, Eduardo Fonseca, Prem Seetharaman, Justin Salamon

Abstract: We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We… ▽ More We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We show that the localization in time of sound events is still a problem for SED systems. We also show that reverberation and non-target sound events are severely degrading the performance of the SED systems. In the latter case, sound separation seems like a promising solution. △ Less

Submitted 2 November, 2020; originally announced November 2020.

arXiv:2007.03932 [pdf, other]

Improving Sound Event Detection In Domestic Environments Using Sound Separation

Authors: Nicolas Turpault, Scott Wisdom, Hakan Erdogan, John Hershey, Romain Serizel, Eduardo Fonseca, Prem Seetharaman, Justin Salamon

Abstract: Performing sound event detection on real-world recordings often implies dealing with overlap** target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing for sound event detection. In this paper we start from a sound separation model trained on t… ▽ More Performing sound event detection on real-world recordings often implies dealing with overlap** target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods to combine separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the sound separation model to the sound event detection data on both the sound separation and the sound event detection. △ Less

Submitted 8 July, 2020; originally announced July 2020.

arXiv:2006.12701 [pdf, other]

Unsupervised Sound Separation Using Mixture Invariant Training

Authors: Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J. Weiss, Kevin Wilson, John R. Hershey

Abstract: In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon… ▽ More In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the acoustic conditions and distribution of sources. The acoustic properties can be challenging to accurately simulate, and the distribution of sound types may be hard to replicate. In this paper, we propose a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures. In MixIT, training examples are constructed by mixing together existing mixtures, and the model separates them into a variable number of latent sources, such that the separated sources can be remixed to approximate the original mixtures. We show that MixIT can achieve competitive performance compared to supervised methods on speech separation. Using MixIT in a semi-supervised learning setting enables unsupervised domain adaptation and learning from large amounts of real world data without ground-truth source waveforms. In particular, we significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data. △ Less

Submitted 23 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

Comments: Accepted for spotlight presentation at NeurIPS 2020

arXiv:1911.07953 [pdf, other]

Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

Authors: Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, Kevin Wilson, Desh Raj, Shinji Watanabe, Zhuo Chen, John R. Hershey

Abstract: This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, incl… ▽ More This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, including factorizing the spatial covariance into a time-varying amplitude component and a time-invariant spatial component, as well as using block-based techniques. In addition, we introduce a multi-frame beamforming method which improves the results significantly by adding contextual frames to the beamforming formulations. We extensively evaluate and analyze the effects of window size, block size, and multi-frame context size for these methods. Our best method utilizes a sequence of three neural separation and multi-frame time-invariant spatial beamforming stages, and demonstrates an average improvement of 2.75 dB in scale-invariant signal-to-noise ratio and 14.2% absolute reduction in a comparative speech recognition metric across four challenging reverberant speech enhancement and separation tasks. We also use our three-speaker separation model to separate real recordings in the LibriCSS evaluation set into non-overlap** tracks, and achieve a better word error rate as compared to a baseline mask based beamformer. △ Less

Submitted 3 November, 2020; v1 submitted 18 November, 2019; originally announced November 2019.

Comments: 7 pages, 7 figures, IEEE SLT 2021 (slt2020.org)

arXiv:1905.03330 [pdf, other]

Universal Sound Separation

Authors: Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey

Abstract: Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a… ▽ More Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation. △ Less

Submitted 2 August, 2019; v1 submitted 8 May, 2019; originally announced May 2019.

Comments: 5 pages, accepted to WASPAA 2019

arXiv:1904.06478 [pdf, other]

Low-Latency Speaker-Independent Continuous Speech Separation

Authors: Takuya Yoshioka, Zhuo Chen, Changliang Liu, Xiong Xiao, Hakan Erdogan, Dimitrios Dimitriadis

Abstract: Speaker independent continuous speech separation (SI-CSS) is a task of converting a continuous audio stream, which may contain overlap** voices of unknown speakers, into a fixed number of continuous signals each of which contains no overlap** speech segment. A separated, or cleaned, version of each utterance is generated from one of SI-CSS's output channels nondeterministically without being s… ▽ More Speaker independent continuous speech separation (SI-CSS) is a task of converting a continuous audio stream, which may contain overlap** voices of unknown speakers, into a fixed number of continuous signals each of which contains no overlap** speech segment. A separated, or cleaned, version of each utterance is generated from one of SI-CSS's output channels nondeterministically without being split up and distributed to multiple channels. A typical application scenario is transcribing multi-party conversations, such as meetings, recorded with microphone arrays. The output signals can be simply sent to a speech recognition engine because they do not include speech overlaps. The previous SI-CSS method uses a neural network trained with permutation invariant training and a data-driven beamformer and thus requires much processing latency. This paper proposes a low-latency SI-CSS method whose performance is comparable to that of the previous method in a microphone array-based meeting transcription task.This is achieved (1) by using a new speech separation network architecture combined with a double buffering scheme and (2) by performing enhancement with a set of fixed beamformers followed by a neural post-filter. △ Less

Submitted 13 April, 2019; originally announced April 2019.

arXiv:1811.02508 [pdf, other]

SDR - half-baked or well done?

Authors: Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, John R. Hershey

Abstract: In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality. A decade ago, the BSS_eval toolkit was developed to give researchers worldwide a way to evaluate the quality of their algorithms in a simple, fair, and hopefully insightful way: it attempted to account for channel variations, and to not only evaluate the total disto… ▽ More In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality. A decade ago, the BSS_eval toolkit was developed to give researchers worldwide a way to evaluate the quality of their algorithms in a simple, fair, and hopefully insightful way: it attempted to account for channel variations, and to not only evaluate the total distortion in the estimated signal but also split it in terms of various factors such as remaining interference, newly added artifacts, and channel errors. In recent years, hundreds of papers have been relying on this toolkit to evaluate their proposed methods and compare them to previous works, often arguing that differences on the order of 0.1 dB proved the effectiveness of a method over others. We argue here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results. We propose to use a slightly modified definition, resulting in a simpler, more robust measure, called scale-invariant SDR (SI-SDR). We present various examples of critical failure of the original SDR that SI-SDR overcomes. △ Less

Submitted 6 November, 2018; originally announced November 2018.

arXiv:1810.03655 [pdf, other]

doi 10.21437/Interspeech.2018-2284

Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks

Authors: Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, Xiong Xiao, Fil Alleva

Abstract: The goal of this work is to develop a meeting transcription system that can recognize speech even when utterances of different speakers are overlapped. While speech overlaps have been regarded as a major obstacle in accurately transcribing meetings, a traditional beamformer with a single output has been exclusively used because previously proposed speech separation techniques have critical constra… ▽ More The goal of this work is to develop a meeting transcription system that can recognize speech even when utterances of different speakers are overlapped. While speech overlaps have been regarded as a major obstacle in accurately transcribing meetings, a traditional beamformer with a single output has been exclusively used because previously proposed speech separation techniques have critical constraints for application to real meetings. This paper proposes a new signal processing module, called an unmixing transducer, and describes its implementation using a windowed BLSTM. The unmixing transducer has a fixed number, say J, of output channels, where J may be different from the number of meeting attendees, and transforms an input multi-channel acoustic signal into J time-synchronous audio streams. Each utterance in the meeting is separated and emitted from one of the output channels. Then, each output signal can be simply fed to a speech recognition back-end for segmentation and transcription. Our meeting transcription system using the unmixing transducer outperforms a system based on a state-of-the-art neural mask-based beamformer by 10.8%. Significant improvements are observed in overlapped segments. To the best of our knowledge, this is the first report that applies overlapped speech recognition to unconstrained real meeting audio. △ Less

Submitted 8 October, 2018; originally announced October 2018.

Journal ref: Proc. Interspeech 2018, 3038-3042

arXiv:1711.08016 [pdf, other]

doi 10.1109/ICASSP.2017.7952160

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

Authors: Zhong Meng, Shinji Watanabe, John R. Hershey, Hakan Erdogan

Abstract: Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate re… ▽ More Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set. △ Less

Submitted 21 November, 2017; originally announced November 2017.

Comments: in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Journal ref: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 271-275

arXiv:1710.00116 [pdf, ps, other]

doi 10.1109/ICASSP.2015.7178884

PLDA-Based Diarization of Telephone Conversations

Authors: Ahmet E. Bulut, Hakan Demir, Yusuf Ziya Isik, Hakan Erdogan

Abstract: This paper investigates the application of the probabilistic linear discriminant analysis (PLDA) to speaker diarization of telephone conversations. We introduce using a variational Bayes (VB) approach for inference under a PLDA model for modeling segmental i-vectors in speaker diarization. Deterministic annealing (DA) algorithm is imposed in order to avoid local optimal solutions in VB iterations.… ▽ More This paper investigates the application of the probabilistic linear discriminant analysis (PLDA) to speaker diarization of telephone conversations. We introduce using a variational Bayes (VB) approach for inference under a PLDA model for modeling segmental i-vectors in speaker diarization. Deterministic annealing (DA) algorithm is imposed in order to avoid local optimal solutions in VB iterations. We compare our proposed system with a well-known system that applies k-means clustering on principal component analysis (PCA) coefficients of segmental i-vectors. We used summed channel telephone data from the National Institute of Standards and Technology (NIST) 2008 Speaker Recognition Evaluation (SRE) as the test set in order to evaluate the performance of the proposed system. We achieve about 20% relative improvement in Diarization Error Rate (DER) compared to the baseline system. △ Less

Submitted 29 September, 2017; originally announced October 2017.

Journal ref: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 1 (2015) 4809-4813

arXiv:1507.02826 [pdf, ps, other]

Comments On "Multipath Matching Pursuit" by Kwon, Wang and Shim

Authors: Nazim Burak Karahanoglu, Hakan Erdogan

Abstract: Straightforward combination of tree search with matching pursuits, which was suggested in 2001 by Cotter and Rao, and then later developed by some other authors, has been revisited recently as multipath matching pursuit (MMP). In this comment, we would like to point out some major issues regarding this publication. First, the idea behind MMP is not novel, and the related literature has not been pr… ▽ More Straightforward combination of tree search with matching pursuits, which was suggested in 2001 by Cotter and Rao, and then later developed by some other authors, has been revisited recently as multipath matching pursuit (MMP). In this comment, we would like to point out some major issues regarding this publication. First, the idea behind MMP is not novel, and the related literature has not been properly referenced. MMP has not been compared to closely related algorithms such as A* orthogonal matching pursuit (A*OMP). The theoretical analyses do ignore the pruning strategies applied by the authors in practice. All these issues have the potential to mislead the reader and lead to misinterpretation of the results. With this short paper, we intend to clarify the relation of MMP to existing literature in the area and compare its performance with A*OMP. △ Less

Submitted 10 July, 2015; originally announced July 2015.

arXiv:1409.8212 [pdf, other]

doi 10.1186/s13634-015-0255-5

THRIVE: Threshold Homomorphic encryption based secure and privacy preserving bIometric VErification system

Authors: Cagatay Karabat, Mehmet Sabir Kiraz, Hakan Erdogan, Erkay Savas

Abstract: In this paper, we propose a new biometric verification and template protection system which we call the THRIVE system. The system includes novel enrollment and authentication protocols based on threshold homomorphic cryptosystem where the private key is shared between a user and the verifier. In the THRIVE system, only encrypted binary biometric templates are stored in the database and verificatio… ▽ More In this paper, we propose a new biometric verification and template protection system which we call the THRIVE system. The system includes novel enrollment and authentication protocols based on threshold homomorphic cryptosystem where the private key is shared between a user and the verifier. In the THRIVE system, only encrypted binary biometric templates are stored in the database and verification is performed via homomorphically randomized templates, thus, original templates are never revealed during the authentication stage. The THRIVE system is designed for the malicious model where the cheating party may arbitrarily deviate from the protocol specification. Since threshold homomorphic encryption scheme is used, a malicious database owner cannot perform decryption on encrypted templates of the users in the database. Therefore, security of the THRIVE system is enhanced using a two-factor authentication scheme involving the user's private key and the biometric data. We prove security and privacy preservation capability of the proposed system in the simulation-based model with no assumption. The proposed system is suitable for applications where the user does not want to reveal her biometrics to the verifier in plain form but she needs to proof her physical presence by using biometrics. The system can be used with any biometric modality and biometric feature extraction scheme whose output templates can be binarized. The overall connection time for the proposed THRIVE system is estimated to be 336 ms on average for 256-bit biohash vectors on a desktop PC running with quad-core 3.2 GHz CPUs at 10 Mbit/s up/down link connection speed. Consequently, the proposed system can be efficiently used in real life applications. △ Less

Submitted 29 September, 2014; originally announced September 2014.

arXiv:1311.2746 [pdf, other]

Deep neural networks for single channel source separation

Authors: Emad M. Grais, Mehmet Umut Sen, Hakan Erdogan

Abstract: In this paper, a novel approach for single channel source separation (SCSS) using a deep neural network (DNN) architecture is introduced. Unlike previous studies in which DNN and other classifiers were used for classifying time-frequency bins to obtain hard masks for each source, we use the DNN to classify estimated source spectra to check for their validity during separation. In the training stag… ▽ More In this paper, a novel approach for single channel source separation (SCSS) using a deep neural network (DNN) architecture is introduced. Unlike previous studies in which DNN and other classifiers were used for classifying time-frequency bins to obtain hard masks for each source, we use the DNN to classify estimated source spectra to check for their validity during separation. In the training stage, the training data for the source signals are used to train a DNN. In the separation stage, the trained DNN is utilized to aid in estimation of each source in the mixed signal. Single channel source separation problem is formulated as an energy minimization problem where each source spectra estimate is encouraged to fit the trained DNN model and the mixed signal spectrum is encouraged to be written as a weighted sum of the estimated source spectra. The proposed approach works regardless of the energy scale differences between the source signals in the training and separation stages. Nonnegative matrix factorization (NMF) is used to initialize the DNN estimate for each source. The experimental results show that using DNN initialized by NMF for source separation improves the quality of the separated signal compared with using NMF for source separation. △ Less

Submitted 12 November, 2013; originally announced November 2013.

Comments: 5 pages, 2 figures, 2 tables, submitted to ICASSP2014

arXiv:1307.1770 [pdf, ps, other]

doi 10.1016/j.sigpro.2015.06.011

Improving A*OMP: Theoretical and Empirical Analyses With a Novel Dynamic Cost Model

Authors: Nazim Burak Karahanoglu, Hakan Erdogan

Abstract: Best-first search has been recently utilized for compressed sensing (CS) by the A* orthogonal matching pursuit (A*OMP) algorithm. In this work, we concentrate on theoretical and empirical analyses of A*OMP. We present a restricted isometry property (RIP) based general condition for exact recovery of sparse signals via A*OMP. In addition, we develop online guarantees which promise improved recovery… ▽ More Best-first search has been recently utilized for compressed sensing (CS) by the A* orthogonal matching pursuit (A*OMP) algorithm. In this work, we concentrate on theoretical and empirical analyses of A*OMP. We present a restricted isometry property (RIP) based general condition for exact recovery of sparse signals via A*OMP. In addition, we develop online guarantees which promise improved recovery performance with the residue-based termination instead of the sparsity-based one. We demonstrate the recovery capabilities of A*OMP with extensive recovery simulations using the adaptive-multiplicative (AMul) cost model, which effectively compensates for the path length differences in the search tree. The presented results, involving phase transitions for different nonzero element distributions as well as recovery rates and average error, reveal not only the superior recovery accuracy of A*OMP, but also the improvements with the residue-based termination and the AMul cost model. Comparison of the run times indicate the speed up by the AMul cost model. We also demonstrate a hybrid of OMP and A?OMP to accelerate the search further. Finally, we run A*OMP on a sparse image to illustrate its recovery performance for more realistic coefcient distributions. △ Less

Submitted 10 July, 2015; v1 submitted 6 July, 2013; originally announced July 2013.

Journal ref: Signal Processing 118 (2016) 62-74

arXiv:1302.7283 [pdf, other]

Source Separation using Regularized NMF with MMSE Estimates under GMM Priors with Online Learning for The Uncertainties

Authors: Emad M. Grais, Hakan Erdogan

Abstract: We propose a new method to enforce priors on the solution of the nonnegative matrix factorization (NMF). The proposed algorithm can be used for denoising or single-channel source separation (SCSS) applications. The NMF solution is guided to follow the Minimum Mean Square Error (MMSE) estimates under Gaussian mixture prior models (GMM) for the source signal. In SCSS applications, the spectra of the… ▽ More We propose a new method to enforce priors on the solution of the nonnegative matrix factorization (NMF). The proposed algorithm can be used for denoising or single-channel source separation (SCSS) applications. The NMF solution is guided to follow the Minimum Mean Square Error (MMSE) estimates under Gaussian mixture prior models (GMM) for the source signal. In SCSS applications, the spectra of the observed mixed signal are decomposed as a weighted linear combination of trained basis vectors for each source using NMF. In this work, the NMF decomposition weight matrices are treated as a distorted image by a distortion operator, which is learned directly from the observed signals. The MMSE estimate of the weights matrix under GMM prior and log-normal distribution for the distortion is then found to improve the NMF decomposition results. The MMSE estimate is embedded within the optimization objective to form a novel regularized NMF cost function. The corresponding update rules for the new objectives are derived in this paper. Experimental results show that, the proposed regularized NMF algorithm improves the source separation performance compared with using NMF without prior or with other prior models. △ Less

Submitted 28 February, 2013; originally announced February 2013.

arXiv:1210.5991 [pdf, ps, other]

Online Recovery Guarantees and Analytical Results for OMP

Authors: Nazim Burak Karahanoglu, Hakan Erdogan

Abstract: Orthogonal Matching Pursuit (OMP) is a simple, yet empirically competitive algorithm for sparse recovery. Recent developments have shown that OMP guarantees exact recovery of K-sparse signals with K or more than K iterations if the observation matrix satisfies the restricted isometry property (RIP) with some conditions. We develop RIP-based online guarantees for recovery of a K-sparse signal with… ▽ More Orthogonal Matching Pursuit (OMP) is a simple, yet empirically competitive algorithm for sparse recovery. Recent developments have shown that OMP guarantees exact recovery of K-sparse signals with K or more than K iterations if the observation matrix satisfies the restricted isometry property (RIP) with some conditions. We develop RIP-based online guarantees for recovery of a K-sparse signal with more than K OMP iterations. Though these guarantees cannot be generalized to all sparse signals a priori, we show that they can still hold online when the state-of-the-art K-step recovery guarantees fail. In addition, we present bounds on the number of correct and false indices in the support estimate for the derived condition to be less restrictive than the K-step guarantees. Under these bounds, this condition guarantees exact recovery of a K-sparse signal within 3K/2 iterations, which is much less than the number of steps required for the state-of-the-art exact recovery guarantees with more than K steps. Moreover, we present phase transitions of OMP in comparison to basis pursuit and subspace pursuit, which are obtained after extensive recovery simulations involving different sparse signal types. Finally, we empirically analyse the number of false indices in the support estimate, which indicates that these do not violate the developed upper bound in practice. △ Less

Submitted 29 March, 2013; v1 submitted 22 October, 2012; originally announced October 2012.

arXiv:1210.5626 [pdf, ps, other]

doi 10.1016/j.dsp.2013.05.007

Compressed Sensing Signal Recovery via Forward-Backward Pursuit

Authors: Nazim Burak Karahanoglu, Hakan Erdogan

Abstract: Recovery of sparse signals from compressed measurements constitutes an l0 norm minimization problem, which is unpractical to solve. A number of sparse recovery approaches have appeared in the literature, including l1 minimization techniques, greedy pursuit algorithms, Bayesian methods and nonconvex optimization techniques among others. This manuscript introduces a novel two stage greedy approach,… ▽ More Recovery of sparse signals from compressed measurements constitutes an l0 norm minimization problem, which is unpractical to solve. A number of sparse recovery approaches have appeared in the literature, including l1 minimization techniques, greedy pursuit algorithms, Bayesian methods and nonconvex optimization techniques among others. This manuscript introduces a novel two stage greedy approach, called the Forward-Backward Pursuit (FBP). FBP is an iterative approach where each iteration consists of consecutive forward and backward stages. The forward step first expands the support estimate by the forward step size, while the following backward step shrinks it by the backward step size. The forward step size is larger than the backward step size, hence the initially empty support estimate is expanded at the end of each iteration. Forward and backward steps are iterated until the residual power of the observation vector falls below a threshold. This structure of FBP does not necessitate the sparsity level to be known a priori in contrast to the Subspace Pursuit or Compressive Sampling Matching Pursuit algorithms. FBP recovery performance is demonstrated via simulations including recovery of random sparse signals with different nonzero coefficient distributions in noisy and noise-free scenarios in addition to the recovery of a sparse image. △ Less

Submitted 6 July, 2013; v1 submitted 20 October, 2012; originally announced October 2012.

Comments: accepted for publication in Digital Signal Processing

Journal ref: Digital Signal Processing 23 (2013), pp. 1539-1548

arXiv:1108.3260 [pdf, other]

doi 10.1017/S1471068411000548

Finding Similar/Diverse Solutions in Answer Set Programming

Authors: Thomas Eiter, Esra Erdem, Halit Erdogan, Michael Fink

Abstract: For some computational problems (e.g., product configuration, planning, diagnosis, query answering, phylogeny reconstruction) computing a set of similar/diverse solutions may be desirable for better decision-making. With this motivation, we studied several decision/optimization versions of this problem in the context of Answer Set Programming (ASP), analyzed their computational complexity, and int… ▽ More For some computational problems (e.g., product configuration, planning, diagnosis, query answering, phylogeny reconstruction) computing a set of similar/diverse solutions may be desirable for better decision-making. With this motivation, we studied several decision/optimization versions of this problem in the context of Answer Set Programming (ASP), analyzed their computational complexity, and introduced offline/online methods to compute similar/diverse solutions of such computational problems with respect to a given distance function. All these methods rely on the idea of computing solutions to a problem by means of finding the answer sets for an ASP program that describes the problem. The offline methods compute all solutions in advance using the ASP formulation of the problem with an ASP solver, like Clasp, and then identify similar/diverse solutions using clustering methods. The online methods compute similar/diverse solutions following one of the three approaches: by reformulating the ASP representation of the problem to compute similar/diverse solutions at once using an ASP solver; by computing similar/diverse solutions iteratively (one after other) using an ASP solver; by modifying the search algorithm of an ASP solver to compute similar/diverse solutions incrementally. We modified Clasp to implement the last online method and called it Clasp-NK. In the first two online methods, the given distance function is represented in ASP; in the last one it is implemented in C++. We showed the applicability and the effectiveness of these methods on reconstruction of similar/diverse phylogenies for Indo-European languages, and on several planning problems in Blocks World. We observed that in terms of computational efficiency the last online method outperforms the others; also it allows us to compute similar/diverse solutions when the distance function cannot be represented in ASP. △ Less

Submitted 16 August, 2011; originally announced August 2011.

Comments: 57 pages, 17 figures, 4 tables. To appear in Theory and Practice of Logic Programming (TPLP)

Journal ref: Theory and Practice of Logic Programming, 13(3), 303-359, 2013

arXiv:1107.5850

Confidence-Based Dynamic Classifier Combination For Mean-Shift Tracking

Authors: Ibrahim Saygin Topkaya, Hakan Erdogan

Abstract: We introduce a novel tracking technique which uses dynamic confidence-based fusion of two different information sources for robust and efficient tracking of visual objects. Mean-shift tracking is a popular and well known method used in object tracking problems. Originally, the algorithm uses a similarity measure which is optimized by shifting a search area to the center of a generated weight image… ▽ More We introduce a novel tracking technique which uses dynamic confidence-based fusion of two different information sources for robust and efficient tracking of visual objects. Mean-shift tracking is a popular and well known method used in object tracking problems. Originally, the algorithm uses a similarity measure which is optimized by shifting a search area to the center of a generated weight image to track objects. Recent improvements on the original mean-shift algorithm involves using a classifier that differentiates the object from its surroundings. We adopt this classifier-based approach and propose an application of a classifier fusion technique within this classifier-based context in this work. We use two different classifiers, where one comes from a background modeling method, to generate the weight image and we calculate contributions of the classifiers dynamically using their confidences to generate a final weight image to be used in tracking. The contributions of the classifiers are calculated by using correlations between histograms of their weight images and histogram of a defined ideal weight image in the previous frame. We show with experiments that our dynamic combination scheme selects good contributions for classifiers for different cases and improves tracking accuracy significantly. △ Less

Submitted 22 July, 2014; v1 submitted 28 July, 2011; originally announced July 2011.

Comments: This paper has been withdrawn by the author due to an implementation issue

ACM Class: I.4.8

arXiv:1106.1684 [pdf, other]

Max-Margin Stacking and Sparse Regularization for Linear Classifier Combination and Selection

Authors: Mehmet Umut Sen, Hakan Erdogan

Abstract: The main principle of stacked generalization (or Stacking) is using a second-level generalizer to combine the outputs of base classifiers in an ensemble. In this paper, we investigate different combination types under the stacking framework; namely weighted sum (WS), class-dependent weighted sum (CWS) and linear stacked generalization (LSG). For learning the weights, we propose using regularized e… ▽ More The main principle of stacked generalization (or Stacking) is using a second-level generalizer to combine the outputs of base classifiers in an ensemble. In this paper, we investigate different combination types under the stacking framework; namely weighted sum (WS), class-dependent weighted sum (CWS) and linear stacked generalization (LSG). For learning the weights, we propose using regularized empirical risk minimization with the hinge loss. In addition, we propose using group sparsity for regularization to facilitate classifier selection. We performed experiments using two different ensemble setups with differing diversities on 8 real-world datasets. Results show the power of regularized learning with the hinge loss function. Using sparse regularization, we are able to reduce the number of selected classifiers of the diverse ensemble without sacrificing accuracy. With the non-diverse ensembles, we even gain accuracy on average by using sparse regularization. △ Less

Submitted 8 June, 2011; originally announced June 2011.

Comments: 8 pages, 3 figures, 6 tables, journal

arXiv:1012.1899 [pdf, other]

Querying Biomedical Ontologies in Natural Language using Answer Set

Authors: Halit Erdogan, Umut Oztok, Yelda Erdem, Esra Erdem

Abstract: In this work, we develop an intelligent user interface that allows users to enter biomedical queries in a natural language, and that presents the answers (possibly with explanations if requested) in a natural language. We develop a rule layer over biomedical ontologies and databases, and use automated reasoners to answer queries considering relevant parts of the rule layer. In this work, we develop an intelligent user interface that allows users to enter biomedical queries in a natural language, and that presents the answers (possibly with explanations if requested) in a natural language. We develop a rule layer over biomedical ontologies and databases, and use automated reasoners to answer queries considering relevant parts of the rule layer. △ Less

Submitted 8 December, 2010; originally announced December 2010.

Comments: in Adrian Paschke, Albert Burger, Andrea Splendiani, M. Scott Marshall, Paolo Romano: Proceedings of the 3rd International Workshop on Semantic Web Applications and Tools for the Life Sciences, Berlin,Germany, December 8-10, 2010

Report number: SWAT4LS 2010 ACM Class: J.3

arXiv:1009.0396 [pdf, ps, other]

doi 10.1016/j.dsp.2012.03.003

A* Orthogonal Matching Pursuit: Best-First Search for Compressed Sensing Signal Recovery

Authors: Nazim Burak Karahanoglu, Hakan Erdogan

Abstract: Compressed sensing is a develo** field aiming at reconstruction of sparse signals acquired in reduced dimensions, which make the recovery process under-determined. The required solution is the one with minimum $\ell_0$ norm due to sparsity, however it is not practical to solve the $\ell_0$ minimization problem. Commonly used techniques include $\ell_1$ minimization, such as Basis Pursuit (BP) an… ▽ More Compressed sensing is a develo** field aiming at reconstruction of sparse signals acquired in reduced dimensions, which make the recovery process under-determined. The required solution is the one with minimum $\ell_0$ norm due to sparsity, however it is not practical to solve the $\ell_0$ minimization problem. Commonly used techniques include $\ell_1$ minimization, such as Basis Pursuit (BP) and greedy pursuit algorithms such as Orthogonal Matching Pursuit (OMP) and Subspace Pursuit (SP). This manuscript proposes a novel semi-greedy recovery approach, namely A* Orthogonal Matching Pursuit (A*OMP). A*OMP performs A* search to look for the sparsest solution on a tree whose paths grow similar to the Orthogonal Matching Pursuit (OMP) algorithm. Paths on the tree are evaluated according to a cost function, which should compensate for different path lengths. For this purpose, three different auxiliary structures are defined, including novel dynamic ones. A*OMP also incorporates pruning techniques which enable practical applications of the algorithm. Moreover, the adjustable search parameters provide means for a complexity-accuracy trade-off. We demonstrate the reconstruction ability of the proposed scheme on both synthetically generated data and images using Gaussian and Bernoulli observation matrices, where A*OMP yields less reconstruction error and higher exact recovery frequency than BP, OMP and SP. Results also indicate that novel dynamic cost functions provide improved results as compared to a conventional choice. △ Less

Submitted 14 March, 2012; v1 submitted 2 September, 2010; originally announced September 2010.

Comments: accepted for publication in Digital Signal Processing

Journal ref: Digital Signal Processing, Volume 22, Issue 4, 2012, Pages 555-568

Showing 1–34 of 34 results for author: Erdogan, H