-
Directional MCLP Analysis and Reconstruction for Spatial Speech Communication
Authors:
Srikanth Raj Chetupalli,
Thippur V. Sreenivas
Abstract:
Spatial speech communication, i.e., the reconstruction of spoken signal along with the relative speaker position in the enclosure (reverberation information) is considered in this paper. Directional, diffuse components and the source position information are estimated at the transmitter, and perceptually effective reproduction is considered at the receiver. We consider spatially distributed microp…
▽ More
Spatial speech communication, i.e., the reconstruction of spoken signal along with the relative speaker position in the enclosure (reverberation information) is considered in this paper. Directional, diffuse components and the source position information are estimated at the transmitter, and perceptually effective reproduction is considered at the receiver. We consider spatially distributed microphone arrays for signal acquisition, and node specific signal estimation, along with its direction of arrival (DoA) estimation. Short-time Fourier transform (STFT) domain multi-channel linear prediction (MCLP) approach is used to model the diffuse component and relative acoustic transfer function is used to model the direct signal component. Distortion-less array response constraint and the time-varying complex Gaussian source model are used in the joint estimation of source DoA and the constituent signal components, separately at each node. The intersection between DoA directions at each node is used to compute the source position. Signal components computed at the node nearest to the estimated source position are taken as the signals for transmission.
At the receiver, a four channel loud speaker (LS) setup is used for spatial reproduction, in which the source spatial image is reproduced relative to a chosen virtual listener position in the transmitter enclosure. Vector base amplitude panning (VBAP) method is used for direct component reproduction using the LS setup and the diffuse component is reproduced equally from all the loud speakers after decorrelation. This scheme of spatial speech communication is shown to be effective and more natural for hands-free telecommunication, through either loudspeaker listening or binaural headphone listening with head related transfer function (HRTF) based presentation.
△ Less
Submitted 9 September, 2021;
originally announced September 2021.
-
Joint spatial filter and time-varying MCLP for dereverberation and interference suppression of a dynamic/static speech source
Authors:
Srikanth Raj Chetupalli,
Thippur V. Sreenivas
Abstract:
Dereverberation of a moving speech source in the presence of other directional interferers, is a harder problem than that of stationary source and interference cancellation. We explore joint multi channel linear prediction (MCLP) and relative transfer function (RTF) formulation in a stochastic framework and maximum likelihood estimation. We found that the combination of spatial filtering with dist…
▽ More
Dereverberation of a moving speech source in the presence of other directional interferers, is a harder problem than that of stationary source and interference cancellation. We explore joint multi channel linear prediction (MCLP) and relative transfer function (RTF) formulation in a stochastic framework and maximum likelihood estimation. We found that the combination of spatial filtering with distortion-less response constraint, and time-varying complex Gaussian model for the desired source signal at a reference microphone does provide better signal estimation. For a stationary source, we consider batch estimation, and obtain an iterative solution. Extending to a moving source, we formulate a linear time-varying dynamic system model for the MCLP coefficients and RTF based online adaptive spatial filter. For the case of tracking a desired source in the presence of interfering sources, the same formulation is used by specifying the RTF. Simulated experimental results show that the proposed scheme provides better spatial selectivity and dereverberation than the traditional methods, for both stationary and dynamic sources even in the presence of interfering sources.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
LSTM based AE-DNN constraint for better late reverb suppression in multi-channel LP formulation
Authors:
Srikanth Raj Chetupalli,
Thippur V. Sreenivas
Abstract:
Prediction of late reverberation component using multi-channel linear prediction (MCLP) in short-time Fourier transform (STFT) domain is an effective means to enhance reverberant speech. Traditionally, a speech power spectral density (PSD) weighted prediction error (WPE) minimization approach is used to estimate the prediction filters. The method is sensitive to the estimate of the desired signal…
▽ More
Prediction of late reverberation component using multi-channel linear prediction (MCLP) in short-time Fourier transform (STFT) domain is an effective means to enhance reverberant speech. Traditionally, a speech power spectral density (PSD) weighted prediction error (WPE) minimization approach is used to estimate the prediction filters. The method is sensitive to the estimate of the desired signal PSD. In this paper, we propose a deep neural network (DNN) based non-linear estimate for the desired signal PSD. An auto encoder trained on clean speech STFT coefficients is used as the desired signal prior. We explore two different architectures based on (i) fully-connected (FC) feed-forward, and (ii) recurrent long short-term memory (LSTM) layers. Experiments using real room impulse responses show that the LSTM-DNN based PSD estimate performs better than the traditional methods for late reverb suppression.
△ Less
Submitted 4 December, 2018;
originally announced December 2018.
-
Latent variable approach to diarization of audio recordings using ad-hoc randomly placed mobile devices
Authors:
Srikanth Raj Chetupalli,
Anirban Bhowmick,
Thippur V. Sreenivas
Abstract:
Diarization of audio recordings from ad-hoc mobile devices using spatial information is considered in this paper. A two-channel synchronous recording is assumed for each mobile device, which is used to compute directional statistics separately at each device in a frame-wise manner. The recordings across the mobile devices are asynchronous, but a coarse synchronization is performed by aligning the…
▽ More
Diarization of audio recordings from ad-hoc mobile devices using spatial information is considered in this paper. A two-channel synchronous recording is assumed for each mobile device, which is used to compute directional statistics separately at each device in a frame-wise manner. The recordings across the mobile devices are asynchronous, but a coarse synchronization is performed by aligning the signals using acoustic events, or real-time clock. Direction statistics computed for all the devices, are then modeled jointly using a Dirichlet mixture model, and the posterior probability over the mixture components is used to derive the diarization information. Experiments on real life recordings using mobile phones show a diarization error rate of less than 14%.
△ Less
Submitted 31 October, 2018;
originally announced October 2018.
-
Raga Identification using Repetitive Note Patterns from prescriptive notations of Carnatic Music
Authors:
Ranjani H. G.,
T. V. Sreenivas
Abstract:
Carnatic music, a form of Indian Art Music, has relied on an oral tradition for transferring knowledge across several generations. Over the last two hundred years, the use of prescriptive notations has been adopted for learning, sight-playing and sight-singing. Prescriptive notations offer generic guidelines for a raga rendition and do not include information about the ornamentations or the gamaka…
▽ More
Carnatic music, a form of Indian Art Music, has relied on an oral tradition for transferring knowledge across several generations. Over the last two hundred years, the use of prescriptive notations has been adopted for learning, sight-playing and sight-singing. Prescriptive notations offer generic guidelines for a raga rendition and do not include information about the ornamentations or the gamakas, which are considered to be critical for characterizing a raga. In this paper, we show that prescriptive notations contain raga attributes and can reliably identify a raga of Carnatic music from its octave-folded prescriptive notations. We restrict the notations to 7 notes and suppress the finer note position information. A dictionary based approach captures the statistics of repetitive note patterns within a raga notation. The proposed stochastic models of repetitive note patterns (or SMRNP in short) obtained from raga notations of known compositions, outperforms the state of the art melody based raga identification technique on an equivalent melodic data corresponding to the same compositions. This in turn shows that for Carnatic music, the note transitions and movements have a greater role in defining the raga structure than the exact note positions.
△ Less
Submitted 30 November, 2017;
originally announced November 2017.