Search | arXiv e-print repository

arXiv:2402.02950 [pdf, other]

Semantic Entropy Can Simultaneously Benefit Transmission Efficiency and Channel Security of Wireless Semantic Communications

Authors: Yankai Rong, Guoshun Nan, Minwei Zhang, Sihan Chen, Songtao Wang, Xuefei Zhang, Nan Ma, Shixun Gong, Zhaohui Yang, Qimei Cui, Xiaofeng Tao, Tony Q. S. Quek

Abstract: Recently proliferated deep learning-based semantic communications (DLSC) focus on how transmitted symbols efficiently convey a desired meaning to the destination. However, the sensitivity of neural models and the openness of wireless channels cause the DLSC system to be extremely fragile to various malicious attacks. This inspires us to ask a question: "Can we further exploit the advantages of tra… ▽ More Recently proliferated deep learning-based semantic communications (DLSC) focus on how transmitted symbols efficiently convey a desired meaning to the destination. However, the sensitivity of neural models and the openness of wireless channels cause the DLSC system to be extremely fragile to various malicious attacks. This inspires us to ask a question: "Can we further exploit the advantages of transmission efficiency in wireless semantic communications while also alleviating its security disadvantages?". Kee** this in mind, we propose SemEntropy, a novel method that answers the above question by exploring the semantics of data for both adaptive transmission and physical layer encryption. Specifically, we first introduce semantic entropy, which indicates the expectation of various semantic scores regarding the transmission goal of the DLSC. Equipped with such semantic entropy, we can dynamically assign informative semantics to Orthogonal Frequency Division Multiplexing (OFDM) subcarriers with better channel conditions in a fine-grained manner. We also use the entropy to guide semantic key generation to safeguard communications over open wireless channels. By doing so, both transmission efficiency and channel security can be simultaneously improved. Extensive experiments over various benchmarks show the effectiveness of the proposed SemEntropy. We discuss the reason why our proposed method benefits secure transmission of DLSC, and also give some interesting findings, e.g., SemEntropy can keep the semantic accuracy remain 95% with 60% less transmission. △ Less

Submitted 6 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: 13 pages, 12 figures

arXiv:2401.15647 [pdf, other]

UP-CrackNet: Unsupervised Pixel-Wise Road Crack Detection via Adversarial Image Restoration

Authors: Nachuan Ma, Rui Fan, Lihua Xie

Abstract: Over the past decade, automated methods have been developed to detect cracks more efficiently, accurately, and objectively, with the ultimate goal of replacing conventional manual visual inspection techniques. Among these methods, semantic segmentation algorithms have demonstrated promising results in pixel-wise crack detection tasks. However, training such networks requires a large amount of huma… ▽ More Over the past decade, automated methods have been developed to detect cracks more efficiently, accurately, and objectively, with the ultimate goal of replacing conventional manual visual inspection techniques. Among these methods, semantic segmentation algorithms have demonstrated promising results in pixel-wise crack detection tasks. However, training such networks requires a large amount of human-annotated datasets with pixel-level annotations, which is a highly labor-intensive and time-consuming process. Moreover, supervised learning-based methods often struggle with poor generalizability in unseen datasets. Therefore, we propose an unsupervised pixel-wise road crack detection network, known as UP-CrackNet. Our approach first generates multi-scale square masks and randomly selects them to corrupt undamaged road images by removing certain regions. Subsequently, a generative adversarial network is trained to restore the corrupted regions by leveraging the semantic context learned from surrounding uncorrupted regions. During the testing phase, an error map is generated by calculating the difference between the input and restored images, which allows for pixel-wise crack detection. Our comprehensive experimental results demonstrate that UP-CrackNet outperforms other general-purpose unsupervised anomaly detection algorithms, and exhibits satisfactory performance and superior generalizability when compared with state-of-the-art supervised crack segmentation algorithms. Our source code is publicly available at mias.group/UP-CrackNet. △ Less

Submitted 6 May, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

arXiv:2310.19817 [pdf, other]

Intelligibility prediction with a pretrained noise-robust automatic speech recognition model

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: This paper describes two intelligibility prediction systems derived from a pretrained noise-robust automatic speech recognition (ASR) model for the second Clarity Prediction Challenge (CPC2). One system is intrusive and leverages the hidden representations of the ASR model. The other system is non-intrusive and makes predictions with derived ASR uncertainty. The ASR model is only pretrained with a… ▽ More This paper describes two intelligibility prediction systems derived from a pretrained noise-robust automatic speech recognition (ASR) model for the second Clarity Prediction Challenge (CPC2). One system is intrusive and leverages the hidden representations of the ASR model. The other system is non-intrusive and makes predictions with derived ASR uncertainty. The ASR model is only pretrained with a simulated noisy speech corpus and does not take advantage of the CPC2 data. For that reason, the intelligibility prediction systems are robust to unseen scenarios given the accurate prediction performance on the CPC2 evaluation. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2309.02171 [pdf, other]

A Wideband MIMO Channel Model for Aerial Intelligent Reflecting Surface-Assisted Wireless Communications

Authors: Shaoyi Liu, Nan Ma, Yaning Chen, Ke Peng, Dongsheng Xue

Abstract: Compared to traditional intelligent reflecting surfaces(IRS), aerial IRS (AIRS) has unique advantages, such as more flexible deployment and wider service coverage. However, modeling AIRS in the channel presents new challenges due to their mobility. In this paper, a three-dimensional (3D) wideband channel model for AIRS and IRS joint-assisted multiple-input multiple-output (MIMO) communication syst… ▽ More Compared to traditional intelligent reflecting surfaces(IRS), aerial IRS (AIRS) has unique advantages, such as more flexible deployment and wider service coverage. However, modeling AIRS in the channel presents new challenges due to their mobility. In this paper, a three-dimensional (3D) wideband channel model for AIRS and IRS joint-assisted multiple-input multiple-output (MIMO) communication system is proposed, where considering the rotational degrees of freedom in three directions and the motion angles of AIRS in space. Based on the proposed model, the channel impulse response (CIR), correlation function, and channel capacity are derived, and several feasible joint phase shifts schemes for AIRS and IRS units are proposed. Simulation results show that the proposed model can capture the channel characteristics accurately, and the proposed phase shifts methods can effectively improve the channel statistical characteristics and increase the system capacity. Additionally, we observe that in certain scenarios, the paths involving the IRS and the line-of-sight (LoS) paths exhibit similar characteristics. These findings provide valuable insights for the future development of intelligent communication systems. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: 6 pages, 7 figures

arXiv:2305.19069 [pdf, other]

doi 10.1016/j.asoc.2023.110675

Multi-source adversarial transfer learning for ultrasound image segmentation with limited similarity

Authors: Yifu Zhang, Hongru Li, Tao Yang, Rui Tao, Zhengyuan Liu, Shimeng Shi, Jiansong Zhang, Ning Ma, Wu** Feng, Zhanhu Zhang, Xinyu Zhang

Abstract: Lesion segmentation of ultrasound medical images based on deep learning techniques is a widely used method for diagnosing diseases. Although there is a large amount of ultrasound image data in medical centers and other places, labeled ultrasound datasets are a scarce resource, and it is likely that no datasets are available for new tissues/organs. Transfer learning provides the possibility to solv… ▽ More Lesion segmentation of ultrasound medical images based on deep learning techniques is a widely used method for diagnosing diseases. Although there is a large amount of ultrasound image data in medical centers and other places, labeled ultrasound datasets are a scarce resource, and it is likely that no datasets are available for new tissues/organs. Transfer learning provides the possibility to solve this problem, but there are too many features in natural images that are not related to the target domain. As a source domain, redundant features that are not conducive to the task will be extracted. Migration between ultrasound images can avoid this problem, but there are few types of public datasets, and it is difficult to find sufficiently similar source domains. Compared with natural images, ultrasound images have less information, and there are fewer transferable features between different ultrasound images, which may cause negative transfer. To this end, a multi-source adversarial transfer learning network for ultrasound image segmentation is proposed. Specifically, to address the lack of annotations, the idea of adversarial transfer learning is used to adaptively extract common features between a certain pair of source and target domains, which provides the possibility to utilize unlabeled ultrasound data. To alleviate the lack of knowledge in a single source domain, multi-source transfer learning is adopted to fuse knowledge from multiple source domains. In order to ensure the effectiveness of the fusion and maximize the use of precious data, a multi-source domain independent strategy is also proposed to improve the estimation of the target domain data distribution, which further increases the learning ability of the multi-source adversarial migration learning network in multiple domains. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Submitted to Applied Soft Computing Journal

arXiv:2305.13616 [pdf]

An Entire Renal Anatomy Extraction Network for Advanced CAD During Partial Nephrectomy

Authors: Nan Ma, Ying Yang, Dongkai Zhou

Abstract: Partial nephrectomy (PN) is common surgery in urology. Digitization of renal anatomies brings much help to many computer-aided diagnosis (CAD) techniques during PN. However, the manual delineation of kidney vascular system and tumor on each slice is time consuming, error-prone, and inconsistent. Therefore, we proposed an entire renal anatomies extraction method from Computed Tomographic Angiograph… ▽ More Partial nephrectomy (PN) is common surgery in urology. Digitization of renal anatomies brings much help to many computer-aided diagnosis (CAD) techniques during PN. However, the manual delineation of kidney vascular system and tumor on each slice is time consuming, error-prone, and inconsistent. Therefore, we proposed an entire renal anatomies extraction method from Computed Tomographic Angiographic (CTA) images fully based on deep learning. We adopted a coarse-to-fine workflow to extract target tissues: first, we roughly located the kidney region, and then cropped the kidney region for more detail extraction. The network we used in our workflow is based on 3D U-Net. To dealing with the imbalance of class contributions to loss, we combined the dice loss with focal loss, and added an extra weight to prevent excessive attention. We also improved the manual annotations of vessels by merging semi-trained model's prediction and original annotations under supervision. We performed several experiments to find the best-fitting combination of variables for training. We trained and evaluated the models on our 60 cases dataset with 3 different sources. The average dice score coefficient (DSC) of kidney, tumor, cyst, artery, and vein, were 90.9%, 90.0%, 89.2%, 80.1% and 82.2% respectively. Our modulate weight and hybrid strategy of loss function increased the average DSC of all tissues about 8-20%. Our optimization of vessel annotation improved the average DSC about 1-5%. We proved the efficiency of our network on renal anatomies segmentation. The high accuracy and fully automation make it possible to quickly digitize the personal renal anatomies, which greatly increases the feasibility and practicability of CAD application on urology surgery. △ Less

Submitted 22 May, 2023; originally announced May 2023.

arXiv:2205.09377 [pdf, other]

Coexistence between Task- and Data-Oriented Communications: A Whittle's Index Guided Multi-Agent Reinforcement Learning Approach

Authors: Ran Li, Chuan Huang, Xiaoqi Qin, Shengpei Jiang, Nan Ma, Shuguang Cui

Abstract: We investigate the coexistence of task-oriented and data-oriented communications in a IoT system that shares a group of channels, and study the scheduling problem to jointly optimize the weighted age of incorrect information (AoII) and throughput, which are the performance metrics of the two types of communications, respectively. This problem is formulated as a Markov decision problem, which is di… ▽ More We investigate the coexistence of task-oriented and data-oriented communications in a IoT system that shares a group of channels, and study the scheduling problem to jointly optimize the weighted age of incorrect information (AoII) and throughput, which are the performance metrics of the two types of communications, respectively. This problem is formulated as a Markov decision problem, which is difficult to solve due to the large discrete action space and the time-varying action constraints induced by the stochastic availability of channels. By exploiting the intrinsic properties of this problem and reformulating the reward function based on channel statistics, we first simplify the solution space, state space, and optimality criteria, and convert it to an equivalent Markov game, for which the large discrete action space issue is greatly relieved. Then, we propose a Whittle's index guided multi-agent proximal policy optimization (WI-MAPPO) algorithm to solve the considered game, where the embedded Whittle's index module further shrinks the action space, and the proposed offline training algorithm extends the training kernel of conventional MAPPO to address the issue of time-varying constraints. Finally, numerical results validate that the proposed algorithm significantly outperforms state-of-the-art age of information (AoI) based algorithms under scenarios with insufficient channel resources. △ Less

Submitted 19 May, 2022; originally announced May 2022.

arXiv:2204.04288 [pdf, other]

Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: Non-intrusive intelligibility prediction is important for its application in realistic scenarios, where a clean reference signal is difficult to access. The construction of many non-intrusive predictors require either ground truth intelligibility labels or clean reference signals for supervised learning. In this work, we leverage an unsupervised uncertainty estimation method for predicting speech… ▽ More Non-intrusive intelligibility prediction is important for its application in realistic scenarios, where a clean reference signal is difficult to access. The construction of many non-intrusive predictors require either ground truth intelligibility labels or clean reference signals for supervised learning. In this work, we leverage an unsupervised uncertainty estimation method for predicting speech intelligibility, which does not require intelligibility labels or reference signals to train the predictor. Our experiments demonstrate that the uncertainty from state-of-the-art end-to-end automatic speech recognition (ASR) models is highly correlated with speech intelligibility. The proposed method is evaluated on two databases and the results show that the unsupervised uncertainty measures of ASR models are more correlated with speech intelligibility from listening results than the predictions made by widely used intrusive methods. △ Less

Submitted 6 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted to INTERSPEECH2022

arXiv:2204.04287 [pdf, other]

Exploiting Hidden Representations from a DNN-based Speech Recogniser for Speech Intelligibility Prediction in Hearing-impaired Listeners

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: An accurate objective speech intelligibility prediction algorithms is of great interest for many applications such as speech enhancement for hearing aids. Most algorithms measures the signal-to-noise ratios or correlations between the acoustic features of clean reference signals and degraded signals. However, these hand-picked acoustic features are usually not explicitly correlated with recognitio… ▽ More An accurate objective speech intelligibility prediction algorithms is of great interest for many applications such as speech enhancement for hearing aids. Most algorithms measures the signal-to-noise ratios or correlations between the acoustic features of clean reference signals and degraded signals. However, these hand-picked acoustic features are usually not explicitly correlated with recognition. Meanwhile, deep neural network (DNN) based automatic speech recogniser (ASR) is approaching human performance in some speech recognition tasks. This work leverages the hidden representations from DNN-based ASR as features for speech intelligibility prediction in hearing-impaired listeners. The experiments based on a hearing aid intelligibility database show that the proposed method could make better prediction than a widely used short-time objective intelligibility (STOI) based binaural measure. △ Less

Submitted 6 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted to INTERSPEECH2022

arXiv:2204.04284 [pdf, other]

Auditory-Based Data Augmentation for End-to-End Automatic Speech Recognition

Authors: Zehai Tu, Jack Deadman, Ning Ma, Jon Barker

Abstract: End-to-end models have achieved significant improvement on automatic speech recognition. One common method to improve performance of these models is expanding the data-space through data augmentation. Meanwhile, human auditory inspired front-ends have also demonstrated improvement for automatic speech recognisers. In this work, a well-verified auditory-based model, which can simulate various heari… ▽ More End-to-end models have achieved significant improvement on automatic speech recognition. One common method to improve performance of these models is expanding the data-space through data augmentation. Meanwhile, human auditory inspired front-ends have also demonstrated improvement for automatic speech recognisers. In this work, a well-verified auditory-based model, which can simulate various hearing abilities, is investigated for the purpose of data augmentation for end-to-end speech recognition. By introducing the auditory model into the data augmentation process, end-to-end systems are encouraged to ignore variation from the signal that cannot be heard and thereby focus on robust features for speech recognition. Two mechanisms in the auditory model, spectral smearing and loudness recruitment, are studied on the LibriSpeech dataset with a transformer-based end-to-end model. The results show that the proposed augmentation methods can bring statistically significant improvement on the performance of the state-of-the-art SpecAugment. △ Less

Submitted 8 April, 2022; originally announced April 2022.

arXiv:2106.04639 [pdf, other]

Optimising Hearing Aid Fittings for Speech in Noise with a Differentiable Hearing Loss Model

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: Current hearing aids normally provide amplification based on a general prescriptive fitting, and the benefits provided by the hearing aids vary among different listening environments despite the inclusion of noise suppression feature. Motivated by this fact, this paper proposes a data-driven machine learning technique to develop hearing aid fittings that are customised to speech in different noisy… ▽ More Current hearing aids normally provide amplification based on a general prescriptive fitting, and the benefits provided by the hearing aids vary among different listening environments despite the inclusion of noise suppression feature. Motivated by this fact, this paper proposes a data-driven machine learning technique to develop hearing aid fittings that are customised to speech in different noisy environments. A differentiable hearing loss model is proposed and used to optimise fittings with back-propagation. The customisation is reflected on the data of speech in different noise with also the consideration of noise suppression. The objective evaluation shows the advantages of optimised custom fittings over general prescriptive fittings. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021

arXiv:2103.09030 [pdf, other]

A Large-Scale Dataset for Benchmarking Elevator Button Segmentation and Character Recognition

Authors: Jianbang Liu, Yuqi Fang, Delong Zhu, Nachuan Ma, ** Pan, Max Q. -H. Meng

Abstract: Human activities are hugely restricted by COVID-19, recently. Robots that can conduct inter-floor navigation attract much public attention, since they can substitute human workers to conduct the service work. However, current robots either depend on human assistance or elevator retrofitting, and fully autonomous inter-floor navigation is still not available. As the very first step of inter-floor n… ▽ More Human activities are hugely restricted by COVID-19, recently. Robots that can conduct inter-floor navigation attract much public attention, since they can substitute human workers to conduct the service work. However, current robots either depend on human assistance or elevator retrofitting, and fully autonomous inter-floor navigation is still not available. As the very first step of inter-floor navigation, elevator button segmentation and recognition hold an important position. Therefore, we release the first large-scale publicly available elevator panel dataset in this work, containing 3,718 panel images with 35,100 button labels, to facilitate more powerful algorithms on autonomous elevator operation. Together with the dataset, a number of deep learning based implementations for button segmentation and recognition are also released to benchmark future methods in the community. The dataset will be available at \url{https://github.com/zhudelong/elevator_button_recognition △ Less

Submitted 22 March, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

arXiv:2012.03166 [pdf, other]

Conditional Generative Adversarial Networks for Optimal Path Planning

Authors: Nachuan Ma, Jiankun Wang, Max Q. -H. Meng

Abstract: Path planning plays an important role in autonomous robot systems. Effective understanding of the surrounding environment and efficient generation of optimal collision-free path are both critical parts for solving path planning problem. Although conventional sampling-based algorithms, such as the rapidly-exploring random tree (RRT) and its improved optimal version (RRT*), have been widely used in… ▽ More Path planning plays an important role in autonomous robot systems. Effective understanding of the surrounding environment and efficient generation of optimal collision-free path are both critical parts for solving path planning problem. Although conventional sampling-based algorithms, such as the rapidly-exploring random tree (RRT) and its improved optimal version (RRT*), have been widely used in path planning problems because of their ability to find a feasible path in even complex environments, they fail to find an optimal path efficiently. To solve this problem and satisfy the two aforementioned requirements, we propose a novel learning-based path planning algorithm which consists of a novel generative model based on the conditional generative adversarial networks (CGAN) and a modified RRT* algorithm (denoted by CGANRRT*). Given the map information, our CGAN model can generate an efficient possibility distribution of feasible paths, which can be utilized by the CGAN-RRT* algorithm to find the optimal path with a non-uniform sampling strategy. The CGAN model is trained by learning from ground truth maps, each of which is generated by putting all the results of executing RRT algorithm 50 times on one raw map. We demonstrate the efficient performance of this CGAN model by testing it on two groups of maps and comparing CGAN-RRT* algorithm with conventional RRT* algorithm. △ Less

Submitted 5 December, 2020; originally announced December 2020.

arXiv:2003.14022 [pdf, ps, other]

Distributed Noise Covariance Matrices Estimation in Sensor Networks

Authors: Jiahong Li, Nan Ma, Fang Deng

Abstract: Adaptive algorithms based on in-network processing over networks are useful for online parameter estimation of historical data (e.g., noise covariance) in predictive control and machine learning areas. This paper focuses on the distributed noise covariance matrices estimation problem for multi-sensor linear time-invariant (LTI) systems. Conventional noise covariance estimation approaches, e.g., au… ▽ More Adaptive algorithms based on in-network processing over networks are useful for online parameter estimation of historical data (e.g., noise covariance) in predictive control and machine learning areas. This paper focuses on the distributed noise covariance matrices estimation problem for multi-sensor linear time-invariant (LTI) systems. Conventional noise covariance estimation approaches, e.g., auto-covariance least squares (ALS) method, suffers from the lack of the sensor's historical measurements and thus produces high variance of the ALS estimate. To solve the problem, we propose the distributed auto-covariance least squares (D-ALS) algorithm based on the batch covariance intersection (BCI) method by enlarging the innovations from the neighbors. The accuracy analysis of D-ALS algorithm is given to show the decrease of the variance of the D-ALS estimate. The numerical results of cooperative target tracking tasks in static and mobile sensor networks are demonstrated to show the feasibility and superiority of the proposed D-ALS algorithm. △ Less

Submitted 31 March, 2020; originally announced March 2020.

Comments: 6 pages, 5 figures

arXiv:1912.11774 [pdf, other]

Autonomous Removal of Perspective Distortion for Robotic Elevator Button Recognition

Authors: Delong Zhu, Jianbang Liu, Nachuan Ma, Zhe Min, Max Q. -H. Meng

Abstract: Elevator button recognition is considered an indispensable function for enabling the autonomous elevator operation of mobile robots. However, due to unfavorable image conditions and various image distortions, the recognition accuracy remains to be improved. In this paper, we present a novel algorithm that can autonomously correct perspective distortions of elevator panel images. The algorithm firs… ▽ More Elevator button recognition is considered an indispensable function for enabling the autonomous elevator operation of mobile robots. However, due to unfavorable image conditions and various image distortions, the recognition accuracy remains to be improved. In this paper, we present a novel algorithm that can autonomously correct perspective distortions of elevator panel images. The algorithm first leverages the Gaussian Mixture Model (GMM) to conduct a grid fitting process based on button recognition results, then utilizes the estimated grid centers as reference features to estimate camera motions for correcting perspective distortions. The algorithm performs on a single image autonomously and does not need explicit feature detection or feature matching procedure, which is much more robust to noises and outliers than traditional feature-based geometric approaches. To verify the effectiveness of the algorithm, we collect an elevator panel dataset of 50 images captured from different angles of view. Experimental results show that the proposed algorithm can accurately estimate camera motions and effectively remove perspective distortions. △ Less

Submitted 25 December, 2019; originally announced December 2019.

arXiv:1904.03006 [pdf, other]

doi 10.1109/TASLP.2018.2855960

Robust Binaural Localization of a Target Sound Source by Combining Spectral Source Models and Deep Neural Networks

Authors: Ning Ma, Jose A. Gonzalez, Guy J. Brown

Abstract: Despite there being clear evidence for top-down (e.g., attentional) effects in biological spatial hearing, relatively few machine hearing systems exploit top-down model-based knowledge in sound localisation. This paper addresses this issue by proposing a novel framework for binaural sound localisation that combines model-based information about the spectral characteristics of sound sources and dee… ▽ More Despite there being clear evidence for top-down (e.g., attentional) effects in biological spatial hearing, relatively few machine hearing systems exploit top-down model-based knowledge in sound localisation. This paper addresses this issue by proposing a novel framework for binaural sound localisation that combines model-based information about the spectral characteristics of sound sources and deep neural networks (DNNs). A target source model and a background source model are first estimated during a training phase using spectral features extracted from sound signals in isolation. When the identity of the background source is not available, a universal background model can be used. During testing, the source models are used jointly to explain the mixed observations and improve the localisation process by selectively weighting source azimuth posteriors output by a DNN-based localisation system. To address the possible mismatch between training and testing, a model adaptation process is further employed on-the-fly during testing, which adapts the background model parameters directly from the noisy observations in an iterative manner. The proposed system therefore combines model-based and data-driven information flow within a single computational framework. The evaluation task involved localisation of a target speech source in the presence of an interfering source and room reverberation. Our experiments show that by exploiting model-based information in this way, sound localisation performance can be improved substantially under various noisy and reverberant conditions. △ Less

Submitted 5 April, 2019; originally announced April 2019.

Comments: 10 pages

Journal ref: IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 26, no. 11, pp. 2122-2131, 2018

arXiv:1904.03001 [pdf, other]

doi 10.1109/TASLP.2017.2750760

Exploiting Deep Neural Networks and Head Movements for Robust Binaural Localisation of Multiple Sources in Reverberant Environments

Authors: Ning Ma, Tobias May, Guy J. Brown

Abstract: This paper presents a novel machine-hearing system that exploits deep neural networks (DNNs) and head movements for robust binaural localisation of multiple sources in reverberant environments. DNNs are used to learn the relationship between the source azimuth and binaural cues, consisting of the complete cross-correlation function (CCF) and interaural level differences (ILDs). In contrast to many… ▽ More This paper presents a novel machine-hearing system that exploits deep neural networks (DNNs) and head movements for robust binaural localisation of multiple sources in reverberant environments. DNNs are used to learn the relationship between the source azimuth and binaural cues, consisting of the complete cross-correlation function (CCF) and interaural level differences (ILDs). In contrast to many previous binaural hearing systems, the proposed approach is not restricted to localisation of sound sources in the frontal hemifield. Due to the similarity of binaural cues in the frontal and rear hemifields, front-back confusions often occur. To address this, a head movement strategy is incorporated in the localisation model to help reduce the front-back errors. The proposed DNN system is compared to a Gaussian mixture model (GMM) based system that employs interaural time differences (ITDs) and ILDs as localisation features. Our experiments show that the DNN is able to exploit information in the CCF that is not available in the ITD cue, which together with head movements substantially improves localisation accuracies under challenging acoustic scenarios in which multiple talkers and room reverberation are present. △ Less

Submitted 5 April, 2019; originally announced April 2019.

Comments: 10 pages

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2444-2453, 2017

arXiv:1904.02992 [pdf, other]

Deep Learning Features for Robust Detection of Acoustic Events in Sleep-Disordered Breathing

Authors: Hector E. Romero, Ning Ma, Guy J. Brown, Amy V. Beeston, Madina Hasan

Abstract: Sleep-disordered breathing (SDB) is a serious and prevalent condition, and acoustic analysis via consumer devices (e.g. smartphones) offers a low-cost solution to screening for it. We present a novel approach for the acoustic identification of SDB sounds, such as snoring, using bottleneck features learned from a corpus of whole-night sound recordings. Two types of bottleneck features are described… ▽ More Sleep-disordered breathing (SDB) is a serious and prevalent condition, and acoustic analysis via consumer devices (e.g. smartphones) offers a low-cost solution to screening for it. We present a novel approach for the acoustic identification of SDB sounds, such as snoring, using bottleneck features learned from a corpus of whole-night sound recordings. Two types of bottleneck features are described, obtained by applying a deep autoencoder to the output of an auditory model or a short-term autocorrelation analysis. We investigate two architectures for snore sound detection: a tandem system and a hybrid system. In both cases, a `language model' (LM) was incorporated to exploit information about the sequence of different SDB events. Our results show that the proposed bottleneck features give better performance than conventional mel-frequency cepstral coefficients, and that the tandem system outperforms the hybrid system given the limited amount of labelled training data available. The LM made a small improvement to the performance of both classifiers. △ Less

Submitted 5 April, 2019; originally announced April 2019.

Comments: Accepted by IEEE ICASSP 2018

arXiv:1904.01916 [pdf, other]

End-to-end Binaural Sound Localisation from the Raw Waveform

Authors: Paolo Vecchiotti, Ning Ma, Stefano Squartini, Guy J. Brown

Abstract: A novel end-to-end binaural sound localisation approach is proposed which estimates the azimuth of a sound source directly from the waveform. Instead of employing hand-crafted features commonly employed for binaural sound localisation, such as the interaural time and level difference, our end-to-end system approach uses a convolutional neural network (CNN) to extract specific features from the wav… ▽ More A novel end-to-end binaural sound localisation approach is proposed which estimates the azimuth of a sound source directly from the waveform. Instead of employing hand-crafted features commonly employed for binaural sound localisation, such as the interaural time and level difference, our end-to-end system approach uses a convolutional neural network (CNN) to extract specific features from the waveform that are suitable for localisation. Two systems are proposed which differ in the initial frequency analysis stage. The first system is auditory-inspired and makes use of a gammatone filtering layer, while the second system is fully data-driven and exploits a trainable convolutional layer to perform frequency analysis. In both systems, a set of dedicated convolutional kernels are then employed to search for specific localisation cues, which are coupled with a localisation stage using fully connected layers. Localisation experiments using binaural simulation in both anechoic and reverberant environments show that the proposed systems outperform a state-of-the-art deep neural network system. Furthermore, our investigation of the frequency analysis stage in the second system suggests that the CNN is able to exploit different frequency bands for localisation according to the characteristics of the reverberant environment. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Comments: Accepted by ICASSP 2019

Showing 1–19 of 19 results for author: Ma, N