Search | arXiv e-print repository

Physics-Constrained Learning for PDE Systems with Uncertainty Quantified Port-Hamiltonian Models

Authors: Kaiyuan Tan, Peilun Li, Thomas Beckers

Abstract: Modeling the dynamics of flexible objects has become an emerging topic in the community as these objects become more present in many applications, e.g., soft robotics. Due to the properties of flexible materials, the movements of soft objects are often highly nonlinear and, thus, complex to predict. Data-driven approaches seem promising for modeling those complex dynamics but often neglect basic p… ▽ More Modeling the dynamics of flexible objects has become an emerging topic in the community as these objects become more present in many applications, e.g., soft robotics. Due to the properties of flexible materials, the movements of soft objects are often highly nonlinear and, thus, complex to predict. Data-driven approaches seem promising for modeling those complex dynamics but often neglect basic physical principles, which consequently makes them untrustworthy and limits generalization. To address this problem, we propose a physics-constrained learning method that combines powerful learning tools and reliable physical models. Our method leverages the data collected from observations by sending them into a Gaussian process that is physically constrained by a distributed Port-Hamiltonian model. Based on the Bayesian nature of the Gaussian process, we not only learn the dynamics of the system, but also enable uncertainty quantification. Furthermore, the proposed approach preserves the compositional nature of Port-Hamiltonian systems. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11619 [pdf, other]

AV-CrossNet: an Audiovisual Complex Spectral Map** Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

Authors: Vahid Ahmadi Kalkhorani, Cheng Yu, Anurag Kumar, Ke Tan, Buye Xu, DeLiang Wang

Abstract: Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral map** for speech separation by lever… ▽ More Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral map** for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 10 pages, 4 Figures, and 4 Tables

arXiv:2406.07390 [pdf, other]

DiffCom: Channel Received Signal is a Natural Condition to Guide Diffusion Posterior Sampling

Authors: Sixian Wang, **cheng Dai, Kailin Tan, Xiaoqi Qin, Kai Niu, ** Zhang

Abstract: End-to-end visual communication systems typically optimize a trade-off between channel bandwidth costs and signal-level distortion metrics. However, under challenging physical conditions, this traditional discriminative communication paradigm often results in unrealistic reconstructions with perceptible blurring and aliasing artifacts, despite the inclusion of perceptual or adversarial losses for… ▽ More End-to-end visual communication systems typically optimize a trade-off between channel bandwidth costs and signal-level distortion metrics. However, under challenging physical conditions, this traditional discriminative communication paradigm often results in unrealistic reconstructions with perceptible blurring and aliasing artifacts, despite the inclusion of perceptual or adversarial losses for optimizing. This issue primarily stems from the receiver's limited knowledge about the underlying data manifold and the use of deterministic decoding mechanisms. To address these limitations, this paper introduces DiffCom, a novel end-to-end generative communication paradigm that utilizes off-the-shelf generative priors and probabilistic diffusion models for decoding, thereby improving perceptual quality without heavily relying on bandwidth costs and received signal quality. Unlike traditional systems that rely on deterministic decoders optimized solely for distortion metrics, our DiffCom leverages raw channel-received signal as a fine-grained condition to guide stochastic posterior sampling. Our approach ensures that reconstructions remain on the manifold of real data with a novel confirming constraint, enhancing the robustness and reliability of the generated outcomes. Furthermore, DiffCom incorporates a blind posterior sampling technique to address scenarios with unknown forward transmission characteristics. Extensive experimental validations demonstrate that DiffCom not only produces realistic reconstructions with details faithful to the original data but also achieves superior robustness against diverse wireless transmission degradations. Collectively, these advancements establish DiffCom as a new benchmark in designing generative communication systems that offer enhanced robustness and generalization superiorities. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.07069 [pdf, other]

Optimal Gait Control for a Tendon-driven Soft Quadruped Robot by Model-based Reinforcement Learning

Authors: Xuezhi Niu, Kaige Tan, Lei Feng

Abstract: This study presents an innovative approach to optimal gait control for a soft quadruped robot enabled by four Compressible Tendon-driven Soft Actuators (CTSAs). Improving our previous studies of using model-free reinforcement learning for gait control, we employ model-based reinforcement learning (MBRL) to further enhance the performance of the gait controller. Compared to rigid robots, the propos… ▽ More This study presents an innovative approach to optimal gait control for a soft quadruped robot enabled by four Compressible Tendon-driven Soft Actuators (CTSAs). Improving our previous studies of using model-free reinforcement learning for gait control, we employ model-based reinforcement learning (MBRL) to further enhance the performance of the gait controller. Compared to rigid robots, the proposed soft quadruped robot has better safety, less weight, and a simpler mechanism for fabrication and control. However, the primary challenge lies in develo** sophisticated control algorithms to attain optimal gait control for fast and stable locomotion. The research employs a multi-stage methodology, including state space restriction, data-driven model training, and reinforcement learning algorithm development. Compared to benchmark methods, the proposed MBRL algorithm, combined with post-training, significantly improves the efficiency and performance of gait control policies. The developed policy is both robust and adaptable to the robot's deformable morphology. The study concludes by highlighting the practical applicability of these findings in real-world scenarios. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.07065 [pdf, other]

Optimal Gait Design for a Soft Quadruped Robot via Multi-fidelity Bayesian Optimization

Authors: Kaige Tan, Xuezhi Niu, Qinglei Ji, Lei Feng, Martin Törngren

Abstract: This study focuses on the locomotion capability improvement in a tendon-driven soft quadruped robot through an online adaptive learning approach. Leveraging the inverse kinematics model of the soft quadruped robot, we employ a central pattern generator to design a parametric gait pattern, and use Bayesian optimization (BO) to find the optimal parameters. Further, to address the challenges of model… ▽ More This study focuses on the locomotion capability improvement in a tendon-driven soft quadruped robot through an online adaptive learning approach. Leveraging the inverse kinematics model of the soft quadruped robot, we employ a central pattern generator to design a parametric gait pattern, and use Bayesian optimization (BO) to find the optimal parameters. Further, to address the challenges of modeling discrepancies, we implement a multi-fidelity BO approach, combining data from both simulation and physical experiments throughout training and optimization. This strategy enables the adaptive refinement of the gait pattern and ensures a smooth transition from simulation to real-world deployment for the controller. Moreover, we integrate a computational task off-loading architecture by edge computing, which reduces the onboard computational and memory overhead, to improve real-time control performance and facilitate an effective online learning process. The proposed approach successfully achieves optimal walking gait design for physical deployment with high efficiency, effectively addressing challenges related to the reality gap in soft robotics. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2403.01369 [pdf, other]

A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement

Authors: Ravi Shankar, Ke Tan, Buye Xu, Anurag Kumar

Abstract: Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we… ▽ More Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions and find that they add very little value for the enhancement task. Our constraints are designed around on-device real-time speech enhancement -- model is causal, the compute footprint is small. Additionally, we focus on low SNR conditions where such models struggle to provide good enhancement. In order to systematically examine how SSL representations impact performance of such enhancement models, we propose a variety of techniques to utilize these embeddings which include different forms of knowledge-distillation and pre-training. △ Less

Submitted 2 March, 2024; originally announced March 2024.

Comments: 8 pages; Shorter form accepted in ICASSP 2024

arXiv:2312.16884 [pdf]

Binaural recording methods with analysis on inter-aural time, level, and phase differences

Authors: Johann Kay Ann Tan

Abstract: Binaural recordings are a form of stereophonic recording method that replicates how human ears perceive sound, these types of recordings create a 3D aural image around the listener and are extremely immersive when well recorded and listened to appropriately with headphones. It has wide applications in video, podcast, and gaming formats -- allowing the listener to feel like they are there. Although… ▽ More Binaural recordings are a form of stereophonic recording method that replicates how human ears perceive sound, these types of recordings create a 3D aural image around the listener and are extremely immersive when well recorded and listened to appropriately with headphones. It has wide applications in video, podcast, and gaming formats -- allowing the listener to feel like they are there. Although binaural formats are seldom used for music applications, they have also been utilized in music ranging from Rock, Jazz, Acoustic, and Classical. In this paper, we will investigate the acoustical phenomenon that produces the binaural effect in audio recordings -- including the ITD (Inter-aural time difference), the ILD (inter-aural level difference), IPD (inter-aural phase difference) as well as the monaural spectral difference that occurs between two ears so we can better understand the replication of human hearing in binaural recordings. Binaural recordings differ from regular stereophonic recordings as they are arranged in a specific way to account for HRTF (Head-related transfer function). The most common method of binaural recordings is with two high-quality omni-directional microphones affixed on a dummy head where the ears are located, although other methods exist without the use of a full dummy head. △ Less

Submitted 28 December, 2023; originally announced December 2023.

arXiv:2311.15959 [pdf, other]

CheapNET: Improving Light-weight speech enhancement network by projected loss function

Authors: Kaijun Tan, Benzhe Dai, Jiakui Li, Wenyu Mao

Abstract: Noise suppression and echo cancellation are critical in speech enhancement and essential for smart devices and real-time communication. Deployed in voice processing front-ends and edge devices, these algorithms must ensure efficient real-time inference with low computational demands. Traditional edge-based noise suppression often uses MSE-based amplitude spectrum mask training, but this approach h… ▽ More Noise suppression and echo cancellation are critical in speech enhancement and essential for smart devices and real-time communication. Deployed in voice processing front-ends and edge devices, these algorithms must ensure efficient real-time inference with low computational demands. Traditional edge-based noise suppression often uses MSE-based amplitude spectrum mask training, but this approach has limitations. We introduce a novel projection loss function, diverging from MSE, to enhance noise suppression. This method uses projection techniques to isolate key audio components from noise, significantly improving model performance. For echo cancellation, the function enables direct predictions on LAEC pre-processed outputs, substantially enhancing performance. Our noise suppression model achieves near state-of-the-art results with only 3.1M parameters and 0.4GFlops/s computational load. Moreover, our echo cancellation model outperforms replicated industry-leading models, introducing a new perspective in speech enhancement. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2310.17116 [pdf, other]

Real-time Neonatal Chest Sound Separation using Deep Learning

Authors: Yang Yi Poh, Ethan Grooby, Kenneth Tan, Lindsay Zhou, Arrabella King, Ashwin Ramanathan, Atul Malhotra, Mehrtash Harandi, Faezeh Marzbanrad

Abstract: Auscultation for neonates is a simple and non-invasive method of providing diagnosis for cardiovascular and respiratory disease. Such diagnosis often requires high-quality heart and lung sounds to be captured during auscultation. However, in most cases, obtaining such high-quality sounds is non-trivial due to the chest sounds containing a mixture of heart, lung, and noise sounds. As such, addition… ▽ More Auscultation for neonates is a simple and non-invasive method of providing diagnosis for cardiovascular and respiratory disease. Such diagnosis often requires high-quality heart and lung sounds to be captured during auscultation. However, in most cases, obtaining such high-quality sounds is non-trivial due to the chest sounds containing a mixture of heart, lung, and noise sounds. As such, additional preprocessing is needed to separate the chest sounds into heart and lung sounds. This paper proposes a novel deep-learning approach to separate such chest sounds into heart and lung sounds. Inspired by the Conv-TasNet model, the proposed model has an encoder, decoder, and mask generator. The encoder consists of a 1D convolution model and the decoder consists of a transposed 1D convolution. The mask generator is constructed using stacked 1D convolutions and transformers. The proposed model outperforms previous methods in terms of objective distortion measures by 2.01 dB to 5.06 dB in the artificial dataset, as well as computation time, with at least a 17-time improvement. Therefore, our proposed model could be a suitable preprocessing step for any phonocardiogram-based health monitoring system. △ Less

Submitted 25 October, 2023; originally announced October 2023.

arXiv:2310.07284 [pdf, other]

Ty** to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Authors: Xiang Hao, Jibin Wu, Jianwei Yu, Chenglin Xu, Kay Chen Tan

Abstract: Humans possess an extraordinary ability to selectively focus on the sound source of interest amidst complex acoustic environments, commonly referred to as cocktail party scenarios. In an attempt to replicate this remarkable auditory attention capability in machines, target speaker extraction (TSE) models have been developed. These models leverage the pre-registered cues of the target speaker to ex… ▽ More Humans possess an extraordinary ability to selectively focus on the sound source of interest amidst complex acoustic environments, commonly referred to as cocktail party scenarios. In an attempt to replicate this remarkable auditory attention capability in machines, target speaker extraction (TSE) models have been developed. These models leverage the pre-registered cues of the target speaker to extract the sound source of interest. However, the effectiveness of these models is hindered in real-world scenarios due to the unreliable or even absence of pre-registered cues. To address this limitation, this study investigates the integration of natural language description to enhance the feasibility, controllability, and performance of existing TSE models. Specifically, we propose a model named LLM-TSE, wherein a large language model (LLM) extracts useful semantic cues from the user's typed text input. These cues can serve as independent extraction cues, task selectors to control the TSE process or complement the pre-registered cues. Our experimental results demonstrate competitive performance when only text-based cues are presented, the effectiveness of using input text as a task selector, and a new state-of-the-art when combining text-based cues with pre-registered cues. To our knowledge, this is the first study to successfully incorporate LLMs to guide target speaker extraction, which can be a cornerstone for cocktail party problem research. △ Less

Submitted 14 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: Under review, https://github.com/haoxiangsnr/llm-tse

arXiv:2305.19557 [pdf, other]

Dictionary Learning under Symmetries via Group Representations

Authors: Subhroshekhar Ghosh, Aaron Y. R. Low, Yong Sheng Soh, Zhuohang Feng, Brendan K. Y. Tan

Abstract: The dictionary learning problem can be viewed as a data-driven process to learn a suitable transformation so that data is sparsely represented directly from example data. In this paper, we examine the problem of learning a dictionary that is invariant under a pre-specified group of transformations. Natural settings include Cryo-EM, multi-object tracking, synchronization, pose estimation, etc. We s… ▽ More The dictionary learning problem can be viewed as a data-driven process to learn a suitable transformation so that data is sparsely represented directly from example data. In this paper, we examine the problem of learning a dictionary that is invariant under a pre-specified group of transformations. Natural settings include Cryo-EM, multi-object tracking, synchronization, pose estimation, etc. We specifically study this problem under the lens of mathematical representation theory. Leveraging the power of non-abelian Fourier analysis for functions over compact groups, we prescribe an algorithmic recipe for learning dictionaries that obey such invariances. We relate the dictionary learning problem in the physical domain, which is naturally modelled as being infinite dimensional, with the associated computational problem, which is necessarily finite dimensional. We establish that the dictionary learning problem can be effectively understood as an optimization instance over certain matrix orbitopes having a particular block-diagonal structure governed by the irreducible representations of the group of symmetries. This perspective enables us to introduce a band-limiting procedure which obtains dimensionality reduction in applications. We provide guarantees for our computational ansatz to provide a desirable dictionary learning outcome. We apply our paradigm to investigate the dictionary learning problem for the groups SO(2) and SO(3). While the SO(2)-orbitope admits an exact spectrahedral description, substantially less is understood about the SO(3)-orbitope. We describe a tractable spectrahedral outer approximation of the SO(3)-orbitope, and contribute an alternating minimization paradigm to perform optimization in this setting. We provide numerical experiments to highlight the efficacy of our approach in learning SO(3)-invariant dictionaries, both on synthetic and on real world data. △ Less

Submitted 25 July, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

Comments: 29 pages, 2 figures

arXiv:2304.01448 [pdf, other]

TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio

Authors: Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, Buye Xu

Abstract: Measuring quality and intelligibility of a speech signal is usually a critical step in development of speech processing systems. To enable this, a variety of metrics to measure quality and intelligibility under different assumptions have been developed. Through this paper, we introduce tools and a set of models to estimate such known metrics using deep neural networks. These models are made availa… ▽ More Measuring quality and intelligibility of a speech signal is usually a critical step in development of speech processing systems. To enable this, a variety of metrics to measure quality and intelligibility under different assumptions have been developed. Through this paper, we introduce tools and a set of models to estimate such known metrics using deep neural networks. These models are made available in the well-established TorchAudio library, the core audio and speech processing library within the PyTorch deep learning framework. We refer to it as TorchAudio-Squim, TorchAudio-Speech QUality and Intelligibility Measures. More specifically, in the current version of TorchAudio-squim, we establish and release models for estimating PESQ, STOI and SI-SDR among objective metrics and MOS among subjective metrics. We develop a novel approach for objective metric estimation and use a recently developed approach for subjective metric estimation. These models operate in a ``reference-less" manner, that is they do not require the corresponding clean speech as reference for speech assessment. Given the unavailability of clean speech and the effortful process of subjective evaluation in real-world situations, such easy-to-use tools would greatly benefit speech processing research and development. △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: ICASSP 2023

arXiv:2301.04320 [pdf, other]

Rethinking complex-valued deep neural networks for monaural speech enhancement

Authors: Haibin Wu, Ke Tan, Buye Xu, Anurag Kumar, Daniel Wong

Abstract: Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investi… ▽ More Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investigate complex-valued DNN atomic units, including linear layers, convolutional layers, long short-term memory (LSTM), and gated linear units. By comparing complex- and real-valued versions of fundamental building blocks in the recently developed gated convolutional recurrent network (GCRN), we show how different mechanisms for basic blocks affect the performance. We also find that the use of complex-valued operations hinders the model capacity when the model size is small. In addition, we examine two recent complex-valued DNNs, i.e. deep complex convolutional recurrent network (DCCRN) and deep complex U-Net (DCUNET). Evaluation results show that both DNNs produce identical performance to their real-valued counterparts while requiring much more computation. Based on these comprehensive comparisons, we conclude that complex-valued DNNs do not provide a performance gain over their real-valued counterparts for monaural speech enhancement, and thus are less desirable due to their higher computational costs. △ Less

Submitted 11 January, 2023; originally announced January 2023.

arXiv:2211.08624 [pdf, ps, other]

Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral Map** for Single-channel Speech Enhancement

Authors: Kuan-Lin Chen, Daniel D. E. Wong, Ke Tan, Buye Xu, Anurag Kumar, Vamsi Krishna Ithapu

Abstract: Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral map** with a tempora… ▽ More Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral map** with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular loss functions including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR). △ Less

Submitted 8 March, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: 5 pages. Accepted at ICASSP 2023

arXiv:2211.04339 [pdf, other]

Toward Adaptive Semantic Communications: Efficient Data Transmission via Online Learned Nonlinear Transform Source-Channel Coding

Authors: **cheng Dai, Sixian Wang, Ke Yang, Kailin Tan, Xiaoqi Qin, Zhongwei Si, Kai Niu, ** Zhang

Abstract: The emerging field semantic communication is driving the research of end-to-end data transmission. By utilizing the powerful representation ability of deep learning models, learned data transmission schemes have exhibited superior performance than the established source and channel coding methods. While, so far, research efforts mainly concentrated on architecture and model improvements toward a s… ▽ More The emerging field semantic communication is driving the research of end-to-end data transmission. By utilizing the powerful representation ability of deep learning models, learned data transmission schemes have exhibited superior performance than the established source and channel coding methods. While, so far, research efforts mainly concentrated on architecture and model improvements toward a static target domain. Despite their successes, such learned models are still suboptimal due to the limitations in model capacity and imperfect optimization and generalization, particularly when the testing data distribution or channel response is different from that adopted for model training, as is likely to be the case in real-world. To tackle this, we propose a novel online learned joint source and channel coding approach that leverages the deep learning model's overfitting property. Specifically, we update the off-the-shelf pre-trained models after deployment in a lightweight online fashion to adapt to the distribution shifts in source data and environment domain. We take the overfitting concept to the extreme, proposing a series of implementation-friendly methods to adapt the codec model or representations to an individual data or channel state instance, which can further lead to substantial gains in terms of the bandwidth ratio-distortion performance. The proposed methods enable the communication-efficient adaptation for all parameters in the network without sacrificing decoding speed. Our experiments, including user study, on continually changing target source data and wireless channel environments, demonstrate the effectiveness and efficiency of our approach, on which we outperform existing state-of-the-art engineered transmission scheme (VVC combined with 5G LDPC coded transmission). △ Less

Submitted 24 May, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

Comments: Accepted by IEEE JSAC

arXiv:2210.06664 [pdf]

Are Macula or Optic Nerve Head Structures better at Diagnosing Glaucoma? An Answer using AI and Wide-Field Optical Coherence Tomography

Authors: Charis Y. N. Chiang, Fabian Braeu, Thanadet Chuangsuwanich, Royston K. Y. Tan, Jacqueline Chua, Leopold Schmetterer, Alexandre Thiery, Martin Buist, Michaël J. A. Girard

Abstract: Purpose: (1) To develop a deep learning algorithm to automatically segment structures of the optic nerve head (ONH) and macula in 3D wide-field optical coherence tomography (OCT) scans; (2) To assess whether 3D macula or ONH structures (or the combination of both) provide the best diagnostic power for glaucoma. Methods: A cross-sectional comparative study was performed which included wide-field sw… ▽ More Purpose: (1) To develop a deep learning algorithm to automatically segment structures of the optic nerve head (ONH) and macula in 3D wide-field optical coherence tomography (OCT) scans; (2) To assess whether 3D macula or ONH structures (or the combination of both) provide the best diagnostic power for glaucoma. Methods: A cross-sectional comparative study was performed which included wide-field swept-source OCT scans from 319 glaucoma subjects and 298 non-glaucoma subjects. All scans were compensated to improve deep-tissue visibility. We developed a deep learning algorithm to automatically label all major ONH tissue structures by using 270 manually annotated B-scans for training. The performance of our algorithm was assessed using the Dice coefficient (DC). A glaucoma classification algorithm (3D CNN) was then designed using a combination of 500 OCT volumes and their corresponding automatically segmented masks. This algorithm was trained and tested on 3 datasets: OCT scans cropped to contain the macular tissues only, those to contain the ONH tissues only, and the full wide-field OCT scans. The classification performance for each dataset was reported using the AUC. Results: Our segmentation algorithm was able to segment ONH and macular tissues with a DC of 0.94 $\pm$ 0.003. The classification algorithm was best able to diagnose glaucoma using wide-field 3D-OCT volumes with an AUC of 0.99 $\pm$ 0.01, followed by ONH volumes with an AUC of 0.93 $\pm$ 0.06, and finally macular volumes with an AUC of 0.91 $\pm$ 0.11. Conclusions: this study showed that using wide-field OCT as compared to the typical OCT images containing just the ONH or macular may allow for a significantly improved glaucoma diagnosis. This may encourage the mainstream adoption of 3D wide-field OCT scans. For clinical AI studies that use traditional machines, we would recommend the use of ONH scans as opposed to macula scans. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: 23 pages, 5 figures

arXiv:2201.10105 [pdf, other]

doi 10.1109/EMBC48229.2022.9871449

Prediction of Neonatal Respiratory Distress in Term Babies at Birth from Digital Stethoscope Recorded Chest Sounds

Authors: Ethan Grooby, Chiranjibi Sitaula, Kenneth Tan, Lindsay Zhou, Arrabella King, Ashwin Ramanathan, Atul Malhotra, Guy A. Dumont, Faezeh Marzbanrad

Abstract: Neonatal respiratory distress is a common condition that if left untreated, can lead to short- and long-term complications. This paper investigates the usage of digital stethoscope recorded chest sounds taken within 1min post-delivery, to enable early detection and prediction of neonatal respiratory distress. Fifty-one term newborns were included in this study, 9 of whom developed respiratory dist… ▽ More Neonatal respiratory distress is a common condition that if left untreated, can lead to short- and long-term complications. This paper investigates the usage of digital stethoscope recorded chest sounds taken within 1min post-delivery, to enable early detection and prediction of neonatal respiratory distress. Fifty-one term newborns were included in this study, 9 of whom developed respiratory distress. For each newborn, 1min anterior and posterior recordings were taken. These recordings were pre-processed to remove noisy segments and obtain high-quality heart and lung sounds. The random undersampling boosting (RUSBoost) classifier was then trained on a variety of features, such as power and vital sign features extracted from the heart and lung sounds. The RUSBoost algorithm produced specificity, sensitivity, and accuracy results of 85.0%, 66.7% and 81.8%, respectively. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: 4 pages, 2 figures, 1 table. Paper submitted for potential publication as conference paper at 44th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2022

Journal ref: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)

arXiv:2201.03211 [pdf, other]

doi 10.1109/JBHI.2022.3215995

Noisy Neonatal Chest Sound Separation for High-Quality Heart and Lung Sounds

Authors: Ethan Grooby, Chiranjibi Sitaula, Davood Fattahi, Reza Sameni, Kenneth Tan, Lindsay Zhou, Arrabella King, Ashwin Ramanathan, Atul Malhotra, Guy A. Dumont, Faezeh Marzbanrad

Abstract: Stethoscope-recorded chest sounds provide the opportunity for remote cardio-respiratory health monitoring of neonates. However, reliable monitoring requires high-quality heart and lung sounds. This paper presents novel Non-negative Matrix Factorisation (NMF) and Non-negative Matrix Co-Factorisation (NMCF) methods for neonatal chest sound separation. To assess these methods and compare with existin… ▽ More Stethoscope-recorded chest sounds provide the opportunity for remote cardio-respiratory health monitoring of neonates. However, reliable monitoring requires high-quality heart and lung sounds. This paper presents novel Non-negative Matrix Factorisation (NMF) and Non-negative Matrix Co-Factorisation (NMCF) methods for neonatal chest sound separation. To assess these methods and compare with existing single-source separation methods, an artificial mixture dataset was generated comprising of heart, lung and noise sounds. Signal-to-noise ratios were then calculated for these artificial mixtures. These methods were also tested on real-world noisy neonatal chest sounds and assessed based on vital sign estimation error and a signal quality score of 1-5 developed in our previous works. Additionally, the computational cost of all methods was assessed to determine the applicability for real-time processing. Overall, both the proposed NMF and NMCF methods outperform the next best existing method by 2.7dB to 11.6dB for the artificial dataset and 0.40 to 1.12 signal quality improvement for the real-world dataset. The median processing time for the sound separation of a 10s recording was found to be 28.3s for NMCF and 342ms for NMF. Because of stable and robust performance, we believe that our proposed methods are useful to denoise neonatal heart and lung sound in a real-world environment. Codes for proposed and existing methods can be found at: https://github.com/egrooby-monash/Heart-and-Lung-Sound-Separation. △ Less

Submitted 10 January, 2022; originally announced January 2022.

Comments: 12 pages, 4 figures, 3 tables. Paper submitted and under review for possible publication in IEEE

Journal ref: IEEE Journal of Biomedical and Health Informatics, 2022

arXiv:2111.01928 [pdf, other]

doi 10.1145/3501710.3519541

Verifying Switched System Stability With Logic

Authors: Yong Kiam Tan, Stefan Mitsch, André Platzer

Abstract: Switched systems are known to exhibit subtle (in)stability behaviors requiring system designers to carefully analyze the stability of closed-loop systems that arise from their proposed switching control laws. This paper presents a formal approach for verifying switched system stability that blends classical ideas from the controls and verification literature using differential dynamic logic (dL),… ▽ More Switched systems are known to exhibit subtle (in)stability behaviors requiring system designers to carefully analyze the stability of closed-loop systems that arise from their proposed switching control laws. This paper presents a formal approach for verifying switched system stability that blends classical ideas from the controls and verification literature using differential dynamic logic (dL), a logic for deductive verification of hybrid systems. From controls, we use standard stability notions for various classes of switching mechanisms and their corresponding Lyapunov function-based analysis techniques. From verification, we use dL's ability to verify quantified properties of hybrid systems and dL models of switched systems as loo** hybrid programs whose stability can be formally specified and proven by finding appropriate loop invariants, i.e., properties that are preserved across each loop iteration. This blend of ideas enables a trustworthy implementation of switched system stability verification in the KeYmaera X prover based on dL. For standard classes of switching mechanisms, the implementation provides fully automated stability proofs, including searching for suitable Lyapunov functions. Moreover, the generality of the deductive approach also enables verification of switching control laws that require non-standard stability arguments through the design of loop invariants that suitably express specific intuitions behind those control laws. This flexibility is demonstrated on three case studies: a model for longitudinal flight control by Branicky, an automatic cruise controller, and Brockett's nonholonomic integrator. △ Less

Submitted 8 April, 2022; v1 submitted 2 November, 2021; originally announced November 2021.

Comments: Long version of paper at HSCC 2022 (25th ACM International Conference on Hybrid Systems: Computation and Control, May 4-6, 2022)

MSC Class: 03B70; 34A38; 93C30 ACM Class: F.3.1; F.4.1; G.1.7; I.2.3

arXiv:2110.08664 [pdf, other]

Finding Critical Scenarios for Automated Driving Systems: A Systematic Literature Review

Authors: Xinhai Zhang, Jianbo Tao, Kaige Tan, Martin Törngren, José Manuel Gaspar Sánchez, Muhammad Rusyadi Ramli, Xin Tao, Magnus Gyllenhammar, Franz Wotawa, Naveen Mohan, Mihai Nica, Hermann Felbinger

Abstract: Scenario-based approaches have been receiving a huge amount of attention in research and engineering of automated driving systems. Due to the complexity and uncertainty of the driving environment, and the complexity of the driving task itself, the number of possible driving scenarios that an ADS or ADAS may encounter is virtually infinite. Therefore it is essential to be able to reason about the i… ▽ More Scenario-based approaches have been receiving a huge amount of attention in research and engineering of automated driving systems. Due to the complexity and uncertainty of the driving environment, and the complexity of the driving task itself, the number of possible driving scenarios that an ADS or ADAS may encounter is virtually infinite. Therefore it is essential to be able to reason about the identification of scenarios and in particular critical ones that may impose unacceptable risk if not considered. Critical scenarios are particularly important to support design, verification and validation efforts, and as a basis for a safety case. In this paper, we present the results of a systematic literature review in the context of autonomous driving. The main contributions are: (i) introducing a comprehensive taxonomy for critical scenario identification methods; (ii) giving an overview of the state-of-the-art research based on the taxonomy encompassing 86 papers between 2017 and 2020; and (iii) identifying open issues and directions for further research. The provided taxonomy comprises three main perspectives encompassing the problem definition (the why), the solution (the methods to derive scenarios), and the assessment of the established scenarios. In addition, we discuss open research issues considering the perspectives of coverage, practicability, and scenario space explosion. △ Less

Submitted 16 October, 2021; originally announced October 2021.

Comments: 37 pages, 24 figures

arXiv:2110.04289 [pdf, other]

Location-based training for multi-channel talker-independent speaker separation

Authors: Hassan Taherian, Ke Tan, DeLiang Wang

Abstract: Permutation-invariant training (PIT) is a dominant approach for addressing the permutation ambiguity problem in talker-independent speaker separation. Leveraging spatial information afforded by microphone arrays, we propose a new training approach to resolving permutation ambiguities for multi-channel speaker separation. The proposed approach, named location-based training (LBT), assigns speakers… ▽ More Permutation-invariant training (PIT) is a dominant approach for addressing the permutation ambiguity problem in talker-independent speaker separation. Leveraging spatial information afforded by microphone arrays, we propose a new training approach to resolving permutation ambiguities for multi-channel speaker separation. The proposed approach, named location-based training (LBT), assigns speakers on the basis of their spatial locations. This training strategy is easy to apply, and organizes speakers according to their positions in physical space. Specifically, this study investigates azimuth angles and source distances for location-based training. Evaluation results on separating two- and three-speaker mixtures show that azimuth-based training consistently outperforms PIT, and distance-based training further improves the separation performance when speaker azimuths are close. Furthermore, we dynamically select azimuth-based or distance-based training by estimating the azimuths of separated speakers, which further improves separation performance. LBT has a linear training complexity with respect to the number of speakers, as opposed to the factorial complexity of PIT. We further demonstrate the effectiveness of LBT for the separation of four and five concurrent speakers. △ Less

Submitted 8 October, 2021; originally announced October 2021.

Comments: submitted to ICASSP 22

arXiv:2109.15127 [pdf, other]

doi 10.1109/ACCESS.2022.3144355

Real-Time Multi-Level Neonatal Heart and Lung Sound Quality Assessment for Telehealth Applications

Authors: Ethan Grooby, Chiranjibi Sitaula, Davood Fattahi, Reza Sameni, Kenneth Tan, Lindsay Zhou, Arrabella King, Ashwin Ramanathan, Atul Malhotra, Guy A. Dumont, Faezeh Marzbanrad

Abstract: Digital stethoscopes in combination with telehealth allow chest sounds to be easily collected and transmitted for remote monitoring and diagnosis. Chest sounds contain important information about a newborn's cardio-respiratory health. However, low-quality recordings complicate the remote monitoring and diagnosis. In this study, a new method is proposed to objectively and automatically assess heart… ▽ More Digital stethoscopes in combination with telehealth allow chest sounds to be easily collected and transmitted for remote monitoring and diagnosis. Chest sounds contain important information about a newborn's cardio-respiratory health. However, low-quality recordings complicate the remote monitoring and diagnosis. In this study, a new method is proposed to objectively and automatically assess heart and lung signal quality on a 5-level scale in real-time and to assess the effect of signal quality on vital sign estimation. For the evaluation, a total of 207 10s long chest sounds were taken from 119 preterm and full-term babies. Thirty of the recordings from ten subjects were obtained with synchronous vital signs from the Neonatal Intensive Care Unit (NICU) based on electrocardiogram recordings. As reference, seven annotators independently assessed the signal quality. For automatic quality classification, 400 features were extracted from the chest sounds. After feature selection using minimum redundancy and maximum relevancy algorithm, class balancing, and hyper-parameter optimization, a variety of multi-class and ordinal classification and regression algorithms were trained. Then, heart rate and breathing rate were automatically estimated from the chest sounds using adapted pre-existing methods. The results of subject-wise leave-one-out cross-validation show that the best-performing models had a mean squared error (MSE) of 0.49 and 0.61, and balanced accuracy of 57% and 51% for heart and lung qualities, respectively. The best-performing models for real-time analysis (<200ms) had MSE of 0.459 and 0.67, and balanced accuracy of 57% and 46%, respectively. Our experimental results underscore that increasing the signal quality leads to a reduction in vital sign error, with only high-quality recordings having a mean absolute error of less than 5 beats per minute, as required for clinical usage. △ Less

Submitted 28 September, 2021; originally announced September 2021.

Comments: 13 pages, 8 figures, 3 tables. Paper submitted and under review in IEEE Access

Journal ref: IEEE Access, 2022

arXiv:2101.06195 [pdf, other]

doi 10.1016/j.ifacol.2021.08.506

Switched Systems as Hybrid Programs

Authors: Yong Kiam Tan, André Platzer

Abstract: Real world systems of interest often feature interactions between discrete and continuous dynamics. Various hybrid system formalisms have been used to model and analyze this combination of dynamics, ranging from mathematical descriptions, e.g., using impulsive differential equations and switching, to automata-theoretic and language-based approaches. This paper bridges two such formalisms by showin… ▽ More Real world systems of interest often feature interactions between discrete and continuous dynamics. Various hybrid system formalisms have been used to model and analyze this combination of dynamics, ranging from mathematical descriptions, e.g., using impulsive differential equations and switching, to automata-theoretic and language-based approaches. This paper bridges two such formalisms by showing how various classes of switched systems can be modeled using the language of hybrid programs from differential dynamic logic (dL). The resulting models enable the formal specification and verification of switched systems using dL and its existing deductive verification tools such as KeYmaera X. Switched systems also provide a natural avenue for the generalization of dL's deductive proof theory for differential equations. The completeness results for switched system invariants proved in this paper enable effective safety verification of those systems in dL. △ Less

Submitted 29 April, 2021; v1 submitted 15 January, 2021; originally announced January 2021.

Comments: Long version of paper at ADHS 2021 (7th IFAC Conference on Analysis and Design of Hybrid Systems, July 7-9, 2021)

MSC Class: 03B70; 34A38; 93C30 ACM Class: F.3.1; F.4.1; G.1.7; I.2.3

arXiv:2009.01381 [pdf, other]

doi 10.1109/LSP.2020.3043977

SAGRNN: Self-Attentive Gated RNN for Binaural Speaker Separation with Interaural Cue Preservation

Authors: Ke Tan, Buye Xu, Anurag Kumar, Eliya Nachmani, Yossi Adi

Abstract: Most existing deep learning based binaural speaker separation systems focus on producing a monaural estimate for each of the target speakers, and thus do not preserve the interaural cues, which are crucial for human listeners to perform sound localization and lateralization. In this study, we address talker-independent binaural speaker separation with interaural cues preserved in the estimated bin… ▽ More Most existing deep learning based binaural speaker separation systems focus on producing a monaural estimate for each of the target speakers, and thus do not preserve the interaural cues, which are crucial for human listeners to perform sound localization and lateralization. In this study, we address talker-independent binaural speaker separation with interaural cues preserved in the estimated binaural signals. Specifically, we extend a newly-developed gated recurrent neural network for monaural separation by additionally incorporating self-attention mechanisms and dense connectivity. We develop an end-to-end multiple-input multiple-output system, which directly maps from the binaural waveform of the mixture to those of the speech signals. The experimental results show that our proposed approach achieves significantly better separation performance than a recent binaural separation approach. In addition, our approach effectively preserves the interaural cues, which improves the accuracy of sound localization. △ Less

Submitted 14 November, 2020; v1 submitted 2 September, 2020; originally announced September 2020.

Comments: 5 pages, accepted by IEEE Signal Processing Letters

arXiv:2008.08791 [pdf, other]

Facial movement synergies and Action Unit detection from distal wearable Electromyography and Computer Vision

Authors: Monica Perusquia-Hernandez, Felix Dollack, Chun Kwang Tan, Shushi Namba, Saho Ayabe-Kanamura, Kenji Suzuki

Abstract: Distal facial Electromyography (EMG) can be used to detect smiles and frowns with reasonable accuracy. It capitalizes on volume conduction to detect relevant muscle activity, even when the electrodes are not placed directly on the source muscle. The main advantage of this method is to prevent occlusion and obstruction of the facial expression production, whilst allowing EMG measurements. However,… ▽ More Distal facial Electromyography (EMG) can be used to detect smiles and frowns with reasonable accuracy. It capitalizes on volume conduction to detect relevant muscle activity, even when the electrodes are not placed directly on the source muscle. The main advantage of this method is to prevent occlusion and obstruction of the facial expression production, whilst allowing EMG measurements. However, measuring EMG distally entails that the exact source of the facial movement is unknown. We propose a novel method to estimate specific Facial Action Units (AUs) from distal facial EMG and Computer Vision (CV). This method is based on Independent Component Analysis (ICA), Non-Negative Matrix Factorization (NNMF), and sorting of the resulting components to determine which is the most likely to correspond to each CV-labeled action unit (AU). Performance on the detection of AU06 (Orbicularis Oculi) and AU12 (Zygomaticus Major) was estimated by calculating the agreement with Human Coders. The results of our proposed algorithm showed an accuracy of 81% and a Cohen's Kappa of 0.49 for AU6; and accuracy of 82% and a Cohen's Kappa of 0.53 for AU12. This demonstrates the potential of distal EMG to detect individual facial movements. Using this multimodal method, several AU synergies were identified. We quantified the co-occurrence and timing of AU6 and AU12 in posed and spontaneous smiles using the human-coded labels, and for comparison, using the continuous CV-labels. The co-occurrence analysis was also performed on the EMG-based labels to uncover the relationship between muscle synergies and the kinematics of visible facial movement. △ Less

Submitted 20 August, 2020; originally announced August 2020.

Comments: 11 pages, 11 figures, 2 tables

arXiv:2004.12599 [pdf, other]

Deploying Image Deblurring across Mobile Devices: A Perspective of Quality and Latency

Authors: Cheng-Ming Chiang, Yu Tseng, Yu-Syuan Xu, Hsien-Kai Kuo, Yi-Min Tsai, Guan-Yu Chen, Koan-Sin Tan, Wei-Ting Wang, Yu-Chieh Lin, Shou-Yao Roy Tseng, Wei-Shiang Lin, Chia-Lin Yu, BY Shen, Kloze Kao, Chia-Ming Cheng, Hung-Jen Chen

Abstract: Recently, image enhancement and restoration have become important applications on mobile devices, such as super-resolution and image deblurring. However, most state-of-the-art networks present extremely high computational complexity. This makes them difficult to be deployed on mobile devices with acceptable latency. Moreover, when deploying to different mobile devices, there is a large latency var… ▽ More Recently, image enhancement and restoration have become important applications on mobile devices, such as super-resolution and image deblurring. However, most state-of-the-art networks present extremely high computational complexity. This makes them difficult to be deployed on mobile devices with acceptable latency. Moreover, when deploying to different mobile devices, there is a large latency variation due to the difference and limitation of deep learning accelerators on mobile devices. In this paper, we conduct a search of portable network architectures for better quality-latency trade-off across mobile devices. We further present the effectiveness of widely used network optimizations for image deblurring task. This paper provides comprehensive experiments and comparisons to uncover the in-depth analysis for both latency and image quality. Through all the above works, we demonstrate the successful deployment of image deblurring application on mobile devices with the acceleration of deep learning accelerators. To the best of our knowledge, this is the first paper that addresses all the deployment issues of image deblurring task across mobile devices. This paper provides practical deployment-guidelines, and is adopted by the championship-winning team in NTIRE 2020 Image Deblurring Challenge on Smartphone Track. △ Less

Submitted 27 April, 2020; originally announced April 2020.

Comments: CVPR 2020 Workshop on New Trends in Image Restoration and Enhancement (NTIRE)

arXiv:1911.06294 [pdf, other]

Deep Reinforcement Learning for Adaptive Traffic Signal Control

Authors: Kai Liang Tan, Subhadipto Poddar, Anuj Sharma, Soumik Sarkar

Abstract: Many existing traffic signal controllers are either simple adaptive controllers based on sensors placed around traffic intersections, or optimized by traffic engineers on a fixed schedule. Optimizing traffic controllers is time consuming and usually require experienced traffic engineers. Recent research has demonstrated the potential of using deep reinforcement learning (DRL) in this context. Howe… ▽ More Many existing traffic signal controllers are either simple adaptive controllers based on sensors placed around traffic intersections, or optimized by traffic engineers on a fixed schedule. Optimizing traffic controllers is time consuming and usually require experienced traffic engineers. Recent research has demonstrated the potential of using deep reinforcement learning (DRL) in this context. However, most of the studies do not consider realistic settings that could seamlessly transition into deployment. In this paper, we propose a DRL-based adaptive traffic signal control framework that explicitly considers realistic traffic scenarios, sensors, and physical constraints. In this framework, we also propose a novel reward function that shows significantly improved traffic performance compared to the typical baseline pre-timed and fully-actuated traffic signals controllers. The framework is implemented and validated on a simulation platform emulating real-life traffic scenarios and sensor data streams. △ Less

Submitted 14 November, 2019; originally announced November 2019.

Comments: ASME 2019 Dynamic Systems and Control Conference (DSCC), October 9-11, Park City, Utah, USA

arXiv:1910.13042 [pdf, other]

doi 10.1016/j.compmedimag.2021.101866

Deep Multi-Magnification Networks for Multi-Class Breast Cancer Image Segmentation

Authors: David Joon Ho, Dig V. K. Yarlagadda, Timothy M. D'Alfonso, Matthew G. Hanna, Anne Grabenstetter, Peter Ntiamoah, Edi Brogi, Lee K. Tan, Thomas J. Fuchs

Abstract: Pathologic analysis of surgical excision specimens for breast carcinoma is important to evaluate the completeness of surgical excision and has implications for future treatment. This analysis is performed manually by pathologists reviewing histologic slides prepared from formalin-fixed tissue. In this paper, we present Deep Multi-Magnification Network trained by partial annotation for automated mu… ▽ More Pathologic analysis of surgical excision specimens for breast carcinoma is important to evaluate the completeness of surgical excision and has implications for future treatment. This analysis is performed manually by pathologists reviewing histologic slides prepared from formalin-fixed tissue. In this paper, we present Deep Multi-Magnification Network trained by partial annotation for automated multi-class tissue segmentation by a set of patches from multiple magnifications in digitized whole slide images. Our proposed architecture with multi-encoder, multi-decoder, and multi-concatenation outperforms other single and multi-magnification-based architectures by achieving the highest mean intersection-over-union, and can be used to facilitate pathologists' assessments of breast cancer. △ Less

Submitted 4 January, 2021; v1 submitted 28 October, 2019; originally announced October 2019.

Comments: Accepted at Computerized Medical Imaging and Graphics

arXiv:1909.07352 [pdf, other]

doi 10.1109/JSTSP.2020.2987209

Audio-Visual Speech Separation and Dereverberation with a Two-Stage Multimodal Network

Authors: Ke Tan, Yong Xu, Shi-Xiong Zhang, Meng Yu, Dong Yu

Abstract: Background noise, interfering speech and room reverberation frequently distort target speech in real listening environments. In this study, we address joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation. In order to tackle this fundamentally difficult problem, we propose a novel multimodal network that e… ▽ More Background noise, interfering speech and room reverberation frequently distort target speech in real listening environments. In this study, we address joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation. In order to tackle this fundamentally difficult problem, we propose a novel multimodal network that exploits both audio and visual signals. The proposed network architecture adopts a two-stage strategy, where a separation module is employed to attenuate background noise and interfering speech in the first stage and a dereverberation module to suppress room reverberation in the second stage. The two modules are first trained separately, and then integrated for joint training, which is based on a new multi-objective loss function. Our experimental results show that the proposed multimodal network yields consistently better objective intelligibility and perceptual quality than several one-stage and two-stage baselines. We find that our network achieves a 21.10% improvement in ESTOI and a 0.79 improvement in PESQ over the unprocessed mixtures. Moreover, our network architecture does not require the knowledge of the number of speakers. △ Less

Submitted 10 April, 2020; v1 submitted 16 September, 2019; originally announced September 2019.

Comments: 13 pages, accepted by IEEE JSTSP Special Issue on Deep Learning for Multi-modal Intelligence across Speech, Language, Vision, and Heterogeneous Signals

arXiv:1906.05372 [pdf, other]

The Herbarium Challenge 2019 Dataset

Authors: Kiat Chuan Tan, Yulong Liu, Barbara Ambrose, Melissa Tulig, Serge Belongie

Abstract: Herbarium sheets are invaluable for botanical research, and considerable time and effort is spent by experts to label and identify specimens on them. In view of recent advances in computer vision and deep learning, develo** an automated approach to help experts identify specimens could significantly accelerate research in this area. Whereas most existing botanical datasets comprise photos of spe… ▽ More Herbarium sheets are invaluable for botanical research, and considerable time and effort is spent by experts to label and identify specimens on them. In view of recent advances in computer vision and deep learning, develo** an automated approach to help experts identify specimens could significantly accelerate research in this area. Whereas most existing botanical datasets comprise photos of specimens in the wild, herbarium sheets exhibit dried specimens, which poses new challenges. We present a challenge dataset of herbarium sheet images labeled by experts, with the intent of facilitating the development of automated identification techniques for this challenging scenario. △ Less

Submitted 15 June, 2019; v1 submitted 12 June, 2019; originally announced June 2019.

Comments: Part of the 6th Fine-Grained Visual Categorization Workshop (FGVC6) at CVPR 2019. Dataset available at https://github.com/visipedia/herbarium_comp

arXiv:1903.05073 [pdf, other]

doi 10.1109/LRA.2019.2923099

A Formal Safety Net for Waypoint Following in Ground Robots

Authors: Brandon Bohrer, Yong Kiam Tan, Stefan Mitsch, Andrew Sogokon, André Platzer

Abstract: We present a reusable formally verified safety net that provides end-to-end safety and liveness guarantees for 2D waypoint-following of Dubins-type ground robots with tolerances and acceleration. We: i) Model a robot in differential dynamic logic (dL), and specify assumptions on the controller and robot kinematics, ii) Prove formal safety and liveness properties for waypoint-following with speed l… ▽ More We present a reusable formally verified safety net that provides end-to-end safety and liveness guarantees for 2D waypoint-following of Dubins-type ground robots with tolerances and acceleration. We: i) Model a robot in differential dynamic logic (dL), and specify assumptions on the controller and robot kinematics, ii) Prove formal safety and liveness properties for waypoint-following with speed limits, iii) Synthesize a monitor, which is automatically proven to enforce model compliance at runtime, and iv) Our use of the VeriPhy toolchain makes these guarantees carry over down to the level of machine code with untrusted controllers, environments, and plans. The guarantees for the safety net apply to any robot as long as the waypoints are chosen safely and the physical assumptions in its model hold. Experiments show these assumptions hold in practice, with an inherent trade-off between compliance and performance. △ Less

Submitted 18 June, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

MSC Class: 03B70; 34A38; 68Q60; 93C85 ACM Class: I.2.9; D.2.4; F.3.1; C.3

arXiv:1903.04567 [pdf, ps, other]

Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling

Authors: Peidong Wang, Ke Tan, DeLiang Wang

Abstract: Monaural speech enhancement has made dramatic advances since the introduction of deep learning a few years ago. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement… ▽ More Monaural speech enhancement has made dramatic advances since the introduction of deep learning a few years ago. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this study, we analyze the distortion problem, compare different acoustic models, and investigate a distortion-independent training scheme for monaural speech recognition. Experimental results suggest that distortion-independent acoustic modeling is able to overcome the distortion problem. Such an acoustic model can also work with speech enhancement models different from the one used during training. Moreover, the models investigated in this paper outperform the previous best system on the CHiME-2 corpus. △ Less

Submitted 12 March, 2019; v1 submitted 11 March, 2019; originally announced March 2019.

arXiv:1811.09010 [pdf]

Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective

Authors: Zhong-Qiu Wang, Ke Tan, DeLiang Wang

Abstract: This study investigates phase reconstruction for deep learning based monaural talker-independent speaker separation in the short-time Fourier transform (STFT) domain. The key observation is that, for a mixture of two sources, with their magnitudes accurately estimated and under a geometric constraint, the absolute phase difference between each source and the mixture can be uniquely determined; in… ▽ More This study investigates phase reconstruction for deep learning based monaural talker-independent speaker separation in the short-time Fourier transform (STFT) domain. The key observation is that, for a mixture of two sources, with their magnitudes accurately estimated and under a geometric constraint, the absolute phase difference between each source and the mixture can be uniquely determined; in addition, the source phases at each time-frequency (T-F) unit can be narrowed down to only two candidates. To pick the right candidate, we propose three algorithms based on iterative phase reconstruction, group delay estimation, and phase-difference sign prediction. State-of-the-art results are obtained on the publicly available wsj0-2mix and 3mix corpus. △ Less

Submitted 21 November, 2018; originally announced November 2018.

Comments: 5 pages, in submission to ICASSP-2019

arXiv:1805.00367 [pdf, other]

A Multi-State Diagnosis and Prognosis Framework with Feature Learning for Tool Condition Monitoring

Authors: Chong Zhang, Geok Soon Hong, Jun-Hong Zhou, Kay Chen Tan, Haizhou Li, Huan Xu, Jihoon Hong, Hian-Leng Chan

Abstract: In this paper, a multi-state diagnosis and prognosis (MDP) framework is proposed for tool condition monitoring via a deep belief network based multi-state approach (DBNMS). For fault diagnosis, a cost-sensitive deep belief network (namely ECS-DBN) is applied to deal with the imbalanced data problem for tool state estimation. An appropriate prognostic degradation model is then applied for tool wear… ▽ More In this paper, a multi-state diagnosis and prognosis (MDP) framework is proposed for tool condition monitoring via a deep belief network based multi-state approach (DBNMS). For fault diagnosis, a cost-sensitive deep belief network (namely ECS-DBN) is applied to deal with the imbalanced data problem for tool state estimation. An appropriate prognostic degradation model is then applied for tool wear estimation based on the different tool states. The proposed framework has the advantage of automatic feature representation learning and shows better performance in accuracy and robustness. The effectiveness of the proposed DBNMS is validated using a real-world dataset obtained from the gun drilling process. This dataset contains a large amount of measured signals involving different tool geometries under various operating conditions. The DBNMS is examined for both the tool state estimation and tool wear estimation tasks. In the experimental studies, the prediction results are evaluated and compared with popular machine learning approaches, which show the superior performance of the proposed DBNMS approach. △ Less

Submitted 30 April, 2018; originally announced May 2018.

Comments: 14 pages, 12 figures, 10 tables, submitted to IEEE Transactions on Cybernetics

arXiv:1305.6402 [pdf, other]

From Parametric Model-based Optimization to robust PID Gain Scheduling

Authors: Minh Hoang-Tuan Nguyen, Kok Kiong Tan

Abstract: In chemical process applications, model predictive control effectively deals with input and state constraints during transient operations. However, industrial PID controllers directly manipulates the actuators, so they play the key role in small perturbation robustness. This paper considers the problem of augmenting the commonplace PID with the constraint handling and optimization functionalities… ▽ More In chemical process applications, model predictive control effectively deals with input and state constraints during transient operations. However, industrial PID controllers directly manipulates the actuators, so they play the key role in small perturbation robustness. This paper considers the problem of augmenting the commonplace PID with the constraint handling and optimization functionalities of MPC. First, we review the MPC framework, which employs a linear feedback gain in its unconstrained region. This linear gain can be any preexisting multiloop PID design, or based on the two stabilizing PI or PID designs for multivariable systems proposed in the paper. The resulting controller is a feedforward PID map**, a straightforward form without the need of tuning PID to fit an optimal input. The parametrized solution of MPC under constraints further leverages a familiar PID gain scheduling structure. Steady state robustness is achieved along with the PID design so that additional robustness analysis is avoided. △ Less

Submitted 28 May, 2013; originally announced May 2013.

Comments: 7 pages, 5 figures, submitted to JPC

arXiv:1305.6394 [pdf, other]

doi 10.1016/j.jprocont.2011.04.002

Enhanced Predictive Ratio Control of Interacting Systems

Authors: Minh Hoang-Tuan Nguyen, Kok Kiong Tan, Sunan Huang

Abstract: Ratio control for two interacting processes is proposed with a PID feedforward design based on model predictive control (MPC) scheme. At each sampling instant, the MPC control action minimizes a state-dependent performance index associated with a PID-type state vector, thus yielding a PID-type control structure. Compared to the standard MPC formulations with separated single-variable control, such… ▽ More Ratio control for two interacting processes is proposed with a PID feedforward design based on model predictive control (MPC) scheme. At each sampling instant, the MPC control action minimizes a state-dependent performance index associated with a PID-type state vector, thus yielding a PID-type control structure. Compared to the standard MPC formulations with separated single-variable control, such a control action allows one to take into account the non-uniformity of the two process outputs. After reformulating the MPC control law as a PID control law, we provide conditions for prediction horizon and weighting matrices so that the closed-loop control is asymptotically stable, and show the effectiveness of the approach with simulation and experiment results. △ Less

Submitted 28 May, 2013; originally announced May 2013.

Comments: 9 pages, 7 figures, 1 table

Journal ref: Journal of Process Control, Volume 21, Issue 7, August 2011, Pages 1115 to 1125

arXiv:1305.6379 [pdf, other]

Robust Precision Positioning Control on Linear Ultrasonic Motor

Authors: Minh H-T Nguyen, Kok Kiong Tan, Wenyu Liang, Chek Sing Teo

Abstract: Ultrasonic motors used in high-precision mechatronics are characterized by strong frictional effects, which are among the main problems in precision motion control. The traditional methods apply model-based nonlinear feedforward to compensate the friction, thus requiring closed-loop stability and safety constraint considerations. Implementation of these methods requires complex designed experiment… ▽ More Ultrasonic motors used in high-precision mechatronics are characterized by strong frictional effects, which are among the main problems in precision motion control. The traditional methods apply model-based nonlinear feedforward to compensate the friction, thus requiring closed-loop stability and safety constraint considerations. Implementation of these methods requires complex designed experiments. This paper introduces a systematic approach using piecewise affine models to emulate the friction effect of the motor motion. The well-known model predictive control method is employed to deal with piecewise affine models. The increased complexity of the model offers a higher tracking precision on a simpler gain scheduling scheme. △ Less

Submitted 28 May, 2013; originally announced May 2013.

Comments: 6 pages, 8 figures, conference

Showing 1–37 of 37 results for author: Tan, K