Search | arXiv e-print repository

Unrolling Plug-and-Play Gradient Graph Laplacian Regularizer for Image Restoration

Authors: Jianghe Cai, Gene Cheung, Fei Chen

Abstract: Generic deep learning (DL) networks for image restoration like denoising and interpolation lack mathematical interpretability, require voluminous training data to tune a large parameter set, and are fragile during covariance shift. To address these shortcomings, for a general linear image formation model, we first formulate a convex optimization problem with a new graph smoothness prior called gra… ▽ More Generic deep learning (DL) networks for image restoration like denoising and interpolation lack mathematical interpretability, require voluminous training data to tune a large parameter set, and are fragile during covariance shift. To address these shortcomings, for a general linear image formation model, we first formulate a convex optimization problem with a new graph smoothness prior called gradient graph Laplacian regularizer (GGLR) that promotes piecewise planar (PWP) signal reconstruction. To solve the posed problem, we introduce a variable number of auxiliary variables to create a family of Plug-and-Play (PnP) ADMM algorithms and unroll them into variable-complexity feed-forward networks, amenable to parameter tuning via back-propagation. More complex unrolled networks require more labeled data to train more parameters, but have better potential performance. Experimental results show that our unrolled networks perform competitively to generic DL networks in image restoration quality while using a small fraction of parameters, and demonstrate improved robustness to covariance shift. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.16020 [pdf, other]

AudioBench: A Universal Benchmark for Audio Large Language Models

Authors: Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen

Abstract: We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs). AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. Despite the rapid advancement of large language models, including multimodal versions, a significant gap exists in co… ▽ More We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs). AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. Despite the rapid advancement of large language models, including multimodal versions, a significant gap exists in comprehensive benchmarks for thoroughly evaluating their capabilities. AudioBench addresses this gap by providing relevant datasets and evaluation metrics. In our study, we evaluated the capabilities of four models across various aspects and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-source code, data, and leaderboard will offer a robust testbed for future model developments. △ Less

Submitted 25 June, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

Comments: 20 pages; v2 - typo update; Code: https://github.com/AudioLLMs/AudioBench

arXiv:2406.02963 [pdf, other]

Dataset-Distillation Generative Model for Speech Emotion Recognition

Authors: Fabian Ritter-Gutierrez, Kuan-Po Huang, Jeremy H. M Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng Siong Chng

Abstract: Deep learning models for speech rely on large datasets, presenting computational challenges. Yet, performance hinges on training data size. Dataset Distillation (DD) aims to learn a smaller dataset without much performance degradation when training with it. DD has been investigated in computer vision but not yet in speech. This paper presents the first approach for DD to speech targeting Speech Em… ▽ More Deep learning models for speech rely on large datasets, presenting computational challenges. Yet, performance hinges on training data size. Dataset Distillation (DD) aims to learn a smaller dataset without much performance degradation when training with it. DD has been investigated in computer vision but not yet in speech. This paper presents the first approach for DD to speech targeting Speech Emotion Recognition on IEMOCAP. We employ Generative Adversarial Networks (GANs) not to mimic real data but to distil key discriminative information of IEMOCAP that is useful for downstream training. The GAN then replaces the original dataset and can sample custom synthetic dataset sizes. It performs comparably when following the original class imbalance but improves performance by 0.3% absolute UAR with balanced classes. It also reduces dataset storage and accelerates downstream training by 95% in both cases and reduces speaker information which could help for a privacy application. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2405.16398 [pdf, other]

Networked Integrated Sensing and Communications for 6G Wireless Systems

Authors: Jiapeng Li, Xiaodan Shao, Feng Chen, Shaohua Wan, Chang Liu, Zhiqiang Wei, Derrick Wing Kwan Ng

Abstract: Integrated sensing and communication (ISAC) is envisioned as a key pillar for enabling the upcoming sixth generation (6G) communication systems, requiring not only reliable communication functionalities but also highly accurate environmental sensing capabilities. In this paper, we design a novel networked ISAC framework to explore the collaboration among multiple users for environmental sensing. S… ▽ More Integrated sensing and communication (ISAC) is envisioned as a key pillar for enabling the upcoming sixth generation (6G) communication systems, requiring not only reliable communication functionalities but also highly accurate environmental sensing capabilities. In this paper, we design a novel networked ISAC framework to explore the collaboration among multiple users for environmental sensing. Specifically, multiple users can serve as powerful sensors, capturing back scattered signals from a target at various angles to facilitate reliable computational imaging. Centralized sensing approaches are extremely sensitive to the capability of the leader node because it requires the leader node to process the signals sent by all the users. To this end, we propose a two-step distributed cooperative sensing algorithm that allows low-dimensional intermediate estimate exchange among neighboring users, thus eliminating the reliance on the centralized leader node and improving the robustness of sensing. This way, multiple users can cooperatively sense a target by exploiting the block-wise environment sparsity and the interference cancellation technique. Furthermore, we analyze the mean square error of the proposed distributed algorithm as a networked sensing performance metric and propose a beamforming design for the proposed network ISAC scheme to maximize the networked sensing accuracy and communication performance subject to a transmit power constraint. Simulation results validate the effectiveness of the proposed algorithm compared with the state-of-the-art algorithms. △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: Received by IEEE Internet of Things Journal

arXiv:2405.09787 [pdf, other]

Analysis of the BraTS 2023 Intracranial Meningioma Segmentation Challenge

Authors: Dominic LaBella, Ujjwal Baid, Omaditya Khanna, Shan McBurney-Lin, Ryan McLean, Pierre Nedelec, Arif Rashid, Nourel Hoda Tahon, Talissa Altes, Radhika Bhalerao, Yaseen Dhemesh, Devon Godfrey, Fathi Hilal, Scott Floyd, Anastasia Janas, Anahita Fathi Kazerooni, John Kirkpatrick, Collin Kent, Florian Kofler, Kevin Leu, Nazanin Maleki, Bjoern Menze, Maxence Pajot, Zachary J. Reitman, Jeffrey D. Rudie , et al. (96 additional authors not shown)

Abstract: We describe the design and results from the BraTS 2023 Intracranial Meningioma Segmentation Challenge. The BraTS Meningioma Challenge differed from prior BraTS Glioma challenges in that it focused on meningiomas, which are typically benign extra-axial tumors with diverse radiologic and anatomical presentation and a propensity for multiplicity. Nine participating teams each developed deep-learning… ▽ More We describe the design and results from the BraTS 2023 Intracranial Meningioma Segmentation Challenge. The BraTS Meningioma Challenge differed from prior BraTS Glioma challenges in that it focused on meningiomas, which are typically benign extra-axial tumors with diverse radiologic and anatomical presentation and a propensity for multiplicity. Nine participating teams each developed deep-learning automated segmentation models using image data from the largest multi-institutional systematically expert annotated multilabel multi-sequence meningioma MRI dataset to date, which included 1000 training set cases, 141 validation set cases, and 283 hidden test set cases. Each case included T2, T2/FLAIR, T1, and T1Gd brain MRI sequences with associated tumor compartment labels delineating enhancing tumor, non-enhancing tumor, and surrounding non-enhancing T2/FLAIR hyperintensity. Participant automated segmentation models were evaluated and ranked based on a scoring system evaluating lesion-wise metrics including dice similarity coefficient (DSC) and 95% Hausdorff Distance. The top ranked team had a lesion-wise median dice similarity coefficient (DSC) of 0.976, 0.976, and 0.964 for enhancing tumor, tumor core, and whole tumor, respectively and a corresponding average DSC of 0.899, 0.904, and 0.871, respectively. These results serve as state-of-the-art benchmarks for future pre-operative meningioma automated segmentation algorithms. Additionally, we found that 1286 of 1424 cases (90.3%) had at least 1 compartment voxel abutting the edge of the skull-stripped image edge, which requires further investigation into optimal pre-processing face anonymization steps. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: 16 pages, 11 tables, 10 figures, MICCAI

arXiv:2403.09188 [pdf]

Design of an basis-projected layer for sparse datasets in deep learning training using gc-ms spectra as a case study

Authors: Yu Tang Chang, Shih Fang Chen

Abstract: Deep learning (DL) models encompass millions or even billions of parameters and learn complex patterns from big data. However, not all data are initially stored in a suitable formation to effectively train a DL model, e.g., gas chromatography-mass spectrometry (GC-MS) spectra and DNA sequence. These datasets commonly contain many zero values, and the sparse data formation causes difficulties in op… ▽ More Deep learning (DL) models encompass millions or even billions of parameters and learn complex patterns from big data. However, not all data are initially stored in a suitable formation to effectively train a DL model, e.g., gas chromatography-mass spectrometry (GC-MS) spectra and DNA sequence. These datasets commonly contain many zero values, and the sparse data formation causes difficulties in optimizing DL models. A DL module called the basis-projected layer (BPL) was proposed to mitigate the issue by transforming the sparse data into a dense representation. The transformed data is expected to facilitate the gradient calculation and finetuned process in a DL training process. The dataset, example of a sparse dataset, contained 362 specialty coffee odorant spectra detected from GC-MS. The BPL layer was placed at the beginning of the DL model. The tunable parameters in the layer were learnable projected axes that were the bases of a new representation space. The layer rotated these bases when its parameters were updated. When the number of the bases was the same as the original dimension, the increasing percentage of the F1 scores was 8.56%. Furthermore, when the number was set as 768 (the original dimension was 490), the increasing percentage of the F1 score was 11.49%. The layer not only maintained the model performance and even constructed a better representation space in analyzing sparse datasets. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 5 pages, 2 figures, 2 tables, conference

MSC Class: 68-06 ACM Class: I.2.4; J.2

arXiv:2402.01665 [pdf, other]

Knowledge-Driven Deep Learning Paradigms for Wireless Network Optimization in 6G

Authors: Rui** Sun, Nan Cheng, Changle Li, Fangjiong Chen, Wen Chen

Abstract: In the sixth-generation (6G) networks, newly emerging diversified services of massive users in dynamic network environments are required to be satisfied by multi-dimensional heterogeneous resources. The resulting large-scale complicated network optimization problems are beyond the capability of model-based theoretical methods due to the overwhelming computational complexity and the long processing… ▽ More In the sixth-generation (6G) networks, newly emerging diversified services of massive users in dynamic network environments are required to be satisfied by multi-dimensional heterogeneous resources. The resulting large-scale complicated network optimization problems are beyond the capability of model-based theoretical methods due to the overwhelming computational complexity and the long processing time. Although with fast online inference and universal approximation ability, data-driven deep learning (DL) heavily relies on abundant training data and lacks interpretability. To address these issues, a new paradigm called knowledge-driven DL has emerged, aiming to integrate proven domain knowledge into the construction of neural networks, thereby exploiting the strengths of both methods. This article provides a systematic review of knowledge-driven DL in wireless networks. Specifically, a holistic framework of knowledge-driven DL in wireless networks is proposed, where knowledge sources, knowledge representation, knowledge integration and knowledge application are forming as a closed loop. Then, a detailed taxonomy of knowledge integration approaches, including knowledge-assisted, knowledge-fused, and knowledge-embedded DL, is presented. Several open issues for future research are also discussed. The insights offered in this article provide a basic principle for the design of network optimization that incorporates communication-specific domain knowledge and DL, facilitating the realization of intelligent 6G networks. △ Less

Submitted 15 January, 2024; originally announced February 2024.

Comments: 9 pages, 5 figures

arXiv:2401.05819 [pdf]

TAnet: A New Temporal Attention Network for EEG-based Auditory Spatial Attention Decoding with a Short Decision Window

Authors: Yuting Ding, Fei Chen

Abstract: Auditory spatial attention detection (ASAD) is used to determine the direction of a listener's attention to a speaker by analyzing her/his electroencephalographic (EEG) signals. This study aimed to further improve the performance of ASAD with a short decision window (i.e., <1 s) rather than with long decision windows ranging from 1 to 5 seconds in previous studies. An end-to-end temporal attention… ▽ More Auditory spatial attention detection (ASAD) is used to determine the direction of a listener's attention to a speaker by analyzing her/his electroencephalographic (EEG) signals. This study aimed to further improve the performance of ASAD with a short decision window (i.e., <1 s) rather than with long decision windows ranging from 1 to 5 seconds in previous studies. An end-to-end temporal attention network (i.e., TAnet) was introduced in this work. TAnet employs a multi-head attention (MHA) mechanism, which can more effectively capture the interactions among time steps in collected EEG signals and efficiently assign corresponding weights to those EEG time steps. Experiments demonstrated that, compared with the CNN-based method and recent ASAD methods, TAnet provided improved decoding performance in the KUL dataset, with decoding accuracies of 92.4% (decision window 0.1 s), 94.9% (0.25 s), 95.1% (0.3 s), 95.4% (0.4 s), and 95.5% (0.5 s) with short decision windows (i.e., <1 s). As a new ASAD model with a short decision window, TAnet can potentially facilitate the design of EEG-controlled intelligent hearing aids and sound recognition systems. △ Less

Submitted 14 May, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

arXiv:2401.04953 [pdf, other]

Adaptive-avg-pooling based Attention Vision Transformer for Face Anti-spoofing

Authors: Jichen Yang, Fangfan Chen, Rohan Kumar Das, Zhengyu Zhu, Shunsi Zhang

Abstract: Traditional vision transformer consists of two parts: transformer encoder and multi-layer perception (MLP). The former plays the role of feature learning to obtain better representation, while the latter plays the role of classification. Here, the MLP is constituted of two fully connected (FC) layers, average value computing, FC layer and softmax layer. However, due to the use of average value com… ▽ More Traditional vision transformer consists of two parts: transformer encoder and multi-layer perception (MLP). The former plays the role of feature learning to obtain better representation, while the latter plays the role of classification. Here, the MLP is constituted of two fully connected (FC) layers, average value computing, FC layer and softmax layer. However, due to the use of average value computing module, some useful information may get lost, which we plan to preserve by the use of alternative framework. In this work, we propose a novel vision transformer referred to as adaptive-avg-pooling based attention vision transformer (AAViT) that uses modules of adaptive average pooling and attention to replace the module of average value computing. We explore the proposed AAViT for the studies on face anti-spoofing using Replay-Attack database. The experiments show that the AAViT outperforms vision transformer in face anti-spoofing by producing a reduced equal error rate. In addition, we found that the proposed AAViT can perform much better than some commonly used neural networks such as ResNet and some other known systems on the Replay-Attack corpus. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: Accepted for Publication in IEEE ICASSP 2024

arXiv:2401.00605 [pdf, other]

Distributed Multi-Object Tracking Under Limited Field of View Heterogeneous Sensors with Density Clustering

Authors: Fei Chen, Hoa Van Nguyen, Alex S. Leong, Sabita Panicker, Robin Baker, Damith C. Ranasinghe

Abstract: We consider the problem of tracking multiple, unknown, and time-varying numbers of objects using a distributed network of heterogeneous sensors. In an effort to derive a formulation for practical settings, we consider limited and unknown sensor field-of-views (FoVs), sensors with limited local computational resources and communication channel capacity. The resulting distributed multi-object tracki… ▽ More We consider the problem of tracking multiple, unknown, and time-varying numbers of objects using a distributed network of heterogeneous sensors. In an effort to derive a formulation for practical settings, we consider limited and unknown sensor field-of-views (FoVs), sensors with limited local computational resources and communication channel capacity. The resulting distributed multi-object tracking algorithm involves solving an NP-hard multidimensional assignment problem either optimally for small-size problems or sub-optimally for general practical problems. For general problems, we propose an efficient distributed multi-object tracking algorithm that performs track-to-track fusion using a clustering-based analysis of the state space transformed into a density space to mitigate the complexity of the assignment problem. The proposed algorithm can more efficiently group local track estimates for fusion than existing approaches. To ensure we achieve globally consistent identities for tracks across a network of nodes as objects move between FoVs, we develop a graph-based algorithm to achieve label consensus and minimise track segmentation. Numerical experiments with a synthetic and a real-world trajectory dataset demonstrate that our proposed method is significantly more computationally efficient than state-of-the-art solutions, achieving similar tracking accuracy and bandwidth requirements but with improved label consistency. △ Less

Submitted 31 December, 2023; originally announced January 2024.

arXiv:2312.12824 [pdf, other]

FedSODA: Federated Cross-assessment and Dynamic Aggregation for Histopathology Segmentation

Authors: Yuan Zhang, Yaolei Qi, Xiaoming Qi, Lotfi Senhadji, Yongyue Wei, Feng Chen, Guanyu Yang

Abstract: Federated learning (FL) for histopathology image segmentation involving multiple medical sites plays a crucial role in advancing the field of accurate disease diagnosis and treatment. However, it is still a task of great challenges due to the sample imbalance across clients and large data heterogeneity from disparate organs, variable segmentation tasks, and diverse distribution. Thus, we propose a… ▽ More Federated learning (FL) for histopathology image segmentation involving multiple medical sites plays a crucial role in advancing the field of accurate disease diagnosis and treatment. However, it is still a task of great challenges due to the sample imbalance across clients and large data heterogeneity from disparate organs, variable segmentation tasks, and diverse distribution. Thus, we propose a novel FL approach for histopathology nuclei and tissue segmentation, FedSODA, via synthetic-driven cross-assessment operation (SO) and dynamic stratified-layer aggregation (DA). Our SO constructs a cross-assessment strategy to connect clients and mitigate the representation bias under sample imbalance. Our DA utilizes layer-wise interaction and dynamic aggregation to diminish heterogeneity and enhance generalization. The effectiveness of our FedSODA has been evaluated on the most extensive histopathology image segmentation dataset from 7 independent datasets. The code is available at https://github.com/yuanzhang7/FedSODA. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP2024

arXiv:2312.12153 [pdf, other]

Noise robust distillation of self-supervised speech models via correlation metrics

Authors: Fabian Ritter-Gutierrez, Kuan-Po Huang, Dianwen Ng, Jeremy H. M. Wong, Hung-yi Lee, Eng Siong Chng, Nancy F. Chen

Abstract: Compared to large speech foundation models, small distilled models exhibit degraded noise robustness. The student's robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Te… ▽ More Compared to large speech foundation models, small distilled models exhibit degraded noise robustness. The student's robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Teacher behavior is learned by maximizing the teacher and student cross-correlation matrix between their representations towards identity. Noise robustness is encouraged via the student's self-correlation minimization. The proposed method is agnostic of the teacher model and consistently outperforms the previous approach. This work also proposes an heuristic to weigh the importance of the two correlation terms automatically. Experiments show consistently better clean and noise generalization on Intent Classification, Keyword Spotting, and Automatic Speech Recognition tasks on SUPERB Challenge. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 6 pages

arXiv:2312.10979 [pdf, ps, other]

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

Authors: Shulin He, **jiang liu, Hao Li, Yang Yang, Fei Chen, Xueliang Zhang

Abstract: Target speaker extraction (TSE) aims to isolate a specific voice from multiple mixed speakers relying on a registerd sample. Since voiceprint features usually vary greatly, current end-to-end neural networks require large model parameters which are computational intensive and impractical for real-time applications, espetially on resource-constrained platforms. In this paper, we address the TSE tas… ▽ More Target speaker extraction (TSE) aims to isolate a specific voice from multiple mixed speakers relying on a registerd sample. Since voiceprint features usually vary greatly, current end-to-end neural networks require large model parameters which are computational intensive and impractical for real-time applications, espetially on resource-constrained platforms. In this paper, we address the TSE task using microphone array and introduce a novel three-stage solution that systematically decouples the process: First, a neural network is trained to estimate the direction of the target speaker. Second, with the direction determined, the Generalized Sidelobe Canceller (GSC) is used to extract the target speech. Third, an Inplace Convolutional Recurrent Neural Network (ICRN) acts as a denoising post-processor, refining the GSC output to yield the final separated speech. Our approach delivers superior performance while drastically reducing computational load, setting a new standard for efficient real-time target speaker extraction. △ Less

Submitted 4 January, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: Accepted to ICASSP 2024

arXiv:2312.10741 [pdf, other]

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

Authors: Yu Zhang, Rongjie Huang, Ruiqi Li, **Zheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao

Abstract: Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expr… ▽ More Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Access to singing voice samples can be found at https://stylesinger.github.io/. △ Less

Submitted 2 January, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI 2024

arXiv:2312.03246 [pdf, ps, other]

On Topological Conditions for Enabling Transient Control in Leader-follower Networks

Authors: Fei Chen, Dimos V. Dimarogonas

Abstract: We derive necessary and sufficient conditions for leader-follower multi-agent systems such that we can further apply prescribed performance control to achieve the desired formation while satisfying certain transient constraints. A leader-follower framework is considered in the sense that a group of agents with external inputs are selected as leaders in order to drive the group of followers in a wa… ▽ More We derive necessary and sufficient conditions for leader-follower multi-agent systems such that we can further apply prescribed performance control to achieve the desired formation while satisfying certain transient constraints. A leader-follower framework is considered in the sense that a group of agents with external inputs are selected as leaders in order to drive the group of followers in a way that the entire system can achieve target formation within certain prescribed performance transient bounds. We first derive necessary conditions on the leader-follower graph topology under which the target formation together with the prescribed performance guarantees can be fulfilled. Afterwards, the derived necessary conditions are extended to necessary and sufficient conditions for leader-follower formation control under transient constraints. Finally, the proposed results are illustrated with simulation examples. △ Less

Submitted 5 December, 2023; originally announced December 2023.

Comments: under review at Automatica

arXiv:2312.03001 [pdf]

Computer Vision for Increased Operative Efficiency via Identification of Instruments in the Neurosurgical Operating Room: A Proof-of-Concept Study

Authors: Tanner J. Zachem, Sully F. Chen, Vishal Venkatraman, David AW Sykes, Ravi Prakash, Koumani W. Ntowe, Mikhail A. Bethell, Samantha Spellicy, Alexander D Suarez, Weston Ross, Patrick J. Codd

Abstract: Objectives Computer vision (CV) is a field of artificial intelligence that enables machines to interpret and understand images and videos. CV has the potential to be of assistance in the operating room (OR) to track surgical instruments. We built a CV algorithm for identifying surgical instruments in the neurosurgical operating room as a potential solution for surgical instrument tracking and mana… ▽ More Objectives Computer vision (CV) is a field of artificial intelligence that enables machines to interpret and understand images and videos. CV has the potential to be of assistance in the operating room (OR) to track surgical instruments. We built a CV algorithm for identifying surgical instruments in the neurosurgical operating room as a potential solution for surgical instrument tracking and management to decrease surgical waste and opening of unnecessary tools. Methods We collected 1660 images of 27 commonly used neurosurgical instruments. Images were labeled using the VGG Image Annotator and split into 80% training and 20% testing sets in order to train a U-Net Convolutional Neural Network using 5-fold cross validation. Results Our U-Net achieved a tool identification accuracy of 80-100% when distinguishing 25 classes of instruments, with 19/25 classes having accuracy over 90%. The model performance was not adequate for sub classifying Adson, Gerald, and Debakey forceps, which had accuracies of 60-80%. Conclusions We demonstrated the viability of using machine learning to accurately identify surgical instruments. Instrument identification could help optimize surgical tray packing, decrease tool usage and waste, decrease incidence of instrument misplacement events, and assist in timing of routine instrument maintenance. More training data will be needed to increase accuracy across all surgical instruments that would appear in a neurosurgical operating room. Such technology has the potential to be used as a method to be used for proving what tools are truly needed in each type of operation allowing surgeons across the world to do more with less. △ Less

Submitted 29 April, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

Comments: Data is openly available through The Open Science Framework: https://doi.org/10.17605/OSF.IO/BCQK2

arXiv:2311.08415 [pdf]

Scanning phase imaging without accurate positioning system

Authors: Tao Liu, Bingyang Wang, JiangTao Zhao, Fu rong Chen, Fucai Zhang

Abstract: Ptychography, a high-resolution phase imaging technique using precise in-plane translation information, has been widely applied in modern synchrotron radiation sources across the globe. A key requirement for successful ptychographic reconstruction is the precise knowledge of the scanning positions, which are typically obtained by a physical interferometric positioning system. Whereas high-throughp… ▽ More Ptychography, a high-resolution phase imaging technique using precise in-plane translation information, has been widely applied in modern synchrotron radiation sources across the globe. A key requirement for successful ptychographic reconstruction is the precise knowledge of the scanning positions, which are typically obtained by a physical interferometric positioning system. Whereas high-throughput positioning poses a challenge in engineering, especially in nano or even smaller scale. In this work, we propose a novel scanning imaging framework that does not require any prior position information from the positioning system. Specifically, our scheme utilizes the wavefront modulation mechanism to reconstruct the object functions at each scan position and the shared illumination function, simultaneously. The scanning trajectory information is extracted by our subpixel image registration algorithm from the overlap region of reconstructed object functions. Then, a completed object function can be obtained by assembling each part of the reconstructed sample functions. High-quality imaging of biological sample and position recovery with sub-pixel accuracy are demonstrated in proof-of-concept experiment. Based on current results, we find it may have great potential applications in high-resolution and high throughput phase imaging. △ Less

Submitted 31 October, 2023; originally announced November 2023.

Comments: 9 pages,4 figures

arXiv:2311.03046 [pdf, other]

Antenna Positioning and Beamforming Design for Fluid-Antenna Enabled Multi-user Downlink Communications

Authors: Haoran Qin, Wen Chen, Zhendong Li, Qingqing Wu, Nan Cheng, Fangjiong Chen

Abstract: This paper investigates a multiple input single output (MISO) downlink communication system in which users are equipped with fluid antennas (FAs). First, we adopt a field-response based channel model to characterize the downlink channel with respect to FAs' positions. Then, we aim to minimize the total transmit power by jointly optimizing the FAs' positions and beamforming matrix. To solve the res… ▽ More This paper investigates a multiple input single output (MISO) downlink communication system in which users are equipped with fluid antennas (FAs). First, we adopt a field-response based channel model to characterize the downlink channel with respect to FAs' positions. Then, we aim to minimize the total transmit power by jointly optimizing the FAs' positions and beamforming matrix. To solve the resulting non-convex problem, we employ an alternating optimization (AO) algorithm based on penalty method and successive convex approximation (SCA) to obtain a sub-optimal solution. Numerical results demonstrate that the FA-assisted communication system performs better than conventional fixed position antennas system. △ Less

Submitted 13 January, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

arXiv:2311.00271 [pdf, other]

EdgeDis: Enabling Fast, Economical, and Reliable Data Dissemination for Mobile Edge Computing

Authors: Bo Li, Qiang He, Feifei Chen, Lingjuan Lyu, Athman Bouguettaya, Yun Yang

Abstract: Mobile edge computing (MEC) enables web data caching in close geographic proximity to end users. Popular data can be cached on edge servers located less than hundreds of meters away from end users. This ensures bounded latency guarantees for various latency-sensitive web applications. However, transmitting a large volume of data out of the cloud onto many geographically-distributed web servers ind… ▽ More Mobile edge computing (MEC) enables web data caching in close geographic proximity to end users. Popular data can be cached on edge servers located less than hundreds of meters away from end users. This ensures bounded latency guarantees for various latency-sensitive web applications. However, transmitting a large volume of data out of the cloud onto many geographically-distributed web servers individually can be expensive. In addition, web content dissemination may be interrupted by various intentional and accidental events in the volatile MEC environment, which undermines dissemination efficiency and subsequently incurs extra transmission costs. To tackle the above challenges, we present a novel scheme named EdgeDis that coordinates data dissemination by distributed consensus among those servers. We analyze EdgeDis's validity theoretically and evaluate its performance experimentally. Results demonstrate that compared with baseline and state-of-the-art schemes, EdgeDis: 1) is 5.97x - 7.52x faster; 2) reduces dissemination costs by 48.21% to 91.87%; and 3) reduces performance loss caused by dissemination failures by up to 97.30% in time and 96.35% in costs. △ Less

Submitted 31 October, 2023; originally announced November 2023.

arXiv:2310.05072 [pdf, other]

Performance Analysis of RIS-Aided Double Spatial Scattering Modulation for mmWave MIMO Systems

Authors: Xusheng Zhu, Wen Chen, Qingqing Wu, Jun Li, Nan Cheng, Fangjiong Chen, Changle Li

Abstract: In this paper, we investigate a practical structure of reconfigurable intelligent surface (RIS)-based double spatial scattering modulation (DSSM) for millimeter-wave (mmWave) multiple-input multiple-output (MIMO) systems. A suboptimal detector is proposed, in which the beam direction is first demodulated according to the received beam strength, and then the remaining information is demodulated by… ▽ More In this paper, we investigate a practical structure of reconfigurable intelligent surface (RIS)-based double spatial scattering modulation (DSSM) for millimeter-wave (mmWave) multiple-input multiple-output (MIMO) systems. A suboptimal detector is proposed, in which the beam direction is first demodulated according to the received beam strength, and then the remaining information is demodulated by adopting the maximum likelihood algorithm. Based on the proposed suboptimal detector, we derive the conditional pairwise error probability expression. Further, the exact numerical integral and closed-form expressions of unconditional pairwise error probability (UPEP) are derived via two different approaches. To provide more insights, we derive the upper bound and asymptotic expressions of UPEP. In addition, the diversity gain of the RIS-DSSM scheme was also given. Furthermore, the union upper bound of average bit error probability (ABEP) is obtained by combining the UPEP and the number of error bits. Simulation results are provided to validate the derived upper bound and asymptotic expressions of ABEP. We found an interesting phenomenon that the ABEP performance of the proposed system-based phase shift keying is better than that of the quadrature amplitude modulation. Additionally, the performance advantage of ABEP is more significant with the increase in the number of RIS elements. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2309.09548 [pdf, other]

Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata

Authors: Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

Abstract: Automated speech intelligibility assessment is pivotal for hearing aid (HA) development. In this paper, we present three novel methods to improve intelligibility prediction accuracy and introduce MBI-Net+, an enhanced version of MBI-Net, the top-performing system in the 1st Clarity Prediction Challenge. MBI-Net+ leverages Whisper's embeddings to create cross-domain acoustic features and includes m… ▽ More Automated speech intelligibility assessment is pivotal for hearing aid (HA) development. In this paper, we present three novel methods to improve intelligibility prediction accuracy and introduce MBI-Net+, an enhanced version of MBI-Net, the top-performing system in the 1st Clarity Prediction Challenge. MBI-Net+ leverages Whisper's embeddings to create cross-domain acoustic features and includes metadata from speech signals by using a classifier that distinguishes different enhancement methods. Furthermore, MBI-Net+ integrates the hearing-aid speech perception index (HASPI) as a supplementary metric into the objective function to further boost prediction performance. Experimental results demonstrate that MBI-Net+ surpasses several intrusive baseline systems and MBI-Net on the Clarity Prediction Challenge 2023 dataset, validating the effectiveness of incorporating Whisper embeddings, speech metadata, and related complementary metrics to improve prediction performance for HA. △ Less

Submitted 13 June, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: Accepted to Interspeech 2024

arXiv:2309.08323 [pdf]

MLP Based Continuous Gait Recognition of a Powered Ankle Prosthesis with Serial Elastic Actuator

Authors: Yanze Li, Feixing Chen, **gqi Cao, Ruoqi Zhao, Xuan Yang, Xingbang Yang, Yubo Fan

Abstract: Powered ankle prostheses effectively assist people with lower limb amputation to perform daily activities. High performance prostheses with adjustable compliance and capability to predict and implement amputee's intent are crucial for them to be comparable to or better than a real limb. However, current designs fail to provide simple yet effective compliance of the joint with full potential of mod… ▽ More Powered ankle prostheses effectively assist people with lower limb amputation to perform daily activities. High performance prostheses with adjustable compliance and capability to predict and implement amputee's intent are crucial for them to be comparable to or better than a real limb. However, current designs fail to provide simple yet effective compliance of the joint with full potential of modification, and lack accurate gait prediction method in real time. This paper proposes an innovative design of powered ankle prosthesis with serial elastic actuator (SEA), and puts forward a MLP based gait recognition method that can accurately and continuously predict more gait parameters for motion sensing and control. The prosthesis mimics biological joint with similar weight, torque, and power which can assist walking of up to 4 m/s. A new design of planar torsional spring is proposed for the SEA, which has better stiffness, endurance, and potential of modification than current designs. The gait recognition system simultaneously generates locomotive speed, gait phase, ankle angle and angular velocity only utilizing signals of single IMU, holding advantage in continuity, adaptability for speed range, accuracy, and capability of multi-functions. △ Less

Submitted 30 March, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

Comments: Submitted to IROS 2024

arXiv:2308.14430 [pdf, other]

doi 10.1109/ICASSP48485.2024.10445879

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

Authors: Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao

Abstract: Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on acoustic knowledge or selecting reference speeches that meet certain requirements, generating speech solely from natural text prompts has emerged as a new challenge for researchers. This challenge arises due to th… ▽ More Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on acoustic knowledge or selecting reference speeches that meet certain requirements, generating speech solely from natural text prompts has emerged as a new challenge for researchers. This challenge arises due to the scarcity of high-quality speech datasets with natural text style prompt and the absence of advanced text-controllable TTS models. In light of this, 1) we propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes. The dataset comprises 236,220 pairs of style prompt in natural text descriptions with five style factors and corresponding speech samples. Through iterative experimentation, we introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes. 2) Furthermore, to address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle. This architecture treats text controllable TTS as a language model task, utilizing audio codec codes as an intermediate representation to replace the conventional mel-spectrogram. Finally, we successfully demonstrate the ability of the proposed model by showing a comparable performance in the controllable TTS task. Audio samples are available at https://sall-e.github.io/ △ Less

Submitted 28 August, 2023; originally announced August 2023.

Journal ref: 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2307.15280 [pdf, other]

Active RIS-Assisted MIMO-OFDM System: Analyses and Prototype Measurements

Authors: De-Ming Chian, Feng-Ji Chen, Yu-Chen Chang, Chao-Kai Wen, Chi-Hung Wu, Fu-Kang Wang, Kai-Kit Wong, Chan-Byoung Chae

Abstract: In this study, we develop an active reconfigurable intelligent surface (RIS)-assisted multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) prototype compliant with the 5G New Radio standard at 3.5~GHz. The experimental results clearly indicate that active RIS plays a vital role in enhancing MIMO performance, surpassing passive RIS. Furthermore, when considering fac… ▽ More In this study, we develop an active reconfigurable intelligent surface (RIS)-assisted multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) prototype compliant with the 5G New Radio standard at 3.5~GHz. The experimental results clearly indicate that active RIS plays a vital role in enhancing MIMO performance, surpassing passive RIS. Furthermore, when considering factors such as complexity, energy consumption, and performance, the comparative evaluation between passive RIS and active RIS reinforces the critical role of active RIS in MIMO systems. These findings underscore the practical significance of active RIS in improving MIMO gain in 5G scenarios. △ Less

Submitted 14 November, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

Comments: 5 pages, 5 figures, 1 table, accepted by IEEE Communications Letters, for demo video see: https://www.youtube.com/watch?v=3R6eZXizwns

arXiv:2307.06547 [pdf]

Full-resolution Lung Nodule Segmentation from Chest X-ray Images using Residual Encoder-Decoder Networks

Authors: Michael James Horry, Subrata Chakraborty, Biswajeet Pradhan, Manoranjan Paul, **g Zhu, Prabal Datta Barua, U. Rajendra Acharya, Fang Chen, Jianlong Zhou

Abstract: Lung cancer is the leading cause of cancer death and early diagnosis is associated with a positive prognosis. Chest X-ray (CXR) provides an inexpensive imaging mode for lung cancer diagnosis. Suspicious nodules are difficult to distinguish from vascular and bone structures using CXR. Computer vision has previously been proposed to assist human radiologists in this task, however, leading studies us… ▽ More Lung cancer is the leading cause of cancer death and early diagnosis is associated with a positive prognosis. Chest X-ray (CXR) provides an inexpensive imaging mode for lung cancer diagnosis. Suspicious nodules are difficult to distinguish from vascular and bone structures using CXR. Computer vision has previously been proposed to assist human radiologists in this task, however, leading studies use down-sampled images and computationally expensive methods with unproven generalization. Instead, this study localizes lung nodules using efficient encoder-decoder neural networks that process full resolution images to avoid any signal loss resulting from down-sampling. Encoder-decoder networks are trained and tested using the JSRT lung nodule dataset. The networks are used to localize lung nodules from an independent external CXR dataset. Sensitivity and false positive rates are measured using an automated framework to eliminate any observer subjectivity. These experiments allow for the determination of the optimal network depth, image resolution and pre-processing pipeline for generalized lung nodule localization. We find that nodule localization is influenced by subtlety, with more subtle nodules being detected in earlier training epochs. Therefore, we propose a novel self-ensemble model from three consecutive epochs centered on the validation optimum. This ensemble achieved a sensitivity of 85% in 10-fold internal testing with false positives of 8 per image. A sensitivity of 81% is achieved at a false positive rate of 6 following morphological false positive reduction. This result is comparable to more computationally complex systems based on linear and spatial filtering, but with a sub-second inference time that is faster than other methods. The proposed algorithm achieved excellent generalization results against an external dataset with sensitivity of 77% at a false positive rate of 7.6. △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2307.05385 [pdf, other]

Learned Kernels for Sparse, Interpretable, and Efficient Medical Time Series Processing

Authors: Sully F. Chen, Zhicheng Guo, Cheng Ding, Xiao Hu, Cynthia Rudin

Abstract: Background: Rapid, reliable, and accurate interpretation of medical signals is crucial for high-stakes clinical decision-making. The advent of deep learning allowed for an explosion of new models that offered unprecedented performance in medical time series processing but at a cost: deep learning models are often compute-intensive and lack interpretability. Methods: We propose Sparse Mixture of… ▽ More Background: Rapid, reliable, and accurate interpretation of medical signals is crucial for high-stakes clinical decision-making. The advent of deep learning allowed for an explosion of new models that offered unprecedented performance in medical time series processing but at a cost: deep learning models are often compute-intensive and lack interpretability. Methods: We propose Sparse Mixture of Learned Kernels (SMoLK), an interpretable architecture for medical time series processing. The method learns a set of lightweight flexible kernels to construct a single-layer neural network, providing not only interpretability, but also efficiency and robustness. We introduce novel parameter reduction techniques to further reduce the size of our network. We demonstrate the power of our architecture on two important tasks: photoplethysmography (PPG) artifact detection and atrial fibrillation detection from single-lead electrocardiograms (ECGs). Our approach has performance similar to the state-of-the-art deep neural networks with several orders of magnitude fewer parameters, allowing for deep neural network level performance with extremely low-power wearable devices. Results: Our interpretable method achieves greater than 99% of the performance of the state-of-the-art methods on the PPG artifact detection task, and even outperforms the state-of-the-art on a challenging out-of-distribution test set, while using dramatically fewer parameters (2% of the parameters of Segade, and about half of the parameters of Tiny-PPG). On single lead atrial fibrillation detection, our method matches the performance of a 1D-residual convolutional network, at less than 1% the parameter count, while exhibiting considerably better performance in the low-data regime, even when compared to a parameter-matched control deep network. △ Less

Submitted 2 April, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

Comments: 26 pages, 9 figures

arXiv:2306.07505 [pdf]

Deep learning radiomics for assessment of gastroesophageal varices in people with compensated advanced chronic liver disease

Authors: Lan Wang, Ruiling He, Lili Zhao, Jia Wang, Zhengzi Geng, Tao Ren, Guo Zhang, Peng Zhang, Kaiqiang Tang, Chaofei Gao, Fei Chen, Liting Zhang, Yonghe Zhou, Xin Li, Fanbin He, Hui Huan, Wenjuan Wang, Yunxiao Liang, Juan Tang, Fang Ai, Tingyu Wang, Liyun Zheng, Zhongwei Zhao, Jiansong Ji, Wei Liu , et al. (22 additional authors not shown)

Abstract: Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with… ▽ More Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with compensated advanced chronic liver disease. 305 patients were enrolled from 12 hospitals, and finally 265 patients were included, with 1136 liver stiffness measurement (LSM) images and 1042 spleen stiffness measurement (SSM) images generated by 2D-SWE. We leveraged deep learning methods to uncover associations between image features and patient risk, and thus conducted models to predict GEV and HRV. Results: A multi-modality Deep Learning Risk Prediction model (DLRP) was constructed to assess GEV and HRV, based on LSM and SSM images, and clinical information. Validation analysis revealed that the AUCs of DLRP were 0.91 for GEV (95% CI 0.90 to 0.93, p < 0.05) and 0.88 for HRV (95% CI 0.86 to 0.89, p < 0.01), which were significantly and robustly better than canonical risk indicators, including the value of LSM and SSM. Moreover, DLPR was better than the model using individual parameters, including LSM and SSM images. In HRV prediction, the 2D-SWE images of SSM outperform LSM (p < 0.01). Conclusion: DLRP shows excellent performance in predicting GEV and HRV over canonical risk indicators LSM and SSM. Additionally, the 2D-SWE images of SSM provided more information for better accuracy in predicting HRV than the LSM. △ Less

Submitted 12 June, 2023; originally announced June 2023.

arXiv:2306.02719 [pdf, ps, other]

Multiple output samples per input in a single-output Gaussian process

Authors: Jeremy H. M. Wong, Huayun Zhang, Nancy F. Chen

Abstract: The standard Gaussian Process (GP) only considers a single output sample per input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters per input. This paper proposes to generalise the GP to allow for these multiple output samples in the training set, and thus make use of available output uncertainty… ▽ More The standard Gaussian Process (GP) only considers a single output sample per input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters per input. This paper proposes to generalise the GP to allow for these multiple output samples in the training set, and thus make use of available output uncertainty information. This differs from a multi-output GP, as all output samples are from the same task here. The output density function is formulated to be the joint likelihood of observing all output samples, and latent variables are not repeated to reduce computation cost. The test set predictions are inferred similarly to a standard GP, with a difference being in the optimised hyper-parameters. This is evaluated on speechocean762, showing that it allows the GP to compute a test set output distribution that is more similar to the collection of reference outputs from the multiple human raters. △ Less

Submitted 25 January, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

Comments: This paper is presented in the "Symposium for Celebrating 40 Years of Bayesian Learning in Speech and Language Processing and Beyond", which is a satellite event of the ASRU workshop, on 20 December 2023. https://bayesian40.github.io/

arXiv:2305.19972 [pdf, other]

VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition

Authors: Ziyi Ni, Minglun Han, Feilong Chen, Linghui Meng, **g Shi, Pin Lv, Bo Xu

Abstract: Enhancing automatic speech recognition (ASR) performance by leveraging additional multimodal information has shown promising results in previous studies. However, most of these works have primarily focused on utilizing visual cues derived from human lip motions. In fact, context-dependent visual and linguistic cues can also benefit in many scenarios. In this paper, we first propose ViLaS (Vision a… ▽ More Enhancing automatic speech recognition (ASR) performance by leveraging additional multimodal information has shown promising results in previous studies. However, most of these works have primarily focused on utilizing visual cues derived from human lip motions. In fact, context-dependent visual and linguistic cues can also benefit in many scenarios. In this paper, we first propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism, which can integrate visual and textual context simultaneously or separately, to facilitate speech recognition. Next, we introduce an effective training strategy that improves performance in modal-incomplete test scenarios. Then, to explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions. Finally, empirical results are reported on the public Flickr8K and self-constructed VSDial datasets. We explore various cross-modal fusion schemes, analyze fine-grained crossmodal alignment on VSDial, and provide insights into the effects of integrating multimodal information on speech recognition. △ Less

Submitted 18 December, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

Comments: Accepted to ICASSP 2024

arXiv:2305.16342 [pdf, other]

InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition

Authors: Zhi-Hao Lai, Tian-Hao Zhang, Qi Liu, Xinyuan Qian, Li-Fang Wei, Song-Lu Chen, Feng Chen, Xu-Cheng Yin

Abstract: The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these… ▽ More The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these issues, this paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR. Specifically, we combine the convolution block with the transformer block in a parallel design. Besides, we propose a bidirectional feature interaction module (BFIM) and a selective fusion module (SFM) to implement the interaction and fusion of local and global features, respectively. Extensive experiments on public ASR datasets demonstrate the effectiveness of our proposed InterFormer and its superior performance over the other Transformer and Conformer models. △ Less

Submitted 29 May, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023

arXiv:2305.14049 [pdf, other]

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

Authors: Tian-Hao Zhang, Hai-Bo Qin, Zhi-Hao Lai, Song-Lu Chen, Qi Liu, Feng Chen, Xinyuan Qian, Xu-Cheng Yin

Abstract: Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla… ▽ More Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively. To prevent information leakage during training, we design a Causal Multimodal Mask. Moreover, a variant Semi-ASCD is proposed to balance accuracy and computational cost. Our proposal is evaluated on the publicly available AISHELL-1 and aidatatang_200zh datasets using Transformer, Conformer, and Branchformer as encoders, respectively. The experimental results show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023

arXiv:2305.05107 [pdf, other]

Modeling Viral Information Spreading via Directed Acyclic Graph Diffusion

Authors: Chinthaka Dinesh, Gene Cheung, Fei Chen, Yuejiang Li, H. Vicky Zhao

Abstract: Viral information like rumors or fake news is spread over a communication network like a virus infection in a unidirectional manner: entity $i$ conveys information to a neighbor $j$, resulting in two equally informed (infected) parties. Existing graph diffusion works focus only on bidirectional diffusion on an undirected graph. Instead, we propose a new directed acyclic graph (DAG) diffusion model… ▽ More Viral information like rumors or fake news is spread over a communication network like a virus infection in a unidirectional manner: entity $i$ conveys information to a neighbor $j$, resulting in two equally informed (infected) parties. Existing graph diffusion works focus only on bidirectional diffusion on an undirected graph. Instead, we propose a new directed acyclic graph (DAG) diffusion model to estimate the probability $x_i(t)$ of node $i$'s infection at time $t$ given a source node $s$, where $x_i(\infty)~=~1$. Specifically, given an undirected positive graph modeling node-to-node communication, we first compute its graph embedding: a latent coordinate for each node in an assumed low-dimensional manifold space from extreme eigenvectors via LOBPCG. Next, we construct a DAG based on Euclidean distances between latent coordinates. Spectrally, we prove that the asymmetric DAG Laplacian matrix contains real non-negative eigenvalues, and that the DAG diffusion converges to the all-infection vector $\x(\infty) = \1$ as $t \rightarrow \infty$. Simulation experiments show that our proposed DAG diffusion accurately models viral information spreading over a variety of graph structures at different time instants. △ Less

Submitted 22 December, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

arXiv:2305.04160 [pdf, other]

X-LLM: Bootstrap** Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Authors: Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, **g Shi, Shuang Xu, Bo Xu

Abstract: Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimod… ▽ More Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimodal capabilities, we propose X-LLM, which converts Multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and inputs them into a large Language model (ChatGLM). Specifically, X-LLM aligns multiple frozen single-modal encoders and a frozen LLM using X2L interfaces, where ``X'' denotes multi-modalities such as image, speech, and videos, and ``L'' denotes languages. X-LLM's training consists of three stages: (1) Converting Multimodal Information: The first stage trains each X2L interface to align with its respective single-modal encoder separately to convert multimodal information into languages. (2) Aligning X2L representations with the LLM: single-modal encoders are aligned with the LLM through X2L interfaces independently. (3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM. Our experiments show that X-LLM demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 84.5\% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. And we also conduct quantitative tests on using LLM for ASR and multimodal ASR, ho** to promote the era of LLM-based speech recognition. △ Less

Submitted 21 May, 2023; v1 submitted 6 May, 2023; originally announced May 2023.

arXiv:2301.13003 [pdf, other]

Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation

Authors: Minglun Han, Feilong Chen, **g Shi, Shuang Xu, Bo Xu

Abstract: Large-scale pre-trained language models (PLMs) have shown great potential in natural language processing tasks. Leveraging the capabilities of PLMs to enhance automatic speech recognition (ASR) systems has also emerged as a promising research direction. However, previous works may be limited by the inflexible structures of PLMs and the insufficient utilization of PLMs. To alleviate these problems,… ▽ More Large-scale pre-trained language models (PLMs) have shown great potential in natural language processing tasks. Leveraging the capabilities of PLMs to enhance automatic speech recognition (ASR) systems has also emerged as a promising research direction. However, previous works may be limited by the inflexible structures of PLMs and the insufficient utilization of PLMs. To alleviate these problems, we propose the hierarchical knowledge distillation (HKD) on the continuous integrate-and-fire (CIF) based ASR models. To transfer knowledge from PLMs to the ASR models, HKD employs cross-modal knowledge distillation with contrastive loss at the acoustic level and knowledge distillation with regression loss at the linguistic level. Compared with the original CIF-based model, our method achieves 15% and 9% relative error rate reduction on the AISHELL-1 and LibriSpeech datasets, respectively. △ Less

Submitted 28 May, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2211.12911 [pdf, ps, other]

Data-driven approximation of control invariant set for linear system based on convex piecewise linear fitting

Authors: Jun Xu, Fanglin Chen

Abstract: Control invariant set is critical for guaranteeing safe control and the problem of computing control invariant set for linear discrete-time system is revisited in this paper by using a data-driven approach. Specifically, sample points on convergent trajectories of linear MPC are recorded, of which the convex hull formulates a control invariant set for the linear system. To approximate the convex h… ▽ More Control invariant set is critical for guaranteeing safe control and the problem of computing control invariant set for linear discrete-time system is revisited in this paper by using a data-driven approach. Specifically, sample points on convergent trajectories of linear MPC are recorded, of which the convex hull formulates a control invariant set for the linear system. To approximate the convex hull of multiple sample points, a convex piecewise linear (PWL) fitting framework has been proposed, which yields a polyhedral approximation with predefined complexity. A descent algorithm for the convex PWL fitting problem is also developed, which is guaranteed to converge to a local optimum. The proposed strategy is flexible in computing the control invariant set in high dimension with a predefined complexity. Simulation results show that the proposed data-driven approximation can compute the approximated control invariant set with high accuracy and relatively low computational cost. △ Less

Submitted 23 November, 2022; originally announced November 2022.

arXiv:2211.07283 [pdf, other]

SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

Authors: Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans

Abstract: Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models can improve on dense models via pruning and extra retraining, or converge faster than dense models with some performance loss. Thus, we propose training TTS models using decaying sparsity, i.e. a high initial sparsity to acc… ▽ More Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models can improve on dense models via pruning and extra retraining, or converge faster than dense models with some performance loss. Thus, we propose training TTS models using decaying sparsity, i.e. a high initial sparsity to accelerate training first, followed by a progressive rate reduction to obtain better eventual performance. This decremental approach differs from current methods of incrementing sparsity to a desired target, which costs significantly more time than dense training. We call our method SNIPER training: Single-shot Initialization Pruning Evolving-Rate training. Our experiments on FastSpeech2 show that we were able to obtain better losses in the first few training epochs with SNIPER, and that the final SNIPER-trained models outperformed constant-sparsity models and edged out dense models, with negligible difference in training time. △ Less

Submitted 1 June, 2024; v1 submitted 14 November, 2022; originally announced November 2022.

arXiv:2211.00715 [pdf, other]

doi 10.1109/LRA.2023.3244716

Tunable Dynamic Walking via Soft Twisted Beam Vibration

Authors: Yuhao Jiang, Fuchen Chen, Daniel M. Aukes

Abstract: We propose a novel mechanism that propagates vibration through soft twisted beams, taking advantage of dynamically-coupled anisotropic stiffness to simplify the actuation of walking robots. Using dynamic simulation and experimental approaches, we show that the coupled stiffness of twisted beams with terrain contact can be controlled to generate a variety of complex trajectories by changing the fre… ▽ More We propose a novel mechanism that propagates vibration through soft twisted beams, taking advantage of dynamically-coupled anisotropic stiffness to simplify the actuation of walking robots. Using dynamic simulation and experimental approaches, we show that the coupled stiffness of twisted beams with terrain contact can be controlled to generate a variety of complex trajectories by changing the frequency of the input signal. This work reveals how ground contact influences the system's dynamic behavior, supporting the design of walking robots inspired by this phenomenon. We also show that the proposed twisted beam produces a tunable walking gait from a single vibrational input. △ Less

Submitted 1 November, 2022; originally announced November 2022.

Comments: 8 pages, 5 figure, this paper has been submitted to IEEE Robotics and Automation Letters, copyright may be transferred without notice, after which this version may no longer be accessible, the supplemental video is available at: https://youtu.be/HpvOvaIC1Z4

Journal ref: IEEE Robotics and Automation Letters, vol. 8, no. 4, pp. 1967-1974, April 2023

arXiv:2210.14481 [pdf]

Calibrationless Reconstruction of Uniformly-Undersampled Multi-Channel MR Data with Deep Learning Estimated ESPIRiT Maps

Authors: Junhao Zhang, Zheyuan Yi, Yujiao Zhao, Linfang Xiao, Jiahao Hu, Christopher Man, Vick Lau, Shi Su, Fei Chen, Alex T. L. Leong, Ed X. Wu

Abstract: Purpose: To develop a truly calibrationless reconstruction method that derives ESPIRiT maps from uniformly-undersampled multi-channel MR data by deep learning. Methods: ESPIRiT, one commonly used parallel imaging reconstruction technique, forms the images from undersampled MR k-space data using ESPIRiT maps that effectively represents coil sensitivity information. Accurate ESPIRiT map estimation r… ▽ More Purpose: To develop a truly calibrationless reconstruction method that derives ESPIRiT maps from uniformly-undersampled multi-channel MR data by deep learning. Methods: ESPIRiT, one commonly used parallel imaging reconstruction technique, forms the images from undersampled MR k-space data using ESPIRiT maps that effectively represents coil sensitivity information. Accurate ESPIRiT map estimation requires quality coil sensitivity calibration or autocalibration data. We present a U-Net based deep learning model to estimate the multi-channel ESPIRiT maps directly from uniformly-undersampled multi-channel multi-slice MR data. The model is trained using fully-sampled multi-slice axial brain datasets from the same MR receiving coil system. To utilize subject-coil geometric parameters available for each dataset, the training imposes a hybrid loss on ESPIRiT maps at the original locations as well as their corresponding locations within the standard reference multi-slice axial stack. The performance of the approach was evaluated using publicly available T1-weighed brain and cardiac data. Results: The proposed model robustly predicted multi-channel ESPIRiT maps from uniformly-undersampled k-space data. They were highly comparable to the reference ESPIRiT maps directly computed from 24 consecutive central k-space lines. Further, they led to excellent ESPIRiT reconstruction performance even at high acceleration, exhibiting a similar level of errors and artifacts to that by using reference ESPIRiT maps. Conclusion: A new deep learning approach is developed to estimate ESPIRiT maps directly from uniformly-undersampled MR data. It presents a general strategy for calibrationless parallel imaging reconstruction through learning from coil and protocol specific data. △ Less

Submitted 27 October, 2022; v1 submitted 26 October, 2022; originally announced October 2022.

arXiv:2209.13112 [pdf, other]

Automated Sex Classification of Children's Voices and Changes in Differentiating Factors with Age

Authors: Fuling Chen, Roberto Togneri, Murray Maybery, Diana Weiting Tan

Abstract: Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 a… ▽ More Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 and 15 years. We proposed an optimal feature set and our modelling achieved an average F1 score (the harmonic mean of the precision and recall) of 0.84 across all ages. Our results suggest that the sex classification is generally more accurate when a model is developed for each year group rather than for children in 4-year age bands, with classification accuracy being better for older age groups. We found that spontaneous speech could provide more helpful cues in sex classification than scripted speech, especially for children younger than 7 years. For younger age groups, a broad range of acoustic factors contributed evenly to sex classification, while for older age groups, F0-related acoustic factors were found to be the most critical predictors generally. Other important acoustic factors for older age groups include vocal tract length estimators, spectral flux, loudness and unvoiced features. △ Less

Submitted 26 September, 2022; originally announced September 2022.

arXiv:2209.10890 [pdf, other]

doi 10.21437/Interspeech.2022-10626

EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

Authors: Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman

Abstract: Neural models are known to be over-parameterized, and recent work has shown that sparse text-to-speech (TTS) models can outperform dense models. Although a plethora of sparse methods has been proposed for other domains, such methods have rarely been applied in TTS. In this work, we seek to answer the question: what are the characteristics of selected sparse techniques on the performance and model… ▽ More Neural models are known to be over-parameterized, and recent work has shown that sparse text-to-speech (TTS) models can outperform dense models. Although a plethora of sparse methods has been proposed for other domains, such methods have rarely been applied in TTS. In this work, we seek to answer the question: what are the characteristics of selected sparse techniques on the performance and model complexity? We compare a Tacotron2 baseline and the results of applying five techniques. We then evaluate the performance via the factors of naturalness, intelligibility and prosody, while reporting model size and training time. Complementary to prior research, we find that pruning before or during training can achieve similar performance to pruning after training and can be trained much faster, while removing entire neurons degrades performance much more than removing parameters. To our best knowledge, this is the first work that compares sparsity paradigms in text-to-speech synthesis. △ Less

Submitted 22 September, 2022; originally announced September 2022.

Journal ref: Interspeech 2022, 823-827 (2022)

arXiv:2208.11439 [pdf, other]

A Consistency Constraint-Based Approach to Coupled State Constraints in Distributed Model Predictive Control

Authors: Adrian Wiltz, Fei Chen, Dimos V. Dimarogonas

Abstract: In this paper, we present a distributed model predictive control (DMPC) scheme for dynamically decoupled systems which are subject to state constraints, coupling state constraints and input constraints. In the proposed control scheme, neighbor-to-neighbor communication suffices and all subsystems solve their local optimization problem in parallel. The approach relies on consistency constraints whi… ▽ More In this paper, we present a distributed model predictive control (DMPC) scheme for dynamically decoupled systems which are subject to state constraints, coupling state constraints and input constraints. In the proposed control scheme, neighbor-to-neighbor communication suffices and all subsystems solve their local optimization problem in parallel. The approach relies on consistency constraints which define a neighborhood around each subsystem's reference trajectory where the state of the respective subsystem is guaranteed to stay in. Reference trajectories and consistency constraints are known to neighboring subsystems. Contrary to other relevant approaches, the reference trajectories are improved iteratively. Besides, the presented approach allows the formulation of convex optimization problems even in the presence of non-convex state constraints. The algorithm's effectiveness is demonstrated with a simulation. △ Less

Submitted 24 August, 2022; originally announced August 2022.

Comments: accepted for presentation at the 61st IEEE Conference on Decision and Control 2022

arXiv:2208.08131 [pdf, other]

Domestic sound event detection by shift consistency mean-teacher training and adversarial domain adaptation

Authors: Fang-Ching Chen, Kuan-Dar Chen, Yi-Wen Liu

Abstract: Semi-supervised learning and domain adaptation techniques have drawn increasing attention in the field of domestic sound event detection thanks to the availability of large amounts of unlabeled data and the relative ease to generate synthetic strongly-labeled data. In a previous work, several semi-supervised learning strategies were designed to boost the performance of a mean-teacher model. Namely… ▽ More Semi-supervised learning and domain adaptation techniques have drawn increasing attention in the field of domestic sound event detection thanks to the availability of large amounts of unlabeled data and the relative ease to generate synthetic strongly-labeled data. In a previous work, several semi-supervised learning strategies were designed to boost the performance of a mean-teacher model. Namely, these strategies include shift consistency training (SCT), interpolation consistency training (ICT), and pseudo-labeling. However, adversarial domain adaptation (ADA) did not seem to improve the event detection accuracy further when we attempt to compensate for the domain gap between synthetic and real data. In this research, we empirically found that ICT tends to pull apart the distributions of synthetic and real data in t-SNE plots. Therefore, ICT is abandoned while SCT, in contrast, is applied to train both the student and the teacher models. With these modifications, the system successfully integrates with an ADA network, and we achieve 47.2% in the F1 score on the DCASE 2020 task 4 dataset, which is 2.1% higher than what was reported in the previous work. △ Less

Submitted 17 August, 2022; originally announced August 2022.

arXiv:2208.00840 [pdf, other]

doi 10.1016/j.media.2022.102540

A Transformer-based Neural Language Model that Synthesizes Brain Activation Maps from Free-Form Text Queries

Authors: Gia H. Ngo, Minh Nguyen, Nancy F. Chen, Mert R. Sabuncu

Abstract: Neuroimaging studies are often limited by the number of subjects and cognitive processes that can be feasibly interrogated. However, a rapidly growing number of neuroscientific studies have collectively accumulated an extensive wealth of results. Digesting this growing literature and obtaining novel insights remains to be a major challenge, since existing meta-analytic tools are constrained to key… ▽ More Neuroimaging studies are often limited by the number of subjects and cognitive processes that can be feasibly interrogated. However, a rapidly growing number of neuroscientific studies have collectively accumulated an extensive wealth of results. Digesting this growing literature and obtaining novel insights remains to be a major challenge, since existing meta-analytic tools are constrained to keyword queries. In this paper, we present Text2Brain, an easy to use tool for synthesizing brain activation maps from open-ended text queries. Text2Brain was built on a transformer-based neural network language model and a coordinate-based meta-analysis of neuroimaging studies. Text2Brain combines a transformer-based text encoder and a 3D image generator, and was trained on variable-length text snippets and their corresponding activation maps sampled from 13,000 published studies. In our experiments, we demonstrate that Text2Brain can synthesize meaningful neural activation patterns from various free-form textual descriptions. Text2Brain is available at https://braininterpreter.com as a web-based tool for efficiently searching through the vast neuroimaging literature and generating new hypotheses. △ Less

Submitted 24 July, 2022; originally announced August 2022.

Comments: arXiv admin note: text overlap with arXiv:2109.13814

Journal ref: Medical Image Analysis. 2022 Jul 19:102540

arXiv:2207.12941 [pdf, other]

Learning Generalizable Latent Representations for Novel Degradations in Super Resolution

Authors: Fengjun Li, Xin Feng, Fanglin Chen, Guangming Lu, Wenjie Pei

Abstract: Typical methods for blind image super-resolution (SR) focus on dealing with unknown degradations by directly estimating them or learning the degradation representations in a latent space. A potential limitation of these methods is that they assume the unknown degradations can be simulated by the integration of various handcrafted degradations (e.g., bicubic downsampling), which is not necessarily… ▽ More Typical methods for blind image super-resolution (SR) focus on dealing with unknown degradations by directly estimating them or learning the degradation representations in a latent space. A potential limitation of these methods is that they assume the unknown degradations can be simulated by the integration of various handcrafted degradations (e.g., bicubic downsampling), which is not necessarily true. The real-world degradations can be beyond the simulation scope by the handcrafted degradations, which are referred to as novel degradations. In this work, we propose to learn a latent representation space for degradations, which can be generalized from handcrafted (base) degradations to novel degradations. The obtained representations for a novel degradation in this latent space are then leveraged to generate degraded images consistent with the novel degradation to compose paired training data for SR model. Furthermore, we perform variational inference to match the posterior of degradations in latent representation space with a prior distribution (e.g., Gaussian distribution). Consequently, we are able to sample more high-quality representations for a novel degradation to augment the training data for SR model. We conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness and advantages of our method for blind super-resolution with novel degradations. △ Less

Submitted 25 July, 2022; originally announced July 2022.

arXiv:2206.04245 [pdf, other]

Manifold Graph Signal Restoration using Gradient Graph Laplacian Regularizer

Authors: Fei Chen, Gene Cheung, Xue Zhang

Abstract: In the graph signal processing (GSP) literature, graph Laplacian regularizer (GLR) was used for signal restoration to promote piecewise smooth / constant reconstruction with respect to an underlying graph. However, for signals slowly varying across graph kernels, GLR suffers from an undesirable "staircase" effect. In this paper, focusing on manifold graphs -- collections of uniform discrete sample… ▽ More In the graph signal processing (GSP) literature, graph Laplacian regularizer (GLR) was used for signal restoration to promote piecewise smooth / constant reconstruction with respect to an underlying graph. However, for signals slowly varying across graph kernels, GLR suffers from an undesirable "staircase" effect. In this paper, focusing on manifold graphs -- collections of uniform discrete samples on low-dimensional continuous manifolds -- we generalize GLR to gradient graph Laplacian regularizer (GGLR) that promotes planar / piecewise planar (PWP) signal reconstruction. Specifically, for a graph endowed with sampling coordinates (e.g., 2D images, 3D point clouds), we first define a gradient operator, using which we construct a gradient graph for nodes' gradients in sampling manifold space. This maps to a gradient-induced nodal graph (GNG) and a positive semi-definite (PSD) Laplacian matrix with planar signals as the 0 frequencies. For manifold graphs without explicit sampling coordinates, we propose a graph embedding method to obtain node coordinates via fast eigenvector computation. We derive the means-square-error minimizing weight parameter for GGLR efficiently, trading off bias and variance of the signal estimate. Experimental results show that GGLR outperformed previous graph signal priors like GLR and graph total variation (GTV) in a range of graph signal restoration tasks. △ Less

Submitted 4 April, 2024; v1 submitted 8 June, 2022; originally announced June 2022.

arXiv:2205.07108 [pdf, other]

Formalizing PQRST Complex in Accelerometer-based Gait Cycle for Authentication

Authors: Frank Sicong Chen, Amith K. Belman, Vir V. Phoha

Abstract: Accelerometer signals generated through gait present a new frontier of human interface with mobile devices. Gait cycle detection based on these signals has applications in various areas, including authentication, health monitoring, and activity detection. Template-based studies focus on how the entire gait cycle represents walking patterns, but these are compute-intensive. Aggregate feature-based… ▽ More Accelerometer signals generated through gait present a new frontier of human interface with mobile devices. Gait cycle detection based on these signals has applications in various areas, including authentication, health monitoring, and activity detection. Template-based studies focus on how the entire gait cycle represents walking patterns, but these are compute-intensive. Aggregate feature-based studies extract features in the time domain and frequency domain from the entire gait cycle to reduce the number of features. However, these methods may miss critical structural information needed to appropriately represent the intricacies of walking patterns. To the best of our knowledge, no study has formally proposed a structure to capture variations within gait cycles or phases from accelerometer readings. We propose a new structure named the PQRST Complex, which corresponds to the swing phase in a gait cycle and matches the foot movements during this phase, thus capturing the changes in foot position. In our experiments, based on the nine features derived from this structure, the accelerometer-based gait authentication system outperforms many state-of-the-art gait cycle-based authentication systems. Our work opens up a new paradigm of capturing the structure of gait and opens multiple areas of research and practice using gait analogous to the "QRS complex" structure of ECG signals related to the heart. △ Less

Submitted 14 May, 2022; originally announced May 2022.

arXiv:2204.11448 [pdf, other]

High-Efficiency Lossy Image Coding Through Adaptive Neighborhood Information Aggregation

Authors: Ming Lu, Fangdong Chen, Shiliang Pu, Zhan Ma

Abstract: Questing for learned lossy image coding (LIC) with superior compression performance and computation throughput is challenging. The vital factor behind it is how to intelligently explore Adaptive Neighborhood Information Aggregation (ANIA) in transform and entropy coding modules. To this end, Integrated Convolution and Self-Attention (ICSA) unit is first proposed to form a content-adaptive transfor… ▽ More Questing for learned lossy image coding (LIC) with superior compression performance and computation throughput is challenging. The vital factor behind it is how to intelligently explore Adaptive Neighborhood Information Aggregation (ANIA) in transform and entropy coding modules. To this end, Integrated Convolution and Self-Attention (ICSA) unit is first proposed to form a content-adaptive transform to characterize and embed neighborhood information dynamically of any input. Then a Multistage Context Model (MCM) is devised to progressively use available neighbors following a pre-arranged spatial-channel order for accurate probability estimation in parallel. ICSA and MCM are stacked under a Variational AutoEncoder (VAE) architecture to derive rate-distortion optimized compact representation of input image via end-to-end learning. Our method reports state-of-the-art compression performance surpassing the VVC Intra and other prevalent LIC approaches across Kodak, CLIC, and Tecnick datasets; More importantly, our method offers $>$60$\times$ decoding speedup using a comparable-size model when compared with the most popular LIC method. All materials are made publicly accessible at https://njuvision.github.io/TinyLIC for reproducible research. △ Less

Submitted 12 October, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

arXiv:2204.03310 [pdf, other]

MTI-Net: A Multi-Target Speech Intelligibility Prediction Model

Authors: Ryandhimas E. Zezario, Szu-wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

Abstract: Recently, deep learning (DL)-based non-intrusive speech assessment models have attracted great attention. Many studies report that these DL-based models yield satisfactory assessment performance and good flexibility, but their performance in unseen environments remains a challenge. Furthermore, compared to quality scores, fewer studies elaborate deep learning models to estimate intelligibility sco… ▽ More Recently, deep learning (DL)-based non-intrusive speech assessment models have attracted great attention. Many studies report that these DL-based models yield satisfactory assessment performance and good flexibility, but their performance in unseen environments remains a challenge. Furthermore, compared to quality scores, fewer studies elaborate deep learning models to estimate intelligibility scores. This study proposes a multi-task speech intelligibility prediction model, called MTI-Net, for simultaneously predicting human and machine intelligibility measures. Specifically, given a speech utterance, MTI-Net is designed to predict human subjective listening test results and word error rate (WER) scores. We also investigate several methods that can improve the prediction performance of MTI-Net. First, we compare different features (including low-level features and embeddings from self-supervised learning (SSL) models) and prediction targets of MTI-Net. Second, we explore the effect of transfer learning and multi-tasking learning on training MTI-Net. Finally, we examine the potential advantages of fine-tuning SSL embeddings. Experimental results demonstrate the effectiveness of using cross-domain features, multi-task learning, and fine-tuning SSL embeddings. Furthermore, it is confirmed that the intelligibility and WER scores predicted by MTI-Net are highly correlated with the ground-truth scores. △ Less

Submitted 30 August, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: Accepted to Interspeech 2022

arXiv:2204.03305 [pdf, other]

MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids

Authors: Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

Abstract: Improving the user's hearing ability to understand speech in noisy environments is critical to the development of hearing aid (HA) devices. For this, it is important to derive a metric that can fairly predict speech intelligibility for HA users. A straightforward approach is to conduct a subjective listening test and use the test results as an evaluation metric. However, conducting large-scale lis… ▽ More Improving the user's hearing ability to understand speech in noisy environments is critical to the development of hearing aid (HA) devices. For this, it is important to derive a metric that can fairly predict speech intelligibility for HA users. A straightforward approach is to conduct a subjective listening test and use the test results as an evaluation metric. However, conducting large-scale listening tests is time-consuming and expensive. Therefore, several evaluation metrics were derived as surrogates for subjective listening test results. In this study, we propose a multi-branched speech intelligibility prediction model (MBI-Net), for predicting the subjective intelligibility scores of HA users. MBI-Net consists of two branches of models, with each branch consisting of a hearing loss model, a cross-domain feature extraction module, and a speech intelligibility prediction model, to process speech signals from one channel. The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores. Experimental results confirm the effectiveness of MBI-Net, which produces higher prediction scores than the baseline system in Track 1 and Track 2 on the Clarity Prediction Challenge 2022 dataset. △ Less

Submitted 30 August, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: Accepted to Interspeech 2022

arXiv:2203.16032 [pdf, other]

ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications

Authors: Gaoxiong Yi, Wei Xiao, Yiming Xiao, Babak Naderi, Sebastian Möller, Wafaa Wardah, Gabriel Mittag, Ross Cutler, Zhuohuang Zhang, Donald S. Williamson, Fei Chen, Fuzheng Yang, Shidong Shang

Abstract: With the advances in speech communication systems such as online conferencing applications, we can seamlessly work with people regardless of where they are. However, during online meetings, speech quality can be significantly affected by background noise, reverberation, packet loss, network jitter, etc. Because of its nature, speech quality is traditionally assessed in subjective tests in laborato… ▽ More With the advances in speech communication systems such as online conferencing applications, we can seamlessly work with people regardless of where they are. However, during online meetings, speech quality can be significantly affected by background noise, reverberation, packet loss, network jitter, etc. Because of its nature, speech quality is traditionally assessed in subjective tests in laboratories and lately also in crowdsourcing following the international standards from ITU-T Rec. P.800 series. However, those approaches are costly and cannot be applied to customer data. Therefore, an effective objective assessment approach is needed to evaluate or monitor the speech quality of the ongoing conversation. The ConferencingSpeech 2022 challenge targets the non-intrusive deep neural network models for the speech quality assessment task. We open-sourced a training corpus with more than 86K speech clips in different languages, with a wide range of synthesized and live degradations and their corresponding subjective quality scores through crowdsourcing. 18 teams submitted their models for evaluation in this challenge. The blind test sets included about 4300 clips from wide ranges of degradations. This paper describes the challenge, the datasets, and the evaluation methods and reports the final results. △ Less

Submitted 31 March, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Showing 1–50 of 108 results for author: Chen, F