Search | arXiv e-print repository

On the social bias of speech self-supervised models

Authors: Yi-Cheng Lin, Tzu-Quan Lin, Hsi-Che Lin, Andy T. Liu, Hung-yi Lee

Abstract: Self-supervised learning (SSL) speech models have achieved remarkable performance in various tasks, yet the biased outcomes, especially affecting marginalized groups, raise significant concerns. Social bias refers to the phenomenon where algorithms potentially amplify disparate properties between social groups present in the data used for training. Bias in SSL models can perpetuate injustice by au… ▽ More Self-supervised learning (SSL) speech models have achieved remarkable performance in various tasks, yet the biased outcomes, especially affecting marginalized groups, raise significant concerns. Social bias refers to the phenomenon where algorithms potentially amplify disparate properties between social groups present in the data used for training. Bias in SSL models can perpetuate injustice by automating discriminatory patterns and reinforcing inequitable systems. This work reveals that prevalent SSL models inadvertently acquire biased associations. We probe how various factors, such as model architecture, size, and training methodologies, influence the propagation of social bias within these models. Finally, we explore the efficacy of debiasing SSL models through regularization techniques, specifically via model compression. Our findings reveal that employing techniques such as row-pruning and training wider, shallower models can effectively mitigate social bias within SSL model. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2405.16791 [pdf, ps, other]

Joint Node Selection and Resource Allocation Optimization for Cooperative Sensing with a Shared Wireless Backhaul

Authors: Mingxin Chen, Ming-Min Zhao, An Liu, Min Li, Qingjiang Shi

Abstract: In this paper, we consider a cooperative sensing framework in the context of future multi-functional network with both communication and sensing ability, where one base station (BS) serves as a sensing transmitter and several nearby BSs serve as sensing receivers. Each receiver receives the sensing signal reflected by the target and communicates with the fusion center (FC) through a wireless multi… ▽ More In this paper, we consider a cooperative sensing framework in the context of future multi-functional network with both communication and sensing ability, where one base station (BS) serves as a sensing transmitter and several nearby BSs serve as sensing receivers. Each receiver receives the sensing signal reflected by the target and communicates with the fusion center (FC) through a wireless multiple access channel (MAC) for cooperative target localization. To improve the localization performance, we present a hybrid information-signal domain cooperative sensing (HISDCS) design, where each sensing receiver transmits both the estimated time delay/effective reflecting coefficient and the received sensing signal sampled around the estimated time delay to the FC. Then, we propose to minimize the number of channel uses by utilizing an efficient Karhunen-Loéve transformation (KLT) encoding scheme for signal quantization and proper node selection, under the Cramér-Rao lower bound (CRLB) constraint and the capacity limits of MAC. A novel matrix-inequality constrained successive convex approximation (MCSCA) algorithm is proposed to optimize the wireless backhaul resource allocation, together with a greedy strategy for node selection. Despite the high non-convexness of the considered problem, we prove that the proposed MCSCA algorithm is able to converge to the set of Karush-Kuhn-Tucker (KKT) solutions of a relaxed problem obtained by relaxing the discrete variables. Besides, a low-complexity quantization bit reallocation algorithm is designed, which does not perform explicit node selection, and is able to harvest most of the performance gain brought by HISDCS. Finally, numerical simulations are presented to show that the proposed HISDCS design is able to significantly outperform the baseline schemes. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: 13 pages, 10 figures

arXiv:2405.04027 [pdf, other]

Joint Visibility Region Detection and Channel Estimation for XL-MIMO Systems via Alternating MAP

Authors: Wenkang Xu, An Liu, Min-jian Zhao

Abstract: We investigate a joint visibility region (VR) detection and channel estimation problem in extremely large-scale multiple-input-multiple-output (XL-MIMO) systems, where near-field propagation and spatial non-stationary effects exist. In this case, each scatterer can only see a subset of antennas, i.e., it has a certain VR over the antennas. Because of the spatial correlation among adjacent sub-arra… ▽ More We investigate a joint visibility region (VR) detection and channel estimation problem in extremely large-scale multiple-input-multiple-output (XL-MIMO) systems, where near-field propagation and spatial non-stationary effects exist. In this case, each scatterer can only see a subset of antennas, i.e., it has a certain VR over the antennas. Because of the spatial correlation among adjacent sub-arrays, VR of scatterers exhibits a two-dimensional (2D) clustered sparsity. We design a 2D Markov prior model to capture such a structured sparsity. Based on this, a novel alternating maximum a posteriori (MAP) framework is developed for high-accuracy VR detection and channel estimation. The alternating MAP framework consists of three basic modules: a channel estimation module, a VR detection module, and a grid update module. Specifically, the first module is a low-complexity inverse-free variational Bayesian inference (IF-VBI) algorithm that avoids the matrix inverse via minimizing a relaxed Kullback-Leibler (KL) divergence. The second module is a structured expectation propagation (EP) algorithm which has the ability to deal with complicated prior information. And the third module refines polar-domain grid parameters via gradient ascent. Simulations demonstrate the superiority of the proposed algorithm in both VR detection and channel estimation. △ Less

Submitted 21 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

Comments: 13 pages, 14 figures, submitted to IEEE TSP

arXiv:2405.01200 [pdf, other]

Learning-to-solve unit commitment based on few-shot physics-guided spatial-temporal graph convolution network

Authors: Mei Yang, Gao Qiu andJunyong Liu, Kai Liu

Abstract: This letter proposes a few-shot physics-guided spatial temporal graph convolutional network (FPG-STGCN) to fast solve unit commitment (UC). Firstly, STGCN is tailored to parameterize UC. Then, few-shot physics-guided learning scheme is proposed. It exploits few typical UC solutions yielded via commercial optimizer to escape from local minimum, and leverages the augmented Lagrangian method for cons… ▽ More This letter proposes a few-shot physics-guided spatial temporal graph convolutional network (FPG-STGCN) to fast solve unit commitment (UC). Firstly, STGCN is tailored to parameterize UC. Then, few-shot physics-guided learning scheme is proposed. It exploits few typical UC solutions yielded via commercial optimizer to escape from local minimum, and leverages the augmented Lagrangian method for constraint satisfaction. To further enable both feasibility and continuous relaxation for integers in learning process, straight-through estimator for Tanh-Sign composition is proposed to fully differentiate the mixed integer solution space. Case study on the IEEE benchmark justifies that, our method bests mainstream learning ways on UC feasibility, and surpasses traditional solver on efficiency. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2404.09385 [pdf, other]

A Large-Scale Evaluation of Speech Foundation Models

Authors: Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee

Abstract: The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,… ▽ More The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark. △ Less

Submitted 29 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred

arXiv:2404.03440 [pdf, ps, other]

doi 10.1109/VTC2023-Fall60731.2023.10333715

Design and Optimization of Cooperative Sensing With Limited Backhaul Capacity

Authors: Wenrui Li, Min Li, An Liu, Tony Xiao Han

Abstract: This paper introduces a cooperative sensing framework designed for integrated sensing and communication cellular networks. The framework comprises one base station (BS) functioning as the sensing transmitter, while several nearby BSs act as sensing receivers. The primary objective is to facilitate cooperative target localization by enabling each receiver to share specific information with a fusion… ▽ More This paper introduces a cooperative sensing framework designed for integrated sensing and communication cellular networks. The framework comprises one base station (BS) functioning as the sensing transmitter, while several nearby BSs act as sensing receivers. The primary objective is to facilitate cooperative target localization by enabling each receiver to share specific information with a fusion center (FC) over a limited capacity backhaul link. To achieve this goal, we propose an advanced cooperative sensing design that enhances the communication process between the receivers and the FC. Each receiver independently estimates the time delay and the reflecting coefficient associated with the reflected path from the target. Subsequently, each receiver transmits the estimated values and the received signal samples centered around the estimated time delay to the FC. To efficiently quantize the signal samples, a Karhunen-Loève Transform coding scheme is employed. Furthermore, an optimization problem is formulated to allocate backhaul resources for quantizing different samples, improving target localization. Numerical results validate the effectiveness of our proposed advanced design and demonstrate its superiority over a baseline design, where only the locally estimated values are transmitted from each receiver to the FC. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: This paper has been published in 2023 IEEE 98th Vehicular Technology Conference (VTC2023-Fall)

arXiv:2403.11974 [pdf, other]

OUCopula: Bi-Channel Multi-Label Copula-Enhanced Adapter-Based CNN for Myopia Screening Based on OU-UWF Images

Authors: Yang Li, Qiuyi Huang, Chong Zhong, Danjuan Yang, Meiyan Li, A. H. Welsh, Aiyi Liu, Bo Fu, Catherien C. Liu, Xingtao Zhou

Abstract: Myopia screening using cutting-edge ultra-widefield (UWF) fundus imaging is potentially significant for ophthalmic outcomes. Current multidisciplinary research between ophthalmology and deep learning (DL) concentrates primarily on disease classification and diagnosis using single-eye images, largely ignoring joint modeling and prediction for Oculus Uterque (OU, both eyes). Inspired by the complex… ▽ More Myopia screening using cutting-edge ultra-widefield (UWF) fundus imaging is potentially significant for ophthalmic outcomes. Current multidisciplinary research between ophthalmology and deep learning (DL) concentrates primarily on disease classification and diagnosis using single-eye images, largely ignoring joint modeling and prediction for Oculus Uterque (OU, both eyes). Inspired by the complex relationships between OU and the high correlation between the (continuous) outcome labels (Spherical Equivalent and Axial Length), we propose a framework of copula-enhanced adapter convolutional neural network (CNN) learning with OU UWF fundus images (OUCopula) for joint prediction of multiple clinical scores. We design a novel bi-channel multi-label CNN that can (1) take bi-channel image inputs subject to both high correlation and heterogeneity (by sharing the same backbone network and employing adapters to parameterize the channel-wise discrepancy), and (2) incorporate correlation information between continuous output labels (using a copula). Solid experiments show that OUCopula achieves satisfactory performance in myopia score prediction compared to backbone models. Moreover, OUCopula can far exceed the performance of models constructed for single-eye inputs. Importantly, our study also hints at the potential extension of the bi-channel model to a multi-channel paradigm and the generalizability of OUCopula across various backbone CNNs. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.13236 [pdf, other]

Towards audio language modeling -- an overview

Authors: Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-wei Chang, Ho-Lam Chung, Alexander H. Liu, Hung-yi Lee

Abstract: Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency. Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs). Numerous high-performance neural audio codecs and codec-based LMs have been developed.… ▽ More Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency. Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs). Numerous high-performance neural audio codecs and codec-based LMs have been developed. The paper aims to provide a thorough and systematic overview of the neural audio codec models and codec-based LMs. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.13071 [pdf, other]

Codec-SUPERB: An In-Depth Analysis of Sound Codec Models

Authors: Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H. Liu, Hung-yi Lee

Abstract: The sound codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance. Recent years have witnessed significant developments in codec models. The ideal sound codec should preserve content, paralinguistics, speakers, and audio information. However, the question of which codec achieves optimal sound information preservation remains unanswere… ▽ More The sound codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance. Recent years have witnessed significant developments in codec models. The ideal sound codec should preserve content, paralinguistics, speakers, and audio information. However, the question of which codec achieves optimal sound information preservation remains unanswered, as in different papers, models are evaluated on their selected experimental settings. This study introduces Codec-SUPERB, an acronym for Codec sound processing Universal PERformance Benchmark. It is an ecosystem designed to assess codec models across representative sound applications and signal-level metrics rooted in sound domain knowledge.Codec-SUPERB simplifies result sharing through an online leaderboard, promoting collaboration within a community-driven benchmark database, thereby stimulating new development cycles for codecs. Furthermore, we undertake an in-depth analysis to offer insights into codec models from both application and signal perspectives, diverging from previous codec papers mainly concentrating on signal-level comparisons. Finally, we will release codes, the leaderboard, and data to accelerate progress within the community. △ Less

Submitted 7 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: Github: https://github.com/voidful/Codec-SUPERB

arXiv:2401.13947 [pdf, other]

Networked Multiagent Reinforcement Learning for Peer-to-Peer Energy Trading

Authors: Chen Feng, Andrew L. Liu

Abstract: Utilizing distributed renewable and energy storage resources in local distribution networks via peer-to-peer (P2P) energy trading has long been touted as a solution to improve energy systems' resilience and sustainability. Consumers and prosumers (those who have energy generation resources), however, do not have the expertise to engage in repeated P2P trading, and the zero-marginal costs of renewa… ▽ More Utilizing distributed renewable and energy storage resources in local distribution networks via peer-to-peer (P2P) energy trading has long been touted as a solution to improve energy systems' resilience and sustainability. Consumers and prosumers (those who have energy generation resources), however, do not have the expertise to engage in repeated P2P trading, and the zero-marginal costs of renewables present challenges in determining fair market prices. To address these issues, we propose multi-agent reinforcement learning (MARL) frameworks to help automate consumers' bidding and management of their solar PV and energy storage resources, under a specific P2P clearing mechanism that utilizes the so-called supply-demand ratio. In addition, we show how the MARL frameworks can integrate physical network constraints to realize voltage control, hence ensuring physical feasibility of the P2P energy trading and paving way for real-world implementations. △ Less

Submitted 27 January, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

arXiv:2401.13914 [pdf, ps, other]

Analog Beamforming for In-Band Full-Duplex Phased Arrays with Quantized Phase Shifters under a Per-Antenna Received Power Constraint

Authors: Ao Liu, Ian P. Roberts, Taneli Riihonen, Weixing Sheng

Abstract: This letter develops a novel transmit beamforming (BF) design for canceling self-interference (SI) in analog in-band full-duplex phased arrays. Our design maximizes transmit BF gain in a desired direction while simultaneously reducing SI power to below a specified threshold on per-antenna basis to avoid saturating receive-chain components, such as LNAs. Core to our approach is that it accounts for… ▽ More This letter develops a novel transmit beamforming (BF) design for canceling self-interference (SI) in analog in-band full-duplex phased arrays. Our design maximizes transmit BF gain in a desired direction while simultaneously reducing SI power to below a specified threshold on per-antenna basis to avoid saturating receive-chain components, such as LNAs. Core to our approach is that it accounts for real-world phase shifters used in analog phased array systems, whose limited resolution imposes non-convex constraints on BF design. We overcome this by transforming these non-convex constraints into convex polygon constraints, which we then solve through semidefinite relaxation and a rank refinement procedure. Numerical results show that our proposed BF scheme reliably cancels SI to the target power threshold at each receive antenna while sacrificing little in transmit BF gain, even with modest phase shifter resolution. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: This paper has been submitted to the IEEE for review; copyright may change without notice

arXiv:2401.08833 [pdf, other]

Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective

Authors: Alexander H. Liu, Sung-Lin Yeh, James Glass

Abstract: Existing studies on self-supervised speech representation learning have focused on develo** new training methods and applying pre-trained models for different applications. However, the quality of these models is often measured by the performance of different downstream tasks. How well the representations access the information of interest is less studied. In this work, we take a closer look int… ▽ More Existing studies on self-supervised speech representation learning have focused on develo** new training methods and applying pre-trained models for different applications. However, the quality of these models is often measured by the performance of different downstream tasks. How well the representations access the information of interest is less studied. In this work, we take a closer look into existing self-supervised methods of speech from an information-theoretic perspective. We aim to develop metrics using mutual information to help practical problems such as model design and selection. We use linear probes to estimate the mutual information between the target information and learned representations, showing another insight into the accessibility to the target information from speech representations. Further, we explore the potential of evaluating representations in a self-supervised fashion, where we estimate the mutual information between different parts of the data without using any labels. Finally, we show that both supervised and unsupervised measures echo the performance of the models on layer-wise linear probing and speech recognition. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: ICASSP 2024

arXiv:2401.00429 [pdf, other]

Deeper and Wider Networks for Performance Metrics Prediction in Communication Networks

Authors: Aijia Liu, Shiqing Liu, Xiaobing Pei

Abstract: In today's era, users have increasingly high expectations regarding the performance and efficiency of communication networks. Network operators aspire to achieve efficient network planning, operation, and optimization through Digital Twin Networks (DTN). The effectiveness of DTN heavily relies on the network model, with graph neural networks (GNN) playing a crucial role in network modeling. Howeve… ▽ More In today's era, users have increasingly high expectations regarding the performance and efficiency of communication networks. Network operators aspire to achieve efficient network planning, operation, and optimization through Digital Twin Networks (DTN). The effectiveness of DTN heavily relies on the network model, with graph neural networks (GNN) playing a crucial role in network modeling. However, existing network modeling methods still lack a comprehensive understanding of communication networks. In this paper, we propose DWNet (Deeper and Wider Networks), a heterogeneous graph neural network modeling method based on data-driven approaches that aims to address end-to-end latency and jitter prediction in network models. This method stands out due to two distinctive features: firstly, it introduces deeper levels of state participation in the message passing process; secondly, it extensively integrates relevant features during the feature fusion process. Through experimental validation and evaluation, our model achieves higher prediction accuracy compared to previous research achievements, particularly when dealing with unseen network topologies during model training. Our model not only provides more accurate predictions but also demonstrates stronger generalization capabilities across diverse topological structures. △ Less

Submitted 31 December, 2023; originally announced January 2024.

arXiv:2310.16338 [pdf, other]

Generative Pre-training for Speech with Flow Matching

Authors: Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

Abstract: Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there… ▽ More Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training. △ Less

Submitted 25 March, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: ICLR 2024

arXiv:2310.08851 [pdf, ps, other]

A Two-Stage 2D Channel Extrapolation Scheme for TDD 5G NR Systems

Authors: Yubo Wan, An Liu

Abstract: Recently, channel extrapolation has been widely investigated in frequency division duplex (FDD) massive MIMO systems. However, in time division duplex (TDD) fifth generation (5G) new radio (NR) systems, the channel extrapolation problem also arises due to the hop** uplink pilot pattern, which has not been fully researched yet. This paper addresses this gap by formulating a channel extrapolation… ▽ More Recently, channel extrapolation has been widely investigated in frequency division duplex (FDD) massive MIMO systems. However, in time division duplex (TDD) fifth generation (5G) new radio (NR) systems, the channel extrapolation problem also arises due to the hop** uplink pilot pattern, which has not been fully researched yet. This paper addresses this gap by formulating a channel extrapolation problem in TDD massive MIMO-OFDM systems for 5G NR, incorporating imperfection factors. A novel two-stage two-dimensional (2D) channel extrapolation scheme in both frequency and time domain is proposed, designed to mitigate the negative effects of imperfection factors and ensure high-accuracy channel estimation. Specifically, in the channel estimation stage, we propose a novel multi-band and multi-timeslot based high-resolution parameter estimation algorithm to achieve 2D channel extrapolation in the presence of imperfection factors. Then, to avoid repeated multi-timeslot based channel estimation, a channel tracking stage is designed during the subsequent time instants, in which a sparse Markov channel model is formulated to capture the dynamic sparsity of massive MIMO-OFDM channels under the influence of imperfection factors. Next, an expectation-maximization (EM) based compressive channel tracking algorithm is designed to jointly estimate unknown imperfection and channel parameters by exploiting the high-resolution prior information of the delay/angle parameters from the previous timeslots. Simulation results underscore the superior performance of our proposed channel extrapolation scheme over baselines. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2310.05382 [pdf, other]

A Stochastic Particle Variational Bayesian Inference Inspired Deep-Unfolding Network for Non-Convex Parameter Estimation

Authors: Zhixiang Hu, An Liu, Minjian Zhao

Abstract: Future wireless networks are envisioned to provide ubiquitous sensing services, which also gives rise to a substantial demand for high-dimensional non-convex parameter estimation, i.e., the associated likelihood function is non-convex and contains numerous local optima. Variational Bayesian inference (VBI) provides a powerful tool for modeling complex estimation problems and reasoning with prior i… ▽ More Future wireless networks are envisioned to provide ubiquitous sensing services, which also gives rise to a substantial demand for high-dimensional non-convex parameter estimation, i.e., the associated likelihood function is non-convex and contains numerous local optima. Variational Bayesian inference (VBI) provides a powerful tool for modeling complex estimation problems and reasoning with prior information, but poses a long-standing challenge on computing intractable posteriori distributions. Most existing variational methods generally rely on assumptions about specific distribution families to derive closed-form solutions, and are difficult to apply in high-dimensional, non-convex scenarios. Given these challenges, firstly, we propose a parallel stochastic particle variational Bayesian inference (PSPVBI) algorithm. Thanks to innovations such as particle approximation, additional updates of particle positions, and parallel stochastic successive convex approximation (PSSCA), PSPVBI can flexibly drive particles to fit the posteriori distribution with acceptable complexity, yielding high-precision estimates of the target parameters. Furthermore, additional speedup can be obtained by deep-unfolding (DU) the PSPVBI algorithm. Specifically, superior hyperparameters are learned to dramatically reduce the number of algorithmic iterations. In this PSPVBI-induced Deep-Unfolding Networks, some techniques related to gradient computation, data sub-sampling, differentiable sampling, and generalization ability are also employed to facilitate the practical deployment. Finally, we apply the LPSPVBI to solve several important parameter estimation problems in wireless sensing scenarios. Simulations indicate that the LPSPVBI algorithm outperforms existing solutions. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2309.14405 [pdf, other]

Joint Audio and Speech Understanding

Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass

Abstract: Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perce… ▽ More Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. Specifically, by integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events - almost everything perceivable from audio signals. △ Less

Submitted 10 December, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: Accepted at ASRU 2023. Code, dataset, and pretrained models are at https://github.com/yuangongnd/ltu. Interactive demo at https://huggingface.co/spaces/yuangongfdu/ltu-2

arXiv:2309.04171 [pdf, other]

PRISTA-Net: Deep Iterative Shrinkage Thresholding Network for Coded Diffraction Patterns Phase Retrieval

Authors: Aoxu Liu, Xiaohong Fan, Yin Yang, Jian** Zhang

Abstract: The problem of phase retrieval (PR) involves recovering an unknown image from limited amplitude measurement data and is a challenge nonlinear inverse problem in computational imaging and image processing. However, many of the PR methods are based on black-box network models that lack interpretability and plug-and-play (PnP) frameworks that are computationally complex and require careful parameter… ▽ More The problem of phase retrieval (PR) involves recovering an unknown image from limited amplitude measurement data and is a challenge nonlinear inverse problem in computational imaging and image processing. However, many of the PR methods are based on black-box network models that lack interpretability and plug-and-play (PnP) frameworks that are computationally complex and require careful parameter tuning. To address this, we have developed PRISTA-Net, a deep unfolding network (DUN) based on the first-order iterative shrinkage thresholding algorithm (ISTA). This network utilizes a learnable nonlinear transformation to address the proximal-point map** sub-problem associated with the sparse priors, and an attention mechanism to focus on phase information containing image edges, textures, and structures. Additionally, the fast Fourier transform (FFT) is used to learn global features to enhance local information, and the designed logarithmic-based loss function leads to significant improvements when the noise level is low. All parameters in the proposed PRISTA-Net framework, including the nonlinear transformation, threshold parameters, and step size, are learned end-to-end instead of being manually set. This method combines the interpretability of traditional methods with the fast inference ability of deep learning and is able to handle noise at each iteration during the unfolding stage, thus improving recovery quality. Experiments on Coded Diffraction Patterns (CDPs) measurements demonstrate that our approach outperforms the existing state-of-the-art methods in terms of qualitative and quantitative evaluations. Our source codes are available at \emph{https://github.com/liuaxou/PRISTA-Net}. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: 12 pages

arXiv:2309.03815 [pdf, other]

T2IW: Joint Text to Image & Watermark Generation

Authors: An-An Liu, Guokai Zhang, Yuting Su, Ning Xu, Yongdong Zhang, Lanjun Wang

Abstract: Recent developments in text-conditioned image generative models have revolutionized the production of realistic results. Unfortunately, this has also led to an increase in privacy violations and the spread of false information, which requires the need for traceability, privacy protection, and other security measures. However, existing text-to-image paradigms lack the technical capabilities to link… ▽ More Recent developments in text-conditioned image generative models have revolutionized the production of realistic results. Unfortunately, this has also led to an increase in privacy violations and the spread of false information, which requires the need for traceability, privacy protection, and other security measures. However, existing text-to-image paradigms lack the technical capabilities to link traceable messages with image generation. In this study, we introduce a novel task for the joint generation of text to image and watermark (T2IW). This T2IW scheme ensures minimal damage to image quality when generating a compound image by forcing the semantic feature and the watermark signal to be compatible in pixels. Additionally, by utilizing principles from Shannon information theory and non-cooperative game theory, we are able to separate the revealed image and the revealed watermark from the compound image. Furthermore, we strengthen the watermark robustness of our approach by subjecting the compound image to various post-processing attacks, with minimal pixel distortion observed in the revealed watermark. Extensive experiments have demonstrated remarkable achievements in image quality, watermark invisibility, and watermark robustness, supported by our proposed set of evaluation metrics. △ Less

Submitted 7 September, 2023; originally announced September 2023.

arXiv:2308.03027 [pdf, other]

Causal Disentanglement Hidden Markov Model for Fault Diagnosis

Authors: Rihao Chang, Yongtao Ma, Weizhi Nie, Jie Nie, An-an Liu

Abstract: In modern industries, fault diagnosis has been widely applied with the goal of realizing predictive maintenance. The key issue for the fault diagnosis system is to extract representative characteristics of the fault signal and then accurately predict the fault type. In this paper, we propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism a… ▽ More In modern industries, fault diagnosis has been widely applied with the goal of realizing predictive maintenance. The key issue for the fault diagnosis system is to extract representative characteristics of the fault signal and then accurately predict the fault type. In this paper, we propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism and thus, capture their characteristics to achieve a more robust representation. Specifically, we make full use of the time-series data and progressively disentangle the vibration signal into fault-relevant and fault-irrelevant factors. The ELBO is reformulated to optimize the learning of the causal disentanglement Markov model. Moreover, to expand the scope of the application, we adopt unsupervised domain adaptation to transfer the learned disentangled representations to other working environments. Experiments were conducted on the CWRU dataset and IMS dataset. Relevant results validate the superiority of the proposed method. △ Less

Submitted 6 August, 2023; originally announced August 2023.

arXiv:2307.09149 [pdf, other]

Successive Linear Approximation VBI for Joint Sparse Signal Recovery and Dynamic Grid Parameters Estimation

Authors: Wenkang Xu, An Liu, Bingpeng Zhou, Minjian Zhao

Abstract: For many practical applications in wireless communications, we need to recover a structured sparse signal from a linear observation model with dynamic grid parameters in the sensing matrix. Conventional expectation maximization (EM)-based compressed sensing (CS) methods, such as turbo compressed sensing (Turbo-CS) and turbo variational Bayesian inference (Turbo-VBI), have double-loop iterations, w… ▽ More For many practical applications in wireless communications, we need to recover a structured sparse signal from a linear observation model with dynamic grid parameters in the sensing matrix. Conventional expectation maximization (EM)-based compressed sensing (CS) methods, such as turbo compressed sensing (Turbo-CS) and turbo variational Bayesian inference (Turbo-VBI), have double-loop iterations, where the inner loop (E-step) obtains a Bayesian estimation of sparse signals and the outer loop (M-step) obtains a point estimation of dynamic grid parameters. This leads to a slow convergence rate. Furthermore, each iteration of the E-step involves a complicated matrix inverse in general. To overcome these drawbacks, we first propose a successive linear approximation VBI (SLA-VBI) algorithm that can provide Bayesian estimation of both sparse signals and dynamic grid parameters. Besides, we simplify the matrix inverse operation based on the majorization-minimization (MM) algorithmic framework. In addition, we extend our proposed algorithm from an independent sparse prior to more complicated structured sparse priors, which can exploit structured sparsity in specific applications to further enhance the performance. Finally, we apply our proposed algorithm to solve two practical application problems in wireless communications and verify that the proposed algorithm can achieve faster convergence, lower complexity, and better performance compared to the state-of-the-art EM-based methods. △ Less

Submitted 12 November, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

Comments: 14 pages, 15 figures, submitted to IEEE Transactions on Wireless Communications

arXiv:2306.08256 [pdf, other]

Data Augmentation for Seizure Prediction with Generative Diffusion Model

Authors: Kai Shu, Yuchang Zhao, Le Wu, Ai** Liu, Ruobing Qian, Xun Chen

Abstract: Objective: Seizure prediction is of great importance to improve the life of patients. The focal point is to distinguish preictal states from interictal ones. With the development of machine learning, seizure prediction methods have achieved significant progress. However, the severe imbalance problem between preictal and interictal data still poses a great challenge, restricting the performance of… ▽ More Objective: Seizure prediction is of great importance to improve the life of patients. The focal point is to distinguish preictal states from interictal ones. With the development of machine learning, seizure prediction methods have achieved significant progress. However, the severe imbalance problem between preictal and interictal data still poses a great challenge, restricting the performance of classifiers. Data augmentation is an intuitive way to solve this problem. Existing data augmentation methods generate samples by overlap** or recombining data. The distribution of generated samples is limited by original data, because such transformations cannot fully explore the feature space and offer new information. As the epileptic EEG representation varies among seizures, these generated samples cannot provide enough diversity to achieve high performance on a new seizure. As a consequence, we propose a novel data augmentation method with diffusion model called DiffEEG. Methods: Diffusion models are a class of generative models that consist of two processes. Specifically, in the diffusion process, the model adds noise to the input EEG sample step by step and converts the noisy sample into output random noise, exploring the distribution of data by minimizing the loss between the output and the noise added. In the denoised process, the model samples the synthetic data by removing the noise gradually, diffusing the data distribution to outward areas and narrowing the distance between different clusters. Results: We compared DiffEEG with existing methods, and integrated them into three representative classifiers. The experiments indicate that DiffEEG could further improve the performance and shows superiority to existing methods. Conclusion: This paper proposes a novel and effective method to solve the imbalanced problem and demonstrates the effectiveness and generality of our method. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: 12 pages, 6 figures

arXiv:2306.02436 [pdf, ps, other]

Joint Activity Detection and Channel Estimation in Massive Machine-Type Communications with Low-Resolution ADC

Authors: Ye Xue, An Liu, Yang Li, Qingjiang Shi, Vincent Lau

Abstract: In massive machine-type communications, data transmission is usually considered sporadic, and thus inherently has a sparse structure. This paper focuses on the joint activity detection (AD) and channel estimation (CE) problems in massive-connected communication systems with low-resolution analog-to-digital converters. To further exploit the sparse structure in transmission, we propose a maximum po… ▽ More In massive machine-type communications, data transmission is usually considered sporadic, and thus inherently has a sparse structure. This paper focuses on the joint activity detection (AD) and channel estimation (CE) problems in massive-connected communication systems with low-resolution analog-to-digital converters. To further exploit the sparse structure in transmission, we propose a maximum posterior probability (MAP) estimation problem based on both sporadic activity and sparse channels for joint AD and CE. Moreover, a majorization-minimization-based method is proposed for solving the MAP problem. Finally, various numerical experiments verify that the proposed scheme outperforms state-of-the-art methods. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: This paper has been accepted by ICC 2023 as a regular paper

arXiv:2306.01232 [pdf, other]

Deep Reinforcement Learning Framework for Thoracic Diseases Classification via Prior Knowledge Guidance

Authors: Weizhi Nie, Chen Zhang, Dan Song, Lina Zhao, Yunpeng Bai, Keliang Xie, Anan Liu

Abstract: The chest X-ray is often utilized for diagnosing common thoracic diseases. In recent years, many approaches have been proposed to handle the problem of automatic diagnosis based on chest X-rays. However, the scarcity of labeled data for related diseases still poses a huge challenge to an accurate diagnosis. In this paper, we focus on the thorax disease diagnostic problem and propose a novel deep r… ▽ More The chest X-ray is often utilized for diagnosing common thoracic diseases. In recent years, many approaches have been proposed to handle the problem of automatic diagnosis based on chest X-rays. However, the scarcity of labeled data for related diseases still poses a huge challenge to an accurate diagnosis. In this paper, we focus on the thorax disease diagnostic problem and propose a novel deep reinforcement learning framework, which introduces prior knowledge to direct the learning of diagnostic agents and the model parameters can also be continuously updated as the data increases, like a person's learning process. Especially, 1) prior knowledge can be learned from the pre-trained model based on old data or other domains' similar data, which can effectively reduce the dependence on target domain data, and 2) the framework of reinforcement learning can make the diagnostic agent as exploratory as a human being and improve the accuracy of diagnosis through continuous exploration. The method can also effectively solve the model learning problem in the case of few-shot data and improve the generalization ability of the model. Finally, our approach's performance was demonstrated using the well-known NIH ChestX-ray 14 and CheXpert datasets, and we achieved competitive results. The source code can be found here: \url{https://github.com/NeaseZ/MARL}. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.14997 [pdf, other]

3GPP-Like GBSM THz Channel Modeling for Indoor Office and Urban Microcellular Scenarios

Authors: Zhaowei Chang, Jianhua Zhang, Pan Tang, Lei Tian, Hao Jiang, Ximan Liu, and Guangyi Liu

Abstract: Terahertz (THz) communication is envisioned as one of the possible technologies for the sixth-generation (6G) communication system due to its rich spectrum. To evaluate the performance of THz communication, it is essential to propose THz channel models within the common framework of the geometry-based stochastic model (GBSM) in the 3rd Generation Partnership Project (3GPP). This paper focuses on 3… ▽ More Terahertz (THz) communication is envisioned as one of the possible technologies for the sixth-generation (6G) communication system due to its rich spectrum. To evaluate the performance of THz communication, it is essential to propose THz channel models within the common framework of the geometry-based stochastic model (GBSM) in the 3rd Generation Partnership Project (3GPP). This paper focuses on 3GPP-like GBSM THz channel modeling based on channel measurements. We first present channel measurements at 100 GHz in an indoor office scenario and at 132 GHz in an urban microcellular scenario. Subsequently, channel characteristics such as PL, delay spread, angle spread, K-factor, cluster characteristic, cross-correlations, and correlation distance are obtained and analyzed using the measurement data. Additionally, statistical values of the channel characteristics are extracted based on the statistical distribution of 3GPP channel models, which can be used to reconstruct the channel impulse response (CIR). Furthermore, these obtained values are compared with the default values in the 3GPP channel model, revealing discrepancies that indicate the default values cannot accurately characterize the THz channel. For instance, for the case of line-of-sight links in the indoor office, the measured cluster number is 4 while the default value is 15. Finally, the channel capacity at THz frequency band is evaluated by the reconstructed CIRs generated by the GBSM using the measured statistical values and the 3GPP default values. It is observed that the 3GPP default values overestimate the THz channel capacity, equivalent to more than 10 bps/Hz larger at a signal-to-noise ratio of 30 dB. Overall, these findings are helpful in understanding and modeling the THz channel, facilitating the application of THz communication techniques for 6G. △ Less

Submitted 22 April, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

arXiv:2305.12072 [pdf, other]

Chest X-ray Image Classification: A Causal Perspective

Authors: Weizhi Nie, Chen Zhang, Dan Song, Lina Zhao, Yunpeng Bai, Keliang Xie, Anan Liu

Abstract: The chest X-ray (CXR) is one of the most common and easy-to-get medical tests used to diagnose common diseases of the chest. Recently, many deep learning-based methods have been proposed that are capable of effectively classifying CXRs. Even though these techniques have worked quite well, it is difficult to establish whether what these algorithms actually learn is the cause-and-effect link between… ▽ More The chest X-ray (CXR) is one of the most common and easy-to-get medical tests used to diagnose common diseases of the chest. Recently, many deep learning-based methods have been proposed that are capable of effectively classifying CXRs. Even though these techniques have worked quite well, it is difficult to establish whether what these algorithms actually learn is the cause-and-effect link between diseases and their causes or just how to map labels to photos.In this paper, we propose a causal approach to address the CXR classification problem, which constructs a structural causal model (SCM) and uses the backdoor adjustment to select effective visual information for CXR classification. Specially, we design different probability optimization functions to eliminate the influence of confounders on the learning of real causality. Experimental results demonstrate that our proposed method outperforms the open-source NIH ChestX-ray14 in terms of classification performance. △ Less

Submitted 19 May, 2023; originally announced May 2023.

arXiv:2305.12070 [pdf, other]

Instrumental Variable Learning for Chest X-ray Classification

Authors: Weizhi Nie, Chen Zhang, Dan song, Yunpeng Bai, Keliang Xie, Anan Liu

Abstract: The chest X-ray (CXR) is commonly employed to diagnose thoracic illnesses, but the challenge of achieving accurate automatic diagnosis through this method persists due to the complex relationship between pathology. In recent years, various deep learning-based approaches have been suggested to tackle this problem but confounding factors such as image resolution or noise problems often damage model… ▽ More The chest X-ray (CXR) is commonly employed to diagnose thoracic illnesses, but the challenge of achieving accurate automatic diagnosis through this method persists due to the complex relationship between pathology. In recent years, various deep learning-based approaches have been suggested to tackle this problem but confounding factors such as image resolution or noise problems often damage model performance. In this paper, we focus on the chest X-ray classification task and proposed an interpretable instrumental variable (IV) learning framework, to eliminate the spurious association and obtain accurate causal representation. Specifically, we first construct a structural causal model (SCM) for our task and learn the confounders and the preliminary representations of IV, we then leverage electronic health record (EHR) as auxiliary information and we fuse the above feature with our transformer-based semantic fusion module, so the IV has the medical semantic. Meanwhile, the reliability of IV is further guaranteed via the constraints of mutual information between related causal variables. Finally, our approach's performance is demonstrated using the MIMIC-CXR, NIH ChestX-ray 14, and CheXpert datasets, and we achieve competitive results. △ Less

Submitted 19 May, 2023; originally announced May 2023.

arXiv:2305.11072 [pdf, other]

Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering

Authors: Heng-Jui Chang, Alexander H. Liu, James Glass

Abstract: Self-supervised speech representation models have succeeded in various tasks, but improving them for content-related problems using unlabeled data is challenging. We propose speaker-invariant clustering (Spin), a novel self-supervised learning method that clusters speech representations and performs swapped prediction between the original and speaker-perturbed utterances. Spin disentangles speaker… ▽ More Self-supervised speech representation models have succeeded in various tasks, but improving them for content-related problems using unlabeled data is challenging. We propose speaker-invariant clustering (Spin), a novel self-supervised learning method that clusters speech representations and performs swapped prediction between the original and speaker-perturbed utterances. Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU. Spin improves pre-trained networks and outperforms prior methods in speech recognition and acoustic unit discovery. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted to Interspeech 2023

arXiv:2305.10790 [pdf, other]

Listen, Think, and Understand

Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass

Abstract: The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into general cat… ▽ More The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into general categories, but also to listen to the finer details of the sounds, explain the reason for the predictions, think about what the sound infers, and understand the scene and what action needs to be taken, if any. Such capabilities beyond perception are not yet present in existing audio models. On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities. Therefore, we ask the question: can we build a model that has both audio perception and a reasoning ability? In this paper, we propose a new audio foundation model, called LTU (Listen, Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse (audio, question, answer) tuples, and have used an autoregressive training framework with a perception-to-understanding curriculum. LTU demonstrates strong performance and generalization ability on conventional audio tasks such as classification and captioning. More importantly, it exhibits emerging audio reasoning and comprehension abilities that are absent in existing audio models. To the best of our knowledge, LTU is one of the first multimodal large language models that focus on general audio (rather than just speech) understanding. △ Less

Submitted 19 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted at ICLR 2024. Code, dataset, and models are available at https://github.com/YuanGongND/ltu. The interactive demo is at https://huggingface.co/spaces/yuangongfdu/ltu

arXiv:2305.07774 [pdf, other]

PanFlowNet: A Flow-Based Deep Network for Pan-sharpening

Authors: Gang Yang, Xiangyong Cao, Wenzhe Xiao, Man Zhou, Ai** Liu, Xun chen, Deyu Meng

Abstract: Pan-sharpening aims to generate a high-resolution multispectral (HRMS) image by integrating the spectral information of a low-resolution multispectral (LRMS) image with the texture details of a high-resolution panchromatic (PAN) image. It essentially inherits the ill-posed nature of the super-resolution (SR) task that diverse HRMS images can degrade into an LRMS image. However, existing deep learn… ▽ More Pan-sharpening aims to generate a high-resolution multispectral (HRMS) image by integrating the spectral information of a low-resolution multispectral (LRMS) image with the texture details of a high-resolution panchromatic (PAN) image. It essentially inherits the ill-posed nature of the super-resolution (SR) task that diverse HRMS images can degrade into an LRMS image. However, existing deep learning-based methods recover only one HRMS image from the LRMS image and PAN image using a deterministic map**, thus ignoring the diversity of the HRMS image. In this paper, to alleviate this ill-posed issue, we propose a flow-based pan-sharpening network (PanFlowNet) to directly learn the conditional distribution of HRMS image given LRMS image and PAN image instead of learning a deterministic map**. Specifically, we first transform this unknown conditional distribution into a given Gaussian distribution by an invertible network, and the conditional distribution can thus be explicitly defined. Then, we design an invertible Conditional Affine Coupling Block (CACB) and further build the architecture of PanFlowNet by stacking a series of CACBs. Finally, the PanFlowNet is trained by maximizing the log-likelihood of the conditional distribution given a training set and can then be used to predict diverse HRMS images. The experimental results verify that the proposed PanFlowNet can generate various HRMS images given an LRMS image and a PAN image. Additionally, the experimental results on different kinds of satellite datasets also demonstrate the superiority of our PanFlowNet compared with other state-of-the-art methods both visually and quantitatively. △ Less

Submitted 16 May, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

arXiv:2303.07742 [pdf, other]

ForDigitStress: A multi-modal stress dataset employing a digital job interview scenario

Authors: Alexander Heimerl, Pooja Prajod, Silvan Mertes, Tobias Baur, Matthias Kraus, Ailin Liu, Helen Risack, Nicolas Rohleder, Elisabeth André, Linda Becker

Abstract: We present a multi-modal stress dataset that uses digital job interviews to induce stress. The dataset provides multi-modal data of 40 participants including audio, video (motion capturing, facial recognition, eye tracking) as well as physiological information (photoplethysmography, electrodermal activity). In addition to that, the dataset contains time-continuous annotations for stress and occurr… ▽ More We present a multi-modal stress dataset that uses digital job interviews to induce stress. The dataset provides multi-modal data of 40 participants including audio, video (motion capturing, facial recognition, eye tracking) as well as physiological information (photoplethysmography, electrodermal activity). In addition to that, the dataset contains time-continuous annotations for stress and occurred emotions (e.g. shame, anger, anxiety, surprise). In order to establish a baseline, five different machine learning classifiers (Support Vector Machine, K-Nearest Neighbors, Random Forest, Long-Short-Term Memory Network) have been trained and evaluated on the proposed dataset for a binary stress classification task. The best-performing classifier achieved an accuracy of 88.3% and an F1-score of 87.5%. △ Less

Submitted 14 March, 2023; originally announced March 2023.

arXiv:2302.11334 [pdf, other]

Stabilization with Prescribed Instant via Lyapunov Method

Authors: Jiyuan Kuang, Yabin Gao, Yizhuo Sun, Jiahui Wang, Aohua Liu, Yue Zhao, Jianxing Liu

Abstract: This letter investigates the prescribed-instant stabilization problem for high-order integrator systems. In anothor word, the settling time under the presented controller is independent of the initial conditions and equals the prescribed time instant. The controller is designed with the concept of backstep**. A strict proof based on the Lyapunov method is presented to clamp the settling time to… ▽ More This letter investigates the prescribed-instant stabilization problem for high-order integrator systems. In anothor word, the settling time under the presented controller is independent of the initial conditions and equals the prescribed time instant. The controller is designed with the concept of backstep**. A strict proof based on the Lyapunov method is presented to clamp the settling time to the prescribed time instant from both the left and right sides. This proof serves as an example to present a general framework to verify the designed stabilization property. It should be emphasized that the prescribed-time stability (PSTS) [1] can only prescribe the upper bound of the settling time and is different from this work. The detailed argumentation will be presented after a brief review of the existing important research. △ Less

Submitted 22 February, 2023; originally announced February 2023.

arXiv:2302.02587 [pdf, other]

Joint Scattering Environment Sensing and Channel Estimation Based on Non-stationary Markov Random Field

Authors: Wenkang Xu, Yongbo Xiao, An Liu, Ming Lei, Minjian Zhao

Abstract: This paper considers an integrated sensing and communication system, where some radar targets also serve as communication scatterers. A location domain channel modeling method is proposed based on the position of targets and scatterers in the scattering environment, and the resulting radar and communication channels exhibit a two-dimensional (2-D) joint burst sparsity. We propose a joint scatterin… ▽ More This paper considers an integrated sensing and communication system, where some radar targets also serve as communication scatterers. A location domain channel modeling method is proposed based on the position of targets and scatterers in the scattering environment, and the resulting radar and communication channels exhibit a two-dimensional (2-D) joint burst sparsity. We propose a joint scattering environment sensing and channel estimation scheme to enhance the target/scatterer localization and channel estimation performance simultaneously, where a spatially non-stationary Markov random field (MRF) model is proposed to capture the 2-D joint burst sparsity. An expectation maximization (EM) based method is designed to solve the joint estimation problem, where the E-step obtains the Bayesian estimation of the radar and communication channels and the M-step automatically learns the dynamic position grid and prior parameters in the MRF. However, the existing sparse Bayesian inference methods used in the E-step involve a high-complexity matrix inverse per iteration. Moreover, due to the complicated non-stationary MRF prior, the complexity of M-step is exponentially large. To address these difficulties, we propose an inverse-free variational Bayesian inference algorithm for the E-step and a low-complexity method based on pseudo-likelihood approximation for the M-step. In the simulations, the proposed scheme can achieve a better performance than the state-of-the-art method while reducing the computational overhead significantly. △ Less

Submitted 18 July, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

Comments: 15 pages, 13 figures, submitted to IEEE Transactions on Wireless Communications

arXiv:2302.01619 [pdf, other]

Joint Scattering Environment Sensing and Channel Estimation for Integrated Sensing and Communication

Authors: Wenkang Xu, Yongbo Xiao, An Liu, Minjian Zhao

Abstract: This paper considers an integrated sensing and communication system, where some radar targets also serve as communication scatterers. A location domain channel modeling method is proposed based on the position of targets and scatterers in the scattering environment, and the resulting radar and communication channels exhibit a partially common sparsity. By exploiting this, we propose a joint scatte… ▽ More This paper considers an integrated sensing and communication system, where some radar targets also serve as communication scatterers. A location domain channel modeling method is proposed based on the position of targets and scatterers in the scattering environment, and the resulting radar and communication channels exhibit a partially common sparsity. By exploiting this, we propose a joint scattering environment sensing and channel estimation scheme to enhance the target/scatterer localization and channel estimation performance simultaneously. Specifically, the base station (BS) first transmits downlink pilots to sense the targets in the scattering environment. Then the user transmits uplink pilots to estimate the communication channel. Finally, joint scattering environment sensing and channel estimation are performed at the BS based on the reflected downlink pilot signal and received uplink pilot signal. A message passing based algorithm is designed by combining the turbo approach and the expectation maximization method. The advantages of our proposed scheme are verified in the simulations. △ Less

Submitted 3 February, 2023; originally announced February 2023.

arXiv:2211.14313 [pdf, other]

AICOM-MP: an AI-based Monkeypox Detector for Resource-Constrained Environments

Authors: Tim Tianyi Yang, Tom Tianze Yang, Andrew Liu, Jie Tang, Na An, Shaoshan Liu, Xue Liu

Abstract: Under the Autonomous Mobile Clinics (AMCs) initiative, we are develo**, open sourcing, and standardizing health AI technologies to enable healthcare access in least developed countries (LDCs). We deem AMCs as the next generation of health care delivery platforms, whereas health AI engines are applications on these platforms, similar to how various applications expand the usage scenarios of smart… ▽ More Under the Autonomous Mobile Clinics (AMCs) initiative, we are develo**, open sourcing, and standardizing health AI technologies to enable healthcare access in least developed countries (LDCs). We deem AMCs as the next generation of health care delivery platforms, whereas health AI engines are applications on these platforms, similar to how various applications expand the usage scenarios of smart phones. Facing the recent global monkeypox outbreak, in this article, we introduce AICOM-MP, an AI-based monkeypox detector specially aiming for handling images taken from resource-constrained devices. Compared to existing AI-based monkeypox detectors, AICOM-MP has achieved state-of-the-art (SOTA) performance. We have hosted AICOM-MP as a web service to allow universal access to monkeypox screening technology. We have also open sourced both the source code and the dataset of AICOM-MP to allow health AI professionals to integrate AICOM-MP into their services. Also, through the AICOM-MP project, we have generalized a methodology of develo** health AI technologies for AMCs to allow universal access even in resource-constrained environments. △ Less

Submitted 21 November, 2022; originally announced November 2022.

arXiv:2211.11749 [pdf]

Towards Automatic Prediction of Outcome in Treatment of Cerebral Aneurysms

Authors: Ashutosh Jadhav, Satyananda Kashyap, Hakan Bulu, Ronak Dholakia, Amon Y. Liu, Tanveer Syeda-Mahmood, William R. Patterson, Hussain Rangwala, Mehdi Moradi

Abstract: Intrasaccular flow disruptors treat cerebral aneurysms by diverting the blood flow from the aneurysm sac. Residual flow into the sac after the intervention is a failure that could be due to the use of an undersized device, or to vascular anatomy and clinical condition of the patient. We report a machine learning model based on over 100 clinical and imaging features that predict the outcome of wide… ▽ More Intrasaccular flow disruptors treat cerebral aneurysms by diverting the blood flow from the aneurysm sac. Residual flow into the sac after the intervention is a failure that could be due to the use of an undersized device, or to vascular anatomy and clinical condition of the patient. We report a machine learning model based on over 100 clinical and imaging features that predict the outcome of wide-neck bifurcation aneurysm treatment with an intravascular embolization device. We combine clinical features with a diverse set of common and novel imaging measurements within a random forest model. We also develop neural network segmentation algorithms in 2D and 3D to contour the sac in angiographic images and automatically calculate the imaging features. These deliver 90% overlap with manual contouring in 2D and 83% in 3D. Our predictive model classifies complete vs. partial occlusion outcomes with an accuracy of 75.31%, and weighted F1-score of 0.74. △ Less

Submitted 18 November, 2022; originally announced November 2022.

Comments: 10 pages

Report number: https://s4.goeshow.com/amia/annual/2022/schedule_at_a_glance.cfm?session_key=1965BCBD-A832-92DD-9D05-FB2CB132FADB&session_date=

Journal ref: AMAI 2022 Annual Symposium

arXiv:2210.07839 [pdf, other]

Contrastive Audio-Visual Masked Autoencoder

Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

Abstract: In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments… ▽ More In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae. △ Less

Submitted 11 April, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

Comments: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained models are at https://github.com/yuangongnd/cav-mae

arXiv:2210.01032 [pdf]

A New Hip Fracture Risk Index Derived from FEA-Computed Proximal Femur Fracture Loads and Energies-to-Failure

Authors: Xuewei Cao, Joyce H Keyak, Sigurdur Sigurdsson, Chen Zhao, Weihua Zhou, Anqi Liu, Thomas Lang, Hong-Wen Deng, Vilmundur Gudnason, Qiuying Sha

Abstract: Hip fracture risk assessment is an important but challenging task. Quantitative CT-based patient specific finite element analysis (FEA) computes the force (fracture load) to break the proximal femur in a particular loading condition. It provides different structural information about the proximal femur that can influence a subject overall fracture risk. To obtain a more robust measure of fracture… ▽ More Hip fracture risk assessment is an important but challenging task. Quantitative CT-based patient specific finite element analysis (FEA) computes the force (fracture load) to break the proximal femur in a particular loading condition. It provides different structural information about the proximal femur that can influence a subject overall fracture risk. To obtain a more robust measure of fracture risk, we used principal component analysis (PCA) to develop a global FEA computed fracture risk index that incorporates the FEA-computed yield and ultimate failure loads and energies to failure in four loading conditions (single-limb stance and impact from a fall onto the posterior, posterolateral, and lateral aspects of the greater trochanter) of 110 hip fracture subjects and 235 age and sex matched control subjects from the AGES-Reykjavik study. We found that the first PC (PC1) of the FE parameters was the only significant predictor of hip fracture. Using a logistic regression model, we determined if prediction performance for hip fracture using PC1 differed from that using FE parameters combined by stratified random resampling with respect to hip fracture status. The results showed that the average of the area under the receive operating characteristic curve (AUC) using PC1 was always higher than that using all FE parameters combined in the male subjects. The AUC of PC1 and AUC of the FE parameters combined were not significantly different than that in the female subjects or in all subjects △ Less

Submitted 18 November, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: 27 pages, 4 figures

arXiv:2209.14505 [pdf, other]

Optimal Retail Tariff Design with Prosumers: Pursuing Equity at the Expenses of Economic Efficiencies?

Authors: Yihsu Chen, Andrew L. Liu, Makoto Tanaka, Ryuta Takashima

Abstract: Distributed renewable resources owned by prosumers can be an effective way of fortifying grid resilience and enhancing sustainability. However, prosumers serve their own interests and their objectives are unlikely to align with that of society. This paper develops a bilevel model to study the optimal design of retail electricity tariffs considering the balance between economic efficiency and energ… ▽ More Distributed renewable resources owned by prosumers can be an effective way of fortifying grid resilience and enhancing sustainability. However, prosumers serve their own interests and their objectives are unlikely to align with that of society. This paper develops a bilevel model to study the optimal design of retail electricity tariffs considering the balance between economic efficiency and energy equity. The retail tariff entails a fixed charge and a volumetric charge tied to electricity usage to recover utilities' fixed costs. We analyze solution properties of the bilevel problem and prove an optimal rate design, which is to use fixed charges to recover fixed costs and to balance energy equity among different income groups. This suggests that programs similar to CARE (California Alternative Rate of Energy), which offer lower retail rates to low-income households, are unlikely to be efficient, even if they are politically appealing. △ Less

Submitted 28 September, 2022; originally announced September 2022.

arXiv:2209.07773 [pdf, ps, other]

Event-Triggered Extended State Observer Based Distributed Control of Nonlinear Vehicle Platoons

Authors: Anquan Liu, Tao Li, Yu Gu

Abstract: We study the platoon control of vehicles with third-order nonlinear dynamics under the constant spacing policy. We consider a vehicle model with parameter uncertainties and external disturbances and propose a distributed control law based on an event-triggered extended state observer (ESO). First, an event-triggered ESO is designed to estimate the unmodeled dynamics in the vehicle model. Then base… ▽ More We study the platoon control of vehicles with third-order nonlinear dynamics under the constant spacing policy. We consider a vehicle model with parameter uncertainties and external disturbances and propose a distributed control law based on an event-triggered extended state observer (ESO). First, an event-triggered ESO is designed to estimate the unmodeled dynamics in the vehicle model. Then based on the estimate of the unmodeled dynamics, a distributed control law is designed by using a modified dynamic surface control method. The control law of each follower vehicle only uses the information obtained by on-board sensors, including its own velocity, acceleration, the velocity of the preceding vehicle and the inter-vehicle distance. Finally, we give the range of the control parameters to ensure the stability of the vehicle platoon system. It is shown that the control parameters can be properly designed to make the observation errors of the ESOs bounded and ensure the string stability and closed-loop stability. We prove that the Zeno behavior is avoided under the designed event-triggered mechanism. The joint simulations of CarSim and MATLAB are given to demonstrate the effectiveness of the proposed control law. △ Less

Submitted 16 September, 2022; originally announced September 2022.

arXiv:2209.07030 [pdf, other]

doi 10.1145/3503161.3548068

Model-Guided Multi-Contrast Deep Unfolding Network for MRI Super-resolution Reconstruction

Authors: Gang Yang, Li Zhang, Man Zhou, Ai** Liu, Xun Chen, Zhiwei Xiong, Feng Wu

Abstract: Magnetic resonance imaging (MRI) with high resolution (HR) provides more detailed information for accurate diagnosis and quantitative image analysis. Despite the significant advances, most existing super-resolution (SR) reconstruction network for medical images has two flaws: 1) All of them are designed in a black-box principle, thus lacking sufficient interpretability and further limiting their p… ▽ More Magnetic resonance imaging (MRI) with high resolution (HR) provides more detailed information for accurate diagnosis and quantitative image analysis. Despite the significant advances, most existing super-resolution (SR) reconstruction network for medical images has two flaws: 1) All of them are designed in a black-box principle, thus lacking sufficient interpretability and further limiting their practical applications. Interpretable neural network models are of significant interest since they enhance the trustworthiness required in clinical practice when dealing with medical images. 2) most existing SR reconstruction approaches only use a single contrast or use a simple multi-contrast fusion mechanism, neglecting the complex relationships between different contrasts that are critical for SR improvement. To deal with these issues, in this paper, a novel Model-Guided interpretable Deep Unfolding Network (MGDUN) for medical image SR reconstruction is proposed. The Model-Guided image SR reconstruction approach solves manually designed objective functions to reconstruct HR MRI. We show how to unfold an iterative MGDUN algorithm into a novel model-guided deep unfolding network by taking the MRI observation matrix and explicit multi-contrast relationship matrix into account during the end-to-end optimization. Extensive experiments on the multi-contrast IXI dataset and BraTs 2019 dataset demonstrate the superiority of our proposed model. △ Less

Submitted 14 September, 2022; originally announced September 2022.

Comments: Accepted to ACMMM 2022, 9 pages

arXiv:2208.00061 [pdf, other]

doi 10.1109/LSP.2022.3224688

UAVM: Towards Unifying Audio and Visual Models

Authors: Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

Abstract: Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do… ▽ More Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have. △ Less

Submitted 15 February, 2023; v1 submitted 29 July, 2022; originally announced August 2022.

Comments: Published in Signal Processing Letters. Code at https://github.com/YuanGongND/uavm

Journal ref: IEEE Signal Processing Letters, vol. 29, pp. 2437-2441, 2022

arXiv:2207.10427 [pdf, other]

A Two-stage Multiband WiFi Sensing Scheme via Stochastic Particle-Based Variational Bayesian Inference

Authors: Zhixiang Hu, An Liu, Yubo Wan, Tony Xiao Han, Minjian Zhao

Abstract: Multiband fusion enhances WiFi sensing by jointly utilizing signals from multiple non-contiguous frequency bands. However, in the multi-band WiFi sensing signal model, there are many local optimums in the associated likelihood function due to the existence of high frequency component and phase distortion factors, posing challenges for high-accuracy parameter estimation. To address this, we propose… ▽ More Multiband fusion enhances WiFi sensing by jointly utilizing signals from multiple non-contiguous frequency bands. However, in the multi-band WiFi sensing signal model, there are many local optimums in the associated likelihood function due to the existence of high frequency component and phase distortion factors, posing challenges for high-accuracy parameter estimation. To address this, we propose a two-stage scheme equipped with different signal models derived from the original model, where the first-stage coarse estimation is performed using a weighted root MUSIC algorithm to narrow down the search range for the subsequent stage, and the second-stage refined estimation utilizes a Bayesian approach to avoid convergence to bad suboptimal solutions. Specifically, we apply the block stochastic successive convex approximation (SSCA) approach to derive a novel stochastic particle-based variational Bayesian inference (SPVBI) algorithm in the refined stage. Unlike conventional particle-based VBI (PVBI) that optimizes only particle probability and incurs exponential per-iteration complexity with particle count, our more flexible SPVBI algorithm optimizes both the position and probability of each particle. Additionally, it utilizes block SSCA to significantly improve sampling efficiency by averaging over iterations, making it suitable for high-dimensional problems. Extensive simulations demonstrate the superiority of our proposed algorithm over various baseline methods. △ Less

Submitted 9 October, 2023; v1 submitted 21 July, 2022; originally announced July 2022.

arXiv:2207.10306 [pdf, ps, other]

Fundamental Limits and Optimization of Multiband Sensing

Authors: Yubo Wan, An Liu, Rui Du, Tony Xiao Han

Abstract: Multiband sensing is a promising technology that utilizes multiple non-contiguous frequency bands to achieve high-resolution target sensing. In this paper, we investigate the fundamental limits and optimization of multiband sensing, focusing on the fundamental limits associated with time delay. We first derive a Fisher information matrix (FIM) with a compact form using the Dirichlet kernel and the… ▽ More Multiband sensing is a promising technology that utilizes multiple non-contiguous frequency bands to achieve high-resolution target sensing. In this paper, we investigate the fundamental limits and optimization of multiband sensing, focusing on the fundamental limits associated with time delay. We first derive a Fisher information matrix (FIM) with a compact form using the Dirichlet kernel and then derive a closed-form expression of the Cramer-Rao bound (CRB) for the delay separation in a simplified case to reveal useful insights. Then, a metric called the statistical resolution limit (SRL) that provides a resolution limit is employed to investigate the fundamental limits of delay resolution. The fundamental limits of delay estimation are also investigated based on the CRB and Ziv-Zakai bound (ZZB). Based on the above derived fundamental limits, numerical results are presented to analyze the effect of frequency band apertures and phase distortions on the performance limits of the multiband sensing systems. We formulate an optimization problem to find the optimal system configuration in multiband sensing systems with the objective of minimizing the delay SRL. To solve this non-convex constrained problem, we propose an efficient alternating optimization (AO) algorithm which iteratively optimizes the variables using successive convex approximation (SCA) and one-dimensional search. Simulation results demonstrate the effectiveness of the proposed algorithm. △ Less

Submitted 31 January, 2023; v1 submitted 21 July, 2022; originally announced July 2022.

arXiv:2207.08123 [pdf, ps, other]

Latency Minimization for mmWave D2D Mobile Edge Computing Systems: Joint Task Allocation and Hybrid Beamforming Design

Authors: Yanzhen Liu, Yunlong Cai, An Liu, Minjian Zhao, Lajos Hanzo

Abstract: Mobile edge computing (MEC) and millimeter wave (mmWave) communications are capable of significantly reducing the network's delay and enhancing its capacity. In this paper we investigate a mmWave and device-to-device (D2D) assisted MEC system, in which user A carries out some computational tasks and shares the results with user B with the aid of a base station (BS). We propose a novel two-timescal… ▽ More Mobile edge computing (MEC) and millimeter wave (mmWave) communications are capable of significantly reducing the network's delay and enhancing its capacity. In this paper we investigate a mmWave and device-to-device (D2D) assisted MEC system, in which user A carries out some computational tasks and shares the results with user B with the aid of a base station (BS). We propose a novel two-timescale joint hybrid beamforming and task allocation algorithm to reduce the system latency whilst cut down the required signaling overhead. Specifically, the high-dimensional analog beamforming matrices are updated in a frame-based manner based on the channel state information (CSI) samples, where each frame consists of a number of time slots, while the low-dimensional digital beamforming matrices and the offloading ratio are optimized more frequently relied on the low-dimensional effective channel matrices in each time slot. A stochastic successive convex approximation (SSCA) based algorithm is developed to design the long-term analog beamforming matrices. As for the short-term variables, the digital beamforming matrices are optimized relying on the innovative penalty-concave convex procedure (penalty-CCCP) for handling the mmWave non-linear transmit power constraint, and the offloading ratio can be obtained via the derived closed-form solution. Simulation results verify the effectiveness of the proposed algorithm by comparing the benchmarks. △ Less

Submitted 17 July, 2022; originally announced July 2022.

arXiv:2206.09751 [pdf, ps, other]

Multiband Delay Estimation for Localization Using a Two-Stage Global Estimation Scheme

Authors: Yubo Wan, An Liu, Qiyu Hu, Mianyi Zhang, Yunlong Cai

Abstract: The time of arrival (TOA)-based localization techniques, which need to estimate the delay of the line-of-sight (LoS) path, have been widely employed in location-aware networks. To achieve a high-accuracy delay estimation, a number of multiband-based algorithms have been proposed recently, which exploit the channel state information (CSI) measurements over multiple non-contiguous frequency bands. H… ▽ More The time of arrival (TOA)-based localization techniques, which need to estimate the delay of the line-of-sight (LoS) path, have been widely employed in location-aware networks. To achieve a high-accuracy delay estimation, a number of multiband-based algorithms have been proposed recently, which exploit the channel state information (CSI) measurements over multiple non-contiguous frequency bands. However, to the best of our knowledge, there still lacks an efficient scheme that fully exploits the multiband gains when the phase distortion factors caused by hardware imperfections are considered, due to that the associated multi-parameter estimation problem contains many local optimums and the existing algorithms can easily get stuck in a "bad" local optimum. To address these issues, we propose a novel two-stage global estimation (TSGE) scheme for multiband delay estimation. In the coarse stage, we exploit the group sparsity structure of the multiband channel and propose a Turbo Bayesian inference (Turbo-BI) algorithm to achieve a good initial delay estimation based on a coarse signal model, which is transformed from the original multiband signal model by absorbing the carrier frequency terms. The estimation problem derived from the coarse signal model contains less local optimums and thus a more stable estimation can be achieved than directly using the original signal model. Then in the refined stage, with the help of coarse estimation results to narrow down the search range, we perform a global delay estimation using a particle swarm optimization-least square (PSO-LS) algorithm based on a refined multiband signal model to exploit the multiband gains to further improve the estimation accuracy. Simulation results show that the proposed TSGE significantly outperforms the benchmarks with comparative computational complexity. △ Less

Submitted 20 June, 2022; originally announced June 2022.

arXiv:2204.11806 [pdf, other]

doi 10.1109/TASLP.2023.3301212

Parallel Synthesis for Autoregressive Speech Generation

Authors: Po-chun Hsu, Da-rong Liu, Andy T. Liu, Hung-yi Lee

Abstract: Autoregressive neural vocoders have achieved outstanding performance in speech synthesis tasks such as text-to-speech and voice conversion. An autoregressive vocoder predicts a sample at some time step conditioned on those at previous time steps. Though it synthesizes natural human speech, the iterative generation inevitably makes the synthesis time proportional to the utterance length, leading to… ▽ More Autoregressive neural vocoders have achieved outstanding performance in speech synthesis tasks such as text-to-speech and voice conversion. An autoregressive vocoder predicts a sample at some time step conditioned on those at previous time steps. Though it synthesizes natural human speech, the iterative generation inevitably makes the synthesis time proportional to the utterance length, leading to low efficiency. Many works were dedicated to generating the whole speech sequence in parallel and proposed GAN-based, flow-based, and score-based vocoders. This paper proposed a new thought for the autoregressive generation. Instead of iteratively predicting samples in a time sequence, the proposed model performs frequency-wise autoregressive generation (FAR) and bit-wise autoregressive generation (BAR) to synthesize speech. In FAR, a speech utterance is split into frequency subbands, and a subband is generated conditioned on the previously generated one. Similarly, in BAR, an 8-bit quantized signal is generated iteratively from the first bit. By redesigning the autoregressive method to compute in domains other than the time domain, the number of iterations in the proposed model is no longer proportional to the utterance length but to the number of subbands/bits, significantly increasing inference efficiency. Besides, a post-filter is employed to sample signals from output posteriors; its training objective is designed based on the characteristics of the proposed methods. Experimental results show that the proposed model can synthesize speech faster than real-time without GPU acceleration. Compared with baseline vocoders, the proposed model achieves better MUSHRA results and shows good generalization ability for unseen speakers and 44 kHz speech. △ Less

Submitted 5 June, 2024; v1 submitted 25 April, 2022; originally announced April 2022.

Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2204.02524 [pdf, other]

Simple and Effective Unsupervised Speech Synthesis

Authors: Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass

Abstract: We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstra… ▽ More We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstrate the unsupervised system can synthesize speech similar to a supervised counterpart in terms of naturalness and intelligibility measured by human evaluation. △ Less

Submitted 20 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: preprint, equal contribution from first two authors

arXiv:2204.02492 [pdf, other]

Towards End-to-end Unsupervised Speech Recognition

Authors: Alexander H. Liu, Wei-Ning Hsu, Michael Auli, Alexei Baevski

Abstract: Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR) systems accessible to every language. However, existing methods still heavily rely on hand-crafted pre-processing. Similar to the trend of making supervised speech recognition end-to-end, we introduce wav2vec-U 2.0 which does away with all audio-side pre-processing and improves accuracy through bet… ▽ More Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR) systems accessible to every language. However, existing methods still heavily rely on hand-crafted pre-processing. Similar to the trend of making supervised speech recognition end-to-end, we introduce wav2vec-U 2.0 which does away with all audio-side pre-processing and improves accuracy through better architecture. In addition, we introduce an auxiliary self-supervised objective that ties model predictions back to the input. Experiments show that wav2vec-U 2.0 improves unsupervised recognition results across different languages while being conceptually simpler. △ Less

Submitted 15 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: Preprint

arXiv:2203.06849 [pdf, other]

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

Authors: Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy T. Liu, Cheng-I Jeff Lai, Jiatong Shi, Xuankai Chang, Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

Abstract: Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards in… ▽ More Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards introducing a common benchmark to evaluate pre-trained models across various speech tasks. In this paper, we introduce SUPERB-SG, a new benchmark focused on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain and quality across different types of tasks. It entails freezing pre-trained model parameters, only using simple task-specific trainable heads. The goal is to be inclusive of all researchers, and encourage efficient use of computational resources. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation. △ Less

Submitted 14 March, 2022; originally announced March 2022.

Comments: ACL 2022 main conference

Showing 1–50 of 103 results for author: Liu, A