-
Spatial-Frequency Dual Progressive Attention Network For Medical Image Segmentation
Authors:
Zhenhuan Zhou,
Along He,
Yanlin Wu,
Rui Yao,
Xueshuo Xie,
Tao Li
Abstract:
In medical images, various types of lesions often manifest significant differences in their shape and texture. Accurate medical image segmentation demands deep learning models with robust capabilities in multi-scale and boundary feature learning. However, previous networks still have limitations in addressing the above issues. Firstly, previous networks simultaneously fuse multi-level features or…
▽ More
In medical images, various types of lesions often manifest significant differences in their shape and texture. Accurate medical image segmentation demands deep learning models with robust capabilities in multi-scale and boundary feature learning. However, previous networks still have limitations in addressing the above issues. Firstly, previous networks simultaneously fuse multi-level features or employ deep supervision to enhance multi-scale learning. However, this may lead to feature redundancy and excessive computational overhead, which is not conducive to network training and clinical deployment. Secondly, the majority of medical image segmentation networks exclusively learn features in the spatial domain, disregarding the abundant global information in the frequency domain. This results in a bias towards low-frequency components, neglecting crucial high-frequency information. To address these problems, we introduce SF-UNet, a spatial-frequency dual-domain attention network. It comprises two main components: the Multi-scale Progressive Channel Attention (MPCA) block, which progressively extract multi-scale features across adjacent encoder layers, and the lightweight Frequency-Spatial Attention (FSA) block, with only 0.05M parameters, enabling concurrent learning of texture and boundary features from both spatial and frequency domains. We validate the effectiveness of the proposed SF-UNet on three public datasets. Experimental results show that compared to previous state-of-the-art (SOTA) medical image segmentation networks, SF-UNet achieves the best performance, and achieves up to 9.4\% and 10.78\% improvement in DSC and IOU. Codes will be released at https://github.com/nkicsl/SF-UNet.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
Authors:
Yuepeng Jiang,
Tao Li,
Fengyu Yang,
Lei Xie,
Meng Meng,
Yujun Wang
Abstract:
Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timb…
▽ More
Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness.
△ Less
Submitted 11 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Medformer: A Multi-Granularity Patching Transformer for Medical Time-Series Classification
Authors:
Yihe Wang,
Nan Huang,
Taida Li,
Yujun Yan,
Xiang Zhang
Abstract:
Medical time series data, such as Electroencephalography (EEG) and Electrocardiography (ECG), play a crucial role in healthcare, such as diagnosing brain and heart diseases. Existing methods for medical time series classification primarily rely on handcrafted biomarkers extraction and CNN-based models, with limited exploration of transformers tailored for medical time series. In this paper, we int…
▽ More
Medical time series data, such as Electroencephalography (EEG) and Electrocardiography (ECG), play a crucial role in healthcare, such as diagnosing brain and heart diseases. Existing methods for medical time series classification primarily rely on handcrafted biomarkers extraction and CNN-based models, with limited exploration of transformers tailored for medical time series. In this paper, we introduce Medformer, a multi-granularity patching transformer tailored specifically for medical time series classification. Our method incorporates three novel mechanisms to leverage the unique characteristics of medical time series: cross-channel patching to leverage inter-channel correlations, multi-granularity embedding for capturing features at different scales, and two-stage (intra- and inter-granularity) multi-granularity self-attention for learning features and correlations within and among granularities. We conduct extensive experiments on five public datasets under both subject-dependent and challenging subject-independent setups. Results demonstrate Medformer's superiority over 10 baselines, achieving top averaged ranking across five datasets on all six evaluation metrics. These findings underscore the significant impact of our method on healthcare applications, such as diagnosing Myocardial Infarction, Alzheimer's, and Parkinson's disease. We release the source code at \url{https://github.com/DL4mHealth/Medformer}.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Graphon Particle Systems, Part I: Spatio-Temporal Approximation and Law of Large Numbers
Authors:
Yan Chen,
Tao Li
Abstract:
We study a class of graphon particle systems with time-varying random coefficients. In a graphon particle system, the interactions among particles are characterized by the coupled mean field terms through an underlying graphon and the randomness of the coefficients comes from the stochastic processes associated with the particle labels. By constructing two-level approximated sequences converging i…
▽ More
We study a class of graphon particle systems with time-varying random coefficients. In a graphon particle system, the interactions among particles are characterized by the coupled mean field terms through an underlying graphon and the randomness of the coefficients comes from the stochastic processes associated with the particle labels. By constructing two-level approximated sequences converging in 2-Wasserstein distance, we prove the existence and uniqueness of the solution to the system. Besides, by constructing two-level approximated functions converging to the graphon mean field terms, we establish the law of large numbers, which reveals that if the number of particles tends to infinity and the discretization step tends to zero, then the discrete-time interacting particle system over the large-scale network converges to the graphon particle system. As a byproduct, we discover that the graphon particle system can describe the dynamic evolution of the distributed stochastic gradient descent algorithm over the large-scale network and prove that if the gradients of the local cost functions are Lipschitz continuous, then the graphon particle system can be regarded as the spatio-temporal approximation of the discrete-time distributed stochastic gradient descent algorithm as the number of network nodes tends to infinity and the algorithm step size tends to zero.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
A Flat Dual-Polarized Millimeter-Wave Luneburg Lens Antenna Using Transformation Optics with Reduced Anisotropy and Impedance Mismatch
Authors:
Yuanyan Su,
Teng Li,
Wei Hong,
Zhi Ning Chen,
Anja K. Skrivervik
Abstract:
In this paper, a compact wideband dual-polarized Luneburg lens antenna (LLA) with reduced anisotropy and improved impedance matching is proposed in Ka band with a wide 2D beamscanning capability. Based on transformation optics, the spherical Luneburg lens is compressed into a cylindrical one, while the merits of high gain, broad band, wide scanning, and free polarization are preserved. A trigonome…
▽ More
In this paper, a compact wideband dual-polarized Luneburg lens antenna (LLA) with reduced anisotropy and improved impedance matching is proposed in Ka band with a wide 2D beamscanning capability. Based on transformation optics, the spherical Luneburg lens is compressed into a cylindrical one, while the merits of high gain, broad band, wide scanning, and free polarization are preserved. A trigonometric function is employed to the material property of the flattened Luneburg lens with reduced anisotropy, thus effectively alleviates the strong reflection, the high sidelobes and back radiation with a free cost on the antenna weight and volume. Furthermore, a light thin wideband 7-by-1 metasurface phased array is studied as the primary feed for the LLA. The proposed metantenna, shorted for metamaterial-based antenna, has a high potential for B5G, future wireless communication and radar sensing as an onboard system.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Asynchronous MIMO-OFDM Massive Unsourced Random Access with Codeword Collisions
Authors:
Tianya Li,
Yongpeng Wu,
Junyuan Gao,
Wenjun Zhang,
Xiang-Gen Xia,
Derrick Wing Kwan Ng,
Chengshan Xiao
Abstract:
This paper investigates asynchronous MIMO massive unsourced random access in an orthogonal frequency division multiplexing (OFDM) system over frequency-selective fading channels, with the presence of both timing and carrier frequency offsets (TO and CFO) and non-negligible codeword collisions. The proposed coding framework segregates the data into two components, namely, preamble and coding parts,…
▽ More
This paper investigates asynchronous MIMO massive unsourced random access in an orthogonal frequency division multiplexing (OFDM) system over frequency-selective fading channels, with the presence of both timing and carrier frequency offsets (TO and CFO) and non-negligible codeword collisions. The proposed coding framework segregates the data into two components, namely, preamble and coding parts, with the former being tree-coded and the latter LDPC-coded. By leveraging the dual sparsity of the equivalent channel across both codeword and delay domains (CD and DD), we develop a message passing-based sparse Bayesian learning algorithm, combined with belief propagation and mean field, to iteratively estimate DD channel responses, TO, and delay profiles. Furthermore, we establish a novel graph-based algorithm to iteratively separate the superimposed channels and compensate for the phase rotations. Additionally, the proposed algorithm is applied to the flat fading scenario to estimate both TO and CFO, where the channel and offset estimation is enhanced by leveraging the geometric characteristics of the signal constellation. Simulations reveal that the proposed algorithm achieves superior performance and substantial complexity reduction in both channel and offset estimation compared to the codebook enlarging-based counterparts, and enhanced data recovery performances compared to state-of-the-art URA schemes.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Convergence Conditions of Online Regularized Statistical Learning in Reproducing Kernel Hilbert Space With Non-Stationary Data
Authors:
Xiwei Zhang,
Tao Li
Abstract:
We study the convergence of recursive regularized learning algorithms in the reproducing kernel Hilbert space (RKHS) with dependent and non-stationary online data streams. Firstly, we study the mean square asymptotic stability of a class of random difference equations in RKHS, whose non-homogeneous terms are martingale difference sequences dependent on the homogeneous ones. Secondly, we introduce…
▽ More
We study the convergence of recursive regularized learning algorithms in the reproducing kernel Hilbert space (RKHS) with dependent and non-stationary online data streams. Firstly, we study the mean square asymptotic stability of a class of random difference equations in RKHS, whose non-homogeneous terms are martingale difference sequences dependent on the homogeneous ones. Secondly, we introduce the concept of random Tikhonov regularization path, and show that if the regularization path is slowly time-varying in some sense, then the output of the algorithm is consistent with the regularization path in mean square. Furthermore, if the data streams also satisfy the RKHS persistence of excitation condition, i.e. there exists a fixed length of time period, such that the conditional expectation of the operators induced by the input data accumulated over every time period has a uniformly strictly positive compact lower bound in the sense of the operator order with respect to time, then the output of the algorithm is consistent with the unknown function in mean square. Finally, for the case with independent and non-identically distributed data streams, the algorithm achieves the mean square consistency provided the marginal probability measures induced by the input data are slowly time-varying and the average measure over each fixed-length time period has a uniformly strictly positive lower bound.
△ Less
Submitted 9 June, 2024; v1 submitted 4 April, 2024;
originally announced April 2024.
-
On the Variational Interpretation of Mirror Play in Monotone Games
Authors:
Yunian Pan,
Tao Li,
Quanyan Zhu
Abstract:
Mirror play (MP) is a well-accepted primal-dual multi-agent learning algorithm where all agents simultaneously implement mirror descent in a distributed fashion. The advantage of MP over vanilla gradient play lies in its usage of mirror maps that better exploit the geometry of decision domains. Despite extensive literature dedicated to the asymptotic convergence of MP to equilibrium, the understan…
▽ More
Mirror play (MP) is a well-accepted primal-dual multi-agent learning algorithm where all agents simultaneously implement mirror descent in a distributed fashion. The advantage of MP over vanilla gradient play lies in its usage of mirror maps that better exploit the geometry of decision domains. Despite extensive literature dedicated to the asymptotic convergence of MP to equilibrium, the understanding of the finite-time behavior of MP before reaching equilibrium is still rudimentary. To facilitate the study of MP's non-equilibrium performance, this work establishes an equivalence between MP's finite-time primal-dual path (mirror path) in monotone games and the closed-loop Nash equilibrium path of a finite-horizon differential game, referred to as mirror differential game (MDG). Our construction of MDG rests on the Brezis-Ekeland variational principle, and the stage cost functional for MDG is Fenchel coupling between MP's iterates and associated gradient updates. The variational interpretation of mirror path in static games as the equilibrium path in MDG holds in deterministic and stochastic cases. Such a variational interpretation translates the non-equilibrium studies of learning dynamics into a more tractable equilibrium analysis of dynamic games, as demonstrated in a case study on the Cournot game, where MP dynamics corresponds to a linear quadratic game.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Digital Twin Channel for 6G: Concepts, Architectures and Potential Applications
Authors:
Heng Wang,
Jianhua Zhang,
Gaofeng Nie,
Li Yu,
Zhiqiang Yuan,
Tongjie Li,
Jialin Wang,
Guangyi Liu
Abstract:
Digital twin channel (DTC) is the real-time map** of a wireless channel from the physical world to the digital world, which is expected to provide significant performance enhancements for the sixth-generation (6G) air-interface design. In this work, we first define five evolution levels of channel twins with the progression of wireless communication. The fifth level, autonomous DTC, is elaborate…
▽ More
Digital twin channel (DTC) is the real-time map** of a wireless channel from the physical world to the digital world, which is expected to provide significant performance enhancements for the sixth-generation (6G) air-interface design. In this work, we first define five evolution levels of channel twins with the progression of wireless communication. The fifth level, autonomous DTC, is elaborated with multi-dimensional factors such as methodology, characterization precision, and data category. Then, we provide detailed insights into the requirements and architecture of a complete DTC for 6G. Subsequently, a sensing-enhanced real-time channel prediction platform and experimental validations are exhibited. Finally, drawing from the vision of the 6G network, we explore the potential applications and the open issues in future DTC research.
△ Less
Submitted 31 March, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation
Authors:
Xi Liu,
Ying Guo,
Cheng Zhen,
Tong Li,
Yingying Ao,
Pengfei Yan
Abstract:
Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion.The applications of listener agent generation in virtual interaction have promoted many works achieving the diverse and fine-grained motion generation. However, they can only manipulate motions through simple emotional labels, but…
▽ More
Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion.The applications of listener agent generation in virtual interaction have promoted many works achieving the diverse and fine-grained motion generation. However, they can only manipulate motions through simple emotional labels, but cannot freely control the listener's motions. Since listener agents should have human-like attributes (e.g. identity, personality) which can be freely customized by users, this limits their realism. In this paper, we propose a user-friendly framework called CustomListener to realize the free-form text prior guided listener generation. To achieve speaker-listener coordination, we design a Static to Dynamic Portrait module (SDP), which interacts with speaker information to transform static text into dynamic portrait token with completion rhythm and amplitude information. To achieve coherence between segments, we design a Past Guided Generation Module (PGG) to maintain the consistency of customized listener attributes through the motion prior, and utilize a diffusion-based structure conditioned on the portrait token and the motion prior to realize the controllable generation. To train and evaluate our model, we have constructed two text-annotated listening head datasets based on ViCo and RealTalk, which provide text-video paired labels. Extensive experiments have verified the effectiveness of our model.
△ Less
Submitted 29 March, 2024; v1 submitted 29 February, 2024;
originally announced March 2024.
-
Conjectural Online Learning with First-order Beliefs in Asymmetric Information Stochastic Games
Authors:
Tao Li,
Kim Hammar,
Rolf Stadler,
Quanyan Zhu
Abstract:
Asymmetric information stochastic games (\textsc{aisg}s) arise in many complex socio-technical systems, such as cyber-physical systems and IT infrastructures. Existing computational methods for \textsc{aisg}s are primarily offline and can not adapt to equilibrium deviations. Further, current methods are limited to special classes of \textsc{aisg}s to avoid belief hierarchies. To address these limi…
▽ More
Asymmetric information stochastic games (\textsc{aisg}s) arise in many complex socio-technical systems, such as cyber-physical systems and IT infrastructures. Existing computational methods for \textsc{aisg}s are primarily offline and can not adapt to equilibrium deviations. Further, current methods are limited to special classes of \textsc{aisg}s to avoid belief hierarchies. To address these limitations, we propose conjectural online learning (\textsc{col}), an online method for generic \textsc{aisg}s. \textsc{col} uses a forecaster-actor-critic (\textsc{fac}) architecture where subjective forecasts are used to conjecture the opponents' strategies within a lookahead horizon, and Bayesian learning is used to calibrate the conjectures. To adapt strategies to nonstationary environments, \textsc{col} uses online rollout with cost function approximation (actor-critic). We prove that the conjectures produced by \textsc{col} are asymptotically consistent with the information feedback in the sense of a relaxed Bayesian consistency. We also prove that the empirical strategy profile induced by \textsc{col} converges to the Berk-Nash equilibrium, a solution concept characterizing rationality under subjectivity. Experimental results from an intrusion response use case demonstrate \textsc{col}'s superiority over state-of-the-art reinforcement learning methods against nonstationary attacks.
△ Less
Submitted 8 March, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
Automated Security Response through Online Learning with Adaptive Conjectures
Authors:
Kim Hammar,
Tao Li,
Rolf Stadler,
Quanyan Zhu
Abstract:
We study automated security response for an IT infrastructure and formulate the interaction between an attacker and a defender as a partially observed, non-stationary game. We relax the standard assumption that the game model is correctly specified and consider that each player has a probabilistic conjecture about the model, which may be misspecified in the sense that the true model has probabilit…
▽ More
We study automated security response for an IT infrastructure and formulate the interaction between an attacker and a defender as a partially observed, non-stationary game. We relax the standard assumption that the game model is correctly specified and consider that each player has a probabilistic conjecture about the model, which may be misspecified in the sense that the true model has probability 0. This formulation allows us to capture uncertainty about the infrastructure and the intents of the players. To learn effective game strategies online, we design a novel method where a player iteratively adapts its conjecture using Bayesian learning and updates its strategy through rollout. We prove that the conjectures converge to best fits, and we provide a bound on the performance improvement that rollout enables with a conjectured model. To characterize the steady state of the game, we propose a variant of the Berk-Nash equilibrium. We present our method through an advanced persistent threat use case. Simulation studies based on testbed measurements show that our method produces effective security strategies that adapt to a changing environment. We also find that our method enables faster convergence than current reinforcement learning techniques.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Authors:
Yutao Hu,
Tianbin Li,
Quanfeng Lu,
Wenqi Shao,
Junjun He,
Yu Qiao,
** Luo
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in various multimodal tasks. However, their potential in the medical domain remains largely unexplored. A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions, which is essential in real-world medical applications. To solve this problem, in this pape…
▽ More
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in various multimodal tasks. However, their potential in the medical domain remains largely unexplored. A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions, which is essential in real-world medical applications. To solve this problem, in this paper, we introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark. This benchmark is collected from 73 different medical datasets, including 12 different modalities and covering more than 20 distinct anatomical regions. Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs. Through our extensive experiments, we have found that existing LVLMs struggle to address these medical VQA problems effectively. Moreover, what surprises us is that medical-specialized LVLMs even exhibit inferior performance to those general-domain models, calling for a more versatile and robust LVLM in the biomedical field. The evaluation results not only reveal the current limitations of LVLM in understanding real medical images but also highlight our dataset's significance. Our code with dataset are available at https://github.com/OpenGVLab/Multi-Modality-Arena.
△ Less
Submitted 21 April, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
An Optimization-based Baseline for Rigid 2D/3D Registration Applied to Spine Surgical Navigation Using CMA-ES
Authors:
Minheng Chen,
Tonglong Li,
Zhirun Zhang,
Youyong Kong
Abstract:
A robust and efficient optimization-based 2D/3D registration framework is crucial for the navigation system of orthopedic surgical robots. It can provide precise position information of surgical instruments and implants during surgery. While artificial intelligence technology has advanced rapidly in recent years, traditional optimization-based registration methods remain indispensable in the field…
▽ More
A robust and efficient optimization-based 2D/3D registration framework is crucial for the navigation system of orthopedic surgical robots. It can provide precise position information of surgical instruments and implants during surgery. While artificial intelligence technology has advanced rapidly in recent years, traditional optimization-based registration methods remain indispensable in the field of 2D/3D registration.he exceptional precision of this method enables it to be considered as a post-processing step of the learning-based methods, thereby offering a reliable assurance for registration. In this paper, we present a coarse-to-fine registration framework based on the CMA-ES algorithm. We conducted intensive testing of our method using data from different parts of the spine. The results shows the effectiveness of the proposed framework on real orthopedic spine surgery clinical data. This work can be viewed as an additional extension that complements the optimization-based methods employed in our previous studies.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Rejection-Sampled Universal Quantization for Smaller Quantization Errors
Authors:
Chih Wei Ling,
Cheuk Ting Li
Abstract:
We construct a randomized vector quantizer which has a smaller maximum error compared to all known lattice quantizers with the same entropy for dimensions 5, 6, ..., 48, and also has a smaller mean squared error compared to known lattice quantizers with the same entropy for dimensions 35, ..., 48, in the high resolution limit. Moreover, our randomized quantizer has a desirable property that the qu…
▽ More
We construct a randomized vector quantizer which has a smaller maximum error compared to all known lattice quantizers with the same entropy for dimensions 5, 6, ..., 48, and also has a smaller mean squared error compared to known lattice quantizers with the same entropy for dimensions 35, ..., 48, in the high resolution limit. Moreover, our randomized quantizer has a desirable property that the quantization error is always uniform over the ball and independent of the input. Our construction is based on applying rejection sampling on universal quantization, which allows us to shape the error distribution to be any continuous distribution, not only uniform distributions over basic cells of a lattice as in conventional dithered quantization. We also characterize the high SNR limit of one-shot channel simulation for any additive noise channel under a mild assumption (e.g., the AWGN channel), up to an additive constant of 1.45 bits.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Arithmetic Average Density Fusion -- Part IV: Distributed Heterogeneous Fusion of RFS and LRFS Filters via Variational Approximation
Authors:
Tiancheng Li,
Haozhe Liang,
Guchong Li,
Jesús GarcÃa Herrero,
Quan Pan
Abstract:
This paper, the fourth part of a series of papers on the arithmetic average (AA) density fusion approach and its application for target tracking, addresses the intricate challenge of distributed heterogeneous multisensor multitarget tracking, where each inter-connected sensor operates a probability hypothesis density (PHD) filter, a multiple Bernoulli (MB) filter or a labeled MB (LMB) filter and t…
▽ More
This paper, the fourth part of a series of papers on the arithmetic average (AA) density fusion approach and its application for target tracking, addresses the intricate challenge of distributed heterogeneous multisensor multitarget tracking, where each inter-connected sensor operates a probability hypothesis density (PHD) filter, a multiple Bernoulli (MB) filter or a labeled MB (LMB) filter and they cooperate with each other via information fusion. Earlier papers in this series have proven that the proper AA fusion of these filters is all exactly built on averaging their respective unlabeled/labeled PHDs. Based on this finding, two PHD-AA fusion approaches are proposed via variational minimization of the upper bound of the Kullback-Leibler divergence between the local and multi-filter averaged PHDs subject to cardinality consensus based on the Gaussian mixture implementation, enabling heterogeneous filter cooperation. One focuses solely on fitting the weights of the local Gaussian components (L-GCs), while the other simultaneously fits all the parameters of the L-GCs at each sensor, both seeking average consensus on the unlabeled PHD, irrespective of the specific posterior form of the local filters. For the distributed peer-to-peer communication, both the classic consensus and flooding paradigms have been investigated. Simulations have demonstrated the effectiveness and flexibility of the proposed approaches in both homogeneous and heterogeneous scenarios.
△ Less
Submitted 30 January, 2024;
originally announced February 2024.
-
Improving Fairness of Automated Chest X-ray Diagnosis by Contrastive Learning
Authors:
Mingquan Lin,
Tianhao Li,
Zhaoyi Sun,
Gregory Holste,
Ying Ding,
Fei Wang,
George Shih,
Yifan Peng
Abstract:
Purpose: Limited studies exploring concrete methods or approaches to tackle and enhance model fairness in the radiology domain. Our proposed AI model utilizes supervised contrastive learning to minimize bias in CXR diagnosis.
Materials and Methods: In this retrospective study, we evaluated our proposed method on two datasets: the Medical Imaging and Data Resource Center (MIDRC) dataset with 77,8…
▽ More
Purpose: Limited studies exploring concrete methods or approaches to tackle and enhance model fairness in the radiology domain. Our proposed AI model utilizes supervised contrastive learning to minimize bias in CXR diagnosis.
Materials and Methods: In this retrospective study, we evaluated our proposed method on two datasets: the Medical Imaging and Data Resource Center (MIDRC) dataset with 77,887 CXR images from 27,796 patients collected as of April 20, 2023 for COVID-19 diagnosis, and the NIH Chest X-ray (NIH-CXR) dataset with 112,120 CXR images from 30,805 patients collected between 1992 and 2015. In the NIH-CXR dataset, thoracic abnormalities include atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax, consolidation, edema, emphysema, fibrosis, pleural thickening, or hernia. Our proposed method utilizes supervised contrastive learning with carefully selected positive and negative samples to generate fair image embeddings, which are fine-tuned for subsequent tasks to reduce bias in chest X-ray (CXR) diagnosis. We evaluated the methods using the marginal AUC difference ($δ$ mAUC).
Results: The proposed model showed a significant decrease in bias across all subgroups when compared to the baseline models, as evidenced by a paired T-test (p<0.0001). The $δ$ mAUC obtained by our method were 0.0116 (95\% CI, 0.0110-0.0123), 0.2102 (95% CI, 0.2087-0.2118), and 0.1000 (95\% CI, 0.0988-0.1011) for sex, race, and age on MIDRC, and 0.0090 (95\% CI, 0.0082-0.0097) for sex and 0.0512 (95% CI, 0.0512-0.0532) for age on NIH-CXR, respectively.
Conclusion: Employing supervised contrastive learning can mitigate bias in CXR diagnosis, addressing concerns of fairness and reliability in deep learning-based diagnostic methods.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
A Unified NOMA Framework in Beam-Hop** Satellite Communication Systems
Authors:
Xuyang Zhang,
Xinwei Yue,
Tian Li,
Zhihao Han,
Yafei Wang,
Yong Ding,
Rongke Liu
Abstract:
This paper investigates the application of a unified non-orthogonal multiple access framework in beam hop** (U-NOMA-BH) based satellite communication systems. More specifically, the proposed U-NOMA-BH framework can be applied to code-domain NOMA based BH (CD-NOMA-BH) and power-domain NOMA based BH (PD-NOMA-BH) systems. To satisfy dynamic-uneven traffic demands, we formulate the optimization prob…
▽ More
This paper investigates the application of a unified non-orthogonal multiple access framework in beam hop** (U-NOMA-BH) based satellite communication systems. More specifically, the proposed U-NOMA-BH framework can be applied to code-domain NOMA based BH (CD-NOMA-BH) and power-domain NOMA based BH (PD-NOMA-BH) systems. To satisfy dynamic-uneven traffic demands, we formulate the optimization problem to minimize the square of discrete difference by jointly optimizing power allocation, carrier assignment and beam scheduling. The non-convexity of the objective function and the constraint condition is solved through Dinkelbach's transform and variable relaxation. As a further development, the closed-from and asymptotic expressions of outage probability are derived for CD/PD-NOMA-BH systems. Based on approximated results, the diversity orders of a pair of users are obtained in detail. In addition, the system throughput of U-NOMA-BH is discussed in delay-limited transmission mode. Numerical results verify that: i) The gap between traffic requests of CD/PD-NOMA-BH systems appears to be more closely compared with orthogonal multiple access based BH (OMA-BH); ii) The CD-NOMA-BH system is capable of providing the enhanced traffic request and capacity provision; and iii) The outage behaviors of CD/PD-NOMA-BH are better than that of OMA-BH.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Generalizable Sleep Staging via Multi-Level Domain Alignment
Authors:
Jiquan Wang,
Sha Zhao,
Haiteng Jiang,
Shijian Li,
Tao Li,
Gang Pan
Abstract:
Automatic sleep staging is essential for sleep assessment and disorder diagnosis. Most existing methods depend on one specific dataset and are limited to be generalized to other unseen datasets, for which the training data and testing data are from the same dataset. In this paper, we introduce domain generalization into automatic sleep staging and propose the task of generalizable sleep staging wh…
▽ More
Automatic sleep staging is essential for sleep assessment and disorder diagnosis. Most existing methods depend on one specific dataset and are limited to be generalized to other unseen datasets, for which the training data and testing data are from the same dataset. In this paper, we introduce domain generalization into automatic sleep staging and propose the task of generalizable sleep staging which aims to improve the model generalization ability to unseen datasets. Inspired by existing domain generalization methods, we adopt the feature alignment idea and propose a framework called SleepDG to solve it. Considering both of local salient features and sequential features are important for sleep staging, we propose a Multi-level Feature Alignment combining epoch-level and sequence-level feature alignment to learn domain-invariant feature representations. Specifically, we design an Epoch-level Feature Alignment to align the feature distribution of each single sleep epoch among different domains, and a Sequence-level Feature Alignment to minimize the discrepancy of sequential features among different domains. SleepDG is validated on five public datasets, achieving the state-of-the-art performance.
△ Less
Submitted 27 January, 2024; v1 submitted 13 December, 2023;
originally announced January 2024.
-
Near-Space Communications: the Last Piece of 6G Space-Air-Ground-Sea Integrated Network Puzzle
Authors:
Hongshan Liu,
Tong Qin,
Zhen Gao,
Tianqi Mao,
Keke Ying,
Ziwei Wan,
Li Qiao,
Rui Na,
Zhongxiang Li,
Chun Hu,
Yikun Mei,
Tuan Li,
Guanghui Wen,
Lei Chen,
Zhonghuai Wu,
Ruiqi Liu,
Gaojie Chen,
Shuo Wang,
Dezhi Zheng
Abstract:
This article presents a comprehensive study on the emerging near-space communications (NS-COM) within the context of space-air-ground-sea integrated network (SAGSIN). Specifically, we firstly explore the recent technical developments of NS-COM, followed by the discussions about motivations behind integrating NS-COM into SAGSIN. To further demonstrate the necessity of NS-COM, a comparative analysis…
▽ More
This article presents a comprehensive study on the emerging near-space communications (NS-COM) within the context of space-air-ground-sea integrated network (SAGSIN). Specifically, we firstly explore the recent technical developments of NS-COM, followed by the discussions about motivations behind integrating NS-COM into SAGSIN. To further demonstrate the necessity of NS-COM, a comparative analysis between the NS-COM network and other counterparts in SAGSIN is conducted, covering aspects of deployment, coverage, channel characteristics and unique problems of NS-COM network. Afterwards, the technical aspects of NS-COM, including channel modeling, random access, channel estimation, array-based beam management and joint network optimization, are examined in detail. Furthermore, we explore the potential applications of NS-COM, such as structural expansion in SAGSIN communication, civil aviation communication, remote and urgent communication, weather monitoring and carbon neutrality. Finally, some promising research avenues are identified, including stratospheric satellite (StratoSat) -to-ground direct links for mobile terminals, reconfigurable multiple-input multiple-output (MIMO) and holographic MIMO, federated learning in NS-COM networks, maritime communication, electromagnetic spectrum sensing and adversarial game, integrated sensing and communications, StratoSat-based radar detection and imaging, NS-COM assisted enhanced global navigation system, NS-COM assisted intelligent unmanned system and free space optical (FSO) communication. Overall, this paper highlights that the NS-COM plays an indispensable role in the SAGSIN puzzle, providing substantial performance and coverage enhancement to the traditional SAGSIN architecture.
△ Less
Submitted 4 March, 2024; v1 submitted 30 December, 2023;
originally announced January 2024.
-
Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection
Authors:
Jiachen Lian,
Carly Feng,
Naasir Farooqi,
Steve Li,
Anshul Kashyap,
Cheol Jun Cho,
Peter Wu,
Robbie Netzorg,
Tingle Li,
Gopala Krishna Anumanchipalli
Abstract:
Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and dete…
▽ More
Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and detection in an automatic and hierarchical manner. UDM eliminates the need for extensive manual annotation by providing a comprehensive solution. Furthermore, we introduce a simulated dysfluent dataset called VCTK++ to enhance the capabilities of UDM in phonetic transcription. Our experimental results demonstrate the effectiveness and robustness of our proposed methods in both transcription and detection tasks.
△ Less
Submitted 20 December, 2023;
originally announced December 2023.
-
MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis
Authors:
Wenhao Guan,
Yishuang Li,
Tao Li,
Hukai Huang,
Feng Wang,
Jiayan Lin,
Lingyan Huang,
Lin Li,
Qingyang Hong
Abstract:
The style transfer task in Text-to-Speech refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide…
▽ More
The style transfer task in Text-to-Speech refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects:1)aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2)efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice. To address these problems, we propose an aligned multi-modal prompt encoder that embeds different modalities into a unified style space, supporting style transfer for different modalities. Additionally, we present a new adaptive style transfer method named Style Adaptive Convolutions to achieve a better style representation. Furthermore, we design a Rectified Flow based Refiner to solve the problem of over-smoothing Mel-spectrogram and generate audio of higher fidelity. Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multi-modal prompts.
△ Less
Submitted 31 January, 2024; v1 submitted 17 December, 2023;
originally announced December 2023.
-
Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation
Authors:
Qi Yang,
Xing Nie,
Tong Li,
Pengfei Gao,
Ying Guo,
Cheng Zhen,
Pengfei Yan,
Shiming Xiang
Abstract:
Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral…
▽ More
Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Regarding pixel entanglement, we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement, we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Code and more results will be publicly available at https://yannqi.github.io/AVS-COMBO/.
△ Less
Submitted 7 April, 2024; v1 submitted 11 December, 2023;
originally announced December 2023.
-
Keyword spotting -- Detecting commands in speech using deep learning
Authors:
Sumedha Rai,
Tong Li,
Bella Lyu
Abstract:
Speech recognition has become an important task in the development of machine learning and artificial intelligence. In this study, we explore the important task of keyword spotting using speech recognition machine learning and deep learning techniques. We implement feature engineering by converting raw waveforms to Mel Frequency Cepstral Coefficients (MFCCs), which we use as inputs to our models.…
▽ More
Speech recognition has become an important task in the development of machine learning and artificial intelligence. In this study, we explore the important task of keyword spotting using speech recognition machine learning and deep learning techniques. We implement feature engineering by converting raw waveforms to Mel Frequency Cepstral Coefficients (MFCCs), which we use as inputs to our models. We experiment with several different algorithms such as Hidden Markov Model with Gaussian Mixture, Convolutional Neural Networks and variants of Recurrent Neural Networks including Long Short-Term Memory and the Attention mechanism. In our experiments, RNN with BiLSTM and Attention achieves the best performance with an accuracy of 93.9 %
△ Less
Submitted 9 December, 2023;
originally announced December 2023.
-
Coordinate-based Neural Network for Fourier Phase Retrieval
Authors:
Tingyou Li,
Zixin Xu,
Yong S. Chu,
Xiao**g Huang,
Jizhou Li
Abstract:
Fourier phase retrieval is essential for high-definition imaging of nanoscale structures across diverse fields, notably coherent diffraction imaging. This study presents the Single impliCit neurAl Network (SCAN), a tool built upon coordinate neural networks meticulously designed for enhanced phase retrieval performance. Remedying the drawbacks of conventional iterative methods which are easiliy tr…
▽ More
Fourier phase retrieval is essential for high-definition imaging of nanoscale structures across diverse fields, notably coherent diffraction imaging. This study presents the Single impliCit neurAl Network (SCAN), a tool built upon coordinate neural networks meticulously designed for enhanced phase retrieval performance. Remedying the drawbacks of conventional iterative methods which are easiliy trapped into local minimum solutions and sensitive to noise, SCAN adeptly connects object coordinates to their amplitude and phase within a unified network in an unsupervised manner. While many existing methods primarily use Fourier magnitude in their loss function, our approach incorporates both the predicted magnitude and phase, enhancing retrieval accuracy. Comprehensive tests validate SCAN's superiority over traditional and other deep learning models regarding accuracy and noise robustness. We also demonstrate that SCAN excels in the ptychography setting.
△ Less
Submitted 8 January, 2024; v1 submitted 24 November, 2023;
originally announced November 2023.
-
Exploiting Active RIS in NOMA Networks with Hardware Impairments
Authors:
Xinwei Yue,
Meiqi Song,
Chongjun Ouyang,
Yuanwei Liu,
Tian Li,
Tianwei Hou
Abstract:
Active reconfigurable intelligent surface (ARIS) is a promising way to compensate for multiplicative fading attenuation by amplifying and reflecting event signals to selected users. This paper investigates the performance of ARIS assisted non-orthogonal multiple access (NOMA) networks over cascaded Nakagami-m fading channels. The effects of hardware impairments (HIS) and reflection coefficients on…
▽ More
Active reconfigurable intelligent surface (ARIS) is a promising way to compensate for multiplicative fading attenuation by amplifying and reflecting event signals to selected users. This paper investigates the performance of ARIS assisted non-orthogonal multiple access (NOMA) networks over cascaded Nakagami-m fading channels. The effects of hardware impairments (HIS) and reflection coefficients on ARIS-NOMA networks with imperfect successive interference cancellation (ipSIC) and perfect successive interference cancellation (pSIC) are considered. More specifically, we develop new precise and asymptotic expressions of outage probability and ergodic data rate with ipSIC/pSIC for ARIS-NOMA-HIS networks. According to the approximated analyses, the diversity orders and multiplexing gains for couple of non-orthogonal users are attained in detail. Additionally, the energy efficiency of ARIS-NOMA-HIS networks is surveyed in delay-limited and delay-tolerant transmission schemes. The simulation findings are presented to demonstrate that: i) The outage behaviors and ergodic data rates of ARIS-NOMA-HIS networks precede that of ARIS aided orthogonal multiple access (OMA) and passive reconfigurable intelligent surface (PRIS) aided OMA; ii) As the reflection coefficient of ARIS increases, ARIS-NOMA-HIS networks have the ability to provide the strengthened outage performance; and iii) ARIS-NOMA-HIS networks are more energy efficient than ARIS/PRIS-OMA networks and conventional cooperative schemes.
△ Less
Submitted 12 January, 2024; v1 submitted 24 November, 2023;
originally announced November 2023.
-
How AI-driven Digital Twins Can Empower Mobile Networks
Authors:
Tong Li,
Fenyu Jiang,
Qiaohong Yu,
Wenzhen Huang,
Tao Jiang,
Depeng **
Abstract:
The growing complexity of next-generation networks exacerbates the modeling and algorithmic flaws of conventional network optimization methodology. In this paper, we propose a mobile network digital twin (MNDT) architecture for 6G networks. To address the modeling and algorithmic shortcomings, the MNDT uses a simulation-optimization structure. The feedback from the network simulation engine, which…
▽ More
The growing complexity of next-generation networks exacerbates the modeling and algorithmic flaws of conventional network optimization methodology. In this paper, we propose a mobile network digital twin (MNDT) architecture for 6G networks. To address the modeling and algorithmic shortcomings, the MNDT uses a simulation-optimization structure. The feedback from the network simulation engine, which serves as validation for the optimizer's decision outcomes, is used explicitly to train artificial intelligence (AI) empowered optimizers iteratively. In practice, we develop a network digital twin prototype system leveraging data-driven technology to accurately model the behaviors of mobile network elements (e.g., mobile users and base stations), wireless environments, and network performance. An AI-powered network optimizer has been developed based on the deployed MNDT prototype system for providing reliable and optimized network configurations. The results of the experiments demonstrate that the proposed MNDT infrastructure can provide practical network optimization solutions while adapting to the more complex environment.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks
Authors:
** Ye,
Junlong Cheng,
Jianpin Chen,
Zhongying Deng,
Tianbin Li,
Haoyu Wang,
Yanzhou Su,
Ziyan Huang,
Jilong Chen,
Lei Jiang,
Hui Sun,
Min Zhu,
Shaoting Zhang,
Junjun He,
Yu Qiao
Abstract:
Segment Anything Model (SAM) has achieved impressive results for natural image segmentation with input prompts such as points and bounding boxes. Its success largely owes to massive labeled training data. However, directly applying SAM to medical image segmentation cannot perform well because SAM lacks medical knowledge -- it does not use medical images for training. To incorporate medical knowled…
▽ More
Segment Anything Model (SAM) has achieved impressive results for natural image segmentation with input prompts such as points and bounding boxes. Its success largely owes to massive labeled training data. However, directly applying SAM to medical image segmentation cannot perform well because SAM lacks medical knowledge -- it does not use medical images for training. To incorporate medical knowledge into SAM, we introduce SA-Med2D-20M, a large-scale segmentation dataset of 2D medical images built upon numerous public and private datasets. It consists of 4.6 million 2D medical images and 19.7 million corresponding masks, covering almost the whole body and showing significant diversity. This paper describes all the datasets collected in SA-Med2D-20M and details how to process these datasets. Furthermore, comprehensive statistics of SA-Med2D-20M are presented to facilitate the better use of our dataset, which can help the researchers build medical vision foundation models or apply their models to downstream medical applications. We hope that the large scale and diversity of SA-Med2D-20M can be leveraged to develop medical artificial intelligence for enhancing diagnosis, medical image analysis, knowledge sharing, and education. The data with the redistribution license is publicly available at https://github.com/OpenGVLab/SAM-Med2D.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
Learning-Augmented Scheduling for Solar-Powered Electric Vehicle Charging
Authors:
Tongxin Li
Abstract:
We tackle the complex challenge of scheduling the charging of electric vehicles (EVs) equipped with solar panels and batteries, particularly under out-of-distribution (OOD) conditions. Traditional scheduling approaches, such as reinforcement learning (RL) and model predictive control (MPC), often fail to provide satisfactory results when faced with OOD data, struggling to balance robustness (worst…
▽ More
We tackle the complex challenge of scheduling the charging of electric vehicles (EVs) equipped with solar panels and batteries, particularly under out-of-distribution (OOD) conditions. Traditional scheduling approaches, such as reinforcement learning (RL) and model predictive control (MPC), often fail to provide satisfactory results when faced with OOD data, struggling to balance robustness (worst-case performance) and consistency (near-optimal average performance). To address this gap, we introduce a novel learning-augmented policy. This policy employs a dynamic robustness budget, which is adapted in real-time based on the reinforcement learning policy's performance. Specifically, it leverages the temporal difference (TD) error, a measure of the learning policy's prediction accuracy, to assess the trustworthiness of the machine-learned policy. This method allows for a more effective balance between consistency and robustness in EV charging schedules, significantly enhancing adaptability and efficiency in real-world, unpredictable environments. Our results demonstrate that this approach markedly improves scheduling effectiveness and reliability, particularly in OOD contexts, paving the way for more resilient and adaptive EV charging systems.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Minimum Snap Trajectory Generation and Control for an Under-actuated Flap** Wing Aerial Vehicle
Authors:
Chen Qian,
Rui Chen,
Peiyao Shen,
Yongchun Fang,
Jifu Yan,
Tiefeng Li
Abstract:
Minimum Snap Trajectory Generation and Control for an Under-actuated Flap** Wing Aerial VehicleThis paper presents both the trajectory generation and tracking control strategies for an underactuated flap** wing aerial vehicle (FWAV). First, the FWAV dynamics is analyzed in a practical perspective. Then, based on these analyses, we demonstrate the differential flatness of the FWAV system, and d…
▽ More
Minimum Snap Trajectory Generation and Control for an Under-actuated Flap** Wing Aerial VehicleThis paper presents both the trajectory generation and tracking control strategies for an underactuated flap** wing aerial vehicle (FWAV). First, the FWAV dynamics is analyzed in a practical perspective. Then, based on these analyses, we demonstrate the differential flatness of the FWAV system, and develop a general-purpose trajectory generation strategy. Subsequently, the trajectory tracking controller is developed with the help of robust control and switch control techniques. After that, the overall system asymptotic stability is guaranteed by Lyapunov stability analysis. To make the controller applicable in real flight, we also provide several instructions. Finally, a series of experiment results manifest the successful implementation of the proposed trajectory generation strategy and tracking control strategy. This work firstly achieves the closed-loop integration of trajectory generation and control for real 3-dimensional flight of an underactuated FWAV to a practical level.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Intelligent-Reflecting-Surface-Assisted UAV Communications for 6G Networks
Authors:
Zhaolong Ning,
Tengfeng Li,
Yu Wu,
Xiaojie Wang,
Qingqing Wu,
Fei Richard Yu,
Song Guo
Abstract:
In 6th-Generation (6G) mobile networks, Intelligent Reflective Surfaces (IRSs) and Unmanned Aerial Vehicles (UAVs) have emerged as promising technologies to address the coverage difficulties and resource constraints faced by terrestrial networks. UAVs, with their mobility and low costs, offer diverse connectivity options for mobile users and a novel deployment paradigm for 6G networks. However, th…
▽ More
In 6th-Generation (6G) mobile networks, Intelligent Reflective Surfaces (IRSs) and Unmanned Aerial Vehicles (UAVs) have emerged as promising technologies to address the coverage difficulties and resource constraints faced by terrestrial networks. UAVs, with their mobility and low costs, offer diverse connectivity options for mobile users and a novel deployment paradigm for 6G networks. However, the limited battery capacity of UAVs, dynamic and unpredictable channel environments, and communication resource constraints result in poor performance of traditional UAV-based networks. IRSs can not only reconstruct the wireless environment in a unique way, but also achieve wireless network relay in a cost-effective manner. Hence, it receives significant attention as a promising solution to solve the above challenges. In this article, we conduct a comprehensive survey on IRS-assisted UAV communications for 6G networks. First, primary issues, key technologies, and application scenarios of IRS-assisted UAV communications for 6G networks are introduced. Then, we put forward specific solutions to the issues of IRS-assisted UAV communications. Finally, we discuss some open issues and future research directions to guide researchers in related fields.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
Controllability of networked multiagent systems based on linearized Turing's model
Authors:
Tianhao Li,
Ruichang Zhang,
Zhixin Liu,
Zhuo Zou,
Xiaoming Hu
Abstract:
Turing's model has been widely used to explain how simple, uniform structures can give rise to complex, patterned structures during the development of organisms. However, it is very hard to establish rigorous theoretical results for the dynamic evolution behavior of Turing's model since it is described by nonlinear partial differential equations. We focus on controllability of Turing's model by li…
▽ More
Turing's model has been widely used to explain how simple, uniform structures can give rise to complex, patterned structures during the development of organisms. However, it is very hard to establish rigorous theoretical results for the dynamic evolution behavior of Turing's model since it is described by nonlinear partial differential equations. We focus on controllability of Turing's model by linearization and spatial discretization. This linearized model is a networked system whose agents are second order linear systems and these agents interact with each other by Laplacian dynamics on a graph. A control signal can be added to agents of choice. Under mild conditions on the parameters of the linearized Turing's model, we prove the equivalence between controllability of the linearized Turing's model and controllability of a Laplace dynamic system with agents of first order dynamics. When the graph is a grid graph or a cylinder grid graph, we then give precisely the minimal number of control nodes and a corresponding control node set such that the Laplace dynamic systems on these graphs with agents of first order dynamics are controllable.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
A Zero-Shot Language Agent for Computer Control with Structured Reflection
Authors:
Tao Li,
Gang Li,
Zhiwei Deng,
Bryan Wang,
Yang Li
Abstract:
Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment (e.g. MiniWoB++). To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting. Without these trace examples, it remains a challenge how an agent can autonomously learn and…
▽ More
Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment (e.g. MiniWoB++). To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting. Without these trace examples, it remains a challenge how an agent can autonomously learn and improve its control on a computer, which limits the ability of an agent to perform a new task. We approach this problem with a zero-shot agent that requires no given expert traces. Our agent plans for executable actions on a partially observed environment, and iteratively progresses a task by identifying and learning from its mistakes via self-reflection and structured thought management. On the easy tasks of MiniWoB++, we show that our zero-shot agent often outperforms recent SoTAs, with more efficient reasoning. For tasks with more complexity, our reflective agent performs on par with prior best models, even though previous works had the advantages of accessing expert traces or additional screen information.
△ Less
Submitted 23 October, 2023; v1 submitted 12 October, 2023;
originally announced October 2023.
-
Vec-Tok Speech: speech vectorization and tokenization for neural speech generation
Authors:
Xinfa Zhu,
Yuanjun Lv,
Yi Lei,
Tao Li,
Wendi He,
Hongbin Zhou,
Heng Lu,
Lei Xie
Abstract:
Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating e…
▽ More
Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .
△ Less
Submitted 12 October, 2023; v1 submitted 11 October, 2023;
originally announced October 2023.
-
U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning
Authors:
Tao Li,
Zhichao Wang,
Xinfa Zhu,
Jian Cong,
Qiao Tian,
Yu** Wang,
Lei Xie
Abstract:
Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zer…
▽ More
Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS
Authors:
Dake Guo,
Xinfa Zhu,
Liumeng Xue,
Tao Li,
Yuanjun Lv,
Yuepeng Jiang,
Lei Xie
Abstract:
Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN…
▽ More
Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN-TTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech.
△ Less
Submitted 6 October, 2023; v1 submitted 25 September, 2023;
originally announced September 2023.
-
Inter-vendor harmonization of Computed Tomography (CT) reconstruction kernels using unpaired image translation
Authors:
Aravind R. Krishnan,
Kaiwen Xu,
Thomas Li,
Chenyu Gao,
Lucas W. Remedios,
Praitayini Kanakaraj,
Ho Hin Lee,
Shunxing Bao,
Kim L. Sandler,
Fabien Maldonado,
Ivana Isgum,
Bennett A. Landman
Abstract:
The reconstruction kernel in computed tomography (CT) generation determines the texture of the image. Consistency in reconstruction kernels is important as the underlying CT texture can impact measurements during quantitative image analysis. Harmonization (i.e., kernel conversion) minimizes differences in measurements due to inconsistent reconstruction kernels. Existing methods investigate harmoni…
▽ More
The reconstruction kernel in computed tomography (CT) generation determines the texture of the image. Consistency in reconstruction kernels is important as the underlying CT texture can impact measurements during quantitative image analysis. Harmonization (i.e., kernel conversion) minimizes differences in measurements due to inconsistent reconstruction kernels. Existing methods investigate harmonization of CT scans in single or multiple manufacturers. However, these methods require paired scans of hard and soft reconstruction kernels that are spatially and anatomically aligned. Additionally, a large number of models need to be trained across different kernel pairs within manufacturers. In this study, we adopt an unpaired image translation approach to investigate harmonization between and across reconstruction kernels from different manufacturers by constructing a multipath cycle generative adversarial network (GAN). We use hard and soft reconstruction kernels from the Siemens and GE vendors from the National Lung Screening Trial dataset. We use 50 scans from each reconstruction kernel and train a multipath cycle GAN. To evaluate the effect of harmonization on the reconstruction kernels, we harmonize 50 scans each from Siemens hard kernel, GE soft kernel and GE hard kernel to a reference Siemens soft kernel (B30f) and evaluate percent emphysema. We fit a linear model by considering the age, smoking status, sex and vendor and perform an analysis of variance (ANOVA) on the emphysema scores. Our approach minimizes differences in emphysema measurement and highlights the impact of age, sex, smoking status and vendor on emphysema quantification.
△ Less
Submitted 26 January, 2024; v1 submitted 22 September, 2023;
originally announced September 2023.
-
Frame Pairwise Distance Loss for Weakly-supervised Sound Event Detection
Authors:
Rui Tao,
Yuxing Huang,
Xiangdong Wang,
Long Yan,
Lufeng Zhai,
Kazushige Ouchi,
Taihao Li
Abstract:
Weakly-supervised learning has emerged as a promising approach to leverage limited labeled data in various domains by bridging the gap between fully supervised methods and unsupervised techniques. Acquisition of strong annotations for detecting sound events is prohibitively expensive, making weakly supervised learning a more cost-effective and broadly applicable alternative. In order to enhance th…
▽ More
Weakly-supervised learning has emerged as a promising approach to leverage limited labeled data in various domains by bridging the gap between fully supervised methods and unsupervised techniques. Acquisition of strong annotations for detecting sound events is prohibitively expensive, making weakly supervised learning a more cost-effective and broadly applicable alternative. In order to enhance the recognition rate of the learning of detection of weakly-supervised sound events, we introduce a Frame Pairwise Distance (FPD) loss branch, complemented with a minimal amount of synthesized data. The corresponding sampling and label processing strategies are also proposed. Two distinct distance metrics are employed to evaluate the proposed approach. Finally, the method is validated on the DCASE 2023 task4 dataset. The obtained experimental results corroborated the efficacy of this approach.
△ Less
Submitted 7 December, 2023; v1 submitted 21 September, 2023;
originally announced September 2023.
-
Joint Demosaicing and Denoising with Double Deep Image Priors
Authors:
Taihui Li,
Anish Lahiri,
Yutong Dai,
Owen Mayer
Abstract:
Demosaicing and denoising of RAW images are crucial steps in the processing pipeline of modern digital cameras. As only a third of the color information required to produce a digital image is captured by the camera sensor, the process of demosaicing is inherently ill-posed. The presence of noise further exacerbates this problem. Performing these two steps sequentially may distort the content of th…
▽ More
Demosaicing and denoising of RAW images are crucial steps in the processing pipeline of modern digital cameras. As only a third of the color information required to produce a digital image is captured by the camera sensor, the process of demosaicing is inherently ill-posed. The presence of noise further exacerbates this problem. Performing these two steps sequentially may distort the content of the captured RAW images and accumulate errors from one step to another. Recent deep neural-network-based approaches have shown the effectiveness of joint demosaicing and denoising to mitigate such challenges. However, these methods typically require a large number of training samples and do not generalize well to different types and intensities of noise. In this paper, we propose a novel joint demosaicing and denoising method, dubbed JDD-DoubleDIP, which operates directly on a single RAW image without requiring any training data. We validate the effectiveness of our method on two popular datasets -- Kodak and McMaster -- with various noises and noise intensities. The experimental results show that our method consistently outperforms other compared methods in terms of PSNR, SSIM, and qualitative visual perception.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
Assessing cognitive function among older adults using machine learning and wearable device data: a feasibility study
Authors:
Collin Sakal,
Tingyou Li,
Juan Li,
Xinyue Li
Abstract:
Timely implementation of interventions to slow cognitive decline among older adults requires accurate monitoring to detect changes in cognitive function. Data gathered using wearable devices that can continuously monitor factors known to be associated with cognition could be used to train machine learning models and develop wearable-based cognitive monitoring systems. Using data from over 2,400 ol…
▽ More
Timely implementation of interventions to slow cognitive decline among older adults requires accurate monitoring to detect changes in cognitive function. Data gathered using wearable devices that can continuously monitor factors known to be associated with cognition could be used to train machine learning models and develop wearable-based cognitive monitoring systems. Using data from over 2,400 older adults in the National Health and Nutrition Examination Survey (NHANES) we developed prediction models to differentiate older adults with normal cognition from those with poor cognition based on outcomes from three cognitive tests measuring different domains of cognitive function. During repeated cross-validation, CatBoost, XGBoost, and Random Forest models performed best when predicting cognition based on processing speed, working memory, and attention (median AUCs >0.82) compared to immediate and delayed recall (median AUCs >0.72) and categorical verbal fluency (median AUC >0.68). Activity and sleep parameters were also more strongly associated with processing speed, working memory, and attention compared to other cognitive subdomains. Our work provides proof of concept that wearable-based cognitive monitoring systems may be a viable alternative to traditional methods for monitoring processing speeds, working memory, and attention. We further identified novel metrics that could be targets in future causal studies seeking to better understand how sleep and activity parameters influence cognitive function among older adults.
△ Less
Submitted 24 March, 2024; v1 submitted 27 August, 2023;
originally announced September 2023.
-
A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation
Authors:
Ziyan Huang,
Zhongying Deng,
** Ye,
Haoyu Wang,
Yanzhou Su,
Tianbin Li,
Hui Sun,
Junlong Cheng,
Jianpin Chen,
Junjun He,
Yun Gu,
Shaoting Zhang,
Lixu Gu,
Yu Qiao
Abstract:
Although deep learning have revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. With the recent emergence of large-scale datasets, some important questions arise: \textbf{Can models trained on these datasets generalize well on different ones? If yes/no, how to further improve their generalizability?} To address t…
▽ More
Although deep learning have revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. With the recent emergence of large-scale datasets, some important questions arise: \textbf{Can models trained on these datasets generalize well on different ones? If yes/no, how to further improve their generalizability?} To address these questions, we introduce A-Eval, a benchmark for the cross-dataset Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation. We employ training sets from four large-scale public datasets: FLARE22, AMOS, WORD, and TotalSegmentator, each providing extensive labels for abdominal multi-organ segmentation. For evaluation, we incorporate the validation sets from these datasets along with the training set from the BTCV dataset, forming a robust benchmark comprising five distinct datasets. We evaluate the generalizability of various models using the A-Eval benchmark, with a focus on diverse data usage scenarios: training on individual datasets independently, utilizing unlabeled data via pseudo-labeling, mixing different modalities, and joint training across all available datasets. Additionally, we explore the impact of model sizes on cross-dataset generalizability. Through these analyses, we underline the importance of effective data usage in enhancing models' generalization capabilities, offering valuable insights for assembling large-scale datasets and improving training strategies. The code and pre-trained models are available at \href{https://github.com/uni-medical/A-Eval}{https://github.com/uni-medical/A-Eval}.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Research on Damage Analysis of Key Parts of UAV Flight Control System
Authors:
Tianshun Li,
Huaimin Chen,
Ben Xiao,
Hao Li,
Shiyu Hao,
Di Hai,
Xuetong Wang
Abstract:
A set of hardware in the loop simulation methods based on the UAV model is proposed to create fault data, which is used to judge the parts where faults happen. Actual flight experimental data is utilized to prove the reliability of Simulink models. Then a series of typical faults with various amplitudes are injected into different channels of UAV parts in hardware in the loop simulation platform.…
▽ More
A set of hardware in the loop simulation methods based on the UAV model is proposed to create fault data, which is used to judge the parts where faults happen. Actual flight experimental data is utilized to prove the reliability of Simulink models. Then a series of typical faults with various amplitudes are injected into different channels of UAV parts in hardware in the loop simulation platform. Fault data is created this way, and the effect on UAV flight and task/control can be obtained through damage analysis. Typical fault characters are extracted, and those parts that have faults can be analyzed and judged. We can also know the trend that faults will develop and conclude the reasons for faults based on exterior performance, which supports precise attack and performance evaluation techniques.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Directionality-Aware Mixture Model Parallel Sampling for Efficient Linear Parameter Varying Dynamical System Learning
Authors:
Sunan Sun,
Haihui Gao,
Tianyu Li,
Nadia Figueroa
Abstract:
The Linear Parameter Varying Dynamical System (LPV-DS) is an effective approach that learns stable, time-invariant motion policies using statistical modeling and semi-definite optimization to encode complex motions for reactive robot control. Despite its strengths, the LPV-DS learning approach faces challenges in achieving a high model accuracy without compromising the computational efficiency. To…
▽ More
The Linear Parameter Varying Dynamical System (LPV-DS) is an effective approach that learns stable, time-invariant motion policies using statistical modeling and semi-definite optimization to encode complex motions for reactive robot control. Despite its strengths, the LPV-DS learning approach faces challenges in achieving a high model accuracy without compromising the computational efficiency. To address this, we introduce the Directionality-Aware Mixture Model (DAMM), a novel statistical model that applies the Riemannian metric on the n-sphere $\mathbb{S}^n$ to efficiently blend non-Euclidean directional data with $\mathbb{R}^m$ Euclidean states. Additionally, we develop a hybrid Markov chain Monte Carlo technique that combines Gibbs Sampling with Split/Merge Proposal, allowing for parallel computation to drastically speed up inference. Our extensive empirical tests demonstrate that LPV-DS integrated with DAMM achieves higher reproduction accuracy, better model efficiency, and near real-time/online learning compared to standard estimation methods on various datasets. Lastly, we demonstrate its suitability for incrementally learning multi-behavior policies in real-world robot experiments.
△ Less
Submitted 24 March, 2024; v1 submitted 5 September, 2023;
originally announced September 2023.
-
Task Generalization with Stability Guarantees via Elastic Dynamical System Motion Policies
Authors:
Tianyu Li,
Nadia Figueroa
Abstract:
Dynamical System (DS) based Learning from Demonstration (LfD) allows learning of reactive motion policies with stability and convergence guarantees from a few trajectories. Yet, current DS learning techniques lack the flexibility to generalize to new task instances as they ignore explicit task parameters that inherently change the underlying trajectories. In this work, we propose Elastic-DS, a nov…
▽ More
Dynamical System (DS) based Learning from Demonstration (LfD) allows learning of reactive motion policies with stability and convergence guarantees from a few trajectories. Yet, current DS learning techniques lack the flexibility to generalize to new task instances as they ignore explicit task parameters that inherently change the underlying trajectories. In this work, we propose Elastic-DS, a novel DS learning, and generalization approach that embeds task parameters into the Gaussian Mixture Model (GMM) based Linear Parameter Varying (LPV) DS formulation. Central to our approach is the Elastic-GMM, a GMM constrained to SE(3) task-relevant frames. Given a new task instance/context, the Elastic-GMM is transformed with Laplacian Editing and used to re-estimate the LPV-DS policy. Elastic-DS is compositional in nature and can be used to construct flexible multi-step tasks. We showcase its strength on a myriad of simulated and real-robot experiments while preserving desirable control-theoretic guarantees. Supplementary videos can be found at https://sites.google.com/view/elastic-ds
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling
Authors:
Zhichao Wang,
Xinsheng Wang,
Qicong Xie,
Tao Li,
Lei Xie,
Qiao Tian,
Yu** Wang
Abstract:
In addition to conveying the linguistic content from source speech to converted speech, maintaining the speaking style of source speech also plays an important role in the voice conversion (VC) task, which is essential in many scenarios with highly expressive source speech, such as dubbing and data augmentation. Previous work generally took explicit prosodic features or fixed-length style embeddin…
▽ More
In addition to conveying the linguistic content from source speech to converted speech, maintaining the speaking style of source speech also plays an important role in the voice conversion (VC) task, which is essential in many scenarios with highly expressive source speech, such as dubbing and data augmentation. Previous work generally took explicit prosodic features or fixed-length style embedding extracted from source speech to model the speaking style of source speech, which is insufficient to achieve comprehensive style modeling and target speaker timbre preservation. Inspired by the style's multi-scale nature of human speech, a multi-scale style modeling method for the VC task, referred to as MSM-VC, is proposed in this paper. MSM-VC models the speaking style of source speech from different levels. To effectively convey the speaking style and meanwhile prevent timbre leakage from source speech to converted speech, each level's style is modeled by specific representation. Specifically, prosodic features, pre-trained ASR model's bottleneck features, and features extracted by a model trained with a self-supervised strategy are adopted to model the frame, local, and global-level styles, respectively. Besides, to balance the performance of source style modeling and target speaker timbre preservation, an explicit constraint module consisting of a pre-trained speech emotion recognition model and a speaker classifier is introduced to MSM-VC. This explicit constraint module also makes it possible to simulate the style transfer inference process during the training to improve the disentanglement ability and alleviate the mismatch between training and inference. Experiments performed on the highly expressive speech corpus demonstrate that MSM-VC is superior to the state-of-the-art VC methods for modeling source speech style while maintaining good speech quality and speaker similarity.
△ Less
Submitted 3 September, 2023;
originally announced September 2023.
-
DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin
Authors:
Tao Li,
Chenxu Hu,
Jian Cong,
Xinfa Zhu,
**gbei Li,
Qiao Tian,
Yu** Wang,
Lei Xie
Abstract:
While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a D…
▽ More
While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.
△ Less
Submitted 2 September, 2023;
originally announced September 2023.
-
CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
Authors:
Yi Meng,
Xiang Li,
Zhiyong Wu,
Tingtian Li,
Zixun Sun,
Xinyu Xiao,
Chi Sun,
Hui Zhan,
Helen Meng
Abstract:
To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or selected by semantic features. However, semantic features contain not only style-related information, but…
▽ More
To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or selected by semantic features. However, semantic features contain not only style-related information, but also style irrelevant information. The information irrelevant to speaking style in the text could interfere the reference audio selection and result in improper speaking styles. To improve the reference selection, we propose Contrastive Acoustic-Linguistic Module (CALM) to extract the Style-related Text Feature (STF) from the text. CALM optimizes the correlation between the speaking style embedding and the extracted STF with contrastive learning. Thus, a certain number of the most appropriate reference speeches for the input text are selected by retrieving the speeches with the top STF similarities. Then the style embeddings are weighted summarized according to their STF similarities and used to stylize the synthesized speech of TTS. Experiment results demonstrate the effectiveness of our proposed approach, with both objective evaluations and subjective evaluations on the speaking styles of the synthesized speeches outperform a baseline approach with semantic-feature-based reference selection.
△ Less
Submitted 30 August, 2023;
originally announced August 2023.
-
Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder
Authors:
Xuyuan Li,
Zengqiang Shang,
Peiyang Shi,
Hua Hua,
Ta Li,
Pengyuan Zhang
Abstract:
Neural networks have been able to generate high-quality single-sentence speech. However, it remains a challenge concerning audio-book speech synthesis due to the intra-paragraph correlation of semantic and acoustic features as well as variable styles. In this paper, we propose a highly expressive paragraph speech synthesis system with a multi-step variational autoencoder, called EP-MSTTS. EP-MSTTS…
▽ More
Neural networks have been able to generate high-quality single-sentence speech. However, it remains a challenge concerning audio-book speech synthesis due to the intra-paragraph correlation of semantic and acoustic features as well as variable styles. In this paper, we propose a highly expressive paragraph speech synthesis system with a multi-step variational autoencoder, called EP-MSTTS. EP-MSTTS is the first VITS-based paragraph speech synthesis model and models the variable style of paragraph speech at five levels: frame, phoneme, word, sentence, and paragraph. We also propose a series of improvements to enhance the performance of this hierarchical model. In addition, we directly train EP-MSTTS on speech sliced by paragraph rather than sentence. Experiment results on the single-speaker French audiobook corpus released at Blizzard Challenge 2023 show EP-MSTTS obtains better performance than baseline models.
△ Less
Submitted 11 June, 2024; v1 submitted 25 August, 2023;
originally announced August 2023.
-
Flexible Distributed Flocking Control for Multi-agent Unicycle Systems
Authors:
Tinghua Li,
Bayu Jayawardhana
Abstract:
Currently, the general aim of flocking and formation control laws for multi-agent systems is to form and maintain a rigid configuration, such as, the alpha-lattices in flocking control methods, where the desired distance between each pair of connected agents is fixed. This introduces a scalability issue for large-scale deployment of agents due to unrealizable geometrical constraints and the consta…
▽ More
Currently, the general aim of flocking and formation control laws for multi-agent systems is to form and maintain a rigid configuration, such as, the alpha-lattices in flocking control methods, where the desired distance between each pair of connected agents is fixed. This introduces a scalability issue for large-scale deployment of agents due to unrealizable geometrical constraints and the constant need of centralized orchestrator to ensure the formation graph rigidity. This paper presents a flexible distributed flocking cohesion algorithm for nonholonomic multi-agent systems. The desired geometry configuration between each pair of agents is adaptive and flexible. The distributed flocking goal is achieved using limited information exchange (i.e., the local field gradient) between connected neighbor agents and it does not rely on any other motion variables measurements, such as (relative) position, velocity, or acceleration. Additionally, the flexible flocking scheme with safety is considered so that the agents with limited sensing capability are able to maintain the connectedness of communication topology at all time and avoid inter-agent collisions. The stability analysis of the proposed methods is presented along with numerical simulation results to show their effectiveness.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
Learning Complex Motion Plans using Neural ODEs with Safety and Stability Guarantees
Authors:
Farhad Nawaz,
Tianyu Li,
Nikolai Matni,
Nadia Figueroa
Abstract:
We propose a Dynamical System (DS) approach to learn complex, possibly periodic motion plans from kinesthetic demonstrations using Neural Ordinary Differential Equations (NODE). To ensure reactivity and robustness to disturbances, we propose a novel approach that selects a target point at each time step for the robot to follow, by combining tools from control theory and the target trajectory gener…
▽ More
We propose a Dynamical System (DS) approach to learn complex, possibly periodic motion plans from kinesthetic demonstrations using Neural Ordinary Differential Equations (NODE). To ensure reactivity and robustness to disturbances, we propose a novel approach that selects a target point at each time step for the robot to follow, by combining tools from control theory and the target trajectory generated by the learned NODE. A correction term to the NODE model is computed online by solving a quadratic program that guarantees stability and safety using control Lyapunov functions and control barrier functions, respectively. Our approach outperforms baseline DS learning techniques on the LASA handwriting dataset and complex periodic trajectories. It is also validated on the Franka Emika robot arm to produce stable motions for wi** and stirring tasks that do not have a single attractor, while being robust to perturbations and safe around humans and obstacles.
△ Less
Submitted 22 March, 2024; v1 submitted 31 July, 2023;
originally announced August 2023.