-
Combining Neural Networks and Symbolic Regression for Analytical Lyapunov Function Discovery
Authors:
Jie Feng,
Haohan Zou,
Yuanyuan Shi
Abstract:
We propose CoNSAL (Combining Neural networks and Symbolic regression for Analytical Lyapunov function) to construct analytical Lyapunov functions for nonlinear dynamic systems. This framework contains a neural Lyapunov function and a symbolic regression component, where symbolic regression is applied to distill the neural network to precise analytical forms. Our approach utilizes symbolic regressi…
▽ More
We propose CoNSAL (Combining Neural networks and Symbolic regression for Analytical Lyapunov function) to construct analytical Lyapunov functions for nonlinear dynamic systems. This framework contains a neural Lyapunov function and a symbolic regression component, where symbolic regression is applied to distill the neural network to precise analytical forms. Our approach utilizes symbolic regression not only as a tool for translation but also as a means to uncover counterexamples. This procedure terminates when no counterexamples are found in the analytical formulation. Compared with previous results, our algorithm directly produces an analytical form of the Lyapunov function with improved interpretability in both the learning process and the final results. We apply our algorithm to 2-D inverted pendulum, path following, Van Der Pol Oscillator, 3-D trig dynamics, 4-D rotating wheel pendulum, 6-D 3-bus power system, and demonstrate that our algorithm successfully finds their valid Lyapunov functions.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Diffusion Model-based FOD Restoration from High Distortion in dMRI
Authors:
Shuo Huang,
Lujia Zhong,
Yonggang Shi
Abstract:
Fiber orientation distributions (FODs) is a popular model to represent the diffusion MRI (dMRI) data. However, imaging artifacts such as susceptibility-induced distortion in dMRI can cause signal loss and lead to the corrupted reconstruction of FODs, which prohibits successful fiber tracking and connectivity analysis in affected brain regions such as the brain stem. Generative models, such as the…
▽ More
Fiber orientation distributions (FODs) is a popular model to represent the diffusion MRI (dMRI) data. However, imaging artifacts such as susceptibility-induced distortion in dMRI can cause signal loss and lead to the corrupted reconstruction of FODs, which prohibits successful fiber tracking and connectivity analysis in affected brain regions such as the brain stem. Generative models, such as the diffusion models, have been successfully applied in various image restoration tasks. However, their application on FOD images poses unique challenges since FODs are 4-dimensional data represented by spherical harmonics (SPHARM) with the 4-th dimension exhibiting order-related dependency. In this paper, we propose a novel diffusion model for FOD restoration that can recover the signal loss caused by distortion artifacts. We use volume-order encoding to enhance the ability of the diffusion model to generate individual FOD volumes at all SPHARM orders. Moreover, we add cross-attention features extracted across all SPHARM orders in generating every individual FOD volume to capture the order-related dependency across FOD volumes. We also condition the diffusion model with low-distortion FODs surrounding high-distortion areas to maintain the geometric coherence of the generated FODs. We trained and tested our model using data from the UK Biobank (n = 1315). On a test set with ground truth (n = 43), we demonstrate the high accuracy of the generated FODs in terms of root mean square errors of FOD volumes and angular errors of FOD peaks. We also apply our method to a test set with large distortion in the brain stem area (n = 1172) and demonstrate the efficacy of our method in restoring the FOD integrity and, hence, greatly improving tractography performance in affected brain regions.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Risk-Aware Value-Oriented Net Demand Forecasting for Virtual Power Plants
Authors:
Yufan Zhang,
Jiajun Han,
Yuanyuan Shi
Abstract:
This paper develops a risk-aware net demand forecasting product for virtual power plants, which helps reduce the risk of high operation costs. At the training phase, a bilevel program for parameter estimation is formulated, where the upper level optimizes over the forecast model parameter to minimize the conditional value-at-risk (a risk metric) of operation costs. The lower level solves the opera…
▽ More
This paper develops a risk-aware net demand forecasting product for virtual power plants, which helps reduce the risk of high operation costs. At the training phase, a bilevel program for parameter estimation is formulated, where the upper level optimizes over the forecast model parameter to minimize the conditional value-at-risk (a risk metric) of operation costs. The lower level solves the operation problems given the forecast. Leveraging the specific structure of the operation problem, we show that the bilevel program is equivalent to a convex program when the forecast model is linear. Numerical results show that our approach effectively reduces the risk of high costs compared to the forecasting approach developed for risk-neutral decision makers.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
Authors:
Frank Seide,
Morrie Doulaty,
Yangyang Shi,
Yashesh Gaur,
Junteng Jia,
Chunyang Wu
Abstract:
We introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle continuous audio without explicit end-pointing. Speech ReaLLM is a special case of the more general ReaLLM ("real-time LLM") approach, also introduced here for the…
▽ More
We introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle continuous audio without explicit end-pointing. Speech ReaLLM is a special case of the more general ReaLLM ("real-time LLM") approach, also introduced here for the first time. The idea is inspired by RNN-T: Instead of generating a response only at the end of a user prompt, generate after every input token received in real time (it is often empty). On Librispeech "test", an 80M Speech ReaLLM achieves WERs of 3.0% and 7.4% in real time (without an external LM or auxiliary loss). This is only slightly above a 3x larger Attention-Encoder-Decoder baseline. We also show that this way, an LLM architecture can learn to represent and reproduce the flow of time; and that a pre-trained 7B LLM can be fine-tuned to do reasonably well on this task.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Stability-Constrained Learning for Frequency Regulation in Power Grids with Variable Inertia
Authors:
Jie Feng,
Manasa Muralidharan,
Rodrigo Henriquez-Auba,
Patricia Hidalgo-Gonzalez,
Yuanyuan Shi
Abstract:
The increasing penetration of converter-based renewable generation has resulted in faster frequency dynamics, and low and variable inertia. As a result, there is a need for frequency control methods that are able to stabilize a disturbance in the power system at timescales comparable to the fast converter dynamics. This paper proposes a combined linear and neural network controller for inverter-ba…
▽ More
The increasing penetration of converter-based renewable generation has resulted in faster frequency dynamics, and low and variable inertia. As a result, there is a need for frequency control methods that are able to stabilize a disturbance in the power system at timescales comparable to the fast converter dynamics. This paper proposes a combined linear and neural network controller for inverter-based primary frequency control that is stable at time-varying levels of inertia. We model the time-variance in inertia via a switched affine hybrid system model. We derive stability certificates for the proposed controller via a quadratic candidate Lyapunov function. We test the proposed control on a 12-bus 3-area test network, and compare its performance with a base case linear controller, optimized linear controller, and finite-horizon Linear Quadratic Regulator (LQR). Our proposed controller achieves faster mean settling time and over 50% reduction in average control cost across $100$ inertia scenarios compared to the optimized linear controller. Unlike LQR which requires complete knowledge of the inertia trajectories and system dynamics over the entire control time horizon, our proposed controller is real-time tractable, and achieves comparable performance to LQR.
△ Less
Submitted 11 June, 2024; v1 submitted 30 May, 2024;
originally announced May 2024.
-
USD: Unsupervised Soft Contrastive Learning for Fault Detection in Multivariate Time Series
Authors:
Hong Liu,
Xiuxiu Qiu,
Yiming Shi,
Zelin Zang
Abstract:
Unsupervised fault detection in multivariate time series is critical for maintaining the integrity and efficiency of complex systems, with current methodologies largely focusing on statistical and machine learning techniques. However, these approaches often rest on the assumption that data distributions conform to Gaussian models, overlooking the diversity of patterns that can manifest in both nor…
▽ More
Unsupervised fault detection in multivariate time series is critical for maintaining the integrity and efficiency of complex systems, with current methodologies largely focusing on statistical and machine learning techniques. However, these approaches often rest on the assumption that data distributions conform to Gaussian models, overlooking the diversity of patterns that can manifest in both normal and abnormal states, thereby diminishing discriminative performance. Our innovation addresses this limitation by introducing a combination of data augmentation and soft contrastive learning, specifically designed to capture the multifaceted nature of state behaviors more accurately. The data augmentation process enriches the dataset with varied representations of normal states, while soft contrastive learning fine-tunes the model's sensitivity to the subtle differences between normal and abnormal patterns, enabling it to recognize a broader spectrum of anomalies. This dual strategy significantly boosts the model's ability to distinguish between normal and abnormal states, leading to a marked improvement in fault detection performance across multiple datasets and settings, thereby setting a new benchmark for unsupervised fault detection in complex systems. The code of our method is available at \url{https://github.com/zangzelin/code_USD.git}.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
TauAD: MRI-free Tau Anomaly Detection in PET Imaging via Conditioned Diffusion Models
Authors:
Lujia Zhong,
Shuo Huang,
Jiaxin Yue,
Jianwei Zhang,
Zhiwei Deng,
Wenhao Chi,
Yonggang Shi
Abstract:
The emergence of tau PET imaging over the last decade has enabled Alzheimer's disease (AD) researchers to examine tau pathology in vivo and more effectively characterize the disease trajectories of AD. Current tau PET analysis methods, however, typically perform inferences on large cortical ROIs and are limited in the detection of localized tau pathology that varies across subjects. Furthermore, a…
▽ More
The emergence of tau PET imaging over the last decade has enabled Alzheimer's disease (AD) researchers to examine tau pathology in vivo and more effectively characterize the disease trajectories of AD. Current tau PET analysis methods, however, typically perform inferences on large cortical ROIs and are limited in the detection of localized tau pathology that varies across subjects. Furthermore, a high-resolution MRI is required to carry out conventional tau PET analysis, which is not commonly acquired in clinical practices and may not be acquired for many elderly patients with dementia due to strong motion artifacts, claustrophobia, or certain metal implants. In this work, we propose a novel conditional diffusion model to perform MRI-free anomaly detection from tau PET imaging data. By including individualized conditions and two complementary loss maps from pseudo-healthy and pseudo-unhealthy reconstructions, our model computes an anomaly map across the entire brain area that allows simply training a support vector machine (SVM) for classifying disease severity. We train our model on ADNI subjects (n=534) and evaluate its performance on a separate dataset from the preclinical subjects of the A4 clinical trial (n=447). We demonstrate that our method outperforms baseline generative models and the conventional Z-score-based method in anomaly localization without mis-detecting off-target bindings in sub-cortical and out-of-brain areas. By classifying the A4 subjects according to their anomaly map using the SVM trained on ADNI data, we show that our method can successfully group preclinical subjects with significantly different cognitive functions, which further demonstrates the effectiveness of our method in capturing biologically relevant anomaly in tau PET imaging.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
PDE Control Gym: A Benchmark for Data-Driven Boundary Control of Partial Differential Equations
Authors:
Luke Bhan,
Yuexin Bian,
Miroslav Krstic,
Yuanyuan Shi
Abstract:
Over the last decade, data-driven methods have surged in popularity, emerging as valuable tools for control theory. As such, neural network approximations of control feedback laws, system dynamics, and even Lyapunov functions have attracted growing attention. With the ascent of learning based control, the need for accurate, fast, and easy-to-use benchmarks has increased. In this work, we present t…
▽ More
Over the last decade, data-driven methods have surged in popularity, emerging as valuable tools for control theory. As such, neural network approximations of control feedback laws, system dynamics, and even Lyapunov functions have attracted growing attention. With the ascent of learning based control, the need for accurate, fast, and easy-to-use benchmarks has increased. In this work, we present the first learning-based environment for boundary control of PDEs. In our benchmark, we introduce three foundational PDE problems - a 1D transport PDE, a 1D reaction-diffusion PDE, and a 2D Navier-Stokes PDE - whose solvers are bundled in an user-friendly reinforcement learning gym. With this gym, we then present the first set of model-free, reinforcement learning algorithms for solving this series of benchmark problems, achieving stability, although at a higher cost compared to model-based PDE backstep**. With the set of benchmark environments and detailed examples, this work significantly lowers the barrier to entry for learning-based PDE control - a topic largely unexplored by the data-driven control community. The entire benchmark is available on Github along with detailed documentation and the presented reinforcement learning models are open sourced.
△ Less
Submitted 23 May, 2024; v1 submitted 18 May, 2024;
originally announced May 2024.
-
Improving Sequential Market Clearing via Value-oriented Renewable Energy Forecasting
Authors:
Yufan Zhang,
Honglin Wen,
Yuexin Bian,
Yuanyuan Shi
Abstract:
Large penetration of renewable energy sources (RESs) brings huge uncertainty into the electricity markets. While existing deterministic market clearing fails to accommodate the uncertainty, the recently proposed stochastic market clearing struggles to achieve desirable market properties. In this work, we propose a value-oriented forecasting approach, which tactically determines the RESs generation…
▽ More
Large penetration of renewable energy sources (RESs) brings huge uncertainty into the electricity markets. While existing deterministic market clearing fails to accommodate the uncertainty, the recently proposed stochastic market clearing struggles to achieve desirable market properties. In this work, we propose a value-oriented forecasting approach, which tactically determines the RESs generation that enters the day-ahead market. With such a forecast, the existing deterministic market clearing framework can be maintained, and the day-ahead and real-time overall operation cost is reduced. At the training phase, the forecast model parameters are estimated to minimize expected day-ahead and real-time overall operation costs, instead of minimizing forecast errors in a statistical sense. Theoretically, we derive the exact form of the loss function for training the forecast model that aligns with such a goal. For market clearing modeled by linear programs, this loss function is a piecewise linear function. Additionally, we derive the analytical gradient of the loss function with respect to the forecast, which inspires an efficient training strategy. A numerical study shows our forecasts can bring significant benefits of the overall cost reduction to deterministic market clearing, compared to quality-oriented forecasting approach.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
On the Foundations of Earth and Climate Foundation Models
Authors:
Xiao Xiang Zhu,
Zhitong Xiong,
Yi Wang,
Adam J. Stewart,
Konrad Heidler,
Yuanyuan Wang,
Zhenghang Yuan,
Thomas Dujardin,
Qingsong Xu,
Yilei Shi
Abstract:
Foundation models have enormous potential in advancing Earth and climate sciences, however, current approaches may not be optimal as they focus on a few basic features of a desirable Earth and climate foundation model. Crafting the ideal Earth foundation model, we define eleven features which would allow such a foundation model to be beneficial for any geoscientific downstream application in an en…
▽ More
Foundation models have enormous potential in advancing Earth and climate sciences, however, current approaches may not be optimal as they focus on a few basic features of a desirable Earth and climate foundation model. Crafting the ideal Earth foundation model, we define eleven features which would allow such a foundation model to be beneficial for any geoscientific downstream application in an environmental- and human-centric manner.We further shed light on the way forward to achieve the ideal model and to evaluate Earth foundation models. What comes after foundation models? Energy efficient adaptation, adversarial defenses, and interpretability are among the emerging directions.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Brain Storm Optimization Based Swarm Learning for Diabetic Retinopathy Image Classification
Authors:
Liang Qu,
Cunze Wang,
Yuhui Shi
Abstract:
The application of deep learning techniques to medical problems has garnered widespread research interest in recent years, such as applying convolutional neural networks to medical image classification tasks. However, data in the medical field is often highly private, preventing different hospitals from sharing data to train an accurate model. Federated learning, as a privacy-preserving machine le…
▽ More
The application of deep learning techniques to medical problems has garnered widespread research interest in recent years, such as applying convolutional neural networks to medical image classification tasks. However, data in the medical field is often highly private, preventing different hospitals from sharing data to train an accurate model. Federated learning, as a privacy-preserving machine learning architecture, has shown promising performance in balancing data privacy and model utility by kee** private data on the client's side and using a central server to coordinate a set of clients for model training through aggregating their uploaded model parameters. Yet, this architecture heavily relies on a trusted third-party server, which is challenging to achieve in real life. Swarm learning, as a specialized decentralized federated learning architecture that does not require a central server, utilizes blockchain technology to enable direct parameter exchanges between clients. However, the mining of blocks requires significant computational resources, limiting its scalability. To address this issue, this paper integrates the brain storm optimization algorithm into the swarm learning framework, named BSO-SL. This approach clusters similar clients into different groups based on their model distributions. Additionally, leveraging the architecture of BSO, clients are given the probability to engage in collaborative learning both within their cluster and with clients outside their cluster, preventing the model from converging to local optima. The proposed method has been validated on a real-world diabetic retinopathy image classification dataset, and the experimental results demonstrate the effectiveness of the proposed approach.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Collaborative Edge AI Inference over Cloud-RAN
Authors:
Pengfei Zhang,
Dingzhu Wen,
Guangxu Zhu,
Qimei Chen,
Kaifeng Han,
Yuanming Shi
Abstract:
In this paper, a cloud radio access network (Cloud-RAN) based collaborative edge AI inference architecture is proposed. Specifically, geographically distributed devices capture real-time noise-corrupted sensory data samples and extract the noisy local feature vectors, which are then aggregated at each remote radio head (RRH) to suppress sensing noise. To realize efficient uplink feature aggregatio…
▽ More
In this paper, a cloud radio access network (Cloud-RAN) based collaborative edge AI inference architecture is proposed. Specifically, geographically distributed devices capture real-time noise-corrupted sensory data samples and extract the noisy local feature vectors, which are then aggregated at each remote radio head (RRH) to suppress sensing noise. To realize efficient uplink feature aggregation, we allow each RRH receives local feature vectors from all devices over the same resource blocks simultaneously by leveraging an over-the-air computation (AirComp) technique. Thereafter, these aggregated feature vectors are quantized and transmitted to a central processor (CP) for further aggregation and downstream inference tasks. Our aim in this work is to maximize the inference accuracy via a surrogate accuracy metric called discriminant gain, which measures the discernibility of different classes in the feature space. The key challenges lie on simultaneously suppressing the coupled sensing noise, AirComp distortion caused by hostile wireless channels, and the quantization error resulting from the limited capacity of fronthaul links. To address these challenges, this work proposes a joint transmit precoding, receive beamforming, and quantization error control scheme to enhance the inference accuracy. Extensive numerical experiments demonstrate the effectiveness and superiority of our proposed optimization algorithm compared to various baselines.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Satellite Federated Edge Learning: Architecture Design and Convergence Analysis
Authors:
Yuanming Shi,
Li Zeng,
**gyang Zhu,
Yong Zhou,
Chunxiao Jiang,
Khaled B. Letaief
Abstract:
The proliferation of low-earth-orbit (LEO) satellite networks leads to the generation of vast volumes of remote sensing data which is traditionally transferred to the ground server for centralized processing, raising privacy and bandwidth concerns. Federated edge learning (FEEL), as a distributed machine learning approach, has the potential to address these challenges by sharing only model paramet…
▽ More
The proliferation of low-earth-orbit (LEO) satellite networks leads to the generation of vast volumes of remote sensing data which is traditionally transferred to the ground server for centralized processing, raising privacy and bandwidth concerns. Federated edge learning (FEEL), as a distributed machine learning approach, has the potential to address these challenges by sharing only model parameters instead of raw data. Although promising, the dynamics of LEO networks, characterized by the high mobility of satellites and short ground-to-satellite link (GSL) duration, pose unique challenges for FEEL. Notably, frequent model transmission between the satellites and ground incurs prolonged waiting time and large transmission latency. This paper introduces a novel FEEL algorithm, named FEDMEGA, tailored to LEO mega-constellation networks. By integrating inter-satellite links (ISL) for intra-orbit model aggregation, the proposed algorithm significantly reduces the usage of low data rate and intermittent GSL. Our proposed method includes a ring all-reduce based intra-orbit aggregation mechanism, coupled with a network flow-based transmission scheme for global model aggregation, which enhances transmission efficiency. Theoretical convergence analysis is provided to characterize the algorithm performance. Extensive simulations show that our FEDMEGA algorithm outperforms existing satellite FEEL algorithms, exhibiting an approximate 30% improvement in convergence rate.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
A Distributionally Robust Model Predictive Control for Static and Dynamic Uncertainties in Smart Grids
Authors:
Qi Li,
Ye Shi,
Yuning Jiang,
Yuanming Shi,
Haoyu Wang,
H. Vincent Poor
Abstract:
The integration of various power sources, including renewables and electric vehicles, into smart grids is expanding, introducing uncertainties that can result in issues like voltage imbalances, load fluctuations, and power losses. These challenges negatively impact the reliability and stability of online scheduling in smart grids. Existing research often addresses uncertainties affecting current s…
▽ More
The integration of various power sources, including renewables and electric vehicles, into smart grids is expanding, introducing uncertainties that can result in issues like voltage imbalances, load fluctuations, and power losses. These challenges negatively impact the reliability and stability of online scheduling in smart grids. Existing research often addresses uncertainties affecting current states but overlooks those that impact future states, such as the unpredictable charging patterns of electric vehicles. To distinguish between these, we term them static uncertainties and dynamic uncertainties, respectively. This paper introduces WDR-MPC, a novel approach that stands for two-stage Wasserstein-based Distributionally Robust (WDR) optimization within a Model Predictive Control (MPC) framework, aimed at effectively managing both types of uncertainties in smart grids. The dynamic uncertainties are first reformulated into ambiguity tubes and then the distributionally robust bounds of both dynamic and static uncertainties can be established using WDR optimization. By employing ambiguity tubes and WDR optimization, the stochastic MPC system is converted into a nominal one. Moreover, we develop a convex reformulation method to speed up WDR computation during the two-stage optimization. The distinctive contribution of this paper lies in its holistic approach to both static and dynamic uncertainties in smart grids. Comprehensive experiment results on IEEE 38-bus and 94-bus systems reveal the method's superior performance and the potential to enhance grid stability and reliability.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
IDF-CR: Iterative Diffusion Process for Divide-and-Conquer Cloud Removal in Remote-sensing Images
Authors:
Meilin Wang,
Yexing Song,
Pengxu Wei,
Xiaoyu Xian,
Yukai Shi,
Liang Lin
Abstract:
Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of…
▽ More
Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of-the-art (SOTA) proficiency in image generation and reconstruction due to their formidable generative capabilities. Inspired by the rapid development of diffusion models, we first present an iterative diffusion process for cloud removal (IDF-CR), which exhibits a strong generative capabilities to achieve component divide-and-conquer cloud removal. IDF-CR consists of a pixel space cloud removal module (Pixel-CR) and a latent space iterative noise diffusion network (IND). Specifically, IDF-CR is divided into two-stage models that address pixel space and latent space. The two-stage model facilitates a strategic transition from preliminary cloud reduction to meticulous detail refinement. In the pixel space stage, Pixel-CR initiates the processing of cloudy images, yielding a suboptimal cloud removal prior to providing the diffusion model with prior cloud removal knowledge. In the latent space stage, the diffusion model transforms low-quality cloud removal into high-quality clean output. We refine the Stable Diffusion by implementing ControlNet. In addition, an unsupervised iterative noise refinement (INR) module is introduced for diffusion model to optimize the distribution of the predicted noise, thereby enhancing advanced detail recovery. Our model performs best with other SOTA methods, including image reconstruction and optical remote-sensing cloud removal on the optical remote-sensing datasets.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Ventilation and Temperature Control for Energy-efficient and Healthy Buildings: A Differentiable PDE Approach
Authors:
Yuexin Bian,
Xiaohan Fu,
Rajesh K. Gupta,
Yuanyuan Shi
Abstract:
In this paper, we introduce a novel framework for building learning and control, focusing on ventilation and thermal management to enhance energy efficiency. We validate the performance of the proposed framework in system model learning via two case studies: a synthetic study focusing on the joint learning of temperature and CO2 fields, and an application to a real-world dataset for CO2 field lear…
▽ More
In this paper, we introduce a novel framework for building learning and control, focusing on ventilation and thermal management to enhance energy efficiency. We validate the performance of the proposed framework in system model learning via two case studies: a synthetic study focusing on the joint learning of temperature and CO2 fields, and an application to a real-world dataset for CO2 field learning. For building control, we demonstrate that the proposed framework can optimize the control actions and significantly reduce the energy cost while maintaining a comfort and healthy indoor environment. When compared to existing traditional methods, an optimization-based method with ODE models and reinforcement learning, our approach can significantly reduce the energy consumption while guarantees all the safety-critical air quality and control constraints. Promising future research directions involve validating and improving the proposed PDE models through accurate estimation of airflow fields within indoor environments. Additionally, incorporating uncertainty modeling into the PDE framework for HVAC control presents an opportunity to enhance the efficiency and reliability of building HVAC system management.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Anatomy-Guided Surface Diffusion Model for Alzheimer's Disease Normative Modeling
Authors:
Jianwei Zhang,
Yonggang Shi
Abstract:
Normative modeling has emerged as a pivotal approach for characterizing heterogeneity and individual variance in neurodegenerative diseases, notably Alzheimer's disease(AD). One of the challenges of cortical normative modeling is the anatomical structure mismatch due to folding pattern variability. Traditionally, registration is applied to address this issue and recently many studies have utilized…
▽ More
Normative modeling has emerged as a pivotal approach for characterizing heterogeneity and individual variance in neurodegenerative diseases, notably Alzheimer's disease(AD). One of the challenges of cortical normative modeling is the anatomical structure mismatch due to folding pattern variability. Traditionally, registration is applied to address this issue and recently many studies have utilized deep generative models to generate anatomically align samples for analyzing disease progression; however, these models are predominantly applied to volume-based data, which often falls short in capturing intricate morphological changes on the brain cortex. As an alternative, surface-based analysis has been proven to be more sensitive in disease modeling such as AD, yet, like volume-based data, it also suffers from the mismatch problem. To address these limitations, we proposed a novel generative normative modeling framework by transferring the conditional diffusion generative model to the spherical non-Euclidean domain. Additionally, this approach generates normal feature map distributions by explicitly conditioning on individual anatomical segmentation to ensure better geometrical alignment which helps to reduce anatomical variance between subjects in analysis. We find that our model can generate samples that are better anatomically aligned than registered reference data and through ablation study and normative assessment experiments, the samples are able to better measure individual differences from the normal distribution and increase sensitivity in differentiating cognitively normal (CN), mild cognitive impairment (MCI), and Alzheimer's disease (AD) patients.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
A Hierarchical Dataflow-Driven Heterogeneous Architecture for Wireless Baseband Processing
Authors:
Limin Jiang,
Yi Shi,
Haiqin Hu,
Qingyu Deng,
Siyi Xu,
Yintao Liu,
Feng Yuan,
Si Wang,
Yihao Shen,
Fangfang Ye,
Shan Cao,
Zhiyuan Jiang
Abstract:
Wireless baseband processing (WBP) is a key element of wireless communications, with a series of signal processing modules to improve data throughput and counter channel fading. Conventional hardware solutions, such as digital signal processors (DSPs) and more recently, graphic processing units (GPUs), provide various degrees of parallelism, yet they both fail to take into account the cyclical and…
▽ More
Wireless baseband processing (WBP) is a key element of wireless communications, with a series of signal processing modules to improve data throughput and counter channel fading. Conventional hardware solutions, such as digital signal processors (DSPs) and more recently, graphic processing units (GPUs), provide various degrees of parallelism, yet they both fail to take into account the cyclical and consecutive character of WBP. Furthermore, the large amount of data in WBPs cannot be processed quickly in symmetric multiprocessors (SMPs) due to the unpredictability of memory latency. To address this issue, we propose a hierarchical dataflow-driven architecture to accelerate WBP. A pack-and-ship approach is presented under a non-uniform memory access (NUMA) architecture to allow the subordinate tiles to operate in a bundled access and execute manner. We also propose a multi-level dataflow model and the related scheduling scheme to manage and allocate the heterogeneous hardware resources. Experiment results demonstrate that our prototype achieves $2\times$ and $2.3\times$ speedup in terms of normalized throughput and single-tile clock cycles compared with GPU and DSP counterparts in several critical WBP benchmarks. Additionally, a link-level throughput of $288$ Mbps can be achieved with a $45$-core configuration.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition
Authors:
Yang Li,
Yuan Shangguan,
Yuhao Wang,
Liangzhen Lai,
Ernie Chang,
Changsheng Zhao,
Yangyang Shi,
Vikas Chandra
Abstract:
Power consumption plays an important role in on-device streaming speech recognition, as it has a direct impact on the user experience. This study delves into how weight parameters in speech recognition models influence the overall power consumption of these models. We discovered that the impact of weight parameters on power consumption varies, influenced by factors including how often they are inv…
▽ More
Power consumption plays an important role in on-device streaming speech recognition, as it has a direct impact on the user experience. This study delves into how weight parameters in speech recognition models influence the overall power consumption of these models. We discovered that the impact of weight parameters on power consumption varies, influenced by factors including how often they are invoked and their placement in memory. Armed with this insight, we developed design guidelines aimed at optimizing on-device speech recognition models. These guidelines focus on minimizing power use without substantially affecting accuracy. Our method, which employs targeted compression based on the varying sensitivities of weight parameters, demonstrates superior performance compared to state-of-the-art compression methods. It achieves a reduction in energy usage of up to 47% while maintaining similar model accuracy and improving the real-time factor.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Innovation-triggered Learning for Data-driven Predictive Control: Deterministic and Stochastic Formulations
Authors:
Kaikai Zheng,
Dawei Shi,
Sandra Hirche,
Yang Shi
Abstract:
Data-driven control has attracted lots of attention in recent years, especially for plants that are difficult to model based on first-principle. In particular, a key issue in data-driven approaches is how to make efficient use of data as the abundance of data becomes overwhelming. {To address this issue, this work proposes an innovation-triggered learning framework and a corresponding data-driven…
▽ More
Data-driven control has attracted lots of attention in recent years, especially for plants that are difficult to model based on first-principle. In particular, a key issue in data-driven approaches is how to make efficient use of data as the abundance of data becomes overwhelming. {To address this issue, this work proposes an innovation-triggered learning framework and a corresponding data-driven controller design approach with guaranteed stability. Specifically, we consider a linear time-invariant system with unknown dynamics subject to deterministic/stochastic disturbances, respectively. Two kinds of data selection mechanisms are proposed by online evaluating the innovation contained in the sampled data, wherein the innovation is quantified by its effect of shrinking the set of potential system dynamics that are compatible with the sampled data. Next, after introducing a stability criterion using the set-valued estimation of system dynamics, a robust data-driven predictive controller is designed by minimizing a worst-case cost function.} The closed-loop stability of the data-driven predictive controller equipped with the innovation-triggered learning protocol is proved with a high probability framework. Finally, numerical experiments are performed to verify the validity of the proposed approaches, and the characteristics and the selection principle of the learning hyper-parameter are also discussed.
△ Less
Submitted 28 January, 2024;
originally announced January 2024.
-
Adaptive Neural-Operator Backstep** Control of a Benchmark Hyperbolic PDE
Authors:
Maxence Lamarque,
Luke Bhan,
Yuanyuan Shi,
Miroslav Krstic
Abstract:
To stabilize PDEs, feedback controllers require gain kernel functions, which are themselves governed by PDEs. Furthermore, these gain-kernel PDEs depend on the PDE plants' functional coefficients. The functional coefficients in PDE plants are often unknown. This requires an adaptive approach to PDE control, i.e., an estimation of the plant coefficients conducted concurrently with control, where a…
▽ More
To stabilize PDEs, feedback controllers require gain kernel functions, which are themselves governed by PDEs. Furthermore, these gain-kernel PDEs depend on the PDE plants' functional coefficients. The functional coefficients in PDE plants are often unknown. This requires an adaptive approach to PDE control, i.e., an estimation of the plant coefficients conducted concurrently with control, where a separate PDE for the gain kernel must be solved at each timestep upon the update in the plant coefficient function estimate. Solving a PDE at each timestep is computationally expensive and a barrier to the implementation of real-time adaptive control of PDEs. Recently, results in neural operator (NO) approximations of functional map**s have been introduced into PDE control, for replacing the computation of the gain kernel with a neural network that is trained, once offline, and reused in real-time for rapid solution of the PDEs. In this paper, we present the first result on applying NOs in adaptive PDE control, presented for a benchmark 1-D hyperbolic PDE with recirculation. We establish global stabilization via Lyapunov analysis, in the plant and parameter error states, and also present an alternative approach, via passive identifiers, which avoids the strong assumptions on kernel differentiability. We then present numerical simulations demonstrating stability and observe speedups up to three orders of magnitude, highlighting the real-time efficacy of neural operators in adaptive control. Our code (Github) is made publicly available for future researchers.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
FADI-AEC: Fast Score Based Diffusion Model Guided by Far-end Signal for Acoustic Echo Cancellation
Authors:
Yang Liu,
Li Wan,
Yun Li,
Yiteng Huang,
Ming Sun,
James Luan,
Yangyang Shi,
Xin Lei
Abstract:
Despite the potential of diffusion models in speech enhancement, their deployment in Acoustic Echo Cancellation (AEC) has been restricted. In this paper, we propose DI-AEC, pioneering a diffusion-based stochastic regeneration approach dedicated to AEC. Further, we propose FADI-AEC, fast score-based diffusion AEC framework to save computational demands, making it favorable for edge devices. It stan…
▽ More
Despite the potential of diffusion models in speech enhancement, their deployment in Acoustic Echo Cancellation (AEC) has been restricted. In this paper, we propose DI-AEC, pioneering a diffusion-based stochastic regeneration approach dedicated to AEC. Further, we propose FADI-AEC, fast score-based diffusion AEC framework to save computational demands, making it favorable for edge devices. It stands out by running the score model once per frame, achieving a significant surge in processing efficiency. Apart from that, we introduce a novel noise generation technique where far-end signals are utilized, incorporating both far-end and near-end signals to refine the score model's accuracy. We test our proposed method on the ICASSP2023 Microsoft deep echo cancellation challenge evaluation dataset, where our method outperforms some of the end-to-end methods and other diffusion based echo cancellation methods.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Moving-Horizon Estimators for Hyperbolic and Parabolic PDEs in 1-D
Authors:
Luke Bhan,
Yuanyuan Shi,
Iasson Karafyllis,
Miroslav Krstic,
James B. Rawlings
Abstract:
Observers for PDEs are themselves PDEs. Therefore, producing real time estimates with such observers is computationally burdensome. For both finite-dimensional and ODE systems, moving-horizon estimators (MHE) are operators whose output is the state estimate, while their inputs are the initial state estimate at the beginning of the horizon as well as the measured output and input signals over the m…
▽ More
Observers for PDEs are themselves PDEs. Therefore, producing real time estimates with such observers is computationally burdensome. For both finite-dimensional and ODE systems, moving-horizon estimators (MHE) are operators whose output is the state estimate, while their inputs are the initial state estimate at the beginning of the horizon as well as the measured output and input signals over the moving time horizon. In this paper we introduce MHEs for PDEs which remove the need for a numerical solution of an observer PDE in real time. We accomplish this using the PDE backstep** method which, for certain classes of both hyperbolic and parabolic PDEs, produces moving-horizon state estimates explicitly. Precisely, to explicitly produce the state estimates, we employ a backstep** transformation of a hard-to-solve observer PDE into a target observer PDE, which is explicitly solvable. The MHEs we propose are not new observer designs but simply the explicit MHE realizations, over a moving horizon of arbitrary length, of the existing backstep** observers. Our PDE MHEs lack the optimality of the MHEs that arose as duals of MPC, but they are given explicitly, even for PDEs. In the paper we provide explicit formulae for MHEs for both hyperbolic and parabolic PDEs, as well as simulation results that illustrate theoretically guaranteed convergence of the MHEs.
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
Robust TOA-based Localization with Inaccurate Anchors for MANET
Authors:
Xinkai Yu,
Yang Zheng,
Min Sheng,
Yan Shi,
Jiandong Li
Abstract:
Accurate node localization is vital for mobile ad hoc networks (MANETs). Current methods like Time of Arrival (TOA) can estimate node positions using imprecise baseplates and achieve the Cramér-Rao lower bound (CRLB) accuracy. In multi-hop MANETs, some nodes lack direct links to base anchors, depending on neighbor nodes as dynamic anchors for chain localization. However, the dynamic nature of MANE…
▽ More
Accurate node localization is vital for mobile ad hoc networks (MANETs). Current methods like Time of Arrival (TOA) can estimate node positions using imprecise baseplates and achieve the Cramér-Rao lower bound (CRLB) accuracy. In multi-hop MANETs, some nodes lack direct links to base anchors, depending on neighbor nodes as dynamic anchors for chain localization. However, the dynamic nature of MANETs challenges TOA's robustness due to the availability and accuracy of base anchors, coupled with ranging errors. To address the issue of cascading positioning error divergence, we first derive the CRLB for any primary node in MANETs as a metric to tackle localization error in cascading scenarios. Second, we propose an advanced two-step TOA method based on CRLB which is able to approximate target node's CRLB with only local neighbor information. Finally, simulation results confirm the robustness of our algorithm, achieving CRLB-level accuracy for small ranging errors and maintaining precision for larger errors compared to existing TOA methods.
△ Less
Submitted 29 December, 2023;
originally announced December 2023.
-
Inter-domain Resource Collaboration in Satellite Networks: An Intelligent Scheduling Approach Towards Hybrid Missions
Authors:
Chenxi Bao,
Di Zhou,
Min Sheng,
Yan Shi,
Jiandong Li
Abstract:
Since the next-generation satellite network consisting of various service function domains, such as communication, observation, navigation, etc., is moving towards large-scale, using single-domain resources is difficult to provide satisfied and timely service guarantees for the rapidly increasing mission demands of each domain. Breaking the barriers of independence of resources in each domain, and…
▽ More
Since the next-generation satellite network consisting of various service function domains, such as communication, observation, navigation, etc., is moving towards large-scale, using single-domain resources is difficult to provide satisfied and timely service guarantees for the rapidly increasing mission demands of each domain. Breaking the barriers of independence of resources in each domain, and realizing the cross-domain transmission of missions to efficiently collaborate inter-domain resources is a promising solution. However, the hybrid scheduling of different missions and the continuous increase in the number of service domains have strengthened the differences and dynamics of mission demands, making it challenging for an efficient cross-domain mission scheduling (CMS). To this end, this paper first accurately characterizes the communication resource state of inter-satellite in real-time exploiting the sparse resource representation scheme, and systematically characterizes the differentiation of mission demands by conducting the mission priority model. Based on the information of resources and missions, we construct the top- and bottom-layer mission scheduling models of reward association exploiting the correlation of intra- and inter-domain mission scheduling and formulate the Markov decision process-based hierarchical CMS problem. Further, to achieve higher adaptability and autonomy of CMS and efficiently mitigate the impact of network scale, a hierarchical intelligent CMS algorithm is developed to dynamically adjust and efficiently match the CMS policy according to different mission demands. Simulation results demonstrate that the proposed algorithm has significant performance gain compared with independent domains and the existing CMS algorithms, and can still guarantee high service performance under different network scales.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
A Comprehensive Dataset and Automated Pipeline for Nailfold Capillary Analysis
Authors:
Linxi Zhao,
Jiankai Tang,
Dongyu Chen,
Xiaohong Liu,
Yong Zhou,
Yuanchun Shi,
Guangyu Wang,
Yuntao Wang
Abstract:
Nailfold capillaroscopy is widely used in assessing health conditions, highlighting the pressing need for an automated nailfold capillary analysis system. In this study, we present a pioneering effort in constructing a comprehensive nailfold capillary dataset-321 images, 219 videos from 68 subjects, with clinic reports and expert annotations-that serves as a crucial resource for training deep-lear…
▽ More
Nailfold capillaroscopy is widely used in assessing health conditions, highlighting the pressing need for an automated nailfold capillary analysis system. In this study, we present a pioneering effort in constructing a comprehensive nailfold capillary dataset-321 images, 219 videos from 68 subjects, with clinic reports and expert annotations-that serves as a crucial resource for training deep-learning models. Leveraging this dataset, we finetuned three deep learning models with expert annotations as supervised labels and integrated them into a novel end-to-end nailfold capillary analysis pipeline. This pipeline excels in automatically detecting and measuring a wide range of size factors, morphological features, and dynamic aspects of nailfold capillaries. We compared our outcomes with clinical reports. Experiment results showed that our automated pipeline achieves an average of sub-pixel level precision in measurements and 89.9% accuracy in identifying morphological abnormalities. These results underscore its potential for advancing quantitative medical research and enabling pervasive computing in healthcare. Our data and code are available at https://github.com/THU-CS-PI-LAB/ANFC-Automated-Nailfold-Capillary.
△ Less
Submitted 14 March, 2024; v1 submitted 10 December, 2023;
originally announced December 2023.
-
Implementing Digital Twin in Field-Deployed Optical Networks: Uncertain Factors, Operational Guidance, and Field-Trial Demonstration
Authors:
Yuchen Song,
Min Zhang,
Yao Zhang,
Yan Shi,
Shikui Shen,
Bingli Guo,
Shanguo Huang,
Danshi Wang
Abstract:
Digital twin has revolutionized optical communication networks by enabling their full life-cycle management, including design, troubleshooting, optimization, upgrade, and prediction. While extensive literature exists on frameworks, standards, and applications of digital twin, there is a pressing need in implementing digital twin in field-deployed optical networks operating in real-world environmen…
▽ More
Digital twin has revolutionized optical communication networks by enabling their full life-cycle management, including design, troubleshooting, optimization, upgrade, and prediction. While extensive literature exists on frameworks, standards, and applications of digital twin, there is a pressing need in implementing digital twin in field-deployed optical networks operating in real-world environments, as opposed to controlled laboratory settings. This paper addresses this challenge by examining the uncertain factors behind the inaccuracy of digital twin in field-deployed optical networks from three main challenges and proposing operational guidance for implementing accurate digital twin in field-deployed optical networks. Through the proposed guidance, we demonstrate the effective implementation of digital twin in a field-trial C+L-band optical transmission link, showcasing its capabilities in performance recovery in a fiber cut scenario.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
MARformer: An Efficient Metal Artifact Reduction Transformer for Dental CBCT Images
Authors:
Yuxuan Shi,
Jun Xu,
Dinggang Shen
Abstract:
Cone Beam Computed Tomography (CBCT) plays a key role in dental diagnosis and surgery. However, the metal teeth implants could bring annoying metal artifacts during the CBCT imaging process, interfering diagnosis and downstream processing such as tooth segmentation. In this paper, we develop an efficient Transformer to perform metal artifacts reduction (MAR) from dental CBCT images. The proposed M…
▽ More
Cone Beam Computed Tomography (CBCT) plays a key role in dental diagnosis and surgery. However, the metal teeth implants could bring annoying metal artifacts during the CBCT imaging process, interfering diagnosis and downstream processing such as tooth segmentation. In this paper, we develop an efficient Transformer to perform metal artifacts reduction (MAR) from dental CBCT images. The proposed MAR Transformer (MARformer) reduces computation complexity in the multihead self-attention by a new Dimension-Reduced Self-Attention (DRSA) module, based on that the CBCT images have globally similar structure. A Patch-wise Perceptive Feed Forward Network (P2FFN) is also proposed to perceive local image information for fine-grained restoration. Experimental results on CBCT images with synthetic and real-world metal artifacts show that our MARformer is efficient and outperforms previous MAR methods and two restoration Transformers.
△ Less
Submitted 18 April, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Resilient and constrained consensus against adversarial attacks: A distributed MPC framework
Authors:
Henglai Wei,
Kunwu Zhang,
Hui Zhang,
Yang Shi
Abstract:
There has been a growing interest in realizing the resilient consensus of the multi-agent system (MAS) under cyber-attacks, which aims to achieve the consensus of normal agents (i.e., agents without attacks) in a network, depending on the neighboring information. The literature has developed mean-subsequence-reduced (MSR) algorithms for the MAS with F adversarial attacks and has shown that the con…
▽ More
There has been a growing interest in realizing the resilient consensus of the multi-agent system (MAS) under cyber-attacks, which aims to achieve the consensus of normal agents (i.e., agents without attacks) in a network, depending on the neighboring information. The literature has developed mean-subsequence-reduced (MSR) algorithms for the MAS with F adversarial attacks and has shown that the consensus is achieved for the normal agents when the communication network is at least (2F+1)-robust. However, such a stringent requirement on the communication network needs to be relaxed to enable more practical applications. Our objective is, for the first time, to achieve less stringent conditions on the network, while ensuring the resilient consensus for the general linear MAS subject to control input constraints. In this work, we propose a distributed resilient consensus framework, consisting of a pre-designed consensus protocol and distributed model predictive control (DMPC) optimization, which can help significantly reduce the requirement on the network robustness and effectively handle the general linear constrained MAS under adversarial attacks. By employing a novel distributed adversarial attack detection mechanism based on the history information broadcast by neighbors and a convex set (i.e., resilience set), we can evaluate the reliability of communication links. Moreover, we show that the recursive feasibility of the associated DMPC optimization problem can be guaranteed. The proposed consensus protocol features the following properties: 1) by minimizing a group of control variables, the consensus performance is optimized; 2) the resilient consensus of the general linear constrained MAS subject to F-locally adversarial attacks is achieved when the communication network is (F+1)-robust. Finally, numerical simulation results are presented to verify the theoretical results.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
3DGAUnet: 3D generative adversarial networks with a 3D U-Net based generator to achieve the accurate and effective synthesis of clinical tumor image data for pancreatic cancer
Authors:
Yu Shi,
Hannah Tang,
Michael Baine,
Michael A. Hollingsworth,
Hui**g Du,
Dandan Zheng,
Chi Zhang,
Hongfeng Yu
Abstract:
Pancreatic ductal adenocarcinoma (PDAC) presents a critical global health challenge, and early detection is crucial for improving the 5-year survival rate. Recent medical imaging and computational algorithm advances offer potential solutions for early diagnosis. Deep learning, particularly in the form of convolutional neural networks (CNNs), has demonstrated success in medical image analysis tasks…
▽ More
Pancreatic ductal adenocarcinoma (PDAC) presents a critical global health challenge, and early detection is crucial for improving the 5-year survival rate. Recent medical imaging and computational algorithm advances offer potential solutions for early diagnosis. Deep learning, particularly in the form of convolutional neural networks (CNNs), has demonstrated success in medical image analysis tasks, including classification and segmentation. However, the limited availability of clinical data for training purposes continues to provide a significant obstacle. Data augmentation, generative adversarial networks (GANs), and cross-validation are potential techniques to address this limitation and improve model performance, but effective solutions are still rare for 3D PDAC, where contrast is especially poor owing to the high heterogeneity in both tumor and background tissues. In this study, we developed a new GAN-based model, named 3DGAUnet, for generating realistic 3D CT images of PDAC tumors and pancreatic tissue, which can generate the interslice connection data that the existing 2D CT image synthesis models lack. Our innovation is to develop a 3D U-Net architecture for the generator to improve shape and texture learning for PDAC tumors and pancreatic tissue. Our approach offers a promising path to tackle the urgent requirement for creative and synergistic methods to combat PDAC. The development of this GAN-based model has the potential to alleviate data scarcity issues, elevate the quality of synthesized data, and thereby facilitate the progression of deep learning models to enhance the accuracy and early detection of PDAC tumors, which could profoundly impact patient outcomes. Furthermore, this model has the potential to be adapted to other types of solid tumors, hence making significant contributions to the field of medical imaging in terms of image processing models.
△ Less
Submitted 27 November, 2023; v1 submitted 9 November, 2023;
originally announced November 2023.
-
Contributions of Individual Generators to Nodal Carbon Emissions
Authors:
Yize Chen,
Deepjyoti Deka,
Yuanyuan Shi
Abstract:
Recent shifts toward sustainable energy systems have witnessed the fast deployment of carbon-free and carbon-efficient generations across the power networks. However, the benefits of carbon reduction are not experienced evenly throughout the grid. Each generator can have distinct carbon emission rates. Due to the existence of physical power flows, nodal power consumption is met by a combination of…
▽ More
Recent shifts toward sustainable energy systems have witnessed the fast deployment of carbon-free and carbon-efficient generations across the power networks. However, the benefits of carbon reduction are not experienced evenly throughout the grid. Each generator can have distinct carbon emission rates. Due to the existence of physical power flows, nodal power consumption is met by a combination of a set of generators, while such combination is determined by network topology, generators' characteristics and power demand. This paper describes a technique based on physical power flow model, which can efficiently compute the nodal carbon emissions contributed by each single generator given the generation and power flow information. We also extend the technique to calculate both the nodal average carbon emission and marginal carbon emission rates. Simulation results validate the effectiveness of the calculations, while our technique provides a fundamental tool for applications such as carbon auditing, carbon-oriented demand management and future carbon-oriented capacity expansion.
△ Less
Submitted 7 November, 2023; v1 submitted 6 November, 2023;
originally announced November 2023.
-
Imaging through multimode fibres with physical prior
Authors:
Chuncheng Zhang,
Yingjie Shi,
Zheyi Yao,
Xiubao Sui,
Qian Chen
Abstract:
Imaging through perturbed multimode fibres based on deep learning has been widely researched. However, existing methods mainly use target-speckle pairs in different configurations. It is challenging to reconstruct targets without trained networks. In this paper, we propose a physics-assisted, unsupervised, learning-based fibre imaging scheme. The role of the physical prior is to simplify the mappi…
▽ More
Imaging through perturbed multimode fibres based on deep learning has been widely researched. However, existing methods mainly use target-speckle pairs in different configurations. It is challenging to reconstruct targets without trained networks. In this paper, we propose a physics-assisted, unsupervised, learning-based fibre imaging scheme. The role of the physical prior is to simplify the map** relationship between the speckle pattern and the target image, thereby reducing the computational complexity. The unsupervised network learns target features according to the optimized direction provided by the physical prior. Therefore, the reconstruction process of the online learning only requires a few speckle patterns and unpaired targets. The proposed scheme also increases the generalization ability of the learning-based method in perturbed multimode fibres. Our scheme has the potential to extend the application of multimode fibre imaging.
△ Less
Submitted 13 November, 2023; v1 submitted 6 November, 2023;
originally announced November 2023.
-
On The Open Prompt Challenge In Conditional Audio Generation
Authors:
Ernie Chang,
Sidd Srinivasan,
Mahi Luthra,
Pin-Jie Lin,
Varun Nagaraja,
Forrest Iandola,
Zechun Liu,
Zhaoheng Ni,
Changsheng Zhao,
Yangyang Shi,
Vikas Chandra
Abstract:
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a ``blackbox'' and address the user prompt challenge with two ke…
▽ More
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a ``blackbox'' and address the user prompt challenge with two key insights: (1) User prompts are generally under-specified, leading to a large alignment gap between user prompts and training prompts. (2) There is a distribution of audio descriptions for which TTA models are better at generating higher quality audio, which we refer to as ``audionese''. To this end, we rewrite prompts with instruction-tuned models and propose utilizing text-audio alignment as feedback signals via margin ranking learning for audio improvements. On both objective and subjective human evaluations, we observed marked improvements in both text-audio alignment and music audio quality.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
In-Context Prompt Editing For Conditional Audio Generation
Authors:
Ernie Chang,
Pin-Jie Lin,
Yang Li,
Sidd Srinivasan,
Gael Le Lan,
David Kant,
Yangyang Shi,
Forrest Iandola,
Vikas Chandra
Abstract:
Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional au…
▽ More
Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional audio generation in the wild as user prompts are under-specified. In particular, we observe a consistent audio quality degradation in generated audio samples with user prompts, as opposed to training set prompts. To this end, we present a retrieval-based in-context prompt editing framework that leverages the training captions as demonstrative exemplars to revisit the user prompts. We show that the framework enhanced the audio quality across the set of collected user prompts, which were edited with reference to the training captions as exemplars.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
Authors:
Jeff Hwang,
Moto Hira,
Caroline Chen,
Xiaohui Zhang,
Zhaoheng Ni,
Guangzhi Sun,
**chuan Ma,
Ruizhe Huang,
Vineel Pratap,
Yuekai Zhang,
Anurag Kumar,
Chin-Yun Yu,
Chuang Zhu,
Chunxi Liu,
Jacob Kahn,
Mirco Ravanelli,
Peng Sun,
Shinji Watanabe,
Yangyang Shi,
Yumeng Tao,
Robin Scheibler,
Samuele Cornell,
Sean Kim,
Stavros Petridis
Abstract:
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel…
▽ More
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
One-Bit Byzantine-Tolerant Distributed Learning via Over-the-Air Computation
Authors:
Yuhan Yang,
Youlong Wu,
Yuning Jiang,
Yuanming Shi
Abstract:
Distributed learning has become a promising computational parallelism paradigm that enables a wide scope of intelligent applications from the Internet of Things (IoT) to autonomous driving and the healthcare industry. This paper studies distributed learning in wireless data center networks, which contain a central edge server and multiple edge workers to collaboratively train a shared global model…
▽ More
Distributed learning has become a promising computational parallelism paradigm that enables a wide scope of intelligent applications from the Internet of Things (IoT) to autonomous driving and the healthcare industry. This paper studies distributed learning in wireless data center networks, which contain a central edge server and multiple edge workers to collaboratively train a shared global model and benefit from parallel computing. However, the distributed nature causes the vulnerability of the learning process to faults and adversarial attacks from Byzantine edge workers, as well as the severe communication and computation overhead induced by the periodical information exchange process. To achieve fast and reliable model aggregation in the presence of Byzantine attacks, we develop a signed stochastic gradient descent (SignSGD)-based Hierarchical Vote framework via over-the-air computation (AirComp), where one voting process is performed locally at the wireless edge by taking advantage of Bernoulli coding while the other is operated over-the-air at the central edge server by utilizing the waveform superposition property of the multiple-access channels. We comprehensively analyze the proposed framework on the impacts including Byzantine attacks and the wireless environment (channel fading and receiver noise), followed by characterizing the convergence behavior under non-convex settings. Simulation results validate our theoretical achievements and demonstrate the robustness of our proposed framework in the presence of Byzantine attacks and receiver noise.
△ Less
Submitted 18 October, 2023; v1 submitted 18 October, 2023;
originally announced October 2023.
-
A High Fidelity and Low Complexity Neural Audio Coding
Authors:
Wenzhe Liu,
Wei Xiao,
Meng Wang,
Shan Yang,
Yupeng Shi,
Yuyong Kang,
Dan Su,
Shidong Shang,
Dong Yu
Abstract:
Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide…
▽ More
Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide-band components and adopts traditional signal processing to compress high-band components according to psychological hearing knowledge. Inspired by auditory perception theory, a perception-based loss function is designed to improve harmonic modeling. Besides, generative adversarial network (GAN) compression is proposed for the first time for neural audio codecs. Our method is superior to prior advanced neural codecs across subjective and objective metrics and allows real-time inference on desktop and mobile.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Over-the-Air Federated Learning and Optimization
Authors:
**gyang Zhu,
Yuanming Shi,
Yong Zhou,
Chunxiao Jiang,
Wei Chen,
Khaled B. Letaief
Abstract:
Federated learning (FL), as an emerging distributed machine learning paradigm, allows a mass of edge devices to collaboratively train a global model while preserving privacy. In this tutorial, we focus on FL via over-the-air computation (AirComp), which is proposed to reduce the communication overhead for FL over wireless networks at the cost of compromising in the learning performance due to mode…
▽ More
Federated learning (FL), as an emerging distributed machine learning paradigm, allows a mass of edge devices to collaboratively train a global model while preserving privacy. In this tutorial, we focus on FL via over-the-air computation (AirComp), which is proposed to reduce the communication overhead for FL over wireless networks at the cost of compromising in the learning performance due to model aggregation error arising from channel fading and noise. We first provide a comprehensive study on the convergence of AirComp-based FedAvg (AirFedAvg) algorithms under both strongly convex and non-convex settings with constant and diminishing learning rates in the presence of data heterogeneity. Through convergence and asymptotic analysis, we characterize the impact of aggregation error on the convergence bound and provide insights for system design with convergence guarantees. Then we derive convergence rates for AirFedAvg algorithms for strongly convex and non-convex objectives. For different types of local updates that can be transmitted by edge devices (i.e., local model, gradient, and model difference), we reveal that transmitting local model in AirFedAvg may cause divergence in the training procedure. In addition, we consider more practical signal processing schemes to improve the communication efficiency and further extend the convergence analysis to different forms of model aggregation error caused by these signal processing schemes. Extensive simulation results under different settings of objective functions, transmitted local information, and communication schemes verify the theoretical conclusions.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
ADASR: An Adversarial Auto-Augmentation Framework for Hyperspectral and Multispectral Data Fusion
Authors:
**ghui Qin,
Lihuang Fang,
Ruitao Lu,
Liang Lin,
Yukai Shi
Abstract:
Deep learning-based hyperspectral image (HSI) super-resolution, which aims to generate high spatial resolution HSI (HR-HSI) by fusing hyperspectral image (HSI) and multispectral image (MSI) with deep neural networks (DNNs), has attracted lots of attention. However, neural networks require large amounts of training data, hindering their application in real-world scenarios. In this letter, we propos…
▽ More
Deep learning-based hyperspectral image (HSI) super-resolution, which aims to generate high spatial resolution HSI (HR-HSI) by fusing hyperspectral image (HSI) and multispectral image (MSI) with deep neural networks (DNNs), has attracted lots of attention. However, neural networks require large amounts of training data, hindering their application in real-world scenarios. In this letter, we propose a novel adversarial automatic data augmentation framework ADASR that automatically optimizes and augments HSI-MSI sample pairs to enrich data diversity for HSI-MSI fusion. Our framework is sample-aware and optimizes an augmentor network and two downsampling networks jointly by adversarial learning so that we can learn more robust downsampling networks for training the upsampling network. Extensive experiments on two public classical hyperspectral datasets demonstrate the effectiveness of our ADASR compared to the state-of-the-art methods.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
Diffusion Prior Regularized Iterative Reconstruction for Low-dose CT
Authors:
Wenjun Xia,
Yongyi Shi,
Chuang Niu,
Wenxiang Cong,
Ge Wang
Abstract:
Computed tomography (CT) involves a patient's exposure to ionizing radiation. To reduce the radiation dose, we can either lower the X-ray photon count or down-sample projection views. However, either of the ways often compromises image quality. To address this challenge, here we introduce an iterative reconstruction algorithm regularized by a diffusion prior. Drawing on the exceptional imaging pro…
▽ More
Computed tomography (CT) involves a patient's exposure to ionizing radiation. To reduce the radiation dose, we can either lower the X-ray photon count or down-sample projection views. However, either of the ways often compromises image quality. To address this challenge, here we introduce an iterative reconstruction algorithm regularized by a diffusion prior. Drawing on the exceptional imaging prowess of the denoising diffusion probabilistic model (DDPM), we merge it with a reconstruction procedure that prioritizes data fidelity. This fusion capitalizes on the merits of both techniques, delivering exceptional reconstruction results in an unsupervised framework. To further enhance the efficiency of the reconstruction process, we incorporate the Nesterov momentum acceleration technique. This enhancement facilitates superior diffusion sampling in fewer steps. As demonstrated in our experiments, our method offers a potential pathway to high-definition CT image reconstruction with minimized radiation.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
A Glance is Enough: Extract Target Sentence By Looking at A keyword
Authors:
Ying Shi,
Dong Wang,
Lantian Li,
Jiqing Han
Abstract:
This paper investigates the possibility of extracting a target sentence from multi-talker speech using only a keyword as input. For example, in social security applications, the keyword might be "help", and the goal is to identify what the person who called for help is articulating while ignoring other speakers. To address this problem, we propose using the Transformer architecture to embed both t…
▽ More
This paper investigates the possibility of extracting a target sentence from multi-talker speech using only a keyword as input. For example, in social security applications, the keyword might be "help", and the goal is to identify what the person who called for help is articulating while ignoring other speakers. To address this problem, we propose using the Transformer architecture to embed both the keyword and the speech utterance and then rely on the cross-attention mechanism to select the correct content from the concatenated or overlap** speech. Experimental results on Librispeech demonstrate that our proposed method can effectively extract target sentences from very noisy and mixed speech (SNR=-3dB), achieving a phone error rate (PER) of 26\%, compared to the baseline system's PER of 96%.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
High Visual-Fidelity Learned Video Compression
Authors:
Meng Li,
Yibo Shi,
**g Wang,
Yunqi Huang
Abstract:
With the growing demand for video applications, many advanced learned video compression methods have been developed, outperforming traditional methods in terms of objective quality metrics such as PSNR. Existing methods primarily focus on objective quality but tend to overlook perceptual quality. Directly incorporating perceptual loss into a learned video compression framework is nontrivial and ra…
▽ More
With the growing demand for video applications, many advanced learned video compression methods have been developed, outperforming traditional methods in terms of objective quality metrics such as PSNR. Existing methods primarily focus on objective quality but tend to overlook perceptual quality. Directly incorporating perceptual loss into a learned video compression framework is nontrivial and raises several perceptual quality issues that need to be addressed. In this paper, we investigated these issues in learned video compression and propose a novel High Visual-Fidelity Learned Video Compression framework (HVFVC). Specifically, we design a novel confidence-based feature reconstruction method to address the issue of poor reconstruction in newly-emerged regions, which significantly improves the visual quality of the reconstruction. Furthermore, we present a periodic compensation loss to mitigate the checkerboard artifacts related to deconvolution operation and optimization. Extensive experiments have shown that the proposed HVFVC achieves excellent perceptual quality, outperforming the latest VVC standard with only 50% required bitrate.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Blind CT Image Quality Assessment Using DDPM-derived Content and Transformer-based Evaluator
Authors:
Yongyi Shi,
Wenjun Xia,
Ge Wang,
Xuanqin Mou
Abstract:
Lowering radiation dose per view and utilizing sparse views per scan are two common CT scan modes, albeit often leading to distorted images characterized by noise and streak artifacts. Blind image quality assessment (BIQA) strives to evaluate perceptual quality in alignment with what radiologists perceive, which plays an important role in advancing low-dose CT reconstruction techniques. An intrigu…
▽ More
Lowering radiation dose per view and utilizing sparse views per scan are two common CT scan modes, albeit often leading to distorted images characterized by noise and streak artifacts. Blind image quality assessment (BIQA) strives to evaluate perceptual quality in alignment with what radiologists perceive, which plays an important role in advancing low-dose CT reconstruction techniques. An intriguing direction involves develo** BIQA methods that mimic the operational characteristic of the human visual system (HVS). The internal generative mechanism (IGM) theory reveals that the HVS actively deduces primary content to enhance comprehension. In this study, we introduce an innovative BIQA metric that emulates the active inference process of IGM. Initially, an active inference module, implemented as a denoising diffusion probabilistic model (DDPM), is constructed to anticipate the primary content. Then, the dissimilarity map is derived by assessing the interrelation between the distorted image and its primary content. Subsequently, the distorted image and dissimilarity map are combined into a multi-channel image, which is inputted into a transformer-based image quality evaluator. Remarkably, by exclusively utilizing this transformer-based quality evaluator, we won the second place in the MICCAI 2023 low-dose computed tomography perceptual image quality assessment grand challenge. Leveraging the DDPM-derived primary content, our approach further improves the performance on the challenge dataset.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Deriving Loss Function for Value-oriented Renewable Energy Forecasting
Authors:
Yufan Zhang,
Honglin Wen,
Yuexin Bian,
Yuanyuan Shi
Abstract:
Renewable energy forecasting is the workhorse for efficient energy dispatch. However, forecasts with small mean squared errors (MSE) may not necessarily lead to low operation costs. Here, we propose a forecasting approach specifically tailored for operational purposes, by incorporating operational problems into the estimation of forecast models via designing a loss function. We formulate a bilevel…
▽ More
Renewable energy forecasting is the workhorse for efficient energy dispatch. However, forecasts with small mean squared errors (MSE) may not necessarily lead to low operation costs. Here, we propose a forecasting approach specifically tailored for operational purposes, by incorporating operational problems into the estimation of forecast models via designing a loss function. We formulate a bilevel program, where the operation problem is at the lower level, and the forecast model estimation is at the upper level. We establish the relationship between the lower-level optimal solutions and forecasts through multiparametric programming. By integrating it into the upper-level objective for minimizing expected operation cost, we convert the bilevel problem to a single-level one and derive the loss function for training the model. It is proved to be piecewise linear, for linear operation problem. Compared to the commonly used loss functions, e.g. MSE, our approach achieves lower operation costs.
△ Less
Submitted 1 October, 2023;
originally announced October 2023.
-
HyperLISTA-ABT: An Ultra-light Unfolded Network for Accurate Multi-component Differential Tomographic SAR Inversion
Authors:
Kun Qian,
Yuanyuan Wang,
Peter Jung,
Yilei Shi,
Xiao Xiang Zhu
Abstract:
Deep neural networks based on unrolled iterative algorithms have achieved remarkable success in sparse reconstruction applications, such as synthetic aperture radar (SAR) tomographic inversion (TomoSAR). However, the currently available deep learning-based TomoSAR algorithms are limited to three-dimensional (3D) reconstruction. The extension of deep learning-based algorithms to four-dimensional (4…
▽ More
Deep neural networks based on unrolled iterative algorithms have achieved remarkable success in sparse reconstruction applications, such as synthetic aperture radar (SAR) tomographic inversion (TomoSAR). However, the currently available deep learning-based TomoSAR algorithms are limited to three-dimensional (3D) reconstruction. The extension of deep learning-based algorithms to four-dimensional (4D) imaging, i.e., differential TomoSAR (D-TomoSAR) applications, is impeded mainly due to the high-dimensional weight matrices required by the network designed for D-TomoSAR inversion, which typically contain millions of freely trainable parameters. Learning such huge number of weights requires an enormous number of training samples, resulting in a large memory burden and excessive time consumption. To tackle this issue, we propose an efficient and accurate algorithm called HyperLISTA-ABT. The weights in HyperLISTA-ABT are determined in an analytical way according to a minimum coherence criterion, trimming the model down to an ultra-light one with only three hyperparameters. Additionally, HyperLISTA-ABT improves the global thresholding by utilizing an adaptive blockwise thresholding scheme, which applies block-coordinate techniques and conducts thresholding in local blocks, so that weak expressions and local features can be retained in the shrinkage step layer by layer. Simulations were performed and demonstrated the effectiveness of our approach, showing that HyperLISTA-ABT achieves superior computational efficiency and with no significant performance degradation compared to state-of-the-art methods. Real data experiments showed that a high-quality 4D point cloud could be reconstructed over a large area by the proposed HyperLISTA-ABT with affordable computational resources and in a fast time.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
VoiceLens: Controllable Speaker Generation and Editing with Flow
Authors:
Yao Shi,
Ming Li
Abstract:
Currently, many multi-speaker speech synthesis and voice conversion systems address speaker variations with an embedding vector. Modeling it directly allows new voices outside of training data to be synthesized. GMM based approaches such as Tacospawn are favored in literature for this generation task, but there are still some limitations when difficult conditionings are involved. In this paper, we…
▽ More
Currently, many multi-speaker speech synthesis and voice conversion systems address speaker variations with an embedding vector. Modeling it directly allows new voices outside of training data to be synthesized. GMM based approaches such as Tacospawn are favored in literature for this generation task, but there are still some limitations when difficult conditionings are involved. In this paper, we propose VoiceLens, a semi-supervised flow-based approach, to model speaker embedding distributions for multi-conditional speaker generation. VoiceLens maps speaker embeddings into a combination of independent attributes and residual information. It allows new voices associated with certain attributes to be \textit{generated} for existing TTS models, and attributes of known voices to be meaningfully \textit{edited}. We show in this paper, VoiceLens displays an unconditional generation capacity that is similar to Tacospawn while obtaining higher controllability and flexibility when used in a conditional manner. In addition, we show synthesizing less noisy speech from known noisy speakers without re-training the TTS model is possible via solely editing their embeddings with a SNR conditioned VoiceLens model. Demos are available at sos1sos2sixteen.github.io/voicelens.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
BiSinger: Bilingual Singing Voice Synthesis
Authors:
Huali Zhou,
Yueqian Lin,
Yao Shi,
Peng Sun,
Ming Li
Abstract:
Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin. Current systems require separate models per language and cannot accurately represent both Chinese and English, hindering code-switch SVS. To a…
▽ More
Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin. Current systems require separate models per language and cannot accurately represent both Chinese and English, hindering code-switch SVS. To address this gap, we design a shared representation between Chinese and English singing voices, achieved by using the CMU dictionary with map** rules. We fuse monolingual singing datasets with open-source singing voice conversion techniques to generate bilingual singing voices while also exploring the potential use of bilingual speech data. Experiments affirm that our language-independent representation and incorporation of related datasets enable a single model with enhanced performance in English and code-switch SVS while maintaining Chinese song performance. Audio samples are available at https://bisinger-svs.github.io.
△ Less
Submitted 9 January, 2024; v1 submitted 25 September, 2023;
originally announced September 2023.
-
Self-supervised Domain-agnostic Domain Adaptation for Satellite Images
Authors:
Fahong Zhang,
Yilei Shi,
Xiao Xiang Zhu
Abstract:
Domain shift caused by, e.g., different geographical regions or acquisition conditions is a common issue in machine learning for global scale satellite image processing. A promising method to address this problem is domain adaptation, where the training and the testing datasets are split into two or multiple domains according to their distributions, and an adaptation method is applied to improve t…
▽ More
Domain shift caused by, e.g., different geographical regions or acquisition conditions is a common issue in machine learning for global scale satellite image processing. A promising method to address this problem is domain adaptation, where the training and the testing datasets are split into two or multiple domains according to their distributions, and an adaptation method is applied to improve the generalizability of the model on the testing dataset. However, defining the domain to which each satellite image belongs is not trivial, especially under large-scale multi-temporal and multi-sensory scenarios, where a single image mosaic could be generated from multiple data sources. In this paper, we propose an self-supervised domain-agnostic domain adaptation (SS(DA)2) method to perform domain adaptation without such a domain definition. To achieve this, we first design a contrastive generative adversarial loss to train a generative network to perform image-to-image translation between any two satellite image patches. Then, we improve the generalizability of the downstream models by augmenting the training data with different testing spectral characteristics. The experimental results on public benchmarks verify the effectiveness of SS(DA)2.
△ Less
Submitted 25 September, 2023; v1 submitted 20 September, 2023;
originally announced September 2023.
-
Exploring Speech Enhancement for Low-resource Speech Synthesis
Authors:
Zhaoheng Ni,
Sravya Popuri,
Ning Dong,
Kohei Saijo,
Xiaohui Zhang,
Gael Le Lan,
Yangyang Shi,
Vikas Chandra,
Changhan Wang
Abstract:
High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS…
▽ More
High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS training still needs to be investigated. In this paper, we train a TF-GridNet speech enhancement model and apply it to low-resource datasets that were collected for the ASR task, then train a discrete unit based TTS model on the enhanced speech. We use Arabic datasets as an example and show that the proposed pipeline significantly improves the low-resource TTS system compared with other baseline methods in terms of ASR WER metric. We also run empirical analysis on the correlation between speech enhancement and TTS performances.
△ Less
Submitted 19 September, 2023;
originally announced September 2023.
-
FoleyGen: Visually-Guided Audio Generation
Authors:
Xinhao Mei,
Varun Nagaraja,
Gael Le Lan,
Zhaoheng Ni,
Ernie Chang,
Yangyang Shi,
Vikas Chandra
Abstract:
Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we…
▽ More
Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video. To address this, we explore three novel visual attention mechanisms. We further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen outperforms previous systems across all objective metrics and human evaluations.
△ Less
Submitted 19 September, 2023;
originally announced September 2023.