Search | arXiv e-print repository

arXiv:2406.19043 [pdf]

CMRxRecon2024: A Multi-Modality, Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI

Authors: Zi Wang, Fanwen Wang, Chen Qin, Jun Lyu, Ouyang Cheng, Shuo Wang, Yan Li, Mengyao Yu, Haoyu Zhang, Kunyuan Guo, Zhang Shi, Qirong Li, Ziqiang Xu, Ya**g Zhang, Hao Li, Sha Hua, Binghua Chen, Longyu Sun, Mengting Sun, Qin Li, Ying-Hua Chu, Wenjia Bai, **g Qin, Xiahai Zhuang, Claudia Prieto , et al. (7 additional authors not shown)

Abstract: Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover h… ▽ More Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover high-quality, clinically interpretable images from undersampled measurements. However, the lack of publicly available cardiac MRI k-space dataset in terms of both quantity and diversity has severely hindered substantial technological progress, particularly for data-driven artificial intelligence. Here, we provide a standardized, diverse, and high-quality CMRxRecon2024 dataset to facilitate the technical development, fair evaluation, and clinical transfer of cardiac MRI reconstruction approaches, towards promoting the universal frameworks that enable fast and robust reconstructions across different cardiac MRI protocols in clinical practice. To the best of our knowledge, the CMRxRecon2024 dataset is the largest and most diverse publicly available cardiac k-space dataset. It is acquired from 330 healthy volunteers, covering commonly used modalities, anatomical views, and acquisition trajectories in clinical cardiac MRI workflows. Besides, an open platform with tutorials, benchmarks, and data processing tools is provided to facilitate data usage, advanced method development, and fair performance evaluation. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 19 pages, 3 figures, 2 tables

arXiv:2406.18950 [pdf, other]

MMR-Mamba: Multi-Contrast MRI Reconstruction with Mamba and Spatial-Frequency Information Fusion

Authors: **g Zou, Lanqing Liu, Qi Chen, Shujun Wang, Xiaohan Xing, **g Qin

Abstract: Multi-contrast MRI acceleration has become prevalent in MR imaging, enabling the reconstruction of high-quality MR images from under-sampled k-space data of the target modality, using guidance from a fully-sampled auxiliary modality. The main crux lies in efficiently and comprehensively integrating complementary information from the auxiliary modality. Existing methods either suffer from quadratic… ▽ More Multi-contrast MRI acceleration has become prevalent in MR imaging, enabling the reconstruction of high-quality MR images from under-sampled k-space data of the target modality, using guidance from a fully-sampled auxiliary modality. The main crux lies in efficiently and comprehensively integrating complementary information from the auxiliary modality. Existing methods either suffer from quadratic computational complexity or fail to capture long-range correlated features comprehensively. In this work, we propose MMR-Mamba, a novel framework that achieves comprehensive integration of multi-contrast features through Mamba and spatial-frequency information fusion. Firstly, we design the \textit{Target modality-guided Cross Mamba} (TCM) module in the spatial domain, which maximally restores the target modality information by selectively absorbing useful information from the auxiliary modality. Secondly, leveraging global properties of the Fourier domain, we introduce the \textit{Selective Frequency Fusion} (SFF) module to efficiently integrate global information in the frequency domain and recover high-frequency signals for the reconstruction of structure details. Additionally, we present the \textit{Adaptive Spatial-Frequency Fusion} (ASFF) module, which enhances fused features by supplementing less informative features from one domain with corresponding features from the other domain. These innovative strategies ensure efficient feature fusion across spatial and frequency domains, avoiding the introduction of redundant information and facilitating the reconstruction of high-quality target images. Extensive experiments on the BraTS and fastMRI knee datasets demonstrate the superiority of the proposed MMR-Mamba over state-of-the-art MRI reconstruction methods. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 10 pages, 5 figure

arXiv:2406.14534 [pdf, other]

Epicardium Prompt-guided Real-time Cardiac Ultrasound Frame-to-volume Registration

Authors: Long Lei, Jun Zhou, Jialun Pei, Baoliang Zhao, Yueming **, Yuen-Chun Jeremy Teoh, **g Qin, Pheng-Ann Heng

Abstract: A comprehensive guidance view for cardiac interventional surgery can be provided by the real-time fusion of the intraoperative 2D images and preoperative 3D volume based on the ultrasound frame-to-volume registration. However, cardiac ultrasound images are characterized by a low signal-to-noise ratio and small differences between adjacent frames, coupled with significant dimension variations betwe… ▽ More A comprehensive guidance view for cardiac interventional surgery can be provided by the real-time fusion of the intraoperative 2D images and preoperative 3D volume based on the ultrasound frame-to-volume registration. However, cardiac ultrasound images are characterized by a low signal-to-noise ratio and small differences between adjacent frames, coupled with significant dimension variations between 2D frames and 3D volumes to be registered, resulting in real-time and accurate cardiac ultrasound frame-to-volume registration being a very challenging task. This paper introduces a lightweight end-to-end Cardiac Ultrasound frame-to-volume Registration network, termed CU-Reg. Specifically, the proposed model leverages epicardium prompt-guided anatomical clues to reinforce the interaction of 2D sparse and 3D dense features, followed by a voxel-wise local-global aggregation of enhanced features, thereby boosting the cross-dimensional matching effectiveness of low-quality ultrasound modalities. We further embed an inter-frame discriminative regularization term within the hybrid supervised learning to increase the distinction between adjacent slices in the same ultrasound volume to ensure registration stability. Experimental results on the reprocessed CAMUS dataset demonstrate that our CU-Reg surpasses existing methods in terms of registration accuracy and efficiency, meeting the guidance requirements of clinical cardiac interventional surgery. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: This paper has been accepted by MICCAI 2024

arXiv:2405.09752 [pdf, other]

Time-Varying Graph Signal Recovery Using High-Order Smoothness and Adaptive Low-rankness

Authors: Weihong Guo, Yifei Lou, **g Qin, Ming Yan

Abstract: Time-varying graph signal recovery has been widely used in many applications, including climate change, environmental hazard monitoring, and epidemic studies. It is crucial to choose appropriate regularizations to describe the characteristics of the underlying signals, such as the smoothness of the signal over the graph domain and the low-rank structure of the spatial-temporal signal modeled in a… ▽ More Time-varying graph signal recovery has been widely used in many applications, including climate change, environmental hazard monitoring, and epidemic studies. It is crucial to choose appropriate regularizations to describe the characteristics of the underlying signals, such as the smoothness of the signal over the graph domain and the low-rank structure of the spatial-temporal signal modeled in a matrix form. As one of the most popular options, the graph Laplacian is commonly adopted in designing graph regularizations for reconstructing signals defined on a graph from partially observed data. In this work, we propose a time-varying graph signal recovery method based on the high-order Sobolev smoothness and an error-function weighted nuclear norm regularization to enforce the low-rankness. Two efficient algorithms based on the alternating direction method of multipliers and iterative reweighting are proposed, and convergence of one algorithm is shown in detail. We conduct various numerical experiments on synthetic and real-world data sets to demonstrate the proposed method's effectiveness compared to the state-of-the-art in graph signal recovery. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2405.07648 [pdf, other]

CDFormer:When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution

Authors: Qingguo Liu, Chenyi Zhuang, Pan Gao, Jie Qin

Abstract: Existing Blind image Super-Resolution (BSR) methods focus on estimating either kernel or degradation information, but have long overlooked the essential content details. In this paper, we propose a novel BSR approach, Content-aware Degradation-driven Transformer (CDFormer), to capture both degradation and content representations. However, low-resolution images cannot provide enough content details… ▽ More Existing Blind image Super-Resolution (BSR) methods focus on estimating either kernel or degradation information, but have long overlooked the essential content details. In this paper, we propose a novel BSR approach, Content-aware Degradation-driven Transformer (CDFormer), to capture both degradation and content representations. However, low-resolution images cannot provide enough content details, and thus we introduce a diffusion-based module $CDFormer_{diff}$ to first learn Content Degradation Prior (CDP) in both low- and high-resolution images, and then approximate the real distribution given only low-resolution information. Moreover, we apply an adaptive SR network $CDFormer_{SR}$ that effectively utilizes CDP to refine features. Compared to previous diffusion-based SR methods, we treat the diffusion model as an estimator that can overcome the limitations of expensive sampling time and excessive diversity. Experiments show that CDFormer can outperform existing methods, establishing a new state-of-the-art performance on various benchmarks under blind settings. Codes and models will be available at \href{https://github.com/I2-Multimedia-Lab/CDFormer}{https://github.com/I2-Multimedia-Lab/CDFormer}. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2404.01082 [pdf, other]

The state-of-the-art in Cardiac MRI Reconstruction: Results of the CMRxRecon Challenge in MICCAI 2023

Authors: Jun Lyu, Chen Qin, Shuo Wang, Fanwen Wang, Yan Li, Zi Wang, Kunyuan Guo, Cheng Ouyang, Michael Tänzer, Meng Liu, Longyu Sun, Mengting Sun, Qin Li, Zhang Shi, Sha Hua, Hao Li, Zhensen Chen, Zhenlin Zhang, Bingyu Xin, Dimitris N. Metaxas, George Yiasemis, Jonas Teuwen, Li** Zhang, Weitian Chen, Yidong Zhao , et al. (25 additional authors not shown)

Abstract: Cardiac MRI, crucial for evaluating heart structure and function, faces limitations like slow imaging and motion artifacts. Undersampling reconstruction, especially data-driven algorithms, has emerged as a promising solution to accelerate scans and enhance imaging performance using highly under-sampled data. Nevertheless, the scarcity of publicly available cardiac k-space datasets and evaluation p… ▽ More Cardiac MRI, crucial for evaluating heart structure and function, faces limitations like slow imaging and motion artifacts. Undersampling reconstruction, especially data-driven algorithms, has emerged as a promising solution to accelerate scans and enhance imaging performance using highly under-sampled data. Nevertheless, the scarcity of publicly available cardiac k-space datasets and evaluation platform hinder the development of data-driven reconstruction algorithms. To address this issue, we organized the Cardiac MRI Reconstruction Challenge (CMRxRecon) in 2023, in collaboration with the 26th International Conference on MICCAI. CMRxRecon presented an extensive k-space dataset comprising cine and map** raw data, accompanied by detailed annotations of cardiac anatomical structures. With overwhelming participation, the challenge attracted more than 285 teams and over 600 participants. Among them, 22 teams successfully submitted Docker containers for the testing phase, with 7 teams submitted for both cine and map** tasks. All teams use deep learning based approaches, indicating that deep learning has predominately become a promising solution for the problem. The first-place winner of both tasks utilizes the E2E-VarNet architecture as backbones. In contrast, U-Net is still the most popular backbone for both multi-coil and single-coil reconstructions. This paper provides a comprehensive overview of the challenge design, presents a summary of the submitted results, reviews the employed methods, and offers an in-depth discussion that aims to inspire future advancements in cardiac MRI reconstruction models. The summary emphasizes the effective strategies observed in Cardiac MRI reconstruction, including backbone architecture, loss function, pre-processing techniques, physical modeling, and model complexity, thereby providing valuable insights for further developments in this field. △ Less

Submitted 16 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: 25 pages, 17 figures

arXiv:2403.06197 [pdf, other]

DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency

Authors: Wenfang Yao, Ke**g Yin, William K. Cheung, Jia Liu, **g Qin

Abstract: The combination of electronic health records (EHR) and medical images is crucial for clinicians in making diagnoses and forecasting prognosis. Strategically fusing these two data modalities has great potential to improve the accuracy of machine learning models in clinical prediction tasks. However, the asynchronous and complementary nature of EHR and medical images presents unique challenges. Miss… ▽ More The combination of electronic health records (EHR) and medical images is crucial for clinicians in making diagnoses and forecasting prognosis. Strategically fusing these two data modalities has great potential to improve the accuracy of machine learning models in clinical prediction tasks. However, the asynchronous and complementary nature of EHR and medical images presents unique challenges. Missing modalities due to clinical and administrative factors are inevitable in practice, and the significance of each data modality varies depending on the patient and the prediction target, resulting in inconsistent predictions and suboptimal model performance. To address these challenges, we propose DrFuse to achieve effective clinical multi-modal fusion. It tackles the missing modality issue by disentangling the features shared across modalities and those unique within each modality. Furthermore, we address the modal inconsistency issue via a disease-wise attention layer that produces the patient- and disease-wise weighting for each modality to make the final prediction. We validate the proposed method using real-world large-scale datasets, MIMIC-IV and MIMIC-CXR. Experimental results show that the proposed method significantly outperforms the state-of-the-art models. Our implementation is publicly available at https://github.com/dorothy-yao/drfuse. △ Less

Submitted 10 March, 2024; originally announced March 2024.

Comments: Accepted by AAAI-24

arXiv:2402.00773 [pdf, other]

On the Choice of Loss Function in Learning-based Optimal Power Flow

Authors: Ge Chen, Junjie Qin

Abstract: We analyze and contrast two ways to train machine learning models for solving AC optimal power flow (OPF) problems, distinguished with the loss functions used. The first trains a map** from the loads to the optimal dispatch decisions, utilizing mean square error (MSE) between predicted and optimal dispatch decisions as the loss function. The other intends to learn the same map**, but directly… ▽ More We analyze and contrast two ways to train machine learning models for solving AC optimal power flow (OPF) problems, distinguished with the loss functions used. The first trains a map** from the loads to the optimal dispatch decisions, utilizing mean square error (MSE) between predicted and optimal dispatch decisions as the loss function. The other intends to learn the same map**, but directly uses the OPF cost of the predicted decisions, referred to as decision loss, as the loss function. In addition to better aligning with the OPF cost which results in reduced suboptimality, the use of decision loss can circumvent feasibility issues that arise with MSE when the underlying map** from loads to optimal dispatch is discontinuous. Since decision loss does not capture the OPF constraints, we further develop a neural network with a specific structure and introduce a modified training algorithm incorporating Lagrangian duality to improve feasibility.} This result in an improved performance measured by feasibility and suboptimality as demonstrated with an IEEE 39-bus case study. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 5 pages, Accepted by PESGM2024

arXiv:2402.00772 [pdf, other]

Neural Risk Limiting Dispatch in Power Networks: Formulation and Generalization Guarantees

Authors: Ge Chen, Junjie Qin

Abstract: Risk limiting dispatch (RLD) has been proposed as an approach that effectively trades off economic costs with operational risks for power dispatch under uncertainty. However, how to solve the RLD problem with provably near-optimal performance still remains an open problem. This paper presents a learning-based solution to this challenge. We first design a data-driven formulation for the RLD problem… ▽ More Risk limiting dispatch (RLD) has been proposed as an approach that effectively trades off economic costs with operational risks for power dispatch under uncertainty. However, how to solve the RLD problem with provably near-optimal performance still remains an open problem. This paper presents a learning-based solution to this challenge. We first design a data-driven formulation for the RLD problem, which aims to construct a decision rule that directly maps day-ahead observable information to cost-effective dispatch decisions for the future delivery interval. Unlike most existing works that follow a predict-then-optimize paradigm, this end-to-end rule bypasses the additional suboptimality introduced by separately handling prediction and optimization. We then propose neural RLD, a novel solution method to the data-driven formulation. This method leverages an L2-regularized neural network to learn the decision rule, thereby transforming the data-driven formulation into a neural network training task that can be efficiently completed by stochastic gradient descent. A theoretical performance guarantee is further established to bound the suboptimality of our method, which implies that its suboptimality approaches to zero with high probability as more samples are utilized. Simulation tests across various systems demonstrate our method's superior performance in convergence, suboptimality, and computational efficiency compared with benchmarks. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 10 pages

arXiv:2401.12789 [pdf, other]

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Authors: W. Ronny Huang, Cyril Allauzen, Tongzhou Chen, Kilol Gupta, Ke Hu, James Qin, Yu Zhang, Yongqiang Wang, Shuo-Yiin Chang, Tara N. Sainath

Abstract: In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average… ▽ More In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our comprehensive ablation study analyzes key parameters such as LLM size, context length, vocabulary size, fusion methodology. For instance, we explore the impact of LLM size ranging from 128M to 340B parameters on ASR performance. This study provides valuable insights into the factors influencing the effectiveness of practical large-scale LM-fused speech recognition systems. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: ICASSP 2024

arXiv:2401.07206 [pdf, other]

Probabilistic Reduced-Dimensional Vector Autoregressive Modeling with Oblique Projections

Authors: Yanfang Mo, S. Joe Qin

Abstract: In this paper, we propose a probabilistic reduced-dimensional vector autoregressive (PredVAR) model to extract low-dimensional dynamics from high-dimensional noisy data. The model utilizes an oblique projection to partition the measurement space into a subspace that accommodates the reduced-dimensional dynamics and a complementary static subspace. An optimal oblique decomposition is derived for th… ▽ More In this paper, we propose a probabilistic reduced-dimensional vector autoregressive (PredVAR) model to extract low-dimensional dynamics from high-dimensional noisy data. The model utilizes an oblique projection to partition the measurement space into a subspace that accommodates the reduced-dimensional dynamics and a complementary static subspace. An optimal oblique decomposition is derived for the best predictability regarding prediction error covariance. Building on this, we develop an iterative PredVAR algorithm using maximum likelihood and the expectation-maximization (EM) framework. This algorithm alternately updates the estimates of the latent dynamics and optimal oblique projection, yielding dynamic latent variables with rank-ordered predictability and an explicit latent VAR model that is consistent with the outer projection model. The superior performance and efficiency of the proposed approach are demonstrated using data sets from a synthesized Lorenz system and an industrial process from Eastman Chemical. △ Less

Submitted 14 January, 2024; originally announced January 2024.

Comments: 16pages, 5 figures

arXiv:2312.06995 [pdf, other]

Transformer-based No-Reference Image Quality Assessment via Supervised Contrastive Learning

Authors: **song Shi, Pan Gao, Jie Qin

Abstract: Image Quality Assessment (IQA) has long been a research hotspot in the field of image processing, especially No-Reference Image Quality Assessment (NR-IQA). Due to the powerful feature extraction ability, existing Convolution Neural Network (CNN) and Transformers based NR-IQA methods have achieved considerable progress. However, they still exhibit limited capability when facing unknown authentic d… ▽ More Image Quality Assessment (IQA) has long been a research hotspot in the field of image processing, especially No-Reference Image Quality Assessment (NR-IQA). Due to the powerful feature extraction ability, existing Convolution Neural Network (CNN) and Transformers based NR-IQA methods have achieved considerable progress. However, they still exhibit limited capability when facing unknown authentic distortion datasets. To further improve NR-IQA performance, in this paper, a novel supervised contrastive learning (SCL) and Transformer-based NR-IQA model SaTQA is proposed. We first train a model on a large-scale synthetic dataset by SCL (no image subjective score is required) to extract degradation features of images with various distortion types and levels. To further extract distortion information from images, we propose a backbone network incorporating the Multi-Stream Block (MSB) by combining the CNN inductive bias and Transformer long-term dependence modeling capability. Finally, we propose the Patch Attention Block (PAB) to obtain the final distorted image quality score by fusing the degradation features learned from contrastive learning with the perceptual distortion information extracted by the backbone network. Experimental results on seven standard IQA datasets show that SaTQA outperforms the state-of-the-art methods for both synthetic and authentic datasets. Code is available at https://github.com/I2-Multimedia-Lab/SaTQA △ Less

Submitted 12 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI24

arXiv:2311.15216 [pdf, other]

Solve Large-scale Unit Commitment Problems by Physics-informed Graph Learning

Authors: **gtao Qin, Nanpeng Yu

Abstract: Unit commitment (UC) problems are typically formulated as mixed-integer programs (MIP) and solved by the branch-and-bound (B&B) scheme. The recent advances in graph neural networks (GNN) enable it to enhance the B&B algorithm in modern MIP solvers by learning to dive and branch. Existing GNN models that tackle MIP problems are mostly constructed from mathematical formulation, which is computationa… ▽ More Unit commitment (UC) problems are typically formulated as mixed-integer programs (MIP) and solved by the branch-and-bound (B&B) scheme. The recent advances in graph neural networks (GNN) enable it to enhance the B&B algorithm in modern MIP solvers by learning to dive and branch. Existing GNN models that tackle MIP problems are mostly constructed from mathematical formulation, which is computationally expensive when dealing with large-scale UC problems. In this paper, we propose a physics-informed hierarchical graph convolutional network (PI-GCN) for neural diving that leverages the underlying features of various components of power systems to find high-quality variable assignments. Furthermore, we adopt the MIP model-based graph convolutional network (MB-GCN) for neural branching to select the optimal variables for branching at each node of the B&B tree. Finally, we integrate neural diving and neural branching into a modern MIP solver to establish a novel neural MIP solver designed for large-scale UC problems. Numeral studies show that PI-GCN has better performance and scalability than the baseline MB-GCN on neural diving. Moreover, the neural MIP solver yields the lowest operational cost and outperforms a modern MIP solver for all testing days after combining it with our proposed neural diving model and the baseline neural branching model. △ Less

Submitted 26 November, 2023; originally announced November 2023.

arXiv:2310.07255 [pdf, other]

ADASR: An Adversarial Auto-Augmentation Framework for Hyperspectral and Multispectral Data Fusion

Authors: **ghui Qin, Lihuang Fang, Ruitao Lu, Liang Lin, Yukai Shi

Abstract: Deep learning-based hyperspectral image (HSI) super-resolution, which aims to generate high spatial resolution HSI (HR-HSI) by fusing hyperspectral image (HSI) and multispectral image (MSI) with deep neural networks (DNNs), has attracted lots of attention. However, neural networks require large amounts of training data, hindering their application in real-world scenarios. In this letter, we propos… ▽ More Deep learning-based hyperspectral image (HSI) super-resolution, which aims to generate high spatial resolution HSI (HR-HSI) by fusing hyperspectral image (HSI) and multispectral image (MSI) with deep neural networks (DNNs), has attracted lots of attention. However, neural networks require large amounts of training data, hindering their application in real-world scenarios. In this letter, we propose a novel adversarial automatic data augmentation framework ADASR that automatically optimizes and augments HSI-MSI sample pairs to enrich data diversity for HSI-MSI fusion. Our framework is sample-aware and optimizes an augmentor network and two downsampling networks jointly by adversarial learning so that we can learn more robust downsampling networks for training the upsampling network. Extensive experiments on two public classical hyperspectral datasets demonstrate the effectiveness of our ADASR compared to the state-of-the-art methods. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: This paper has been accepted by IEEE Geoscience and Remote Sensing Letters. Code is released at https://github.com/fangfang11-plog/ADASR

arXiv:2309.17136 [pdf, other]

Latent Dynamic Networked System Identification with High-Dimensional Networked Data

Authors: Jiaxin Yu, Yanfang Mo, S. Joe Qin

Abstract: Networked dynamic systems are ubiquitous in various domains, such as industrial processes, social networks, and biological systems. These systems produce high-dimensional data that reflect the complex interactions among the network nodes with rich sensor measurements. In this paper, we propose a novel algorithm for latent dynamic networked system identification that leverages the network structure… ▽ More Networked dynamic systems are ubiquitous in various domains, such as industrial processes, social networks, and biological systems. These systems produce high-dimensional data that reflect the complex interactions among the network nodes with rich sensor measurements. In this paper, we propose a novel algorithm for latent dynamic networked system identification that leverages the network structure and performs dimension reduction for each node via dynamic latent variables (DLVs). The algorithm assumes that the DLVs of each node have an auto-regressive model with exogenous input and interactions from other nodes. The DLVs of each node are extracted to capture the most predictable latent variables in the high dimensional data, while the residual factors are not predictable. The advantage of the proposed framework is demonstrated on an industrial process network for system identification and dynamic data analytics. △ Less

Submitted 29 September, 2023; originally announced September 2023.

arXiv:2309.12963 [pdf, ps, other]

Massive End-to-end Models for Short Search Queries

Authors: Weiran Wang, Rohit Prabhavalkar, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li, James Qin, Xingyu Cai, Adam Stooke, Zhong Meng, CJ Zheng, Yanzhang He, Tara Sainath, Pedro Moreno Mengibar

Abstract: In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to signifi… ▽ More In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion. △ Less

Submitted 22 September, 2023; originally announced September 2023.

arXiv:2309.01161 [pdf, other]

Probabilistic Reduced-Dimensional Vector Autoregressive Modeling for Dynamics Prediction and Reconstruction with Oblique Projections

Authors: Yanfang Mo, Jiaxin Yu, S. Joe Qin

Abstract: In this paper, we propose a probabilistic reduced-dimensional vector autoregressive (PredVAR) model with oblique projections. This model partitions the measurement space into a dynamic subspace and a static subspace that do not need to be orthogonal. The partition allows us to apply an oblique projection to extract dynamic latent variables (DLVs) from high-dimensional data with maximized predictab… ▽ More In this paper, we propose a probabilistic reduced-dimensional vector autoregressive (PredVAR) model with oblique projections. This model partitions the measurement space into a dynamic subspace and a static subspace that do not need to be orthogonal. The partition allows us to apply an oblique projection to extract dynamic latent variables (DLVs) from high-dimensional data with maximized predictability. We develop an alternating iterative PredVAR algorithm that exploits the interaction between updating the latent VAR dynamics and estimating the oblique projection, using expectation maximization (EM) and a statistical constraint. In addition, the noise covariance matrices are estimated as a natural outcome of the EM method. A simulation case study of the nonlinear Lorenz oscillation system illustrates the advantages of the proposed approach over two alternatives. △ Less

Submitted 3 September, 2023; originally announced September 2023.

arXiv:2306.12925 [pdf, other]

AudioPaLM: A Large Language Model That Can Speak and Listen

Authors: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats , et al. (5 additional authors not shown)

Abstract: We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the… ▽ More We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: Technical report

arXiv:2306.08131 [pdf, other]

Efficient Adapters for Giant Speech Models

Authors: Nanxin Chen, Izhak Shafran, Yu Zhang, Chung-Cheng Chiu, Hagen Soltau, James Qin, Yonghui Wu

Abstract: Large pre-trained speech models are widely used as the de-facto paradigm, especially in scenarios when there is a limited amount of labeled data available. However, finetuning all parameters from the self-supervised learned model can be computationally expensive, and becomes infeasiable as the size of the model and the number of downstream tasks scales. In this paper, we propose a novel approach c… ▽ More Large pre-trained speech models are widely used as the de-facto paradigm, especially in scenarios when there is a limited amount of labeled data available. However, finetuning all parameters from the self-supervised learned model can be computationally expensive, and becomes infeasiable as the size of the model and the number of downstream tasks scales. In this paper, we propose a novel approach called Two Parallel Adapter (TPA) that is inserted into the conformer-based model pre-trained model instead. TPA is based on systematic studies of the residual adapter, a popular approach for finetuning a subset of parameters. We evaluate TPA on various public benchmarks and experiment results demonstrates its superior performance, which is close to the full finetuning on different datasets and speech tasks. These results show that TPA is an effective and efficient approach for serving large pre-trained speech models. Ablation studies show that TPA can also be pruned, especially for lower blocks. △ Less

Submitted 13 June, 2023; originally announced June 2023.

arXiv:2306.04730 [pdf, other]

Stochastic Natural Thresholding Algorithms

Authors: Rachel Grotheer, Shuang Li, Anna Ma, Deanna Needell, **g Qin

Abstract: Sparse signal recovery is one of the most fundamental problems in various applications, including medical imaging and remote sensing. Many greedy algorithms based on the family of hard thresholding operators have been developed to solve the sparse signal recovery problem. More recently, Natural Thresholding (NT) has been proposed with improved computational efficiency. This paper proposes and disc… ▽ More Sparse signal recovery is one of the most fundamental problems in various applications, including medical imaging and remote sensing. Many greedy algorithms based on the family of hard thresholding operators have been developed to solve the sparse signal recovery problem. More recently, Natural Thresholding (NT) has been proposed with improved computational efficiency. This paper proposes and discusses convergence guarantees for stochastic natural thresholding algorithms by extending the NT from the deterministic version with linear measurements to the stochastic version with a general objective function. We also conduct various numerical experiments on linear and nonlinear measurements to demonstrate the performance of StoNT. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2304.05723 [pdf, ps, other]

Distributed Coverage Control of Constrained Constant-Speed Unicycle Multi-Agent Systems

Authors: Qingchen Liu, Zengjie Zhang, Nhan Khanh Le, Jiahu Qin, Fangzhou Liu, Sandra Hirche

Abstract: This paper proposes a novel distributed coverage controller for a multi-agent system with constant-speed unicycle robots (CSUR). The work is motivated by the limitation of the conventional method that does not ensure the satisfaction of hard state- and input-dependent constraints and leads to feasibility issues for multi-CSUR systems. In this paper, we solve these problems by designing a novel cov… ▽ More This paper proposes a novel distributed coverage controller for a multi-agent system with constant-speed unicycle robots (CSUR). The work is motivated by the limitation of the conventional method that does not ensure the satisfaction of hard state- and input-dependent constraints and leads to feasibility issues for multi-CSUR systems. In this paper, we solve these problems by designing a novel coverage cost function and a saturated gradient-search-based control law. Invariant set theory and Lyapunov-based techniques are used to prove the state-dependent confinement and the convergence of the system state to the optimal coverage configuration, respectively. The controller is implemented in a distributed manner based on a novel communication standard among the agents. A series of simulation case studies are conducted to validate the effectiveness of the proposed coverage controller in different initial conditions and with control parameters. A comparison study in simulation reveals the advantage of the proposed method in terms of avoiding infeasibility. The experiment study verifies the applicability of the method to real robots with uncertainties. The development procedure of the method from theoretical analysis to experimental validation provides a novel framework for multi-agent system coordinate control with complex agent dynamics. △ Less

Submitted 14 March, 2024; v1 submitted 12 April, 2023; originally announced April 2023.

arXiv:2303.09704 [pdf, other]

Mobile Energy Storage in Power Network: Marginal Value and Optimal Operation

Authors: Utkarsha Agwan, Junjie Qin, Kameshwar Poolla, Pravin Varaiya

Abstract: This paper examines the marginal value of mobile energy storage, i.e., energy storage units that can be efficiently relocated to other locations in the power network. In particular, we formulate and analyze the joint problem for operating the power grid and a fleet of mobile storage units. We use two different storage models: rapid storage, which disregards travel time and power constraints, and g… ▽ More This paper examines the marginal value of mobile energy storage, i.e., energy storage units that can be efficiently relocated to other locations in the power network. In particular, we formulate and analyze the joint problem for operating the power grid and a fleet of mobile storage units. We use two different storage models: rapid storage, which disregards travel time and power constraints, and general storage, which incorporates them. By explicitly connecting the marginal value of mobile storage to locational marginal prices (LMPs), we propose efficient algorithms that only use LMPs and transportation costs to optimize the relocation trajectories of the mobile storage units. Furthermore, we provide examples and conditions under which the marginal value of mobile storage is strictly higher, equal to, or strictly lower than the sum of marginal value of corresponding stationary storage units and wires. We also propose faster algorithms to approximate the optimal operation and relocation for our general mobile storage model, and illustrate our results with simulations of an passenger EV and a heavy-duty electric truck in the PJM territory. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: 14 pages, submitted to IEEE Transactions on Smart Grids

arXiv:2303.08113 [pdf, other]

Learning Homeomorphic Image Registration via Conformal-Invariant Hyperelastic Regularisation

Authors: **g Zou, Noémie Debroux, Lihao Liu, **g Qin, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero

Abstract: Deformable image registration is a fundamental task in medical image analysis and plays a crucial role in a wide range of clinical applications. Recently, deep learning-based approaches have been widely studied for deformable medical image registration and achieved promising results. However, existing deep learning image registration techniques do not theoretically guarantee topology-preserving tr… ▽ More Deformable image registration is a fundamental task in medical image analysis and plays a crucial role in a wide range of clinical applications. Recently, deep learning-based approaches have been widely studied for deformable medical image registration and achieved promising results. However, existing deep learning image registration techniques do not theoretically guarantee topology-preserving transformations. This is a key property to preserve anatomical structures and achieve plausible transformations that can be used in real clinical settings. We propose a novel framework for deformable image registration. Firstly, we introduce a novel regulariser based on conformal-invariant properties in a nonlinear elasticity setting. Our regulariser enforces the deformation field to be smooth, invertible and orientation-preserving. More importantly, we strictly guarantee topology preservation yielding to a clinical meaningful registration. Secondly, we boost the performance of our regulariser through coordinate MLPs, where one can view the to-be-registered images as continuously differentiable entities. We demonstrate, through numerical and visual experiments, that our framework is able to outperform current techniques for image registration. △ Less

Submitted 30 June, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

Comments: 13 pages, 3 figures

arXiv:2303.01037 [pdf, other]

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk , et al. (2 additional authors not shown)

Abstract: We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quant… ▽ More We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages. △ Less

Submitted 24 September, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

Comments: 20 pages, 7 figures, 8 tables

arXiv:2303.00155 [pdf, other]

Exponential Consensus of Multiple Agents over Dynamic Network Topology: Controllability, Connectivity, and Compactness

Authors: Qichao Ma, Jiahu Qin, Brian D. O. Anderson, Long Wang

Abstract: This paper investigates the problem of securing exponentially fast consensus (exponential consensus for short) for identical agents with finite-dimensional linear system dynamics over dynamic network topology. Our aim is to find the weakest possible conditions that guarantee exponentially fast consensus using a Lyapunov function consisting of a sum of terms of the same functional form. We first in… ▽ More This paper investigates the problem of securing exponentially fast consensus (exponential consensus for short) for identical agents with finite-dimensional linear system dynamics over dynamic network topology. Our aim is to find the weakest possible conditions that guarantee exponentially fast consensus using a Lyapunov function consisting of a sum of terms of the same functional form. We first investigate necessary conditions, starting by examining the system (both agent and network) parameters. It is found that controllability of the linear agents is necessary for reaching consensus. Then, to work out necessary conditions incorporating the network topology, we construct a set of Laplacian matrix-valued functions. The precompactness of this set of functions is shown to be a significant generalization of existing assumptions on network topology, including the common assumption that the edge weights are bounded piecewise constant functions or continuous functions. With the aid of such a precompactness assumption and restricting the Lyapunov function to one consisting of a sum of terms of the same functional form, we prove that a joint $(δ, T)$-connectivity condition on the network topology is necessary for exponential consensus. Finally, we investigate how the above two ``necessities'' work together to guarantee exponential consensus. To partially address this problem, we define a synchronization index to characterize the interplay between agent parameters and network topology. Based on this notion, it is shown that by designing a proper feedback matrix and under the precompactness assumption, exponential consensus can be reached globally and uniformly if the joint $(δ,T)$-connectivity and controllability conditions are satisfied, and the synchronization index is not less than one. △ Less

Submitted 28 February, 2023; originally announced March 2023.

arXiv:2212.02198 [pdf, other]

Rethinking Generative Methods for Image Restoration in Physics-based Vision: A Theoretical Analysis from the Perspective of Information

Authors: Xudong Kang, Haoran Xie, Man-Leung Wong, **g Qin

Abstract: End-to-end generative methods are considered a more promising solution for image restoration in physics-based vision compared with the traditional deconstructive methods based on handcrafted composition models. However, existing generative methods still have plenty of room for improvement in quantitative performance. More crucially, these methods are considered black boxes due to weak interpretabi… ▽ More End-to-end generative methods are considered a more promising solution for image restoration in physics-based vision compared with the traditional deconstructive methods based on handcrafted composition models. However, existing generative methods still have plenty of room for improvement in quantitative performance. More crucially, these methods are considered black boxes due to weak interpretability and there is rarely a theory trying to explain their mechanism and learning process. In this study, we try to re-interpret these generative methods for image restoration tasks using information theory. Different from conventional understanding, we analyzed the information flow of these methods and identified three sources of information (extracted high-level information, retained low-level information, and external information that is absent from the source inputs) are involved and optimized respectively in generating the restoration results. We further derived their learning behaviors, optimization objectives, and the corresponding information boundaries by extending the information bottleneck principle. Based on this theoretic framework, we found that many existing generative methods tend to be direct applications of the general models designed for conventional generation tasks, which may suffer from problems including over-invested abstraction processes, inherent details loss, and vanishing gradients or imbalance in training. We analyzed these issues with both intuitive and theoretical explanations and proved them with empirical evidence respectively. Ultimately, we proposed general solutions or ideas to address the above issue and validated these approaches with performance boosts on six datasets of three different image restoration tasks. △ Less

Submitted 8 December, 2022; v1 submitted 5 December, 2022; originally announced December 2022.

arXiv:2209.06461 [pdf, other]

Electric Vehicle Battery Sharing Game for Mobile Energy Storage Provision in Power Networks

Authors: Utkarsha Agwan, Junjie Qin, Kameshwar Poolla, Pravin Varaiya

Abstract: Electric vehicles (EVs) equipped with a bidirectional charger can provide valuable grid services as mobile energy storage. However, proper financial incentives need to be in place to enlist EV drivers to provide services to the grid. In this paper, we consider two types of EV drivers who may be willing to provide mobile storage service using their EVs: commuters taking a fixed route, and on-demand… ▽ More Electric vehicles (EVs) equipped with a bidirectional charger can provide valuable grid services as mobile energy storage. However, proper financial incentives need to be in place to enlist EV drivers to provide services to the grid. In this paper, we consider two types of EV drivers who may be willing to provide mobile storage service using their EVs: commuters taking a fixed route, and on-demand EV drivers who receive incentives from a transportation network company (TNC) and are willing to take any route. We model the behavior of each type of driver using game theoretic methods, and characterize the Nash equilibrium (NE) of an EV battery sharing game where each EV driver withdraws power from the grid to charge its EV battery at the origin of a route, travels from the origin to the destination, and then discharges power back to the grid at the destination of the route. The driver earns a payoff that depends on the participation of other drivers and power network conditions. We characterize the NE in three situations: when there are only commuters, when there are only on-demand TNC drivers, and when the two groups of drivers co-exist. In particular, we show that the equilibrium outcome supports the social welfare in each of these three cases. △ Less

Submitted 16 December, 2022; v1 submitted 14 September, 2022; originally announced September 2022.

Comments: 10 pages, IEEE CDC 2022

arXiv:2209.04789 [pdf, other]

Optimal Ordering Policies for Multi-Echelon Supply Networks

Authors: Jose I. Caiza, Ian Walter, Jitesh H. Panchal, Junjie Qin, Philip E. Pare

Abstract: In this paper, we formulate an optimal ordering policy as a stochastic control problem where each firm decides the amount of input goods to order from their upstream suppliers based on the current inventory level of its output good. For this purpose, we provide a closed-form solution for the optimal request of the raw materials for given a fixed production policy. We implement the proposed policy… ▽ More In this paper, we formulate an optimal ordering policy as a stochastic control problem where each firm decides the amount of input goods to order from their upstream suppliers based on the current inventory level of its output good. For this purpose, we provide a closed-form solution for the optimal request of the raw materials for given a fixed production policy. We implement the proposed policy on a 15-firm acyclic network based on a real product supply chain. We first simulate ideal demand situations, and then we implement demand-side shocks (i.e., demand levels outside of those considered in the policy formulation) and supply-side shocks (i.e., halts in production for some suppliers) to evaluate the robustness of the proposed policies. △ Less

Submitted 11 September, 2022; originally announced September 2022.

arXiv:2207.00769 [pdf, other]

Test-time Adaptation with Calibration of Medical Image Classification Nets for Label Distribution Shift

Authors: Wenao Ma, Cheng Chen, Shuang Zheng, **g Qin, Huimao Zhang, Qi Dou

Abstract: Class distribution plays an important role in learning deep classifiers. When the proportion of each class in the test set differs from the training set, the performance of classification nets usually degrades. Such a label distribution shift problem is common in medical diagnosis since the prevalence of disease vary over location and time. In this paper, we propose the first method to tackle labe… ▽ More Class distribution plays an important role in learning deep classifiers. When the proportion of each class in the test set differs from the training set, the performance of classification nets usually degrades. Such a label distribution shift problem is common in medical diagnosis since the prevalence of disease vary over location and time. In this paper, we propose the first method to tackle label shift for medical image classification, which effectively adapt the model learned from a single training label distribution to arbitrary unknown test label distribution. Our approach innovates distribution calibration to learn multiple representative classifiers, which are capable of handling different one-dominating-class distributions. When given a test image, the diverse classifiers are dynamically aggregated via the consistency-driven test-time adaptation, to deal with the unknown test label distribution. We validate our method on two important medical image classification tasks including liver fibrosis staging and COVID-19 severity prediction. Our experiments clearly show the decreased model performance under label shift. With our method, model performance significantly improves on all the test datasets with different label shifts for both medical image diagnosis tasks. △ Less

Submitted 9 July, 2022; v1 submitted 2 July, 2022; originally announced July 2022.

Comments: This paper has been accepted by MICCAI 2022

arXiv:2207.00141 [pdf, other]

A New Dataset and A Baseline Model for Breast Lesion Detection in Ultrasound Videos

Authors: Zhi Lin, Junhao Lin, Lei Zhu, Huazhu Fu, **g Qin, Liansheng Wang

Abstract: Breast lesion detection in ultrasound is critical for breast cancer diagnosis. Existing methods mainly rely on individual 2D ultrasound images or combine unlabeled video and labeled 2D images to train models for breast lesion detection. In this paper, we first collect and annotate an ultrasound video dataset (188 videos) for breast lesion detection. Moreover, we propose a clip-level and video-leve… ▽ More Breast lesion detection in ultrasound is critical for breast cancer diagnosis. Existing methods mainly rely on individual 2D ultrasound images or combine unlabeled video and labeled 2D images to train models for breast lesion detection. In this paper, we first collect and annotate an ultrasound video dataset (188 videos) for breast lesion detection. Moreover, we propose a clip-level and video-level feature aggregated network (CVA-Net) for addressing breast lesion detection in ultrasound videos by aggregating video-level lesion classification features and clip-level temporal features. The clip-level temporal features encode local temporal information of ordered video frames and global temporal information of shuffled video frames. In our CVA-Net, an inter-video fusion module is devised to fuse local features from original video frames and global features from shuffled video frames, and an intra-video fusion module is devised to learn the temporal information among adjacent video frames. Moreover, we learn video-level features to classify the breast lesions of the original video as benign or malignant lesions to further enhance the final breast lesion detection performance in ultrasound videos. Experimental results on our annotated dataset demonstrate that our CVA-Net clearly outperforms state-of-the-art methods. The corresponding code and dataset are publicly available at \url{https://github.com/jhl-Det/CVA-Net}. △ Less

Submitted 30 June, 2022; originally announced July 2022.

Comments: 11 pages, 4 figures

Report number: Report-no: MICCAI-1016

Journal ref: Medical Image Computing and Computer Assisted Interventions 2022

arXiv:2206.04249 [pdf, other]

An Optimization Method-Assisted Ensemble Deep Reinforcement Learning Algorithm to Solve Unit Commitment Problems

Authors: **gtao Qin, Yuanqi Gao, Mikhail Bragin, Nanpeng Yu

Abstract: Unit commitment (UC) is a fundamental problem in the day-ahead electricity market, and it is critical to solve UC problems efficiently. Mathematical optimization techniques like dynamic programming, Lagrangian relaxation, and mixed-integer quadratic programming (MIQP) are commonly adopted for UC problems. However, the calculation time of these methods increases at an exponential rate with the amou… ▽ More Unit commitment (UC) is a fundamental problem in the day-ahead electricity market, and it is critical to solve UC problems efficiently. Mathematical optimization techniques like dynamic programming, Lagrangian relaxation, and mixed-integer quadratic programming (MIQP) are commonly adopted for UC problems. However, the calculation time of these methods increases at an exponential rate with the amount of generators and energy resources, which is still the main bottleneck in industry. Recent advances in artificial intelligence have demonstrated the capability of reinforcement learning (RL) to solve UC problems. Unfortunately, the existing research on solving UC problems with RL suffers from the curse of dimensionality when the size of UC problems grows. To deal with these problems, we propose an optimization method-assisted ensemble deep reinforcement learning algorithm, where UC problems are formulated as a Markov Decision Process (MDP) and solved by multi-step deep Q-learning in an ensemble framework. The proposed algorithm establishes a candidate action set by solving tailored optimization problems to ensure a relatively high performance and the satisfaction of operational constraints. Numerical studies on IEEE 118 and 300-bus systems show that our algorithm outperforms the baseline RL algorithm and MIQP. Furthermore, the proposed algorithm shows strong generalization capacity under unforeseen operational conditions. △ Less

Submitted 8 June, 2022; originally announced June 2022.

arXiv:2206.02609 [pdf, other]

Real-World Image Super-Resolution by Exclusionary Dual-Learning

Authors: Hao Li, **ghui Qin, Zhi**g Yang, Pengxu Wei, **shan Pan, Liang Lin, Yukai Shi

Abstract: Real-world image super-resolution is a practical image restoration problem that aims to obtain high-quality images from in-the-wild input, has recently received considerable attention with regard to its tremendous application potentials. Although deep learning-based methods have achieved promising restoration quality on real-world image super-resolution datasets, they ignore the relationship betwe… ▽ More Real-world image super-resolution is a practical image restoration problem that aims to obtain high-quality images from in-the-wild input, has recently received considerable attention with regard to its tremendous application potentials. Although deep learning-based methods have achieved promising restoration quality on real-world image super-resolution datasets, they ignore the relationship between L1- and perceptual- minimization and roughly adopt auxiliary large-scale datasets for pre-training. In this paper, we discuss the image types within a corrupted image and the property of perceptual- and Euclidean- based evaluation protocols. Then we propose a method, Real-World image Super-Resolution by Exclusionary Dual-Learning (RWSR-EDL) to address the feature diversity in perceptual- and L1- based cooperative learning. Moreover, a noise-guidance data collection strategy is developed to address the training time consumption in multiple datasets optimization. When an auxiliary dataset is incorporated, RWSR-EDL achieves promising results and repulses any training time increment by adopting the noise-guidance data collection strategy. Extensive experiments show that RWSR-EDL achieves competitive performance over state-of-the-art methods on four in-the-wild image super-resolution datasets. △ Less

Submitted 6 June, 2022; originally announced June 2022.

Comments: IEEE TMM 2022; Considering large volume of RealSR datasets, a multi-dataset sampling scheme is developed

arXiv:2204.02663 [pdf, other]

Towards An End-to-End Framework for Flow-Guided Video Inpainting

Authors: Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, Ming-Ming Cheng

Abstract: Optical flow, which captures motion information across frames, is exploited in recent video inpainting methods through propagating pixels along its trajectories. However, the hand-crafted flow-based processes in these methods are applied separately to form the whole inpainting pipeline. Thus, these methods are less efficient and rely heavily on the intermediate results from earlier stages. In this… ▽ More Optical flow, which captures motion information across frames, is exploited in recent video inpainting methods through propagating pixels along its trajectories. However, the hand-crafted flow-based processes in these methods are applied separately to form the whole inpainting pipeline. Thus, these methods are less efficient and rely heavily on the intermediate results from earlier stages. In this paper, we propose an End-to-End framework for Flow-Guided Video Inpainting (E$^2$FGVI) through elaborately designed three trainable modules, namely, flow completion, feature propagation, and content hallucination modules. The three modules correspond with the three stages of previous flow-based methods but can be jointly optimized, leading to a more efficient and effective inpainting process. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods both qualitatively and quantitatively and shows promising efficiency. The code is available at https://github.com/MCG-NKU/E2FGVI. △ Less

Submitted 7 April, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

Comments: Accepted to CVPR 2022

arXiv:2204.01954 [pdf, other]

doi 10.1016/j.jsv.2022.117421

Application of a Spectral Method to Simulate Quasi-Three-Dimensional Underwater Acoustic Fields

Authors: Houwang Tu, Yongxian Wang, Wei Liu, Chunmei Yang, Jixing Qin, Shuqing Ma, Xiaodong Wang

Abstract: The calculation of a three-dimensional underwater acoustic field has always been a key problem in computational ocean acoustics. Traditionally, this solution is usually obtained by directly solving the acoustic Helmholtz equation using a finite difference or finite element algorithm. Solving the three-dimensional Helmholtz equation directly is computationally expensive. For quasi-three-dimensional… ▽ More The calculation of a three-dimensional underwater acoustic field has always been a key problem in computational ocean acoustics. Traditionally, this solution is usually obtained by directly solving the acoustic Helmholtz equation using a finite difference or finite element algorithm. Solving the three-dimensional Helmholtz equation directly is computationally expensive. For quasi-three-dimensional problems, the Helmholtz equation can be processed by the integral transformation approach, which can greatly reduce the computational cost. In this paper, a numerical algorithm for a quasi-three-dimensional sound field that combines an integral transformation technique, stepwise coupled modes and a spectral method is designed. The quasi-three-dimensional problem is transformed into a two-dimensional problem using an integral transformation strategy. A stepwise approximation is then used to discretize the range dependence of the two-dimensional problem; this approximation is essentially a physical discretization that further reduces the range-dependent two-dimensional problem to a one-dimensional problem. Finally, the Chebyshev--Tau spectral method is employed to accurately solve the one-dimensional problem. We provide the corresponding numerical program SPEC3D for the proposed algorithm and describe several representative numerical examples. In the numerical experiments, the consistency between SPEC3D and the analytical solution/high-precision finite difference program COACH verifies the reliability and capability of the proposed algorithm. A comparison of running times illustrates that the algorithm proposed in this paper is significantly faster than the full three-dimensional algorithm in terms of computational speed. △ Less

Submitted 10 November, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

Comments: 31 pages, 22 figures. arXiv admin note: text overlap with arXiv:2112.13602

arXiv:2203.13963 [pdf, other]

Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution

Authors: Guangyuan Li, Jun Lv, Yapeng Tian, Qi Dou, Chengyan Wang, Chenliang Xu, **g Qin

Abstract: Magnetic resonance imaging (MRI) can present multi-contrast images of the same anatomical structures, enabling multi-contrast super-resolution (SR) techniques. Compared with SR reconstruction using a single-contrast, multi-contrast SR reconstruction is promising to yield SR images with higher quality by leveraging diverse yet complementary information embedded in different imaging modalities. Howe… ▽ More Magnetic resonance imaging (MRI) can present multi-contrast images of the same anatomical structures, enabling multi-contrast super-resolution (SR) techniques. Compared with SR reconstruction using a single-contrast, multi-contrast SR reconstruction is promising to yield SR images with higher quality by leveraging diverse yet complementary information embedded in different imaging modalities. However, existing methods still have two shortcomings: (1) they neglect that the multi-contrast features at different scales contain different anatomical details and hence lack effective mechanisms to match and fuse these features for better reconstruction; and (2) they are still deficient in capturing long-range dependencies, which are essential for the regions with complicated anatomical structures. We propose a novel network to comprehensively address these problems by develo** a set of innovative Transformer-empowered multi-scale contextual matching and aggregation techniques; we call it McMRSR. Firstly, we tame transformers to model long-range dependencies in both reference and target images. Then, a new multi-scale contextual matching method is proposed to capture corresponding contexts from reference features at different scales. Furthermore, we introduce a multi-scale aggregation mechanism to gradually and interactively aggregate multi-scale matched features for reconstructing the target SR MR image. Extensive experiments demonstrate that our network outperforms state-of-the-art approaches and has great potential to be applied in clinical practice. Codes are available at https://github.com/XAIMI-Lab/McMRSR. △ Less

Submitted 25 March, 2022; originally announced March 2022.

Comments: CVPR 2022 accepted

arXiv:2202.01855 [pdf, other]

Self-supervised Learning with Random-projection Quantizer for Speech Recognition

Authors: Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu

Abstract: We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither… ▽ More We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT. △ Less

Submitted 29 June, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

Comments: ICML 2022

arXiv:2112.02743 [pdf, other]

Separated Contrastive Learning for Organ-at-Risk and Gross-Tumor-Volume Segmentation with Limited Annotation

Authors: Jiacheng Wang, Xiaomeng Li, Yiming Han, **g Qin, Liansheng Wang, Zhou Qichao

Abstract: Automatic delineation of organ-at-risk (OAR) and gross-tumor-volume (GTV) is of great significance for radiotherapy planning. However, it is a challenging task to learn powerful representations for accurate delineation under limited pixel (voxel)-wise annotations. Contrastive learning at pixel-level can alleviate the dependency on annotations by learning dense representations from unlabeled data.… ▽ More Automatic delineation of organ-at-risk (OAR) and gross-tumor-volume (GTV) is of great significance for radiotherapy planning. However, it is a challenging task to learn powerful representations for accurate delineation under limited pixel (voxel)-wise annotations. Contrastive learning at pixel-level can alleviate the dependency on annotations by learning dense representations from unlabeled data. Recent studies in this direction design various contrastive losses on the feature maps, to yield discriminative features for each pixel in the map. However, pixels in the same map inevitably share semantics to be closer than they actually are, which may affect the discrimination of pixels in the same map and lead to the unfair comparison to pixels in other maps. To address these issues, we propose a separated region-level contrastive learning scheme, namely SepaReg, the core of which is to separate each image into regions and encode each region separately. Specifically, SepaReg comprises two components: a structure-aware image separation (SIS) module and an intra- and inter-organ distillation (IID) module. The SIS is proposed to operate on the image set to rebuild a region set under the guidance of structural information. The inter-organ representation will be learned from this set via typical contrastive losses cross regions. On the other hand, the IID is proposed to tackle the quantity imbalance in the region set as tiny organs may produce fewer regions, by exploiting intra-organ representations. We conducted extensive experiments to evaluate the proposed model on a public dataset and two private datasets. The experimental results demonstrate the effectiveness of the proposed model, consistently achieving better performance than state-of-the-art approaches. Code is available at https://github.com/jcwang123/Separate_CL. △ Less

Submitted 20 April, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

Comments: Accepted in AAAI-22 (Oral)

arXiv:2111.04733 [pdf, other]

Real-time landmark detection for precise endoscopic submucosal dissection via shape-aware relation network

Authors: Jiacheng Wang, Yueming **, Shuntian Cai, Hongzhi Xu, Pheng-Ann Heng, **g Qin, Liansheng Wang

Abstract: We propose a novel shape-aware relation network for accurate and real-time landmark detection in endoscopic submucosal dissection (ESD) surgery. This task is of great clinical significance but extremely challenging due to bleeding, lighting reflection, and motion blur in the complicated surgical environment. Compared with existing solutions, which either neglect geometric relationships among targe… ▽ More We propose a novel shape-aware relation network for accurate and real-time landmark detection in endoscopic submucosal dissection (ESD) surgery. This task is of great clinical significance but extremely challenging due to bleeding, lighting reflection, and motion blur in the complicated surgical environment. Compared with existing solutions, which either neglect geometric relationships among targeting objects or capture the relationships by using complicated aggregation schemes, the proposed network is capable of achieving satisfactory accuracy while maintaining real-time performance by taking full advantage of the spatial relations among landmarks. We first devise an algorithm to automatically generate relation keypoint heatmaps, which are able to intuitively represent the prior knowledge of spatial relations among landmarks without using any extra manual annotation efforts. We then develop two complementary regularization schemes to progressively incorporate the prior knowledge into the training process. While one scheme introduces pixel-level regularization by multi-task learning, the other integrates global-level regularization by harnessing a newly designed grouped consistency evaluator, which adds relation constraints to the proposed network in an adversarial manner. Both schemes are beneficial to the model in training, and can be readily unloaded in inference to achieve real-time detection. We establish a large in-house dataset of ESD surgery for esophageal cancer to validate the effectiveness of our proposed method. Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy and efficiency, achieving better detection results faster. Promising results on two downstream applications further corroborate the great potential of our method in ESD clinical practice. △ Less

Submitted 8 November, 2021; originally announced November 2021.

arXiv:2110.03864 [pdf, other]

doi 10.1007/978-3-030-87193-2_20

Boundary-aware Transformers for Skin Lesion Segmentation

Authors: Jiacheng Wang, Lan Wei, Liansheng Wang, Qichao Zhou, Lei Zhu, **g Qin

Abstract: Skin lesion segmentation from dermoscopy images is of great importance for improving the quantitative analysis of skin cancer. However, the automatic segmentation of melanoma is a very challenging task owing to the large variation of melanoma and ambiguous boundaries of lesion areas. While convolutional neutral networks (CNNs) have achieved remarkable progress in this task, most of existing soluti… ▽ More Skin lesion segmentation from dermoscopy images is of great importance for improving the quantitative analysis of skin cancer. However, the automatic segmentation of melanoma is a very challenging task owing to the large variation of melanoma and ambiguous boundaries of lesion areas. While convolutional neutral networks (CNNs) have achieved remarkable progress in this task, most of existing solutions are still incapable of effectively capturing global dependencies to counteract the inductive bias caused by limited receptive fields. Recently, transformers have been proposed as a promising tool for global context modeling by employing a powerful global attention mechanism, but one of their main shortcomings when applied to segmentation tasks is that they cannot effectively extract sufficient local details to tackle ambiguous boundaries. We propose a novel boundary-aware transformer (BAT) to comprehensively address the challenges of automatic skin lesion segmentation. Specifically, we integrate a new boundary-wise attention gate (BAG) into transformers to enable the whole network to not only effectively model global long-range dependencies via transformers but also, simultaneously, capture more local details by making full use of boundary-wise prior knowledge. Particularly, the auxiliary supervision of BAG is capable of assisting transformers to learn position embedding as it provides much spatial information. We conducted extensive experiments to evaluate the proposed BAT and experiments corroborate its effectiveness, consistently outperforming state-of-the-art methods in two famous datasets. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Journal ref: Medical Image Computing and Computer Assisted Intervention 2021

arXiv:2109.13593 [pdf, other]

doi 10.1007/978-3-030-87202-1_33

Efficient Global-Local Memory for Real-time Instrument Segmentation of Robotic Surgical Video

Authors: Jiacheng Wang, Yueming **, Liansheng Wang, Shuntian Cai, Pheng-Ann Heng, **g Qin

Abstract: Performing a real-time and accurate instrument segmentation from videos is of great significance for improving the performance of robotic-assisted surgery. We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration. However, most existing works perform segmentation purely using… ▽ More Performing a real-time and accurate instrument segmentation from videos is of great significance for improving the performance of robotic-assisted surgery. We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration. However, most existing works perform segmentation purely using visual cues in a single frame. Optical flow is just used to model the motion between only two frames and brings heavy computational cost. We propose a novel dual-memory network (DMNet) to wisely relate both global and local spatio-temporal knowledge to augment the current features, boosting the segmentation performance and retaining the real-time prediction capability. We propose, on the one hand, an efficient local memory by taking the complementary advantages of convolutional LSTM and non-local mechanisms towards the relating reception field. On the other hand, we develop an active global memory to gather the global semantic correlation in long temporal range to current one, in which we gather the most informative frames derived from model uncertainty and frame similarity. We have extensively validated our method on two public benchmark surgical video datasets. Experimental results demonstrate that our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed. △ Less

Submitted 28 September, 2021; originally announced September 2021.

Journal ref: Medical Image Computing and Computer Assisted Intervention, 2021

arXiv:2109.13226 [pdf, other]

doi 10.1109/JSTSP.2022.3182537

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks. △ Less

Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

arXiv:2108.06209 [pdf, other]

W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

Authors: Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu

Abstract: Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens,… ▽ More Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks~(the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light~60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec~2.0 and HuBERT, our model shows~5\% to~10\% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec~2.0 by more than~30\% relatively. △ Less

Submitted 13 September, 2021; v1 submitted 7 August, 2021; originally announced August 2021.

arXiv:2105.04728 [pdf, other]

doi 10.1145/3447555.3464857

Optimal Online Algorithms for Peak-Demand Reduction Maximization with Energy Storage

Authors: Yanfang Mo, Qiulin Lin, Minghua Chen, Si-Zhao Joe Qin

Abstract: The high proportions of demand charges in electric bills motivate large-power customers to leverage energy storage for reducing the peak procurement from the outer grid. Given limited energy storage, we expect to maximize the peak-demand reduction in an online fashion, challenged by the highly uncertain demands and renewable injections, the non-cumulative nature of peak consumption, and the coupli… ▽ More The high proportions of demand charges in electric bills motivate large-power customers to leverage energy storage for reducing the peak procurement from the outer grid. Given limited energy storage, we expect to maximize the peak-demand reduction in an online fashion, challenged by the highly uncertain demands and renewable injections, the non-cumulative nature of peak consumption, and the coupling of online decisions. In this paper, we propose an optimal online algorithm that achieves the best competitive ratio, following the idea of maintaining a constant ratio between the online and the optimal offline peak-reduction performance. We further show that the optimal competitive ratio can be computed by solving a linear number of linear-fractional programs. Moreover, we extend the algorithm to adaptively maintain the best competitive ratio given the revealed inputs and actions at each decision-making round. The adaptive algorithm retains the optimal worst-case guarantee and attains improved average-case performance. We evaluate our proposed algorithms using real-world traces and show that they obtain up to 81% peak reduction of the optimal offline benchmark. Additionally, the adaptive algorithm achieves at least 20% more peak reduction against baseline alternatives. △ Less

Submitted 10 May, 2021; originally announced May 2021.

Comments: 12 pages, 6 figures

arXiv:2104.14830 [pdf, other]

Scaling End-to-End Models for Large-Scale Multilingual ASR

Authors: Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, Junwen Bai

Abstract: Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity.… ▽ More Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity. We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.6K to 53.5K hours. We adopt GShard [1] to efficiently scale up to 10B parameters. Empirically, we find that (1) scaling the number of model parameters is an effective way to solve the capacity bottleneck - our 500M-param model already outperforms monolingual baselines and scaling it to 1B and 10B brought further quality gains; (2) larger models are not only more data efficient, but also more efficient in terms of training cost as measured in TPU days - the 1B-param model reaches the same accuracy at 34% of training time as the 500M-param model; (3) given a fixed capacity budget, adding depth works better than width and large encoders do better than large decoders; (4) with continuous training, they can be adapted to new languages and domains. △ Less

Submitted 11 September, 2021; v1 submitted 30 April, 2021; originally announced April 2021.

Comments: ASRU 2021

arXiv:2103.09407 [pdf, ps, other]

Model-Free Design of Stochastic LQR Controller from Reinforcement Learning and Primal-Dual Optimization Perspective

Authors: Man Li, Jiahu Qin, Wei Xing Zheng, Yaonan Wang, Yu Kang

Abstract: To further understand the underlying mechanism of various reinforcement learning (RL) algorithms and also to better use the optimization theory to make further progress in RL, many researchers begin to revisit the linear-quadratic regulator (LQR) problem, whose setting is simple and yet captures the characteristics of RL. Inspired by this, this work is concerned with the model-free design of stoch… ▽ More To further understand the underlying mechanism of various reinforcement learning (RL) algorithms and also to better use the optimization theory to make further progress in RL, many researchers begin to revisit the linear-quadratic regulator (LQR) problem, whose setting is simple and yet captures the characteristics of RL. Inspired by this, this work is concerned with the model-free design of stochastic LQR controller for linear systems subject to Gaussian noises, from the perspective of both RL and primal-dual optimization. From the RL perspective, we first develop a new model-free off-policy policy iteration (MF-OPPI) algorithm, in which the sampled data is repeatedly used for updating the policy to alleviate the data-hungry problem to some extent. We then provide a rigorous analysis for algorithm convergence by showing that the involved iterations are equivalent to the iterations in the classical policy iteration (PI) algorithm. From the perspective of optimization, we first reformulate the stochastic LQR problem at hand as a constrained non-convex optimization problem, which is shown to have strong duality. Then, to solve this non-convex optimization problem, we propose a model-based primal-dual (MB-PD) algorithm based on the properties of the resulting Karush-Kuhn-Tucker (KKT) conditions. We also give a model-free implementation for the MB-PD algorithm by solving a transformed dual feasibility condition. More importantly, we show that the dual and primal update steps in the MB-PD algorithm can be interpreted as the policy evaluation and policy improvement steps in the PI algorithm, respectively. Finally, we provide one simulation example to show the performance of the proposed algorithms. △ Less

Submitted 16 March, 2021; originally announced March 2021.

arXiv:2011.11373 [pdf, other]

Optimal Power Control for DoS Attack over Fading Channel: A Game-Theoretic Approach

Authors: Jie Wang, Jiahu Qin, Menglin Li, Yang Shi

Abstract: In this paper, we investigate remote state estimation against an intelligent denial-of-service (DoS) attack over a vulnerable wireless network whose channel undergoes attenuation and distortion caused by fading. We use the sensor to observe system states and transmit its local state estimates to the remote center. Meanwhile, the attacker injects a jamming signal to destroy the packet accepted by t… ▽ More In this paper, we investigate remote state estimation against an intelligent denial-of-service (DoS) attack over a vulnerable wireless network whose channel undergoes attenuation and distortion caused by fading. We use the sensor to observe system states and transmit its local state estimates to the remote center. Meanwhile, the attacker injects a jamming signal to destroy the packet accepted by the remote center and causes the performance degradation. Most of the existing works are built on a time-invariant channel state information (CSI) model in which the channel fading is stationary. However, the wireless communication channels are more prone to dynamic changes. To capture this time-variant property in the channel quality of the real-world networks, we study the fading channel network whose channel model is characterized by a generalized finite-state Markov chain. With the goals of two players in infinite-time horizon, we describe the conflicting characteristic between the attacker and the sensor with a general-sum stochastic game. Moreover, the Q-learning techniques are applied to obtain an optimal strategy pair at a Nash equilibrium. Also the monotone structure of the optimal stationary strategy is constructed under a sufficient condition. Besides, when channel gain is known a priori, except for the full Channel State Information (CSI), we also investigate the partial CSI, where Bayesian games are employed. Based on the player's own channel information and the belief on the channel distribution of other players, the energy strategy at a Nash equilibrium is obtained. △ Less

Submitted 23 November, 2020; originally announced November 2020.

arXiv:2011.10798 [pdf, other]

A Better and Faster End-to-End Model for Streaming ASR

Authors: Bo Li, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han, Qiao Liang, Yu Zhang, Trevor Strohman, Yonghui Wu

Abstract: End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this i… ▽ More End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, we find that the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR. △ Less

Submitted 11 February, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

Comments: Accepted in ICASSP 2021

arXiv:2010.10504 [pdf, other]

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

Authors: Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, Yonghui Wu

Abstract: We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-e… ▽ More We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%. △ Less

Submitted 20 July, 2022; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: 11 pages, 3 figures, 5 tables. Accepted to NeurIPS SAS 2020 Workshop; v2: minor errors corrected

arXiv:2008.13093 [pdf, other]

Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

Authors: Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He

Abstract: Recent advances of end-to-end models have outperformed conventional models through employing a two-pass model. The two-pass model provides better speed-quality trade-offs for on-device speech recognition, where a 1st-pass model generates hypotheses in a streaming fashion, and a 2nd-pass model re-scores the hypotheses with full audio sequence context. The 2nd-pass model plays a key role in the qual… ▽ More Recent advances of end-to-end models have outperformed conventional models through employing a two-pass model. The two-pass model provides better speed-quality trade-offs for on-device speech recognition, where a 1st-pass model generates hypotheses in a streaming fashion, and a 2nd-pass model re-scores the hypotheses with full audio sequence context. The 2nd-pass model plays a key role in the quality improvement of the end-to-end model to surpass the conventional model. One main challenge of the two-pass model is the computation latency introduced by the 2nd-pass model. Specifically, the original design of the two-pass model uses LSTMs for the 2nd-pass model, which are subject to long latency as they are constrained by the recurrent nature and have to run inference sequentially. In this work we explore replacing the LSTM layers in the 2nd-pass rescorer with Transformer layers, which can process the entire hypothesis sequences in parallel and can therefore utilize the on-device computation resources more efficiently. Compared with an LSTM-based baseline, our proposed Transformer rescorer achieves more than 50% latency reduction with quality improvement. △ Less

Submitted 2 September, 2020; v1 submitted 30 August, 2020; originally announced August 2020.

Comments: Proceedings of Interspeech, 2020

arXiv:2006.00413 [pdf]

doi 10.1016/j.fmre.2021.06.010

Two-stage short-term wind power forecasting algorithm using different feature-learning models

Authors: Jiancheng Qin, ** Yang, Ying Chen, Qiang Ye, Hua Li

Abstract: Two-stage ensemble-based forecasting methods have been studied extensively in the wind power forecasting field. However, deep learning-based wind power forecasting studies have not investigated two aspects. In the first stage, different learning structures considering multiple inputs and multiple outputs have not been discussed. In the second stage, the model extrapolation issue has not been inves… ▽ More Two-stage ensemble-based forecasting methods have been studied extensively in the wind power forecasting field. However, deep learning-based wind power forecasting studies have not investigated two aspects. In the first stage, different learning structures considering multiple inputs and multiple outputs have not been discussed. In the second stage, the model extrapolation issue has not been investigated. Therefore, we develop four deep neural networks for the first stage to learn data features considering the input-and-output structure. We then explore the model extrapolation issue in the second stage using different modeling methods. Considering the overfitting issue, we propose a new moving window-based algorithm using a validation set in the first stage to update the training data in both stages with two different moving window processes.Experiments were conducted at three wind farms, and the results demonstrate that the model with single input multiple output structure obtains better forecasting accuracy compared to existing models. In addition, the ridge regression method results in a better ensemble model that can further improve forecasting accuracy compared to existing machine learning methods. Finally, the proposed two-stage forecasting algorithm can generate more accurate and stable results than existing algorithms. △ Less

Submitted 28 June, 2021; v1 submitted 30 May, 2020; originally announced June 2020.

Journal ref: Fundamental Research, 2021

Showing 1–50 of 68 results for author: Qin, J