Search | arXiv e-print repository

arXiv:2406.16967 [pdf, other]

Remaining useful life prediction of rolling bearings based on refined composite multi-scale attention entropy and dispersion entropy

Authors: Yunchong Long, Qinkang Pang, Guangjie Zhu, Junxian Cheng, Xiangshun Li

Abstract: Remaining useful life (RUL) prediction based on vibration signals is crucial for ensuring the safe operation and effective health management of rotating machinery. Existing studies often extract health indicators (HI) from time domain and frequency domain features to analyze complex vibration signals, but these features may not accurately capture the degradation process. In this study, we propose… ▽ More Remaining useful life (RUL) prediction based on vibration signals is crucial for ensuring the safe operation and effective health management of rotating machinery. Existing studies often extract health indicators (HI) from time domain and frequency domain features to analyze complex vibration signals, but these features may not accurately capture the degradation process. In this study, we propose a degradation feature extraction method called Fusion of Multi-Modal Multi-Scale Entropy (FMME), which utilizes multi-modal Refined Composite Multi-scale Attention Entropy (RCMATE) and Fluctuation Dispersion Entropy (RCMFDE), to solve the problem that the existing degradation features cannot accurately reflect the degradation process. Firstly, the Empirical Mode Decomposition (EMD) is employed to decompose the dual-channel vibration signals of bearings into multiple modals. The main modals are then selected for further analysis. The subsequent step involves the extraction of RCMATE and RCMFDE from each modal, followed by wavelet denoising. Next, a novel metric is proposed to evaluate the quality of degradation features. The attention entropy and dispersion entropy of the optimal scales under different modals are fused using Laplacian Eigenmap (LE) to obtain the health indicators. Finally, RUL prediction is performed through the similarity of health indicators between fault samples and bearings to be predicted. Experimental results demonstrate that the proposed method yields favorable outcomes across diverse operating conditions. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 12pages, 9 figures

arXiv:2404.15339 [pdf, other]

Efficient EndoNeRF Reconstruction and Its Application for Data-driven Surgical Simulation

Authors: Yuehao Wang, Bingchen Gong, Yonghao Long, Siu Hin Fan, Qi Dou

Abstract: The healthcare industry has a growing need for realistic modeling and efficient simulation of surgical scenes. With effective models of deformable surgical scenes, clinicians are able to conduct surgical planning and surgery training on scenarios close to real-world cases. However, a significant challenge in achieving such a goal is the scarcity of high-quality soft tissue models with accurate sha… ▽ More The healthcare industry has a growing need for realistic modeling and efficient simulation of surgical scenes. With effective models of deformable surgical scenes, clinicians are able to conduct surgical planning and surgery training on scenarios close to real-world cases. However, a significant challenge in achieving such a goal is the scarcity of high-quality soft tissue models with accurate shapes and textures. To address this gap, we present a data-driven framework that leverages emerging neural radiance field technology to enable high-quality surgical reconstruction and explore its application for surgical simulations. We first focus on develo** a fast NeRF-based surgical scene 3D reconstruction approach that achieves state-of-the-art performance. This method can significantly outperform traditional 3D reconstruction methods, which have failed to capture large deformations and produce fine-grained shapes and textures. We then propose an automated creation pipeline of interactive surgical simulation environments through a closed mesh extraction algorithm. Our experiments have validated the superior performance and efficiency of our proposed approach in surgical scene 3D reconstruction. We further utilize our reconstructed soft tissues to conduct FEM and MPM simulations, showcasing the practical application of our method in data-driven surgical simulations. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 14 pages, 4 figures. Accepted by International Journal of Computer Assisted Radiology and Surgery

arXiv:2401.03623 [pdf]

A Video Coding Method Based on Neural Network for CLIC2024

Authors: Zhengang Li, **gchi Zhang, Yonghua Wang, Xing Zeng, Zhen Zhang, Yunlin Long, Menghu Jia, Ning Wang

Abstract: This paper presents a video coding scheme that combines traditional optimization methods with deep learning methods based on the Enhanced Compression Model (ECM). In this paper, the traditional optimization methods adaptively adjust the quantization parameter (QP). The key frame QP offset is set according to the video content characteristics, and the coding tree unit (CTU) level QP of all frames i… ▽ More This paper presents a video coding scheme that combines traditional optimization methods with deep learning methods based on the Enhanced Compression Model (ECM). In this paper, the traditional optimization methods adaptively adjust the quantization parameter (QP). The key frame QP offset is set according to the video content characteristics, and the coding tree unit (CTU) level QP of all frames is also adjusted according to the spatial-temporal perception information. Block importance map** technology (BIM) is also introduced, which adjusts the QP according to the block importance. Meanwhile, the deep learning methods propose a convolutional neural network-based loop filter (CNNLF), which is turned on/off based on the rate-distortion optimization at the CTU and frame level. Besides, intra-prediction using neural networks (NN-intra) is proposed to further improve compression quality, where 8 neural networks are used for predicting blocks of different sizes. The experimental results show that compared with ECM-3.0, the proposed traditional methods and adding deep learning methods improve the PSNR by 0.54 dB and 1 dB at 0.05Mbps, respectively; 0.38 dB and 0.71dB at 0.5 Mbps, respectively, which proves the superiority of our method. △ Less

Submitted 7 January, 2024; originally announced January 2024.

arXiv:2311.12071 [pdf, other]

Enhancing Low-dose CT Image Reconstruction by Integrating Supervised and Unsupervised Learning

Authors: Ling Chen, Zhishen Huang, Yong Long, Saiprasad Ravishankar

Abstract: Traditional model-based image reconstruction (MBIR) methods combine forward and noise models with simple object priors. Recent application of deep learning methods for image reconstruction provides a successful data-driven approach to addressing the challenges when reconstructing images with undersampled measurements or various types of noise. In this work, we propose a hybrid supervised-unsupervi… ▽ More Traditional model-based image reconstruction (MBIR) methods combine forward and noise models with simple object priors. Recent application of deep learning methods for image reconstruction provides a successful data-driven approach to addressing the challenges when reconstructing images with undersampled measurements or various types of noise. In this work, we propose a hybrid supervised-unsupervised learning framework for X-ray computed tomography (CT) image reconstruction. The proposed learning formulation leverages both sparsity or unsupervised learning-based priors and neural network reconstructors to simulate a fixed-point iteration process. Each proposed trained block consists of a deterministic MBIR solver and a neural network. The information flows in parallel through these two reconstructors and is then optimally combined. Multiple such blocks are cascaded to form a reconstruction pipeline. We demonstrate the efficacy of this learned hybrid model for low-dose CT image reconstruction with limited training data, where we use the NIH AAPM Mayo Clinic Low Dose CT Grand Challenge dataset for training and testing. In our experiments, we study combinations of supervised deep network reconstructors and MBIR solver with learned sparse representation-based priors or analytical priors. Our results demonstrate the promising performance of the proposed framework compared to recent low-dose CT reconstruction methods. △ Less

Submitted 19 November, 2023; originally announced November 2023.

Comments: submitted to IEEE Transactions on Medical Imaging

arXiv:2311.08829 [pdf, other]

Autoencoder with Group-based Decoder and Multi-task Optimization for Anomalous Sound Detection

Authors: Yifan Zhou, Dongxing Xu, Haoran Wei, Yanhua Long

Abstract: In industry, machine anomalous sound detection (ASD) is in great demand. However, collecting enough abnormal samples is difficult due to the high cost, which boosts the rapid development of unsupervised ASD algorithms. Autoencoder (AE) based methods have been widely used for unsupervised ASD, but suffer from problems including 'shortcut', poor anti-noise ability and sub-optimal quality of features… ▽ More In industry, machine anomalous sound detection (ASD) is in great demand. However, collecting enough abnormal samples is difficult due to the high cost, which boosts the rapid development of unsupervised ASD algorithms. Autoencoder (AE) based methods have been widely used for unsupervised ASD, but suffer from problems including 'shortcut', poor anti-noise ability and sub-optimal quality of features. To address these challenges, we propose a new AE-based framework termed AEGM. Specifically, we first insert an auxiliary classifier into AE to enhance ASD in a multi-task learning manner. Then, we design a group-based decoder structure, accompanied by an adaptive loss function, to endow the model with domain-specific knowledge. Results on the DCASE 2021 Task 2 development set show that our methods achieve a relative improvement of 13.11% and 15.20% respectively in average AUC over the official AE and MobileNetV2 across test sets of seven machines. △ Less

Submitted 15 November, 2023; originally announced November 2023.

Comments: Submitted to the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)

arXiv:2308.12526 [pdf, other]

UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023

Authors: Yu Zheng, Yajun Zhang, Chuanying Niu, Yibin Zhan, Yanhua Long, Dongxing Xu

Abstract: This report describes the UNISOUND submission for Track1 and Track2 of VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC 2023). We submit the same system on Track 1 and Track 2, which is trained with only VoxCeleb2-dev. Large-scale ResNet and RepVGG architectures are developed for the challenge. We propose a consistency-aware score calibration method, which leverages the stability of audio voice… ▽ More This report describes the UNISOUND submission for Track1 and Track2 of VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC 2023). We submit the same system on Track 1 and Track 2, which is trained with only VoxCeleb2-dev. Large-scale ResNet and RepVGG architectures are developed for the challenge. We propose a consistency-aware score calibration method, which leverages the stability of audio voiceprints in similarity score by a Consistency Measure Factor (CMF). CMF brings a huge performance boost in this challenge. Our final system is a fusion of six models and achieves the first place in Track 1 and second place in Track 2 of VoxSRC 2023. The minDCF of our submission is 0.0855 and the EER is 1.5880%. △ Less

Submitted 23 August, 2023; originally announced August 2023.

arXiv:2306.11309 [pdf, other]

Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition

Authors: Xuefei Wang, Yanhua Long, Yijie Li, Haoran Wei

Abstract: Low-resource accented speech recognition is one of the important challenges faced by current ASR technology in practical applications. In this study, we propose a Conformer-based architecture, called Aformer, to leverage both the acoustic information from large non-accented and limited accented training data. Specifically, a general encoder and an accent encoder are designed in the Aformer to extr… ▽ More Low-resource accented speech recognition is one of the important challenges faced by current ASR technology in practical applications. In this study, we propose a Conformer-based architecture, called Aformer, to leverage both the acoustic information from large non-accented and limited accented training data. Specifically, a general encoder and an accent encoder are designed in the Aformer to extract complementary acoustic information. Moreover, we propose to train the Aformer in a multi-pass manner, and investigate three cross-information fusion methods to effectively combine the information from both general and accent encoders. All experiments are conducted on both the accented English and Mandarin ASR tasks. Results show that our proposed methods outperform the strong Conformer baseline by relative 10.2% to 24.5% word/character error rate reduction on six in-domain and out-of-domain accented test sets. △ Less

Submitted 20 June, 2023; originally announced June 2023.

arXiv:2305.10055 [pdf, other]

Optimized Joint Beamforming for Wireless Powered Over-the-Air Computation

Authors: Siyao Zhang, Xinmin Li, Yin Long, Jie Xu, Shuguang Cui

Abstract: This correspondence studies the wireless powered over-the-air computation (AirComp) for achieving sustainable wireless data aggregation (WDA) by integrating AirComp and wireless power transfer (WPT) into a joint design. In particular, we consider that a multi-antenna hybrid access point (HAP) employs the transmit energy beamforming to charge multiple single-antenna low-power wireless devices (WDs)… ▽ More This correspondence studies the wireless powered over-the-air computation (AirComp) for achieving sustainable wireless data aggregation (WDA) by integrating AirComp and wireless power transfer (WPT) into a joint design. In particular, we consider that a multi-antenna hybrid access point (HAP) employs the transmit energy beamforming to charge multiple single-antenna low-power wireless devices (WDs) in the downlink, and the WDs use the harvested energy to simultaneously send their messages to the HAP for AirComp in the uplink. Under this setup, we minimize the computation mean square error (MSE), by jointly optimizing the transmit energy beamforming and the receive AirComp beamforming at the HAP, as well as the transmit power at the WDs, subject to the maximum transmit power constraint at the HAP and the wireless energy harvesting constraints at individual WDs. To tackle the non-convex computation MSE minimization problem, we present an efficient algorithm to find a converged high-quality solution by using the alternating optimization technique. Numerical results show that the proposed joint WPT-AirComp approach significantly reduces the computation MSE, as compared to other benchmark schemes. △ Less

Submitted 17 May, 2023; originally announced May 2023.

Comments: 3 figures

arXiv:2303.02388 [pdf, other]

Graph-based Representation for Image based on Granular-ball

Authors: Xia Shuyin, Dai Dawei, Yang Long, Zhany Li, Lan Danf, Zhu hao, Wang Guoy

Abstract: Current image processing methods usually operate on the finest-granularity unit; that is, the pixel, which leads to challenges in terms of efficiency, robustness, and understandability in deep learning models. We present an improved granular-ball computing method to represent the image as a graph, in which each node expresses a structural block in the image and each edge represents the association… ▽ More Current image processing methods usually operate on the finest-granularity unit; that is, the pixel, which leads to challenges in terms of efficiency, robustness, and understandability in deep learning models. We present an improved granular-ball computing method to represent the image as a graph, in which each node expresses a structural block in the image and each edge represents the association between two nodes. Specifically:(1) We design a gradient-based strategy for the adaptive reorganization of all pixels in the image into numerous rectangular regions, each of which can be regarded as one node. (2) Each node has a connection edge with the nodes with which it shares regions. (3) We design a low-dimensional vector as the attribute of each node. All nodes and their corresponding edges form a graphical representation of a digital image. In the experiments, our proposed graph representation is applied to benchmark datasets for image classification tasks, and the efficiency and good understandability demonstrate that our proposed method offers significant potential in artificial intelligence theory and application. △ Less

Submitted 4 March, 2023; originally announced March 2023.

Comments: 9 pages, 5 figures

arXiv:2301.10056 [pdf]

doi 10.1109/SP46215.2023.00059

Side Eye: Characterizing the Limits of POV Acoustic Eavesdrop** from Smartphone Cameras with Rolling Shutters and Movable Lenses

Authors: Yan Long, Pirouz Naghavi, Blas Kojusner, Kevin Butler, Sara Rampazzi, Kevin Fu

Abstract: Our research discovers how the rolling shutter and movable lens structures widely found in smartphone cameras modulate structure-borne sounds onto camera images, creating a point-of-view (POV) optical-acoustic side channel for acoustic eavesdrop**. The movement of smartphone camera hardware leaks acoustic information because images unwittingly modulate ambient sound as imperceptible distortions.… ▽ More Our research discovers how the rolling shutter and movable lens structures widely found in smartphone cameras modulate structure-borne sounds onto camera images, creating a point-of-view (POV) optical-acoustic side channel for acoustic eavesdrop**. The movement of smartphone camera hardware leaks acoustic information because images unwittingly modulate ambient sound as imperceptible distortions. Our experiments find that the side channel is further amplified by intrinsic behaviors of Complementary metal-oxide-semiconductor (CMOS) rolling shutters and movable lenses such as in Optical Image Stabilization (OIS) and Auto Focus (AF). Our paper characterizes the limits of acoustic information leakage caused by structure-borne sound that perturbs the POV of smartphone cameras. In contrast with traditional optical-acoustic eavesdrop** on vibrating objects, this side channel requires no line of sight and no object within the camera's field of view (images of a ceiling suffice). Our experiments test the limits of this side channel with a novel signal processing pipeline that extracts and recognizes the leaked acoustic information. Our evaluation with 10 smartphones on a spoken digit dataset reports 80.66%, 91.28%, and 99.67% accuracies on recognizing 10 spoken digits, 20 speakers, and 2 genders respectively. We further systematically discuss the possible defense strategies and implementations. By modeling, measuring, and demonstrating the limits of acoustic eavesdrop** from smartphone camera image streams, our contributions explain the physics-based causality and possible ways to reduce the threat on current and future devices. △ Less

Submitted 26 January, 2023; v1 submitted 24 January, 2023; originally announced January 2023.

Journal ref: 2023 IEEE Symposium on Security and Privacy

arXiv:2211.12097 [pdf, other]

Dynamic Acoustic Compensation and Adaptive Focal Training for Personalized Speech Enhancement

Authors: Xiaofeng Ge, Jiangyu Han, Haixin Guan, Yanhua Long

Abstract: Recently, more and more personalized speech enhancement systems (PSE) with excellent performance have been proposed. However, two critical issues still limit the performance and generalization ability of the model: 1) Acoustic environment mismatch between the test noisy speech and target speaker enrollment speech; 2) Hard sample mining and learning. In this paper, dynamic acoustic compensation (DA… ▽ More Recently, more and more personalized speech enhancement systems (PSE) with excellent performance have been proposed. However, two critical issues still limit the performance and generalization ability of the model: 1) Acoustic environment mismatch between the test noisy speech and target speaker enrollment speech; 2) Hard sample mining and learning. In this paper, dynamic acoustic compensation (DAC) is proposed to alleviate the environment mismatch, by intercepting the noise or environmental acoustic segments from noisy speech and mixing it with the clean enrollment speech. To well exploit the hard samples in training data, we propose an adaptive focal training (AFT) strategy by assigning adaptive loss weights to hard and non-hard samples during training. A time-frequency multi-loss training is further introduced to improve and generalize our previous work sDPCCN for PSE. The effectiveness of proposed methods are examined on the DNS4 Challenge dataset. Results show that, the DAC brings large improvements in terms of multiple evaluation metrics, and AFT reduces the hard sample rate significantly and produces obvious MOS score improvement. △ Less

Submitted 22 November, 2022; originally announced November 2022.

arXiv:2211.01571 [pdf, other]

Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system

Authors: Li Li, Dongxing Xu, Haoran Wei, Yanhua Long

Abstract: Exploiting effective target modeling units is very important and has always been a concern in end-to-end automatic speech recognition (ASR). In this work, we propose a phonetic-assisted multi target units (PMU) modeling approach, to enhance the Conformer-Transducer ASR system in a progressive representation learning manner. Specifically, PMU first uses the pronunciation-assisted subword modeling (… ▽ More Exploiting effective target modeling units is very important and has always been a concern in end-to-end automatic speech recognition (ASR). In this work, we propose a phonetic-assisted multi target units (PMU) modeling approach, to enhance the Conformer-Transducer ASR system in a progressive representation learning manner. Specifically, PMU first uses the pronunciation-assisted subword modeling (PASM) and byte pair encoding (BPE) to produce phonetic-induced and text-induced target units separately; Then, three new frameworks are investigated to enhance the acoustic encoder, including a basic PMU, a paraCTC and a pcaCTC, they integrate the PASM and BPE units at different levels for CTC and transducer multi-task training. Experiments on both LibriSpeech and accented ASR tasks show that, the proposed PMU significantly outperforms the conventional BPE, it reduces the WER of LibriSpeech clean, other, and six accented ASR testsets by relative 12.7%, 6.0% and 7.7%, respectively. △ Less

Submitted 7 July, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: Accepted by Interspeech 2023

arXiv:2211.01266 [pdf, other]

Knowing the Past to Predict the Future: Reinforcement Virtual Learning

Authors: Peng Zhang, Yawen Huang, Bingzhang Hu, Shizheng Wang, Haoran Duan, Noura Al Moubayed, Yefeng Zheng, Yang Long

Abstract: Reinforcement Learning (RL)-based control system has received considerable attention in recent decades. However, in many real-world problems, such as Batch Process Control, the environment is uncertain, which requires expensive interaction to acquire the state and reward values. In this paper, we present a cost-efficient framework, such that the RL model can evolve for itself in a Virtual Space us… ▽ More Reinforcement Learning (RL)-based control system has received considerable attention in recent decades. However, in many real-world problems, such as Batch Process Control, the environment is uncertain, which requires expensive interaction to acquire the state and reward values. In this paper, we present a cost-efficient framework, such that the RL model can evolve for itself in a Virtual Space using the predictive models with only historical data. The proposed framework enables a step-by-step RL model to predict the future state and select optimal actions for long-sight decisions. The main focuses are summarized as: 1) how to balance the long-sight and short-sight rewards with an optimal strategy; 2) how to make the virtual model interacting with real environment to converge to a final learning policy. Under the experimental settings of Fed-Batch Process, our method consistently outperforms the existing state-of-the-art methods. △ Less

Submitted 2 November, 2022; originally announced November 2022.

arXiv:2210.17189

DiaCorrect: End-to-end error correction for speaker diarization

Authors: Jiangyu Han, Yuhang Cao, Heng Lu, Yanhua Long

Abstract: In recent years, speaker diarization has attracted widespread attention. To achieve better performance, some studies propose to diarize speech in multiple stages. Although these methods might bring additional benefits, most of them are quite complex. Motivated by spelling correction in automatic speech recognition (ASR), in this paper, we propose an end-to-end error correction framework, termed Di… ▽ More In recent years, speaker diarization has attracted widespread attention. To achieve better performance, some studies propose to diarize speech in multiple stages. Although these methods might bring additional benefits, most of them are quite complex. Motivated by spelling correction in automatic speech recognition (ASR), in this paper, we propose an end-to-end error correction framework, termed DiaCorrect, to refine the initial diarization results in a simple but efficient way. By exploiting the acoustic interactions between input mixture and its corresponding speaker activity, DiaCorrect could automatically adapt the initial speaker activity to minimize the diarization errors. Without bells and whistles, experiments on LibriSpeech based 2-speaker meeting-like data show that, the self-attentitive end-to-end neural diarization (SA-EEND) baseline with DiaCorrect could reduce its diarization error rate (DER) by over 62.4% from 12.31% to 4.63%. Our source code is available online at https://github.com/jyhan03/diacorrect. △ Less

Submitted 18 September, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: This paper has been superseded by arXiv:2309.08377 (merged from arXiv:2210.17189)

arXiv:2205.09587 [pdf, other]

Combining Deep Learning and Adaptive Sparse Modeling for Low-dose CT Reconstruction

Authors: Ling Chen, Zhishen Huang, Yong Long, Saiprasad Ravishankar

Abstract: Traditional model-based image reconstruction (MBIR) methods combine forward and noise models with simple object priors. Recent application of deep learning methods for image reconstruction provides a successful data-driven approach to addressing the challenges when reconstructing images with measurement undersampling or various types of noise. In this work, we propose a hybrid supervised-unsupervi… ▽ More Traditional model-based image reconstruction (MBIR) methods combine forward and noise models with simple object priors. Recent application of deep learning methods for image reconstruction provides a successful data-driven approach to addressing the challenges when reconstructing images with measurement undersampling or various types of noise. In this work, we propose a hybrid supervised-unsupervised learning framework for X-ray computed tomography (CT) image reconstruction. The proposed learning formulation leverages both sparsity or unsupervised learning-based priors and neural network reconstructors to simulate a fixed-point iteration process. Each proposed trained block consists of a deterministic MBIR solver and a neural network. The information flows in parallel through these two reconstructors and is then optimally combined, and multiple such blocks are cascaded to form a reconstruction pipeline. We demonstrate the efficacy of this learned hybrid model for low-dose CT image reconstruction with limited training data, where we use the NIH AAPM Mayo Clinic Low Dose CT Grand Challenge dataset for training and testing. In our experiments, we study combinations of supervised deep network reconstructors and sparse representations-based (unsupervised) learned or analytical priors. Our results demonstrate the promising performance of the proposed framework compared to recent reconstruction methods. △ Less

Submitted 19 May, 2022; originally announced May 2022.

arXiv:2205.04821 [pdf, other]

Self-supervised regression learning using domain knowledge: Applications to improving self-supervised denoising in imaging

Authors: Il Yong Chun, Dongwon Park, Xuehang Zheng, Se Young Chun, Yong Long

Abstract: Regression that predicts continuous quantity is a central part of applications using computational imaging and computer vision technologies. Yet, studying and understanding self-supervised learning for regression tasks - except for a particular regression task, image denoising - have lagged behind. This paper proposes a general self-supervised regression learning (SSRL) framework that enables lear… ▽ More Regression that predicts continuous quantity is a central part of applications using computational imaging and computer vision technologies. Yet, studying and understanding self-supervised learning for regression tasks - except for a particular regression task, image denoising - have lagged behind. This paper proposes a general self-supervised regression learning (SSRL) framework that enables learning regression neural networks with only input data (but without ground-truth target data), by using a designable pseudo-predictor that encapsulates domain knowledge of a specific application. The paper underlines the importance of using domain knowledge by showing that under different settings, the better pseudo-predictor can lead properties of SSRL closer to those of ordinary supervised learning. Numerical experiments for low-dose computational tomography denoising and camera image denoising demonstrate that proposed SSRL significantly improves the denoising quality over several existing self-supervised denoising methods. △ Less

Submitted 10 May, 2022; originally announced May 2022.

Comments: 17 pages, 16 figures, 2 tables, submitted to IEEE T-IP

arXiv:2204.11032 [pdf, other]

Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation

Authors: Jiangyu Han, Yanhua Long

Abstract: Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic char… ▽ More Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models. SCT follows a framework using two heterogeneous neural networks (HNNs) to produce high confidence pseudo labels of unlabeled real speech mixtures. These labels are then updated, and used to refine the HNNs to produce more reliable consistent separation results for real mixture pseudo-labeling. To maximally utilize the large complementary information between different separation networks, a cross-knowledge adaptation is further proposed. Together with simulated dataset, those real mixtures with high confidence pseudo labels are then used to update the HNN separation models iteratively. In addition, we find that combing the heterogeneous separation outputs by a simple linear fusion can further slightly improve the final system performance. △ Less

Submitted 6 August, 2022; v1 submitted 23 April, 2022; originally announced April 2022.

arXiv:2203.11565 [pdf, other]

Multi-layer Clustering-based Residual Sparsifying Transform for Low-dose CT Image Reconstruction

Authors: Xikai Yang, Zhishen Huang, Yong Long, Saiprasad Ravishankar

Abstract: The recently proposed sparsifying transform models incur low computational cost and have been applied to medical imaging. Meanwhile, deep models with nested network structure reveal great potential for learning features in different layers. In this study, we propose a network-structured sparsifying transform learning approach for X-ray computed tomography (CT), which we refer to as multi-layer clu… ▽ More The recently proposed sparsifying transform models incur low computational cost and have been applied to medical imaging. Meanwhile, deep models with nested network structure reveal great potential for learning features in different layers. In this study, we propose a network-structured sparsifying transform learning approach for X-ray computed tomography (CT), which we refer to as multi-layer clustering-based residual sparsifying transform (MCST) learning. The proposed MCST scheme learns multiple different unitary transforms in each layer by dividing each layer's input into several classes. We apply the MCST model to low-dose CT (LDCT) reconstruction by deploying the learned MCST model into the regularizer in penalized weighted least squares (PWLS) reconstruction. We conducted LDCT reconstruction experiments on XCAT phantom data and Mayo Clinic data and trained the MCST model with 2 (or 3) layers and with 5 clusters in each layer. The learned transforms in the same layer showed rich features while additional information is extracted from representation residuals. Our simulation results demonstrate that PWLS-MCST achieves better image reconstruction quality than the conventional FBP method and PWLS with edge-preserving (EP) regularizer. It also outperformed recent advanced methods like PWLS with a learned multi-layer residual sparsifying transform prior (MARS) and PWLS with a union of learned transforms (ULTRA), especially for displaying clear edges and preserving subtle details. △ Less

Submitted 22 March, 2022; originally announced March 2022.

Comments: 19 pages, 12 figures, submitted to the Medical Physics

arXiv:2203.02263 [pdf, other]

PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement

Authors: Xiaofeng Ge, Jiangyu Han, Yanhua Long, Haixin Guan

Abstract: PercepNet, a recent extension of the RNNoise, an efficient, high-quality and real-time full-band speech enhancement technique, has shown promising performance in various public deep noise suppression tasks. This paper proposes a new approach, named PercepNet+, to further extend the PercepNet with four significant improvements. First, we introduce a phase-aware structure to leverage the phase infor… ▽ More PercepNet, a recent extension of the RNNoise, an efficient, high-quality and real-time full-band speech enhancement technique, has shown promising performance in various public deep noise suppression tasks. This paper proposes a new approach, named PercepNet+, to further extend the PercepNet with four significant improvements. First, we introduce a phase-aware structure to leverage the phase information into PercepNet, by adding the complex features and complex subband gains as the deep network input and output respectively. Then, a signal-to-noise ratio (SNR) estimator and an SNR switched post-processing are specially designed to alleviate the over attenuation (OA) that appears in high SNR conditions of the original PercepNet. Moreover, the GRU layer is replaced by TF-GRU to model both temporal and frequency dependencies. Finally, we propose to integrate the loss of complex subband gain, SNR, pitch filtering strength, and an OA loss in a multi-objective learning manner to further improve the speech enhancement performance. Experimental results show that, the proposed PercepNet+ outperforms the original PercepNet significantly in terms of both PESQ and STOI, without increasing the model size too much. △ Less

Submitted 4 March, 2022; originally announced March 2022.

Comments: This article was submitted to Interspeech 2022

arXiv:2203.02191 [pdf, other]

Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event Detection

Authors: Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang

Abstract: In recent years, exploring effective sound separation (SSep) techniques to improve overlap** sound event detection (SED) attracts more and more attention. Creating accurate separation signals to avoid the catastrophic error accumulation during SED model training is very important and challenging. In this study, we first propose a novel selective pseudo-labeling approach, termed SPL, to produce h… ▽ More In recent years, exploring effective sound separation (SSep) techniques to improve overlap** sound event detection (SED) attracts more and more attention. Creating accurate separation signals to avoid the catastrophic error accumulation during SED model training is very important and challenging. In this study, we first propose a novel selective pseudo-labeling approach, termed SPL, to produce high confidence separated target events from blind sound separation outputs. These target events are then used to fine-tune the original SED model that pre-trained on the sound mixtures in a multi-objective learning style. Then, to further leverage the SSep outputs, a class-wise discriminative fusion is proposed to improve the final SED performances, by combining multiple frame-level event predictions of both sound mixtures and their separated signals. All experiments are performed on the public DCASE 2021 Task 4 dataset, and results show that our approaches significantly outperforms the official baseline, the collar-based F 1, PSDS1 and PSDS2 performances are improved from 44.3%, 37.3% and 54.9% to 46.5%, 44.5% and 75.4%, respectively. △ Less

Submitted 4 March, 2022; originally announced March 2022.

Comments: This article was submitted to Interspeech 2022

arXiv:2201.05267 [pdf, other]

doi 10.1109/TPWRS.2022.3142105

Bi-level Volt/VAR Optimization in Distribution Networks with Smart PV Inverters

Authors: Yao Long, Daniel S. Kirschen

Abstract: Optimal Volt/VAR control (VVC) in distribution networks relies on an effective coordination between the conventional utility-owned mechanical devices and the smart residential photovoltaic (PV) inverters. Typically, a central controller carries out a periodic optimization and sends setpoints to the local controller of each device. However, instead of tracking centrally dispatched setpoints, smart… ▽ More Optimal Volt/VAR control (VVC) in distribution networks relies on an effective coordination between the conventional utility-owned mechanical devices and the smart residential photovoltaic (PV) inverters. Typically, a central controller carries out a periodic optimization and sends setpoints to the local controller of each device. However, instead of tracking centrally dispatched setpoints, smart PV inverters can cooperate on a much faster timescale to reach optimality within a PV inverter group. To accommodate such PV inverter groups in the VVC architecture, this paper proposes a bi-level optimization framework. The upper-level determines the setpoints of the mechanical devices to minimize the network active power losses, while the lower-level represents the coordinated actions that the inverters take for their own objectives. The interactions between these two levels are captured in the bi-level optimization, which is solved using the Karush-Kuhn-Tucker (KKT) conditions. This framework fully exploits the capabilities of the different types of voltage regulation devices and enables them to cooperatively optimize their goals. Case studies on typical distribution networks with field-recorded data demonstrate the effectiveness and advantages of the proposed approach. △ Less

Submitted 13 January, 2022; originally announced January 2022.

arXiv:2112.13520 [pdf, other]

DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation And Extraction

Authors: Jiangyu Han, Yanhua Long, Lukas Burget, Jan Cernocky

Abstract: In recent years, a number of time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks. In this paper, from the time-frequency domain perspective, we propose a densely-connected pyramid complex convolutional network, termed DPCCN, to improve the robustness of speech separation under complicated conditions.… ▽ More In recent years, a number of time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks. In this paper, from the time-frequency domain perspective, we propose a densely-connected pyramid complex convolutional network, termed DPCCN, to improve the robustness of speech separation under complicated conditions. Furthermore, we generalize the DPCCN to target speech extraction (TSE) by integrating a new specially designed speaker encoder. Moreover, we also investigate the robustness of DPCCN to unsupervised cross-domain TSE tasks. A Mixture-Remix approach is proposed to adapt the target domain acoustic characteristics for fine-tuning the source model. We evaluate the proposed methods not only under noisy and reverberant in-domain condition, but also in clean but cross-domain conditions. Results show that for both speech separation and extraction, the DPCCN-based systems achieve significantly better performance and robustness than the currently dominating time-domain methods, especially for the cross-domain tasks. Particularly, we find that the Mixture-Remix fine-tuning with DPCCN significantly outperforms the TD-SpeakerBeam for unsupervised cross-domain TSE, with around 3.5 dB SISNR improvement on target domain test set, without any source domain performance degradation. △ Less

Submitted 29 January, 2022; v1 submitted 27 December, 2021; originally announced December 2021.

Comments: accepted by ICASSP 2022

arXiv:2110.15390 [pdf, other]

doi 10.1109/TPWRS.2021.3120195

Adaptive Coalition Formation-Based Coordinated Voltage Regulation in Distribution Networks

Authors: Yao Long, Ryan T. Elliott, Daniel S. Kirschen

Abstract: High penetrations of photovoltaic (PV) systems can cause severe voltage quality problems in distribution networks. This paper proposes a distributed control strategy based on the dynamic formation of coalitions to coordinate a large number of PV inverters for voltage regulation. In this strategy, a rule-based coalition formation scheme deals with the zonal voltage difference caused by the uneven i… ▽ More High penetrations of photovoltaic (PV) systems can cause severe voltage quality problems in distribution networks. This paper proposes a distributed control strategy based on the dynamic formation of coalitions to coordinate a large number of PV inverters for voltage regulation. In this strategy, a rule-based coalition formation scheme deals with the zonal voltage difference caused by the uneven integration of PV capacity. Under this scheme, PV inverters form into separate voltage regulation coalitions autonomously according to local, neighbor as well as coalition voltage magnitude and regulation capacity information. To coordinate control within each coalition, we develop a feedback-based leader-follower consensus algorithm which eliminates the voltage violations caused by the fast fluctuations of load and PV generation. This algorithm allocates the required reactive power contribution among the PV inverters according to their maximum available capacity to promote an effective and fair use of the overall voltage regulation capacity. Case studies based on realistic distribution networks and field-recorded data validate the effectiveness of the proposed control strategy. Moreover, comparison with a centralized network decomposition-based scheme shows the flexibility of coalition formation in organizing the distributed PV inverters. The robustness and generalizability of the proposed strategy are also demonstrated. △ Less

Submitted 28 October, 2021; originally announced October 2021.

arXiv:2110.03912 [pdf, other]

doi 10.1109/TBME.2022.3195027

Stereo Dense Scene Reconstruction and Accurate Localization for Learning-Based Navigation of Laparoscope in Minimally Invasive Surgery

Authors: Ruofeng Wei, Bin Li, Hangjie Mo, Bo Lu, Yonghao Long, Bohan Yang, Qi Dou, Yunhui Liu, Dong Sun

Abstract: Objective: The computation of anatomical information and laparoscope position is a fundamental block of surgical navigation in Minimally Invasive Surgery (MIS). Recovering a dense 3D structure of surgical scene using visual cues remains a challenge, and the online laparoscopic tracking primarily relies on external sensors, which increases system complexity. Methods: Here, we propose a learning-dri… ▽ More Objective: The computation of anatomical information and laparoscope position is a fundamental block of surgical navigation in Minimally Invasive Surgery (MIS). Recovering a dense 3D structure of surgical scene using visual cues remains a challenge, and the online laparoscopic tracking primarily relies on external sensors, which increases system complexity. Methods: Here, we propose a learning-driven framework, in which an image-guided laparoscopic localization with 3D reconstructions of complex anatomical structures is obtained. To reconstruct the 3D structure of the whole surgical environment, we first fine-tune a learning-based stereoscopic depth perception method, which is robust to the texture-less and variant soft tissues, for depth estimation. Then, we develop a dense visual reconstruction algorithm to represent the scene by surfels, estimate the laparoscope poses and fuse the depth maps into a unified reference coordinate for tissue reconstruction. To estimate poses of new laparoscope views, we achieve a coarse-to-fine localization method, which incorporates our reconstructed 3D model. Results: We evaluate the reconstruction method and the localization module on three datasets, namely, the stereo correspondence and reconstruction of endoscopic data (SCARED), the ex-vivo phantom and tissue data collected with Universal Robot (UR) and Karl Storz Laparoscope, and the in-vivo DaVinci robotic surgery dataset, where the reconstructed 3D structures have rich details of surface texture with an accuracy error under 1.71 mm and the localization module can accurately track the laparoscope with only images as input. Conclusions: Experimental results demonstrate the superior performance of the proposed method in 3D anatomy reconstruction and laparoscopic localization. Significance: The proposed framework can be potentially extended to the current surgical navigation system. △ Less

Submitted 27 November, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

Journal ref: IEEE Transactions on Biomedical Engineering 2022

arXiv:2109.14956 [pdf]

Comparative Validation of Machine Learning Algorithms for Surgical Workflow and Skill Analysis with the HeiChole Benchmark

Authors: Martin Wagner, Beat-Peter Müller-Stich, Anna Kisilenko, Duc Tran, Patrick Heger, Lars Mündermann, David M Lubotsky, Benjamin Müller, Tornike Davitashvili, Manuela Capek, Annika Reinke, Tong Yu, Armine Vardazaryan, Chinedu Innocent Nwoye, Nicolas Padoy, Xinyang Liu, Eung-Joo Lee, Constantin Disch, Hans Meine, Tong Xia, Fucang Jia, Satoshi Kondo, Wolfgang Reiter, Yueming **, Yonghao Long , et al. (16 additional authors not shown)

Abstract: PURPOSE: Surgical workflow and skill analysis are key technologies for the next generation of cognitive surgical assistance systems. These systems could increase the safety of the operation through context-sensitive warnings and semi-autonomous robotic assistance or improve training of surgeons via data-driven feedback. In surgical workflow analysis up to 91% average precision has been reported fo… ▽ More PURPOSE: Surgical workflow and skill analysis are key technologies for the next generation of cognitive surgical assistance systems. These systems could increase the safety of the operation through context-sensitive warnings and semi-autonomous robotic assistance or improve training of surgeons via data-driven feedback. In surgical workflow analysis up to 91% average precision has been reported for phase recognition on an open data single-center dataset. In this work we investigated the generalizability of phase recognition algorithms in a multi-center setting including more difficult recognition tasks such as surgical action and surgical skill. METHODS: To achieve this goal, a dataset with 33 laparoscopic cholecystectomy videos from three surgical centers with a total operation time of 22 hours was created. Labels included annotation of seven surgical phases with 250 phase transitions, 5514 occurences of four surgical actions, 6980 occurences of 21 surgical instruments from seven instrument categories and 495 skill classifications in five skill dimensions. The dataset was used in the 2019 Endoscopic Vision challenge, sub-challenge for surgical workflow and skill analysis. Here, 12 teams submitted their machine learning algorithms for recognition of phase, action, instrument and/or skill assessment. RESULTS: F1-scores were achieved for phase recognition between 23.9% and 67.7% (n=9 teams), for instrument presence detection between 38.5% and 63.8% (n=8 teams), but for action recognition only between 21.8% and 23.3% (n=5 teams). The average absolute error for skill assessment was 0.78 (n=1 team). CONCLUSION: Surgical workflow and skill analysis are promising technologies to support the surgical team, but are not solved yet, as shown by our comparison of algorithms. This novel benchmark can be used for comparable evaluation and validation of future work. △ Less

Submitted 30 September, 2021; originally announced September 2021.

arXiv:2108.01997 [pdf, other]

DuCN: Dual-children Network for Medical Diagnosis and Similar Case Recommendation towards COVID-19

Authors: Chengtao Peng, Yunfei Long, Senhua Zhu, Dandan Tu, Bin Li

Abstract: Early detection of the coronavirus disease 2019 (COVID-19) helps to treat patients timely and increase the cure rate, thus further suppressing the spread of the disease. In this study, we propose a novel deep learning based detection and similar case recommendation network to help control the epidemic. Our proposed network contains two stages: the first one is a lung region segmentation step and i… ▽ More Early detection of the coronavirus disease 2019 (COVID-19) helps to treat patients timely and increase the cure rate, thus further suppressing the spread of the disease. In this study, we propose a novel deep learning based detection and similar case recommendation network to help control the epidemic. Our proposed network contains two stages: the first one is a lung region segmentation step and is used to exclude irrelevant factors, and the second is a detection and recommendation stage. Under this framework, in the second stage, we develop a dual-children network (DuCN) based on a pre-trained ResNet-18 to simultaneously realize the disease diagnosis and similar case recommendation. Besides, we employ triplet loss and intrapulmonary distance maps to assist the detection, which helps incorporate tiny differences between two images and is conducive to improving the diagnostic accuracy. For each confirmed COVID-19 case, we give similar cases to provide radiologists with diagnosis and treatment references. We conduct experiments on a large publicly available dataset (CC-CCII) and compare the proposed model with state-of-the-art COVID-19 detection methods. The results show that our proposed model achieves a promising clinical performance. △ Less

Submitted 3 August, 2021; originally announced August 2021.

arXiv:2106.07564 [pdf]

An optimized Capsule-LSTM model for facial expression recognition with video sequences

Authors: Siwei Liu, Yuanpeng Long, Gao Xu, Lijia Yang, Shimei Xu, Xiaoming Yao, Kunxian Shu

Abstract: To overcome the limitations of convolutional neural network in the process of facial expression recognition, a facial expression recognition model Capsule-LSTM based on video frame sequence is proposed. This model is composed of three networks includingcapsule encoders, capsule decoders and LSTM network. The capsule encoder extracts the spatial information of facial expressions in video frames. Ca… ▽ More To overcome the limitations of convolutional neural network in the process of facial expression recognition, a facial expression recognition model Capsule-LSTM based on video frame sequence is proposed. This model is composed of three networks includingcapsule encoders, capsule decoders and LSTM network. The capsule encoder extracts the spatial information of facial expressions in video frames. Capsule decoder reconstructs the images to optimize the network. LSTM extracts the temporal information between video frames and analyzes the differences in expression changes between frames. The experimental results from the MMI dataset show that the Capsule-LSTM model proposed in this paper can effectively improve the accuracy of video expression recognition. △ Less

Submitted 27 May, 2021; originally announced June 2021.

Comments: 14pages,4 figurews

arXiv:2106.07563 [pdf]

BPLF: A Bi-Parallel Linear Flow Model for Facial Expression Generation from Emotion Set Images

Authors: Gao Xu, Yuanpeng Long, Siwei Liu, Lijia Yang, Shimei Xu, Xiaoming Yao, Kunxian Shu

Abstract: The flow-based generative model is a deep learning generative model, which obtains the ability to generate data by explicitly learning the data distribution. Theoretically its ability to restore data is stronger than other generative models. However, its implementation has many limitations, including limited model design, too many model parameters and tedious calculation. In this paper, a bi-paral… ▽ More The flow-based generative model is a deep learning generative model, which obtains the ability to generate data by explicitly learning the data distribution. Theoretically its ability to restore data is stronger than other generative models. However, its implementation has many limitations, including limited model design, too many model parameters and tedious calculation. In this paper, a bi-parallel linear flow model for facial emotion generation from emotion set images is constructed, and a series of improvements have been made in terms of the expression ability of the model and the convergence speed in training. The model is mainly composed of several coupling layers superimposed to form a multi-scale structure, in which each coupling layer contains 1*1 reversible convolution and linear operation modules. Furthermore, this paper sorted out the current public data set of facial emotion images, made a new emotion data, and verified the model through this data set. The experimental results show that, under the traditional convolutional neural network, the 3-layer 3*3 convolution kernel is more conducive to extracte the features of the face images. The introduction of principal component decomposition can improve the convergence speed of the model. △ Less

Submitted 27 May, 2021; originally announced June 2021.

Comments: 20 pages, 10 figures

arXiv:2106.03113 [pdf, other]

Improving Channel Decorrelation for Multi-Channel Target Speech Extraction

Authors: Jiangyu Han, Wei Rao, Yannan Wang, Yanhua Long

Abstract: Target speech extraction has attracted widespread attention. When microphone arrays are available, the additional spatial information can be helpful in extracting the target speech. We have recently proposed a channel decorrelation (CD) mechanism to extract the inter-channel differential information to enhance the reference channel encoder representation. Although the proposed mechanism has shown… ▽ More Target speech extraction has attracted widespread attention. When microphone arrays are available, the additional spatial information can be helpful in extracting the target speech. We have recently proposed a channel decorrelation (CD) mechanism to extract the inter-channel differential information to enhance the reference channel encoder representation. Although the proposed mechanism has shown promising results for extracting the target speech from mixtures, the extraction performance is still limited by the nature of the original decorrelation theory. In this paper, we propose two methods to broaden the horizon of the original channel decorrelation, by replacing the original softmax-based inter-channel similarity between encoder representations, using an unrolled probability and a normalized cosine-based similarity at the dimensional-level. Moreover, new combination strategies of the CD-based spatial information and target speaker adaptation of parallel encoder outputs are also investigated. Experiments on the reverberant WSJ0 2-mix show that the improved CD can result in more discriminative differential information and the new adaptation strategy is also very effective to improve the target speech extraction. △ Less

Submitted 6 June, 2021; originally announced June 2021.

Comments: accepted to Interspeech 2021. arXiv admin note: text overlap with arXiv:2010.09191

arXiv:2103.14297 [pdf, other]

CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Authors: Tiantian Tang, Xinyuan Zhou, Yanhua Long, Yijie Li, Jiaen Liang

Abstract: Domain mismatch is a noteworthy issue in acoustic event detection tasks, as the target domain data is difficult to access in most real applications. In this study, we propose a novel CNN-based discriminative training framework as a domain compensation method to handle this issue. It uses a parallel CNN-based discriminator to learn a pair of high-level intermediate acoustic representations. Togethe… ▽ More Domain mismatch is a noteworthy issue in acoustic event detection tasks, as the target domain data is difficult to access in most real applications. In this study, we propose a novel CNN-based discriminative training framework as a domain compensation method to handle this issue. It uses a parallel CNN-based discriminator to learn a pair of high-level intermediate acoustic representations. Together with a binary discriminative loss, the discriminators are forced to maximally exploit the discrimination of heterogeneous acoustic information in each audio clip with target events, which results in a robust paired representations that can well discriminate the target events and background/domain variations separately. Moreover, to better learn the transient characteristics of target events, a frame-wise classifier is designed to perform the final classification. In addition, a two-stage training with the CNN-based discriminator initialization is further proposed to enhance the system training. All experiments are performed on the DCASE 2018 Task3 datasets. Results show that our proposal significantly outperforms the official baseline on cross-domain conditions in AUC by relative $1.8-12.1$% without any performance degradation on in-domain evaluation conditions. △ Less

Submitted 26 March, 2021; originally announced March 2021.

arXiv:2103.13581 [pdf, other]

doi 10.1109/TASLP.2022.3182856

EfficientTDNN: Efficient Architecture Search for Speaker Recognition

Authors: Rui Wang, Zhihua Wei, Haoran Duan, Shouling Ji, Yang Long, Zhen Hong

Abstract: Convolutional neural networks (CNNs), such as the time-delay neural network (TDNN), have shown their remarkable capability in learning speaker embedding. However, they meanwhile bring a huge computational cost in storage size, processing, and memory. Discovering the specialized CNN that meets a specific constraint requires a substantial effort of human experts. Compared with hand-designed approach… ▽ More Convolutional neural networks (CNNs), such as the time-delay neural network (TDNN), have shown their remarkable capability in learning speaker embedding. However, they meanwhile bring a huge computational cost in storage size, processing, and memory. Discovering the specialized CNN that meets a specific constraint requires a substantial effort of human experts. Compared with hand-designed approaches, neural architecture search (NAS) appears as a practical technique in automating the manual architecture design process and has attracted increasing interest in spoken language processing tasks such as speaker recognition. In this paper, we propose EfficientTDNN, an efficient architecture search framework consisting of a TDNN-based supernet and a TDNN-NAS algorithm. The proposed supernet introduces temporal convolution of different ranges of the receptive field and feature aggregation of various resolutions from different layers to TDNN. On top of it, the TDNN-NAS algorithm quickly searches for the desired TDNN architecture via weight-sharing subnets, which surprisingly reduces computation while handling the vast number of devices with various resources requirements. Experimental results on the VoxCeleb dataset show the proposed EfficientTDNN enables approximate $10^{13}$ architectures concerning depth, kernel, and width. Considering different computation constraints, it achieves a 2.20% equal error rate (EER) with 204M multiply-accumulate operations (MACs), 1.41% EER with 571M MACs as well as 0.94% EER with 1.45G MACs. Comprehensive investigations suggest that the trained supernet generalizes subnets not sampled during training and obtains a favorable trade-off between accuracy and efficiency. △ Less

Submitted 18 June, 2022; v1 submitted 24 March, 2021; originally announced March 2021.

Comments: 13 pages, 12 figures, accepted to TASLP

arXiv:2103.12388 [pdf, other]

doi 10.1016/j.dsp.2022.103446

Joint framework with deep feature distillation and adaptive focal loss for weakly supervised audio tagging and acoustic event detection

Authors: Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang, Yu** Wang

Abstract: A good joint training framework is very helpful to improve the performances of weakly supervised audio tagging (AT) and acoustic event detection (AED) simultaneously. In this study, we propose three methods to improve the best teacher-student framework in the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 4 for both audio tagging and acoustic ev… ▽ More A good joint training framework is very helpful to improve the performances of weakly supervised audio tagging (AT) and acoustic event detection (AED) simultaneously. In this study, we propose three methods to improve the best teacher-student framework in the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 4 for both audio tagging and acoustic events detection tasks. A frame-level target-events based deep feature distillation is first proposed, which aims to leverage the potential of limited strong-labeled data in weakly supervised framework to learn better intermediate feature maps. Then, we propose an adaptive focal loss and two-stage training strategy to enable an effective and more accurate model training, where the contribution of hard and easy acoustic events to the total cost function can be automatically adjusted. Furthermore, an event-specific post processing is designed to improve the prediction of target event time-stamps. Our experiments are performed on the public DCASE 2019 Task 4 dataset, results show that our approach achieves competitive performances in both AT (81.2\% F1-score) and AED (49.8\% F1-score) tasks. △ Less

Submitted 12 February, 2022; v1 submitted 23 March, 2021; originally announced March 2021.

Comments: Updated, please refer to "https://sciencedirect.53yu.com/science/article/abs/pii/S105120042200063X"

arXiv:2012.01986 [pdf, other]

An Improved Iterative Neural Network for High-Quality Image-Domain Material Decomposition in Dual-Energy CT

Authors: Zhipeng Li, Yong Long, Il Yong Chun

Abstract: Dual-energy computed tomography (DECT) has been widely used in many applications that need material decomposition. Image-domain methods directly decompose material images from high- and low-energy attenuation images, and thus, are susceptible to noise and artifacts on attenuation images. The purpose of this study is to develop an improved iterative neural network (INN) for high-quality image-domai… ▽ More Dual-energy computed tomography (DECT) has been widely used in many applications that need material decomposition. Image-domain methods directly decompose material images from high- and low-energy attenuation images, and thus, are susceptible to noise and artifacts on attenuation images. The purpose of this study is to develop an improved iterative neural network (INN) for high-quality image-domain material decomposition in DECT, and to study its properties. We propose a new INN architecture for DECT material decomposition. The proposed INN architecture uses distinct cross-material convolutional neural network (CNN) in image refining modules, and uses image decomposition physics in image reconstruction modules. The distinct cross-material CNN refiners incorporate distinct encoding-decoding filters and cross-material model that captures correlations between different materials. We study the distinct cross-material CNN refiner with patch-based reformulation and tight-frame condition. Numerical experiments with extended cardiactorso (XCAT) phantom and clinical data show that the proposed INN significantly improves the image quality over several image-domain material decomposition methods, including a conventional model-based image decomposition (MBID) method using an edge-preserving regularizer, a recent MBID method using pre-learned material-wise sparsifying transforms, and a noniterative deep CNN method. Our study with patch-based reformulations reveals that learned filters of distinct cross-material CNN refiners can approximately satisfy the tight-frame condition. △ Less

Submitted 21 January, 2022; v1 submitted 2 December, 2020; originally announced December 2020.

arXiv:2011.00428 [pdf, other]

Two-layer clustering-based sparsifying transform learning for low-dose CT reconstruction

Authors: Xikai Yang, Yong Long, Saiprasad Ravishankar

Abstract: Achieving high-quality reconstructions from low-dose computed tomography (LDCT) measurements is of much importance in clinical settings. Model-based image reconstruction methods have been proven to be effective in removing artifacts in LDCT. In this work, we propose an approach to learn a rich two-layer clustering-based sparsifying transform model (MCST2), where image patches and their subsequent… ▽ More Achieving high-quality reconstructions from low-dose computed tomography (LDCT) measurements is of much importance in clinical settings. Model-based image reconstruction methods have been proven to be effective in removing artifacts in LDCT. In this work, we propose an approach to learn a rich two-layer clustering-based sparsifying transform model (MCST2), where image patches and their subsequent feature maps (filter residuals) are clustered into groups with different learned sparsifying filters per group. We investigate a penalized weighted least squares (PWLS) approach for LDCT reconstruction incorporating learned MCST2 priors. Experimental results show the superior performance of the proposed PWLS-MCST2 approach compared to other related recent schemes. △ Less

Submitted 1 November, 2020; originally announced November 2020.

Comments: 5 pages, 3 figures, submitted to ISBI2021

arXiv:2010.10923 [pdf, other]

Attention-based scaling adaptation for target speech extraction

Authors: Jiangyu Han, Wei Rao, Yanhua Long, Jiaen Liang

Abstract: The target speech extraction has attracted widespread attention in recent years. In this work, we focus on investigating the dynamic interaction between different mixtures and the target speaker to exploit the discriminative target speaker clues. We propose a special attention mechanism without introducing any additional parameters in a scaling adaptation layer to better adapt the network towards… ▽ More The target speech extraction has attracted widespread attention in recent years. In this work, we focus on investigating the dynamic interaction between different mixtures and the target speaker to exploit the discriminative target speaker clues. We propose a special attention mechanism without introducing any additional parameters in a scaling adaptation layer to better adapt the network towards extracting the target speech. Furthermore, by introducing a mixture embedding matrix pooling method, our proposed attention-based scaling adaptation (ASA) can exploit the target speaker clues in a more efficient way. Experimental results on the spatialized reverberant WSJ0 2-mix dataset demonstrate that the proposed method can improve the performance of the target speech extraction effectively. Furthermore, we find that under the same network configurations, the ASA in a single-channel condition can achieve competitive performance gains as that achieved from two-channel mixtures with inter-microphone phase difference (IPD) features. △ Less

Submitted 18 October, 2021; v1 submitted 18 October, 2020; originally announced October 2020.

Comments: 5 pages, 2 figures. Accepted by ASRU 2021

arXiv:2010.09191 [pdf, other]

Multi-channel target speech extraction with channel decorrelation and target speaker adaptation

Authors: Jiangyu Han, Xinyuan Zhou, Yanhua Long, Yijie Li

Abstract: The end-to-end approaches for single-channel target speech extraction have attracted widespread attention. However, the studies for end-to-end multi-channel target speech extraction are still relatively limited. In this work, we propose two methods for exploiting the multi-channel spatial information to extract the target speech. The first one is using a target speech adaptation layer in a paralle… ▽ More The end-to-end approaches for single-channel target speech extraction have attracted widespread attention. However, the studies for end-to-end multi-channel target speech extraction are still relatively limited. In this work, we propose two methods for exploiting the multi-channel spatial information to extract the target speech. The first one is using a target speech adaptation layer in a parallel encoder architecture. The second one is designing a channel decorrelation mechanism to extract the inter-channel differential information to enhance the multi-channel encoder representation. We compare the proposed methods with two strong state-of-the-art baselines. Experimental results on the multi-channel reverberant WSJ0 2-mix dataset demonstrate that our proposed methods achieve up to 11.2% and 11.5% relative improvements in SDR and SiSDR respectively, which are the best reported results on this task to the best of our knowledge. △ Less

Submitted 21 October, 2020; v1 submitted 18 October, 2020; originally announced October 2020.

Comments: 5 pages, 3 figures. Submitted to ICASSP 2021

arXiv:2010.06144 [pdf, other]

doi 10.1002/mp.15013

Multi-layer Residual Sparsifying Transform (MARS) Model for Low-dose CT Image Reconstruction

Authors: Xikai Yang, Yong Long, Saiprasad Ravishankar

Abstract: Signal models based on sparse representations have received considerable attention in recent years. On the other hand, deep models consisting of a cascade of functional layers, commonly known as deep neural networks, have been highly successful for the task of object classification and have been recently introduced to image reconstruction. In this work, we develop a new image reconstruction approa… ▽ More Signal models based on sparse representations have received considerable attention in recent years. On the other hand, deep models consisting of a cascade of functional layers, commonly known as deep neural networks, have been highly successful for the task of object classification and have been recently introduced to image reconstruction. In this work, we develop a new image reconstruction approach based on a novel multi-layer model learned in an unsupervised manner by combining both sparse representations and deep models. The proposed framework extends the classical sparsifying transform model for images to a Multi-lAyer Residual Sparsifying transform (MARS) model, wherein the transform domain data are jointly sparsified over layers. We investigate the application of MARS models learned from limited regular-dose images for low-dose CT reconstruction using Penalized Weighted Least Squares (PWLS) optimization. We propose new formulations for multi-layer transform learning and image reconstruction. We derive an efficient block coordinate descent algorithm to learn the transforms across layers, in an unsupervised manner from limited regular-dose images. The learned model is then incorporated into the low-dose image reconstruction phase. Low-dose CT experimental results with both the XCAT phantom and Mayo Clinic data show that the MARS model outperforms conventional methods such as FBP and PWLS methods based on the edge-preserving (EP) regularizer in terms of two numerical metrics (RMSE and SSIM) and noise suppression. Compared with the single-layer learned transform (ST) model, the MARS model performs better in maintaining some subtle details. △ Less

Submitted 28 May, 2021; v1 submitted 10 October, 2020; originally announced October 2020.

Comments: 28 pages, 12 figures, accepted by Medical Physics. arXiv admin note: text overlap with arXiv:2005.03825

arXiv:2010.02761 [pdf, other]

doi 10.1109/TMI.2021.3095310

Unified Supervised-Unsupervised (SUPER) Learning for X-ray CT Image Reconstruction

Authors: Siqi Ye, Zhipeng Li, Michael T. McCann, Yong Long, Saiprasad Ravishankar

Abstract: Traditional model-based image reconstruction (MBIR) methods combine forward and noise models with simple object priors. Recent machine learning methods for image reconstruction typically involve supervised learning or unsupervised learning, both of which have their advantages and disadvantages. In this work, we propose a unified supervised-unsupervised (SUPER) learning framework for X-ray computed… ▽ More Traditional model-based image reconstruction (MBIR) methods combine forward and noise models with simple object priors. Recent machine learning methods for image reconstruction typically involve supervised learning or unsupervised learning, both of which have their advantages and disadvantages. In this work, we propose a unified supervised-unsupervised (SUPER) learning framework for X-ray computed tomography (CT) image reconstruction. The proposed learning formulation combines both unsupervised learning-based priors (or even simple analytical priors) together with (supervised) deep network-based priors in a unified MBIR framework based on a fixed point iteration analysis. The proposed training algorithm is also an approximate scheme for a bilevel supervised training optimization problem, wherein the network-based regularizer in the lower-level MBIR problem is optimized using an upper-level reconstruction loss. The training problem is optimized by alternating between updating the network weights and iteratively updating the reconstructions based on those weights. We demonstrate the learned SUPER models' efficacy for low-dose CT image reconstruction, for which we use the NIH AAPM Mayo Clinic Low Dose CT Grand Challenge dataset for training and testing. In our experiments, we studied different combinations of supervised deep network priors and unsupervised learning-based or analytical priors. Both numerical and visual results show the superiority of the proposed unified SUPER methods over standalone supervised learning-based methods, iterative MBIR methods, and variations of SUPER obtained via ablation studies. We also show that the proposed algorithm converges rapidly in practice. △ Less

Submitted 8 April, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

Comments: 18 pages, 21 figures, submitted journal paper

Journal ref: IEEE Transactions on Medical Imaging, vol. 40, no. 11, pp. 2986-3001, Nov. 2021

arXiv:2007.13401 [pdf, ps, other]

IEEE 802.11be-Wi-Fi 7: New Challenges and Opportunities

Authors: Cailian Deng, Xuming Fang, Xiao Han, Xianbin Wang, Li Yan, Rong He, Yan Long, Yuchen Guo

Abstract: With the emergence of 4k/8k video, the throughput requirement of video delivery will keep grow to tens of Gbps. Other new high-throughput and low-latency video applications including augmented reality (AR), virtual reality (VR), and online gaming, are also proliferating. Due to the related stringent requirements, supporting these applications over wireless local area network (WLAN) is far beyond t… ▽ More With the emergence of 4k/8k video, the throughput requirement of video delivery will keep grow to tens of Gbps. Other new high-throughput and low-latency video applications including augmented reality (AR), virtual reality (VR), and online gaming, are also proliferating. Due to the related stringent requirements, supporting these applications over wireless local area network (WLAN) is far beyond the capabilities of the new WLAN standard -- IEEE 802.11ax. To meet these emerging demands, the IEEE 802.11 will release a new amendment standard IEEE 802.11be -- Extremely High Throughput (EHT), also known as Wireless-Fidelity (Wi-Fi) 7. This article provides the comprehensive survey on the key medium access control (MAC) layer techniques and physical layer (PHY) techniques being discussed in the EHT task group, including the channelization and tone plan, multiple resource units (multi-RU) support, 4096 quadrature amplitude modulation (4096-QAM), preamble designs, multiple link operations (e.g., multi-link aggregation and channel access), multiple input multiple output (MIMO) enhancement, multiple access point (multi-AP) coordination (e.g., multi-AP joint transmission), enhanced link adaptation and retransmission protocols (e.g., hybrid automatic repeat request (HARQ)). This survey covers both the critical technologies being discussed in EHT standard and the related latest progresses from worldwide research. Besides, the potential developments beyond EHT are discussed to provide some possible future research directions for WLAN. △ Less

Submitted 3 August, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

Comments: Accepted for publication in IEEE Communications Surveys and Tutorials

arXiv:2006.10414 [pdf, other]

Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition

Authors: Xinyuan Zhou, Emre Yılmaz, Yanhua Long, Yijie Li, Haizhou Li

Abstract: Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In this study, we propose a Transformer-based architecture with two symmetric language-specific encoders to capture the individual language attributes, that improve… ▽ More Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In this study, we propose a Transformer-based architecture with two symmetric language-specific encoders to capture the individual language attributes, that improve the acoustic representation of each language. These representations are combined using a language-specific multi-head attention mechanism in the decoder module. Each encoder and its corresponding attention module in the decoder are pre-trained using a large monolingual corpus aiming to alleviate the impact of limited CS training data. We call such a network a multi-encoder-decoder (MED) architecture. Experiments on the SEAME corpus show that the proposed MED architecture achieves 10.2% and 10.8% relative error rate reduction on the CS evaluation sets with Mandarin and English as the matrix language respectively. △ Less

Submitted 18 June, 2020; originally announced June 2020.

arXiv:2006.10407 [pdf, other]

Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR

Authors: Xinyuan Zhou, Grandee Lee, Emre Yılmaz, Yanhua Long, Jiaen Liang, Haizhou Li

Abstract: The Transformer has shown impressive performance in automatic speech recognition. It uses the encoder-decoder structure with self-attention to learn the relationship between the high-level representation of the source inputs and embedding of the target outputs. In this paper, we propose a novel decoder structure that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure… ▽ More The Transformer has shown impressive performance in automatic speech recognition. It uses the encoder-decoder structure with self-attention to learn the relationship between the high-level representation of the source inputs and embedding of the target outputs. In this paper, we propose a novel decoder structure that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure (DAS) to improve the acoustic representation of Transformer-based LVCSR. Specifically, we introduce a self-attention mechanism to learn a multi-layer deep acoustic structure for multiple levels of acoustic abstraction. We also design a mixed attention mechanism that learns the alignment between different levels of acoustic abstraction and its corresponding linguistic information simultaneously in a shared embedding space. The ASR experiments on Aishell-1 shown that the proposed structure achieves CERs of 4.8% on the dev set and 5.1% on the test set, which are the best results obtained on this task to the best of our knowledge. △ Less

Submitted 15 September, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: Accepted by INTERSPEECH 2020

arXiv:2005.03825 [pdf, other]

Learned Multi-layer Residual Sparsifying Transform Model for Low-dose CT Reconstruction

Authors: Xikai Yang, Xuehang Zheng, Yong Long, Saiprasad Ravishankar

Abstract: Signal models based on sparse representation have received considerable attention in recent years. Compared to synthesis dictionary learning, sparsifying transform learning involves highly efficient sparse coding and operator update steps. In this work, we propose a Multi-layer Residual Sparsifying Transform (MRST) learning model wherein the transform domain residuals are jointly sparsified over l… ▽ More Signal models based on sparse representation have received considerable attention in recent years. Compared to synthesis dictionary learning, sparsifying transform learning involves highly efficient sparse coding and operator update steps. In this work, we propose a Multi-layer Residual Sparsifying Transform (MRST) learning model wherein the transform domain residuals are jointly sparsified over layers. In particular, the transforms for the deeper layers exploit the more intricate properties of the residual maps. We investigate the application of the learned MRST model for low-dose CT reconstruction using Penalized Weighted Least Squares (PWLS) optimization. Experimental results on Mayo Clinic data show that the MRST model outperforms conventional methods such as FBP and PWLS methods based on edge-preserving (EP) regularizer and single-layer transform (ST) model, especially for maintaining some subtle details. △ Less

Submitted 7 May, 2020; originally announced May 2020.

arXiv:2004.08498 [pdf, other]

doi 10.1364/JOSAB.391297

Enhanced principle component method for fringe removal in cold atom images

Authors: Feng Xiong, Yun Long, Colin V. Parker

Abstract: Many powerful imaging techniques for cold atoms are based on determining the optical density by comparing a beam image having passed through the atom cloud to a reference image taken under similar conditions with no atoms. In practice the beam profile typically contains interference fringes whose phase is not stable between camera exposures. To reduce the error of these fringes in the computed opt… ▽ More Many powerful imaging techniques for cold atoms are based on determining the optical density by comparing a beam image having passed through the atom cloud to a reference image taken under similar conditions with no atoms. In practice the beam profile typically contains interference fringes whose phase is not stable between camera exposures. To reduce the error of these fringes in the computed optical density, an algorithm based on principle component analysis (PCA) is often employed. However, PCA is general purpose and not tailored to the specific case of interference fringes. Here we demonstrate an algorithm that takes advantage of the Fourier-space structure of interference fringes to further reduce the residual fringe signatures in the optical density. △ Less

Submitted 17 April, 2020; originally announced April 2020.

arXiv:2002.12018 [pdf, other]

Momentum-Net for Low-Dose CT Image Reconstruction

Authors: Siqi Ye, Yong Long, Il Yong Chun

Abstract: This paper applies the recent fast iterative neural network framework, Momentum-Net, using appropriate models to low-dose X-ray computed tomography (LDCT) image reconstruction. At each layer of the proposed Momentum-Net, the model-based image reconstruction module solves the majorized penalized weighted least-square problem, and the image refining module uses a four-layer convolutional neural netw… ▽ More This paper applies the recent fast iterative neural network framework, Momentum-Net, using appropriate models to low-dose X-ray computed tomography (LDCT) image reconstruction. At each layer of the proposed Momentum-Net, the model-based image reconstruction module solves the majorized penalized weighted least-square problem, and the image refining module uses a four-layer convolutional neural network (CNN). Experimental results with the NIH AAPM-Mayo Clinic Low Dose CT Grand Challenge dataset show that the proposed Momentum-Net architecture significantly improves image reconstruction accuracy, compared to a state-of-the-art noniterative image denoising deep neural network (NN), WavResNet (in LDCT). We also investigated the spectral normalization technique that applies to image refining NN learning to satisfy the nonexpansive NN property; however, experimental results show that this does not improve the image reconstruction performance of Momentum-Net. △ Less

Submitted 8 September, 2020; v1 submitted 27 February, 2020; originally announced February 2020.

Comments: Five pages conference paper. Accepted by 2020 Asilomar Conference on Signals, Systems, and Computers

arXiv:1910.12024 [pdf, other]

SUPER Learning: A Supervised-Unsupervised Framework for Low-Dose CT Image Reconstruction

Authors: Zhipeng Li, Siqi Ye, Yong Long, Saiprasad Ravishankar

Abstract: Recent years have witnessed growing interest in machine learning-based models and techniques for low-dose X-ray CT (LDCT) imaging tasks. The methods can typically be categorized into supervised learning methods and unsupervised or model-based learning methods. Supervised learning methods have recently shown success in image restoration tasks. However, they often rely on large training sets. Model-… ▽ More Recent years have witnessed growing interest in machine learning-based models and techniques for low-dose X-ray CT (LDCT) imaging tasks. The methods can typically be categorized into supervised learning methods and unsupervised or model-based learning methods. Supervised learning methods have recently shown success in image restoration tasks. However, they often rely on large training sets. Model-based learning methods such as dictionary or transform learning do not require large or paired training sets and often have good generalization properties, since they learn general properties of CT image sets. Recent works have shown the promising reconstruction performance of methods such as PWLS-ULTRA that rely on clustering the underlying (reconstructed) image patches into a learned union of transforms. In this paper, we propose a new Supervised-UnsuPERvised (SUPER) reconstruction framework for LDCT image reconstruction that combines the benefits of supervised learning methods and (unsupervised) transform learning-based methods such as PWLS-ULTRA that involve highly image-adaptive clustering. The SUPER model consists of several layers, each of which includes a deep network learned in a supervised manner and an unsupervised iterative method that involves image-adaptive components. The SUPER reconstruction algorithms are learned in a greedy manner from training data. The proposed SUPER learning methods dramatically outperform both the constituent supervised learning-based networks and iterative algorithms for LDCT, and use much fewer iterations in the iterative reconstruction modules. △ Less

Submitted 26 October, 2019; originally announced October 2019.

Comments: Accepted to International Conference on Computer Vision (ICCV) - Learning for Computational Imaging (LCI) Workshop, 2019

arXiv:1908.01287 [pdf, other]

BCD-Net for Low-dose CT Reconstruction: Acceleration, Convergence, and Generalization

Authors: Il Yong Chun, Xuehang Zheng, Yong Long, Jeffrey A. Fessler

Abstract: Obtaining accurate and reliable images from low-dose computed tomography (CT) is challenging. Regression convolutional neural network (CNN) models that are learned from training data are increasingly gaining attention in low-dose CT reconstruction. This paper modifies the architecture of an iterative regression CNN, BCD-Net, for fast, stable, and accurate low-dose CT reconstruction, and presents t… ▽ More Obtaining accurate and reliable images from low-dose computed tomography (CT) is challenging. Regression convolutional neural network (CNN) models that are learned from training data are increasingly gaining attention in low-dose CT reconstruction. This paper modifies the architecture of an iterative regression CNN, BCD-Net, for fast, stable, and accurate low-dose CT reconstruction, and presents the convergence property of the modified BCD-Net. Numerical results with phantom data show that applying faster numerical solvers to model-based image reconstruction (MBIR) modules of BCD-Net leads to faster and more accurate BCD-Net; BCD-Net significantly improves the reconstruction accuracy, compared to the state-of-the-art MBIR method using learned transforms; BCD-Net achieves better image quality, compared to a state-of-the-art iterative NN architecture, ADMM-Net. Numerical results with clinical data show that BCD-Net generalizes significantly better than a state-of-the-art deep (non-iterative) regression NN, FBPConvNet, that lacks MBIR modules. △ Less

Submitted 4 August, 2019; originally announced August 2019.

Comments: Accepted to MICCAI 2019, and the authors indicated by asterisks (*) equally contributed to this work

arXiv:1906.00165 [pdf, other]

Two-layer Residual Sparsifying Transform Learning for Image Reconstruction

Authors: Xuehang Zheng, Saiprasad Ravishankar, Yong Long, Marc Louis Klasky, Brendt Wohlberg

Abstract: Signal models based on sparsity, low-rank and other properties have been exploited for image reconstruction from limited and corrupted data in medical imaging and other computational imaging applications. In particular, sparsifying transform models have shown promise in various applications, and offer numerous advantages such as efficiencies in sparse coding and learning. This work investigates pr… ▽ More Signal models based on sparsity, low-rank and other properties have been exploited for image reconstruction from limited and corrupted data in medical imaging and other computational imaging applications. In particular, sparsifying transform models have shown promise in various applications, and offer numerous advantages such as efficiencies in sparse coding and learning. This work investigates pre-learning a two-layer extension of the transform model for image reconstruction, wherein the transform domain or filtering residuals of the image are further sparsified in the second layer. The proposed block coordinate descent optimization algorithms involve highly efficient updates. Preliminary numerical experiments demonstrate the usefulness of a two-layer model over the previous related schemes for CT image reconstruction from low-dose measurements. △ Less

Submitted 7 January, 2020; v1 submitted 1 June, 2019; originally announced June 2019.

Comments: Accepted to IEEE ISBI 2020

arXiv:1901.00106 [pdf, other]

DECT-MULTRA: Dual-Energy CT Image Decomposition With Learned Mixed Material Models and Efficient Clustering

Authors: Zhipeng Li, Saiprasad Ravishankar, Yong Long, Jeffrey A. Fessler

Abstract: Dual energy computed tomography (DECT) imaging plays an important role in advanced imaging applications due to its material decomposition capability. Image-domain decomposition operates directly on CT images using linear matrix inversion, but the decomposed material images can be severely degraded by noise and artifacts. This paper proposes a new method dubbed DECT-MULTRA for image-domain DECT mat… ▽ More Dual energy computed tomography (DECT) imaging plays an important role in advanced imaging applications due to its material decomposition capability. Image-domain decomposition operates directly on CT images using linear matrix inversion, but the decomposed material images can be severely degraded by noise and artifacts. This paper proposes a new method dubbed DECT-MULTRA for image-domain DECT material decomposition that combines conventional penalized weighted-least squares (PWLS) estimation with regularization based on a mixed union of learned transforms (MULTRA) model. Our proposed approach pre-learns a union of common-material sparsifying transforms from patches extracted from all the basis materials, and a union of cross-material sparsifying transforms from multi-material patches. The common-material transforms capture the common properties among different material images, while the cross-material transforms capture the cross-dependencies. The proposed PWLS formulation is optimized efficiently by alternating between an image update step and a sparse coding and clustering step, with both of these steps having closed-form solutions. The effectiveness of our method is validated with both XCAT phantom and clinical head data. The results demonstrate that our proposed method provides superior material image quality and decomposition accuracy compared to other competing methods. △ Less

Submitted 18 August, 2019; v1 submitted 1 January, 2019; originally announced January 2019.

arXiv:1810.12126 [pdf, other]

ActionXPose: A Novel 2D Multi-view Pose-based Algorithm for Real-time Human Action Recognition

Authors: Federico Angelini, Zeyu Fu, Yang Long, Ling Shao, Syed Mohsen Naqvi

Abstract: We present ActionXPose, a novel 2D pose-based algorithm for posture-level Human Action Recognition (HAR). The proposed approach exploits 2D human poses provided by OpenPose detector from RGB videos. ActionXPose aims to process poses data to be provided to a Long Short-Term Memory Neural Network and to a 1D Convolutional Neural Network, which solve the classification problem. ActionXPose is one of… ▽ More We present ActionXPose, a novel 2D pose-based algorithm for posture-level Human Action Recognition (HAR). The proposed approach exploits 2D human poses provided by OpenPose detector from RGB videos. ActionXPose aims to process poses data to be provided to a Long Short-Term Memory Neural Network and to a 1D Convolutional Neural Network, which solve the classification problem. ActionXPose is one of the first algorithms that exploits 2D human poses for HAR. The algorithm has real-time performance and it is robust to camera movings, subject proximity changes, viewpoint changes, subject appearance changes and provide high generalization degree. In fact, extensive simulations show that ActionXPose can be successfully trained using different datasets at once. State-of-the-art performance on popular datasets for posture-related HAR problems (i3DPost, KTH) are provided and results are compared with those obtained by other methods, including the selected ActionXPose baseline. Moreover, we also proposed two novel datasets called MPOSE and ISLD recorded in our Intelligent Sensing Lab, to show ActionXPose generalization performance. △ Less

Submitted 29 October, 2018; originally announced October 2018.

arXiv:1808.08791 [pdf, other]

doi 10.1109/TMI.2019.2934933

SPULTRA: Low-Dose CT Image Reconstruction with Joint Statistical and Learned Image Models

Authors: Siqi Ye, Saiprasad Ravishankar, Yong Long, Jeffrey A. Fessler

Abstract: Low-dose CT image reconstruction has been a popular research topic in recent years. A typical reconstruction method based on post-log measurements is called penalized weighted-least squares (PWLS). Due to the underlying limitations of the post-log statistical model, the PWLS reconstruction quality is often degraded in low-dose scans. This paper investigates a shifted-Poisson (SP) model based likel… ▽ More Low-dose CT image reconstruction has been a popular research topic in recent years. A typical reconstruction method based on post-log measurements is called penalized weighted-least squares (PWLS). Due to the underlying limitations of the post-log statistical model, the PWLS reconstruction quality is often degraded in low-dose scans. This paper investigates a shifted-Poisson (SP) model based likelihood function that uses the pre-log raw measurements that better represents the measurement statistics, together with a data-driven regularizer exploiting a Union of Learned TRAnsforms (SPULTRA). Both the SP induced data-fidelity term and the regularizer in the proposed framework are nonconvex. The proposed SPULTRA algorithm uses quadratic surrogate functions for the SP induced data-fidelity term. Each iteration involves a quadratic subproblem for updating the image, and a sparse coding and clustering subproblem that has a closed-form solution. The SPULTRA algorithm has a similar computational cost per iteration as its recent counterpart PWLS-ULTRA that uses post-log measurements, and it provides better image reconstruction quality than PWLS-ULTRA, especially in low-dose scans. △ Less

Submitted 12 August, 2019; v1 submitted 27 August, 2018; originally announced August 2018.

Comments: Accepted to IEEE Transaction on Medical Imaging

Showing 1–50 of 52 results for author: Long, Y