Search | arXiv e-print repository

Enhancing Medical Imaging with GANs Synthesizing Realistic Images from Limited Data

Authors: Yinqiu Feng, Bo Zhang, Lingxi Xiao, Yutian Yang, Tana Gegen, Zexi Chen

Abstract: In this research, we introduce an innovative method for synthesizing medical images using generative adversarial networks (GANs). Our proposed GANs method demonstrates the capability to produce realistic synthetic images even when trained on a limited quantity of real medical image data, showcasing commendable generalization prowess. To achieve this, we devised a generator and discriminator networ… ▽ More In this research, we introduce an innovative method for synthesizing medical images using generative adversarial networks (GANs). Our proposed GANs method demonstrates the capability to produce realistic synthetic images even when trained on a limited quantity of real medical image data, showcasing commendable generalization prowess. To achieve this, we devised a generator and discriminator network architecture founded on deep convolutional neural networks (CNNs), leveraging the adversarial training paradigm for model optimization. Through extensive experimentation across diverse medical image datasets, our method exhibits robust performance, consistently generating synthetic images that closely emulate the structural and textural attributes of authentic medical images. △ Less

Submitted 22 May, 2024; originally announced June 2024.

arXiv:2406.16981 [pdf]

Research on Feature Extraction Data Processing System For MRI of Brain Diseases Based on Computer Deep Learning

Authors: Lingxi Xiao, **xin Hu, Yutian Yang, Yinqiu Feng, Zichao Li, Zexi Chen

Abstract: Most of the existing wavelet image processing techniques are carried out in the form of single-scale reconstruction and multiple iterations. However, processing high-quality fMRI data presents problems such as mixed noise and excessive computation time. This project proposes the use of matrix operations by combining mixed noise elimination methods with wavelet analysis to replace traditional itera… ▽ More Most of the existing wavelet image processing techniques are carried out in the form of single-scale reconstruction and multiple iterations. However, processing high-quality fMRI data presents problems such as mixed noise and excessive computation time. This project proposes the use of matrix operations by combining mixed noise elimination methods with wavelet analysis to replace traditional iterative algorithms. Functional magnetic resonance imaging (fMRI) of the auditory cortex of a single subject is analyzed and compared to the wavelet domain signal processing technology based on repeated times and the world's most influential SPM8. Experiments show that this algorithm is the fastest in computing time, and its detection effect is comparable to the traditional iterative algorithm. However, this has a higher practical value for the processing of FMRI data. In addition, the wavelet analysis method proposed signal processing to speed up the calculation rate. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.10910 [pdf, ps, other]

Fast Fractional Programming for Multi-Cell Integrated Sensing and Communications

Authors: Yannan Chen, Yi Feng, Xiaoyang Li, Licheng Zhao, Kaiming Shen

Abstract: This paper concerns the coordinate multi-cell beamforming design for integrated sensing and communications (ISAC). In particular, we assume that each base station (BS) has massive antennas. The optimization objective is to maximize a weighted sum of the data rates (for communications) and the Fisher information (for sensing). We first show that the conventional beamforming method for the multiple-… ▽ More This paper concerns the coordinate multi-cell beamforming design for integrated sensing and communications (ISAC). In particular, we assume that each base station (BS) has massive antennas. The optimization objective is to maximize a weighted sum of the data rates (for communications) and the Fisher information (for sensing). We first show that the conventional beamforming method for the multiple-input multiple-output (MIMO) transmission, i.e., the weighted minimum mean square error (WMMSE) algorithm, has a natural extension to the ISAC problem scenario from a fractional programming (FP) perspective. However, the extended WMMSE algorithm requires computing the $N\times N$ matrix inverse extensively, where $N$ is proportional to the antenna array size, so the algorithm becomes quite costly when antennas are massively deployed. To address this issue, we develop a nonhomogeneous bound and use it in conjunction with the FP technique to solve the ISAC beamforming problem without the need to invert any large matrices. It is further shown that the resulting new FP algorithm has an intimate connection with gradient projection, based on which we can accelerate the convergence via Nesterov's gradient extrapolation. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10606 [pdf, other]

Semantic Communication for Edge Intelligence Enabled Autonomous Driving System

Authors: Yunqi Feng, Hesheng Shen, Zhendong Shan, Qianqian Yang, Xiufang Shi

Abstract: Expected to provide higher transportation efficiency and security, autonomous driving has attracted substantial attentions from both industry and academia. Meanwhile, the emergence of edge intelligence has further introduced significant advancements to this field. However, the crucial demands of ultra-reliable and low-latency communications (URLLC) among the vehicles and edge servers have hindered… ▽ More Expected to provide higher transportation efficiency and security, autonomous driving has attracted substantial attentions from both industry and academia. Meanwhile, the emergence of edge intelligence has further introduced significant advancements to this field. However, the crucial demands of ultra-reliable and low-latency communications (URLLC) among the vehicles and edge servers have hindered the development of autonomous driving. In this article, we provide a brief overview of edge intelligence enabled autonomous driving system and current vehicle-to-everything (V2X) technologies. Moreover, challenges associated with massive data transmission in autonomous driving are highlighted from three perspectives: multi-modal data transmission and fusion, multi-user collaboration and connection, and multi-task training and execution. To cope with these challenges, we propose to incorporate semantic communication into autonomous driving to achieve highly efficient and task-oriented data transmission. Unlike traditional communications, semantic communication extracts task-relevant semantic feature from multi-sensory data. Specifically, a unified multi-user semantic communication system for transmitting multi-modal data and performing multi-task execution is designed for collaborative data transmission and decision making in autonomous driving. Simulation results demonstrate that the proposed system can significantly reduce data transmission volume without compromising task performance, as evidenced by the realization of a cooperative multi-vehicle target classification and detection task. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: This paper has been submitted to IEEE Network Magazine, and is ungergoing major revisions

arXiv:2406.07330 [pdf, other]

CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Authors: Qingkai Fang, Zhengrui Ma, Yan Zhou, Min Zhang, Yang Feng

Abstract: Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investig… ▽ More Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation. Experimental results demonstrate that by combining pretraining, knowledge distillation, and advanced NAR training techniques such as glancing training and non-monotonic latent alignments, CTC-based NAR models achieve translation quality comparable to the AR model, while preserving up to 26.81$\times$ decoding speedup. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: ACL 2024 Findings

ACM Class: I.2.7

arXiv:2406.07289 [pdf, other]

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Authors: Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng

Abstract: Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data… ▽ More Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data and pretrained models, which have not been fully utilized in the development of S2ST models. Inspired by this, in this paper, we first introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. Furthermore, to eliminate the reliance on parallel speech data, we propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data. It aligns representations in the latent space through contrastive learning, enabling the speech synthesis capability learned from the TTS data to generalize to S2ST in a zero-shot manner. Experimental results on the CVSS dataset show that when the parallel speech data is available, ComSpeech surpasses previous two-pass models like UnitY and Translatotron 2 in both translation quality and decoding speed. When there is no parallel speech data, ComSpeech-ZS lags behind \name by only 0.7 ASR-BLEU and outperforms the cascaded models. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: ACL 2024 main conference. Project Page: https://ictnlp.github.io/ComSpeech-Site/

ACM Class: I.2.7

arXiv:2406.06937 [pdf, other]

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

Authors: Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min Zhang

Abstract: Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization betwee… ▽ More Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2X outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: ACL 2024; Codes and demos are at https://github.com/ictnlp/NAST-S2x

arXiv:2406.03049 [pdf, other]

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

Authors: Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng

Abstract: Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing… ▽ More Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing a double challenge of translation and policy. In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. Adhering to a multi-task learning approach, StreamSpeech can perform offline and simultaneous speech recognition, speech translation and speech synthesis via an "All-in-One" seamless model. Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks. Besides, StreamSpeech is able to present high-quality intermediate results (i.e., ASR or translation results) during simultaneous translation process, offering a more comprehensive real-time communication experience. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024 main conference, Project Page: https://ictnlp.github.io/StreamSpeech-site/

arXiv:2405.15339 [pdf, other]

Environment Sensing-aided Beam Prediction with Transfer Learning for Smart Factory

Authors: Yuan Feng, Chuanbing Zhao, Feifei Gao, Yong Zhang, Shaodan Ma

Abstract: In this paper, we propose an environment sensing-aided beam prediction model for smart factory that can be transferred from given environments to a new environment. In particular, we first design a pre-training model that predicts the optimal beam by sensing the present environmental information. When encountering a new environment, it generally requires collecting a large amount of new training d… ▽ More In this paper, we propose an environment sensing-aided beam prediction model for smart factory that can be transferred from given environments to a new environment. In particular, we first design a pre-training model that predicts the optimal beam by sensing the present environmental information. When encountering a new environment, it generally requires collecting a large amount of new training data to retrain the model, whose cost severely impedes the application of the designed pre-training model. Therefore, we next design a transfer learning strategy that fine-tunes the pre-trained model by limited labeled data of the new environment. Simulation results show that when the pre-trained model is fine-tuned by 30\% of labeled data from the new environment, the Top-10 beam prediction accuracy reaches 94\%. Moreover, compared with the way to completely re-training the prediction model, the amount of training data and the time cost of the proposed transfer learning strategy reduce 70\% and 75\% respectively. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.10463 [pdf, other]

Single-shot volumetric fluorescence imaging with neural fields

Authors: Oumeng Zhang, Haowen Zhou, Brandon Y. Feng, Elin M. Larsson, Reinaldo E. Alcalde, Siyuan Yin, Catherine Deng, Changhuei Yang

Abstract: Single-shot volumetric fluorescence (SVF) imaging offers a significant advantage over traditional imaging methods that require scanning across multiple axial planes as it can capture biological processes with high temporal resolution across a large field of view. The key challenges in SVF imaging include requiring sparsity constraints to meet the multiplexing requirements of compressed sensing, el… ▽ More Single-shot volumetric fluorescence (SVF) imaging offers a significant advantage over traditional imaging methods that require scanning across multiple axial planes as it can capture biological processes with high temporal resolution across a large field of view. The key challenges in SVF imaging include requiring sparsity constraints to meet the multiplexing requirements of compressed sensing, eliminating depth ambiguity in the reconstruction, and maintaining high resolution across a large field of view. In this paper, we introduce the QuadraPol point spread function (PSF) combined with neural fields, a novel approach for SVF imaging. This method utilizes a custom polarizer at the back focal plane and a polarization camera to detect fluorescence, effectively encoding the 3D scene within a compact PSF without depth ambiguity. Additionally, we propose a reconstruction algorithm based on the neural fields technique that provides improved reconstruction quality and addresses the inaccuracies of phase retrieval methods used to correct imaging system aberrations. This algorithm combines the accuracy of experimental PSFs with the long depth of field of computationally generated retrieved PSFs. QuadraPol PSF, combined with neural fields, significantly reduces the acquisition time of a conventional fluorescence microscope by approximately 20 times and captures a 100 mm$^3$ cubic volume in one shot. We validate the effectiveness of both our hardware and algorithm through all-in-focus imaging of bacterial colonies on sand surfaces and visualization of plant root morphology. Our approach offers a powerful tool for advancing biological research and ecological studies. △ Less

Submitted 4 June, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

arXiv:2404.13677 [pdf, other]

A Dataset and Model for Realistic License Plate Deblurring

Authors: Haoyan Gong, Yuzheng Feng, Zhenrong Zhang, Xianxu Hou, **gxin Liu, Siqi Huang, Hongbin Liu

Abstract: Vehicle license plate recognition is a crucial task in intelligent traffic management systems. However, the challenge of achieving accurate recognition persists due to motion blur from fast-moving vehicles. Despite the widespread use of image synthesis approaches in existing deblurring and recognition algorithms, their effectiveness in real-world scenarios remains unproven. To address this, we int… ▽ More Vehicle license plate recognition is a crucial task in intelligent traffic management systems. However, the challenge of achieving accurate recognition persists due to motion blur from fast-moving vehicles. Despite the widespread use of image synthesis approaches in existing deblurring and recognition algorithms, their effectiveness in real-world scenarios remains unproven. To address this, we introduce the first large-scale license plate deblurring dataset named License Plate Blur (LPBlur), captured by a dual-camera system and processed through a post-processing pipeline to avoid misalignment issues. Then, we propose a License Plate Deblurring Generative Adversarial Network (LPDGAN) to tackle the license plate deblurring: 1) a Feature Fusion Module to integrate multi-scale latent codes; 2) a Text Reconstruction Module to restore structure through textual modality; 3) a Partition Discriminator Module to enhance the model's perception of details in each letter. Extensive experiments validate the reliability of the LPBlur dataset for both model training and testing, showcasing that our proposed model outperforms other state-of-the-art motion deblurring methods in realistic license plate deblurring scenarios. The dataset and code are available at https://github.com/haoyGONG/LPDGAN. △ Less

Submitted 22 April, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

Comments: Accepted by IJCAI 2024

arXiv:2404.07985 [pdf, other]

WaveMo: Learning Wavefront Modulations to See Through Scattering

Authors: Mingyang Xie, Haiyun Guo, Brandon Y. Feng, Lingbo **, Ashok Veeraraghavan, Christopher A. Metzler

Abstract: Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introdu… ▽ More Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introduces a novel learning-based framework to address the gap. Our approach jointly optimizes wavefront modulations and a computationally lightweight feedforward "proxy" reconstruction network. This network is trained to recover scenes obscured by scattering, using measurements that are modified by these modulations. The learned modulations produced by our framework generalize effectively to unseen scattering scenarios and exhibit remarkable versatility. During deployment, the learned modulations can be decoupled from the proxy network to augment other more computationally expensive restoration algorithms. Through extensive experiments, we demonstrate our approach significantly advances the state of the art in imaging through scattering media. Our project webpage is at https://wavemo-2024.github.io/. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2403.06463 [pdf, other]

A prediction-based forward-looking vehicle dispatching strategy for dynamic ride-pooling

Authors: Xiaolei Wang, Chen Yang, Yuzhen Feng, Luohan Hu, Zhengbing He

Abstract: For on-demand dynamic ride-pooling services, e.g., Uber Pool and Didi Pinche, a well-designed vehicle dispatching strategy is crucial for platform profitability and passenger experience. Most existing dispatching strategies overlook incoming pairing opportunities, therefore suffer from short-sighted limitations. In this paper, we propose a forward-looking vehicle dispatching strategy, which first… ▽ More For on-demand dynamic ride-pooling services, e.g., Uber Pool and Didi Pinche, a well-designed vehicle dispatching strategy is crucial for platform profitability and passenger experience. Most existing dispatching strategies overlook incoming pairing opportunities, therefore suffer from short-sighted limitations. In this paper, we propose a forward-looking vehicle dispatching strategy, which first predicts the expected distance saving that could be brought about by future orders and then solves a bipartite matching problem based on the prediction to match passengers with partially occupied or vacant vehicles or keep passengers waiting for next rounds of matching. To demonstrate the performance of the proposed strategy, a number of simulation experiments and comparisons are conducted based on the real-world road network and historical trip data from Haikou, China. Results show that the proposed strategy outperform the baseline strategies by generating approximately 31\% more distance saving and 18\% less average passenger detour distance. It indicates the significant benefits of considering future pairing opportunities in dispatching, and highlights the effectiveness of our innovative forward-looking vehicle dispatching strategy in improving system efficiency and user experience for dynamic ride-pooling services. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2402.18856 [pdf, other]

Anatomy-guided fiber trajectory distribution estimation for cranial nerves tractography

Authors: Lei Xie, Qingrun Zeng, Huajun Zhou, Guoqiang Xie, Mingchu Li, Jiahao Huang, Jianan Cui, Hao Chen, Yuan**g Feng

Abstract: Diffusion MRI tractography is an important tool for identifying and analyzing the intracranial course of cranial nerves (CNs). However, the complex environment of the skull base leads to ambiguous spatial correspondence between diffusion directions and fiber geometry, and existing diffusion tractography methods of CNs identification are prone to producing erroneous trajectories and missing true po… ▽ More Diffusion MRI tractography is an important tool for identifying and analyzing the intracranial course of cranial nerves (CNs). However, the complex environment of the skull base leads to ambiguous spatial correspondence between diffusion directions and fiber geometry, and existing diffusion tractography methods of CNs identification are prone to producing erroneous trajectories and missing true positive connections. To overcome the above challenge, we propose a novel CNs identification framework with anatomy-guided fiber trajectory distribution, which incorporates anatomical shape prior knowledge during the process of CNs tracing to build diffusion tensor vector fields. We introduce higher-order streamline differential equations for continuous flow field representations to directly characterize the fiber trajectory distribution of CNs from the tract-based level. The experimental results on the vivo HCP dataset and the clinical MDM dataset demonstrate that the proposed method reduces false-positive fiber production compared to competing methods and produces reconstructed CNs (i.e. CN II, CN III, CN V, and CN VII/VIII) that are judged to better correspond to the known anatomy. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2401.08120 [pdf]

Operation Scheme Optimizations to Achieve Ultra-high Endurance (1010) in Flash Memory with Robust Reliabilities

Authors: Yang Feng, Zhaohui Sun, Chengcheng Wang, Xinyi Guo, Junyao Mei, Yueran Qi, **g Liu, Junyu Zhang, Jixuan Wu, Xuepeng Zhan, Jiezhi Chen

Abstract: Flash memory has been widely adopted as stand-alone memory and embedded memory due to its robust reliability. However, the limited endurance obstacles its further applications in storage class memory (SCM) and to proceed endurance-required computing-in-memory (CIM) tasks. In this work, the optimization strategies have been studied to tackle this concern. It is shown that by adopting the channel ho… ▽ More Flash memory has been widely adopted as stand-alone memory and embedded memory due to its robust reliability. However, the limited endurance obstacles its further applications in storage class memory (SCM) and to proceed endurance-required computing-in-memory (CIM) tasks. In this work, the optimization strategies have been studied to tackle this concern. It is shown that by adopting the channel hot electrons injection (CHEI) and hot hole injection (HHI) to implement program/erase (PE) cycling together with a balanced memory window (MW) at the high-Vth (HV) mode, impressively, the endurance can be greatly extended to 1010 PE cycles, which is a record-high value in flash memory. Moreover, by using the proposed electric-field-assisted relaxation (EAR) scheme, the degradation of flash cells can be well suppressed with better subthreshold swings (SS) and lower leakage currents (sub-10pA after 1010 PE cycles). Our results shed light on the optimization strategy of flash memory to serve as SCM and implementendurance-required CIM tasks. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.01685 [pdf]

Modality Exchange Network for Retinogeniculate Visual Pathway Segmentation

Authors: Hua Han, Cheng Li, Lei Xie, Yuan**g Feng, Alou Diakite, Shanshan Wang

Abstract: Accurate segmentation of the retinogeniculate visual pathway (RGVP) aids in the diagnosis and treatment of visual disorders by identifying disruptions or abnormalities within the pathway. However, the complex anatomical structure and connectivity of RGVP make it challenging to achieve accurate segmentation. In this study, we propose a novel Modality Exchange Network (ME-Net) that effectively utili… ▽ More Accurate segmentation of the retinogeniculate visual pathway (RGVP) aids in the diagnosis and treatment of visual disorders by identifying disruptions or abnormalities within the pathway. However, the complex anatomical structure and connectivity of RGVP make it challenging to achieve accurate segmentation. In this study, we propose a novel Modality Exchange Network (ME-Net) that effectively utilizes multi-modal magnetic resonance (MR) imaging information to enhance RGVP segmentation. Our ME-Net has two main contributions. Firstly, we introduce an effective multi-modal soft-exchange technique. Specifically, we design a channel and spatially mixed attention module to exchange modality information between T1-weighted and fractional anisotropy MR images. Secondly, we propose a cross-fusion module that further enhances the fusion of information between the two modalities. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches in terms of RGVP segmentation performance. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.01654 [pdf, other]

LESEN: Label-Efficient deep learning for Multi-parametric MRI-based Visual Pathway Segmentation

Authors: Alou Diakite, Cheng Li, Lei Xie, Yuan**g Feng, Hua Han, Shanshan Wang

Abstract: Recent research has shown the potential of deep learning in multi-parametric MRI-based visual pathway (VP) segmentation. However, obtaining labeled data for training is laborious and time-consuming. Therefore, it is crucial to develop effective algorithms in situations with limited labeled samples. In this work, we propose a label-efficient deep learning method with self-ensembling (LESEN). LESEN… ▽ More Recent research has shown the potential of deep learning in multi-parametric MRI-based visual pathway (VP) segmentation. However, obtaining labeled data for training is laborious and time-consuming. Therefore, it is crucial to develop effective algorithms in situations with limited labeled samples. In this work, we propose a label-efficient deep learning method with self-ensembling (LESEN). LESEN incorporates supervised and unsupervised losses, enabling the student and teacher models to mutually learn from each other, forming a self-ensembling mean teacher framework. Additionally, we introduce a reliable unlabeled sample selection (RUSS) mechanism to further enhance LESEN's effectiveness. Our experiments on the human connectome project (HCP) dataset demonstrate the superior performance of our method when compared to state-of-the-art techniques, advancing multimodal VP segmentation for comprehensive analysis in clinical and research settings. The implementation code will be available at: https://github.com/aldiak/Semi-Supervised-Multimodal-Visual-Pathway- Delineation. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2312.04679 [pdf, other]

ConVRT: Consistent Video Restoration Through Turbulence with Test-time Optimization of Neural Video Representations

Authors: Haoming Cai, **gxi Chen, Brandon Y. Feng, Weiyun Jiang, Mingyang Xie, Kevin Zhang, Ashok Veeraraghavan, Christopher Metzler

Abstract: tmospheric turbulence presents a significant challenge in long-range imaging. Current restoration algorithms often struggle with temporal inconsistency, as well as limited generalization ability across varying turbulence levels and scene content different than the training data. To tackle these issues, we introduce a self-supervised method, Consistent Video Restoration through Turbulence (ConVRT)… ▽ More tmospheric turbulence presents a significant challenge in long-range imaging. Current restoration algorithms often struggle with temporal inconsistency, as well as limited generalization ability across varying turbulence levels and scene content different than the training data. To tackle these issues, we introduce a self-supervised method, Consistent Video Restoration through Turbulence (ConVRT) a test-time optimization method featuring a neural video representation designed to enhance temporal consistency in restoration. A key innovation of ConVRT is the integration of a pretrained vision-language model (CLIP) for semantic-oriented supervision, which steers the restoration towards sharp, photorealistic images in the CLIP latent space. We further develop a principled selection strategy of text prompts, based on their statistical correlation with a perceptual metric. ConVRT's test-time optimization allows it to adapt to a wide range of real-world turbulence conditions, effectively leveraging the insights gained from pre-trained models on simulated data. ConVRT offers a comprehensive and effective solution for mitigating real-world turbulence in dynamic videos. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: https://convrt-2024.github.io/

arXiv:2311.12316 [pdf]

Generating Progressive Images from Pathological Transitions via Diffusion Model

Authors: Zeyu Liu, Tianyi Zhang, Yufang He, Yunlu Feng, Yu Zhao, Guanglei Zhang

Abstract: Deep learning is widely applied in computer-aided pathological diagnosis, which alleviates the pathologist workload and provide timely clinical analysis. However, most models generally require large-scale annotated data for training, which faces challenges due to the sampling and annotation scarcity in pathological images. The rapid develo** generative models shows potential to generate more tra… ▽ More Deep learning is widely applied in computer-aided pathological diagnosis, which alleviates the pathologist workload and provide timely clinical analysis. However, most models generally require large-scale annotated data for training, which faces challenges due to the sampling and annotation scarcity in pathological images. The rapid develo** generative models shows potential to generate more training samples from recent studies. However, they also struggle in generalization diversity with limited training data, incapable of generating effective samples. Inspired by the pathological transitions between different stages, we propose an adaptive depth-controlled diffusion (ADD) network to generate pathological progressive images for effective data augmentation. This novel approach roots in domain migration, where a hybrid attention strategy guides the bidirectional diffusion, blending local and global attention priorities. With feature measuring, the adaptive depth-controlled strategy ensures the migration and maintains locational similarity in simulating the pathological feature transition. Based on tiny training set (samples less than 500), the ADD yields cross-domain progressive images with corresponding soft-labels. Experiments on two datasets suggest significant improvements in generation diversity, and the effectiveness with generated progressive samples are highlighted in downstream classifications. The code is available at https://github.com/Rowerliu/ADD. △ Less

Submitted 9 March, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

Comments: 13 pages, 9 figs, 4 tabs

arXiv:2311.06712 [pdf, other]

PuzzleTuning: Explicitly Bridge Pathological and Natural Image with Puzzles

Authors: Tianyi Zhang, Shangqing Lyu, Yanli Lei, Sicheng Chen, Nan Ying, Yufang He, Yu Zhao, Yunlu Feng, Hwee Kuan Lee, Guanglei Zhang

Abstract: Pathological image analysis is a crucial field in computer vision. Due to the annotation scarcity in the pathological field, pre-training with self-supervised learning (SSL) is widely applied to learn on unlabeled images. However, the current SSL-based pathological pre-training: (1) does not explicitly explore the essential focuses of the pathological field, and (2) does not effectively bridge wit… ▽ More Pathological image analysis is a crucial field in computer vision. Due to the annotation scarcity in the pathological field, pre-training with self-supervised learning (SSL) is widely applied to learn on unlabeled images. However, the current SSL-based pathological pre-training: (1) does not explicitly explore the essential focuses of the pathological field, and (2) does not effectively bridge with and thus take advantage of the knowledge from natural images. To explicitly address them, we propose our large-scale PuzzleTuning framework, containing the following innovations. Firstly, we define three task focuses that can effectively bridge knowledge of pathological and natural domain: appearance consistency, spatial consistency, and restoration understanding. Secondly, we devise a novel multiple puzzle restoring task, which explicitly pre-trains the model regarding these focuses. Thirdly, we introduce an explicit prompt-tuning process to incrementally integrate the domain-specific knowledge. It builds a bridge to align the large domain gap between natural and pathological images. Additionally, a curriculum-learning training strategy is designed to regulate task difficulty, making the model adaptive to the puzzle restoring complexity. Experimental results show that our PuzzleTuning framework outperforms the previous state-of-the-art methods in various downstream tasks on multiple datasets. The code, demo, and pre-trained weights are available at https://github.com/sagizty/PuzzleTuning. △ Less

Submitted 22 April, 2024; v1 submitted 11 November, 2023; originally announced November 2023.

Comments: 13 pages, 9 figures, 8 tables

arXiv:2310.18529 [pdf, other]

FPM-INR: Fourier ptychographic microscopy image stack reconstruction using implicit neural representations

Authors: Haowen Zhou, Brandon Y. Feng, Haiyun Guo, Siyu Lin, Mingshu Liang, Christopher A. Metzler, Changhuei Yang

Abstract: Image stacks provide invaluable 3D information in various biological and pathological imaging applications. Fourier ptychographic microscopy (FPM) enables reconstructing high-resolution, wide field-of-view image stacks without z-stack scanning, thus significantly accelerating image acquisition. However, existing FPM methods take tens of minutes to reconstruct and gigabytes of memory to store a hig… ▽ More Image stacks provide invaluable 3D information in various biological and pathological imaging applications. Fourier ptychographic microscopy (FPM) enables reconstructing high-resolution, wide field-of-view image stacks without z-stack scanning, thus significantly accelerating image acquisition. However, existing FPM methods take tens of minutes to reconstruct and gigabytes of memory to store a high-resolution volumetric scene, impeding fast gigapixel-scale remote digital pathology. While deep learning approaches have been explored to address this challenge, existing methods poorly generalize to novel datasets and can produce unreliable hallucinations. This work presents FPM-INR, a compact and efficient framework that integrates physics-based optical models with implicit neural representations (INR) to represent and reconstruct FPM image stacks. FPM-INR is agnostic to system design or sample types and does not require external training data. In our demonstrated experiments, FPM-INR substantially outperforms traditional FPM algorithms with up to a 25-fold increase in speed and an 80-fold reduction in memory usage for continuous image stack representations. △ Less

Submitted 31 October, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

Comments: Project Page: https://hwzhou2020.github.io/FPM-INR-Web/

arXiv:2310.17940 [pdf, other]

Unified Segment-to-Segment Framework for Simultaneous Sequence Generation

Authors: Shaolei Zhang, Yang Feng

Abstract: Simultaneous sequence generation is a pivotal task for real-time scenarios, such as streaming speech recognition, simultaneous machine translation and simultaneous speech translation, where the target sequence is generated while receiving the source sequence. The crux of achieving high-quality generation with low latency lies in identifying the optimal moments for generating, accomplished by learn… ▽ More Simultaneous sequence generation is a pivotal task for real-time scenarios, such as streaming speech recognition, simultaneous machine translation and simultaneous speech translation, where the target sequence is generated while receiving the source sequence. The crux of achieving high-quality generation with low latency lies in identifying the optimal moments for generating, accomplished by learning a map** between the source and target sequences. However, existing methods often rely on task-specific heuristics for different sequence types, limiting the model's capacity to adaptively learn the source-target map** and hindering the exploration of multi-task learning for various simultaneous tasks. In this paper, we propose a unified segment-to-segment framework (Seg2Seg) for simultaneous sequence generation, which learns the map** in an adaptive and unified manner. During the process of simultaneous generation, the model alternates between waiting for a source segment and generating a target segment, making the segment serve as the natural bridge between the source and target. To accomplish this, Seg2Seg introduces a latent segment as the pivot between source to target and explores all potential source-target map**s via the proposed expectation training, thereby learning the optimal moments for generating. Experiments on multiple simultaneous generation tasks demonstrate that Seg2Seg achieves state-of-the-art performance and exhibits better generality across various tasks. △ Less

Submitted 30 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

Comments: Accepted at NeurIPS 2023

arXiv:2310.07403 [pdf, other]

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

Authors: Qingkai Fang, Yan Zhou, Yang Feng

Abstract: Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressi… ▽ More Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG). To consider all potential paths in the DAG during training, we calculate the expected hidden states for each target token via dynamic programming, and feed them into the acoustic decoder to predict the target mel-spectrogram. During inference, we select the most probable path and take hidden states on that path as input to the acoustic decoder. Experiments on the CVSS Fr-En benchmark demonstrate that DASpeech can achieve comparable or even better performance than the state-of-the-art S2ST model Translatotron 2, while preserving up to 18.53x speedup compared to the autoregressive baseline. Compared with the previous non-autoregressive S2ST model, DASpeech does not rely on knowledge distillation and iterative decoding, achieving significant improvements in both translation quality and decoding speed. Furthermore, DASpeech shows the ability to preserve the speaker's voice of the source speech during translation. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023. Audio samples are available at https://ictnlp.github.io/daspeech-demo/

ACM Class: I.2.7

arXiv:2310.02281 [pdf, other]

End-to-End Continuous Speech Emotion Recognition in Real-life Customer Service Call Center Conversations

Authors: Ya**g Feng, Laurence Devillers

Abstract: Speech Emotion recognition (SER) in call center conversations has emerged as a valuable tool for assessing the quality of interactions between clients and agents. In contrast to controlled laboratory environments, real-life conversations take place under uncontrolled conditions and are subject to contextual factors that influence the expression of emotions. In this paper, we present our approach t… ▽ More Speech Emotion recognition (SER) in call center conversations has emerged as a valuable tool for assessing the quality of interactions between clients and agents. In contrast to controlled laboratory environments, real-life conversations take place under uncontrolled conditions and are subject to contextual factors that influence the expression of emotions. In this paper, we present our approach to constructing a large-scale reallife dataset (CusEmo) for continuous SER in customer service call center conversations. We adopted the dimensional emotion annotation approach to capture the subtlety, complexity, and continuity of emotions in real-life call center conversations, while annotating contextual information. The study also addresses the challenges encountered during the application of the End-to-End (E2E) SER system to the dataset, including determining the appropriate label sampling rate and input segment length, as well as integrating contextual information (interlocutor's gender and empathy level) with different weights using multitask learning. The result shows that incorporating the empathy level information improved the model's performance. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Journal ref: 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), Sep 2023, Boston (MA), United States

arXiv:2309.11015

3D-U-SAM Network For Few-shot Tooth Segmentation in CBCT Images

Authors: Yifu Zhang, Zuozhu Liu, Yang Feng, Ren**g Xu

Abstract: Accurate representation of tooth position is extremely important in treatment. 3D dental image segmentation is a widely used method, however labelled 3D dental datasets are a scarce resource, leading to the problem of small samples that this task faces in many cases. To this end, we address this problem with a pretrained SAM and propose a novel 3D-U-SAM network for 3D dental image segmentation. Sp… ▽ More Accurate representation of tooth position is extremely important in treatment. 3D dental image segmentation is a widely used method, however labelled 3D dental datasets are a scarce resource, leading to the problem of small samples that this task faces in many cases. To this end, we address this problem with a pretrained SAM and propose a novel 3D-U-SAM network for 3D dental image segmentation. Specifically, in order to solve the problem of using 2D pre-trained weights on 3D datasets, we adopted a convolution approximation method; in order to retain more details, we designed skip connections to fuse features at all levels with reference to U-Net. The effectiveness of the proposed method is demonstrated in ablation experiments, comparison experiments, and sample size experiments. △ Less

Submitted 27 February, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

Comments: The paper needs to be updated

arXiv:2309.07524 [pdf, other]

doi 10.1109/TGRS.2024.3368760

A Multi-scale Generalized Shrinkage Threshold Network for Image Blind Deblurring in Remote Sensing

Authors: Yujie Feng, Yin Yang, Xiaohong Fan, Zhengpeng Zhang, Jian** Zhang

Abstract: Remote sensing images are essential for many applications of the earth's sciences, but their quality can usually be degraded due to limitations in sensor technology and complex imaging environments. To address this, various remote sensing image deblurring methods have been developed to restore sharp and high-quality images from degraded observational data. However, most traditional model-based deb… ▽ More Remote sensing images are essential for many applications of the earth's sciences, but their quality can usually be degraded due to limitations in sensor technology and complex imaging environments. To address this, various remote sensing image deblurring methods have been developed to restore sharp and high-quality images from degraded observational data. However, most traditional model-based deblurring methods usually require predefined {hand-crafted} prior assumptions, which are difficult to handle in complex applications. On the other hand, deep learning-based deblurring methods are often considered as black boxes, lacking transparency and interpretability. In this work, we propose a new blind deblurring learning framework that utilizes alternating iterations of shrinkage thresholds. This framework involves updating blurring kernels and images, with a theoretical foundation in network design. Additionally, we propose a learnable blur kernel proximal map** module to improve the accuracy of the blur kernel reconstruction. Furthermore, we propose a deep proximal map** module in the image domain, which combines a generalized shrinkage threshold with a multi-scale prior feature extraction block. This module also incorporates an attention mechanism to learn adaptively the importance of prior information, improving the flexibility and robustness of prior terms, and avoiding limitations similar to hand-crafted image prior terms. Consequently, we design a novel multi-scale generalized shrinkage threshold network (MGSTNet) that focuses specifically on learning deep geometric prior features to enhance image restoration. Experimental results on real and synthetic remote sensing image datasets demonstrate the superiority of our MGSTNet framework compared to existing deblurring methods. △ Less

Submitted 21 February, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: 16 pages,Accepted to IEEE Transactions on Geoscience and Remote Sensing,2024

MSC Class: 54H30; 68U10; 94A08

Journal ref: IEEE Transactions on Geoscience and Remote Sensing,2024

arXiv:2309.03440 [pdf, other]

Punctate White Matter Lesion Segmentation in Preterm Infants Powered by Counterfactually Generative Learning

Authors: Zehua Ren, Yongheng Sun, Miaomiao Wang, Yuying Feng, Xianjun Li, Chao **, Jian Yang, Chunfeng Lian, Fan Wang

Abstract: Accurate segmentation of punctate white matter lesions (PWMLs) are fundamental for the timely diagnosis and treatment of related developmental disorders. Automated PWMLs segmentation from infant brain MR images is challenging, considering that the lesions are typically small and low-contrast, and the number of lesions may dramatically change across subjects. Existing learning-based methods directl… ▽ More Accurate segmentation of punctate white matter lesions (PWMLs) are fundamental for the timely diagnosis and treatment of related developmental disorders. Automated PWMLs segmentation from infant brain MR images is challenging, considering that the lesions are typically small and low-contrast, and the number of lesions may dramatically change across subjects. Existing learning-based methods directly apply general network architectures to this challenging task, which may fail to capture detailed positional information of PWMLs, potentially leading to severe under-segmentations. In this paper, we propose to leverage the idea of counterfactual reasoning coupled with the auxiliary task of brain tissue segmentation to learn fine-grained positional and morphological representations of PWMLs for accurate localization and segmentation. A simple and easy-to-implement deep-learning framework (i.e., DeepPWML) is accordingly designed. It combines the lesion counterfactual map with the tissue probability map to train a lightweight PWML segmentation network, demonstrating state-of-the-art performance on a real-clinical dataset of infant T1w MR images. The code is available at \href{https://github.com/ladderlab-xjtu/DeepPWML}{https://github.com/ladderlab-xjtu/DeepPWML}. △ Less

Submitted 6 September, 2023; originally announced September 2023.

Comments: 10 pages, 3 figures, Medical Image Computing and Computer Assisted Intervention(MICCAI)

arXiv:2308.03807 [pdf, other]

doi 10.1109/TCI.2023.3315853

Nest-DGIL: Nesterov-optimized Deep Geometric Incremental Learning for CS Image Reconstruction

Authors: Xiaohong Fan, Yin Yang, Ke Chen, Yujie Feng, Jian** Zhang

Abstract: Proximal gradient-based optimization is one of the most common strategies to solve inverse problem of images, and it is easy to implement. However, these techniques often generate heavy artifacts in image reconstruction. One of the most popular refinement methods is to fine-tune the regularization parameter to alleviate such artifacts, but it may not always be sufficient or applicable due to incre… ▽ More Proximal gradient-based optimization is one of the most common strategies to solve inverse problem of images, and it is easy to implement. However, these techniques often generate heavy artifacts in image reconstruction. One of the most popular refinement methods is to fine-tune the regularization parameter to alleviate such artifacts, but it may not always be sufficient or applicable due to increased computational costs. In this work, we propose a deep geometric incremental learning framework based on the second Nesterov proximal gradient optimization. The proposed end-to-end network not only has the powerful learning ability for high-/low-frequency image features, but also can theoretically guarantee that geometric texture details will be reconstructed from preliminary linear reconstruction. Furthermore, it can avoid the risk of intermediate reconstruction results falling outside the geometric decomposition domains and achieve fast convergence. Our reconstruction framework is decomposed into four modules including general linear reconstruction, cascade geometric incremental restoration, Nesterov acceleration, and post-processing. In the image restoration step, a cascade geometric incremental learning module is designed to compensate for missing texture information from different geometric spectral decomposition domains. Inspired by the overlap-tile strategy, we also develop a post-processing module to remove the block effect in patch-wise-based natural image reconstruction. All parameters in the proposed model are learnable, an adaptive initialization technique of physical parameters is also employed to make model flexibility and ensure converging smoothly. We compare the reconstruction performance of the proposed method with existing state-of-the-art methods to demonstrate its superiority. Our source codes are available at https://github.com/fanxiaohong/Nest-DGIL. △ Less

Submitted 11 October, 2023; v1 submitted 6 August, 2023; originally announced August 2023.

Comments: 15 pages,our source codes are available at https://github.com/fanxiaohong/Nest-DGIL

Journal ref: This work is published in IEEE Transactions on Computational Imaging, vol. 9, pp. 819-833, 2023

arXiv:2307.15388 [pdf, other]

An Empirical Study of Large-Scale Data-Driven Full Waveform Inversion

Authors: Peng **, Yinan Feng, Shihang Feng, Hanchen Wang, Yinpeng Chen, Benjamin Consolvo, Zicheng Liu, Youzuo Lin

Abstract: This paper investigates the impact of big data on deep learning models to help solve the full waveform inversion (FWI) problem. While it is well known that big data can boost the performance of deep learning models in many tasks, its effectiveness has not been validated for FWI. To address this gap, we present an empirical study that investigates how deep learning models in FWI behave when trained… ▽ More This paper investigates the impact of big data on deep learning models to help solve the full waveform inversion (FWI) problem. While it is well known that big data can boost the performance of deep learning models in many tasks, its effectiveness has not been validated for FWI. To address this gap, we present an empirical study that investigates how deep learning models in FWI behave when trained on OpenFWI, a collection of large-scale, multi-structural, synthetic datasets published recently. In particular, we train and evaluate the FWI models on a combination of 10 2D subsets in OpenFWI that contain 470K pairs of seismic data and velocity maps in total. Our experiments demonstrate that training on the combined dataset yields an average improvement of 13.03% in MAE, 7.19% in MSE and 1.87% in SSIM compared to each split dataset, and an average improvement of 28.60%, 21.55% and 8.22% in the leave-one-out generalization test. We further demonstrate that model capacity needs to scale in accordance with data size for optimal improvement, where our largest model yields an average improvement of 20.06%, 13.39% and 0.72% compared to the smallest one. △ Less

Submitted 24 April, 2024; v1 submitted 28 July, 2023; originally announced July 2023.

arXiv:2307.03298 [pdf]

Spherical CNN for Medical Imaging Applications: Importance of Equivariance in image reconstruction and denoising

Authors: Amirreza Hashemi, Yuemeng Feng, Hamid Sabet

Abstract: This work highlights the significance of equivariant networks as efficient and high-performance approaches for tomography applications. Our study builds upon the limitations of conventional Convolutional Neural Networks (CNNs), which have shown promise in post-processing various medical imaging systems. However, the efficiency of conventional CNNs heavily relies on an undiminished and proper train… ▽ More This work highlights the significance of equivariant networks as efficient and high-performance approaches for tomography applications. Our study builds upon the limitations of conventional Convolutional Neural Networks (CNNs), which have shown promise in post-processing various medical imaging systems. However, the efficiency of conventional CNNs heavily relies on an undiminished and proper training set. To tackle this issue, in this study, we introduce an equivariant network, aiming to reduce CNN's dependency on specific training sets. We evaluate the efficacy of equivariant spherical CNNs (SCNNs) for 2- and 3- dimensional medical imaging problems. Our results demonstrate superior quality and computational efficiency of SCNNs in denoising and reconstructing benchmark problems. Furthermore, we propose a novel approach to employ SCNNs as a complement to conventional image reconstruction tools, enhancing the outcomes while reducing reliance on the training set. Across all cases, we observe a significant decrease in computational costs while maintaining the same or higher quality of image processing using SCNNs compared to CNNs. Additionally, we explore the potential of this network for broader tomography applications, particularly those requiring omnidirectional representation. △ Less

Submitted 26 October, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

arXiv:2307.02825 [pdf, other]

Bundle-specific Tractogram Distribution Estimation Using Higher-order Streamline Differential Equation

Authors: Yuan**g Feng, Lei Xie, **gqiang Wang, Jianzhong He, Fei Gao

Abstract: Tractography traces the peak directions extracted from fiber orientation distribution (FOD) suffering from ambiguous spatial correspondences between diffusion directions and fiber geometry, which is prone to producing erroneous tracks while missing true positive connections. The peaks-based tractography methods 'locally' reconstructed streamlines in 'single to single' manner, thus lacking of globa… ▽ More Tractography traces the peak directions extracted from fiber orientation distribution (FOD) suffering from ambiguous spatial correspondences between diffusion directions and fiber geometry, which is prone to producing erroneous tracks while missing true positive connections. The peaks-based tractography methods 'locally' reconstructed streamlines in 'single to single' manner, thus lacking of global information about the trend of the whole fiber bundle. In this work, we propose a novel tractography method based on a bundle-specific tractogram distribution function by using a higher-order streamline differential equation, which reconstructs the streamline bundles in 'cluster to cluster' manner. A unified framework for any higher-order streamline differential equation is presented to describe the fiber bundles with disjoint streamlines defined based on the diffusion tensor vector field. At the global level, the tractography process is simplified as the estimation of bundle-specific tractogram distribution (BTD) coefficients by minimizing the energy optimization model, and is used to characterize the relations between BTD and diffusion tensor vector under the prior guidance by introducing the tractogram bundle information to provide anatomic priors. Experiments are performed on simulated Hough, Sine, Circle data, ISMRM 2015 Tractography Challenge data, FiberCup data, and in vivo data from the Human Connectome Project (HCP) data for qualitative and quantitative evaluation. The results demonstrate that our approach can reconstruct the complex global fiber bundles directly. BTD reduces the error deviation and accumulation at the local level and shows better results in reconstructing long-range, twisting, and large fanning tracts. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2307.01981 [pdf, other]

A ChatGPT Aided Explainable Framework for Zero-Shot Medical Image Diagnosis

Authors: Jiaxiang Liu, Tianxiang Hu, Yan Zhang, Xiaotang Gai, Yang Feng, Zuozhu Liu

Abstract: Zero-shot medical image classification is a critical process in real-world scenarios where we have limited access to all possible diseases or large-scale annotated data. It involves computing similarity scores between a query medical image and possible disease categories to determine the diagnostic result. Recent advances in pretrained vision-language models (VLMs) such as CLIP have shown great pe… ▽ More Zero-shot medical image classification is a critical process in real-world scenarios where we have limited access to all possible diseases or large-scale annotated data. It involves computing similarity scores between a query medical image and possible disease categories to determine the diagnostic result. Recent advances in pretrained vision-language models (VLMs) such as CLIP have shown great performance for zero-shot natural image recognition and exhibit benefits in medical applications. However, an explainable zero-shot medical image recognition framework with promising performance is yet under development. In this paper, we propose a novel CLIP-based zero-shot medical image classification framework supplemented with ChatGPT for explainable diagnosis, mimicking the diagnostic process performed by human experts. The key idea is to query large language models (LLMs) with category names to automatically generate additional cues and knowledge, such as disease symptoms or descriptions other than a single category name, to help provide more accurate and explainable diagnosis in CLIP. We further design specific prompts to enhance the quality of generated texts by ChatGPT that describe visual medical features. Extensive results on one private dataset and four public datasets along with detailed analysis demonstrate the effectiveness and explainability of our training-free zero-shot diagnosis pipeline, corroborating the great potential of VLMs and LLMs for medical applications. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: Workshop on Interpretable ML in Healthcare at International Conference on Machine Learning (ICML) 2023

arXiv:2307.01979 [pdf, other]

ToothSegNet: Image Degradation meets Tooth Segmentation in CBCT Images

Authors: Jiaxiang Liu, Tianxiang Hu, Yang Feng, Wanghui Ding, Zuozhu Liu

Abstract: In computer-assisted orthodontics, three-dimensional tooth models are required for many medical treatments. Tooth segmentation from cone-beam computed tomography (CBCT) images is a crucial step in constructing the models. However, CBCT image quality problems such as metal artifacts and blurring caused by shooting equipment and patients' dental conditions make the segmentation difficult. In this pa… ▽ More In computer-assisted orthodontics, three-dimensional tooth models are required for many medical treatments. Tooth segmentation from cone-beam computed tomography (CBCT) images is a crucial step in constructing the models. However, CBCT image quality problems such as metal artifacts and blurring caused by shooting equipment and patients' dental conditions make the segmentation difficult. In this paper, we propose ToothSegNet, a new framework which acquaints the segmentation model with generated degraded images during training. ToothSegNet merges the information of high and low quality images from the designed degradation simulation module using channel-wise cross fusion to reduce the semantic gap between encoder and decoder, and also refines the shape of tooth prediction through a structural constraint loss. Experimental results suggest that ToothSegNet produces more precise segmentation and outperforms the state-of-the-art medical image segmentation methods. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: IEEE ISBI 2023

arXiv:2306.10499 [pdf, other]

Channel-Spatial-Based Few-Shot Bird Sound Event Detection

Authors: Lingwen Liu, Yuxuan Feng, Haitao Fu, Yajie Yang, Xin Pan, Chenlei **

Abstract: In this paper, we propose a model for bird sound event detection that focuses on a small number of training samples within the everyday long-tail distribution. As a result, we investigate bird sound detection using the few-shot learning paradigm. By integrating channel and spatial attention mechanisms, improved feature representations can be learned from few-shot training datasets. We develop a Me… ▽ More In this paper, we propose a model for bird sound event detection that focuses on a small number of training samples within the everyday long-tail distribution. As a result, we investigate bird sound detection using the few-shot learning paradigm. By integrating channel and spatial attention mechanisms, improved feature representations can be learned from few-shot training datasets. We develop a Metric Channel-Spatial Network model by incorporating a Channel Spatial Squeeze-Excitation block into the prototype network, combining it with these attention mechanisms. We evaluate the Metric Channel Spatial Network model on the DCASE 2022 Take5 dataset benchmark, achieving an F-measure of 66.84% and a PSDS of 58.98%. Our experiment demonstrates that the combination of channel and spatial attention mechanisms effectively enhances the performance of bird sound classification and detection. △ Less

Submitted 25 June, 2023; v1 submitted 18 June, 2023; originally announced June 2023.

Comments: 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference

arXiv:2305.18033 [pdf]

doi 10.1038/s41597-023-02422-6

The ACROBAT 2022 Challenge: Automatic Registration Of Breast Cancer Tissue

Authors: Philippe Weitz, Masi Valkonen, Leslie Solorzano, Circe Carr, Kimmo Kartasalo, Constance Boissin, Sonja Koivukoski, Aino Kuusela, Dusan Rasic, Yanbo Feng, Sandra Sinius Pouplier, Abhinav Sharma, Kajsa Ledesma Eriksson, Stephanie Robertson, Christian Marzahl, Chandler D. Gatenbee, Alexander R. A. Anderson, Marek Wodzinski, Artur Jurgas, Niccolò Marini, Manfredo Atzori, Henning Müller, Daniel Budelmann, Nick Weiss, Stefan Heldmann , et al. (16 additional authors not shown)

Abstract: The alignment of tissue between histopathological whole-slide-images (WSI) is crucial for research and clinical applications. Advances in computing, deep learning, and availability of large WSI datasets have revolutionised WSI analysis. Therefore, the current state-of-the-art in WSI registration is unclear. To address this, we conducted the ACROBAT challenge, based on the largest WSI registration… ▽ More The alignment of tissue between histopathological whole-slide-images (WSI) is crucial for research and clinical applications. Advances in computing, deep learning, and availability of large WSI datasets have revolutionised WSI analysis. Therefore, the current state-of-the-art in WSI registration is unclear. To address this, we conducted the ACROBAT challenge, based on the largest WSI registration dataset to date, including 4,212 WSIs from 1,152 breast cancer patients. The challenge objective was to align WSIs of tissue that was stained with routine diagnostic immunohistochemistry to its H&E-stained counterpart. We compare the performance of eight WSI registration algorithms, including an investigation of the impact of different WSI properties and clinical covariates. We find that conceptually distinct WSI registration methods can lead to highly accurate registration performances and identify covariates that impact performances across methods. These results establish the current state-of-the-art in WSI registration and guide researchers in selecting and develo** methods. △ Less

Submitted 29 May, 2023; originally announced May 2023.

arXiv:2305.16093 [pdf, other]

doi 10.18653/v1/2023.findings-acl.485

End-to-End Simultaneous Speech Translation with Differentiable Segmentation

Authors: Shaolei Zhang, Yang Feng

Abstract: End-to-end simultaneous speech translation (SimulST) outputs translation while receiving the streaming speech inputs (a.k.a. streaming speech translation), and hence needs to segment the speech inputs and then translate based on the current received speech. However, segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the tr… ▽ More End-to-end simultaneous speech translation (SimulST) outputs translation while receiving the streaming speech inputs (a.k.a. streaming speech translation), and hence needs to segment the speech inputs and then translate based on the current received speech. However, segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model. Therefore, learning to segment the speech inputs at those moments that are beneficial for the translation model to produce high-quality translation is the key to SimulST. Existing SimulST methods, either using the fixed-length segmentation or external segmentation model, always separate segmentation from the underlying translation model, where the gap results in segmentation outcomes that are not necessarily beneficial for the translation process. In this paper, we propose Differentiable Segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model. DiSeg turns hard segmentation into differentiable through the proposed expectation training, enabling it to be jointly trained with the translation model and thereby learn translation-beneficial segmentation. Experimental results demonstrate that DiSeg achieves state-of-the-art performance and exhibits superior segmentation capability. △ Less

Submitted 17 June, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: Accepted at ACL 2023 findings

arXiv:2305.14635 [pdf, other]

CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

Authors: Yan Zhou, Qingkai Fang, Yang Feng

Abstract: End-to-end speech translation (ST) is the task of translating speech signals in the source language into text in the target language. As a cross-modal task, end-to-end ST is difficult to train with limited data. Existing methods often try to transfer knowledge from machine translation (MT), but their performances are restricted by the modality gap between speech and text. In this paper, we propose… ▽ More End-to-end speech translation (ST) is the task of translating speech signals in the source language into text in the target language. As a cross-modal task, end-to-end ST is difficult to train with limited data. Existing methods often try to transfer knowledge from machine translation (MT), but their performances are restricted by the modality gap between speech and text. In this paper, we propose Cross-modal Mixup via Optimal Transport CMOT to overcome the modality gap. We find the alignment between speech and text sequences via optimal transport and then mix up the sequences from different modalities at a token level using the alignment. Experiments on the MuST-C ST benchmark demonstrate that CMOT achieves an average BLEU of 30.0 in 8 translation directions, outperforming previous methods. Further analysis shows CMOT can adaptively find the alignment between modalities, which helps alleviate the modality gap between speech and text. Code is publicly available at https://github.com/ictnlp/CMOT. △ Less

Submitted 25 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: ACL 2023 main conference

arXiv:2305.13314 [pdf, other]

Auto-Linear Phenomenon in Subsurface Imaging

Authors: Yinan Feng, Yinpeng Chen, Peng **, Shihang Feng, Zicheng Liu, Youzuo Lin

Abstract: Subsurface imaging involves solving full waveform inversion (FWI) to predict geophysical properties from measurements. This problem can be reframed as an image-to-image translation, with the usual approach being to train an encoder-decoder network using paired data from two domains: geophysical property and measurement. A recent seminal work (InvLINT) demonstrates there is only a linear map** be… ▽ More Subsurface imaging involves solving full waveform inversion (FWI) to predict geophysical properties from measurements. This problem can be reframed as an image-to-image translation, with the usual approach being to train an encoder-decoder network using paired data from two domains: geophysical property and measurement. A recent seminal work (InvLINT) demonstrates there is only a linear map** between the latent spaces of the two domains, and the decoder requires paired data for training. This paper extends this direction by demonstrating that only linear map** necessitates paired data, while both the encoder and decoder can be learned from their respective domains through self-supervised learning. This unveils an intriguing phenomenon (named Auto-Linear) where the self-learned features of two separate domains are automatically linearly correlated. Compared with existing methods, our Auto-Linear has four advantages: (a) solving both forward and inverse modeling simultaneously, (b) applicable to different subsurface imaging tasks and achieving markedly better results than previous methods, (c)enhanced performance, especially in scenarios with limited paired data and in the presence of noisy data, and (d) strong generalization ability of the trained encoder and decoder. △ Less

Submitted 21 May, 2024; v1 submitted 27 April, 2023; originally announced May 2023.

arXiv:2305.08709 [pdf, other]

Back Translation for Speech-to-text Translation Without Transcripts

Authors: Qingkai Fang, Yang Feng

Abstract: The success of end-to-end speech-to-text translation (ST) is often achieved by utilizing source transcripts, e.g., by pre-training with automatic speech recognition (ASR) and machine translation (MT) tasks, or by introducing additional ASR and MT data. Unfortunately, transcripts are only sometimes available since numerous unwritten languages exist worldwide. In this paper, we aim to utilize large… ▽ More The success of end-to-end speech-to-text translation (ST) is often achieved by utilizing source transcripts, e.g., by pre-training with automatic speech recognition (ASR) and machine translation (MT) tasks, or by introducing additional ASR and MT data. Unfortunately, transcripts are only sometimes available since numerous unwritten languages exist worldwide. In this paper, we aim to utilize large amounts of target-side monolingual data to enhance ST without transcripts. Motivated by the remarkable success of back translation in MT, we develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data. To ease the challenges posed by short-to-long generation and one-to-many map**, we introduce self-supervised discrete units and achieve back translation by cascading a target-to-unit model and a unit-to-speech model. With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets. More experiments show that our method is especially effective in low-resource scenarios. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: ACL 2023 main conference

ACM Class: I.2.7

arXiv:2305.08706 [pdf, other]

Understanding and Bridging the Modality Gap for Speech Translation

Authors: Qingkai Fang, Yang Feng

Abstract: How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target map**. However, due to the differences between speech and text, there is always a gap between ST and MT.… ▽ More How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target map**. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias. We find that the modality gap is relatively small during training except for some difficult cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically, we regularize the output predictions of ST and MT, whose target-side contexts are derived by sampling between ground truth words and self-generated words with a varying probability. Furthermore, we introduce token-level adaptive training which assigns different training weights to target tokens to handle difficult cases with large modality gaps. Experiments and analysis show that our approach effectively bridges the modality gap, and achieves promising results in all eight directions of the MuST-C dataset. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: ACL 2023 main conference

ACM Class: I.2.7

arXiv:2304.10095 [pdf, ps, other]

Transmit Power Minimization for STAR-RIS Empowered Symbiotic Radio Communications

Authors: Chao Zhou, Bin Lyu, Youhong Feng, Dinh Thai Hoang

Abstract: In this paper, we propose a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) empowered transmission scheme for symbiotic radio (SR) systems to make more flexibility for network deployment and enhance system performance. The STAR-RIS is utilized to not only beam the primary signals from the base station (BS) towards multiple primary users on the same side of… ▽ More In this paper, we propose a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) empowered transmission scheme for symbiotic radio (SR) systems to make more flexibility for network deployment and enhance system performance. The STAR-RIS is utilized to not only beam the primary signals from the base station (BS) towards multiple primary users on the same side of the STAR-RIS, but also achieve the secondary transmission to the secondary users on another side. We consider both the broadcasting signal model and unicasting signal model at the BS. For each model, we aim for minimizing the transmit power of the BS by designing the active beamforming and simultaneous reflection and transmission coefficients under the practical phase correlation constraint. To address the challenge of solving the formulated problem, we propose a block coordinate descent based algorithm with the semidefinite relaxation, penalty dual decomposition and successive convex approximation methods, which decomposes the original problem into one sub-problem about active beamforming and the other sub-problem about simultaneous reflection and transmission coefficients, and iteratively solve them until the convergence is achieved. Numerical results indicate that the proposed scheme can reduce up to 150.6% transmit power compared to the backscattering device enabled scheme. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: 32 pages, 12 figures

arXiv:2304.02883 [pdf, other]

GA-HQS: MRI reconstruction via a generically accelerated unfolding approach

Authors: Jiawei Jiang, Yuchao Feng, Honghui Xu, Wanjun Chen, Jianwei Zheng

Abstract: Deep unfolding networks (DUNs) are the foremost methods in the realm of compressed sensing MRI, as they can employ learnable networks to facilitate interpretable forward-inference operators. However, several daunting issues still exist, including the heavy dependency on the first-order optimization algorithms, the insufficient information fusion mechanisms, and the limitation of capturing long-ran… ▽ More Deep unfolding networks (DUNs) are the foremost methods in the realm of compressed sensing MRI, as they can employ learnable networks to facilitate interpretable forward-inference operators. However, several daunting issues still exist, including the heavy dependency on the first-order optimization algorithms, the insufficient information fusion mechanisms, and the limitation of capturing long-range relationships. To address the issues, we propose a Generically Accelerated Half-Quadratic Splitting (GA-HQS) algorithm that incorporates second-order gradient information and pyramid attention modules for the delicate fusion of inputs at the pixel level. Moreover, a multi-scale split transformer is also designed to enhance the global feature representation. Comprehensive experiments demonstrate that our method surpasses previous ones on single-coil MRI acceleration tasks. △ Less

Submitted 6 April, 2023; originally announced April 2023.

arXiv:2303.12289 [pdf]

doi 10.1109/TITS.2022.3220110

Adaptive Road Configurations for Improved Autonomous Vehicle-Pedestrian Interactions using Reinforcement Learning

Authors: Qiming Ye, Yuxiang Feng, Jose Javier Escribano Macias, Marc Stettler, Panagiotis Angeloudis

Abstract: The deployment of Autonomous Vehicles (AVs) poses considerable challenges and unique opportunities for the design and management of future urban road infrastructure. In light of this disruptive transformation, the Right-Of-Way (ROW) composition of road space has the potential to be renewed. Design approaches and intelligent control models have been proposed to address this problem, but we lack an… ▽ More The deployment of Autonomous Vehicles (AVs) poses considerable challenges and unique opportunities for the design and management of future urban road infrastructure. In light of this disruptive transformation, the Right-Of-Way (ROW) composition of road space has the potential to be renewed. Design approaches and intelligent control models have been proposed to address this problem, but we lack an operational framework that can dynamically generate ROW plans for AVs and pedestrians in response to real-time demand. Based on microscopic traffic simulation, this study explores Reinforcement Learning (RL) methods for evolving ROW compositions. We implement a centralised paradigm and a distributive learning paradigm to separately perform the dynamic control on several road network configurations. Experimental results indicate that the algorithms have the potential to improve traffic flow efficiency and allocate more space for pedestrians. Furthermore, the distributive learning algorithm outperforms its centralised counterpart regarding computational cost (49.55\%), benchmark rewards (25.35\%), best cumulative rewards (24.58\%), optimal actions (13.49\%) and rate of convergence. This novel road management technique could potentially contribute to the flow-adaptive and active mobility-friendly streets in the AVs era. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Journal ref: IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 2, pp. 2024-2034, Feb. 2023

arXiv:2301.02367 [pdf]

doi 10.1109/TMI.2023.3261346

MR Elastography with Optimization-Based Phase Unwrap** and Traveling Wave Expansion-based Neural Network (TWENN)

Authors: Shengyuan Ma, Runke Wang, Suhao Qiu, Ruokun Li, Qi Yue, Qingfang Sun, Liang Chen, Fuhua Yan, Guang-Zhong Yang, Yuan Feng

Abstract: Magnetic Resonance Elastography (MRE) can characterize biomechanical properties of soft tissue for disease diagnosis and treatment planning. However, complicated wavefields acquired from MRE coupled with noise pose challenges for accurate displacement extraction and modulus estimation. Here we propose a pipeline for processing MRE images using optimization-based displacement extraction and Traveli… ▽ More Magnetic Resonance Elastography (MRE) can characterize biomechanical properties of soft tissue for disease diagnosis and treatment planning. However, complicated wavefields acquired from MRE coupled with noise pose challenges for accurate displacement extraction and modulus estimation. Here we propose a pipeline for processing MRE images using optimization-based displacement extraction and Traveling Wave Expansion-based Neural Network (TWENN) modulus estimation. Phase unwrap** and displacement extraction were achieved by optimization of an objective function with Dual Data Consistency (Dual-DC). A complex-valued neural network using displacement covariance as input has been constructed for the estimation of complex wavenumbers. A model of traveling wave expansion is used to generate training datasets with different levels of noise for the network. The complex shear modulus map is obtained by a fusion of multifrequency and multidirectional data. Validation using images of brain and liver simulation demonstrates the practical value of the proposed pipeline, which can estimate the biomechanical properties with minimum root-mean-square-errors compared with state-of-the-art methods. Applications of the proposed method for processing MRE images of phantom, brain, and liver show clear anatomical features and that the pipeline is robust to noise and has a good generalization capability. △ Less

Submitted 4 April, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

arXiv:2301.00066 [pdf, other]

Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition

Authors: Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Abstract: Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a n… ▽ More Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens. △ Less

Submitted 30 December, 2022; originally announced January 2023.

Comments: Submitted to ICASSP 2023

arXiv:2212.09206 [pdf, other]

Segmentation Ability Map: Interpret deep features for medical image segmentation

Authors: Sheng He, Yanfang Feng, P. Ellen Grant, Yangming Ou

Abstract: Deep convolutional neural networks (CNNs) have been widely used for medical image segmentation. In most studies, only the output layer is exploited to compute the final segmentation results and the hidden representations of the deep learned features have not been well understood. In this paper, we propose a prototype segmentation (ProtoSeg) method to compute a binary segmentation map based on deep… ▽ More Deep convolutional neural networks (CNNs) have been widely used for medical image segmentation. In most studies, only the output layer is exploited to compute the final segmentation results and the hidden representations of the deep learned features have not been well understood. In this paper, we propose a prototype segmentation (ProtoSeg) method to compute a binary segmentation map based on deep features. We measure the segmentation abilities of the features by computing the Dice between the feature segmentation map and ground-truth, named as the segmentation ability score (SA score for short). The corresponding SA score can quantify the segmentation abilities of deep features in different layers and units to understand the deep neural networks for segmentation. In addition, our method can provide a mean SA score which can give a performance estimation of the output on the test images without ground-truth. Finally, we use the proposed ProtoSeg method to compute the segmentation map directly on input images to further understand the segmentation ability of each input image. Results are presented on segmenting tumors in brain MRI, lesions in skin images, COVID-related abnormality in CT images, prostate segmentation in abdominal MRI, and pancreatic mass segmentation in CT images. Our method can provide new insights for interpreting and explainable AI systems for medical image segmentation. Our code is available on: \url{https://github.com/shengfly/ProtoSeg}. △ Less

Submitted 18 December, 2022; originally announced December 2022.

Journal ref: Medical Image Analysis, 2023

arXiv:2212.00687 [pdf]

3D-EPI Blip-Up/Down Acquisition (BUDA) with CAIPI and Joint Hankel Structured Low-Rank Reconstruction for Rapid Distortion-Free High-Resolution T2* Map**

Authors: Zhifeng Chen, Congyu Liao, Xiaozhi Cao, Benedikt A. Poser, Zhongbiao Xu, Wei-Ching Lo, Manyi Wen, Jae** Cho, Qiyuan Tian, Yaohui Wang, Yanqiu Feng, Ling Xia, Wufan Chen, Feng Liu, Berkin Bilgic

Abstract: Purpose: This work aims to develop a novel distortion-free 3D-EPI acquisition and image reconstruction technique for fast and robust, high-resolution, whole-brain imaging as well as quantitative T2* map**. Methods: 3D-Blip-Up and -Down Acquisition (3D-BUDA) sequence is designed for both single- and multi-echo 3D GRE-EPI imaging using multiple shots with blip-up and -down readouts to encode B0 fi… ▽ More Purpose: This work aims to develop a novel distortion-free 3D-EPI acquisition and image reconstruction technique for fast and robust, high-resolution, whole-brain imaging as well as quantitative T2* map**. Methods: 3D-Blip-Up and -Down Acquisition (3D-BUDA) sequence is designed for both single- and multi-echo 3D GRE-EPI imaging using multiple shots with blip-up and -down readouts to encode B0 field map information. Complementary k-space coverage is achieved using controlled aliasing in parallel imaging (CAIPI) sampling across the shots. For image reconstruction, an iterative hard-thresholding algorithm is employed to minimize the cost function that combines field map information informed parallel imaging with the structured low-rank constraint for multi-shot 3D-BUDA data. Extending 3D-BUDA to multi-echo imaging permits T2* map**. For this, we propose constructing a joint Hankel matrix along both echo and shot dimensions to improve the reconstruction. Results: Experimental results on in vivo multi-echo data demonstrate that, by performing joint reconstruction along with both echo and shot dimensions, reconstruction accuracy is improved compared to standard 3D-BUDA reconstruction. CAIPI sampling is further shown to enhance the image quality. For T2* map**, T2* values from 3D-Joint-CAIPI-BUDA and reference multi-echo GRE are within limits of agreement as quantified by Bland-Altman analysis. Conclusions: The proposed technique enables rapid 3D distortion-free high-resolution imaging and T2* map**. Specifically, 3D-BUDA enables 1-mm isotropic whole-brain imaging in 22 s at 3 T and 9 s on a 7 T scanner. The combination of multi-echo 3D-BUDA with CAIPI acquisition and joint reconstruction enables distortion-free whole-brain T2* map** in 47 s at 1.1x1.1x1.0 mm3 resolution. △ Less

Submitted 1 December, 2022; originally announced December 2022.

arXiv:2211.13621 [pdf]

ACROBAT -- a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology

Authors: Philippe Weitz, Masi Valkonen, Leslie Solorzano, Circe Carr, Kimmo Kartasalo, Constance Boissin, Sonja Koivukoski, Aino Kuusela, Dusan Rasic, Yanbo Feng, Sandra Kristiane Sinius Pouplier, Abhinav Sharma, Kajsa Ledesma Eriksson, Leena Latonen, Anne-Vibeke Laenkholm, Johan Hartman, Pekka Ruusuvuori, Mattias Rantalainen

Abstract: The analysis of FFPE tissue sections stained with haematoxylin and eosin (H&E) or immunohistochemistry (IHC) is an essential part of the pathologic assessment of surgically resected breast cancer specimens. IHC staining has been broadly adopted into diagnostic guidelines and routine workflows to manually assess status and scoring of several established biomarkers, including ER, PGR, HER2 and KI67.… ▽ More The analysis of FFPE tissue sections stained with haematoxylin and eosin (H&E) or immunohistochemistry (IHC) is an essential part of the pathologic assessment of surgically resected breast cancer specimens. IHC staining has been broadly adopted into diagnostic guidelines and routine workflows to manually assess status and scoring of several established biomarkers, including ER, PGR, HER2 and KI67. However, this is a task that can also be facilitated by computational pathology image analysis methods. The research in computational pathology has recently made numerous substantial advances, often based on publicly available whole slide image (WSI) data sets. However, the field is still considerably limited by the sparsity of public data sets. In particular, there are no large, high quality publicly available data sets with WSIs of matching IHC and H&E-stained tissue sections. Here, we publish the currently largest publicly available data set of WSIs of tissue sections from surgical resection specimens from female primary breast cancer patients with matched WSIs of corresponding H&E and IHC-stained tissue, consisting of 4,212 WSIs from 1,153 patients. The primary purpose of the data set was to facilitate the ACROBAT WSI registration challenge, aiming at accurately aligning H&E and IHC images. For research in the area of image registration, automatic quantitative feedback on registration algorithm performance remains available through the ACROBAT challenge website, based on more than 37,000 manually annotated landmark pairs from 13 annotators. Beyond registration, this data set has the potential to enable many different avenues of computational pathology research, including stain-guided learning, virtual staining, unsupervised pre-training, artefact detection and stain-independent models. △ Less

Submitted 24 November, 2022; originally announced November 2022.

arXiv:2211.02292 [pdf, other]

Boosting Binary Neural Networks via Dynamic Thresholds Learning

Authors: Jiehua Zhang, Xueyang Zhang, Zhuo Su, Zitong Yu, Yanghe Feng, Xin Lu, Matti Pietikäinen, Li Liu

Abstract: Develo** lightweight Deep Convolutional Neural Networks (DCNNs) and Vision Transformers (ViTs) has become one of the focuses in vision research since the low computational cost is essential for deploying vision models on edge devices. Recently, researchers have explored highly computational efficient Binary Neural Networks (BNNs) by binarizing weights and activations of Full-precision Neural Net… ▽ More Develo** lightweight Deep Convolutional Neural Networks (DCNNs) and Vision Transformers (ViTs) has become one of the focuses in vision research since the low computational cost is essential for deploying vision models on edge devices. Recently, researchers have explored highly computational efficient Binary Neural Networks (BNNs) by binarizing weights and activations of Full-precision Neural Networks. However, the binarization process leads to an enormous accuracy gap between BNN and its full-precision version. One of the primary reasons is that the Sign function with predefined or learned static thresholds limits the representation capacity of binarized architectures since single-threshold binarization fails to utilize activation distributions. To overcome this issue, we introduce the statistics of channel information into explicit thresholds learning for the Sign Function dubbed DySign to generate various thresholds based on input distribution. Our DySign is a straightforward method to reduce information loss and boost the representative capacity of BNNs, which can be flexibly applied to both DCNNs and ViTs (i.e., DyBCNN and DyBinaryCCT) to achieve promising performance improvement. As shown in our extensive experiments. For DCNNs, DyBCNNs based on two backbones (MobileNetV1 and ResNet18) achieve 71.2% and 67.4% top1-accuracy on ImageNet dataset, outperforming baselines by a large margin (i.e., 1.8% and 1.5% respectively). For ViTs, DyBinaryCCT presents the superiority of the convolutional embedding layer in fully binarized ViTs and achieves 56.1% on the ImageNet dataset, which is nearly 9% higher than the baseline. △ Less

Submitted 4 November, 2022; originally announced November 2022.

arXiv:2210.12357 [pdf, other]

Information-Transport-based Policy for Simultaneous Translation

Authors: Shaolei Zhang, Yang Feng

Abstract: Simultaneous translation (ST) outputs translation while receiving the source inputs, and hence requires a policy to determine whether to translate a target token or wait for the next source token. The major challenge of ST is that each target token can only be translated based on the current received source tokens, where the received source information will directly affect the translation quality.… ▽ More Simultaneous translation (ST) outputs translation while receiving the source inputs, and hence requires a policy to determine whether to translate a target token or wait for the next source token. The major challenge of ST is that each target token can only be translated based on the current received source tokens, where the received source information will directly affect the translation quality. So naturally, how much source information is received for the translation of the current target token is supposed to be the pivotal evidence for the ST policy to decide between translating and waiting. In this paper, we treat the translation as information transport from source to target and accordingly propose an Information-Transport-based Simultaneous Translation (ITST). ITST quantifies the transported information weight from each source token to the current target token, and then decides whether to translate the target token according to its accumulated received information. Experiments on both text-to-text ST and speech-to-text ST (a.k.a., streaming speech translation) tasks show that ITST outperforms strong baselines and achieves state-of-the-art performance. △ Less

Submitted 31 October, 2022; v1 submitted 22 October, 2022; originally announced October 2022.

Comments: Accept to EMNLP 2022 main conference. 22 pages, 16 figures, 8 tables

Showing 1–50 of 107 results for author: Feng, Y