Search | arXiv e-print repository

Improving Segment Anything on the Fly: Auxiliary Online Learning and Adaptive Fusion for Medical Image Segmentation

Authors: Tianyu Huang, Tao Zhou, Weidi Xie, Shuo Wang, Qi Dou, Yizhe Zhang

Abstract: The current variants of the Segment Anything Model (SAM), which include the original SAM and Medical SAM, still lack the capability to produce sufficiently accurate segmentation for medical images. In medical imaging contexts, it is not uncommon for human experts to rectify segmentations of specific test samples after SAM generates its segmentation predictions. These rectifications typically entai… ▽ More The current variants of the Segment Anything Model (SAM), which include the original SAM and Medical SAM, still lack the capability to produce sufficiently accurate segmentation for medical images. In medical imaging contexts, it is not uncommon for human experts to rectify segmentations of specific test samples after SAM generates its segmentation predictions. These rectifications typically entail manual or semi-manual corrections employing state-of-the-art annotation tools. Motivated by this process, we introduce a novel approach that leverages the advantages of online machine learning to enhance Segment Anything (SA) during test time. We employ rectified annotations to perform online learning, with the aim of improving the segmentation quality of SA on medical images. To improve the effectiveness and efficiency of online learning when integrated with large-scale vision models like SAM, we propose a new method called Auxiliary Online Learning (AuxOL). AuxOL creates and applies a small auxiliary model (specialist) in conjunction with SAM (generalist), entails adaptive online-batch and adaptive segmentation fusion. Experiments conducted on eight datasets covering four medical imaging modalities validate the effectiveness of the proposed method. Our work proposes and validates a new, practical, and effective approach for enhancing SA on downstream segmentation tasks (e.g., medical image segmentation). △ Less

Submitted 2 June, 2024; originally announced June 2024.

Comments: Project Link: https://sam-auxol.github.io/AuxOL/

arXiv:2404.10556 [pdf, other]

Generative AI for Advanced UAV Networking

Authors: Geng Sun, Wenwen Xie, Dusit Niyato, Hongyang Du, Jiawen Kang, **g Wu, Sumei Sun, ** Zhang

Abstract: With the impressive achievements of chatGPT and Sora, generative artificial intelligence (GAI) has received increasing attention. Not limited to the field of content generation, GAI is also widely used to solve the problems in wireless communication scenarios due to its powerful learning and generalization capabilities. Therefore, we discuss key applications of GAI in improving unmanned aerial veh… ▽ More With the impressive achievements of chatGPT and Sora, generative artificial intelligence (GAI) has received increasing attention. Not limited to the field of content generation, GAI is also widely used to solve the problems in wireless communication scenarios due to its powerful learning and generalization capabilities. Therefore, we discuss key applications of GAI in improving unmanned aerial vehicle (UAV) communication and networking performance in this article. Specifically, we first review the key technologies of GAI and the important roles of UAV networking. Then, we show how GAI can improve the communication, networking, and security performances of UAV systems. Subsequently, we propose a novel framework of GAI for advanced UAV networking, and then present a case study of UAV-enabled spectrum map estimation and transmission rate optimization based on the proposed framework to verify the effectiveness of GAI-enabled UAV systems. Finally, we discuss some important open directions. △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2401.16423 [pdf, other]

Synchformer: Efficient Synchronization from Sparse Cues

Authors: Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

Abstract: Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art… ▽ More Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: Extended version of the ICASSP 24 paper. Project page: https://www.robots.ox.ac.uk/~vgg/research/synchformer/ Code: https://github.com/v-iashin/Synchformer

arXiv:2401.11141 [pdf, other]

Wideband Beamforming for RIS Assisted Near-Field Communications

Authors: Ji Wang, Jian Xiao, Yixuan Zou, Wenwu Xie, Yuanwei Liu

Abstract: A near-field wideband beamforming scheme is investigated for reconfigurable intelligent surface (RIS) assisted multiple-input multiple-output (MIMO) systems, in which a deep learning-based end-to-end (E2E) optimization framework is proposed to maximize the system spectral efficiency. To deal with the near-field double beam split effect, the base station is equipped with frequency-dependent hybrid… ▽ More A near-field wideband beamforming scheme is investigated for reconfigurable intelligent surface (RIS) assisted multiple-input multiple-output (MIMO) systems, in which a deep learning-based end-to-end (E2E) optimization framework is proposed to maximize the system spectral efficiency. To deal with the near-field double beam split effect, the base station is equipped with frequency-dependent hybrid precoding architecture by introducing sub-connected true time delay (TTD) units, while two specific RIS architectures, namely true time delay-based RIS (TTD-RIS) and virtual subarray-based RIS (SA-RIS), are exploited to realize the frequency-dependent passive beamforming at the RIS. Furthermore, the efficient E2E beamforming models without explicit channel state information are proposed, which jointly exploits the uplink channel training module and the downlink wideband beamforming module. In the proposed network architecture of the E2E models, the classical communication signal processing methods, i.e., polarized filtering and sparsity transform, are leveraged to develop a signal-guided beamforming network. Numerical results show that the proposed E2E models have superior beamforming performance and robustness to conventional beamforming benchmarks. Furthermore, the tradeoff between the beamforming gain and the hardware complexity is investigated for different frequency-dependent RIS architectures, in which the TTD-RIS can achieve better spectral efficiency than the SA-RIS while requiring additional energy consumption and hardware cost. △ Less

Submitted 20 January, 2024; originally announced January 2024.

arXiv:2312.17183 [pdf, other]

One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts

Authors: Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, Weidi Xie

Abstract: In this study, we focus on building up a model that aims to Segment Anything in medical scenarios, driven by Text prompts, termed as SAT. Our main contributions are three folds: (i) for dataset construction, we combine multiple knowledge sources to construct the first multi-modal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then we build up the largest and most compreh… ▽ More In this study, we focus on building up a model that aims to Segment Anything in medical scenarios, driven by Text prompts, termed as SAT. Our main contributions are three folds: (i) for dataset construction, we combine multiple knowledge sources to construct the first multi-modal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then we build up the largest and most comprehensive segmentation dataset for training, by collecting over 22K 3D medical image scans from 72 segmentation datasets with careful standardization on both image scans and label space; (ii) for architecture design, we formulate a universal segmentation model, that can be prompted by inputting medical terminologies in text form. We present knowledge-enhanced representation learning on the combination of a large number of datasets; (iii) for model evaluation, we train a SAT-Pro with only 447M parameters, to segment 72 different segmentation datasets with text prompt, resulting in 497 classes. We have thoroughly evaluated the model from three aspects: averaged by body regions, averaged by classes, and average by datasets, demonstrating comparable performance to 72 specialist nnU-Nets, i.e., we train nnU-Net models on each dataset/subset, resulting in 72 nnU-Nets with around 2.2B parameters for the 72 datasets. We will release all the codes, and models in this work. △ Less

Submitted 1 May, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: 53 pages

arXiv:2311.18788 [pdf, other]

doi 10.1016/j.media.2020.101942

Automated interpretation of congenital heart disease from multi-view echocardiograms

Authors: **g Wang, Xiaofeng Liu, Fangyun Wang, Lin Zheng, Fengqiao Gao, Hanwen Zhang, Xin Zhang, Wanqing Xie, Binbin Wang

Abstract: Congenital heart disease (CHD) is the most common birth defect and the leading cause of neonate death in China. Clinical diagnosis can be based on the selected 2D key-frames from five views. Limited by the availability of multi-view data, most methods have to rely on the insufficient single view analysis. This study proposes to automatically analyze the multi-view echocardiograms with a practical… ▽ More Congenital heart disease (CHD) is the most common birth defect and the leading cause of neonate death in China. Clinical diagnosis can be based on the selected 2D key-frames from five views. Limited by the availability of multi-view data, most methods have to rely on the insufficient single view analysis. This study proposes to automatically analyze the multi-view echocardiograms with a practical end-to-end framework. We collect the five-view echocardiograms video records of 1308 subjects (including normal controls, ventricular septal defect (VSD) patients and atrial septal defect (ASD) patients) with both disease labels and standard-view key-frame labels. Depthwise separable convolution-based multi-channel networks are adopted to largely reduce the network parameters. We also approach the imbalanced class problem by augmenting the positive training samples. Our 2D key-frame model can diagnose CHD or negative samples with an accuracy of 95.4\%, and in negative, VSD or ASD classification with an accuracy of 92.3\%. To further alleviate the work of key-frame selection in real-world implementation, we propose an adaptive soft attention scheme to directly explore the raw video data. Four kinds of neural aggregation methods are systematically investigated to fuse the information of an arbitrary number of frames in a video. Moreover, with a view detection module, the system can work without the view records. Our video-based model can diagnose with an accuracy of 93.9\% (binary classification), and 92.1\% (3-class classification) in a collected 2D video testing set, which does not need key-frame selection and view annotation in testing. The detailed ablation study and the interpretability analysis are provided. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: Published in Medical Image Analysis

Journal ref: Medical Image Analysis (Volume 69, April 2021, 101942)

arXiv:2311.17624 [pdf, other]

Combating Multi-path Interference to Improve Chirp-based Underwater Acoustic Communication

Authors: Wenjun Xie, Enqi Zhang, Lizhao You, Deqing Wang, Zhaorui Wang, Liqun Fu

Abstract: Linear chirp-based underwater acoustic communication has been widely used due to its reliability and long-range transmission capability. However, unlike the counterpart chirp technology in wireless -- LoRa, its throughput is severely limited by the number of modulated chirps in a symbol. The fundamental challenge lies in the underwater multi-path channel, where the delayed copied of one symbol may… ▽ More Linear chirp-based underwater acoustic communication has been widely used due to its reliability and long-range transmission capability. However, unlike the counterpart chirp technology in wireless -- LoRa, its throughput is severely limited by the number of modulated chirps in a symbol. The fundamental challenge lies in the underwater multi-path channel, where the delayed copied of one symbol may cause inter-symbol and intra-symbol interfere. In this paper, we present UWLoRa+, a system that realizes the same chirp modulation as LoRa with higher data rate, and enhances LoRa's design to address the multi-path challenge via the following designs: a) we replace the linear chirp used by LoRa with the non-linear chirp to reduce the signal interference range and the collision probability; b) we design an algorithm that first demodulates each path and then combines the demodulation results of detected paths; and c) we replace the Hamming codes used by LoRa with the non-binary LDPC codes to mitigate the impact of the inevitable collision.Experiment results show that the new designs improve the bit error rate (BER) by 3x, and the packet error rate (PER) significantly, compared with the LoRa's naive design. Compared with an state-of-the-art system for decoding underwater LoRa chirp signal, UWLoRa+ improves the throughput by up to 50 times. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2310.17470 [pdf, other]

doi 10.1109/TGCN.2023.3298039

Adaptive Digital Twin for UAV-Assisted Integrated Sensing, Communication, and Computation Networks

Authors: Bin Li, Wenshuai Liu, Wancheng Xie, Ning Zhang, Yan Zhang

Abstract: In this paper, we study a digital twin (DT)-empowered integrated sensing, communication, and computation network. Specifically, the users perform radar sensing and computation offloading on the same spectrum, while unmanned aerial vehicles (UAVs) are deployed to provide edge computing service. We first formulate a multi-objective optimization problem to minimize the beampattern performance of mult… ▽ More In this paper, we study a digital twin (DT)-empowered integrated sensing, communication, and computation network. Specifically, the users perform radar sensing and computation offloading on the same spectrum, while unmanned aerial vehicles (UAVs) are deployed to provide edge computing service. We first formulate a multi-objective optimization problem to minimize the beampattern performance of multi-input multi-output (MIMO) radars and the computation offloading energy consumption simultaneously. Then, we explore the prediction capability of DT to provide intelligent offloading decision, where the DT estimation deviation is considered. To track this challenge, we reformulate the original problem as a multi-agent Markov decision process and design a multi-agent proximal policy optimization (MAPPO) framework to achieve a flexible learning policy. Furthermore, the Beta-policy and attention mechanism are used to improve the training performance. Numerical results show that the proposed method is able to balance the performance tradeoff between sensing and computation functions, while reducing the energy consumption compared with the existing studies. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: 14 pages, 11 figures,

Journal ref: IEEE Transactions on Green Communications and Networking, 2023

arXiv:2310.15371 [pdf, other]

doi 10.1007/978-3-031-34048-2_28

Vicinal Feature Statistics Augmentation for Federated 3D Medical Volume Segmentation

Authors: Yongsong Huang, Wanqing Xie, Mingzhen Li, Mingmei Cheng, **zhou Wu, Weixiao Wang, Jane You, Xiaofeng Liu

Abstract: Federated learning (FL) enables multiple client medical institutes collaboratively train a deep learning (DL) model with privacy protection. However, the performance of FL can be constrained by the limited availability of labeled data in small institutes and the heterogeneous (i.e., non-i.i.d.) data distribution across institutes. Though data augmentation has been a proven technique to boost the g… ▽ More Federated learning (FL) enables multiple client medical institutes collaboratively train a deep learning (DL) model with privacy protection. However, the performance of FL can be constrained by the limited availability of labeled data in small institutes and the heterogeneous (i.e., non-i.i.d.) data distribution across institutes. Though data augmentation has been a proven technique to boost the generalization capabilities of conventional centralized DL as a "free lunch", its application in FL is largely underexplored. Notably, constrained by costly labeling, 3D medical segmentation generally relies on data augmentation. In this work, we aim to develop a vicinal feature-level data augmentation (VFDA) scheme to efficiently alleviate the local feature shift and facilitate collaborative training for privacy-aware FL segmentation. We take both the inner- and inter-institute divergence into consideration, without the need for cross-institute transfer of raw data or their mixup. Specifically, we exploit the batch-wise feature statistics (e.g., mean and standard deviation) in each institute to abstractly represent the discrepancy of data, and model each feature statistic probabilistically via a Gaussian prototype, with the mean corresponding to the original statistic and the variance quantifying the augmentation scope. From the vicinal risk minimization perspective, novel feature statistics can be drawn from the Gaussian distribution to fulfill augmentation. The variance is explicitly derived by the data bias in each individual institute and the underlying feature statistics characterized by all participating institutes. The added-on VFDA consistently yielded marked improvements over six advanced FL methods on both 3D brain tumor and cardiac segmentation. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: 28th biennial international conference on Information Processing in Medical Imaging (IPMI 2023): Oral Paper

Journal ref: In: Frangi, A., de Bruijne, M., Wassermann, D., Navab, N. (eds) Information Processing in Medical Imaging. IPMI 2023. Lecture Notes in Computer Science, vol 13939. Springer, Cham

arXiv:2309.11500 [pdf, other]

A Large-scale Dataset for Audio-Language Representation Learning

Authors: Luoyi Sun, Xuenan Xu, Mengyue Wu, Weidi Xie

Abstract: The AI community has made significant strides in develo** powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic a… ▽ More The AI community has made significant strides in develo** powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at https://auto-acd.github.io/. △ Less

Submitted 3 October, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

arXiv:2309.02576 [pdf, other]

doi 10.1038/s41598-023-40116-6

Emphysema Subty** on Thoracic Computed Tomography Scans using Deep Neural Networks

Authors: Weiyi Xie, Colin Jacobs, Jean-Paul Charbonnier, Dirk Jan Slebos, Bram van Ginneken

Abstract: Accurate identification of emphysema subtypes and severity is crucial for effective management of COPD and the study of disease heterogeneity. Manual analysis of emphysema subtypes and severity is laborious and subjective. To address this challenge, we present a deep learning-based approach for automating the Fleischner Society's visual score system for emphysema subty** and severity analysis. W… ▽ More Accurate identification of emphysema subtypes and severity is crucial for effective management of COPD and the study of disease heterogeneity. Manual analysis of emphysema subtypes and severity is laborious and subjective. To address this challenge, we present a deep learning-based approach for automating the Fleischner Society's visual score system for emphysema subty** and severity analysis. We trained and evaluated our algorithm using 9650 subjects from the COPDGene study. Our algorithm achieved the predictive accuracy at 52\%, outperforming a previously published method's accuracy of 45\%. In addition, the agreement between the predicted scores of our method and the visual scores was good, where the previous method obtained only moderate agreement. Our approach employs a regression training strategy to generate categorical labels while simultaneously producing high-resolution localized activation maps for visualizing the network predictions. By leveraging these dense activation maps, our method possesses the capability to compute the percentage of emphysema involvement per lung in addition to categorical severity scores. Furthermore, the proposed method extends its predictive capabilities beyond centrilobular emphysema to include paraseptal emphysema subtypes. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Journal ref: Sci Rep. 2023 Aug 29;13(1):14147

arXiv:2308.11980 [pdf, other]

Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning

Authors: Yuanbo Hou, Siyang Song, Cheng Luo, Andrew Mitchell, Qiaoqiao Ren, Weicheng Xie, Jian Kang, Wenwu Wang, Dick Botteldooren

Abstract: Sound events in daily life carry rich information about the objective world. The composition of these sounds affects the mood of people in a soundscape. Most previous approaches only focus on classifying and detecting audio events and scenes, but may ignore their perceptual quality that may impact humans' listening mood for the environment, e.g. annoyance. To this end, this paper proposes a novel… ▽ More Sound events in daily life carry rich information about the objective world. The composition of these sounds affects the mood of people in a soundscape. Most previous approaches only focus on classifying and detecting audio events and scenes, but may ignore their perceptual quality that may impact humans' listening mood for the environment, e.g. annoyance. To this end, this paper proposes a novel hierarchical graph representation learning (HGRL) approach which links objective audio events (AE) with subjective annoyance ratings (AR) of the soundscape perceived by humans. The hierarchical graph consists of fine-grained event (fAE) embeddings with single-class event semantics, coarse-grained event (cAE) embeddings with multi-class event semantics, and AR embeddings. Experiments show the proposed HGRL successfully integrates AE with AR for AEC and ARP tasks, while coordinating the relations between cAE and fAE and further aligning the two different grains of AE information with the AR. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Comments: INTERSPEECH 2023, Code and models: https://github.com/Yuanbo2020/HGRL

arXiv:2307.12717 [pdf, ps, other]

Dense Transformer based Enhanced Coding Network for Unsupervised Metal Artifact Reduction

Authors: Wangduo Xie, Matthew B. Blaschko

Abstract: CT images corrupted by metal artifacts have serious negative effects on clinical diagnosis. Considering the difficulty of collecting paired data with ground truth in clinical settings, unsupervised methods for metal artifact reduction are of high interest. However, it is difficult for previous unsupervised methods to retain structural information from CT images while handling the non-local charact… ▽ More CT images corrupted by metal artifacts have serious negative effects on clinical diagnosis. Considering the difficulty of collecting paired data with ground truth in clinical settings, unsupervised methods for metal artifact reduction are of high interest. However, it is difficult for previous unsupervised methods to retain structural information from CT images while handling the non-local characteristics of metal artifacts. To address these challenges, we proposed a novel Dense Transformer based Enhanced Coding Network (DTEC-Net) for unsupervised metal artifact reduction. Specifically, we introduce a Hierarchical Disentangling Encoder, supported by the high-order dense process, and transformer to obtain densely encoded sequences with long-range correspondence. Then, we present a second-order disentanglement method to improve the dense sequence's decoding process. Extensive experiments and model discussions illustrate DTEC-Net's effectiveness, which outperforms the previous state-of-the-art methods on a benchmark dataset, and greatly reduces metal artifacts while restoring richer texture details. △ Less

Submitted 28 July, 2023; v1 submitted 24 July, 2023; originally announced July 2023.

arXiv:2306.02054 [pdf]

Low-Complexity Acoustic Scene Classification Using Data Augmentation and Lightweight ResNet

Authors: Yanxiong Li, Wenchang Cao, Wei Xie, Qisheng Huang, Wenfeng Pang, Qianhua He

Abstract: We present a work on low-complexity acoustic scene classification (ASC) with multiple devices, namely the subtask A of Task 1 of the DCASE2021 challenge. This subtask focuses on classifying audio samples of multiple devices with a low-complexity model, where two main difficulties need to be overcome. First, the audio samples are recorded by different devices, and there is mismatch of recording dev… ▽ More We present a work on low-complexity acoustic scene classification (ASC) with multiple devices, namely the subtask A of Task 1 of the DCASE2021 challenge. This subtask focuses on classifying audio samples of multiple devices with a low-complexity model, where two main difficulties need to be overcome. First, the audio samples are recorded by different devices, and there is mismatch of recording devices in audio samples. We reduce the negative impact of the mismatch of recording devices by using some effective strategies, including data augmentation (e.g., mix-up, spectrum correction, pitch shift), usages of multi-patch network structure and channel attention. Second, the model size should be smaller than a threshold (e.g., 128 KB required by the DCASE2021 challenge). To meet this condition, we adopt a ResNet with both depthwise separable convolution and channel attention as the backbone network, and perform model compression. In summary, we propose a low-complexity ASC method using data augmentation and a lightweight ResNet. Evaluated on the official development and evaluation datasets, our method obtains classification accuracy scores of 71.6% and 66.7%, respectively; and obtains Log-loss scores of 1.038 and 1.136, respectively. Our final model size is 110.3 KB which is smaller than the maximum of 128 KB. △ Less

Submitted 3 June, 2023; originally announced June 2023.

Comments: 5 pages, 5 figures, 4 tables. Accepted for publication in the 16th IEEE International Conference on Signal Processing (IEEE ICSP)

arXiv:2306.02053 [pdf]

Few-shot Class-incremental Audio Classification Using Stochastic Classifier

Authors: Yanxiong Li, Wenchang Cao, Jialong Li, Wei Xie, Qianhua He

Abstract: It is generally assumed that number of classes is fixed in current audio classification methods, and the model can recognize pregiven classes only. When new classes emerge, the model needs to be retrained with adequate samples of all classes. If new classes continually emerge, these methods will not work well and even infeasible. In this study, we propose a method for fewshot class-incremental aud… ▽ More It is generally assumed that number of classes is fixed in current audio classification methods, and the model can recognize pregiven classes only. When new classes emerge, the model needs to be retrained with adequate samples of all classes. If new classes continually emerge, these methods will not work well and even infeasible. In this study, we propose a method for fewshot class-incremental audio classification, which continually recognizes new classes and remember old ones. The proposed model consists of an embedding extractor and a stochastic classifier. The former is trained in base session and frozen in incremental sessions, while the latter is incrementally expanded in all sessions. Two datasets (NS-100 and LS-100) are built by choosing samples from audio corpora of NSynth and LibriSpeech, respectively. Results show that our method exceeds four baseline ones in average accuracy and performance drop** rate. Code is at https://github.com/vinceasvp/meta-sc. △ Less

Submitted 3 June, 2023; originally announced June 2023.

Comments: 5 pages, 3 figures, 4 tables. Accepted for publication in INTERSPEECH 2023

arXiv:2305.19539 [pdf]

doi 10.1109/TMM.2023.3280011

Few-shot Class-incremental Audio Classification Using Dynamically Expanded Classifier with Self-attention Modified Prototypes

Authors: Yanxiong Li, Wenchang Cao, Wei Xie, Jialong Li, Emmanouil Benetos

Abstract: Most existing methods for audio classification assume that the vocabulary of audio classes to be classified is fixed. When novel (unseen) audio classes appear, audio classification systems need to be retrained with abundant labeled samples of all audio classes for recognizing base (initial) and novel audio classes. If novel audio classes continue to appear, the existing methods for audio classific… ▽ More Most existing methods for audio classification assume that the vocabulary of audio classes to be classified is fixed. When novel (unseen) audio classes appear, audio classification systems need to be retrained with abundant labeled samples of all audio classes for recognizing base (initial) and novel audio classes. If novel audio classes continue to appear, the existing methods for audio classification will be inefficient and even infeasible. In this work, we propose a method for few-shot class-incremental audio classification, which can continually recognize novel audio classes without forgetting old ones. The framework of our method mainly consists of two parts: an embedding extractor and a classifier, and their constructions are decoupled. The embedding extractor is the backbone of a ResNet based network, which is frozen after construction by a training strategy using only samples of base audio classes. However, the classifier consisting of prototypes is expanded by a prototype adaptation network with few samples of novel audio classes in incremental sessions. Labeled support samples and unlabeled query samples are used to train the prototype adaptation network and update the classifier, since they are informative for audio classification. Three audio datasets, named NSynth-100, FSC-89 and LS-100 are built by choosing samples from audio corpora of NSynth, FSD-MIX-CLIP and LibriSpeech, respectively. Results show that our method exceeds baseline methods in average accuracy and performance drop** rate. In addition, it is competitive compared to baseline methods in computational complexity and memory requirement. The code for our method is given at https://github.com/vinceasvp/FCAC. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: 13 pages, 8 figures, 12 tables. Accepted for publication in IEEE TMM

arXiv:2305.18045 [pdf, ps, other]

Few-shot Class-incremental Audio Classification Using Adaptively-refined Prototypes

Authors: Wei Xie, Yanxiong Li, Qianhua He, Wenchang Cao, Tuomas Virtanen

Abstract: New classes of sounds constantly emerge with a few samples, making it challenging for models to adapt to dynamic acoustic environments. This challenge motivates us to address the new problem of few-shot class-incremental audio classification. This study aims to enable a model to continuously recognize new classes of sounds with a few training samples of new classes while remembering the learned on… ▽ More New classes of sounds constantly emerge with a few samples, making it challenging for models to adapt to dynamic acoustic environments. This challenge motivates us to address the new problem of few-shot class-incremental audio classification. This study aims to enable a model to continuously recognize new classes of sounds with a few training samples of new classes while remembering the learned ones. To this end, we propose a method to generate discriminative prototypes and use them to expand the model's classifier for recognizing sounds of new and learned classes. The model is first trained with a random episodic training strategy, and then its backbone is used to generate the prototypes. A dynamic relation projection module refines the prototypes to enhance their discriminability. Results on two datasets (derived from the corpora of Nsynth and FSD-MIX-CLIPS) show that the proposed method exceeds three state-of-the-art methods in average accuracy and performance drop** rate. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: 5 pages,2 figures, Accepted by Interspeech 2023

arXiv:2303.10372 [pdf, other]

Just Noticeable Visual Redundancy Forecasting: A Deep Multimodal-driven Approach

Authors: Wuyuan Xie, Shukang Wang, Sukun Tian, Lirong Huang, Ye Liu, Miaohui Wang

Abstract: Just noticeable difference (JND) refers to the maximum visual change that human eyes cannot perceive, and it has a wide range of applications in multimedia systems. However, most existing JND approaches only focus on a single modality, and rarely consider the complementary effects of multimodal information. In this article, we investigate the JND modeling from an end-to-end homologous multimodal p… ▽ More Just noticeable difference (JND) refers to the maximum visual change that human eyes cannot perceive, and it has a wide range of applications in multimedia systems. However, most existing JND approaches only focus on a single modality, and rarely consider the complementary effects of multimodal information. In this article, we investigate the JND modeling from an end-to-end homologous multimodal perspective, namely hmJND-Net. Specifically, we explore three important visually sensitive modalities, including saliency, depth, and segmentation. To better utilize homologous multimodal information, we establish an effective fusion method via summation enhancement and subtractive offset, and align homologous multimodal features based on a self-attention driven encoder-decoder paradigm. Extensive experimental results on eight different benchmark datasets validate the superiority of our hmJND-Net over eight representative methods. △ Less

Submitted 18 March, 2023; originally announced March 2023.

Journal ref: AAAI 2023

arXiv:2302.09021 [pdf, ps, other]

doi 10.1109/TWC.2023.3235997

Energy Efficient Computation Offloading in Aerial Edge Networks With Multi-Agent Cooperation

Authors: Wenshuai Liu, Bin Li, Wancheng Xie, Yueyue Dai, Zesong Fei

Abstract: With the high flexibility of supporting resource-intensive and time-sensitive applications, unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) is proposed as an innovational paradigm to support the mobile users (MUs). As a promising technology, digital twin (DT) is capable of timely map** the physical entities to virtual models, and reflecting the MEC network state in real-time.… ▽ More With the high flexibility of supporting resource-intensive and time-sensitive applications, unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) is proposed as an innovational paradigm to support the mobile users (MUs). As a promising technology, digital twin (DT) is capable of timely map** the physical entities to virtual models, and reflecting the MEC network state in real-time. In this paper, we first propose an MEC network with multiple movable UAVs and one DT-empowered ground base station to enhance the MEC service for MUs. Considering the limited energy resource of both MUs and UAVs, we formulate an online problem of resource scheduling to minimize the weighted energy consumption of them. To tackle the difficulty of the combinational problem, we formulate it as a Markov decision process (MDP) with multiple types of agents. Since the proposed MDP has huge state space and action space, we propose a deep reinforcement learning approach based on multi-agent proximal policy optimization (MAPPO) with Beta distribution and attention mechanism to pursue the optimal computation offloading policy. Numerical results show that our proposed scheme is able to efficiently reduce the energy consumption and outperforms the benchmarks in performance, convergence speed and utilization of resources. △ Less

Submitted 14 February, 2023; originally announced February 2023.

Comments: 14 pages, 13 figures

arXiv:2301.02228 [pdf, other]

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

Authors: Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, Weidi Xie

Abstract: In this paper, we consider enhancing medical visual-language pre-training (VLP) with domain-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following contributions: First, unlike existing works that directly process the raw reports, we adopt a novel triplet extraction module to extract the medical-related information,… ▽ More In this paper, we consider enhancing medical visual-language pre-training (VLP) with domain-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following contributions: First, unlike existing works that directly process the raw reports, we adopt a novel triplet extraction module to extract the medical-related information, avoiding unnecessary complexity from language grammar and enhancing the supervision signals; Second, we propose a novel triplet encoding module with entity translation by querying a knowledge base, to exploit the rich domain knowledge in medical field, and implicitly build relationships between medical entities in the language embedding space; Third, we propose to use a Transformer-based fusion model for spatially aligning the entity description with visual signals at the image patch level, enabling the ability for medical diagnosis; Fourth, we conduct thorough experiments to validate the effectiveness of our architecture, and benchmark on numerous public benchmarks, e.g., ChestX-ray14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-2, COVID Rural, and EdemaSeverity. In both zero-shot and fine-tuning settings, our model has demonstrated strong performance compared with the former methods on disease classification and grounding. △ Less

Submitted 3 April, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

arXiv:2210.07055 [pdf, other]

Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

Authors: Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

Abstract: The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is 'sparse in space and time'. This contrasts with the case of synchronising videos of talking heads, wher… ▽ More The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is 'sparse in space and time'. This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space. We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs 'selectors' to distil the long audio and visual streams into small sequences that are then used to predict the temporal offset between streams. (ii) We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task. (iii) We curate a dataset with only sparse in time and space synchronisation signals; and (iv) the effectiveness of the proposed model is shown on both dense and sparse datasets quantitatively and qualitatively. Project page: v-iashin.github.io/SparseSync △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: Accepted as a spotlight presentation for the BMVC 2022. Code: https://github.com/v-iashin/SparseSync Project page: https://v-iashin.github.io/SparseSync

arXiv:2209.05477 [pdf, other]

Adaptive 3D Localization of 2D Freehand Ultrasound Brain Images

Authors: Pak-Hei Yeung, Moska Aliasi, Monique Haak, The INTERGROWTH-21st Consortium, Weidi Xie, Ana I. L. Namburete

Abstract: Two-dimensional (2D) freehand ultrasound is the mainstay in prenatal care and fetal growth monitoring. The task of matching corresponding cross-sectional planes in the 3D anatomy for a given 2D ultrasound brain scan is essential in freehand scanning, but challenging. We propose AdLocUI, a framework that Adaptively Localizes 2D Ultrasound Images in the 3D anatomical atlas without using any external… ▽ More Two-dimensional (2D) freehand ultrasound is the mainstay in prenatal care and fetal growth monitoring. The task of matching corresponding cross-sectional planes in the 3D anatomy for a given 2D ultrasound brain scan is essential in freehand scanning, but challenging. We propose AdLocUI, a framework that Adaptively Localizes 2D Ultrasound Images in the 3D anatomical atlas without using any external tracking sensor.. We first train a convolutional neural network with 2D slices sampled from co-aligned 3D ultrasound volumes to predict their locations in the 3D anatomical atlas. Next, we fine-tune it with 2D freehand ultrasound images using a novel unsupervised cycle consistency, which utilizes the fact that the overall displacement of a sequence of images in the 3D anatomical atlas is equal to the displacement from the first image to the last in that sequence. We demonstrate that AdLocUI can adapt to three different ultrasound datasets, acquired with different machines and protocols, and achieves significantly better localization accuracy than the baselines. AdLocUI can be used for sensorless 2D freehand ultrasound guidance by the bedside. The source code is available at https://github.com/pakheiyeung/AdLocUI. △ Less

Submitted 12 September, 2022; originally announced September 2022.

Comments: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2022

arXiv:2208.06222 [pdf, other]

Scale-free and Task-agnostic Attack: Generating Photo-realistic Adversarial Patterns with Patch Quilting Generator

Authors: Xiangbo Gao, Cheng Luo, Qinliang Lin, Weicheng Xie, Minmin Liu, Linlin Shen, Keerthy Kusumam, Siyang Song

Abstract: \noindent Traditional L_p norm-restricted image attack algorithms suffer from poor transferability to black box scenarios and poor robustness to defense algorithms. Recent CNN generator-based attack approaches can synthesize unrestricted and semantically meaningful entities to the image, which is shown to be transferable and robust. However, such methods attack images by either synthesizing local… ▽ More \noindent Traditional L_p norm-restricted image attack algorithms suffer from poor transferability to black box scenarios and poor robustness to defense algorithms. Recent CNN generator-based attack approaches can synthesize unrestricted and semantically meaningful entities to the image, which is shown to be transferable and robust. However, such methods attack images by either synthesizing local adversarial entities, which are only suitable for attacking specific contents or performing global attacks, which are only applicable to a specific image scale. In this paper, we propose a novel Patch Quilting Generative Adversarial Networks (PQ-GAN) to learn the first scale-free CNN generator that can be applied to attack images with arbitrary scales for various computer vision tasks. The principal investigation on transferability of the generated adversarial examples, robustness to defense frameworks, and visual quality assessment show that the proposed PQG-based attack framework outperforms the other nine state-of-the-art adversarial attack approaches when attacking the neural networks trained on two standard evaluation datasets (i.e., ImageNet and CityScapes). △ Less

Submitted 19 November, 2022; v1 submitted 12 August, 2022; originally announced August 2022.

arXiv:2206.12772 [pdf, other]

doi 10.1145/3503161.3548317

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

Authors: **xiang Liu, Chen Ju, Weidi Xie, Ya Zhang

Abstract: We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representat… ▽ More We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representations to be invariant to various transformations~({\em transformation invariance}); (2) enforcing geometric consistency substantially improves the quality of learned representations, i.e. the detected sound source should follow the same transformation applied on input video frames~({\em transformation equivariance}). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound. Additionally, we also evaluate audio retrieval and cross-modal retrieval tasks. In both cases, our self-supervised models demonstrate superior retrieval performances, even competitive with the supervised approach in audio retrieval. This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications. \textit{All codes will be available}. △ Less

Submitted 15 August, 2022; v1 submitted 25 June, 2022; originally announced June 2022.

Comments: Camera-ready Version for ACMMM 2022, Project page is https://**xiang-liu.github.io/SSL-TIE/

arXiv:2206.06947 [pdf, other]

K-Space Transformer for Undersampled MRI Reconstruction

Authors: Ziheng Zhao, Tianjiao Zhang, Weidi Xie, Yanfeng Wang, Ya Zhang

Abstract: This paper considers the problem of undersampled MRI reconstruction. We propose a novel Transformer-based framework for directly processing signal in k-space, going beyond the limitation of regular grids as ConvNets do. We adopt an implicit representation of k-space spectrogram, treating spatial coordinates as inputs, and dynamically query the sparsely sampled points to reconstruct the spectrogram… ▽ More This paper considers the problem of undersampled MRI reconstruction. We propose a novel Transformer-based framework for directly processing signal in k-space, going beyond the limitation of regular grids as ConvNets do. We adopt an implicit representation of k-space spectrogram, treating spatial coordinates as inputs, and dynamically query the sparsely sampled points to reconstruct the spectrogram, i.e. learning the inductive bias in k-space. To strike a balance between computational cost and reconstruction quality, we build the decoder with hierarchical structure to generate low-resolution and high-resolution outputs respectively. To validate the effectiveness of our proposed method, we have conducted extensive experiments on two public datasets, and demonstrate superior or comparable performance to state-of-the-art approaches. △ Less

Submitted 10 November, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

arXiv:2205.03920 [pdf, other]

From Discovery to Production: Challenges and Novel Methodologies for Next Generation Biomanufacturing

Authors: Wei Xie, Giulia Pedrielli

Abstract: The increasingly pressing demand of novel drugs (e.g., gene therapies for personalized cancer care, ever evolving vaccines) with unprecedented levels of personalization, has put a remarkable pressure on the traditionally long time required by the pharma R&D and manufacturing to go from design to production of new products. The revolution has already brought important changes in the technologies us… ▽ More The increasingly pressing demand of novel drugs (e.g., gene therapies for personalized cancer care, ever evolving vaccines) with unprecedented levels of personalization, has put a remarkable pressure on the traditionally long time required by the pharma R&D and manufacturing to go from design to production of new products. The revolution has already brought important changes in the technologies used within the industry. In fact, practitioners are increasingly moving away from the classical paradigm of large-scale batch production to continuous biomanufacturing with flexible and modular design, which is further supported by the recent technology advance in single-use equipment. In contrast to long design processes, low product variability (one-fits-all), and highly rigid systems, modern pharma players are answering the question: can we bring design and process control up to the speed that novel production technologies give us to quickly set up a flexible production run? In this tutorial, we present key challenges and potential solutions from the world of operations research that can support answering such question. We first present technical challenges and novel methods for the design of next generation drugs, followed by the process modeling and control approaches to successfully and efficiently manufacture them. △ Less

Submitted 28 June, 2022; v1 submitted 8 May, 2022; originally announced May 2022.

Comments: 15 pages, 5 figures

arXiv:2205.03229 [pdf]

Multi-core fiber enabled fading noise suppression in φ-OFDR based quantitative distributed vibration sensing

Authors: Yuxiang Feng, Weilin Xie, Yinxia Meng, Jiang Yang, Qiang Yang, Yan Ren, Tianwai Bo, Zhongwei Tan, Wei Wei, Yi Dong

Abstract: Coherent fading has been regarded as a critical issue in phase-sensitive optical frequency domain reflectometry (φ-OFDR) based distributed fiber-optic sensing. Here, we report on an approach for fading noise suppression in φ-OFDR with multi-core fiber. By exploiting the independent nature of the randomness in the distribution of reflective index in each of the cores, the drastic phase fluctuations… ▽ More Coherent fading has been regarded as a critical issue in phase-sensitive optical frequency domain reflectometry (φ-OFDR) based distributed fiber-optic sensing. Here, we report on an approach for fading noise suppression in φ-OFDR with multi-core fiber. By exploiting the independent nature of the randomness in the distribution of reflective index in each of the cores, the drastic phase fluctuations due to the fading phenomina can be effectively alleviated by applying weighted vectorial averaging for the Rayleigh backscattering traces from each of the cores with distinct fading distributions. With the consistent linear response with respect to external excitation of interest for each of the cores, demonstration for the propsoed φ-OFDR with a commercial seven-core fiber has achieved highly sensitive quantitative distributed vibration sensing with about 2.2 nm length precision and 2 cm sensing resolution along the 500 m fiber, corresponding to a range resolution factor as high as about about 4E-5. Featuring long distance, high sensitivity, high resolution, and fading robustness, this approach has shown promising potentials in various sensing techniques for a wide range of practical scenarios. △ Less

Submitted 3 May, 2022; originally announced May 2022.

Comments: 4 pages

arXiv:2203.08980 [pdf, other]

Stochastic Simulation Uncertainty Analysis to Accelerate Flexible Biomanufacturing Process Development

Authors: Wei Xie, Russell R. Barton, Barry L. Nelson, Keqi Wang

Abstract: Motivated by critical challenges and needs from biopharmaceuticals manufacturing, we propose a general metamodel-assisted stochastic simulation uncertainty analysis framework to accelerate the development of a simulation model with modular design for flexible production processes. There are often very limited process observations. Thus, there exist both simulation and model uncertainties in the sy… ▽ More Motivated by critical challenges and needs from biopharmaceuticals manufacturing, we propose a general metamodel-assisted stochastic simulation uncertainty analysis framework to accelerate the development of a simulation model with modular design for flexible production processes. There are often very limited process observations. Thus, there exist both simulation and model uncertainties in the system performance estimates. In biopharmaceutical manufacturing, model uncertainty often dominates. The proposed framework can produce a confidence interval that accounts for simulation and model uncertainties by using a metamodel-assisted bootstrap** approach. Furthermore, a variance decomposition is utilized to estimate the relative contributions from each source of model uncertainty, as well as simulation uncertainty. This information can be used to improve the system mean performance estimation. Asymptotic analysis provides theoretical support for our approach, while the empirical study demonstrates that it has good finite-sample performance. △ Less

Submitted 3 September, 2022; v1 submitted 16 March, 2022; originally announced March 2022.

Comments: 32 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:2011.04207

arXiv:2201.03116 [pdf, other]

Opportunities of Hybrid Model-based Reinforcement Learning for Cell Therapy Manufacturing Process Control

Authors: Hua Zheng, Wei Xie, Keqi Wang, Zheng Li

Abstract: Driven by the key challenges of cell therapy manufacturing, including high complexity, high uncertainty, and very limited process observations, we propose a hybrid model-based reinforcement learning (RL) to efficiently guide process control. We first create a probabilistic knowledge graph (KG) hybrid model characterizing the risk- and science-based understanding of biomanufacturing process mechani… ▽ More Driven by the key challenges of cell therapy manufacturing, including high complexity, high uncertainty, and very limited process observations, we propose a hybrid model-based reinforcement learning (RL) to efficiently guide process control. We first create a probabilistic knowledge graph (KG) hybrid model characterizing the risk- and science-based understanding of biomanufacturing process mechanisms and quantifying inherent stochasticity, e.g., batch-to-batch variation. It can capture the key features, including nonlinear reactions, nonstationary dynamics, and partially observed state. This hybrid model can leverage existing mechanistic models and facilitate learning from heterogeneous process data. A computational sampling approach is used to generate posterior samples quantifying model uncertainty. Then, we introduce hybrid model-based Bayesian RL, accounting for both inherent stochasticity and model uncertainty, to guide optimal, robust, and interpretable dynamic decision making. Cell therapy manufacturing examples are used to empirically demonstrate that the proposed framework can outperform the classical deterministic mechanistic model assisted process optimization. △ Less

Submitted 25 January, 2022; v1 submitted 9 January, 2022; originally announced January 2022.

Comments: 14 pages, 2 figures

arXiv:2112.07948 [pdf, other]

Transcoded Video Restoration by Temporal Spatial Auxiliary Network

Authors: Li Xu, Gang He, **jia Zhou, Jie Lei, Weiying Xie, Yunsong Li, Yu-Wing Tai

Abstract: In most video platforms, such as Youtube, and TikTok, the played videos usually have undergone multiple video encodings such as hardware encoding by recording devices, software encoding by video editing apps, and single/multiple video transcoding by video application servers. Previous works in compressed video restoration typically assume the compression artifacts are caused by one-time encoding.… ▽ More In most video platforms, such as Youtube, and TikTok, the played videos usually have undergone multiple video encodings such as hardware encoding by recording devices, software encoding by video editing apps, and single/multiple video transcoding by video application servers. Previous works in compressed video restoration typically assume the compression artifacts are caused by one-time encoding. Thus, the derived solution usually does not work very well in practice. In this paper, we propose a new method, temporal spatial auxiliary network (TSAN), for transcoded video restoration. Our method considers the unique traits between video encoding and transcoding, and we consider the initial shallow encoded videos as the intermediate labels to assist the network to conduct self-supervised attention training. In addition, we employ adjacent multi-frame information and propose the temporal deformable alignment and pyramidal spatial fusion for transcoded video restoration. The experimental results demonstrate that the performance of the proposed method is superior to that of the previous techniques. The code is available at https://github.com/icecherylXuli/TSAN. △ Less

Submitted 15 December, 2021; originally announced December 2021.

Comments: Accepted by AAAI2022

arXiv:2112.04432 [pdf, other]

Audio-Visual Synchronisation in the wild

Authors: Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

Abstract: In this paper, we consider the problem of audio-visual synchronisation applied to videos `in-the-wild' (ie of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while sig… ▽ More In this paper, we consider the problem of audio-visual synchronisation applied to videos `in-the-wild' (ie of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while significantly reducing memory requirements during training. We further conduct an in-depth analysis on the curated dataset and define an evaluation metric for open domain audio-visual synchronisation. We apply our method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations on various aspects. Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset. In all cases, our proposed model outperforms the previous state-of-the-art by a significant margin. △ Less

Submitted 8 December, 2021; originally announced December 2021.

arXiv:2109.12108 [pdf, other]

ImplicitVol: Sensorless 3D Ultrasound Reconstruction with Deep Implicit Representation

Authors: Pak-Hei Yeung, Linde Hesse, Moska Aliasi, Monique Haak, the INTERGROWTH-21st Consortium, Weidi Xie, Ana I. L. Namburete

Abstract: The objective of this work is to achieve sensorless reconstruction of a 3D volume from a set of 2D freehand ultrasound images with deep implicit representation. In contrast to the conventional way that represents a 3D volume as a discrete voxel grid, we do so by parameterizing it as the zero level-set of a continuous function, i.e. implicitly representing the 3D volume as a map** from the spatia… ▽ More The objective of this work is to achieve sensorless reconstruction of a 3D volume from a set of 2D freehand ultrasound images with deep implicit representation. In contrast to the conventional way that represents a 3D volume as a discrete voxel grid, we do so by parameterizing it as the zero level-set of a continuous function, i.e. implicitly representing the 3D volume as a map** from the spatial coordinates to the corresponding intensity values. Our proposed model, termed as ImplicitVol, takes a set of 2D scans and their estimated locations in 3D as input, jointly refining the estimated 3D locations and learning a full reconstruction of the 3D volume. When testing on real 2D ultrasound images, novel cross-sectional views that are sampled from ImplicitVol show significantly better visual quality than those sampled from existing reconstruction approaches, outperforming them by over 30% (NCC and SSIM), between the output and ground-truth on the 3D volume testing data. The code will be made publicly available. △ Less

Submitted 24 September, 2021; originally announced September 2021.

arXiv:2107.13431 [pdf]

AI assisted method for efficiently generating breast ultrasound screening reports

Authors: Shuang Ge, Qiongyu Ye, Wenquan Xie, Desheng Sun, Huabin Zhang, Xiaobo Zhou, Kehong Yuan

Abstract: Background: Ultrasound is one of the preferred choices for early screening of dense breast cancer. Clinically, doctors have to manually write the screening report which is time-consuming and laborious, and it is easy to miss and miswrite. Aim: We proposed a new pipeline to automatically generate AI breast ultrasound screening reports based on ultrasound images, aiming to assist doctors in improvin… ▽ More Background: Ultrasound is one of the preferred choices for early screening of dense breast cancer. Clinically, doctors have to manually write the screening report which is time-consuming and laborious, and it is easy to miss and miswrite. Aim: We proposed a new pipeline to automatically generate AI breast ultrasound screening reports based on ultrasound images, aiming to assist doctors in improving the efficiency of clinical screening and reducing repetitive report writing. Methods: AI was used to efficiently generate personalized breast ultrasound screening preliminary reports, especially for benign and normal cases which account for the majority. Based on the preliminary AI report, doctors then make simple adjustments or corrections to quickly generate the final report. The approach has been trained and tested using a database of 4809 breast tumor instances. Results: Experimental results indicate that this pipeline improves doctors' work efficiency by up to 90%, which greatly reduces repetitive work. Conclusion: Personalized report generation is more widely recognized by doctors in clinical practice compared with non-intelligent reports based on fixed templates or containing options to fill in the blanks. △ Less

Submitted 22 May, 2022; v1 submitted 28 July, 2021; originally announced July 2021.

arXiv:2106.01351 [pdf, other]

Deep Clustering Activation Maps for Emphysema Subty**

Authors: Weiyi Xie, Colin Jacobs, Bram van Ginneken

Abstract: We propose a deep learning clustering method that exploits dense features from a segmentation network for emphysema subty** from computed tomography (CT) scans. Using dense features enables high-resolution visualization of image regions corresponding to the cluster assignment via dense clustering activation maps (dCAMs). This approach provides model interpretability. We evaluated clustering resu… ▽ More We propose a deep learning clustering method that exploits dense features from a segmentation network for emphysema subty** from computed tomography (CT) scans. Using dense features enables high-resolution visualization of image regions corresponding to the cluster assignment via dense clustering activation maps (dCAMs). This approach provides model interpretability. We evaluated clustering results on 500 subjects from the COPDGenestudy, where radiologists manually annotated emphysema sub-types according to their visual CT assessment. We achieved a 43% unsupervised clustering accuracy, outperforming our baseline at 41% and yielding results comparable to supervised classification at 45%. The proposed method also offers a better cluster formation than the baseline, achieving0.54 in silhouette coefficient and 0.55 in David-Bouldin scores. △ Less

Submitted 1 June, 2021; originally announced June 2021.

arXiv:2105.11748 [pdf, other]

Dense Regression Activation Maps For Lesion Segmentation in CT scans of COVID-19 patients

Authors: Weiyi Xie, Colin Jacobs, Jean-Paul Charbonnier, Bram van Ginneken

Abstract: Automatic lesion segmentation on thoracic CT enables rapid quantitative analysis of lung involvement in COVID-19 infections. However, obtaining a large amount of voxel-level annotations for training segmentation networks is prohibitively expensive. Therefore, we propose a weakly-supervised segmentation method based on dense regression activation maps (dRAMs). Most weakly-supervised segmentation ap… ▽ More Automatic lesion segmentation on thoracic CT enables rapid quantitative analysis of lung involvement in COVID-19 infections. However, obtaining a large amount of voxel-level annotations for training segmentation networks is prohibitively expensive. Therefore, we propose a weakly-supervised segmentation method based on dense regression activation maps (dRAMs). Most weakly-supervised segmentation approaches exploit class activation maps (CAMs) to localize objects. However, because CAMs were trained for classification, they do not align precisely with the object segmentations. Instead, we produce high-resolution activation maps using dense features from a segmentation network that was trained to estimate a per-lobe lesion percentage. In this way, the network can exploit knowledge regarding the required lesion volume. In addition, we propose an attention neural network module to refine dRAMs, optimized together with the main regression task. We evaluated our algorithm on 90 subjects. Results show our method achieved 70.2% Dice coefficient, substantially outperforming the CAM-based baseline at 48.6%. △ Less

Submitted 18 November, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

arXiv:2104.14332 [pdf, other]

Hypernetwork Dismantling via Deep Reinforcement Learning

Authors: Dengcheng Yan, Wenxin Xie, Yiwen Zhang, Qiang He, Yun Yang

Abstract: Network dismantling aims to degrade the connectivity of a network by removing an optimal set of nodes. It has been widely adopted in many real-world applications such as epidemic control and rumor containment. However, conventional methods usually focus on simple network modeling with only pairwise interactions, while group-wise interactions modeled by hypernetwork are ubiquitous and critical. In… ▽ More Network dismantling aims to degrade the connectivity of a network by removing an optimal set of nodes. It has been widely adopted in many real-world applications such as epidemic control and rumor containment. However, conventional methods usually focus on simple network modeling with only pairwise interactions, while group-wise interactions modeled by hypernetwork are ubiquitous and critical. In this work, we formulate the hypernetwork dismantling problem as a node sequence decision problem and propose a deep reinforcement learning (DRL)-based hypernetwork dismantling framework. Besides, we design a novel inductive hypernetwork embedding method to ensure the transferability to various real-world hypernetworks. Our framework first generates small-scale synthetic hypernetworks and embeds the nodes and hypernetworks into a low dimensional vector space to represent the action and state space in DRL, respectively. Then trial-and-error dismantling tasks are conducted by an agent on these synthetic hypernetworks, and the dismantling strategy is continuously optimized. Finally, the well-optimized strategy is applied to real-world hypernetwork dismantling tasks. Experimental results on five real-world hypernetworks demonstrate the effectiveness of our proposed framework. △ Less

Submitted 8 March, 2022; v1 submitted 29 April, 2021; originally announced April 2021.

arXiv:2104.12044 [pdf, other]

Multi-Cycle-Consistent Adversarial Networks for Edge Denoising of Computed Tomography Images

Authors: Xiaowe Xu, Jiawei Zhang, **glan Liu, Yukun Ding, Tianchen Wang, Hailong Qiu, Haiyun Yuan, Jian Zhuang, Wen Xie, Yuhao Dong, Qianjun Jia, Mei** Huang, Yiyu Shi

Abstract: As one of the most commonly ordered imaging tests, computed tomography (CT) scan comes with inevitable radiation exposure that increases the cancer risk to patients. However, CT image quality is directly related to radiation dose, thus it is desirable to obtain high-quality CT images with as little dose as possible. CT image denoising tries to obtain high dose like high-quality CT images (domain X… ▽ More As one of the most commonly ordered imaging tests, computed tomography (CT) scan comes with inevitable radiation exposure that increases the cancer risk to patients. However, CT image quality is directly related to radiation dose, thus it is desirable to obtain high-quality CT images with as little dose as possible. CT image denoising tries to obtain high dose like high-quality CT images (domain X) from low dose low-quality CTimages (domain Y), which can be treated as an image-to-image translation task where the goal is to learn the transform between a source domain X (noisy images) and a target domain Y (clean images). In this paper, we propose a multi-cycle-consistent adversarial network (MCCAN) that builds intermediate domains and enforces both local and global cycle-consistency for edge denoising of CT images. The global cycle-consistency couples all generators together to model the whole denoising process, while the local cycle-consistency imposes effective supervision on the process between adjacent domains. Experiments show that both local and global cycle-consistency are important for the success of MCCAN, which outperformsCCADN in terms of denoising quality with slightly less computation resource consumption. △ Less

Submitted 24 April, 2021; originally announced April 2021.

Comments: 16 pages, 7 figures, 4 tables, accepted by the ACM Journal on Emerging Technologies in Computing Systems (JETC). arXiv admin note: substantial text overlap with arXiv:2002.12130

arXiv:2104.02691 [pdf, other]

Localizing Visual Sounds the Hard Way

Authors: Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

Abstract: The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing… ▽ More The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous existing ones, contains 5K videos spanning over 200 categories, and, differently from Flickr SoundNet, is video-based. On VGG-SS, we also show that our algorithm achieves state-of-the-art performance against several baselines. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: CVPR2021

arXiv:2103.08357 [pdf, other]

Learning Frequency-aware Dynamic Network for Efficient Super-Resolution

Authors: Wenbin Xie, Dehua Song, Chang Xu, Chun**g Xu, Hui Zhang, Yunhe Wang

Abstract: Deep learning based methods, especially convolutional neural networks (CNNs) have been successfully applied in the field of single image super-resolution (SISR). To obtain better fidelity and visual quality, most of existing networks are of heavy design with massive computation. However, the computation resources of modern mobile devices are limited, which cannot easily support the expensive cost.… ▽ More Deep learning based methods, especially convolutional neural networks (CNNs) have been successfully applied in the field of single image super-resolution (SISR). To obtain better fidelity and visual quality, most of existing networks are of heavy design with massive computation. However, the computation resources of modern mobile devices are limited, which cannot easily support the expensive cost. To this end, this paper explores a novel frequency-aware dynamic network for dividing the input into multiple parts according to its coefficients in the discrete cosine transform (DCT) domain. In practice, the high-frequency part will be processed using expensive operations and the lower-frequency part is assigned with cheap operations to relieve the computation burden. Since pixels or image patches belong to low-frequency areas contain relatively few textural details, this dynamic network will not affect the quality of resulting super-resolution images. In addition, we embed predictors into the proposed dynamic network to end-to-end fine-tune the handcrafted frequency-aware masks. Extensive experiments conducted on benchmark SISR models and datasets show that the frequency-aware dynamic network can be employed for various SISR neural architectures to obtain the better tradeoff between visual quality and computational complexity. For instance, we can reduce the FLOPs of SR models by approximate 50% while preserving state-of-the-art SISR performance. △ Less

Submitted 16 August, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

arXiv:2012.06867 [pdf, other]

VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge

Authors: Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman

Abstract: We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020. The goal of this challenge was to assess how well current speaker recognition technology is able to diarise and recognize speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition and diarisation dataset from YouTube videos together… ▽ More We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020. The goal of this challenge was to assess how well current speaker recognition technology is able to diarise and recognize speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition and diarisation dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a virtual public challenge and workshop held at Interspeech 2020. This paper outlines the challenge, and describes the baselines, methods used, and results. We conclude with a discussion of the progress over the first installment of the challenge. △ Less

Submitted 12 December, 2020; originally announced December 2020.

arXiv:2009.05399 [pdf, other]

doi 10.1364/JOSAB.397469

Investigation of analog signal distortion introduced by a fiber phase sensitive amplifier

Authors: Debanuj Chatterjee, Yousra Bouasria, Weilin Xie, Tarek Labidi, Fabienne Goldfarb, Ihsan Fsaifes, Fabien Bretenaker

Abstract: We numerically simulate the distortion of an analog signal carried in a microwave photonics link containing a phase sensitive amplifier (PSA), focusing mainly on amplitude modulation format. The numerical model is validated by comparison with experimental measurements. By using the well known two-tone test, we compare the situations in which a standard intensity modulator is used with the one wher… ▽ More We numerically simulate the distortion of an analog signal carried in a microwave photonics link containing a phase sensitive amplifier (PSA), focusing mainly on amplitude modulation format. The numerical model is validated by comparison with experimental measurements. By using the well known two-tone test, we compare the situations in which a standard intensity modulator is used with the one where a perfectly linear modulator would be employed. We also investigate the role of gain saturation on the nonlinearity of the PSA. Finally, we establish the conditions, in which the signal nonlinearity introduced by the PSA itself can be extremely small. △ Less

Submitted 10 September, 2020; originally announced September 2020.

Journal ref: JOSA B 37, 2405 (2020)

arXiv:2009.03591 [pdf]

Are wave union methods still suitable for 20 nm FPGA-based high-resolution (< 2 ps) time-to-digital converters?

Authors: Wujun Xie, Haochang Chen, David Day-Uei Li

Abstract: This paper presents several new structures to pursue high-resolution (< 2 ps) time-to-digital converters (TDCs) in Xilinx 20 nm UltraScale field-programmable gate arrays (FPGAs). The proposed TDCs combined the advantages of 1) our newly proposed sub-tapped delay line (sub-TDL) architecture effective in removing bubbles and zero-bins and 2) the wave union (WU) A method to improve the resolution and… ▽ More This paper presents several new structures to pursue high-resolution (< 2 ps) time-to-digital converters (TDCs) in Xilinx 20 nm UltraScale field-programmable gate arrays (FPGAs). The proposed TDCs combined the advantages of 1) our newly proposed sub-tapped delay line (sub-TDL) architecture effective in removing bubbles and zero-bins and 2) the wave union (WU) A method to improve the resolution and reduce the impact introduced from ultrawide bins. We also compared the proposed WU/sub-TDL TDC with the TDC combining the dual sampling (DS) structure and the sub-TDL technique. Moreover, we introduced a binning method to improve the linearity and derived a formula of the total measurement uncertainty for a single-stage TDL-TDC to obtain its root-mean-square (RMS) resolution. Results conclude that the proposed designs are cost-effective in logic resources and have the potential for multiple-channel implementations. Different from the conclusions from a previous study, we found that the wave union is still influential in UltraScale devices when combining with our sub-TDL structure. We also compared with other published TDCs to demonstrate where the proposed TDCs stand. △ Less

Submitted 8 September, 2020; originally announced September 2020.

arXiv:2006.10915 [pdf, other]

Simulation-Based Digital Twin Development for Blockchain Enabled End-to-End Industrial Hemp Supply Chain Risk Management

Authors: Keqi Wang, Wei Xie, Wencen Wu, Bo Wang, **xiang Pei, Mike Baker, Qi Zhou

Abstract: With the passage of the 2018 U.S. Farm Bill, Industrial Hemp production is moved from limited pilot programs to a regulated agriculture production system. However, Industrial Hemp Supply Chain (IHSC) faces critical challenges, including: high complexity and variability, very limited production knowledge, lack of data and information tracking. In this paper, we propose blockchain-enabled IHSC and d… ▽ More With the passage of the 2018 U.S. Farm Bill, Industrial Hemp production is moved from limited pilot programs to a regulated agriculture production system. However, Industrial Hemp Supply Chain (IHSC) faces critical challenges, including: high complexity and variability, very limited production knowledge, lack of data and information tracking. In this paper, we propose blockchain-enabled IHSC and develop a preliminary simulation-based digital twin for this distributed cyber-physical system (CPS) to support the process learning and risk management. Basically, we develop a two-layer blockchain with proof of authority smart contract, which can track the data and key information, improve the supply chain transparency, and leverage local authorities and state regulators to ensure the quality control verification. Then, we introduce a stochastic simulation-based digital twin for IHSC risk management, which can characterize the process spatial-temporal causal interdependencies and dynamic evolution to guide risk control and decision making. Our empirical study demonstrates the promising performance of proposed platform. △ Less

Submitted 18 June, 2020; originally announced June 2020.

Comments: 11 pages, 2 figures, 2020 Winter Simulation Conference

arXiv:2006.02666 [pdf]

Deep Sequential Feature Learning in Clinical Image Classification of Infectious Keratitis

Authors: Yesheng Xu, Ming Kong, Wenjia Xie, Run** Duan, Zhengqing Fang, Yuxiao Lin, Qiang Zhu, Siliang Tang, Fei Wu, Yu-Feng Yao

Abstract: Infectious keratitis is the most common entities of corneal diseases, in which pathogen grows in the cornea leading to inflammation and destruction of the corneal tissues. Infectious keratitis is a medical emergency, for which a rapid and accurate diagnosis is needed for speedy initiation of prompt and precise treatment to halt the disease progress and to limit the extent of corneal damage; otherw… ▽ More Infectious keratitis is the most common entities of corneal diseases, in which pathogen grows in the cornea leading to inflammation and destruction of the corneal tissues. Infectious keratitis is a medical emergency, for which a rapid and accurate diagnosis is needed for speedy initiation of prompt and precise treatment to halt the disease progress and to limit the extent of corneal damage; otherwise it may develop sight-threatening and even eye-globe-threatening condition. In this paper, we propose a sequential-level deep learning model to effectively discriminate the distinction and subtlety of infectious corneal disease via the classification of clinical images. In this approach, we devise an appropriate mechanism to preserve the spatial structures of clinical images and disentangle the informative features for clinical image classification of infectious keratitis. In competition with 421 ophthalmologists, the performance of the proposed sequential-level deep model achieved 80.00% diagnostic accuracy, far better than the 49.27% diagnostic accuracy achieved by ophthalmologists over 120 test images. △ Less

Submitted 4 June, 2020; originally announced June 2020.

Comments: Accepted by Engineering

arXiv:2005.11684 [pdf]

Deep Learning-based Modulation Detection for NOMA Systems

Authors: Wenwu Xie, Jian Xiao, **xia Yang, Xin Peng, Chao Yu, Peng Zhu

Abstract: Since the signal with strong power should be demodulated first for successive interference cancellation (SIC) demodulation in non-orthogonal multiple access (NOMA) systems, the base station (BS) should inform the near user terminal (UT), which has allocated higher power, of modulation mode of the far user terminal. To avoid unnecessary signaling overhead in this process, a blind detection algorith… ▽ More Since the signal with strong power should be demodulated first for successive interference cancellation (SIC) demodulation in non-orthogonal multiple access (NOMA) systems, the base station (BS) should inform the near user terminal (UT), which has allocated higher power, of modulation mode of the far user terminal. To avoid unnecessary signaling overhead in this process, a blind detection algorithm of NOMA signal modulation mode is designed in this paper. Taking the joint constellation density diagrams of NOMA signal as the detection features, deep residual network is built for classification, so as to detect the modulation mode of NOMA signal. In view of the fact that the joint constellation diagrams are easily polluted by high intensity noise and lose their real distribution pattern, the wavelet denoising method is adopted to improve the quality of constellations. The simulation results represent that the proposed algorithm can achieve satisfactory detection accuracy in NOMA systems. In addition, the factors affecting the recognition performance are also verified and analyzed. △ Less

Submitted 16 October, 2020; v1 submitted 24 May, 2020; originally announced May 2020.

arXiv:2004.14368 [pdf, other]

VGGSound: A Large-scale Audio-Visual Dataset

Authors: Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman

Abstract: Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves… ▽ More Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves obtaining videos from YouTube; using image classification algorithms to localize audio-visual correspondence; and filtering out ambient noise using audio verification. Second, we use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes. Third, we investigate various Convolutional Neural Network~(CNN) architectures and aggregation approaches to establish audio recognition baselines for our new dataset. Compared to existing audio datasets, VGGSound ensures audio-visual correspondence and is collected under unconstrained conditions. Code and the dataset are available at http://www.robots.ox.ac.uk/~vgg/data/vggsound/ △ Less

Submitted 24 September, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

Comments: ICASSP2020

arXiv:2004.07443 [pdf, other]

doi 10.1109/TMI.2020.2995108

Relational Modeling for Robust and Efficient Pulmonary Lobe Segmentation in CT Scans

Authors: Weiyi Xie, Colin Jacobs, Jean-Paul Charbonnier, Bram van Ginneken

Abstract: Pulmonary lobe segmentation in computed tomography scans is essential for regional assessment of pulmonary diseases. Recent works based on convolution neural networks have achieved good performance for this task. However, they are still limited in capturing structured relationships due to the nature of convolution. The shape of the pulmonary lobes affect each other and their borders relate to the… ▽ More Pulmonary lobe segmentation in computed tomography scans is essential for regional assessment of pulmonary diseases. Recent works based on convolution neural networks have achieved good performance for this task. However, they are still limited in capturing structured relationships due to the nature of convolution. The shape of the pulmonary lobes affect each other and their borders relate to the appearance of other structures, such as vessels, airways, and the pleural wall. We argue that such structural relationships play a critical role in the accurate delineation of pulmonary lobes when the lungs are affected by diseases such as COVID-19 or COPD. In this paper, we propose a relational approach (RTSU-Net) that leverages structured relationships by introducing a novel non-local neural network module. The proposed module learns both visual and geometric relationships among all convolution features to produce self-attention weights. With a limited amount of training data available from COVID-19 subjects, we initially train and validate RTSU-Net on a cohort of 5000 subjects from the COPDGene study (4000 for training and 1000 for evaluation). Using models pre-trained on COPDGene, we apply transfer learning to retrain and evaluate RTSU-Net on 470 COVID-19 suspects (370 for retraining and 100 for evaluation). Experimental results show that RTSU-Net outperforms three baselines and performs robustly on cases with severe lung infection due to COVID-19. △ Less

Submitted 12 May, 2020; v1 submitted 15 April, 2020; originally announced April 2020.

arXiv:2002.04251 [pdf, other]

doi 10.1016/j.bspc.2023.104858

2.75D: Boosting learning by representing 3D Medical imaging to 2D features for small data

Authors: Xin Wang, Ruisheng Su, Weiyi Xie, Wen** Wang, Yi Xu, Ritse Mann, Jungong Han, Tao Tan

Abstract: In medical-data driven learning, 3D convolutional neural networks (CNNs) have started to show superior performance to 2D CNNs in numerous deep learning tasks, proving the added value of 3D spatial information in feature representation. However, the difficulty in collecting more training samples to converge, more computational resources and longer execution time make this approach less applied. Als… ▽ More In medical-data driven learning, 3D convolutional neural networks (CNNs) have started to show superior performance to 2D CNNs in numerous deep learning tasks, proving the added value of 3D spatial information in feature representation. However, the difficulty in collecting more training samples to converge, more computational resources and longer execution time make this approach less applied. Also, applying transfer learning on 3D CNN is challenging due to a lack of publicly available pre-trained 3D models. To tackle these issues, we proposed a novel 2D strategical representation of volumetric data, namely 2.75D. In this work, the spatial information of 3D images is captured in a single 2D view by a spiral-spinning technique. As a result, 2D CNN networks can also be used to learn volumetric information. Besides, we can fully leverage pre-trained 2D CNNs for downstream vision problems. We also explore a multi-view 2.75D strategy, 2.75D 3 channels (2.75Dx3), to boost the advantage of 2.75D. We evaluated the proposed methods on three public datasets with different modalities or organs (Lung CT, Breast MRI, and Prostate MRI), against their 2D, 2.5D, and 3D counterparts in classification tasks. Results show that the proposed methods significantly outperform other counterparts when all methods were trained from scratch on the lung dataset. Such performance gain is more pronounced with transfer learning or in the case of limited training data. Our methods also achieved comparable performance on other datasets. In addition, our methods achieved a substantial reduction in time consumption of training and inference compared with the 2.5D or 3D method. △ Less

Submitted 22 January, 2024; v1 submitted 11 February, 2020; originally announced February 2020.

arXiv:2002.00275 [pdf, other]

Data-Driven Stochastic Optimization for Power Grids Scheduling under High Wind Penetration

Authors: Wei Xie, Yuan Yi, Zhi Zhou, Keqi Wang

Abstract: To address the environmental concern and improve the economic efficiency, the wind power is rapidly integrated into smart grids. However, the inherent uncertainty of wind energy raises operational challenges. To ensure the cost-efficient, reliable and robust operation, it is critically important to find the optimal decision that can correctly and rigorously hedge against all sources of uncertainty… ▽ More To address the environmental concern and improve the economic efficiency, the wind power is rapidly integrated into smart grids. However, the inherent uncertainty of wind energy raises operational challenges. To ensure the cost-efficient, reliable and robust operation, it is critically important to find the optimal decision that can correctly and rigorously hedge against all sources of uncertainty. In this paper, we propose data-driven stochastic unit commitment (SUC) to guide the power grids scheduling. Specifically, given the finite historical data, the posterior predictive distribution is developed to quantify the wind power prediction uncertainty accounting for both inherent stochastic uncertainty of wind power generation and input model estimation error. For complex power grid systems, a finite number of scenarios is used to estimate the expected cost in the planning horizon. To further control the impact of finite sampling error induced by using the sample average approximation (SAA), we propose a parallel computing based optimization solution methodology, which can quickly find the reliable optimal unit commitment decision hedging against various sources of uncertainty. The empirical study over six-bus and 118-bus systems demonstrates that our approach can provide more efficient and robust performance than the existing deterministic and stochastic unit commitment approaches. △ Less

Submitted 9 November, 2020; v1 submitted 1 February, 2020; originally announced February 2020.

Comments: 24 pages, 2 figures

arXiv:1912.02522 [pdf, other]

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

Authors: Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman

Abstract: The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and workshop held at Inte… ▽ More The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and workshop held at Interspeech 2019 in Graz, Austria. This paper outlines the challenge and provides its baselines, results and discussions. △ Less

Submitted 5 December, 2019; originally announced December 2019.

Comments: ISCA Archive

Showing 1–50 of 53 results for author: Xie, W