Search | arXiv e-print repository

Mamba-based Light Field Super-Resolution with Efficient Subspace Scanning

Authors: Ruisheng Gao, Zeyu Xiao, Zhiwei Xiong

Abstract: Transformer-based methods have demonstrated impressive performance in 4D light field (LF) super-resolution by effectively modeling long-range spatial-angular correlations, but their quadratic complexity hinders the efficient processing of high resolution 4D inputs, resulting in slow inference speed and high memory cost. As a compromise, most prior work adopts a patch-based strategy, which fails to… ▽ More Transformer-based methods have demonstrated impressive performance in 4D light field (LF) super-resolution by effectively modeling long-range spatial-angular correlations, but their quadratic complexity hinders the efficient processing of high resolution 4D inputs, resulting in slow inference speed and high memory cost. As a compromise, most prior work adopts a patch-based strategy, which fails to leverage the full information from the entire input LFs. The recently proposed selective state-space model, Mamba, has gained popularity for its efficient long-range sequence modeling. In this paper, we propose a Mamba-based Light Field Super-Resolution method, named MLFSR, by designing an efficient subspace scanning strategy. Specifically, we tokenize 4D LFs into subspace sequences and conduct bi-directional scanning on each subspace. Based on our scanning strategy, we then design the Mamba-based Global Interaction (MGI) module to capture global information and the local Spatial- Angular Modulator (SAM) to complement local details. Additionally, we introduce a Transformer-to-Mamba (T2M) loss to further enhance overall performance. Extensive experiments on public benchmarks demonstrate that MLFSR surpasses CNN-based models and rivals Transformer-based methods in performance while maintaining higher efficiency. With quicker inference speed and reduced memory demand, MLFSR facilitates full-image processing of high-resolution 4D LFs with enhanced performance. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: 17 pages,7 figures

arXiv:2406.07532 [pdf, other]

Hearing Anything Anywhere

Authors: Mason Wang, Ryosuke Sawata, Samuel Clarke, Ruohan Gao, Shangzhe Wu, Jiajun Wu

Abstract: Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic… ▽ More Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic characteristics of an arbitrary environment given only a sparse set of (roughly 12) room impulse response (RIR) recordings and a planar reconstruction of the scene, a setup that is easily achievable by ordinary users. To this end, we introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene, including sound source directivity and surface reflectivity. This allows us to synthesize novel auditory experiences through the space with any source audio. To evaluate our method, we collect a dataset of RIR recordings and music in four diverse, real environments. We show that our model outperforms state-ofthe-art baselines on rendering monaural and binaural RIRs and music at unseen locations, and learns physically interpretable parameters characterizing acoustic properties of the sound source and surfaces in the scene. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: CVPR 2024. The first two authors contributed equally. Project page: https://masonlwang.com/hearinganythinganywhere/

ACM Class: I.2.10; I.4.8

arXiv:2405.14559 [pdf, other]

HemSeg-200: A Voxel-Annotated Dataset for Intracerebral Hemorrhages Segmentation in Brain CT Scans

Authors: Changwei Song, Qing Zhao, Jianqiang Li, Xin Yue, Ruoyun Gao, Zhaoxuan Wang, An Gao, Guanghui Fu

Abstract: Acute intracerebral hemorrhage is a life-threatening condition that demands immediate medical intervention. Intraparenchymal hemorrhage (IPH) and intraventricular hemorrhage (IVH) are critical subtypes of this condition. Clinically, when such hemorrhages are suspected, immediate CT scanning is essential to assess the extent of the bleeding and to facilitate the formulation of a targeted treatment… ▽ More Acute intracerebral hemorrhage is a life-threatening condition that demands immediate medical intervention. Intraparenchymal hemorrhage (IPH) and intraventricular hemorrhage (IVH) are critical subtypes of this condition. Clinically, when such hemorrhages are suspected, immediate CT scanning is essential to assess the extent of the bleeding and to facilitate the formulation of a targeted treatment plan. While current research in deep learning has largely focused on qualitative analyses, such as identifying subtypes of cerebral hemorrhages, there remains a significant gap in quantitative analysis crucial for enhancing clinical treatments. Addressing this gap, our paper introduces a dataset comprising 222 CT annotations, sourced from the RSNA 2019 Brain CT Hemorrhage Challenge and meticulously annotated at the voxel level for precise IPH and IVH segmentation. This dataset was utilized to train and evaluate seven advanced medical image segmentation algorithms, with the goal of refining the accuracy of segmentation for these hemorrhages. Our findings demonstrate that this dataset not only furthers the development of sophisticated segmentation algorithms but also substantially aids scientific research and clinical practice by improving the diagnosis and management of these severe hemorrhages. Our dataset and codes are available at \url{https://github.com/songchangwei/3DCT-SD-IVH-ICH}. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2402.17718 [pdf]

Towards a Digital Twin Framework in Additive Manufacturing: Machine Learning and Bayesian Optimization for Time Series Process Optimization

Authors: Vispi Karkaria, Anthony Goeckner, Ru**g Zha, Jie Chen, Jian**g Zhang, Qi Zhu, Jian Cao, Robert X. Gao, Wei Chen

Abstract: Laser-directed-energy deposition (DED) offers advantages in additive manufacturing (AM) for creating intricate geometries and material grading. Yet, challenges like material inconsistency and part variability remain, mainly due to its layer-wise fabrication. A key issue is heat accumulation during DED, which affects the material microstructure and properties. While closed-loop control methods for… ▽ More Laser-directed-energy deposition (DED) offers advantages in additive manufacturing (AM) for creating intricate geometries and material grading. Yet, challenges like material inconsistency and part variability remain, mainly due to its layer-wise fabrication. A key issue is heat accumulation during DED, which affects the material microstructure and properties. While closed-loop control methods for heat management are common in DED research, few integrate real-time monitoring, physics-based modeling, and control in a unified framework. Our work presents a digital twin (DT) framework for real-time predictive control of DED process parameters to meet specific design objectives. We develop a surrogate model using Long Short-Term Memory (LSTM)-based machine learning with Bayesian Inference to predict temperatures in DED parts. This model predicts future temperature states in real time. We also introduce Bayesian Optimization (BO) for Time Series Process Optimization (BOTSPO), based on traditional BO but featuring a unique time series process profile generator with reduced dimensions. BOTSPO dynamically optimizes processes, identifying optimal laser power profiles to attain desired mechanical properties. The established process trajectory guides online optimizations, aiming to enhance performance. This paper outlines the digital twin framework's components, promoting its integration into a comprehensive system for AM. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: 12 Pages, 10 Figures, 1 Table, NAMRC Conference

arXiv:2402.06841 [pdf]

Point cloud-based registration and image fusion between cardiac SPECT MPI and CTA

Authors: Shaojie Tang, Penpen Miao, Xingyu Gao, Yu Zhong, Dantong Zhu, Haixing Wen, Zhihui Xu, Qiuyue Wei, Hong** Yao, Xin Huang, Rui Gao, Chen Zhao, Weihua Zhou

Abstract: A method was proposed for the point cloud-based registration and image fusion between cardiac single photon emission computed tomography (SPECT) myocardial perfusion images (MPI) and cardiac computed tomography angiograms (CTA). Firstly, the left ventricle (LV) epicardial regions (LVERs) in SPECT and CTA images were segmented by using different U-Net neural networks trained to generate the point c… ▽ More A method was proposed for the point cloud-based registration and image fusion between cardiac single photon emission computed tomography (SPECT) myocardial perfusion images (MPI) and cardiac computed tomography angiograms (CTA). Firstly, the left ventricle (LV) epicardial regions (LVERs) in SPECT and CTA images were segmented by using different U-Net neural networks trained to generate the point clouds of the LV epicardial contours (LVECs). Secondly, according to the characteristics of cardiac anatomy, the special points of anterior and posterior interventricular grooves (APIGs) were manually marked in both SPECT and CTA image volumes. Thirdly, we developed an in-house program for coarsely registering the special points of APIGs to ensure a correct cardiac orientation alignment between SPECT and CTA images. Fourthly, we employed ICP, SICP or CPD algorithm to achieve a fine registration for the point clouds (together with the special points of APIGs) of the LV epicardial surfaces (LVERs) in SPECT and CTA images. Finally, the image fusion between SPECT and CTA was realized after the fine registration. The experimental results showed that the cardiac orientation was aligned well and the mean distance error of the optimal registration method (CPD with affine transform) was consistently less than 3 mm. The proposed method could effectively fuse the structures from cardiac CTA and SPECT functional images, and demonstrated a potential in assisting in accurate diagnosis of cardiac diseases by combining complementary advantages of the two imaging modalities. △ Less

Submitted 9 February, 2024; originally announced February 2024.

arXiv:2402.02735 [pdf, other]

Timed-Elastic-Band Based Variable Splitting for Autonomous Trajectory Planning

Authors: Hao Zhu, Kefan **, Rui Gao, Jialin Wang, C. -J. Richard Shi

Abstract: Existing trajectory planning methods are struggling to handle the issue of autonomous track swinging during navigation, resulting in significant errors when reaching the destination. In this article, we address autonomous trajectory planning problems, which aims at develo** innovative solutions to enhance the adaptability and robustness of unmanned systems in navigating complex and dynamic envir… ▽ More Existing trajectory planning methods are struggling to handle the issue of autonomous track swinging during navigation, resulting in significant errors when reaching the destination. In this article, we address autonomous trajectory planning problems, which aims at develo** innovative solutions to enhance the adaptability and robustness of unmanned systems in navigating complex and dynamic environments. We first introduce the variable splitting (VS) method as a constrained optimization method to reimagine the renowned Timed-Elastic-Band (TEB) algorithm, resulting in a novel collision avoidance approach named Timed-Elastic-Band based variable splitting (TEB-VS). The proposed TEB-VS demonstrates superior navigation stability, while maintaining nearly identical resource consumption to TEB. We then analyze the convergence of the proposed TEB-VS method. To evaluate the effectiveness and efficiency of TEB-VS, extensive experiments have been conducted using TurtleBot2 in both simulated environments and real-world datasets. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2311.03517 [pdf, other]

SoundCam: A Dataset for Finding Humans Using Room Acoustics

Authors: Mason Wang, Samuel Clarke, Jui-Hsien Wang, Ruohan Gao, Jiajun Wu

Abstract: A room's acoustic properties are a product of the room's geometry, the objects within the room, and their specific positions. A room's acoustic properties can be characterized by its impulse response (RIR) between a source and listener location, or roughly inferred from recordings of natural signals present in the room. Variations in the positions of objects in a room can effect measurable changes… ▽ More A room's acoustic properties are a product of the room's geometry, the objects within the room, and their specific positions. A room's acoustic properties can be characterized by its impulse response (RIR) between a source and listener location, or roughly inferred from recordings of natural signals present in the room. Variations in the positions of objects in a room can effect measurable changes in the room's acoustic properties, as characterized by the RIR. Existing datasets of RIRs either do not systematically vary positions of objects in an environment, or they consist of only simulated RIRs. We present SoundCam, the largest dataset of unique RIRs from in-the-wild rooms publicly released to date. It includes 5,000 10-channel real-world measurements of room impulse responses and 2,000 10-channel recordings of music in three different rooms, including a controlled acoustic lab, an in-the-wild living room, and a conference room, with different humans in positions throughout each room. We show that these measurements can be used for interesting tasks, such as detecting and identifying humans, and tracking their positions. △ Less

Submitted 15 January, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

Comments: In NeurIPS 2023 Datasets and Benchmarks Track. Project page: https://masonlwang.com/soundcam/. Wang and Clarke contributed equally to this work

arXiv:2309.09392 [pdf, other]

Deep conditional generative models for longitudinal single-slice abdominal computed tomography harmonization

Authors: Xin Yu, Qi Yang, Yucheng Tang, Riqiang Gao, Shunxing Bao, Leon Y. Cai, Ho Hin Lee, Yuankai Huo, Ann Zenobia Moore, Luigi Ferrucci, Bennett A. Landman

Abstract: Two-dimensional single-slice abdominal computed tomography (CT) provides a detailed tissue map with high resolution allowing quantitative characterization of relationships between health conditions and aging. However, longitudinal analysis of body composition changes using these scans is difficult due to positional variation between slices acquired in different years, which leading to different or… ▽ More Two-dimensional single-slice abdominal computed tomography (CT) provides a detailed tissue map with high resolution allowing quantitative characterization of relationships between health conditions and aging. However, longitudinal analysis of body composition changes using these scans is difficult due to positional variation between slices acquired in different years, which leading to different organs/tissues captured. To address this issue, we propose C-SliceGen, which takes an arbitrary axial slice in the abdominal region as a condition and generates a pre-defined vertebral level slice by estimating structural changes in the latent space. Our experiments on 2608 volumetric CT data from two in-house datasets and 50 subjects from the 2015 Multi-Atlas Abdomen Labeling Challenge dataset (BTCV) Challenge demonstrate that our model can generate high-quality images that are realistic and similar. We further evaluate our method's capability to harmonize longitudinal positional variation on 1033 subjects from the Baltimore Longitudinal Study of Aging (BLSA) dataset, which contains longitudinal single abdominal slices, and confirmed that our method can harmonize the slice positional variance in terms of visceral fat area. This approach provides a promising direction for map** slices from different vertebral levels to a target slice and reducing positional variance for single-slice longitudinal analysis. The source code is available at: https://github.com/MASILab/C-SliceGen. △ Less

Submitted 17 September, 2023; originally announced September 2023.

arXiv:2306.09944 [pdf, other]

RealImpact: A Dataset of Impact Sound Fields for Real Objects

Authors: Samuel Clarke, Ruohan Gao, Mason Wang, Mark Rau, Julia Xu, Jui-Hsien Wang, Doug L. James, Jiajun Wu

Abstract: Objects make unique sounds under different perturbations, environment conditions, and poses relative to the listener. While prior works have modeled impact sounds and sound propagation in simulation, we lack a standard dataset of impact sound fields of real objects for audio-visual learning and calibration of the sim-to-real gap. We present RealImpact, a large-scale dataset of real object impact s… ▽ More Objects make unique sounds under different perturbations, environment conditions, and poses relative to the listener. While prior works have modeled impact sounds and sound propagation in simulation, we lack a standard dataset of impact sound fields of real objects for audio-visual learning and calibration of the sim-to-real gap. We present RealImpact, a large-scale dataset of real object impact sounds recorded under controlled conditions. RealImpact contains 150,000 recordings of impact sounds of 50 everyday objects with detailed annotations, including their impact locations, microphone locations, contact force profiles, material labels, and RGBD images. We make preliminary attempts to use our dataset as a reference to current simulation methods for estimating object impact sounds that match the real world. Moreover, we demonstrate the usefulness of our dataset as a testbed for acoustic and audio-visual learning via the evaluation of two benchmark tasks, including listener location classification and visual acoustic matching. △ Less

Submitted 16 June, 2023; originally announced June 2023.

Comments: CVPR 2023 (Highlight). Project page: https://samuelpclarke.com/realimpact/

arXiv:2305.18994 [pdf, other]

Toward Real-World Light Field Super-Resolution

Authors: Zeyu Xiao, Ruisheng Gao, Yutong Liu, Yueyi Zhang, Zhiwei Xiong

Abstract: Deep learning has opened up new possibilities for light field super-resolution (SR), but existing methods trained on synthetic datasets with simple degradations (e.g., bicubic downsampling) suffer from poor performance when applied to complex real-world scenarios. To address this problem, we introduce LytroZoom, the first real-world light field SR dataset capturing paired low- and high-resolution… ▽ More Deep learning has opened up new possibilities for light field super-resolution (SR), but existing methods trained on synthetic datasets with simple degradations (e.g., bicubic downsampling) suffer from poor performance when applied to complex real-world scenarios. To address this problem, we introduce LytroZoom, the first real-world light field SR dataset capturing paired low- and high-resolution light fields of diverse indoor and outdoor scenes using a Lytro ILLUM camera. Additionally, we propose the Omni-Frequency Projection Network (OFPNet), which decomposes the omni-frequency components and iteratively enhances them through frequency projection operations to address spatially variant degradation processes present in all frequency components. Experiments demonstrate that models trained on LytroZoom outperform those trained on synthetic datasets and are generalizable to diverse content and devices. Quantitative and qualitative evaluations verify the superiority of OFPNet. We believe this work will inspire future research in real-world light field SR. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: CVPRW 2023

arXiv:2304.02836 [pdf, other]

Longitudinal Multimodal Transformer Integrating Imaging and Latent Clinical Signatures From Routine EHRs for Pulmonary Nodule Classification

Authors: Thomas Z. Li, John M. Still, Kaiwen Xu, Ho Hin Lee, Leon Y. Cai, Aravind R. Krishnan, Riqiang Gao, Mirza S. Khan, Sanja Antic, Michael Kammer, Kim L. Sandler, Fabien Maldonado, Bennett A. Landman, Thomas A. Lasko

Abstract: The accuracy of predictive models for solitary pulmonary nodule (SPN) diagnosis can be greatly increased by incorporating repeat imaging and medical context, such as electronic health records (EHRs). However, clinically routine modalities such as imaging and diagnostic codes can be asynchronous and irregularly sampled over different time scales which are obstacles to longitudinal multimodal learni… ▽ More The accuracy of predictive models for solitary pulmonary nodule (SPN) diagnosis can be greatly increased by incorporating repeat imaging and medical context, such as electronic health records (EHRs). However, clinically routine modalities such as imaging and diagnostic codes can be asynchronous and irregularly sampled over different time scales which are obstacles to longitudinal multimodal learning. In this work, we propose a transformer-based multimodal strategy to integrate repeat imaging with longitudinal clinical signatures from routinely collected EHRs for SPN classification. We perform unsupervised disentanglement of latent clinical signatures and leverage time-distance scaled self-attention to jointly learn from clinical signatures expressions and chest computed tomography (CT) scans. Our classifier is pretrained on 2,668 scans from a public dataset and 1,149 subjects with longitudinal chest CTs, billing codes, medications, and laboratory tests from EHRs of our home institution. Evaluation on 227 subjects with challenging SPNs revealed a significant AUC improvement over a longitudinal multimodal baseline (0.824 vs 0.752 AUC), as well as improvements over a single cross-section multimodal scenario (0.809 AUC) and a longitudinal imaging-only scenario (0.741 AUC). This work demonstrates significant advantages with a novel approach for co-learning longitudinal imaging and non-imaging phenotypes with transformers. Code available at https://github.com/MASILab/lmsignatures. △ Less

Submitted 29 June, 2023; v1 submitted 5 April, 2023; originally announced April 2023.

Comments: Accepted to MICCAI 2023

arXiv:2301.00322 [pdf, other]

Encrypted Data-driven Predictive Cloud Control with Disturbance Observer

Authors: Qiwen Li, Runze Gao, Yuanqing Xia

Abstract: In data-driven predictive cloud control tasks, the privacy of data stored and used in cloud services could be leaked to malicious attackers or curious eavesdroppers. Homomorphic encryption technique could be used to protect data privacy while allowing computation. However, extra errors are introduced by the homomorphic encryption extension to ensure the privacy-preserving properties, and the real… ▽ More In data-driven predictive cloud control tasks, the privacy of data stored and used in cloud services could be leaked to malicious attackers or curious eavesdroppers. Homomorphic encryption technique could be used to protect data privacy while allowing computation. However, extra errors are introduced by the homomorphic encryption extension to ensure the privacy-preserving properties, and the real number truncation also brings uncertainty. Also, process and measure noise existed in system input and output may bring disturbance. In this work, a data-driven predictive cloud controller is developed based on homomorphic encryption to protect the cloud data privacy. Besides, a disturbance observer is introduced to estimate and compensate the encrypted control signal sequence computed in the cloud. The privacy of data is guaranteed by encryption and experiment results show the effect of our cloud-edge cooperative design. △ Less

Submitted 6 February, 2023; v1 submitted 31 December, 2022; originally announced January 2023.

arXiv:2209.14467 [pdf, other]

doi 10.1007/978-3-031-16449-1_20

Reducing Positional Variance in Cross-sectional Abdominal CT Slices with Deep Conditional Generative Models

Authors: Xin Yu, Qi Yang, Yucheng Tang, Riqiang Gao, Shunxing Bao, LeonY. Cai, Ho Hin Lee, Yuankai Huo, Ann Zenobia Moore, Luigi Ferrucci, Bennett A. Landman

Abstract: 2D low-dose single-slice abdominal computed tomography (CT) slice enables direct measurements of body composition, which are critical to quantitatively characterizing health relationships on aging. However, longitudinal analysis of body composition changes using 2D abdominal slices is challenging due to positional variance between longitudinal slices acquired in different years. To reduce the posi… ▽ More 2D low-dose single-slice abdominal computed tomography (CT) slice enables direct measurements of body composition, which are critical to quantitatively characterizing health relationships on aging. However, longitudinal analysis of body composition changes using 2D abdominal slices is challenging due to positional variance between longitudinal slices acquired in different years. To reduce the positional variance, we extend the conditional generative models to our C-SliceGen that takes an arbitrary axial slice in the abdominal region as the condition and generates a defined vertebral level slice by estimating the structural changes in the latent space. Experiments on 1170 subjects from an in-house dataset and 50 subjects from BTCV MICCAI Challenge 2015 show that our model can generate high quality images in terms of realism and similarity. External experiments on 20 subjects from the Baltimore Longitudinal Study of Aging (BLSA) dataset that contains longitudinal single abdominal slices validate that our method can harmonize the slice positional variance in terms of muscle and visceral fat area. Our approach provides a promising direction of map** slices from different vertebral levels to a target slice to reduce positional variance for single slice longitudinal analysis. The source code is available at: https://github.com/MASILab/C-SliceGen. △ Less

Submitted 28 September, 2022; originally announced September 2022.

Comments: 11 pages, 4 figures

Journal ref: Medical Image Computing and Computer Assisted Intervention MICCAI 2022, Cham, 2022, pp202,212

arXiv:2209.14378 [pdf, other]

UNesT: Local Spatial Representation Learning with Hierarchical Transformer for Efficient Medical Segmentation

Authors: Xin Yu, Qi Yang, Yinchi Zhou, Leon Y. Cai, Riqiang Gao, Ho Hin Lee, Thomas Li, Shunxing Bao, Zhoubing Xu, Thomas A. Lasko, Richard G. Abramson, Zizhao Zhang, Yuankai Huo, Bennett A. Landman, Yucheng Tang

Abstract: Transformer-based models, capable of learning better global dependencies, have recently demonstrated exceptional representation learning capabilities in computer vision and medical image analysis. Transformer reformats the image into separate patches and realizes global communication via the self-attention mechanism. However, positional information between patches is hard to preserve in such 1D se… ▽ More Transformer-based models, capable of learning better global dependencies, have recently demonstrated exceptional representation learning capabilities in computer vision and medical image analysis. Transformer reformats the image into separate patches and realizes global communication via the self-attention mechanism. However, positional information between patches is hard to preserve in such 1D sequences, and loss of it can lead to sub-optimal performance when dealing with large amounts of heterogeneous tissues of various sizes in 3D medical image segmentation. Additionally, current methods are not robust and efficient for heavy-duty medical segmentation tasks such as predicting a large number of tissue classes or modeling globally inter-connected tissue structures. To address such challenges and inspired by the nested hierarchical structures in vision transformer, we proposed a novel 3D medical image segmentation method (UNesT), employing a simplified and faster-converging transformer encoder design that achieves local communication among spatially adjacent patch sequences by aggregating them hierarchically. We extensively validate our method on multiple challenging datasets, consisting of multiple modalities, anatomies, and a wide range of tissue classes, including 133 structures in the brain, 14 organs in the abdomen, 4 hierarchical components in the kidneys, inter-connected kidney tumors and brain tumors. We show that UNesT consistently achieves state-of-the-art performance and evaluate its generalizability and data efficiency. Particularly, the model achieves whole brain segmentation task complete ROI with 133 tissue classes in a single network, outperforming prior state-of-the-art method SLANT27 ensembled with 27 networks. △ Less

Submitted 7 September, 2023; v1 submitted 28 September, 2022; originally announced September 2022.

Comments: 19 pages, 17 figures. arXiv admin note: text overlap with arXiv:2203.02430

arXiv:2209.07884 [pdf, ps, other]

Workflow-based Fast Data-driven Predictive Control with Disturbance Observer in Cloud-edge Collaborative Architecture

Authors: Runze Gao, Qiwen Li, Li Dai, Yufeng Zhan, Yuanqing Xia

Abstract: Data-driven predictive control (DPC) has been studied and used in various scenarios, since it could generate the predicted control sequence only relying on the historical input and output data. Recently, based on cloud computing, data-driven predictive cloud control system (DPCCS) has been proposed with the advantage of sufficient computational resources. However, the existing computation mode of… ▽ More Data-driven predictive control (DPC) has been studied and used in various scenarios, since it could generate the predicted control sequence only relying on the historical input and output data. Recently, based on cloud computing, data-driven predictive cloud control system (DPCCS) has been proposed with the advantage of sufficient computational resources. However, the existing computation mode of DPCCS is centralized. This computation mode could not utilize fully the computing power of cloud computing, of which the structure is distributed. Thus, the computation delay could not been reduced and still affects the control quality. In this paper, a novel cloud-edge collaborative containerised workflow-based DPC system with disturbance observer (DOB) is proposed, to improve the computation efficiency and guarantee the control accuracy. First, a construction method for the DPC workflow is designed, to match the distributed processing environment of cloud computing. But the non-computation overheads of the workflow tasks are relatively high. Therefore, a cloud-edge collaborative control scheme with DOB is designed. The low-weight data could be truncated to reduce the non-computation overheads. Meanwhile, we design an edge DOB to estimate and compensate the uncertainty in cloud workflow processing, and obtain the composite control variable. The UUB stability of the DOB is also proved. Third, to execute the workflow-based DPC controller and evaluate the proposed cloud-edge collaborative control scheme with DOB in the real cloud environment, we design and implement a practical workflow-based cloud control experimental system based on container technology. Finally, a series of evaluations show that, the computation times are decreased by 45.19% and 74.35% for two real-time control examples, respectively, and by at most 85.10% for a high-dimension control example. △ Less

Submitted 16 September, 2022; originally announced September 2022.

Comments: 58 pages and 23 figures

arXiv:2209.01676 [pdf]

Time-distance vision transformers in lung cancer diagnosis from longitudinal computed tomography

Authors: Thomas Z. Li, Kaiwen Xu, Riqiang Gao, Yucheng Tang, Thomas A. Lasko, Fabien Maldonado, Kim Sandler, Bennett A. Landman

Abstract: Features learned from single radiologic images are unable to provide information about whether and how much a lesion may be changing over time. Time-dependent features computed from repeated images can capture those changes and help identify malignant lesions by their temporal behavior. However, longitudinal medical imaging presents the unique challenge of sparse, irregular time intervals in data… ▽ More Features learned from single radiologic images are unable to provide information about whether and how much a lesion may be changing over time. Time-dependent features computed from repeated images can capture those changes and help identify malignant lesions by their temporal behavior. However, longitudinal medical imaging presents the unique challenge of sparse, irregular time intervals in data acquisition. While self-attention has been shown to be a versatile and efficient learning mechanism for time series and natural images, its potential for interpreting temporal distance between sparse, irregularly sampled spatial features has not been explored. In this work, we propose two interpretations of a time-distance vision transformer (ViT) by using (1) vector embeddings of continuous time and (2) a temporal emphasis model to scale self-attention weights. The two algorithms are evaluated based on benign versus malignant lung cancer discrimination of synthetic pulmonary nodules and lung screening computed tomography studies from the National Lung Screening Trial (NLST). Experiments evaluating the time-distance ViTs on synthetic nodules show a fundamental improvement in classifying irregularly sampled longitudinal images when compared to standard ViTs. In cross-validation on screening chest CTs from the NLST, our methods (0.785 and 0.786 AUC respectively) significantly outperform a cross-sectional approach (0.734 AUC) and match the discriminative performance of the leading longitudinal medical imaging algorithm (0.779 AUC) on benign versus malignant classification. This work represents the first self-attention-based framework for classifying longitudinal medical images. Our code is available at https://github.com/tom1193/time-distance-transformer. △ Less

Submitted 4 September, 2022; originally announced September 2022.

Comments: Summited to SPIE 2023 - Medical Imaging. 10 pages

arXiv:2207.06551 [pdf, other]

Body Composition Assessment with Limited Field-of-view Computed Tomography: A Semantic Image Extension Perspective

Authors: Kaiwen Xu, Thomas Li, Mirza S. Khan, Riqiang Gao, Sanja L. Antic, Yuankai Huo, Kim L. Sandler, Fabien Maldonado, Bennett A. Landman

Abstract: Field-of-view (FOV) tissue truncation beyond the lungs is common in routine lung screening computed tomography (CT). This poses limitations for opportunistic CT- based body composition (BC) assessment as key anatomical structures are missing. Traditionally, extending the FOV of CT is considered as a CT reconstruction problem using limited data. However, this approach relies on the projection domai… ▽ More Field-of-view (FOV) tissue truncation beyond the lungs is common in routine lung screening computed tomography (CT). This poses limitations for opportunistic CT- based body composition (BC) assessment as key anatomical structures are missing. Traditionally, extending the FOV of CT is considered as a CT reconstruction problem using limited data. However, this approach relies on the projection domain data which might not be available in application. In this work, we formulate the problem from the semantic image extension perspective which only requires image data as inputs. The proposed two-stage method identifies a new FOV border based on the estimated extent of the complete body and imputes missing tissues in the truncated region. The training samples are simulated using CT slices with complete body in FOV, making the model development self-supervised. We evaluate the validity of the proposed method in automatic BC assessment using lung screening CT with limited FOV. The proposed method effectively restores the missing tissues and reduces BC assessment error introduced by FOV tissue truncation. In the BC assessment for a large-scale lung screening CT dataset, this correction improves both the intra-subject consistency and the correlation with anthropometric approximations. The developed method is available at https://github.com/MASILab/S-EFOV. △ Less

Submitted 15 April, 2023; v1 submitted 13 July, 2022; originally announced July 2022.

Comments: Updated with additional evaluation and clarification

arXiv:2206.07599 [pdf, other]

How GNNs Facilitate CNNs in Mining Geometric Information from Large-Scale Medical Images

Authors: Yiqing Shen, Bingxin Zhou, Xinye Xiong, Ruitian Gao, Yu Guang Wang

Abstract: Gigapixel medical images provide massive data, both morphological textures and spatial information, to be mined. Due to the large data scale in histology, deep learning methods play an increasingly significant role as feature extractors. Existing solutions heavily rely on convolutional neural networks (CNNs) for global pixel-level analysis, leaving the underlying local geometric structure such as… ▽ More Gigapixel medical images provide massive data, both morphological textures and spatial information, to be mined. Due to the large data scale in histology, deep learning methods play an increasingly significant role as feature extractors. Existing solutions heavily rely on convolutional neural networks (CNNs) for global pixel-level analysis, leaving the underlying local geometric structure such as the interaction between cells in the tumor microenvironment unexplored. The topological structure in medical images, as proven to be closely related to tumor evolution, can be well characterized by graphs. To obtain a more comprehensive representation for downstream oncology tasks, we propose a fusion framework for enhancing the global image-level representation captured by CNNs with the geometry of cell-level spatial information learned by graph neural networks (GNN). The fusion layer optimizes an integration between collaborative features of global images and cell graphs. Two fusion strategies have been developed: one with MLP which is simple but turns out efficient through fine-tuning, and the other with Transformer gains a champion in fusing multiple networks. We evaluate our fusion strategies on histology datasets curated from large patient cohorts of colorectal and gastric cancers for three biomarker prediction tasks. Both two models outperform plain CNNs or GNNs, reaching a consistent AUC improvement of more than 5% on various network backbones. The experimental results yield the necessity for combining image-level morphological features with cell spatial relations in medical image analysis. Codes are available at https://github.com/yiqings/HEGnnEnhanceCnn. △ Less

Submitted 15 June, 2022; originally announced June 2022.

Comments: 21 pages

arXiv:2205.05898 [pdf]

Pseudo-Label Guided Multi-Contrast Generalization for Non-Contrast Organ-Aware Segmentation

Authors: Ho Hin Lee, Yucheng Tang, Riqiang Gao, Qi Yang, Xin Yu, Shunxing Bao, James G. Terry, J. Jeffrey Carr, Yuankai Huo, Bennett A. Landman

Abstract: Non-contrast computed tomography (NCCT) is commonly acquired for lung cancer screening, assessment of general abdominal pain or suspected renal stones, trauma evaluation, and many other indications. However, the absence of contrast limits distinguishing organ in-between boundaries. In this paper, we propose a novel unsupervised approach that leverages pairwise contrast-enhanced CT (CECT) context t… ▽ More Non-contrast computed tomography (NCCT) is commonly acquired for lung cancer screening, assessment of general abdominal pain or suspected renal stones, trauma evaluation, and many other indications. However, the absence of contrast limits distinguishing organ in-between boundaries. In this paper, we propose a novel unsupervised approach that leverages pairwise contrast-enhanced CT (CECT) context to compute non-contrast segmentation without ground-truth label. Unlike generative adversarial approaches, we compute the pairwise morphological context with CECT to provide teacher guidance instead of generating fake anatomical context. Additionally, we further augment the intensity correlations in 'organ-specific' settings and increase the sensitivity to organ-aware boundary. We validate our approach on multi-organ segmentation with paired non-contrast & contrast-enhanced CT scans using five-fold cross-validation. Full external validations are performed on an independent non-contrast cohort for aorta segmentation. Compared with current abdominal organs segmentation state-of-the-art in fully supervised setting, our proposed pipeline achieves a significantly higher Dice by 3.98% (internal multi-organ annotated), and 8.00% (external aorta annotated) for abdominal organs segmentation. The code and pretrained models are publicly available at https://github.com/MASILab/ContrastMix. △ Less

Submitted 12 May, 2022; originally announced May 2022.

arXiv:2204.02389 [pdf, other]

ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

Authors: Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu

Abstract: Objects play a crucial role in our everyday activities. Though multisensory object-centric learning has shown great potential lately, the modeling of objects in prior work is rather unrealistic. ObjectFolder 1.0 is a recent dataset that introduces 100 virtualized objects with visual, acoustic, and tactile sensory data. However, the dataset is small in scale and the multisensory data is of limited… ▽ More Objects play a crucial role in our everyday activities. Though multisensory object-centric learning has shown great potential lately, the modeling of objects in prior work is rather unrealistic. ObjectFolder 1.0 is a recent dataset that introduces 100 virtualized objects with visual, acoustic, and tactile sensory data. However, the dataset is small in scale and the multisensory data is of limited quality, hampering generalization to real-world scenarios. We present ObjectFolder 2.0, a large-scale, multisensory dataset of common household objects in the form of implicit neural representations that significantly enhances ObjectFolder 1.0 in three aspects. First, our dataset is 10 times larger in the amount of objects and orders of magnitude faster in rendering time. Second, we significantly improve the multisensory rendering quality for all three modalities. Third, we show that models learned from virtual objects in our dataset successfully transfer to their real-world counterparts in three challenging tasks: object scale estimation, contact localization, and shape reconstruction. ObjectFolder 2.0 offers a new path and testbed for multisensory learning in computer vision and robotics. The dataset is available at https://github.com/rhgao/ObjectFolder. △ Less

Submitted 5 April, 2022; originally announced April 2022.

Comments: In CVPR 2022. Gao, Si, and Chang contributed equally to this work. Project page: https://ai.stanford.edu/~rhgao/objectfolder2.0/

arXiv:2203.12910 [pdf, other]

SSGCNet: A Sparse Spectra Graph Convolutional Network for Epileptic EEG Signal Classification

Authors: Jialin Wang, Rui Gao, Haotian Zheng, Hao Zhu, C. -J. Richard Shi

Abstract: In this article, we propose a sparse spectra graph convolutional network (SSGCNet) for solving Epileptic EEG signal classification problems. The aim is to achieve a lightweight deep learning model without losing model classification accuracy. We propose a weighted neighborhood field graph (WNFG) to represent EEG signals, which reduces the redundant edges between graph nodes. WNFG has lower time co… ▽ More In this article, we propose a sparse spectra graph convolutional network (SSGCNet) for solving Epileptic EEG signal classification problems. The aim is to achieve a lightweight deep learning model without losing model classification accuracy. We propose a weighted neighborhood field graph (WNFG) to represent EEG signals, which reduces the redundant edges between graph nodes. WNFG has lower time complexity and memory usage than the conventional solutions. Using the graph representation, the sequential graph convolutional network is based on a combination of sparse weight pruning technique and the alternating direction method of multipliers (ADMM). Our approach can reduce computation complexity without effect on classification accuracy. We also present convergence results for the proposed approach. The performance of the approach is illustrated in public and clinical-real datasets. Compared with the existing literature, our WNFG of EEG signals achieves up to 10 times of redundant edge reduction, and our approach achieves up to 97 times of model pruning without loss of classification accuracy. △ Less

Submitted 24 March, 2022; originally announced March 2022.

Comments: 12 pages, 7 figures

arXiv:2203.09477 [pdf, other]

doi 10.1016/j.ins.2022.12.088

A Decomposition-Based Hybrid Ensemble CNN Framework for Driver Fatigue Recognition

Authors: Ruilin Li, Ruobin Gao, P. N. Suganthan

Abstract: Electroencephalogram (EEG) has become increasingly popular in driver fatigue monitoring systems. Several decomposition methods have been attempted to analyze the EEG signals that are complex, nonlinear and non-stationary and improve the EEG decoding performance in different applications. However, it remains challenging to extract more distinguishable features from different decomposed components f… ▽ More Electroencephalogram (EEG) has become increasingly popular in driver fatigue monitoring systems. Several decomposition methods have been attempted to analyze the EEG signals that are complex, nonlinear and non-stationary and improve the EEG decoding performance in different applications. However, it remains challenging to extract more distinguishable features from different decomposed components for driver fatigue recognition. In this work, we propose a novel decomposition-based hybrid ensemble convolutional neural network (CNN) framework to enhance the capability of decoding EEG signals. Four decomposition methods are employed to disassemble the EEG signals into components of different complexity. Instead of handcraft features, the CNNs in this framework directly learn from the decomposed components. In addition, a component-specific batch normalization layer is employed to reduce subject variability. Moreover, we employ two ensemble modes to integrate the outputs of all CNNs, comprehensively exploiting the diverse information of the decomposed components. Against the challenging cross-subject driver fatigue recognition task, the models under the framework all showed superior performance to the strong baselines. Specifically, the performance of different decomposition methods and ensemble modes was further compared. The results indicated that discrete wavelet transform-based ensemble CNN achieved the highest average classification accuracy of 83.48% among the compared methods. The proposed framework can be extended to any CNN architecture and be applied to any EEG-related tasks, opening the possibility of extracting more beneficial features from complex EEG data. △ Less

Submitted 10 January, 2023; v1 submitted 14 March, 2022; originally announced March 2022.

Journal ref: Information Sciences. 2023

arXiv:2203.02430 [pdf, other]

Characterizing Renal Structures with 3D Block Aggregate Transformers

Authors: Xin Yu, Yucheng Tang, Yinchi Zhou, Riqiang Gao, Qi Yang, Ho Hin Lee, Thomas Li, Shunxing Bao, Yuankai Huo, Zhoubing Xu, Thomas A. Lasko, Richard G. Abramson, Bennett A. Landman

Abstract: Efficiently quantifying renal structures can provide distinct spatial context and facilitate biomarker discovery for kidney morphology. However, the development and evaluation of the transformer model to segment the renal cortex, medulla, and collecting system remains challenging due to data inefficiency. Inspired by the hierarchical structures in vision transformer, we propose a novel method usin… ▽ More Efficiently quantifying renal structures can provide distinct spatial context and facilitate biomarker discovery for kidney morphology. However, the development and evaluation of the transformer model to segment the renal cortex, medulla, and collecting system remains challenging due to data inefficiency. Inspired by the hierarchical structures in vision transformer, we propose a novel method using a 3D block aggregation transformer for segmenting kidney components on contrast-enhanced CT scans. We construct the first cohort of renal substructures segmentation dataset with 116 subjects under institutional review board (IRB) approval. Our method yields the state-of-the-art performance (Dice of 0.8467) against the baseline approach of 0.8308 with the data-efficient design. The Pearson R achieves 0.9891 between the proposed method and manual standards and indicates the strong correlation and reproducibility for volumetric analysis. We extend the proposed method to the public KiTS dataset, the method leads to improved accuracy compared to transformer-based approaches. We show that the 3D block aggregation transformer can achieve local communication between sequence representations without modifying self-attention, and it can serve as an accurate and efficient quantification tool for characterizing renal structures. △ Less

Submitted 4 March, 2022; originally announced March 2022.

arXiv:2202.06875 [pdf, other]

Visual Acoustic Matching

Authors: Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman

Abstract: We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal tr… ▽ More We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines. △ Less

Submitted 13 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: Project page: https://vision.cs.utexas.edu/projects/visual-acoustic-matching. Accepted at CVPR 2022

arXiv:2202.06012

Cloud-based computational model predictive control using a parallel multi-block ADMM approach

Authors: Yaling Ma, Runze Gao, Li Dai, **xian Wu, Yuanqing Xia

Abstract: Heavy computational load for solving nonconvex problems for large-scale systems or systems with real-time demands at each sample step has been recognized as one of the reasons for preventing a wider application of nonlinear model predictive control (NMPC). To improve the real-time feasibility of NMPC with input nonlinearity, we devise an innovative scheme called cloud-based computational model pre… ▽ More Heavy computational load for solving nonconvex problems for large-scale systems or systems with real-time demands at each sample step has been recognized as one of the reasons for preventing a wider application of nonlinear model predictive control (NMPC). To improve the real-time feasibility of NMPC with input nonlinearity, we devise an innovative scheme called cloud-based computational model predictive control (MPC) by using an elaborately designed parallel multi-block alternating direction method of multipliers (ADMM) algorithm. This novel parallel multi-block ADMM algorithm is tailored to tackle the computational issue of solving a nonconvex problem with nonlinear constraints. △ Less

Submitted 15 April, 2022; v1 submitted 12 February, 2022; originally announced February 2022.

Comments: Statements and experiments are flawed

arXiv:2112.14349 [pdf, ps, other]

Fast Subspace Identification Method Based on Containerised Cloud Workflow Processing System

Authors: Runze Gao, Yuanqing Xia, Guan Wang, Liwen Yang, Yufeng Zhan

Abstract: Subspace identification (SID) has been widely used in system identification and control fields since it can estimate system models only relying on the input and output data by reliable numerical operations such as singular value decomposition (SVD). However, high-dimension Hankel matrices are involved to store these data and used to obtain the system models, which increases the computation amount… ▽ More Subspace identification (SID) has been widely used in system identification and control fields since it can estimate system models only relying on the input and output data by reliable numerical operations such as singular value decomposition (SVD). However, high-dimension Hankel matrices are involved to store these data and used to obtain the system models, which increases the computation amount of SID and leads SID not suitable for the large-scale or real-time identification tasks. In this paper, a novel fast SID method based on cloud workflow processing and container technology is proposed to accelerate the traditional algorithm. First, a workflow-based structure of SID is designed to match the distributed cloud environment, based on the computational feature of each calculation stage. Second, a containerised cloud workflow processing system is established to execute the logic- and data- dependent SID workflow mission based on Kubernetes system. Finally, the experiments show that the computation time is reduced by at most $91.6\%$ for large-scale SID mission and decreased to within 20 ms for the real-time mission parameter. △ Less

Submitted 28 December, 2021; originally announced December 2021.

arXiv:2112.14347 [pdf, ps, other]

Design and Implementation of Data-driven Predictive Cloud Control System

Authors: Runze Gao, Yuanqing Xia, Li Dai, Zhongqi Sun

Abstract: Nowadays, the rapid increases of the scale and complexity of the controlled plants bring new challenges such as computing power and storage for conventional control systems. Cloud computing is concerned as a powerful solution to handle the complex large-scale control missions using sufficient computing resources. However, the developed computing ability enables more complex devices and mass data b… ▽ More Nowadays, the rapid increases of the scale and complexity of the controlled plants bring new challenges such as computing power and storage for conventional control systems. Cloud computing is concerned as a powerful solution to handle the complex large-scale control missions using sufficient computing resources. However, the developed computing ability enables more complex devices and mass data being involved and thus the applications of model-based algorithms are constrained. Motivated by the above, we propose an original data-driven predictive cloud control system. To achieve the proposed system, a practical data-driven predictive cloud control platform rather than only a numerical simulator is established and together a cloud-edge communication scheme is developed. Finally, the verification of simulations and experiments as well as discussions demonstrate the effectiveness of the proposed system. △ Less

Submitted 28 December, 2021; originally announced December 2021.

arXiv:2111.10882 [pdf, other]

Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

Authors: Rishabh Garg, Ruohan Gao, Kristen Grauman

Abstract: Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach ex… ▽ More Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process. In particular, we develop a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream's coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time. Furthermore, we introduce a new large video dataset with realistic binaural audio simulated for real-world scanned environments. On two datasets, we demonstrate the efficacy of our method, which achieves state-of-the-art results. △ Less

Submitted 21 November, 2021; originally announced November 2021.

Comments: Published in BMVC 2021, project page: http://vision.cs.utexas.edu/projects/geometry-aware-binaural/

arXiv:2111.09957 [pdf, other]

Rethinking Dilated Convolution for Real-time Semantic Segmentation

Authors: Roland Gao

Abstract: The field-of-view is an important metric when designing a model for semantic segmentation. To obtain a large field-of-view, previous approaches generally choose to rapidly downsample the resolution, usually with average poolings or stride 2 convolutions. We take a different approach by using dilated convolutions with large dilation rates throughout the backbone, allowing the backbone to easily tun… ▽ More The field-of-view is an important metric when designing a model for semantic segmentation. To obtain a large field-of-view, previous approaches generally choose to rapidly downsample the resolution, usually with average poolings or stride 2 convolutions. We take a different approach by using dilated convolutions with large dilation rates throughout the backbone, allowing the backbone to easily tune its field-of-view by adjusting its dilation rates, and show that it's competitive with existing approaches. To effectively use the dilated convolution, we show a simple upper bound on the dilation rate in order to not leave gaps in between the convolutional weights, and design an SE-ResNeXt inspired block structure that uses two parallel $3\times 3$ convolutions with different dilation rates to preserve the local details. Manually tuning the dilation rates for every block can be difficult, so we also introduce a differentiable neural architecture search method that uses gradient descent to optimize the dilation rates. In addition, we propose a lightweight decoder that restores local information better than common alternatives. To demonstrate the effectiveness of our approach, our model RegSeg achieves competitive results on real-time Cityscapes and CamVid datasets. Using a T4 GPU with mixed precision, RegSeg achieves 78.3 mIOU on Cityscapes test set at $37$ FPS, and 80.9 mIOU on CamVid test set at $112$ FPS, both without ImageNet pretraining. △ Less

Submitted 27 November, 2023; v1 submitted 18 November, 2021; originally announced November 2021.

Comments: CVPR 2023 Efficient CV workshop

arXiv:2111.02493 [pdf]

doi 10.1088/1361-6501/ac2dbd

Roadmap on Signal Processing for Next Generation Measurement Systems

Authors: D. K. Iakovidis, M. Ooi, Y. C. Kuang, S. Demidenko, A. Shestakov, V. Sinitsin, M. Henry, A. Sciacchitano, A. Discetti, S. Donati, M. Norgia, A. Menychtas, I. Maglogiannis, S. C. Wriessnegger, L. A. Barradas Chacon, G. Dimas, D. Filos, A. H. Aletras, J. Töger, F. Dong, S. Ren, A. Uhl, J. Paziewski, J. Geng, F. Fioranelli , et al. (9 additional authors not shown)

Abstract: Signal processing is a fundamental component of almost any sensor-enabled system, with a wide range of applications across different scientific disciplines. Time series data, images, and video sequences comprise representative forms of signals that can be enhanced and analysed for information extraction and quantification. The recent advances in artificial intelligence and machine learning are shi… ▽ More Signal processing is a fundamental component of almost any sensor-enabled system, with a wide range of applications across different scientific disciplines. Time series data, images, and video sequences comprise representative forms of signals that can be enhanced and analysed for information extraction and quantification. The recent advances in artificial intelligence and machine learning are shifting the research attention towards intelligent, data-driven, signal processing. This roadmap presents a critical overview of the state-of-the-art methods and applications aiming to highlight future challenges and research opportunities towards next generation measurement systems. It covers a broad spectrum of topics ranging from basic to industrial research, organized in concise thematic sections that reflect the trends and the impacts of current and future developments per research field. Furthermore, it offers guidance to researchers and funding agencies in identifying new prospects. △ Less

Submitted 28 January, 2022; v1 submitted 3 November, 2021; originally announced November 2021.

Comments: 48 pages, https://iopscience.iop.org/article/10.1088/1361-6501/ac2dbd

Journal ref: Measurement Science and Technology 33(1) (2022) 1-48

arXiv:2110.08718 [pdf, other]

AE-StyleGAN: Improved Training of Style-Based Auto-Encoders

Authors: Ligong Han, Sri Harsha Musunuri, Martin Renqiang Min, Ruijiang Gao, Yu Tian, Dimitris Metaxas

Abstract: StyleGANs have shown impressive results on data generation and manipulation in recent years, thanks to its disentangled style latent space. A lot of efforts have been made in inverting a pretrained generator, where an encoder is trained ad hoc after the generator is trained in a two-stage fashion. In this paper, we focus on style-based generators asking a scientific question: Does forcing such a g… ▽ More StyleGANs have shown impressive results on data generation and manipulation in recent years, thanks to its disentangled style latent space. A lot of efforts have been made in inverting a pretrained generator, where an encoder is trained ad hoc after the generator is trained in a two-stage fashion. In this paper, we focus on style-based generators asking a scientific question: Does forcing such a generator to reconstruct real data lead to more disentangled latent space and make the inversion process from image to latent space easy? We describe a new methodology to train a style-based autoencoder where the encoder and generator are optimized end-to-end. We show that our proposed model consistently outperforms baselines in terms of image inversion and generation quality. Supplementary, code, and pretrained models are available on the project website. △ Less

Submitted 17 October, 2021; originally announced October 2021.

Comments: Accepted at WACV-22

arXiv:2109.09004 [pdf, other]

Random Multi-Channel Image Synthesis for Multiplexed Immunofluorescence Imaging

Authors: Shunxing Bao, Yucheng Tang, Ho Hin Lee, Riqiang Gao, Sophie Chiron, Ilwoo Lyu, Lori A. Coburn, Keith T. Wilson, Joseph T. Roland, Bennett A. Landman, Yuankai Huo

Abstract: Multiplex immunofluorescence (MxIF) is an emerging imaging technique that produces the high sensitivity and specificity of single-cell map**. With a tenet of 'seeing is believing', MxIF enables iterative staining and imaging extensive antibodies, which provides comprehensive biomarkers to segment and group different cells on a single tissue section. However, considerable depletion of the scarce… ▽ More Multiplex immunofluorescence (MxIF) is an emerging imaging technique that produces the high sensitivity and specificity of single-cell map**. With a tenet of 'seeing is believing', MxIF enables iterative staining and imaging extensive antibodies, which provides comprehensive biomarkers to segment and group different cells on a single tissue section. However, considerable depletion of the scarce tissue is inevitable from extensive rounds of staining and bleaching ('missing tissue'). Moreover, the immunofluorescence (IF) imaging can globally fail for particular rounds ('missing stain''). In this work, we focus on the 'missing stain' issue. It would be appealing to develop digital image synthesis approaches to restore missing stain images without losing more tissue physically. Herein, we aim to develop image synthesis approaches for eleven MxIF structural molecular markers (i.e., epithelial and stromal) on real samples. We propose a novel multi-channel high-resolution image synthesis approach, called pixN2N-HD, to tackle possible missing stain scenarios via a high-resolution generative adversarial network (GAN). Our contribution is three-fold: (1) a single deep network framework is proposed to tackle missing stain in MxIF; (2) the proposed 'N-to-N' strategy reduces theoretical four years of computational time to 20 hours when covering all possible missing stains scenarios, with up to five missing stains (e.g., '(N-1)-to-1', '(N-2)-to-2'); and (3) this work is the first comprehensive experimental study of investigating cross-stain synthesis in MxIF. Our results elucidate a promising direction of advancing MxIF imaging with deep image synthesis. △ Less

Submitted 18 September, 2021; originally announced September 2021.

Comments: Accepted at the third MICCAI workshop on Computational Pathology (COMPAY 2021)

arXiv:2107.14385 [pdf, other]

doi 10.1016/j.eswa.2022.117784

Random vector functional link neural network based ensemble deep learning for short-term load forecasting

Authors: Ruobin Gao, Liang Du, P. N. Suganthan, Qin Zhou, Kum Fai Yuen

Abstract: Electricity load forecasting is crucial for the power systems' planning and maintenance. However, its un-stationary and non-linear characteristics impose significant difficulties in anticipating future demand. This paper proposes a novel ensemble deep Random Vector Functional Link (edRVFL) network for electricity load forecasting. The weights of hidden layers are randomly initialized and kept fixe… ▽ More Electricity load forecasting is crucial for the power systems' planning and maintenance. However, its un-stationary and non-linear characteristics impose significant difficulties in anticipating future demand. This paper proposes a novel ensemble deep Random Vector Functional Link (edRVFL) network for electricity load forecasting. The weights of hidden layers are randomly initialized and kept fixed during the training process. The hidden layers are stacked to enforce deep representation learning. Then, the model generates the forecasts by ensembling the outputs of each layer. Moreover, we also propose to augment the random enhancement features by empirical wavelet transformation (EWT). The raw load data is decomposed by EWT in a walk-forward fashion, not introducing future data leakage problems in the decomposition process. Finally, all the sub-series generated by the EWT, including raw data, are fed into the edRVFL for forecasting purposes. The proposed model is evaluated on twenty publicly available time series from the Australian Energy Market Operator of the year 2020. The simulation results demonstrate the proposed model's superior performance over eleven forecasting methods in three error metrics and statistical tests on electricity load forecasting tasks. △ Less

Submitted 29 July, 2021; originally announced July 2021.

Journal ref: Expert Systems with Applications,2022

arXiv:2107.11882 [pdf, other]

Lung Cancer Risk Estimation with Incomplete Data: A Joint Missing Imputation Perspective

Authors: Riqiang Gao, Yucheng Tang, Kaiwen Xu, Ho Hin Lee, Steve Deppen, Kim Sandler, Pierre Massion, Thomas A. Lasko, Yuankai Huo, Bennett A. Landman

Abstract: Data from multi-modality provide complementary information in clinical prediction, but missing data in clinical cohorts limits the number of subjects in multi-modal learning context. Multi-modal missing imputation is challenging with existing methods when 1) the missing data span across heterogeneous modalities (e.g., image vs. non-image); or 2) one modality is largely missing. In this paper, we a… ▽ More Data from multi-modality provide complementary information in clinical prediction, but missing data in clinical cohorts limits the number of subjects in multi-modal learning context. Multi-modal missing imputation is challenging with existing methods when 1) the missing data span across heterogeneous modalities (e.g., image vs. non-image); or 2) one modality is largely missing. In this paper, we address imputation of missing data by modeling the joint distribution of multi-modal data. Motivated by partial bidirectional generative adversarial net (PBiGAN), we propose a new Conditional PBiGAN (C-PBiGAN) method that imputes one modality combining the conditional knowledge from another modality. Specifically, C-PBiGAN introduces a conditional latent space in a missing imputation framework that jointly encodes the available multi-modal data, along with a class regularization loss on imputed data to recover discriminative information. To our knowledge, it is the first generative adversarial model that addresses multi-modal missing imputation by modeling the joint distribution of image and non-image data. We validate our model with both the national lung screening trial (NLST) dataset and an external clinical validation cohort. The proposed C-PBiGAN achieves significant improvements in lung cancer risk estimation compared with representative imputation methods (e.g., AUC values increase in both NLST (+2.9\%) and in-house dataset (+4.3\%) compared with PBiGAN, p$<$0.05). △ Less

Submitted 25 July, 2021; originally announced July 2021.

Comments: Early Accepted by MICCAI 2021. Traveling Award

arXiv:2105.11332 [pdf, other]

Physical Layer Security for UAV Communications in 5G and Beyond Networks

Authors: Jue Wang, Xuanxuan Wang, Ruifeng Gao, Chengleyang Lei, Wei Feng, Ning Ge, Shi **, Tony Q. S. Quek

Abstract: Due to its high mobility and flexible deployment, unmanned aerial vehicle (UAV) is drawing unprecedented interest in both military and civil applications to enable agile wireless communications and provide ubiquitous connectivity. Mainly operating in an open environment, UAV communications can benefit from dominant line-of-sight links; however, it on the other hand renders the UAVs more vulnerable… ▽ More Due to its high mobility and flexible deployment, unmanned aerial vehicle (UAV) is drawing unprecedented interest in both military and civil applications to enable agile wireless communications and provide ubiquitous connectivity. Mainly operating in an open environment, UAV communications can benefit from dominant line-of-sight links; however, it on the other hand renders the UAVs more vulnerable to malicious eavesdrop** or jamming attacks. Recently, physical layer security (PLS), which exploits the inherent randomness of the wireless channels for secure communications, has been introduced to UAV systems as an important complement to the conventional cryptography-based approaches. In this paper, a comprehensive survey on the current achievements of the UAV-aided wireless communications is conducted from the PLS perspective. We first introduce the basic concepts of UAV communications including the typical static/mobile deployment scenarios, the unique characteristics of air-to-ground channels, as well as various roles that a UAV may act when PLS is concerned. Then, we introduce the widely used secrecy performance metrics and start by reviewing the secrecy performance analysis and enhancing techniques for statically deployed UAV systems, and extend the discussion to a more general scenario where the UAVs' mobility is further exploited. For both cases, respectively, we summarize the commonly adopted methodologies in the corresponding analysis and design, then describe important works in the literature in detail. Finally, potential research directions and challenges are discussed to provide an outlook for future works in the area of UAV-PLS in 5G and beyond networks. △ Less

Submitted 24 May, 2021; originally announced May 2021.

arXiv:2101.03711 [pdf, other]

Generalize Ultrasound Image Segmentation via Instant and Plug & Play Style Transfer

Authors: Zhendong Liu, Xiaoqiong Huang, Xin Yang, Rui Gao, Rui Li, Yuanji Zhang, Yankai Huang, Guangquan Zhou, Yi Xiong, Alejandro F Frangi, Dong Ni

Abstract: Deep segmentation models that generalize to images with unknown appearance are important for real-world medical image analysis. Retraining models leads to high latency and complex pipelines, which are impractical in clinical settings. The situation becomes more severe for ultrasound image analysis because of their large appearance shifts. In this paper, we propose a novel method for robust segment… ▽ More Deep segmentation models that generalize to images with unknown appearance are important for real-world medical image analysis. Retraining models leads to high latency and complex pipelines, which are impractical in clinical settings. The situation becomes more severe for ultrasound image analysis because of their large appearance shifts. In this paper, we propose a novel method for robust segmentation under unknown appearance shifts. Our contribution is three-fold. First, we advance a one-stage plug-and-play solution by embedding hierarchical style transfer units into a segmentation architecture. Our solution can remove appearance shifts and perform segmentation simultaneously. Second, we adopt Dynamic Instance Normalization to conduct precise and dynamic style transfer in a learnable manner, rather than previously fixed style normalization. Third, our solution is fast and lightweight for routine clinical adoption. Given 400*400 image input, our solution only needs an additional 0.2ms and 1.92M FLOPs to handle appearance shifts compared to the baseline pipeline. Extensive experiments are conducted on a large dataset from three vendors demonstrate our proposed method enhances the robustness of deep segmentation models. △ Less

Submitted 11 January, 2021; originally announced January 2021.

Comments: Accepted by IEEE ISBI 2021

arXiv:2101.03149 [pdf, other]

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

Authors: Ruohan Gao, Kristen Grauman

Abstract: We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's lip movements and the sounds they generate, we propose to leverage the speaker's face appearance as an additional… ▽ More We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's lip movements and the sounds they generate, we propose to leverage the speaker's face appearance as an additional prior to isolate the corresponding vocal qualities they are likely to produce. Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video. It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement, and generalizes well to challenging real-world videos of diverse scenarios. Our video results and code: http://vision.cs.utexas.edu/projects/VisualVoice/. △ Less

Submitted 6 April, 2021; v1 submitted 8 January, 2021; originally announced January 2021.

Comments: In CVPR 2021. Project page: http://vision.cs.utexas.edu/projects/VisualVoice/

arXiv:2012.03124 [pdf]

Development and Characterization of a Chest CT Atlas

Authors: Kaiwen Xu, Riqiang Gao, Mirza S. Khan, Shunxing Bao, Yucheng Tang, Steve A. Deppen, Yuankai Huo, Kim L. Sandler, Pierre P. Massion, Mattias P. Heinrich, Bennett A. Landman

Abstract: A major goal of lung cancer screening is to identify individuals with particular phenotypes that are associated with high risk of cancer. Identifying relevant phenotypes is complicated by the variation in body position and body composition. In the brain, standardized coordinate systems (e.g., atlases) have enabled separate consideration of local features from gross/global structure. To date, no an… ▽ More A major goal of lung cancer screening is to identify individuals with particular phenotypes that are associated with high risk of cancer. Identifying relevant phenotypes is complicated by the variation in body position and body composition. In the brain, standardized coordinate systems (e.g., atlases) have enabled separate consideration of local features from gross/global structure. To date, no analogous standard atlas has been presented to enable spatial map** and harmonization in chest computational tomography (CT). In this paper, we propose a thoracic atlas built upon a large low dose CT (LDCT) database of lung cancer screening program. The study cohort includes 466 male and 387 female subjects with no screening detected malignancy (age 46-79 years, mean 64.9 years). To provide spatial map**, we optimize a multi-stage inter-subject non-rigid registration pipeline for the entire thoracic space. We evaluate the optimized pipeline relative to two baselines with alternative non-rigid registration module: the same software with default parameters and an alternative software. We achieve a significant improvement in terms of registration success rate based on manual QA. For the entire study cohort, the optimized pipeline achieves a registration success rate of 91.7%. The application validity of the developed atlas is evaluated in terms of discriminative capability for different anatomic phenotypes, including body mass index (BMI), chronic obstructive pulmonary disease (COPD), and coronary artery calcification (CAC). △ Less

Submitted 5 December, 2020; originally announced December 2020.

Comments: Accepted by SPIE2021 Medical Imaging (oral)

arXiv:2011.11896 [pdf]

doi 10.1109/JLT.2021.3067146

A Data-Fusion-Assisted Telemetry Layer for Autonomous Optical Networks

Authors: Xiaomin Liu, Huazhi Lun, Ruoxuan Gao, Meng Cai, Lilin Yi, Weisheng Hu, Qunbi Zhuge

Abstract: For further improving the capacity and reliability of optical networks, a closed-loop autonomous architecture is preferred. Considering a large number of optical components in an optical network and many digital signal processing modules in each optical transceiver, massive real-time data can be collected. However, for a traditional monitoring structure, collecting, storing and processing a large… ▽ More For further improving the capacity and reliability of optical networks, a closed-loop autonomous architecture is preferred. Considering a large number of optical components in an optical network and many digital signal processing modules in each optical transceiver, massive real-time data can be collected. However, for a traditional monitoring structure, collecting, storing and processing a large size of data are challenging tasks. Moreover, strong correlations and similarities between data from different sources and regions are not properly considered, which may limit function extension and accuracy improvement. To address abovementioned issues, a data-fusion-assisted telemetry layer between the physical layer and control layer is proposed in this paper. The data fusion methodologies are elaborated on three different levels: Source Level, Space Level and Model Level. For each level, various data fusion algorithms are introduced and relevant works are reviewed. In addition, proof-of-concept use cases for each level are provided through simulations, where the benefits of the data-fusion-assisted telemetry layer are shown. △ Less

Submitted 24 November, 2020; originally announced November 2020.

arXiv:2010.09524 [pdf]

Deep Multi-path Network Integrating Incomplete Biomarker and Chest CT Data for Evaluating Lung Cancer Risk

Authors: Riqiang Gao, Yucheng Tang, Kaiwen Xu, Michael N. Kammer, Sanja L. Antic, Steve Deppen, Kim L. Sandler, Pierre P. Massion, Yuankai Huo, Bennett A. Landman

Abstract: Clinical data elements (CDEs) (e.g., age, smoking history), blood markers and chest computed tomography (CT) structural features have been regarded as effective means for assessing lung cancer risk. These independent variables can provide complementary information and we hypothesize that combining them will improve the prediction accuracy. In practice, not all patients have all these variables ava… ▽ More Clinical data elements (CDEs) (e.g., age, smoking history), blood markers and chest computed tomography (CT) structural features have been regarded as effective means for assessing lung cancer risk. These independent variables can provide complementary information and we hypothesize that combining them will improve the prediction accuracy. In practice, not all patients have all these variables available. In this paper, we propose a new network design, termed as multi-path multi-modal missing network (M3Net), to integrate the multi-modal data (i.e., CDEs, biomarker and CT image) considering missing modality with multiple paths neural network. Each path learns discriminative features of one modality, and different modalities are fused in a second stage for an integrated prediction. The network can be trained end-to-end with both medical image features and CDEs/biomarkers, or make a prediction with single modality. We evaluate M3Net with datasets including three sites from the Consortium for Molecular and Cellular Characterization of Screen-Detected Lesions (MCL) project. Our method is cross validated within a cohort of 1291 subjects (383 subjects with complete CDEs/biomarkers and CT images), and externally validated with a cohort of 99 subjects (99 with complete CDEs/biomarkers and CT images). Both cross-validation and external-validation results show that combining multiple modality significantly improves the predicting performance of single modality. The results suggest that integrating subjects with missing either CDEs/biomarker or CT imaging features can contribute to the discriminatory power of our model (p < 0.05, bootstrap two-tailed test). In summary, the proposed M3Net framework provides an effective way to integrate image and non-image data in the context of missing information. △ Less

Submitted 9 February, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

Comments: RFW all-conference best paper finalist, SPIE2021 Medical Imaging

arXiv:2010.04928 [pdf, other]

Contrastive Rendering for Ultrasound Image Segmentation

Authors: Haoming Li, Xin Yang, Jiamin Liang, Wenlong Shi, Chaoyu Chen, Haoran Dou, Rui Li, Rui Gao, Guangquan Zhou, **ghui Fang, Xiaowen Liang, Ruobing Huang, Alejandro Frangi, Zhiyi Chen, Dong Ni

Abstract: Ultrasound (US) image segmentation embraced its significant improvement in deep learning era. However, the lack of sharp boundaries in US images still remains an inherent challenge for segmentation. Previous methods often resort to global context, multi-scale cues or auxiliary guidance to estimate the boundaries. It is hard for these methods to approach pixel-level learning for fine-grained bounda… ▽ More Ultrasound (US) image segmentation embraced its significant improvement in deep learning era. However, the lack of sharp boundaries in US images still remains an inherent challenge for segmentation. Previous methods often resort to global context, multi-scale cues or auxiliary guidance to estimate the boundaries. It is hard for these methods to approach pixel-level learning for fine-grained boundary generating. In this paper, we propose a novel and effective framework to improve boundary estimation in US images. Our work has three highlights. First, we propose to formulate the boundary estimation as a rendering task, which can recognize ambiguous points (pixels/voxels) and calibrate the boundary prediction via enriched feature representation learning. Second, we introduce point-wise contrastive learning to enhance the similarity of points from the same class and contrastively decrease the similarity of points from different classes. Boundary ambiguities are therefore further addressed. Third, both rendering and contrastive learning tasks contribute to consistent improvement while reducing network parameters. As a proof-of-concept, we performed validation experiments on a challenging dataset of 86 ovarian US volumes. Results show that our proposed method outperforms state-of-the-art methods and has the potential to be used in clinical practice. △ Less

Submitted 10 October, 2020; originally announced October 2020.

Comments: 10 pages, 5 figures, 2 tables, 13 references

arXiv:2005.01616 [pdf, other]

VisualEchoes: Spatial Image Representation Learning through Echolocation

Authors: Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman

Abstract: Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D… ▽ More Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D indoor scene environments. Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation. We show that the learned image features are useful for multiple downstream vision tasks requiring spatial reasoning---monocular depth estimation, surface normal estimation, and visual navigation---with results comparable or even better than heavily supervised pre-training. Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world. △ Less

Submitted 17 July, 2020; v1 submitted 4 May, 2020; originally announced May 2020.

Comments: Appears in ECCV 2020

arXiv:2002.05844 [pdf, other]

Remove Appearance Shift for Ultrasound Image Segmentation via Fast and Universal Style Transfer

Authors: Zhendong Liu, Xin Yang, Rui Gao, Shengfeng Liu, Haoran Dou, Shuangchi He, Yuhao Huang, Yankai Huang, Huanjia Luo, Yuanji Zhang, Yi Xiong, Dong Ni

Abstract: Deep Neural Networks (DNNs) suffer from the performance degradation when image appearance shift occurs, especially in ultrasound (US) image segmentation. In this paper, we propose a novel and intuitive framework to remove the appearance shift, and hence improve the generalization ability of DNNs. Our work has three highlights. First, we follow the spirit of universal style transfer to remove appea… ▽ More Deep Neural Networks (DNNs) suffer from the performance degradation when image appearance shift occurs, especially in ultrasound (US) image segmentation. In this paper, we propose a novel and intuitive framework to remove the appearance shift, and hence improve the generalization ability of DNNs. Our work has three highlights. First, we follow the spirit of universal style transfer to remove appearance shifts, which was not explored before for US images. Without sacrificing image structure details, it enables the arbitrary style-content transfer. Second, accelerated with Adaptive Instance Normalization block, our framework achieved real-time speed required in the clinical US scanning. Third, an efficient and effective style image selection strategy is proposed to ensure the target-style US image and testing content US image properly match each other. Experiments on two large US datasets demonstrate that our methods are superior to state-of-the-art methods on making DNNs robust against various appearance shifts. △ Less

Submitted 13 February, 2020; originally announced February 2020.

Comments: IEEE International Symposium on Biomedical Imaging (IEEE ISBI 2020)

arXiv:2002.04102 [pdf]

Validation and Optimization of Multi-Organ Segmentation on Clinical Imaging Archives

Authors: Yuchen Xu, Olivia Tang, Yucheng Tang, Ho Hin Lee, Yunqiang Chen, Dashan Gao, Shizhong Han, Riqiang Gao, Michael R. Savona, Richard G. Abramson, Yuankai Huo, Bennett A. Landman

Abstract: Segmentation of abdominal computed tomography(CT) provides spatial context, morphological properties, and a framework for tissue-specific radiomics to guide quantitative Radiological assessment. A 2015 MICCAI challenge spurred substantial innovation in multi-organ abdominal CT segmentation with both traditional and deep learning methods. Recent innovations in deep methods have driven performance t… ▽ More Segmentation of abdominal computed tomography(CT) provides spatial context, morphological properties, and a framework for tissue-specific radiomics to guide quantitative Radiological assessment. A 2015 MICCAI challenge spurred substantial innovation in multi-organ abdominal CT segmentation with both traditional and deep learning methods. Recent innovations in deep methods have driven performance toward levels for which clinical translation is appealing. However, continued cross-validation on open datasets presents the risk of indirect knowledge contamination and could result in circular reasoning. Moreover, 'real world' segmentations can be challenging due to the wide variability of abdomen physiology within patients. Herein, we perform two data retrievals to capture clinically acquired deidentified abdominal CT cohorts with respect to a recently published variation on 3D U-Net (baseline algorithm). First, we retrieved 2004 deidentified studies on 476 patients with diagnosis codes involving spleen abnormalities (cohort A). Second, we retrieved 4313 deidentified studies on 1754 patients without diagnosis codes involving spleen abnormalities (cohort B). We perform prospective evaluation of the existing algorithm on both cohorts, yielding 13% and 8% failure rate, respectively. Then, we identified 51 subjects in cohort A with segmentation failures and manually corrected the liver and gallbladder labels. We re-trained the model adding the manual labels, resulting in performance improvement of 9% and 6% failure rate for the A and B cohorts, respectively. In summary, the performance of the baseline on the prospective cohorts was similar to that on previously published datasets. Moreover, adding data from the first cohort substantively improved performance when evaluated on the second withheld validation cohort. △ Less

Submitted 10 February, 2020; originally announced February 2020.

Comments: SPIE2020 Medical Imaging

arXiv:1912.04487 [pdf, other]

Listen to Look: Action Recognition by Previewing Audio

Authors: Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani

Abstract: In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalit… ▽ More In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalities---a single frame and its accompanying audio---reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on ImgAud2Vid, we further propose ImgAud-Skimming, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state-of-the-art in terms of both recognition accuracy and speed. △ Less

Submitted 28 March, 2020; v1 submitted 9 December, 2019; originally announced December 2019.

Comments: Appears in CVPR 2020; Project page: http://vision.cs.utexas.edu/projects/listen_to_look/

arXiv:1911.07372 [pdf, other]

Deep Learning for the Digital Pathologic Diagnosis of Cholangiocarcinoma and Hepatocellular Carcinoma: Evaluating the Impact of a Web-based Diagnostic Assistant

Authors: Bora Uyumazturk, Amirhossein Kiani, Pranav Rajpurkar, Alex Wang, Robyn L. Ball, Rebecca Gao, Yifan Yu, Erik Jones, Curtis P. Langlotz, Brock Martin, Gerald J. Berry, Michael G. Ozawa, Florette K. Hazard, Ryanne A. Brown, Simon B. Chen, Mona Wood, Libby S. Allard, Lourdes Ylagan, Andrew Y. Ng, Jeanne Shen

Abstract: While artificial intelligence (AI) algorithms continue to rival human performance on a variety of clinical tasks, the question of how best to incorporate these algorithms into clinical workflows remains relatively unexplored. We investigated how AI can affect pathologist performance on the task of differentiating between two subtypes of primary liver cancer, hepatocellular carcinoma (HCC) and chol… ▽ More While artificial intelligence (AI) algorithms continue to rival human performance on a variety of clinical tasks, the question of how best to incorporate these algorithms into clinical workflows remains relatively unexplored. We investigated how AI can affect pathologist performance on the task of differentiating between two subtypes of primary liver cancer, hepatocellular carcinoma (HCC) and cholangiocarcinoma (CC). We developed an AI diagnostic assistant using a deep learning model and evaluated its effect on the diagnostic performance of eleven pathologists with varying levels of expertise. Our deep learning model achieved an accuracy of 0.885 on an internal validation set of 26 slides and an accuracy of 0.842 on an independent test set of 80 slides. Despite having high accuracy on a hold out test set, the diagnostic assistant did not significantly improve performance across pathologists (p-value: 0.184, OR: 1.287 (95% CI 0.886, 1.871)). Model correctness was observed to significantly bias the pathologist decisions. When the model was correct, assistance significantly improved accuracy across all pathologist experience levels and for all case difficulty levels (p-value: < 0.001, OR: 4.289 (95% CI 2.360, 7.794)). When the model was incorrect, assistance significantly decreased accuracy across all 11 pathologists and for all case difficulty levels (p-value < 0.001, OR: 0.253 (95% CI 0.126, 0.507)). Our results highlight the challenges of translating AI models to the clinical setting, especially for difficult subspecialty tasks such as tumor classification. In particular, they suggest that incorrect model predictions could strongly bias an expert's diagnosis, an important factor to consider when designing medical AI-assistance systems. △ Less

Submitted 17 November, 2019; originally announced November 2019.

Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract

arXiv:1911.06395 [pdf]

Contrast Phase Classification with a Generative Adversarial Network

Authors: Yucheng Tang, Ho Hin Lee, Yuchen Xu, Olivia Tang, Yunqiang Chen, Dashan Gao, Shizhong Han, Riqiang Gao, Camilo Bermudez, Michael R. Savona, Richard G. Abramson, Yuankai Huo, Bennett A. Landman

Abstract: Dynamic contrast enhanced computed tomography (CT) is an imaging technique that provides critical information on the relationship of vascular structure and dynamics in the context of underlying anatomy. A key challenge for image processing with contrast enhanced CT is that phase discrepancies are latent in different tissues due to contrast protocols, vascular dynamics, and metabolism variance. Pre… ▽ More Dynamic contrast enhanced computed tomography (CT) is an imaging technique that provides critical information on the relationship of vascular structure and dynamics in the context of underlying anatomy. A key challenge for image processing with contrast enhanced CT is that phase discrepancies are latent in different tissues due to contrast protocols, vascular dynamics, and metabolism variance. Previous studies with deep learning frameworks have been proposed for classifying contrast enhancement with networks inspired by computer vision. Here, we revisit the challenge in the context of whole abdomen contrast enhanced CTs. To capture and compensate for the complex contrast changes, we propose a novel discriminator in the form of a multi-domain disentangled representation learning network. The goal of this network is to learn an intermediate representation that separates contrast enhancement from anatomy and enables classification of images with varying contrast time. Briefly, our unpaired contrast disentangling GAN(CD-GAN) Discriminator follows the ResNet architecture to classify a CT scan from different enhancement phases. To evaluate the approach, we trained the enhancement phase classifier on 21060 slices from two clinical cohorts of 230 subjects. Testing was performed on 9100 slices from 30 independent subjects who had been imaged with CT scans from all contrast phases. Performance was quantified in terms of the multi-class normalized confusion matrix. The proposed network significantly improved correspondence over baseline UNet, ResNet50 and StarGAN performance of accuracy scores 0.54. 0.55, 0.62 and 0.91, respectively. The proposed discriminator from the disentangled network presents a promising technique that may allow deeper modeling of dynamic imaging against patient specific anatomies. △ Less

Submitted 14 November, 2019; originally announced November 2019.

Comments: 8 pages, 4 figures

Journal ref: SPIE2020

arXiv:1911.05113 [pdf]

Semi-Supervised Multi-Organ Segmentation through Quality Assurance Supervision

Authors: Ho Hin Lee, Yucheng Tang, Olivia Tang, Yuchen Xu, Yunqiang Chen, Dashan Gao, Shizhong Han, Riqiang Gao, Michael R. Savona, Richard G. Abramson, Yuankai Huo, Bennett A. Landman

Abstract: Human in-the-loop quality assurance (QA) is typically performed after medical image segmentation to ensure that the systems are performing as intended, as well as identifying and excluding outliers. By performing QA on large-scale, previously unlabeled testing data, categorical QA scores can be generatedIn this paper, we propose a semi-supervised multi-organ segmentation deep neural network consis… ▽ More Human in-the-loop quality assurance (QA) is typically performed after medical image segmentation to ensure that the systems are performing as intended, as well as identifying and excluding outliers. By performing QA on large-scale, previously unlabeled testing data, categorical QA scores can be generatedIn this paper, we propose a semi-supervised multi-organ segmentation deep neural network consisting of a traditional segmentation model generator and a QA involved discriminator. A large-scale dataset of 2027 volumes are used to train the generator, whose 2-D montage images and segmentation mask with QA scores are used to train the discriminator. To generate the QA scores, the 2-D montage images were reviewed manually and coded 0 (success), 1 (errors consistent with published performance), and 2 (gross failure). Then, the ResNet-18 network was trained with 1623 montage images in equal distribution of all three code labels and achieved an accuracy 94% for classification predictions with 404 montage images withheld for the test cohort. To assess the performance of using the QA supervision, the discriminator was used as a loss function in a multi-organ segmentation pipeline. The inclusion of QA-loss function boosted performance on the unlabeled test dataset from 714 patients to 951 patients over the baseline model. Additionally, the number of failures decreased from 606 (29.90%) to 402 (19.83%). The contributions of the proposed method are threefold: We show that (1) the QA scores can be used as a loss function to perform semi-supervised learning for unlabeled data, (2) the well trained discriminator is learnt by QA score rather than traditional true/false, and (3) the performance of multi-organ segmentation on unlabeled datasets can be fine-tuned with more robust and higher accuracy than the original baseline method. △ Less

Submitted 12 November, 2019; originally announced November 2019.

Comments: 7 pages, 5 figures, Accepted by SPIE 2020: Medical Imaging

arXiv:1909.05321 [pdf]

Distanced LSTM: Time-Distanced Gates in Long Short-Term Memory Models for Lung Cancer Detection

Authors: Riqiang Gao, Yuankai Huo, Shunxing Bao, Yucheng Tang, Sanja L. Antic, Emily S. Epstein, Aneri B. Balar, Steve Deppen, Alexis B. Paulson, Kim L. Sandler, Pierre P. Massion, Bennett A. Landman

Abstract: The field of lung nodule detection and cancer prediction has been rapidly develo** with the support of large public data archives. Previous studies have largely focused on cross-sectional (single) CT data. Herein, we consider longitudinal data. The Long Short-Term Memory (LSTM) model addresses learning with regularly spaced time points (i.e., equal temporal intervals). However, clinical imaging… ▽ More The field of lung nodule detection and cancer prediction has been rapidly develo** with the support of large public data archives. Previous studies have largely focused on cross-sectional (single) CT data. Herein, we consider longitudinal data. The Long Short-Term Memory (LSTM) model addresses learning with regularly spaced time points (i.e., equal temporal intervals). However, clinical imaging follows patient needs with often heterogeneous, irregular acquisitions. To model both regular and irregular longitudinal samples, we generalize the LSTM model with the Distanced LSTM (DLSTM) for temporally varied acquisitions. The DLSTM includes a Temporal Emphasis Model (TEM) that enables learning across regularly and irregularly sampled intervals. Briefly, (1) the time intervals between longitudinal scans are modeled explicitly, (2) temporally adjustable forget and input gates are introduced for irregular temporal sampling; and (3) the latest longitudinal scan has an additional emphasis term. We evaluate the DLSTM framework in three datasets including simulated data, 1794 National Lung Screening Trial (NLST) scans, and 1420 clinically acquired data with heterogeneous and irregular temporal accession. The experiments on the first two datasets demonstrate that our method achieves competitive performance on both simulated and regularly sampled datasets (e.g. improve LSTM from 0.6785 to 0.7085 on F1 score in NLST). In external validation of clinically and irregularly acquired data, the benchmarks achieved 0.8350 (CNN feature) and 0.8380 (LSTM) on the area under the ROC curve (AUC) score, while the proposed DLSTM achieves 0.8905. △ Less

Submitted 11 September, 2019; originally announced September 2019.

Comments: This paper is accepted by MLMI (oral), MICCAI workshop

arXiv:1907.03994 [pdf, other]

doi 10.1145/3351279

FarSense: Pushing the Range Limit of WiFi-based Respiration Sensing with CSI Ratio of Two Antennas

Authors: Youwei Zeng, Dan Wu, Jie Xiong, Enze Yi, Ruiyang Gao, Daqing Zhang

Abstract: The past few years have witnessed the great potential of exploiting channel state information retrieved from commodity WiFi devices for respiration monitoring. However, existing approaches only work when the target is close to the WiFi transceivers and the performance degrades significantly when the target is far away. On the other hand, most home environments only have one WiFi access point and i… ▽ More The past few years have witnessed the great potential of exploiting channel state information retrieved from commodity WiFi devices for respiration monitoring. However, existing approaches only work when the target is close to the WiFi transceivers and the performance degrades significantly when the target is far away. On the other hand, most home environments only have one WiFi access point and it may not be located in the same room as the target. This sensing range constraint greatly limits the application of the proposed approaches in real life. This paper presents FarSense--the first real-time system that can reliably monitor human respiration when the target is far away from the WiFi transceiver pair. FarSense works well even when one of the transceivers is located in another room, moving a big step towards real-life deployment. We propose two novel schemes to achieve this goal: (1) Instead of applying the raw CSI readings of individual antenna for sensing, we employ the ratio of CSI readings from two antennas, whose noise is mostly canceled out by the division operation to significantly increase the sensing range; (2) The division operation further enables us to utilize the phase information which is not usable with one single antenna for sensing. The orthogonal amplitude and phase are elaborately combined to address the "blind spots" issue and further increase the sensing range. Extensive experiments show that FarSense is able to accurately monitor human respiration even when the target is 8 meters away from the transceiver pair, increasing the sensing range by more than 100%. We believe this is the first system to enable through-wall respiration sensing with commodity WiFi devices and the proposed method could also benefit other sensing applications. △ Less

Submitted 10 August, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

Comments: This work is a pre-print version to appear at UbiComp 2019

Showing 1–50 of 53 results for author: Gao, R