Search | arXiv e-print repository

Enhancing RSS-Based Visible Light Positioning by Optimal Calibrating the LED Tilt and Gain

Authors: Fan Wu, Nobby Stevens, Lieven De Strycker, François Rottenberg

Abstract: This paper presents an optimal calibration scheme and a weighted least squares (LS) localization algorithm for received signal strength (RSS) based visible light positioning (VLP) systems, focusing on the often overlooked impact of light emitting diode (LED) tilt. By optimally calibrating LED tilt and gain, we significantly enhance VLP localization accuracy. Our algorithm outperforms both machine… ▽ More This paper presents an optimal calibration scheme and a weighted least squares (LS) localization algorithm for received signal strength (RSS) based visible light positioning (VLP) systems, focusing on the often overlooked impact of light emitting diode (LED) tilt. By optimally calibrating LED tilt and gain, we significantly enhance VLP localization accuracy. Our algorithm outperforms both machine learning Gaussian processes (GPs) and traditional multilateration techniques. Against GPs, it achieves improvements of 58% and 74% in the 50th and 99th percentiles, respectively. When compared to multilateration, it reduces the 50th percentile error from 7.4 cm to 3.2 cm and the 99th percentile error from 25.7 cm to 11 cm. We introduce a low-complexity estimator for tilt and gain that meets the Cramer-Rao lower bound (CRLB) for the mean squared error (MSE), emphasizing its precision and efficiency. Further, we elaborate on optimal calibration measurement placement and refine the observation model to include residual calibration errors, thereby improving localization performance. The weighted LS algorithm's effectiveness is validated through simulations and real-world data, consistently outperforming GPs and multilateration, across various training set sizes and reducing outlier errors. Our findings underscore the critical role of LED tilt calibration in advancing VLP system accuracy and contribute to a more precise model for indoor positioning technologies. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.03408 [pdf, other]

Comparative Efficacy of Commercial Wearables for Circadian Rhythm Home Monitoring from Activity, Heart Rate, and Core Body Temperature

Authors: Fan Wu, Patrick Langer, **joo Shim, Elgar Fleisch, Filipe Barata

Abstract: Circadian rhythms govern biological patterns that follow a 24-hour cycle. Dysfunctions in circadian rhythms can contribute to various health problems, such as sleep disorders. Current circadian rhythm assessment methods, often invasive or subjective, limit circadian rhythm monitoring to laboratories. Hence, this study aims to investigate scalable consumer-centric wearables for circadian rhythm mon… ▽ More Circadian rhythms govern biological patterns that follow a 24-hour cycle. Dysfunctions in circadian rhythms can contribute to various health problems, such as sleep disorders. Current circadian rhythm assessment methods, often invasive or subjective, limit circadian rhythm monitoring to laboratories. Hence, this study aims to investigate scalable consumer-centric wearables for circadian rhythm monitoring outside traditional laboratories. In a two-week longitudinal study conducted in real-world settings, 36 participants wore an Actigraph, a smartwatch, and a core body temperature sensor to collect activity, temperature, and heart rate data. We evaluated circadian rhythms calculated from commercial wearables by comparing them with circadian rhythm reference measures, i.e., Actigraph activities and chronotype questionnaire scores. The circadian rhythm metric acrophases, determined from commercial wearables using activity, heart rate, and temperature data, significantly correlated with the acrophase derived from Actigraph activities (r=0.96, r=0.87, r=0.79; all p<0.001) and chronotype questionnaire (r=-0.66, r=-0.73, r=-0.61; all p<0.001). The acrophases obtained concurrently from consumer sensors significantly predicted the chronotype (R2=0.64; p<0.001). Our study validates commercial sensors for circadian rhythm assessment, highlighting their potential to support maintaining healthy rhythms and provide scalable and timely health monitoring in real-life scenarios. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2403.11694 [pdf, other]

Object Segmentation-Assisted Inter Prediction for Versatile Video Coding

Authors: Zhuoyuan Li, Zikun Yuan, Li Li, Dong Liu, Xiaohu Tang, Feng Wu

Abstract: In modern video coding standards, block-based inter prediction is widely adopted, which brings high compression efficiency. However, in natural videos, there are usually multiple moving objects of arbitrary shapes, resulting in complex motion fields that are difficult to compactly represent. This problem has been tackled by more flexible block partitioning methods in the Versatile Video Coding (VV… ▽ More In modern video coding standards, block-based inter prediction is widely adopted, which brings high compression efficiency. However, in natural videos, there are usually multiple moving objects of arbitrary shapes, resulting in complex motion fields that are difficult to compactly represent. This problem has been tackled by more flexible block partitioning methods in the Versatile Video Coding (VVC) standard, but the more flexible partitions require more overhead bits to signal and still cannot be made arbitrary shaped. To address this limitation, we propose an object segmentation-assisted inter prediction method (SAIP), where objects in the reference frames are segmented by some advanced technologies. With a proper indication, the object segmentation mask is translated from the reference frame to the current frame as the arbitrary-shaped partition of different regions without any extra signal. Using the segmentation mask, motion compensation is separately performed for different regions, achieving higher prediction accuracy. The segmentation mask is further used to code the motion vectors of different regions more efficiently. Moreover, segmentation mask is considered in the joint rate-distortion optimization for motion estimation and partition estimation to derive the motion vector of different regions and partition more accurately. The proposed method is implemented into the VVC reference software, VTM version 12.0. Experimental results show that the proposed method achieves up to 1.98%, 1.14%, 0.79%, and on average 0.82%, 0.49%, 0.37% BD-rate reduction for common test sequences, under the Low-delay P, Low-delay B, and Random Access configurations, respectively. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: 22 pages, 15 figures

arXiv:2402.17043 [pdf, other]

Traffic Control via Connected and Automated Vehicles: An Open-Road Field Experiment with 100 CAVs

Authors: Jonathan W. Lee, Han Wang, Kathy Jang, Amaury Hayat, Matthew Bunting, Arwa Alanqary, William Barbour, Zhe Fu, Xiaoqian Gong, George Gunter, Sharon Hornstein, Abdul Rahman Kreidieh, Nathan Lichtlé, Matthew W. Nice, William A. Richardson, Adit Shah, Eugene Vinitsky, Fangyu Wu, Shengquan Xiang, Sulaiman Almatrudi, Fahd Althukair, Rahul Bhadani, Joy Carpio, Raphael Chekroun, Eric Cheng , et al. (39 additional authors not shown)

Abstract: The CIRCLES project aims to reduce instabilities in traffic flow, which are naturally occurring phenomena due to human driving behavior. These "phantom jams" or "stop-and-go waves,"are a significant source of wasted energy. Toward this goal, the CIRCLES project designed a control system referred to as the MegaController by the CIRCLES team, that could be deployed in real traffic. Our field experim… ▽ More The CIRCLES project aims to reduce instabilities in traffic flow, which are naturally occurring phenomena due to human driving behavior. These "phantom jams" or "stop-and-go waves,"are a significant source of wasted energy. Toward this goal, the CIRCLES project designed a control system referred to as the MegaController by the CIRCLES team, that could be deployed in real traffic. Our field experiment leveraged a heterogeneous fleet of 100 longitudinally-controlled vehicles as Lagrangian traffic actuators, each of which ran a controller with the architecture described in this paper. The MegaController is a hierarchical control architecture, which consists of two main layers. The upper layer is called Speed Planner, and is a centralized optimal control algorithm. It assigns speed targets to the vehicles, conveyed through the LTE cellular network. The lower layer is a control layer, running on each vehicle. It performs local actuation by overriding the stock adaptive cruise controller, using the stock on-board sensors. The Speed Planner ingests live data feeds provided by third parties, as well as data from our own control vehicles, and uses both to perform the speed assignment. The architecture of the speed planner allows for modular use of standard control techniques, such as optimal control, model predictive control, kernel methods and others, including Deep RL, model predictive control and explicit controllers. Depending on the vehicle architecture, all onboard sensing data can be accessed by the local controllers, or only some. Control inputs vary across different automakers, with inputs ranging from torque or acceleration requests for some cars, and electronic selection of ACC set points in others. The proposed architecture allows for the combination of all possible settings proposed above. Most configurations were tested throughout the ramp up to the MegaVandertest. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2401.08835 [pdf, other]

Improving ASR Contextual Biasing with Guided Attention

Authors: Jiyang Tang, Kwangyoun Kim, Suwon Shon, Felix Wu, Prashant Sridhar, Shinji Watanabe

Abstract: In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To addres… ▽ More In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To address this challenge, we employ a GA loss as an additional training objective besides the Transducer loss. The proposed GA loss aims to teach the cross attention how to align bias phrases with text tokens or audio frames. Compared to studies with similar motivations, the proposed loss operates directly on the cross attention weights and is easier to implement. Through extensive experiments based on Conformer Transducer with Contextual Adapter, we demonstrate that the proposed method not only leads to a lower WER but also retains its effectiveness as the number of bias phrases increases. Specifically, the GA loss decreases the WER of rare vocabularies by up to 19.2% on LibriSpeech compared to the contextual biasing baseline, and up to 49.3% compared to a vanilla Transducer. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted at ICASSP 2024

arXiv:2312.01499 [pdf, other]

Towards Decentralized Task Offloading and Resource Allocation in User-Centric Mobile Edge Computing

Authors: Langtian Qin, Hancheng Lu, Yuang Chen, Baolin Chong, Feng Wu

Abstract: In the traditional cellular-based mobile edge computing (MEC), users at the edge of the cell are prone to suffer severe inter-cell interference and signal attenuation, leading to low throughput even transmission interruptions. Such edge effect severely obstructs offloading of tasks to MEC servers. To address this issue, we propose user-centric mobile edge computing (UCMEC), a novel MEC architectur… ▽ More In the traditional cellular-based mobile edge computing (MEC), users at the edge of the cell are prone to suffer severe inter-cell interference and signal attenuation, leading to low throughput even transmission interruptions. Such edge effect severely obstructs offloading of tasks to MEC servers. To address this issue, we propose user-centric mobile edge computing (UCMEC), a novel MEC architecture integrating user-centric transmission, which can ensure high throughput and reliable communication for task offloading. Then, we formulate an optimization problem with joint consideration of task offloading, power control, and computing resource allocation in UCMEC, aiming at obtaining the optimal performance in terms of long-term average total delay. To solve the intractable problem, we propose two decentralized joint optimization schemes based on multi-agent deep reinforcement learning (MADRL) and convex optimization, which consider both cooperation and non-cooperation among network nodes. Simulation results demonstrate that the proposed schemes in UCMEC can significantly improve the uplink transmission rate by at most 343.56% and reduce the long-term average total delay by at most 45.57% compared to traditional cellular-based MEC. △ Less

Submitted 3 December, 2023; originally announced December 2023.

Comments: 16 pages, 13 figures

arXiv:2311.18073 [pdf, other]

DiffGEPCI: 3D MRI Synthesis from mGRE Signals using 2.5D Diffusion Model

Authors: Yuyang Hu, Satya V. V. N. Kothapalli, Weijie Gan, Alexander L. Sukstanskii, Gregory F. Wu, Manu Goyal, Dmitriy A. Yablonskiy, Ulugbek S. Kamilov

Abstract: We introduce a new framework called DiffGEPCI for cross-modality generation in magnetic resonance imaging (MRI) using a 2.5D conditional diffusion model. DiffGEPCI can synthesize high-quality Fluid Attenuated Inversion Recovery (FLAIR) and Magnetization Prepared-Rapid Gradient Echo (MPRAGE) images, without acquiring corresponding measurements, by leveraging multi-Gradient-Recalled Echo (mGRE) MRI… ▽ More We introduce a new framework called DiffGEPCI for cross-modality generation in magnetic resonance imaging (MRI) using a 2.5D conditional diffusion model. DiffGEPCI can synthesize high-quality Fluid Attenuated Inversion Recovery (FLAIR) and Magnetization Prepared-Rapid Gradient Echo (MPRAGE) images, without acquiring corresponding measurements, by leveraging multi-Gradient-Recalled Echo (mGRE) MRI signals as conditional inputs. DiffGEPCI operates in a two-step fashion: it initially estimates a 3D volume slice-by-slice using the axial plane and subsequently applies a refinement algorithm (referred to as 2.5D) to enhance the quality of the coronal and sagittal planes. Experimental validation on real mGRE data shows that DiffGEPCI achieves excellent performance, surpassing generative adversarial networks (GANs) and traditional diffusion models. △ Less

Submitted 18 April, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

arXiv:2309.02318 [pdf, other]

TiAVox: Time-aware Attenuation Voxels for Sparse-view 4D DSA Reconstruction

Authors: Zhenghong Zhou, Huangxuan Zhao, Jiemin Fang, Dongqiao Xiang, Lei Chen, Lingxia Wu, Feihong Wu, Wenyu Liu, Chuansheng Zheng, Xinggang Wang

Abstract: Four-dimensional Digital Subtraction Angiography (4D DSA) plays a critical role in the diagnosis of many medical diseases, such as Arteriovenous Malformations (AVM) and Arteriovenous Fistulas (AVF). Despite its significant application value, the reconstruction of 4D DSA demands numerous views to effectively model the intricate vessels and radiocontrast flow, thereby implying a significant radiatio… ▽ More Four-dimensional Digital Subtraction Angiography (4D DSA) plays a critical role in the diagnosis of many medical diseases, such as Arteriovenous Malformations (AVM) and Arteriovenous Fistulas (AVF). Despite its significant application value, the reconstruction of 4D DSA demands numerous views to effectively model the intricate vessels and radiocontrast flow, thereby implying a significant radiation dose. To address this high radiation issue, we propose a Time-aware Attenuation Voxel (TiAVox) approach for sparse-view 4D DSA reconstruction, which paves the way for high-quality 4D imaging. Additionally, 2D and 3D DSA imaging results can be generated from the reconstructed 4D DSA images. TiAVox introduces 4D attenuation voxel grids, which reflect attenuation properties from both spatial and temporal dimensions. It is optimized by minimizing discrepancies between the rendered images and sparse 2D DSA images. Without any neural network involved, TiAVox enjoys specific physical interpretability. The parameters of each learnable voxel represent the attenuation coefficients. We validated the TiAVox approach on both clinical and simulated datasets, achieving a 31.23 Peak Signal-to-Noise Ratio (PSNR) for novel view synthesis using only 30 views on the clinically sourced dataset, whereas traditional Feldkamp-Davis-Kress methods required 133 views. Similarly, with merely 10 views from the synthetic dataset, TiAVox yielded a PSNR of 34.32 for novel view synthesis and 41.40 for 3D reconstruction. We also executed ablation studies to corroborate the essential components of TiAVox. The code will be publically available. △ Less

Submitted 19 December, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

Comments: 10 pages, 8 figures

arXiv:2308.12203 [pdf, ps, other]

A Robust ADMM-Based Optimization Algorithm For Underwater Acoustic Channel Estimation

Authors: Tian Tian, Agastya Raj, Bruno Missi Xavier, Ying Zhang, Feiyun Wu, Kunde Yang

Abstract: Accurate estimation of the Underwater acoustic (UWA) is a key part of underwater communications, especially for coherent systems. The severe multipath effects and large delay spreads make the estimation problem large-scale. The non-stationary, non-Gaussian, and impulsive nature of ocean ambient noise poses further obstacles to the design of estimation algorithms. Under the framework of compressed… ▽ More Accurate estimation of the Underwater acoustic (UWA) is a key part of underwater communications, especially for coherent systems. The severe multipath effects and large delay spreads make the estimation problem large-scale. The non-stationary, non-Gaussian, and impulsive nature of ocean ambient noise poses further obstacles to the design of estimation algorithms. Under the framework of compressed sensing (CS), this work addresses the issue of robust channel estimation when measurements are contaminated by impulsive noise. A first-order algorithm based on alternating direction method of multipliers (ADMM) is proposed. Numerical simulations of time-varying channel estimation are performed to show its improved performance in highly impulsive noise environments. △ Less

Submitted 24 August, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

Comments: This paper has been accepted and presented at OCEANS 2023 - Limerick Conference. This version is a preprint

arXiv:2307.09775 [pdf, other]

DisCover: Disentangled Music Representation Learning for Cover Song Identification

Authors: Jiahao Xun, Shengyu Zhang, Yanting Yang, Jieming Zhu, Liqun Deng, Zhou Zhao, Zhenhua Dong, Ruiqi Li, Lichao Zhang, Fei Wu

Abstract: In the field of music information retrieval (MIR), cover song identification (CSI) is a challenging task that aims to identify cover versions of a query song from a massive collection. Existing works still suffer from high intra-song variances and inter-song correlations, due to the entangled nature of version-specific and version-invariant factors in their modeling. In this work, we set the goal… ▽ More In the field of music information retrieval (MIR), cover song identification (CSI) is a challenging task that aims to identify cover versions of a query song from a massive collection. Existing works still suffer from high intra-song variances and inter-song correlations, due to the entangled nature of version-specific and version-invariant factors in their modeling. In this work, we set the goal of disentangling version-specific and version-invariant factors, which could make it easier for the model to learn invariant music representations for unseen query songs. We analyze the CSI task in a disentanglement view with the causal graph technique, and identify the intra-version and inter-version effects biasing the invariant learning. To block these effects, we propose the disentangled music representation learning framework (DisCover) for CSI. DisCover consists of two critical components: (1) Knowledge-guided Disentanglement Module (KDM) and (2) Gradient-based Adversarial Disentanglement Module (GADM), which block intra-version and inter-version biased effects, respectively. KDM minimizes the mutual information between the learned representations and version-variant factors that are identified with prior domain knowledge. GADM identifies version-variant factors by simulating the representation transitions between intra-song versions, and exploits adversarial distillation for effect blocking. Extensive comparisons with best-performing methods and in-depth analysis demonstrate the effectiveness of DisCover and the and necessity of disentanglement for CSI. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2307.01710 [pdf]

A Scalable Arrangement Method for Aperiodic Array Antennas to Reduce Peak Sidelobe Level

Authors: Jiao Zhang, Hongtao Zhang, Xuelei Chen, Fengquan Wu, Yufeng Liu, Wenmei Zhang

Abstract: Peak sidelobe level reduction (PSLR) is crucial in the application of large-scale array antenna, which directly determines the radiation performance of array antenna. We study the PSLR of subarray level aperiodic arrays and propose three array structures: dislocated subarrays with uniform elements (DSUE), uniform subarrays with random elements (USRE), dislocated subarrays with random elements (DSR… ▽ More Peak sidelobe level reduction (PSLR) is crucial in the application of large-scale array antenna, which directly determines the radiation performance of array antenna. We study the PSLR of subarray level aperiodic arrays and propose three array structures: dislocated subarrays with uniform elements (DSUE), uniform subarrays with random elements (USRE), dislocated subarrays with random elements (DSRE). To optimize the dislocation position of subarrays and random position of elements, the improved Bat algorithm (IBA) is applied. To draw the comparison of PSLR effect among these three array structures, we take three size of array antennas from small to large as examples to simulate and calculate the redundancy and peak sidelobe level (PSLL) of them. The results show that DSRE is the optimal array structure by analyzing the dislocation distance of subarray, scanning angle and applicable frequency. The proposed design method is a universal and scalable method, which is of great application value to the design of large-scale aperiodic array antenna. △ Less

Submitted 4 July, 2023; originally announced July 2023.

arXiv:2305.19507 [pdf, other]

Manifold Constraint Regularization for Remote Sensing Image Generation

Authors: Xingzhe Su, Changwen Zheng, Wenwen Qiang, Fengge Wu, Junsuo Zhao, Fuchun Sun, Hui Xiong

Abstract: Generative Adversarial Networks (GANs) have shown notable accomplishments in remote sensing domain. However, this paper reveals that their performance on remote sensing images falls short when compared to their impressive results with natural images. This study identifies a previously overlooked issue: GANs exhibit a heightened susceptibility to overfitting on remote sensing images.To address this… ▽ More Generative Adversarial Networks (GANs) have shown notable accomplishments in remote sensing domain. However, this paper reveals that their performance on remote sensing images falls short when compared to their impressive results with natural images. This study identifies a previously overlooked issue: GANs exhibit a heightened susceptibility to overfitting on remote sensing images.To address this challenge, this paper analyzes the characteristics of remote sensing images and proposes manifold constraint regularization, a novel approach that tackles overfitting of GANs on remote sensing images for the first time. Our method includes a new measure for evaluating the structure of the data manifold. Leveraging this measure, we propose the manifold constraint regularization term, which not only alleviates the overfitting problem, but also promotes alignment between the generated and real data manifolds, leading to enhanced quality in the generated images. The effectiveness and versatility of this method have been corroborated through extensive validation on various remote sensing datasets and GAN models. The proposed method not only enhances the quality of the generated images, reflected in a 3.13\% improvement in Frechet Inception Distance (FID) score, but also boosts the performance of the GANs on downstream tasks, evidenced by a 3.76\% increase in classification accuracy. △ Less

Submitted 28 March, 2024; v1 submitted 30 May, 2023; originally announced May 2023.

arXiv:2305.11073 [pdf, other]

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

Authors: Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, Shinji Watanabe

Abstract: Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU). Recently, a new encoder called E-Branchformer has outperformed Conformer in the LibriSpeech ASR benchmark, making it… ▽ More Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU). Recently, a new encoder called E-Branchformer has outperformed Conformer in the LibriSpeech ASR benchmark, making it promising for more general speech applications. This work compares E-Branchformer and Conformer through extensive experiments using different types of end-to-end sequence-to-sequence models. Results demonstrate that E-Branchformer achieves comparable or better performance than Conformer in almost all evaluation sets across 15 ASR, 2 ST, and 3 SLU benchmarks, while being more stable during training. We will release our training configurations and pre-trained models for reproducibility, which can benefit the speech community. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted at INTERSPEECH 2023. Code: https://github.com/espnet/espnet

arXiv:2305.03354 [pdf, other]

doi 10.1109/LRA.2023.3307011

Close-range Human Following Control on a Cane-type Robot with Multi-camera Fusion

Authors: Haowen Liu, Fengxian Wu, Bin Zhong, Yijun Zhao, Jiatong Zhang, Wenxin Niu, Mingming Zhang

Abstract: Cane-type robots have been utilized to assist and supervise the mobility-impaired population. One essential technique for cane-type robots is human following control, which allows the robot to follow the user. However, the limited perceptible information of humans by sensors at close range, combined with the occlusion caused by lower limb swing during normal walking, affect the localization of use… ▽ More Cane-type robots have been utilized to assist and supervise the mobility-impaired population. One essential technique for cane-type robots is human following control, which allows the robot to follow the user. However, the limited perceptible information of humans by sensors at close range, combined with the occlusion caused by lower limb swing during normal walking, affect the localization of users. These limitations make it difficult to achieve human following at close range.To address these challenges, this study developed a new cane-type wheeled robot and proposed a novel human-following control with multi-camera fusion. This control system mainly consists of two parts: 1) a human following controller that locates a user by multi-camera fusion and generates control signals to follow the user. 2) a cane robot controller designed to steer the cane robot to a target position. The proposed strategy's effectiveness has been validated in outdoor experiments with six healthy subjects. The experimental scenarios included different terrains (i.e., straight, turning, and inclined paths), road conditions (i.e., flat and rough roads), and walking speeds. The obtained results showed that the average tracking error for position and orientation was less than 5 cm and 15° respectively across all scenarios. Moreover, the cane robot can effectively adapt to a wide range of individual gait patterns and achieve stable human following at daily walking speeds (0.75 m/s - 1.45 m/s). △ Less

Submitted 27 September, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

Comments: This paper has been accepted for publication in IEEE Robotics and Automation Letters.( Volume: 8, Issue: 10, October 2023)

arXiv:2303.05240 [pdf, other]

Intriguing Property and Counterfactual Explanation of GAN for Remote Sensing Image Generation

Authors: Xingzhe Su, Wenwen Qiang, Jie Hu, Fengge Wu, Changwen Zheng, Fuchun Sun

Abstract: Generative adversarial networks (GANs) have achieved remarkable progress in the natural image field. However, when applying GANs in the remote sensing (RS) image generation task, an extraordinary phenomenon is observed: the GAN model is more sensitive to the size of training data for RS image generation than for natural image generation. In other words, the generation quality of RS images will cha… ▽ More Generative adversarial networks (GANs) have achieved remarkable progress in the natural image field. However, when applying GANs in the remote sensing (RS) image generation task, an extraordinary phenomenon is observed: the GAN model is more sensitive to the size of training data for RS image generation than for natural image generation. In other words, the generation quality of RS images will change significantly with the number of training categories or samples per category. In this paper, we first analyze this phenomenon from two kinds of toy experiments and conclude that the amount of feature information contained in the GAN model decreases with reduced training data. Then we establish a structural causal model (SCM) of the data generation process and interpret the generated data as the counterfactuals. Based on this SCM, we theoretically prove that the quality of generated images is positively correlated with the amount of feature information. This provides insights for enriching the feature information learned by the GAN model during training. Consequently, we propose two innovative adjustment schemes, namely Uniformity Regularization (UR) and Entropy Regularization (ER), to increase the information learned by the GAN model at the distributional and sample levels, respectively. We theoretically and empirically demonstrate the effectiveness and versatility of our methods. Extensive experiments on three RS datasets and two natural datasets show that our methods outperform the well-established models on RS image generation tasks. The source code is available at https://github.com/rootSue/Causal-RSGAN. △ Less

Submitted 14 May, 2024; v1 submitted 9 March, 2023; originally announced March 2023.

arXiv:2302.14132 [pdf, ps, other]

Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding

Authors: Yifan Peng, Kwangyoun Kim, Felix Wu, Prashant Sridhar, Shinji Watanabe

Abstract: Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow. Model compression techniques such as pruning aim to reduce the model size and computation without degradation in accuracy. Prior studies focus on the pruning of Transformers; however, speech models not only utilize a stack of Transformer blocks, but… ▽ More Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow. Model compression techniques such as pruning aim to reduce the model size and computation without degradation in accuracy. Prior studies focus on the pruning of Transformers; however, speech models not only utilize a stack of Transformer blocks, but also combine a frontend network based on multiple convolutional layers for low-level feature representation learning. This frontend has a small size but a heavy computational cost. In this work, we propose three task-specific structured pruning methods to deal with such heterogeneous networks. Experiments on LibriSpeech and SLURP show that the proposed method is more accurate than the original wav2vec2-base with 10% to 30% less computation, and is able to reduce the computation by 40% to 50% without any degradation. △ Less

Submitted 27 February, 2023; originally announced February 2023.

Comments: Accepted at ICASSP 2023

arXiv:2302.10558 [pdf, other]

Joint Optimization of Base Station Clustering and Service Caching in User-Centric MEC

Authors: Langtian Qin, Hancheng Lu, Yao Lu, Chenwu Zhang, Feng Wu

Abstract: Edge service caching can effectively reduce the delay or bandwidth overhead for acquiring and initializing applications. To address single-base station (BS) transmission limitation and serious edge effect in traditional cellular-based edge service caching networks, in this paper, we proposed a novel user-centric edge service caching framework where each user is jointly provided with edge caching a… ▽ More Edge service caching can effectively reduce the delay or bandwidth overhead for acquiring and initializing applications. To address single-base station (BS) transmission limitation and serious edge effect in traditional cellular-based edge service caching networks, in this paper, we proposed a novel user-centric edge service caching framework where each user is jointly provided with edge caching and wireless transmission services by a specific BS cluster instead of a single BS. To minimize the long-term average delay under the constraint of the caching cost, a mixed integer non-linear programming (MINLP) problem is formulated by jointly optimizing the BS clustering and service caching decisions. To tackle the problem, we propose JO-CDSD, an efficiently joint optimization algorithm based on Lyapunov optimization and generalized benders decomposition (GBD). In particular, the long-term optimization problem can be transformed into a primal problem and a master problem in each time slot that is much simpler to solve. The near-optimal clustering and caching strategy can be obtained through solving the primal and master problem alternately. Extensive simulations show that the proposed joint optimization algorithm outperforms other algorithms and can effectively reduce the long-term delay by at most $93.75% and caching cost by at most $53.12%. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2302.10515 [pdf, other]

Energy-Efficient Blockchain-enabled User-Centric Mobile Edge Computing

Authors: Langtian Qin, Hancheng Lu, Yuang Chen, Zhuojia Gu, Dan Zhao, Feng Wu

Abstract: In the traditional mobile edge computing (MEC) system, the availability of MEC services is greatly limited for the edge users of the cell due to serious signal attenuation and inter-cell interference. User-centric MEC (UC-MEC) can be seen as a promising solution to address this issue. In UC-MEC, each user is served by a dedicated access point (AP) cluster enabled with MEC capability instead of a s… ▽ More In the traditional mobile edge computing (MEC) system, the availability of MEC services is greatly limited for the edge users of the cell due to serious signal attenuation and inter-cell interference. User-centric MEC (UC-MEC) can be seen as a promising solution to address this issue. In UC-MEC, each user is served by a dedicated access point (AP) cluster enabled with MEC capability instead of a single MEC server, however, at the expense of more energy consumption and greater privacy risks. To achieve efficient and reliable resource utilization with user-centric services, we propose an energy efficient blockchain-enabled UC-MEC system where blockchain operations and resource optimization are jointly performed. Firstly, we design a resource-aware, reliable, replicated, redundant, and fault-tolerant (R-RAFT) consensus mechanism to implement secure and reliable resource trading. Then, an optimization framework based on alternating direction method of multipliers (ADMM) is proposed to minimize the total energy consumed by wireless transmission, consensus and task computing, where APs clustering, computing resource allocation and bandwidth allocation are jointly considered. Simulation results show superiority of the proposed UC-MEC system over reference schemes, at most 33.96% reduction in the total delay and 48.77% reduction in the total energy consumption. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2301.06043 [pdf, other]

Unsupervised Cardiac Segmentation Utilizing Synthesized Images from Anatomical Labels

Authors: Sihan Wang, Fu** Wu, Lei Li, Zheyao Gao, Byung-Woo Hong, Xiahai Zhuang

Abstract: Cardiac segmentation is in great demand for clinical practice. Due to the enormous labor of manual delineation, unsupervised segmentation is desired. The ill-posed optimization problem of this task is inherently challenging, requiring well-designed constraints. In this work, we propose an unsupervised framework for multi-class segmentation with both intensity and shape constraints. Firstly, we ext… ▽ More Cardiac segmentation is in great demand for clinical practice. Due to the enormous labor of manual delineation, unsupervised segmentation is desired. The ill-posed optimization problem of this task is inherently challenging, requiring well-designed constraints. In this work, we propose an unsupervised framework for multi-class segmentation with both intensity and shape constraints. Firstly, we extend a conventional non-convex energy function as an intensity constraint and implement it with U-Net. For shape constraint, synthetic images are generated from anatomical labels via image-to-image translation, as shape supervision for the segmentation network. Moreover, augmentation invariance is applied to facilitate the segmentation network to learn the latent features in terms of shape. We evaluated the proposed framework using the public datasets from MICCAI2019 MSCMR Challenge and achieved promising results on cardiac MRIs with Dice scores of 0.5737, 0.7796, and 0.6287 in Myo, LV, and RV, respectively. △ Less

Submitted 15 January, 2023; originally announced January 2023.

arXiv:2212.10525 [pdf, other]

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

Authors: Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

Abstract: Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce suc… ▽ More Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models. △ Less

Submitted 15 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: accepted in ACL 2023 (long paper)

arXiv:2212.08542 [pdf, other]

Context-aware Fine-tuning of Self-supervised Speech Models

Authors: Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu, Shinji Watanabe

Abstract: Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tu… ▽ More Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tuning. We attach a context module on top of the last layer of a pre-trained model to encode the whole segment into a context embedding vector which is then used as an additional feature for the final prediction. During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments. This allows the model to make predictions without access to these surrounding segments at inference time and requires only a tiny overhead compared to standard fine-tuned models. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: Automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). The results show that context-aware fine-tuning not only outperforms a standard fine-tuning baseline but also rivals a strong context injection baseline that uses neighboring speech segments during inference. △ Less

Submitted 28 March, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

arXiv:2210.03402 [pdf]

Research on Self-adaptive Online Vehicle Velocity Prediction Strategy Considering Traffic Information Fusion

Authors: Ziyan Zhang, Junhao Shen, Dongwei Yao, Feng Wu

Abstract: In order to increase the prediction accuracy of the online vehicle velocity prediction (VVP) strategy, a self-adaptive velocity prediction algorithm fused with traffic information was presented for the multiple scenarios. Initially, traffic scenarios were established inside the co-simulation environment. In addition, the algorithm of a general regressive neural network (GRNN) paired with datasets… ▽ More In order to increase the prediction accuracy of the online vehicle velocity prediction (VVP) strategy, a self-adaptive velocity prediction algorithm fused with traffic information was presented for the multiple scenarios. Initially, traffic scenarios were established inside the co-simulation environment. In addition, the algorithm of a general regressive neural network (GRNN) paired with datasets of the ego-vehicle, the front vehicle, and traffic lights was used in traffic scenarios, which increasingly improved the prediction accuracy. To ameliorate the robustness of the algorithm, then the strategy was optimized by particle swarm optimization (PSO) and k-fold cross-validation to find the optimal parameters of the neural network in real-time, which constructed a self-adaptive online PSO-GRNN VVP strategy with multi-information fusion to adapt with different operating situations. The self-adaptive online PSO-GRNN VVP strategy was then deployed to a variety of simulated scenarios to test its efficacy under various operating situations. Finally, the simulation results reveal that in urban and highway scenarios, the prediction accuracy is separately increased by 27.8% and 54.5% when compared to the traditional GRNN VVP strategy with fixed parameters utilizing only the historical ego-vehicle velocity dataset. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: 9 pages, 7 figures

arXiv:2210.00077 [pdf, other]

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Authors: Kwangyoun Kim, Felix Wu, Yifan Peng, **g Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

Abstract: Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Bra… ▽ More Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data. △ Less

Submitted 14 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

Comments: Accepted to SLT 2022

arXiv:2209.07030 [pdf, other]

doi 10.1145/3503161.3548068

Model-Guided Multi-Contrast Deep Unfolding Network for MRI Super-resolution Reconstruction

Authors: Gang Yang, Li Zhang, Man Zhou, Ai** Liu, Xun Chen, Zhiwei Xiong, Feng Wu

Abstract: Magnetic resonance imaging (MRI) with high resolution (HR) provides more detailed information for accurate diagnosis and quantitative image analysis. Despite the significant advances, most existing super-resolution (SR) reconstruction network for medical images has two flaws: 1) All of them are designed in a black-box principle, thus lacking sufficient interpretability and further limiting their p… ▽ More Magnetic resonance imaging (MRI) with high resolution (HR) provides more detailed information for accurate diagnosis and quantitative image analysis. Despite the significant advances, most existing super-resolution (SR) reconstruction network for medical images has two flaws: 1) All of them are designed in a black-box principle, thus lacking sufficient interpretability and further limiting their practical applications. Interpretable neural network models are of significant interest since they enhance the trustworthiness required in clinical practice when dealing with medical images. 2) most existing SR reconstruction approaches only use a single contrast or use a simple multi-contrast fusion mechanism, neglecting the complex relationships between different contrasts that are critical for SR improvement. To deal with these issues, in this paper, a novel Model-Guided interpretable Deep Unfolding Network (MGDUN) for medical image SR reconstruction is proposed. The Model-Guided image SR reconstruction approach solves manually designed objective functions to reconstruct HR MRI. We show how to unfold an iterative MGDUN algorithm into a novel model-guided deep unfolding network by taking the MRI observation matrix and explicit multi-contrast relationship matrix into account during the end-to-end optimization. Extensive experiments on the multi-contrast IXI dataset and BraTs 2019 dataset demonstrate the superiority of our proposed model. △ Less

Submitted 14 September, 2022; originally announced September 2022.

Comments: Accepted to ACMMM 2022, 9 pages

arXiv:2207.05565 [pdf, other]

Towards Hybrid-Optimization Video Coding

Authors: Shuai Huo, Dong Liu, Li Li, Siwei Ma, Feng Wu, Wen Gao

Abstract: Video coding is a mathematical optimization problem of rate and distortion essentially. To solve this complex optimization problem, two popular video coding frameworks have been developed: block-based hybrid video coding and end-to-end learned video coding. If we rethink video coding from the perspective of optimization, we find that the existing two frameworks represent two directions of optimiza… ▽ More Video coding is a mathematical optimization problem of rate and distortion essentially. To solve this complex optimization problem, two popular video coding frameworks have been developed: block-based hybrid video coding and end-to-end learned video coding. If we rethink video coding from the perspective of optimization, we find that the existing two frameworks represent two directions of optimization solutions. Block-based hybrid coding represents the discrete optimization solution because those irrelevant coding modes are discrete in mathematics. It searches for the best one among multiple starting points (i.e. modes). However, the search is not efficient enough. On the other hand, end-to-end learned coding represents the continuous optimization solution because the gradient descent is based on a continuous function. It optimizes a group of model parameters efficiently by the numerical algorithm. However, limited by only one starting point, it is easy to fall into the local optimum. To better solve the optimization problem, we propose to regard video coding as a hybrid of the discrete and continuous optimization problem, and use both search and numerical algorithm to solve it. Our idea is to provide multiple discrete starting points in the global space and optimize the local optimum around each point by numerical algorithm efficiently. Finally, we search for the global optimum among those local optimums. Guided by the hybrid optimization idea, we design a hybrid optimization video coding framework, which is built on continuous deep networks entirely and also contains some discrete modes. We conduct a comprehensive set of experiments. Compared to the continuous optimization framework, our method outperforms pure learned video coding methods. Meanwhile, compared to the discrete optimization framework, our method achieves comparable performance to HEVC reference software HM16.10 in PSNR. △ Less

Submitted 12 July, 2022; originally announced July 2022.

arXiv:2206.05284 [pdf, other]

Decoupling Predictions in Distributed Learning for Multi-Center Left Atrial MRI Segmentation

Authors: Zheyao Gao, Lei Li, Fu** Wu, Sihan Wang, Xiahai Zhuang

Abstract: Distributed learning has shown great potential in medical image analysis. It allows to use multi-center training data with privacy protection. However, data distributions in local centers can vary from each other due to different imaging vendors, and annotation protocols. Such variation degrades the performance of learning-based methods. To mitigate the influence, two groups of methods have been p… ▽ More Distributed learning has shown great potential in medical image analysis. It allows to use multi-center training data with privacy protection. However, data distributions in local centers can vary from each other due to different imaging vendors, and annotation protocols. Such variation degrades the performance of learning-based methods. To mitigate the influence, two groups of methods have been proposed for different aims, i.e., the global methods and the personalized methods. The former are aimed to improve the performance of a single global model for all test data from unseen centers (known as generic data); while the latter target multiple models for each center (denoted as local data). However, little has been researched to achieve both goals simultaneously. In this work, we propose a new framework of distributed learning that bridges the gap between two groups, and improves the performance for both generic and local data. Specifically, our method decouples the predictions for generic data and local data, via distribution-conditioned adaptation matrices. Results on multi-center left atrial (LA) MRI segmentation showed that our method demonstrated superior performance over existing methods on both generic and local data. Our code is available at https://github.com/key1589745/decouple_predict △ Less

Submitted 10 June, 2022; originally announced June 2022.

Comments: Accepted by MICCAI 2022

arXiv:2205.14833 [pdf, other]

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

Authors: Chengfei Lv, Chaoyue Niu, Renjie Gu, Xiaotang Jiang, Zhaode Wang, Bin Liu, Ziqi Wu, Qiulin Yao, Congyu Huang, Panos Huang, Tao Huang, Hui Shu, **de Song, Bin Zou, Peng Lan, Guohuan Xu, Fei Wu, Shaojie Tang, Fan Wu, Guihai Chen

Abstract: To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a c… ▽ More To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment, while facilitating daily task iteration. Specifically, the compute container is based on Mobile Neural Network (MNN), a tensor compute engine along with the data processing and model execution libraries, which are exposed through a refined Python thread-level virtual machine (VM) to support diverse ML tasks and concurrent task execution. The core of MNN is the novel mechanisms of operator decomposition and semi-auto search, sharply reducing the workload in manually optimizing hundreds of operators for tens of hardware backends and further quickly identifying the best backend with runtime optimization for a computation graph. The data pipeline introduces an on-device stream processing framework to enable processing user behavior data at source. The deployment platform releases ML tasks with an efficient push-then-pull method and supports multi-granularity deployment policies. We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability. Extensive micro-benchmarks also highlight the superior performance of MNN and the Python thread-level VM. Walle has been in large-scale production use in Alibaba, while MNN has been open source with a broad impact in the community. △ Less

Submitted 29 May, 2022; originally announced May 2022.

Comments: Accepted by OSDI 2022

arXiv:2205.10781 [pdf, other]

A Hierarchical MPC Approach to Car-Following via Linearly Constrained Quadratic Programming

Authors: Fangyu Wu, Alexandre Bayen

Abstract: Single-lane car-following is a fundamental task in autonomous driving. A desirable car-following controller should keep a reasonable range of distances to the preceding vehicle and do so as smoothly as possible. To achieve this, numerous control methods have been proposed: some only rely on local sensing; others also make use of non-local downstream observations. While local methods are capable of… ▽ More Single-lane car-following is a fundamental task in autonomous driving. A desirable car-following controller should keep a reasonable range of distances to the preceding vehicle and do so as smoothly as possible. To achieve this, numerous control methods have been proposed: some only rely on local sensing; others also make use of non-local downstream observations. While local methods are capable of attenuating high-frequency velocity oscillation and are economical to compute, non-local methods can dampen a wider spectrum of oscillatory traffic but incur a larger cost in computing. In this article, we design a novel non-local tri-layer MPC controller that is capable of smoothing a wide range of oscillatory traffic and is amenable to real-time applications. At the core of the controller design are 1) an accessible prediction method based on ETA estimation and 2) a robust, light-weight optimization procedure, designed specifically for handling various headway constraints. Numerical simulations suggest that the proposed controller can simultaneously maintain a variable headway while driving with modest acceleration and is robust to imperfect traffic predictions. △ Less

Submitted 20 August, 2022; v1 submitted 22 May, 2022; originally announced May 2022.

Comments: 6 pages, 7 figures

arXiv:2205.01086 [pdf, other]

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Authors: Felix Wu, Kwangyoun Kim, Shinji Watanabe, Kyu Han, Ryan McDonald, Kilian Q. Weinberger, Yoav Artzi

Abstract: We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training… ▽ More We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods. △ Less

Submitted 2 May, 2022; originally announced May 2022.

Comments: Code available at https://github.com/asappresearch/wav2seq

arXiv:2204.11792 [pdf, other]

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

Authors: Zhenhui Ye, Zhou Zhao, Yi Ren, Fei Wu

Abstract: The recent progress in non-autoregressive text-to-speech (NAR-TTS) has made fast and high-quality speech synthesis possible. However, current NAR-TTS models usually use phoneme sequence as input and thus cannot understand the tree-structured syntactic information of the input sequence, which hurts the prosody modeling. To this end, we propose SyntaSpeech, a syntax-aware and light-weight NAR-TTS mo… ▽ More The recent progress in non-autoregressive text-to-speech (NAR-TTS) has made fast and high-quality speech synthesis possible. However, current NAR-TTS models usually use phoneme sequence as input and thus cannot understand the tree-structured syntactic information of the input sequence, which hurts the prosody modeling. To this end, we propose SyntaSpeech, a syntax-aware and light-weight NAR-TTS model, which integrates tree-structured syntactic information into the prosody modeling modules in PortaSpeech \cite{ren2021portaspeech}. Specifically, 1) We build a syntactic graph based on the dependency tree of the input sentence, then process the text encoding with a syntactic graph encoder to extract the syntactic information. 2) We incorporate the extracted syntactic encoding with PortaSpeech to improve the prosody prediction. 3) We introduce a multi-length discriminator to replace the flow-based post-net in PortaSpeech, which simplifies the training pipeline and improves the inference speed, while kee** the naturalness of the generated audio. Experiments on three datasets not only show that the tree-structured syntactic information grants SyntaSpeech the ability to synthesize better audio with expressive prosody, but also demonstrate the generalization ability of SyntaSpeech to adapt to multiple languages and multi-speaker text-to-speech. Ablation studies demonstrate the necessity of each component in SyntaSpeech. Source code and audio samples are available at https://syntaspeech.github.io △ Less

Submitted 25 April, 2022; originally announced April 2022.

Comments: Accepted by IJCAI-2022. 12 pages

arXiv:2204.10704 [pdf, other]

SUES-200: A Multi-height Multi-scene Cross-view Image Benchmark Across Drone and Satellite

Authors: Runzhe Zhu, Ling Yin, Mingze Yang, Fei Wu, Yuncheng Yang, Wenbo Hu

Abstract: Cross-view image matching aims to match images of the same target scene acquired from different platforms. With the rapid development of drone technology, cross-view matching by neural network models has been a widely accepted choice for drone position or navigation. However, existing public datasets do not include images obtained by drones at different heights, and the types of scenes are relativ… ▽ More Cross-view image matching aims to match images of the same target scene acquired from different platforms. With the rapid development of drone technology, cross-view matching by neural network models has been a widely accepted choice for drone position or navigation. However, existing public datasets do not include images obtained by drones at different heights, and the types of scenes are relatively homogeneous, which yields issues in assessing a model's capability to adapt to complex and changing scenes. In this end, we present a new cross-view dataset called SUES-200 to address these issues. SUES-200 contains 24120 images acquired by the drone at four different heights and corresponding satellite view images of the same target scene. To the best of our knowledge, SUES-200 is the first public dataset that considers the differences generated in aerial photography captured by drones flying at different heights. In addition, we developed an evaluation for efficient training, testing and evaluation of cross-view matching models, under which we comprehensively analyze the performance of nine architectures. Then, we propose a robust baseline model for use with SUES-200. Experimental results show that SUES-200 can help the model to learn highly discriminative features of the height of the drone. △ Less

Submitted 21 January, 2023; v1 submitted 22 April, 2022; originally announced April 2022.

arXiv:2201.03186 [pdf, other]

MyoPS: A Benchmark of Myocardial Pathology Segmentation Combining Three-Sequence Cardiac Magnetic Resonance Images

Authors: Lei Li, Fu** Wu, Sihan Wang, Xinzhe Luo, Carlos Martin-Isla, Shuwei Zhai, Jianpeng Zhang, Yanfei Liu7, Zhen Zhang, Markus J. Ankenbrand, Haochuan Jiang, Xiaoran Zhang, Linhong Wang, Tewodros Weldebirhan Arega, Elif Altunok, Zhou Zhao, Feiyan Li, Jun Ma, ** Yang, Elodie Puybareau, Ilkay Oksuz, Stephanie Bricq, Weisheng Li, Kumaradevan Punithakumar, Sotirios A. Tsaftaris , et al. (7 additional authors not shown)

Abstract: Assessment of myocardial viability is essential in diagnosis and treatment management of patients suffering from myocardial infarction, and classification of pathology on myocardium is the key to this assessment. This work defines a new task of medical image analysis, i.e., to perform myocardial pathology segmentation (MyoPS) combining three-sequence cardiac magnetic resonance (CMR) images, which… ▽ More Assessment of myocardial viability is essential in diagnosis and treatment management of patients suffering from myocardial infarction, and classification of pathology on myocardium is the key to this assessment. This work defines a new task of medical image analysis, i.e., to perform myocardial pathology segmentation (MyoPS) combining three-sequence cardiac magnetic resonance (CMR) images, which was first proposed in the MyoPS challenge, in conjunction with MICCAI 2020. The challenge provided 45 paired and pre-aligned CMR images, allowing algorithms to combine the complementary information from the three CMR sequences for pathology segmentation. In this article, we provide details of the challenge, survey the works from fifteen participants and interpret their methods according to five aspects, i.e., preprocessing, data augmentation, learning strategy, model architecture and post-processing. In addition, we analyze the results with respect to different factors, in order to examine the key obstacles and explore potential of solutions, as well as to provide a benchmark for future research. We conclude that while promising results have been reported, the research is still in the early stage, and more in-depth exploration is needed before a successful application to the clinics. Note that MyoPS data and evaluation tool continue to be publicly available upon registration via its homepage (www.sdspeople.fudan.edu.cn/zhuangxiahai/0/myops20/). △ Less

Submitted 10 January, 2022; originally announced January 2022.

arXiv:2201.00100 [pdf, other]

doi 10.1109/TIP.2021.3139232

Boosting RGB-D Saliency Detection by Leveraging Unlabeled RGB Images

Authors: Xiaoqiang Wang, Lei Zhu, Siliang Tang, Huazhu Fu, ** Li, Fei Wu, Yi Yang, Yueting Zhuang

Abstract: Training deep models for RGB-D salient object detection (SOD) often requires a large number of labeled RGB-D images. However, RGB-D data is not easily acquired, which limits the development of RGB-D SOD techniques. To alleviate this issue, we present a Dual-Semi RGB-D Salient Object Detection Network (DS-Net) to leverage unlabeled RGB images for boosting RGB-D saliency detection. We first devise a… ▽ More Training deep models for RGB-D salient object detection (SOD) often requires a large number of labeled RGB-D images. However, RGB-D data is not easily acquired, which limits the development of RGB-D SOD techniques. To alleviate this issue, we present a Dual-Semi RGB-D Salient Object Detection Network (DS-Net) to leverage unlabeled RGB images for boosting RGB-D saliency detection. We first devise a depth decoupling convolutional neural network (DDCNN), which contains a depth estimation branch and a saliency detection branch. The depth estimation branch is trained with RGB-D images and then used to estimate the pseudo depth maps for all unlabeled RGB images to form the paired data. The saliency detection branch is used to fuse the RGB feature and depth feature to predict the RGB-D saliency. Then, the whole DDCNN is assigned as the backbone in a teacher-student framework for semi-supervised learning. Moreover, we also introduce a consistency loss on the intermediate attention and saliency maps for the unlabeled data, as well as a supervised depth and saliency loss for labeled data. Experimental results on seven widely-used benchmark datasets demonstrate that our DDCNN outperforms state-of-the-art methods both quantitatively and qualitatively. We also demonstrate that our semi-supervised DS-Net can further improve the performance, even when using an RGB image with the pseudo depth map. △ Less

Submitted 31 December, 2021; originally announced January 2022.

Comments: Accepted by IEEE TIP

arXiv:2112.08655 [pdf, other]

Feature Distillation Interaction Weighting Network for Lightweight Image Super-Resolution

Authors: Guangwei Gao, Wenjie Li, Juncheng Li, Fei Wu, Huimin Lu, Yi Yu

Abstract: Convolutional neural networks based single-image super-resolution (SISR) has made great progress in recent years. However, it is difficult to apply these methods to real-world scenarios due to the computational and memory cost. Meanwhile, how to take full advantage of the intermediate features under the constraints of limited parameters and calculations is also a huge challenge. To alleviate these… ▽ More Convolutional neural networks based single-image super-resolution (SISR) has made great progress in recent years. However, it is difficult to apply these methods to real-world scenarios due to the computational and memory cost. Meanwhile, how to take full advantage of the intermediate features under the constraints of limited parameters and calculations is also a huge challenge. To alleviate these issues, we propose a lightweight yet efficient Feature Distillation Interaction Weighted Network (FDIWN). Specifically, FDIWN utilizes a series of specially designed Feature Shuffle Weighted Groups (FSWG) as the backbone, and several novel mutual Wide-residual Distillation Interaction Blocks (WDIB) form an FSWG. In addition, Wide Identical Residual Weighting (WIRW) units and Wide Convolutional Residual Weighting (WCRW) units are introduced into WDIB for better feature distillation. Moreover, a Wide-Residual Distillation Connection (WRDC) framework and a Self-Calibration Fusion (SCF) unit are proposed to interact features with different scales more flexibly and efficiently.Extensive experiments show that our FDIWN is superior to other models to strike a good balance between model performance and efficiency. The code is available at https://github.com/IVIPLab/FDIWN. △ Less

Submitted 11 April, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: 9 pages, 9 figures, 4 tables, AAAI2022

arXiv:2112.07648 [pdf, other]

On the Use of External Data for Spoken Named Entity Recognition

Authors: Ankita Pasad, Felix Wu, Suwon Shon, Karen Livescu, Kyu J. Han

Abstract: Spoken language understanding (SLU) tasks involve map** from speech audio signals to semantic labels. Given the complexity of such tasks, good performance might be expected to require large labeled datasets, which are difficult to collect for each new task and domain. However, recent advances in self-supervised speech representations have made it feasible to consider learning SLU models with lim… ▽ More Spoken language understanding (SLU) tasks involve map** from speech audio signals to semantic labels. Given the complexity of such tasks, good performance might be expected to require large labeled datasets, which are difficult to collect for each new task and domain. However, recent advances in self-supervised speech representations have made it feasible to consider learning SLU models with limited labeled data. In this work we focus on low-resource spoken named entity recognition (NER) and address the question: Beyond self-supervised pre-training, how can we use external speech and/or text data that are not annotated for the task? We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline (speech recognition followed by text NER model) approaches. We find that several of these approaches improve performance in resource-constrained settings beyond the benefits from pre-trained representations alone. Compared to prior work, we find improved F1 scores of up to 16%. While the best baseline model is a pipeline approach, the best performance when using external data is ultimately achieved by an end-to-end model. We provide detailed comparisons and analyses, showing for example that end-to-end models are able to focus on the more NER-specific words. △ Less

Submitted 8 July, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

Comments: Accepted at NAACL 2022. Codebase available at https://github.com/asappresearch/spoken-ner

arXiv:2112.07238 [pdf, other]

Composing MPC with LQR and Neural Network for Amortized Efficiency and Stable Control

Authors: Fangyu Wu, Guanhua Wang, Siyuan Zhuang, Kehan Wang, Alexander Keimer, Ion Stoica, Alexandre Bayen

Abstract: Model predictive control (MPC) is a powerful control method that handles dynamical systems with constraints. However, solving MPC iteratively in real time, i.e., implicit MPC, remains a computational challenge. To address this, common solutions include explicit MPC and function approximation. Both methods, whenever applicable, may improve the computational efficiency of the implicit MPC by several… ▽ More Model predictive control (MPC) is a powerful control method that handles dynamical systems with constraints. However, solving MPC iteratively in real time, i.e., implicit MPC, remains a computational challenge. To address this, common solutions include explicit MPC and function approximation. Both methods, whenever applicable, may improve the computational efficiency of the implicit MPC by several orders of magnitude. Nevertheless, explicit MPC often requires expensive pre-computation and does not easily apply to higher-dimensional problems. Meanwhile, function approximation, although scales better with dimension, still requires pre-training on a large dataset and generally cannot guarantee to find an accurate surrogate policy, the failure of which often leads to closed-loop instability. To address these issues, we propose a triple-mode hybrid control scheme, named Memory-Augmented MPC, by combining a linear quadratic regulator, a neural network, and an MPC. From its standard form, we further derive two variants of such hybrid control scheme: one customized for chaotic systems and the other for slow systems. The proposed scheme does not require pre-computation and can improve the amortized running time of the composed MPC with a well-trained neural network. In addition, the scheme maintains closed-loop stability with any neural networks of proper input and output dimensions, alleviating the need for certifying optimality of the neural network in safety-critical applications. △ Less

Submitted 2 August, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

Comments: 13 pages, 10 figures, 2 tables

arXiv:2111.10367 [pdf, other]

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Authors: Suwon Shon, Ankita Pasad, Felix Wu, Pablo Brusco, Yoav Artzi, Karen Livescu, Kyu J. Han

Abstract: Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, rece… ▽ More Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, recent work shows the possibility of pre-training generic representations and then fine-tuning for several tasks using relatively little labeled data. We propose to create a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE) consisting of limited-size labeled training sets and corresponding evaluation sets. This resource would allow the research community to track progress, evaluate pre-trained representations for higher-level tasks, and study open questions such as the utility of pipeline versus end-to-end approaches. We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets. We focus on naturally produced (not read or synthesized) speech, and freely available datasets. We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models. △ Less

Submitted 29 July, 2022; v1 submitted 19 November, 2021; originally announced November 2021.

Comments: Updated preprint for SLUE Benchmark v0.2; Toolkit link https://github.com/asappresearch/slue-toolkit

arXiv:2111.04736 [pdf, other]

Multi-Modality Cardiac Image Analysis with Deep Learning

Authors: Lei Li, Fu** Wu, Sihang Wang, Xiahai Zhuang

Abstract: Accurate cardiac computing, analysis and modeling from multi-modality images are important for the diagnosis and treatment of cardiac disease. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is a promising technique to visualize and quantify myocardial infarction (MI) and atrial scars. Automating quantification of MI and atrial scars can be challenging due to the low image quality… ▽ More Accurate cardiac computing, analysis and modeling from multi-modality images are important for the diagnosis and treatment of cardiac disease. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is a promising technique to visualize and quantify myocardial infarction (MI) and atrial scars. Automating quantification of MI and atrial scars can be challenging due to the low image quality and complex enhancement patterns of LGE MRI. Moreover, compared with the other sequences LGE MRIs with gold standard labels are particularly limited, which represents another obstacle for develo** novel algorithms for automatic segmentation and quantification of LGE MRIs. This chapter aims to summarize the state-of-the-art and our recent advanced contributions on deep learning based multi-modality cardiac image analysis. Firstly, we introduce two benchmark works for multi-sequence cardiac MRI based myocardial and pathology segmentation. Secondly, two novel frameworks for left atrial scar segmentation and quantification from LGE MRI were presented. Thirdly, we present three unsupervised domain adaptation techniques for cross-modality cardiac image segmentation. △ Less

Submitted 8 November, 2021; originally announced November 2021.

Comments: Under review as a chapter of book 'Deep Learning for Medical Image Analysis, 2E'

arXiv:2109.14837 [pdf, other]

End-to-End Image Compression with Probabilistic Decoding

Authors: Haichuan Ma, Dong Liu, Cunhui Dong, Li Li, Feng Wu

Abstract: Lossy image compression is a many-to-one process, thus one bitstream corresponds to multiple possible original images, especially at low bit rates. However, this nature was seldom considered in previous studies on image compression, which usually chose one possible image as reconstruction, e.g. the one with the maximal a posteriori probability. We propose a learned image compression framework to n… ▽ More Lossy image compression is a many-to-one process, thus one bitstream corresponds to multiple possible original images, especially at low bit rates. However, this nature was seldom considered in previous studies on image compression, which usually chose one possible image as reconstruction, e.g. the one with the maximal a posteriori probability. We propose a learned image compression framework to natively support probabilistic decoding. The compressed bitstream is decoded into a series of parameters that instantiate a pre-chosen distribution; then the distribution is used by the decoder to sample and reconstruct images. The decoder may adopt different sampling strategies and produce diverse reconstructions, among which some have higher signal fidelity and some others have better visual quality. The proposed framework is dependent on a revertible neural network-based transform to convert pixels into coefficients that obey the pre-chosen distribution as much as possible. Our code and models will be made publicly available. △ Less

Submitted 30 September, 2021; originally announced September 2021.

arXiv:2109.06870 [pdf, other]

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Authors: Felix Wu, Kwangyoun Kim, **g Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi

Abstract: This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improveme… ▽ More This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes. △ Less

Submitted 14 September, 2021; originally announced September 2021.

Comments: Code available at https://github.com/asappresearch/sew

arXiv:2106.09760 [pdf, other]

Multi-mode Transformer Transducer with Stochastic Future Context

Authors: Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

Abstract: Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naively, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Inste… ▽ More Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naively, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Instead, a more desirable approach is to have a single model that can dynamically adjust its latency based on different constraints, which we refer to as Multi-mode ASR. A Multi-mode ASR model can fulfill various latency requirements during inference -- when a larger latency becomes acceptable, the model can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can be less dependent on future context but still achieve reliable accuracy. In pursuit of Multi-mode ASR, we propose Stochastic Future Context, a simple training procedure that samples one streaming configuration in each iteration. Through extensive experiments on AISHELL-1 and LibriSpeech datasets, we show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets. △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021

arXiv:2106.08752 [pdf, other]

Unsupervised Domain Adaptation with Variational Approximation for Cardiac Segmentation

Authors: Fu** Wu, Xiahai Zhuang

Abstract: Unsupervised domain adaptation is useful in medical image segmentation. Particularly, when ground truths of the target images are not available, domain adaptation can train a target-specific model by utilizing the existing labeled images from other modalities. Most of the reported works mapped images of both the source and target domains into a common latent feature space, and then reduced their d… ▽ More Unsupervised domain adaptation is useful in medical image segmentation. Particularly, when ground truths of the target images are not available, domain adaptation can train a target-specific model by utilizing the existing labeled images from other modalities. Most of the reported works mapped images of both the source and target domains into a common latent feature space, and then reduced their discrepancy either implicitly with adversarial training or explicitly by directly minimizing a discrepancy metric. In this work, we propose a new framework, where the latent features of both domains are driven towards a common and parameterized variational form, whose conditional distribution given the image is Gaussian. This is achieved by two networks based on variational auto-encoders (VAEs) and a regularization for this variational approximation. Both of the VAEs, each for one domain, contain a segmentation module, where the source segmentation is trained in a supervised manner, while the target one is trained unsupervisedly. We validated the proposed domain adaptation method using two cardiac segmentation tasks, i.e., the cross-modality (CT and MR) whole heart segmentation and the cross-sequence cardiac MR segmentation. Results show that the proposed method achieved better accuracies compared to two state-of-the-art approaches and demonstrated good potential for cardiac segmentation. Furthermore, the proposed explicit regularization was shown to be effective and efficient in narrowing down the distribution gap between domains, which is useful for unsupervised domain adaptation. Our code and data has been released via https://zmiclab.github.io/projects.html. △ Less

Submitted 16 June, 2021; originally announced June 2021.

Comments: accepted by IEEE Transactions on Medical Imaging

arXiv:2105.03678 [pdf, other]

Nearly Minimax-Optimal Rates for Noisy Sparse Phase Retrieval via Early-Stopped Mirror Descent

Authors: Fan Wu, Patrick Rebeschini

Abstract: This paper studies early-stopped mirror descent applied to noisy sparse phase retrieval, which is the problem of recovering a $k$-sparse signal $\mathbf{x}^\star\in\mathbb{R}^n$ from a set of quadratic Gaussian measurements corrupted by sub-exponential noise. We consider the (non-convex) unregularized empirical risk minimization problem and show that early-stopped mirror descent, when equipped wit… ▽ More This paper studies early-stopped mirror descent applied to noisy sparse phase retrieval, which is the problem of recovering a $k$-sparse signal $\mathbf{x}^\star\in\mathbb{R}^n$ from a set of quadratic Gaussian measurements corrupted by sub-exponential noise. We consider the (non-convex) unregularized empirical risk minimization problem and show that early-stopped mirror descent, when equipped with the hyperbolic entropy mirror map and proper initialization, achieves a nearly minimax-optimal rate of convergence, provided the sample size is at least of order $k^2$ (modulo logarithmic term) and the minimum (in modulus) non-zero entry of the signal is on the order of $\|\mathbf{x}^\star\|_2/\sqrt{k}$. Our theory leads to a simple algorithm that does not rely on explicit regularization or thresholding steps to promote sparsity. More generally, our results establish a connection between mirror descent and sparsity in the non-convex problem of noisy sparse phase retrieval, adding to the literature on early stop** that has mostly focused on non-sparse, Euclidean, and convex settings via gradient descent. Our proof combines a potential-based analysis of mirror descent with a quantitative control on a variational coherence property that we establish along the path of mirror descent, up to a prescribed stop** time. △ Less

Submitted 8 May, 2021; originally announced May 2021.

Comments: arXiv admin note: text overlap with arXiv:2010.10168

arXiv:2103.13676 [pdf, other]

doi 10.1109/TMM.2023.3240880

JDSR-GAN: Constructing An Efficient Joint Learning Network for Masked Face Super-Resolution

Authors: Guangwei Gao, Lei Tang, Fei Wu, Huimin Lu, Jian Yang

Abstract: With the growing importance of preventing the COVID-19 virus, face images obtained in most video surveillance scenarios are low resolution with mask simultaneously. However, most of the previous face super-resolution solutions can not handle both tasks in one model. In this work, we treat the mask occlusion as image noise and construct a joint and collaborative learning network, called JDSR-GAN, f… ▽ More With the growing importance of preventing the COVID-19 virus, face images obtained in most video surveillance scenarios are low resolution with mask simultaneously. However, most of the previous face super-resolution solutions can not handle both tasks in one model. In this work, we treat the mask occlusion as image noise and construct a joint and collaborative learning network, called JDSR-GAN, for the masked face super-resolution task. Given a low-quality face image with the mask as input, the role of the generator composed of a denoising module and super-resolution module is to acquire a high-quality high-resolution face image. The discriminator utilizes some carefully designed loss functions to ensure the quality of the recovered face images. Moreover, we incorporate the identity information and attention mechanism into our network for feasible correlated feature expression and informative feature learning. By jointly performing denoising and face super-resolution, the two tasks can complement each other and attain promising performance. Extensive qualitative and quantitative results show the superiority of our proposed JDSR-GAN over some comparable methods which perform the previous two tasks separately. △ Less

Submitted 29 January, 2023; v1 submitted 25 March, 2021; originally announced March 2021.

Comments: IEEE Transactions on Multimedia, 8 pages, 7 figures

arXiv:2011.08001 [pdf, other]

doi 10.1016/j.media.2021.102138

Deep-LIBRA: Artificial intelligence method for robust quantification of breast density with independent validation in breast cancer risk assessment

Authors: Omid Haji Maghsoudi, Aimilia Gastounioti, Christopher Scott, Lauren Pantalone, Fang-Fang Wu, Eric A. Cohen, Stacey Winham, Emily F. Conant, Celine Vachon, Despina Kontos

Abstract: Breast density is an important risk factor for breast cancer that also affects the specificity and sensitivity of screening mammography. Current federal legislation mandates reporting of breast density for all women undergoing breast screening. Clinically, breast density is assessed visually using the American College of Radiology Breast Imaging Reporting And Data System (BI-RADS) scale. Here, we… ▽ More Breast density is an important risk factor for breast cancer that also affects the specificity and sensitivity of screening mammography. Current federal legislation mandates reporting of breast density for all women undergoing breast screening. Clinically, breast density is assessed visually using the American College of Radiology Breast Imaging Reporting And Data System (BI-RADS) scale. Here, we introduce an artificial intelligence (AI) method to estimate breast percentage density (PD) from digital mammograms. Our method leverages deep learning (DL) using two convolutional neural network architectures to accurately segment the breast area. A machine-learning algorithm combining superpixel generation, texture feature analysis, and support vector machine is then applied to differentiate dense from non-dense tissue regions, from which PD is estimated. Our method has been trained and validated on a multi-ethnic, multi-institutional dataset of 15,661 images (4,437 women), and then tested on an independent dataset of 6,368 digital mammograms (1,702 women; cases=414) for both PD estimation and discrimination of breast cancer. On the independent dataset, PD estimates from Deep-LIBRA and an expert reader were strongly correlated (Spearman correlation coefficient = 0.90). Moreover, Deep-LIBRA yielded a higher breast cancer discrimination performance (area under the ROC curve, AUC = 0.611 [95% confidence interval (CI): 0.583, 0.639]) compared to four other widely-used research and commercial PD assessment methods (AUCs = 0.528 to 0.588). Our results suggest a strong agreement of PD estimates between Deep-LIBRA and gold-standard assessment by an expert reader, as well as improved performance in breast cancer risk assessment over state-of-the-art open-source and commercial methods. △ Less

Submitted 18 October, 2021; v1 submitted 13 November, 2020; originally announced November 2020.

arXiv:2008.12205 [pdf, other]

Random Style Transfer based Domain Generalization Networks Integrating Shape and Spatial Information

Authors: Lei Li, Veronika A. Zimmer, Wangbin Ding, Fu** Wu, Liqin Huang, Julia A. Schnabel, Xiahai Zhuang

Abstract: Deep learning (DL)-based models have demonstrated good performance in medical image segmentation. However, the models trained on a known dataset often fail when performed on an unseen dataset collected from different centers, vendors and disease populations. In this work, we present a random style transfer network to tackle the domain generalization problem for multi-vendor and center cardiac imag… ▽ More Deep learning (DL)-based models have demonstrated good performance in medical image segmentation. However, the models trained on a known dataset often fail when performed on an unseen dataset collected from different centers, vendors and disease populations. In this work, we present a random style transfer network to tackle the domain generalization problem for multi-vendor and center cardiac image segmentation. Style transfer is used to generate training data with a wider distribution/ heterogeneity, namely domain augmentation. As the target domain could be unknown, we randomly generate a modality vector for the target modality in the style transfer stage, to simulate the domain shift for unknown domains. The model can be trained in a semi-supervised manner by simultaneously optimizing a supervised segmentation and an unsupervised style translation objective. Besides, the framework incorporates the spatial information and shape prior of the target by introducing two regularization terms. We evaluated the proposed framework on 40 subjects from the M\&Ms challenge2020, and obtained promising performance in the segmentation for data from unknown vendors and centers. △ Less

Submitted 3 September, 2020; v1 submitted 27 August, 2020; originally announced August 2020.

Comments: 11 pages

arXiv:2007.14177 [pdf]

Generative networks as inverse problems with fractional wavelet scattering networks

Authors: Jiasong Wu, **g Zhang, Fuzhi Wu, Youyong Kong, Guanyu Yang, Lotfi Senhadji, Huazhong Shu

Abstract: Deep learning is a hot research topic in the field of machine learning methods and applications. Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs) provide impressive image generations from Gaussian white noise, but both of them are difficult to train since they need to train the generator (or encoder) and the discriminator (or decoder) simultaneously, which is easy to cau… ▽ More Deep learning is a hot research topic in the field of machine learning methods and applications. Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs) provide impressive image generations from Gaussian white noise, but both of them are difficult to train since they need to train the generator (or encoder) and the discriminator (or decoder) simultaneously, which is easy to cause unstable training. In order to solve or alleviate the synchronous training difficult problems of GANs and VAEs, recently, researchers propose Generative Scattering Networks (GSNs), which use wavelet scattering networks (ScatNets) as the encoder to obtain the features (or ScatNet embeddings) and convolutional neural networks (CNNs) as the decoder to generate the image. The advantage of GSNs is the parameters of ScatNets are not needed to learn, and the disadvantage of GSNs is that the expression ability of ScatNets is slightly weaker than CNNs and the dimensional reduction method of Principal Component Analysis (PCA) is easy to lead overfitting in the training of GSNs, and therefore affect the generated quality in the testing process. In order to further improve the quality of generated images while keep the advantages of GSNs, this paper proposes Generative Fractional Scattering Networks (GFRSNs), which use more expressive fractional wavelet scattering networks (FrScatNets) instead of ScatNets as the encoder to obtain the features (or FrScatNet embeddings) and use the similar CNNs of GSNs as the decoder to generate the image. Additionally, this paper develops a new dimensional reduction method named Feature-Map Fusion (FMF) instead of PCA for better kee** the information of FrScatNets and the effect of image fusion on the quality of image generation is also discussed. △ Less

Submitted 28 July, 2020; originally announced July 2020.

Comments: 27 pages, 13 figures, 6 tables

arXiv:2006.02666 [pdf]

Deep Sequential Feature Learning in Clinical Image Classification of Infectious Keratitis

Authors: Yesheng Xu, Ming Kong, Wenjia Xie, Run** Duan, Zhengqing Fang, Yuxiao Lin, Qiang Zhu, Siliang Tang, Fei Wu, Yu-Feng Yao

Abstract: Infectious keratitis is the most common entities of corneal diseases, in which pathogen grows in the cornea leading to inflammation and destruction of the corneal tissues. Infectious keratitis is a medical emergency, for which a rapid and accurate diagnosis is needed for speedy initiation of prompt and precise treatment to halt the disease progress and to limit the extent of corneal damage; otherw… ▽ More Infectious keratitis is the most common entities of corneal diseases, in which pathogen grows in the cornea leading to inflammation and destruction of the corneal tissues. Infectious keratitis is a medical emergency, for which a rapid and accurate diagnosis is needed for speedy initiation of prompt and precise treatment to halt the disease progress and to limit the extent of corneal damage; otherwise it may develop sight-threatening and even eye-globe-threatening condition. In this paper, we propose a sequential-level deep learning model to effectively discriminate the distinction and subtlety of infectious corneal disease via the classification of clinical images. In this approach, we devise an appropriate mechanism to preserve the spatial structures of clinical images and disentangle the informative features for clinical image classification of infectious keratitis. In competition with 421 ophthalmologists, the performance of the proposed sequential-level deep model achieved 80.00% diagnostic accuracy, far better than the 49.27% diagnostic accuracy achieved by ophthalmologists over 120 test images. △ Less

Submitted 4 June, 2020; originally announced June 2020.

Comments: Accepted by Engineering

arXiv:2006.01065 [pdf, other]

Hadamard Wirtinger Flow for Sparse Phase Retrieval

Authors: Fan Wu, Patrick Rebeschini

Abstract: We consider the problem of reconstructing an $n$-dimensional $k$-sparse signal from a set of noiseless magnitude-only measurements. Formulating the problem as an unregularized empirical risk minimization task, we study the sample complexity performance of gradient descent with Hadamard parametrization, which we call Hadamard Wirtinger flow (HWF). Provided knowledge of the signal sparsity $k$, we p… ▽ More We consider the problem of reconstructing an $n$-dimensional $k$-sparse signal from a set of noiseless magnitude-only measurements. Formulating the problem as an unregularized empirical risk minimization task, we study the sample complexity performance of gradient descent with Hadamard parametrization, which we call Hadamard Wirtinger flow (HWF). Provided knowledge of the signal sparsity $k$, we prove that a single step of HWF is able to recover the support from $k(x^*_{max})^{-2}$ (modulo logarithmic term) samples, where $x^*_{max}$ is the largest component of the signal in magnitude. This support recovery procedure can be used to initialize existing reconstruction methods and yields algorithms with total runtime proportional to the cost of reading the data and improved sample complexity, which is linear in $k$ when the signal contains at least one large component. We numerically investigate the performance of HWF at convergence and show that, while not requiring any explicit form of regularization nor knowledge of $k$, HWF adapts to the signal sparsity and reconstructs sparse signals with fewer measurements than existing gradient based methods. △ Less

Submitted 24 February, 2021; v1 submitted 1 June, 2020; originally announced June 2020.

arXiv:2005.12848 [pdf, other]

Group-In: Group Inference from Wireless Traces of Mobile Devices

Authors: Gürkan Solmaz, Jonathan Fürst, Samet Aytaç, Fang-**g Wu

Abstract: This paper proposes Group-In, a wireless scanning system to detect static or mobile people groups in indoor or outdoor environments. Group-In collects only wireless traces from the Bluetooth-enabled mobile devices for group inference. The key problem addressed in this work is to detect not only static groups but also moving groups with a multi-phased approach based only noisy wireless Received Sig… ▽ More This paper proposes Group-In, a wireless scanning system to detect static or mobile people groups in indoor or outdoor environments. Group-In collects only wireless traces from the Bluetooth-enabled mobile devices for group inference. The key problem addressed in this work is to detect not only static groups but also moving groups with a multi-phased approach based only noisy wireless Received Signal Strength Indicator (RSSIs) observed by multiple wireless scanners without localization support. We propose new centralized and decentralized schemes to process the sparse and noisy wireless data, and leverage graph-based clustering techniques for group detection from short-term and long-term aspects. Group-In provides two outcomes: 1) group detection in short time intervals such as two minutes and 2) long-term linkages such as a month. To verify the performance, we conduct two experimental studies. One consists of 27 controlled scenarios in the lab environments. The other is a real-world scenario where we place Bluetooth scanners in an office environment, and employees carry beacons for more than one month. Both the controlled and real-world experiments result in high accuracy group detection in short time intervals and sampling liberties in terms of the Jaccard index and pairwise similarity coefficient. △ Less

Submitted 10 June, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

Comments: This work has been funded by the EU Horizon 2020 Programme under Grant Agreements No. 731993 AUTOPILOT and No.871249 LOCUS projects. The content of this paper does not reflect the official opinion of the EU. Responsibility for the information and views expressed therein lies entirely with the authors. Proc. of ACM/IEEE IPSN'20, 2020

Showing 1–50 of 76 results for author: Wu, F