-
Learning Granularity-Aware Affordances from Human-Object Interaction for Tool-Based Functional Gras** in Dexterous Robotics
Authors:
Fan Yang,
Wenrui Chen,
Kailun Yang,
Haoran Lin,
DongSheng Luo,
Conghui Tang,
Zhiyong Li,
Yaonan Wang
Abstract:
To enable robots to use tools, the initial step is teaching robots to employ dexterous gestures for touching specific areas precisely where tasks are performed. Affordance features of objects serve as a bridge in the functional interaction between agents and objects. However, leveraging these affordance cues to help robots achieve functional tool gras** remains unresolved. To address this, we pr…
▽ More
To enable robots to use tools, the initial step is teaching robots to employ dexterous gestures for touching specific areas precisely where tasks are performed. Affordance features of objects serve as a bridge in the functional interaction between agents and objects. However, leveraging these affordance cues to help robots achieve functional tool gras** remains unresolved. To address this, we propose a granularity-aware affordance feature extraction method for locating functional affordance areas and predicting dexterous coarse gestures. We study the intrinsic mechanisms of human tool use. On one hand, we use fine-grained affordance features of object-functional finger contact areas to locate functional affordance regions. On the other hand, we use highly activated coarse-grained affordance features in hand-object interaction regions to predict grasp gestures. Additionally, we introduce a model-based post-processing module that includes functional finger coordinate localization, finger-to-end coordinate transformation, and force feedback-based coarse-to-fine gras**. This forms a complete dexterous robotic functional gras** framework GAAF-Dex, which learns Granularity-Aware Affordances from human-object interaction for tool-based Functional gras** in Dexterous Robotics. Unlike fully-supervised methods that require extensive data annotation, we employ a weakly supervised approach to extract relevant cues from exocentric (Exo) images of hand-object interactions to supervise feature extraction in egocentric (Ego) images. We have constructed a small-scale dataset, FAH, which includes near 6K images of functional hand-object interaction Exo- and Ego images of 18 commonly used tools performing 6 tasks. Extensive experiments on the dataset demonstrate our method outperforms state-of-the-art methods. The code will be made publicly available at https://github.com/yangfan293/GAAF-DEX.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
A microwave photonic prototype for concurrent radar detection and spectrum sensing over an 8 to 40 GHz bandwidth
Authors:
Taixia Shi,
Dingding Liang,
Lu Wang,
Lin Li,
Shaogang Guo,
Jiawei Gao,
Xiaowei Li,
Chulun Lin,
Lei Shi,
Baogang Ding,
Shiyang Liu,
Fangyi Yang,
Chi Jiang,
Yang Chen
Abstract:
In this work, a microwave photonic prototype for concurrent radar detection and spectrum sensing is proposed, designed, built, and investigated. A direct digital synthesizer and an analog electronic circuit are integrated to generate an intermediate frequency (IF) linearly frequency-modulated (LFM) signal with a tunable center frequency from 2.5 to 9.5 GHz and an instantaneous bandwidth of 1 GHz.…
▽ More
In this work, a microwave photonic prototype for concurrent radar detection and spectrum sensing is proposed, designed, built, and investigated. A direct digital synthesizer and an analog electronic circuit are integrated to generate an intermediate frequency (IF) linearly frequency-modulated (LFM) signal with a tunable center frequency from 2.5 to 9.5 GHz and an instantaneous bandwidth of 1 GHz. The IF LFM signal is converted to the optical domain via an intensity modulator and then filtered by a fiber Bragg grating (FBG) to generate only two 2nd-order optical LFM sidebands. In radar detection, the two optical LFM sidebands beat with each other to generate a frequency-and-bandwidth-quadrupled LFM signal, which is used for ranging, radial velocity measurement, and imaging. By changing the center frequency of the IF LFM signal, the radar function can be operated within 8 to 40 GHz. In spectrum sensing, one 2nd-order optical LFM sideband is selected by another FBG, which then works in conjunction with the stimulated Brillouin scattering gain spectrum to map the frequency of the signal under test to time with an instantaneous measurement bandwidth of 2 GHz. By using a frequency shift module to adjust the pump frequency, the frequency measurement range can be adjusted from 0 to 40 GHz. The prototype is comprehensively studied and tested, which is capable of achieving a range resolution of 3.75 cm, a range error of less than $\pm$ 2 cm, a radial velocity error within $\pm$ 1 cm/s, delivering clear imaging of multiple small targets, and maintaining a frequency measurement error of less than $\pm$ 7 MHz and a frequency resolution of better than 20 MHz.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
Authors:
Yuepeng Jiang,
Tao Li,
Fengyu Yang,
Lei Xie,
Meng Meng,
Yujun Wang
Abstract:
Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timb…
▽ More
Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness.
△ Less
Submitted 11 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Optical IRS for Visible Light Communication: Modeling, Design, and Open Issues
Authors:
Shiyuan Sun,
Fang Yang,
Weidong Mei,
Jian Song,
Zhu Han,
Rui Zhang
Abstract:
Optical intelligent reflecting surface (OIRS) offers a new and effective approach to resolving the line-of-sight blockage issue in visible light communication (VLC) by enabling redirection of light to bypass obstacles, thereby dramatically enhancing indoor VLC coverage and reliability. This article provides a comprehensive overview of OIRS for VLC, including channel modeling, design techniques, an…
▽ More
Optical intelligent reflecting surface (OIRS) offers a new and effective approach to resolving the line-of-sight blockage issue in visible light communication (VLC) by enabling redirection of light to bypass obstacles, thereby dramatically enhancing indoor VLC coverage and reliability. This article provides a comprehensive overview of OIRS for VLC, including channel modeling, design techniques, and open issues. First, we present the characteristics of OIRS-reflected channels and introduce two practical models, namely, optics model and association model, which are then compared in terms of applicable conditions, configuration methods, and channel parameters. Next, under the more practically appealing association model, we discuss the main design techniques for OIRS-aided VLC systems, including beam alignment, channel estimation, and OIRS reflection optimization. Finally, open issues are identified to stimulate future research in this area.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Target Localization with Macro and Micro Base Stations Cooperative Sensing
Authors:
Haotian Liu,
Zhiqing Wei,
Furong Yang,
Huici Wu,
Kaifeng Han,
Zhiyong Feng
Abstract:
Addressing the communication and sensing demands of sixth-generation (6G) mobile communication system, integrated sensing and communication (ISAC) has garnered traction in academia and industry. With the sensing limitation of single base station (BS), multi-BS cooperative sensing is regarded as a promising solution. The coexistence and overlapped coverage of macro BS (MBS) and micro BS (MiBS) are…
▽ More
Addressing the communication and sensing demands of sixth-generation (6G) mobile communication system, integrated sensing and communication (ISAC) has garnered traction in academia and industry. With the sensing limitation of single base station (BS), multi-BS cooperative sensing is regarded as a promising solution. The coexistence and overlapped coverage of macro BS (MBS) and micro BS (MiBS) are common in the development of 6G, making the cooperative sensing between MBS and MiBS feasible. Since MBS and MiBS work in low and high frequency bands, respectively, the challenges of MBS and MiBS cooperative sensing lie in the fusion method of the sensing information in high and low-frequency bands. To this end, this paper introduces a symbol-level fusion method and a grid-based three-dimensional discrete Fourier transform (3D-GDFT) algorithm to achieve precise localization of multiple targets with limited resources. Simulation results demonstrate that the proposed MBS and MiBS cooperative sensing scheme outperforms traditional single BS (MBS/MiBS) sensing scheme, showcasing superior sensing performance
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights
Authors:
Youngdong Jang,
Dong In Lee,
MinHyuk Jang,
Jong Wook Kim,
Feng Yang,
Sangpil Kim
Abstract:
The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representat…
▽ More
The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representations. In this work, we introduce an innovative watermarking method that can be employed in both representations of NeRF. This is achieved by fine-tuning NeRF to embed binary messages in the rendering process. In detail, we propose utilizing the discrete wavelet transform in the NeRF space for watermarking. Furthermore, we adopt a deferred back-propagation technique and introduce a combination with the patch-wise loss to improve rendering quality and bit accuracy with minimum trade-offs. We evaluate our method in three different aspects: capacity, invisibility, and robustness of the embedded watermarks in the 2D-rendered images. Our method achieves state-of-the-art performance with faster training speed over the compared state-of-the-art methods.
△ Less
Submitted 27 May, 2024; v1 submitted 3 May, 2024;
originally announced May 2024.
-
A Hypergraph Approach to Distributed Broadcast
Authors:
Qi Cao,
Yulin Shao,
Fan Yang
Abstract:
This paper explores the distributed broadcast problem within the context of network communications, a critical challenge in decentralized information dissemination. We put forth a novel hypergraph-based approach to address this issue, focusing on minimizing the number of broadcasts to ensure comprehensive data sharing among all network users. A key contribution of our work is the establishment of…
▽ More
This paper explores the distributed broadcast problem within the context of network communications, a critical challenge in decentralized information dissemination. We put forth a novel hypergraph-based approach to address this issue, focusing on minimizing the number of broadcasts to ensure comprehensive data sharing among all network users. A key contribution of our work is the establishment of a general lower bound for the problem using the min-cut capacity of hypergraphs. Additionally, we present the distributed broadcast for quasi-trees (DBQT) algorithm tailored for the unique structure of quasi-trees, which is proven to be optimal. This paper advances both network communication strategies and hypergraph theory, with implications for a wide range of real-world applications, from vehicular and sensor networks to distributed storage systems.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Channel Estimation for Optical Intelligent Reflecting Surface-Assisted VLC System: A Joint Space-Time Sampling Approach
Authors:
Shiyuan Sun,
Fang Yang,
Weidong Mei,
Jian Song,
Zhu Han,
Rui Zhang
Abstract:
Optical intelligent reflecting surface (OIRS) has attracted increasing attention due to its capability of overcoming signal blockages in visible light communication (VLC), an emerging technology for the next-generation advanced transceivers. However, current works on OIRS predominantly assume known channel state information (CSI), which is essential to practical OIRS configuration. To bridge such…
▽ More
Optical intelligent reflecting surface (OIRS) has attracted increasing attention due to its capability of overcoming signal blockages in visible light communication (VLC), an emerging technology for the next-generation advanced transceivers. However, current works on OIRS predominantly assume known channel state information (CSI), which is essential to practical OIRS configuration. To bridge such a gap, this paper proposes a new and customized channel estimation protocol for OIRSs under the alignment-based channel model. Specifically, we first unveil OIRS spatial and temporal coherence characteristics and derive the coherence distance and the coherence time in closed form. Next, to achieve fast beam alignment over different coherence time, we propose to dynamically tune the rotational angles of the OIRS reflecting elements following a geometric optics-based non-uniform codebook. Given the above beam alignment, we propose an efficient joint space-time sampling-based algorithm to estimate the OIRS channel. In particular, we divide the OIRS into multiple subarrays based on the coherence distance and sequentially estimate their associated CSI, followed by a spacetime interpolation to retrieve full CSI for other non-aligned transceiver antennas. Numerical results validate our theoretical analyses and demonstrate the efficacy of our proposed OIRS channel estimation scheme as compared to other benchmark schemes.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Channel Estimation for Optical IRS-Assisted VLC System via Spatial Coherence
Authors:
Shiyuan Sun,
Fang Yang,
Weidong Mei,
Jian Song,
Zhu Han,
Rui Zhang
Abstract:
Optical intelligent reflecting surface (OIRS) has been considered a promising technology for visible light communication (VLC) by constructing visual line-of-sight propagation paths to address the signal blockage issue. However, the existing works on OIRSs are mostly based on perfect channel state information (CSI), whose acquisition appears to be challenging due to the passive nature of the OIRS.…
▽ More
Optical intelligent reflecting surface (OIRS) has been considered a promising technology for visible light communication (VLC) by constructing visual line-of-sight propagation paths to address the signal blockage issue. However, the existing works on OIRSs are mostly based on perfect channel state information (CSI), whose acquisition appears to be challenging due to the passive nature of the OIRS. To tackle this challenge, this paper proposes a customized channel estimation algorithm for OIRSs. Specifically, we first unveil the OIRS spatial coherence characteristics and derive the coherence distance in closed form. Based on this property, a spatial sampling-based algorithm is proposed to estimate the OIRS-reflected channel, by dividing the OIRS into multiple subarrays based on the coherence distance and sequentially estimating their associated CSI, followed by an interpolation to retrieve the full CSI. Simulation results validate the derived OIRS spatial coherence and demonstrate the efficacy of the proposed OIRS channel estimation algorithm.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
APISR: Anime Production Inspired Real-World Anime Super-Resolution
Authors:
Boyang Wang,
Fengyu Yang,
Xihang Yu,
Chao Zhang,
Hanbin Zhao
Abstract:
While real-world anime super-resolution (SR) has gained increasing attention in the SR community, existing methods still adopt techniques from the photorealistic domain. In this paper, we analyze the anime production workflow and rethink how to use characteristics of it for the sake of the real-world anime SR. First, we argue that video networks and datasets are not necessary for anime SR due to t…
▽ More
While real-world anime super-resolution (SR) has gained increasing attention in the SR community, existing methods still adopt techniques from the photorealistic domain. In this paper, we analyze the anime production workflow and rethink how to use characteristics of it for the sake of the real-world anime SR. First, we argue that video networks and datasets are not necessary for anime SR due to the repetition use of hand-drawing frames. Instead, we propose an anime image collection pipeline by choosing the least compressed and the most informative frames from the video sources. Based on this pipeline, we introduce the Anime Production-oriented Image (API) dataset. In addition, we identify two anime-specific challenges of distorted and faint hand-drawn lines and unwanted color artifacts. We address the first issue by introducing a prediction-oriented compression module in the image degradation model and a pseudo-ground truth preparation with enhanced hand-drawn lines. In addition, we introduce the balanced twin perceptual loss combining both anime and photorealistic high-level features to mitigate unwanted color artifacts and increase visual clarity. We evaluate our method through extensive experiments on the public benchmark, showing our method outperforms state-of-the-art anime dataset-trained approaches.
△ Less
Submitted 4 April, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.
-
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Authors:
Mateusz Łajszczak,
Guillermo Cámbara,
Yang Li,
Fatih Beyhan,
Arent van Korlaar,
Fan Yang,
Arnaud Joly,
Álvaro Martín-Cortinas,
Ammar Abbas,
Adam Michalski,
Alexis Moinet,
Sri Karlapati,
Ewa Muszyńska,
Haohan Guo,
Bartosz Putrycz,
Soledad López Gambino,
Kayeon Yoo,
Elena Sokolova,
Thomas Drugman
Abstract:
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts ra…
▽ More
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.
△ Less
Submitted 15 February, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
DARCS: Memory-Efficient Deep Compressed Sensing Reconstruction for Acceleration of 3D Whole-Heart Coronary MR Angiography
Authors:
Zhihao Xue,
Fan Yang,
Juan Gao,
Zhuo Chen,
Hao Peng,
Chao Zou,
Hang **,
Chenxi Hu
Abstract:
Three-dimensional coronary magnetic resonance angiography (CMRA) demands reconstruction algorithms that can significantly suppress the artifacts from a heavily undersampled acquisition. While unrolling-based deep reconstruction methods have achieved state-of-the-art performance on 2D image reconstruction, their application to 3D reconstruction is hindered by the large amount of memory needed to tr…
▽ More
Three-dimensional coronary magnetic resonance angiography (CMRA) demands reconstruction algorithms that can significantly suppress the artifacts from a heavily undersampled acquisition. While unrolling-based deep reconstruction methods have achieved state-of-the-art performance on 2D image reconstruction, their application to 3D reconstruction is hindered by the large amount of memory needed to train an unrolled network. In this study, we propose a memory-efficient deep compressed sensing method by employing a sparsifying transform based on a pre-trained artifact estimation network. The motivation is that the artifact image estimated by a well-trained network is sparse when the input image is artifact-free, and less sparse when the input image is artifact-affected. Thus, the artifact-estimation network can be used as an inherent sparsifying transform. The proposed method, named De-Aliasing Regularization based Compressed Sensing (DARCS), was compared with a traditional compressed sensing method, de-aliasing generative adversarial network (DAGAN), model-based deep learning (MoDL), and plug-and-play for accelerations of 3D CMRA. The results demonstrate that the proposed method improved the reconstruction quality relative to the compared methods by a large margin. Furthermore, the proposed method well generalized for different undersampling rates and noise levels. The memory usage of the proposed method was only 63% of that needed by MoDL. In conclusion, the proposed method achieves improved reconstruction quality for 3D CMRA with reduced memory burden.
△ Less
Submitted 2 February, 2024; v1 submitted 31 January, 2024;
originally announced February 2024.
-
Enhancing Safety in Nonlinear Systems: Design and Stability Analysis of Adaptive Cruise Control
Authors:
Fan Yang,
Haoqi Li,
Maolong Lv,
Jiang** Hu,
Qingrui Zhou,
Bijoy K. Ghosh
Abstract:
The safety of autonomous driving systems, particularly self-driving vehicles, remains of paramount concern. These systems exhibit affine nonlinear dynamics and face the challenge of executing predefined control tasks while adhering to state and input constraints to mitigate risks. However, achieving safety control within the framework of control input constraints, such as collision avoidance and m…
▽ More
The safety of autonomous driving systems, particularly self-driving vehicles, remains of paramount concern. These systems exhibit affine nonlinear dynamics and face the challenge of executing predefined control tasks while adhering to state and input constraints to mitigate risks. However, achieving safety control within the framework of control input constraints, such as collision avoidance and maintaining system states within secure boundaries, presents challenges due to limited options. In this study, we introduce a novel approach to address safety concerns by transforming safety conditions into control constraints with a relative degree of 1. This transformation is facilitated through the design of control barrier functions, enabling the creation of a safety control system for affine nonlinear networks. Subsequently, we formulate a robust control strategy that incorporates safety protocols and conduct a comprehensive analysis of its stability and reliability. To illustrate the effectiveness of our approach, we apply it to a specific problem involving adaptive cruise control. Through simulations, we validate the efficiency of our model in ensuring safety without compromising control performance. Our approach signifies significant progress in the field, providing a practical solution to enhance safety for autonomous driving systems operating within the context of affine nonlinear dynamics.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Centralized active reconfigurable intelligent surface: Architecture, path loss analysis and experimental verification
Authors:
Changhao Liu,
Fan Yang,
Shenheng Xu,
Yezhen Li,
Maokun Li
Abstract:
Reconfigurable intelligent surfaces (RISs) are promising candidate for the 6G communication. Recently, active RIS has been proposed to compensate the multiplicative fading effect inherent in passive RISs. However, conventional distributed active RISs, with at least one amplifier per element, are costly, complex, and power-intensive. To address these challenges, this paper proposes a novel architec…
▽ More
Reconfigurable intelligent surfaces (RISs) are promising candidate for the 6G communication. Recently, active RIS has been proposed to compensate the multiplicative fading effect inherent in passive RISs. However, conventional distributed active RISs, with at least one amplifier per element, are costly, complex, and power-intensive. To address these challenges, this paper proposes a novel architecture of active RIS: the centralized active RIS (CA-RIS), which amplifies the energy using a centralized amplifying reflector to reduce the number of amplifiers. Under this architecture, only as low as one amplifier is needed for power amplification of the entire array, which can eliminate the mutual-coupling effect among amplifiers, and significantly reduce the cost, noise level, and power consumption. We evaluate the performance of CA-RIS, specifically its path loss, and compare it with conventional passive RISs, revealing a moderate amplification gain. Furthermore, the proposed CA-RIS and the path loss model are experimentally verified, achieving a 9.6 dB net gain over passive RIS at 4 GHz. The CA-RIS offers a substantial simplification of active RIS architecture while preserving performance, striking an optimal balance between system complexity and the performance, which is competitive in various scenarios.
△ Less
Submitted 18 January, 2024; v1 submitted 17 January, 2024;
originally announced January 2024.
-
A Surrogate-Assisted Extended Generative Adversarial Network for Parameter Optimization in Free-Form Metasurface Design
Authors:
Manna Dai,
Yang Jiang,
Feng Yang,
Joyjit Chattoraj,
Yingzhi Xia,
Xinxing Xu,
Weijiang Zhao,
My Ha Dao,
Yong Liu
Abstract:
Metasurfaces have widespread applications in fifth-generation (5G) microwave communication. Among the metasurface family, free-form metasurfaces excel in achieving intricate spectral responses compared to regular-shape counterparts. However, conventional numerical methods for free-form metasurfaces are time-consuming and demand specialized expertise. Alternatively, recent studies demonstrate that…
▽ More
Metasurfaces have widespread applications in fifth-generation (5G) microwave communication. Among the metasurface family, free-form metasurfaces excel in achieving intricate spectral responses compared to regular-shape counterparts. However, conventional numerical methods for free-form metasurfaces are time-consuming and demand specialized expertise. Alternatively, recent studies demonstrate that deep learning has great potential to accelerate and refine metasurface designs. Here, we present XGAN, an extended generative adversarial network (GAN) with a surrogate for high-quality free-form metasurface designs. The proposed surrogate provides a physical constraint to XGAN so that XGAN can accurately generate metasurfaces monolithically from input spectral responses. In comparative experiments involving 20000 free-form metasurface designs, XGAN achieves 0.9734 average accuracy and is 500 times faster than the conventional methodology. This method facilitates the metasurface library building for specific spectral responses and can be extended to various inverse design problems, including optical metamaterials, nanophotonic devices, and drug discovery.
△ Less
Submitted 18 October, 2023;
originally announced January 2024.
-
Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection
Authors:
**bo Hu,
Yin Cao,
Ming Wu,
Qiuqiang Kong,
Feiran Yang,
Mark D. Plumbley,
Jun Yang
Abstract:
Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, it is notably costly to obtain annotated samples for spatial sound events. Deploying a SELD system in a…
▽ More
Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, it is notably costly to obtain annotated samples for spatial sound events. Deploying a SELD system in a new environment requires extensive time for re-training and fine-tuning. To overcome these challenges, we propose environment-adaptive Meta-SELD, designed for efficient adaptation to new environments using minimal data. Our method specifically utilizes computationally synthesized spatial data and employs Model-Agnostic Meta-Learning (MAML) on a pre-trained, environment-independent model. The method then utilizes fast adaptation to unseen real-world environments using limited samples from the respective environments. Inspired by the Learning-to-Forget approach, we introduce the concept of selective memory as a strategy for resolving conflicts across environments. This approach involves selectively memorizing target-environment-relevant information and adapting to the new environments through the selective attenuation of model parameters. In addition, we introduce environment representations to characterize different acoustic settings, enhancing the adaptability of our attenuation approach to various environments. We evaluate our proposed method on the development set of the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSS23) dataset and computationally synthesized scenes. Experimental results demonstrate the superior performance of the proposed method compared to conventional supervised learning methods, particularly in localization.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
Free Space Optical Integrated Sensing and Communication Based on DCO-OFDM: Performance Metrics and Resource Allocation
Authors:
Yunfeng Wen,
Fang Yang,
Jian Song,
Zhu Han
Abstract:
As one of the six usage scenarios of the sixth generation (6G) mobile communication system, integrated sensing and communication (ISAC) has garnered considerable attention, and numerous studies have been conducted on radio-frequency (RF)-ISAC. Benefitting from the communication and sensing capabilities of an optical system, free space optical (FSO)-ISAC becomes a potential complement to RF-ISAC. I…
▽ More
As one of the six usage scenarios of the sixth generation (6G) mobile communication system, integrated sensing and communication (ISAC) has garnered considerable attention, and numerous studies have been conducted on radio-frequency (RF)-ISAC. Benefitting from the communication and sensing capabilities of an optical system, free space optical (FSO)-ISAC becomes a potential complement to RF-ISAC. In this paper, a direct-current-biased optical orthogonal frequency division multiplexing (DCO-OFDM) scheme is proposed for FSO-ISAC. To derive the spectral efficiency for communication and the Fisher information for sensing as performance metrics, we model the clip** noise of DCO-OFDM as additive colored Gaussian noise to obtain the expression of the signal-to-noise ratio. Based on the derived performance metrics, joint power allocation problems are formulated for both communication-centric and sensing-centric scenarios. In addition, the non-convex joint optimization problems are decomposed into sub-problems for DC bias and subcarriers, which can be solved by block coordinate descent algorithms. Furthermore, numerical simulations demonstrate the proposed algorithms and reveal the trade-off between communication and sensing functionalities of the OFDM-based FSO-ISAC system.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Optical Integrated Sensing and Communication: Architectures, Potentials and Challenges
Authors:
Yunfeng Wen,
Fang Yang,
Jian Song,
Zhu Han
Abstract:
Integrated sensing and communication (ISAC) is viewed as a crucial component of future mobile networks and has gained much interest in both academia and industry. Similar to the emergence of radio-frequency (RF) ISAC, the integration of free space optical communication and optical sensing yields optical ISAC (O-ISAC), which is regarded as a powerful complement to its RF counterpart. In this articl…
▽ More
Integrated sensing and communication (ISAC) is viewed as a crucial component of future mobile networks and has gained much interest in both academia and industry. Similar to the emergence of radio-frequency (RF) ISAC, the integration of free space optical communication and optical sensing yields optical ISAC (O-ISAC), which is regarded as a powerful complement to its RF counterpart. In this article, we first introduce the generalized system structure of O-ISAC, and then elaborate on three advantages of O-ISAC, i.e., increasing communication rate, enhancing sensing precision, and reducing interference. Next, waveform design and resource allocation of O-ISAC are discussed based on pulsed waveform, constant-modulus waveform, and multi-carrier waveform. Furthermore, we put forward future trends and challenges of O-ISAC, which are expected to provide some valuable directions for future research.
△ Less
Submitted 10 March, 2024; v1 submitted 21 December, 2023;
originally announced December 2023.
-
PKU-I2IQA: An Image-to-Image Quality Assessment Database for AI Generated Images
Authors:
Jiquan Yuan,
Xinyan Cao,
Chang** Li,
Fanyi Yang,
**long Lin,
Xixin Cao
Abstract:
As image generation technology advances, AI-based image generation has been applied in various fields and Artificial Intelligence Generated Content (AIGC) has garnered widespread attention. However, the development of AI-based image generative models also brings new problems and challenges. A significant challenge is that AI-generated images (AIGI) may exhibit unique distortions compared to natura…
▽ More
As image generation technology advances, AI-based image generation has been applied in various fields and Artificial Intelligence Generated Content (AIGC) has garnered widespread attention. However, the development of AI-based image generative models also brings new problems and challenges. A significant challenge is that AI-generated images (AIGI) may exhibit unique distortions compared to natural images, and not all generated images meet the requirements of the real world. Therefore, it is of great significance to evaluate AIGIs more comprehensively. Although previous work has established several human perception-based AIGC image quality assessment (AIGCIQA) databases for text-generated images, the AI image generation technology includes scenarios like text-to-image and image-to-image, and assessing only the images generated by text-to-image models is insufficient. To address this issue, we establish a human perception-based image-to-image AIGCIQA database, named PKU-I2IQA. We conduct a well-organized subjective experiment to collect quality labels for AIGIs and then conduct a comprehensive analysis of the PKU-I2IQA database. Furthermore, we have proposed two benchmark models: NR-AIGCIQA based on the no-reference image quality assessment method and FR-AIGCIQA based on the full-reference image quality assessment method. Finally, leveraging this database, we conduct benchmark experiments and compare the performance of the proposed benchmark models. The PKU-I2IQA database and benchmarks will be released to facilitate future research on \url{https://github.com/jiquan123/I2IQA}.
△ Less
Submitted 29 November, 2023; v1 submitted 27 November, 2023;
originally announced November 2023.
-
VCISR: Blind Single Image Super-Resolution with Video Compression Synthetic Data
Authors:
Boyang Wang,
Bowen Liu,
Shiyu Liu,
Fengyu Yang
Abstract:
In the blind single image super-resolution (SISR) task, existing works have been successful in restoring image-level unknown degradations. However, when a single video frame becomes the input, these works usually fail to address degradations caused by video compression, such as mosquito noise, ringing, blockiness, and staircase noise. In this work, we for the first time, present a video compressio…
▽ More
In the blind single image super-resolution (SISR) task, existing works have been successful in restoring image-level unknown degradations. However, when a single video frame becomes the input, these works usually fail to address degradations caused by video compression, such as mosquito noise, ringing, blockiness, and staircase noise. In this work, we for the first time, present a video compression-based degradation model to synthesize low-resolution image data in the blind SISR task. Our proposed image synthesizing method is widely applicable to existing image datasets, so that a single degraded image can contain distortions caused by the lossy video compression algorithms. This overcomes the leak of feature diversity in video data and thus retains the training efficiency. By introducing video coding artifacts to SISR degradation models, neural networks can super-resolve images with the ability to restore video compression degradations, and achieve better results on restoring generic distortions caused by image compression as well. Our proposed approach achieves superior performance in SOTA no-reference Image Quality Assessment, and shows better visual quality on various datasets. In addition, we evaluate the SISR neural network trained with our degradation model on video super-resolution (VSR) datasets. Compared to architectures specifically designed for the VSR purpose, our method exhibits similar or better performance, evidencing that the presented strategy on infusing video-based degradation is generalizable to address more complicated compression artifacts even without temporal cues.
△ Less
Submitted 22 November, 2023; v1 submitted 2 November, 2023;
originally announced November 2023.
-
Free Space Optical Communication for Inter-Satellite Link: Architecture, Potentials and Trends
Authors:
Guanhua Wang,
Fang Yang,
Jian Song,
Zhu Han
Abstract:
The sixth-generation (6G) network is expected to achieve global coverage based on the space-air-ground integrated network, and the latest satellite network will play an important role in it. The introduction of inter-satellite links (ISLs) can significantly improve the throughput of the satellite network, and recently gets lots of attention from both academia and industry. In this paper, we illust…
▽ More
The sixth-generation (6G) network is expected to achieve global coverage based on the space-air-ground integrated network, and the latest satellite network will play an important role in it. The introduction of inter-satellite links (ISLs) can significantly improve the throughput of the satellite network, and recently gets lots of attention from both academia and industry. In this paper, we illustrate the advantages of using the laser for ISLs due to its longer communication distance, higher data speed, and stronger security. Specifically, space-borne laser terminals with the acquisition, pointing and tracking mechanism which realize long-distance communication are illustrated, advanced modulation and multiplexing modes that make high communication rates possible are introduced, and the security of ISLs ensured by the characteristics of both laser and the optical channel is also analyzed. Moreover, some open issues such as advanced optical beam steering, routing and scheduling algorithm, and integrated sensing and communication are discussed to direct future research.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
Measuring Acoustics with Collaborative Multiple Agents
Authors:
Yinfeng Yu,
Changan Chen,
Lele Cao,
Fangkai Yang,
Fuchun Sun
Abstract:
As humans, we hear sound every second of our life. The sound we hear is often affected by the acoustics of the environment surrounding us. For example, a spacious hall leads to more reverberation. Room Impulse Responses (RIR) are commonly used to characterize environment acoustics as a function of the scene geometry, materials, and source/receiver locations. Traditionally, RIRs are measured by set…
▽ More
As humans, we hear sound every second of our life. The sound we hear is often affected by the acoustics of the environment surrounding us. For example, a spacious hall leads to more reverberation. Room Impulse Responses (RIR) are commonly used to characterize environment acoustics as a function of the scene geometry, materials, and source/receiver locations. Traditionally, RIRs are measured by setting up a loudspeaker and microphone in the environment for all source/receiver locations, which is time-consuming and inefficient. We propose to let two robots measure the environment's acoustics by actively moving and emitting/receiving sweep signals. We also devise a collaborative multi-agent policy where these two robots are trained to explore the environment's acoustics while being rewarded for wide exploration and accurate prediction. We show that the robots learn to collaborate and move to explore environment acoustics while minimizing the prediction error. To the best of our knowledge, we present the very first problem formulation and solution to the task of collaborative environment acoustics measurements with multiple agents.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
AOSR-Net: All-in-One Sandstorm Removal Network
Authors:
Yazhong Si,
Xulong Zhang,
Fan Yang,
Jianzong Wang,
Ning Cheng,
**g Xiao
Abstract:
Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios. In addition, these approaches often adopt a strategy of color correction followed by dust removal, which makes the algorithm structure too complex. To solve the issue, we introduce a novel image restoration model, named all-in-one…
▽ More
Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios. In addition, these approaches often adopt a strategy of color correction followed by dust removal, which makes the algorithm structure too complex. To solve the issue, we introduce a novel image restoration model, named all-in-one sandstorm removal network (AOSR-Net). This model is developed based on a re-formulated sandstorm scattering model, which directly establishes the image map** relationship by integrating intermediate parameters. Such integration scheme effectively addresses the problems of over-enhancement and weak generalization in the field of sand dust image enhancement. Experimental results on synthetic and real-world sandstorm images demonstrate the superiority of the proposed AOSR-Net over state-of-the-art (SOTA) algorithms.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
META-SELD: Meta-Learning for Fast Adaptation to the new environment in Sound Event Localization and Detection
Authors:
**bo Hu,
Yin Cao,
Ming Wu,
Feiran Yang,
Ziying Yu,
Wenwu Wang,
Mark D. Plumbley,
Jun Yang
Abstract:
For learning-based sound event localization and detection (SELD) methods, different acoustic environments in the training and test sets may result in large performance differences in the validation and evaluation stages. Different environments, such as different sizes of rooms, different reverberation times, and different background noise, may be reasons for a learning-based system to fail. On the…
▽ More
For learning-based sound event localization and detection (SELD) methods, different acoustic environments in the training and test sets may result in large performance differences in the validation and evaluation stages. Different environments, such as different sizes of rooms, different reverberation times, and different background noise, may be reasons for a learning-based system to fail. On the other hand, acquiring annotated spatial sound event samples, which include onset and offset time stamps, class types of sound events, and direction-of-arrival (DOA) of sound sources is very expensive. In addition, deploying a SELD system in a new environment often poses challenges due to time-consuming training and fine-tuning processes. To address these issues, we propose Meta-SELD, which applies meta-learning methods to achieve fast adaptation to new environments. More specifically, based on Model Agnostic Meta-Learning (MAML), the proposed Meta-SELD aims to find good meta-initialized parameters to adapt to new environments with only a small number of samples and parameter updating iterations. We can then quickly adapt the meta-trained SELD model to unseen environments. Our experiments compare fine-tuning methods from pre-trained SELD models with our Meta-SELD on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSSS23) dataset. The evaluation results demonstrate the effectiveness of Meta-SELD when adapting to new environments.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Multilingual context-based pronunciation learning for Text-to-Speech
Authors:
Giulia Comini,
Manuel Sam Ribeiro,
Fan Yang,
Heereen Shim,
Jaime Lorenzo-Trueba
Abstract:
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. Given a language, a lexicon can be collected offline and Grapheme-to-Phoneme (G2P) relationships are usually modeled in order to predict the pronunciation for out-of-vocabulary (OOV) words. Additionally, post-lexical phonology, often defined in the form of rule-based systems, is used to co…
▽ More
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. Given a language, a lexicon can be collected offline and Grapheme-to-Phoneme (G2P) relationships are usually modeled in order to predict the pronunciation for out-of-vocabulary (OOV) words. Additionally, post-lexical phonology, often defined in the form of rule-based systems, is used to correct pronunciation within or between words. In this work we showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules. We evaluate the proposed model on G2P conversion and other language-specific challenges, such as homograph and polyphones disambiguation, post-lexical rules and implicit diacritization. We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
A Message Passing Detection based Affine Frequency Division Multiplexing Communication System
Authors:
Lifan Wu,
Shan Luo,
Dongxiao Song,
Fan Yang,
Rong** Lin
Abstract:
The next generation of wireless communication technology is anticipated to address the communication reliability challenges encountered in high-speed mobile communication scenarios. An Orthogonal Time Frequency Space (OTFS) system has been introduced as a solution that effectively mitigates these issues. However, OTFS is associated with relatively high pilot overhead and multiuser multiplexing ove…
▽ More
The next generation of wireless communication technology is anticipated to address the communication reliability challenges encountered in high-speed mobile communication scenarios. An Orthogonal Time Frequency Space (OTFS) system has been introduced as a solution that effectively mitigates these issues. However, OTFS is associated with relatively high pilot overhead and multiuser multiplexing overhead. In response to these concerns within the OTFS framework, a novel modulation technology known as Affine Frequency Division Multiplexing (AFDM) which is based on the discrete affine Fourier transform has emerged. AFDM effectively resolves the challenges by achieving full diversity through parameter adjustments aligned with the channel's delay-Doppler profile. Consequently, AFDM is capable of achieving performance levels comparable to OTFS. As the research on AFDM detection is currently limited, we present a low-complexity yet efficient message passing (MP) algorithm. This algorithm handles joint interference cancellation and detection while capitalizing on the inherent sparsity of the channel. Based on simulation results, the MP detection algorithm outperforms Minimum Mean Square Error (MMSE) and Maximal Ratio Combining (MRC) detection techniques.
△ Less
Submitted 30 August, 2023; v1 submitted 29 July, 2023;
originally announced July 2023.
-
One for Multiple: Physics-informed Synthetic Data Boosts Generalizable Deep Learning for Fast MRI Reconstruction
Authors:
Zi Wang,
Xiaotong Yu,
Chengyan Wang,
Weibo Chen,
Jiazheng Wang,
Ying-Hua Chu,
Hongwei Sun,
Rushuai Li,
Peiyong Li,
Fan Yang,
Haiwei Han,
Taishan Kang,
Jianzhong Lin,
Chen Yang,
Shufu Chang,
Zhang Shi,
Sha Hua,
Yan Li,
Juan Hu,
Liuhong Zhu,
Jianjun Zhou,
Mei**g Lin,
Jiefeng Guo,
Congbo Cai,
Zhong Chen
, et al. (3 additional authors not shown)
Abstract:
Magnetic resonance imaging (MRI) is a widely used radiological modality renowned for its radiation-free, comprehensive insights into the human body, facilitating medical diagnoses. However, the drawback of prolonged scan times hinders its accessibility. The k-space undersampling offers a solution, yet the resultant artifacts necessitate meticulous removal during image reconstruction. Although Deep…
▽ More
Magnetic resonance imaging (MRI) is a widely used radiological modality renowned for its radiation-free, comprehensive insights into the human body, facilitating medical diagnoses. However, the drawback of prolonged scan times hinders its accessibility. The k-space undersampling offers a solution, yet the resultant artifacts necessitate meticulous removal during image reconstruction. Although Deep Learning (DL) has proven effective for fast MRI image reconstruction, its broader applicability across various imaging scenarios has been constrained. Challenges include the high cost and privacy restrictions associated with acquiring large-scale, diverse training data, coupled with the inherent difficulty of addressing mismatches between training and target data in existing DL methodologies. Here, we present a novel Physics-Informed Synthetic data learning framework for Fast MRI, called PISF. PISF marks a breakthrough by enabling generalized DL for multi-scenario MRI reconstruction through a single trained model. Our approach separates the reconstruction of a 2D image into many 1D basic problems, commencing with 1D data synthesis to facilitate generalization. We demonstrate that training DL models on synthetic data, coupled with enhanced learning techniques, yields in vivo MRI reconstructions comparable to or surpassing those of models trained on matched realistic datasets, reducing the reliance on real-world MRI data by up to 96%. Additionally, PISF exhibits remarkable generalizability across multiple vendors and imaging centers. Its adaptability to diverse patient populations has been validated through evaluations by ten experienced medical professionals. PISF presents a feasible and cost-effective way to significantly boost the widespread adoption of DL in various fast MRI applications.
△ Less
Submitted 28 February, 2024; v1 submitted 24 July, 2023;
originally announced July 2023.
-
Spatio-Temporal Classification of Lung Ventilation Patterns using 3D EIT Images: A General Approach for Individualized Lung Function Evaluation
Authors:
Shuzhe Chen,
Li Li,
Zhichao Lin,
Ke Zhang,
Ying Gong,
Lu Wang,
Xu Wu,
Maokun Li,
Yuanlin Song,
Fan Yang,
Shenheng Xu
Abstract:
The Pulmonary Function Test (PFT) is an widely utilized and rigorous classification test for lung function evaluation, serving as a comprehensive tool for lung diagnosis. Meanwhile, Electrical Impedance Tomography (EIT) is a rapidly advancing clinical technique that visualizes conductivity distribution induced by ventilation. EIT provides additional spatial and temporal information on lung ventila…
▽ More
The Pulmonary Function Test (PFT) is an widely utilized and rigorous classification test for lung function evaluation, serving as a comprehensive tool for lung diagnosis. Meanwhile, Electrical Impedance Tomography (EIT) is a rapidly advancing clinical technique that visualizes conductivity distribution induced by ventilation. EIT provides additional spatial and temporal information on lung ventilation beyond traditional PFT. However, relying solely on conventional isolated interpretations of PFT results and EIT images overlooks the continuous dynamic aspects of lung ventilation. This study aims to classify lung ventilation patterns by extracting spatial and temporal features from the 3D EIT image series. The study uses a Variational Autoencoder network with a MultiRes block to compress the spatial distribution in a 3D image into a one-dimensional vector. These vectors are then concatenated to create a feature map for the exhibition of temporal features. A simple convolutional neural network is used for classification. Data collected from 137 subjects were finally used for training. The model is validated by ten-fold and leave-one-out cross-validation first. The accuracy and sensitivity of normal ventilation mode are 0.95 and 1.00, and the f1-score is 0.94. Furthermore, we check the reliability and feasibility of the proposed pipeline by testing it on newly recruited nine subjects. Our results show that the pipeline correctly predicts the ventilation mode of 8 out of 9 subjects. The study demonstrates the potential of using image series for lung ventilation mode classification, providing a feasible method for patient prescreening and presenting an alternative form of PFT.
△ Less
Submitted 1 July, 2023;
originally announced July 2023.
-
MLE-based Device Activity Detection under Rician Fading for Massive Grant-free Access with Perfect and Imperfect Synchronization
Authors:
Wang Liu,
Ying Cui,
Feng Yang,
Lianghui Ding,
Jun Sun
Abstract:
Most existing studies on massive grant-free access, proposed to support massive machine-type communications (mMTC) for the Internet of things (IoT), assume Rayleigh fading and perfect synchronization for simplicity. However, in practice, line-of-sight (LoS) components generally exist, and time and frequency synchronization are usually imperfect. This paper systematically investigates maximum likel…
▽ More
Most existing studies on massive grant-free access, proposed to support massive machine-type communications (mMTC) for the Internet of things (IoT), assume Rayleigh fading and perfect synchronization for simplicity. However, in practice, line-of-sight (LoS) components generally exist, and time and frequency synchronization are usually imperfect. This paper systematically investigates maximum likelihood estimation (MLE)-based device activity detection under Rician fading for massive grant-free access with perfect and imperfect synchronization. We assume that the large-scale fading powers, Rician factors, and normalized LoS components can be estimated offline. We formulate device activity detection in the synchronous case and joint device activity and offset detection in three asynchronous cases (i.e., time, frequency, and time and frequency asynchronous cases) as MLE problems. In the synchronous case, we propose an iterative algorithm to obtain a stationary point of the MLE problem. In each asynchronous case, we propose two iterative algorithms with identical detection performance but different computational complexities. In particular, one is computationally efficient for small ranges of offsets, whereas the other one, relying on fast Fourier transform (FFT) and inverse FFT, is computationally efficient for large ranges of offsets. The proposed algorithms generalize the existing MLE-based methods for Rayleigh fading and perfect synchronization. Numerical results show that the proposed algorithm for the synchronous case can reduce the detection error probability by up to 50.4% at a 78.6% computation time increase, compared to the MLEbased state-of-the-art, and the proposed algorithms for the three asynchronous cases can reduce the detection error probabilities and computation times by up to 65.8% and 92.0%, respectively, compared to the MLE-based state-of-the-arts.
△ Less
Submitted 11 January, 2024; v1 submitted 11 June, 2023;
originally announced June 2023.
-
Segmentation of Aortic Vessel Tree in CT Scans with Deep Fully Convolutional Networks
Authors:
Shaofeng Yuan,
Feng Yang
Abstract:
Automatic and accurate segmentation of aortic vessel tree (AVT) in computed tomography (CT) scans is crucial for early detection, diagnosis and prognosis of aortic diseases, such as aneurysms, dissections and stenosis. However, this task remains challenges, due to the complexity of aortic vessel tree and amount of CT angiography data. In this technical report, we use two-stage fully convolutional…
▽ More
Automatic and accurate segmentation of aortic vessel tree (AVT) in computed tomography (CT) scans is crucial for early detection, diagnosis and prognosis of aortic diseases, such as aneurysms, dissections and stenosis. However, this task remains challenges, due to the complexity of aortic vessel tree and amount of CT angiography data. In this technical report, we use two-stage fully convolutional networks (FCNs) to automatically segment AVT in CTA scans from multiple centers. Specifically, we firstly adopt a 3D FCN with U-shape network architecture to segment AVT in order to produce topology attention and accelerate medical image analysis pipeline. And then another one 3D FCN is trained to segment branches of AVT along the pseudo-centerline of AVT. In the 2023 MICCAI Segmentation of the Aorta (SEG.A.) Challenge , the reported method was evaluated on the public dataset of 56 cases. The resulting Dice Similarity Coefficient (DSC) is 0.920, Jaccard Similarity Coefficient (JSC) is 0.861, Recall is 0.922, and Precision is 0.926 on a 5-fold random split of training and validation set.
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
The Ways of Words: The Impact of Word Choice on Information Engagement and Decision Making
Authors:
Nimrod Dvir,
Elaine Friedman,
Suraj Commuri,
Fan Yang,
Jennifer Romano
Abstract:
Little research has explored how information engagement (IE), the degree to which individuals interact with and use information in a manner that manifests cognitively, behaviorally, and affectively. This study explored the impact of phrasing, specifically word choice, on IE and decision making. Synthesizing two theoretical models, User Engagement Theory UET and Information Behavior Theory IBT, a t…
▽ More
Little research has explored how information engagement (IE), the degree to which individuals interact with and use information in a manner that manifests cognitively, behaviorally, and affectively. This study explored the impact of phrasing, specifically word choice, on IE and decision making. Synthesizing two theoretical models, User Engagement Theory UET and Information Behavior Theory IBT, a theoretical framework illustrating the impact of and relationships among the three IE dimensions of perception, participation, and perseverance was developed and hypotheses generated. The framework was empirically validated in a large-scale user study measuring how word choice impacts the dimensions of IE. The findings provide evidence that IE differs from other forms of engagement in that it is driven and fostered by the expression of the information itself, regardless of the information system used to view, interact with, and use the information. The findings suggest that phrasing can have a significant effect on the interpretation of and interaction with digital information, indicating the importance of expression of information, in particular word choice, on decision making and IE. The research contributes to the literature by identifying methods for assessment and improvement of IE and decision making with digital text.
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
SSD-MonoDETR: Supervised Scale-aware Deformable Transformer for Monocular 3D Object Detection
Authors:
Xuan He,
Fan Yang,
Kailun Yang,
Jiacheng Lin,
Haolong Fu,
Meng Wang,
** Yuan,
Zhiyong Li
Abstract:
Transformer-based methods have demonstrated superior performance for monocular 3D object detection recently, which aims at predicting 3D attributes from a single 2D image. Most existing transformer-based methods leverage both visual and depth representations to explore valuable query points on objects, and the quality of the learned query points has a great impact on detection accuracy. Unfortunat…
▽ More
Transformer-based methods have demonstrated superior performance for monocular 3D object detection recently, which aims at predicting 3D attributes from a single 2D image. Most existing transformer-based methods leverage both visual and depth representations to explore valuable query points on objects, and the quality of the learned query points has a great impact on detection accuracy. Unfortunately, existing unsupervised attention mechanisms in transformers are prone to generate low-quality query features due to inaccurate receptive fields, especially on hard objects. To tackle this problem, this paper proposes a novel "Supervised Scale-aware Deformable Attention" (SSDA) for monocular 3D object detection. Specifically, SSDA presets several masks with different scales and utilizes depth and visual features to adaptively learn a scale-aware filter for object query augmentation. Imposing the scale awareness, SSDA could well predict the accurate receptive field of an object query to support robust query feature generation. Aside from this, SSDA is assigned with a Weighted Scale Matching (WSM) loss to supervise scale prediction, which presents more confident results as compared to the unsupervised attention mechanisms. Extensive experiments on the KITTI and Waymo Open datasets demonstrate that SSDA significantly improves the detection accuracy, especially on moderate and hard objects, yielding state-of-the-art performance as compared to the existing approaches. Our code will be made publicly available at https://github.com/mikasa3lili/SSD-MonoDETR.
△ Less
Submitted 1 September, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
CSDN: Combing Shallow and Deep Networks for Accurate Real-time Segmentation of High-definition Intravascular Ultrasound Images
Authors:
Shaofeng Yuan,
Feng Yang
Abstract:
Intravascular ultrasound (IVUS) is the preferred modality for capturing real-time and high resolution cross-sectional images of the coronary arteries, and evaluating the stenosis. Accurate and real-time segmentation of IVUS images involves the delineation of lumen and external elastic membrane borders. In this paper, we propose a two-stream framework for efficient segmentation of 60 MHz high resol…
▽ More
Intravascular ultrasound (IVUS) is the preferred modality for capturing real-time and high resolution cross-sectional images of the coronary arteries, and evaluating the stenosis. Accurate and real-time segmentation of IVUS images involves the delineation of lumen and external elastic membrane borders. In this paper, we propose a two-stream framework for efficient segmentation of 60 MHz high resolution IVUS images. It combines shallow and deep networks, namely, CSDN. The shallow network with thick channels focuses to extract low-level details. The deep network with thin channels takes charge of learning high-level semantics. Treating the above information separately enables learning a model to achieve high accuracy and high efficiency for accurate real-time segmentation. To further improve the segmentation performance, mutual guided fusion module is used to enhance and fuse both different types of feature representation. The experimental results show that our CSDN accomplishes a good trade-off between analysis speed and segmentation accuracy.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Does image resolution impact chest X-ray based fine-grained Tuberculosis-consistent lesion segmentation?
Authors:
Sivaramakrishnan Rajaraman,
Feng Yang,
Ghada Zamzmi,
Zhiyun Xue,
Sameer Antani
Abstract:
Deep learning (DL) models are state-of-the-art in segmenting anatomical and disease regions of interest (ROIs) in medical images. Particularly, a large number of DL-based techniques have been reported using chest X-rays (CXRs). However, these models are reportedly trained on reduced image resolutions for reasons related to the lack of computational resources. Literature is sparse in discussing the…
▽ More
Deep learning (DL) models are state-of-the-art in segmenting anatomical and disease regions of interest (ROIs) in medical images. Particularly, a large number of DL-based techniques have been reported using chest X-rays (CXRs). However, these models are reportedly trained on reduced image resolutions for reasons related to the lack of computational resources. Literature is sparse in discussing the optimal image resolution to train these models for segmenting the Tuberculosis (TB)-consistent lesions in CXRs. In this study, we investigated the performance variations using an Inception-V3 UNet model using various image resolutions with/without lung ROI crop** and aspect ratio adjustments, and (ii) identified the optimal image resolution through extensive empirical evaluations to improve TB-consistent lesion segmentation performance. We used the Shenzhen CXR dataset for the study which includes 326 normal patients and 336 TB patients. We proposed a combinatorial approach consisting of storing model snapshots, optimizing segmentation threshold and test-time augmentation (TTA), and averaging the snapshot predictions, to further improve performance with the optimal resolution. Our experimental results demonstrate that higher image resolutions are not always necessary, however, identifying the optimal image resolution is critical to achieving superior performance.
△ Less
Submitted 27 January, 2023; v1 submitted 10 January, 2023;
originally announced January 2023.
-
Active RISs: Signal Modeling, Asymptotic Analysis, and Beamforming Design
Authors:
Zijian Zhang,
Linglong Dai,
Xibi Chen,
Changhao Liu,
Fan Yang,
Robert Schober,
H. Vincent Poor
Abstract:
Reconfigurable intelligent surfaces (RISs) have emerged as a candidate technology for future 6G networks. However, due to the "multiplicative fading" effect, the existing passive RISs only achieve a negligible capacity gain in environments with strong direct links. In this paper, the concept of active RISs is studied to overcome this fundamental limitation. Unlike the existing passive RISs that re…
▽ More
Reconfigurable intelligent surfaces (RISs) have emerged as a candidate technology for future 6G networks. However, due to the "multiplicative fading" effect, the existing passive RISs only achieve a negligible capacity gain in environments with strong direct links. In this paper, the concept of active RISs is studied to overcome this fundamental limitation. Unlike the existing passive RISs that reflect signals without amplification, active RISs can amplify the reflected signals via amplifiers integrated into their elements. To characterize the signal amplification and incorporate the noise introduced by the active components, we verify the signal model of active RISs through the experimental measurements on a fabricated active RIS element. Based on the verified signal model, we formulate the sum-rate maximization problem for an active RIS aided multi-user multiple-input single-output (MU-MISO) system and a joint transmit precoding and reflect beamforming algorithm is proposed to solve this problem. Simulation results show that, in a typical wireless system, the existing passive RISs can realize only a negligible sum-rate gain of 3%, while the active RISs can achieve a significant sum-rate gain of 62%, thus overcoming the "multiplicative fading" effect. Finally, we develop a 64-element active RIS aided wireless communication prototype, and the significant gain of active RISs is validated by field test.
△ Less
Submitted 31 December, 2022;
originally announced January 2023.
-
Improve Bilingual TTS Using Dynamic Language and Phonology Embedding
Authors:
Fengyu Yang,
Jian Luan,
Yujun Wang
Abstract:
In most cases, bilingual TTS needs to handle three types of input scripts: first language only, second language only, and second language embedded in the first language. In the latter two situations, the pronunciation and intonation of the second language are usually quite different due to the influence of the first language. Therefore, it is a big challenge to accurately model the pronunciation a…
▽ More
In most cases, bilingual TTS needs to handle three types of input scripts: first language only, second language only, and second language embedded in the first language. In the latter two situations, the pronunciation and intonation of the second language are usually quite different due to the influence of the first language. Therefore, it is a big challenge to accurately model the pronunciation and intonation of the second language in different contexts without mutual interference. This paper builds a Mandarin-English TTS system to acquire more standard spoken English speech from a monolingual Chinese speaker. We introduce phonology embedding to capture the English differences between different phonology. Embedding mask is applied to language embedding for distinguishing information between different languages and to phonology embedding for focusing on English expression. We specially design an embedding strength modulator to capture the dynamic strength of language and phonology. Experiments show that our approach can produce significantly more natural and standard spoken English speech of the monolingual Chinese speaker. From analysis, we find that suitable phonology control contributes to better performance in different scenarios.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Generalizability of Deep Adult Lung Segmentation Models to the Pediatric Population: A Retrospective Study
Authors:
Sivaramakrishnan Rajaraman,
Feng Yang,
Ghada Zamzmi,
Zhiyun Xue,
Sameer Antani
Abstract:
Lung segmentation in chest X-rays (CXRs) is an important prerequisite for improving the specificity of diagnoses of cardiopulmonary diseases in a clinical decision support system. Current deep learning models for lung segmentation are trained and evaluated on CXR datasets in which the radiographic projections are captured predominantly from the adult population. However, the shape of the lungs is…
▽ More
Lung segmentation in chest X-rays (CXRs) is an important prerequisite for improving the specificity of diagnoses of cardiopulmonary diseases in a clinical decision support system. Current deep learning models for lung segmentation are trained and evaluated on CXR datasets in which the radiographic projections are captured predominantly from the adult population. However, the shape of the lungs is reported to be significantly different across the developmental stages from infancy to adulthood. This might result in age-related data domain shifts that would adversely impact lung segmentation performance when the models trained on the adult population are deployed for pediatric lung segmentation. In this work, our goal is to (i) analyze the generalizability of deep adult lung segmentation models to the pediatric population and (ii) improve performance through a stage-wise, systematic approach consisting of CXR modality-specific weight initializations, stacked ensembles, and an ensemble of stacked ensembles. To evaluate segmentation performance and generalizability, novel evaluation metrics consisting of mean lung contour distance (MLCD) and average hash score (AHS) are proposed in addition to the multi-scale structural similarity index measure (MS-SSIM), the intersection of union (IoU), Dice score, 95% Hausdorff distance (HD95), and average symmetric surface distance (ASSD). Our results showed a significant improvement (p < 0.05) in cross-domain generalization through our approach. This study could serve as a paradigm to analyze the cross-domain generalizability of deep segmentation models for other medical imaging modalities and applications.
△ Less
Submitted 25 May, 2023; v1 submitted 4 November, 2022;
originally announced November 2022.
-
ISA-Net: Improved spatial attention network for PET-CT tumor segmentation
Authors:
Zhengyong Huang,
Sijuan Zou,
Guoshuai Wang,
Zixiang Chen,
Hao Shen,
Haiyan Wang,
Na Zhang,
Lu Zhang,
Fan Yang,
Haining Wangg,
Dong Liang,
Tianye Niu,
Xiaohua Zhuc,
Zhanli Hua
Abstract:
Achieving accurate and automated tumor segmentation plays an important role in both clinical practice and radiomics research. Segmentation in medicine is now often performed manually by experts, which is a laborious, expensive and error-prone task. Manual annotation relies heavily on the experience and knowledge of these experts. In addition, there is much intra- and interobserver variation. There…
▽ More
Achieving accurate and automated tumor segmentation plays an important role in both clinical practice and radiomics research. Segmentation in medicine is now often performed manually by experts, which is a laborious, expensive and error-prone task. Manual annotation relies heavily on the experience and knowledge of these experts. In addition, there is much intra- and interobserver variation. Therefore, it is of great significance to develop a method that can automatically segment tumor target regions. In this paper, we propose a deep learning segmentation method based on multimodal positron emission tomography-computed tomography (PET-CT), which combines the high sensitivity of PET and the precise anatomical information of CT. We design an improved spatial attention network(ISA-Net) to increase the accuracy of PET or CT in detecting tumors, which uses multi-scale convolution operation to extract feature information and can highlight the tumor region location information and suppress the non-tumor region location information. In addition, our network uses dual-channel inputs in the coding stage and fuses them in the decoding stage, which can take advantage of the differences and complementarities between PET and CT. We validated the proposed ISA-Net method on two clinical datasets, a soft tissue sarcoma(STS) and a head and neck tumor(HECKTOR) dataset, and compared with other attention methods for tumor segmentation. The DSC score of 0.8378 on STS dataset and 0.8076 on HECKTOR dataset show that ISA-Net method achieves better segmentation performance and has better generalization. Conclusions: The method proposed in this paper is based on multi-modal medical image tumor segmentation, which can effectively utilize the difference and complementarity of different modes. The method can also be applied to other multi-modal data or single-modal data by proper adjustment.
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
Distributed Optimal Control of Graph Symmetric Systems via Graph Filters
Authors:
Fengjun Yang,
Fernando Gama,
Somayeh Sojoudi,
Nikolai Matni
Abstract:
Designing distributed optimal controllers subject to communication constraints is a difficult problem unless structural assumptions are imposed on the underlying dynamics and information exchange structure, e.g., sparsity, delay, or spatial invariance. In this paper, we borrow ideas from graph signal processing and define and analyze a class of Graph Symmetric Systems (GSSs), which are systems tha…
▽ More
Designing distributed optimal controllers subject to communication constraints is a difficult problem unless structural assumptions are imposed on the underlying dynamics and information exchange structure, e.g., sparsity, delay, or spatial invariance. In this paper, we borrow ideas from graph signal processing and define and analyze a class of Graph Symmetric Systems (GSSs), which are systems that are symmetric with respect to an underlying graph topology. We show that for linear quadratic problems subject to dynamics defined by a GSS, the optimal centralized controller is given by a novel class of graph filters with transfer function valued filter taps and can be implemented via distributed message passing. We then propose several methods for approximating the optimal centralized graph filter by a distributed controller only requiring communication with a small subset of neighboring subsystems. We further provide stability and suboptimality guarantees for the resulting distributed controllers. Finally, we empirically demonstrate that our approach allows for a principled tradeoff between communication cost and performance while guaranteeing stability. Our results can be viewed as a first step towards bridging the fields of distributed optimal control and graph signal processing.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains
Authors:
**bo Hu,
Yin Cao,
Ming Wu,
Qiuqiang Kong,
Feiran Yang,
Mark D. Plumbley,
Jun Yang
Abstract:
Sound event localization and detection (SELD) is a joint task of sound event detection and direction-of-arrival estimation. In DCASE 2022 Task 3, types of data transform from computationally generated spatial recordings to recordings of real-sound scenes. Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent Network V2 (EINV2) with a novel data augmentat…
▽ More
Sound event localization and detection (SELD) is a joint task of sound event detection and direction-of-arrival estimation. In DCASE 2022 Task 3, types of data transform from computationally generated spatial recordings to recordings of real-sound scenes. Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent Network V2 (EINV2) with a novel data augmentation method. Our method employs EINV2 with a track-wise output format, permutation-invariant training, and a soft parameter-sharing strategy, to detect different sound events of the same class but in different locations. The Conformer structure is used for extending EINV2 to learn local and global features. A data augmentation method, which contains several data augmentation chains composed of stochastic combinations of several different data augmentation operations, is utilized to generalize the model. To mitigate the lack of real-scene recordings in the development dataset and the presence of sound events being unbalanced, we exploit FSD50K, AudioSet, and TAU Spatial Room Impulse Response Database (TAU-SRIR DB) to generate simulated datasets for training. We present results on the validation set of Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) in detail. Experimental results indicate that the ability to generalize to different environments and unbalanced performance among different classes are two main challenges. We evaluate our proposed method in Task 3 of the DCASE 2022 challenge and obtain the second rank in the teams ranking. Source code is released.
△ Less
Submitted 9 September, 2022; v1 submitted 5 September, 2022;
originally announced September 2022.
-
DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation
Authors:
Da-Yi Wu,
Wen-Yi Hsiao,
Fu-Rong Yang,
Oscar Friedman,
Warren Jackson,
Scott Bruzenak,
Yi-Wen Liu,
Yi-Hsuan Yang
Abstract:
A vocoder is a conditional audio generation model that converts acoustic features such as mel-spectrograms into waveforms. Taking inspiration from Differentiable Digital Signal Processing (DDSP), we propose a new vocoder named SawSing for singing voices. SawSing synthesizes the harmonic part of singing voices by filtering a sawtooth source signal with a linear time-variant finite impulse response…
▽ More
A vocoder is a conditional audio generation model that converts acoustic features such as mel-spectrograms into waveforms. Taking inspiration from Differentiable Digital Signal Processing (DDSP), we propose a new vocoder named SawSing for singing voices. SawSing synthesizes the harmonic part of singing voices by filtering a sawtooth source signal with a linear time-variant finite impulse response filter whose coefficients are estimated from the input mel-spectrogram by a neural network. As this approach enforces phase continuity, SawSing can generate singing voices without the phase-discontinuity glitch of many existing vocoders. Moreover, the source-filter assumption provides an inductive bias that allows SawSing to be trained on a small amount of data. Our experiments show that SawSing converges much faster and outperforms state-of-the-art generative adversarial network and diffusion-based vocoders in a resource-limited scenario with only 3 training recordings and a 3-hour training time.
△ Less
Submitted 18 August, 2022; v1 submitted 9 August, 2022;
originally announced August 2022.
-
Distributed Scheduling at Non-Signalized Intersections with Mixed Cooperative and Non-Cooperative Vehicles
Authors:
Feihong Yang,
Yuan Shen
Abstract:
Intersection management with mixed cooperative and non-cooperative vehicles is crucial in next-generation transportation systems. For fully non-cooperative systems, a minimax scheduling framework was established, while it is inefficient in mixed systems as the benefit of cooperation is not exploited. This letter focuses on the efficient scheduling in mixed systems and proposes a two-stage policy t…
▽ More
Intersection management with mixed cooperative and non-cooperative vehicles is crucial in next-generation transportation systems. For fully non-cooperative systems, a minimax scheduling framework was established, while it is inefficient in mixed systems as the benefit of cooperation is not exploited. This letter focuses on the efficient scheduling in mixed systems and proposes a two-stage policy that makes full use of the cooperation relation. Specifically, a long-horizon self-organization policy is first developed to optimize the passing order of cooperative vehicles in a distributed manner, which is proved convergent when inbound roads are sufficiently long. Then a short-horizon trajectory planning policy is proposed to improve the efficiency when an ego-vehicle faces both cooperative and non-cooperative vehicles, and its safety and efficiency are theoretically validated. Furthermore, numerical simulations verify that the proposed policies can effectively reduce the scheduling cost and improve the throughput for cooperative vehicles.
△ Less
Submitted 7 August, 2022; v1 submitted 30 July, 2022;
originally announced August 2022.
-
Degradation-Guided Meta-Restoration Network for Blind Super-Resolution
Authors:
Fuzhi Yang,
Huan Yang,
Yanhong Zeng,
Jianlong Fu,
Hongtao Lu
Abstract:
Blind super-resolution (SR) aims to recover high-quality visual textures from a low-resolution (LR) image, which is usually degraded by down-sampling blur kernels and additive noises. This task is extremely difficult due to the challenges of complicated image degradations in the real-world. Existing SR approaches either assume a predefined blur kernel or a fixed noise, which limits these approache…
▽ More
Blind super-resolution (SR) aims to recover high-quality visual textures from a low-resolution (LR) image, which is usually degraded by down-sampling blur kernels and additive noises. This task is extremely difficult due to the challenges of complicated image degradations in the real-world. Existing SR approaches either assume a predefined blur kernel or a fixed noise, which limits these approaches in challenging cases. In this paper, we propose a Degradation-guided Meta-restoration network for blind Super-Resolution (DMSR) that facilitates image restoration for real cases. DMSR consists of a degradation extractor and meta-restoration modules. The extractor estimates the degradations in LR inputs and guides the meta-restoration modules to predict restoration parameters for different degradations on-the-fly. DMSR is jointly optimized by a novel degradation consistency loss and reconstruction losses. Through such an optimization, DMSR outperforms SOTA by a large margin on three widely-used benchmarks. A user study including 16 subjects further validates the superiority of DMSR in real-world blind SR tasks.
△ Less
Submitted 2 July, 2022;
originally announced July 2022.
-
Intelligent Reflecting Surface for MIMO VLC: Joint Design of Surface Configuration and Transceiver Signal Processing
Authors:
Shiyuan Sun,
Fang Yang,
Jian Song,
Rui Zhang
Abstract:
With the capability of reconfiguring the wireless electromagnetic environment, intelligent reflecting surface (IRS) is a new paradigm for designing future wireless communication systems. In this paper, we consider optical IRS for improving the performance of visible light communication (VLC) under a multiple-input and multiple-output (MIMO) setting. Specifically, we focus on the downlink communica…
▽ More
With the capability of reconfiguring the wireless electromagnetic environment, intelligent reflecting surface (IRS) is a new paradigm for designing future wireless communication systems. In this paper, we consider optical IRS for improving the performance of visible light communication (VLC) under a multiple-input and multiple-output (MIMO) setting. Specifically, we focus on the downlink communication of an indoor MIMO VLC system and aim to minimize the mean square error (MSE) of demodulated signals at the receiver. To this end, the MIMO channel gain of the IRS-aided VLC is first derived under the point source assumption, based on which the MSE minimization problem is then formulated subject to the emission power constraints. Next, we propose an alternating optimization algorithm, which decomposes the original problem into three subproblems, to iteratively optimize the IRS configuration, the precoding and detection matrices for minimizing the MSE. Moreover, theoretical analysis on the performance of the proposed algorithm in high and low signal-to-noise rate (SNR) regimes is provided, revealing that the joint optimization process can be simplified in such special cases, and the algorithm's convergence property and computational complexity are also discussed. Finally, numerical results show that IRS-aided schemes significantly reduce the MSE as compared to their counterparts without IRS, and the proposed algorithm outperforms other baseline schemes.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
Free-form Lesion Synthesis Using a Partial Convolution Generative Adversarial Network for Enhanced Deep Learning Liver Tumor Segmentation
Authors:
Yingao Liu,
Fei Yang,
Yidong Yang
Abstract:
Automatic deep learning segmentation models has been shown to improve both the segmentation efficiency and the accuracy. However, training a robust segmentation model requires considerably large labeled training samples, which may be impractical. This study aimed to develop a deep learning framework for generating synthetic lesions that can be used to enhance network training. The lesion synthesis…
▽ More
Automatic deep learning segmentation models has been shown to improve both the segmentation efficiency and the accuracy. However, training a robust segmentation model requires considerably large labeled training samples, which may be impractical. This study aimed to develop a deep learning framework for generating synthetic lesions that can be used to enhance network training. The lesion synthesis network is a modified generative adversarial network (GAN). Specifically, we innovated a partial convolution strategy to construct an Unet-like generator. The discriminator is designed using Wasserstein GAN with gradient penalty and spectral normalization. A mask generation method based on principal component analysis was developed to model various lesion shapes. The generated masks are then converted into liver lesions through a lesion synthesis network. The lesion synthesis framework was evaluated for lesion textures, and the synthetic lesions were used to train a lesion segmentation network to further validate the effectiveness of this framework. All the networks are trained and tested on the public dataset from LITS. The synthetic lesions generated by the proposed approach have very similar histogram distributions compared to the real lesions for the two employed texture parameters, GLCM-energy and GLCM-correlation. The Kullback-Leibler divergence of GLCM-energy and GLCM-correlation were 0.01 and 0.10, respectively. Including the synthetic lesions in the tumor segmentation network improved the segmentation dice performance of U-Net significantly from 67.3% to 71.4% (p<0.05). Meanwhile, the volume precision and sensitivity improve from 74.6% to 76.0% (p=0.23) and 66.1% to 70.9% (p<0.01), respectively. The synthetic data significantly improves the segmentation performance.
△ Less
Submitted 25 October, 2022; v1 submitted 17 June, 2022;
originally announced June 2022.
-
PeQuENet: Perceptual Quality Enhancement of Compressed Video with Adaptation- and Attention-based Network
Authors:
Sai** Zhang,
Luis Herranz,
Marta Mrak,
Marc Gorriz Blanch,
Shuai Wan,
Fuzheng Yang
Abstract:
In this paper we propose a generative adversarial network (GAN) framework to enhance the perceptual quality of compressed videos. Our framework includes attention and adaptation to different quantization parameters (QPs) in a single model. The attention module exploits global receptive fields that can capture and align long-range correlations between consecutive frames, which can be beneficial for…
▽ More
In this paper we propose a generative adversarial network (GAN) framework to enhance the perceptual quality of compressed videos. Our framework includes attention and adaptation to different quantization parameters (QPs) in a single model. The attention module exploits global receptive fields that can capture and align long-range correlations between consecutive frames, which can be beneficial for enhancing perceptual quality of videos. The frame to be enhanced is fed into the deep network together with its neighboring frames, and in the first stage features at different depths are extracted. Then extracted features are fed into attention blocks to explore global temporal correlations, followed by a series of upsampling and convolution layers. Finally, the resulting features are processed by the QP-conditional adaptation module which leverages the corresponding QP information. In this way, a single model can be used to enhance adaptively to various QPs without requiring multiple models specific for every QP value, while having similar performance. Experimental results demonstrate the superior performance of the proposed PeQuENet compared with the state-of-the-art compressed video quality enhancement algorithms.
△ Less
Submitted 15 June, 2022;
originally announced June 2022.
-
Demo: low-power communications based on RIS and AI for 6G
Authors:
Mingyao Cui,
Zidong Wu,
Yuhao Chen,
Shenheng Xu,
Fan Yang,
Linglong Dai
Abstract:
Ultra-massive multiple-input-multiple-output (UM-MIMO) is promising to meet the high rate requirements for future 6G. However, due to the large number of antennas and high path loss, the hardware power consumption and computing power consumption of UM-MIMO will be unaffordable. To address this problem, we implement a low-power communication system based on reconfigurable intelligent surface (RIS)…
▽ More
Ultra-massive multiple-input-multiple-output (UM-MIMO) is promising to meet the high rate requirements for future 6G. However, due to the large number of antennas and high path loss, the hardware power consumption and computing power consumption of UM-MIMO will be unaffordable. To address this problem, we implement a low-power communication system based on reconfigurable intelligent surface (RIS) and artificial intelligence (AI) for 6G. For hardware design, we employ a 256-element RIS at the base station to replace the traditional phased array. Moreover, a 2304-element RIS is developed as a relay to assist communication with much reduced transmit power. For software implementation, we develop an AI-based transmission design to reduce computing power consumption. By jointly designing the hardware and software, this prototype can realize real-time 4K video transmission with much reduced power consumption.
△ Less
Submitted 21 May, 2022;
originally announced June 2022.
-
Deep ensemble learning for segmenting tuberculosis-consistent manifestations in chest radiographs
Authors:
Sivaramakrishnan Rajaraman,
Feng Yang,
Ghada Zamzmi,
Peng Guo,
Zhiyun Xue,
Sameer K Antani
Abstract:
Automated segmentation of tuberculosis (TB)-consistent lesions in chest X-rays (CXRs) using deep learning (DL) methods can help reduce radiologist effort, supplement clinical decision-making, and potentially result in improved patient treatment. The majority of works in the literature discuss training automatic segmentation models using coarse bounding box annotations. However, the granularity of…
▽ More
Automated segmentation of tuberculosis (TB)-consistent lesions in chest X-rays (CXRs) using deep learning (DL) methods can help reduce radiologist effort, supplement clinical decision-making, and potentially result in improved patient treatment. The majority of works in the literature discuss training automatic segmentation models using coarse bounding box annotations. However, the granularity of the bounding box annotation could result in the inclusion of a considerable fraction of false positives and negatives at the pixel level that may adversely impact overall semantic segmentation performance. This study (i) evaluates the benefits of using fine-grained annotations of TB-consistent lesions and (ii) trains and constructs ensembles of the variants of U-Net models for semantically segmenting TB-consistent lesions in both original and bone-suppressed frontal CXRs. We evaluated segmentation performance using several ensemble methods such as bitwise AND, bitwise-OR, bitwise-MAX, and stacking. We observed that the stacking ensemble demonstrated superior segmentation performance (Dice score: 0.5743, 95% confidence interval: (0.4055,0.7431)) compared to the individual constituent models and other ensemble methods. To the best of our knowledge, this is the first study to apply ensemble learning to improve fine-grained TB-consistent lesion segmentation performance.
△ Less
Submitted 13 June, 2022;
originally announced June 2022.
-
Slimmable Video Codec
Authors:
Zhaocheng Liu,
Luis Herranz,
Fei Yang,
Sai** Zhang,
Shuai Wan,
Marta Mrak,
Marc Górriz Blanch
Abstract:
Neural video compression has emerged as a novel paradigm combining trainable multilayer neural networks and machine learning, achieving competitive rate-distortion (RD) performances, but still remaining impractical due to heavy neural architectures, with large memory and computational demands. In addition, models are usually optimized for a single RD tradeoff. Recent slimmable image codecs can dyn…
▽ More
Neural video compression has emerged as a novel paradigm combining trainable multilayer neural networks and machine learning, achieving competitive rate-distortion (RD) performances, but still remaining impractical due to heavy neural architectures, with large memory and computational demands. In addition, models are usually optimized for a single RD tradeoff. Recent slimmable image codecs can dynamically adjust their model capacity to gracefully reduce the memory and computation requirements, without harming RD performance. In this paper we propose a slimmable video codec (SlimVC), by integrating a slimmable temporal entropy model in a slimmable autoencoder. Despite a significantly more complex architecture, we show that slimming remains a powerful mechanism to control rate, memory footprint, computational cost and latency, all being important requirements for practical video compression.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
Limited-memory BFGS Optimisation of Phase-Only Computer-Generated Hologram for Fraunhofer Diffraction
Authors:
**ze Sha,
Andrew Kadis,
Fan Yang,
Timothy D. Wilkinson
Abstract:
We implement a novel limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) optimisation algorithm with cross entropy (CE) loss function, to produce phase-only computer-generated hologram (CGH) for holographic displays, with validation on a binary-phase modulation holographic projector.
We implement a novel limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) optimisation algorithm with cross entropy (CE) loss function, to produce phase-only computer-generated hologram (CGH) for holographic displays, with validation on a binary-phase modulation holographic projector.
△ Less
Submitted 10 May, 2022;
originally announced May 2022.