Search | arXiv e-print repository

Transformer-based Learned Image Compression for Joint Decoding and Denoising

Authors: Yi-Hsin Chen, Kuan-Wei Ho, Shiau-Rung Tsai, Guan-Hsun Lin, Alessandro Gnutti, Wen-Hsiao Peng, Riccardo Leonardi

Abstract: This work introduces a Transformer-based image compression system. It has the flexibility to switch between the standard image reconstruction and the denoising reconstruction from a single compressed bitstream. Instead of training separate decoders for these tasks, we incorporate two add-on modules to adapt a pre-trained image decoder from performing the standard image reconstruction to joint deco… ▽ More This work introduces a Transformer-based image compression system. It has the flexibility to switch between the standard image reconstruction and the denoising reconstruction from a single compressed bitstream. Instead of training separate decoders for these tasks, we incorporate two add-on modules to adapt a pre-trained image decoder from performing the standard image reconstruction to joint decoding and denoising. Our scheme adopts a two-pronged approach. It features a latent refinement module to refine the latent representation of a noisy input image for reconstructing a noise-free image. Additionally, it incorporates an instance-specific prompt generator that adapts the decoding process to improve on the latent refinement. Experimental results show that our method achieves a similar level of denoising quality to training a separate decoder for joint decoding and denoising at the expense of only a modest increase in the decoder's model size and computational complexity. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: Accepted to PCS 2024

arXiv:2401.09648 [pdf]

Staggered Comb Reference Signal Design for Integrated Communication and Sensing

Authors: Rui Zhang, Shawn Tsai, Tzu-Han Chou, Jiaying Ren

Abstract: Ambiguity performance is a critical criterion in radar sensor design, which indicates the ambiguities arising from multiple target estimation and detection. We considered a requirement-driven selection of OFDM reference signal (RS) patterns based on ambiguity performances for bi-static sensing in integrated communication and sensing with minimal modifications of current RSs. An RS pattern with a s… ▽ More Ambiguity performance is a critical criterion in radar sensor design, which indicates the ambiguities arising from multiple target estimation and detection. We considered a requirement-driven selection of OFDM reference signal (RS) patterns based on ambiguity performances for bi-static sensing in integrated communication and sensing with minimal modifications of current RSs. An RS pattern with a staggering offset of a linear slope that is relatively prime to the RS comb size is suggested for standard-resolution sensing algorithms to obtain the best ambiguity performances. Moreover, an extended guard interval design is proposed to increase the maximum time delay, that is inter-symbol interference (ISI) free using post-FFT sensing algorithms. The proposed techniques are promising to extend the distance and speed without ambiguities and ISI for sensing. △ Less

Submitted 25 April, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

Comments: accepted by IEEE International Symposium on Personal, Indoor and Mobile Radio Communications. arXiv admin note: substantial text overlap with arXiv:2401.09643

arXiv:2401.09643 [pdf]

OFDM Reference Signal Pattern Design Criteria for Integrated Communication and Sensing

Authors: Rui Zhang, Shawn Tsai, Tzu-Han Chou, Jiaying Ren, Wenze Qu, Oliver Sun

Abstract: Ambiguity performance, which indicates the maximum detectable region for target parameter estimation, is critical to radar sensor design. Driven by ambiguity performance requirements of bi-static sensing, we propose design criteria for orthogonal frequency division multiplexing (OFDM) reference signal (RS) patterns. The design not only reduces ambiguities in both time delay and Doppler shift domai… ▽ More Ambiguity performance, which indicates the maximum detectable region for target parameter estimation, is critical to radar sensor design. Driven by ambiguity performance requirements of bi-static sensing, we propose design criteria for orthogonal frequency division multiplexing (OFDM) reference signal (RS) patterns. The design not only reduces ambiguities in both time delay and Doppler shift domains under different types of sensing algorithms, but also reduces resource overhead for integrated comunication and sensing. With minimal modifications of post-FFT processing for current RS patterns, guard interval is extended beyond conventional cyclic prefix (CP), while maintaining inter-symbol-interference-(ISI)-free delay estimation. For standard-resolution sensing algorithms, a staggering offset of a linear slope that is relatively prime to the RS comb size is suggested. As for high-resolution sensing algorithms, necessary and sufficient conditions of comb RS staggering offsets, plus new patterns synthesized therefrom, are derived for the corresponding achievable ambiguity performance. Furthermore, we generalize the RS pattern design criterion for high-resolution sensing algorithms to irregular forms, which minimizes number of resource elements (REs) for associated algorithms to eliminate all side peaks. Starting from staggered comb pattern in current positioning RS, our generalized design eventually removes any regular form for ultimate flexibility. Overall, the proposed techniques are promising to extend the ISI- and ambiguity-free range of distance and speed estimates for radar sensing. △ Less

Submitted 25 April, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

arXiv:2312.14495 [pdf, other]

Beam Foreseeing in Millimeter-Wave Systems with Situational Awareness: Fundamental Limits via Cramér-Rao Lower Bound

Authors: Wan-Ting Shih, Chao-Kai Wen, Shang-Ho Tsai, Shi **, Chau Yuen

Abstract: Millimeter-wave (mmWave) networks offer the potential for high-speed data transfer and precise localization, leveraging large antenna arrays and extensive bandwidths. However, these networks are challenged by significant path loss and susceptibility to blockages. In this study, we delve into the use of situational awareness for beam prediction within the 5G NR beam management framework. We introdu… ▽ More Millimeter-wave (mmWave) networks offer the potential for high-speed data transfer and precise localization, leveraging large antenna arrays and extensive bandwidths. However, these networks are challenged by significant path loss and susceptibility to blockages. In this study, we delve into the use of situational awareness for beam prediction within the 5G NR beam management framework. We introduce an analytical framework based on the Cramér-Rao Lower Bound, enabling the quantification of 6D position-related information of geometric reflectors. This includes both 3D locations and 3D orientation biases, facilitating accurate determinations of the beamforming gain achievable by each reflector or candidate beam. This framework empowers us to predict beam alignment performance at any given location in the environment, ensuring uninterrupted wireless access. Our analysis offers critical insights for choosing the most effective beam and antenna module strategies, particularly in scenarios where communication stability is threatened by blockages. Simulation results show that our approach closely approximates the performance of an ideal, Oracle-based solution within the existing 5G NR beam management system. △ Less

Submitted 22 December, 2023; originally announced December 2023.

Comments: 16 pages, 10 figures; IEEE Transactions on Wireless Communications

arXiv:2306.06653 [pdf, other]

Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features

Authors: Hsin-Hao Chen, Yung-Lun Chien, Ming-Chi Yen, Shu-Wei Tsai, Yu Tsao, Tai-shih Chi, Hsin-Min Wang

Abstract: Patients who have had their entire larynx removed, including the vocal folds, owing to throat cancer may experience difficulties in speaking. In such cases, electrolarynx devices are often prescribed to produce speech, which is commonly referred to as electrolaryngeal speech (EL speech). However, the quality and intelligibility of EL speech are poor. To address this problem, EL voice conversion (E… ▽ More Patients who have had their entire larynx removed, including the vocal folds, owing to throat cancer may experience difficulties in speaking. In such cases, electrolarynx devices are often prescribed to produce speech, which is commonly referred to as electrolaryngeal speech (EL speech). However, the quality and intelligibility of EL speech are poor. To address this problem, EL voice conversion (ELVC) is a method used to improve the intelligibility and quality of EL speech. In this paper, we propose a novel ELVC system that incorporates cross-domain features, specifically spectral features and self-supervised learning (SSL) embeddings. The experimental results show that applying cross-domain features can notably improve the conversion performance for the ELVC task compared with utilizing only traditional spectral features. △ Less

Submitted 11 June, 2023; originally announced June 2023.

Comments: Accepted to INTERSPEECH 2023

arXiv:2306.06652 [pdf, other]

Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion

Authors: Yung-Lun Chien, Hsin-Hao Chen, Ming-Chi Yen, Shu-Wei Tsai, Hsin-Min Wang, Yu Tsao, Tai-Shih Chi

Abstract: Electrolarynx is a commonly used assistive device to help patients with removed vocal cords regain their ability to speak. Although the electrolarynx can generate excitation signals like the vocal cords, the naturalness and intelligibility of electrolaryngeal (EL) speech are very different from those of natural (NL) speech. Many deep-learning-based models have been applied to electrolaryngeal spee… ▽ More Electrolarynx is a commonly used assistive device to help patients with removed vocal cords regain their ability to speak. Although the electrolarynx can generate excitation signals like the vocal cords, the naturalness and intelligibility of electrolaryngeal (EL) speech are very different from those of natural (NL) speech. Many deep-learning-based models have been applied to electrolaryngeal speech voice conversion (ELVC) for converting EL speech to NL speech. In this study, we propose a multimodal voice conversion (VC) model that integrates acoustic and visual information into a unified network. We compared different pre-trained models as visual feature extractors and evaluated the effectiveness of these features in the ELVC task. The experimental results demonstrate that the proposed multimodal VC model outperforms single-modal models in both objective and subjective metrics, suggesting that the integration of visual information can significantly improve the quality of ELVC. △ Less

Submitted 11 June, 2023; originally announced June 2023.

Comments: Accepted to INTERSPEECH 2023

arXiv:2109.03551 [pdf, other]

Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion

Authors: Yi-Syuan Liou, Wen-Chin Huang, Ming-Chi Yen, Shu-Wei Tsai, Yu-Huai Peng, Tomoki Toda, Yu Tsao, Hsin-Min Wang

Abstract: Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time war** (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair… ▽ More Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time war** (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We investigate two naive lip representations and distance metrics, and experimental results demonstrate that the proposed method can significantly outperform the audio-only alignment in terms of objective and subjective evaluations. △ Less

Submitted 8 September, 2021; originally announced September 2021.

Comments: Accepted to APSIPA ASC 2021

Showing 1–7 of 7 results for author: Tsai, S