Search | arXiv e-print repository

Double Privacy Guard: Robust Traceable Adversarial Watermarking against Face Recognition

Authors: Yunming Zhang, Dengpan Ye, Sipeng Shen, Caiyun Xie, Ziyi Liu, Jiacheng Deng, Long Tang

Abstract: The wide deployment of Face Recognition (FR) systems poses risks of privacy leakage. One countermeasure to address this issue is adversarial attacks, which deceive malicious FR searches but simultaneously interfere the normal identity verification of trusted authorizers. In this paper, we propose the first Double Privacy Guard (DPG) scheme based on traceable adversarial watermarking. DPG employs a… ▽ More The wide deployment of Face Recognition (FR) systems poses risks of privacy leakage. One countermeasure to address this issue is adversarial attacks, which deceive malicious FR searches but simultaneously interfere the normal identity verification of trusted authorizers. In this paper, we propose the first Double Privacy Guard (DPG) scheme based on traceable adversarial watermarking. DPG employs a one-time watermark embedding to deceive unauthorized FR models and allows authorizers to perform identity verification by extracting the watermark. Specifically, we propose an information-guided adversarial attack against FR models. The encoder embeds an identity-specific watermark into the deep feature space of the carrier, guiding recognizable features of the image to deviate from the source identity. We further adopt a collaborative meta-optimization strategy compatible with sub-tasks, which regularizes the joint optimization direction of the encoder and decoder. This strategy enhances the representation of universal carrier features, mitigating multi-objective optimization conflicts in watermarking. Experiments confirm that DPG achieves significant attack success rates and traceability accuracy on state-of-the-art FR models, exhibiting remarkable robustness that outperforms the existing privacy protection methods using adversarial attacks and deep watermarking, or simple combinations of the two. Our work potentially opens up new insights into proactive protection for FR privacy. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2403.15735 [pdf, other]

3D-TransUNet for Brain Metastases Segmentation in the BraTS2023 Challenge

Authors: Siwei Yang, Xianhang Li, Jieru Mei, Jieneng Chen, Cihang Xie, Yuyin Zhou

Abstract: Segmenting brain tumors is complex due to their diverse appearances and scales. Brain metastases, the most common type of brain tumor, are a frequent complication of cancer. Therefore, an effective segmentation model for brain metastases must adeptly capture local intricacies to delineate small tumor regions while also integrating global context to understand broader scan features. The TransUNet m… ▽ More Segmenting brain tumors is complex due to their diverse appearances and scales. Brain metastases, the most common type of brain tumor, are a frequent complication of cancer. Therefore, an effective segmentation model for brain metastases must adeptly capture local intricacies to delineate small tumor regions while also integrating global context to understand broader scan features. The TransUNet model, which combines Transformer self-attention with U-Net's localized information, emerges as a promising solution for this task. In this report, we address brain metastases segmentation by training the 3D-TransUNet model on the Brain Tumor Segmentation (BraTS-METS) 2023 challenge dataset. Specifically, we explored two architectural configurations: the Encoder-only 3D-TransUNet, employing Transformers solely in the encoder, and the Decoder-only 3D-TransUNet, utilizing Transformers exclusively in the decoder. For Encoder-only 3D-TransUNet, we note that Masked-Autoencoder pre-training is required for a better initialization of the Transformer Encoder and thus accelerates the training process. We identify that the Decoder-only 3D-TransUNet model should offer enhanced efficacy in the segmentation of brain metastases, as indicated by our 5-fold cross-validation on the training set. However, our use of the Encoder-only 3D-TransUNet model already yield notable results, with an average lesion-wise Dice score of 59.8\% on the test set, securing second place in the BraTS-METS 2023 challenge. △ Less

Submitted 23 March, 2024; originally announced March 2024.

arXiv:2306.03494 [pdf, other]

LegoNet: Alternating Model Blocks for Medical Image Segmentation

Authors: Ikboljon Sobirov, Cheng Xie, Muhammad Siddique, Parijat Patel, Kenneth Chan, Thomas Halborg, Christos Kotanidis, Zarqiash Fatima, Henry West, Keith Channon, Stefan Neubauer, Charalambos Antoniades, Mohammad Yaqub

Abstract: Since the emergence of convolutional neural networks (CNNs), and later vision transformers (ViTs), the common paradigm for model development has always been using a set of identical block types with varying parameters/hyper-parameters. To leverage the benefits of different architectural designs (e.g. CNNs and ViTs), we propose to alternate structurally different types of blocks to generate a new a… ▽ More Since the emergence of convolutional neural networks (CNNs), and later vision transformers (ViTs), the common paradigm for model development has always been using a set of identical block types with varying parameters/hyper-parameters. To leverage the benefits of different architectural designs (e.g. CNNs and ViTs), we propose to alternate structurally different types of blocks to generate a new architecture, mimicking how Lego blocks can be assembled together. Using two CNN-based and one SwinViT-based blocks, we investigate three variations to the so-called LegoNet that applies the new concept of block alternation for the segmentation task in medical imaging. We also study a new clinical problem which has not been investigated before, namely the right internal mammary artery (RIMA) and perivascular space segmentation from computed tomography angiography (CTA) which has demonstrated a prognostic value to major cardiovascular outcomes. We compare the model performance against popular CNN and ViT architectures using two large datasets (e.g. achieving 0.749 dice similarity coefficient (DSC) on the larger dataset). We evaluate the performance of the model on three external testing cohorts as well, where an expert clinician made corrections to the model segmented results (DSC>0.90 for the three cohorts). To assess our proposed model for suitability in clinical use, we perform intra- and inter-observer variability analysis. Finally, we investigate a joint self-supervised learning approach to assess its impact on model performance. The code and the pretrained model weights will be available upon acceptance. △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: 12 pages, 5 figures, 4 tables

arXiv:2304.00440 [pdf, other]

Near-Field Channel Estimation for Extremely Large-Scale Reconfigurable Intelligent Surface (XL-RIS)-Aided Wideband mmWave Systems

Authors: Songjie Yang, Chenfei Xie, Wanting Lyu, Boyu Ning, Zhongpei Zhang, Chau Yuen

Abstract: Near-field communications present new opportunities over near-field channels, however, the spherical wavefront propagation makes near-field signal processing challenging. In this context, this paper proposes efficient near-field channel estimation methods for wideband MIMO mmWave systems with the aid of extremely large-scale reconfigurable intelligent surfaces (XL-RIS). For the wideband signals re… ▽ More Near-field communications present new opportunities over near-field channels, however, the spherical wavefront propagation makes near-field signal processing challenging. In this context, this paper proposes efficient near-field channel estimation methods for wideband MIMO mmWave systems with the aid of extremely large-scale reconfigurable intelligent surfaces (XL-RIS). For the wideband signals reflected by the analog RIS, we characterize their near-field beam squint effect in both angle and distance domains. Based on the mathematical analysis of the near-field beam patterns over all frequencies, a wideband spherical-domain dictionary is constructed by minimizing the coherence of two arbitrary beams. In light of this, we formulate a two-dimensional compressive sensing problem to recover the channel parameter based on the spherical-domain sparsity of mmWave channels. To this end, we present a correlation coefficient-based atom matching method within our proposed multi-frequency parallelizable subspace recovery framework for efficient solutions. Additionally, we propose a two-dimensional oracle estimator as a benchmark and derive its lower bound across all subcarriers. Our findings emphasize the significance of system hyperparameters and the sensing matrix of each subcarrier in determining the accuracy of the estimation. Finally, numerical results show that our proposed method achieves considerable performance compared with the lower bound and has a time complexity linear to the number of RIS elements. △ Less

Submitted 1 April, 2023; originally announced April 2023.

arXiv:2303.09170 [pdf, other]

NLUT: Neural-based 3D Lookup Tables for Video Photorealistic Style Transfer

Authors: Yaosen Chen, Han Yang, Yuexin Yang, Yuegen Liu, Wei Wang, Xuming Wen, Abstract: Video photorealistic style transfer is desired to generate videos with a similar photorealistic style to the style image while maintaining temporal consistency. However, existing methods obtain stylized video sequences by performing frame-by-frame photorealistic style transfer, which is inefficient and does not ensure the temporal consistency of the stylized video. To address this issue, we use ne… ▽ More Video photorealistic style transfer is desired to generate videos with a similar photorealistic style to the style image while maintaining temporal consistency. However, existing methods obtain stylized video sequences by performing frame-by-frame photorealistic style transfer, which is inefficient and does not ensure the temporal consistency of the stylized video. To address this issue, we use neural network-based 3D Lookup Tables (LUTs) for the photorealistic transfer of videos, achieving a balance between efficiency and effectiveness. We first train a neural network for generating photorealistic stylized 3D LUTs on a large-scale dataset; then, when performing photorealistic style transfer for a specific video, we select a keyframe and style image in the video as the data source and fine-turn the neural network; finally, we query the 3D LUTs generated by the fine-tuned neural network for the colors in the video, resulting in a super-fast photorealistic style transfer, even processing 8K video takes less than 2 millisecond per frame. The experimental results show that our method not only realizes the photorealistic style transfer of arbitrary style images but also outperforms the existing methods in terms of visual quality and consistency. Project page:https://semchan.github.io/NLUT_Project. △ Less

Submitted 17 March, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

arXiv:2210.08181 [pdf, other]

Panchromatic and Multispectral Image Fusion via Alternating Reverse Filtering Network

Authors: Keyu Yan, Man Zhou, Jie Huang, Feng Zhao, Chengjun Xie, Chongyi Li, Danfeng Hong

Abstract: Panchromatic (PAN) and multi-spectral (MS) image fusion, named Pan-sharpening, refers to super-resolve the low-resolution (LR) multi-spectral (MS) images in the spatial domain to generate the expected high-resolution (HR) MS images, conditioning on the corresponding high-resolution PAN images. In this paper, we present a simple yet effective \textit{alternating reverse filtering network} for pan-s… ▽ More Panchromatic (PAN) and multi-spectral (MS) image fusion, named Pan-sharpening, refers to super-resolve the low-resolution (LR) multi-spectral (MS) images in the spatial domain to generate the expected high-resolution (HR) MS images, conditioning on the corresponding high-resolution PAN images. In this paper, we present a simple yet effective \textit{alternating reverse filtering network} for pan-sharpening. Inspired by the classical reverse filtering that reverses images to the status before filtering, we formulate pan-sharpening as an alternately iterative reverse filtering process, which fuses LR MS and HR MS in an interpretable manner. Different from existing model-driven methods that require well-designed priors and degradation assumptions, the reverse filtering process avoids the dependency on pre-defined exact priors. To guarantee the stability and convergence of the iterative process via contraction map** on a metric space, we develop the learnable multi-scale Gaussian kernel module, instead of using specific filters. We demonstrate the theoretical feasibility of such formulations. Extensive experiments on diverse scenes to thoroughly verify the performance of our method, significantly outperforming the state of the arts. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Journal ref: NeurIPS2022

arXiv:2207.14182 [pdf, other]

Channel Estimation for Reconfigurable Intelligent Surface-Assisted Cell-Free Communications

Authors: Songjie Yang, Chenfei Xie, Mingwei Wang, Zhongpei Zhang

Abstract: Recent research has focused on reconfigurable intelligent surface (RIS)-assisted cell-free systems with the goal of enhancing coverage and lowering the cost of cell-free networks. However, current research makes the assumption that the perfect channel state information is known. Channel acquisition is, certainly, a difficulty in this case. This work is aimed at investigating RIS-assisted cell-free… ▽ More Recent research has focused on reconfigurable intelligent surface (RIS)-assisted cell-free systems with the goal of enhancing coverage and lowering the cost of cell-free networks. However, current research makes the assumption that the perfect channel state information is known. Channel acquisition is, certainly, a difficulty in this case. This work is aimed at investigating RIS-assisted cell-free channel estimation. Toward this end, two unique characteristics are pointed out: 1) For all users, a common channel exists between the base station (BS) and the RIS; and 2) For all BSs, a common channel exists between the RIS and the user. Based on these two characteristics, cascaded and two-timescale channel estimation concerns are studied. Subsequently, two solutions for tackling with the two issues are presented respectively: a three-dimensional multiple measurement vector (3D-MMV)-based compressive sensing technique and a multi-BS cooperative pilot-reduced methodology. Finally, simulations illustrate the effectiveness of the schemes we have presented. △ Less

Submitted 28 July, 2022; originally announced July 2022.

arXiv:2207.14107 [pdf, other]

Fast Compressive Channel Estimation for MmWave MIMO Hybrid Beamforming Systems

Authors: Songjie Yang, Chenfei Xie, Dongli Wang, Zhongpei Zhang

Abstract: Given the high degree of computational complexity of the channel estimation technique based on the conventional one-dimensional (1-D) compressive sensing (CS) framework employed in the hybrid beamforming architecture, this study proposes two low-complexity channel estimation strategies. One is two-stage CS, which exploits row-group sparsity to estimate angle-of-arrival (AoA) first and uses the con… ▽ More Given the high degree of computational complexity of the channel estimation technique based on the conventional one-dimensional (1-D) compressive sensing (CS) framework employed in the hybrid beamforming architecture, this study proposes two low-complexity channel estimation strategies. One is two-stage CS, which exploits row-group sparsity to estimate angle-of-arrival (AoA) first and uses the conventional 1-D CS method to obtain angle-of-departure (AoD). The other is two-dimensional (2-D) CS, which utilizes a 2-D dictionary to reconstruct the 2-D sparse signal. To conduct a meaningful comparison of the three CS frameworks, i.e., 1-D, two-stage and 2-D CS, the orthogonal match pursuit (OMP) algorithm is employed as the basic algorithm and is expanded to two variants for the proposed frameworks. Analysis and simulations demonstrate that when the 1-D CS method is compared, two-stage CS has somewhat lower performance but significantly lower computational complexity, while 2-D CS is not only the same as 1-D CS in terms of performance but also slightly lower in computational complexity than two-stage CS. △ Less

Submitted 28 July, 2022; originally announced July 2022.

arXiv:2206.15155 [pdf, other]

An Evaluation of Three-Stage Voice Conversion Framework for Noisy and Reverberant Conditions

Authors: Yeonjong Choi, Chao Xie, Tomoki Toda

Abstract: This paper presents a new voice conversion (VC) framework capable of dealing with both additive noise and reverberation, and its performance evaluation. There have been studied some VC researches focusing on real-world circumstances where speech data are interfered with background noise and reverberation. To deal with more practical conditions where no clean target dataset is available, one possib… ▽ More This paper presents a new voice conversion (VC) framework capable of dealing with both additive noise and reverberation, and its performance evaluation. There have been studied some VC researches focusing on real-world circumstances where speech data are interfered with background noise and reverberation. To deal with more practical conditions where no clean target dataset is available, one possible approach is zero-shot VC, but its performance tends to degrade compared with VC using sufficient amount of target speech data. To leverage large amount of noisy-reverberant target speech data, we propose a three-stage VC framework based on denoising process using a pretrained denoising model, dereverberation process using a dereverberation model, and VC process using a nonparallel VC model based on a variational autoencoder. The experimental results show that 1) noise and reverberation additively cause significant VC performance degradation, 2) the proposed method alleviates the adverse effects caused by both noise and reverberation, and significantly outperforms the baseline directly trained on the noisy-reverberant speech data, and 3) the potential degradation introduced by the denoising and dereverberation still causes noticeable adverse effects on VC performance. △ Less

Submitted 30 June, 2022; originally announced June 2022.

Comments: Accepted to INTERSPEECH 2022

arXiv:2206.07142 [pdf]

Experimental Comparison of PAM-8 Probabilistic Sha** with Different Gaussian Orders at 200 Gb/s Net Rate in IM/DD System with O-Band TOSA

Authors: Md Sabbir-Bin Hossain, Georg Böcherer, Youxi Lin, Shuangxu Li, Stefano Calabrò, Andrei Nedelcu, Talha Rahman, Tom Wettlin, **long Wei, Nebojša Stojanović, Changsong Xie, Maxim Kuschnerov, Stephan Pachnicke

Abstract: For 200Gb/s net rates, cap probabilistic shaped PAM-8 with different Gaussian orders are experimentally compared against uniform PAM-8. In back-to-back and 5km measurements, cap-shaped 85-GBd PAM-8 with Gaussian order of 5 outperforms 71-GBd uniform PAM-8 by up to 2.90dB and 3.80dB in receiver sensitivity, respectively. For 200Gb/s net rates, cap probabilistic shaped PAM-8 with different Gaussian orders are experimentally compared against uniform PAM-8. In back-to-back and 5km measurements, cap-shaped 85-GBd PAM-8 with Gaussian order of 5 outperforms 71-GBd uniform PAM-8 by up to 2.90dB and 3.80dB in receiver sensitivity, respectively. △ Less

Submitted 14 June, 2022; originally announced June 2022.

Comments: submitted to 2022 European Conference on Optical Communication (ECOC)

arXiv:2205.12781 [pdf, other]

doi 10.1145/3457388.3458656

Ultra-compact Binary Neural Networks for Human Activity Recognition on RISC-V Processors

Authors: Francesco Daghero, Chen Xie, Daniele Jahier Pagliari, Alessio Burrello, Marco Castellano, Luca Gandolfi, Andrea Calimera, Enrico Macii, Massimo Poncino

Abstract: Human Activity Recognition (HAR) is a relevant inference task in many mobile applications. State-of-the-art HAR at the edge is typically achieved with lightweight machine learning models such as decision trees and Random Forests (RFs), whereas deep learning is less common due to its high computational complexity. In this work, we propose a novel implementation of HAR based on deep neural networks,… ▽ More Human Activity Recognition (HAR) is a relevant inference task in many mobile applications. State-of-the-art HAR at the edge is typically achieved with lightweight machine learning models such as decision trees and Random Forests (RFs), whereas deep learning is less common due to its high computational complexity. In this work, we propose a novel implementation of HAR based on deep neural networks, and precisely on Binary Neural Networks (BNNs), targeting low-power general purpose processors with a RISC-V instruction set. BNNs yield very small memory footprints and low inference complexity, thanks to the replacement of arithmetic operations with bit-wise ones. However, existing BNN implementations on general purpose processors impose constraints tailored to complex computer vision tasks, which result in over-parametrized models for simpler problems like HAR. Therefore, we also introduce a new BNN inference library, which targets ultra-compact models explicitly. With experiments on a single-core RISC-V processor, we show that BNNs trained on two HAR datasets obtain higher classification accuracy compared to a state-of-the-art baseline based on RFs. Furthermore, our BNN reaches the same accuracy of a RF with either less memory (up to 91%) or more energy-efficiency (up to 70%), depending on the complexity of the features extracted by the RF. △ Less

Submitted 25 May, 2022; originally announced May 2022.

Comments: Published in: 2021 18th ACM International Conference on Computing Frontiers (CF)

Journal ref: 18th ACM International Conference on Computing Frontiers (CF), 2021, pp. 3-11

arXiv:2205.08805 [pdf]

doi 10.1109/ECOC52684.2021.9605995

Experimental Comparison of Cap and Cup Probabilistically Shaped PAM for O-Band IM/DD Transmission System

Authors: Md Sabbir-Bin Hossain, Georg Boecherer, Talha Rahman, Nebojsa Stojanovic, Patrick Schulte, Stefano Calabrò, **long Wei, Christian Bluemm, Tom Wettlin, Changsong Xie, Maxim Kuschnerov, Stephan Pachnicke

Abstract: For 200Gbit/s net rates, uniform PAM-4, 6 and 8 are experimentally compared against probabilistic shaped PAM-8 cap and cup variants. In back-to-back and 20km measurements, cap shaped 80GBd PAM-8 outperforms 72GBd PAM-8 and 83GBd PAM-6 by up to 3.50dB and 0.8dB in receiver sensitivity, respectively For 200Gbit/s net rates, uniform PAM-4, 6 and 8 are experimentally compared against probabilistic shaped PAM-8 cap and cup variants. In back-to-back and 20km measurements, cap shaped 80GBd PAM-8 outperforms 72GBd PAM-8 and 83GBd PAM-6 by up to 3.50dB and 0.8dB in receiver sensitivity, respectively △ Less

Submitted 18 May, 2022; originally announced May 2022.

Comments: Originally published in ECOC-2021. We have updated Figure 3. The change also affects the overall outcome. In contrast to the published version, compared to uniform PAM-8 72 GBd, PS-PAM-8 80 GBd performance is updated to 3.50 dB instead of 5.17 dB, while for PAM-6 83 GBd the gain becomes 0.8 dB instead of 2.17 dB. The changes are adapted in all sections except the experimental setup and DSP section

Journal ref: 2021 European Conference on Optical Communication (ECOC)

arXiv:2204.10541 [pdf, other]

Privacy-preserving Social Distance Monitoring on Microcontrollers with Low-Resolution Infrared Sensors and CNNs

Authors: Chen Xie, Francesco Daghero, Yukai Chen, Marco Castellano, Luca Gandolfi, Andrea Calimera, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari

Abstract: Low-resolution infrared (IR) array sensors offer a low-cost, low-power, and privacy-preserving alternative to optical cameras and smartphones/wearables for social distance monitoring in indoor spaces, permitting the recognition of basic shapes, without revealing the personal details of individuals. In this work, we demonstrate that an accurate detection of social distance violations can be achieve… ▽ More Low-resolution infrared (IR) array sensors offer a low-cost, low-power, and privacy-preserving alternative to optical cameras and smartphones/wearables for social distance monitoring in indoor spaces, permitting the recognition of basic shapes, without revealing the personal details of individuals. In this work, we demonstrate that an accurate detection of social distance violations can be achieved processing the raw output of a 8x8 IR array sensor with a small-sized Convolutional Neural Network (CNN). Furthermore, the CNN can be executed directly on a Microcontroller (MCU)-based sensor node. With results on a newly collected open dataset, we show that our best CNN achieves 86.3% balanced accuracy, significantly outperforming the 61% achieved by a state-of-the-art deterministic algorithm. Changing the architectural parameters of the CNN, we obtain a rich Pareto set of models, spanning 70.5-86.3% accuracy and 0.18-75k parameters. Deployed on a STM32L476RG MCU, these models have a latency of 0.73-5.33ms, with an energy consumption per inference of 9.38-68.57μJ. △ Less

Submitted 22 April, 2022; originally announced April 2022.

Comments: Accepted as a conference paper at the 2022 IEEE International Symposium on Circuits and Systems (ISCAS)

arXiv:2204.08692 [pdf, other]

Time Domain Adversarial Voice Conversion for ADD 2022

Authors: Cheng Wen, Tingwei Guo, Xingjun Tan, Rui Yan, Shuran Zhou, Chuandong Xie, Wei Zou, Xiangang Li

Abstract: In this paper, we describe our speech generation system for the first Audio Deep Synthesis Detection Challenge (ADD 2022). Firstly, we build an any-to-many voice conversion (VC) system to convert source speech with arbitrary language content into the target speaker%u2019s fake speech. Then the converted speech generated from VC is post-processed in the time domain to improve the deception ability.… ▽ More In this paper, we describe our speech generation system for the first Audio Deep Synthesis Detection Challenge (ADD 2022). Firstly, we build an any-to-many voice conversion (VC) system to convert source speech with arbitrary language content into the target speaker%u2019s fake speech. Then the converted speech generated from VC is post-processed in the time domain to improve the deception ability. The experimental results show that our system has adversarial ability against anti-spoofing detectors with a little compromise in audio quality and speaker similarity. This system ranks top in Track 3.1 in the ADD 2022, showing that our method could also gain good generalization ability against different detectors. △ Less

Submitted 19 April, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: Accepted to ICASSP 2022

arXiv:2204.08686 [pdf, ps, other]

Audio-Visual Wake Word Spotting System For MISP Challenge 2021

Authors: Yanguang Xu, Jianwei Sun, Yang Han, Shuaijiang Zhao, Chaoyang Mei, Tingwei Guo, Shuran Zhou, Chuandong Xie, Wei Zou, Xiangang Li, Shuran Zhou, Chuandong Xie, Wei Zou, Xiangang Li

Abstract: This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weight… ▽ More This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as transformer and conformed. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score. △ Less

Submitted 19 April, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: Accepted to ICASSP 2022

arXiv:2111.07116 [pdf, other]

Direct Noisy Speech Modeling for Noisy-to-Noisy Voice Conversion

Authors: Chao Xie, Yi-Chiao Wu, Patrick Lumban Tobing, Wen-Chin Huang, Tomoki Toda

Abstract: Beyond the conventional voice conversion (VC) where the speaker information is converted without altering the linguistic content, the background sounds are informative and need to be retained in some real-world scenarios, such as VC in movie/video and VC in music where the voice is entangled with background sounds. As a new VC framework, we have developed a noisy-to-noisy (N2N) VC framework to con… ▽ More Beyond the conventional voice conversion (VC) where the speaker information is converted without altering the linguistic content, the background sounds are informative and need to be retained in some real-world scenarios, such as VC in movie/video and VC in music where the voice is entangled with background sounds. As a new VC framework, we have developed a noisy-to-noisy (N2N) VC framework to convert the speaker's identity while preserving the background sounds. Although our framework consisting of a denoising module and a VC module well handles the background sounds, the VC module is sensitive to the distortion caused by the denoising module. To address this distortion issue, in this paper we propose the improved VC module to directly model the noisy speech waveform while controlling the background sounds. The experimental results have demonstrated that our improved framework significantly outperforms the previous one and achieves an acceptable score in terms of naturalness, while reaching comparable similarity performance to the upper bound of our framework. △ Less

Submitted 13 November, 2021; originally announced November 2021.

arXiv:2109.10608 [pdf, ps, other]

Noisy-to-Noisy Voice Conversion Framework with Denoising Model

Authors: Chao Xie, Yi-Chiao Wu, Patrick Lumban Tobing, Wen-Chin Huang, Tomoki Toda

Abstract: In a conventional voice conversion (VC) framework, a VC model is often trained with a clean dataset consisting of speech data carefully recorded and selected by minimizing background interference. However, collecting such a high-quality dataset is expensive and time-consuming. Leveraging crowd-sourced speech data in training is more economical. Moreover, for some real-world VC scenarios such as VC… ▽ More In a conventional voice conversion (VC) framework, a VC model is often trained with a clean dataset consisting of speech data carefully recorded and selected by minimizing background interference. However, collecting such a high-quality dataset is expensive and time-consuming. Leveraging crowd-sourced speech data in training is more economical. Moreover, for some real-world VC scenarios such as VC in video and VC-based data augmentation for speech recognition systems, the background sounds themselves are also informative and need to be maintained. In this paper, to explore VC with the flexibility of handling background sounds, we propose a noisy-to-noisy (N2N) VC framework composed of a denoising module and a VC module. With the proposed framework, we can convert the speaker's identity while preserving the background sounds. Both objective and subjective evaluations are conducted, and the results reveal the effectiveness of the proposed framework. △ Less

Submitted 22 September, 2021; originally announced September 2021.

arXiv:2108.01873 [pdf]

doi 10.1109/LPT.2022.3142538

1.71 Tb/s Single-Channel and 56.51 Tb/s DWDM Transmission over 96.5 km Field-Deployed SSMF

Authors: Fabio Pittala, Ralf-Peter Braun, Georg Boecherer, Patrick Schulte, Maximilian Schaedler, Stefano Bettelli, Stefano Calabro, Maxim Kuschnerov, Andreas Gladisch, Fritz-Joachim Westphal, Changsong Xie, Rongfu Chen, Qibing Wang, Bofang Zheng

Abstract: We report an industry leading optical dense wavelength division multiplexing (DWDM) field trial with line rates per channel exceeding 1.66 Tb/s using 130 GBaud dual-polarization probabilistic constellation sha** 256-ary quadrature amplitude modulation (DP-PCS256QAM) in a high capacity data center interconnect (DCI) scenario. This research trial was performed on 96.5 km of field-deployed standard… ▽ More We report an industry leading optical dense wavelength division multiplexing (DWDM) field trial with line rates per channel exceeding 1.66 Tb/s using 130 GBaud dual-polarization probabilistic constellation sha** 256-ary quadrature amplitude modulation (DP-PCS256QAM) in a high capacity data center interconnect (DCI) scenario. This research trial was performed on 96.5 km of field-deployed standard single mode G.652 fiber infrastructure of Deutsche Telekom in Germany employing Erbium-doped fiber amplifier (EDFA)-only amplification. A total of 34 channels were transmitted with 150 GHz spacing for a total fiber capacity of 56.51 Tb/s and a spectral efficiency higher than 11bit/s/Hz. In the single-channel transmission scenario 1.71 Tb/s was achieved over the same link. In addition, we successfully demonstrate record net bitrates of 1.88 Tb/s in back-to-back (B2B) using 130 GBaud DP-PCS400QAM. △ Less

Submitted 4 August, 2021; originally announced August 2021.

Comments: This work has been submitted to the IEEE Photonics Technology Letters (PTL) for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:1912.01054 [pdf, other]

The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 Challenge

Authors: Nicholas Heller, Fabian Isensee, Klaus H. Maier-Hein, Xiaoshuai Hou, Chunmei Xie, Fengyi Li, Yang Nan, Guangrui Mu, Zhiyong Lin, Miofei Han, Guang Yao, Yaozong Gao, Yao Zhang, Yixin Wang, Feng Hou, Jiawei Yang, Guangwei Xiong, Jiang Tian, Cheng Zhong, Jun Ma, Jack Rickman, Joshua Dean, Bethany Stai, Resha Tejpaul, Makinna Oestreich , et al. (16 additional authors not shown)

Abstract: There is a large body of literature linking anatomic and geometric characteristics of kidney tumors to perioperative and oncologic outcomes. Semantic segmentation of these tumors and their host kidneys is a promising tool for quantitatively characterizing these lesions, but its adoption is limited due to the manual effort required to produce high-quality 3D segmentations of these structures. Recen… ▽ More There is a large body of literature linking anatomic and geometric characteristics of kidney tumors to perioperative and oncologic outcomes. Semantic segmentation of these tumors and their host kidneys is a promising tool for quantitatively characterizing these lesions, but its adoption is limited due to the manual effort required to produce high-quality 3D segmentations of these structures. Recently, methods based on deep learning have shown excellent results in automatic 3D segmentation, but they require large datasets for training, and there remains little consensus on which methods perform best. The 2019 Kidney and Kidney Tumor Segmentation challenge (KiTS19) was a competition held in conjunction with the 2019 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) which sought to address these issues and stimulate progress on this automatic segmentation problem. A training set of 210 cross sectional CT images with kidney tumors was publicly released with corresponding semantic segmentation masks. 106 teams from five continents used this data to develop automated systems to predict the true segmentation masks on a test set of 90 CT images for which the corresponding ground truth segmentations were kept private. These predictions were scored and ranked according to their average So rensen-Dice coefficient between the kidney and tumor across all 90 cases. The winning team achieved a Dice of 0.974 for kidney and 0.851 for tumor, approaching the inter-annotator performance on kidney (0.983) but falling short on tumor (0.923). This challenge has now entered an "open leaderboard" phase where it serves as a challenging benchmark in 3D semantic segmentation. △ Less

Submitted 7 August, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

Comments: 24 pages, 11 figures

Showing 1–19 of 19 results for author: Xie, C