Skip to main content

Showing 1–48 of 48 results for author: Song, F

Searching in archive eess. Search in all archives.
.
  1. arXiv:2404.03327  [pdf, other

    cs.CV eess.IV

    DI-Retinex: Digital-Imaging Retinex Theory for Low-Light Image Enhancement

    Authors: Shangquan Sun, Wenqi Ren, **gyang Peng, Fenglong Song, Xiaochun Cao

    Abstract: Many existing methods for low-light image enhancement (LLIE) based on Retinex theory ignore important factors that affect the validity of this theory in digital imaging, such as noise, quantization error, non-linearity, and dynamic range overflow. In this paper, we propose a new expression called Digital-Imaging Retinex theory (DI-Retinex) through theoretical and experimental analysis of Retinex t… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  2. arXiv:2404.02661  [pdf

    physics.app-ph eess.SP

    Terahertz channel modeling based on surface sensing characteristics

    Authors: Jiayuan Cui, Da Li, Jiabiao Zhao, Jiacheng Liu, Guohao Liu, Xiangkun He, Yue Su, Fei Song, Peian Li, Jianjun Ma

    Abstract: The dielectric properties of environmental surfaces, including walls, floors and the ground, etc., play a crucial role in sha** the accuracy of terahertz (THz) channel modeling, thereby directly impacting the effectiveness of communication systems. Traditionally, acquiring these properties has relied on methods such as terahertz time-domain spectroscopy (THz-TDS) or vector network analyzers (VNA… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: Submitted to Nano Communication Networks

  3. arXiv:2401.17133  [pdf, other

    cs.SD cs.AI cs.CR cs.LG cs.MM eess.AS

    A Proactive and Dual Prevention Mechanism against Illegal Song Covers empowered by Singing Voice Conversion

    Authors: Guangke Chen, Yedi Zhang, Fu Song, Ting Wang, Xiaoning Du, Yang Liu

    Abstract: Singing voice conversion (SVC) automates song covers by converting one singer's singing voice into another target singer's singing voice with the original lyrics and melody. However, it raises serious concerns about copyright and civil right infringements to multiple entities. This work proposes SongBsAb, the first proactive approach to mitigate unauthorized SVC-based illegal song covers. SongBsAb… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

  4. arXiv:2312.13310  [pdf, other

    eess.IV cs.CV

    Computational Spectral Imaging with Unified Encoding Model: A Comparative Study and Beyond

    Authors: Xinyuan Liu, Lizhi Wang, Lingen Li, Chang Chen, Xue Hu, Fenglong Song, Youliang Yan

    Abstract: Computational spectral imaging is drawing increasing attention owing to the snapshot advantage, and amplitude, phase, and wavelength encoding systems are three types of representative implementations. Fairly comparing and understanding the performance of these systems is essential, but challenging due to the heterogeneity in encoding design. To overcome this limitation, we propose the unified enco… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  5. arXiv:2312.12833  [pdf, other

    eess.IV cs.CV

    Learning Exhaustive Correlation for Spectral Super-Resolution: Where Spatial-Spectral Attention Meets Linear Dependence

    Authors: Hongyuan Wang, Lizhi Wang, Jiang Xu, Chang Chen, Xue Hu, Fenglong Song, Youliang Yan

    Abstract: Spectral super-resolution that aims to recover hyperspectral image (HSI) from easily obtainable RGB image has drawn increasing interest in the field of computational photography. The crucial aspect of spectral super-resolution lies in exploiting the correlation within HSIs. However, two types of bottlenecks in existing Transformers limit performance improvement and practical applications. First, e… ▽ More

    Submitted 18 March, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

  6. arXiv:2309.07983  [pdf, other

    cs.CR cs.LG cs.MM cs.SD eess.AS

    SLMIA-SR: Speaker-Level Membership Inference Attacks against Speaker Recognition Systems

    Authors: Guangke Chen, Yedi Zhang, Fu Song

    Abstract: Membership inference attacks allow adversaries to determine whether a particular example was contained in the model's training dataset. While previous works have confirmed the feasibility of such attacks in various applications, none has focused on speaker recognition (SR), a promising voice-based biometric recognition technique. In this work, we propose SLMIA-SR, the first membership inference at… ▽ More

    Submitted 27 November, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: In Proceedings of the 31st Network and Distributed System Security (NDSS) Symposium, 2024

  7. ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

    Authors: Yujia Xiao, Shaofei Zhang, Xi Wang, Xu Tan, Lei He, Sheng Zhao, Frank K. Soong, Tan Lee

    Abstract: While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a l… ▽ More

    Submitted 7 October, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

    Comments: 5 pages, 4 figures, Proceedings of Interspeech 2023

  8. arXiv:2305.14097  [pdf, other

    cs.CR cs.LG cs.MM cs.SD eess.AS

    QFA2SR: Query-Free Adversarial Transfer Attacks to Speaker Recognition Systems

    Authors: Guangke Chen, Yedi Zhang, Zhe Zhao, Fu Song

    Abstract: Current adversarial attacks against speaker recognition systems (SRSs) require either white-box access or heavy black-box queries to the target SRS, thus still falling behind practical attacks against proprietary commercial APIs and voice-controlled devices. To fill this gap, we propose QFA2SR, an effective and imperceptible query-free black-box attack, by leveraging the transferability of adversa… ▽ More

    Submitted 23 September, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted by the 32nd USENIX Security Symposium (2023 USENIX Security); Full Version

  9. arXiv:2211.03058  [pdf, other

    cs.CV eess.IV

    Towards Real World HDRTV Reconstruction: A Data Synthesis-based Approach

    Authors: Zhen Cheng, Tao Wang, Yong Li, Fenglong Song, Chang Chen, Zhiwei Xiong

    Abstract: Existing deep learning based HDRTV reconstruction methods assume one kind of tone map** operators (TMOs) as the degradation procedure to synthesize SDRTV-HDRTV pairs for supervised training. In this paper, we argue that, although traditional TMOs exploit efficient dynamic range compression priors, they have several drawbacks on modeling the realistic degradation: information over-preservation, c… ▽ More

    Submitted 6 November, 2022; originally announced November 2022.

  10. arXiv:2210.11153  [pdf, other

    eess.IV cs.CV

    Reversed Image Signal Processing and RAW Reconstruction. AIM 2022 Challenge Report

    Authors: Marcos V. Conde, Radu Timofte, Yibin Huang, **gyang Peng, Chang Chen, Cheng Li, Eduardo PĂ©rez-Pellitero, Fenglong Song, Furui Bai, Shuai Liu, Chaoyu Feng, Xiaotao Wang, Lei Lei, Yu Zhu, Chenghua Li, Yingying Jiang, Yong A, Peisong Wang, Cong Leng, Jian Cheng, Xiaoyu Liu, Zhicun Yin, Zhilu Zhang, Junyi Li, Ming Liu , et al. (18 additional authors not shown)

    Abstract: Cameras capture sensor RAW images and transform them into pleasant RGB images, suitable for the human eyes, using their integrated Image Signal Processor (ISP). Numerous low-level vision tasks operate in the RAW domain (e.g. image denoising, white balance) due to its linear relationship with the scene irradiance, wide-range of information at 12bits, and sensor designs. Despite this, RAW image data… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: ECCV 2022 Advances in Image Manipulation (AIM) workshop

  11. arXiv:2209.10887  [pdf, other

    cs.SD cs.CL eess.AS

    A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS

    Authors: Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng

    Abstract: We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks,… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

  12. arXiv:2209.06484  [pdf, other

    cs.SD cs.CL eess.AS

    ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS

    Authors: Liumeng Xue, Frank K. Soong, Shaofei Zhang, Lei Xie

    Abstract: Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-based TTS model. To alleviate the difficulty in trai… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing

  13. arXiv:2206.09611  [pdf, other

    eess.IV cs.CV

    SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes

    Authors: Wei Li, Shuai Xiao, Tianhong Dai, Shanxin Yuan, Tao Wang, Cheng Li, Fenglong Song

    Abstract: Ghosting artifacts, motion blur, and low fidelity in highlight are the main challenges in High Dynamic Range (HDR) imaging from multiple Low Dynamic Range (LDR) images. These issues come from using the medium-exposed image as the reference frame in previous methods. To deal with them, we propose to use the under-exposed image as the reference to avoid these issues. However, the heavy noise in dark… ▽ More

    Submitted 3 November, 2022; v1 submitted 20 June, 2022; originally announced June 2022.

  14. arXiv:2206.03393  [pdf, other

    cs.SD cs.AI cs.CR cs.LG eess.AS

    Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition

    Authors: Guangke Chen, Zhe Zhao, Fu Song, Sen Chen, Lingling Fan, Feng Wang, Jiashui Wang

    Abstract: Speaker recognition systems (SRSs) have recently been shown to be vulnerable to adversarial attacks, raising significant security concerns. In this work, we systematically investigate transformation and adversarial training based defenses for securing SRSs. According to the characteristic of SRSs, we present 22 diverse transformations and thoroughly evaluate them using 7 recent promising adversari… ▽ More

    Submitted 7 June, 2022; originally announced June 2022.

  15. arXiv:2206.03351  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    AS2T: Arbitrary Source-To-Target Adversarial Attack on Speaker Recognition Systems

    Authors: Guangke Chen, Zhe Zhao, Fu Song, Sen Chen, Lingling Fan, Yang Liu

    Abstract: Recent work has illuminated the vulnerability of speaker recognition systems (SRSs) against adversarial attacks, raising significant security concerns in deploying SRSs. However, they considered only a few settings (e.g., some combinations of source and target speakers), leaving many interesting and important settings in real-world attack scenarios alone. In this work, we present AS2T, the first a… ▽ More

    Submitted 7 June, 2022; originally announced June 2022.

  16. arXiv:2205.04421  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

    Authors: Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu

    Abstract: Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing app… ▽ More

    Submitted 10 May, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

    Comments: 19 pages, 3 figures, 8 tables

  17. arXiv:2202.12028  [pdf, other

    eess.SP cs.LG

    Evolutionary Multi-Objective Reinforcement Learning Based Trajectory Control and Task Offloading in UAV-Assisted Mobile Edge Computing

    Authors: Fuhong Song, Huanlai Xing, Xinhan Wang, Shouxi Luo, Penglin Dai, Zhiwen Xiao, Bowen Zhao

    Abstract: This paper studies the trajectory control and task offloading (TCTO) problem in an unmanned aerial vehicle (UAV)-assisted mobile edge computing system, where a UAV flies along a planned trajectory to collect computation tasks from smart devices (SDs). We consider a scenario that SDs are not directly connected by the base station (BS) and the UAV has two roles to play: MEC server or wireless relay.… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

  18. arXiv:2201.09472  [pdf, other

    cs.SD eess.AS

    Disentangling Style and Speaker Attributes for TTS Style Transfer

    Authors: Xiaochun An, Frank K. Soong, Lei Xie

    Abstract: End-to-end neural TTS has shown improved performance in speech style transfer. However, the improvement is still limited by the available training data in both target styles and speakers. Additionally, degenerated performance is observed when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

  19. arXiv:2112.13513  [pdf

    eess.IV cs.CV cs.LG

    MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer

    Authors: Tianyi Zhang, Yunlu Feng, Yu Zhao, Guangda Fan, Aiming Yang, Shangqin Lyu, Peng Zhang, Fan Song, Chenbin Ma, Yangyang Sun, Youdan Feng, Guanglei Zhang

    Abstract: Pancreatic cancer is one of the most malignant cancers in the world, which deteriorates rapidly with very high mortality. The rapid on-site evaluation (ROSE) technique innovates the workflow by immediately analyzing the fast stained cytopathological images with on-site pathologists, which enables faster diagnosis in this time-pressured process. However, the wider expansion of ROSE diagnosis has be… ▽ More

    Submitted 27 December, 2021; originally announced December 2021.

    Comments: 12 pages, 10 figures

  20. arXiv:2110.09698  [pdf, other

    cs.SD cs.CL eess.AS

    Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

    Authors: Mutian He, **gzhou Yang, Lei He, Frank K. Soong

    Abstract: End-to-end TTS requires a large amount of speech/text paired data to cover all necessary knowledge, particularly how to pronounce different words in diverse contexts, so that a neural model may learn such knowledge accordingly. But in real applications, such high demand of training data is hard to be satisfied and additional knowledge often needs to be injected manually. For example, to capture pr… ▽ More

    Submitted 24 June, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures; accepted by Interspeech 2022

  21. arXiv:2110.07274  [pdf, other

    cs.CL cs.SD eess.AS

    An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings

    Authors: Wenxuan Ye, Shaoguang Mao, Frank Soong, Wenshan Wu, Yan Xia, Jonathan Tien, Zhiyong Wu

    Abstract: Many mispronunciation detection and diagnosis (MD&D) research approaches try to exploit both the acoustic and linguistic features as input. Yet the improvement of the performance is limited, partially due to the shortage of large amount annotated training data at the phoneme level. Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a goo… ▽ More

    Submitted 31 March, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: Accepted by ICASSP 2022

  22. arXiv:2110.04735  [pdf, other

    eess.IV

    Prior Attention Network for Multi-Lesion Segmentation in Medical Images

    Authors: Xiangyu Zhao, Peng Zhang, Fan Song, Chenbin Ma, Guangda Fan, Yangyang Sun, Youdan Feng, Guanglei Zhang

    Abstract: The accurate segmentation of multiple types of lesions from adjacent tissues in medical images is significant in clinical practice. Convolutional neural networks (CNNs) based on the coarse-to-fine strategy have been widely used in this field. However, multi-lesion segmentation remains to be challenging due to the uncertainty in size, contrast, and high interclass similarity of tissues. In addition… ▽ More

    Submitted 10 October, 2021; originally announced October 2021.

    Comments: 10 pages

  23. arXiv:2109.01766  [pdf, other

    cs.CR cs.LG cs.MM cs.SD eess.AS

    SEC4SR: A Security Analysis Platform for Speaker Recognition

    Authors: Guangke Chen, Zhe Zhao, Fu Song, Sen Chen, Lingling Fan, Yang Liu

    Abstract: Adversarial attacks have been expanded to speaker recognition (SR). However, existing attacks are often assessed using different SR models, recognition tasks and datasets, and only few adversarial defenses borrowed from computer vision are considered. Yet,these defenses have not been thoroughly evaluated against adaptive attacks. Thus, there is still a lack of quantitative understanding about the… ▽ More

    Submitted 3 September, 2021; originally announced September 2021.

  24. arXiv:2108.08697  [pdf, other

    cs.CV eess.IV

    Real-time Image Enhancer via Learnable Spatial-aware 3D Lookup Tables

    Authors: Tao Wang, Yong Li, **gyang Peng, Yipeng Ma, Xian Wang, Fenglong Song, Youliang Yan

    Abstract: Recently, deep learning-based image enhancement algorithms achieved state-of-the-art (SOTA) performance on several publicly available datasets. However, most existing methods fail to meet practical requirements either for visual perception or for computation efficiency, especially for high-resolution images. In this paper, we propose a novel real-time image enhancer via learnable spatial-aware 3-d… ▽ More

    Submitted 19 August, 2021; originally announced August 2021.

    Comments: Accepted to ICCV2021

  25. arXiv:2106.15561  [pdf, other

    eess.AS cs.CL cs.LG cs.MM cs.SD

    A Survey on Neural Speech Synthesis

    Authors: Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu

    Abstract: Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years… ▽ More

    Submitted 23 July, 2021; v1 submitted 29 June, 2021; originally announced June 2021.

    Comments: A comprehensive survey on TTS, 63 pages, 18 tables, 7 figures, 457 references

  26. arXiv:2106.10003  [pdf, other

    cs.SD eess.AS

    Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS

    Authors: Xiaochun An, Frank K. Soong, Lei Xie

    Abstract: End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to sty… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

  27. arXiv:2106.04312  [pdf, other

    eess.AS cs.SD

    Speech BERT Embedding For Improving Prosody in Neural TTS

    Authors: Li** Chen, Yan Deng, Xi Wang, Frank K. Soong, Lei He

    Abstract: This paper presents a speech BERT model to extract embedded prosody information in speech segments for improving the prosody of synthesized speech in neural text-to-speech (TTS). As a pre-trained model, it can learn prosody attributes from a large amount of speech data, which can utilize more data than the original training data used by the target TTS. The embedding is extracted from the previous… ▽ More

    Submitted 14 September, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

    Journal ref: ICASSP 2021

  28. arXiv:2105.08629  [pdf, other

    eess.IV cs.CV cs.LG

    Fast Camera Image Denoising on Mobile GPUs with Deep Learning, Mobile AI 2021 Challenge: Report

    Authors: Andrey Ignatov, Kim Byeoung-su, Radu Timofte, Angeline Pouget, Fenglong Song, Cheng Li, Shuai Xiao, Zhongqian Fu, Matteo Maggioni, Yibin Huang, Shen Cheng, Xin Lu, Yifeng Zhou, Liangyu Chen, Donghao Liu, Xiangyu Zhang, Haoqiang Fan, Jian Sun, Shuaicheng Liu, Minsu Kwon, Myungje Lee, Jaeyoon Yoo, Changbeom Kang, Shinjo Wang, Bin Huang , et al. (7 additional authors not shown)

    Abstract: Image denoising is one of the most critical problems in mobile photo processing. While many solutions have been proposed for this task, they are usually working with synthetic data and are too computationally expensive to run on mobile devices. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop an end-to-end deep learning-based image denoising solut… ▽ More

    Submitted 17 May, 2021; originally announced May 2021.

    Comments: Mobile AI 2021 Workshop and Challenges: https://ai-benchmark.com/workshops/mai/2021/. arXiv admin note: substantial text overlap with arXiv:2105.07809, arXiv:2105.07825

  29. arXiv:2105.07809  [pdf, other

    eess.IV cs.CV cs.LG

    Learned Smartphone ISP on Mobile NPUs with Deep Learning, Mobile AI 2021 Challenge: Report

    Authors: Andrey Ignatov, Cheng-Ming Chiang, Hsien-Kai Kuo, Anastasia Sycheva, Radu Timofte, Min-Hung Chen, Man-Yu Lee, Yu-Syuan Xu, Yu Tseng, Shusong Xu, ** Guo, Chao-Hung Chen, Ming-Chun Hsyu, Wen-Chia Tsai, Chao-Wei Chen, Grigory Malivenko, Minsu Kwon, Myungje Lee, Jaeyoon Yoo, Changbeom Kang, Shinjo Wang, Zheng Shaolong, Hao Dejun, Xie Fen, Feng Zhuang , et al. (16 additional authors not shown)

    Abstract: As the quality of mobile cameras starts to play a crucial role in modern smartphones, more and more attention is now being paid to ISP algorithms used to improve various perceptual aspects of mobile photos. In this Mobile AI challenge, the target was to develop an end-to-end deep learning-based image signal processing (ISP) pipeline that can replace classical hand-crafted ISPs and achieve nearly r… ▽ More

    Submitted 17 May, 2021; originally announced May 2021.

    Comments: Mobile AI 2021 Workshop and Challenges: https://ai-benchmark.com/workshops/mai/2021/

  30. arXiv:2104.10781  [pdf, other

    eess.IV cs.CV

    NTIRE 2021 Challenge on Quality Enhancement of Compressed Video: Methods and Results

    Authors: Ren Yang, Radu Timofte, **g Liu, Yi Xu, Xinjian Zhang, Minyi Zhao, Shuigeng Zhou, Kelvin C. K. Chan, Shangchen Zhou, Xiangyu Xu, Chen Change Loy, Xin Li, Fanglong Liu, He Zheng, Lielin Jiang, Qi Zhang, Dongliang He, Fu Li, Qingqing Dang, Yibin Huang, Matteo Maggioni, Zhongqian Fu, Shuai Xiao, Cheng li, Thomas Tanay , et al. (47 additional authors not shown)

    Abstract: This paper reviews the first NTIRE challenge on quality enhancement of compressed video, with a focus on the proposed methods and results. In this challenge, the new Large-scale Diverse Video (LDV) dataset is employed. The challenge has three tracks. Tracks 1 and 2 aim at enhancing the videos compressed by HEVC at a fixed QP, while Track 3 is designed for enhancing the videos compressed by x265 at… ▽ More

    Submitted 31 August, 2022; v1 submitted 21 April, 2021; originally announced April 2021.

    Comments: Corrected the MOS values in Table 2, and corrected some minor typos

  31. Efficient Multi-Stage Video Denoising with Recurrent Spatio-Temporal Fusion

    Authors: Matteo Maggioni, Yibin Huang, Cheng Li, Shuai Xiao, Zhongqian Fu, Fenglong Song

    Abstract: In recent years, denoising methods based on deep learning have achieved unparalleled performance at the cost of large computational complexity. In this work, we propose an Efficient Multi-stage Video Denoising algorithm, called EMVD, to drastically reduce the complexity while maintaining or even improving the performance. First, a fusion stage reduces the noise through a recursive combination of a… ▽ More

    Submitted 30 March, 2023; v1 submitted 9 March, 2021; originally announced March 2021.

    Journal ref: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3465-3474

  32. arXiv:2103.03541  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis

    Authors: Mutian He, **gzhou Yang, Lei He, Frank K. Soong

    Abstract: To scale neural speech synthesis to various real-world languages, we present a multilingual end-to-end framework that maps byte inputs to spectrograms, thus allowing arbitrary input scripts. Besides strong results on 40+ languages, the framework demonstrates capabilities to adapt to new languages under extreme low-resource and even few-shot scenarios of merely 40s transcribed recording, without th… ▽ More

    Submitted 9 July, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

    Comments: 17 pages

  33. arXiv:2103.00110  [pdf, other

    cs.SD eess.AS

    MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network

    Authors: Yichong Leng, Xu Tan, Sheng Zhao, Frank Soong, Xiang-Yang Li, Tao Qin

    Abstract: Mean opinion score (MOS) is a popular subjective metric to assess the quality of synthesized speech, and usually involves multiple human judges to evaluate each speech utterance. To reduce the labor cost in MOS test, multiple methods have been proposed to automatically predict MOS scores. To our knowledge, for a speech utterance, all previous works only used the average of multiple scores from dif… ▽ More

    Submitted 26 February, 2021; originally announced March 2021.

    Comments: Accepted by ICASSP 2021

  34. arXiv:2102.05210  [pdf, other

    eess.IV cs.CV

    D2A U-Net: Automatic Segmentation of COVID-19 Lesions from CT Slices with Dilated Convolution and Dual Attention Mechanism

    Authors: Xiangyu Zhao, Peng Zhang, Fan Song, Guangda Fan, Yangyang Sun, Yujia Wang, Zheyuan Tian, Luqi Zhang, Guanglei Zhang

    Abstract: Coronavirus Disease 2019 (COVID-19) has caused great casualties and becomes almost the most urgent public health events worldwide. Computed tomography (CT) is a significant screening tool for COVID-19 infection, and automated segmentation of lung infection in COVID-19 CT images will greatly assist diagnosis and health care of patients. However, accurate and automatic segmentation of COVID-19 lung… ▽ More

    Submitted 9 February, 2021; originally announced February 2021.

  35. arXiv:2011.08480  [pdf, other

    eess.AS cs.SD

    s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis

    Authors: Xi Wang, Huai** Ming, Lei He, Frank K. Soong

    Abstract: Neural end-to-end text-to-speech (TTS) , which adopts either a recurrent model, e.g. Tacotron, or an attention one, e.g. Transformer, to characterize a speech utterance, has achieved significant improvement of speech synthesis. However, it is still very challenging to deal with different sentence lengths, particularly, for long sentences where sequence model has limitation of the effective context… ▽ More

    Submitted 17 November, 2020; originally announced November 2020.

    Comments: 5 pages, 5 figures

  36. arXiv:2010.13339  [pdf, other

    eess.AS cs.LG cs.SD

    Improving pronunciation assessment via ordinal regression with anchored reference samples

    Authors: Bin Su, Shaoguang Mao, Frank Soong, Yan Xia, Jonathan Tien, Zhiyong Wu

    Abstract: Sentence level pronunciation assessment is important for Computer Assisted Language Learning (CALL). Traditional speech pronunciation assessment, based on the Goodness of Pronunciation (GOP) algorithm, has some weakness in assessing a speech utterance: 1) Phoneme GOP scores cannot be easily translated into a sentence score with a simple average for effective assessment; 2) The rank ordering inform… ▽ More

    Submitted 26 October, 2020; originally announced October 2020.

  37. arXiv:2008.04658  [pdf, other

    eess.AS cs.SD

    Transfer Learning for Improving Singing-voice Detection in Polyphonic Instrumental Music

    Authors: Yuanbo Hou, Frank K. Soong, Jian Luan, Shengchen Li

    Abstract: Detecting singing-voice in polyphonic instrumental music is critical to music information retrieval. To train a robust vocal detector, a large dataset marked with vocal or non-vocal label at frame-level is essential. However, frame-level labeling is time-consuming and labor expensive, resulting there is little well-labeled dataset available for singing-voice detection (S-VD). Hence, we propose a d… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: Accepted by INTERSPEECH 2020

  38. arXiv:2007.10795  [pdf

    eess.IV cs.CV physics.app-ph

    Label-free detection of Giardia lamblia cysts using a deep learning-enabled portable imaging flow cytometer

    Authors: Zoltan Gorocs, David Baum, Fang Song, Kevin DeHaan, Hatice Ceylan Koydemir, Yunzhe Qiu, Zilin Cai, Thamira Skandakumar, Spencer Peterman, Miu Tamamitsu, Aydogan Ozcan

    Abstract: We report a field-portable and cost-effective imaging flow cytometer that uses deep learning to accurately detect Giardia lamblia cysts in water samples at a volumetric throughput of 100 mL/h. This flow cytometer uses lensfree color holographic imaging to capture and reconstruct phase and intensity images of microscopic objects in a continuously flowing sample, and automatically identifies Giardia… ▽ More

    Submitted 12 July, 2020; originally announced July 2020.

    Comments: 17 Pages, 5 Figures, 1 Table

    Journal ref: Lab on a Chip (2020)

  39. arXiv:2005.10438  [pdf, other

    cs.SD eess.AS

    Conversational End-to-End TTS for Voice Agent

    Authors: Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, Lei Xie

    Abstract: End-to-end neural TTS has achieved superior performance on reading style speech synthesis. However, it's still a challenge to build a high-quality conversational TTS due to the limitations of the corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech c… ▽ More

    Submitted 16 November, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: Accepted by SLT 2021; 7 pages

  40. arXiv:2004.00506  [pdf, other

    eess.SP

    Dynamic Virtual Resource Allocation for 5G and Beyond Network Slicing

    Authors: Fei Song, Jun Li, Chuan Ma, Yi** Zhang, Long Shi, Dushantha Nalin K. Jayakody Li

    Abstract: The fifth generation and beyond wireless communication will support vastly heterogeneous services and use demands such as massive connection, low latency and high transmission rate. Network slicing has been envisaged as an efficient technology to meet these diverse demands. In this paper, we propose a dynamic virtual resources allocation scheme based on the radio access network (RAN) slicing for u… ▽ More

    Submitted 29 March, 2020; originally announced April 2020.

  41. arXiv:2003.13201  [pdf, ps, other

    eess.SP

    Probabilistic Caching for Small-Cell Networks with Terrestrial and Aerial Users

    Authors: Fei Song, Jun Li, Ming Ding, Long Shi, Feng Shu, Meixia Tao, Wen Chen, H. Vincent Poor

    Abstract: The support for aerial users has become the focus of recent 3GPP standardizations of 5G, due to their high maneuverability and flexibility for on-demand deployment. In this paper, probabilistic caching is studied for ultra-dense small-cell networks with terrestrial and aerial users, where a dynamic on-off architecture is adopted under a sophisticated path loss model incorporating both line-of-sigh… ▽ More

    Submitted 29 March, 2020; originally announced March 2020.

    Comments: TVT

  42. arXiv:2001.11686  [pdf, other

    eess.AS

    Improving LPCNet-based Text-to-Speech with Linear Prediction-structured Mixture Density Network

    Authors: Min-Jae Hwang, Eunwoo Song, Ryuichi Yamamoto, Frank Soong, Hong-Goo Kang

    Abstract: In this paper, we propose an improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN). The recently proposed LPCNet vocoder has successfully achieved high-quality and lightweight speech synthesis systems by combining a vocal tract LP filter with a WaveRNN-based vocal source (i.e., excitation) generator. However, the quality of synthesized speech is often unst… ▽ More

    Submitted 31 January, 2020; originally announced January 2020.

    Comments: Accepted to ICASSP 2020

    Journal ref: IEEE ICASSP 2020

  43. arXiv:1911.01840  [pdf, other

    eess.AS cs.CR cs.LG cs.MM cs.SD

    Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems

    Authors: Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, Yang Liu

    Abstract: Speaker recognition (SR) is widely used in our daily life as a biometric authentication or identification mechanism. The popularity of SR brings in serious security concerns, as demonstrated by recent adversarial attacks. However, the impacts of such threats in the practical black-box setting are still open, since current attacks consider the white-box setting only. In this paper, we conduct the f… ▽ More

    Submitted 23 April, 2020; v1 submitted 3 November, 2019; originally announced November 2019.

    Comments: IEEE Symposium on Security and Privacy 2021

  44. arXiv:1909.05249  [pdf, other

    eess.IV cs.CV

    NODE: Extreme Low Light Raw Image Denoising using a Noise Decomposition Network

    Authors: Hao Guan, Liu Liu, Sean Moran, Fenglong Song, Gregory Slabaugh

    Abstract: Denoising extreme low light images is a challenging task due to the high noise level. When the illumination is low, digital cameras increase the ISO (electronic gain) to amplify the brightness of captured data. However, this in turn amplifies the noise, arising from read, shot, and defective pixel sources. In the raw domain, read and shot noise are effectively modelled using Gaussian and Poisson d… ▽ More

    Submitted 11 September, 2019; originally announced September 2019.

  45. arXiv:1907.09006  [pdf, other

    eess.AS cs.CL cs.SD

    Forward-Backward Decoding for Regularizing End-to-End TTS

    Authors: Yibin Zheng, Xi Wang, Lei He, Shifeng Pan, Frank K. Soong, Zhengqi Wen, Jianhua Tao

    Abstract: Neural end-to-end TTS can generate very high-quality synthesized speech, and even close to human recording within similar domain text. However, it performs unsatisfactory when scaling it to challenging test sets. One concern is that the encoder-decoder with attention-based network adopts autoregressive generative sequence model with the limitation of "exposure bias" To address this issue, we propo… ▽ More

    Submitted 18 July, 2019; originally announced July 2019.

    Comments: Accepted by INTERSPEECH2019. arXiv admin note: text overlap with arXiv:1808.04064, arXiv:1804.05374 by other authors

  46. arXiv:1901.00707  [pdf, other

    cs.SD cs.CL eess.AS

    Feature reinforcement with word embedding and parsing information in neural TTS

    Authors: Huai** Ming, Lei He, Haohan Guo, Frank K. Soong

    Abstract: In this paper, we propose a feature reinforcement method under the sequence-to-sequence neural text-to-speech (TTS) synthesis framework. The proposed method utilizes the multiple input encoder to take three levels of text information, i.e., phoneme sequence, pre-trained word embedding, and grammatical structure of sentences from parser as the input feature for the neural TTS system. The added word… ▽ More

    Submitted 6 March, 2019; v1 submitted 3 January, 2019; originally announced January 2019.

  47. arXiv:1812.05253  [pdf, other

    eess.AS cs.CL cs.SD

    Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

    Authors: Yan Deng, Lei He, Frank Soong

    Abstract: Neural TTS has shown it can generate high quality synthesized speech. In this paper, we investigate the multi-speaker latent space to improve neural TTS for adapting the system to new speakers with only several minutes of speech or enhancing a premium voice by utilizing the data from other speakers for richer contextual coverage and better generalization. A multi-speaker neural TTS model is built… ▽ More

    Submitted 1 September, 2019; v1 submitted 12 December, 2018; originally announced December 2018.

  48. arXiv:1811.11913  [pdf, other

    eess.AS cs.SD

    LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis

    Authors: Min-Jae Hwang, Frank Soong, Eunwoo Song, Xi Wang, Hyeonjoo Kang, Hong-Goo Kang

    Abstract: We propose a linear prediction (LP)-based waveform generation method via WaveNet vocoding framework. A WaveNet-based neural vocoder has significantly improved the quality of parametric text-to-speech (TTS) systems. However, it is challenging to effectively train the neural vocoder when the target database contains massive amount of acoustical information such as prosody, style or expressiveness. A… ▽ More

    Submitted 4 March, 2020; v1 submitted 28 November, 2018; originally announced November 2018.

    Comments: Submitted to EUSIPCO 2020