Skip to main content

Showing 1–18 of 18 results for author: Luong, H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.17376  [pdf, other

    cs.SD cs.AI eess.AS

    Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

    Authors: Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

    Abstract: Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in sp… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  2. arXiv:2305.04047  [pdf, other

    eess.IV cs.CV

    Degradation-Noise-Aware Deep Unfolding Transformer for Hyperspectral Image Denoising

    Authors: Hai** Zeng, Jiezhang Cao, Kai Feng, Shaoguang Huang, Hongyan Zhang, Hiep Luong, Wilfried Philips

    Abstract: Hyperspectral imaging (HI) has emerged as a powerful tool in diverse fields such as medical diagnosis, industrial inspection, and agriculture, owing to its ability to detect subtle differences in physical properties through high spectral resolution. However, hyperspectral images (HSIs) are often quite noisy because of narrow band spectral filtering. To reduce the noise in HSI data cubes, both mode… ▽ More

    Submitted 6 May, 2023; originally announced May 2023.

  3. arXiv:2303.13571  [pdf, other

    cs.CV eess.IV

    Inheriting Bayer's Legacy-Joint Remosaicing and Denoising for Quad Bayer Image Sensor

    Authors: Hai** Zeng, Kai Feng, Jiezhang Cao, Shaoguang Huang, Yongqiang Zhao, Hiep Luong, Jan Aelterman, Wilfried Philips

    Abstract: Pixel binning based Quad sensors have emerged as a promising solution to overcome the hardware limitations of compact cameras in low-light imaging. However, binning results in lower spatial resolution and non-Bayer CFA artifacts. To address these challenges, we propose a dual-head joint remosaicing and denoising network (DJRD), which enables the conversion of noisy Quad Bayer and standard noise-fr… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

  4. arXiv:2303.13404  [pdf, other

    eess.IV cs.CV

    MSFA-Frequency-Aware Transformer for Hyperspectral Images Demosaicing

    Authors: Hai** Zeng, Kai Feng, Shaoguang Huang, Jiezhang Cao, Yongyong Chen, Hongyan Zhang, Hiep Luong, Wilfried Philips

    Abstract: Hyperspectral imaging systems that use multispectral filter arrays (MSFA) capture only one spectral component in each pixel. Hyperspectral demosaicing is used to recover the non-measured components. While deep learning methods have shown promise in this area, they still suffer from several challenges, including limited modeling of non-local dependencies, lack of consideration of the periodic MSFA… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

  5. arXiv:2204.12879  [pdf, other

    cs.CV eess.IV

    Low-rank Meets Sparseness: An Integrated Spatial-Spectral Total Variation Approach to Hyperspectral Denoising

    Authors: Hai** Zeng, Shaoguang Huang, Yongyong Chen, Hiep Luong, Wilfried Philips

    Abstract: Spatial-Spectral Total Variation (SSTV) can quantify local smoothness of image structures, so it is widely used in hyperspectral image (HSI) processing tasks. Essentially, SSTV assumes a sparse structure of gradient maps calculated along the spatial and spectral directions. In fact, these gradient tensors are not only sparse, but also (approximately) low-rank under FFT, which we have verified by n… ▽ More

    Submitted 27 April, 2022; originally announced April 2022.

  6. arXiv:2204.00818  [pdf, other

    eess.IV cs.CV

    RFVTM: A Recovery and Filtering Vertex Trichotomy Matching for Remote Sensing Image Registration

    Authors: Ming Zhao, Bowen An, Yongpeng Wu, Huynh Van Luong, André Kaup

    Abstract: Reliable feature point matching is a vital yet challenging process in feature-based image registration. In this paper,a robust feature point matching algorithm called Recovery and Filtering Vertex Trichotomy Matching (RFVTM) is proposed to remove outliers and retain sufficient inliers for remote sensing images. A novel affine invariant descriptor called vertex trichotomy descriptor is proposed on… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

  7. arXiv:2110.04946  [pdf, other

    cs.SD cs.LG eess.AS

    LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: Emotional and controllable speech synthesis is a topic that has received much attention. However, most studies focused on improving the expressiveness and controllability in the context of linguistic content, even though natural verbal human communication is inseparable from spontaneous non-speech expressions such as laughter, crying, or grunting. We propose a model called LaughNet for synthesizin… ▽ More

    Submitted 25 January, 2022; v1 submitted 10 October, 2021; originally announced October 2021.

  8. arXiv:2106.13479  [pdf, other

    cs.SD cs.CL eess.AS

    Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given to the hidden layers. However, by learning useful latent representation, the system can be used for many more practical scenarios. In this paper, we investigate the use of quantized vectors to m… ▽ More

    Submitted 25 June, 2021; originally announced June 2021.

    Comments: to be presented at SSW11

  9. arXiv:2010.03717  [pdf, other

    eess.AS cs.CL cs.SD

    Latent linguistic embedding for cross-lingual text-to-speech and voice conversion

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: As the recently proposed voice cloning system, NAUTILUS, is capable of cloning unseen voices using untranscribed speech, we investigate the feasibility of using it to develop a unified cross-lingual TTS/VC system. Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally. This type of syst… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

    Comments: Accepted to Voice Conversion Challenge 2020 Online Workshop

  10. arXiv:2010.00929  [pdf, other

    cs.LG cs.CV eess.IV stat.ML

    A Deep-Unfolded Reference-Based RPCA Network For Video Foreground-Background Separation

    Authors: Huynh Van Luong, Boris Joukovsky, Yonina C. Eldar, Nikos Deligiannis

    Abstract: Deep unfolded neural networks are designed by unrolling the iterations of optimization algorithms. They can be shown to achieve faster convergence and higher accuracy than their optimization counterparts. This paper proposes a new deep-unfolding-based network design for the problem of Robust Principal Component Analysis (RPCA) with application to video foreground-background separation. Unlike exis… ▽ More

    Submitted 2 October, 2020; originally announced October 2020.

    Comments: 5 pages, accepted for publication

  11. arXiv:2005.11004  [pdf, other

    eess.AS cs.CL cs.SD

    NAUTILUS: a Versatile Voice Cloning System

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis o… ▽ More

    Submitted 6 October, 2020; v1 submitted 22 May, 2020; originally announced May 2020.

    Comments: Submitted to The IEEE/ACM Transactions on Audio, Speech, and Language Processing

  12. arXiv:1909.06532  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Bootstrap** non-parallel voice conversion from speaker-adaptive text-to-speech

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: Voice conversion (VC) and text-to-speech (TTS) are two tasks that share a similar objective, generating speech with a target voice. However, they are usually developed independently under vastly different frameworks. In this paper, we propose a methodology to bootstrap a VC system from a pretrained speaker-adaptive TTS model and unify the techniques as well as the interpretations of these two task… ▽ More

    Submitted 14 September, 2019; originally announced September 2019.

    Comments: Accepted for IEEE ASRU 2019

  13. arXiv:1906.07414  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: By representing speaker characteristic as a single fixed-length vector extracted solely from speech, we can train a neural multi-speaker speech synthesis model by conditioning the model on those vectors. This model can also be adapted to unseen speakers regardless of whether the transcript of adaptation data is available or not. However, this setup restricts the speaker component to just a single… ▽ More

    Submitted 7 October, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

    Comments: 14 pages, 10 figures

  14. arXiv:1904.00771  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

    Authors: Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

    Abstract: When the available data of a target speaker is insufficient to train a high quality speaker-dependent neural text-to-speech (TTS) system, we can combine data from multiple speakers and train a multi-speaker TTS model instead. Many studies have shown that neural multi-speaker TTS model trained with a small amount data from multiple speakers combined can generate synthetic speech with better quality… ▽ More

    Submitted 7 April, 2019; v1 submitted 1 April, 2019; originally announced April 2019.

    Comments: Submitted to Interspeech 2019, Graz, Austria

  15. arXiv:1808.06288  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Multimodal speech synthesis architecture for unsupervised speaker adaptation

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: This paper proposes a new architecture for speaker adaptation of multi-speaker neural-network speech synthesis systems, in which an unseen speaker's voice can be built using a relatively small amount of speech data without transcriptions. This is sometimes called "unsupervised speaker adaptation". More specifically, we concatenate the layers to the audio inputs when performing unsupervised speaker… ▽ More

    Submitted 19 August, 2018; originally announced August 2018.

    Comments: Accepted for Interspeech 2018, India

  16. arXiv:1808.00665  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

    Authors: Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

    Abstract: We investigated the impact of noisy linguistic features on the performance of a Japanese speech synthesis system based on neural network that uses WaveNet vocoder. We compared an ideal system that uses manually corrected linguistic features including phoneme and prosodic information in training and test sets against a few other systems that use corrupted linguistic features. Both subjective and ob… ▽ More

    Submitted 2 August, 2018; originally announced August 2018.

    Comments: Accepted for Interspeech 2018

  17. arXiv:1807.11679  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

    Authors: Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke Saito, Nobuaki Minematsu

    Abstract: Recent neural networks such as WaveNet and sampleRNN that learn directly from speech waveform samples have achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. Such neural networks are being used as an alternative to vocoders and hence they are often called neural vocoders. The neural vocoder uses ac… ▽ More

    Submitted 31 July, 2018; originally announced July 2018.

  18. arXiv:1807.11632  [pdf, other

    eess.AS cs.SD stat.ML

    Scaling and bias codes for modeling speaker-adaptive DNN-based speech synthesis systems

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: Most neural-network based speaker-adaptive acoustic models for speech synthesis can be categorized into either layer-based or input-code approaches. Although both approaches have their own pros and cons, most existing works on speaker adaptation focus on improving one or the other. In this paper, after we first systematically overview the common principles of neural-network based speaker-adaptive… ▽ More

    Submitted 30 September, 2018; v1 submitted 30 July, 2018; originally announced July 2018.

    Comments: Accepted for 2018 IEEE Workshop on Spoken Language Technology (SLT), Athens, Greece