Search | arXiv e-print repository

TypeII-CsiNet: CSI Feedback with TypeII Codebook

Authors: Yiliang Sang, Ke Ma, Yang Ming, ** Lian, Zhaocheng Wang

Abstract: The latest TypeII codebook selects partial strongest angular-delay ports for the feedback of downlink channel state information (CSI), whereas its performance is limited due to the deficiency of utilizing the correlations among the port coefficients. To tackle this issue, we propose a tailored autoencoder named TypeII-CsiNet to effectively integrate the TypeII codebook with deep learning, wherein… ▽ More The latest TypeII codebook selects partial strongest angular-delay ports for the feedback of downlink channel state information (CSI), whereas its performance is limited due to the deficiency of utilizing the correlations among the port coefficients. To tackle this issue, we propose a tailored autoencoder named TypeII-CsiNet to effectively integrate the TypeII codebook with deep learning, wherein three novel designs are developed for sufficiently boosting the sum rate performance. Firstly, a dedicated pre-processing module is designed to sort the selected ports for reserving the correlations of their corresponding coefficients. Secondly, a position-filling layer is developed in the decoder to fill the feedback coefficients into their ports in the recovered CSI matrix, so that the corresponding angular-delay-domain structure is adequately leveraged to enhance the reconstruction accuracy. Thirdly, a two-stage loss function is proposed to improve the sum rate performance while avoiding the trap** in local optimums during model training. Simulation results verify that our proposed TypeII-CsiNet outperforms the TypeII codebook and existing deep learning benchmarks. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2403.00529 [pdf, other]

VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis

Authors: Weiwei Lin, Chenhang He, Man-Wai Mak, Jiachen Lian, Kong Aik Lee

Abstract: Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence. Although significant progress has been made in recent years, the mainstream of speech synthesis models still relies on supervised speaker modeling and explicit reference utterances. However, there are many aspects of human voice, such as emotion, intonation, and speaking style, for whic… ▽ More Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence. Although significant progress has been made in recent years, the mainstream of speech synthesis models still relies on supervised speaker modeling and explicit reference utterances. However, there are many aspects of human voice, such as emotion, intonation, and speaking style, for which it is hard to obtain accurate labels. In this paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework that can discover a latent speaker manifold and meaningful voice editing directions without supervision. VoxGenesis is conceptually simple. Instead of map** speech features to waveforms deterministically, VoxGenesis transforms a Gaussian distribution into speech distributions conditioned and aligned by semantic tokens. This forces the model to learn a speaker distribution disentangled from the semantic content. During the inference, sampling from the Gaussian distribution enables the creation of novel speakers with distinct characteristics. More importantly, the exploration of latent space uncovers human-interpretable directions associated with specific speaker characteristics such as gender attributes, pitch, tone, and emotion, allowing for voice editing by manipulating the latent codes along these identified directions. We conduct extensive experiments to evaluate the proposed VoxGenesis using both subjective and objective metrics, finding that it produces significantly more diverse and realistic speakers with distinct characteristics than the previous approaches. We also show that latent space manipulation produces consistent and human-identifiable effects that are not detrimental to the speech quality, which was not possible with previous approaches. Audio samples of VoxGenesis can be found at: \url{https://bit.ly/VoxGenesis}. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: preprint

arXiv:2402.02411 [pdf, other]

Physics-Inspired Degradation Models for Hyperspectral Image Fusion

Authors: Jie Lian, Lizhi Wang, Lin Zhu, Renwei Dian, Zhiwei Xiong, Hua Huang

Abstract: The fusion of a low-spatial-resolution hyperspectral image (LR-HSI) with a high-spatial-resolution multispectral image (HR-MSI) has garnered increasing research interest. However, most fusion methods solely focus on the fusion algorithm itself and overlook the degradation models, which results in unsatisfactory performance in practical scenarios. To fill this gap, we propose physics-inspired degra… ▽ More The fusion of a low-spatial-resolution hyperspectral image (LR-HSI) with a high-spatial-resolution multispectral image (HR-MSI) has garnered increasing research interest. However, most fusion methods solely focus on the fusion algorithm itself and overlook the degradation models, which results in unsatisfactory performance in practical scenarios. To fill this gap, we propose physics-inspired degradation models (PIDM) to model the degradation of LR-HSI and HR-MSI, which comprises a spatial degradation network (SpaDN) and a spectral degradation network (SpeDN). SpaDN and SpeDN are designed based on two insights. First, we employ spatial war** and spectral modulation operations to simulate lens aberrations, thereby introducing non-uniformity into the spatial and spectral degradation processes. Second, we utilize asymmetric downsampling and parallel downsampling operations to separately reduce the spatial and spectral resolutions of the images, thus ensuring the matching of spatial and spectral degradation processes with specific physical characteristics. Once SpaDN and SpeDN are established, we adopt a self-supervised training strategy to optimize the network parameters and provide a plug-and-play solution for fusion methods. Comprehensive experiments demonstrate that our proposed PIDM can boost the fusion performance of existing fusion methods in practical scenarios. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.10015 [pdf, other]

Towards Hierarchical Spoken Language Dysfluency Modeling

Authors: Jiachen Lian, Gopala Anumanchipalli

Abstract: Speech disfluency modeling is the bottleneck for both speech therapy and language learning. However, there is no effective AI solution to systematically tackle this problem. We solidify the concept of disfluent speech and disfluent speech modeling. We then present Hierarchical Unconstrained Disfluency Modeling (H-UDM) approach, the hierarchical extension of UDM that addresses both disfluency trans… ▽ More Speech disfluency modeling is the bottleneck for both speech therapy and language learning. However, there is no effective AI solution to systematically tackle this problem. We solidify the concept of disfluent speech and disfluent speech modeling. We then present Hierarchical Unconstrained Disfluency Modeling (H-UDM) approach, the hierarchical extension of UDM that addresses both disfluency transcription and detection to eliminate the need for extensive manual annotation. Our experimental findings serve as clear evidence of the effectiveness and reliability of the methods we have introduced, encompassing both transcription and detection tasks. △ Less

Submitted 21 January, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: 2024 EACL. Hierarchical extension of our previous workshop paper arXiv:2312.12810

arXiv:2312.12810 [pdf, other]

Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection

Authors: Jiachen Lian, Carly Feng, Naasir Farooqi, Steve Li, Anshul Kashyap, Cheol Jun Cho, Peter Wu, Robbie Netzorg, Tingle Li, Gopala Krishna Anumanchipalli

Abstract: Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and dete… ▽ More Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and detection in an automatic and hierarchical manner. UDM eliminates the need for extensive manual annotation by providing a comprehensive solution. Furthermore, we introduce a simulated dysfluent dataset called VCTK++ to enhance the capabilities of UDM in phonetic transcription. Our experimental results demonstrate the effectiveness and robustness of our proposed methods in both transcription and detection tasks. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 2023 ASRU

arXiv:2310.05962 [pdf, other]

Improving the Performance of R17 Type-II Codebook with Deep Learning

Authors: Ke Ma, Yiliang Sang, Yang Ming, ** Lian, Chang Tian, Zhaocheng Wang

Abstract: The Type-II codebook in Release 17 (R17) exploits the angular-delay-domain partial reciprocity between uplink and downlink channels to select part of angular-delay-domain ports for measuring and feeding back the downlink channel state information (CSI), where the performance of existing deep learning enhanced CSI feedback methods is limited due to the deficiency of sparse structures. To address th… ▽ More The Type-II codebook in Release 17 (R17) exploits the angular-delay-domain partial reciprocity between uplink and downlink channels to select part of angular-delay-domain ports for measuring and feeding back the downlink channel state information (CSI), where the performance of existing deep learning enhanced CSI feedback methods is limited due to the deficiency of sparse structures. To address this issue, we propose two new perspectives of adopting deep learning to improve the R17 Type-II codebook. Firstly, considering the low signal-to-noise ratio of uplink channels, deep learning is utilized to accurately select the dominant angular-delay-domain ports, where the focal loss is harnessed to solve the class imbalance problem. Secondly, we propose to adopt deep learning to reconstruct the downlink CSI based on the feedback of the R17 Type-II codebook at the base station, where the information of sparse structures can be effectively leveraged. Besides, a weighted shortcut module is designed to facilitate the accurate reconstruction. Simulation results demonstrate that our proposed methods could improve the sum rate performance compared with its traditional R17 Type-II codebook and deep learning benchmarks. △ Less

Submitted 13 September, 2023; originally announced October 2023.

Comments: Accepted by IEEE GLOBECOM 2023, conference version of Arxiv:2305.08081

arXiv:2309.15203 [pdf, other]

Eve Said Yes: AirBone Authentication for Head-Wearable Smart Voice Assistant

Authors: Chenpei Huang, Hui Zhong, Jie Lian, Pavana Prakash, Dian Shi, Yuan Xu, Miao Pan

Abstract: Recent advances in machine learning and natural language processing have fostered the enormous prosperity of smart voice assistants and their services, e.g., Alexa, Google Home, Siri, etc. However, voice spoofing attacks are deemed to be one of the major challenges of voice control security, and never stop evolving such as deep-learning-based voice conversion and speech synthesis techniques. To so… ▽ More Recent advances in machine learning and natural language processing have fostered the enormous prosperity of smart voice assistants and their services, e.g., Alexa, Google Home, Siri, etc. However, voice spoofing attacks are deemed to be one of the major challenges of voice control security, and never stop evolving such as deep-learning-based voice conversion and speech synthesis techniques. To solve this problem outside the acoustic domain, we focus on head-wearable devices, such as earbuds and virtual reality (VR) headsets, which are feasible to continuously monitor the bone-conducted voice in the vibration domain. Specifically, we identify that air and bone conduction (AC/BC) from the same vocalization are coupled (or concurrent) and user-level unique, which makes them suitable behavior and biometric factors for multi-factor authentication (MFA). The legitimate user can defeat acoustic domain and even cross-domain spoofing samples with the proposed two-stage AirBone authentication. The first stage answers \textit{whether air and bone conduction utterances are time domain consistent (TC)} and the second stage runs \textit{bone conduction speaker recognition (BC-SR)}. The security level is hence increased for two reasons: (1) current acoustic attacks on smart voice assistants cannot affect bone conduction, which is in the vibration domain; (2) even for advanced cross-domain attacks, the unique bone conduction features can detect adversary's impersonation and machine-induced vibration. Finally, AirBone authentication has good usability (the same level as voice authentication) compared with traditional MFA and those specially designed to enhance smart voice security. Our experimental results show that the proposed AirBone authentication is usable and secure, and can be easily equipped by commercial off-the-shelf head wearables with good user experience. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: 13 pages, 12 figures

arXiv:2309.09088 [pdf, other]

Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition

Authors: Haoming Guo, Seth Z. Zhao, Jiachen Lian, Gopala Anumanchipalli, Gerald Friedland

Abstract: Vocoder models have recently achieved substantial progress in generating authentic audio comparable to human quality while significantly reducing memory requirement and inference time. However, these data-hungry generative models require large-scale audio data for learning good representations. In this paper, we apply contrastive learning methods in training the vocoder to improve the perceptual q… ▽ More Vocoder models have recently achieved substantial progress in generating authentic audio comparable to human quality while significantly reducing memory requirement and inference time. However, these data-hungry generative models require large-scale audio data for learning good representations. In this paper, we apply contrastive learning methods in training the vocoder to improve the perceptual quality of the vocoder without modifying its architecture or adding more data. We design an auxiliary task with mel-spectrogram contrastive learning to enhance the utterance-level quality of the vocoder model under data-limited conditions. We also extend the task to include waveforms to improve the multi-modality comprehension of the model and address the discriminator overfitting problem. We optimize the additional task simultaneously with GAN training objectives. Our results show that the tasks improve model performance substantially in data-limited settings. △ Less

Submitted 18 December, 2023; v1 submitted 16 September, 2023; originally announced September 2023.

arXiv:2307.02471 [pdf, other]

Deep Speech Synthesis from MRI-Based Articulatory Representations

Authors: Peter Wu, Tingle Li, Yi**g Lu, Yubin Zhang, Jiachen Lian, Alan W Black, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

Abstract: In this paper, we study articulatory synthesis, a speech synthesis method using human vocal tract information that offers a way to develop efficient, generalizable and interpretable synthesizers. While recent advances have enabled intelligible articulatory synthesis using electromagnetic articulography (EMA), these methods lack critical articulatory information like excitation and nasality, limiti… ▽ More In this paper, we study articulatory synthesis, a speech synthesis method using human vocal tract information that offers a way to develop efficient, generalizable and interpretable synthesizers. While recent advances have enabled intelligible articulatory synthesis using electromagnetic articulography (EMA), these methods lack critical articulatory information like excitation and nasality, limiting generalization capabilities. To bridge this gap, we propose an alternative MRI-based feature set that covers a much more extensive articulatory space than EMA. We also introduce normalization and denoising procedures to enhance the generalizability of deep learning methods trained on MRI data. Moreover, we propose an MRI-to-speech model that improves both computational efficiency and speech fidelity. Finally, through a series of ablations, we show that the proposed MRI representation is more comprehensive than EMA and identify the most suitable MRI feature subset for articulatory synthesis. △ Less

Submitted 5 July, 2023; originally announced July 2023.

arXiv:2302.06419 [pdf, other]

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Authors: Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli

Abstract: Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on pr… ▽ More Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size. △ Less

Submitted 21 January, 2024; v1 submitted 9 February, 2023; originally announced February 2023.

Comments: 2023 ASRU

arXiv:2210.16498 [pdf, other]

Articulatory Representation Learning Via Joint Factor Analysis and Neural Matrix Factorization

Authors: Jiachen Lian, Alan W Black, Yi**g Lu, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

Abstract: Articulatory representation learning is the fundamental research in modeling neural speech production system. Our previous work has established a deep paradigm to decompose the articulatory kinematics data into gestures, which explicitly model the phonological and linguistic structure encoded with human speech production mechanism, and corresponding gestural scores. We continue with this line of w… ▽ More Articulatory representation learning is the fundamental research in modeling neural speech production system. Our previous work has established a deep paradigm to decompose the articulatory kinematics data into gestures, which explicitly model the phonological and linguistic structure encoded with human speech production mechanism, and corresponding gestural scores. We continue with this line of work by raising two concerns: (1) The articulators are entangled together in the original algorithm such that some of the articulators do not leverage effective moving patterns, which limits the interpretability of both gestures and gestural scores; (2) The EMA data is sparsely sampled from articulators, which limits the intelligibility of learned representations. In this work, we propose a novel articulatory representation decomposition algorithm that takes the advantage of guided factor analysis to derive the articulatory-specific factors and factor scores. A neural convolutive matrix factorization algorithm is then employed on the factor scores to derive the new gestures and gestural scores. We experiment with the rtMRI corpus that captures the fine-grained vocal tract contours. Both subjective and objective evaluation results suggest that the newly proposed system delivers the articulatory representations that are intelligible, generalizable, efficient and interpretable. △ Less

Submitted 20 February, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

Comments: Accepted to 2023 ICASSP. Camera Ready

arXiv:2206.02512 [pdf, other]

UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder

Authors: Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu

Abstract: In this paper, we propose a novel unsupervised text-to-speech (UTTS) framework which does not require text-audio pairs for the TTS acoustic modeling (AM). UTTS is a multi-speaker speech synthesizer that supports zero-shot voice cloning, it is developed from a perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre fea… ▽ More In this paper, we propose a novel unsupervised text-to-speech (UTTS) framework which does not require text-audio pairs for the TTS acoustic modeling (AM). UTTS is a multi-speaker speech synthesizer that supports zero-shot voice cloning, it is developed from a perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for system development. Specifically, we employ our recently formulated Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE) as the backbone UTTS AM, which offers well-structured content representations given unsupervised alignment (UA) as condition during training. For UTTS inference, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment map** module that converts FA to UA. Finally, the C-DSVAE, serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations. Audio samples are available at our demo page https://neurtts.github.io/utts_demo. △ Less

Submitted 11 October, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

Comments: Under Review

arXiv:2205.05227 [pdf, ps, other]

Towards Improved Zero-shot Voice Conversion with Conditional DSVAE

Authors: Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu

Abstract: Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition. We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible fo… ▽ More Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition. We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible for zero-shot VC. In this study, we continue the direction by raising one concern about the prior distribution of content branch in the DSVAE baseline. We find the random initialized prior distribution will force the content embedding to reduce the phonetic-structure information during the learning process, which is not a desired property. Here, we seek to achieve a better content embedding with more phonetic information preserved. We propose conditional DSVAE, a new model that enables content bias as a condition to the prior modeling and reshapes the content embedding sampled from the posterior distribution. In our experiment on the VCTK dataset, we demonstrate that content embeddings derived from the conditional DSVAE overcome the randomness and achieve a much better phoneme classification accuracy, a stabilized vocalization and a better zero-shot VC performance compared with the competitive DSVAE baseline. △ Less

Submitted 20 June, 2022; v1 submitted 10 May, 2022; originally announced May 2022.

Comments: Accepted to 2022 Interspeech. Demo link is here https://jlian2.github.io/Improved-Voice-Conversion-with-Conditional-DSVAE/

arXiv:2204.00465 [pdf, other]

Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition

Authors: Jiachen Lian, Alan W Black, Louis Goldstein, Gopala Krishna Anumanchipalli

Abstract: Most of the research on data-driven speech representation learning has focused on raw audios in an end-to-end manner, paying little attention to their internal phonological or gestural structure. This work, investigating the speech representations derived from articulatory kinematics signals, uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data… ▽ More Most of the research on data-driven speech representation learning has focused on raw audios in an end-to-end manner, paying little attention to their internal phonological or gestural structure. This work, investigating the speech representations derived from articulatory kinematics signals, uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores. By applying sparse constraints, the gestural scores leverage the discrete combinatorial properties of phonological gestures. Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully. The proposed work thus makes a bridge between articulatory phonology and deep neural networks to leverage informative, intelligible, interpretable,and efficient speech representations. △ Less

Submitted 20 June, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

Comments: Accepted to 2022 Interspeech. Code is publicly available at https://github.com/Berkeley-Speech-Group/ema_gesture

arXiv:2203.16705 [pdf, other]

Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

Authors: Jiachen Lian, Chunlei Zhang, Dong Yu

Abstract: Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive map** functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, we achieve the disentanglemen… ▽ More Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive map** functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, we achieve the disentanglement by balancing the information flow between global speaker representation and time-varying content representation in a sequential variational autoencoder (VAE). A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to the VAE decoder. Besides that, an on-the-fly data augmentation training strategy is applied to make the learned representation noise invariant. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e., voice naturalness and similarity, and remains to be robust even with noisy source/target utterances. △ Less

Submitted 30 March, 2022; originally announced March 2022.

Comments: Accepted to 2022 ICASSP

arXiv:2110.15018 [pdf, other]

TorchAudio: Building Blocks for Audio and Speech Processing

Authors: Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Anjali Chourdia, Artyom Astafurov, Caroline Chen, Ching-Feng Yeh, Christian Puhrsch, David Pollack, Dmitriy Genzel, Donny Greenberg, Edward Z. Yang, Jason Lian, Jay Mahadeokar, Jeff Hwang, Ji Chen, Peter Goldsborough, Prabhat Roy, Sean Narenthiran, Shinji Watanabe, Soumith Chintala, Vincent Quenneville-Bélair, Yangyang Shi

Abstract: This document describes version 0.10 of TorchAudio: building blocks for machine learning applications in the audio and speech processing domain. The objective of TorchAudio is to accelerate the development and deployment of machine learning applications for researchers and engineers by providing off-the-shelf building blocks. The building blocks are designed to be GPU-compatible, automatically dif… ▽ More This document describes version 0.10 of TorchAudio: building blocks for machine learning applications in the audio and speech processing domain. The objective of TorchAudio is to accelerate the development and deployment of machine learning applications for researchers and engineers by providing off-the-shelf building blocks. The building blocks are designed to be GPU-compatible, automatically differentiable, and production-ready. TorchAudio can be easily installed from Python Package Index repository and the source code is publicly available under a BSD-2-Clause License (as of September 2021) at https://github.com/pytorch/audio. In this document, we provide an overview of the design principles, functionalities, and benchmarks of TorchAudio. We also benchmark our implementation of several audio and speech operations and models. We verify through the benchmarks that our implementations of various operations and models are valid and perform similarly to other publicly available implementations. △ Less

Submitted 16 February, 2022; v1 submitted 28 October, 2021; originally announced October 2021.

Comments: Accepted by ICASSP 2022

arXiv:2110.12192 [pdf, other]

Dual Shape Guided Segmentation Network for Organs-at-Risk in Head and Neck CT Images

Authors: Shuai Wang, Theodore Yanagihara, Bhishamjit Chera, Colette Shen, Pew-Thian Yap, Jun Lian

Abstract: The accurate segmentation of organs-at-risk (OARs) in head and neck CT images is a critical step for radiation therapy of head and neck cancer patients. However, manual delineation for numerous OARs is time-consuming and laborious, even for expert oncologists. Moreover, manual delineation results are susceptible to high intra- and inter-variability. To this end, we propose a novel dual shape guide… ▽ More The accurate segmentation of organs-at-risk (OARs) in head and neck CT images is a critical step for radiation therapy of head and neck cancer patients. However, manual delineation for numerous OARs is time-consuming and laborious, even for expert oncologists. Moreover, manual delineation results are susceptible to high intra- and inter-variability. To this end, we propose a novel dual shape guided network (DSGnet) to automatically delineate nine important OARs in head and neck CT images. To deal with the large shape variation and unclear boundary of OARs in CT images, we represent the organ shape using an organ-specific unilateral inverse-distance map (UIDM) and guide the segmentation task from two different perspectives: direct shape guidance by following the segmentation prediction and across shape guidance by sharing the segmentation feature. In the direct shape guidance, the segmentation prediction is not only supervised by the true label mask, but also by the true UIDM, which is implemented through a simple yet effective encoder-decoder map** from the label space to the distance space. In the across shape guidance, UIDM is used to facilitate the segmentation by optimizing the shared feature maps. For the experiments, we build a large head and neck CT dataset with a total of 699 images from different volunteers, and conduct comprehensive experiments and comparisons with other state-of-the-art methods to justify the effectiveness and efficiency of our proposed method. The overall Dice Similarity Coefficient (DSC) value of 0.842 across the nine important OARs demonstrates great potential applications in improving the delineation quality and reducing the time cost. △ Less

Submitted 23 October, 2021; originally announced October 2021.

arXiv:2106.14143 [pdf, ps, other]

Sparse Control Synthesis for Uncertain Responsive Loads with Stochastic Stability Guarantees

Authors: Sai Pushpak Nandanoori, Soumya Kundu, Jianming Lian, Umesh Vaidya, Draguna Vrabie, Karanjit Kalsi

Abstract: Recent studies have demonstrated the potential of flexible loads in providing frequency response services. However, uncertainty and variability in various weather-related and end-use behavioral factors often affect the demand-side control performance. This work addresses this problem with the design of a demand-side control to achieve frequency response under load uncertainties. Our approach invol… ▽ More Recent studies have demonstrated the potential of flexible loads in providing frequency response services. However, uncertainty and variability in various weather-related and end-use behavioral factors often affect the demand-side control performance. This work addresses this problem with the design of a demand-side control to achieve frequency response under load uncertainties. Our approach involves modeling the load uncertainties via stochastic processes that appear as both multiplicative and additive to the system states in closed-loop power system dynamics. Extending the recently developed mean square exponential stability (MSES) results for stochastic systems, we formulate multi-objective linear matrix inequality (LMI)-based optimal control synthesis problems to not only guarantee stochastic stability, but also promote sparsity, enhance closed-loop transient performance, and maximize allowable uncertainties. The fundamental trade-off between the maximum allowable (\textit{critical}) uncertainty levels and the optimal stochastic stabilizing control efforts is established. Moreover, the sparse control synthesis problem is generalized to the realistic power systems scenario in which only partial-state measurements are available. Detailed numerical studies are carried out on IEEE 39-bus system to demonstrate the closed-loop stochastic stabilizing performance of the sparse controllers in enhancing frequency response under load uncertainties; as well as illustrate the fundamental trade-off between the allowable uncertainties and optimal control efforts. △ Less

Submitted 27 June, 2021; originally announced June 2021.

Comments: accepted for publication at the IEEE Transactions on Power Sysems

Report number: PNNL-SA-156076

arXiv:2104.10326 [pdf, other]

A Structure-Aware Relation Network for Thoracic Diseases Detection and Segmentation

Authors: Jie Lian, **gyu Liu, Shu Zhang, Kai Gao, Xiaoqing Liu, Dingwen Zhang, Yizhou Yu

Abstract: Instance level detection and segmentation of thoracic diseases or abnormalities are crucial for automatic diagnosis in chest X-ray images. Leveraging on constant structure and disease relations extracted from domain knowledge, we propose a structure-aware relation network (SAR-Net) extending Mask R-CNN. The SAR-Net consists of three relation modules: 1. the anatomical structure relation module enc… ▽ More Instance level detection and segmentation of thoracic diseases or abnormalities are crucial for automatic diagnosis in chest X-ray images. Leveraging on constant structure and disease relations extracted from domain knowledge, we propose a structure-aware relation network (SAR-Net) extending Mask R-CNN. The SAR-Net consists of three relation modules: 1. the anatomical structure relation module encoding spatial relations between diseases and anatomical parts. 2. the contextual relation module aggregating clues based on query-key pair of disease RoI and lung fields. 3. the disease relation module propagating co-occurrence and causal relations into disease proposals. Towards making a practical system, we also provide ChestX-Det, a chest X-Ray dataset with instance-level annotations (boxes and masks). ChestX-Det is a subset of the public dataset NIH ChestX-ray14. It contains ~3500 images of 13 common disease categories labeled by three board-certified radiologists. We evaluate our SAR-Net on it and another dataset DR-Private. Experimental results show that it can enhance the strong baseline of Mask R-CNN with significant improvements. The ChestX-Det is released at https://github.com/Deepwise-AILab/ChestX-Det-Dataset. △ Less

Submitted 20 April, 2021; originally announced April 2021.

Comments: This paper has been accepted by IEEE Transactions on Medical Imaging

arXiv:2011.04491 [pdf, other]

doi 10.21437/Interspeech.2021-2190

Masked Proxy Loss For Text-Independent Speaker Verification

Authors: Jiachen Lian, Aiswarya Vinod Kumar, Hira Dhamyal, Bhiksha Raj, Rita Singh

Abstract: Open-set speaker recognition can be regarded as a metric learning problem, which is to maximize inter-class variance and minimize intra-class variance. Supervised metric learning can be categorized into entity-based learning and proxy-based learning. Most of the existing metric learning objectives like Contrastive, Triplet, Prototypical, GE2E, etc all belong to the former division, the performance… ▽ More Open-set speaker recognition can be regarded as a metric learning problem, which is to maximize inter-class variance and minimize intra-class variance. Supervised metric learning can be categorized into entity-based learning and proxy-based learning. Most of the existing metric learning objectives like Contrastive, Triplet, Prototypical, GE2E, etc all belong to the former division, the performance of which is either highly dependent on sample mining strategy or restricted by insufficient label information in the mini-batch. Proxy-based losses mitigate both shortcomings, however, fine-grained connections among entities are either not or indirectly leveraged. This paper proposes a Masked Proxy (MP) loss which directly incorporates both proxy-based relationships and pair-based relationships. We further propose Multinomial Masked Proxy (MMP) loss to leverage the hardness of speaker pairs. These methods have been applied to evaluate on VoxCeleb test set and reach state-of-the-art Equal Error Rate(EER). △ Less

Submitted 24 June, 2021; v1 submitted 9 November, 2020; originally announced November 2020.

Comments: Accepted at Interspeech 2021

arXiv:2011.03689 [pdf, other]

Detection and Evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems

Authors: Yang Gao, Jiachen Lian, Bhiksha Raj, Rita Singh

Abstract: Automatic speaker verification (ASV) systems utilize the biometric information in human speech to verify the speaker's identity. The techniques used for performing speaker verification are often vulnerable to malicious attacks that attempt to induce the ASV system to return wrong results, allowing an impostor to bypass the system and gain access. Attackers use a multitude of spoofing techniques fo… ▽ More Automatic speaker verification (ASV) systems utilize the biometric information in human speech to verify the speaker's identity. The techniques used for performing speaker verification are often vulnerable to malicious attacks that attempt to induce the ASV system to return wrong results, allowing an impostor to bypass the system and gain access. Attackers use a multitude of spoofing techniques for this, such as voice conversion, audio replay, speech synthesis, etc. In recent years, easily available tools to generate deepfaked audio have increased the potential threat to ASV systems. In this paper, we compare the potential of human impersonation (voice disguise) based attacks with attacks based on machine-generated speech, on black-box and white-box ASV systems. We also study countermeasures by using features that capture the unique aspects of human speech production, under the hypothesis that machines cannot emulate many of the fine-level intricacies of the human speech production mechanism. We show that fundamental frequency sequence-related entropy, spectral envelope, and aperiodic parameters are promising candidates for robust detection of deepfaked speech generated by unknown methods. △ Less

Submitted 24 November, 2020; v1 submitted 6 November, 2020; originally announced November 2020.

Comments: 6 pages excluding references. Paper accepted by IEEE Spoken Language Technology (SLT) 2021

arXiv:2010.10298 [pdf]

The Detection of Thoracic Abnormalities ChestX-Det10 Challenge Results

Authors: Jie Lian, **gyu Liu, Yizhou Yu, Mengyuan Ding, Yaoci Lu, Yi Lu, Jie Cai, Deshou Lin, Miao Zhang, Zhe Wang, Kai He, Yijie Yu

Abstract: The detection of thoracic abnormalities challenge is organized by the Deepwise AI Lab. The challenge is divided into two rounds. In this paper, we present the results of 6 teams which reach the second round. The challenge adopts the ChestX-Det10 dateset proposed by the Deepwise AI Lab. ChestX-Det10 is the first chest X-Ray dataset with instance-level annotations, including 10 categories of disease… ▽ More The detection of thoracic abnormalities challenge is organized by the Deepwise AI Lab. The challenge is divided into two rounds. In this paper, we present the results of 6 teams which reach the second round. The challenge adopts the ChestX-Det10 dateset proposed by the Deepwise AI Lab. ChestX-Det10 is the first chest X-Ray dataset with instance-level annotations, including 10 categories of disease/abnormality of 3,543 images. The annotations are located at https://github.com/Deepwise-AILab/ChestX-Det10-Dataset. In the challenge, we randomly split all data into 3001 images for training and 542 images for testing. △ Less

Submitted 21 October, 2020; v1 submitted 19 October, 2020; originally announced October 2020.

arXiv:2008.00152 [pdf, other]

Transactive Energy System Deployment over Insecure Communication Links

Authors: Yang Lu, Jianming Lian, Minghui Zhu, Ke Ma

Abstract: In this paper, the privacy and security issues associated with the transactive energy system (TES) deployment over insecure communication links are addressed. In particular, it is ensured that (1) individual agents' bidding information is kept private throughout hierarchical market-based interactions; and (2) any extraneous data injection attack can be quickly and easily detected. An implementatio… ▽ More In this paper, the privacy and security issues associated with the transactive energy system (TES) deployment over insecure communication links are addressed. In particular, it is ensured that (1) individual agents' bidding information is kept private throughout hierarchical market-based interactions; and (2) any extraneous data injection attack can be quickly and easily detected. An implementation framework is proposed to enable the cryptography-based enhancement of privacy and security for the deployment of any general hierarchical systems including TESs. Under the proposed framework, a unified cryptography-based approach is developed to achieve both privacy and security simultaneously. Specifically, privacy preservation is realized by an enhanced Paillier encryption scheme, where a block design is proposed to significantly improve computational efficiency. Attack detection is further achieved by an enhanced Paillier digital signature scheme, where a stamp-concatenation mechanism is proposed to enable detection of data replace and reorder attacks. Simulation results verify the effectiveness of the proposed cyber-resilient design for transactive energy systems. △ Less

Submitted 16 October, 2021; v1 submitted 31 July, 2020; originally announced August 2020.

Comments: 10 pages, 6 figures, journal submission

arXiv:2007.09770 [pdf, other]

Multi-stage Power Scheduling Framework for Data Center with Chilled Water Storage in Energy and Regulation Markets

Authors: Yangyang Fu, Xu Han, Jessica Stershic, Wangda Zuo, Kyri Baker, Jianming Lian

Abstract: Leveraging electrochemical and thermal energy storage systems has been proposed as a strategy to reduce peak power in data centers. Thermal energy storage systems, such as chilled water tanks, have gained increasing attention in data centers for load shifting due to their relatively small capital and operational costs compared to electrochemical energy storage. However, there are few studies inves… ▽ More Leveraging electrochemical and thermal energy storage systems has been proposed as a strategy to reduce peak power in data centers. Thermal energy storage systems, such as chilled water tanks, have gained increasing attention in data centers for load shifting due to their relatively small capital and operational costs compared to electrochemical energy storage. However, there are few studies investigating the possibility of utilizing thermal energy storage system with resources to provide ancillary services (e.g., frequency regulation) to the grid. This paper proposes a synergistic control strategy for the data center with a chilled water storage providing frequency regulation service by adjusting the chiller capacity, storage charging rate, and IT server CPU frequency. Then, a three-stage multi-market scheduling framework based on a model predictive control scheme is developed to minimize operational costs of data centers participating in both energy and regulation markets. The framework solves a power baseline scheduling problem, a regulation reserve problem, and a real-time power signal tracking problem sequentially. Simulation results show that utilizing the thermal energy storage can increase the regulation capacity bid, reduce energy costs and demand charges, and also harvest frequency regulation revenues. The proposed multi-market scheduling framework in a span of two days can reduce the operational costs up to 8.8% ($1,606.4) compared to the baseline with 0.2% (\$38.7) energy cost reduction, 6.5% (\$1,179.4) from demand reduction, and 2.1% (\$338.3) from regulation revenues. △ Less

Submitted 19 July, 2020; originally announced July 2020.

arXiv:2006.10550 [pdf]

ChestX-Det10: Chest X-ray Dataset on Detection of Thoracic Abnormalities

Authors: **gyu Liu, Jie Lian, Yizhou Yu

Abstract: Instance level detection of thoracic diseases or abnormalities are crucial for automatic diagnosis in chest X-ray images. Most existing works on chest X-rays focus on disease classification and weakly supervised localization. In order to push forward the research on disease classification and localization on chest X-rays. We provide a new benchmark called ChestX-Det10, including box-level annotati… ▽ More Instance level detection of thoracic diseases or abnormalities are crucial for automatic diagnosis in chest X-ray images. Most existing works on chest X-rays focus on disease classification and weakly supervised localization. In order to push forward the research on disease classification and localization on chest X-rays. We provide a new benchmark called ChestX-Det10, including box-level annotations of 10 categories of disease/abnormality of $\sim$ 3,500 images. The annotations are located at https://github.com/Deepwise-AILab/ChestX-Det10-Dataset. △ Less

Submitted 19 October, 2020; v1 submitted 17 June, 2020; originally announced June 2020.

arXiv:1701.02036 [pdf, other]

Decentralized Robust Control for Dam** Inter-area Oscillations in Power Systems

Authors: Jianming Lian, Shaobu Wang, Ruisheng Diao, Zhenyu Huang

Abstract: As power systems become more and more interconnected, the inter-area oscillations has become a serious factor limiting large power transfer among different areas. Underdamped (Undamped) inter-area oscillations may cause system breakup and even lead to large-scale blackout. Traditional dam** controllers include Power System Stabilizer (PSS) and Flexible AC Transmission System (FACTS) controller,… ▽ More As power systems become more and more interconnected, the inter-area oscillations has become a serious factor limiting large power transfer among different areas. Underdamped (Undamped) inter-area oscillations may cause system breakup and even lead to large-scale blackout. Traditional dam** controllers include Power System Stabilizer (PSS) and Flexible AC Transmission System (FACTS) controller, which adds additional dam** to the inter-area oscillation modes by affecting the real power in an indirect manner. However, the effectiveness of these controllers is restricted to the neighborhood of a prescribed set of operating conditions. In this paper, decentralized robust controllers are developed to improve the dam** ratios of the inter-area oscillation modes by directly affecting the real power through the turbine governing system. The proposed control strategy requires only local signals and is robust to the variations in operation condition and system topology. The effectiveness of the proposed robust controllers is illustrated by detailed case studies on two different test systems. △ Less

Submitted 8 January, 2017; originally announced January 2017.

arXiv:1510.05071 [pdf, other]

doi 10.1109/TAC.2019.2962356

Distributed Robust Adaptive Frequency Control of Power Systems with Dynamic Loads

Authors: Hunmin Kim, Minghui Zhu, Jianming Lian

Abstract: This paper investigates the frequency control of multi-machine power systems subject to uncertain and dynamic net loads. We propose distributed internal model controllers that coordinate synchronous generators and demand response to tackle the unpredictable nature of net loads. Frequency stability is formally guaranteed via Lyapunov analysis. Numerical simulations on the IEEE 68-bus test system de… ▽ More This paper investigates the frequency control of multi-machine power systems subject to uncertain and dynamic net loads. We propose distributed internal model controllers that coordinate synchronous generators and demand response to tackle the unpredictable nature of net loads. Frequency stability is formally guaranteed via Lyapunov analysis. Numerical simulations on the IEEE 68-bus test system demonstrate the effectiveness of the controllers. △ Less

Submitted 8 January, 2020; v1 submitted 17 October, 2015; originally announced October 2015.

Comments: Published in the IEEE Transaction on Automatic Control

Showing 1–27 of 27 results for author: Lian, J