Skip to main content

Showing 1–50 of 116 results for author: Tang, H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.05464  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models

    Authors: Tzu-Quan Lin, Hung-yi Lee, Hao Tang

    Abstract: Self-supervised speech models have shown to be useful for various tasks, but their large size limits the use in devices with low computing power and memory. In this work, we explore early exit, an approach for reducing latency by exiting the forward process of a network early. Most approaches of early exit need a separate early exit model for each task, with some even requiring fine-tuning of the… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  2. arXiv:2405.15927  [pdf

    eess.SP cs.NE eess.SY

    Application based Evaluation of an Efficient Spike-Encoder, "Spiketrum"

    Authors: MHD Anas Alsakkal, Runze Wang, Jayawan Wijekoon, Hua** Tang

    Abstract: Spike-based encoders represent information as sequences of spikes or pulses, which are transmitted between neurons. A prevailing consensus suggests that spike-based approaches demonstrate exceptional capabilities in capturing the temporal dynamics of neural activity and have the potential to provide energy-efficient solutions for low-power applications. The Spiketrum encoder efficiently compresses… ▽ More

    Submitted 31 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: To be published at "IEEE/ACM Transactions on Audio, Speech, and Language Processing"

  3. arXiv:2405.10691  [pdf, other

    eess.IV cs.CV

    LoCI-DiffCom: Longitudinal Consistency-Informed Diffusion Model for 3D Infant Brain Image Completion

    Authors: Zihao Zhu, Tianli Tao, Yitian Tao, Haowen Deng, Xinyi Cai, Gaofeng Wu, Kaidong Wang, Haifeng Tang, Lixuan Zhu, Zhuoyang Gu, Jiawei Huang, Dinggang Shen, Han Zhang

    Abstract: The infant brain undergoes rapid development in the first few years after birth.Compared to cross-sectional studies, longitudinal studies can depict the trajectories of infants brain development with higher accuracy, statistical power and flexibility.However, the collection of infant longitudinal magnetic resonance (MR) data suffers a notorious dropout problem, resulting in incomplete datasets wit… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

  4. arXiv:2405.08237  [pdf, other

    cs.CL cs.SD eess.AS

    A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

    Authors: Oli Danyi Liu, Hao Tang, Naomi Feldman, Sharon Goldwater

    Abstract: Speech perception involves storing and integrating sequentially presented items. Recent work in cognitive neuroscience has identified temporal and contextual characteristics in humans' neural encoding of speech that may facilitate this temporal processing. In this study, we simulated similar analyses with representations extracted from a computational model that was trained on unlabelled speech wi… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

    Comments: Accepted to CogSci 2024

  5. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, **shan Pan, Jiangxin Dong, **hui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi **, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  6. arXiv:2404.15349  [pdf, other

    eess.SP cs.LG cs.MM

    A Survey on Multimodal Wearable Sensor-based Human Action Recognition

    Authors: Jianyuan Ni, Hao Tang, Syed Tousiful Haque, Yan Yan, Anne H. H. Ngu

    Abstract: The combination of increased life expectancy and falling birth rates is resulting in an aging population. Wearable Sensor-based Human Activity Recognition (WSHAR) emerges as a promising assistive technology to support the daily lives of older individuals, unlocking vast potential for human-centric applications. However, recent surveys in WSHAR have been limited, focusing either solely on deep lear… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

    Comments: Multimodal Survey for Wearable Sensor-based Human Action Recognition

  7. arXiv:2404.11152  [pdf, other

    eess.IV cs.CV

    Multi-target and multi-stage liver lesion segmentation and detection in multi-phase computed tomography scans

    Authors: Abdullah F. Al-Battal, Soan T. M. Duong, Van Ha Tang, Quang Duc Tran, Steven Q. H. Truong, Chien Phan, Truong Q. Nguyen, Cheolhong An

    Abstract: Multi-phase computed tomography (CT) scans use contrast agents to highlight different anatomical structures within the body to improve the probability of identifying and detecting anatomical structures of interest and abnormalities such as liver lesions. Yet, detecting these lesions remains a challenging task as these lesions vary significantly in their size, shape, texture, and contrast with resp… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  8. arXiv:2403.19983  [pdf, other

    eess.IV cs.CV

    A multi-stage semi-supervised learning for ankle fracture classification on CT images

    Authors: Hongzhi Liu, Guicheng Li, Jiacheng Nie, Hui Tang, Chunfeng Yang, Qian** Feng, Hailin Xu, Yang Chen

    Abstract: Because of the complicated mechanism of ankle injury, it is very difficult to diagnose ankle fracture in clinic. In order to simplify the process of fracture diagnosis, an automatic diagnosis model of ankle fracture was proposed. Firstly, a tibia-fibula segmentation network is proposed for the joint tibiofibular region of the ankle joint, and the corresponding segmentation dataset is established o… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

  9. arXiv:2403.17701   

    eess.IV cs.CV cs.LG

    Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation

    Authors: Hao Tang, Lianglun Cheng, Guoheng Huang, Zhengguang Tan, Junhao Lu, Kaihong Wu

    Abstract: Image segmentation holds a vital position in the realms of diagnosis and treatment within the medical domain. Traditional convolutional neural networks (CNNs) and Transformer models have made significant advancements in this realm, but they still encounter challenges because of limited receptive field or high computing complexity. Recently, State Space Models (SSMs), particularly Mamba and its var… ▽ More

    Submitted 3 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: Experimental method encountered errors, undergoing experiment again

  10. arXiv:2403.10585  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint

    Authors: Haoyue Tang, Tian Xie, Aosong Feng, Hanyu Wang, Chenyang Zhang, Yang Bai

    Abstract: Solving image inverse problems (e.g., super-resolution and inpainting) requires generating a high fidelity image that matches the given input (the low-resolution image or the masked image). By using the input image as guidance, we can leverage a pretrained diffusion generative model to solve a wide range of image inverse tasks without task specific model fine-tuning. To precisely estimate the guid… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted and to Appear, AISTATS 2024

  11. arXiv:2403.00379  [pdf, other

    eess.AS cs.SD

    The Impact of Frequency Bands on Acoustic Anomaly Detection of Machines using Deep Learning Based Model

    Authors: Tin Nguyen, Lam Pham, Phat Lam, Dat Ngo, Hieu Tang, Alexander Schindler

    Abstract: In this paper, we propose a deep learning based model for Acoustic Anomaly Detection of Machines, the task for detecting abnormal machines by analysing the machine sound. By conducting extensive experiments, we indicate that multiple techniques of pseudo audios, audio segment, data augmentation, Mahalanobis distance, and narrow frequency bands, which mainly focus on feature engineering, are effect… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  12. arXiv:2402.14349  [pdf, other

    eess.IV cs.CV cs.LG

    Uncertainty-driven and Adversarial Calibration Learning for Epicardial Adipose Tissue Segmentation

    Authors: Kai Zhao, Zhiming Liu, Jiaqi Liu, **gbiao Zhou, Bihong Liao, Huifang Tang, Qiuyu Wang, Chunquan Li

    Abstract: Epicardial adipose tissue (EAT) is a type of visceral fat that can secrete large amounts of adipokines to affect the myocardium and coronary arteries. EAT volume and density can be used as independent risk markers measurement of volume by noninvasive magnetic resonance images is the best method of assessing EAT. However, segmenting EAT is challenging due to the low contrast between EAT and pericar… ▽ More

    Submitted 23 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: 13 pages,7 figuers

  13. arXiv:2402.13776  [pdf, other

    eess.IV cs.CV cs.LG

    Cas-DiffCom: Cascaded diffusion model for infant longitudinal super-resolution 3D medical image completion

    Authors: Lianghu Guo, Tianli Tao, Xinyi Cai, Zihao Zhu, Jiawei Huang, Lixuan Zhu, Zhuoyang Gu, Haifeng Tang, Rui Zhou, Siyan Han, Yan Liang, Qing Yang, Dinggang Shen, Han Zhang

    Abstract: Early infancy is a rapid and dynamic neurodevelopmental period for behavior and neurocognition. Longitudinal magnetic resonance imaging (MRI) is an effective tool to investigate such a crucial stage by capturing the developmental trajectories of the brain structures. However, longitudinal MRI acquisition always meets a serious data-missing problem due to participant dropout and failed scans, makin… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

  14. arXiv:2402.09372  [pdf, other

    eess.IV cs.AI cs.CV

    Deep Rib Fracture Instance Segmentation and Classification from CT on the RibFrac Challenge

    Authors: Jiancheng Yang, Rui Shi, Liang **, Xiaoyang Huang, Kaiming Kuang, Donglai Wei, Shixuan Gu, Jianying Liu, Pengfei Liu, Zhizhong Chai, Yongjie Xiao, Hao Chen, Liming Xu, Bang Du, Xiangyi Yan, Hao Tang, Adam Alessio, Gregory Holste, Jiapeng Zhang, Xiaoming Wang, Jianye He, Lixuan Che, Hanspeter Pfister, Ming Li, Bingbing Ni

    Abstract: Rib fractures are a common and potentially severe injury that can be challenging and labor-intensive to detect in CT scans. While there have been efforts to address this field, the lack of large-scale annotated datasets and evaluation benchmarks has hindered the development and validation of deep learning algorithms. To address this issue, the RibFrac Challenge was introduced, providing a benchmar… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

    Comments: Challenge paper for MICCAI RibFrac Challenge (https://ribfrac.grand-challenge.org/)

  15. arXiv:2401.08166  [pdf, other

    eess.AS cs.SD

    ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

    Authors: Haobin Tang, Xulong Zhang, Ning Cheng, **g Xiao, Jianzong Wang

    Abstract: Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our p… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)

  16. arXiv:2401.08096  [pdf, other

    cs.SD eess.AS

    Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

    Authors: Yimin Deng, Huaizhen Tang, Xulong Zhang, Ning Cheng, **g Xiao, Jianzong Wang

    Abstract: Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these… ▽ More

    Submitted 17 January, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

    Comments: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)

  17. 3DGAUnet: 3D generative adversarial networks with a 3D U-Net based generator to achieve the accurate and effective synthesis of clinical tumor image data for pancreatic cancer

    Authors: Yu Shi, Hannah Tang, Michael Baine, Michael A. Hollingsworth, Hui**g Du, Dandan Zheng, Chi Zhang, Hongfeng Yu

    Abstract: Pancreatic ductal adenocarcinoma (PDAC) presents a critical global health challenge, and early detection is crucial for improving the 5-year survival rate. Recent medical imaging and computational algorithm advances offer potential solutions for early diagnosis. Deep learning, particularly in the form of convolutional neural networks (CNNs), has demonstrated success in medical image analysis tasks… ▽ More

    Submitted 27 November, 2023; v1 submitted 9 November, 2023; originally announced November 2023.

    Comments: Published on Cancers: Shi, Yu, Hannah Tang, Michael J. Baine, Michael A. Hollingsworth, Hui**g Du, Dandan Zheng, Chi Zhang, and Hongfeng Yu. 2023. "3DGAUnet: 3D Generative Adversarial Networks with a 3D U-Net Based Generator to Achieve the Accurate and Effective Synthesis of Clinical Tumor Image Data for Pancreatic Cancer" Cancers 15, no. 23: 5496

  18. arXiv:2311.00932  [pdf, other

    cs.CV eess.IV

    Towards High-quality HDR Deghosting with Conditional Diffusion Models

    Authors: Qingsen Yan, Tao Hu, Yuan Sun, Hao Tang, Yu Zhu, Wei Dong, Luc Van Gool, Yanning Zhang

    Abstract: High Dynamic Range (HDR) images can be recovered from several Low Dynamic Range (LDR) images by existing Deep Neural Networks (DNNs) techniques. Despite the remarkable progress, DNN-based methods still generate ghosting artifacts when LDR images have saturation and large motion, which hinders potential applications in real-world scenarios. To address this challenge, we formulate the HDR deghosting… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: accepted by IEEE TCSVT

  19. arXiv:2310.20275  [pdf, other

    cs.NI eess.SP

    Age Optimum Sampling in Non-Stationary Environment

    Authors: **heng Zhang, Haoyue Tang, **tao Wang, Sastry Kompella, Leandros Tassiulas

    Abstract: In this work, we consider a status update system with a sensor and a receiver. The status update information is sampled by the sensor and then forwarded to the receiver through a channel with non-stationary delay distribution. The data freshness at the receiver is quantified by the Age-of-Information (AoI). The goal is to design an online sampling strategy that can minimize the average AoI when th… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

  20. arXiv:2310.17558  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards Matching Phones and Speech Representations

    Authors: Gene-** Yang, Hao Tang

    Abstract: Learning phone types from phone instances has been a long-standing problem, while still being open. In this work, we revisit this problem in the context of self-supervised learning, and pose it as the problem of matching cluster centroids to phone embeddings. We study two key properties that enable matching, namely, whether cluster centroids of self-supervised representations reduce the variabilit… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: Accepted to ASRU 2023

  21. arXiv:2310.09918  [pdf, other

    eess.IV

    Pedestrian Accessible Infrastructure Inventory: Assessing Zero-Shot Segmentation on Multi-Mode Geospatial Data for All Pedestrian Types

    Authors: Jiahao Xia, Gavin Gong, Jiawei Liu, Zhigang Zhu, Hao Tang

    Abstract: In this paper, a Segment Anything Model (SAM)-based pedestrian infrastructure segmentation workflow is designed and optimized, which is capable of efficiently processing multi-sourced geospatial data including LiDAR data and satellite imagery data. We used an expanded definition of pedestrian infrastructure inventory which goes beyond the traditional transportation elements to include street furni… ▽ More

    Submitted 27 November, 2023; v1 submitted 15 October, 2023; originally announced October 2023.

  22. Non-parametric Ensemble Empirical Mode Decomposition for extracting weak features to identify bearing defects

    Authors: Anil Kumar, Yaakoub Berrouche, Radosław Zimroz, Govind Vashishtha, Sumika Chauhan, C. P. Gandhi, Hesheng Tang, Jiawei Xiang

    Abstract: A non-parametric complementary ensemble empirical mode decomposition (NPCEEMD) is proposed for identifying bearing defects using weak features. NPCEEMD is non-parametric because, unlike existing decomposition methods such as ensemble empirical mode decomposition, it does not require defining the ideal SNR of noise and the number of ensembles, every time while processing the signals. The simulation… ▽ More

    Submitted 2 October, 2023; v1 submitted 12 September, 2023; originally announced September 2023.

    Journal ref: Measurement 211, 112615 (2023)

  23. arXiv:2308.16573  [pdf, other

    eess.IV cs.CV

    Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation for Semi-Supervised Medical Image Segmentation

    Authors: Yuanbin Chen, Tao Wang, Hui Tang, Longxuan Zhao, Ruige Zong, Shun Chen, Tao Tan, Xinlin Zhang, Tong Tong

    Abstract: While supervised learning has achieved remarkable success, obtaining large-scale labeled datasets in biomedical imaging is often impractical due to high costs and the time-consuming annotations required from radiologists. Semi-supervised learning emerges as an effective strategy to overcome this limitation by leveraging useful information from unlabeled datasets. In this paper, we present a novel… ▽ More

    Submitted 18 January, 2024; v1 submitted 31 August, 2023; originally announced August 2023.

  24. arXiv:2308.14638  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

    Authors: Ruoyu Wang, Maokui He, Jun Du, Hengshun Zhou, Shutong Niu, Hang Chen, Yanyan Yue, Gaobin Yang, Shilong Wu, Lei Sun, Yanhui Tu, Haitao Tang, Shuangqing Qian, Tian Gao, Mengzhi Wang, Genshun Wan, Jia Pan, Jianqing Gao, Chin-Hui Lee

    Abstract: This technical report details our submission system to the CHiME-7 DASR Challenge, which focuses on speaker diarization and speech recognition under complex multi-speaker scenarios. Additionally, it also evaluates the efficiency of systems in handling diverse array devices. To address these issues, we implemented an end-to-end speaker diarization system and introduced a rectification strategy base… ▽ More

    Submitted 10 October, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: Accepted by 2023 CHiME Workshop, Oral

  25. PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

    Authors: Yimin Deng, Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, **g Xiao

    Abstract: Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosod… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: Accepted by the 31st ACM International Conference on Multimedia (MM2023)

  26. arXiv:2307.08688  [pdf, other

    eess.AS

    Semi-supervised multi-channel speaker diarization with cross-channel attention

    Authors: Shilong Wu, Jun Du, Maokui He, Shutong Niu, Hang Chen, Haitao Tang, Chin-Hui Lee

    Abstract: Most neural speaker diarization systems rely on sufficient manual training data labels, which are hard to collect under real-world scenarios. This paper proposes a semi-supervised speaker diarization system to utilize large-scale multi-channel training data by generating pseudo-labels for unlabeled data. Furthermore, we introduce cross-channel attention into the Neural Speaker Diarization Using Me… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

    Comments: 8 pages,3 figures

  27. arXiv:2306.02153  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling

    Authors: Ramon Sanabria, Ondrej Klejch, Hao Tang, Sharon Goldwater

    Abstract: Acoustic word embeddings are typically created by training a pooling function using pairs of word-like units. For unsupervised systems, these are mined using k-nearest neighbor (KNN) search, which is slow. Recently, mean-pooled representations from a pre-trained self-supervised English model were suggested as a promising alternative, but their performance on target languages was not fully competit… ▽ More

    Submitted 3 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023

  28. arXiv:2306.00648  [pdf, other

    cs.SD eess.AS

    EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

    Authors: Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, **g Xiao

    Abstract: There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emo… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)

  29. arXiv:2304.11547  [pdf, other

    cs.SD eess.AS

    SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

    Authors: Jianzong Wang, Xulong Zhang, Haobin Tang, Aolan Sun, Ning Cheng, **g Xiao

    Abstract: In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low-quality training data. To overcome such limits… ▽ More

    Submitted 23 April, 2023; originally announced April 2023.

    Comments: Accepted by IJCNN2023. 2023 International Joint Conference on Neural Networks (IJCNN2023)

  30. arXiv:2303.13072  [pdf, other

    cs.SD cs.CL eess.AS

    Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognition

    Authors: Haoyu Tang, Zhaoyi Liu, Chang Zeng, Xinfeng Li

    Abstract: Transformer-based models have recently made significant achievements in the application of end-to-end (E2E) automatic speech recognition (ASR). It is possible to deploy the E2E ASR system on smart devices with the help of Transformer-based models. While these models still have the disadvantage of requiring a large number of model parameters. To overcome the drawback of universal Transformer models… ▽ More

    Submitted 5 April, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

  31. arXiv:2303.07687  [pdf, other

    cs.SD cs.CL eess.AS

    Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy

    Authors: Xulong Zhang, Haobin Tang, Jianzong Wang, Ning Cheng, Jian Luo, **g Xiao

    Abstract: Because of predicting all the target tokens in parallel, the non-autoregressive models greatly improve the decoding efficiency of speech recognition compared with traditional autoregressive models. In this work, we present dynamic alignment Mask CTC, introducing two methods: (1) Aligned Cross Entropy (AXE), finding the monotonic alignment that minimizes the cross-entropy loss through dynamic progr… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  32. arXiv:2303.07682  [pdf, other

    cs.SD cs.CL eess.AS

    QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

    Authors: Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, **g Xiao

    Abstract: Recent expressive text to speech (TTS) models focus on synthesizing emotional speech, but some fine-grained styles such as intonation are neglected. In this paper, we propose QI-TTS which aims to better transfer and control intonation to further deliver the speaker's questioning intention while transferring emotion from reference speech. We propose a multi-style extractor to extract style embeddin… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  33. arXiv:2302.12428  [pdf

    eess.SY

    A holistically 3D-printed flexible millimeter-wave Doppler radar: Towards fully printed high-frequency multilayer flexible hybrid electronics systems

    Authors: Hong Tang, Yingjie Zhang, Bowen Zheng, Sensong An, Mohammad Haerinia, Yunxi Dong, Yi Huang, Wei Guo, Hualiang Zhang

    Abstract: Flexible hybrid electronics (FHE) is an emerging technology enabled through the integration of advanced semiconductor devices and 3D printing technology. It unlocks tremendous market potential by realizing low-cost flexible circuits and systems that can be conformally integrated into various applications. However, the operating frequencies of most reported FHE systems are relatively low. It is als… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    MSC Class: 78-05

  34. arXiv:2212.06557  [pdf, ps, other

    eess.SP

    A Data Quality Assessment Framework for AI-enabled Wireless Communication

    Authors: Hanning Tang, Liusha Yang, Rui Zhou, **g Liang, Hong Wei, Xuan Wang, Qingjiang Shi, Zhi-Quan Luo

    Abstract: Using artificial intelligent (AI) to re-design and enhance the current wireless communication system is a promising pathway for the future sixth-generation (6G) wireless network. The performance of AI-enabled wireless communication depends heavily on the quality of wireless air-interface data. Although there are various approaches to data quality assessment (DQA) for different applications, none h… ▽ More

    Submitted 13 December, 2022; originally announced December 2022.

  35. arXiv:2211.09949  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Compressing Transformer-based self-supervised models for speech processing

    Authors: Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen, Tzu-hsun Feng, Hung-yi Lee, Hao Tang

    Abstract: Despite the success of Transformers in self- supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices. Several isolated attempts have been made to compress Transformers, but the settings and metrics are different across studies. Trade-off at various compressi… ▽ More

    Submitted 26 January, 2024; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: Submitted to IEEE Transactions on Audio, Speech and Language Processing (TASLP)

  36. arXiv:2211.09944  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    MelHuBERT: A simplified HuBERT on Mel spectrograms

    Authors: Tzu-Quan Lin, Hung-yi Lee, Hao Tang

    Abstract: Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly succe… ▽ More

    Submitted 27 October, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: ASRU 2023

  37. arXiv:2210.17113  [pdf, ps, other

    eess.SP

    Lightweight Neural Network with Knowledge Distillation for CSI Feedback

    Authors: Yiming Cui, Jiajia Guo, Zheng Cao, Huaze Tang, Chao-Kai Wen, Shi **, Xin Wang, Xiaolin Hou

    Abstract: Deep learning has shown promise in enhancing channel state information (CSI) feedback. However, many studies indicate that better feedback performance often accompanies higher computational complexity. Pursuing better performance-complexity tradeoffs is crucial to facilitate practical deployment, especially on computation-limited devices, which may have to use lightweight autoencoder with unfavora… ▽ More

    Submitted 3 March, 2024; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: 13 pages, 5 figures

  38. arXiv:2210.16428  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sound… ▽ More

    Submitted 28 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: INTERSPEECH 2023

  39. arXiv:2210.16043  [pdf, other

    cs.CL cs.SD eess.AS

    Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised Speech Models

    Authors: Ramon Sanabria, Hao Tang, Sharon Goldwater

    Abstract: Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing variable-length spoken word segments. In this work, we study several pre-trained models and pooling methods for constructing AWEs with self-supervised representations. Owing… ▽ More

    Submitted 14 March, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE ICASSP 2023

  40. arXiv:2210.15793  [pdf, ps, other

    eess.AS cs.SD eess.SP

    Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution

    Authors: Chin-Yun Yu, Sung-Lin Yeh, György Fazekas, Hao Tang

    Abstract: Recently, diffusion models (DMs) have been increasingly used in audio processing tasks, including speech super-resolution (SR), which aims to restore high-frequency content given low-resolution speech utterances. This is commonly achieved by conditioning the network of noise predictor with low-resolution audio. In this paper, we propose a novel sampling algorithm that communicates the information… ▽ More

    Submitted 24 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  41. arXiv:2210.15399  [pdf, other

    eess.SP

    Joint Communication and Computation Design in Transmissive RMS Transceiver Enabled Multi-Tier Computing Networks

    Authors: Zhendong Li, Wen Chen, Ziwei Liu, Hongying Tang, Jianmin Lu

    Abstract: In this paper, a novel transmissive reconfigurable meta-surface (RMS) transceiver enabled multi-tier computing network architecture is proposed for improving computing capability, decreasing computing delay and reducing base station (BS) deployment cost, in which transmissive RMS equipped with a feed antenna can be regarded as a new type of multi-antenna system. We formulate a total energy consump… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

  42. arXiv:2210.07189  [pdf, other

    cs.CL cs.SD eess.AS

    On Compressing Sequences for Self-Supervised Speech Models

    Authors: Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-yi Lee, Hao Tang

    Abstract: Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how in… ▽ More

    Submitted 25 October, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  43. arXiv:2208.08757  [pdf, other

    eess.AS cs.LG cs.SD

    Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

    Authors: SiCheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu, Aolan Sun, Jianzong Wang, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng

    Abstract: One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the va… ▽ More

    Submitted 18 August, 2022; originally announced August 2022.

    Comments: 5 pages,5 figures,INTERSPEECH 2022

  44. arXiv:2208.05163  [pdf, other

    cs.CV cs.LG eess.IV

    Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

    Authors: Zhengang Li, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, Xue Lin, Zhenman Fang

    Abstract: Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, thi… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: Published in FPL2022

  45. TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

    Authors: Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Zhen Zeng, Edward Xiao, **g Xiao

    Abstract: Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Recently, AutoVC, a conditional autoencoder based method, achieved excellent conversion results by disentangling the speaker identity and the speech content using information-constraining bottlenecks. However, due to the pure autoencoder training method, it is difficult to evaluate the separat… ▽ More

    Submitted 8 August, 2022; originally announced August 2022.

    Comments: ASRU 6 pages

    Journal ref: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 938-945

  46. arXiv:2207.12577  [pdf, other

    cs.CV cs.AR cs.LG eess.IV

    Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution

    Authors: Yushu Wu, Yifan Gong, Pu Zhao, Yanyu Li, Zheng Zhan, Wei Niu, Hao Tang, Minghai Qin, Bin Ren, Yanzhi Wang

    Abstract: Deep learning-based super-resolution (SR) has gained tremendous popularity in recent years because of its high image quality performance and wide application scenarios. However, prior methods typically suffer from large amounts of computations and huge power consumption, causing difficulties for real-time inference, especially on resource-limited platforms such as mobile devices. To mitigate this,… ▽ More

    Submitted 25 July, 2022; originally announced July 2022.

  47. arXiv:2207.08265   

    eess.IV cs.CV

    MLP-GAN for Brain Vessel Image Segmentation

    Authors: Bin Xie, Hao Tang, Bin Duan, Dawen Cai, Yan Yan

    Abstract: Brain vessel image segmentation can be used as a promising biomarker for better prevention and treatment of different diseases. One successful approach is to consider the segmentation as an image-to-image translation task and perform a conditional Generative Adversarial Network (cGAN) to learn a transformation between two distributions. In this paper, we present a novel multi-view approach, MLP-GA… ▽ More

    Submitted 26 October, 2022; v1 submitted 17 July, 2022; originally announced July 2022.

    Comments: Resubmit a conference

  48. arXiv:2206.01244  [pdf, other

    cs.CV eess.IV

    Real-Time Portrait Stylization on the Edge

    Authors: Yanyu Li, Xuan Shen, Geng Yuan, Jiexiong Guan, Wei Niu, Hao Tang, Bin Ren, Yanzhi Wang

    Abstract: In this work we demonstrate real-time portrait stylization, specifically, translating self-portrait into cartoon or anime style on mobile devices. We propose a latency-driven differentiable architecture search method, maintaining realistic generative quality. With our framework, we obtain $10\times$ computation reduction on the generative model and achieve real-time video stylization on off-the-sh… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

  49. arXiv:2205.14329  [pdf, other

    cs.SD cs.CL eess.AS

    Speech Augmentation Based Unsupervised Learning for Keyword Spotting

    Authors: Jian Luo, Jianzong Wang, Ning Cheng, Haobin Tang, **g Xiao

    Abstract: In this paper, we investigated a speech augmentation based unsupervised learning approach for keyword spotting (KWS) task. KWS is a useful speech application, yet also heavily depends on the labeled data. We designed a CNN-Attention architecture to conduct the KWS task. CNN layers focus on the local acoustic features, and attention layers model the long-time dependency. To improve the robustness o… ▽ More

    Submitted 28 May, 2022; originally announced May 2022.

    Comments: accepted by WCCI 2022

  50. arXiv:2205.12429  [pdf, other

    eess.IV cs.CV

    Interaction of a priori Anatomic Knowledge with Self-Supervised Contrastive Learning in Cardiac Magnetic Resonance Imaging

    Authors: Makiya Nakashima, Inyeop Jang, Ramesh Basnet, Mitchel Benovoy, W. H. Wilson Tang, Christopher Nguyen, Deborah Kwon, Tae Hyun Hwang, David Chen

    Abstract: Training deep learning models on cardiac magnetic resonance imaging (CMR) can be a challenge due to the small amount of expert generated labels and inherent complexity of data source. Self-supervised contrastive learning (SSCL) has recently been shown to boost performance in several medical imaging tasks. However, it is unclear how much the pre-trained representation reflects the primary organ of… ▽ More

    Submitted 24 May, 2022; originally announced May 2022.

    Comments: Under review at Machine Learning in Healthcare