Search | arXiv e-print repository

MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation

Authors: Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi Wen, Jianhua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu, Xuefei Liu, Shuai Zhang, Guanjun Li

Abstract: Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on… ▽ More Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on detailed and acoustically relevant textual descriptions, falls short in practical video dubbing applications. Existing datasets like AudioSet, AudioCaps, Clotho, Sound-of-Story, and WavCaps do not fully meet the requirements for real-world foley audio dubbing task. To address this, we introduce the Multi-modal Image and Narrative Text Dubbing Dataset (MINT), designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing. Besides, to address the limitations of existing TTA technology in understanding and planning complex prompts, a Foley Audio Content Planning, Generation, and Alignment (CPGA) framework is proposed, which includes a content planning module leveraging large language models for complex multi-modal prompts comprehension. Additionally, the training process is optimized using Proximal Policy Optimization based reinforcement learning, significantly improving the alignment and auditory realism of generated foley audio. Experimental results demonstrate that our approach significantly advances the field of foley audio dubbing, providing robust solutions for the challenges of multi-modal dubbing. Even when utilizing the relatively lightweight GPT-2 model, our framework outperforms open-source multimodal large models such as LLaVA, DeepSeek-VL, and Moondream2. The dataset is available at https://github.com/borisfrb/MINT . △ Less

Submitted 15 June, 2024; originally announced June 2024.

arXiv:2406.10056 [pdf, other]

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

Authors: Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, Helen Meng

Abstract: The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-dr… ▽ More The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-driven audio codec model, LLM-Codec, to transfer the audio modality into the textual space, \textit{i.e.} representing audio tokens with words or sub-words in the vocabulary of LLMs, while kee** high audio reconstruction quality. The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into a well-trained LLMs token space. Thus, the audio representation can be viewed as a new \textit{foreign language}, and LLMs can learn the new \textit{foreign language} with several demonstrations. In experiments, we investigate the performance of the proposed approach across multiple audio understanding and generation tasks, \textit{e.g.} speech emotion classification, audio classification, text-to-speech generation, speech enhancement, etc. The experimental results demonstrate that the LLMs equipped with the proposed LLM-Codec, named as UniAudio 1.5, prompted by only a few examples, can achieve the expected functions in simple scenarios. It validates the feasibility and effectiveness of the proposed cross-modal in-context learning approach. To facilitate research on few-shot audio task learning and multi-modal LLMs, we have open-sourced the LLM-Codec model. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.07422 [pdf, other]

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Authors: Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie, Yunlin Chen, Hao Yin, Zhifei Li

Abstract: The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermor… ▽ More The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.02940 [pdf, other]

Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

Authors: Haohan Guo, Fenglong Xie, Dongchao Yang, Hui Lu, Xixin Wu, Helen Meng

Abstract: VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into c… ▽ More VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into codewords in a larger codebook. Besides, to utilize each VQ subspace well, we also enhance PQ-VAE via a dual-decoding training strategy with the encoding and quantized sequences. The experimental results demonstrate that PQ-VAE addresses ``index collapse" effectively, especially for larger codebooks. The model with the proposed training strategy further improves codebook perplexity and reconstruction quality, outperforming other multi-codebook VQ approaches. Finally, PQ-VAE demonstrates its effectiveness in language-model-based TTS, supporting higher-quality speech generation with larger codebooks. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.02328 [pdf, other]

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

Authors: Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng

Abstract: In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compac… ▽ More In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released. △ Less

Submitted 14 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted by InterSpeech 2024

arXiv:2405.08949 [pdf, other]

Task-Oriented Mulsemedia Communication using Unified Perceiver and Conformal Prediction in 6G Metaverse Systems

Authors: Hongzhi Guo, Ian F. Akyildiz

Abstract: The growing prominence of extended reality (XR), holographic-type communications, and metaverse demands truly immersive user experiences by using many sensory modalities, including sight, hearing, touch, smell, taste, etc. Additionally, the widespread deployment of sensors in areas such as agriculture, manufacturing, and smart homes is generating a diverse array of sensory data. A new media format… ▽ More The growing prominence of extended reality (XR), holographic-type communications, and metaverse demands truly immersive user experiences by using many sensory modalities, including sight, hearing, touch, smell, taste, etc. Additionally, the widespread deployment of sensors in areas such as agriculture, manufacturing, and smart homes is generating a diverse array of sensory data. A new media format known as multisensory media (mulsemedia) has emerged, which incorporates a wide range of sensory modalities beyond the traditional visual and auditory media. 6G wireless systems are envisioned to support the internet of senses, making it crucial to explore effective data fusion and communication strategies for mulsemedia. In this paper, we introduce a task-oriented multi-task mulsemedia communication system named MuSeCo, which is developed using unified Perceiver models and Conformal Prediction. This unified model can accept any sensory input and efficiently extract latent semantic features, making it adaptable for deployment across various Artificial Intelligence of Things (AIoT) devices. Conformal Prediction is employed for modality selection and combination, enhancing task accuracy while minimizing data communication overhead. The model has been trained using six sensory modalities across four classification tasks. Simulations and experiments demonstrate that MuSeCo can effectively select and combine sensory modalities, significantly reduce end-to-end communication latency and energy consumption, and maintain high accuracy in communication-constrained systems. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2404.17069 [pdf, other]

Channel Modeling for FR3 Upper Mid-band via Generative Adversarial Networks

Authors: Yaqi Hu, Mingsheng Yin, Marco Mezzavilla, Hao Guo, Sundeep Rangan

Abstract: The upper mid-band (FR3) has been recently attracting interest for new generation of mobile networks, as it provides a promising balance between spectrum availability and coverage, which are inherent limitations of the sub 6GHz and millimeter wave bands, respectively. In order to efficiently design and optimize the network, channel modeling plays a key role since FR3 systems are expected to operat… ▽ More The upper mid-band (FR3) has been recently attracting interest for new generation of mobile networks, as it provides a promising balance between spectrum availability and coverage, which are inherent limitations of the sub 6GHz and millimeter wave bands, respectively. In order to efficiently design and optimize the network, channel modeling plays a key role since FR3 systems are expected to operate at multiple frequency bands. Data-driven methods, especially generative adversarial networks (GANs), can capture the intricate relationships among data samples, and provide an appropriate tool for FR3 channel modeling. In this work, we present the architecture, link state model, and path generative network of GAN-based FR3 channel modeling. The comparison of our model greatly matches the ray-tracing simulated data. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.09179 [pdf]

doi 10.1109/JSTARS.2023.3310208

Change Guiding Network: Incorporating Change Prior to Guide Change Detection in Remote Sensing Imagery

Authors: Chengxi Han, Chen Wu, Haonan Guo, Meiqi Hu, Jiepan Li, Hongruixuan Chen

Abstract: The rapid advancement of automated artificial intelligence algorithms and remote sensing instruments has benefited change detection (CD) tasks. However, there is still a lot of space to study for precise detection, especially the edge integrity and internal holes phenomenon of change features. In order to solve these problems, we design the Change Guiding Network (CGNet), to tackle the insufficien… ▽ More The rapid advancement of automated artificial intelligence algorithms and remote sensing instruments has benefited change detection (CD) tasks. However, there is still a lot of space to study for precise detection, especially the edge integrity and internal holes phenomenon of change features. In order to solve these problems, we design the Change Guiding Network (CGNet), to tackle the insufficient expression problem of change features in the conventional U-Net structure adopted in previous methods, which causes inaccurate edge detection and internal holes. Change maps from deep features with rich semantic information are generated and used as prior information to guide multi-scale feature fusion, which can improve the expression ability of change features. Meanwhile, we propose a self-attention module named Change Guide Module (CGM), which can effectively capture the long-distance dependency among pixels and effectively overcome the problem of the insufficient receptive field of traditional convolutional neural networks. On four major CD datasets, we verify the usefulness and efficiency of the CGNet, and a large number of experiments and ablation studies demonstrate the effectiveness of CGNet. We're going to open-source our code at https://github.com/ChengxiHAN/CGNet-CD. △ Less

Submitted 14 April, 2024; originally announced April 2024.

arXiv:2404.07985 [pdf, other]

WaveMo: Learning Wavefront Modulations to See Through Scattering

Authors: Mingyang Xie, Haiyun Guo, Brandon Y. Feng, Lingbo **, Ashok Veeraraghavan, Christopher A. Metzler

Abstract: Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introdu… ▽ More Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introduces a novel learning-based framework to address the gap. Our approach jointly optimizes wavefront modulations and a computationally lightweight feedforward "proxy" reconstruction network. This network is trained to recover scenes obscured by scattering, using measurements that are modified by these modulations. The learned modulations produced by our framework generalize effectively to unseen scattering scenarios and exhibit remarkable versatility. During deployment, the learned modulations can be decoupled from the proxy network to augment other more computationally expensive restoration algorithms. Through extensive experiments, we demonstrate our approach significantly advances the state of the art in imaging through scattering media. Our project webpage is at https://wavemo-2024.github.io/. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.04878 [pdf, other]

CycleINR: Cycle Implicit Neural Representation for Arbitrary-Scale Volumetric Super-Resolution of Medical Data

Authors: Wei Fang, Yuxing Tang, Heng Guo, Mingze Yuan, Tony C. W. Mok, Ke Yan, Jiawen Yao, Xin Chen, Zaiyi Liu, Le Lu, Ling Zhang, Minfeng Xu

Abstract: In the realm of medical 3D data, such as CT and MRI images, prevalent anisotropic resolution is characterized by high intra-slice but diminished inter-slice resolution. The lowered resolution between adjacent slices poses challenges, hindering optimal viewing experiences and impeding the development of robust downstream analysis algorithms. Various volumetric super-resolution algorithms aim to sur… ▽ More In the realm of medical 3D data, such as CT and MRI images, prevalent anisotropic resolution is characterized by high intra-slice but diminished inter-slice resolution. The lowered resolution between adjacent slices poses challenges, hindering optimal viewing experiences and impeding the development of robust downstream analysis algorithms. Various volumetric super-resolution algorithms aim to surmount these challenges, enhancing inter-slice resolution and overall 3D medical imaging quality. However, existing approaches confront inherent challenges: 1) often tailored to specific upsampling factors, lacking flexibility for diverse clinical scenarios; 2) newly generated slices frequently suffer from over-smoothing, degrading fine details, and leading to inter-slice inconsistency. In response, this study presents CycleINR, a novel enhanced Implicit Neural Representation model for 3D medical data volumetric super-resolution. Leveraging the continuity of the learned implicit function, the CycleINR model can achieve results with arbitrary up-sampling rates, eliminating the need for separate training. Additionally, we enhance the grid sampling in CycleINR with a local attention mechanism and mitigate over-smoothing by integrating cycle-consistent loss. We introduce a new metric, Slice-wise Noise Level Inconsistency (SNLI), to quantitatively assess inter-slice noise level inconsistency. The effectiveness of our approach is demonstrated through image quality evaluations on an in-house dataset and a downstream task analysis on the Medical Segmentation Decathlon liver tumor dataset. △ Less

Submitted 7 April, 2024; originally announced April 2024.

Comments: CVPR accepted paper

arXiv:2403.19785 [pdf, other]

Integrated Communication, Localization, and Sensing in 6G D-MIMO Networks

Authors: Hao Guo, Henk Wymeersch, Behrooz Makki, Hui Chen, Yibo Wu, Giuseppe Durisi, Musa Furkan Keskin, Mohammad H. Moghaddam, Charitha Madapatha, Han Yu, Peter Hammarberg, Hyowon Kim, Tommy Svensson

Abstract: Future generations of mobile networks call for concurrent sensing and communication functionalities in the same hardware and/or spectrum. Compared to communication, sensing services often suffer from limited coverage, due to the high path loss of the reflected signal and the increased infrastructure requirements. To provide a more uniform quality of service, distributed multiple input multiple out… ▽ More Future generations of mobile networks call for concurrent sensing and communication functionalities in the same hardware and/or spectrum. Compared to communication, sensing services often suffer from limited coverage, due to the high path loss of the reflected signal and the increased infrastructure requirements. To provide a more uniform quality of service, distributed multiple input multiple output (D-MIMO) systems deploy a large number of distributed nodes and efficiently control them, making distributed integrated sensing and communications (ISAC) possible. In this paper, we investigate ISAC in D-MIMO through the lens of different design architectures and deployments, revealing both conflicts and synergies. In addition, simulation and demonstration results reveal both opportunities and challenges towards the implementation of ISAC in D-MIMO. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.19238 [pdf, other]

Taming Lookup Tables for Efficient Image Retouching

Authors: Sidi Yang, Binxiao Huang, Mingdeng Cao, Yatai Ji, Hanzhong Guo, Ngai Wong, Yujiu Yang

Abstract: The widespread use of high-definition screens in edge devices, such as end-user cameras, smartphones, and televisions, is spurring a significant demand for image enhancement. Existing enhancement models often optimize for high performance while falling short of reducing hardware inference time and power consumption, especially on edge devices with constrained computing and storage resources. To th… ▽ More The widespread use of high-definition screens in edge devices, such as end-user cameras, smartphones, and televisions, is spurring a significant demand for image enhancement. Existing enhancement models often optimize for high performance while falling short of reducing hardware inference time and power consumption, especially on edge devices with constrained computing and storage resources. To this end, we propose Image Color Enhancement Lookup Table (ICELUT) that adopts LUTs for extremely efficient edge inference, without any convolutional neural network (CNN). During training, we leverage pointwise (1x1) convolution to extract color information, alongside a split fully connected layer to incorporate global information. Both components are then seamlessly converted into LUTs for hardware-agnostic deployment. ICELUT achieves near-state-of-the-art performance and remarkably low power consumption. We observe that the pointwise network structure exhibits robust scalability, upkee** the performance even with a heavily downsampled 32x32 input image. These enable ICELUT, the first-ever purely LUT-based image enhancer, to reach an unprecedented speed of 0.4ms on GPU and 7ms on CPU, at least one order faster than any CNN solution. Codes are available at https://github.com/Stephen0808/ICELUT. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.14268 [pdf]

Speech-Aware Neural Diarization with Encoder-Decoder Attractor Guided by Attention Constraints

Authors: PeiYing Lee, HauYun Guo, Berlin Chen

Abstract: End-to-End Neural Diarization with Encoder-Decoder based Attractor (EEND-EDA) is an end-to-end neural model for automatic speaker segmentation and labeling. It achieves the capability to handle flexible number of speakers by estimating the number of attractors. EEND-EDA, however, struggles to accurately capture local speaker dynamics. This work proposes an auxiliary loss that aims to guide the Tra… ▽ More End-to-End Neural Diarization with Encoder-Decoder based Attractor (EEND-EDA) is an end-to-end neural model for automatic speaker segmentation and labeling. It achieves the capability to handle flexible number of speakers by estimating the number of attractors. EEND-EDA, however, struggles to accurately capture local speaker dynamics. This work proposes an auxiliary loss that aims to guide the Transformer encoders at the lower layer of EEND-EDA model to enhance the effect of self-attention modules using speaker activity information. The results evaluated on public dataset Mini LibriSpeech, demonstrates the effectiveness of the work, reducing Diarization Error Rate from 30.95% to 28.17%. We will release the source code on GitHub to allow further research and reproducibility. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: Accepted to The 28th International Conference on Technologies and Applications of Artificial Intelligence (TAAI), in Chinese language

Report number: TAAI2023-Domestic-131

arXiv:2403.13820 [pdf, other]

Identity information based on human magnetocardiography signals

Authors: Pengju Zhang, Chenxi Sun, Jianwei Zhang, Hong Guo

Abstract: We have developed an individual identification system based on magnetocardiography (MCG) signals captured using optically pumped magnetometers (OPMs). Our system utilizes pattern recognition to analyze the signals obtained at different positions on the body, by scanning the matrices composed of MCG signals with a 2*2 window. In order to make use of the spatial information of MCG signals, we transf… ▽ More We have developed an individual identification system based on magnetocardiography (MCG) signals captured using optically pumped magnetometers (OPMs). Our system utilizes pattern recognition to analyze the signals obtained at different positions on the body, by scanning the matrices composed of MCG signals with a 2*2 window. In order to make use of the spatial information of MCG signals, we transform the signals from adjacent small areas into four channels of a dataset. We further transform the data into time-frequency matrices using wavelet transforms and employ a convolutional neural network (CNN) for classification. As a result, our system achieves an accuracy rate of 97.04% in identifying individuals. This finding indicates that the MCG signal holds potential for use in individual identification systems, offering a valuable tool for personalized healthcare management. △ Less

Submitted 2 March, 2024; originally announced March 2024.

Comments: 7 pages, 5 figures. Author manuscript accepted for AAAI 2024 Spring Symposium on Clinical Foundation Models

arXiv:2402.11769 [pdf, other]

Connection-Aware P2P Trading: Simultaneous Trading and Peer Selection

Authors: Cheng Feng, Kedi Zheng, Lanqing Shan, Hani Alers, Lampros Stergioulas, Hongye Guo, Qixin Chen

Abstract: Peer-to-peer (P2P) trading is seen as a viable solution to handle the growing number of distributed energy resources in distribution networks. However, when dealing with large-scale consumers, there are several challenges that must be addressed. One of these challenges is limited communication capabilities. Additionally, prosumers may have specific preferences when it comes to trading. Both can re… ▽ More Peer-to-peer (P2P) trading is seen as a viable solution to handle the growing number of distributed energy resources in distribution networks. However, when dealing with large-scale consumers, there are several challenges that must be addressed. One of these challenges is limited communication capabilities. Additionally, prosumers may have specific preferences when it comes to trading. Both can result in serious asynchrony in peer-to-peer trading, potentially impacting the effectiveness of negotiations and hindering convergence before the market closes. This paper introduces a connection-aware P2P trading algorithm designed for extensive prosumer trading. The algorithm facilitates asynchronous trading while respecting prosumer's autonomy in trading peer selection, an often overlooked aspect in traditional models. In addition, to optimize the use of limited connection opportunities, a smart trading peer connection selection strategy is developed to guide consumers to communicate strategically to accelerate convergence. A theoretical convergence guarantee is provided for the connection-aware P2P trading algorithm, which further details how smart selection strategies enhance convergence efficiency. Numerical studies are carried out to validate the effectiveness of the connection-aware algorithm and the performance of smart selection strategies in reducing the overall convergence time. △ Less

Submitted 18 February, 2024; originally announced February 2024.

Comments: Submitted to IEEE PES Transactions

arXiv:2402.08093 [pdf, other]

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Authors: Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts ra… ▽ More We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/. △ Less

Submitted 15 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: v1.1 (fixed typos)

arXiv:2402.03886 [pdf, other]

Full-Duplex Millimeter Wave MIMO Channel Estimation: A Neural Network Approach

Authors: Mehdi Sattari, Hao Guo, Deniz Gündüz, Ashkan Panahi, Tommy Svensson

Abstract: Millimeter wave (mmWave) multiple-input-multi-output (MIMO) is now a reality with great potential for further improvement. We study full-duplex transmissions as an effective way to improve mmWave MIMO systems. Compared to half-duplex systems, full-duplex transmissions may offer higher data rates and lower latency. However, full-duplex transmission is hindered by self-interference (SI) at the recei… ▽ More Millimeter wave (mmWave) multiple-input-multi-output (MIMO) is now a reality with great potential for further improvement. We study full-duplex transmissions as an effective way to improve mmWave MIMO systems. Compared to half-duplex systems, full-duplex transmissions may offer higher data rates and lower latency. However, full-duplex transmission is hindered by self-interference (SI) at the receive antennas, and SI channel estimation becomes a crucial step to make the full-duplex systems feasible. In this paper, we address the problem of channel estimation in full-duplex mmWave MIMO systems using neural networks (NNs). Our approach involves sharing pilot resources between user equipments (UEs) and transmit antennas at the base station (BS), aiming to reduce the pilot overhead in full-duplex systems and to achieve a comparable level to that of a half-duplex system. Additionally, in the case of separate antenna configurations in a full-duplex BS, providing channel estimates of transmit antenna (TX) arrays to the downlink UEs poses another challenge, as the TX arrays are not capable of receiving pilot signals. To address this, we employ an NN to map the channel from the downlink UEs to the receive antenna (RX) arrays to the channel from the TX arrays to the downlink UEs. We further elaborate on how NNs perform the estimation with different architectures, (e.g., different numbers of hidden layers), the introduction of non-linear distortion (e.g., with a 1-bit analog-to-digital converter (ADC)), and different channel conditions (e.g., low-correlated and high-correlated channels). Our work provides novel insights into NN-based channel estimators. △ Less

Submitted 18 June, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.11479 [pdf, other]

Battery-Free Sensor Array for Wireless Multi-Depth In-Situ Sensing

Authors: Hongzhi Guo, Adam Kamrath

Abstract: Underground in-situ sensing plays a vital role in precision agriculture and infrastructure monitoring. While existing sensing systems utilize wires to connect an array of sensors at various depths for spatial-temporal data collection, wireless underground sensor networks offer a cable-free alternative. However, these wireless sensors are typically battery-powered, necessitating periodic recharging… ▽ More Underground in-situ sensing plays a vital role in precision agriculture and infrastructure monitoring. While existing sensing systems utilize wires to connect an array of sensors at various depths for spatial-temporal data collection, wireless underground sensor networks offer a cable-free alternative. However, these wireless sensors are typically battery-powered, necessitating periodic recharging or replacement. This paper proposes a battery-free sensor array which can be used for wireless multi-depth in-situ sensing. Utilizing Near Field Communication (NFC)-which can penetrate soil with negligible signal power loss-this sensor array can form a virtual magnetic waveguide, achieving long communication ranges. An analytical model has been developed to offer insights and determine optimal design parameters. Moreover, a prototype, constructed using off-the-shelf NFC sensors, was tested to validate the proposed concept. While this system is primarily designed for underground applications, it holds potential for other multi-depth in-situ sensing scenarios, including underwater environments. △ Less

Submitted 21 January, 2024; originally announced January 2024.

arXiv:2401.09013 [pdf, other]

An Improved Virtual Force Approach for UAV Deployment and Resource Allocation in Emergency Communications

Authors: Hongying Guo, Li Wang, Ruoguang Li, Luyang Hou, Lianming Xu, Aiguo Fei

Abstract: In this paper, we consider an unmanned aerial vehicle (UAV)-enabled emergency communication system, which establishes temporary communication link with users equipment (UEs) in a typical disaster environment with mountainous forest and obstacles. Towards this end, a joint deployment, power allocation, and user association optimization problem is formulated to maximize the total transmission rate,… ▽ More In this paper, we consider an unmanned aerial vehicle (UAV)-enabled emergency communication system, which establishes temporary communication link with users equipment (UEs) in a typical disaster environment with mountainous forest and obstacles. Towards this end, a joint deployment, power allocation, and user association optimization problem is formulated to maximize the total transmission rate, while considering the demand of each UE and the disaster environment characteristics. Then, an alternating optimization algorithm is proposed by integrating coalition game and virtual force approach which captures the impact of the demand priority of UEs and the obstacles to the flight path and consumed power. Simulation results demonstrate that the computation time consumed by our proposed algorithm is only $5.6\%$ of the traditional heuristic algorithms, which validates its effectiveness in disaster scenarios. △ Less

Submitted 17 January, 2024; originally announced January 2024.

arXiv:2401.07791 [pdf, other]

Near-Far Field Channel Modeling for Holographic MIMO Using Expectation-Maximization Methods

Authors: Houfeng Chen, Shuhao Zeng, Hao Guo, Tommy Svensson, Hongliang Zhang

Abstract: Holographic Multiple-Input Multiple-Output (HMIMO), which densely integrates numerous antennas into a limited space, is anticipated to provide higher rates for future 6G wireless communications. The increase in antenna aperture size makes the near-field region enlarge, causing some users to be located in the near-field region. Thus, we are facing a hybrid near-field and far-field communication pro… ▽ More Holographic Multiple-Input Multiple-Output (HMIMO), which densely integrates numerous antennas into a limited space, is anticipated to provide higher rates for future 6G wireless communications. The increase in antenna aperture size makes the near-field region enlarge, causing some users to be located in the near-field region. Thus, we are facing a hybrid near-field and far-field communication problem, where conventional far-field modeling methods may not work well. In this paper, we propose a near-far field channel model that does not presuppose whether each path is near-field or far-field, different from the existing work requiring the ratio of the number of near-field paths to that of far-field paths as prior knowledge. However, this gives rise to a new challenge for accurately modeling the channel, as conventional methods of obtaining channel model parameters are not applicable to this model. Therefore, we propose a new method, Expectation-Maximization (EM)-based Near-Far Field Channel Modeling, to obtain channel model parameters, which considers whether each path is near-field or far-field as a hidden variable, and optimizes the hidden variables and channel model parameters through an alternating iteration method. Simulation results show that our method is superior to conventional near-field and far-field algorithms in fitting the near-far field channel in terms of outage probability. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2401.04152 [pdf, other]

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

Authors: Jiawen Kang, Lingwei Meng, Mingyu Cui, Haohan Guo, Xixin Wu, Xunying Liu, Helen Meng

Abstract: End-to-end multi-talker speech recognition has garnered great interest as an effective approach to directly transcribe overlapped speech from multiple speakers. Current methods typically adopt either 1) single-input multiple-output (SIMO) models with a branched encoder, or 2) single-input single-output (SISO) models based on attention-based encoder-decoder architecture with serialized output train… ▽ More End-to-end multi-talker speech recognition has garnered great interest as an effective approach to directly transcribe overlapped speech from multiple speakers. Current methods typically adopt either 1) single-input multiple-output (SIMO) models with a branched encoder, or 2) single-input single-output (SISO) models based on attention-based encoder-decoder architecture with serialized output training (SOT). In this work, we propose a Cross-Speaker Encoding (CSE) network to address the limitations of SIMO models by aggregating cross-speaker representations. Furthermore, the CSE model is integrated with SOT to leverage both the advantages of SIMO and SISO while mitigating their drawbacks. To the best of our knowledge, this work represents an early effort to integrate SIMO and SISO for multi-talker speech recognition. Experiments on the two-speaker LibrispeechMix dataset show that the CES model reduces word error rate (WER) by 8% over the SIMO baseline. The CSE-SOT model reduces WER by 10% overall and by 16% on high-overlap speech compared to the SOT model. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP2024

arXiv:2311.11260 [pdf, other]

doi 10.1145/3643832.3661871

Radarize: Enhancing Radar SLAM with Generalizable Doppler-Based Odometry

Authors: Emerson Sie, Xinyu Wu, Heyu Guo, Deepak Vasisht

Abstract: Millimeter-wave (mmWave) radar is increasingly being considered as an alternative to optical sensors for robotic primitives like simultaneous localization and map** (SLAM). While mmWave radar overcomes some limitations of optical sensors, such as occlusions, poor lighting conditions, and privacy concerns, it also faces unique challenges, such as missed obstacles due to specular reflections or fa… ▽ More Millimeter-wave (mmWave) radar is increasingly being considered as an alternative to optical sensors for robotic primitives like simultaneous localization and map** (SLAM). While mmWave radar overcomes some limitations of optical sensors, such as occlusions, poor lighting conditions, and privacy concerns, it also faces unique challenges, such as missed obstacles due to specular reflections or fake objects due to multipath. To address these challenges, we propose Radarize, a self-contained SLAM pipeline that uses only a commodity single-chip mmWave radar. Our radar-native approach uses techniques such as Doppler shift-based odometry and multipath artifact suppression to improve performance. We evaluate our method on a large dataset of 146 trajectories spanning 4 buildings and mounted on 3 different platforms, totaling approximately 4.7 Km of travel distance. Our results show that our method outperforms state-of-the-art radar and radar-inertial approaches by approximately 5x in terms of odometry and 8x in terms of end-to-end SLAM, as measured by absolute trajectory error (ATE), without the need for additional sensors such as IMUs or wheel encoders. △ Less

Submitted 29 April, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

arXiv:2311.11086 [pdf]

LightBTSeg: A lightweight breast tumor segmentation model using ultrasound images via dual-path joint knowledge distillation

Authors: Hongjiang Guo, Shengwen Wang, Hao Dang, Kangle Xiao, Yaru Yang, Wenpei Liu, Tongtong Liu, Yiying Wan

Abstract: The accurate segmentation of breast tumors is an important prerequisite for lesion detection, which has significant clinical value for breast tumor research. The mainstream deep learning-based methods have achieved a breakthrough. However, these high-performance segmentation methods are formidable to implement in clinical scenarios since they always embrace high computation complexity, massive par… ▽ More The accurate segmentation of breast tumors is an important prerequisite for lesion detection, which has significant clinical value for breast tumor research. The mainstream deep learning-based methods have achieved a breakthrough. However, these high-performance segmentation methods are formidable to implement in clinical scenarios since they always embrace high computation complexity, massive parameters, slow inference speed, and huge memory consumption. To tackle this problem, we propose LightBTSeg, a dual-path joint knowledge distillation framework, for lightweight breast tumor segmentation. Concretely, we design a double-teacher model to represent the fine-grained feature of breast ultrasound according to different semantic feature realignments of benign and malignant breast tumors. Specifically, we leverage the bottleneck architecture to reconstruct the original Attention U-Net. It is regarded as a lightweight student model named Simplified U-Net. Then, the prior knowledge of benign and malignant categories is utilized to design the teacher network combined dual-path joint knowledge distillation, which distills the knowledge from cumbersome benign and malignant teachers to a lightweight student model. Extensive experiments conducted on breast ultrasound images (Dataset BUSI) and Breast Ultrasound Dataset B (Dataset B) datasets demonstrate that LightBTSeg outperforms various counterparts. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: 7 pages, 7 figures, conference

arXiv:2311.02551 [pdf]

High-dimensional Bid Learning for Energy Storage Bidding in Energy Markets

Authors: **yu Liu, Hongye Guo, Qinghu Tang, En Lu, Qiuna Cai, Qixin Chen

Abstract: With the growing penetration of renewable energy resource, electricity market prices have exhibited greater volatility. Therefore, it is important for Energy Storage Systems(ESSs) to leverage the multidimensional nature of energy market bids to maximize profitability. However, current learning methods cannot fully utilize the high-dimensional price-quantity bids in the energy markets. To address t… ▽ More With the growing penetration of renewable energy resource, electricity market prices have exhibited greater volatility. Therefore, it is important for Energy Storage Systems(ESSs) to leverage the multidimensional nature of energy market bids to maximize profitability. However, current learning methods cannot fully utilize the high-dimensional price-quantity bids in the energy markets. To address this challenge, we modify the common reinforcement learning(RL) process by proposing a new bid representation method called Neural Network Embedded Bids (NNEBs). NNEBs refer to market bids that are represented by monotonic neural networks with discrete outputs. To achieve effective learning of NNEBs, we first learn a neural network as a strategic map** from the market price to ESS power output with RL. Then, we re-train the network with two training modifications to make the network output monotonic and discrete. Finally, the neural network is equivalently converted into a high-dimensional bid for bidding. We conducted experiments over real-world market datasets. Our studies show that the proposed method achieves 18% higher profit than the baseline and up to 78% profit of the optimal market bidder. △ Less

Submitted 4 November, 2023; originally announced November 2023.

Comments: 5 pages, 3 figures, Accepted by the 15th International Conference on Applied Energy (ICAE2023)

arXiv:2310.18529 [pdf, other]

FPM-INR: Fourier ptychographic microscopy image stack reconstruction using implicit neural representations

Authors: Haowen Zhou, Brandon Y. Feng, Haiyun Guo, Siyu Lin, Mingshu Liang, Christopher A. Metzler, Changhuei Yang

Abstract: Image stacks provide invaluable 3D information in various biological and pathological imaging applications. Fourier ptychographic microscopy (FPM) enables reconstructing high-resolution, wide field-of-view image stacks without z-stack scanning, thus significantly accelerating image acquisition. However, existing FPM methods take tens of minutes to reconstruct and gigabytes of memory to store a hig… ▽ More Image stacks provide invaluable 3D information in various biological and pathological imaging applications. Fourier ptychographic microscopy (FPM) enables reconstructing high-resolution, wide field-of-view image stacks without z-stack scanning, thus significantly accelerating image acquisition. However, existing FPM methods take tens of minutes to reconstruct and gigabytes of memory to store a high-resolution volumetric scene, impeding fast gigapixel-scale remote digital pathology. While deep learning approaches have been explored to address this challenge, existing methods poorly generalize to novel datasets and can produce unreliable hallucinations. This work presents FPM-INR, a compact and efficient framework that integrates physics-based optical models with implicit neural representations (INR) to represent and reconstruct FPM image stacks. FPM-INR is agnostic to system design or sample types and does not require external training data. In our demonstrated experiments, FPM-INR substantially outperforms traditional FPM algorithms with up to a 25-fold increase in speed and an 80-fold reduction in memory usage for continuous image stack representations. △ Less

Submitted 31 October, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

Comments: Project Page: https://hwzhou2020.github.io/FPM-INR-Web/

arXiv:2309.13602 [pdf, other]

6G Positioning and Sensing Through the Lens of Sustainability, Inclusiveness, and Trustworthiness

Authors: Henk Wymeersch, Hui Chen, Hao Guo, Musa Furkan Keskin, Bahare M. Khorsandi, Mohammad H. Moghaddam, Alejandro Ramirez, Kim Schindhelm, Athanasios Stavridis, Tommy Svensson, Vijaya Yajnanarayana

Abstract: 6G promises a paradigm shift in which positioning and sensing are inherently integrated, enhancing not only the communication performance but also enabling location- and context-aware services. Historically, positioning and sensing have been viewed through the lens of cost and performance trade-offs, implying an escalated demand for resources, such as radio, physical, and computational resources,… ▽ More 6G promises a paradigm shift in which positioning and sensing are inherently integrated, enhancing not only the communication performance but also enabling location- and context-aware services. Historically, positioning and sensing have been viewed through the lens of cost and performance trade-offs, implying an escalated demand for resources, such as radio, physical, and computational resources, for improved performance. However, 6G goes beyond this traditional perspective to encompass a set of broader values, namely sustainability, inclusiveness, and trustworthiness. From a joint industrial/academic perspective, this paper aims to shed light on these important value indicators and their relationship with the conventional key performance indicators in the context of positioning and sensing. △ Less

Submitted 14 February, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

Comments: Submitted and under review for IEEE Wireless Communications

arXiv:2309.09088 [pdf, other]

Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition

Authors: Haoming Guo, Seth Z. Zhao, Jiachen Lian, Gopala Anumanchipalli, Gerald Friedland

Abstract: Vocoder models have recently achieved substantial progress in generating authentic audio comparable to human quality while significantly reducing memory requirement and inference time. However, these data-hungry generative models require large-scale audio data for learning good representations. In this paper, we apply contrastive learning methods in training the vocoder to improve the perceptual q… ▽ More Vocoder models have recently achieved substantial progress in generating authentic audio comparable to human quality while significantly reducing memory requirement and inference time. However, these data-hungry generative models require large-scale audio data for learning good representations. In this paper, we apply contrastive learning methods in training the vocoder to improve the perceptual quality of the vocoder without modifying its architecture or adding more data. We design an auxiliary task with mel-spectrogram contrastive learning to enhance the utterance-level quality of the vocoder model under data-limited conditions. We also extend the task to include waveforms to improve the multi-modality comprehension of the model and address the discriminator overfitting problem. We optimize the additional task simultaneously with GAN training objectives. Our results show that the tasks improve model performance substantially in data-limited settings. △ Less

Submitted 18 December, 2023; v1 submitted 16 September, 2023; originally announced September 2023.

arXiv:2309.06981 [pdf, other]

MASTERKEY: Practical Backdoor Attack Against Speaker Verification Systems

Authors: Hanqing Guo, Xun Chen, Junfeng Guo, Li Xiao, Qiben Yan

Abstract: Speaker Verification (SV) is widely deployed in mobile systems to authenticate legitimate users by using their voice traits. In this work, we propose a backdoor attack MASTERKEY, to compromise the SV models. Different from previous attacks, we focus on a real-world practical setting where the attacker possesses no knowledge of the intended victim. To design MASTERKEY, we investigate the limitation… ▽ More Speaker Verification (SV) is widely deployed in mobile systems to authenticate legitimate users by using their voice traits. In this work, we propose a backdoor attack MASTERKEY, to compromise the SV models. Different from previous attacks, we focus on a real-world practical setting where the attacker possesses no knowledge of the intended victim. To design MASTERKEY, we investigate the limitation of existing poisoning attacks against unseen targets. Then, we optimize a universal backdoor that is capable of attacking arbitrary targets. Next, we embed the speaker's characteristics and semantics information into the backdoor, making it imperceptible. Finally, we estimate the channel distortion and integrate it into the backdoor. We validate our attack on 6 popular SV models. Specifically, we poison a total of 53 models and use our trigger to attack 16,430 enrolled speakers, composed of 310 target speakers enrolled in 53 poisoned models. Our attack achieves 100% attack success rate with a 15% poison rate. By decreasing the poison rate to 3%, the attack success rate remains around 50%. We validate our attack in 3 real-world scenarios and successfully demonstrate the attack through both over-the-air and over-the-telephony-line scenarios. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: Accepted by Mobicom 2023

arXiv:2309.05276 [pdf, other]

Beamforming in Wireless Coded-Caching Systems

Authors: Sneha Madhusudan, Charitha Madapatha, Behrooz Makki, Hao Guo, Tommy Svensson

Abstract: Increased capacity in the access network poses capacity challenges on the transport network due to the aggregated traffic. However, there are spatial and time correlation in the user data demands that could potentially be utilized. To that end, we investigate a wireless transport network architecture that integrates beamforming and coded-caching strategies. Especially, our proposed design entails… ▽ More Increased capacity in the access network poses capacity challenges on the transport network due to the aggregated traffic. However, there are spatial and time correlation in the user data demands that could potentially be utilized. To that end, we investigate a wireless transport network architecture that integrates beamforming and coded-caching strategies. Especially, our proposed design entails a server with multiple antennas that broadcasts content to cache nodes responsible for serving users. Traditional caching methods face the limitation of relying on the individual memory with additional overhead. Hence, we develop an efficient genetic algorithm-based scheme for beam optimization in the coded-caching system. By exploiting the advantages of beamforming and coded-caching, the architecture achieves gains in terms of multicast opportunities, interference mitigation, and reduced peak backhaul traffic. A comparative analysis of this joint design with traditional, un-coded caching schemes is also conducted to assess the benefits of the proposed approach. Additionally, we examine the impact of various buffering and decoding methods on the performance of the coded-caching scheme. Our findings suggest that proper beamforming is useful in enhancing the effectiveness of the coded-caching technique, resulting in significant reduction in peak backhaul traffic. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Comments: Submitted to IEEE Future Networks World Forum, 2023

arXiv:2309.00126 [pdf, other]

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

Authors: Haohan Guo, Fenglong Xie, Jiawen Kang, Yujia Xiao, Xixin Wu, Helen Meng

Abstract: This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio. This framework comprises two VQ-S3R learners: first, the principal learner aims to provide a generative Multi-Stage Multi-Codebook (MSMC) VQ-S3R via the… ▽ More This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio. This framework comprises two VQ-S3R learners: first, the principal learner aims to provide a generative Multi-Stage Multi-Codebook (MSMC) VQ-S3R via the MSMC-VQ-GAN combined with the contrastive S3RL, while decoding it back to the high-quality audio; then, the associate learner further abstracts the MSMC representation into a highly-compact VQ representation through a VQ-VAE. These two generative VQ-S3R learners provide profitable speech representations and pre-trained models for TTS, significantly improving synthesis quality with the lower requirement for supervised data. QS-TTS is evaluated comprehensively under various scenarios via subjective and objective tests in experiments. The results powerfully demonstrate the superior performance of QS-TTS, winning the highest MOS over supervised or semi-supervised baseline TTS approaches, especially in low-resource scenarios. Moreover, comparing various speech representations and transfer learning methods in TTS further validates the notable improvement of the proposed VQ-S3RL to TTS, showing the best audio quality and intelligibility metrics. The trend of slower decay in the synthesis quality of QS-TTS with decreasing supervised data further highlights its lower requirements for supervised data, indicating its great potential in low-resource scenarios. △ Less

Submitted 31 August, 2023; originally announced September 2023.

arXiv:2308.11154 [pdf, other]

doi 10.1109/CCNC49032.2021.9369456

Mobility-Aware Computation Offloading for Swarm Robotics using Deep Reinforcement Learning

Authors: Xiucheng Wang, Hongzhi Guo

Abstract: Swarm robotics is envisioned to automate a large number of dirty, dangerous, and dull tasks. Robots have limited energy, computation capability, and communication resources. Therefore, current swarm robotics have a small number of robots, which can only provide limited spatio-temporal information. In this paper, we propose to leverage the mobile edge computing to alleviate the computation burden.… ▽ More Swarm robotics is envisioned to automate a large number of dirty, dangerous, and dull tasks. Robots have limited energy, computation capability, and communication resources. Therefore, current swarm robotics have a small number of robots, which can only provide limited spatio-temporal information. In this paper, we propose to leverage the mobile edge computing to alleviate the computation burden. We develop an effective solution based on a mobility-aware deep reinforcement learning model at the edge server side for computing scheduling and resource. Our results show that the proposed approach can meet delay requirements and guarantee computation precision by using minimum robot energy. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Journal ref: 2021 IEEE 18th Annual Consumer Communications & Networking Conference (CCNC)

arXiv:2308.04666 [pdf, other]

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

Authors: Zirui Ge, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang

Abstract: The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and… ▽ More The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and graph attention factors, non-injective aggregation still exists in the approaches, which may influence the performance for speaker recognition. In this regard, we propose a speaker recognition approach using Isomorphic Graph ATtention network (IsoGAT) on self-supervised representation. The proposed approach contains three modules of representation learning, graph attention, and aggregation, jointly considering learning on the self-supervised representation and the IsoGAT. Then, we perform experiments for speaker recognition tasks on VoxCeleb1\&2 datasets, with the corresponding experimental results demonstrating the recognition performance for the proposed approach, compared with existing pooling approaches on the self-supervised representation. △ Less

Submitted 23 February, 2024; v1 submitted 8 August, 2023; originally announced August 2023.

Comments: 9 pages, 4 figures

arXiv:2308.02494 [pdf, other]

doi 10.1109/TVCG.2023.3327194

Adaptively Placed Multi-Grid Scene Representation Networks for Large-Scale Data Visualization

Authors: Skylar Wolfgang Wurster, Tianyu Xiong, Han-Wei Shen, Hanqi Guo, Tom Peterka

Abstract: Scene representation networks (SRNs) have been recently proposed for compression and visualization of scientific data. However, state-of-the-art SRNs do not adapt the allocation of available network parameters to the complex features found in scientific data, leading to a loss in reconstruction quality. We address this shortcoming with an adaptively placed multi-grid SRN (APMGSRN) and propose a do… ▽ More Scene representation networks (SRNs) have been recently proposed for compression and visualization of scientific data. However, state-of-the-art SRNs do not adapt the allocation of available network parameters to the complex features found in scientific data, leading to a loss in reconstruction quality. We address this shortcoming with an adaptively placed multi-grid SRN (APMGSRN) and propose a domain decomposition training and inference technique for accelerated parallel training on multi-GPU systems. We also release an open-source neural volume rendering application that allows plug-and-play rendering with any PyTorch-based SRN. Our proposed APMGSRN architecture uses multiple spatially adaptive feature grids that learn where to be placed within the domain to dynamically allocate more neural network resources where error is high in the volume, improving state-of-the-art reconstruction accuracy of SRNs for scientific data without requiring expensive octree refining, pruning, and traversal like previous adaptive models. In our domain decomposition approach for representing large-scale data, we train an set of APMGSRNs in parallel on separate bricks of the volume to reduce training time while avoiding overhead necessary for an out-of-core solution for volumes too large to fit in GPU memory. After training, the lightweight SRNs are used for realtime neural volume rendering in our open-source renderer, where arbitrary view angles and transfer functions can be explored. A copy of this paper, all code, all models used in our experiments, and all supplemental materials and videos are available at https://github.com/skywolf829/APMGSRN. △ Less

Submitted 6 April, 2024; v1 submitted 16 July, 2023; originally announced August 2023.

Comments: Accepted to IEEE VIS 2023. https://www.computer.org/csdl/journal/tg/2024/01/10297599/1RyYguiNBLO

Journal ref: In IEEE Transactions on Visualization & Computer Graphics, vol. 30, no. 01, pp. 965-974, 2024

arXiv:2307.00393 [pdf, other]

Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion

Authors: Houjian Guo, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro

Abstract: Voice conversion systems have made significant advancements in terms of naturalness and similarity in common voice conversion tasks. However, their performance in more complex tasks such as cross-lingual voice conversion and expressive voice conversion remains imperfect. In this study, we propose a novel approach that combines a jointly trained speaker encoder and content features extracted from t… ▽ More Voice conversion systems have made significant advancements in terms of naturalness and similarity in common voice conversion tasks. However, their performance in more complex tasks such as cross-lingual voice conversion and expressive voice conversion remains imperfect. In this study, we propose a novel approach that combines a jointly trained speaker encoder and content features extracted from the cross-lingual speech recognition model Whisper to achieve high-quality cross-lingual voice conversion. Additionally, we introduce a speaker consistency loss to the joint encoder, which improves the similarity between the converted speech and the reference speech. To further explore the capabilities of the joint speaker encoder, we use the phonetic posteriorgram as the content feature, which enables the model to effectively reproduce both the speaker characteristics and the emotional aspects of the reference speech. △ Less

Submitted 1 July, 2023; originally announced July 2023.

arXiv:2305.05736 [pdf, other]

doi 10.1145/3558482.3590189

VSMask: Defending Against Voice Synthesis Attack via Real-Time Predictive Perturbation

Authors: Yuanda Wang, Hanqing Guo, Guang**g Wang, Bocheng Chen, Qiben Yan

Abstract: Deep learning based voice synthesis technology generates artificial human-like speeches, which has been used in deepfakes or identity theft attacks. Existing defense mechanisms inject subtle adversarial perturbations into the raw speech audios to mislead the voice synthesis models. However, optimizing the adversarial perturbation not only consumes substantial computation time, but it also requires… ▽ More Deep learning based voice synthesis technology generates artificial human-like speeches, which has been used in deepfakes or identity theft attacks. Existing defense mechanisms inject subtle adversarial perturbations into the raw speech audios to mislead the voice synthesis models. However, optimizing the adversarial perturbation not only consumes substantial computation time, but it also requires the availability of entire speech. Therefore, they are not suitable for protecting live speech streams, such as voice messages or online meetings. In this paper, we propose VSMask, a real-time protection mechanism against voice synthesis attacks. Different from offline protection schemes, VSMask leverages a predictive neural network to forecast the most effective perturbation for the upcoming streaming speech. VSMask introduces a universal perturbation tailored for arbitrary speech input to shield a real-time speech in its entirety. To minimize the audio distortion within the protected speech, we implement a weight-based perturbation constraint to reduce the perceptibility of the added perturbation. We comprehensively evaluate VSMask protection performance under different scenarios. The experimental results indicate that VSMask can effectively defend against 3 popular voice synthesis models. None of the synthetic voice could deceive the speaker verification models or human ears with VSMask protection. In a physical world experiment, we demonstrate that VSMask successfully safeguards the real-time speech by injecting the perturbation over the air. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2304.06890 [pdf, other]

doi 10.1109/JSEN.2023.3253663

A Low-cost Through-metal Communication System for Sensors in Metallic Pipes

Authors: Hongzhi Guo, Marlin Prince, Javionn Ramsey, Jarvis Turner, Marcus Allen, Chevel Samuels, Jordan Atta Nuako

Abstract: Metallic pipes and other containers are widely used to store and transport toxic gases and liquids. Various sensors have been designed to monitor the environment inside metallic pipes and containers, such as pressure, liquid-level, and chemical sensors. Moreover, sensors are also used to inspect and detect pipe leakages. However, sensors are usually placed outside of metallic pipes and containers… ▽ More Metallic pipes and other containers are widely used to store and transport toxic gases and liquids. Various sensors have been designed to monitor the environment inside metallic pipes and containers, such as pressure, liquid-level, and chemical sensors. Moreover, sensors are also used to inspect and detect pipe leakages. However, sensors are usually placed outside of metallic pipes and containers and use ultrasound to monitor the internal unseen environment. This is mainly due to the fact that internal sensors cannot communicate with external data sinks without cables, but using cables can dramatically affect the metal-sealed structure. Wireless communication is desirable to communicate with internal sensors, but it experiences high attenuation losses since metal can block wireless signals due to its high conductivity. This paper develops a low-cost through-metal communication system prototype using off-the-shelf electronic components. The system is fully reconfigurable, and arbitrary modulation and coding schemes can be implemented. We design the transmit module which includes a signal processing microcontroller, an amplifier, and a transmit coil, and the receive module which includes a receive coil, an amplifier, and a microcontroller with demodulation algorithms and bit-error-rate (BER) calculations. The performance of the prototype is evaluated using various symbol rates, distances, and transmission power. The results show that the communication system can achieve a 500 bps data rate with 0.01 BER and 3.4 cm communication range when penetrating an Aluminum pipe with 7 mm thickness. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Journal ref: IEEE Sensors Journal, 2023

arXiv:2303.10556 [pdf, ps, other]

The Graph feature fusion technique for speaker recognition based on wav2vec2.0 framework

Authors: Zirui Ge, Haiyan Guo, Zhen Yang

Abstract: Pre-trained wav2vec2.0 model has been proved its effectiveness for speaker recognition. However, current feature processing methods are focusing on classical pooling on the output features of the pre-trained wav2vec2.0 model, such as mean pooling, max pooling etc. That methods take the features as the independent and irrelevant units, ignoring the inter-relationship among all the features, and do… ▽ More Pre-trained wav2vec2.0 model has been proved its effectiveness for speaker recognition. However, current feature processing methods are focusing on classical pooling on the output features of the pre-trained wav2vec2.0 model, such as mean pooling, max pooling etc. That methods take the features as the independent and irrelevant units, ignoring the inter-relationship among all the features, and do not take the features as an overall representation of a speaker. Gated Recurrent Unit (GRU), as a feature fusion method, can also be considered as a complicated pooling technique, mainly focuses on the temporal information, which may show poor performance in some situations that the main information is not on the temporal dimension. In this paper, we investigate the graph neural network (GNN) as a backend processing module based on wav2vec2.0 framework to provide a solution for the mentioned matters. The GNN takes all the output features as the graph signal data and extracts the related graph structure information of features for speaker recognition. Specifically, we first give a simple proof that the GNN feature fusion method can outperform than the mean, max, random pooling methods and so on theoretically. Then, we model the output features of wav2vec2.0 as the vertices of a graph, and construct the graph adjacency matrix by graph attention network (GAT). Finally, we follow the message passing neural network (MPNN) to design our message function, vertex update function and readout function to transform the speaker features into the graph features. The experiments show our performance can provide a relative improvement compared to the baseline methods. Code is available at xxx. △ Less

Submitted 18 March, 2023; originally announced March 2023.

arXiv:2302.08296 [pdf, other]

QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion

Authors: Houjian Guo, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro

Abstract: With the development of automatic speech recognition (ASR) and text-to-speech (TTS) technology, high-quality voice conversion (VC) can be achieved by extracting source content information and target speaker information to reconstruct waveforms. However, current methods still require improvement in terms of inference speed. In this study, we propose a lightweight VITS-based VC model that uses the H… ▽ More With the development of automatic speech recognition (ASR) and text-to-speech (TTS) technology, high-quality voice conversion (VC) can be achieved by extracting source content information and target speaker information to reconstruct waveforms. However, current methods still require improvement in terms of inference speed. In this study, we propose a lightweight VITS-based VC model that uses the HuBERT-Soft model to extract content information features without speaker information. Through subjective and objective experiments on synthesized speech, the proposed model demonstrates competitive results in terms of naturalness and similarity. Importantly, unlike the original VITS model, we use the inverse short-time Fourier transform (iSTFT) to replace the most computationally expensive part. Experimental results show that our model can generate samples at over 5000 kHz on the 3090 GPU and over 250 kHz on the i9-10900K CPU, achieving competitive speed for the same hardware configuration. △ Less

Submitted 23 February, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

arXiv:2212.10730 [pdf, other]

doi 10.3390/app13063729

Real-time Path Planning of Driver-less Mining Trains with Time-dependent Physical Constraints

Authors: Xiaojiang Ren, Hui Guo, Sheng Kai, Guoqiang Mao

Abstract: While the increased automation levels of production and operation equipment have led to improved productivity of mining activity in open pit mines, the capacity of mine transport system become a bottleneck. The optimization of mine transport system is of great practical significance to reduce the production and operation cost and improve the production and organizational efficiency of mines. In th… ▽ More While the increased automation levels of production and operation equipment have led to improved productivity of mining activity in open pit mines, the capacity of mine transport system become a bottleneck. The optimization of mine transport system is of great practical significance to reduce the production and operation cost and improve the production and organizational efficiency of mines. In this paper we first formulate a multi-objective optimisation problem for mine railway scheduling by introducing a set of mathematical constraints. As the problem is NP-hard, we then devise a Mixed Integer Programming based solution to solve this problem, and develop an online framework accordingly. We finally conduct test cases to evaluate the performance of the proposed solution. Experimental results demonstrate that the proposed solution is efficient and able to generate train schedule in a real-time manner. △ Less

Submitted 6 January, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

arXiv:2211.06974 [pdf]

A Comparison between Network-Controlled Repeaters and Reconfigurable Intelligent Surfaces

Authors: Hao Guo, Charitha Madapatha, Behrooz Makki, Boris Dortschy, Lei Bao, Magnus Åström, Tommy Svensson

Abstract: Network-controlled repeater (NCR) has been recently considered as a study-item in 3GPP Release 18, and the discussions are continuing in a work-item. In this paper, we introduce the concept of NCRs, as a possible low-complexity device to support for network densification and compare the performance of the NCRs with those achieved by reconfigurable intelligent surfaces (RISs). The results are prese… ▽ More Network-controlled repeater (NCR) has been recently considered as a study-item in 3GPP Release 18, and the discussions are continuing in a work-item. In this paper, we introduce the concept of NCRs, as a possible low-complexity device to support for network densification and compare the performance of the NCRs with those achieved by reconfigurable intelligent surfaces (RISs). The results are presented for the cases with different beamforming methods and hardware impairment models of the RIS. Moreover, we introduce the objectives of the 3GPP Release 18 NCR work-item and study the effect of different parameters on the performance of NCR-assisted networks. As we show, with a proper deployment, the presence of NCRs/RISs can improve the network performance considerably. △ Less

Submitted 13 November, 2022; originally announced November 2022.

Comments: 7 pages, 7 figures, submitted to potential IEEE publication

arXiv:2211.00272 [pdf, other]

RF-CHORD: Towards Deployable RFID Localization System for Logistics Network

Authors: Bo Liang, Purui Wang, Renjie Zhao, Heyu Guo, Pengyu Zhang, Junchen Guo, Shunmin Zhu, Hongqiang Harry Liu, Xinyu Zhang, Chenren Xu

Abstract: RFID localization is considered the key enabler of automating the process of inventory tracking and management for high-performance logistic network. A practical and deployable RFID localization system needs to meet reliability, throughput, and range requirements. This paper presents RF-Chord, the first RFID localization system that simultaneously meets all three requirements. RF-Chord features a… ▽ More RFID localization is considered the key enabler of automating the process of inventory tracking and management for high-performance logistic network. A practical and deployable RFID localization system needs to meet reliability, throughput, and range requirements. This paper presents RF-Chord, the first RFID localization system that simultaneously meets all three requirements. RF-Chord features a one-shot multisine-constructed wideband design that can process RF signal with a 200 MHz bandwidth in real-time to facilitate one-shot localization at scale. In addition, multiple SINR enhancement techniques are designed for range extension. On top of that, a kernel-layer-based near-field localization framework and a multipath-suppression algorithm are proposed to reduce the 99% long-tail errors. Our empirical results show that RF-Chord can localize more than 180 tags 6 m away from a reader within 1 second and with 99% long-tail error of 0.786 m, achieving a 0% miss reading rate and ~0.01% cross-reading rate in the warehouse and fresh food delivery store deployment. △ Less

Submitted 1 November, 2022; originally announced November 2022.

Comments: To be published in NSDI 2023

arXiv:2210.15131 [pdf, other]

Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations

Authors: Haohan Guo, Fenglong Xie, Xixin Wu, Hui Lu, Helen Meng

Abstract: This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations. A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms. Subsequently, we train the multi-stage predictor to predict MSMCRs from the text for TTS synthesis. Moreover, we optimize the training strategy by leveraging mor… ▽ More This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations. A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms. Subsequently, we train the multi-stage predictor to predict MSMCRs from the text for TTS synthesis. Moreover, we optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages. It selects audio from other languages using speaker similarity metric to augment the training set, and applies transfer learning to improve training quality. In MOS tests, the proposed system significantly outperforms FastSpeech and VITS in standard and low-resource scenarios, showing lower data requirements. The proposed training strategy effectively enhances MSMCRs on waveform reconstruction. It improves TTS performance further, which wins 77% votes in the preference test for the low-resource TTS with only 15 minutes of paired data. △ Less

Submitted 26 October, 2022; originally announced October 2022.

Comments: Submitted to ICASSP 2023

arXiv:2209.10887 [pdf, other]

A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS

Authors: Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng

Abstract: We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks,… ▽ More We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks, respectively. Multi-stage predictors are trained to map the input text sequence to MSMCRs progressively by minimizing a combined loss of the reconstruction Mean Square Error (MSE) and "triplet loss". In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms. The proposed approach is trained and tested with an English TTS database of 16 hours by a female speaker. The proposed TTS achieves an MOS score of 4.41, which outperforms the baseline with an MOS of 3.62. Compact versions of the proposed TTS with much less parameters can still preserve high MOS scores. Ablation studies show that both multiple stages and multiple codebooks are effective for achieving high TTS performance. △ Less

Submitted 22 September, 2022; originally announced September 2022.

arXiv:2209.10367 [pdf, other]

Electromagnetic Field Exposure Avoidance thanks to Non-Intended User Equipment and RIS

Authors: Hao Guo, Dinh-Thuy Phan-Huy, Tommy Svensson

Abstract: On the one hand, there is a growing demand for high throughput which can be satisfied thanks to the deployment of new networks using massive multiple-input multiple-output (MIMO) and beamforming. On the other hand, in some countries or cities, there is a demand for arbitrarily low electromagnetic field exposure (EMFE) of people not concerned by the ongoing communication, which slows down the deplo… ▽ More On the one hand, there is a growing demand for high throughput which can be satisfied thanks to the deployment of new networks using massive multiple-input multiple-output (MIMO) and beamforming. On the other hand, in some countries or cities, there is a demand for arbitrarily low electromagnetic field exposure (EMFE) of people not concerned by the ongoing communication, which slows down the deployment of new networks. Recently, it has been proposed to take the opportunity, when designing the future 6th generation (6G), to offer, in addition to higher throughput, a new type of service: arbitrarily low EMFE. Recent works have shown that a reconfigurable intelligent surface (RIS), jointly optimized with the base station (BS) beamforming can improve the received throughput at the desired location whilst reducing EMFE everywhere. In this paper, we introduce a new concept of a non-intended user (NIU). An NIU is a user of the network who requests low EMFE when he/she is not downloading/uploading data. An NIU lets his/her device, called NIU equipment (NIUE), exchange some control signaling and pilots with the network, to help the network avoid exposing NIU to waves that are transporting data for another user of the network: the intended user (IU), whose device is called IU equipment (IUE). Specifically, we propose several new schemes to maximize the IU throughput under an EMFE constraint at the NIU (in practice, an interference constraint at the NIUE). Several propagation scenarios are investigated. Analytical and numerical results show that proper power allocation and beam optimization can remarkably boost the EMFE-constrained system's performance with limited complexity and channel information. △ Less

Submitted 7 October, 2022; v1 submitted 21 September, 2022; originally announced September 2022.

Comments: 6 pages, 6 figures. Accepted in Globecom 2022 Workshop

arXiv:2208.01382 [pdf]

A New Probabilistic V-Net Model with Hierarchical Spatial Feature Transform for Efficient Abdominal Multi-Organ Segmentation

Authors: Minfeng Xu, Heng Guo, Jianfeng Zhang, Ke Yan, Le Lu

Abstract: Accurate and robust abdominal multi-organ segmentation from CT imaging of different modalities is a challenging task due to complex inter- and intra-organ shape and appearance variations among abdominal organs. In this paper, we propose a probabilistic multi-organ segmentation network with hierarchical spatial-wise feature modulation to capture flexible organ semantic variants and inject the learn… ▽ More Accurate and robust abdominal multi-organ segmentation from CT imaging of different modalities is a challenging task due to complex inter- and intra-organ shape and appearance variations among abdominal organs. In this paper, we propose a probabilistic multi-organ segmentation network with hierarchical spatial-wise feature modulation to capture flexible organ semantic variants and inject the learnt variants into different scales of feature maps for guiding segmentation. More specifically, we design an input decomposition module via a conditional variational auto-encoder to learn organ-specific distributions on the low dimensional latent space and model richer organ semantic variations that is conditioned on input images.Then by integrating these learned variations into the V-Net decoder hierarchically via spatial feature transformation, which has the ability to convert the variations into conditional Affine transformation parameters for spatial-wise feature maps modulating and guiding the fine-scale segmentation. The proposed method is trained on the publicly available AbdomenCT-1K dataset and evaluated on two other open datasets, i.e., 100 challenging/pathological testing patient cases from AbdomenCT-1K fully-supervised abdominal organ segmentation benchmark and 90 cases from TCIA+&BTCV dataset. Highly competitive or superior quantitative segmentation results have been achieved using these datasets for four abdominal organs of liver, kidney, spleen and pancreas with reported Dice scores improved by 7.3% for kidneys and 9.7% for pancreas, while being ~7 times faster than two strong baseline segmentation methods(nnUNet and CoTr). △ Less

Submitted 2 August, 2022; originally announced August 2022.

Comments: 12 pages, 6 figures

arXiv:2207.05848 [pdf, other]

NEC: Speaker Selective Cancellation via Neural Enhanced Ultrasound Shadowing

Authors: Hanqing Guo, Chenning Li, Lingkun Li, Zhichao Cao, Qiben Yan, Li Xiao

Abstract: In this paper, we propose NEC (Neural Enhanced Cancellation), a defense mechanism, which prevents unauthorized microphones from capturing a target speaker's voice. Compared with the existing scrambling-based audio cancellation approaches, NEC can selectively remove a target speaker's voice from a mixed speech without causing interference to others. Specifically, for a target speaker, we design a D… ▽ More In this paper, we propose NEC (Neural Enhanced Cancellation), a defense mechanism, which prevents unauthorized microphones from capturing a target speaker's voice. Compared with the existing scrambling-based audio cancellation approaches, NEC can selectively remove a target speaker's voice from a mixed speech without causing interference to others. Specifically, for a target speaker, we design a Deep Neural Network (DNN) model to extract high-level speaker-specific but utterance-independent vocal features from his/her reference audios. When the microphone is recording, the DNN generates a shadow sound to cancel the target voice in real-time. Moreover, we modulate the audible shadow sound onto an ultrasound frequency, making it inaudible for humans. By leveraging the non-linearity of the microphone circuit, the microphone can accurately decode the shadow sound for target voice cancellation. We implement and evaluate NEC comprehensively with 8 smartphone microphones in different settings. The results show that NEC effectively mutes the target speaker at a microphone without interfering with other users' normal conversations. △ Less

Submitted 12 July, 2022; originally announced July 2022.

Comments: 12 pages

arXiv:2207.01219 [pdf, other]

Masked Self-Supervision for Remaining Useful Lifetime Prediction in Machine Tools

Authors: Haoren Guo, Haiyue Zhu, Jiahui Wang, Vadakkepat Prahlad, Weng Khuen Ho, Tong Heng Lee

Abstract: Prediction of Remaining Useful Lifetime(RUL) in the modern manufacturing and automation workplace for machines and tools is essential in Industry 4.0. This is clearly evident as continuous tool wear, or worse, sudden machine breakdown will lead to various manufacturing failures which would clearly cause economic loss. With the availability of deep learning approaches, the great potential and prosp… ▽ More Prediction of Remaining Useful Lifetime(RUL) in the modern manufacturing and automation workplace for machines and tools is essential in Industry 4.0. This is clearly evident as continuous tool wear, or worse, sudden machine breakdown will lead to various manufacturing failures which would clearly cause economic loss. With the availability of deep learning approaches, the great potential and prospect of utilizing these for RUL prediction have resulted in several models which are designed driven by operation data of manufacturing machines. Current efforts in these which are based on fully-supervised models heavily rely on the data labeled with their RULs. However, the required RUL prediction data (i.e. the annotated and labeled data from faulty and/or degraded machines) can only be obtained after the machine breakdown occurs. The scarcity of broken machines in the modern manufacturing and automation workplace in real-world situations increases the difficulty of getting sufficient annotated and labeled data. In contrast, the data from healthy machines is much easier to be collected. Noting this challenge and the potential for improved effectiveness and applicability, we thus propose (and also fully develop) a method based on the idea of masked autoencoders which will utilize unlabeled data to do self-supervision. In thus the work here, a noteworthy masked self-supervised learning approach is developed and utilized. This is designed to seek to build a deep learning model for RUL prediction by utilizing unlabeled data. The experiments to verify the effectiveness of this development are implemented on the C-MAPSS datasets (which are collected from the data from the NASA turbofan engine). The results rather clearly show that our development and approach here perform better, in both accuracy and effectiveness, for RUL prediction when compared with approaches utilizing a fully-supervised model. △ Less

Submitted 4 July, 2022; originally announced July 2022.

arXiv:2207.01218 [pdf, other]

CAM/CAD Point Cloud Part Segmentation via Few-Shot Learning

Authors: Jiahui Wang, Haiyue Zhu, Haoren Guo, Abdullah Al Mamun, Vadakkepat Prahlad, Tong Heng Lee

Abstract: 3D part segmentation is an essential step in advanced CAM/CAD workflow. Precise 3D segmentation contributes to lower defective rate of work-pieces produced by the manufacturing equipment (such as computer controlled CNCs), thereby improving work efficiency and attaining the attendant economic benefits. A large class of existing works on 3D model segmentation are mostly based on fully-supervised le… ▽ More 3D part segmentation is an essential step in advanced CAM/CAD workflow. Precise 3D segmentation contributes to lower defective rate of work-pieces produced by the manufacturing equipment (such as computer controlled CNCs), thereby improving work efficiency and attaining the attendant economic benefits. A large class of existing works on 3D model segmentation are mostly based on fully-supervised learning, which trains the AI models with large, annotated datasets. However, the disadvantage is that the resulting models from the fully-supervised learning methodology are highly reliant on the completeness of the available dataset, and its generalization ability is relatively poor to new unknown segmentation types (i.e. further additional novel classes). In this work, we propose and develop a noteworthy few-shot learning-based approach for effective part segmentation in CAM/CAD; and this is designed to significantly enhance its generalization ability and flexibly adapt to new segmentation tasks by using only relatively rather few samples. As a result, it not only reduces the requirements for the usually unattainable and exhaustive completeness of supervision datasets, but also improves the flexibility for real-world applications. As further improvement and innovation, we additionally adopt the transform net and the center loss block in the network. These characteristics serve to improve the comprehension for 3D features of the various possible instances of the whole work-piece and ensure the close distribution of the same class in feature space. △ Less

Submitted 16 July, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

Comments: 7 pages, 5 figures

arXiv:2205.14496 [pdf, other]

SuperVoice: Text-Independent Speaker Verification Using Ultrasound Energy in Human Speech

Authors: Hanqing Guo, Qiben Yan, Nikolay Ivanov, Ying Zhu, Li Xiao, Eric J. Hunter

Abstract: Voice-activated systems are integrated into a variety of desktop, mobile, and Internet-of-Things (IoT) devices. However, voice spoofing attacks, such as impersonation and replay attacks, in which malicious attackers synthesize the voice of a victim or simply replay it, have brought growing security concerns. Existing speaker verification techniques distinguish individual speakers via the spectrogr… ▽ More Voice-activated systems are integrated into a variety of desktop, mobile, and Internet-of-Things (IoT) devices. However, voice spoofing attacks, such as impersonation and replay attacks, in which malicious attackers synthesize the voice of a victim or simply replay it, have brought growing security concerns. Existing speaker verification techniques distinguish individual speakers via the spectrographic features extracted from an audible frequency range of voice commands. However, they often have high error rates and/or long delays. In this paper, we explore a new direction of human voice research by scrutinizing the unique characteristics of human speech at the ultrasound frequency band. Our research indicates that the high-frequency ultrasound components (e.g. speech fricatives) from 20 to 48 kHz can significantly enhance the security and accuracy of speaker verification. We propose a speaker verification system, SUPERVOICE that uses a two-stream DNN architecture with a feature fusion mechanism to generate distinctive speaker models. To test the system, we create a speech dataset with 12 hours of audio (8,950 voice samples) from 127 participants. In addition, we create a second spoofed voice dataset to evaluate its security. In order to balance between controlled recordings and real-world applications, the audio recordings are collected from two quiet rooms by 8 different recording devices, including 7 smartphones and an ultrasound microphone. Our evaluation shows that SUPERVOICE achieves 0.58% equal error rate in the speaker verification task, it only takes 120 ms for testing an incoming utterance, outperforming all existing speaker verification systems. Moreover, within 91 ms processing time, SUPERVOICE achieves 0% equal error rate in detecting replay attacks launched by 5 different loudspeakers. △ Less

Submitted 28 May, 2022; originally announced May 2022.

arXiv:2203.14748 [pdf, ps, other]

Universal Graph Filter Design based on Butterworth, Chebyshev and Elliptic Functions

Authors: Zirui Ge, Haiyan Guo, Tingting Wang, Zhen Yang

Abstract: Graph filters are crucial tools in processing the spectrum of graph signals. In this paper, we propose to design universal IIR graph filters with low computational complexity by using three kinds of functions, which are Butterworth, Chebyshev, and Elliptic functions, respectively. Specifically, inspired by the classical analog filter design method, we first derive the zeros and poles of graph freq… ▽ More Graph filters are crucial tools in processing the spectrum of graph signals. In this paper, we propose to design universal IIR graph filters with low computational complexity by using three kinds of functions, which are Butterworth, Chebyshev, and Elliptic functions, respectively. Specifically, inspired by the classical analog filter design method, we first derive the zeros and poles of graph frequency responses. With these zeros and poles, we construct the conjugate graph filters to design the Butterworth high pass graph filter, Chebyshev high pass graph filter, and Elliptic high pass graph filter, respectively. On this basis, we further propose to construct a desired graph filter of low pass, band pass, and band stop by map** the parameters of the desired graph filter to those of the equivalent high pass graph filter. Furthermore, we propose to set the graph filter order given the maximum passband attenuation and the minimum stopband attenuation. Our numerical results show that the proposed graph filter design methods realize the desired frequency response more accurately than the autoregressive moving average (ARMA) graph filter design method, the linear least-squares fitting (LLS) based graph filter design method, and the Chebyshev FIR polynomial graph filter design method. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: 17 pages, 7 figures

MSC Class: 05C31 ACM Class: G.1.10

Showing 1–50 of 101 results for author: Guo, H