Search | arXiv e-print repository

FlocOff: Data Heterogeneity Resilient Federated Learning with Communication-Efficient Edge Offloading

Authors: Mulei Ma, Chenyu Gong, Liekang Zeng, Yang Yang, Liantao Wu

Abstract: Federated Learning (FL) has emerged as a fundamental learning paradigm to harness massive data scattered at geo-distributed edge devices in a privacy-preserving way. Given the heterogeneous deployment of edge devices, however, their data are usually Non-IID, introducing significant challenges to FL including degraded training accuracy, intensive communication costs, and high computing complexity.… ▽ More Federated Learning (FL) has emerged as a fundamental learning paradigm to harness massive data scattered at geo-distributed edge devices in a privacy-preserving way. Given the heterogeneous deployment of edge devices, however, their data are usually Non-IID, introducing significant challenges to FL including degraded training accuracy, intensive communication costs, and high computing complexity. Towards that, traditional approaches typically utilize adaptive mechanisms, which may suffer from scalability issues, increased computational overhead, and limited adaptability to diverse edge environments. To address that, this paper instead leverages the observation that the computation offloading involves inherent functionalities such as node matching and service correlation to achieve data resha** and proposes Federated learning based on computing Offloading (FlocOff) framework, to address data heterogeneity and resource-constrained challenges. Specifically, FlocOff formulates the FL process with Non-IID data in edge scenarios and derives rigorous analysis on the impact of imbalanced data distribution. Based on this, FlocOff decouples the optimization in two steps, namely : (1) Minimizes the Kullback-Leibler (KL) divergence via Computation Offloading scheduling (MKL-CO); (2) Minimizes the Communication Cost through Resource Allocation (MCC-RA). Extensive experimental results demonstrate that the proposed FlocOff effectively improves model convergence and accuracy by 14.3\%-32.7\% while reducing data heterogeneity under various data distributions. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2404.06674 [pdf, other]

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

Authors: Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yu** Wang, Yuxuan Wang, Mingbo Ma

Abstract: We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion… ▽ More We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at \url{https://voiceshopai.github.io}. △ Less

Submitted 11 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.04904 [pdf, other]

Cross-Domain Audio Deepfake Detection: Dataset and Analysis

Authors: Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang

Abstract: Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-… ▽ More Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1\% and 6.5\% respectively. Additionally, we demonstrate our models' outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research. △ Less

Submitted 7 April, 2024; originally announced April 2024.

arXiv:2403.02039 [pdf, other]

A Frequency-Domain Approach for Enhanced Performance and Task Flexibility in Finite-Time ILC

Authors: Max van Haren, Kentaro Tsurumoto, Masahiro Mae, Lennart Blanken, Wataru Ohnishi, Tom Oomen

Abstract: Iterative learning control (ILC) is capable of improving the tracking performance of repetitive control systems by utilizing data from past iterations. The aim of this paper is to achieve both task flexibility, which is often achieved by ILC with basis functions, and the performance of frequency-domain ILC, with an intuitive design procedure. The cost function of norm-optimal ILC is determined tha… ▽ More Iterative learning control (ILC) is capable of improving the tracking performance of repetitive control systems by utilizing data from past iterations. The aim of this paper is to achieve both task flexibility, which is often achieved by ILC with basis functions, and the performance of frequency-domain ILC, with an intuitive design procedure. The cost function of norm-optimal ILC is determined that recovers frequency-domain ILC, and consequently, the feedforward signal is parameterized in terms of basis functions and frequency-domain ILC. The resulting method has the performance and design procedure of frequency-domain ILC and the task flexibility of basis functions ILC, and are complimentary to each other. Validation on a benchmark example confirms the capabilities of the framework. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.12482 [pdf, other]

SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech

Authors: Adam Sabra, Cyprian Wronka, Michelle Mao, Samer Hijazi

Abstract: As more speech technologies rely on a supervised deep learning approach with clean speech as the ground truth, a methodology to onboard said speech at scale is needed. However, this approach needs to minimize the dependency on human listening and annotation, only requiring a human-in-the-loop when needed. In this paper, we address this issue by outlining Speech Enhancement-based Curation Pipeline… ▽ More As more speech technologies rely on a supervised deep learning approach with clean speech as the ground truth, a methodology to onboard said speech at scale is needed. However, this approach needs to minimize the dependency on human listening and annotation, only requiring a human-in-the-loop when needed. In this paper, we address this issue by outlining Speech Enhancement-based Curation Pipeline (SECP) which serves as a framework to onboard clean speech. This clean speech can then train a speech enhancement model, which can further refine the original dataset and thus close the iterative loop. By running two iterative rounds, we observe that enhanced output used as ground truth does not degrade model performance according to $Δ_{PESQ}$, a metric used in this paper. We also show through comparative mean opinion score (CMOS) based subjective tests that the highest and lowest bound of refined data is perceptually better than the original data. △ Less

Submitted 19 February, 2024; originally announced February 2024.

Comments: Accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

arXiv:2311.07613 [pdf]

A Physics-informed Machine Learning-based Control Method for Nonlinear Dynamic Systems with Highly Noisy Measurements

Authors: Mason Ma, Jiajie Wu, Chase Post, Tony Shi, **gang Yi, Tony Schmitz, Hong Wang

Abstract: This study presents a physics-informed machine learning-based control method for nonlinear dynamic systems with highly noisy measurements. Existing data-driven control methods that use machine learning for system identification cannot effectively cope with highly noisy measurements, resulting in unstable control performance. To address this challenge, the present study extends current physics-info… ▽ More This study presents a physics-informed machine learning-based control method for nonlinear dynamic systems with highly noisy measurements. Existing data-driven control methods that use machine learning for system identification cannot effectively cope with highly noisy measurements, resulting in unstable control performance. To address this challenge, the present study extends current physics-informed machine learning capabilities for modeling nonlinear dynamics with control and integrates them into a model predictive control framework. To demonstrate the capability of the proposed method we test and validate with two noisy nonlinear dynamic systems: the chaotic Lorenz 3 system, and turning machine tool. Analysis of the results illustrate that the proposed method outperforms state-of-the-art benchmarks as measured by both modeling accuracy and control performance for nonlinear dynamic systems under high-noise conditions. △ Less

Submitted 11 November, 2023; originally announced November 2023.

arXiv:2311.00332 [pdf, other]

SDF4CHD: Generative Modeling of Cardiac Anatomies with Congenital Heart Defects

Authors: Fanwei Kong, Sascha Stocker, Perry S. Choi, Michael Ma, Daniel B. Ennis, Alison Marsden

Abstract: Congenital heart disease (CHD) encompasses a spectrum of cardiovascular structural abnormalities, often requiring customized treatment plans for individual patients. Computational modeling and analysis of these unique cardiac anatomies can improve diagnosis and treatment planning and may ultimately lead to improved outcomes. Deep learning (DL) methods have demonstrated the potential to enable effi… ▽ More Congenital heart disease (CHD) encompasses a spectrum of cardiovascular structural abnormalities, often requiring customized treatment plans for individual patients. Computational modeling and analysis of these unique cardiac anatomies can improve diagnosis and treatment planning and may ultimately lead to improved outcomes. Deep learning (DL) methods have demonstrated the potential to enable efficient treatment planning by automating cardiac segmentation and mesh construction for patients with normal cardiac anatomies. However, CHDs are often rare, making it challenging to acquire sufficiently large patient cohorts for training such DL models. Generative modeling of cardiac anatomies has the potential to fill this gap via the generation of virtual cohorts; however, prior approaches were largely designed for normal anatomies and cannot readily capture the significant topological variations seen in CHD patients. Therefore, we propose a type- and shape-disentangled generative approach suitable to capture the wide spectrum of cardiac anatomies observed in different CHD types and synthesize differently shaped cardiac anatomies that preserve the unique topology for specific CHD types. Our DL approach represents generic whole heart anatomies with CHD type-specific abnormalities implicitly using signed distance fields (SDF) based on CHD type diagnosis, which conveniently captures divergent anatomical variations across different types and represents meaningful intermediate CHD states. To capture the shape-specific variations, we then learn invertible deformations to morph the learned CHD type-specific anatomies and reconstruct patient-specific shapes. Our approach has the potential to augment the image-segmentation pairs for rarer CHD types for cardiac segmentation and generate cohorts of CHD cardiac meshes for computational simulation. △ Less

Submitted 8 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.15407 [pdf, ps, other]

Finite-Time Adaptive Fuzzy Tracking Control for Nonlinear State Constrained Pure-Feedback Systems

Authors: Ju Wu, Tong Wang, Min Ma

Abstract: This paper investigates the finite-time adaptive fuzzy tracking control problem for a class of pure-feedback system with full-state constraints. With the help of Mean-Value Theorem, the pure-feedback nonlinear system is transformed into strict-feedback case. By employing finite-time-stable like function and state transformation for output tracking error, the output tracking error converges to a pr… ▽ More This paper investigates the finite-time adaptive fuzzy tracking control problem for a class of pure-feedback system with full-state constraints. With the help of Mean-Value Theorem, the pure-feedback nonlinear system is transformed into strict-feedback case. By employing finite-time-stable like function and state transformation for output tracking error, the output tracking error converges to a predefined set in a fixed finite interval. To tackle the problem of state constraints, integral Barrier Lyapunov functions are utilized to guarantee that the state variables remain within the prescribed constraints with feasibility check. Fuzzy logic systems are utilized to approximate the unknown nonlinear functions. In addition, all the signals in the closed-loop system are guaranteed to be semi-global ultimately uniformly bounded. Finally, two simulation examples are given to show the effectiveness of the proposed control strategy. △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.08804 [pdf, other]

Spiking Semantic Communication for Feature Transmission with HARQ

Authors: Mengyang Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan

Abstract: In Collaborative Intelligence (CI), the Artificial Intelligence (AI) model is divided between the edge and the cloud, with intermediate features being sent from the edge to the cloud for inference. Several deep learning-based Semantic Communication (SC) models have been proposed to reduce feature transmission overhead and mitigate channel noise interference. Previous research has demonstrated that… ▽ More In Collaborative Intelligence (CI), the Artificial Intelligence (AI) model is divided between the edge and the cloud, with intermediate features being sent from the edge to the cloud for inference. Several deep learning-based Semantic Communication (SC) models have been proposed to reduce feature transmission overhead and mitigate channel noise interference. Previous research has demonstrated that Spiking Neural Network (SNN)-based SC models exhibit greater robustness on digital channels compared to Deep Neural Network (DNN)-based SC models. However, the existing SNN-based SC models require fixed time steps, resulting in fixed transmission bandwidths that cannot be adaptively adjusted based on channel conditions. To address this issue, this paper introduces a novel SC model called SNN-SC-HARQ, which combines the SNN-based SC model with the Hybrid Automatic Repeat Request (HARQ) mechanism. SNN-SC-HARQ comprises an SNN-based SC model that supports the transmission of features at varying bandwidths, along with a policy model that determines the appropriate bandwidth. Experimental results show that SNN-SC-HARQ can dynamically adjust the bandwidth according to the channel conditions without performance loss. △ Less

Submitted 12 October, 2023; originally announced October 2023.

arXiv:2309.10567 [pdf, other]

Multimodal Modeling For Spoken Language Identification

Authors: Shikhar Bharadwaj, Min Ma, Shikhar Vashishth, Ankur Bapna, Sriram Ganapathy, Vera Axelrod, Siddharth Dalmia, Wei Han, Yu Zhang, Daan van Esch, Sandy Ritchie, Partha Talukdar, Jason Riesa

Abstract: Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI,… ▽ More Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition. △ Less

Submitted 19 September, 2023; originally announced September 2023.

arXiv:2308.01317 [pdf]

ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders

Authors: Shawn Xu, Lin Yang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, Atilla Kiraly, Sahar Kazemzadeh, Zakkai Melamed, Jungyeon Park, Patricia Strachan, Yun Liu, Chuck Lau, Preeti Singh, Christina Chen, Mozziyar Etemadi, Sreenivasa Raju Kalidindi, Yossi Matias, Katherine Chou, Greg S. Corrado, Shravya Shetty, Daniel Tse, Shruthi Prabhakara, Daniel Golden , et al. (3 additional authors not shown)

Abstract: In this work, we present an approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, that leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of chest X-ray tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR ach… ▽ More In this work, we present an approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, that leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of chest X-ray tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI. △ Less

Submitted 7 September, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

arXiv:2308.00393 [pdf, other]

A Survey of Time Series Anomaly Detection Methods in the AIOps Domain

Authors: Zhenyu Zhong, Qiliang Fan, Jiacheng Zhang, Minghua Ma, Shenglin Zhang, Yongqian Sun, Qingwei Lin, Yuzhi Zhang, Dan Pei

Abstract: Internet-based services have seen remarkable success, generating vast amounts of monitored key performance indicators (KPIs) as univariate or multivariate time series. Monitoring and analyzing these time series are crucial for researchers, service operators, and on-call engineers to detect outliers or anomalies indicating service failures or significant events. Numerous advanced anomaly detection… ▽ More Internet-based services have seen remarkable success, generating vast amounts of monitored key performance indicators (KPIs) as univariate or multivariate time series. Monitoring and analyzing these time series are crucial for researchers, service operators, and on-call engineers to detect outliers or anomalies indicating service failures or significant events. Numerous advanced anomaly detection methods have emerged to address availability and performance issues. This review offers a comprehensive overview of time series anomaly detection in Artificial Intelligence for IT operations (AIOps), which uses AI capabilities to automate and optimize operational workflows. Additionally, it explores future directions for real-world and next-generation time-series anomaly detection based on recent advancements. △ Less

Submitted 1 August, 2023; originally announced August 2023.

arXiv:2307.10982 [pdf, other]

MASR: Multi-label Aware Speech Representation

Authors: Anjali Raj, Shikhar Bharadwaj, Sriram Ganapathy, Min Ma, Shikhar Vashishth

Abstract: In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables th… ▽ More In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of multiple external knowledge sources to enhance the utilization of meta-data information. The external knowledge sources are incorporated in the form of sample-level pair-wise similarity matrices that are useful in a hard-mining loss. A key advantage of the MASR framework is that it can be combined with any choice of SSL method. Using MASR representations, we perform evaluations on several downstream tasks such as language identification, speech recognition and other non-semantic tasks such as speaker and emotion recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. We perform a detailed analysis on the language identification task to provide insights on how the proposed loss function enables the representations to separate closely related languages. △ Less

Submitted 25 September, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: Accepted at ASRU 2023

arXiv:2307.04327 [pdf]

Legal Decision-making for Highway Automated Driving

Authors: Xiaohan Ma, Wenhao Yu, Chengxiang Zhao, Changjun Wang, Wenhui Zhou, Guangming Zhao, Mingyue Ma, Weida Wang, Lin Yang, Rui Mu, Hong Wang, Jun Li

Abstract: Compliance with traffic laws is a fundamental requirement for human drivers on the road, and autonomous vehicles must adhere to traffic laws as well. However, current autonomous vehicles prioritize safety and collision avoidance primarily in their decision-making and planning, which will lead to misunderstandings and distrust from human drivers and may even result in accidents in mixed traffic flo… ▽ More Compliance with traffic laws is a fundamental requirement for human drivers on the road, and autonomous vehicles must adhere to traffic laws as well. However, current autonomous vehicles prioritize safety and collision avoidance primarily in their decision-making and planning, which will lead to misunderstandings and distrust from human drivers and may even result in accidents in mixed traffic flow. Therefore, ensuring the compliance of the autonomous driving decision-making system is essential for ensuring the safety of autonomous driving and promoting the widespread adoption of autonomous driving technology. To this end, the paper proposes a trigger-based layered compliance decision-making framework. This framework utilizes the decision intent at the highest level as a signal to activate an online violation monitor that identifies the type of violation committed by the vehicle. Then, a four-layer architecture for compliance decision-making is employed to generate compliantly trajectories. Using this system, autonomous vehicles can detect and correct potential violations in real-time, thereby enhancing safety and building public confidence in autonomous driving technology. Finally, the proposed method is evaluated on the DJI AD4CHE highway dataset under four typical highway scenarios: speed limit, following distance, overtaking, and lane-changing. The results indicate that the proposed method increases the vehicle's overall compliance rate from 13.85% to 84.46%, while reducing the proportion of active violations to 0%, demonstrating its effectiveness. △ Less

Submitted 9 July, 2023; originally announced July 2023.

Comments: 14 pages, 17 figures

arXiv:2306.17697 [pdf, ps, other]

Analysis of Oversampling in Uplink Massive MIMO-OFDM with Low-Resolution ADCs

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Italo Atzeni, Markku Juntti

Abstract: Low-resolution analog-to-digital converters (ADCs) have emerged as an efficient solution for massive multiple-input multiple-output (MIMO) systems to reap high data rates with reasonable power consumption and hardware complexity. In this paper, we analyze the performance of oversampling in uplink massive MIMO orthogonal frequency-division multiplexing (MIMO-OFDM) systems with low-resolution ADCs.… ▽ More Low-resolution analog-to-digital converters (ADCs) have emerged as an efficient solution for massive multiple-input multiple-output (MIMO) systems to reap high data rates with reasonable power consumption and hardware complexity. In this paper, we analyze the performance of oversampling in uplink massive MIMO orthogonal frequency-division multiplexing (MIMO-OFDM) systems with low-resolution ADCs. Considering both the temporal and spatial correlation of the quantization distortion, we derive an approximate closed-form expression of an achievable sum rate, which reveals how the oversampling ratio (OSR), the ADC resolution, and the signal-to-noise ratio (SNR) jointly affect the system performance. In particular, we demonstrate that oversampling can effectively improve the sum rate by mitigating the impact of the quantization distortion, especially at high SNR and with very low ADC resolution. Furthermore, we show that the considered low-resolution massive MIMO-OFDM system can achieve the same performance as the unquantized one when both the SNR and the OSR are sufficiently high. Numerical simulations confirm our analysis. △ Less

Submitted 30 June, 2023; originally announced June 2023.

Comments: 5 papges, 5 figures, to be appeared in IEEE SPAWC2023

arXiv:2306.10232

Multi-Task Offloading via Graph Neural Networks in Heterogeneous Multi-access Edge Computing

Authors: Mulei Ma

Abstract: In the rapidly evolving field of Heterogeneous Multi-access Edge Computing (HMEC), efficient task offloading plays a pivotal role in optimizing system throughput and resource utilization. However, existing task offloading methods often fall short of adequately modeling the dependency topology relationships between offloaded tasks, which limits their effectiveness in capturing the complex interdepe… ▽ More In the rapidly evolving field of Heterogeneous Multi-access Edge Computing (HMEC), efficient task offloading plays a pivotal role in optimizing system throughput and resource utilization. However, existing task offloading methods often fall short of adequately modeling the dependency topology relationships between offloaded tasks, which limits their effectiveness in capturing the complex interdependencies of task features. To address this limitation, we propose a task offloading mechanism based on Graph Neural Networks (GNN). Our modeling approach takes into account factors such as task characteristics, network conditions, and available resources at the edge, and embeds these captured features into the graph structure. By utilizing GNNs, our mechanism can capture and analyze the intricate relationships between task features, enabling a more comprehensive understanding of the underlying dependency topology. Through extensive evaluations in heterogeneous networks, our proposed algorithm improves 18.6\%-53.8\% over greedy and approximate algorithms in optimizing system throughput and resource utilization. Our experiments showcase the advantage of considering the intricate interplay of task features using GNN-based modeling. △ Less

Submitted 30 May, 2024; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: Insufficient completion, there are some errors in the current version

arXiv:2306.04374 [pdf, other]

Label Aware Speech Representation Learning For Language Identification

Authors: Shikhar Vashishth, Shikhar Bharadwaj, Sriram Ganapathy, Ankur Bapna, Min Ma, Wei Han, Vera Axelrod, Partha Talukdar

Abstract: Speech representation learning approaches for non-semantic tasks such as language recognition have either explored supervised embedding extraction methods using a classifier model or self-supervised representation learning approaches using raw data. In this paper, we propose a novel framework of combining self-supervised representation learning with the language label information for the pre-train… ▽ More Speech representation learning approaches for non-semantic tasks such as language recognition have either explored supervised embedding extraction methods using a classifier model or self-supervised representation learning approaches using raw data. In this paper, we propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task. This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function. The speech representations are further fine-tuned for the downstream task. The language recognition experiments are performed on two public datasets - FLEURS and Dhwani. In these experiments, we illustrate that the proposed LASR framework improves over the state-of-the-art systems on language identification. We also report an analysis of the robustness of LASR approach to noisy/missing labels as well as its application to multi-lingual speech recognition tasks. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: Accepted at Interspeech 2023

arXiv:2305.15719 [pdf, other]

Efficient Neural Music Generation

Authors: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yu** Wang, Yuxuan Wang

Abstract: Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real… ▽ More Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2303.12466 [pdf, ps, other]

doi 10.1109/ICC45041.2023

Beam Squint Analysis and Mitigation via Hybrid Beamforming Design in THz Communications

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Markku Juntti

Abstract: We investigate the beam squint effect in uniform planar arrays (UPAs) and propose an efficient hybrid beamforming (HBF) design to mitigate the beam squint in multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) systems operating at terahertz band. We first analyze the array gain and derive the closed-form beam squint ratio that characterizes the severity of the bea… ▽ More We investigate the beam squint effect in uniform planar arrays (UPAs) and propose an efficient hybrid beamforming (HBF) design to mitigate the beam squint in multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) systems operating at terahertz band. We first analyze the array gain and derive the closed-form beam squint ratio that characterizes the severity of the beam squint effect on UPAs. The effect is shown to be more severe with a higher fractional bandwidth, while it can be significantly mitigated when the shape of a UPA approaches a square. We then focus on the HBF design that maximizes the system spectral efficiency. The design problem is challenging due to the frequency-flat nature and hardware constraints of the analog beamformer. We overcome the challenges by proposing an efficient decoupling design in which the digital and analog beamformers admit closed-form solutions, which facilitate practical implementations. Numerical results validate our analysis and show that the proposed HBF design is robust to beam squint, and thus, it outperforms the state-of-the-art methods in wideband massive MIMO systems. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 6 pages, 7 figures, to be appeared in IEEE ICC2023

arXiv:2303.03470 [pdf, other]

Partial-Information, Longitudinal Cyber Attacks on LiDAR in Autonomous Vehicles

Authors: R. Spencer Hallyburton, Qingzhao Zhang, Z. Morley Mao, Miroslav Pajic

Abstract: What happens to an autonomous vehicle (AV) if its data are adversarially compromised? Prior security studies have addressed this question through mostly unrealistic threat models, with limited practical relevance, such as white-box adversarial learning or nanometer-scale laser aiming and spoofing. With growing evidence that cyber threats pose real, imminent danger to AVs and cyber-physical systems… ▽ More What happens to an autonomous vehicle (AV) if its data are adversarially compromised? Prior security studies have addressed this question through mostly unrealistic threat models, with limited practical relevance, such as white-box adversarial learning or nanometer-scale laser aiming and spoofing. With growing evidence that cyber threats pose real, imminent danger to AVs and cyber-physical systems (CPS) in general, we present and evaluate a novel AV threat model: a cyber-level attacker capable of disrupting sensor data but lacking any situational awareness. We demonstrate that even though the attacker has minimal knowledge and only access to raw data from a single sensor (i.e., LiDAR), she can design several attacks that critically compromise perception and tracking in multi-sensor AVs. To mitigate vulnerabilities and advance secure architectures in AVs, we introduce two improvements for security-aware fusion: a probabilistic data-asymmetry monitor and a scalable track-to-track fusion of 3D LiDAR and monocular detections (T2T-3DLM); we demonstrate that the approaches significantly reduce attack effectiveness. To support objective safety and security evaluations in AVs, we release our security evaluation platform, AVsec, which is built on security-relevant metrics to benchmark AVs on gold-standard longitudinal AV datasets and AV simulators. △ Less

Submitted 8 December, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

arXiv:2303.01723 [pdf, other]

AI-Empowered Hybrid MIMO Beamforming

Authors: Nir Shlezinger, Mengyuan Ma, Ortal Lavi, Nhan Thanh Nguyen, Yonina C. Eldar, Markku Juntti

Abstract: Hybrid multiple-input multiple-output (MIMO) is an attractive technology for realizing extreme massive MIMO systems envisioned for future wireless communications in a scalable and power-efficient manner. However, the fact that hybrid MIMO systems implement part of their beamforming in analog and part in digital makes the optimization of their beampattern notably more challenging compared with conv… ▽ More Hybrid multiple-input multiple-output (MIMO) is an attractive technology for realizing extreme massive MIMO systems envisioned for future wireless communications in a scalable and power-efficient manner. However, the fact that hybrid MIMO systems implement part of their beamforming in analog and part in digital makes the optimization of their beampattern notably more challenging compared with conventional fully digital MIMO. Consequently, recent years have witnessed a growing interest in using data-aided artificial intelligence (AI) tools for hybrid beamforming design. This article reviews candidate strategies to leverage data to improve real-time hybrid beamforming design. We discuss the architectural constraints and characterize the core challenges associated with hybrid beamforming optimization. We then present how these challenges are treated via conventional optimization, and identify different AI-aided design approaches. These can be roughly divided into purely data-driven deep learning models and different forms of deep unfolding techniques for combining AI with classical optimization.We provide a systematic comparative study between existing approaches including both numerical evaluations and qualitative measures. We conclude by presenting future research opportunities associated with the incorporation of AI in hybrid MIMO systems. △ Less

Submitted 3 March, 2023; originally announced March 2023.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2302.12041 [pdf, other]

Deep Unfolding Hybrid Beamforming Designs for THz Massive MIMO Systems

Authors: Nhan Thanh Nguyen, Mengyuan Ma, Nir Shlezinger, Yonina C. Eldar, A. L. Swindlehurst, Markku Juntti

Abstract: Hybrid beamforming (HBF) is a key enabler for wideband terahertz (THz) massive multiple-input multiple-output (mMIMO) communications systems. A core challenge with designing HBF systems stems from the fact their application often involves a non-convex, highly complex optimization of large dimensions. In this paper, we propose HBF schemes that leverage data to enable efficient designs for both the… ▽ More Hybrid beamforming (HBF) is a key enabler for wideband terahertz (THz) massive multiple-input multiple-output (mMIMO) communications systems. A core challenge with designing HBF systems stems from the fact their application often involves a non-convex, highly complex optimization of large dimensions. In this paper, we propose HBF schemes that leverage data to enable efficient designs for both the fully-connected HBF (FC-HBF) and dynamic sub-connected HBF (SC-HBF) architectures. We develop a deep unfolding framework based on factorizing the optimal fully digital beamformer into analog and digital terms and formulating two corresponding equivalent least squares (LS) problems. Then, the digital beamformer is obtained via a closed-form LS solution, while the analog beamformer is obtained via ManNet, a lightweight sparsely-connected deep neural network based on unfolding projected gradient descent. Incorporating ManNet into the developed deep unfolding framework leads to the ManNet-based FC-HBF scheme. We show that the proposed ManNet can also be applied to SC-HBF designs after determining the connections between the radio frequency chain and antennas. We further develop a simplified version of ManNet, referred to as subManNet, that directly produces the sparse analog precoder for SC-HBF architectures. Both networks are trained with an unsupervised training procedure. Numerical results verify that the proposed ManNet/subManNet-based HBF approaches outperform the conventional model-based and deep unfolded counterparts with very low complexity and a fast run time. For example, in a simulation with 128 transmit antennas, it attains a slightly higher spectral efficiency than the Riemannian manifold scheme, but over 1000 times faster and with a complexity reduction of more than by a factor of six (6). △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: This paper has been submitted to IEEE Transaction on Signal Processing

arXiv:2212.05751 [pdf, other]

Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network

Authors: Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo Ma, Yu** Wang, Yuxuan Wang

Abstract: The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity. AC enables a variety of applications, such as language learning, speech content creation, and data augmentation. Previous methods rely on reference utterances in the inference phase or are unable to preserve speaker identity. To address these issues, we pr… ▽ More The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity. AC enables a variety of applications, such as language learning, speech content creation, and data augmentation. Previous methods rely on reference utterances in the inference phase or are unable to preserve speaker identity. To address these issues, we propose a zero-shot reference-free accent conversion method, which is able to convert unseen speakers' utterances into a target accent. Pseudo Siamese Disentanglement Network (PSDN) is proposed to disentangle the accent from the content representation. Experimental results show that our model generates speech samples with much higher accentedness than the input and comparable naturalness, on two-way conversion including foreign-to-native and native-to-foreign. △ Less

Submitted 10 August, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: Accepted by INTERSPEECH 2023

arXiv:2210.06890 [pdf, ps, other]

Switch-based Hybrid Beamforming Transceiver Design for Wideband Communications with Beam Squint

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Markku Juntti

Abstract: Hybrid beamforming (HBF) transceiver architectures based on frequency-independent phase shifters (PS-HBF) are sensitive to the phases and physical directions with limited capability to compensate for the detrimental effects of the beam squint. Motivated by the fact that switches are phase-independent and more power/cost efficient than PSs, we consider the switch-based HBF (SW-HBF) for wideband lar… ▽ More Hybrid beamforming (HBF) transceiver architectures based on frequency-independent phase shifters (PS-HBF) are sensitive to the phases and physical directions with limited capability to compensate for the detrimental effects of the beam squint. Motivated by the fact that switches are phase-independent and more power/cost efficient than PSs, we consider the switch-based HBF (SW-HBF) for wideband large-scale multiple-input multiple-output systems in this paper. We first derive a closed-form expression of the beam squint ratio and compare the expected array gains of both SW-HBF and PS-HBF architectures. The results show that SW-HBF is more robust to the beam squint effect. We then focus on the SW-HBF designs to maximize the spectral efficiency (SE) in both single-user and multiuser systems, which are both non-convex mixed-integer problems. For the former, by combining the tabu search (TS) method and projected gradient ascend (PGA), we propose an efficient heuristic PGA-TS algorithm to design analog beamformers while the digital ones admit closed-form solutions. For the latter, we develop a two-step algorithm based on fractional programming and the PGA-TS method. Simulations show that the proposed SW-HBF schemes are efficient and can outperform PS-based HBF architectures in terms of both SE and energy efficiency in terahertz communication systems. △ Less

Submitted 20 November, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: 15 pages, 15 figures

arXiv:2210.06836 [pdf, other]

SNN-SC: A Spiking Semantic Communication Framework for Feature Transmission

Authors: Mengyang Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan, Yonghong Tian

Abstract: In Collaborative Intelligence (CI), Artificial Intelligence (AI) models are split between edge devices and cloud. Features extracted from input on edge devices are transmitted to the cloud for subsequent tasks. Extracting task-related and compact information is critical when transmission bandwidth is limited. In this paper, we propose a task-oriented Semantic Communication (SC) framework (SNN-SC)… ▽ More In Collaborative Intelligence (CI), Artificial Intelligence (AI) models are split between edge devices and cloud. Features extracted from input on edge devices are transmitted to the cloud for subsequent tasks. Extracting task-related and compact information is critical when transmission bandwidth is limited. In this paper, we propose a task-oriented Semantic Communication (SC) framework (SNN-SC) to address this problem. In SC, only important information for downstream tasks is extracted and transmitted. However, most of the existing SC works only transmit analog information on the AWGN channel and cannot be directly used for digital channels. SNN-SC fills this gap by using Spiking Neural Networks (SNNs) to extract and transmit semantic information. Since the outputs of SNNs are binary spikes, SNN-SC can be directly applied to digital channels. In SNN-SC, a new spiking neuron is proposed to help the cloud recover binary semantic information into informative floating-point features. Furthermore, we improve the performance of SNN-SC by maximizing the entropy of the semantic information. We evaluate the performance of SNN-SC on different collaborative classification models, digital channels, and bandwidths. The experimental results show that SNN-SC is more robust than the CNN-based SC framework and separate source and channel coding method on digital channels. △ Less

Submitted 17 April, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

arXiv:2210.06747 [pdf, other]

DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation

Authors: Lizhi Bai, Jun Yang, Chunqi Tian, Yaoru Sun, Maoyu Mao, Yanjun Xu, Weirong Xu

Abstract: Combining RGB images and the corresponding depth maps in semantic segmentation proves the effectiveness in the past few years. Existing RGB-D modal fusion methods either lack the non-linear feature fusion ability or treat both modal images equally, regardless of the intrinsic distribution gap or information loss. Here we find that depth maps are suitable to provide intrinsic fine-grained patterns… ▽ More Combining RGB images and the corresponding depth maps in semantic segmentation proves the effectiveness in the past few years. Existing RGB-D modal fusion methods either lack the non-linear feature fusion ability or treat both modal images equally, regardless of the intrinsic distribution gap or information loss. Here we find that depth maps are suitable to provide intrinsic fine-grained patterns of objects due to their local depth continuity, while RGB images effectively provide a global view. Based on this, we propose a pixel differential convolution attention (DCA) module to consider geometric information and local-range correlations for depth data. Furthermore, we extend DCA to ensemble differential convolution attention (EDCA) which propagates long-range contextual dependencies and seamlessly incorporates spatial distribution for RGB data. DCA and EDCA dynamically adjust convolutional weights by pixel difference to enable self-adaptive in local and long range, respectively. A two-branch network built with DCA and EDCA, called Differential Convolutional Network (DCANet), is proposed to fuse local and global information of two-modal data. Consequently, the individual advantage of RGB and depth data are emphasized. Our DCANet is shown to set a new state-of-the-art performance for RGB-D semantic segmentation on two challenging benchmark datasets, i.e., NYUDv2 and SUN-RGBD. △ Less

Submitted 13 October, 2022; originally announced October 2022.

arXiv:2208.02792 [pdf]

A Cooperative Perception Environment for Traffic Operations and Control

Authors: Hanlin Chen, Brian Liu, Xumiao Zhang, Feng Qian, Z. Morley Mao, Yiheng Feng

Abstract: Existing data collection methods for traffic operations and control usually rely on infrastructure-based loop detectors or probe vehicle trajectories. Connected and automated vehicles (CAVs) not only can report data about themselves but also can provide the status of all detected surrounding vehicles. Integration of perception data from multiple CAVs as well as infrastructure sensors (e.g., LiDAR)… ▽ More Existing data collection methods for traffic operations and control usually rely on infrastructure-based loop detectors or probe vehicle trajectories. Connected and automated vehicles (CAVs) not only can report data about themselves but also can provide the status of all detected surrounding vehicles. Integration of perception data from multiple CAVs as well as infrastructure sensors (e.g., LiDAR) can provide richer information even under a very low penetration rate. This paper aims to develop a cooperative data collection system, which integrates Lidar point cloud data from both infrastructure and CAVs to create a cooperative perception environment for various transportation applications. The state-of-the-art 3D detection models are applied to detect vehicles in the merged point cloud. We test the proposed cooperative perception environment with the max pressure adaptive signal control model in a co-simulation platform with CARLA and SUMO. Results show that very low penetration rates of CAV plus an infrastructure sensor are sufficient to achieve comparable performance with 30% or higher penetration rates of connected vehicles (CV). We also show the equivalent CV penetration rate (E-CVPR) under different CAV penetration rates to demonstrate the data collection efficiency of the cooperative perception environment. △ Less

Submitted 4 August, 2022; originally announced August 2022.

arXiv:2206.07008 [pdf, other]

doi 10.1109/LSP.2022.3184251

Constellation Design for Deep Joint Source-Channel Coding

Authors: Mengyang Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan

Abstract: Deep learning-based joint source-channel coding (JSCC) has shown excellent performance in image and feature transmission. However, the output values of the JSCC encoder are continuous, which makes the constellation of modulation complex and dense. It is hard and expensive to design radio frequency chains for transmitting such full-resolution constellation points. In this paper, two methods of mapp… ▽ More Deep learning-based joint source-channel coding (JSCC) has shown excellent performance in image and feature transmission. However, the output values of the JSCC encoder are continuous, which makes the constellation of modulation complex and dense. It is hard and expensive to design radio frequency chains for transmitting such full-resolution constellation points. In this paper, two methods of map** the full-resolution constellation to finite constellation are proposed for real system implementation. The constellation map** results of the proposed methods correspond to regular constellation and irregular constellation, respectively. We apply the methods to existing deep JSCC models and evaluate them on AWGN channels with different signal-to-noise ratios (SNRs). Experimental results show that the proposed methods outperform the traditional uniform quadrature amplitude modulation (QAM) constellation map** method by only adding a few additional parameters. △ Less

Submitted 7 June, 2022; originally announced June 2022.

arXiv:2205.12446 [pdf, other]

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Authors: Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna

Abstract: We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Languag… ▽ More We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding. △ Less

Submitted 24 May, 2022; originally announced May 2022.

arXiv:2205.03524 [pdf, other]

Dual Adversarial Adaptation for Cross-Device Real-World Image Super-Resolution

Authors: Xiaoqian Xu, Pengxu Wei, Weikai Chen, Mingzhi Mao, Liang Lin, Guanbin Li

Abstract: Due to the sophisticated imaging process, an identical scene captured by different cameras could exhibit distinct imaging patterns, introducing distinct proficiency among the super-resolution (SR) models trained on images from different devices. In this paper, we investigate a novel and practical task coded cross-device SR, which strives to adapt a real-world SR model trained on the paired images… ▽ More Due to the sophisticated imaging process, an identical scene captured by different cameras could exhibit distinct imaging patterns, introducing distinct proficiency among the super-resolution (SR) models trained on images from different devices. In this paper, we investigate a novel and practical task coded cross-device SR, which strives to adapt a real-world SR model trained on the paired images captured by one camera to low-resolution (LR) images captured by arbitrary target devices. The proposed task is highly challenging due to the absence of paired data from various imaging devices. To address this issue, we propose an unsupervised domain adaptation mechanism for real-world SR, named Dual ADversarial Adaptation (DADA), which only requires LR images in the target domain with available real paired data from a source camera. DADA employs the Domain-Invariant Attention (DIA) module to establish the basis of target model training even without HR supervision. Furthermore, the dual framework of DADA facilitates an Inter-domain Adversarial Adaptation (InterAA) in one branch for two LR input images from two domains, and an Intra-domain Adversarial Adaptation (IntraAA) in two branches for an LR input image. InterAA and IntraAA together improve the model transferability from the source domain to the target. We empirically conduct experiments under six Real to Real adaptation settings among three different cameras, and achieve superior performance compared with existing state-of-the-art approaches. We also evaluate the proposed DADA to address the adaptation to the video camera, which presents a promising research topic to promote the wide applications of real-world super-resolution. Our source code is publicly available at https://github.com/lonelyhope/DADA.git. △ Less

Submitted 6 May, 2022; originally announced May 2022.

arXiv:2205.03122 [pdf]

doi 10.1364/BOE.463057

Ultrathin, high-speed, all-optical photoacoustic endomicroscopy probe for guiding minimally invasive surgery

Authors: Tianrui Zhao, Truc Thuy Pham, Christian Baker, Michelle T. Ma, Sebastien Ourselin, Tom Vercauteren, Edward Zhang, Paul C. Beard, Wenfeng Xia

Abstract: Photoacoustic (PA) endoscopy has shown significant potential for clinical diagnosis and surgical guidance. Multimode fibres (MMFs) are becoming increasing attractive for the development of miniature endoscopy probes owing to ultrathin size, low cost and diffraction-limited spatial resolution enabled by wavefront sha**. However, current MMF-based PA endomicroscopy probes are either limited by a b… ▽ More Photoacoustic (PA) endoscopy has shown significant potential for clinical diagnosis and surgical guidance. Multimode fibres (MMFs) are becoming increasing attractive for the development of miniature endoscopy probes owing to ultrathin size, low cost and diffraction-limited spatial resolution enabled by wavefront sha**. However, current MMF-based PA endomicroscopy probes are either limited by a bulky ultrasound detector or a low imaging speed which hindered their usability. In this work, we report the development of a highly miniaturised and high-speed PA endomicroscopy probe that is integrated within the cannula of a 20 gauge medical needle. This probe comprises a MMF for delivering the PA excitation light and a single-mode optical fibre with a plano-concave microresonator for ultrasound detection. Wavefront sha** with a digital micromirror device enabled rapid raster-scanning of a focused light spot at the distal end of the MMF for tissue interrogation. High-resolution PA imaging of mouse red blood cells covering an area 100 microns in diameter was achieved with the needle probe at ~3 frames per second. Mosaicing imaging was performed after fibre characterisation by translating the needle probe to enlarge the field-of-view in real-time. The developed ultrathin PA endomicroscopy probe is promising for guiding minimally invasive surgery by providing functional, molecular and microstructural information of tissue in real-time. △ Less

Submitted 6 May, 2022; originally announced May 2022.

arXiv:2203.09690 [pdf, other]

A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Authors: He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang

Abstract: Recently, speech representation learning has improved many speech-related tasks such as speech recognition, speech classification, and speech-to-text translation. However, all the above tasks are in the direction of speech understanding, but for the inverse direction, speech synthesis, the potential of representation learning is yet to be realized, due to the challenging nature of generating high-… ▽ More Recently, speech representation learning has improved many speech-related tasks such as speech recognition, speech classification, and speech-to-text translation. However, all the above tasks are in the direction of speech understanding, but for the inverse direction, speech synthesis, the potential of representation learning is yet to be realized, due to the challenging nature of generating high-quality speech. To address this problem, we propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$^3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training. In this way, the pretrained model can generate high quality reconstructed spectrogram, which can be applied to the speech editing and unseen speaker TTS directly. Experiments show A$^3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model. △ Less

Submitted 18 June, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

Comments: Accepted by ICML 2022, 12 pages, 10 figures

arXiv:2111.01544 [pdf]

Comprehensive and Clinically Accurate Head and Neck Organs at Risk Delineation via Stratified Deep Learning: A Large-scale Multi-Institutional Study

Authors: Dazhou Guo, Jia Ge, Xianghua Ye, Senxiang Yan, Yi Xin, Yuchen Song, Bing-shen Huang, Tsung-Min Hung, Zhuotun Zhu, Ling Peng, Yan** Ren, Rui Liu, Gong Zhang, Mengyuan Mao, Xiaohua Chen, Zhongjie Lu, Wenxiang Li, Yuzhen Chen, Lingyun Huang, **g Xiao, Adam P. Harrison, Le Lu, Chien-Yu Lin, Dakai **, Tsung-Ying Ho

Abstract: Accurate organ at risk (OAR) segmentation is critical to reduce the radiotherapy post-treatment complications. Consensus guidelines recommend a set of more than 40 OARs in the head and neck (H&N) region, however, due to the predictable prohibitive labor-cost of this task, most institutions choose a substantially simplified protocol by delineating a smaller subset of OARs and neglecting the dose di… ▽ More Accurate organ at risk (OAR) segmentation is critical to reduce the radiotherapy post-treatment complications. Consensus guidelines recommend a set of more than 40 OARs in the head and neck (H&N) region, however, due to the predictable prohibitive labor-cost of this task, most institutions choose a substantially simplified protocol by delineating a smaller subset of OARs and neglecting the dose distributions associated with other OARs. In this work we propose a novel, automated and highly effective stratified OAR segmentation (SOARS) system using deep learning to precisely delineate a comprehensive set of 42 H&N OARs. SOARS stratifies 42 OARs into anchor, mid-level, and small & hard subcategories, with specifically derived neural network architectures for each category by neural architecture search (NAS) principles. We built SOARS models using 176 training patients in an internal institution and independently evaluated on 1327 external patients across six different institutions. It consistently outperformed other state-of-the-art methods by at least 3-5% in Dice score for each institutional evaluation (up to 36% relative error reduction in other metrics). More importantly, extensive multi-user studies evidently demonstrated that 98% of the SOARS predictions need only very minor or no revisions for direct clinical acceptance (saving 90% radiation oncologists workload), and their segmentation and dosimetric accuracy are within or smaller than the inter-user variation. These findings confirmed the strong clinical applicability of SOARS for the OAR delineation process in H&N cancer radiotherapy workflows, with improved efficiency, comprehensiveness, and quality. △ Less

Submitted 1 November, 2021; originally announced November 2021.

arXiv:2110.15561 [pdf, ps, other]

Exposing Deepfake with Pixel-wise AR and PPG Correlation from Faint Signals

Authors: Maoyu Mao, Jun Yang

Abstract: Deepfake poses a serious threat to the reliability of judicial evidence and intellectual property protection. In spite of an urgent need for Deepfake identification, existing pixel-level detection methods are increasingly unable to resist the growing realism of fake videos and lack generalization. In this paper, we propose a scheme to expose Deepfake through faint signals hidden in face videos. Th… ▽ More Deepfake poses a serious threat to the reliability of judicial evidence and intellectual property protection. In spite of an urgent need for Deepfake identification, existing pixel-level detection methods are increasingly unable to resist the growing realism of fake videos and lack generalization. In this paper, we propose a scheme to expose Deepfake through faint signals hidden in face videos. This scheme extracts two types of minute information hidden between face pixels-photoplethysmography (PPG) features and auto-regressive (AR) features, which are used as the basis for forensics in the temporal and spatial domains, respectively. According to the principle of PPG, tracking the absorption of light by blood cells allows remote estimation of the temporal domains heart rate (HR) of face video, and irregular HR fluctuations can be seen as traces of tampering. On the other hand, AR coefficients are able to reflect the inter-pixel correlation, and can also reflect the traces of smoothing caused by up-sampling in the process of generating fake faces. Furthermore, the scheme combines asymmetric convolution block (ACBlock)-based improved densely connected networks (DenseNets) to achieve face video authenticity forensics. Its asymmetric convolutional structure enhances the robustness of network to the input feature image upside-down and left-right flip**, so that the sequence of feature stitching does not affect detection results. Simulation results show that our proposed scheme provides more accurate authenticity detection results on multiple deep forgery datasets and has better generalization compared to the benchmark strategy. △ Less

Submitted 29 October, 2021; originally announced October 2021.

arXiv:2110.09744 [pdf, ps, other]

doi 10.1109/TGRS.2022.3169228

Spectral Variability Augmented Sparse Unmixing of Hyperspectral Images

Authors: Ge Zhang, Shaohui Mei, Mingyang Ma, Yan Feng, Qian Du

Abstract: Spectral unmixing (SU) expresses the mixed pixels existed in hyperspectral images as the product of endmember and abundance, which has been widely used in hyperspectral imagery analysis. However, the influence of light, acquisition conditions and the inherent properties of materials, results in that the identified endmembers can vary spectrally within a given image (construed as spectral variabili… ▽ More Spectral unmixing (SU) expresses the mixed pixels existed in hyperspectral images as the product of endmember and abundance, which has been widely used in hyperspectral imagery analysis. However, the influence of light, acquisition conditions and the inherent properties of materials, results in that the identified endmembers can vary spectrally within a given image (construed as spectral variability). To address this issue, recent methods usually use a priori obtained spectral library to represent multiple characteristic spectra of the same object, but few of them extracted the spectral variability explicitly. In this paper, a spectral variability augmented sparse unmixing model (SVASU) is proposed, in which the spectral variability is extracted for the first time. The variable spectra are divided into two parts of intrinsic spectrum and spectral variability for spectral reconstruction, and modeled synchronously in the SU model adding the regular terms restricting the sparsity of abundance and the generalization of the variability coefficient. It is noted that the spectral variability library and the intrinsic spectral library are all constructed from the In-situ observed image. Experimental results over both synthetic and real-world data sets demonstrate that the augmented decomposition by spectral variability significantly improves the unmixing performance than the decomposition only by spectral library, as well as compared to state-of-the-art algorithms. △ Less

Submitted 21 October, 2021; v1 submitted 19 October, 2021; originally announced October 2021.

arXiv:2110.06301 [pdf, ps, other]

Switch-based Hybrid Beamforming for Wideband Multi-carrier Communications

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Markku Juntti

Abstract: Switch-based hybrid beamforming (SW-HBF) architectures are promising for realizing massive multiple-input multiple-output (MIMO) communications systems because of their low cost and low power consumption. In this paper, we study the performance of SW-HBF in a wideband multi-carrier MIMO communication system considering the beam squint effect. We aim at designing the switch-based combiner that maxi… ▽ More Switch-based hybrid beamforming (SW-HBF) architectures are promising for realizing massive multiple-input multiple-output (MIMO) communications systems because of their low cost and low power consumption. In this paper, we study the performance of SW-HBF in a wideband multi-carrier MIMO communication system considering the beam squint effect. We aim at designing the switch-based combiner that maximizes the system spectral efficiency (SE). However, the design problem is challenging because the analog combing matrix elements are binary variables. To overcome this, we propose tabu search-based (TS) SW-HBF schemes that can attain near-optimal performance with reasonable computational complexity. Furthermore, we compare the total power consumption and energy efficiency (EE) of the SW-HBF architecture to those of the phase-shifter-based hybrid beamforming (PS-HBF) architecture. Numerical simulations show that the proposed algorithms can efficiently find near-optimal solutions. Moreover, the SW-HBF scheme can significantly mitigate the beam squint effect and is less affected by the number of subcarriers than PS-HBF. It also provides improved SE and EE performance compared to PS-HBF schemes. △ Less

Submitted 21 November, 2021; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: 6 pages, 8 figures, to appear in the Proceedings of the 25th International ITG Workshop on Smart Antennas (WSA 2021)

arXiv:2109.13226 [pdf, other]

doi 10.1109/JSTSP.2022.3182537

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks. △ Less

Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

arXiv:2108.06691 [pdf, ps, other]

Closed-Form Hybrid Beamforming Solution for Spectral Efficiency Upper Bound Maximization in mmWave MIMO-OFDM Systems

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Markku Juntti

Abstract: Hybrid beamforming is considered a key enabler to realize millimeter wave (mmWave) multiple-input multiple-output (MIMO) communications due to its capability of considerably reducing the number of costly and power-hungry radio frequency chains in the transceiver. However, in mmWave MIMO orthogonal frequency-division multiplexing (MIMO-OFDM) systems, hybrid beamforming design is challenging because… ▽ More Hybrid beamforming is considered a key enabler to realize millimeter wave (mmWave) multiple-input multiple-output (MIMO) communications due to its capability of considerably reducing the number of costly and power-hungry radio frequency chains in the transceiver. However, in mmWave MIMO orthogonal frequency-division multiplexing (MIMO-OFDM) systems, hybrid beamforming design is challenging because the analog precoder and combiner are required to be shared across the whole employed bandwidth. In this paper, we propose closed-form solutions to the problem of designing the analog precoder/combiner in a mmWave MIMO-OFDM system by maximizing the upper bound of the spectral efficiency. The closed-form solutions facilitate the design of analog beamformers while guaranteeing state-of-art performance. Numerical results show that the proposed algorithm attains a slightly improved performance with much lower computational complexity compared to the considered benchmarks. △ Less

Submitted 24 August, 2021; v1 submitted 15 August, 2021; originally announced August 2021.

Comments: 5 pages, 5 figures, to appear in the proceedings of VTC2021-Fall

arXiv:2106.07976 [pdf, other]

doi 10.1145/3485730.3493444

Federated Learning for Internet of Things: A Federated Learning Framework for On-device Anomaly Data Detection

Authors: Tuo Zhang, Chaoyang He, Tianhao Ma, Lei Gao, Mark Ma, Salman Avestimehr

Abstract: Federated learning can be a promising solution for enabling IoT cybersecurity (i.e., anomaly detection in the IoT environment) while preserving data privacy and mitigating the high communication/storage overhead (e.g., high-frequency data from time-series sensors) of centralized over-the-cloud approaches. In this paper, to further push forward this direction with a comprehensive study in both algo… ▽ More Federated learning can be a promising solution for enabling IoT cybersecurity (i.e., anomaly detection in the IoT environment) while preserving data privacy and mitigating the high communication/storage overhead (e.g., high-frequency data from time-series sensors) of centralized over-the-cloud approaches. In this paper, to further push forward this direction with a comprehensive study in both algorithm and system design, we build FedIoT platform that contains FedDetect algorithm for on-device anomaly data detection and a system design for realistic evaluation of federated learning on IoT devices. Furthermore, the proposed FedDetect learning framework improves the performance by utilizing a local adaptive optimizer (e.g., Adam) and a cross-round learning rate scheduler. In a network of realistic IoT devices (Raspberry PI), we evaluate FedIoT platform and FedDetect algorithm in both model and system performance. Our results demonstrate the efficacy of federated learning in detecting a wider range of attack types occurred at multiple devices. The system efficiency analysis indicates that both end-to-end training time and memory cost are affordable and promising for resource-constrained IoT devices. The source code is publicly available at https://github.com/FedML-AI/FedIoT. △ Less

Submitted 18 October, 2021; v1 submitted 15 June, 2021; originally announced June 2021.

Journal ref: Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, November 2021, Pages 413-419

arXiv:2106.07098 [pdf, other]

Security Analysis of Camera-LiDAR Fusion Against Black-Box Attacks on Autonomous Vehicles

Authors: R. Spencer Hallyburton, Yupei Liu, Yulong Cao, Z. Morley Mao, Miroslav Pajic

Abstract: To enable safe and reliable decision-making, autonomous vehicles (AVs) feed sensor data to perception algorithms to understand the environment. Sensor fusion with multi-frame tracking is becoming increasingly popular for detecting 3D objects. Thus, in this work, we perform an analysis of camera-LiDAR fusion, in the AV context, under LiDAR spoofing attacks. Recently, LiDAR-only perception was shown… ▽ More To enable safe and reliable decision-making, autonomous vehicles (AVs) feed sensor data to perception algorithms to understand the environment. Sensor fusion with multi-frame tracking is becoming increasingly popular for detecting 3D objects. Thus, in this work, we perform an analysis of camera-LiDAR fusion, in the AV context, under LiDAR spoofing attacks. Recently, LiDAR-only perception was shown vulnerable to LiDAR spoofing attacks; however, we demonstrate these attacks are not capable of disrupting camera-LiDAR fusion. We then define a novel, context-aware attack: frustum attack, and show that out of 8 widely used perception algorithms - across 3 architectures of LiDAR-only and 3 architectures of camera-LiDAR fusion - all are significantly vulnerable to the frustum attack. In addition, we demonstrate that the frustum attack is stealthy to existing defenses against LiDAR spoofing as it preserves consistencies between camera and LiDAR semantics. Finally, we show that the frustum attack can be exercised consistently over time to form stealthy longitudinal attack sequences, compromising the tracking module and creating adverse outcomes on end-to-end AV control. △ Less

Submitted 21 February, 2022; v1 submitted 13 June, 2021; originally announced June 2021.

arXiv:2106.06636 [pdf, other]

Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR

Authors: Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang

Abstract: Simultaneous speech-to-text translation is widely useful in many scenarios. The conventional cascaded approach uses a pipeline of streaming ASR followed by simultaneous MT, but suffers from error propagation and extra latency. To alleviate these issues, recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of… ▽ More Simultaneous speech-to-text translation is widely useful in many scenarios. The conventional cascaded approach uses a pipeline of streaming ASR followed by simultaneous MT, but suffers from error propagation and extra latency. To alleviate these issues, recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of two separate tasks. We instead propose a new paradigm with the advantages of both cascaded and end-to-end approaches. The key idea is to use two separate, but synchronized, decoders on streaming ASR and direct speech-to-text translation (ST), respectively, and the intermediate results of ASR guide the decoding policy of (but is not fed as input to) ST. During training time, we use multitask learning to jointly learn these two tasks with a shared encoder. En-to-De and En-to-Es experiments on the MuSTC dataset demonstrate that our proposed technique achieves substantially better translation quality at similar levels of latency. △ Less

Submitted 11 June, 2021; originally announced June 2021.

Comments: accepted by Findings of ACL 2021

arXiv:2104.14830 [pdf, other]

Scaling End-to-End Models for Large-Scale Multilingual ASR

Authors: Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, Junwen Bai

Abstract: Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity.… ▽ More Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity. We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.6K to 53.5K hours. We adopt GShard [1] to efficiently scale up to 10B parameters. Empirically, we find that (1) scaling the number of model parameters is an effective way to solve the capacity bottleneck - our 500M-param model already outperforms monolingual baselines and scaling it to 1B and 10B brought further quality gains; (2) larger models are not only more data efficient, but also more efficient in terms of training cost as measured in TPU days - the 1B-param model reaches the same accuracy at 34% of training time as the 500M-param model; (3) given a fixed capacity budget, adding depth works better than width and large encoders do better than large decoders; (4) with continuous training, they can be adapted to new languages and domains. △ Less

Submitted 11 September, 2021; v1 submitted 30 April, 2021; originally announced April 2021.

Comments: ASRU 2021

arXiv:2104.04993 [pdf, other]

The DKU System Description for The Interspeech 2021 Auto-KWS Challenge

Authors: Yechen Wang, Yan Jia, Murong Ma, Zexin Cai, Ming Li

Abstract: This paper introduces the system submitted by the DKU-SMIIP team for the Auto-KWS 2021 Challenge. Our implementation consists of a two-stage keyword spotting system based on query-by-example spoken term detection and a speaker verification system. We employ two different detection algorithms in our proposed keyword spotting system. The first stage adopts subsequence dynamic time war** for templa… ▽ More This paper introduces the system submitted by the DKU-SMIIP team for the Auto-KWS 2021 Challenge. Our implementation consists of a two-stage keyword spotting system based on query-by-example spoken term detection and a speaker verification system. We employ two different detection algorithms in our proposed keyword spotting system. The first stage adopts subsequence dynamic time war** for template matching based on frame-level language-independent bottleneck feature and phoneme posterior probability. We use a sliding window template matching algorithm based on acoustic word embeddings to further verify the detection from the first stage. As a result, our KWS system achieves an average score of 0.61 on the feedback dataset, which outperforms the baseline1 system by 0.25. △ Less

Submitted 11 April, 2021; originally announced April 2021.

Comments: 5 pages, 1 figures, submitted to INTERSPEECH

arXiv:2102.03357 [pdf, other]

Machine Learning for Electronic Design Automation: A Survey

Authors: Guyue Huang, **gbo Hu, Yifan He, Jialong Liu, Mingyuan Ma, Zhaoyang Shen, Juejian Wu, Yuanfan Xu, Hengrui Zhang, Kai Zhong, Xuefei Ning, Yuzhe Ma, Haoyu Yang, Bei Yu, Huazhong Yang, Yu Wang

Abstract: With the down-scaling of CMOS technology, the design complexity of very large-scale integrated (VLSI) is increasing. Although the application of machine learning (ML) techniques in electronic design automation (EDA) can trace its history back to the 90s, the recent breakthrough of ML and the increasing complexity of EDA tasks have aroused more interests in incorporating ML to solve EDA tasks. In t… ▽ More With the down-scaling of CMOS technology, the design complexity of very large-scale integrated (VLSI) is increasing. Although the application of machine learning (ML) techniques in electronic design automation (EDA) can trace its history back to the 90s, the recent breakthrough of ML and the increasing complexity of EDA tasks have aroused more interests in incorporating ML to solve EDA tasks. In this paper, we present a comprehensive review of existing ML for EDA studies, organized following the EDA hierarchy. △ Less

Submitted 8 March, 2021; v1 submitted 10 January, 2021; originally announced February 2021.

Comments: Accepted by TODAES. The first 10 authors are ordered alphabetically

arXiv:2102.00202 [pdf, other]

SNR-adaptive deep joint source-channel coding for wireless image transmission

Authors: Mingze Ding, Jiahui Li, Mengyao Ma, Xiaopeng Fan

Abstract: Considering the problem of joint source-channel coding (JSCC) for multi-user transmission of images over noisy channels, an autoencoder-based novel deep joint source-channel coding scheme is proposed in this paper. In the proposed JSCC scheme, the decoder can estimate the signal-to-noise ratio (SNR) and use it to adaptively decode the transmitted image. Experiments demonstrate that the proposed sc… ▽ More Considering the problem of joint source-channel coding (JSCC) for multi-user transmission of images over noisy channels, an autoencoder-based novel deep joint source-channel coding scheme is proposed in this paper. In the proposed JSCC scheme, the decoder can estimate the signal-to-noise ratio (SNR) and use it to adaptively decode the transmitted image. Experiments demonstrate that the proposed scheme achieves impressive results in adaptability for different SNRs and is robust to the decoder's estimation error of the SNR. To the best of our knowledge, this is the first deep JSCC scheme that focuses on the adaptability for different SNRs and can be applied to multi-user scenarios. △ Less

Submitted 2 February, 2021; v1 submitted 30 January, 2021; originally announced February 2021.

Comments: Accepted in IEEE ICASSP 2021

arXiv:2011.01460 [pdf, other]

Training Wake Word Detection with Synthesized Speech Data on Confusion Words

Authors: Yan Jia, Zexin Cai, Murong Ma, Zeqing Zhao, Xuyang Wang, Junjie Wang, Ming Li

Abstract: Confusing-words are commonly encountered in real-life keyword spotting applications, which causes severe degradation of performance due to complex spoken terms and various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system's robustness on such scenarios, we investigate two data augmentation setups for training end-to-end KWS systems. One is invo… ▽ More Confusing-words are commonly encountered in real-life keyword spotting applications, which causes severe degradation of performance due to complex spoken terms and various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system's robustness on such scenarios, we investigate two data augmentation setups for training end-to-end KWS systems. One is involving the synthesized data from a multi-speaker speech synthesis system, and the other augmentation is performed by adding random noise to the acoustic feature. Experimental results show that augmentations help improve the system's robustness. Moreover, by augmenting the training set with the synthetic data generated by the multi-speaker text-to-speech system, we achieve a significant improvement regarding confusing words scenario. △ Less

Submitted 2 November, 2020; originally announced November 2020.

Comments: Submitted to ICASSP 2021

arXiv:2011.00384 [pdf, other]

doi 10.1109/ICCPS48487.2020.00013

Predictive Monitoring with Logic-Calibrated Uncertainty for Cyber-Physical Systems

Authors: Meiyi Ma, John Stankovic, Ezio Bartocci, Lu Feng

Abstract: Predictive monitoring -- making predictions about future states and monitoring if the predicted states satisfy requirements -- offers a promising paradigm in supporting the decision making of Cyber-Physical Systems (CPS). Existing works of predictive monitoring mostly focus on monitoring individual predictions rather than sequential predictions. We develop a novel approach for monitoring sequentia… ▽ More Predictive monitoring -- making predictions about future states and monitoring if the predicted states satisfy requirements -- offers a promising paradigm in supporting the decision making of Cyber-Physical Systems (CPS). Existing works of predictive monitoring mostly focus on monitoring individual predictions rather than sequential predictions. We develop a novel approach for monitoring sequential predictions generated from Bayesian Recurrent Neural Networks (RNNs) that can capture the inherent uncertainty in CPS, drawing on insights from our study of real-world CPS datasets. We propose a new logic named \emph{Signal Temporal Logic with Uncertainty} (STL-U) to monitor a flowpipe containing an infinite set of uncertain sequences predicted by Bayesian RNNs. We define STL-U strong and weak satisfaction semantics based on if all or some sequences contained in a flowpipe satisfy the requirement. We also develop methods to compute the range of confidence levels under which a flowpipe is guaranteed to strongly (weakly) satisfy an STL-U formula. Furthermore, we develop novel criteria that leverage STL-U monitoring results to calibrate the uncertainty estimation in Bayesian RNNs. Finally, we evaluate the proposed approach via experiments with real-world datasets and a simulated smart city case study, which show very encouraging results of STL-U based predictive monitoring approach outperforming baselines. △ Less

Submitted 24 July, 2021; v1 submitted 31 October, 2020; originally announced November 2020.

Comments: This article appears as part of the ESWEEK-TECS special issue and was presented in the International Conference on Embedded Software (EMSOFT), 2021

Journal ref: In 2020 ACM/IEEE 11th International Conference on Cyber-Physical Systems (ICCPS) (pp. 51-62). IEEE

arXiv:2010.12096 [pdf, other]

Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data

Authors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao

Abstract: Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a nov… ▽ More Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNNT models not only on LibriSpeech but also on YouTube data in four languages. For example, in French, we are able to reduce the WER by 16.4% relatively to a baseline streaming model by leveraging a non-streaming teacher model trained on the same amount of labeled data as the baseline. △ Less

Submitted 21 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

arXiv:2010.04753 [pdf, other]

Impact Evaluation of Falsified Data Attacks on Connected Vehicle Based Traffic Signal Control

Authors: Shihong Ed Huang, Wai Wong, Yiheng Feng, Qi Alfred Chen, Z. Morley Mao, Henry X. Liu

Abstract: Connected vehicle (CV) technology enables data exchange between vehicles and transportation infrastructure and therefore has great potentials to improve current traffic signal control systems. However, this connectivity might also bring cyber security concerns. As the first step in investigating the cyber security of CV-based traffic signal control (CV-TSC) systems, potential cyber threats need to… ▽ More Connected vehicle (CV) technology enables data exchange between vehicles and transportation infrastructure and therefore has great potentials to improve current traffic signal control systems. However, this connectivity might also bring cyber security concerns. As the first step in investigating the cyber security of CV-based traffic signal control (CV-TSC) systems, potential cyber threats need to be identified and corresponding impact needs to be evaluated. In this paper, we aim to evaluate the impact of cyber attacks on CV-TSC systems by considering a realistic attack scenario in which the control logic of a CV-TSC system is unavailable to attackers. Our threat model presumes that an attacker may learn the control logic using a surrogate model. Based on the surrogate model, the attacker may launch falsified data attacks to influence signal control decisions. In the case study, we realistically evaluate the impact of falsified data attacks on an existing CV-TSC system (i.e., I-SIG). △ Less

Submitted 9 October, 2020; originally announced October 2020.

arXiv:2007.03724 [pdf, other]

Learning while Respecting Privacy and Robustness to Distributional Uncertainties and Adversarial Data

Authors: Alireza Sadeghi, Gang Wang, Meng Ma, Georgios B. Giannakis

Abstract: Data used to train machine learning models can be adversarial--maliciously constructed by adversaries to fool the model. Challenge also arises by privacy, confidentiality, or due to legal constraints when data are geographically gathered and stored across multiple learners, some of which may hold even an "anonymized" or unreliable dataset. In this context, the distributionally robust optimization… ▽ More Data used to train machine learning models can be adversarial--maliciously constructed by adversaries to fool the model. Challenge also arises by privacy, confidentiality, or due to legal constraints when data are geographically gathered and stored across multiple learners, some of which may hold even an "anonymized" or unreliable dataset. In this context, the distributionally robust optimization framework is considered for training a parametric model, both in centralized and federated learning settings. The objective is to endow the trained model with robustness against adversarially manipulated input data, or, distributional uncertainties, such as mismatches between training and testing data distributions, or among datasets stored at different workers. To this aim, the data distribution is assumed unknown, and lies within a Wasserstein ball centered around the empirical data distribution. This robust learning task entails an infinite-dimensional optimization problem, which is challenging. Leveraging a strong duality result, a surrogate is obtained, for which three stochastic primal-dual algorithms are developed: i) stochastic proximal gradient descent with an $ε$-accurate oracle, which invokes an oracle to solve the convex sub-problems; ii) stochastic proximal gradient descent-ascent, which approximates the solution of the convex sub-problems via a single gradient ascent step; and, iii) a distributionally robust federated learning algorithm, which solves the sub-problems locally at different workers where data are stored. Compared to the empirical risk minimization and federated learning methods, the proposed algorithms offer robustness with little computation overhead. Numerical tests using image datasets showcase the merits of the proposed algorithms under several existing adversarial attacks and distributional uncertainties. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: 14 pages, 5 figures

Showing 1–50 of 63 results for author: Ma, M