Search | arXiv e-print repository

Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer

Authors: Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James Glass

Abstract: Automatic prediction of amyotrophic lateral sclerosis (ALS) disease progression provides a more efficient and objective alternative than manual approaches. We propose ALS longitudinal speech transformer (ALST), a neural network-based automatic predictor of ALS disease progression from longitudinal speech recordings of ALS patients. By taking advantage of high-quality pretrained speech features and… ▽ More Automatic prediction of amyotrophic lateral sclerosis (ALS) disease progression provides a more efficient and objective alternative than manual approaches. We propose ALS longitudinal speech transformer (ALST), a neural network-based automatic predictor of ALS disease progression from longitudinal speech recordings of ALS patients. By taking advantage of high-quality pretrained speech features and longitudinal information in the recordings, our best model achieves 91.0\% AUC, improving upon the previous best model by 5.6\% relative on the ALS TDI dataset. Careful analysis reveals that ALST is capable of fine-grained and interpretable predictions of ALS progression, especially for distinguishing between rarer and more severe cases. Code is publicly available. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.14931 [pdf, other]

Multi-beam Training for Near-field Communications in High-frequency Bands

Authors: Cong Zhou, Changsheng You, Zixuan Huang, Shuo Shi, Yi Gong, Chan-Byoung Chae, Kaibin Huang

Abstract: In this paper, we study efficient multi-beam training design for near-field communications to reduce the beam training overhead of conventional single-beam training methods. In particular, the array-division based multi-beam training method, which is widely used in far-field communications, cannot be directly applied to the near-field scenario, since different sub-arrays may observe different user… ▽ More In this paper, we study efficient multi-beam training design for near-field communications to reduce the beam training overhead of conventional single-beam training methods. In particular, the array-division based multi-beam training method, which is widely used in far-field communications, cannot be directly applied to the near-field scenario, since different sub-arrays may observe different user angles and there exist coverage holes in the angular domain. To address these issues, we first devise a new near-field multi-beam codebook by sparsely activating a portion of antennas to form a sparse linear array (SLA), hence generating multiple beams simultaneously by effective exploiting the near-field grating-lobs. Next, a two-stage near-field beam training method is proposed, for which several candidate user locations are identified firstly based on multi-beam swee** over time, followed by the second stage to further determine the true user location with a small number of single-beam swee**. Finally, numerical results show that our proposed multi-beam training method significantly reduces the beam training overhead of conventional single-beam training methods, yet achieving comparable rate performance in data transmission. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: In this paper, a novel near-field multi-beam training scheme is proposed by sparsely activating a portion of antennas to form a sparse linear array

arXiv:2406.13205 [pdf]

Application of Computer Deep Learning Model in Diagnosis of Pulmonary Nodules

Authors: Yutian Yang, Hongjie Qiu, Yulu Gong, Xiaoyi Liu, Yang Lin, Muqing Li

Abstract: The 3D simulation model of the lung was established by using the reconstruction method. A computer aided pulmonary nodule detection model was constructed. The process iterates over the images to refine the lung nodule recognition model based on neural networks. It is integrated with 3D virtual modeling technology to improve the interactivity of the system, so as to achieve intelligent recognition… ▽ More The 3D simulation model of the lung was established by using the reconstruction method. A computer aided pulmonary nodule detection model was constructed. The process iterates over the images to refine the lung nodule recognition model based on neural networks. It is integrated with 3D virtual modeling technology to improve the interactivity of the system, so as to achieve intelligent recognition of lung nodules. A 3D RCNN (Region-based Convolutional Neural Network) was utilized for feature extraction and nodule identification. The LUNA16 large sample database was used as the research dataset. FROC (Free-response Receiver Operating Characteristic) analysis was applied to evaluate the model, calculating sensitivity at various false positive rates to derive the average FROC. Compared with conventional diagnostic methods, the recognition rate was significantly improved. This technique facilitates the detection of pulmonary abnormalities at an initial phase, which holds immense value for the prompt diagnosis of lung malignancies. △ Less

Submitted 19 June, 2024; originally announced June 2024.

MSC Class: 68T10; 92C50

arXiv:2406.11158 [pdf, other]

Dynamic Modeling and Control for an Offshore Semisubmersible Floating Wind Turbine

Authors: Yingjie Gong, Qinmin Yang, Hua Geng, Wenchao Meng, Lin Wang

Abstract: Floating wind turbines (FWTs) hold significant potential for the exploitation of offshore renewable energy resources. Nevertheless, prior to the construction of FWTs, it is imperative to tackle several critical challenges, especially the issue of performance degradation under combined wind and wave loads. This study initiates with the development of a simplified nonlinear dynamical model for a sem… ▽ More Floating wind turbines (FWTs) hold significant potential for the exploitation of offshore renewable energy resources. Nevertheless, prior to the construction of FWTs, it is imperative to tackle several critical challenges, especially the issue of performance degradation under combined wind and wave loads. This study initiates with the development of a simplified nonlinear dynamical model for a semi-submersible FWT. In particular, both the rotor dynamics and the finite rotations of the platform are considered in presented modeling approach, thereby effectively capturing the complex interplay between the platform, tower, nacelle, and rotor under combined wind and wave loads. Subsequently, based on the developed FWT model, a novel adaptive nonlinear pitch controller is formulated with the goal of striking a trade-off between regulating power generation and reducing platform motion. Notably, the proposed control strategy adopts a continuous control approach, strategically beneficial in circumventing the chattering phenomenon commonly associated with sliding mode control. Furthermore, the controller integrates an online approximator and a robust integral of the sign of the tracking error, facilitating real-time learning of system unknown dynamics while compensating for bounded disturbances. Finally, both the accuracy of the established nonlinear FWT model in predicting key dynamics and the superiority of the presented pitch controller are validated through comprehensive comparative studies. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10082 [pdf, other]

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

Authors: Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

Abstract: Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data differe… ▽ More Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data difference motivates us to adapt Whisper to handle video inputs. Inspired by Flamingo which injects visual features into language models, we propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention. Our audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy conditions. Moreover, Whisper-Flamingo is a versatile model and conducts all of these tasks using one set of parameters, while prior methods are trained separately on each language. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Interspeech 2024. Code https://github.com/roudimit/whisper-flamingo

arXiv:2405.05446 [pdf, other]

GDGS: Gradient Domain Gaussian Splatting for Sparse Representation of Radiance Fields

Authors: Yuanhao Gong

Abstract: The 3D Gaussian splatting methods are getting popular. However, they work directly on the signal, leading to a dense representation of the signal. Even with some techniques such as pruning or distillation, the results are still dense. In this paper, we propose to model the gradient of the original signal. The gradients are much sparser than the original signal. Therefore, the gradients use much le… ▽ More The 3D Gaussian splatting methods are getting popular. However, they work directly on the signal, leading to a dense representation of the signal. Even with some techniques such as pruning or distillation, the results are still dense. In this paper, we propose to model the gradient of the original signal. The gradients are much sparser than the original signal. Therefore, the gradients use much less Gaussian splats, leading to the more efficient storage and thus higher computational performance during both training and rendering. Thanks to the sparsity, during the view synthesis, only a small mount of pixels are needed, leading to much higher computational performance ($100\sim 1000\times$ faster). And the 2D image can be recovered from the gradients via solving a Poisson equation with linear computation complexity. Several experiments are performed to confirm the sparseness of the gradients and the computation performance of the proposed method. The method can be applied various applications, such as human body modeling and indoor environment modeling. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: arXiv admin note: text overlap with arXiv:2404.09105

arXiv:2404.19087 [pdf, other]

Deep Reinforcement Learning for Advanced Longitudinal Control and Collision Avoidance in High-Risk Driving Scenarios

Authors: Dianwei Chen, Yaobang Gong, Xianfeng Yang

Abstract: Existing Advanced Driver Assistance Systems primarily focus on the vehicle directly ahead, often overlooking potential risks from following vehicles. This oversight can lead to ineffective handling of high risk situations, such as high speed, closely spaced, multi vehicle scenarios where emergency braking by one vehicle might trigger a pile up collision. To overcome these limitations, this study i… ▽ More Existing Advanced Driver Assistance Systems primarily focus on the vehicle directly ahead, often overlooking potential risks from following vehicles. This oversight can lead to ineffective handling of high risk situations, such as high speed, closely spaced, multi vehicle scenarios where emergency braking by one vehicle might trigger a pile up collision. To overcome these limitations, this study introduces a novel deep reinforcement learning based algorithm for longitudinal control and collision avoidance. This proposed algorithm effectively considers the behavior of both leading and following vehicles. Its implementation in simulated high risk scenarios, which involve emergency braking in dense traffic where traditional systems typically fail, has demonstrated the algorithm ability to prevent potential pile up collisions, including those involving heavy duty vehicles. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.09105 [pdf, other]

EGGS: Edge Guided Gaussian Splatting for Radiance Fields

Authors: Yuanhao Gong

Abstract: The Gaussian splatting methods are getting popular. However, their loss function only contains the $\ell_1$ norm and the structural similarity between the rendered and input images, without considering the edges in these images. It is well-known that the edges in an image provide important information. Therefore, in this paper, we propose an Edge Guided Gaussian Splatting (EGGS) method that levera… ▽ More The Gaussian splatting methods are getting popular. However, their loss function only contains the $\ell_1$ norm and the structural similarity between the rendered and input images, without considering the edges in these images. It is well-known that the edges in an image provide important information. Therefore, in this paper, we propose an Edge Guided Gaussian Splatting (EGGS) method that leverages the edges in the input images. More specifically, we give the edge region a higher weight than the flat region. With such edge guidance, the resulting Gaussian particles focus more on the edges instead of the flat regions. Moreover, such edge guidance does not crease the computation cost during the training and rendering stage. The experiments confirm that such simple edge-weighted loss function indeed improves about $1\sim2$ dB on several difference data sets. With simply plugging in the edge guidance, the proposed method can improve all Gaussian splatting methods in different scenarios, such as human head modeling, building 3D reconstruction, etc. △ Less

Submitted 22 April, 2024; v1 submitted 13 April, 2024; originally announced April 2024.

arXiv:2404.07121 [pdf, other]

Digital Over-the-Air Computation: Achieving High Reliability via Bit-Slicing

Authors: Jiawei Liu, Yi Gong, Kaibin Huang

Abstract: 6G mobile networks aim to realize ubiquitous intelligence at the network edge via distributed learning, sensing, and data analytics. Their common operation is to aggregate high-dimensional data, which causes a communication bottleneck that cannot be resolved using traditional orthogonal multi-access schemes. A promising solution, called over-the-air computation (AirComp), exploits channels' wavefo… ▽ More 6G mobile networks aim to realize ubiquitous intelligence at the network edge via distributed learning, sensing, and data analytics. Their common operation is to aggregate high-dimensional data, which causes a communication bottleneck that cannot be resolved using traditional orthogonal multi-access schemes. A promising solution, called over-the-air computation (AirComp), exploits channels' waveform superposition property to enable simultaneous access, thereby overcoming the bottleneck. Nevertheless, its reliance on uncoded linear analog modulation exposes data to perturbation by noise and interference. Hence, the traditional analog AirComp falls short of meeting the high-reliability requirement for 6G. Overcoming the limitation of analog AirComp motivates this work, which focuses on develo** a framework for digital AirComp. The proposed framework features digital modulation of each data value, integrated with the bit-slicing technique to allocate its bits to multiple symbols, thereby increasing the AirComp reliability. To optimally detect the aggregated digital symbols, we derive the optimal maximum a posteriori detector that is shown to outperform the traditional maximum likelihood detector. Furthermore, a comparative performance analysis of digital AirComp with respect to its analog counterpart with repetition coding is conducted to quantify the practical signal-to-noise ratio (SNR) regime favoring the proposed scheme. On the other hand, digital AirComp is enhanced by further development to feature awareness of heterogeneous bit importance levels and its exploitation in channel adaptation. Lastly, simulation results demonstrate the achivability of substantial reliability improvement of digital AirComp over its analog counterpart given the same channel uses. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2403.16212 [pdf, other]

Leveraging Deep Learning and Xception Architecture for High-Accuracy MRI Classification in Alzheimer Diagnosis

Authors: Shaojie Li, Haichen Qu, Xinqi Dong, Bo Dang, Hengyi Zang, Yulu Gong

Abstract: Exploring the application of deep learning technologies in the field of medical diagnostics, Magnetic Resonance Imaging (MRI) provides a unique perspective for observing and diagnosing complex neurodegenerative diseases such as Alzheimer Disease (AD). With advancements in deep learning, particularly in Convolutional Neural Networks (CNNs) and the Xception network architecture, we are now able to a… ▽ More Exploring the application of deep learning technologies in the field of medical diagnostics, Magnetic Resonance Imaging (MRI) provides a unique perspective for observing and diagnosing complex neurodegenerative diseases such as Alzheimer Disease (AD). With advancements in deep learning, particularly in Convolutional Neural Networks (CNNs) and the Xception network architecture, we are now able to analyze and classify vast amounts of MRI data with unprecedented accuracy. The progress of this technology not only enhances our understanding of brain structural changes but also opens up new avenues for monitoring disease progression through non-invasive means and potentially allows for precise diagnosis in the early stages of the disease. This study aims to classify MRI images using deep learning models to identify different stages of Alzheimer Disease through a series of innovative data processing and model construction steps. Our experimental results show that the deep learning framework based on the Xception model achieved a 99.6% accuracy rate in the multi-class MRI image classification task, demonstrating its potential application value in assistive diagnosis. Future research will focus on expanding the dataset, improving model interpretability, and clinical validation to further promote the application of deep learning technology in the medical field, with the hope of bringing earlier diagnosis and more personalized treatment plans to Alzheimer Disease patients. △ Less

Submitted 24 March, 2024; originally announced March 2024.

arXiv:2403.14775 [pdf, ps, other]

RIS-Aided Cooperative Mobile Edge Computing: Computation Efficiency Maximization via Joint Uplink and Downlink Resource Allocation

Authors: Zhenrong Liu, Zongze Li, Yi Gong, Yik-Chung Wu

Abstract: In mobile edge computing (MEC) systems, the wireless channel condition is a critical factor affecting both the communication power consumption and computation rate of the offloading tasks. This paper exploits the idea of cooperative transmission and employing reconfigurable intelligent surface (RIS) in MEC to improve the channel condition and maximize computation efficiency (CE). The resulting pro… ▽ More In mobile edge computing (MEC) systems, the wireless channel condition is a critical factor affecting both the communication power consumption and computation rate of the offloading tasks. This paper exploits the idea of cooperative transmission and employing reconfigurable intelligent surface (RIS) in MEC to improve the channel condition and maximize computation efficiency (CE). The resulting problem couples various wireless resources in both uplink and downlink, which calls for the joint design of the user association, receive/downlink beamforming vectors, transmit power of users, task partition strategies for local computing and offloading, and uplink/downlink phase shifts at the RIS. To tackle the challenges brought by the combinatorial optimization problem, the group sparsity structure of the beamforming vectors determined by user association is exploited. Furthermore, while the CE does not explicitly depend on the downlink phase shifts, instead of simply finding a feasible solution, we exploit the hidden relationship between them and convert this relationship into an explicit form for optimization. Then the resulting problem is solved via the alternating maximization framework, and the nonconvexity of each subproblem is handled individually. Simulation results show that cooperative transmission and RIS deployment can significantly improve the CE and demonstrate the importance of optimizing the downlink phase shifts with an explicit form. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: This paper has been accepted for publication in IEEE Transactions on Wireless Communications

arXiv:2403.14244 [pdf, other]

Isotropic Gaussian Splatting for Real-Time Radiance Field Rendering

Authors: Yuanhao Gong, Lantao Yu, Guanghui Yue

Abstract: The 3D Gaussian splatting method has drawn a lot of attention, thanks to its high performance in training and high quality of the rendered image. However, it uses anisotropic Gaussian kernels to represent the scene. Although such anisotropic kernels have advantages in representing the geometry, they lead to difficulties in terms of computation, such as splitting or merging two kernels. In this pap… ▽ More The 3D Gaussian splatting method has drawn a lot of attention, thanks to its high performance in training and high quality of the rendered image. However, it uses anisotropic Gaussian kernels to represent the scene. Although such anisotropic kernels have advantages in representing the geometry, they lead to difficulties in terms of computation, such as splitting or merging two kernels. In this paper, we propose to use isotropic Gaussian kernels to avoid such difficulties in the computation, leading to a higher performance method. The experiments confirm that the proposed method is about {\bf 100X} faster without losing the geometry representation accuracy. The proposed method can be applied in a large range applications where the radiance field is needed, such as 3D reconstruction, view synthesis, and dynamic object modeling. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2402.15939 [pdf]

Deep Separable Spatiotemporal Learning for Fast Dynamic Cardiac MRI

Authors: Zi Wang, Min Xiao, Yirong Zhou, Chengyan Wang, Naiming Wu, Yi Li, Yiwen Gong, Shufu Chang, Yinyin Chen, Liuhong Zhu, Jianjun Zhou, Congbo Cai, He Wang, Di Guo, Guang Yang, Xiaobo Qu

Abstract: Dynamic magnetic resonance imaging (MRI) plays an indispensable role in cardiac diagnosis. To enable fast imaging, the k-space data can be undersampled but the image reconstruction poses a great challenge of high-dimensional processing. This challenge leads to necessitate extensive training data in many deep learning reconstruction methods. This work proposes a novel and efficient approach, levera… ▽ More Dynamic magnetic resonance imaging (MRI) plays an indispensable role in cardiac diagnosis. To enable fast imaging, the k-space data can be undersampled but the image reconstruction poses a great challenge of high-dimensional processing. This challenge leads to necessitate extensive training data in many deep learning reconstruction methods. This work proposes a novel and efficient approach, leveraging a dimension-reduced separable learning scheme that excels even with highly limited training data. We further integrate it with spatiotemporal priors to develop a Deep Separable Spatiotemporal Learning network (DeepSSL), which unrolls an iteration process of a reconstruction model with both temporal low-rankness and spatial sparsity. Intermediate outputs are visualized to provide insights into the network's behavior and enhance its interpretability. Extensive results on cardiac cine datasets show that the proposed DeepSSL is superior to the state-of-the-art methods visually and quantitatively, while reducing the demand for training cases by up to 75%. And its preliminary adaptability to cardiac patients has been verified through experienced radiologists' and cardiologists' blind reader study. Additionally, DeepSSL also benefits for achieving the downstream task of cardiac segmentation with higher accuracy and shows robustness in prospective real-time cardiac MRI. △ Less

Submitted 24 February, 2024; originally announced February 2024.

Comments: 10 pages, 11 figures, 3 tables

arXiv:2402.06630 [pdf]

The Dilemma of Standardizing Indoor Photovoltaic Characterisation: Embracing Diversity for Powering the IoT

Authors: Zacharie Jehl Li-Kao, Kunal J. Tiwari, Sergio Giraldo, Marcel Placidi, Yuancai Gong, Arindam Basak, Taizo Kobayashi, Jon Major, Edgardo Saucedo

Abstract: In this viewpoint contribution, we argue that the emerging landscape of indoor photovoltaics poses unique challenges that transcend the capabilities of a singular standard, unlike what the community has become accustomed with the success of the AM1.x standard for outdoor application. We aim at illustrating the pitfalls associated with a one-size-fits-all approach to standardisation, emphasising th… ▽ More In this viewpoint contribution, we argue that the emerging landscape of indoor photovoltaics poses unique challenges that transcend the capabilities of a singular standard, unlike what the community has become accustomed with the success of the AM1.x standard for outdoor application. We aim at illustrating the pitfalls associated with a one-size-fits-all approach to standardisation, emphasising the necessity for a concerted and nuanced methodology tailored to the complexities of indoor energy utilisation, and particularly in the context of the various needs of the Internet of Things. Acknowledging the inherent variability in indoor illumination conditions, and using simple numerical modelling and real-life examples to illustrate how it influences the output of indoor cells, we advocate for a shift from conventional standards to comprehensive guidelines that will better accommodate and evaluate the diverse interplays between photovoltaic device, internet of things sensors, and illumination sources. Our proposed methodology is not merely a set of rules but a strategic framework for the community to build upon, inviting researchers and industry stakeholders to collaborate and establish a unified foundation for assessing the performance of photovoltaic devices indoors. By fostering a collective approach and steering clear of rigid standards, this viewpoint lays the groundwork for future studies to better assess the performance and usability of indoor photovoltaics, thus ensuring innovation, adaptability, and reliable analyses in this very fast evolving and increasingly relevant field. △ Less

Submitted 14 January, 2024; originally announced February 2024.

Comments: References are currently missing

arXiv:2401.15354 [pdf, other]

DeepGI: An Automated Approach for Gastrointestinal Tract Segmentation in MRI Scans

Authors: Ye Zhang, Yulu Gong, Dongji Cui, Xinrui Li, Xinyu Shen

Abstract: Gastrointestinal (GI) tract cancers pose a global health challenge, demanding precise radiotherapy planning for optimal treatment outcomes. This paper introduces a cutting-edge approach to automate the segmentation of GI tract regions in magnetic resonance imaging (MRI) scans. Leveraging advanced deep learning architectures, the proposed model integrates Inception-V4 for initial classification, UN… ▽ More Gastrointestinal (GI) tract cancers pose a global health challenge, demanding precise radiotherapy planning for optimal treatment outcomes. This paper introduces a cutting-edge approach to automate the segmentation of GI tract regions in magnetic resonance imaging (MRI) scans. Leveraging advanced deep learning architectures, the proposed model integrates Inception-V4 for initial classification, UNet++ with a VGG19 encoder for 2.5D data, and Edge UNet for grayscale data segmentation. Meticulous data preprocessing, including innovative 2.5D processing, is employed to enhance adaptability, robustness, and accuracy. This work addresses the manual and time-consuming segmentation process in current radiotherapy planning, presenting a unified model that captures intricate anatomical details. The integration of diverse architectures, each specializing in unique aspects of the segmentation task, signifies a novel and comprehensive solution. This model emerges as an efficient and accurate tool for clinicians, marking a significant advancement in the field of GI tract image segmentation for radiotherapy planning. △ Less

Submitted 27 January, 2024; originally announced January 2024.

arXiv:2401.10345 [pdf, other]

Attack and Defense Analysis of Learned Image Compression

Authors: Tianyu Zhu, Heming Sun, Xiankui Xiong, Xuanpeng Zhu, Yong Gong, Minge **g, Yibo Fan

Abstract: Learned image compression (LIC) is becoming more and more popular these years with its high efficiency and outstanding compression quality. Still, the practicality against modified inputs added with specific noise could not be ignored. White-box attacks such as FGSM and PGD use only gradient to compute adversarial images that mislead LIC models to output unexpected results. Our experiments compare… ▽ More Learned image compression (LIC) is becoming more and more popular these years with its high efficiency and outstanding compression quality. Still, the practicality against modified inputs added with specific noise could not be ignored. White-box attacks such as FGSM and PGD use only gradient to compute adversarial images that mislead LIC models to output unexpected results. Our experiments compare the effects of different dimensions such as attack methods, models, qualities, and targets, concluding that in the worst case, there is a 61.55% decrease in PSNR or a 19.15 times increase in bpp under the PGD attack. To improve their robustness, we conduct adversarial training by adding adversarial images into the training datasets, which obtains a 95.52% decrease in the R-D cost of the most vulnerable LIC model. We further test the robustness of H.266, whose better performance on reconstruction quality extends its possibility to defend one-step or iterative adversarial attacks. △ Less

Submitted 27 March, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

arXiv:2401.08887 [pdf, ps, other]

NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription

Authors: Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka

Abstract: We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First… ▽ More We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First, a benchmarking dataset of 315 meetings, averaging 6 minutes each, capturing a broad spectrum of real-world acoustic conditions and conversational dynamics. It is recorded across 30 conference rooms, featuring 4-8 attendees and a total of 35 unique speakers. Second, a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. The tasks focus on single-device DASR, where multi-channel devices always share the same known geometry. This is aligned with common setups in actual conference rooms, and avoids technical complexities associated with multi-device tasks. It also allows for the development of geometry-specific solutions. The NOTSOFAR-1 Challenge aims to advance research in the field of distant conversational speech recognition, providing key resources to unlock the potential of data-driven methods, which we believe are currently constrained by the absence of comprehensive high-quality training and benchmarking datasets. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: preprint

arXiv:2312.14448 [pdf, other]

Quantum-Assisted Joint Caching and Power Allocation for Integrated Satellite-Terrestrial Networks

Authors: Yu Zhang, Yanmin Gong, Lei Fan, Yu Wang, Zhu Han, Yuanxiong Guo

Abstract: Low earth orbit (LEO) satellite network can complement terrestrial networks for achieving global wireless coverage and improving delay-sensitive Internet services. This paper proposes an integrated satellite-terrestrial network (ISTN) architecture to provide ground users with seamless and reliable content delivery services. For optimal service provisioning in this architecture, we formulate an opt… ▽ More Low earth orbit (LEO) satellite network can complement terrestrial networks for achieving global wireless coverage and improving delay-sensitive Internet services. This paper proposes an integrated satellite-terrestrial network (ISTN) architecture to provide ground users with seamless and reliable content delivery services. For optimal service provisioning in this architecture, we formulate an optimization model to maximize the network throughput by jointly optimizing content delivery policy, cache placement, and transmission power allocation. The resulting optimization model is a large-scale mixed-integer nonlinear program (MINLP) that is intractable for classical computer solvers. Inspired by quantum computing techniques, we propose a hybrid quantum-classical generalized Benders' decomposition (HQCGBD) algorithm to address this challenge. Specifically, we first exploit the generalized Benders' decomposition (GBD) to decompose the problem into a master problem and a subproblem and then leverage the state-of-art quantum annealer to solve the challenging master problem. △ Less

Submitted 22 December, 2023; originally announced December 2023.

arXiv:2312.13683 [pdf, other]

Joint Channel Estimation and Cooperative Localization for Near-Field Ultra-Massive MIMO

Authors: Ruoxiao Cao, Hengtao He, Xianghao Yu, Shenghui Song, Kaibin Huang, Jun Zhang, Yi Gong, Khaled B. Letaief

Abstract: The next-generation (6G) wireless networks are expected to provide not only seamless and high data-rate communications, but also ubiquitous sensing services. By providing vast spatial degrees of freedom (DoFs), ultra-massive multiple-input multiple-output (UM-MIMO) technology is a key enabler for both sensing and communications in 6G. However, the adoption of UM-MIMO leads to a shift from the far… ▽ More The next-generation (6G) wireless networks are expected to provide not only seamless and high data-rate communications, but also ubiquitous sensing services. By providing vast spatial degrees of freedom (DoFs), ultra-massive multiple-input multiple-output (UM-MIMO) technology is a key enabler for both sensing and communications in 6G. However, the adoption of UM-MIMO leads to a shift from the far field to the near field in terms of the electromagnetic propagation, which poses novel challenges in system design. Specifically, near-field effects introduce highly non-linear spherical wave models that render existing designs based on plane wave assumptions ineffective. In this paper, we focus on two crucial tasks in sensing and communications, respectively, i.e., localization and channel estimation, and investigate their joint design by exploring the near-field propagation characteristics, achieving mutual benefits between two tasks. In addition, multiple base stations (BSs) are leveraged to collaboratively facilitate a cooperative localization framework. To address the joint channel estimation and cooperative localization problem for near-field UM-MIMO systems, we propose a variational Newtonized near-field channel estimation (VNNCE) algorithm and a Gaussian fusion cooperative localization (GFCL) algorithm. The VNNCE algorithm exploits the spatial DoFs provided by the near-field channel to obtain position-related soft information, while the GFCL algorithm fuses this soft information to achieve more accurate localization. Additionally, we introduce a joint architecture that seamlessly integrates channel estimation and cooperative localization. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: Submit to JSAC

arXiv:2311.09850 [pdf, other]

Semantic-Relay-Aided Text Transmission: Placement Optimization and Bandwidth Allocation

Authors: Tianyu Liu, Changsheng You, Zeyang Hu, Chenyu Wu, Yi Gong, Kaibin Huang

Abstract: Semantic communication has emerged as a promising technology to break the Shannon limit by extracting the meaning of source data and sending relevant semantic information only. However, some mobile devices may have limited computation and storage resources, which renders it difficult to deploy and implement the resource-demanding deep learning based semantic encoder/decoder. To tackle this challen… ▽ More Semantic communication has emerged as a promising technology to break the Shannon limit by extracting the meaning of source data and sending relevant semantic information only. However, some mobile devices may have limited computation and storage resources, which renders it difficult to deploy and implement the resource-demanding deep learning based semantic encoder/decoder. To tackle this challenge, we propose in this paper a new semantic relay (SemRelay), which is equipped with a semantic receiver for assisting text transmission from a resource-abundant base station (BS) to a resource-constrained mobile device. Specifically, the SemRelay first decodes the semantic information sent by the BS (with a semantic transmitter) and then forwards it to the user by adopting conventional bit transmission, hence effectively improving the text transmission efficiency. We formulate an optimization problem to maximize the achievable (effective) bit rate by jointly designing the SemRelay placement and bandwidth allocation. Although this problem is non-convex and generally difficult to solve, we propose an efficient penalty-based algorithm to obtain a high-quality suboptimal solution. Numerical results show the close-to-optimal performance of the proposed algorithm as well as significant rate performance gain of the proposed SemRelay over conventional decode-and-forward relay. △ Less

Submitted 16 November, 2023; originally announced November 2023.

Comments: 6 pages, 4 figures, accepted for IEEE Global Communication Conference (GLOBECOM) 2023 Workshop

arXiv:2310.01342 [pdf, other]

Near-field Integrated Sensing and Communication: Opportunities and Challenges

Authors: Jiayi Cong, Changsheng You, Jiapeng Li, Li Chen, Beixiong Zheng, Yuanwei Liu, Wen Wu, Yi Gong, Shi **, Rui Zhang

Abstract: With the extremely large-scale array XL-array deployed in future wireless systems, wireless communication and sensing are expected to operate in the radiative near-field region, which needs to be characterized by the spherical rather than planar wavefronts. Unlike most existing works that considered far-field integrated sensing and communication (ISAC), we study in this article the new near-field… ▽ More With the extremely large-scale array XL-array deployed in future wireless systems, wireless communication and sensing are expected to operate in the radiative near-field region, which needs to be characterized by the spherical rather than planar wavefronts. Unlike most existing works that considered far-field integrated sensing and communication (ISAC), we study in this article the new near-field ISAC, which integrates both functions of sensing and communication in the near-field region. To this end, we first discuss the appealing advantages of near-field communication and sensing over their far-field counterparts, respectively. Then, we introduce three approaches for near-field ISAC, including joint near-field communication and sensing, sensing-assisted near-field communication, and communication-assisted near-field sensing. We discuss their individual research opportunities, new design issues, as well as propose promising solutions. Finally, several important directions in near-field ISAC are also highlighted to motivate future work. △ Less

Submitted 17 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: This work is submitted to IEEE for possible publication

arXiv:2309.14405 [pdf, other]

Joint Audio and Speech Understanding

Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass

Abstract: Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perce… ▽ More Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. Specifically, by integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events - almost everything perceivable from audio signals. △ Less

Submitted 10 December, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: Accepted at ASRU 2023. Code, dataset, and pretrained models are at https://github.com/yuangongnd/ltu. Interactive demo at https://huggingface.co/spaces/yuangongfdu/ltu-2

arXiv:2309.11161 [pdf, other]

Beamforming Design for RIS-Aided THz Wideband Communication Systems

Authors: Yihang Jiang, Ziqin Zhou, Xiaoyang Li, Yi Gong

Abstract: Benefiting from tens of GHz of bandwidth, terahertz (THz) communications has become a promising technology for future 6G networks. However, the conventional hybrid beamforming architecture based on frequency-independent phase-shifters is not able to cope with the beam split effect (BSE) in THz massive multiple-input multiple-output (MIMO) systems. Despite some work introducing the frequency-depend… ▽ More Benefiting from tens of GHz of bandwidth, terahertz (THz) communications has become a promising technology for future 6G networks. However, the conventional hybrid beamforming architecture based on frequency-independent phase-shifters is not able to cope with the beam split effect (BSE) in THz massive multiple-input multiple-output (MIMO) systems. Despite some work introducing the frequency-dependent phase shifts via the time delay network to mitigate the beam splitting in THz wideband communications, the corresponding issue in reconfigurable intelligent surface (RIS)-aided communications has not been well investigated. In this paper, the BSE in THz massive MIMO is quantified by analyzing the array gain loss. A new beamforming architecture has been proposed to mitigate this effect under RIS-aided communications scenarios. Simulations are performed to evaluate the effectiveness of the proposed system architecture in combating the array gain loss. △ Less

Submitted 21 September, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

arXiv:2309.07369 [pdf, other]

Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation

Authors: Shaoshi Ling, Guoli Ye, Rui Zhao, Yifan Gong

Abstract: Attention-based encoder-decoder (AED) speech recognition model has been widely successful in recent years. However, the joint optimization of acoustic model and language model in end-to-end manner has created challenges for text adaptation. In particular, effectively, quickly and inexpensively adapting text has become a primary concern for deploying AED systems in industry. To address this issue,… ▽ More Attention-based encoder-decoder (AED) speech recognition model has been widely successful in recent years. However, the joint optimization of acoustic model and language model in end-to-end manner has created challenges for text adaptation. In particular, effectively, quickly and inexpensively adapting text has become a primary concern for deploying AED systems in industry. To address this issue, we propose a novel model, the hybrid attention-based encoder-decoder (HAED) speech recognition model that preserves the modularity of conventional hybrid automatic speech recognition systems. Our HAED model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques. We demonstrate that the proposed HAED model yields 21\% Word Error Rate (WER) improvements in relative when out-of-domain text data is used for language model adaptation, and with only a minor degradation in WER on a general test set compared with conventional AED model. △ Less

Submitted 13 September, 2023; originally announced September 2023.

arXiv:2309.01875 [pdf, other]

Gradient Domain Diffusion Models for Image Synthesis

Authors: Yuanhao Gong

Abstract: Diffusion models are getting popular in generative image and video synthesis. However, due to the diffusion process, they require a large number of steps to converge. To tackle this issue, in this paper, we propose to perform the diffusion process in the gradient domain, where the convergence becomes faster. There are two reasons. First, thanks to the Poisson equation, the gradient domain is mathe… ▽ More Diffusion models are getting popular in generative image and video synthesis. However, due to the diffusion process, they require a large number of steps to converge. To tackle this issue, in this paper, we propose to perform the diffusion process in the gradient domain, where the convergence becomes faster. There are two reasons. First, thanks to the Poisson equation, the gradient domain is mathematically equivalent to the original image domain. Therefore, each diffusion step in the image domain has a unique corresponding gradient domain representation. Second, the gradient domain is much sparser than the image domain. As a result, gradient domain diffusion models converge faster. Several numerical experiments confirm that the gradient domain diffusion models are more efficient than the original diffusion models. The proposed method can be applied in a wide range of applications such as image processing, computer vision and machine learning tasks. △ Less

Submitted 4 September, 2023; originally announced September 2023.

arXiv:2308.10009 [pdf, other]

Realizing In-Memory Baseband Processing for Ultra-Fast and Energy-Efficient 6G

Authors: Qunsong Zeng, Jiawei Liu, Mingrui Jiang, Jun Lan, Yi Gong, Zhongrui Wang, Yida Li, Can Li, Jim Ignowski, Kaibin Huang

Abstract: To support emerging applications ranging from holographic communications to extended reality, next-generation mobile wireless communication systems require ultra-fast and energy-efficient baseband processors. Traditional complementary metal-oxide-semiconductor (CMOS)-based baseband processors face two challenges in transistor scaling and the von Neumann bottleneck. To address these challenges, in-… ▽ More To support emerging applications ranging from holographic communications to extended reality, next-generation mobile wireless communication systems require ultra-fast and energy-efficient baseband processors. Traditional complementary metal-oxide-semiconductor (CMOS)-based baseband processors face two challenges in transistor scaling and the von Neumann bottleneck. To address these challenges, in-memory computing-based baseband processors using resistive random-access memory (RRAM) present an attractive solution. In this paper, we propose and demonstrate RRAM-implemented in-memory baseband processing for the widely adopted multiple-input-multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) air interface. Its key feature is to execute the key operations, including discrete Fourier transform (DFT) and MIMO detection using linear minimum mean square error (L-MMSE) and zero forcing (ZF), in one-step. In addition, RRAM-based channel estimation module is proposed and discussed. By prototy** and simulations, we demonstrate the feasibility of RRAM-based full-fledged communication system in hardware, and reveal it can outperform state-of-the-art baseband processors with a gain of 91.2$\times$ in latency and 671$\times$ in energy efficiency by large-scale simulations. Our results pave a potential pathway for RRAM-based in-memory computing to be implemented in the era of the sixth generation (6G) mobile communications. △ Less

Submitted 19 August, 2023; originally announced August 2023.

Comments: arXiv admin note: text overlap with arXiv:2205.03561

arXiv:2307.16332 [pdf]

Pre-training End-to-end ASR Models with Augmented Speech Samples Queried by Text

Authors: Eric Sun, **yu Li, Jian Xue, Yifan Gong

Abstract: In end-to-end automatic speech recognition system, one of the difficulties for language expansion is the limited paired speech and text training data. In this paper, we propose a novel method to generate augmented samples with unpaired speech feature segments and text data for model pre-training, which has the advantage of low cost without using additional speech data. When mixing 20,000 hours aug… ▽ More In end-to-end automatic speech recognition system, one of the difficulties for language expansion is the limited paired speech and text training data. In this paper, we propose a novel method to generate augmented samples with unpaired speech feature segments and text data for model pre-training, which has the advantage of low cost without using additional speech data. When mixing 20,000 hours augmented speech data generated by our method with 12,500 hours original transcribed speech data for Italian Transformer transducer model pre-training, we achieve 8.7% relative word error rate reduction. The pre-trained model achieves similar performance as the model pre-trained with multilingual transcribed 75,000 hours raw speech data. When merging the augmented speech data with the multilingual data to pre-train a new model, we achieve even more relative word error rate reduction of 12.2% over the baseline, which further verifies the effectiveness of our method for speech data augmentation. △ Less

Submitted 30 July, 2023; originally announced July 2023.

arXiv:2307.08234 [pdf, other]

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

Authors: Shaoshi Ling, Yuxuan Hu, Shuangbei Qian, Guoli Ye, Yao Qian, Yifan Gong, Ed Lin, Michael Zeng

Abstract: Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions. Pretrained large language models (LLMs) have the potential to improve the performance of E2E ASR. However, integrating a pretrained language model into an E2E speech recognition model has shown limited benefits due to the mismatches between text-based LL… ▽ More Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions. Pretrained large language models (LLMs) have the potential to improve the performance of E2E ASR. However, integrating a pretrained language model into an E2E speech recognition model has shown limited benefits due to the mismatches between text-based LLMs and those used in E2E ASR. In this paper, we explore an alternative approach by adapting a pretrained LLMs to speech. Our experiments on fully-formatted E2E ASR transcription tasks across various domains demonstrate that our approach can effectively leverage the strengths of pretrained LLMs to produce more readable ASR transcriptions. Our model, which is based on the pretrained large language models with either an encoder-decoder or decoder-only structure, surpasses strong ASR models such as Whisper, in terms of recognition error rate, considering formats like punctuation and capitalization as well. △ Less

Submitted 2 August, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

arXiv:2307.03183 [pdf, other]

doi 10.21437/Interspeech.2023-2193

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

Authors: Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass

Abstract: In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sound… ▽ More In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass. △ Less

Submitted 6 July, 2023; originally announced July 2023.

Comments: Accepted at Interspeech 2023. Code at https://github.com/yuangongnd/whisper-at

Journal ref: Proceedings of Interspeech 2023

arXiv:2307.00307 [pdf, other]

Spatio-Temporal Classification of Lung Ventilation Patterns using 3D EIT Images: A General Approach for Individualized Lung Function Evaluation

Authors: Shuzhe Chen, Li Li, Zhichao Lin, Ke Zhang, Ying Gong, Lu Wang, Xu Wu, Maokun Li, Yuanlin Song, Fan Yang, Shenheng Xu

Abstract: The Pulmonary Function Test (PFT) is an widely utilized and rigorous classification test for lung function evaluation, serving as a comprehensive tool for lung diagnosis. Meanwhile, Electrical Impedance Tomography (EIT) is a rapidly advancing clinical technique that visualizes conductivity distribution induced by ventilation. EIT provides additional spatial and temporal information on lung ventila… ▽ More The Pulmonary Function Test (PFT) is an widely utilized and rigorous classification test for lung function evaluation, serving as a comprehensive tool for lung diagnosis. Meanwhile, Electrical Impedance Tomography (EIT) is a rapidly advancing clinical technique that visualizes conductivity distribution induced by ventilation. EIT provides additional spatial and temporal information on lung ventilation beyond traditional PFT. However, relying solely on conventional isolated interpretations of PFT results and EIT images overlooks the continuous dynamic aspects of lung ventilation. This study aims to classify lung ventilation patterns by extracting spatial and temporal features from the 3D EIT image series. The study uses a Variational Autoencoder network with a MultiRes block to compress the spatial distribution in a 3D image into a one-dimensional vector. These vectors are then concatenated to create a feature map for the exhibition of temporal features. A simple convolutional neural network is used for classification. Data collected from 137 subjects were finally used for training. The model is validated by ten-fold and leave-one-out cross-validation first. The accuracy and sensitivity of normal ventilation mode are 0.95 and 1.00, and the f1-score is 0.94. Furthermore, we check the reliability and feasibility of the proposed pipeline by testing it on newly recruited nine subjects. Our results show that the pipeline correctly predicts the ventilation mode of 8 out of 9 subjects. The study demonstrates the potential of using image series for lung ventilation mode classification, providing a feasible method for patient prescreening and presenting an alternative form of PFT. △ Less

Submitted 1 July, 2023; originally announced July 2023.

arXiv:2305.10790 [pdf, other]

Listen, Think, and Understand

Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass

Abstract: The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into general cat… ▽ More The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into general categories, but also to listen to the finer details of the sounds, explain the reason for the predictions, think about what the sound infers, and understand the scene and what action needs to be taken, if any. Such capabilities beyond perception are not yet present in existing audio models. On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities. Therefore, we ask the question: can we build a model that has both audio perception and a reasoning ability? In this paper, we propose a new audio foundation model, called LTU (Listen, Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse (audio, question, answer) tuples, and have used an autoregressive training framework with a perception-to-understanding curriculum. LTU demonstrates strong performance and generalization ability on conventional audio tasks such as classification and captioning. More importantly, it exhibits emerging audio reasoning and comprehension abilities that are absent in existing audio models. To the best of our knowledge, LTU is one of the first multimodal large language models that focus on general audio (rather than just speech) understanding. △ Less

Submitted 19 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted at ICLR 2024. Code, dataset, and models are available at https://github.com/YuanGongND/ltu. The interactive demo is at https://huggingface.co/spaces/yuangongfdu/ltu

arXiv:2305.00428 [pdf, ps, other]

STAR-RIS-Aided Mobile Edge Computing: Computation Rate Maximization with Binary Amplitude Coefficients

Authors: Zhenrong Liu, Zongze Li, Miaowen Wen, Yi Gong, Yik-Chung Wu

Abstract: In this paper, simultaneously transmitting and reflecting (STAR) reconfigurable intelligent surface (RIS) is investigated in the multi-user mobile edge computing (MEC) system to improve the computation rate. Compared with traditional RIS-aided MEC, STAR-RIS extends the service coverage from half-space to full-space and provides new flexibility for improving the computation rate for end users. Howe… ▽ More In this paper, simultaneously transmitting and reflecting (STAR) reconfigurable intelligent surface (RIS) is investigated in the multi-user mobile edge computing (MEC) system to improve the computation rate. Compared with traditional RIS-aided MEC, STAR-RIS extends the service coverage from half-space to full-space and provides new flexibility for improving the computation rate for end users. However, the STAR-RIS-aided MEC system design is a challenging problem due to the non-smooth and non-convex binary amplitude coefficients with coupled phase shifters. To fill this gap, this paper formulates a computation rate maximization problem via the joint design of the STAR-RIS phase shifts, reflection and transmission amplitude coefficients, the receive beamforming vectors, and energy partition strategies for local computing and offloading. To tackle the discontinuity caused by binary variables, we propose an efficient smoothing-based method to decrease convergence error, in contrast to the conventional penalty-based method, which brings many undesired stationary points and local optima. Furthermore, a fast iterative algorithm is proposed to obtain a stationary point for the joint optimization problem, with each subproblem solved by a low-complexity algorithm, making the proposed design scalable to a massive number of users and STAR-RIS elements. Simulation results validate the strength of the proposed smoothing-based method and show that the proposed fast iterative algorithm achieves a higher computation rate than the conventional method while saving the computation time by at least an order of magnitude. Moreover, the resultant STAR-RIS-aided MEC system significantly improves the computation rate compared to other baseline schemes with conventional reflect-only/transmit-only RIS. △ Less

Submitted 30 April, 2023; originally announced May 2023.

arXiv:2304.07512 [pdf, other]

Soft Label Coding for End-to-end Sound Source Localization With Ad-hoc Microphone Arrays

Authors: Linfeng Feng, Yijun Gong, Xiao-Lei Zhang

Abstract: Recently, an end-to-end two-dimensional sound source localization algorithm with ad-hoc microphone arrays formulates the sound source localization problem as a classification problem. The algorithm divides the target indoor space into a set of local areas, and predicts the local area where the speaker locates. However, the local areas are encoded by one-hot code, which may lose the connections bet… ▽ More Recently, an end-to-end two-dimensional sound source localization algorithm with ad-hoc microphone arrays formulates the sound source localization problem as a classification problem. The algorithm divides the target indoor space into a set of local areas, and predicts the local area where the speaker locates. However, the local areas are encoded by one-hot code, which may lose the connections between the local areas due to quantization errors. In this paper, we propose a new soft label coding method, named label smoothing, for the classification-based two-dimensional sound source location with ad-hoc microphone arrays. The core idea is to take the geometric connection between the classes into the label coding process.The first one is named static soft label coding (SSLC), which modifies the one-hot codes into soft codes based on the distances between the local areas. Because SSLC is handcrafted which may not be optimal, the second one, named dynamic soft label coding (DSLC), further rectifies SSLC, by learning the soft codes according to the statistics of the predictions produced by the classification-based localization model in the training stage. Experimental results show that the proposed methods can effectively improve the localization accuracy. △ Less

Submitted 15 April, 2023; originally announced April 2023.

Comments: 4pages, 2figures, conference

arXiv:2303.10691 [pdf, other]

Multi-Channel Attentive Feature Fusion for Radio Frequency Fingerprinting

Authors: Yuan Zeng, Yi Gong, Jiawei Liu, Shangao Lin, Zidong Han, Ruoxiao Cao, Kaibin Huang, Khaled Ben Letaief

Abstract: Radio frequency fingerprinting (RFF) is a promising device authentication technique for securing the Internet of things. It exploits the intrinsic and unique hardware impairments of the transmitters for RF device identification. In real-world communication systems, hardware impairments across transmitters are subtle, which are difficult to model explicitly. Recently, due to the superior performanc… ▽ More Radio frequency fingerprinting (RFF) is a promising device authentication technique for securing the Internet of things. It exploits the intrinsic and unique hardware impairments of the transmitters for RF device identification. In real-world communication systems, hardware impairments across transmitters are subtle, which are difficult to model explicitly. Recently, due to the superior performance of deep learning (DL)-based classification models on real-world datasets, DL networks have been explored for RFF. Most existing DL-based RFF models use a single representation of radio signals as the input. Multi-channel input model can leverage information from different representations of radio signals and improve the identification accuracy of the RF fingerprint. In this work, we propose a novel multi-channel attentive feature fusion (McAFF) method for RFF. It utilizes multi-channel neural features extracted from multiple representations of radio signals, including IQ samples, carrier frequency offset, fast Fourier transform coefficients and short-time Fourier transform coefficients, for better RF fingerprint identification. The features extracted from different channels are fused adaptively using a shared attention module, where the weights of neural features from multiple channels are learned during training the McAFF model. In addition, we design a signal identification module using a convolution-based ResNeXt block to map the fused features to device identities. To evaluate the identification performance of the proposed method, we construct a WiFi dataset, named WFDI, using commercial WiFi end-devices as the transmitters and a Universal Software Radio Peripheral (USRP) as the receiver. ... △ Less

Submitted 23 June, 2023; v1 submitted 19 March, 2023; originally announced March 2023.

arXiv:2303.00786 [pdf]

Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Authors: Eric Sun, **yu Li, Yuxuan Hu, Yimeng Zhu, Long Zhou, Jian Xue, Peidong Wang, Linquan Liu, Shujie Liu, Edward Lin, Yifan Gong

Abstract: We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference. Our method incorporates a gating mechanism and LID loss, enabling transformer experts to learn language-specific information. By combining gated transformer experts with shared transformer layers, we const… ▽ More We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference. Our method incorporates a gating mechanism and LID loss, enabling transformer experts to learn language-specific information. By combining gated transformer experts with shared transformer layers, we construct multilingual transformer blocks and utilize linear experts to effectively regularize the joint network. The curriculum training scheme leverages LID to guide the gated experts in improving their respective language performance. Experimental results on a bilingual task involving English and Spanish demonstrate significant improvements, with average relative word error reductions of 12.5% and 7.3% compared to the baseline bilingual and monolingual models, respectively. Notably, our method achieves performance comparable to the upper-bound model trained and inferred with oracle LID. Extending our approach to trilingual, quadrilingual, and pentalingual models reveals similar advantages to those observed in the bilingual models, highlighting its ease of extension to multiple languages. △ Less

Submitted 7 July, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

arXiv:2302.08525 [pdf, ps, other]

Computation and Privacy Protection for Satellite-Ground Digital Twin Networks

Authors: Yongkang Gong, Haipeng Yao Xiaonan Liu, Mehdi Bennis, Arumugam Nallanathan, Zhu Han

Abstract: Satellite-ground integrated digital twin networks (SGIDTNs) are regarded as innovative network architectures for reducing network congestion, enabling nearly-instant data map** from the physical world to digital systems, and offering ubiquitous intelligence services to terrestrial users. However, the challenges, such as the pricing policy, the stochastic task arrivals, the time-varying satellite… ▽ More Satellite-ground integrated digital twin networks (SGIDTNs) are regarded as innovative network architectures for reducing network congestion, enabling nearly-instant data map** from the physical world to digital systems, and offering ubiquitous intelligence services to terrestrial users. However, the challenges, such as the pricing policy, the stochastic task arrivals, the time-varying satellite locations, mutual channel interference, and resource scheduling mechanisms between the users and cloud servers, are critical for improving quality of service in SGIDTNs. Hence, we establish a blockchain-aided Stackelberg game model for maximizing the pricing profits and network throughput in terms of minimizing overhead of privacy protection, thus performing computation offloading, decreasing channel interference, and improving privacy protection. Next, we propose a Lyapunov stability theory-based model-agnostic metalearning aided multi-agent deep federated reinforcement learning (MAML-MADFRL) framework for optimizing the CPU cycle frequency, channel selection, task-offloading decision, block size, and cloud server price, which facilitate the integration of communication, computation, and block resources. Subsequently, the extensive performance analyses show that the proposed MAMLMADFRL algorithm can strengthen the privacy protection via the transaction verification mechanism, approach the optimal time average penalty, and fulfill the long-term average queue size via lower computational complexity. Finally, our simulation results indicate that the proposed MAML-MADFRL learning framework is superior to the existing baseline methods in terms of network throughput, channel interference, cloud server profits, and privacy overhead. △ Less

Submitted 16 February, 2023; originally announced February 2023.

arXiv:2212.12214 [pdf, ps, other]

Trainable Proximal Gradient Descent Based Channel Estimation for mmWave Massive MIMO Systems

Authors: Peicong Zheng, Xuantao Lyu, Yi Gong

Abstract: In this letter, we address the problem of millimeter-Wave channel estimation in massive MIMO communication systems. Leveraging the sparsity of the mmWave channel in the beamspace, we formulate the estimation problem as a sparse signal recovery problem. To this end, we propose a deep learning based trainable proximal gradient descent network (TPGD-Net). The TPGD-Net unfolds the iterative proximal g… ▽ More In this letter, we address the problem of millimeter-Wave channel estimation in massive MIMO communication systems. Leveraging the sparsity of the mmWave channel in the beamspace, we formulate the estimation problem as a sparse signal recovery problem. To this end, we propose a deep learning based trainable proximal gradient descent network (TPGD-Net). The TPGD-Net unfolds the iterative proximal gradient descent (PGD) algorithm into a layer-wise network, with the gradient descent step size set as a trainable parameter. Additionally, we replace the proximal operator in the PGD algorithm with a neural network that exploits data-driven prior channel information to perform the proximal operation implicitly. To further enhance the transfer of feature information across layers, we introduce the cross-layer feature attention fusion module into the TPGD-Net. Our simulation results on the Saleh-Valenzuela channel model and the DeepMIMO dataset demonstrate the superior performance of TPGD-Net compared to state-of-the-art mmWave channel estimators. △ Less

Submitted 6 March, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

arXiv:2212.00697 [pdf, ps, other]

Simultaneously Transmitting and Reflecting RIS-Aided Mobile Edge Computing: Computation Rate Maximization

Authors: Zhenrong Liu, Zongze Li, Miaowen Wen, Yi Gong, Yik-Chung Wu

Abstract: In this paper, the novel simultaneously transmitting and reflecting (STAR) reconfigurable intelligent surface (RIS), which enables full-space coverage on users located on both sides of the surface, is investigated in the multi-user mobile edge computing (MEC) system. A computation rate maximization problem is formulated via the joint design of the STAR-RIS phase shifts, reflection and transmission… ▽ More In this paper, the novel simultaneously transmitting and reflecting (STAR) reconfigurable intelligent surface (RIS), which enables full-space coverage on users located on both sides of the surface, is investigated in the multi-user mobile edge computing (MEC) system. A computation rate maximization problem is formulated via the joint design of the STAR-RIS phase shifts, reflection and transmission amplitude coefficients, the receive beamforming vectors at the access point, and the users' energy partition strategies for local computing and offloading. Two operating protocols of STAR-RIS, namely energy splitting (ES) and mode switching (MS) are studied. Based on DC programming and semidefinite relaxation, an iterative algorithm is proposed for the ES protocol to solve the formulated non-convex problem. Furthermore, the proposed algorithm is extended to solve the non-convex, non-continuous MS problems with binary amplitude coefficients. Simulation results show that the resultant STAR-RIS-aided MEC system significantly improves the computation rate compared to the baseline scheme with conventional reflect-only/transmit-only RIS. △ Less

Submitted 1 December, 2022; originally announced December 2022.

arXiv:2210.10265 [pdf, other]

Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays

Authors: Shupei Liu, Linfeng Feng, Yijun Gong, Chengdong Liang, Chen Zhang, Xiao-Lei Zhang, Xuelong Li

Abstract: While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays, where an ad-hoc microphone array is composed of randomly distri… ▽ More While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays, where an ad-hoc microphone array is composed of randomly distributed microphone nodes, each of which is equipped with a traditional array. Specifically, we first employ convolutional neural networks at each node to estimate speaker directions. Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at https://github.com/Liu-sp/Libri-adhoc-nodes10. △ Less

Submitted 1 April, 2024; v1 submitted 18 October, 2022; originally announced October 2022.

arXiv:2210.08484 [pdf, other]

End-to-end Two-dimensional Sound Source Localization With Ad-hoc Microphone Arrays

Authors: Yijun Gong, Shupei Liu, Xiao-Lei Zhang

Abstract: Conventional sound source localization methods are mostly based on a single microphone array that consists of multiple microphones. They are usually formulated as the estimation of the direction of arrival problem. In this paper, we propose a deep-learning-based end-to-end sound source localization method with ad-hoc microphone arrays, where an ad-hoc microphone array is a set of randomly distribu… ▽ More Conventional sound source localization methods are mostly based on a single microphone array that consists of multiple microphones. They are usually formulated as the estimation of the direction of arrival problem. In this paper, we propose a deep-learning-based end-to-end sound source localization method with ad-hoc microphone arrays, where an ad-hoc microphone array is a set of randomly distributed microphone arrays that collaborate with each other. It can produce two-dimensional locations of speakers with only a single microphone per node. Specifically, we divide a targeted indoor space into multiple local areas. We encode each local area by a one-hot code, therefore, the node and speaker locations can be represented by the one-hot codes. Accordingly, the sound source localization problem is formulated as such a classification task of recognizing the one-hot code of the speaker given the one hot codes of the microphone nodes and their speech recordings. An end-to-end spatial-temporal deep model is designed for the classification problem. It utilizes a spatial-temporal attention architecture with a fusion layer inserted in the middle of the architecture, which is able to handle arbitrarily different numbers of microphone nodes during the model training and test. Experimental results show that the proposed method yields good performance in highly reverberant and noisy environments. △ Less

Submitted 16 October, 2022; originally announced October 2022.

Comments: 6 pages, 4 figures, coference

arXiv:2210.07839 [pdf, other]

Contrastive Audio-Visual Masked Autoencoder

Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

Abstract: In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments… ▽ More In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae. △ Less

Submitted 11 April, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

Comments: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained models are at https://github.com/yuangongnd/cav-mae

arXiv:2208.00061 [pdf, other]

doi 10.1109/LSP.2022.3224688

UAVM: Towards Unifying Audio and Visual Models

Authors: Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

Abstract: Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do… ▽ More Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have. △ Less

Submitted 15 February, 2023; v1 submitted 29 July, 2022; originally announced August 2022.

Comments: Published in Signal Processing Letters. Code at https://github.com/YuanGongND/uavm

Journal ref: IEEE Signal Processing Letters, vol. 29, pp. 2437-2441, 2022

arXiv:2207.12577 [pdf, other]

Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution

Authors: Yushu Wu, Yifan Gong, Pu Zhao, Yanyu Li, Zheng Zhan, Wei Niu, Hao Tang, Minghai Qin, Bin Ren, Yanzhi Wang

Abstract: Deep learning-based super-resolution (SR) has gained tremendous popularity in recent years because of its high image quality performance and wide application scenarios. However, prior methods typically suffer from large amounts of computations and huge power consumption, causing difficulties for real-time inference, especially on resource-limited platforms such as mobile devices. To mitigate this,… ▽ More Deep learning-based super-resolution (SR) has gained tremendous popularity in recent years because of its high image quality performance and wide application scenarios. However, prior methods typically suffer from large amounts of computations and huge power consumption, causing difficulties for real-time inference, especially on resource-limited platforms such as mobile devices. To mitigate this, we propose a compiler-aware SR neural architecture search (NAS) framework that conducts depth search and per-layer width search with adaptive SR blocks. The inference speed is directly taken into the optimization along with the SR loss to derive SR models with high image quality while satisfying the real-time inference requirement. Instead of measuring the speed on mobile devices at each iteration during the search process, a speed model incorporated with compiler optimizations is leveraged to predict the inference latency of the SR block with various width configurations for faster convergence. With the proposed framework, we achieve real-time SR inference for implementing 720p resolution with competitive SR performance (in terms of PSNR and SSIM) on GPU/DSP of mobile platforms (Samsung Galaxy S21). △ Less

Submitted 25 July, 2022; originally announced July 2022.

arXiv:2206.13135 [pdf]

TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a Speech Recognition Baseline

Authors: Chengfei Li, Shuhao Deng, Yao** Wang, Guang**g Wang, Yaguang Gong, Changbin Chen, **feng Bai

Abstract: This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus, suitable for training and evaluating code-switching speech recognition systems. TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group, which contains roughly 587 hours of speech sampled at 16 kHz. To our best knowledge, TALCS corpus is the largest wel… ▽ More This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus, suitable for training and evaluating code-switching speech recognition systems. TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group, which contains roughly 587 hours of speech sampled at 16 kHz. To our best knowledge, TALCS corpus is the largest well labeled Mandarin-English code-switching open source automatic speech recognition (ASR) dataset in the world. In this paper, we will introduce the recording procedure in detail, including audio capturing devices and corpus environments. And the TALCS corpus is freely available for download under the permissive license1. Using TALCS corpus, we conduct ASR experiments in two popular speech recognition toolkits to make a baseline system, including ESPnet and Wenet. The Mixture Error Rate (MER) performance in the two speech recognition toolkits is compared in TALCS corpus. The experimental results implies that the quality of audio recordings and transcriptions are promising and the baseline system is workable. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: accepted by INTERSPEECH 2022

arXiv:2205.03561 [pdf]

Realizing Ultra-Fast and Energy-Efficient Baseband Processing Using Analogue Resistive Switching Memory

Authors: Qunsong Zeng, Jiawei Liu, Jun Lan, Yi Gong, Zhongrui Wang, Yida Li, Kaibin Huang

Abstract: To support emerging applications ranging from holographic communications to extended reality, next-generation mobile wireless communication systems require ultra-fast and energy-efficient (UFEE) baseband processors. Traditional complementary metal-oxide-semiconductor (CMOS)-based baseband processors face two challenges in transistor scaling and the von Neumann bottleneck. To address these challeng… ▽ More To support emerging applications ranging from holographic communications to extended reality, next-generation mobile wireless communication systems require ultra-fast and energy-efficient (UFEE) baseband processors. Traditional complementary metal-oxide-semiconductor (CMOS)-based baseband processors face two challenges in transistor scaling and the von Neumann bottleneck. To address these challenges, in-memory computing-based baseband processors using resistive random-access memory (RRAM) present an attractive solution. In this paper, we propose and demonstrate RRAM-based in-memory baseband processing for the widely adopted multiple-input-multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) air interface. Its key feature is to execute the key operations, including discrete Fourier transform (DFT) and MIMO detection using linear minimum mean square error (L-MMSE) and zero forcing (ZF), in one-step. In addition, RRAM-based channel estimation as well as mapper/demapper modules are proposed. By prototy** and simulations, we demonstrate that the RRAM-based full-fledged communication system can significantly outperform its CMOS-based counterpart in terms of speed and energy efficiency by $10^3$ and $10^6$ times, respectively. The results pave a potential pathway for RRAM-based in-memory computing to be implemented in the era of the sixth generation (6G) mobile communications. △ Less

Submitted 7 May, 2022; originally announced May 2022.

arXiv:2205.03433 [pdf, other]

doi 10.1109/ICASSP43922.2022.9746828

Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

Authors: Yuan Gong, ** Yu, James Glass

Abstract: Recognizing human non-speech vocalizations is an important task and has broad applications such as automatic sound transcription and health condition monitoring. However, existing datasets have a relatively small number of vocal sound samples or noisy labels. As a consequence, state-of-the-art audio event classification models may not perform well in detecting human vocal sounds. To support resear… ▽ More Recognizing human non-speech vocalizations is an important task and has broad applications such as automatic sound transcription and health condition monitoring. However, existing datasets have a relatively small number of vocal sound samples or noisy labels. As a consequence, state-of-the-art audio event classification models may not perform well in detecting human vocal sounds. To support research on building robust and accurate vocal sound recognition, we have created a VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. Experiments show that the vocal sound recognition performance of a model can be significantly improved by 41.9% by adding VocalSound dataset to an existing dataset as training material. In addition, different from previous datasets, the VocalSound dataset contains meta information such as speaker age, gender, native language, country, and health condition. △ Less

Submitted 17 June, 2022; v1 submitted 6 May, 2022; originally announced May 2022.

Comments: Accepted at ICASSP 2022. Dataset and code at https://github.com/YuanGongND/vocalsound Interactive Colab demo at https://colab.research.google.com/github/YuanGongND/vocalsound/blob/main/colab/VocalSound.ipynb

arXiv:2205.03432 [pdf, other]

doi 10.1109/ICASSP43922.2022.9746743

Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

Authors: Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, James Glass

Abstract: Automatic pronunciation assessment is an important technology to help self-directed language learners. While pronunciation quality has multiple aspects including accuracy, fluency, completeness, and prosody, previous efforts typically only model one aspect (e.g., accuracy) at one granularity (e.g., at the phoneme-level). In this work, we explore modeling multi-aspect pronunciation assessment at mu… ▽ More Automatic pronunciation assessment is an important technology to help self-directed language learners. While pronunciation quality has multiple aspects including accuracy, fluency, completeness, and prosody, previous efforts typically only model one aspect (e.g., accuracy) at one granularity (e.g., at the phoneme-level). In this work, we explore modeling multi-aspect pronunciation assessment at multiple granularities. Specifically, we train a Goodness Of Pronunciation feature-based Transformer (GOPT) with multi-task learning. Experiments show that GOPT achieves the best results on speechocean762 with a public automatic speech recognition (ASR) acoustic model trained on Librispeech. △ Less

Submitted 6 May, 2022; originally announced May 2022.

Comments: Accepted at ICASSP 2022. Code at https://github.com/YuanGongND/gopt Interactive Colab demo at https://colab.research.google.com/github/YuanGongND/gopt/blob/master/colab/GOPT_GPU.ipynb . ICASSP 2022

arXiv:2205.02194 [pdf, other]

Intelligent Reflecting Surface Aided Mobile Edge Computing With Binary Offloading: Energy Minimization for IoT Devices

Authors: Yizhen Yang, Yi Gong, Yik-Chung Wu

Abstract: Mobile edge computing (MEC) is envisioned as a promising technique to support computation-intensive and timecritical applications in future Internet of Things (IoT) era. However, the uplink transmission performance will be highly impacted by the hostile wireless channel, the low bandwidth, and the low transmission power of IoT devices. Recently, intelligent reflecting surface (IRS) has drawn much… ▽ More Mobile edge computing (MEC) is envisioned as a promising technique to support computation-intensive and timecritical applications in future Internet of Things (IoT) era. However, the uplink transmission performance will be highly impacted by the hostile wireless channel, the low bandwidth, and the low transmission power of IoT devices. Recently, intelligent reflecting surface (IRS) has drawn much attention because of its capability to control the wireless environments so as to enhance the spectrum and energy efficiencies of wireless communications. In this paper, we consider an IRS-aided multidevice MEC system where each IoT device follows the binary offloading policy, i.e., a task has to be computed as a whole either locally or remotely at the edge server. We aim to minimize the total energy consumption of devices by jointly optimizing the binary offloading modes, the CPU frequencies, the offloading powers, the offloading times and the IRS phase shifts for all devices. Two algorithms, which are greedy-based and penalty-based, are proposed to solve the challenging nonconvex and discontinuous problem. It is found that the penalty-based method has only linear complexity with respect to the number of devices, but it performs close to the greedy-based method with cubic complexity with respect to number of devices. Furthermore, binary offloading via IRS indeed saves more energy compared to the case without IRS. △ Less

Submitted 4 May, 2022; originally announced May 2022.

arXiv:2204.08958 [pdf, other]

MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment

Authors: Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, Yujiu Yang

Abstract: No-Reference Image Quality Assessment (NR-IQA) aims to assess the perceptual quality of images in accordance with human subjective perception. Unfortunately, existing NR-IQA methods are far from meeting the needs of predicting accurate quality scores on GAN-based distortion images. To this end, we propose Multi-dimension Attention Network for no-reference Image Quality Assessment (MANIQA) to impro… ▽ More No-Reference Image Quality Assessment (NR-IQA) aims to assess the perceptual quality of images in accordance with human subjective perception. Unfortunately, existing NR-IQA methods are far from meeting the needs of predicting accurate quality scores on GAN-based distortion images. To this end, we propose Multi-dimension Attention Network for no-reference Image Quality Assessment (MANIQA) to improve the performance on GAN-based distortion. We firstly extract features via ViT, then to strengthen global and local interactions, we propose the Transposed Attention Block (TAB) and the Scale Swin Transformer Block (SSTB). These two modules apply attention mechanisms across the channel and spatial dimension, respectively. In this multi-dimensional manner, the modules cooperatively increase the interaction among different regions of images globally and locally. Finally, a dual branch structure for patch-weighted quality prediction is applied to predict the final score depending on the weight of each patch's score. Experimental results demonstrate that MANIQA outperforms state-of-the-art methods on four standard datasets (LIVE, TID2013, CSIQ, and KADID-10K) by a large margin. Besides, our method ranked first place in the final testing phase of the NTIRE 2022 Perceptual Image Quality Assessment Challenge Track 2: No-Reference. Codes and models are available at https://github.com/IIGROUP/MANIQA. △ Less

Submitted 20 April, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

arXiv:2203.07996 [pdf, other]

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Authors: Xichen Pan, Peiyu Chen, Yichen Gong, Helong Zhou, Xinbing Wang, Zhouhan Lin

Abstract: Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and… ▽ More Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pre-trained models into a multimodal scenario remains underexplored. In this work, we successfully leverage unimodal self-supervised learning to promote the multimodal AVSR. In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding. We show that both components inherited from unimodal self-supervised learning cooperate well, resulting in that the multimodal framework yields competitive results through fine-tuning. Our model is experimentally validated on both word-level and sentence-level tasks. Especially, even without an external language model, our proposed model raises the state-of-the-art performances on the widely accepted Lip Reading Sentences 2 (LRS2) dataset by a large margin, with a relative improvement of 30%. △ Less

Submitted 26 March, 2022; v1 submitted 24 February, 2022; originally announced March 2022.

Comments: ACL2022 Main Conference

Showing 1–50 of 129 results for author: Gong, Y