Search | arXiv e-print repository

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

Authors: Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur

Abstract: In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (S… ▽ More In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (Solo-SF), an innovative method that utilizes a target speaker's isolated speech segment to enhance ASR performance, thereby circumventing the need for conventional inputs like microphone array layouts. We explore effective strategies for selecting optimal solo segments, a crucial aspect for Solo-SF's success. Through evaluations conducted on the AliMeeting dataset and AISHELL-1 simulations, Solo-SF demonstrates superior performance over existing techniques, significantly lowering Character Error Rates (CER) in various test conditions. Our findings highlight Solo-SF's potential as an effective solution for addressing the complexities of multi-channel, multi-speaker ASR tasks. △ Less

Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted for presentation at Interspeech 2024

arXiv:2406.00492 [pdf, other]

SAM-VMNet: Deep Neural Networks For Coronary Angiography Vessel Segmentation

Authors: Xueying Zeng, Baixiang Huang, Yu Luo, Guangyu Wei, Songyan He, Yushuang Shao

Abstract: Coronary artery disease (CAD) is one of the most prevalent diseases in the cardiovascular field and one of the major contributors to death worldwide. Computed Tomography Angiography (CTA) images are regarded as the authoritative standard for the diagnosis of coronary artery disease, and by performing vessel segmentation and stenosis detection on CTA images, physicians are able to diagnose coronary… ▽ More Coronary artery disease (CAD) is one of the most prevalent diseases in the cardiovascular field and one of the major contributors to death worldwide. Computed Tomography Angiography (CTA) images are regarded as the authoritative standard for the diagnosis of coronary artery disease, and by performing vessel segmentation and stenosis detection on CTA images, physicians are able to diagnose coronary artery disease more accurately. In order to combine the advantages of both the base model and the domain-specific model, and to achieve high-precision and fully-automatic segmentation and detection with a limited number of training samples, we propose a novel architecture, SAM-VMNet, which combines the powerful feature extraction capability of MedSAM with the advantage of the linear complexity of the visual state-space model of VM-UNet, giving it faster inferences than Vision Transformer with faster inference speed and stronger data processing capability, achieving higher segmentation accuracy and stability for CTA images. Experimental results show that the SAM-VMNet architecture performs excellently in the CTA image segmentation task, with a segmentation accuracy of up to 98.32% and a sensitivity of up to 99.33%, which is significantly better than other existing models and has stronger domain adaptability. Comprehensive evaluation of the CTA image segmentation task shows that SAM-VMNet accurately extracts the vascular trunks and capillaries, demonstrating its great potential and wide range of application scenarios for the vascular segmentation task, and also laying a solid foundation for further stenosis detection. △ Less

Submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.16889 [pdf]

Extraction of In-Phase and Quadrature Components by Time-Encoding Sampling

Authors: Y. H. Shao, S. Y. Chen, H. Z. Yang, F. Xi, H. Hong, Z. Liu

Abstract: Time encoding machine (TEM) is a biologically-inspired scheme to perform signal sampling using timing. In this paper, we study its application to the sampling of bandpass signals. We propose an integrate-and-fire TEM scheme by which the in-phase (I) and quadrature (Q) components are extracted through reconstruction. We design the TEM according to the signal bandwidth and amplitude instead of upper… ▽ More Time encoding machine (TEM) is a biologically-inspired scheme to perform signal sampling using timing. In this paper, we study its application to the sampling of bandpass signals. We propose an integrate-and-fire TEM scheme by which the in-phase (I) and quadrature (Q) components are extracted through reconstruction. We design the TEM according to the signal bandwidth and amplitude instead of upper-edge frequency and amplitude as in the case of bandlimited/lowpass signals. We show that the I and Q components can be perfectly reconstructed from the TEM measurements if the minimum firing rate is equal to the Landau's rate of the signal. For the reconstruction of I and Q components, we develop an alternating projection onto convex sets (POCS) algorithm in which two POCS algorithms are alternately iterated. For the algorithm analysis, we define a solution space of vector-valued signals and prove that the proposed reconstruction algorithm converges to the correct unique solution in the noiseless case. The proposed TEM can operate regardless of the center frequencies of the bandpass signals. This is quite different from traditional bandpass sampling, where the center frequency should be carefully allocated for Landau's rate and its variations have the negative effect on the sampling performance. In addition, the proposed TEM achieves certain reconstructed signal-to-noise-plus-distortion ratios for small firing rates in thermal noise, which is unavoidably present and will be aliased to the Nyquist band in the traditional sampling such that high sampling rates are required. We demonstrate the reconstruction performance and substantiate our claims via simulation experiments. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 30 pages, 8 figures

arXiv:2405.11493 [pdf, other]

Point Cloud Compression with Implicit Neural Representations: A Unified Framework

Authors: Hongning Ruan, Yulin Shao, Qianqian Yang, Liang Zhao, Dusit Niyato

Abstract: Point clouds have become increasingly vital across various applications thanks to their ability to realistically depict 3D objects and scenes. Nevertheless, effectively compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we present a pioneering point cloud compression framework capable of handling both geometry and attribute components. Unlike… ▽ More Point clouds have become increasingly vital across various applications thanks to their ability to realistically depict 3D objects and scenes. Nevertheless, effectively compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we present a pioneering point cloud compression framework capable of handling both geometry and attribute components. Unlike traditional approaches and existing learning-based methods, our framework utilizes two coordinate-based neural networks to implicitly represent a voxelized point cloud. The first network generates the occupancy status of a voxel, while the second network determines the attributes of an occupied voxel. To tackle an immense number of voxels within the volumetric space, we partition the space into smaller cubes and focus solely on voxels within non-empty cubes. By feeding the coordinates of these voxels into the respective networks, we reconstruct the geometry and attribute components of the original point cloud. The neural network parameters are further quantized and compressed. Experimental results underscore the superior performance of our proposed method compared to the octree-based approach employed in the latest G-PCC standards. Moreover, our method exhibits high universality when contrasted with existing learning-based techniques. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 6 Pages, 6 Figures, submitted to IEEE ICCC

arXiv:2405.09698 [pdf, other]

A Deep Joint Source-Channel Coding Scheme for Hybrid Mobile Multi-hop Networks

Authors: Chenghong Bian, Yulin Shao, Deniz Gündüz

Abstract: Efficient data transmission across mobile multi-hop networks that connect edge devices to core servers presents significant challenges, particularly due to the variability in link qualities between wireless and wired segments. This variability necessitates a robust transmission scheme that transcends the limitations of existing deep joint source-channel coding (DeepJSCC) strategies, which often st… ▽ More Efficient data transmission across mobile multi-hop networks that connect edge devices to core servers presents significant challenges, particularly due to the variability in link qualities between wireless and wired segments. This variability necessitates a robust transmission scheme that transcends the limitations of existing deep joint source-channel coding (DeepJSCC) strategies, which often struggle at the intersection of analog and digital methods. Addressing this need, this paper introduces a novel hybrid DeepJSCC framework, h-DJSCC, tailored for effective image transmission from edge devices through a network architecture that includes initial wireless transmission followed by multiple wired hops. Our approach harnesses the strengths of DeepJSCC for the initial, variable-quality wireless link to avoid the cliff effect inherent in purely digital schemes. For the subsequent wired hops, which feature more stable and high-capacity connections, we implement digital compression and forwarding techniques to prevent noise accumulation. This dual-mode strategy is adaptable even in scenarios with limited knowledge of the image distribution, enhancing the framework's robustness and utility. Extensive numerical simulations demonstrate that our hybrid solution outperforms traditional fully digital approaches by effectively managing transitions between different network segments and optimizing for variable signal-to-noise ratios (SNRs). We also introduce a fully adaptive h-DJSCC architecture capable of adjusting to different network conditions and achieving diverse rate-distortion objectives, thereby reducing the memory requirements on network nodes. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: Submitted to possible IEEE journal

arXiv:2405.09552 [pdf, other]

ODFormer: Semantic Fundus Image Segmentation Using Transformer for Optic Nerve Head Detection

Authors: Jiayi Wang, Yi-An Mao, Xiaoyu Ma, Sicen Guo, Yuting Shao, Xiao Lv, Wenting Han, Mark Christopher, Linda M. Zangwill, Yanlong Bi, Rui Fan

Abstract: Optic nerve head (ONH) detection has been a crucial area of study in ophthalmology for years. However, the significant discrepancy between fundus image datasets, each generated using a single type of fundus camera, poses challenges to the generalizability of ONH detection approaches developed based on semantic segmentation networks. Despite the numerous recent advancements in general-purpose seman… ▽ More Optic nerve head (ONH) detection has been a crucial area of study in ophthalmology for years. However, the significant discrepancy between fundus image datasets, each generated using a single type of fundus camera, poses challenges to the generalizability of ONH detection approaches developed based on semantic segmentation networks. Despite the numerous recent advancements in general-purpose semantic segmentation methods using convolutional neural networks (CNNs) and Transformers, there is currently a lack of benchmarks for these state-of-the-art (SoTA) networks specifically trained for ONH detection. Therefore, in this article, we make contributions from three key aspects: network design, the publication of a dataset, and the establishment of a comprehensive benchmark. Our newly developed ONH detection network, referred to as ODFormer, is based upon the Swin Transformer architecture and incorporates two novel components: a multi-scale context aggregator and a lightweight bidirectional feature recalibrator. Our published large-scale dataset, known as TongjiU-DROD, provides multi-resolution fundus images for each participant, captured using two distinct types of cameras. Our established benchmark involves three datasets: DRIONS-DB, DRISHTI-GS1, and TongjiU-DROD, created by researchers from different countries and containing fundus images captured from participants of diverse races and ages. Extensive experimental results demonstrate that our proposed ODFormer outperforms other state-of-the-art (SoTA) networks in terms of performance and generalizability. Our dataset and source code are publicly available at mias.group/ODFormer. △ Less

Submitted 2 June, 2024; v1 submitted 15 April, 2024; originally announced May 2024.

arXiv:2404.16376 [pdf, ps, other]

A Hypergraph Approach to Distributed Broadcast

Authors: Qi Cao, Yulin Shao, Fan Yang

Abstract: This paper explores the distributed broadcast problem within the context of network communications, a critical challenge in decentralized information dissemination. We put forth a novel hypergraph-based approach to address this issue, focusing on minimizing the number of broadcasts to ensure comprehensive data sharing among all network users. A key contribution of our work is the establishment of… ▽ More This paper explores the distributed broadcast problem within the context of network communications, a critical challenge in decentralized information dissemination. We put forth a novel hypergraph-based approach to address this issue, focusing on minimizing the number of broadcasts to ensure comprehensive data sharing among all network users. A key contribution of our work is the establishment of a general lower bound for the problem using the min-cut capacity of hypergraphs. Additionally, we present the distributed broadcast for quasi-trees (DBQT) algorithm tailored for the unique structure of quasi-trees, which is proven to be optimal. This paper advances both network communication strategies and hypergraph theory, with implications for a wide range of real-world applications, from vehicular and sensor networks to distributed storage systems. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.09500 [pdf]

On-chip Real-time Hyperspectral Imager with Full CMOS Resolution Enabled by Massively Parallel Neural Network

Authors: Junren Wen, Haiqi Gao, Weiming Shi, Shuaibo Feng, Lingyun Hao, Yujie Liu, Liang Xu, Yuchuan Shao, Yueguang Zhang, Weidong Shen, Chenying Yang

Abstract: Traditional spectral imaging methods are constrained by the time-consuming scanning process, limiting the application in dynamic scenarios. One-shot spectral imaging based on reconstruction has been a hot research topic recently and the primary challenges still lie in both efficient fabrication techniques suitable for mass production and the high-speed, high-accuracy reconstruction algorithm for r… ▽ More Traditional spectral imaging methods are constrained by the time-consuming scanning process, limiting the application in dynamic scenarios. One-shot spectral imaging based on reconstruction has been a hot research topic recently and the primary challenges still lie in both efficient fabrication techniques suitable for mass production and the high-speed, high-accuracy reconstruction algorithm for real-time spectral imaging. In this study, we introduce an innovative on-chip real-time hyperspectral imager that leverages nanophotonic film spectral encoders and a Massively Parallel Network (MP-Net), featuring a 4 * 4 array of compact, all-dielectric film units for the micro-spectrometers. Each curved nanophotonic film unit uniquely modulates incident light across the underlying 3 * 3 CMOS image sensor (CIS) pixels, enabling a high spatial resolution equivalent to the full CMOS resolution. The implementation of MP-Net, specially designed to address variability in transmittance and manufacturing errors such as misalignment and non-uniformities in thin film deposition, can greatly increase the structural tolerance of the device and reduce the preparation requirement, further simplifying the manufacturing process. Tested in varied environments on both static and moving objects, the real-time hyperspectral imager demonstrates the robustness and high-fidelity spatial-spectral data capabilities across diverse scenarios. This on-chip hyperspectral imager represents a significant advancement in real-time, high-resolution spectral imaging, offering a versatile solution for applications ranging from environmental monitoring, remote sensing to consumer electronics. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.00510 [pdf, other]

Denoising Low-dose Images Using Deep Learning of Time Series Images

Authors: Yang Shao, Toshie Yaguchi, Toshiaki Tanigaki

Abstract: Digital image devices have been widely applied in many fields, including scientific imaging, recognition of individuals, and remote sensing. As the application of these imaging technologies to autonomous driving and measurement, image noise generated when observation cannot be performed with a sufficient dose has become a major problem. Machine learning denoise technology is expected to be the sol… ▽ More Digital image devices have been widely applied in many fields, including scientific imaging, recognition of individuals, and remote sensing. As the application of these imaging technologies to autonomous driving and measurement, image noise generated when observation cannot be performed with a sufficient dose has become a major problem. Machine learning denoise technology is expected to be the solver of this problem, but there are the following problems. Here we report, artifacts generated by machine learning denoise in ultra-low dose observation using an in-situ observation video of an electron microscope as an example. And as a method to solve this problem, we propose a method to decompose a time series image into a 2D image of the spatial axis and time to perform machine learning denoise. Our method opens new avenues accurate and stable reconstruction of continuous high-resolution images from low-dose imaging in science, industry, and life. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2403.13615 [pdf, other]

MIMO Channel as a Neural Function: Implicit Neural Representations for Extreme CSI Compression in Massive MIMO Systems

Authors: Haotian Wu, Maojun Zhang, Yulin Shao, Krystian Mikolajczyk, Deniz Gündüz

Abstract: Acquiring and utilizing accurate channel state information (CSI) can significantly improve transmission performance, thereby holding a crucial role in realizing the potential advantages of massive multiple-input multiple-output (MIMO) technology. Current prevailing CSI feedback approaches improve precision by employing advanced deep-learning methods to learn representative CSI features for a subse… ▽ More Acquiring and utilizing accurate channel state information (CSI) can significantly improve transmission performance, thereby holding a crucial role in realizing the potential advantages of massive multiple-input multiple-output (MIMO) technology. Current prevailing CSI feedback approaches improve precision by employing advanced deep-learning methods to learn representative CSI features for a subsequent compression process. Diverging from previous works, we treat the CSI compression problem in the context of implicit neural representations. Specifically, each CSI matrix is viewed as a neural function that maps the CSI coordinates (antenna number and subchannel) to the corresponding channel gains. Instead of transmitting the parameters of the implicit neural functions directly, we transmit modulations based on the CSI matrix derived through a meta-learning algorithm. Modulations are then applied to a shared base network to generate the elements of the CSI matrix. Modulations corresponding to the CSI matrix are quantized and entropy-coded to further reduce the communication bandwidth, thus achieving extreme CSI compression ratios. Numerical results show that our proposed approach achieves state-of-the-art performance and showcases flexibility in feedback strategies. △ Less

Submitted 20 March, 2024; originally announced March 2024.

MSC Class: 94A24 ACM Class: E.4

arXiv:2403.10613 [pdf, other]

Process-and-Forward: Deep Joint Source-Channel Coding Over Cooperative Relay Networks

Authors: Chenghong Bian, Yulin Shao, Haotian Wu, Emre Ozfatura, Deniz Gunduz

Abstract: This paper introduces an innovative deep joint source-channel coding (DeepJSCC) approach to image transmission over a cooperative relay channel. The relay either amplifies and forwards a scaled version of its received signal, referred to as DeepJSCC-AF, or leverages neural networks to extract relevant features about the source signal before forwarding it to the destination, which we call DeepJSCC-… ▽ More This paper introduces an innovative deep joint source-channel coding (DeepJSCC) approach to image transmission over a cooperative relay channel. The relay either amplifies and forwards a scaled version of its received signal, referred to as DeepJSCC-AF, or leverages neural networks to extract relevant features about the source signal before forwarding it to the destination, which we call DeepJSCC-PF (Process-and-Forward). In the full-duplex scheme, inspired by the block Markov coding (BMC) concept, we introduce a novel block transmission strategy built upon novel vision transformer architecture. In the proposed scheme, the source transmits information in blocks, and the relay updates its knowledge about the input signal after each block and generates its own signal to be conveyed to the destination. To enhance practicality, we introduce an adaptive transmission model, which allows a single trained DeepJSCC model to adapt seamlessly to various channel qualities, making it a versatile solution. Simulation results demonstrate the superior performance of our proposed DeepJSCC compared to the state-of-the-art BPG image compression algorithm, even when operating at the maximum achievable rate of conventional decode-and-forward and compress-and-forward protocols, for both half-duplex and full-duplex relay scenarios. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Submitted for possible IEEE journal

arXiv:2403.00321 [pdf, other]

DEEP-IoT: Downlink-Enhanced Efficient-Power Internet of Things

Authors: Yulin Shao

Abstract: At the heart of the Internet of Things (IoT) -- a domain witnessing explosive growth -- the imperative for energy efficiency and the extension of device lifespans has never been more pressing. This paper presents DEEP-IoT, a revolutionary communication paradigm poised to redefine how IoT devices communicate. Through a pioneering "listen more, transmit less" strategy, DEEP-IoT challenges and transf… ▽ More At the heart of the Internet of Things (IoT) -- a domain witnessing explosive growth -- the imperative for energy efficiency and the extension of device lifespans has never been more pressing. This paper presents DEEP-IoT, a revolutionary communication paradigm poised to redefine how IoT devices communicate. Through a pioneering "listen more, transmit less" strategy, DEEP-IoT challenges and transforms the traditional transmitter (IoT devices)-centric communication model to one where the receiver (the access point) play a pivotal role, thereby cutting down energy use and boosting device longevity. We not only conceptualize DEEP-IoT but also actualize it by integrating deep learning-enhanced feedback channel codes within a narrow-band system. Simulation results show a significant enhancement in the operational lifespan of IoT cells -- surpassing traditional systems using Turbo and Polar codes by up to 52.71%. This leap signifies a paradigm shift in IoT communications, setting the stage for a future where IoT devices boast unprecedented efficiency and durability. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2401.05182 [pdf, other]

Integrated Sensing and Communication with Reconfigurable Distributed Antenna and Reflecting Surface: Joint Beamforming and Mode Selection

Authors: **** Zhang, **tao Wang, Yulin Shao, Shaodan Ma

Abstract: This paper presents a new integrated sensing and communication (ISAC) framework, leveraging the recent advancements of reconfigurable distributed antenna and reflecting surface (RDARS). RDARS is a programmable surface structure comprising numerous elements, each of which can be flexibly configured to operate either in a reflection mode, resembling a passive reconfigurable intelligent surface (RIS)… ▽ More This paper presents a new integrated sensing and communication (ISAC) framework, leveraging the recent advancements of reconfigurable distributed antenna and reflecting surface (RDARS). RDARS is a programmable surface structure comprising numerous elements, each of which can be flexibly configured to operate either in a reflection mode, resembling a passive reconfigurable intelligent surface (RIS), or in a connected mode, functioning as a remote transmit or receive antenna. Our RDARS-aided ISAC framework effectively mitigates the adverse impact of multiplicative fading when compared to the passive RIS-aided ISAC, and reduces cost and energy consumption when compared to the active RIS-aided ISAC. Within our RDARS-aided ISAC framework, we consider a radar output signal-to-noise ratio (SNR) maximization problem under communication constraints to jointly optimize the active transmit beamforming matrix of the base station (BS), the reflection and mode selection matrices of RDARS, and the receive filter. To tackle the inherent non-convexity and the binary integer optimization introduced by the mode selection in this optimization challenge, we propose an efficient iterative algorithm with proved convergence based on majorization minimization (MM) and penalty-based methods.Numerical and simulation results demonstrate the superior performance of our new framework, and clearly verify substantial distribution, reflection as well as selection gains obtained by properly configuring the RDARS. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: 13 pages, 9 figures

arXiv:2401.00658 [pdf, other]

Point Cloud in the Air

Authors: Yulin Shao, Chenghong Bian, Li Yang, Qianqian Yang, Zhaoyang Zhang, Deniz Gunduz

Abstract: Acquisition and processing of point clouds (PCs) is a crucial enabler for many emerging applications reliant on 3D spatial data, such as robot navigation, autonomous vehicles, and augmented reality. In most scenarios, PCs acquired by remote sensors must be transmitted to an edge server for fusion, segmentation, or inference. Wireless transmission of PCs not only puts on increased burden on the alr… ▽ More Acquisition and processing of point clouds (PCs) is a crucial enabler for many emerging applications reliant on 3D spatial data, such as robot navigation, autonomous vehicles, and augmented reality. In most scenarios, PCs acquired by remote sensors must be transmitted to an edge server for fusion, segmentation, or inference. Wireless transmission of PCs not only puts on increased burden on the already congested wireless spectrum, but also confronts a unique set of challenges arising from the irregular and unstructured nature of PCs. In this paper, we meticulously delineate these challenges and offer a comprehensive examination of existing solutions while candidly acknowledging their inherent limitations. In response to these intricacies, we proffer four pragmatic solution frameworks, spanning advanced techniques, hybrid schemes, and distributed data aggregation approaches. In doing so, our goal is to chart a path toward efficient, reliable, and low-latency wireless PC transmission. △ Less

Submitted 31 December, 2023; originally announced January 2024.

arXiv:2311.14780 [pdf, other]

Wavelength-multiplexed Multi-mode EUV Reflection Ptychography based on Automatic-Differentiation

Authors: Yifeng Shao, Sven Weerdenburg, Jacob Seifert, H. Paul Urbach, Allard P. Mosk, Wim Coene

Abstract: Ptychographic extreme ultraviolet (EUV) diffractive imaging has emerged as a promising candidate for the next-generation metrology solutions in the semiconductor industry, as it can image wafer samples in reflection geometry at the nanoscale. This technique has surged attention recently, owing to the significant progress in high-harmonic generation (HHG) EUV sources and advancements in both hardwa… ▽ More Ptychographic extreme ultraviolet (EUV) diffractive imaging has emerged as a promising candidate for the next-generation metrology solutions in the semiconductor industry, as it can image wafer samples in reflection geometry at the nanoscale. This technique has surged attention recently, owing to the significant progress in high-harmonic generation (HHG) EUV sources and advancements in both hardware and software for computation. In this study, a novel algorithm is introduced and tested, which enables wavelength-multiplexed reconstruction that enhances the measurement throughput and introduces data diversity, allowing the accurate characterisation of sample structures. To tackle the inherent instabilities of the HHG source, a modal approach was adopted, which represents the cross-density function of the illumination by a series of mutually incoherent and independent spatial modes. The proposed algorithm was implemented on a mainstream machine learning platform, which leverages automatic differentiation to manage the drastic growth in model complexity and expedites the computation using GPU acceleration. By optimising over 200 million parameters, we demonstrate the algorithm's capacity to accommodate experimental uncertainties and achieve a resolution approaching the diffraction limit in reflection geometry. The reconstruction of wafer samples with 20-nm heigh patterned gold structures on a silicon substrate highlights our ability to handle complex physical interrelations involving a multitude of parameters. These results establish ptychography as an efficient and accurate metrology tool. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2311.07580 [pdf, other]

doi 10.1364/OE.513556

Noise-robust latent vector reconstruction in ptychography using deep generative models

Authors: Jacob Seifert, Yifeng Shao, Allard P. Mosk

Abstract: Computational imaging is increasingly vital for a broad spectrum of applications, ranging from biological to material sciences. This includes applications where the object is known and sufficiently sparse, allowing it to be described with a reduced number of parameters. When no explicit parameterization is available, a deep generative model can be trained to represent an object in a low-dimensiona… ▽ More Computational imaging is increasingly vital for a broad spectrum of applications, ranging from biological to material sciences. This includes applications where the object is known and sufficiently sparse, allowing it to be described with a reduced number of parameters. When no explicit parameterization is available, a deep generative model can be trained to represent an object in a low-dimensional latent space. In this paper, we harness this dimensionality reduction capability of autoencoders to search for the object solution within the latent space rather than the object space. We demonstrate a novel approach to ptychographic image reconstruction by integrating a deep generative model obtained from a pre-trained autoencoder within an Automatic Differentiation Ptychography (ADP) framework. This approach enables the retrieval of objects from highly ill-posed diffraction patterns, offering an effective method for noise-robust latent vector reconstruction in ptychography. Moreover, the map** into a low-dimensional latent space allows us to visualize the optimization landscape, which provides insight into the convexity and convergence behavior of the inverse problem. With this work, we aim to facilitate new applications for sparse computational imaging such as when low radiation doses or rapid reconstructions are essential. △ Less

Submitted 14 January, 2024; v1 submitted 18 October, 2023; originally announced November 2023.

Journal ref: Opt. Express 32(1), 1020-1033 (2024)

arXiv:2311.07028 [pdf, other]

A Hybrid Joint Source-Channel Coding Scheme for Mobile Multi-hop Networks

Authors: Chenghong Bian, Yulin Shao, Deniz Gunduz

Abstract: We propose a novel hybrid joint source-channel coding (JSCC) scheme for robust image transmission over multi-hop networks. In the considered scenario, a mobile user wants to deliver an image to its destination over a mobile cellular network. We assume a practical setting, where the links between the nodes belonging to the mobile core network are stable and of high quality, while the link between t… ▽ More We propose a novel hybrid joint source-channel coding (JSCC) scheme for robust image transmission over multi-hop networks. In the considered scenario, a mobile user wants to deliver an image to its destination over a mobile cellular network. We assume a practical setting, where the links between the nodes belonging to the mobile core network are stable and of high quality, while the link between the mobile user and the first node (e.g., the access point) is potentially time-varying with poorer quality. In recent years, neural network based JSCC schemes (called DeepJSCC) have emerged as promising solutions to overcome the limitations of separation-based fully digital schemes. However, relying on analog transmission, DeepJSCC suffers from noise accumulation over multi-hop networks. Moreover, most of the hops within the mobile core network may be high-capacity wireless connections, calling for digital approaches. To this end, we propose a hybrid solution, where DeepJSCC is adopted for the first hop, while the received signal at the first relay is digitally compressed and forwarded through the mobile core network. We show through numerical simulations that the proposed scheme is able to outperform both the fully analog and fully digital schemes. Thanks to DeepJSCC it can avoid the cliff effect over the first hop, while also avoiding noise forwarding over the mobile core network thank to digital transmission. We believe this work paves the way for the practical deployment of DeepJSCC solutions in 6G and future wireless networks. △ Less

Submitted 7 February, 2024; v1 submitted 12 November, 2023; originally announced November 2023.

Comments: Accepted to IEEE International Conference on Communications (ICC), 2024. Source code will be released soon

arXiv:2311.00146 [pdf, other]

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

Authors: Yiwen Shao, Shi-Xiong Zhang, Dong Yu

Abstract: Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that… ▽ More Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that leverages the speaker's position, room acoustics, and reflection dynamics. RIR-SF significantly outperforms traditional 3D spatial features, showing superior theoretical and empirical performance. We also propose an optimized all-neural multi-channel ASR framework for RIR-SF, achieving a relative 21.3\% reduction in CER for target speaker ASR in multi-channel settings. RIR-SF enhances recognition accuracy and demonstrates robustness in high-reverberation scenarios, overcoming the limitations of previous methods. △ Less

Submitted 11 June, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

Comments: Accepted for presentation at Interspeech 2024

arXiv:2310.16367 [pdf, other]

UniX-Encoder: A Universal $X$-Channel Speech Encoder for Ad-Hoc Microphone Array Speech Processing

Authors: Zili Huang, Yiwen Shao, Shi-Xiong Zhang, Dong Yu

Abstract: The speech field is evolving to solve more challenging scenarios, such as multi-channel recordings with multiple simultaneous talkers. Given the many types of microphone setups out there, we present the UniX-Encoder. It's a universal encoder designed for multiple tasks, and worked with any microphone array, in both solo and multi-talker environments. Our research enhances previous multi-channel sp… ▽ More The speech field is evolving to solve more challenging scenarios, such as multi-channel recordings with multiple simultaneous talkers. Given the many types of microphone setups out there, we present the UniX-Encoder. It's a universal encoder designed for multiple tasks, and worked with any microphone array, in both solo and multi-talker environments. Our research enhances previous multi-channel speech processing efforts in four key areas: 1) Adaptability: Contrasting traditional models constrained to certain microphone array configurations, our encoder is universally compatible. 2) Multi-Task Capability: Beyond the single-task focus of previous systems, UniX-Encoder acts as a robust upstream model, adeptly extracting features for diverse tasks including ASR and speaker recognition. 3) Self-Supervised Training: The encoder is trained without requiring labeled multi-channel data. 4) End-to-End Integration: In contrast to models that first beamform then process single-channels, our encoder offers an end-to-end solution, bypassing explicit beamforming or separation. To validate its effectiveness, we tested the UniX-Encoder on a synthetic multi-channel dataset from the LibriSpeech corpus. Across tasks like speech recognition and speaker diarization, our encoder consistently outperformed combinations like the WavLM model with the BeamformIt frontend. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: Submitted to ICASSP 2024

arXiv:2310.03901 [pdf, other]

Challenges and Insights: Exploring 3D Spatial Features and Complex Networks on the MISP Dataset

Authors: Yiwen Shao

Abstract: Multi-channel multi-talker speech recognition presents formidable challenges in the realm of speech processing, marked by issues such as background noise, reverberation, and overlap** speech. Overcoming these complexities requires leveraging contextual cues to separate target speech from a cacophonous mix, enabling accurate recognition. Among these cues, the 3D spatial feature has emerged as a c… ▽ More Multi-channel multi-talker speech recognition presents formidable challenges in the realm of speech processing, marked by issues such as background noise, reverberation, and overlap** speech. Overcoming these complexities requires leveraging contextual cues to separate target speech from a cacophonous mix, enabling accurate recognition. Among these cues, the 3D spatial feature has emerged as a cutting-edge solution, particularly when equipped with spatial information about the target speaker. Its exceptional ability to discern the target speaker within mixed audio, often rendering intermediate processing redundant, paves the way for the direct training of "All-in-one" ASR models. These models have demonstrated commendable performance on both simulated and real-world data. In this paper, we extend this approach to the MISP dataset to further validate its efficacy. We delve into the challenges encountered and insights gained when applying 3D spatial features to MISP, while also exploring preliminary experiments involving the replacement of these features with more complex input and models. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2309.00470 [pdf, other]

Deep Joint Source-Channel Coding for Adaptive Image Transmission over MIMO Channels

Authors: Haotian Wu, Yulin Shao, Chenghong Bian, Krystian Mikolajczyk, Deniz Gündüz

Abstract: This paper introduces a vision transformer (ViT)-based deep joint source and channel coding (DeepJSCC) scheme for wireless image transmission over multiple-input multiple-output (MIMO) channels, denoted as DeepJSCC-MIMO. We consider DeepJSCC-MIMO for adaptive image transmission in both open-loop and closed-loop MIMO systems. The novel DeepJSCC-MIMO architecture surpasses the classical separation-b… ▽ More This paper introduces a vision transformer (ViT)-based deep joint source and channel coding (DeepJSCC) scheme for wireless image transmission over multiple-input multiple-output (MIMO) channels, denoted as DeepJSCC-MIMO. We consider DeepJSCC-MIMO for adaptive image transmission in both open-loop and closed-loop MIMO systems. The novel DeepJSCC-MIMO architecture surpasses the classical separation-based benchmarks with robustness to channel estimation errors and showcases remarkable flexibility in adapting to diverse channel conditions and antenna numbers without requiring retraining. Specifically, by harnessing the self-attention mechanism of ViT, DeepJSCC-MIMO intelligently learns feature map** and power allocation strategies tailored to the unique characteristics of the source image and prevailing channel conditions. Extensive numerical experiments validate the significant improvements in transmission quality achieved by DeepJSCC-MIMO for both open-loop and closed-loop MIMO systems across a wide range of scenarios. Moreover, DeepJSCC-MIMO exhibits robustness to varying channel conditions, channel estimation errors, and different antenna numbers, making it an appealing solution for emerging semantic communication systems. △ Less

Submitted 7 May, 2024; v1 submitted 1 September, 2023; originally announced September 2023.

Comments: arXiv admin note: text overlap with arXiv:2210.15347

MSC Class: 94A24 ACM Class: E.4

arXiv:2308.02436 [pdf, other]

doi 10.1364/OL.502344

Maximum-likelihood estimation in ptychography in the presence of Poisson-Gaussian noise statistics

Authors: Jacob Seifert, Yifeng Shao, Rens van Dam, Dorian Bouchet, Tristan van Leeuwen, Allard P. Mosk

Abstract: Optical measurements often exhibit mixed Poisson-Gaussian noise statistics, which hampers image quality, particularly under low signal-to-noise ratio (SNR) conditions. Computational imaging falls short in such situations when solely Poissonian noise statistics are assumed. In response to this challenge, we define a loss function that explicitly incorporates this mixed noise nature. By using maximu… ▽ More Optical measurements often exhibit mixed Poisson-Gaussian noise statistics, which hampers image quality, particularly under low signal-to-noise ratio (SNR) conditions. Computational imaging falls short in such situations when solely Poissonian noise statistics are assumed. In response to this challenge, we define a loss function that explicitly incorporates this mixed noise nature. By using maximum-likelihood estimation, we devise a practical method to account for camera readout noise in gradient-based ptychography optimization. Our results, based on both experimental and numerical data, demonstrate that this approach outperforms the conventional one, enabling enhanced image reconstruction quality under challenging noise conditions through a straightforward methodological adjustment. △ Less

Submitted 11 October, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

Comments: Contains main and supplementary documents

Journal ref: Opt. Lett. 48, 6027-6030 (2023)

arXiv:2307.07319 [pdf, other]

The Power of Large Language Models for Wireless Communication System Development: A Case Study on FPGA Platforms

Authors: Yuyang Du, Soung Chang Liew, Kexin Chen, Yulin Shao

Abstract: Large language models (LLMs) have garnered significant attention across various research disciplines, including the wireless communication community. There have been several heated discussions on the intersection of LLMs and wireless technologies. While recent studies have demonstrated the ability of LLMs to generate hardware description language (HDL) code for simple computation tasks, develo**… ▽ More Large language models (LLMs) have garnered significant attention across various research disciplines, including the wireless communication community. There have been several heated discussions on the intersection of LLMs and wireless technologies. While recent studies have demonstrated the ability of LLMs to generate hardware description language (HDL) code for simple computation tasks, develo** wireless prototypes and products via HDL poses far greater challenges because of the more complex computation tasks involved. In this paper, we aim to address this challenge by investigating the role of LLMs in FPGA-based hardware development for advanced wireless signal processing. We begin by exploring LLM-assisted code refactoring, reuse, and validation, using an open-source software-defined radio (SDR) project as a case study. Through the case study, we find that an LLM assistant can potentially yield substantial productivity gains for researchers and developers. We then examine the feasibility of using LLMs to generate HDL code for advanced wireless signal processing, using the Fast Fourier Transform (FFT) algorithm as an example. This task presents two unique challenges: the scheduling of subtasks within the overall task and the multi-step thinking required to solve certain arithmetic problem within the task. To address these challenges, we employ in-context learning (ICL) and Chain-of-Thought (CoT) prompting techniques, culminating in the successful generation of a 64-point Verilog FFT module. Our results demonstrate the potential of LLMs for generalization and imitation, affirming their usefulness in writing HDL code for wireless communication systems. Overall, this work contributes to understanding the role of LLMs in wireless communication and motivates further exploration of their capabilities. △ Less

Submitted 7 November, 2023; v1 submitted 14 July, 2023; originally announced July 2023.

arXiv:2306.09101 [pdf, other]

Transformer-aided Wireless Image Transmission with Channel Feedback

Authors: Haotian Wu, Yulin Shao, Emre Ozfatura, Krystian Mikolajczyk, Deniz Gündüz

Abstract: This paper presents a novel wireless image transmission paradigm that can exploit feedback from the receiver, called DeepJSCC-ViT-f. We consider a block feedback channel model, where the transmitter receives noiseless/noisy channel output feedback after each block. The proposed scheme employs a single encoder to facilitate transmission over multiple blocks, refining the receiver's estimation at ea… ▽ More This paper presents a novel wireless image transmission paradigm that can exploit feedback from the receiver, called DeepJSCC-ViT-f. We consider a block feedback channel model, where the transmitter receives noiseless/noisy channel output feedback after each block. The proposed scheme employs a single encoder to facilitate transmission over multiple blocks, refining the receiver's estimation at each block. Specifically, the unified encoder of DeepJSCC-ViT-f can leverage the semantic information from the source image, and acquire channel state information and the decoder's current belief about the source image from the feedback signal to generate coded symbols at each block. Numerical experiments show that our DeepJSCC-ViT-f scheme achieves state-of-the-art transmission performance with robustness to noise in the feedback link. Additionally, DeepJSCC-ViT-f can adapt to the channel condition directly through feedback without the need for separate channel estimation. We further extend the scope of the DeepJSCC-ViT-f approach to include the broadcast channel, which enables the transmitter to generate broadcast codes in accordance with signal semantics and channel feedback from individual receivers. △ Less

Submitted 14 February, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

MSC Class: 94A24 ACM Class: E.4

arXiv:2306.09014 [pdf, other]

Geometric Wide-Angle Camera Calibration: A Review and Comparative Study

Authors: Jianzhu Huai, Yuan Zhuang, Yuxin Shao, Grzegorz Jozkow, Binliang Wang, Yijia He, Alper Yilmaz

Abstract: Wide-angle cameras are widely used in photogrammetry and autonomous systems which rely on the accurate metric measurements derived from images. To find the geometric relationship between incoming rays and image pixels, geometric camera calibration (GCC) has been actively developed. Aiming to provide practical calibration guidelines, this work surveys the existing GCC tools and evaluates the repres… ▽ More Wide-angle cameras are widely used in photogrammetry and autonomous systems which rely on the accurate metric measurements derived from images. To find the geometric relationship between incoming rays and image pixels, geometric camera calibration (GCC) has been actively developed. Aiming to provide practical calibration guidelines, this work surveys the existing GCC tools and evaluates the representative ones for wide-angle cameras. The survey covers camera models, calibration targets, and algorithms used in these tools, highlighting their properties and the trends in GCC development. The evaluation compares six target-based GCC tools, namely, BabelCalib, Basalt, Camodocal, Kalibr, the MATLAB calibrator, and the OpenCV-based ROS calibrator, with simulated and real data for wide-angle cameras described by four parametric projection models. These tests reveal the strengths and weaknesses of these camera models, as well as the repeatability of these GCC tools. In view of the survey and evaluation, future research directions of wide-angle GCC are also discussed. △ Less

Submitted 27 March, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: 18 pages, 12 figures

arXiv:2306.08730 [pdf, other]

Wireless Point Cloud Transmission

Authors: Chenghong Bian, Yulin Shao, Deniz Gunduz

Abstract: 3D point cloud is a three-dimensional data format generated by LiDARs and depth sensors, and is being increasingly used in a large variety of applications. This paper presents a novel solution called SEmantic Point cloud Transmission (SEPT), for the transmission of point clouds over wireless channels with limited bandwidth. At the transmitter, SEPT encodes the point cloud via an iterative downsamp… ▽ More 3D point cloud is a three-dimensional data format generated by LiDARs and depth sensors, and is being increasingly used in a large variety of applications. This paper presents a novel solution called SEmantic Point cloud Transmission (SEPT), for the transmission of point clouds over wireless channels with limited bandwidth. At the transmitter, SEPT encodes the point cloud via an iterative downsampling and feature extraction process. At the receiver, SEPT reconstructs the point cloud with latent reconstruction and offset-based upsampling. Extensive numerical experiments confirm that SEPT significantly outperforms the standard approach with octree-based compression followed by channel coding. Compared with a more advanced benchmark that utilizes state-of-the-art deep learning-based compression techniques, SEPT achieves comparable performance while eliminating the cliff and leveling effects. Thanks to its improved performance and robustness against channel variations, we believe that SEPT can be instrumental in collaborative sensing and inference applications among robots and vehicles, particularly in the low-latency and high-mobility scenarios. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: 7 pages

arXiv:2305.13161 [pdf, ps, other]

DeepJSCC-l++: Robust and Bandwidth-Adaptive Wireless Image Transmission

Authors: Chenghong Bian, Yulin Shao, Deniz Gunduz

Abstract: This paper presents a novel vision transformer (ViT) based deep joint source channel coding (DeepJSCC) scheme, dubbed DeepJSCC-l++, which can be adaptive to multiple target bandwidth ratios as well as different channel signal-to-noise ratios (SNRs) using a single model. To achieve this, we train the proposed DeepJSCC-l++ model with different bandwidth ratios and SNRs, which are fed to the model as… ▽ More This paper presents a novel vision transformer (ViT) based deep joint source channel coding (DeepJSCC) scheme, dubbed DeepJSCC-l++, which can be adaptive to multiple target bandwidth ratios as well as different channel signal-to-noise ratios (SNRs) using a single model. To achieve this, we train the proposed DeepJSCC-l++ model with different bandwidth ratios and SNRs, which are fed to the model as side information. The reconstruction losses corresponding to different bandwidth ratios are calculated, and a new training methodology is proposed, which dynamically assigns different weights to the losses of different bandwidth ratios according to their individual reconstruction qualities. Shifted window (Swin) transformer, is adopted as the backbone for our DeepJSCC-l++ model. Through extensive simulations it is shown that the proposed DeepJSCC-l++ and successive refinement models can adapt to different bandwidth ratios and channel SNRs with marginal performance loss compared to the separately trained models. We also observe the proposed schemes can outperform the digital baseline, which concatenates the BPG compression with capacity-achieving channel code. △ Less

Submitted 30 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted to IEEE Global Communications Conference 2023. Code available at https://github.com/aprilbian/deepjscc-lplusplus

arXiv:2305.11651 [pdf, other]

Channel Cycle Time: A New Measure of Short-term Fairness

Authors: Pengfei Shen, Yulin Shao, Haoyuan Pan, Lu Lu, Yonina C. Eldar

Abstract: This paper puts forth a new metric, dubbed channel cycle time (CCT), to measure the short-term fairness of communication networks. CCT characterizes the average duration between two consecutive successful transmissions of a user, during which all other users successfully accessed the channel at least once. In contrast to existing short-term fairness measures, CCT provides more comprehensive insigh… ▽ More This paper puts forth a new metric, dubbed channel cycle time (CCT), to measure the short-term fairness of communication networks. CCT characterizes the average duration between two consecutive successful transmissions of a user, during which all other users successfully accessed the channel at least once. In contrast to existing short-term fairness measures, CCT provides more comprehensive insight into the transient dynamics of communication networks, with a particular focus on users' delays and jitter. To validate the efficacy of our approach, we analytically characterize the CCTs for two classical communication protocols: slotted Aloha and CSMA/CA. The analysis demonstrates that CSMA/CA exhibits superior short-term fairness over slotted Aloha. Beyond its role as a measurement metric, CCT has broader implications as a guiding principle for the design of future communication networks by emphasizing factors like fairness, delay, and jitter in short-term behaviors. △ Less

Submitted 14 October, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

arXiv:2305.04395 [pdf, other]

Optical Integrated Sensing and Communication

Authors: Runxin Zhang, Yulin Shao, Menghan Li, Lu Lu

Abstract: This paper explores a new paradigm of optical integrated sensing and communication (O-ISAC). Our investigation reveals that optical communication and optical sensing are two inherently complementary technologies. On the one hand, optical communication provides the necessary illumination for optical sensing. On the other hand, optical sensing provides environmental information for optical communica… ▽ More This paper explores a new paradigm of optical integrated sensing and communication (O-ISAC). Our investigation reveals that optical communication and optical sensing are two inherently complementary technologies. On the one hand, optical communication provides the necessary illumination for optical sensing. On the other hand, optical sensing provides environmental information for optical communication. These insights form the foundation of a directionless integrated system, which constitutes the first phase of O-ISAC. We further put forth the concept of optical beamforming using the collimating lens, whereby the light emitted by optical sources is concentrated onto the target device. This greatly improves communication rate and sensing accuracy, thanks to remarkably increased light intensity. Simulation results confirm the significant performance gains of our O-ISAC system over a separated sensing and communication system. With the collimating lens, the light intensity arrived at the target object is increased from 1.09% to 78.06%. The sensing accuracy and communication BER are improved by 62.06dB and 65.52dB, respectively. △ Less

Submitted 23 May, 2023; v1 submitted 7 May, 2023; originally announced May 2023.

arXiv:2212.06596 [pdf, other]

Broadband Digital Over-the-Air Computation for Wireless Federated Edge Learning

Authors: Lizhao You, Xinbo Zhao, Rui Cao, Yulin Shao, Liqun Fu

Abstract: This paper presents the first orthogonal frequency-division multiplexing(OFDM)-based digital over-the-air computation (AirComp) system for wireless federated edge learning, where multiple edge devices transmit model data simultaneously using non-orthogonal OFDM subcarriers, and the edge server aggregates data directly from the superimposed signal. Existing analog AirComp systems often assume perfe… ▽ More This paper presents the first orthogonal frequency-division multiplexing(OFDM)-based digital over-the-air computation (AirComp) system for wireless federated edge learning, where multiple edge devices transmit model data simultaneously using non-orthogonal OFDM subcarriers, and the edge server aggregates data directly from the superimposed signal. Existing analog AirComp systems often assume perfect phase alignment via channel precoding and utilize uncoded analog transmission for model aggregation. In contrast, our digital AirComp system leverages digital modulation and channel codes to overcome phase asynchrony, thereby achieving accurate model aggregation for phase-asynchronous multi-user OFDM systems. To realize a digital AirComp system, we develop a medium access control (MAC) protocol that allows simultaneous transmissions from different users using non-orthogonal OFDM subcarriers, and put forth joint channel decoding and aggregation decoders tailored for convolutional and LDPC codes. To verify the proposed system design, we build a digital AirComp prototype on the USRP software-defined radio platform, and demonstrate a real-time LDPC-coded AirComp system with up to four users. Trace-driven simulation results on test accuracy versus SNR show that: 1) analog AirComp is sensitive to phase asynchrony in practical multi-user OFDM systems, and the test accuracy performance fails to improve even at high SNRs; 2) our digital AirComp system outperforms two analog AirComp systems at all SNRs, and approaches the optimal performance when SNR $\geq$ 6 dB for two-user LDPC-coded AirComp, demonstrating the advantage of digital AirComp in phase-asynchronous multi-user OFDM systems. △ Less

Submitted 5 July, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

Comments: 20 pages. arXiv admin note: text overlap with arXiv:2111.10508

arXiv:2211.06705 [pdf, ps, other]

Deep Joint Source-Channel Coding Over Cooperative Relay Networks

Authors: Chenghong Bian, Yulin Shao, Haotian Wu, Deniz Gunduz

Abstract: This paper presents a novel deep joint source-channel coding (DeepJSCC) scheme for image transmission over a half-duplex cooperative relay channel. Specifically, we apply DeepJSCC to two basic modes of cooperative communications, namely amplify-and-forward (AF) and decode-and-forward (DF). In DeepJSCC-AF, the relay simply amplifies and forwards its received signal. In DeepJSCC-DF, on the other han… ▽ More This paper presents a novel deep joint source-channel coding (DeepJSCC) scheme for image transmission over a half-duplex cooperative relay channel. Specifically, we apply DeepJSCC to two basic modes of cooperative communications, namely amplify-and-forward (AF) and decode-and-forward (DF). In DeepJSCC-AF, the relay simply amplifies and forwards its received signal. In DeepJSCC-DF, on the other hand, the relay first reconstructs the transmitted image and then re-encodes it before forwarding. Considering the excessive computation overhead of DeepJSCC-DF for recovering the image at the relay, we propose an alternative scheme, called DeepJSCC-PF, in which the relay processes and forwards its received signal without necessarily recovering the image. Simulation results show that the proposed DeepJSCC-AF, DF, and PF schemes are superior to the digital baselines with BPG compression with polar codes and provides a graceful performance degradation with deteriorating channel quality. Further investigation shows that the PSNR gain of DeepJSCC-DF/PF over DeepJSCC-AF improves as the channel condition between the source and relay improves. Moreover, DeepJSCC-PF scheme achieves a similar performance to DeepJSCC-DF with lower computational complexity. △ Less

Submitted 18 March, 2024; v1 submitted 12 November, 2022; originally announced November 2022.

Comments: Accepted to IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN) 2024, code available via this link https://github.com/aprilbian/Relay_JSCC

arXiv:2211.01730 [pdf, other]

Feedback is Good, Active Feedback is Better: Block Attention Active Feedback Codes

Authors: Emre Ozfatura, Yulin Shao, Amin Ghazanfari, Alberto Perotti, Branislav Popovic, Deniz Gunduz

Abstract: Deep neural network (DNN)-assisted channel coding designs, such as low-complexity neural decoders for existing codes, or end-to-end neural-network-based auto-encoder designs are gaining interest recently due to their improved performance and flexibility; particularly for communication scenarios in which high-performing structured code designs do not exist. Communication in the presence of feedback… ▽ More Deep neural network (DNN)-assisted channel coding designs, such as low-complexity neural decoders for existing codes, or end-to-end neural-network-based auto-encoder designs are gaining interest recently due to their improved performance and flexibility; particularly for communication scenarios in which high-performing structured code designs do not exist. Communication in the presence of feedback is one such communication scenario, and practical code design for feedback channels has remained an open challenge in coding theory for many decades. Recently, DNN-based designs have shown impressive results in exploiting feedback. In particular, generalized block attention feedback (GBAF) codes, which utilizes the popular transformer architecture, achieved significant improvement in terms of the block error rate (BLER) performance. However, previous works have focused mainly on passive feedback, where the transmitter observes a noisy version of the signal at the receiver. In this work, we show that GBAF codes can also be used for channels with active feedback. We implement a pair of transformer architectures, at the transmitter and the receiver, which interact with each other sequentially, and achieve a new state-of-the-art BLER performance, especially in the low SNR regime. △ Less

Submitted 3 November, 2022; originally announced November 2022.

arXiv:2210.16985 [pdf, other]

Space-time design for deep joint source channel coding of images Over MIMO channels

Authors: Chenghong Bian, Yulin Shao, Haotian Wu, Deniz Gunduz

Abstract: We propose novel deep joint source-channel coding (DeepJSCC) algorithms for wireless image transmission over multi-input multi-output (MIMO) Rayleigh fading channels, when channel state information (CSI) is available only at the receiver. We consider two different schemes; one exploiting the spatial diversity and the other exploiting the spatial multiplexing gain of the MIMO channel, respectively.… ▽ More We propose novel deep joint source-channel coding (DeepJSCC) algorithms for wireless image transmission over multi-input multi-output (MIMO) Rayleigh fading channels, when channel state information (CSI) is available only at the receiver. We consider two different schemes; one exploiting the spatial diversity and the other exploiting the spatial multiplexing gain of the MIMO channel, respectively. For the former, we utilize an orthogonal space-time block code (OSTBC) to achieve full diversity and increase the robustness against channel variations. In the latter, we directly map the input to the antennas, where the additional degree of freedom can be used to send more information about the source signal. Simulation results show that the diversity scheme outperforms the multiplexing scheme for lower signal-to-noise ratio (SNR) values and a smaller number of receive antennas at the AP. When the number of transmit antennas is greater than two, however, the full-diversity scheme becomes less beneficial. We also show that both the diversity and multiplexing schemes can achieve comparable performance with the state-of-the-art BPG algorithm delivered at the instantaneous capacity of the MIMO channel, which serves as an upper bound on the performance of separation-based practical systems. △ Less

Submitted 20 June, 2023; v1 submitted 30 October, 2022; originally announced October 2022.

Comments: Accepted to SPAWC 2023, 5 pages

arXiv:2210.15347 [pdf, other]

Vision Transformer for Adaptive Image Transmission over MIMO Channels

Authors: Haotian Wu, Yulin Shao, Chenghong Bian, Krystian Mikolajczyk, Deniz Gündüz

Abstract: This paper presents a vision transformer (ViT) based joint source and channel coding (JSCC) scheme for wireless image transmission over multiple-input multiple-output (MIMO) systems, called ViT-MIMO. The proposed ViT-MIMO architecture, in addition to outperforming separation-based benchmarks, can flexibly adapt to different channel conditions without requiring retraining. Specifically, exploiting… ▽ More This paper presents a vision transformer (ViT) based joint source and channel coding (JSCC) scheme for wireless image transmission over multiple-input multiple-output (MIMO) systems, called ViT-MIMO. The proposed ViT-MIMO architecture, in addition to outperforming separation-based benchmarks, can flexibly adapt to different channel conditions without requiring retraining. Specifically, exploiting the self-attention mechanism of the ViT enables the proposed ViT-MIMO model to adaptively learn the feature map** and power allocation based on the source image and channel conditions. Numerical experiments show that ViT-MIMO can significantly improve the transmission quality cross a large variety of scenarios, including varying channel conditions, making it an attractive solution for emerging semantic communication systems. △ Less

Submitted 27 October, 2022; originally announced October 2022.

MSC Class: 94A24 ACM Class: E.4

arXiv:2208.08342 [pdf, other]

Semantic Communications with Discrete-time Analog Transmission: A PAPR Perspective

Authors: Yulin Shao, Deniz Gunduz

Abstract: Recent progress in deep learning (DL)-based joint source-channel coding (DeepJSCC) has led to a new paradigm of semantic communications. Two salient features of DeepJSCC-based semantic communications are the exploitation of semantic-aware features directly from the source signal, and the discrete-time analog transmission (DTAT) of these features. Compared with traditional digital communications, s… ▽ More Recent progress in deep learning (DL)-based joint source-channel coding (DeepJSCC) has led to a new paradigm of semantic communications. Two salient features of DeepJSCC-based semantic communications are the exploitation of semantic-aware features directly from the source signal, and the discrete-time analog transmission (DTAT) of these features. Compared with traditional digital communications, semantic communications with DeepJSCC provide superior reconstruction performance at the receiver and graceful degradation with diminishing channel quality, but also exhibit a large peak-to-average power ratio (PAPR) in the transmitted signal. An open question has been whether the gains of DeepJSCC come from the additional freedom brought by the high-PAPR continuous-amplitude signal. In this paper, we address this question by exploring three PAPR reduction techniques in the application of image transmission. We confirm that the superior image reconstruction performance of DeepJSCC-based semantic communications can be retained while the transmitted PAPR is suppressed to an acceptable level. This observation is an important step towards the implementation of DeepJSCC in practical semantic communication systems. △ Less

Submitted 29 December, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

Comments: Keywords: semantic communication, DeepJSCC, discrete-time analog transmission, PAPR

arXiv:2207.14631 [pdf, other]

Phase Code Discovery for Pulse Compression Radar: A Genetic Algorithm Approach

Authors: Xinyan Xie, Runxin Zhang, Yulin Shao, Lu Lu

Abstract: Discovering sequences with desired properties has long been an interesting intellectual pursuit. In pulse compression radar (PCR), discovering phase codes with low aperiodic autocorrelations is essential for a good estimation performance. The design of phase code, however, is mathematically non-trivial as the aperiodic autocorrelation properties of a sequence are intractable to characterize. In th… ▽ More Discovering sequences with desired properties has long been an interesting intellectual pursuit. In pulse compression radar (PCR), discovering phase codes with low aperiodic autocorrelations is essential for a good estimation performance. The design of phase code, however, is mathematically non-trivial as the aperiodic autocorrelation properties of a sequence are intractable to characterize. In this paper, we put forth a genetic algorithm (GA) approach to discover new phase codes for PCR with the mismatched filter (MMF) receiver. The developed GA, dubbed GASeq, discovers better phase codes than the state of the art. At a code length of 59, the sequence discovered by GASeq achieves a signal-to-clutter ratio (SCR) of 50.84, while the best-known sequence has an SCR of 45.16. In addition, the efficiency and scalability of GASeq enable us to search phase codes with a longer code length, which thwarts existing deep learning-based approaches. At a code length of 100, the best phase code discovered by GASeq exhibit an SCR of 63.23. △ Less

Submitted 29 July, 2022; originally announced July 2022.

Comments: Keywords: Genetic algorithm, pulse compression radar, phase code, mismatched receiver, signal-to-clutter ratio

arXiv:2207.06309 [pdf, other]

Dynamic gNodeB Sleep Control for Energy-Conserving 5G Radio Access Network

Authors: Pengfei Shen, Yulin Shao, Qi Cao, Lu Lu

Abstract: 5G radio access network (RAN) is consuming much more energy than legacy RAN due to the denser deployments of gNodeBs (gNBs) and higher single-gNB power consumption. In an effort to achieve an energy-conserving RAN, this paper develops a dynamic on-off switching paradigm, where the ON/OFF states of gNBs can be dynamically configured according to the evolvements of the associated users. We formulate… ▽ More 5G radio access network (RAN) is consuming much more energy than legacy RAN due to the denser deployments of gNodeBs (gNBs) and higher single-gNB power consumption. In an effort to achieve an energy-conserving RAN, this paper develops a dynamic on-off switching paradigm, where the ON/OFF states of gNBs can be dynamically configured according to the evolvements of the associated users. We formulate the dynamic sleep control for a cluster of gNBs as a Markov decision process (MDP) and analyze various switching policies to reduce the energy expenditure. The optimal policy of the MDP that minimizes the energy expenditure can be derived from dynamic programming, but the computation is expensive. To circumvent this issue, this paper puts forth a greedy policy and an index policy for gNB sleep control. When there is no constraint on the number of gNBs that can be turned off, we prove the dual-threshold structure of the greedy policy and analyze its connections with the optimal policy. Inspired by the dual-threshold structure and Whittle index, we develop an index policy by decoupling the original MDP into multiple one-dimensional MDPs -- the indexability of the decoupled MDP is proven and an algorithm to compute the index is proposed. Extensive simulation results verify that the index policy exhibits close-to-optimal performance in terms of the energy expenditure of the gNB cluster. As far as the computational complexity is concerned, on the other hand, the index policy is much more efficient than the optimal policy, which is computationally prohibitive when the number of gNBs is large. △ Less

Submitted 13 July, 2022; originally announced July 2022.

Comments: Keywords: Base station sleep control, 5G, radio access network, Markov decision process, greedy policy, index policy

arXiv:2207.03605 [pdf, other]

Learning-based Autonomous Channel Access in the Presence of Hidden Terminals

Authors: Yulin Shao, Yucheng Cai, Taotao Wang, Ziyang Guo, Peng Liu, Jiajun Luo, Deniz Gunduz

Abstract: We consider the problem of autonomous channel access (AutoCA), where a group of terminals tries to discover a communication strategy with an access point (AP) via a common wireless channel in a distributed fashion. Due to the irregular topology and the limited communication range of terminals, a practical challenge for AutoCA is the hidden terminal problem, which is notorious in wireless networks… ▽ More We consider the problem of autonomous channel access (AutoCA), where a group of terminals tries to discover a communication strategy with an access point (AP) via a common wireless channel in a distributed fashion. Due to the irregular topology and the limited communication range of terminals, a practical challenge for AutoCA is the hidden terminal problem, which is notorious in wireless networks for deteriorating the throughput and delay performances. To meet the challenge, this paper presents a new multi-agent deep reinforcement learning paradigm, dubbed MADRL-HT, tailored for AutoCA in the presence of hidden terminals. MADRL-HT exploits topological insights and transforms the observation space of each terminal into a scalable form independent of the number of terminals. To compensate for the partial observability, we put forth a look-back mechanism such that the terminals can infer behaviors of their hidden terminals from the carrier sensed channel states as well as feedback from the AP. A window-based global reward function is proposed, whereby the terminals are instructed to maximize the system throughput while balancing the terminals' transmission opportunities over the course of learning. Extensive numerical experiments verified the superior performance of our solution benchmarked against the legacy carrier-sense multiple access with collision avoidance (CSMA/CA) protocol. △ Less

Submitted 2 December, 2022; v1 submitted 7 July, 2022; originally announced July 2022.

Comments: Keywords: multiple channel access, hidden terminal, multi-agent deep reinforcement learning, Wi-Fi, proximal policy optimization

arXiv:2206.09457 [pdf, other]

All you need is feedback: Communication with block attention feedback codes

Authors: Emre Ozfatura, Yulin Shao, Alberto Perotti, Branislav Popovic, Deniz Gunduz

Abstract: Deep learning based channel code designs have recently gained interest as an alternative to conventional coding algorithms, particularly for channels for which existing codes do not provide effective solutions. Communication over a feedback channel is one such problem, for which promising results have recently been obtained by employing various deep learning architectures. In this paper, we introd… ▽ More Deep learning based channel code designs have recently gained interest as an alternative to conventional coding algorithms, particularly for channels for which existing codes do not provide effective solutions. Communication over a feedback channel is one such problem, for which promising results have recently been obtained by employing various deep learning architectures. In this paper, we introduce a novel learning-aided code design for feedback channels, called generalized block attention feedback (GBAF) codes, which i) employs a modular architecture that can be implemented using different neural network architectures; ii) provides order-of-magnitude improvements in the probability of error compared to existing designs; and iii) can transmit at desired code rates. △ Less

Submitted 5 October, 2022; v1 submitted 19 June, 2022; originally announced June 2022.

arXiv:2206.08864 [pdf, other]

Avoid Overfitting User Specific Information in Federated Keyword Spotting

Authors: Xin-Chun Li, **-Lin Tang, Shaoming Song, Bingshuai Li, Yinchuan Li, Yunfeng Shao, Le Gan, De-Chuan Zhan

Abstract: Keyword spotting (KWS) aims to discriminate a specific wake-up word from other signals precisely and efficiently for different users. Recent works utilize various deep networks to train KWS models with all users' speech data centralized without considering data privacy. Federated KWS (FedKWS) could serve as a solution without directly sharing users' data. However, the small amount of data, differe… ▽ More Keyword spotting (KWS) aims to discriminate a specific wake-up word from other signals precisely and efficiently for different users. Recent works utilize various deep networks to train KWS models with all users' speech data centralized without considering data privacy. Federated KWS (FedKWS) could serve as a solution without directly sharing users' data. However, the small amount of data, different user habits, and various accents could lead to fatal problems, e.g., overfitting or weight divergence. Hence, we propose several strategies to encourage the model not to overfit user-specific information in FedKWS. Specifically, we first propose an adversarial learning strategy, which updates the downloaded global model against an overfitted local model and explicitly encourages the global model to capture user-invariant information. Furthermore, we propose an adaptive local training strategy, letting clients with more training data and more uniform class distributions undertake more local update steps. Equivalently, this strategy could weaken the negative impacts of those users whose data is less qualified. Our proposed FedKWS-UI could explicitly and implicitly learn user-invariant information in FedKWS. Abundant experimental results on federated Google Speech Commands verify the effectiveness of FedKWS-UI. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: Accepted by Interspeech 2022

arXiv:2205.02417 [pdf, ps, other]

doi 10.1109/LWC.2022.3204837

Channel-Adaptive Wireless Image Transmission with OFDM

Authors: Haotian Wu, Yulin Shao, Krystian Mikolajczyk, Deniz Gündüz

Abstract: We present a learning-based channel-adaptive joint source and channel coding (CA-JSCC) scheme for wireless image transmission over multipath fading channels. The proposed method is an end-to-end autoencoder architecture with a dual-attention mechanism employing orthogonal frequency division multiplexing (OFDM) transmission. Unlike the previous works, our approach is adaptive to channel-gain and no… ▽ More We present a learning-based channel-adaptive joint source and channel coding (CA-JSCC) scheme for wireless image transmission over multipath fading channels. The proposed method is an end-to-end autoencoder architecture with a dual-attention mechanism employing orthogonal frequency division multiplexing (OFDM) transmission. Unlike the previous works, our approach is adaptive to channel-gain and noise-power variations by exploiting the estimated channel state information (CSI). Specifically, with the proposed dual-attention mechanism, our model can learn to map the features and allocate transmission-power resources judiciously based on the estimated CSI. Extensive numerical experiments verify that CA-JSCC achieves state-of-the-art performance among existing JSCC schemes. In addition, CA-JSCC is robust to varying channel conditions and can better exploit the limited channel resources by transmitting critical features over better subchannels. △ Less

Submitted 8 September, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

Comments: IEEE Wireless Communications Letters

MSC Class: 94A24 ACM Class: E.4

arXiv:2204.03851 [pdf, other]

Defense against Adversarial Attacks on Hybrid Speech Recognition using Joint Adversarial Fine-tuning with Denoiser

Authors: Sonal Joshi, Saurabh Kataria, Yiwen Shao, Piotr Zelasko, Jesus Villalba, Sanjeev Khudanpur, Najim Dehak

Abstract: Adversarial attacks are a threat to automatic speech recognition (ASR) systems, and it becomes imperative to propose defenses to protect them. In this paper, we perform experiments to show that K2 conformer hybrid ASR is strongly affected by white-box adversarial attacks. We propose three defenses--denoiser pre-processor, adversarially fine-tuning ASR model, and adversarially fine-tuning joint mod… ▽ More Adversarial attacks are a threat to automatic speech recognition (ASR) systems, and it becomes imperative to propose defenses to protect them. In this paper, we perform experiments to show that K2 conformer hybrid ASR is strongly affected by white-box adversarial attacks. We propose three defenses--denoiser pre-processor, adversarially fine-tuning ASR model, and adversarially fine-tuning joint model of ASR and denoiser. Our evaluation shows denoiser pre-processor (trained on offline adversarial examples) fails to defend against adaptive white-box attacks. However, adversarially fine-tuning the denoiser using a tandem model of denoiser and ASR offers more robustness. We evaluate two variants of this defense--one updating parameters of both models and the second kee** ASR frozen. The joint model offers a mean absolute decrease of 19.3\% ground truth (GT) WER with reference to baseline against fast gradient sign method (FGSM) attacks with different $L_\infty$ norms. The joint model with frozen ASR parameters gives the best defense against projected gradient descent (PGD) with 7 iterations, yielding a mean absolute increase of 22.3\% GT WER with reference to baseline; and against PGD with 500 iterations, yielding a mean absolute decrease of 45.08\% GT WER and an increase of 68.05\% adversarial target WER. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: Submitted to Interspeech 2022

arXiv:2203.09932 [pdf, other]

Efficient FFT Computation in IFDMA Transceivers

Authors: Yuyang Du, Soung Chang Liew, Yulin Shao

Abstract: Interleaved Frequency Division Multiple Access (IFDMA) has the salient advantage of lower Peak-to-Average Power Ratio (PAPR) than its competitors like Orthogonal FDMA (OFDMA). A recent research effort put forth a new IFDMA transceiver design significantly less complex than conventional IFDMA transceivers. The new IFDMA transceiver design reduces the complexity by exploiting a certain correspondenc… ▽ More Interleaved Frequency Division Multiple Access (IFDMA) has the salient advantage of lower Peak-to-Average Power Ratio (PAPR) than its competitors like Orthogonal FDMA (OFDMA). A recent research effort put forth a new IFDMA transceiver design significantly less complex than conventional IFDMA transceivers. The new IFDMA transceiver design reduces the complexity by exploiting a certain correspondence between the IFDMA signal processing and the Cooley-Tukey IFFT/FFT algorithmic structure so that IFDMA streams can be inserted/extracted at different stages of an IFFT/FFT module according to the sizes of the streams. Although the prior work has laid down the theoretical foundation for the new IFDMA transceiver's structure, the practical realization of the transceiver on specific hardware with resource constraints has not been carefully investigated. This paper is an attempt to fill the gap. Specifically, this paper puts forth a heuristic algorithm called multi-priority scheduling (MPS) to schedule the execution of the butterfly computations in the IFDMA transceiver with the constraint of a limited number of hardware processors. The resulting FFT computation, referred to as MPS-FFT, has a much lower computation time than conventional FFT computation when applied to the IFDMA signal processing. Importantly, we derive a lower bound for the optimal IFDMA FFT computation time to benchmark MPS-FFT. Our experimental results indicate that when the number of hardware processors is a power of two: 1) MPS-FFT has near-optimal computation time; 2) MPS-FFT incurs less than 44.13\% of the computation time of the conventional pipelined FFT. △ Less

Submitted 5 March, 2022; originally announced March 2022.

arXiv:2203.01429

SMTNet: Hierarchical cavitation intensity recognition based on sub-main transfer network

Authors: Yu Sha, Johannes Faber, Shui** Gou, Bo Liu, Wei Li, Stefan Schramm, Horst Stoecker, Thomas Steckenreiter, Domagoj Vnucec, Nadine Wetzstein, Andreas Widl, Kai Zhou

Abstract: With the rapid development of smart manufacturing, data-driven machinery health management has been of growing attention. In situations where some classes are more difficult to be distinguished compared to others and where classes might be organised in a hierarchy of categories, current DL methods can not work well. In this study, a novel hierarchical cavitation intensity recognition framework usi… ▽ More With the rapid development of smart manufacturing, data-driven machinery health management has been of growing attention. In situations where some classes are more difficult to be distinguished compared to others and where classes might be organised in a hierarchy of categories, current DL methods can not work well. In this study, a novel hierarchical cavitation intensity recognition framework using Sub-Main Transfer Network, termed SMTNet, is proposed to classify acoustic signals of valve cavitation. SMTNet model outputs multiple predictions ordered from coarse to fine along a network corresponding to a hierarchy of target cavitation states. Firstly, a data augmentation method based on Sliding Window with Fast Fourier Transform (Swin-FFT) is developed to solve few-shot problem. Secondly, a 1-D double hierarchical residual block (1-D DHRB) is presented to capture sensitive features of the frequency domain valve acoustic signals. Thirdly, hierarchical multi-label tree is proposed to assist the embedding of the semantic structure of target cavitation states into SMTNet. Fourthly, experience filtering mechanism is proposed to fully learn a prior knowledge of cavitation detection model. Finally, SMTNet has been evaluated on two cavitation datasets without noise (Dataset 1 and Dataset 2), and one cavitation dataset with real noise (Dataset 3) provided by SAMSON AG (Frankfurt). The prediction accurcies of SMTNet for cavitation intensity recognition are as high as 95.32%, 97.16% and 100%, respectively. At the same time, the testing accuracies of SMTNet for cavitation detection are as high as 97.02%, 97.64% and 100%. In addition, SMTNet has also been tested for different frequencies of samples and has achieved excellent results of the highest frequency of samples of mobile phones. △ Less

Submitted 12 July, 2023; v1 submitted 1 March, 2022; originally announced March 2022.

Comments: we need update this paper

arXiv:2203.01118 [pdf, other]

doi 10.1016/j.engappai.2022.104904

A multi-task learning for cavitation detection and cavitation intensity recognition of valve acoustic signals

Authors: Yu Sha, Johannes Faber, Shui** Gou, Bo Liu, Wei Li, Stefan Schramm, Horst Stoecker, Thomas Steckenreiter, Domagoj Vnucec, Nadine Wetzstein, Andreas Widl, Kai Zhou

Abstract: With the rapid development of smart manufacturing, data-driven machinery health management has received a growing attention. As one of the most popular methods in machinery health management, deep learning (DL) has achieved remarkable successes. However, due to the issues of limited samples and poor separability of different cavitation states of acoustic signals, which greatly hinder the eventual… ▽ More With the rapid development of smart manufacturing, data-driven machinery health management has received a growing attention. As one of the most popular methods in machinery health management, deep learning (DL) has achieved remarkable successes. However, due to the issues of limited samples and poor separability of different cavitation states of acoustic signals, which greatly hinder the eventual performance of DL modes for cavitation intensity recognition and cavitation detection. In this work, a novel multi-task learning framework for simultaneous cavitation detection and cavitation intensity recognition framework using 1-D double hierarchical residual networks (1-D DHRN) is proposed for analyzing valves acoustic signals. Firstly, a data augmentation method based on sliding window with fast Fourier transform (Swin-FFT) is developed to alleviate the small-sample issue confronted in this study. Secondly, a 1-D double hierarchical residual block (1-D DHRB) is constructed to capture sensitive features from the frequency domain acoustic signals of valve. Then, a new structure of 1-D DHRN is proposed. Finally, the devised 1-D DHRN is evaluated on two datasets of valve acoustic signals without noise (Dataset 1 and Dataset 2) and one dataset of valve acoustic signals with realistic surrounding noise (Dataset 3) provided by SAMSON AG (Frankfurt). Our method has achieved state-of-the-art results. The prediction accurcies of 1-D DHRN for cavitation intensitys recognition are as high as 93.75%, 94.31% and 100%, which indicates that 1-D DHRN outperforms other DL models and conventional methods. At the same time, the testing accuracies of 1-D DHRN for cavitation detection are as high as 97.02%, 97.64% and 100%. In addition, 1-D DHRN has also been tested for different frequencies of samples and shows excellent results for frequency of samples that mobile phones can accommodate. △ Less

Submitted 20 April, 2022; v1 submitted 1 March, 2022; originally announced March 2022.

Comments: arXiv admin note: text overlap with arXiv:2202.13226

Journal ref: Engineering Applications of Artificial Intelligence, 113 (2022), 104904

arXiv:2202.13245 [pdf, other]

doi 10.1145/3534678.3539133

Regional-Local Adversarially Learned One-Class Classifier Anomalous Sound Detection in Global Long-Term Space

Authors: Yu Sha, Johannes Faber, Shui** Gou, Bo Liu, Wei Li, Stefan Schramm, Horst Stoecker, Thomas Steckenreiter, Domagoj Vnucec, Nadine Wetzstein, Andreas Widl, Kai Zhou

Abstract: Anomalous sound detection (ASD) is one of the most significant tasks of mechanical equipment monitoring and maintaining in complex industrial systems. In practice, it is vital to precisely identify abnormal status of the working mechanical system, which can further facilitate the failure troubleshooting. In this paper, we propose a multi-pattern adversarial learning one-class classification framew… ▽ More Anomalous sound detection (ASD) is one of the most significant tasks of mechanical equipment monitoring and maintaining in complex industrial systems. In practice, it is vital to precisely identify abnormal status of the working mechanical system, which can further facilitate the failure troubleshooting. In this paper, we propose a multi-pattern adversarial learning one-class classification framework, which allows us to use both the generator and the discriminator of an adversarial model for efficient ASD. The core idea is learning to reconstruct the normal patterns of acoustic data through two different patterns of auto-encoding generators, which succeeds in extending the fundamental role of a discriminator from identifying real and fake data to distinguishing between regional and local pattern reconstructions. Furthermore, we present a global filter layer for long-term interactions in the frequency domain space, which directly learns from the original data without introducing any human priors. Extensive experiments performed on four real-world datasets from different industrial domains (three cavitation datasets provided by SAMSON AG, and one existing publicly) for anomaly detection show superior results, and outperform recent state-of-the-art ASD methods. △ Less

Submitted 26 February, 2022; originally announced February 2022.

Journal ref: KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 2022

arXiv:2202.13226 [pdf, other]

doi 10.1016/j.measurement.2022.110897

An acoustic signal cavitation detection framework based on XGBoost with adaptive selection feature engineering

Authors: Yu Sha, Johannes Faber, Shui** Gou, Bo Liu, Wei Li, Stefan Schramm, Horst Stoecker, Thomas Steckenreiter, Domagoj Vnucec, Nadine Wetzstein, Andreas Widl, Kai Zhou

Abstract: Valves are widely used in industrial and domestic pipeline systems. However, during their operation, they may suffer from the occurrence of the cavitation, which can cause loud noise, vibration and damage to the internal components of the valve. Therefore, monitoring the flow status inside valves is significantly beneficial to prevent the additional cost induced by cavitation. In this paper, a nov… ▽ More Valves are widely used in industrial and domestic pipeline systems. However, during their operation, they may suffer from the occurrence of the cavitation, which can cause loud noise, vibration and damage to the internal components of the valve. Therefore, monitoring the flow status inside valves is significantly beneficial to prevent the additional cost induced by cavitation. In this paper, a novel acoustic signal cavitation detection framework--based on XGBoost with adaptive selection feature engineering--is proposed. Firstly, a data augmentation method with non-overlap** sliding window (NOSW) is developed to solve small-sample problem involved in this study. Then, the each segmented piece of time-domain acoustic signal is transformed by fast Fourier transform (FFT) and its statistical features are extracted to be the input to the adaptive selection feature engineering (ASFE) procedure, where the adaptive feature aggregation and feature crosses are performed. Finally, with the selected features the XGBoost algorithm is trained for cavitation detection and tested on valve acoustic signal data provided by Samson AG (Frankfurt). Our method has achieved state-of-the-art results. The prediction performance on the binary classification (cavitation and no-cavitation) and the four-class classification (cavitation choked flow, constant cavitation, incipient cavitation and no-cavitation) are satisfactory and outperform the traditional XGBoost by 4.67% and 11.11% increase of the accuracy. △ Less

Submitted 1 March, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

Journal ref: Measurement 192 (2022), 110897

arXiv:2202.03433 [pdf, other]

A Coarse-to-fine Morphological Approach With Knowledge-based Rules and Self-adapting Correction for Lung Nodules Segmentation

Authors: Xinliang Fu, Jiayin Zheng, Juanyun Mai, Yanbo Shao, Minghao Wang, Linyu Li, Zhaoqi Diao, Yulong Chen, Jianyu Xiao, Jian You, Airu Yin, Yang Yang, Xiangcheng Qiu, **sheng Tao, Bo Wang, Hua Ji

Abstract: The segmentation module which precisely outlines the nodules is a crucial step in a computer-aided diagnosis(CAD) system. The most challenging part of such a module is how to achieve high accuracy of the segmentation, especially for the juxtapleural, non-solid and small nodules. In this research, we present a coarse-to-fine methodology that greatly improves the thresholding method performance with… ▽ More The segmentation module which precisely outlines the nodules is a crucial step in a computer-aided diagnosis(CAD) system. The most challenging part of such a module is how to achieve high accuracy of the segmentation, especially for the juxtapleural, non-solid and small nodules. In this research, we present a coarse-to-fine methodology that greatly improves the thresholding method performance with a novel self-adapting correction algorithm and effectively removes noisy pixels with well-defined knowledge-based principles. Compared with recent strong morphological baselines, our algorithm, by combining dataset features, achieves state-of-the-art performance on both the public LIDC-IDRI dataset (DSC 0.699) and our private LC015 dataset (DSC 0.760) which closely approaches the SOTA deep learning-based models' performances. Furthermore, unlike most available morphological methods that can only segment the isolated and well-circumscribed nodules accurately, the precision of our method is totally independent of the nodule type or diameter, proving its applicability and generality. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2201.13392

MHSnet: Multi-head and Spatial Attention Network with False-Positive Reduction for Pulmonary Nodules Detection

Authors: Juanyun Mai, Minghao Wang, Jiayin Zheng, Yanbo Shao, Zhaoqi Diao, Xinliang Fu, Yulong Chen, Jianyu Xiao, Jian You, Airu Yin, Yang Yang, Xiangcheng Qiu, **sheng Tao, Bo Wang, Hua Ji

Abstract: The mortality of lung cancer has ranked high among cancers for many years. Early detection of lung cancer is critical for disease prevention, cure, and mortality rate reduction. However, existing detection methods on pulmonary nodules introduce an excessive number of false positive proposals in order to achieve high sensitivity, which is not practical in clinical situations. In this paper, we prop… ▽ More The mortality of lung cancer has ranked high among cancers for many years. Early detection of lung cancer is critical for disease prevention, cure, and mortality rate reduction. However, existing detection methods on pulmonary nodules introduce an excessive number of false positive proposals in order to achieve high sensitivity, which is not practical in clinical situations. In this paper, we propose the multi-head detection and spatial squeeze-and-attention network, MHSnet, to detect pulmonary nodules, in order to aid doctors in the early diagnosis of lung cancers. Specifically, we first introduce multi-head detectors and skip connections to customize for the variety of nodules in sizes, shapes and types and capture multi-scale features. Then, we implement a spatial attention module to enable the network to focus on different regions differently inspired by how experienced clinicians screen CT images, which results in fewer false positive proposals. Lastly, we present a lightweight but effective false positive reduction module with the Linear Regression model to cut down the number of false positive proposals, without any constraints on the front network. Extensive experimental results compared with the state-of-the-art models have shown the superiority of the MHSnet in terms of the average FROC, sensitivity and especially false discovery rate (2.98% and 2.18% improvement in terms of average FROC and sensitivity, 5.62% and 28.33% decrease in terms of false discovery rate and average candidates per scan). The false positive reduction module significantly decreases the average number of candidates generated per scan by 68.11% and the false discovery rate by 13.48%, which is promising to reduce distracted proposals for the downstream tasks based on the detection results. △ Less

Submitted 12 May, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

Comments: We have to revise the experiment results and conclusions

arXiv:2112.01766 [pdf, other]

Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior

Authors: Feng Zhang, Yuanjie Shao, Yishi Sun, Kai Zhu, Changxin Gao, Nong Sang

Abstract: Deep learning-based methods for low-light image enhancement typically require enormous paired training data, which are impractical to capture in real-world scenarios. Recently, unsupervised approaches have been explored to eliminate the reliance on paired training data. However, they perform erratically in diverse real-world scenarios due to the absence of priors. To address this issue, we propose… ▽ More Deep learning-based methods for low-light image enhancement typically require enormous paired training data, which are impractical to capture in real-world scenarios. Recently, unsupervised approaches have been explored to eliminate the reliance on paired training data. However, they perform erratically in diverse real-world scenarios due to the absence of priors. To address this issue, we propose an unsupervised low-light image enhancement method based on an effective prior termed histogram equalization prior (HEP). Our work is inspired by the interesting observation that the feature maps of histogram equalization enhanced image and the ground truth are similar. Specifically, we formulate the HEP to provide abundant texture and luminance information. Embedded into a Light Up Module (LUM), it helps to decompose the low-light images into illumination and reflectance maps, and the reflectance maps can be regarded as restored images. However, the derivation based on Retinex theory reveals that the reflectance maps are contaminated by noise. We introduce a Noise Disentanglement Module (NDM) to disentangle the noise and content in the reflectance maps with the reliable aid of unpaired clean images. Guided by the histogram equalization prior and noise disentanglement, our method can recover finer details and is more capable to suppress noise in real-world low-light scenarios. Extensive experiments demonstrate that our method performs favorably against the state-of-the-art unsupervised low-light enhancement algorithms and even matches the state-of-the-art supervised algorithms. △ Less

Submitted 3 December, 2021; originally announced December 2021.

Comments: submitted to IEEE Transactions on Image Processing

Showing 1–50 of 78 results for author: Sha, Y