Electrical Engineering and Systems Science
- [1] arXiv:2405.19336 [pdf, ps, other]
-
Title: Image-based retrieval of all-day cloud physical parameters for FY4A/AGRI and its application over the Tibetan PlateauZhijun Zhao (1, 2), Feng Zhang (1, 2), Wenwen Li (1), **gwei Li (1, 2) ((1) CMA-FDU Joint Laboratory of Marine Meteorology, Department of Atmospheric and Oceanic Sciences, Institutes of Atmospheric Sciences, Fudan University, China, (2) Key Laboratory for Information Science of Electromagnetic Waves, Ministry of Education, School of Information Science and Technology, Fudan University, China)Subjects: Signal Processing (eess.SP)
Satellite remote sensing serves as a crucial means to acquire cloud physical parameters. However, existing official cloud products derived from the advanced geostationary radiation imager (AGRI) onboard the Fengyun-4A geostationary satellite suffer from limitations in computational precision and efficiency. In this study, an image-based transfer learning model (ITLM) was developed to realize all-day and high-precision retrieval of cloud physical parameters using AGRI thermal infrared measurements and auxiliary data. Combining the observation advantages of geostationary and polar-orbiting satellites, ITLM was pre-trained and transfer-trained with official cloud products from advanced Himawari imager (AHI) and Moderate Resolution Imaging Spectroradiometer (MODIS), respectively. Taking official MODIS products as the benchmarks, ITLM achieved an overall accuracy of 79.93% for identifying cloud phase and root mean squared errors of 1.85 km, 6.72 um, and 12.79 for estimating cloud top height, cloud effective radius, and cloud optical thickness, outperforming the precision of official AGRI and AHI products. Compared to the pixel-based random forest model, ITLM utilized the spatial information of clouds to significantly improve the retrieval performance and achieve more than a 6-fold increase in speed for a single full-disk retrieval. Moreover, the AGRI ITLM products with spatiotemporal continuity and high precision were used to accurately describe the spatial distribution characteristics of cloud fractions and cloud properties over the Tibetan Plateau (TP) during both daytime and nighttime, and for the first time provide insights into the diurnal variation of cloud cover and cloud properties for total clouds and deep convective clouds across different seasons.
- [2] arXiv:2405.19338 [pdf, ps, other]
-
Title: Accurate Patient Alignment without Unnecessary Imaging Dose via Synthesizing Patient-specific 3D CT Images from 2D kV ImagesYuzhen Ding, Jason M. Holmes, Hongying Feng, Baoxin Li, Lisa A. McGee, Jean-Claude M. Rwigema, Sujay A. Vora, Daniel J. Ma, Robert L. Foote, Samir H. Patel, Wei LiuComments: 17 pages, 8 figures and tablesSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
In radiotherapy, 2D orthogonally projected kV images are used for patient alignment when 3D-on-board imaging(OBI) unavailable. But tumor visibility is constrained due to the projection of patient's anatomy onto a 2D plane, potentially leading to substantial setup errors. In treatment room with 3D-OBI such as cone beam CT(CBCT), the field of view(FOV) of CBCT is limited with unnecessarily high imaging dose, thus unfavorable for pediatric patients. A solution to this dilemma is to reconstruct 3D CT from kV images obtained at the treatment position. Here, we propose a dual-models framework built with hierarchical ViT blocks. Unlike a proof-of-concept approach, our framework considers kV images as the solo input and can synthesize accurate, full-size 3D CT in real time(within milliseconds). We demonstrate the feasibility of the proposed approach on 10 patients with head and neck (H&N) cancer using image quality(MAE: <45HU), dosimetrical accuracy(Gamma passing rate (2%/2mm/10%)>97%) and patient position uncertainty(shift error: <0.4mm). The proposed framework can generate accurate 3D CT faithfully mirroring real-time patient position, thus significantly improving patient setup accuracy, kee** imaging dose minimum, and maintaining treatment veracity.
- [3] arXiv:2405.19340 [pdf, ps, other]
-
Title: Obtaining physical layer data of latest generation networks for investigating adversary attacksSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
The field of machine learning is develo** rapidly and is being used in various fields of science and technology. In this way, machine learning can be used to optimize the functions of latest generation data networks such as 5G and 6G. This also applies to functions at a lower level. A feature of the use of machine learning in the radio path for targeted radiation generation in modern ultra-massive MIMO, reconfigurable intelligent interfaces and other technologies is the complex acquisition and processing of data from the physical layer. Additionally, adversarial measures that manipulate the behaviour of intelligent machine learning models are becoming a major concern, as many machine learning models are sensitive to incorrect input data. To obtain data on attacks directly from processing service information, a simulation model is proposed that works in conjunction with machine learning applications.
- [4] arXiv:2405.19341 [pdf, ps, html, other]
-
Title: Spatial Impulse Response Analysis and Ensemble Learning for Efficient Precision Level SensingBerkay Cetkin, Lejla Begic Fazlic, Kristof Ueding, Rüdiger Machhamer, Achim Guldner, Lars Creutz, Stefan Naumann, Guido DartmannSubjects: Signal Processing (eess.SP)
In this paper, we propose an innovative method for determining the fill level of containers, such as trash cans, addressing a critical aspect of waste management. The method combines spatial impulse response analysis with machine learning techniques, offering a unique and effective approach for sound-based classification that can be extended to various domains beyond waste management. By employing a buzzer-generated sine sweep signal, we create a distinctive signature specific to the fill level of the waste container. This signature is then interpreted by a specially developed ensemble learning algorithm. Our approach achieves a classification accuracy of over 90% when implemented locally on a development board, eliminating the need to delegate complex classification tasks to external entities. Using low-cost and energy-efficient hardware components, our method offers a cost-effective approach that contributes to sustainable and efficient waste management practices, providing a reliable and locally deployable solution.
- [5] arXiv:2405.19345 [pdf, ps, html, other]
-
Title: Review of Deep Representation Learning Techniques for Brain-Computer Interfaces and RecommendationsComments: Submitted to: Journal of Neural Engineering (JNE)Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
In the field of brain-computer interfaces (BCIs), the potential for leveraging deep learning techniques for representing electroencephalogram (EEG) signals has gained substantial interest. This review synthesizes empirical findings from a collection of articles using deep representation learning techniques for BCI decoding, to provide a comprehensive analysis of the current state-of-the-art. Each article was scrutinized based on three criteria: (1) the deep representation learning technique employed, (2) the underlying motivation for its utilization, and (3) the approaches adopted for characterizing the learned representations. Among the 81 articles finally reviewed in depth, our analysis reveals a predominance of 31 articles using autoencoders. We identified 13 studies employing self-supervised learning (SSL) techniques, among which ten were published in 2022 or later, attesting to the relative youth of the field. However, at the time being, none of these have led to standard foundation models that are picked up by the BCI community. Likewise, only a few studies have introspected their learned representations. We observed that the motivation in most studies for using representation learning techniques is for solving transfer learning tasks, but we also found more specific motivations such as to learn robustness or invariances, as an algorithmic bridge, or finally to uncover the structure of the data. Given the potential of foundation models to effectively tackle these challenges, we advocate for a continued dedication to the advancement of foundation models specifically designed for EEG signal decoding by using SSL techniques. We also underline the imperative of establishing specialized benchmarks and datasets to facilitate the development and continuous improvement of such foundation models.
- [6] arXiv:2405.19346 [pdf, ps, html, other]
-
Title: Subject-Adaptive Transfer Learning Using Resting State EEG Signals for Cross-Subject EEG Motor Imagery ClassificationComments: Early Accepted at MICCAI 2024Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Electroencephalography (EEG) motor imagery (MI) classification is a fundamental, yet challenging task due to the variation of signals between individuals i.e., inter-subject variability. Previous approaches try to mitigate this using task-specific (TS) EEG signals from the target subject in training. However, recording TS EEG signals requires time and limits its applicability in various fields. In contrast, resting state (RS) EEG signals are a viable alternative due to ease of acquisition with rich subject information. In this paper, we propose a novel subject-adaptive transfer learning strategy that utilizes RS EEG signals to adapt models on unseen subject data. Specifically, we disentangle extracted features into task- and subject-dependent features and use them to calibrate RS EEG signals for obtaining task information while preserving subject characteristics. The calibrated signals are then used to adapt the model to the target subject, enabling the model to simulate processing TS EEG signals of the target subject. The proposed method achieves state-of-the-art accuracy on three public benchmarks, demonstrating the effectiveness of our method in cross-subject EEG MI classification. Our findings highlight the potential of leveraging RS EEG signals to advance practical brain-computer interface systems.
- [7] arXiv:2405.19347 [pdf, ps, html, other]
-
Title: Near-Field Spot Beamfocusing: A Correlation-Aware Transfer Learning ApproachSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
3D spot beamfocusing (SBF), in contrast to conventional angular-domain beamforming, concentrates radiating power within very small volume in both radial and angular domains in the near-field zone. Recently the implementation of channel-state-information (CSI)-independent machine learning (ML)-based approaches have been developed for effective SBF using extremely-largescale-programable-metasurface (ELPMs). These methods involve dividing the ELPMs into subarrays and independently training them with Deep Reinforcement Learning to jointly focus the beam at the Desired Focal Point (DFP). This paper explores near-field SBF using ELPMs, addressing challenges associated with lengthy training times resulting from independent training of subarrays. To achieve a faster CSIindependent solution, inspired by the correlation between the beamfocusing matrices of the subarrays, we leverage transfer learning techniques. First, we introduce a novel similarity criterion based on the Phase Distribution Image of subarray apertures. Then we devise a subarray policy propagation scheme that transfers the knowledge from trained to untrained subarrays. We further enhance learning by introducing Quasi-Liquid-Layers as a revised version of the adaptive policy reuse technique. We show through simulations that the proposed scheme improves the training speed about 5 times. Furthermore, for dynamic DFP management, we devised a DFP policy blending process, which augments the convergence rate up to 8-fold.
- [8] arXiv:2405.19348 [pdf, ps, other]
-
Title: NERULA: A Dual-Pathway Self-Supervised Learning Framework for Electrocardiogram Signal AnalysisComments: Paper in reviewSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Electrocardiogram (ECG) signals are critical for diagnosing heart conditions and capturing detailed cardiac patterns. As wearable single-lead ECG devices become more common, efficient analysis methods are essential. We present NERULA (Non-contrastive ECG and Reconstruction Unsupervised Learning Algorithm), a self-supervised framework designed for single-lead ECG signals. NERULA's dual-pathway architecture combines ECG reconstruction and non-contrastive learning to extract detailed cardiac features. Our 50% masking strategy, using both masked and inverse-masked signals, enhances model robustness against real-world incomplete or corrupted data. The non-contrastive pathway aligns representations of masked and inverse-masked signals, while the reconstruction pathway comprehends and reconstructs missing features. We show that combining generative and discriminative paths into the training spectrum leads to better results by outperforming state-of-the-art self-supervised learning benchmarks in various tasks, demonstrating superior performance in ECG analysis, including arrhythmia classification, gender classification, age regression, and human activity recognition. NERULA's dual-pathway design offers a robust, efficient solution for comprehensive ECG signal interpretation.
- [9] arXiv:2405.19349 [pdf, ps, other]
-
Title: Beyond Isolated Frames: Enhancing Sensor-Based Human Activity Recognition through Intra- and Inter-Frame AttentionSubjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Human Activity Recognition (HAR) has become increasingly popular with ubiquitous computing, driven by the popularity of wearable sensors in fields like healthcare and sports. While Convolutional Neural Networks (ConvNets) have significantly contributed to HAR, they often adopt a frame-by-frame analysis, concentrating on individual frames and potentially overlooking the broader temporal dynamics inherent in human activities. To address this, we propose the intra- and inter-frame attention model. This model captures both the nuances within individual frames and the broader contextual relationships across multiple frames, offering a comprehensive perspective on sequential data. We further enrich the temporal understanding by proposing a novel time-sequential batch learning strategy. This learning strategy preserves the chronological sequence of time-series data within each batch, ensuring the continuity and integrity of temporal patterns in sensor-based HAR.
- [10] arXiv:2405.19351 [pdf, ps, other]
-
Title: Resonate-and-Fire Spiking Neurons for Target Detection and Hand Gesture Recognition: A Hybrid ApproachSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Hand gesture recognition using radar often relies on computationally expensive fast Fourier transforms. This paper proposes an alternative approach that bypasses fast Fourier transforms using resonate-and-fire neurons. These neurons directly detect the hand in the time-domain signal, eliminating the need for fast Fourier transforms to retrieve range information. Following detection, a simple Goertzel algorithm is employed to extract five key features, eliminating the need for a second fast Fourier transform. These features are then fed into a recurrent neural network, achieving an accuracy of 98.21% for classifying five gestures. The proposed approach demonstrates competitive performance with reduced complexity compared to traditional methods
- [11] arXiv:2405.19356 [pdf, ps, html, other]
-
Title: An LSTM Feature Imitation Network for Hand Movement Recognition from sEMG SignalsComments: This work has been submitted to RA-L, and under reviewSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Surface Electromyography (sEMG) is a non-invasive signal that is used in the recognition of hand movement patterns, the diagnosis of diseases, and the robust control of prostheses. Despite the remarkable success of recent end-to-end Deep Learning approaches, they are still limited by the need for large amounts of labeled data. To alleviate the requirement for big data, researchers utilize Feature Engineering, which involves decomposing the sEMG signal into several spatial, temporal, and frequency features. In this paper, we propose utilizing a feature-imitating network (FIN) for closed-form temporal feature learning over a 300ms signal window on Ninapro DB2, and applying it to the task of 17 hand movement recognition. We implement a lightweight LSTM-FIN network to imitate four standard temporal features (entropy, root mean square, variance, simple square integral). We then explore transfer learning capabilities by applying the pre-trained LSTM-FIN for tuning to a downstream hand movement recognition task. We observed that the LSTM network can achieve up to 99\% R2 accuracy in feature reconstruction and 80\% accuracy in hand movement recognition. Our results also showed that the model can be robustly applied for both within- and cross-subject movement recognition, as well as simulated low-latency environments. Overall, our work demonstrates the potential of the FIN modeling paradigm in data-scarce scenarios for sEMG signal processing.
- [12] arXiv:2405.19359 [pdf, ps, html, other]
-
Title: Modally Reduced Representation Learning of Multi-Lead ECG Signals through Simultaneous Alignment and ReconstructionComments: Accepted as a Workshop Paper at TS4H@ICLR2024Journal-ref: ICLR 2024 Workshop on Learning from Time Series For HealthSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Electrocardiogram (ECG) signals, profiling the electrical activities of the heart, are used for a plethora of diagnostic applications. However, ECG systems require multiple leads or channels of signals to capture the complete view of the cardiac system, which limits their application in smartwatches and wearables. In this work, we propose a modally reduced representation learning method for ECG signals that is capable of generating channel-agnostic, unified representations for ECG signals. Through joint optimization of reconstruction and alignment, we ensure that the embeddings of the different channels contain an amalgamation of the overall information across channels while also retaining their specific information. On an independent test dataset, we generated highly correlated channel embeddings from different ECG channels, leading to a moderate approximation of the 12-lead signals from a single-channel embedding. Our generated embeddings can work as competent features for ECG signals for downstream tasks.
- [13] arXiv:2405.19363 [pdf, ps, html, other]
-
Title: Medformer: A Multi-Granularity Patching Transformer for Medical Time-Series ClassificationComments: 20pages (14 pages main paper + 6 pages supplementary materials)Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Medical time series data, such as Electroencephalography (EEG) and Electrocardiography (ECG), play a crucial role in healthcare, such as diagnosing brain and heart diseases. Existing methods for medical time series classification primarily rely on handcrafted biomarkers extraction and CNN-based models, with limited exploration of transformers tailored for medical time series. In this paper, we introduce Medformer, a multi-granularity patching transformer tailored specifically for medical time series classification. Our method incorporates three novel mechanisms to leverage the unique characteristics of medical time series: cross-channel patching to leverage inter-channel correlations, multi-granularity embedding for capturing features at different scales, and two-stage (intra- and inter-granularity) multi-granularity self-attention for learning features and correlations within and among granularities. We conduct extensive experiments on five public datasets under both subject-dependent and challenging subject-independent setups. Results demonstrate Medformer's superiority over 10 baselines, achieving top averaged ranking across five datasets on all six evaluation metrics. These findings underscore the significant impact of our method on healthcare applications, such as diagnosing Myocardial Infarction, Alzheimer's, and Parkinson's disease. We release the source code at \url{this https URL}.
- [14] arXiv:2405.19366 [pdf, ps, html, other]
-
Title: ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological TextSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)
The utilization of deep learning on electrocardiogram (ECG) analysis has brought the advanced accuracy and efficiency of cardiac healthcare diagnostics. By leveraging the capabilities of deep learning in semantic understanding, especially in feature extraction and representation learning, this study introduces a new multimodal contrastive pretaining framework that aims to improve the quality and robustness of learned representations of 12-lead ECG signals. Our framework comprises two key components, including Cardio Query Assistant (CQA) and ECG Semantics Integrator(ESI). CQA integrates a retrieval-augmented generation (RAG) pipeline to leverage large language models (LLMs) and external medical knowledge to generate detailed textual descriptions of ECGs. The generated text is enriched with information about demographics and waveform patterns. ESI integrates both contrastive and captioning loss to pretrain ECG encoders for enhanced representations. We validate our approach through various downstream tasks, including arrhythmia detection and ECG-based subject identification. Our experimental results demonstrate substantial improvements over strong baselines in these tasks. These baselines encompass supervised and self-supervised learning methods, as well as prior multimodal pretraining approaches.
- [15] arXiv:2405.19373 [pdf, ps, html, other]
-
Title: Multi-modal Mood Reader: Pre-trained Model Empowers Cross-Subject Emotion RecognitionComments: Accepted by International Conference on Neural Computing for Advanced Applications, 2024Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Emotion recognition based on Electroencephalography (EEG) has gained significant attention and diversified development in fields such as neural signal processing and affective computing. However, the unique brain anatomy of individuals leads to non-negligible natural differences in EEG signals across subjects, posing challenges for cross-subject emotion recognition. While recent studies have attempted to address these issues, they still face limitations in practical effectiveness and model framework unity. Current methods often struggle to capture the complex spatial-temporal dynamics of EEG signals and fail to effectively integrate multimodal information, resulting in suboptimal performance and limited generalizability across subjects. To overcome these limitations, we develop a Pre-trained model based Multimodal Mood Reader for cross-subject emotion recognition that utilizes masked brain signal modeling and interlinked spatial-temporal attention mechanism. The model learns universal latent representations of EEG signals through pre-training on large scale dataset, and employs Interlinked spatial-temporal attention mechanism to process Differential Entropy(DE) features extracted from EEG data. Subsequently, a multi-level fusion layer is proposed to integrate the discriminative features, maximizing the advantages of features across different dimensions and modalities. Extensive experiments on public datasets demonstrate Mood Reader's superior performance in cross-subject emotion recognition tasks, outperforming state-of-the-art methods. Additionally, the model is dissected from attention perspective, providing qualitative analysis of emotion-related brain areas, offering valuable insights for affective research in neural signal processing.
- [16] arXiv:2405.19481 [pdf, ps, html, other]
-
Title: Integrated Communication and Imaging: Design, Analysis, and Performances of COSMIC WaveformsSubjects: Signal Processing (eess.SP)
This paper proposes a novel waveform design method named COSMIC (Connectivity-Oriented Sensing Method for Imaging and Communication). These waveforms are engineered to convey communication symbols while adhering to an extended orthogonality condition, enabling their use in generating radio images of the environment. A Multiple-Input Multiple-Output (MIMO) Radar-Communication (RadCom) device transmits COSMIC waveforms from each antenna simultaneously within the same time window and frequency band, indicating that orthogonality is not achieved by space, time, or frequency multiplexing. Indeed, orthogonality among the waveforms is achieved by leveraging the degrees of freedom provided by the assumption that the field of view is limited or significantly smaller than the transmitted signals' length. The RadCom device receives and processes the echoes from an infinite number of infinitesimal scatterers within its field of view, constructing an electromagnetic image of the environment. Concurrently, these waveforms can also carry information to other connected network entities. This work provides the algebraic concepts used to generate COSMIC waveforms. Moreover, an opportunistic optimization of the imaging and communication efficiency is discussed. Simulation results demonstrate that COSMIC waveforms enable accurate environmental imaging while maintaining acceptable communication performances.
- [17] arXiv:2405.19489 [pdf, ps, other]
-
Title: Optimising RF linear Amplifier for maximum efficiency and linearitySubjects: Signal Processing (eess.SP)
A method for increasing efficiency of radio frequency (RF) amplifier employing laterally diffused metal oxide semiconductor (LDMOS) transistors coupled to an RF exciter depending on the emission mode of modulated RF input signals generated by exciter, if exciter output signal is of a type where modulated RF signals do not have continuously varying envelope, biasing the LDMOS transistor in the RF amplifier with fixed quiescent drain current and fixed drain voltage supply to cause LDMOS transistors to operate in compression and if exciter output signal is of a type where modulated RF signals do have continuously varying envelope, biasing the LDMOS transistors in the RF amplifier for linear operation.
- [18] arXiv:2405.19492 [pdf, ps, other]
-
Title: TotalSegmentator MRI: Sequence-Independent Segmentation of 59 Anatomical Structures in MR imagesTugba Akinci D'Antonoli, Lucas K. Berger, Ashraya K. Indrakanti, Nathan Vishwanathan, Jakob Weiß, Matthias Jung, Zeynep Berkarda, Alexander Rau, Marco Reisert, Thomas Küstner, Alexandra Walter, Elmar M. Merkle, Martin Segeroth, Joshy Cyriac, Shan Yang, Jakob WasserthalSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Purpose: To develop an open-source and easy-to-use segmentation model that can automatically and robustly segment most major anatomical structures in MR images independently of the MR sequence.
Materials and Methods: In this study we extended the capabilities of TotalSegmentator to MR images. 298 MR scans and 227 CT scans were used to segment 59 anatomical structures (20 organs, 18 bones, 11 muscles, 7 vessels, 3 tissue types) relevant for use cases such as organ volumetry, disease characterization, and surgical planning. The MR and CT images were randomly sampled from routine clinical studies and thus represent a real-world dataset (different ages, pathologies, scanners, body parts, sequences, contrasts, echo times, repetition times, field strengths, slice thicknesses and sites). We trained an nnU-Net segmentation algorithm on this dataset and calculated Dice similarity coefficients (Dice) to evaluate the model's performance.
Results: The model showed a Dice score of 0.824 (CI: 0.801, 0.842) on the test set, which included a wide range of clinical data with major pathologies. The model significantly outperformed two other publicly available segmentation models (Dice score, 0.824 versus 0.762; p<0.001 and 0.762 versus 0.542; p<0.001). On the CT image test set of the original TotalSegmentator paper it almost matches the performance of the original TotalSegmentator (Dice score, 0.960 versus 0.970; p<0.001).
Conclusion: Our proposed model extends the capabilities of TotalSegmentator to MR images. The annotated dataset (this https URL) and open-source toolkit (this https URL) are publicly available. - [19] arXiv:2405.19497 [pdf, ps, html, other]
-
Title: Gaussian Flow Bridges for Audio Domain Transfer with Unpaired DataComments: Submitted to IWAENC 2024Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Audio domain transfer is the process of modifying audio signals to match characteristics of a different domain, while retaining the original content. This paper investigates the potential of Gaussian Flow Bridges, an emerging approach in generative modeling, for this problem. The presented framework addresses the transport problem across different distributions of audio signals through the implementation of a series of two deterministic probability flows. The proposed framework facilitates manipulation of the target distribution properties through a continuous control variable, which defines a certain aspect of the target domain. Notably, this approach does not rely on paired examples for training. To address identified challenges on maintaining the speech content consistent, we recommend a training strategy that incorporates chunk-based minibatch Optimal Transport couplings of data samples and noise. Comparing our unsupervised method with established baselines, we find competitive performance in tasks of reverberation and distortion manipulation. Despite encoutering limitations, the intriguing results obtained in this study underscore potential for further exploration.
- [20] arXiv:2405.19516 [pdf, ps, html, other]
-
Title: Enabling Visual Recognition at Radio FrequencySubjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
This paper introduces PanoRadar, a novel RF imaging system that brings RF resolution close to that of LiDAR, while providing resilience against conditions challenging for optical signals. Our LiDAR-comparable 3D imaging results enable, for the first time, a variety of visual recognition tasks at radio frequency, including surface normal estimation, semantic segmentation, and object detection. PanoRadar utilizes a rotating single-chip mmWave radar, along with a combination of novel signal processing and machine learning algorithms, to create high-resolution 3D images of the surroundings. Our system accurately estimates robot motion, allowing for coherent imaging through a dense grid of synthetic antennas. It also exploits the high azimuth resolution to enhance elevation resolution using learning-based methods. Furthermore, PanoRadar tackles 3D learning via 2D convolutions and addresses challenges due to the unique characteristics of RF signals. Our results demonstrate PanoRadar's robust performance across 12 buildings.
- [21] arXiv:2405.19527 [pdf, ps, other]
-
Title: Flexible Agent-based Modeling Framework to Evaluate Integrated Microtransit and Fixed-route Transit Designs: Mode Choice, Supernetworks, and Fleet SimulationComments: 49 pages, 25 figures, 8 tables; Submitted To: Transportation Research Part C: Emerging Technologies on May 1st, 2024Subjects: Systems and Control (eess.SY)
The integration of traditional fixed-route transit (FRT) and more flexible microtransit has been touted as a means of improving mobility and access to opportunity, increasing transit ridership, and promoting environmental sustainability. To help evaluate integrated FRT and microtransit public transit (PT) system (henceforth ``integrated fixed-flex PT system'') designs, we propose a high-fidelity modeling framework that provides reliable estimates for a wide range of (i) performance metrics and (ii) integrated fixed-flex PT system designs. We formulate the mode choice equilibrium problem as a fixed-point problem wherein microtransit demand is a function of microtransit performance, and microtransit performance depends on microtransit demand. We propose a detailed agent-based simulation modeling framework that includes (i) a binary logit mode choice model (private auto vs. transit), (ii) a supernetwork-based model and pathfinding algorithm for multi-modal transit path choice where the supernetwork includes pedestrian, FRT, and microtransit layers, (iii) a detailed mobility-on-demand fleet simulator called FleetPy to model the supply-demand dynamics of the microtransit service. In this paper, we illustrate the capabilities of the modeling framework by analyzing integrated fixed-flex PT system designs that vary the following design parameters: FRT frequencies and microtransit fleet size, service region structure, virtual stop coverage, and operating hours. We include case studies in downtown San Diego and Lemon Grove, California. The computational results show that the proposed modeling framework converges to a mode choice equilibrium. Moreover, the scenario results imply that introducing a new microtransit service decreases FRT ridership and requires additional subsidies, but it significantly increases job accessibility and slightly reduces total VMT.
- [22] arXiv:2405.19542 [pdf, ps, html, other]
-
Title: Anatomical Region Recognition and Real-time Bone Tracking Methods by Dynamically Decoding A-Mode Ultrasound SignalsSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Robotics (cs.RO)
Accurate bone tracking is crucial for kinematic analysis in orthopedic surgery and prosthetic robotics. Traditional methods (e.g., skin markers) are subject to soft tissue artifacts, and the bone pins used in surgery introduce the risk of additional trauma and infection. For electromyography (EMG), its inability to directly measure joint angles requires complex algorithms for kinematic estimation. To address these issues, A-mode ultrasound-based tracking has been proposed as a non-invasive and safe alternative. However, this approach suffers from limited accuracy in peak detection when processing received ultrasound signals. To build a precise and real-time bone tracking approach, this paper introduces a deep learning-based method for anatomical region recognition and bone tracking using A-mode ultrasound signals, specifically focused on the knee joint. The algorithm is capable of simultaneously performing bone tracking and identifying the anatomical region where the A-mode ultrasound transducer is placed. It contains the fully connection between all encoding and decoding layers of the cascaded U-Nets to focus only on the signal region that is most likely to have the bone peak, thus pinpointing the exact location of the peak and classifying the anatomical region of the signal. The experiment showed a 97% accuracy in the classification of the anatomical regions and a precision of around 0.5$\pm$1mm under dynamic tracking conditions for various anatomical areas surrounding the knee joint. In general, this approach shows great potential beyond the traditional method, in terms of the accuracy achieved and the recognition of the anatomical region where the ultrasound has been attached as an additional functionality.
- [23] arXiv:2405.19639 [pdf, ps, other]
-
Title: Generalized BER Performance Analysis for SIC-based Uplink NOMA SystemsSubjects: Signal Processing (eess.SP)
Non-orthogonal multiple access (NOMA) is widely recognized for its spectral and energy efficiency, which allows more users to share the network resources more effectively. This paper provides a generalized bit error rate (BER) performance analysis of successive interference cancellation (SIC)-based uplink NOMA systems under Rayleigh fading channels, taking into account error propagation resulting from SIC imperfections. Exact closed-form BER expressions are initially derived for scenarios with 2 and 3 users using quadrature phase shift keying (QPSK) modulation. These expressions are then generalized to encompass any arbitrary rectangular/square M-ary quadrature amplitude modulation (M-QAM) order, number of NOMA users, and number of BS antennas. Additionally, by utilizing the derived closed-form BER expressions, a simple and practically feasible power allocation (PA) technique is devised to minimize the sum bit error rate of the users and optimize the SIC-based NOMA detection at the base-station (BS). The derived closed-form expressions are corroborated through Monte Carlo simulations. It is demonstrated that these expressions can be effective for optimal uplink PA to ensure optimized SIC detection that mitigates error floors. It is also shown that significant performance improvements are achieved regardless of the users' decoding order, making uplink SIC-based NOMA a viable approach.
- [24] arXiv:2405.19645 [pdf, ps, html, other]
-
Title: A Landmark-aware Network for Automated Cobb Angle Estimation Using X-ray ImagesSubjects: Image and Video Processing (eess.IV)
Automated Cobb angle estimation based on X-ray images plays an important role in scoliosis diagnosis, treatment, and progression surveillance. The inadequate feature extraction and the noise in X-ray images are the main difficulties of automated Cobb angle estimation, and it is challenging to ensure that the calculated Cobb angle meets clinical requirements. To address these problems, we propose a Landmark-aware Network named LaNet with three components, Feature Robustness Enhancement Module (FREM), Landmark-aware Objective Function (LOF), and Cobb Angle Calculation Method (CACM), for automated Cobb angle estimation in this paper. To enhance feature extraction, FREM is designed to explore geometric and semantic constraints among landmarks, thus geometric and semantic correlations between landmarks are globally modeled, and robust landmark-based features are extracted. Furthermore, to mitigate the effect of background noise on landmark localization, LOF is proposed to focus more on the foreground near the landmarks and ignore irrelevant background pixels by exploiting category prior information of landmarks. In addition, we also advance CACM to locate the bending segments first and then calculate the Cobb angle within the bending segment, which facilitates the calculation of the clinical standardized Cobb angle. The experiment results on the AASCE dataset demonstrate that our proposed LaNet can significantly improve the Cobb angle estimation performance and outperform other state-of-the-art methods.
- [25] arXiv:2405.19665 [pdf, ps, other]
-
Title: A novel fault localization with data refinement for hydroelectric unitsJialong Huang, Junlin Song, Penglong Lian, Mengjie Gan, Zhiheng Su, Benhao Wang, Wenji Zhu, Xiaomin Pu, Jianxiao Zou, Shicai FanComments: 6pages,4 figures,Conference on Decision and Control(CDC) conferenceSubjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learning (SG-WMBDL) based fault localization method for hydroelectric units is proposed. To overcome the data scarcity, a SAE is embedded into the GAN to generate more high-quality samples in the data generation module. Considering the signals involving non-linear and non-smooth characteristics, the improved WNR which combining both soft and hard thresholding and local linear embedding (LLE) are utilized to the data preprocessing module in order to reduce the noise and effectively capture the local features. In addition, to seek higher performance, the novel Adaptive Boost (AdaBoost) combined with multi deep learning is proposed to achieve accurate fault localization. The experimental results show that the SG-WMBDL can locate faults for hydroelectric units under a small number of fault samples with non-linear and non-smooth characteristics on higher precision and accuracy compared to other frontier methods, which verifies the effectiveness and practicality of the proposed method.
- [26] arXiv:2405.19685 [pdf, ps, other]
-
Title: Identifying Functional Brain Networks of Spatiotemporal Wide-Field Calcium Imaging Data via a Long Short-Term Memory AutoencoderXiaohui Zhang, Eric C Landsness, Lindsey M Brier, Wei Chen, Michelle J. Tang, Hanyang Miao, **-Moo Lee, Mark A. Anastasio, Joseph P. CulverSubjects: Image and Video Processing (eess.IV)
Wide-field calcium imaging (WFCI) that records neural calcium dynamics allows for identification of functional brain networks (FBNs) in mice that express genetically encoded calcium indicators. Estimating FBNs from WFCI data is commonly achieved by use of seed-based correlation (SBC) analysis and independent component analysis (ICA). These two methods are conceptually distinct and each possesses limitations. Recent success of unsupervised representation learning in neuroimage analysis motivates the investigation of such methods to identify FBNs. In this work, a novel approach referred as LSTM-AER, is proposed in which a long short-term memory (LSTM) autoencoder (AE) is employed to learn spatial-temporal latent embeddings from WFCI data, followed by an ordinary least square regression (R) to estimate FBNs. The goal of this study is to elucidate and illustrate, qualitatively and quantitatively, the FBNs identified by use of the LSTM-AER method and compare them to those from traditional SBC and ICA. It was observed that spatial FBN maps produced from LSTM-AER resembled those derived by SBC and ICA while better accounting for intra-subject variation, data from a single hemisphere, shorter epoch lengths and tunable number of latent components. The results demonstrate the potential of unsupervised deep learning-based approaches to identifying and map** FBNs.
- [27] arXiv:2405.19845 [pdf, ps, html, other]
-
Title: Assessing the impact of weather-induced uncertainties in large-scale electricity systemsSubjects: Systems and Control (eess.SY)
The future energy system will largely depend on volatile renewable energy sources and temperature-dependent loads, which makes the weather a central influencing factor. This article presents a novel approach for simulating weather scenarios for robust large-scale power system analysis. By applying different signal analysis methods, historical weather data is decomposed into its spectral components, processed appropriately, and then used to generate random, self-consistent weather data. In this process, any weather parameters of different locations can be considered, while their respective dependencies are mapped. The added value is demonstrated by coupling with a state-of-the-art large-scale energy system model for Europe. It is shown that the integrated consideration of different weather influences allows a quantification of the range of fluctuation of various parameters - such as the feed-in of wind and solar power - and thus provides the basis for future resilient grid planning approaches.
- [28] arXiv:2405.19858 [pdf, ps, html, other]
-
Title: Position Error Bound for Cooperative Sensing in MIMO-OFDM NetworksComments: 6 pagesSubjects: Signal Processing (eess.SP)
Only the chairs can edit This paper investigates the fundamental limits of target position estimation accuracy of joint sensing and communication (JSC) networks comprising several monostatic base stations (BSs) that cooperate to localize targets. Specifically, each BS adopts a multiple-input multiple-output (MIMO)-orthogonal frequency division multiplexing (OFDM) scheme with a multi-beam radiation pattern to partition power between communication and sensing tasks. Building on prior works, we derive a general framework to evaluate the positioning accuracy of a target in networks with an arbitrary number of cooperating BSs and arbitrary geometrical configurations using Fisher information. Numerical results demonstrate the benefits of cooperation between BSs in improving target localization accuracy and provide insights into the relationships between various system parameters, which may aid in designing JSC networks.
- [29] arXiv:2405.19889 [pdf, ps, html, other]
-
Title: Deep Joint Semantic Coding and Beamforming for Near-Space Airship-Borne Massive MIMO NetworkComments: Major Revision by IEEE JSACSubjects: Signal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG); Multimedia (cs.MM)
Near-space airship-borne communication network is recognized to be an indispensable component of the future integrated ground-air-space network thanks to airships' advantage of long-term residency at stratospheric altitudes, but it urgently needs reliable and efficient Airship-to-X link. To improve the transmission efficiency and capacity, this paper proposes to integrate semantic communication with massive multiple-input multiple-output (MIMO) technology. Specifically, we propose a deep joint semantic coding and beamforming (JSCBF) scheme for airship-based massive MIMO image transmission network in space, in which semantics from both source and channel are fused to jointly design the semantic coding and physical layer beamforming. First, we design two semantic extraction networks to extract semantics from image source and channel state information, respectively. Then, we propose a semantic fusion network that can fuse these semantics into complex-valued semantic features for subsequent physical-layer transmission. To efficiently transmit the fused semantic features at the physical layer, we then propose the hybrid data and model-driven semantic-aware beamforming networks. At the receiver, a semantic decoding network is designed to reconstruct the transmitted images. Finally, we perform end-to-end deep learning to jointly train all the modules, using the image reconstruction quality at the receivers as a metric. The proposed deep JSCBF scheme fully combines the efficient source compressibility and robust error correction capability of semantic communication with the high spectral efficiency of massive MIMO, achieving a significant performance improvement over existing approaches.
- [30] arXiv:2405.19925 [pdf, ps, html, other]
-
Title: Integrated Sensing and Communications Framework for 6G NetworksHongliang Luo, Tengyu Zhang, Chuanbin Zhao, Yucong Wang, Bo Lin, Yuhua Jiang, Dongqi Luo, Feifei GaoSubjects: Signal Processing (eess.SP)
In this paper, we propose a novel integrated sensing and communications (ISAC) framework for the sixth generation (6G) mobile networks, in which we decompose the real physical world into static environment, dynamic targets, and various object materials. The ubiquitous static environment occupies the vast majority of the physical world, for which we design static environment reconstruction (SER) scheme to obtain the layout and point cloud information of static buildings. The dynamic targets floating in static environments create the spatiotemporal transition of the physical world, for which we design comprehensive dynamic target sensing (DTS) scheme to detect, estimate, track, image and recognize the dynamic targets in real-time. The object materials enrich the electromagnetic laws of the physical world, for which we develop object material recognition (OMR) scheme to estimate the electromagnetic coefficient of the objects. Besides, to integrate these sensing functions into existing communications systems, we discuss the interference issues and corresponding solutions for ISAC cellular networks. Furthermore, we develop an ISAC hardware prototype platform that can reconstruct the environmental maps and sense the dynamic targets while maintaining communications services. With all these designs, the proposed ISAC framework can support multifarious emerging applications, such as digital twins, low altitude economy, internet of vehicles, marine management, deformation monitoring, etc.
- [31] arXiv:2405.19944 [pdf, ps, other]
-
Title: Discrete-Time I&I Adaptive Interconnection and Dam** Passivity-Based Control for Nonlinearly Parameterized Port-Controlled Hamiltonian SystemsComments: 31 pages, 9 figuresSubjects: Systems and Control (eess.SY)
In this paper, a discrete-time I&I-based adaptive IDA-PBC controller for uncertain nonlinearly parameterized port-controlled Hamiltonian systems (PCH), where the parameter uncertainties are assumed in the energy function, is constructed. A proper formulation for the uncertain system dynamics is established where the uncertainties appear in nonlinearly parameterized form in the gradient of the Hamiltonian function. The adaptive IDA-PBC controller is constructed considering this formulation. For the adaptation mechanism of the IDA-PBC controller, a discrete-time parameter estimator is derived based on the immersion and invariance (I&I) approach. A structure for a free design function in the I&I-based estimator is proposed including some other free design functions. If these free design functions are selected to satisfy some conditions, derived in this paper, the Lyapunov asymptotic stability of the estimator dynamics is guaranteed. Besides, assuming these conditions are satisfied, local asymptotic stability of the closed-loop system, in a sufficiently large set is shown. The proposed method is applied to the two physical system examples and the performance of the adaptive controller is tested by simulation. It is demonstrated that the performance of the certain IDA-PBC controller is maintained by the adaptive IDA-PBC controller successfully.
- [32] arXiv:2405.20052 [pdf, ps, html, other]
-
Title: A Hardware-Efficient EMG Decoder with an Attractor-based Neural Network for Next-Generation Hand ProsthesesMohammad Kalbasi, MohammadAli Shaeri, Vincent Alexandre Mendez, Solaiman Shokur, Silvestro Micera, Mahsa ShoaranComments: \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Advancements in neural engineering have enabled the development of Robotic Prosthetic Hands (RPHs) aimed at restoring hand functionality. Current commercial RPHs offer limited control through basic on/off commands. Recent progresses in machine learning enable finger movement decoding with higher degrees of freedom, yet the high computational complexity of such models limits their application in portable devices. Future RPH designs must balance portability, low power consumption, and high decoding accuracy to be practical for individuals with disabilities. To this end, we introduce a novel attractor-based neural network to realize on-chip movement decoding for next-generation portable RPHs. The proposed architecture comprises an encoder, an attention layer, an attractor network, and a refinement regressor. We tested our model on four healthy subjects and achieved a decoding accuracy of 80.6\pm3.3\%. Our proposed model is over 120 and 50 times more compact compared to state-of-the-art LSTM and CNN models, respectively, with comparable (or superior) decoding accuracy. Therefore, it exhibits minimal hardware complexity and can be effectively integrated as a System-on-Chip.
- [33] arXiv:2405.20055 [pdf, ps, html, other]
-
Title: Hypergraph-Aided Task-Resource Matching for Maximizing Value of Task Completion in Collaborative IoT SystemsComments: This paper has been published in IEEE Transactions on Mobile Computing, May 2024Subjects: Systems and Control (eess.SY)
With the growing scale and intrinsic heterogeneity of Internet of Things (IoT) systems, distributed device collaboration becomes essential for effective task completion by dynamically utilizing limited communication and computing resources. However, the separated design and situation-agnostic operation of computing, communication and application layers create a fundamental challenge for rapid task-resource matching, which further deteriorate the overall task completion effectiveness. To overcome this challenge, we utilize hypergraph as a new tool to vertically unify computing, communication, and task aspects of IoT systems for an effective matching by accurately capturing the relationships between tasks and communication and computing resources. Specifically, a state-of-the-art task-resource matching hypergraph (TRM-hypergraph) model is proposed in this paper, which is used to effectively transform the process of allocating complex heterogeneous resources to convoluted tasks into a hypergraph matching problem. Taking into account computational complexity and storage, a game-theoretic hypergraph matching algorithm is proposed via considering the hypergraph matching problem as a non-cooperative multi-player clustering game. Numerical results demonstrate that the proposed TRM-hypergraph model achieves superior performance in matching of tasks and resources compared with comparison algorithms.
- [34] arXiv:2405.20064 [pdf, ps, html, other]
-
Title: 1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance ProblemMingjie Chen, Hezhao Zhang, Yuanchao Li, Jiachen Luo, Wen Wu, Ziyang Ma, Peter Bell, Catherine Lai, Joshua Reiss, Lin Wang, Philip C. Woodland, Xie Chen, Huy Phan, Thomas HainSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for training, but problems remain as it sometimes causes over-fitting for minor classes or under-fitting for major classes. This paper presents the system developed by a multi-site team for the participation in the Odyssey 2024 Emotion Recognition Challenge Track-1. The challenge data has the aforementioned properties and therefore the presented systems aimed to tackle these issues, by introducing focal loss in optimisation when applying class weighted loss. Specifically, the focal loss is further weighted by prior-based class weights. Experimental results show that combining these two approaches brings better overall performance, by sacrificing performance on major classes. The system further employs a majority voting strategy to combine the outputs of an ensemble of 7 models. The models are trained independently, using different acoustic features and loss functions - with the aim to have different properties for different data. Hence these models show different performance preferences on major classes and minor classes. The ensemble system output obtained the best performance in the challenge, ranking top-1 among 68 submissions. It also outperformed all single models in our set. On the Odyssey 2024 Emotion Recognition Challenge Task-1 data the system obtained a Macro-F1 score of 35.69% and an accuracy of 37.32%.
- [35] arXiv:2405.20068 [pdf, ps, html, other]
-
Title: An Efficient Network with Novel Quantization Designed for Massive MIMO CSI FeedbackSubjects: Signal Processing (eess.SP)
The efficacy of massive multiple-input multiple-output (MIMO) techniques heavily relies on the accuracy of channel state information (CSI) in frequency division duplexing (FDD) systems. Many works focus on CSI compression and quantization methods to enhance CSI reconstruction accuracy with lower feedback overhead. In this letter, we propose CsiConformer, a novel CSI feedback network that combines convolutional operations and self-attention mechanisms to improve CSI feedback accuracy. Additionally, a new quantization module is developed to improve encoding efficiency. Experiment results show that CsiConformer outperforms previous state-of-the-art networks, achieving an average accuracy improvement of 17.67\% with lower computational overhead.
- [36] arXiv:2405.20100 [pdf, ps, html, other]
-
Title: Dynamic Slack BusSubjects: Systems and Control (eess.SY)
This letter proposes a general dynamic formulation of slack bus. With this aim, the angle constraint imposed by the slack bus is redefined as a set of differential equations and an energy source. The existence and role of the transient component of this source is also discussed in the letter. Based on this framework, the letter shows that the swing equations of synchronous machines can be interpreted as distributed, dynamic, multi-variable, local slack buses. Other relevant cases, including primary and secondary frequency regulation, passive loads as well as grid following and grid forming converters are discussed.
- [37] arXiv:2405.20107 [pdf, ps, html, other]
-
Title: A Perspective on the Impact of Group Delay Dispersion in Future Terahertz Wireless SystemsComments: 7 pages, 4 figures, 2 tables. This work has been submitted to the IEEE for possible publicationSubjects: Signal Processing (eess.SP)
This article discusses the challenges and opportunities of managing group delay dispersion (GDD) and its relation to the performance standards of future sixth-generation (6G) wireless communication systems utilizing terahertz frequency waves. The unique susceptibilities of 6G systems to GDD are described, along with a quantitative description of the sources of GDD, including multipath, rough surface scattering, intelligent reflecting surfaces, and propagation through the atmosphere. An experimental case-study is presented that confirms previous models quantifying the impact of atmospheric GDD. Several GDD manipulation strategies are presented illustrating their hindered effectiveness in the 6G context. Conversely, some benefits of leveraging GDD to enhance 6G systems, such as improved security and simplified hardware, are also discussed. Finally, a perspective on using photonic GDD control devices is provided, revealing quantitative benefits that may unburden existing equalization schemes. The article argues that GDD will uniquely and significantly impact some 6G systems, but that its careful consideration along with new mitigation strategies, including photonic devices, will help optimize system performance. The conclusion provides a perspective to guide future research in this area.
- [38] arXiv:2405.20122 [pdf, ps, html, other]
-
Title: Distributed MIMO Precoding with Routing Constraints in Segmented FronthaulComments: This is the accepted version of a paper published in 2023 IEEE 34th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). The final version is available at this https URLJournal-ref: PIMRC, Toronto, ON, Canada, 2023, pp. 1-6Subjects: Signal Processing (eess.SP)
Distributed Multiple-Input and Multiple-Output (D-MIMO) is envisioned to play a significant role in future wireless communication systems as an effective means to improve coverage and capacity. In this paper, we have studied the impact of a practical two-level data routing scheme on radio performance in a downlink D-MIMO scenario with segmented fronthaul. At the first level, a Distributed Unit (DU) is connected to the Aggregating Radio Units (ARUs) that behave as cluster heads for the selected serving RU groups. At the second level, the selected ARUs connect with the additional serving RUs. At each route discovery level, RUs and/or ARUs share information with each other. The aim of the proposed framework is to efficiently select serving RUs and ARUs so that the practical data routing impact for each User Equipment (UE) connection is minimal. The resulting post-routing Signal-to-Interference plus Noise Ratio (SINR) among all UEs is analyzed after the routing constraints have been applied. The results show that limited fronthaul segment capacity causes connection failures with the serving RUs of individual UEs, especially when long routing path lengths are required. Depending on whether the failures occur at the first or the second routing level, a UE may be dropped or its SINR may be reduced. To minimize the DU-ARU connection failures, the segment capacity of the segments closest to the DU is set as double as the remaining segments. When the number of active co-scheduled UEs is kept low enough, practical segment capacities suffice to achieve a zero UE drop** rate. Besides, the proper choice of maximum path length setting should take into account segment capacity and its utilization due to the relation between the two.
- [39] arXiv:2405.20157 [pdf, ps, other]
-
Title: A Multiband T-Shaped Antenna Array for 6G Mobile CommunicationSunday Achimugu, Abraham Usman Usman, Suleiman Zubair, Michael David, Abdulkadir Olayinka Abdulbaki, Hassan Musa AbdullahiSubjects: Signal Processing (eess.SP)
The paradigm shift in the use cases of wireless communication necessitates the need to move toward higher data rates, large bandwidths, and intelligent reconfiguration in 6G. This paper presents a novel double T-shaped antenna array that operates between 4GHz to 16GHz for 6G mobile communication. The antenna consists of a rectangular microstrip with a fractal Tshaped slot, cut at the rear of the microstrip to provide an air gap for an improved radiation pattern.
- [40] arXiv:2405.20168 [pdf, ps, html, other]
-
Title: Enhancing Battlefield Awareness: An Aerial RIS-assisted ISAC System with Deep Reinforcement LearningSubjects: Systems and Control (eess.SY)
This paper considers a joint communication and sensing technique for enhancing situational awareness in practical battlefield scenarios. In particular, we propose an aerial reconfigurable intelligent surface (ARIS)-assisted integrated sensing and communication (ISAC) system consisting of a single access point (AP), an ARIS, multiple users, and a sensing target. With deep reinforcement learning (DRL), we jointly optimize the transmit beamforming of the AP, the RIS phase shifts, and the trajectory of the ARIS under signal-to-interference-noise ratio (SINR) constraints. Numerical results demonstrate that the proposed technique outperforms the conventional benchmark schemes by suppressing the self-interference and clutter echo signals or optimizing the RIS phase shifts.
- [41] arXiv:2405.20178 [pdf, ps, other]
-
Title: Non-intrusive data-driven model order reduction for circuits based on Hammerstein architecturesComments: 13 pages, 13 figures; submitted to IEEE Transactions on Computer-Aided Design of Integrated Circuits and SystemsSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
We demonstrate that data-driven system identification techniques can provide a basis for effective, non-intrusive model order reduction (MOR) for common circuits that are key building blocks in microelectronics. Our approach is motivated by the practical operation of these circuits and utilizes a canonical Hammerstein architecture. To demonstrate the approach we develop a parsimonious Hammerstein model for a non-linear CMOS differential amplifier. We train this model on a combination of direct current (DC) and transient Spice (Xyce) circuit simulation data using a novel sequential strategy to identify the static nonlinear and linear dynamical parts of the model. Simulation results show that the Hammerstein model is an effective surrogate for the differential amplifier circuit that accurately and efficiently reproduces its behavior over a wide range of operating points and input frequencies.
- [42] arXiv:2405.20199 [pdf, ps, html, other]
-
Title: Generation planning and operation under power stability constraints: A Hydro-Quebec use caseAlexandre Besner, Alexandre Blondin Massé, Abderrahman Bani, Mouad Morabit, Luc Charest, David Ialongo, Simon Couture-Gagnon, Julien FournierComments: 10 pages, 3 figures, 13 tables, 1 algorithmSubjects: Systems and Control (eess.SY)
Hydro-Quebec (HQ) is a vertically integrated utility that produces, transmits, and distributes most of the electricity in the province of Quebec. The power grid it operates has a particular architecture created by large hydroelectric dams located far north and the extensive 735kV transmission grid that allows the generated power to reach the majority of the load located thousands of kilometers away in the southern region of Quebec. The specificity of the grid has led HQ to develop monitoring tools responsible for generating so-called stability limits. Those stability limits take into account several nonlinear phenomena such as angular stability, frequency stability, or voltage stability. Since generation planning and operation tools rely mostly on mixed integer linear programming formulation, HQ had to adapt its tools to integrate stability limits into them. This paper presents the challenges it faced, especially considering its reserve monitoring tool and unit commitment tool.
- [43] arXiv:2405.20219 [pdf, ps, html, other]
-
Title: System Identification for Lithium-Ion Batteries with Nonlinear Coupled Electro-Thermal Dynamics via Bayesian OptimizationComments: 2024 American Control Conference(ACC)Subjects: Systems and Control (eess.SY)
Essential to various practical applications of lithium-ion batteries is the availability of accurate equivalent circuit models. This paper presents a new coupled electro-thermal model for batteries and studies how to extract it from data. We consider the problem of maximum likelihood parameter estimation, which, however, is nontrivial to solve as the model is nonlinear in both its dynamics and measurement. We propose to leverage the Bayesian optimization approach, owing to its machine learning-driven capability in handling complex optimization problems and searching for global optima. To enhance the parameter search efficiency, we dynamically narrow and refine the search space in Bayesian optimization. The proposed system identification approach can efficiently determine the parameters of the coupled electro-thermal model. It is amenable to practical implementation, with few requirements on the experiment, data types, and optimization setups, and well applicable to many other battery models.
- [44] arXiv:2405.20260 [pdf, ps, html, other]
-
Title: Ancillary Services Provision by Cross-Voltage-Level Power Flow Control using Flexibility RegionsComments: preprintSubjects: Systems and Control (eess.SY)
The large-scale integration of distributed renewable energy sources into the electricity grid requires the investigation of new methods to ensure stability. For example, Active Distribution Networks (ADNs) can be used at (sub-) transmission levels for emergency operation, provided robust and efficient control is available. This paper investigates the use of Feasible Operating Regions (FORs) and Flexibility Regions (FRs) for Cross-Voltage-Level Power Flow Control (CPFC). The enhancement of network stability due to the provision of ancillary services is illustrated, as is the need for strengthened cooperation between Transmission (TSOs) and Distribution System Operators (DSOs). Optimal power flow methods are considered, focusing on computational advances through PieceWise Linearization (PWL) and convex relaxation techniques aiming to speed up runtime while kee** high accuracy. To illustrate the algorithms' benefits and drawbacks, they are analyzed using exemplary medium voltage grids.
- [45] arXiv:2405.20261 [pdf, ps, html, other]
-
Title: Speed Profile Definition for GLOSA Implementation on Buses Based on Statistical Analysis of Experimental DataSubjects: Systems and Control (eess.SY)
Intelligent Transportation Systems (ITS) are pushing an increasing interest and development when dealing with eco-driving systems. In this framework, this paper presents a method to define speed profiles specifically designed for Green Light Optimal Speed Advisory (GLOSA) systems on buses. GLOSA aims to optimize traffic flow by providing vehicles with real-time speed recommendations synchronized with traffic signal timings. Leveraging statistical analysis of experimental data collected from an urban bus, the study develops a methodology to extract meaningful insights into bus behaviour and traffic dynamics. The proposed approach considers road topology, scheduled bus stops, and signal timings to define simple although suitable speed profiles considering the peculiarities of the motion of a bus in an urban scenario. Through extensive data collection robust statistical data are defined, allowing the definition of vehicle motion profile for effectively develop and implement GLOSA systems. This research contributes to the advancement of Intelligent Transportation Systems by providing realistic data and practical insights for optimizing bus operations in urban environments.
New submissions for Friday, 31 May 2024 (showing 45 of 45 entries )
- [46] arXiv:2405.19342 (cross-list from cs.SD) [pdf, ps, html, other]
-
Title: Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice AssistantsChloé Sekkat, Fanny Leroy, Salima Mdhaffar, Blake Perry Smith, Yannick Estève, Joseph Dureau, Alice CouckeSubjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American English in the music domain (1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts) with a controlled demographic diversity (gender, age, dialectal region and ethnicity). We also release a statistical demographic bias assessment methodology, at the univariate and multivariate levels, tailored to this specific use case and leveraging spoken language understanding metrics rather than transcription accuracy, which we believe is a better proxy for user experience. To demonstrate the capabilities of this dataset and statistical method to detect demographic bias, we consider a pair of state-of-the-art Automatic Speech Recognition and Spoken Language Understanding models. Results show statistically significant differences in performance across age, dialectal region and ethnicity. Multivariate tests are crucial to shed light on mixed effects between dialectal region, gender and age.
- [47] arXiv:2405.19343 (cross-list from cs.SD) [pdf, ps, html, other]
-
Title: Luganda Speech Intent Recognition for IoT ApplicationsComments: Presented as a conference paper at ICLR 2024/AfricaNLPSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
The advent of Internet of Things (IoT) technology has generated massive interest in voice-controlled smart homes. While many voice-controlled smart home systems are designed to understand and support widely spoken languages like English, speakers of low-resource languages like Luganda may need more support. This research project aimed to develop a Luganda speech intent classification system for IoT applications to integrate local languages into smart home environments. The project uses hardware components such as Raspberry Pi, Wio Terminal, and ESP32 nodes as microcontrollers. The Raspberry Pi processes Luganda voice commands, the Wio Terminal is a display device, and the ESP32 nodes control the IoT devices. The ultimate objective of this work was to enable voice control using Luganda, which was accomplished through a natural language processing (NLP) model deployed on the Raspberry Pi. The NLP model utilized Mel Frequency Cepstral Coefficients (MFCCs) as acoustic features and a Convolutional Neural Network (Conv2D) architecture for speech intent classification. A dataset of Luganda voice commands was curated for this purpose and this has been made open-source. This work addresses the localization challenges and linguistic diversity in IoT applications by incorporating Luganda voice commands, enabling users to interact with smart home devices without English proficiency, especially in regions where local languages are predominant.
- [48] arXiv:2405.19380 (cross-list from stat.ML) [pdf, ps, other]
-
Title: Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ RegretComments: 61 pages, 6 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of $O(\sqrt{T})$. Our method leverages Langevin dynamics with a meticulously designed preconditioner as well as a simple excitation mechanism. We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process. Moreover, we identify nontrivial concentration properties of the approximate posteriors generated by our algorithm. These properties enable us to bound the moments of the system state and attain an $O(\sqrt{T})$ regret bound without the unrealistic restrictive assumptions on parameter sets that are often used in the literature.
- [49] arXiv:2405.19426 (cross-list from cs.CL) [pdf, ps, html, other]
-
Title: Deep Learning for Assessment of Oral Reading FluencySubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Reading fluency assessment is a critical component of literacy programmes, serving to guide and monitor early education interventions. Given the resource intensive nature of the exercise when conducted by teachers, the development of automatic tools that can operate on audio recordings of oral reading is attractive as an objective and highly scalable solution. Multiple complex aspects such as accuracy, rate and expressiveness underlie human judgements of reading fluency. In this work, we investigate end-to-end modeling on a training dataset of children's audio recordings of story texts labeled by human experts. The pre-trained wav2vec2.0 model is adopted due its potential to alleviate the challenges from the limited amount of labeled data. We report the performance of a number of system variations on the relevant measures, and also probe the learned embeddings for lexical and acoustic-prosodic features known to be important to the perception of reading fluency.
- [50] arXiv:2405.19450 (cross-list from cs.CV) [pdf, ps, html, other]
-
Title: FourierMamba: Fourier Learning Integration with State Space Models for Image DerainingSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Image deraining aims to remove rain streaks from rainy images and restore clear backgrounds. Currently, some research that employs the Fourier transform has proved to be effective for image deraining, due to it acting as an effective frequency prior for capturing rain streaks. However, despite there exists dependency of low frequency and high frequency in images, these Fourier-based methods rarely exploit the correlation of different frequencies for conjuncting their learning procedures, limiting the full utilization of frequency information for image deraining. Alternatively, the recently emerged Mamba technique depicts its effectiveness and efficiency for modeling correlation in various domains (e.g., spatial, temporal), and we argue that introducing Mamba into its unexplored Fourier spaces to correlate different frequencies would help improve image deraining. This motivates us to propose a new framework termed FourierMamba, which performs image deraining with Mamba in the Fourier space. Owning to the unique arrangement of frequency orders in Fourier space, the core of FourierMamba lies in the scanning encoding of different frequencies, where the low-high frequency order formats exhibit differently in the spatial dimension (unarranged in axis) and channel dimension (arranged in axis). Therefore, we design FourierMamba that correlates Fourier space information in the spatial and channel dimensions with distinct designs. Specifically, in the spatial dimension Fourier space, we introduce the zigzag coding to scan the frequencies to rearrange the orders from low to high frequencies, thereby orderly correlating the connections between frequencies; in the channel dimension Fourier space with arranged orders of frequencies in axis, we can directly use Mamba to perform frequency correlation and improve the channel information representation.
- [51] arXiv:2405.19513 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Decentralized Optimization in Time-Varying Networks with Arbitrary DelaysComments: arXiv admin note: text overlap with arXiv:2401.11344Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
We consider a decentralized optimization problem for networks affected by communication delays. Examples of such networks include collaborative machine learning, sensor networks, and multi-agent systems. To mimic communication delays, we add virtual non-computing nodes to the network, resulting in directed graphs. This motivates investigating decentralized optimization solutions on directed graphs. Existing solutions assume nodes know their out-degrees, resulting in limited applicability. To overcome this limitation, we introduce a novel gossip-based algorithm, called DT-GO, that does not need to know the out-degrees. The algorithm is applicable in general directed networks, for example networks with delays or limited acknowledgment capabilities. We derive convergence rates for both convex and non-convex objectives, showing that our algorithm achieves the same complexity order as centralized Stochastic Gradient Descent. In other words, the effects of the graph topology and delays are confined to higher-order terms. Additionally, we extend our analysis to accommodate time-varying network topologies. Numerical simulations are provided to support our theoretical findings.
- [52] arXiv:2405.19546 (cross-list from physics.ao-ph) [pdf, ps, html, other]
-
Title: Convex Optimization of Initial Perturbations toward Quantitative Weather ControlComments: submitted to Geophysical Research LettersSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Systems and Control (eess.SY); Optimization and Control (math.OC)
We propose a convex optimization approach to determine perturbations in the initial conditions of a weather phenomenon as control inputs for quantitative weather control. We first construct a sensitivity matrix of outputs, such as accumulated precipitation, to the initial conditions, such as temperature and humidity, through sensitivity analysis of a numerical weather prediction model. We then solve a convex optimization problem to find optimal perturbations in the initial conditions to realize the desired spatial distribution of the targeting outputs. We implement the proposed method in a benchmark of a warm bubble experiment and show that it realizes desired spatial distributions of accumulated precipitation, such as a reference distribution and the reduced maximum value.
- [53] arXiv:2405.19653 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: SysCaps: Language Interfaces for Simulation Surrogates of Complex SystemsComments: 17 pages. Under reviewSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Systems and Control (eess.SY)
Data-driven simulation surrogates help computational scientists study complex systems. They can also help inform impactful policy decisions. We introduce a learning framework for surrogate modeling where language is used to interface with the underlying system being simulated. We call a language description of a system a "system caption", or SysCap. To address the lack of datasets of paired natural language SysCaps and simulation runs, we use large language models (LLMs) to synthesize high-quality captions. Using our framework, we train multimodal text and timeseries regression models for two real-world simulators of complex energy systems. Our experiments demonstrate the feasibility of designing language interfaces for real-world surrogate models at comparable accuracy to standard baselines. We qualitatively and quantitatively show that SysCaps unlock text-prompt-style surrogate modeling and new generalization abilities beyond what was previously possible. We will release the generated SysCaps datasets and our code to support follow-on studies.
- [54] arXiv:2405.19659 (cross-list from cs.CV) [pdf, ps, html, other]
-
Title: CSANet: Channel Spatial Attention Network for Robust 3D Face Alignment and ReconstructionComments: 10 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Our project proposes an end-to-end 3D face alignment and reconstruction network. The backbone of our model is built by Bottle-Neck structure via Depth-wise Separable Convolution. We integrate Coordinate Attention mechanism and Spatial Group-wise Enhancement to extract more representative features. For more stable training process and better convergence, we jointly use Wing loss and the Weighted Parameter Distance Cost to learn parameters for 3D Morphable model and 3D vertices. Our proposed model outperforms all baseline models both quantitatively and qualitatively.
- [55] arXiv:2405.19771 (cross-list from cs.NI) [pdf, ps, html, other]
-
Title: Data Service Maximization in Integrated Terrestrial-Non-Terrestrial 6G Networks: A Deep Reinforcement Learning ApproachComments: 5 pages, 4 figuresSubjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Integrating terrestrial and non-terrestrial networks has emerged as a promising paradigm to fulfill the constantly growing demand for connectivity, low transmission delay, and quality of services (QoS). This integration brings together the strengths of terrestrial and non-terrestrial networks, such as the reliability of terrestrial networks, broad coverage, and service continuity of non-terrestrial networks like low earth orbit (LEO) satellites. In this work, we study a data service maximization problem in an integrated terrestrial-non-terrestrial network (I-TNT) where the ground base stations (GBSs) and LEO satellites cooperatively serve the coexisting aerial users (AUs) and ground users (GUs). Then, by considering the spectrum scarcity, interference, and QoS requirements of the users, we jointly optimize the user association, AUE's trajectory, and power allocation. To tackle the formulated mixed-integer non-convex problem, we disintegrate it into two subproblems: 1) user association problem and 2) trajectory and power allocation problem. Since the user association problem is a binary integer programming problem, we use the standard convex optimization method to solve it. Meanwhile, the trajectory and power allocation problem is solved by the deep deterministic policy gradient (DDPG) method to cope with the problem's non-convexity and dynamic network environments. Then, the two subproblems are alternately solved by the proposed iterative algorithm. By comparing with the baselines in the existing literature, extensive simulations are conducted to evaluate the performance of the proposed framework.
- [56] arXiv:2405.19796 (cross-list from cs.SD) [pdf, ps, html, other]
-
Title: Explainable Attribute-Based Speaker VerificationSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings. We believe this approach better aligns with human reasoning, making it more understandable than traditional methods. Evaluated on the Voxceleb1 test set, the best performance of our system is comparable with the ground truth established when using all correct attributes, proving its efficacy. Whilst our approach sacrifices some performance compared to non-explainable methods, we believe that it moves us closer to the goal of transparent, interpretable AI and lays the groundwork for future enhancements through attribute expansion.
- [57] arXiv:2405.20045 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Iterative Learning Control of Fast, Nonlinear, Oscillatory Dynamics (Preprint)Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
The sudden onset of deleterious and oscillatory dynamics (often called instabilities) is a known challenge in many fluid, plasma, and aerospace systems. These dynamics are difficult to address because they are nonlinear, chaotic, and are often too fast for active control schemes. In this work, we develop an alternative active controls system using an iterative, trajectory-optimization and parameter-tuning approach based on Iterative Learning Control (ILC), Time-Lagged Phase Portraits (TLPP) and Gaussian Process Regression (GPR). The novelty of this approach is that it can control a system's dynamics despite the controller being much slower than the dynamics. We demonstrate this controller on the Lorenz system of equations where it iteratively adjusts (tunes) the system's input parameters to successfully reproduce a desired oscillatory trajectory or state. Additionally, we investigate the system's dynamical sensitivity to its control parameters, identify continuous and bounded regions of desired dynamical trajectories, and demonstrate that the controller is robust to missing information and uncontrollable parameters as long as certain requirements are met. The controller presented in this work provides a framework for low-speed control for a variety of fast, nonlinear systems that may aid in instability suppression and mitigation.
- [58] arXiv:2405.20059 (cross-list from cs.SD) [pdf, ps, html, other]
-
Title: Spectral Map** of Singing Voices: U-Net-Assisted Vocal SegmentationSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Separating vocal elements from musical tracks is a longstanding challenge in audio signal processing. This study tackles the distinct separation of vocal components from musical spectrograms. We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms, utilizing the benchmark MUSDB18 dataset for music separation. Subsequently, we implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately. We achieved noteworthy results in audio source separation using of our U-Net-based models. The combination of frequency-axis normalization with Min/Max scaling and the Mean Absolute Error (MAE) loss function achieved the highest Source-to-Distortion Ratio (SDR) of 7.1 dB, indicating a high level of accuracy in preserving the quality of the original signal during separation. This setup also recorded impressive Source-to-Interference Ratio (SIR) and Source-to-Artifact Ratio (SAR) scores of 25.2 dB and 7.2 dB, respectively. These values significantly outperformed other configurations, particularly those using Quantile-based normalization or a Mean Squared Error (MSE) loss function. Our source code, model weights, and demo material can be found at the project's GitHub repository: this https URL
- [59] arXiv:2405.20073 (cross-list from cs.IT) [pdf, ps, html, other]
-
Title: Power Allocation for Cell-Free Massive MIMO ISAC Systems with OTFS SignalComments: This work is submitted to IEEE for possible publicationSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Applying integrated sensing and communication (ISAC) to a cell-free massive multiple-input multiple-output (CF mMIMO) architecture has attracted increasing attention. This approach equips CF mMIMO networks with sensing capabilities and resolves the problem of unreliable service at cell edges in conventional cellular networks. However, existing studies on CF-ISAC systems have focused on the application of traditional integrated signals. To address this limitation, this study explores the employment of the orthogonal time frequency space (OTFS) signal as a representative of innovative signals in the CF-ISAC system, and the system's overall performance is optimized and evaluated. A universal downlink spectral efficiency (SE) expression is derived regarding multi-antenna access points (APs) and optional sensing beams. To streamline the analysis and optimization of the CF-ISAC system with the OTFS signal, we introduce a lower bound on the achievable SE that is applicable to OTFS-signal-based systems. Based on this, a power allocation algorithm is proposed to maximize the minimum communication signal-to-interference-plus-noise ratio (SINR) of users while guaranteeing a specified sensing SINR value and meeting the per-AP power constraints. The results demonstrate the tightness of the proposed lower bound and the efficiency of the proposed algorithm. Finally, the superiority of using the OTFS signals is verified by a 13-fold expansion of the SE performance gap over the application of orthogonal frequency division multiplexing signals. These findings could guide the future deployment of the CF-ISAC systems, particularly in the field of millimeter waves with a large bandwidth.
- [60] arXiv:2405.20101 (cross-list from cs.SD) [pdf, ps, html, other]
-
Title: Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech InpaintingSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In the present study, we investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context, i.e., fulfilling a downstream task that is very similar to the pretext task. To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder. In particular, we propose two solutions to match the HuBERT output with the HiFiGAN input, by freezing one and fine-tuning the other, and vice versa. Performance of both approaches was assessed in single- and multi-speaker settings, for both informed and blind inpainting configurations (i.e., the position of the mask is known or unknown, respectively), with different objective metrics and a perceptual evaluation. Performances show that if both solutions allow to correctly reconstruct signal portions up to the size of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a more accurate signal reconstruction in the single-speaker setting case, while freezing it (and training the neural vocoder instead) is a better strategy when dealing with multi-speaker data.
- [61] arXiv:2405.20118 (cross-list from cs.RO) [pdf, ps, html, other]
-
Title: Assistance-Seeking in Human-Supervised Autonomy: Role of Trust and Secondary Task Engagement (Extended Version)Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Using a dual-task paradigm, we explore how robot actions, performance, and the introduction of a secondary task influence human trust and engagement. In our study, a human supervisor simultaneously engages in a target-tracking task while supervising a mobile manipulator performing an object collection task. The robot can either autonomously collect the object or ask for human assistance. The human supervisor also has the choice to rely upon or interrupt the robot. Using data from initial experiments, we model the dynamics of human trust and engagement using a linear dynamical system (LDS). Furthermore, we develop a human action model to define the probability of human reliance on the robot. Our model suggests that participants are more likely to interrupt the robot when their trust and engagement are low during high-complexity collection tasks. Using Model Predictive Control (MPC), we design an optimal assistance-seeking policy. Evaluation experiments demonstrate the superior performance of the MPC policy over the baseline policy for most participants.
- [62] arXiv:2405.20161 (cross-list from cs.CV) [pdf, ps, html, other]
-
Title: Landslide map** from Sentinel-2 imagery through change detectionComments: to be published in IEEE IGARSS 2024 conference proceedingsSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Landslides are one of the most critical and destructive geohazards. Widespread development of human activities and settlements combined with the effects of climate change on weather are resulting in a high increase in the frequency and destructive power of landslides, making them a major threat to human life and the economy. In this paper, we explore methodologies to map newly-occurred landslides using Sentinel-2 imagery automatically. All approaches presented are framed as a bi-temporal change detection problem, requiring only a pair of Sentinel-2 images, taken respectively before and after a landslide-triggering event. Furthermore, we introduce a novel deep learning architecture for fusing Sentinel-2 bi-temporal image pairs with Digital Elevation Model (DEM) data, showcasing its promising performances w.r.t. other change detection models in the literature. As a parallel task, we address limitations in existing datasets by creating a novel geodatabase, which includes manually validated open-access landslide inventories over heterogeneous ecoregions of the world. We release both code and dataset with an open-source license.
- [63] arXiv:2405.20172 (cross-list from cs.SD) [pdf, ps, html, other]
-
Title: Iterative Feature Boosting for Explainable Speech Emotion RecognitionComments: Published in: 2023 International Conference on Machine Learning and Applications (ICMLA)Journal-ref: 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 2023, pp. 543-549Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
In speech emotion recognition (SER), using predefined features without considering their practical importance may lead to high dimensional datasets, including redundant and irrelevant information. Consequently, high-dimensional learning often results in decreasing model accuracy while increasing computational complexity. Our work underlines the importance of carefully considering and analyzing features in order to build efficient SER systems. We present a new supervised SER method based on an efficient feature engineering approach. We pay particular attention to the explainability of results to evaluate feature relevance and refine feature sets. This is performed iteratively through feature evaluation loop, using Shapley values to boost feature selection and improve overall framework performance. Our approach allows thus to balance the benefits between model performance and transparency. The proposed method outperforms human-level performance (HLP) and state-of-the-art machine learning methods in emotion recognition on the TESS dataset.
- [64] arXiv:2405.20209 (cross-list from math.OC) [pdf, ps, html, other]
-
Title: Lasso-based state estimation for cyber-physical systems under sensor attacksComments: \textcopyright 2024 the authors. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-NDSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
The development of algorithms for secure state estimation in vulnerable cyber-physical systems has been gaining attention in the last years. A consolidated assumption is that an adversary can tamper a relatively small number of sensors. In the literature, block-sparsity methods exploit this prior information to recover the attack locations and the state of the system.
In this paper, we propose an alternative, Lasso-based approach and we analyse its effectiveness. In particular, we theoretically derive conditions that guarantee successful attack/state recovery, independently of established time sparsity patterns. Furthermore, we develop a sparse state observer, by starting from the iterative soft thresholding algorithm for Lasso, to perform online estimation.
Through several numerical experiments, we compare the proposed methods to the state-of-the-art algorithms. - [65] arXiv:2405.20279 (cross-list from cs.CV) [pdf, ps, html, other]
-
Title: CV-VAE: A Compatible Video VAE for Latent Generative Video ModelsComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, we propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. With our CV-VAE, existing video models can generate four times more frames with minimal finetuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.
- [66] arXiv:2405.20336 (cross-list from cs.CV) [pdf, ps, html, other]
-
Title: RapVerse: Coherent Vocals and Whole-Body Motions Generations from TextJiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, Chuang GanComments: Project website: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rap** vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information, and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. The project page is available for research purposes at this https URL.
Cross submissions for Friday, 31 May 2024 (showing 21 of 21 entries )
- [67] arXiv:2210.16299 (replaced) [pdf, ps, other]
-
Title: Nonuniqueness and Convergence to Equivalent Solutions in Observer-based Inverse Reinforcement LearningSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
A key challenge in solving the deterministic inverse reinforcement learning (IRL) problem online and in real-time is the existence of multiple solutions. Nonuniqueness necessitates the study of the notion of equivalent solutions, i.e., solutions that result in a different cost functional but same feedback matrix, and convergence to such solutions. While offline algorithms that result in convergence to equivalent solutions have been developed in the literature, online, real-time techniques that address nonuniqueness are not available. In this paper, a regularized history stack observer that converges to approximately equivalent solutions of the IRL problem is developed. Novel data-richness conditions are developed to facilitate the analysis and simulation results are provided to demonstrate the effectiveness of the developed technique.
- [68] arXiv:2212.09664 (replaced) [pdf, ps, html, other]
-
Title: Fast Low Rank column-wise Compressive Sensing for Accelerated Dynamic MRIComments: 16 Pages (Including Appendix), 9 Figures, 3 Tables. arXiv admin note: substantial text overlap with arXiv:2206.13618Journal-ref: in IEEE Transactions on Computational Imaging, vol. 9, pp. 409-424, 2023Subjects: Image and Video Processing (eess.IV)
This work develops a novel set of algorithms, alternating Gradient Descent (GD) and minimization for MRI (altGDmin-MRI1 and altGDmin-MRI2), for accelerated dynamic MRI by assuming an approximate low-rank (LR) model on the matrix formed by the vectorized images of the sequence. The LR model itself is well-known in the MRI literature; our contribution is the novel GD-based algorithms which are much faster, memory efficient, and general compared with existing work; and careful use of a 3-level hierarchical LR model. By general, we mean that, with a single choice of parameters, our method provides accurate reconstructions for multiple accelerated dynamic MRI applications, multiple sampling rates and sampling schemes.
We show that our methods outperform many of the popular existing approaches while also being faster than all of them, on average. This claim is based on comparisons on 8 different retrospectively under sampled multi-coil dynamic MRI applications, sampled using either 1D Cartesian or 2D pseudo radial under sampling, at multiple sampling rates. Evaluations on some prospectively under sampled datasets are also provided. Our second contribution is a mini-batch subspace tracking extension that can process new measurements and return reconstructions within a short delay after they arrive. The recovery algorithm itself is also faster than its batch counterpart. - [69] arXiv:2305.13910 (replaced) [pdf, ps, html, other]
-
Title: Experimental Assessment of Misalignment Effects in Terahertz CommunicationsComments: 6 pages, 6 figures, conference paperSubjects: Signal Processing (eess.SP)
Terahertz (THz) frequencies are important for next generation wireless systems due to the advantages in terms of large available bandwidths. On the other hand, the limited range due to high attenuation in these frequencies can be overcome via densely installed heterogeneous networks also utilizing UAVs in a three-dimensional hyperspace. Yet, THz communications rely on precise beam alignment, if not handled properly results in low signal strength at the receiver which impacts THz signals more than conventional ones. This work focuses on the importance of precise alignment in THz communication systems and the significant effect of proper alignment is validated through comprehensive measurements conducted through a state-of-the-art measurement setup, which enables accurate data collection between 240 GHz to 300 GHz at varying angles and distances in an anechoic chamber eliminating reflections. By analyzing the channel frequency and impulse responses of these extensive and particular measurements, this study provides the first quantifiable results in terms of measuring the effects of beam misalignment in THz frequencies.
- [70] arXiv:2306.00530 (replaced) [pdf, ps, other]
-
Title: CL-MRI: Self-Supervised Contrastive Learning to Improve the Accuracy of Undersampled MRI ReconstructionSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
In Magnetic Resonance Imaging (MRI), image acquisitions are often undersampled in the measurement domain to accelerate the scanning process, at the expense of image quality. However, image quality is a crucial factor that influences the accuracy of clinical diagnosis; hence, high-quality image reconstruction from undersampled measurements has been a key area of research. Recently, deep learning (DL) methods have emerged as the state-of-the-art for MRI reconstruction, typically involving deep neural networks to transform undersampled MRI images into high-quality MRI images through data-driven processes. Nevertheless, there is clear and significant room for improvement in undersampled DL MRI reconstruction to meet the high standards required for clinical diagnosis, in terms of eliminating aliasing artifacts and reducing image noise. In this paper, we introduce a self-supervised pretraining procedure using contrastive learning to improve the accuracy of undersampled DL MRI reconstruction. We use contrastive learning to transform the MRI image representations into a latent space that maximizes mutual information among different undersampled representations and optimizes the information content at the input of the downstream DL reconstruction models. Our experiments demonstrate improved reconstruction accuracy across a range of acceleration factors and datasets, both quantitatively and qualitatively. Furthermore, our extended experiments validate the proposed framework's robustness under adversarial conditions, such as measurement noise, different k-space sampling patterns, and pathological abnormalities, and also prove the transfer learning capabilities on MRI datasets with completely different anatomy. Additionally, we conducted experiments to visualize and analyze the properties of the proposed MRI contrastive learning latent space.
- [71] arXiv:2306.04821 (replaced) [pdf, ps, html, other]
-
Title: AI-based Identification of Most Critical Cyberattacks in Industrial SystemsSubjects: Systems and Control (eess.SY)
Modern industrial systems face a growing threat from sophisticated cyberattacks that can cause significant operational disruptions. This work presents a novel methodology for identification of the most critical cyberattacks that may disrupt the operation of such a system. Application of the proposed framework can enable the design and development of advanced cybersecurity solutions for a wide range of industrial applications. Attacks are assessed taking into direct consideration how they impact the system operation as measured by a defined Key Performance Indicator (KPI). A simulation model (SM), of the industrial process is employed for calculation of the KPI based on operating conditions. Such SM is augmented with a layer of information describing the communication network topology, connected devices, and potential actions an adversary can take based on each device or network link. Each possible action is associated with an abstract measure of effort, which is interpreted as a cost. It is assumed that the adversary has a corresponding budget that constrains the selection of the sequence of actions defining the progression of the attack. A dynamical system comprising a set of states associated with the cyberattack (cyber-states) and transition logic for updating their values is also proposed. The resulting augmented simulation model (ASM) is then employed in an artificial intelligence-based sequential decision-making optimization to yield the most critical cyberattack scenarios as measured by their impact on the defined KPI. The methodology is successfully tested based on an electrical power distribution system use case.
- [72] arXiv:2309.07131 (replaced) [pdf, ps, html, other]
-
Title: Wideband High Gain Metasurface-Based 4T4R MIMO antenna with Highly Isolated Ports for Sub-6 GHz 5G ApplicationsComments: 20 pages, 15 figures, and 3 TablesSubjects: Signal Processing (eess.SP)
This study presents the design of four $178\times178$ $(mm)^{2}$ wideband, high gain, highly efficient metasurface-based 4T4R MIMO (Multiple-Input Multiple-Output) antennas with highly isolated ports, covering the middle and a portion of the upper bands of the sub 6 GHz 5G frequency spectrum for 5G-based systems, such as IoT (Internet of Things) applications, vehicular communications (e.g., rooftop antennas of cars or trains), smart industries (e.g., farms and factories). The radiating elements of these antennas use the aperture-coupled feeding technique with a dumbbell-shaped slot, a truncated square patch with two U-shaped slots, and a metasurface layer. The proposed MIMO structures place four identical radiating elements like a $2\times2$ matrix with $90^\circ$ successive rotations to produce orthogonal electromagnetic waves, improving the isolation between ports. Six-millimeter spaces are added between these elements, and two vertical and horizontal strip slots are carved on the ground as the decoupling structure to decrease the mutual coupling. Simulation results show that Antenna\_{1}, Antenna\_{2}, and Antenna\_{3} achieve gain values of 6.2 to 9.4 dBi, 8.2 to 11.6 dBi, 6.2 to 9.5 dBi, below -35, -25, and -33 isolation and almost 10 dB diversity gain from 2.8 to 4.7 GHz, 2.8 to 4.5 GHz, and 2.7 to 4.9 GHz, respectively. As a prototype, Antenna\_{4} is manufactured, and measurements are performed. It achieves 6.28 to 10.45 dBi gain values, below -23 dB isolation, and 0.001 envelope correlation coefficient over 2.7 to 4.3 GHz. The results confirm that the proposed MIMO antennas are compatible with the 5G essential requisites.
- [73] arXiv:2402.04395 (replaced) [pdf, ps, other]
-
Title: Auto-Encoder Optimized PAM IM/DD Transceivers for Amplified Fiber LinksAmir Omidi, Mai Banawan, Erwan Weckenmann, Benoit Paquin, Alireza Geravand, Zibo Zheng, Wei Shi, Ming Zeng, Leslie A. RuschComments: 9 pages and 13 figuresSubjects: Signal Processing (eess.SP)
We examine pulse amplitude modulation (PAM) for intensity modulation and direct detection systems. Using a straight-forward, mixed noise model, we optimize the constellations with an autoencoder-based neural network (NN), an improve required signal-to-noise ratio of 4 dB for amplified spontaneous emission (ASE)-limited PAM4 and PAM8, without increasing system complexity. Performance can also be improved in O-band wavelength division multiplexing system with semiconductor optical amplifier amplification and chromatic dispersion. We show via simulation that for such a system operating at 53 Gbaud, we can extend the reach of PAM4 by 10-25 km with an optimized constellation and a NN decoder. We present an experimental validation of 4 dB improvement of an ASE-limited PAM4 at 60 Gbaud using an optimized constellation and a NN decoder.
- [74] arXiv:2402.06875 (replaced) [pdf, ps, html, other]
-
Title: Disentangled Latent Energy-Based Style Translation: An Image-Level Structural MRI Harmonization FrameworkSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Brain magnetic resonance imaging (MRI) has been extensively employed across clinical and research fields, but often exhibits sensitivity to site effects arising from non-biological variations such as differences in field strength and scanner vendors. Numerous retrospective MRI harmonization techniques have demonstrated encouraging outcomes in reducing the site effects at the image level. However, existing methods generally suffer from high computational requirements and limited generalizability, restricting their applicability to unseen MRIs. In this paper, we design a novel disentangled latent energy-based style translation (DLEST) framework for unpaired image-level MRI harmonization, consisting of (a) site-invariant image generation (SIG), (b) site-specific style translation (SST), and (c) site-specific MRI synthesis (SMS). Specifically, the SIG employs a latent autoencoder to encode MRIs into a low-dimensional latent space and reconstruct MRIs from latent codes. The SST utilizes an energy-based model to comprehend the global latent distribution of a target domain and translate source latent codes toward the target domain, while SMS enables MRI synthesis with a target-specific style. By disentangling image generation and style translation in latent space, the DLEST can achieve efficient style translation. Our model was trained on T1-weighted MRIs from a public dataset (with 3,984 subjects across 58 acquisition sites/settings) and validated on an independent dataset (with 9 traveling subjects scanned in 11 sites/settings) in four tasks: histogram and feature visualization, site classification, brain tissue segmentation, and site-specific structural MRI synthesis. Qualitative and quantitative results demonstrate the superiority of our method over several state-of-the-arts.
- [75] arXiv:2403.02565 (replaced) [pdf, ps, html, other]
-
Title: Deep Cooperation in ISAC System: Resource, Node and Infrastructure PerspectivesComments: 8 pages and 6 figures, Accepted by IEEE Internet of Things MagazineSubjects: Signal Processing (eess.SP)
With the emerging Integrated Sensing and Communication (ISAC) technique, exploiting the mobile communication system with multi-domain resources, multiple network elements, and large-scale infrastructures to realize cooperative sensing is a crucial approach satisfying the requirements of high-accuracy and large-scale sensing in IoE. In this article, the deep cooperation in ISAC system including three perspectives is investigated. In the microscopic perspective, namely, within a single node, the sensing information carried by time-frequency-space-code domain resources is processed, such as phase compensation, coherent accumulation and other operations, thereby improving the sensing accuracy. In the mesoscopic perspective, the sensing accuracy could be improved through the cooperation of multiple nodes. We explore various multi-node cooperative sensing scenarios and present the corresponding challenges and future research trends. In the macroscopic perspective, the massive number of infrastructures from the same operator or different operators could perform cooperative sensing to extend the sensing coverage and improve the sensing continuity. We investigate network architecture, target tracking methods, and the large-scale sensing assisted digital twin construction. Simulation results demonstrate the superiority of multi-nodes and multi-resources cooperative sensing over single resource or node sensing. This article may provide a deep and comprehensive view on the cooperative sensing in ISAC system to enhance the performance of sensing, supporting the applications of IoE.
- [76] arXiv:2403.05955 (replaced) [pdf, ps, html, other]
-
Title: IOI: Invisible One-Iteration Adversarial Attack on No-Reference Image- and Video-Quality MetricsComments: Accepted to ICML 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
No-reference image- and video-quality metrics are widely used in video processing benchmarks. The robustness of learning-based metrics under video attacks has not been widely studied. In addition to having success, attacks that can be employed in video processing benchmarks must be fast and imperceptible. This paper introduces an Invisible One-Iteration (IOI) adversarial attack on no reference image and video quality metrics. We compared our method alongside eight prior approaches using image and video datasets via objective and subjective tests. Our method exhibited superior visual quality across various attacked metric architectures while maintaining comparable attack success and speed. We made the code available on GitHub: this https URL.
- [77] arXiv:2403.11883 (replaced) [pdf, ps, html, other]
-
Title: Data-Enabled Predictive Iterative ControlSubjects: Systems and Control (eess.SY)
This work introduces the Data-Enabled Predictive iteRative Control (DeePRC) algorithm, a direct data-driven approach for iterative LTI systems. The DeePRC learns from previous iterations to improve its performance and achieves the optimal cost. By utilizing a tube-based variation of the DeePRC scheme, we propose a two-stage approach that enables safe active exploration using a left-kernel-based input disturbance design. This method generates informative trajectories to enrich the historical data, which extends the maximum achievable prediction horizon and leads to faster iteration convergence. In addition, we present an end-to-end formulation of the two-stage approach, integrating the disturbance design procedure into the planning phase. We showcase the effectiveness of the proposed algorithms on a numerical experiment.
- [78] arXiv:2403.17902 (replaced) [pdf, ps, html, other]
-
Title: Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space ModelsComments: 12 pages, 7 figures, under reviewSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters, while efficient, are inherently local and therefore struggle with modeling long-range dependencies in images. In contrast, attention excels at capturing global interactions between arbitrary image regions, but suffers from a quadratic cost in image dimension. In this work, we propose Serpent, an efficient architecture for high-resolution image restoration that combines recent advances in state space models (SSMs) with multi-scale signal processing in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. We propose a novel hierarchical architecture inspired by traditional signal processing principles, that converts the input image into a collection of sequences and processes them in a multi-scale fashion. Our experimental results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5\times$ less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.
- [79] arXiv:2404.09385 (replaced) [pdf, ps, html, other]
-
Title: A Large-Scale Evaluation of Speech Foundation ModelsShu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi LeeComments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferredSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Signal Processing (eess.SP)
The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark.
- [80] arXiv:2405.14472 (replaced) [pdf, ps, html, other]
-
Title: SolNet: Open-source deep learning models for photovoltaic power forecasting across the globeComments: 24 pages, 5 figuresSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Deep learning models have gained increasing prominence in recent years in the field of solar pho-tovoltaic (PV) forecasting. One drawback of these models is that they require a lot of high-quality data to perform well. This is often infeasible in practice, due to poor measurement infrastructure in legacy systems and the rapid build-up of new solar systems across the world. This paper proposes SolNet: a novel, general-purpose, multivariate solar power forecaster, which addresses these challenges by using a two-step forecasting pipeline which incorporates transfer learning from abundant synthetic data generated from PVGIS, before fine-tuning on observational data. Using actual production data from hundreds of sites in the Netherlands, Australia and Belgium, we show that SolNet improves forecasting performance over data-scarce settings as well as baseline models. We find transfer learning benefits to be the strongest when only limited observational data is available. At the same time we provide several guidelines and considerations for transfer learning practitioners, as our results show that weather data, seasonal patterns, amount of synthetic data and possible mis-specification in source location, can have a major impact on the results. The SolNet models created in this way are applicable for any land-based solar photovoltaic system across the planet where simulated and observed data can be combined to obtain improved forecasting capabilities.
- [81] arXiv:2405.18775 (replaced) [pdf, ps, html, other]
-
Title: Synchronization Scheme based on Pilot Sharing in Cell-Free Massive MIMO SystemsQihao Peng, Hong Ren, Zhendong Peng, Cunhua Pan, Maged Elkashlan, Dongming Wang, Jiangzhou Wang, Xiaohu YouComments: Submitted to IEEE Journal for posSubjects: Signal Processing (eess.SP)
This paper analyzes the impact of pilot-sharing scheme on synchronization performance in a scenario where several slave access points (APs) with uncertain carrier frequency offsets (CFOs) and timing offsets (TOs) share a common pilot sequence. First, the Cramer-Rao bound (CRB) with pilot contamination is derived for pilot-pairing estimation. Furthermore, a maximum likelihood algorithm is presented to estimate the CFO and TO among the pairing APs. Then, to minimize the sum of CRBs, we devise a synchronization strategy based on a pilot-sharing scheme by jointly optimizing the cluster classification, synchronization overhead, and pilot-sharing scheme, while simultaneously considering the overhead and each AP's synchronization requirements. To solve this NP-hard problem, we simplify it into two sub-problems, namely cluster classification problem and the pilot sharing problem. To strike a balance between synchronization performance and overhead, we first classify the clusters by using the K-means algorithm, and propose a criteria to find a good set of master APs. Then, the pilot-sharing scheme is obtained by using the swap-matching operations. Simulation results validate the accuracy of our derivations and demonstrate the effectiveness of the proposed scheme over the benchmark schemes.
- [82] arXiv:2206.01312 (replaced) [pdf, ps, other]
-
Title: Optimization of Energy-Constrained IRS-NOMA Using a Complex Circle Manifold ApproachSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
This work investigates the performance of intelligent reflective surfaces (IRSs) assisted uplink non-orthogonal multiple access (NOMA) in energy-constrained networks. Specifically, we formulate and solve two optimization problems; the first aims at minimizing the sum of users' transmit power, while the second targets maximizing the system level energy efficiency (EE). The two problems are solved by jointly optimizing the users' transmit powers and the beamforming coefficients at IRS subject to the users' individual uplink rate and transmit power constraints. A novel and low complexity algorithm is developed to optimize the IRS beamforming coefficients by optimizing the objective function over a \textit{complex circle manifold} (CCM). To efficiently optimize the IRS phase shifts over the manifold, the optimization problem is reformulated into a feasibility expansion problem which is reduced to a max-min signal-plus-interference-ratio (SINR). Then, with the aid of a smoothing technique, the exact penalty method is applied to transform the problem from constrained to unconstrained. The proposed solution is compared against three semi-definite programming (SDP)-based benchmarks which are semi-definite relaxation (SDR), SDP-difference of convex (SDP-DC) and sequential rank-one constraint relaxation (SROCR). The results show that the manifold algorithm provides better performance than the SDP-based benchmarks, and at a much lower computational complexity for both the transmit power minimization and EE maximization problems. The results also reveal that IRS-NOMA is only superior to orthogonal multiple access (OMA) when the users' target achievable rate requirements are relatively high.
- [83] arXiv:2305.07926 (replaced) [pdf, ps, other]
-
Title: Characteristic time of transient response of solid oxide cells (SOCs) to changes in voltage/current: from theory to applicationsJournal-ref: Nat Commun 15, 4587 (2024)Subjects: Fluid Dynamics (physics.flu-dyn); Systems and Control (eess.SY)
The intermittency of solar and wind power can be addressed by integrating them with Solid Oxide Cells (SOCs). This study delves into the transient characteristics of SOCs and their dependence on dynamic heat and mass transfer processes. Non-dimensional analysis was used to identify influential parameters, followed by a 3-D numerical simulation-based parametric analysis to examine the dynamic gaseous and thermal responses of SOCs with varying dimensions, material properties, and operating conditions. For the first time, we proposed characteristic times to describe the relationship between SOC transients and multiple parameters. These characteristic times represent the overall heat and mass transfer rats in SOCs. Their effectiveness was validated against literature and demonstrated potential in characterizing the transient characteristics of other electrochemical cells. Besides, two examples are provided to illustrate how the characteristic times facilitate SOC design and control at minimal computational cost.
- [84] arXiv:2305.15595 (replaced) [pdf, ps, html, other]
-
Title: Time-Varying Convex Optimization: A Contraction and Equilibrium Tracking ApproachSubjects: Optimization and Control (math.OC); Signal Processing (eess.SP); Systems and Control (eess.SY)
In this article, we provide a novel and broadly-applicable contraction-theoretic approach to continuous-time time-varying convex optimization. For any parameter-dependent contracting dynamics, we show that the tracking error is asymptotically proportional to the rate of change of the parameter with proportionality constant upper bounded by Lipschitz constant in which the parameter appears divided by the contraction rate of the dynamics squared. We additionally establish that any parameter-dependent contracting dynamics can be augmented with a feedforward prediction term to ensure that the tracking error converges to zero exponentially quickly. To apply these results to time-varying convex optimization problems, we establish the strong infinitesimal contractivity of dynamics solving three canonical problems, namely monotone inclusions, linear equality-constrained problems, and composite minimization problems. For each of these problems, we prove the sharpest-known rates of contraction and provide explicit tracking error bounds between solution trajectories and minimizing trajectories. We validate our theoretical results on three numerical examples including an application to control-barrier function based controller design.
- [85] arXiv:2305.17217 (replaced) [pdf, ps, html, other]
-
Title: Tactile-based Exploration, Map** and Navigation with Collision-Resilient Aerial VehiclesSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
In this article, we introduce novel tactile-based motion primitives termed "tactile-traversal", "tactile-turning" and "ricocheting" for unmanned aerial vehicles (UAVs). These primitives enable contact-rich UAV missions such as tactile-based exploration, map**, and collision-inclusive navigation. We begin by introducing XPLORER, a passive deformable UAV that sustains collisions and establishes smooth contacts by exploiting its spring-augmented chassis. Next, an improved and fast converging external force estimation algorithm is proposed to detect contacts/collisions. We also design three distinct reaction controllers for (i) static-wrench application, (ii) disturbance rejection, and (iii) collision recovery. Finally, the three new tactile-based motion primitives are proposed by leveraging the reactions obtained from deploying these controllers to interact with surroundings. We showcase the effectiveness of these primitives to facilitate efficient exploration and rapid navigation in unknown environments by capitalizing on collisions and contacts.
- [86] arXiv:2306.10232 (replaced) [pdf, ps, other]
-
Title: Multi-Task Offloading via Graph Neural Networks in Heterogeneous Multi-access Edge ComputingComments: Insufficient completion, there are some errors in the current versionSubjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
In the rapidly evolving field of Heterogeneous Multi-access Edge Computing (HMEC), efficient task offloading plays a pivotal role in optimizing system throughput and resource utilization. However, existing task offloading methods often fall short of adequately modeling the dependency topology relationships between offloaded tasks, which limits their effectiveness in capturing the complex interdependencies of task features. To address this limitation, we propose a task offloading mechanism based on Graph Neural Networks (GNN). Our modeling approach takes into account factors such as task characteristics, network conditions, and available resources at the edge, and embeds these captured features into the graph structure. By utilizing GNNs, our mechanism can capture and analyze the intricate relationships between task features, enabling a more comprehensive understanding of the underlying dependency topology. Through extensive evaluations in heterogeneous networks, our proposed algorithm improves 18.6\%-53.8\% over greedy and approximate algorithms in optimizing system throughput and resource utilization. Our experiments showcase the advantage of considering the intricate interplay of task features using GNN-based modeling.
- [87] arXiv:2309.11656 (replaced) [pdf, ps, html, other]
-
Title: Real-to-Sim Deformable Object Manipulation: Optimizing Physics Models with Residual Map**s for Robotic SurgerySubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Accurate deformable object manipulation (DOM) is essential for achieving autonomy in robotic surgery, where soft tissues are being displaced, stretched, and dissected. Many DOM methods can be powered by simulation, which ensures realistic deformation by adhering to the governing physical constraints and allowing for model prediction and control. However, real soft objects in robotic surgery, such as membranes and soft tissues, have complex, anisotropic physical parameters that a simulation with simple initialization from cameras may not fully capture. To use the simulation techniques in real surgical tasks, the "real-to-sim" gap needs to be properly compensated. In this work, we propose an online, adaptive parameter tuning approach for simulation optimization that (1) bridges the real-to-sim gap between a physics simulation and observations obtained 3D perceptions through estimating a residual map** and (2) optimizes its stiffness parameters online. Our method ensures a small residual gap between the simulation and observation and improves the simulation's predictive capabilities. The effectiveness of the proposed mechanism is evaluated in the manipulation of both a thin-shell and volumetric tissue, representative of most tissue scenarios. This work contributes to the advancement of simulation-based deformable tissue manipulation and holds potential for improving surgical autonomy.
- [88] arXiv:2309.15405 (replaced) [pdf, ps, other]
-
Title: Teach and Repeat Navigation: A Robust Control ApproachComments: Accepted to IEEE International Conference on Robotics and Automation 2024 (ICRA2024)Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Robot navigation requires an autonomy pipeline that is robust to environmental changes and effective in varying conditions. Teach and Repeat (T&R) navigation has shown high performance in autonomous repeated tasks under challenging circumstances, but research within T&R has predominantly focused on motion planning as opposed to motion control. In this paper, we propose a novel T&R system based on a robust motion control technique for a skid-steering mobile robot using sliding-mode control that effectively handles uncertainties that are particularly pronounced in the T&R task, where sensor noises, parametric uncertainties, and wheel-terrain interaction are common challenges. We first theoretically demonstrate that the proposed T&R system is globally stable and robust while considering the uncertainties of the closed-loop system. When deployed on a Clearpath Jackal robot, we then show the global stability of the proposed system in both indoor and outdoor environments covering different terrains, outperforming previous state-of-the-art methods in terms of mean average trajectory error and stability in these challenging environments. This paper makes an important step towards long-term autonomous T&R navigation with ensured safety guarantees.
- [89] arXiv:2311.09655 (replaced) [pdf, ps, html, other]
-
Title: Multi-View Spectrogram Transformer for Respiratory Sound ClassificationComments: The paper was published at ICASSP 2024Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different sized patches, representing the multi-view acoustic elements of a respiratory sound. These patches and positional embeddings are then fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the proposed MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds.
- [90] arXiv:2312.11329 (replaced) [pdf, ps, html, other]
-
Title: Convergence guarantees for adaptive model predictive control with kinky inferenceSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
We analyze the convergence properties of a robust adaptive model predictive control algorithm used to control an unknown nonlinear system. We show that by employing a standard quadratic stabilizing cost function, and by recursively updating the nominal model through kinky inference, the resulting controller ensures convergence of the true system to the origin, despite the presence of model uncertainty. We illustrate our theoretical findings through a numerical simulation.
- [91] arXiv:2312.14264 (replaced) [pdf, ps, other]
-
Title: Experimental demonstration of magnetic tunnel junction-based computational random-access memoryYang Lv, Brandon R. Zink, Robert P. Bloom, Hüsrev Cılasun, Pravin Khanal, Salonik Resch, Zamshed Chowdhury, Ali Habiboglu, Weigang Wang, Sachin S. Sapatnekar, Ulya Karpuzcu, Jian-** WangSubjects: Emerging Technologies (cs.ET); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
Conventional computing paradigm struggles to fulfill the rapidly growing demands from emerging applications, especially those for machine intelligence, because much of the power and energy is consumed by constant data transfers between logic and memory modules. A new paradigm, called "computational random-access memory (CRAM)" has emerged to address this fundamental limitation. CRAM performs logic operations directly using the memory cells themselves, without having the data ever leave the memory. The energy and performance benefits of CRAM for both conventional and emerging applications have been well established by prior numerical studies. However, there lacks an experimental demonstration and study of CRAM to evaluate its computation accuracy, which is a realistic and application-critical metrics for its technological feasibility and competitiveness. In this work, a CRAM array based on magnetic tunnel junctions (MTJs) is experimentally demonstrated. First, basic memory operations as well as 2-, 3-, and 5-input logic operations are studied. Then, a 1-bit full adder with two different designs is demonstrated. Based on the experimental results, a suite of modeling has been developed to characterize the accuracy of CRAM computation. Scalar addition, multiplication, and matrix multiplication, which are essential building blocks for many conventional and machine intelligence applications, are evaluated and show promising accuracy performance. With the confirmation of MTJ-based CRAM's accuracy, there is a strong case that this technology will have a significant impact on power- and energy-demanding applications of machine intelligence.
- [92] arXiv:2402.12786 (replaced) [pdf, ps, html, other]
-
Title: Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken ConversationsComments: Accepted by ACL 2024Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
In spoken dialogue, even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles. The spoken styles, containing paralinguistic and prosodic information, mark the most significant difference between text and speech modality. When using text-only LLMs to model spoken dialogue, text-only LLMs cannot give different responses based on the speaking style of the current turn. In this paper, we focus on enabling LLMs to listen to the speaking styles and respond properly. Our goal is to teach the LLM that "even if the sentences are identical if they are spoken in different styles, their corresponding responses might be different". Since there is no suitable dataset for achieving this goal, we collect a speech-to-speech dataset, StyleTalk, with the following desired characteristics: when two current speeches have the same content but are spoken in different styles, their responses will be different. To teach LLMs to understand and respond properly to the speaking styles, we propose the Spoken-LLM framework that can model the linguistic content and the speaking styles. We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles. Based on extensive experiments, we show that Spoken-LLM outperforms text-only baselines and prior speech LLMs methods.
- [93] arXiv:2403.06439 (replaced) [pdf, ps, html, other]
-
Title: Wide-Field, High-Resolution Reconstruction in Computational Multi-Aperture Miniscope Using a Fourier Neural NetworkSubjects: Optics (physics.optics); Image and Video Processing (eess.IV)
Traditional fluorescence microscopy is constrained by inherent trade-offs among resolution, field-of-view, and system complexity. To navigate these challenges, we introduce a simple and low-cost computational multi-aperture miniature microscope, utilizing a microlens array for single-shot wide-field, high-resolution imaging. Addressing the challenges posed by extensive view multiplexing and non-local, shift-variant aberrations in this device, we present SV-FourierNet, a novel multi-channel Fourier neural network. SV-FourierNet facilitates high-resolution image reconstruction across the entire imaging field through its learned global receptive field. We establish a close relationship between the physical spatially-varying point-spread functions and the network's learned effective receptive field. This ensures that SV-FourierNet has effectively encapsulated the spatially-varying aberrations in our system, and learned a physically meaningful function for image reconstruction. Training of SV-FourierNet is conducted entirely on a physics-based simulator. We showcase wide-field, high-resolution video reconstructions on colonies of freely moving C. elegans and imaging of a mouse brain section. Our computational multi-aperture miniature microscope, augmented with SV-FourierNet, represents a major advancement in computational microscopy and may find broad applications in biomedical research and other fields requiring compact microscopy solutions.
- [94] arXiv:2404.04870 (replaced) [pdf, ps, html, other]
-
Title: Signal-noise separation using unsupervised reservoir computingSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Chaotic Dynamics (nlin.CD)
Removing noise from a signal without knowing the characteristics of the noise is a challenging task. This paper introduces a signal-noise separation method based on time series prediction. We use Reservoir Computing (RC) to extract the maximum portion of "predictable information" from a given signal. Reproducing the deterministic component of the signal using RC, we estimate the noise distribution from the difference between the original signal and reconstructed one. The method is based on a machine learning approach and requires no prior knowledge of either the deterministic signal or the noise distribution. It provides a way to identify additivity/multiplicativity of noise and to estimate the signal-to-noise ratio (SNR) indirectly. The method works successfully for combinations of various signal and noise, including chaotic signal and highly oscillating sinusoidal signal which are corrupted by non-Gaussian additive/ multiplicative noise. The separation performances are robust and notably outstanding for signals with strong noise, even for those with negative SNR.
- [95] arXiv:2405.06289 (replaced) [pdf, ps, html, other]
-
Title: Look Once to Hear: Target Speech Hearing with Noisy ExamplesComments: Best paper honorable mention at CHI 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound. We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker. A naive approach is to require a clean speech example to enroll the target speaker. This is however not well aligned with the hearable application domain since obtaining a clean example is challenging in real world scenarios, creating a unique user interface problem. We present the first enrollment interface where the wearer looks at the target speaker for a few seconds to capture a single, short, highly noisy, binaural example of the target speaker. This noisy example is used for enrollment and subsequent speech extraction in the presence of interfering speakers and noise. Our system achieves a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio and can process 8 ms of audio chunks in 6.24 ms on an embedded CPU. Our user studies demonstrate generalization to real-world static and mobile speakers in previously unseen indoor and outdoor multipath environments. Finally, our enrollment interface for noisy examples does not cause performance degradation compared to clean examples, while being convenient and user-friendly. Taking a step back, this paper takes an important step towards enhancing the human auditory perception with artificial intelligence. We provide code and data at: this https URL.
- [96] arXiv:2405.12031 (replaced) [pdf, ps, html, other]
-
Title: Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker VerificationComments: 8 pages, 2 figures, 3 tables; added github linkSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Transformer-based architectures for speaker verification typically require more training data than ECAPA-TDNN. Therefore, recent work has generally been trained on VoxCeleb1&2. We propose a backbone network based on self-attention, which can achieve competitive results when trained on VoxCeleb2 alone. The network alternates between neighborhood attention and global attention to capture local and global features, then aggregates features of different hierarchical levels, and finally performs attentive statistics pooling. Additionally, we employ a progressive channel fusion strategy to expand the receptive field in the channel dimension as the network deepens. We trained the proposed PCF-NAT model on VoxCeleb2 and evaluated it on VoxCeleb1 and the validation sets of VoxSRC. The EER and minDCF of the shallow PCF-NAT are on average more than 20% lower than those of similarly sized ECAPA-TDNN. Deep PCF-NAT achieves an EER lower than 0.5% on VoxCeleb1-O. The code and models are publicly available at this https URL.
- [97] arXiv:2405.16090 (replaced) [pdf, ps, html, other]
-
Title: EEG-DBNet: A Dual-Branch Network for Temporal-Spectral Decoding in Motor-Imagery Brain-Computer InterfacesSubjects: Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
Motor imagery electroencephalogram (EEG)-based brain-computer interfaces (BCIs) offer significant advantages for individuals with restricted limb mobility. However, challenges such as low signal-to-noise ratio and limited spatial resolution impede accurate feature extraction from EEG signals, thereby affecting the classification accuracy of different actions. To address these challenges, this study proposes an end-to-end dual-branch network (EEG-DBNet) that decodes the temporal and spectral sequences of EEG signals in parallel through two distinct network branches. Each branch comprises a local convolutional block and a global convolutional block. The local convolutional block transforms the source signal from the temporal-spatial domain to the temporal-spectral domain. By varying the number of filters and convolution kernel sizes, the local convolutional blocks in different branches adjust the length of their respective dimension sequences. Different types of pooling layers are then employed to emphasize the features of various dimension sequences, setting the stage for subsequent global feature extraction. The global convolution block splits and reconstructs the feature of the signal sequence processed by the local convolution block in the same branch and further extracts features through the dilated causal convolutional neural networks. Finally, the outputs from the two branches are concatenated, and signal classification is completed via a fully connected layer. Our proposed method achieves classification accuracies of 85.84% and 91.60% on the BCI Competition 4-2a and BCI Competition 4-2b datasets, respectively, surpassing existing state-of-the-art models. The source code is available at this https URL.
- [98] arXiv:2405.16470 (replaced) [pdf, ps, html, other]
-
Title: Image Deraining with Frequency-Enhanced State Space ModelSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Removing rain artifacts in images is recognized as a significant issue. In this field, deep learning-based approaches, such as convolutional neural networks (CNNs) and Transformers, have succeeded. Recently, State Space Models (SSMs) have exhibited superior performance across various tasks in both natural language processing and image processing due to their ability to model long-range dependencies. This study introduces SSM to rain removal and proposes a Deraining Frequency-Enhanced State Space Model (DFSSM). To effectively remove rain streaks, which produce high-intensity frequency components in specific directions, we employ frequency domain processing concurrently with SSM. Additionally, we develop a novel mixed-scale gated-convolutional block, which uses convolutions with multiple kernel sizes to capture various scale degradations effectively and integrates a gating mechanism to manage the flow of information. Finally, experiments on synthetic and real-world rainy image datasets show that our method surpasses state-of-the-art methods.
- [99] arXiv:2405.19228 (replaced) [pdf, ps, other]
-
Title: Motor Imagery Task Alters Dynamics of Human Body PostureSubjects: Neurons and Cognition (q-bio.NC); Signal Processing (eess.SP)
Motor Imagery (MI) is gaining traction in both rehabilitation and sports settings, but its immediate influence on human postural control is not yet clearly understood. The focus of this study is to examine the effects of MI on the dynamics of the Center of Pressure (COP), a crucial metric for evaluating postural stability. In the experiment, thirty healthy young adults participated in four different scenarios: normal standing with both open and closed eyes, and kinesthetic motor imagery focused on mediolateral (ML) and anteroposterior (AP) sway movements. A mathematical model was developed to characterize the nonlinear dynamics of the COP and to assess the impact of MI on these dynamics. Our results show a statistically significant increase (p-value<0.05) in variables such as COP path length and Long-Range Correlation (LRC) during MI compared to the closed-eye and normal standing conditions. These observations align well with psycho-neuromuscular theory, which suggests that imagining a specific movement activates neural pathways, consequently affecting postural control. This study presents compelling evidence that motor imagery not only has a quantifiable impact on COP dynamics but also that changes in the Center of Pressure (COP) are directionally consistent with the imagined movements. This finding holds significant implications for the field of rehabilitation science, suggesting that motor imagery could be strategically utilized to induce targeted postural adjustments. Nonetheless, additional research is required to fully understand the complex mechanisms that underlie this relationship and to corroborate these results across a more diverse set of populations.