Search | arXiv e-print repository

arXiv:2405.15831 [pdf, other]

Transmission Interface Power Flow Adjustment: A Deep Reinforcement Learning Approach based on Multi-task Attribution Map

Authors: Shunyu Liu, Wei Luo, Yanzhen Zhou, Kaixuan Chen, Quan Zhang, Huating Xu, Qinglai Guo, Mingli Song

Abstract: Transmission interface power flow adjustment is a critical measure to ensure the security and economy operation of power systems. However, conventional model-based adjustment schemes are limited by the increasing variations and uncertainties occur in power systems, where the adjustment problems of different transmission interfaces are often treated as several independent tasks, ignoring their coup… ▽ More Transmission interface power flow adjustment is a critical measure to ensure the security and economy operation of power systems. However, conventional model-based adjustment schemes are limited by the increasing variations and uncertainties occur in power systems, where the adjustment problems of different transmission interfaces are often treated as several independent tasks, ignoring their coupling relationship and even leading to conflict decisions. In this paper, we introduce a novel data-driven deep reinforcement learning (DRL) approach, to handle multiple power flow adjustment tasks jointly instead of learning each task from scratch. At the heart of the proposed method is a multi-task attribution map (MAM), which enables the DRL agent to explicitly attribute each transmission interface task to different power system nodes with task-adaptive attention weights. Based on this MAM, the agent can further provide effective strategies to solve the multi-task adjustment problem with a near-optimal operation cost. Simulation results on the IEEE 118-bus system, a realistic 300-bus system in China, and a very large European system with 9241 buses demonstrate that the proposed method significantly improves the performance compared with several baseline methods, and exhibits high interpretability with the learnable MAM. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: Accepted by IEEE Transactions on Power Systems

arXiv:2404.01620 [pdf]

Voice EHR: Introducing Multimodal Audio Data for Health

Authors: James Anibal, Hannah Huth, Ming Li, Lindsey Hazen, Yen Minh Lam, Hang Nguyen, Phuc Hong, Michael Kleinman, Shelley Ost, Christopher Jackson, Laura Sprabery, Cheran Elangovan, Balaji Krishnaiah, Lee Akst, Ioan Lina, Iqbal Elyazar, Lenny Ekwati, Stefan Jansen, Richard Nduwayezu, Charisse Garcia, Jeffrey Plum, Jacqueline Brenner, Miranda Song, Emily Ricotta, David Clifton , et al. (3 additional authors not shown)

Abstract: Large AI models trained on audio data may have the potential to rapidly classify patients, enhancing medical decision-making and potentially improving outcomes through early detection. Existing technologies depend on limited datasets using expensive recording equipment in high-income, English-speaking countries. This challenges deployment in resource-constrained, high-volume settings where audio d… ▽ More Large AI models trained on audio data may have the potential to rapidly classify patients, enhancing medical decision-making and potentially improving outcomes through early detection. Existing technologies depend on limited datasets using expensive recording equipment in high-income, English-speaking countries. This challenges deployment in resource-constrained, high-volume settings where audio data may have a profound impact. This report introduces a novel data type and a corresponding collection system that captures health data through guided questions using only a mobile/web application. This application ultimately results in an audio electronic health record (voice EHR) which may contain complex biomarkers of health from conventional voice/respiratory features, speech patterns, and language with semantic meaning - compensating for the typical limitations of unimodal clinical datasets. This report introduces a consortium of partners for global work, presents the application used for data collection, and showcases the potential of informative voice EHR to advance the scalability and diversity of audio AI. △ Less

Submitted 1 June, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 19 pages, 2 figures, 7 tables

arXiv:2401.12987 [pdf, other]

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

Authors: Taeyang Yun, Hyunkuk Lim, Jeonghwan Lee, Min Song

Abstract: Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challen… ▽ More Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effectiveness of our components through additional experiments. △ Less

Submitted 31 March, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: NAACL 2024 main conference

arXiv:2401.11902 [pdf, other]

A Training-Free Defense Framework for Robust Learned Image Compression

Authors: Myungseo Song, **young Choi, Bohyung Han

Abstract: We study the robustness of learned image compression models against adversarial attacks and present a training-free defense technique based on simple image transform functions. Recent learned image compression models are vulnerable to adversarial attacks that result in poor compression rate, low reconstruction quality, or weird artifacts. To address the limitations, we propose a simple but effecti… ▽ More We study the robustness of learned image compression models against adversarial attacks and present a training-free defense technique based on simple image transform functions. Recent learned image compression models are vulnerable to adversarial attacks that result in poor compression rate, low reconstruction quality, or weird artifacts. To address the limitations, we propose a simple but effective two-way compression algorithm with random input transforms, which is conveniently applicable to existing image compression models. Unlike the naïve approaches, our approach preserves the original rate-distortion performance of the models on clean images. Moreover, the proposed algorithm requires no additional training or modification of existing models, making it more practical. We demonstrate the effectiveness of the proposed techniques through extensive experiments under multiple compression models, evaluation metrics, and attack scenarios. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 10 pages and 14 figures

arXiv:2401.02771 [pdf, other]

Powerformer: A Section-adaptive Transformer for Power Flow Adjustment

Authors: Kaixuan Chen, Wei Luo, Shunyu Liu, Yaoquan Wei, Yihe Zhou, Yunpeng Qing, Quan Zhang, Jie Song, Mingli Song

Abstract: In this paper, we present a novel transformer architecture tailored for learning robust power system state representations, which strives to optimize power dispatch for the power flow adjustment across different transmission sections. Specifically, our proposed approach, named Powerformer, develops a dedicated section-adaptive attention mechanism, separating itself from the self-attention used in… ▽ More In this paper, we present a novel transformer architecture tailored for learning robust power system state representations, which strives to optimize power dispatch for the power flow adjustment across different transmission sections. Specifically, our proposed approach, named Powerformer, develops a dedicated section-adaptive attention mechanism, separating itself from the self-attention used in conventional transformers. This mechanism effectively integrates power system states with transmission section information, which facilitates the development of robust state representations. Furthermore, by considering the graph topology of power system and the electrical attributes of bus nodes, we introduce two customized strategies to further enhance the expressiveness: graph neural network propagation and multi-factor attention mechanism. Extensive evaluations are conducted on three power system scenarios, including the IEEE 118-bus system, a realistic 300-bus system in China, and a large-scale European system with 9241 buses, where Powerformer demonstrates its superior performance over several baseline methods. △ Less

Submitted 30 January, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

Comments: 8 figures

arXiv:2312.03490 [pdf, other]

PneumoLLM: Harnessing the Power of Large Language Model for Pneumoconiosis Diagnosis

Authors: Meiyue Song, Zhihua Yu, Jiaxin Wang, Jiarui Wang, Yuting Lu, Baicun Li, Xiaoxu Wang, Qinghua Huang, Zhijun Li, Nikolaos I. Kanellakis, Jiangfeng Liu, **g Wang, Binglu Wang, Juntao Yang

Abstract: The conventional pretraining-and-finetuning paradigm, while effective for common diseases with ample data, faces challenges in diagnosing data-scarce occupational diseases like pneumoconiosis. Recently, large language models (LLMs) have exhibits unprecedented ability when conducting multiple tasks in dialogue, bringing opportunities to diagnosis. A common strategy might involve using adapter layer… ▽ More The conventional pretraining-and-finetuning paradigm, while effective for common diseases with ample data, faces challenges in diagnosing data-scarce occupational diseases like pneumoconiosis. Recently, large language models (LLMs) have exhibits unprecedented ability when conducting multiple tasks in dialogue, bringing opportunities to diagnosis. A common strategy might involve using adapter layers for vision-language alignment and diagnosis in a dialogic manner. Yet, this approach often requires optimization of extensive learnable parameters in the text branch and the dialogue head, potentially diminishing the LLMs' efficacy, especially with limited training data. In our work, we innovate by eliminating the text branch and substituting the dialogue head with a classification head. This approach presents a more effective method for harnessing LLMs in diagnosis with fewer learnable parameters. Furthermore, to balance the retention of detailed image information with progression towards accurate diagnosis, we introduce the contextual multi-token engine. This engine is specialized in adaptively generating diagnostic tokens. Additionally, we propose the information emitter module, which unidirectionally emits information from image tokens to diagnosis tokens. Comprehensive experiments validate the superiority of our methods and the effectiveness of proposed modules. Our codes can be found at https://github.com/CodeMonsterPHD/PneumoLLM/tree/main. △ Less

Submitted 28 June, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: Medical Image Analysis

arXiv:2311.14295 [pdf, ps, other]

Exploiting Active RIS in NOMA Networks with Hardware Impairments

Authors: Xinwei Yue, Meiqi Song, Chongjun Ouyang, Yuanwei Liu, Tian Li, Tianwei Hou

Abstract: Active reconfigurable intelligent surface (ARIS) is a promising way to compensate for multiplicative fading attenuation by amplifying and reflecting event signals to selected users. This paper investigates the performance of ARIS assisted non-orthogonal multiple access (NOMA) networks over cascaded Nakagami-m fading channels. The effects of hardware impairments (HIS) and reflection coefficients on… ▽ More Active reconfigurable intelligent surface (ARIS) is a promising way to compensate for multiplicative fading attenuation by amplifying and reflecting event signals to selected users. This paper investigates the performance of ARIS assisted non-orthogonal multiple access (NOMA) networks over cascaded Nakagami-m fading channels. The effects of hardware impairments (HIS) and reflection coefficients on ARIS-NOMA networks with imperfect successive interference cancellation (ipSIC) and perfect successive interference cancellation (pSIC) are considered. More specifically, we develop new precise and asymptotic expressions of outage probability and ergodic data rate with ipSIC/pSIC for ARIS-NOMA-HIS networks. According to the approximated analyses, the diversity orders and multiplexing gains for couple of non-orthogonal users are attained in detail. Additionally, the energy efficiency of ARIS-NOMA-HIS networks is surveyed in delay-limited and delay-tolerant transmission schemes. The simulation findings are presented to demonstrate that: i) The outage behaviors and ergodic data rates of ARIS-NOMA-HIS networks precede that of ARIS aided orthogonal multiple access (OMA) and passive reconfigurable intelligent surface (PRIS) aided OMA; ii) As the reflection coefficient of ARIS increases, ARIS-NOMA-HIS networks have the ability to provide the strengthened outage performance; and iii) ARIS-NOMA-HIS networks are more energy efficient than ARIS/PRIS-OMA networks and conventional cooperative schemes. △ Less

Submitted 12 January, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

arXiv:2311.10463 [pdf, other]

Correlation-Distance Graph Learning for Treatment Response Prediction from rs-fMRI

Authors: Xiatian Zhang, Sisi Zheng, Hubert P. H. Shum, Haozheng Zhang, Nan Song, Mingkang Song, Hongxiao Jia

Abstract: Resting-state fMRI (rs-fMRI) functional connectivity (FC) analysis provides valuable insights into the relationships between different brain regions and their potential implications for neurological or psychiatric disorders. However, specific design efforts to predict treatment response from rs-fMRI remain limited due to difficulties in understanding the current brain state and the underlying mech… ▽ More Resting-state fMRI (rs-fMRI) functional connectivity (FC) analysis provides valuable insights into the relationships between different brain regions and their potential implications for neurological or psychiatric disorders. However, specific design efforts to predict treatment response from rs-fMRI remain limited due to difficulties in understanding the current brain state and the underlying mechanisms driving the observed patterns, which limited the clinical application of rs-fMRI. To overcome that, we propose a graph learning framework that captures comprehensive features by integrating both correlation and distance-based similarity measures under a contrastive loss. This approach results in a more expressive framework that captures brain dynamic features at different scales and enables more accurate prediction of treatment response. Our experiments on the chronic pain and depersonalization disorder datasets demonstrate that our proposed method outperforms current methods in different scenarios. To the best of our knowledge, we are the first to explore the integration of distance-based and correlation-based neural similarity into graph learning for treatment response prediction. △ Less

Submitted 17 November, 2023; originally announced November 2023.

Comments: Proceedings of the 2023 International Conference on Neural Information Processing (ICONIP)

arXiv:2308.06285 [pdf, other]

An Integrated Visual Analytics System for Studying Clinical Carotid Artery Plaques

Authors: Chaoqing Xu, Zhentao Zheng, Yiting Fu, Baofeng Chang, Legao Chen, Minghui Wu, Mingli Song, **song Jiang

Abstract: Carotid artery plaques can cause arterial vascular diseases such as stroke and myocardial infarction, posing a severe threat to human life. However, the current clinical examination mainly relies on a direct assessment by physicians of patients' clinical indicators and medical images, lacking an integrated visualization tool for analyzing the influencing factors and composition of carotid artery p… ▽ More Carotid artery plaques can cause arterial vascular diseases such as stroke and myocardial infarction, posing a severe threat to human life. However, the current clinical examination mainly relies on a direct assessment by physicians of patients' clinical indicators and medical images, lacking an integrated visualization tool for analyzing the influencing factors and composition of carotid artery plaques. We have designed an intelligent carotid artery plaque visual analysis system for vascular surgery experts to comprehensively analyze the clinical physiological and imaging indicators of carotid artery diseases. The system mainly includes two functions: First, it displays the correlation between carotid artery plaque and various factors through a series of information visualization methods and integrates the analysis of patient physiological indicator data. Second, it enhances the interface guidance analysis of the inherent correlation between the components of carotid artery plaque through machine learning and displays the spatial distribution of the plaque on medical images. Additionally, we conducted two case studies on carotid artery plaques using real data obtained from a hospital, and the results indicate that our designed carotid analysis system can effectively provide clinical diagnosis and treatment guidance for vascular surgeons. △ Less

Submitted 8 August, 2023; originally announced August 2023.

arXiv:2306.02913 [pdf, other]

Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

Authors: Tongtian Zhu, Fengxiang He, Kaixuan Chen, Mingli Song, Dacheng Tao

Abstract: Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-S… ▽ More Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$β$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios. The code is available at https://github.com/Raiden-Zhu/ICML-2023-DSGD-and-SAM. △ Less

Submitted 9 November, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

Comments: 40th International Conference on Machine Learning (ICML 2023)

arXiv:2303.15669 [pdf, other]

Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages

Authors: Seongyeon Park, Myungseo Song, Bohyung Kim, Tae-Hyun Oh

Abstract: Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired… ▽ More Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired transcribed data required to train the model for the target downstream TTS task. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones, which may allow the model to learn proper temporal assignment relation between input and output sequences. In addition, we propose a data augmentation method that further improves the data efficiency in fine-tuning. We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios, achieving outstanding performance compared to competing methods. The code and audio samples are available at: https://github.com/cnaigithub/SpeechDewar** △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: ICASSP 2023

arXiv:2303.09199 [pdf, other]

A Generative Model for Digital Camera Noise Synthesis

Authors: Mingyang Song, Yang Zhang, Tunç O. Aydın, Elham Amin Mansour, Christopher Schroers

Abstract: Noise synthesis is a challenging low-level vision task aiming to generate realistic noise given a clean image along with the camera settings. To this end, we propose an effective generative model which utilizes clean features as guidance followed by noise injections into the network. Specifically, our generator follows a UNet-like structure with skip connections but without downsampling and upsamp… ▽ More Noise synthesis is a challenging low-level vision task aiming to generate realistic noise given a clean image along with the camera settings. To this end, we propose an effective generative model which utilizes clean features as guidance followed by noise injections into the network. Specifically, our generator follows a UNet-like structure with skip connections but without downsampling and upsampling layers. Firstly, we extract deep features from a clean image as the guidance and concatenate a Gaussian noise map to the transition point between the encoder and decoder as the noise source. Secondly, we propose noise synthesis blocks in the decoder in each of which we inject Gaussian noise to model the noise characteristics. Thirdly, we propose to utilize an additional Style Loss and demonstrate that this allows better noise characteristics supervision in the generator. Through a number of new experiments, we evaluate the temporal variance and the spatial correlation of the generated noise which we hope can provide meaningful insights for future works. Finally, we show that our proposed approach outperforms existing methods for synthesizing camera noise. △ Less

Submitted 13 June, 2024; v1 submitted 16 March, 2023; originally announced March 2023.

arXiv:2212.11486 [pdf, other]

Over-the-Air Federated Learning with Enhanced Privacy

Authors: Xiaochan Xue, Moh Khalid Hasan, Shucheng Yu, Laxima Niure Kandel, Min Song

Abstract: Federated learning (FL) has emerged as a promising learning paradigm in which only local model parameters (gradients) are shared. Private user data never leaves the local devices thus preserving data privacy. However, recent research has shown that even when local data is never shared by a user, exchanging model parameters without protection can also leak private information. Moreover, in wireless… ▽ More Federated learning (FL) has emerged as a promising learning paradigm in which only local model parameters (gradients) are shared. Private user data never leaves the local devices thus preserving data privacy. However, recent research has shown that even when local data is never shared by a user, exchanging model parameters without protection can also leak private information. Moreover, in wireless systems, the frequent transmission of model parameters can cause tremendous bandwidth consumption and network congestion when the model is large. To address this problem, we propose a new FL framework with efficient over-the-air parameter aggregation and strong privacy protection of both user data and models. We achieve this by introducing pairwise cancellable random artificial noises (PCR-ANs) on end devices. As compared to existing over-the-air computation (AirComp) based FL schemes, our design provides stronger privacy protection. We analytically show the secrecy capacity and the convergence rate of the proposed wireless FL aggregation algorithm. △ Less

Submitted 22 December, 2022; originally announced December 2022.

Comments: 6 pages

arXiv:2209.07384 [pdf, other]

Self-Supervised Attention Networks and Uncertainty Loss Weighting for Multi-Task Emotion Recognition on Vocal Bursts

Authors: Vincent Karas, Andreas Triantafyllopoulos, Meishu Song, Björn W. Schuller

Abstract: Vocal bursts play an important role in communicating affect, making them valuable for improving speech emotion recognition. Here, we present our approach for classifying vocal bursts and predicting their emotional significance in the ACII Affective Vocal Burst Workshop & Challenge 2022 (A-VB). We use a large self-supervised audio model as shared feature extractor and compare multiple architectures… ▽ More Vocal bursts play an important role in communicating affect, making them valuable for improving speech emotion recognition. Here, we present our approach for classifying vocal bursts and predicting their emotional significance in the ACII Affective Vocal Burst Workshop & Challenge 2022 (A-VB). We use a large self-supervised audio model as shared feature extractor and compare multiple architectures built on classifier chains and attention networks, combined with uncertainty loss weighting strategies. Our approach surpasses the challenge baseline by a wide margin on all four tasks. △ Less

Submitted 27 September, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: 4 pages, 1 figure, accepted at The 2022 ACII Affective Vocal Burst Workshop & Challenge (A-VB)

arXiv:2208.10922 [pdf, other]

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Authors: Dongchan Min, Minyoung Song, Eunji Ko, Sung Ju Hwang

Abstract: We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given… ▽ More We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines. △ Less

Submitted 15 March, 2024; v1 submitted 23 August, 2022; originally announced August 2022.

arXiv:2206.13390 [pdf, other]

A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!

Authors: Chenglizhao Chen, Mengke Song, Wenfeng Song, Li Guo, Muwei Jian

Abstract: Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied on the visual system but paid less attention to the audio aspect, while, actually, our audio system is the most vital complementary part to our visual system. Also, audio-visual saliency detection (AVSD), one of the most representativ… ▽ More Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied on the visual system but paid less attention to the audio aspect, while, actually, our audio system is the most vital complementary part to our visual system. Also, audio-visual saliency detection (AVSD), one of the most representative research topics for mimicking human perceptual mechanisms, is currently in its infancy, and none of the existing survey papers have touched on it, especially from the perspective of saliency detection. Thus, the ultimate goal of this paper is to provide an extensive review to bridge the gap between audio-visual fusion and saliency detection. In addition, as another highlight of this review, we have provided a deep insight into key factors which could directly determine the performances of AVSD deep models, and we claim that the audio-visual consistency degree (AVC) -- a long-overlooked issue, can directly influence the effectiveness of using audio to benefit its visual counterpart when performing saliency detection. Moreover, in order to make the AVC issue more practical and valuable for future followers, we have newly equipped almost all existing publicly available AVSD datasets with additional frame-wise AVC labels. Based on these upgraded datasets, we have conducted extensive quantitative evaluations to ground our claim on the importance of AVC in the AVSD task. In a word, both our ideas and new sets serve as a convenient platform with preliminaries and guidelines, all of which are very potential to facilitate future works in promoting state-of-the-art (SOTA) performance further. △ Less

Submitted 20 June, 2022; originally announced June 2022.

arXiv:2206.11049 [pdf, other]

Dynamic Restrained Uncertainty Weighting Loss for Multitask Learning of Vocal Expression

Authors: Meishu Song, Zijiang Yang, Andreas Triantafyllopoulos, Xin **g, Vincent Karas, Xie Jiangjian, Zixing Zhang, Yamamoto Yoshiharu, Bjoern W. Schuller

Abstract: We propose a novel Dynamic Restrained Uncertainty Weighting Loss to experimentally handle the problem of balancing the contributions of multiple tasks on the ICML ExVo 2022 Challenge. The multitask aims to recognize expressed emotions and demographic traits from vocal bursts jointly. Our strategy combines the advantages of Uncertainty Weight and Dynamic Weight Average, by extending weights with a… ▽ More We propose a novel Dynamic Restrained Uncertainty Weighting Loss to experimentally handle the problem of balancing the contributions of multiple tasks on the ICML ExVo 2022 Challenge. The multitask aims to recognize expressed emotions and demographic traits from vocal bursts jointly. Our strategy combines the advantages of Uncertainty Weight and Dynamic Weight Average, by extending weights with a restraint term to make the learning process more explainable. We use a lightweight multi-exit CNN architecture to implement our proposed loss approach. The experimental H-Mean score (0.394) shows a substantial improvement over the baseline H-Mean score (0.335). △ Less

Submitted 27 June, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

Comments: 5 pages

arXiv:2206.11045 [pdf, other]

COVYT: Introducing the Coronavirus YouTube and TikTok speech dataset featuring the same speakers with and without infection

Authors: Andreas Triantafyllopoulos, Anastasia Semertzidou, Meishu Song, Florian B. Pokorny, Björn W. Schuller

Abstract: More than two years after its outbreak, the COVID-19 pandemic continues to plague medical systems around the world, putting a strain on scarce resources, and claiming human lives. From the very beginning, various AI-based COVID-19 detection and monitoring tools have been pursued in an attempt to stem the tide of infections through timely diagnosis. In particular, computer audition has been suggest… ▽ More More than two years after its outbreak, the COVID-19 pandemic continues to plague medical systems around the world, putting a strain on scarce resources, and claiming human lives. From the very beginning, various AI-based COVID-19 detection and monitoring tools have been pursued in an attempt to stem the tide of infections through timely diagnosis. In particular, computer audition has been suggested as a non-invasive, cost-efficient, and eco-friendly alternative for detecting COVID-19 infections through vocal sounds. However, like all AI methods, also computer audition is heavily dependent on the quantity and quality of available data, and large-scale COVID-19 sound datasets are difficult to acquire -- amongst other reasons -- due to the sensitive nature of such data. To that end, we introduce the COVYT dataset -- a novel COVID-19 dataset collected from public sources containing more than 8 hours of speech from 65 speakers. As compared to other existing COVID-19 sound datasets, the unique feature of the COVYT dataset is that it comprises both COVID-19 positive and negative samples from all 65 speakers. We analyse the acoustic manifestation of COVID-19 on the basis of these perfectly speaker characteristic balanced `in-the-wild' data using interpretable audio descriptors, and investigate several classification scenarios that shed light into proper partitioning strategies for a fair speech-based COVID-19 detection. △ Less

Submitted 20 June, 2022; originally announced June 2022.

arXiv:2206.09142 [pdf, other]

Redundancy Reduction Twins Network: A Training framework for Multi-output Emotion Regression

Authors: Xin **g, Meishu Song, Andreas Triantafyllopoulos, Zijiang Yang, Björn W. Schuller

Abstract: In this paper, we propose the Redundancy Reduction Twins Network (RRTN), a redundancy reduction training framework that minimizes redundancy by measuring the cross-correlation matrix between the outputs of the same network fed with distorted versions of a sample and bringing it as close to the identity matrix as possible. RRTN also applies a new loss function, the Barlow Twins loss function, to he… ▽ More In this paper, we propose the Redundancy Reduction Twins Network (RRTN), a redundancy reduction training framework that minimizes redundancy by measuring the cross-correlation matrix between the outputs of the same network fed with distorted versions of a sample and bringing it as close to the identity matrix as possible. RRTN also applies a new loss function, the Barlow Twins loss function, to help maximize the similarity of representations obtained from different distorted versions of a sample. However, as the distribution of losses can cause performance fluctuations in the network, we also propose the use of a Restrained Uncertainty Weight Loss (RUWL) or joint training to identify the best weights for the loss function. Our best approach on CNN14 with the proposed methodology obtains a CCC over emotion regression of 0.678 on the ExVo Multi-task dev set, a 4.8% increase over a vanilla CNN 14 CCC of 0.647, which achieves a significant difference at the 95% confidence interval (2-tailed). △ Less

Submitted 28 June, 2022; v1 submitted 18 June, 2022; originally announced June 2022.

Comments: 5 pages, accepted by ICML Exvo workshop

arXiv:2206.06680 [pdf, other]

Exploring speaker enrolment for few-shot personalisation in emotional vocalisation prediction

Authors: Andreas Triantafyllopoulos, Meishu Song, Zijiang Yang, Xin **g, Björn W. Schuller

Abstract: In this work, we explore a novel few-shot personalisation architecture for emotional vocalisation prediction. The core contribution is an `enrolment' encoder which utilises two unlabelled samples of the target speaker to adjust the output of the emotion encoder; the adjustment is based on dot-product attention, thus effectively functioning as a form of `soft' feature selection. The emotion and enr… ▽ More In this work, we explore a novel few-shot personalisation architecture for emotional vocalisation prediction. The core contribution is an `enrolment' encoder which utilises two unlabelled samples of the target speaker to adjust the output of the emotion encoder; the adjustment is based on dot-product attention, thus effectively functioning as a form of `soft' feature selection. The emotion and enrolment encoders are based on two standard audio architectures: CNN14 and CNN10. The two encoders are further guided to forget or learn auxiliary emotion and/or speaker information. Our best approach achieves a CCC of $.650$ on the ExVo Few-Shot dev set, a $2.5\%$ increase over our baseline CNN14 CCC of $.634$. △ Less

Submitted 20 June, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: Proceedings of the ICML Expressive Vocalizations Workshop and Competition held in conjunction with the $\mathit{39}^{th}$ International Conference on Machine Learning, Copyright 2022 by the author(s)

arXiv:2206.02705 [pdf]

Human Behavior Recognition Method Based on CEEMD-ES Radar Selection

Authors: Zhaolin Zhang, Mingqi Song, Wugang Meng, Yuhan Liu, Fengcong Li, Xiang Feng, Yinan Zhao

Abstract: In recent years, the millimeter-wave radar to identify human behavior has been widely used in medical,security, and other fields. When multiple radars are performing detection tasks, the validity of the features contained in each radar is difficult to guarantee. In addition, processing multiple radar data also requires a lot of time and computational cost. The Complementary Ensemble Empirical Mode… ▽ More In recent years, the millimeter-wave radar to identify human behavior has been widely used in medical,security, and other fields. When multiple radars are performing detection tasks, the validity of the features contained in each radar is difficult to guarantee. In addition, processing multiple radar data also requires a lot of time and computational cost. The Complementary Ensemble Empirical Mode Decomposition-Energy Slice (CEEMD-ES) multistatic radar selection method is proposed to solve these problems. First, this method decomposes and reconstructs the radar signal according to the difference in the reflected echo frequency between the limbs and the trunk of the human body. Then, the radar is selected according to the difference between the ratio of echo energy of limbs and trunk and the theoretical value. The time domain, frequency domain and various entropy features of the selected radar are extracted. Finally, the Extreme Learning Machine (ELM) recognition model of the ReLu core is established. Experiments show that this method can effectively select the radar, and the recognition rate of three kinds of human actions is 98.53%. △ Less

Submitted 6 June, 2022; originally announced June 2022.

Comments: 4 pages, 5 figures

arXiv:2205.06576 [pdf, other]

Distribution-Aware Graph Representation Learning for Transient Stability Assessment of Power System

Authors: Kaixuan Chen, Shunyu Liu, Na Yu, Rong Yan, Quan Zhang, Jie Song, Zunlei Feng, Mingli Song

Abstract: The real-time transient stability assessment (TSA) plays a critical role in the secure operation of the power system. Although the classic numerical integration method, \textit{i.e.} time-domain simulation (TDS), has been widely used in industry practice, it is inevitably trapped in a high computational complexity due to the high latitude sophistication of the power system. In this work, a data-dr… ▽ More The real-time transient stability assessment (TSA) plays a critical role in the secure operation of the power system. Although the classic numerical integration method, \textit{i.e.} time-domain simulation (TDS), has been widely used in industry practice, it is inevitably trapped in a high computational complexity due to the high latitude sophistication of the power system. In this work, a data-driven power system estimation method is proposed to quickly predict the stability of the power system before TDS reaches the end of simulating time windows, which can reduce the average simulation time of stability assessment without loss of accuracy. As the topology of the power system is in the form of graph structure, graph neural network based representation learning is naturally suitable for learning the status of the power system. Motivated by observing the distribution information of crucial active power and reactive power on the power system's bus nodes, we thus propose a distribution-aware learning~(DAL) module to explore an informative graph representation vector for describing the status of a power system. Then, TSA is re-defined as a binary classification task, and the stability of the system is determined directly from the resulting graph representation without numerical integration. Finally, we apply our method to the online TSA task. The case studies on the IEEE 39-bus system and Polish 2383-bus system demonstrate the effectiveness of our proposed method. △ Less

Submitted 12 May, 2022; originally announced May 2022.

Comments: 8 pages, 6 figures, 4 tables

arXiv:2203.17012 [pdf, other]

A Temporal-oriented Broadcast ResNet for COVID-19 Detection

Authors: Xin **g, Shuo Liu, Emilia Parada-Cabaleiro, Andreas Triantafyllopoulos, Meishu Song, Zijiang Yang, Björn W. Schuller

Abstract: Detecting COVID-19 from audio signals, such as breathing and coughing, can be used as a fast and efficient pre-testing method to reduce the virus transmission. Due to the promising results of deep learning networks in modelling time sequences, and since applications to rapidly identify COVID in-the-wild should require low computational effort, we present a temporal-oriented broadcasting residual l… ▽ More Detecting COVID-19 from audio signals, such as breathing and coughing, can be used as a fast and efficient pre-testing method to reduce the virus transmission. Due to the promising results of deep learning networks in modelling time sequences, and since applications to rapidly identify COVID in-the-wild should require low computational effort, we present a temporal-oriented broadcasting residual learning method that achieves efficient computation and high accuracy with a small model size. Based on the EfficientNet architecture, our novel network, named Temporal-oriented ResNet~(TorNet), constitutes of a broadcasting learning block, i.e. the Alternating Broadcast (AB) Block, which contains several Broadcast Residual Blocks (BC ResBlocks) and a convolution layer. With the AB Block, the network obtains useful audio-temporal features and higher level embeddings effectively with much less computation than Recurrent Neural Networks~(RNNs), typically used to model temporal information. TorNet achieves 72.2% Unweighted Average Recall (UAR) on the INTERPSEECH 2021 Computational Paralinguistics Challenge COVID-19 cough Sub-Challenge, by this showing competitive results with a higher computational efficiency than other state-of-the-art alternatives. △ Less

Submitted 31 March, 2022; originally announced March 2022.

Comments: 5 pages,submitted to Intesspeech 2022

arXiv:2112.12386 [pdf, other]

KFWC: A Knowledge-Driven Deep Learning Model for Fine-grained Classification of Wet-AMD

Authors: Haihong E, Jiawen He, Tianyi Hu, Lifei Wang, Lifei Yuan, Ruru Zhang, Meina Song

Abstract: Automated diagnosis using deep neural networks can help ophthalmologists detect the blinding eye disease wet Age-related Macular Degeneration (AMD). Wet-AMD has two similar subtypes, Neovascular AMD and Polypoidal Choroidal Vessels (PCV). However, due to the difficulty in data collection and the similarity between images, most studies have only achieved the coarse-grained classification of wet-AMD… ▽ More Automated diagnosis using deep neural networks can help ophthalmologists detect the blinding eye disease wet Age-related Macular Degeneration (AMD). Wet-AMD has two similar subtypes, Neovascular AMD and Polypoidal Choroidal Vessels (PCV). However, due to the difficulty in data collection and the similarity between images, most studies have only achieved the coarse-grained classification of wet-AMD rather than a finer-grained one of wet-AMD subtypes. To solve this issue, in this paper we propose a Knowledge-driven Fine-grained Wet-AMD Classification Model (KFWC), to classify fine-grained diseases with insufficient data. With the introduction of a priori knowledge of 10 lesion signs of input images into the KFWC, we aim to accelerate the KFWC by means of multi-label classification pre-training, to locate the decisive image features in the fine-grained disease classification task and therefore achieve better classification. Simultaneously, the KFWC can also provide good interpretability and effectively alleviate the pressure of data collection and annotation in the field of fine-grained disease classification for wet-AMD. The experiments demonstrate the effectiveness of the KFWC which reaches 99.71% in AU-ROC scores, and its considerable improvements over the data-driven w/o Knowledge and ophthalmologists, with the rates of 6.69% over the strongest baseline and 4.14% over ophthalmologists. △ Less

Submitted 23 December, 2021; originally announced December 2021.

arXiv:2109.02104 [pdf, other]

doi 10.1109/JIOT.2022.3155773

Machine Learning-Based 3D Channel Modeling for U2V mmWave Communications

Authors: Kai Mao, Qiuming Zhu, Maozhong Song, Hanpeng Li, Benzhe Ning, Boyu Hua, Wei Fan

Abstract: Unmanned aerial vehicle (UAV) millimeter wave (mmWave) technologies can provide flexible link and high data rate for future communication networks. By considering the new features of three-dimensional (3D) scattering space, 3D velocity, 3D antenna array, and especially 3D rotations, a machine learning (ML) integrated UAV-to-Vehicle (U2V) mmWave channel model is proposed. Meanwhile, a ML-based netw… ▽ More Unmanned aerial vehicle (UAV) millimeter wave (mmWave) technologies can provide flexible link and high data rate for future communication networks. By considering the new features of three-dimensional (3D) scattering space, 3D velocity, 3D antenna array, and especially 3D rotations, a machine learning (ML) integrated UAV-to-Vehicle (U2V) mmWave channel model is proposed. Meanwhile, a ML-based network for channel parameter calculation and generation is developed. The deterministic parameters are calculated based on the simplified geometry information, while the random ones are generated by the back propagation based neural network (BPNN) and generative adversarial network (GAN), where the training data set is obtained from massive ray-tracing (RT) simulations. Moreover, theoretical expressions of channel statistical properties, i.e., power delay profile (PDP), autocorrelation function (ACF), Doppler power spectrum density (DPSD), and cross-correlation function (CCF) are derived and analyzed. Finally, the U2V mmWave channel is generated under a typical urban scenario at 28 GHz. The generated PDP and DPSD show good agreement with RT-based results, which validates the effectiveness of proposed method. Moreover, the impact of 3D rotations, which has rarely been reported in previous works, can be observed in the generated CCF and ACF, which are also consistent with the theoretical and measurement results. △ Less

Submitted 5 September, 2021; originally announced September 2021.

Comments: IEEE Internet of Things Journal, early access, Mar. 2022

Journal ref: in IEEE Internet of Things Journal, vol. 9, no. 18, pp. 17592-17607, 15 Sept.15, 2022

arXiv:2108.09551 [pdf, other]

Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform

Authors: Myungseo Song, **young Choi, Bohyung Han

Abstract: We propose a versatile deep image compression network based on Spatial Feature Transform (SFT arXiv:1804.02815), which takes a source image and a corresponding quality map as inputs and produce a compressed image with variable rates. Our model covers a wide range of compression rates using a single model, which is controlled by arbitrary pixel-wise quality maps. In addition, the proposed framework… ▽ More We propose a versatile deep image compression network based on Spatial Feature Transform (SFT arXiv:1804.02815), which takes a source image and a corresponding quality map as inputs and produce a compressed image with variable rates. Our model covers a wide range of compression rates using a single model, which is controlled by arbitrary pixel-wise quality maps. In addition, the proposed framework allows us to perform task-aware image compressions for various tasks, e.g., classification, by efficiently estimating optimized quality maps specific to target tasks for our encoding network. This is even possible with a pretrained network without learning separate models for individual tasks. Our algorithm achieves outstanding rate-distortion trade-off compared to the approaches based on multiple models that are optimized separately for several different target rates. At the same level of compression, the proposed approach successfully improves performance on image classification and text region quality preservation via task-aware quality map estimation without additional model training. The code is available at the project website: https://github.com/micmic123/QmapCompression △ Less

Submitted 21 August, 2021; originally announced August 2021.

Comments: ICCV 2021

arXiv:2104.07116 [pdf, other]

doi 10.1109/VTC2020-Fall49728.2020.9348592

Meteorologically Introduced Impacts on Aerial Channels and UAV Communications

Authors: Mengan Song, Yiming Huo, Tao Lu, Xiaodai Dong, Zhonghua Liang

Abstract: As 5G wireless systems and networks are now being globally commercialized and deployed, more diversified application scenarios are emerging, quickly resha** our societies and paving the road to the beyond 5G (6G) era when terahertz (THz) and unmanned aerial vehicle (UAV) communications may play critical roles. In this paper, aerial channel models under multiple meteorological conditions such as… ▽ More As 5G wireless systems and networks are now being globally commercialized and deployed, more diversified application scenarios are emerging, quickly resha** our societies and paving the road to the beyond 5G (6G) era when terahertz (THz) and unmanned aerial vehicle (UAV) communications may play critical roles. In this paper, aerial channel models under multiple meteorological conditions such as rain, fog and snow, have been investigated at frequencies of interest (from 2 GHz to 900 GHz) for UAV communications. Furthermore, the link budget and the received signal-to-noise ratio (SNR) performance under the existing air-to-ground (A2G) channel models are studied with antenna(s) system considered. The relationship between the 3D coverage radius and UAV altitude under the influence of multiple weather (MW) conditions is simulated. Numerical results show that medium rain has the most effects on the UAV's coverage for UAV communications at millimeter wave (mmWave) bands, while snow has the largest impacts at near THz bands. In addition, when the frequency increases, the corresponding increase in the number of antennas can effectively compensate for the propagation loss introduced by weather factors, while its form factor and weight can be kept to maintain the UAV's payload. △ Less

Submitted 14 April, 2021; originally announced April 2021.

Comments: 5 pages, 7 figures, accepted by IEEE VTC2020-FALL

arXiv:2104.03540 [pdf, other]

doi 10.1109/TVT.2022.3174404

Map-based Channel Modeling and Generation for U2V mmWave Communication

Authors: Qiuming Zhu, Kai Mao, Maozhong Song, Xiaomin Chen, Boyu Hua, Weizhi Zhong, Xijuan Ye

Abstract: Unmanned aerial vehicle (UAV) aided millimeter wave (mmWave) technologies have a promising prospect in the future communication networks. By considering the factors of three-dimensional (3D) scattering space, 3D trajectory, and 3D antenna array, a non-stationary channel model for UAV-to-vehicle (U2V) mmWave communications is proposed. The computation and generation methods of channel parameters in… ▽ More Unmanned aerial vehicle (UAV) aided millimeter wave (mmWave) technologies have a promising prospect in the future communication networks. By considering the factors of three-dimensional (3D) scattering space, 3D trajectory, and 3D antenna array, a non-stationary channel model for UAV-to-vehicle (U2V) mmWave communications is proposed. The computation and generation methods of channel parameters including interpath and intra-path are analyzed in detail. The inter-path parameters are calculated in a deterministic way, while the parameters of intra-path rays are generated in a stochastic way. The statistical properties are obtained by using a Gaussian mixture model (GMM) on the massive ray tracing (RT) data. Then, a modified method of equal areas (MMEA) is developed to generate the random intra-path variables. Meanwhile, to reduce the complexity of RT method, the 3D propagation space is reconstructed based on the user-defined digital map. The simulated and analyzed results show that the proposed model and generation method can reproduce non-stationary U2V channels in accord with U2V scenarios. The generated statistical properties are consistent with the theoretical and measured ones as well. △ Less

Submitted 8 April, 2021; originally announced April 2021.

Journal ref: in IEEE Transactions on Vehicular Technology, vol. 71, no. 8, pp. 8004-8015, Aug. 2022

arXiv:2103.00430 [pdf, other]

Training Generative Adversarial Networks in One Stage

Authors: Chengchao Shen, Youtan Yin, Xinchao Wang, Xubin Li, Jie Song, Mingli Song

Abstract: Generative Adversarial Networks (GANs) have demonstrated unprecedented success in various image generation tasks. The encouraging results, however, come at the price of a cumbersome training process, during which the generator and discriminator are alternately updated in two stages. In this paper, we investigate a general training scheme that enables training GANs efficiently in only one stage. Ba… ▽ More Generative Adversarial Networks (GANs) have demonstrated unprecedented success in various image generation tasks. The encouraging results, however, come at the price of a cumbersome training process, during which the generator and discriminator are alternately updated in two stages. In this paper, we investigate a general training scheme that enables training GANs efficiently in only one stage. Based on the adversarial losses of the generator and discriminator, we categorize GANs into two classes, Symmetric GANs and Asymmetric GANs, and introduce a novel gradient decomposition method to unify the two, allowing us to train both classes in one stage and hence alleviate the training effort. We also computationally analyze the efficiency of the proposed method, and empirically demonstrate that, the proposed method yields a solid $1.5\times$ acceleration across various datasets and network architectures. Furthermore, we show that the proposed method is readily applicable to other adversarial-training scenarios, such as data-free knowledge distillation. The code is available at https://github.com/zju-vipa/OSGAN. △ Less

Submitted 16 June, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

Comments: Accepted to CVPR 2021

arXiv:2012.02859 [pdf, other]

Idle speed control with low-complexity offset-free explicit model predictive control in presence of system delay

Authors: Sang Hwan Son, Se-Kyu Oh, Byung Jun Park, Min Jun Song, Jong Min Lee

Abstract: The requirement for continual improvement of idle speed control (ISC) performance is increasing due to the stringent regulation on emission and fuel economy these days. In this regard, a low-complexity offset-free explicit model predictive control (EMPC) with constraint horizon is designed to regulate the idle speed under unmeasured disturbance in presence of system delay with rigorous formulation… ▽ More The requirement for continual improvement of idle speed control (ISC) performance is increasing due to the stringent regulation on emission and fuel economy these days. In this regard, a low-complexity offset-free explicit model predictive control (EMPC) with constraint horizon is designed to regulate the idle speed under unmeasured disturbance in presence of system delay with rigorous formulation. Particularly, we developed a high-fidelity 4-stroke gasoline-direct injected spark-ignited engine model based on first-principles and test vehicle driving data, and designed a model predictive ISC system. To handle the delay from intake to torque production, we constructed a control-oriented model with delay augmentation. To reject the influence of torque loss, we implemented the offset-free MPC scheme with disturbance model and estimator. Moreover, to deal with the limited capacity assigned for the controller in the engine control unit and the short sampling instant of the engine system, we formulated a low-complexity multiparametric quadratic program with constraint horizon in presence of system delay in state and input variables, and obtained an explicit solution map. To demonstrate the performance of the designed controller, a series of closed-loop simulations were performed. The developed explicit controller showed proper ISC performance in presence of torque loss and system delay. △ Less

Submitted 13 December, 2020; v1 submitted 4 December, 2020; originally announced December 2020.

arXiv:2005.00096 [pdf, other]

An Early Study on Intelligent Analysis of Speech under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety

Authors: **g Han, Kun Qian, Meishu Song, Zijiang Yang, Zhao Ren, Shuo Liu, Juan Liu, Huaiyuan Zheng, Wei Ji, Tomoya Koike, Xiao Li, Zixing Zhang, Yoshiharu Yamamoto, Björn W. Schuller

Abstract: The COVID-19 outbreak was announced as a global pandemic by the World Health Organisation in March 2020 and has affected a growing number of people in the past few weeks. In this context, advanced artificial intelligence techniques are brought to the fore in responding to fight against and reduce the impact of this global health crisis. In this study, we focus on develo** some potential use-case… ▽ More The COVID-19 outbreak was announced as a global pandemic by the World Health Organisation in March 2020 and has affected a growing number of people in the past few weeks. In this context, advanced artificial intelligence techniques are brought to the fore in responding to fight against and reduce the impact of this global health crisis. In this study, we focus on develo** some potential use-cases of intelligent speech analysis for COVID-19 diagnosed patients. In particular, by analysing speech recordings from these patients, we construct audio-only-based models to automatically categorise the health state of patients from four aspects, including the severity of illness, sleep quality, fatigue, and anxiety. For this purpose, two established acoustic feature sets and support vector machines are utilised. Our experiments show that an average accuracy of .69 obtained estimating the severity of illness, which is derived from the number of days in hospitalisation. We hope that this study can foster an extremely fast, low-cost, and convenient way to automatically detect the COVID-19 disease. △ Less

Submitted 14 May, 2020; v1 submitted 30 April, 2020; originally announced May 2020.

arXiv:1911.11502 [pdf, other]

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Authors: Ya Zhao, Rui Xu, Xinchao Wang, Peng Hou, Haihong Tang, Mingli Song

Abstract: Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of large-scale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features fr… ▽ More Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of large-scale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features from the lip movement videos. In this paper, we propose a new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen lip reading by learning from speech recognizers. The rationale behind our approach is that the features extracted from speech recognizers may provide complementary and discriminant clues, which are formidable to be obtained from the subtle movements of the lips, and consequently facilitate the training of lip readers. This is achieved, specifically, by distilling multi-granularity knowledge from speech recognizers to lip readers. To conduct this cross-modal knowledge distillation, we utilize an efficacious alignment scheme to handle the inconsistent lengths of the audios and videos, as well as an innovative filtering strategy to refine the speech recognizer's prediction. The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7.66% and 2.75% in character error rate, respectively. △ Less

Submitted 26 November, 2019; originally announced November 2019.

Comments: AAAI 2020

arXiv:1907.00390 [pdf, other]

A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling

Authors: Haihong E, Peiqing Niu, Zhongfu Chen, Meina Song

Abstract: A spoken language understanding (SLU) system includes two main tasks, slot filling (SF) and intent detection (ID). The joint model for the two tasks is becoming a tendency in SLU. But the bi-directional interrelated connections between the intent and slots are not established in the existing joint models. In this paper, we propose a novel bi-directional interrelated model for joint intent detectio… ▽ More A spoken language understanding (SLU) system includes two main tasks, slot filling (SF) and intent detection (ID). The joint model for the two tasks is becoming a tendency in SLU. But the bi-directional interrelated connections between the intent and slots are not established in the existing joint models. In this paper, we propose a novel bi-directional interrelated model for joint intent detection and slot filling. We introduce an SF-ID network to establish direct connections for the two tasks to help them promote each other mutually. Besides, we design an entirely new iteration mechanism inside the SF-ID network to enhance the bi-directional interrelated connections. The experimental results show that the relative improvement in the sentence-level semantic frame accuracy of our model is 3.79% and 5.42% on ATIS and Snips datasets, respectively, compared to the state-of-the-art model. △ Less

Submitted 30 June, 2019; originally announced July 2019.

Comments: Accepted paper of ACL 2019 (short paper) with 5 pages

arXiv:1807.04931 [pdf, ps, other]

doi 10.1109/CoDIT.2019.8820652

Convexity Analysis of Optimization Framework of Attitude Determination from Vector Observations

Authors: ** Wu, Zebo Zhou, Min Song

Abstract: In the past several years, there have been several representative attitude determination methods developed using derivative-based optimization algorithms. Optimization techniques e.g. gradient-descent algorithm (GDA), Gauss-Newton algorithm (GNA), Levenberg-Marquadt algorithm (LMA) suffer from local optimum in real engineering practices. A brief discussion on the convexity of this problem is prese… ▽ More In the past several years, there have been several representative attitude determination methods developed using derivative-based optimization algorithms. Optimization techniques e.g. gradient-descent algorithm (GDA), Gauss-Newton algorithm (GNA), Levenberg-Marquadt algorithm (LMA) suffer from local optimum in real engineering practices. A brief discussion on the convexity of this problem is presented recently \cite{Ahmed2012} stating that the problem is neither convex nor concave. In this paper, we give analytic proofs on this problem. The results reveal that the target loss function is convex in the common practice of quaternion normalization, which leads to non-existence of local optimum. △ Less

Submitted 13 July, 2018; originally announced July 2018.

Journal ref: IEEE CODIT 2019

arXiv:1803.07713 [pdf, ps, other]

Robust Beamforming for SWIPT System with Chance Constraints

Authors: Yinglei Teng, Wanxin Zhao, Mei Yan, Yong Zhang, Mei Song

Abstract: The robust beamforming problem in multiple-input single-output (MISO) downlink networks of simultaneous wireless information and power transfer (SWIPT) is studied in this paper. Adopting the time switching fashion to perform energy harvesting and information decoding respectively, we aim at maximizing the sum rate under imperfect channel state information (CSI) and the chance constraints of users'… ▽ More The robust beamforming problem in multiple-input single-output (MISO) downlink networks of simultaneous wireless information and power transfer (SWIPT) is studied in this paper. Adopting the time switching fashion to perform energy harvesting and information decoding respectively, we aim at maximizing the sum rate under imperfect channel state information (CSI) and the chance constraints of users' harvested energy. In view of the fact that the constraints for minimal harvested energy is not necessary to meet from time to time, this paper adopts chance constraint to model it and uses the Bernstein inequality to transform it into deterministic constraints equivalently. Recognizing the maximum sum rate problem of imperfect CSI as nonconvex problem, we transform it into finding the expectation of minimum mean square error (MMSE) equivalently in this paper, and an alternative optimization (AO) algorithm is proposed to decompose the optimization problem into two sub-problems: the transmit beamformer design and the division of switching time. The simulation results show the performance gains compared to non-robust state of the art schemes. △ Less

Submitted 20 March, 2018; originally announced March 2018.

Comments: 6 pages, 5 figures, to appear in IEEE ICC 2018, May 20-24

arXiv:1802.09448 [pdf, other]

doi 10.1109/JIOT.2018.2813429

DTER: Schedule Optimal RF Energy Request and Harvest for Internet of Things

Authors: Yu Luo, Lina Pu, Yanxiao Zhao, Guodong Wang, Min Song

Abstract: We propose a new energy harvesting strategy that uses a dedicated energy source (ES) to optimally replenish energy for radio frequency (RF) energy harvesting powered Internet of Things. Specifically, we develop a two-step dual tunnel energy requesting (DTER) strategy that minimizes the energy consumption on both the energy harvesting device and the ES. Besides the causality and capacity constraint… ▽ More We propose a new energy harvesting strategy that uses a dedicated energy source (ES) to optimally replenish energy for radio frequency (RF) energy harvesting powered Internet of Things. Specifically, we develop a two-step dual tunnel energy requesting (DTER) strategy that minimizes the energy consumption on both the energy harvesting device and the ES. Besides the causality and capacity constraints that are investigated in the existing approaches, DTER also takes into account the overhead issue and the nonlinear charge characteristics of an energy storage component to make the proposed strategy practical. Both offline and online scenarios are considered in the second step of DTER. To solve the nonlinear optimization problem of the offline scenario, we convert the design of offline optimal energy requesting problem into a classic shortest path problem and thus a global optimal solution can be obtained through dynamic programming (DP) algorithms. The online suboptimal transmission strategy is developed as well. Simulation study verifies that the online strategy can achieve almost the same energy efficiency as the global optimal solution in the long term. △ Less

Submitted 20 February, 2018; originally announced February 2018.

arXiv:1802.07101 [pdf, other]

Stroke Controllable Fast Style Transfer with Adaptive Receptive Fields

Authors: Yongcheng **g, Yang Liu, Yezhou Yang, Zunlei Feng, Yizhou Yu, Dacheng Tao, Mingli Song

Abstract: The Fast Style Transfer methods have been recently proposed to transfer a photograph to an artistic style in real-time. This task involves controlling the stroke size in the stylized results, which remains an open challenge. In this paper, we present a stroke controllable style transfer network that can achieve continuous and spatial stroke size control. By analyzing the factors that influence the… ▽ More The Fast Style Transfer methods have been recently proposed to transfer a photograph to an artistic style in real-time. This task involves controlling the stroke size in the stylized results, which remains an open challenge. In this paper, we present a stroke controllable style transfer network that can achieve continuous and spatial stroke size control. By analyzing the factors that influence the stroke size, we propose to explicitly account for the receptive field and the style image scales. We propose a StrokePyramid module to endow the network with adaptive receptive fields, and two training strategies to achieve faster convergence and augment new stroke sizes upon a trained model respectively. By combining the proposed runtime control strategies, our network can achieve continuous changes in stroke sizes and produce distinct stroke sizes in different spatial regions within the same output image. △ Less

Submitted 18 October, 2018; v1 submitted 20 February, 2018; originally announced February 2018.

Comments: Accepted by ECCV2018. Supplementary material: https://yongcheng**g.com/pdf/strokeControllable_supp.pdf

arXiv:1705.04058 [pdf, other]

Neural Style Transfer: A Review

Authors: Yongcheng **g, Yezhou Yang, Zunlei Feng, **gwen Ye, Yizhou Yu, Mingli Song

Abstract: The seminal work of Gatys et al. demonstrated the power of Convolutional Neural Networks (CNNs) in creating artistic imagery by separating and recombining image content and style. This process of using CNNs to render a content image in different styles is referred to as Neural Style Transfer (NST). Since then, NST has become a trending topic both in academic literature and industrial applications.… ▽ More The seminal work of Gatys et al. demonstrated the power of Convolutional Neural Networks (CNNs) in creating artistic imagery by separating and recombining image content and style. This process of using CNNs to render a content image in different styles is referred to as Neural Style Transfer (NST). Since then, NST has become a trending topic both in academic literature and industrial applications. It is receiving increasing attention and a variety of approaches are proposed to either improve or extend the original NST algorithm. In this paper, we aim to provide a comprehensive overview of the current progress towards NST. We first propose a taxonomy of current algorithms in the field of NST. Then, we present several evaluation methods and compare different NST algorithms both qualitatively and quantitatively. The review concludes with a discussion of various applications of NST and open problems for future research. A list of papers discussed in this review, corresponding codes, pre-trained models and more comparison results are publicly available at https://github.com/yc**g/Neural-Style-Transfer-Papers. △ Less

Submitted 30 October, 2018; v1 submitted 11 May, 2017; originally announced May 2017.

Comments: Project page: https://github.com/yc**g/Neural-Style-Transfer-Papers

Showing 1–38 of 38 results for author: Song, M