-
SPIRONet: Spatial-Frequency Learning and Topological Channel Interaction Network for Vessel Segmentation
Authors:
De-Xing Huang,
Xiao-Hu Zhou,
Xiao-Liang Xie,
Shi-Qi Liu,
Shuang-Yi Wang,
Zhen-Qiu Feng,
Mei-Jiang Gui,
Hao Li,
Tian-Yu Xiang,
Bo-Xian Yao,
Zeng-Guang Hou
Abstract:
Automatic vessel segmentation is paramount for develo** next-generation interventional navigation systems. However, current approaches suffer from suboptimal segmentation performances due to significant challenges in intraoperative images (i.e., low signal-to-noise ratio, small or slender vessels, and strong interference). In this paper, a novel spatial-frequency learning and topological channel…
▽ More
Automatic vessel segmentation is paramount for develo** next-generation interventional navigation systems. However, current approaches suffer from suboptimal segmentation performances due to significant challenges in intraoperative images (i.e., low signal-to-noise ratio, small or slender vessels, and strong interference). In this paper, a novel spatial-frequency learning and topological channel interaction network (SPIRONet) is proposed to address the above issues. Specifically, dual encoders are utilized to comprehensively capture local spatial and global frequency vessel features. Then, a cross-attention fusion module is introduced to effectively fuse spatial and frequency features, thereby enhancing feature discriminability. Furthermore, a topological channel interaction module is designed to filter out task-irrelevant responses based on graph neural networks. Extensive experimental results on several challenging datasets (CADSA, CAXF, DCA1, and XCAD) demonstrate state-of-the-art performances of our method. Moreover, the inference speed of SPIRONet is 21 FPS with a 512x512 input size, surpassing clinical real-time requirements (6~12FPS). These promising outcomes indicate SPIRONet's potential for integration into vascular interventional navigation systems. Code is available at https://github.com/Dxhuang-CASIA/SPIRONet.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing
Authors:
Yu-Fen Huang,
Nikki Moran,
Simon Coleman,
Jon Kelly,
Shun-Hwa Wei,
Po-Yin Chen,
Yun-Hsin Huang,
Tsung-** Chen,
Yu-Chia Kuo,
Yu-Chi Wei,
Chih-Hsuan Li,
Da-Yu Huang,
Hsuan-Kai Kao,
Ting-Wei Lin,
Li Su
Abstract:
In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music m…
▽ More
In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset).
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Research on Tumors Segmentation based on Image Enhancement Method
Authors:
Danyi Huang,
Ziang Liu,
Yizhou Li
Abstract:
One of the most effective ways to treat liver cancer is to perform precise liver resection surgery, the key step of which includes precise digital image segmentation of the liver and its tumor. However, traditional liver parenchymal segmentation techniques often face several challenges in performing liver segmentation: lack of precision, slow processing speed, and computational burden. These short…
▽ More
One of the most effective ways to treat liver cancer is to perform precise liver resection surgery, the key step of which includes precise digital image segmentation of the liver and its tumor. However, traditional liver parenchymal segmentation techniques often face several challenges in performing liver segmentation: lack of precision, slow processing speed, and computational burden. These shortcomings limit the efficiency of surgical planning and execution. In this work, the model initially describes in detail a new image enhancement algorithm that enhances the key features of an image by adaptively adjusting the contrast and brightness of the image. Then, a deep learning-based segmentation network was introduced, which was specially trained on the enhanced images to optimize the detection accuracy of tumor regions. In addition, multi-scale analysis techniques have been incorporated into the study, allowing the model to analyze images at different resolutions to capture more nuanced tumor features. In the presentation of the experimental results, the study used the 3Dircadb dataset to test the effectiveness of the proposed method. The experimental results show that compared with the traditional image segmentation method, the new method using image enhancement technology has significantly improved the accuracy and recall rate of tumor identification.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
A Hybrid Deep Learning Classification of Perimetric Glaucoma Using Peripapillary Nerve Fiber Layer Reflectance and Other OCT Parameters from Three Anatomy Regions
Authors:
Ou Tan,
David S. Greenfield,
Brian A. Francis,
Rohit Varma,
Joel S. Schuman,
David Huang,
Dongseok Choi
Abstract:
Precis: A hybrid deep-learning model combines NFL reflectance and other OCT parameters to improve glaucoma diagnosis. Objective: To investigate if a deep learning model could be used to combine nerve fiber layer (NFL) reflectance and other OCT parameters for glaucoma diagnosis. Patients and Methods: This is a prospective observational study where of 106 normal subjects and 164 perimetric glaucoma…
▽ More
Precis: A hybrid deep-learning model combines NFL reflectance and other OCT parameters to improve glaucoma diagnosis. Objective: To investigate if a deep learning model could be used to combine nerve fiber layer (NFL) reflectance and other OCT parameters for glaucoma diagnosis. Patients and Methods: This is a prospective observational study where of 106 normal subjects and 164 perimetric glaucoma (PG) patients. Peripapillary NFL reflectance map, NFL thickness map, optic head analysis of disc, and macular ganglion cell complex thickness were obtained using spectral domain OCT. A hybrid deep learning model combined a fully connected network (FCN) and a convolution neural network (CNN) to develop and combine those OCT maps and parameters to distinguish normal and PG eyes. Two deep learning models were compared based on whether the NFL reflectance map was used as part of the input or not. Results: The hybrid deep learning model with reflectance achieved 0.909 sensitivity at 99% specificity and 0.926 at 95%. The overall accuracy was 0.948 with 0.893 sensitivity and 1.000 specificity, and the AROC was 0.979, which is significantly better than the logistic regression models (p < 0.001). The second best model is the hybrid deep learning model w/o reflectance, which also had significantly higher AROC than logistic regression models (p < 0.001). Logistic regression with reflectance model had slightly higher AROC or sensitivity than the other logistic regression model without reflectance (p = 0.024). Conclusions: Hybrid deep learning model significantly improved the diagnostic accuracy, without or without NFL reflectance. Hybrid deep learning model, combining reflectance/NFL thickness/GCC thickness/ONH parameter, may be a practical model for glaucoma screen purposes.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Invisible Needle Detection in Ultrasound: Leveraging Mechanism-Induced Vibration
Authors:
Chenyang Li,
Dianye Huang,
Angelos Karlas,
Nassir Navab,
Zhongliang Jiang
Abstract:
In clinical applications that involve ultrasound-guided intervention, the visibility of the needle can be severely impeded due to steep insertion and strong distractors such as speckle noise and anatomical occlusion. To address this challenge, we propose VibNet, a learning-based framework tailored to enhance the robustness and accuracy of needle detection in ultrasound images, even when the target…
▽ More
In clinical applications that involve ultrasound-guided intervention, the visibility of the needle can be severely impeded due to steep insertion and strong distractors such as speckle noise and anatomical occlusion. To address this challenge, we propose VibNet, a learning-based framework tailored to enhance the robustness and accuracy of needle detection in ultrasound images, even when the target becomes invisible to the naked eye. Inspired by Eulerian Video Magnification techniques, we utilize an external step motor to induce low-amplitude periodic motion on the needle. These subtle vibrations offer the potential to generate robust frequency features for detecting the motion patterns around the needle. To robustly and precisely detect the needle leveraging these vibrations, VibNet integrates learning-based Short-Time-Fourier-Transform and Hough-Transform modules to achieve successive sub-goals, including motion feature extraction in the spatiotemporal space, frequency feature aggregation, and needle detection in the Hough space. Based on the results obtained on distinct ex vivo porcine and bovine tissue samples, the proposed algorithm exhibits superior detection performance with efficient computation and generalization capability.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
MOSformer: Momentum encoder-based inter-slice fusion transformer for medical image segmentation
Authors:
De-Xing Huang,
Xiao-Hu Zhou,
Xiao-Liang Xie,
Shi-Qi Liu,
Zhen-Qiu Feng,
Mei-Jiang Gui,
Hao Li,
Tian-Yu Xiang,
Xiu-Ling Liu,
Zeng-Guang Hou
Abstract:
Medical image segmentation takes an important position in various clinical applications. Deep learning has emerged as the predominant solution for automated segmentation of volumetric medical images. 2.5D-based segmentation models bridge computational efficiency of 2D-based models and spatial perception capabilities of 3D-based models. However, prevailing 2.5D-based models often treat each slice e…
▽ More
Medical image segmentation takes an important position in various clinical applications. Deep learning has emerged as the predominant solution for automated segmentation of volumetric medical images. 2.5D-based segmentation models bridge computational efficiency of 2D-based models and spatial perception capabilities of 3D-based models. However, prevailing 2.5D-based models often treat each slice equally, failing to effectively learn and exploit inter-slice information, resulting in suboptimal segmentation performances. In this paper, a novel Momentum encoder-based inter-slice fusion transformer (MOSformer) is proposed to overcome this issue by leveraging inter-slice information at multi-scale feature maps extracted by different encoders. Specifically, dual encoders are employed to enhance feature distinguishability among different slices. One of the encoders is moving-averaged to maintain the consistency of slice representations. Moreover, an IF-Swin transformer module is developed to fuse inter-slice multi-scale features. The MOSformer is evaluated on three benchmark datasets (Synapse, ACDC, and AMOS), establishing a new state-of-the-art with 85.63%, 92.19%, and 85.43% of DSC, respectively. These promising results indicate its competitiveness in medical image segmentation. Codes and models of MOSformer will be made publicly available upon acceptance.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Underwater motions analysis and control of a coupling-tiltable unmanned aerial-aquatic quadrotor
Authors:
Dongyue Huang,
Chenggang Wang,
Minghao Dou,
Xuchen Liu,
Zixuan Liu,
Biao Wang,
Ben M. Chen
Abstract:
This paper proposes a method for analyzing a series of potential motions in a coupling-tiltable aerial-aquatic quadrotor based on its nonlinear dynamics. Some characteristics and constraints derived by this method are specified as Singular Thrust Tilt Angles (STTAs), utilizing to generate motions including planar motions. A switch-based control scheme addresses issues of control direction uncertai…
▽ More
This paper proposes a method for analyzing a series of potential motions in a coupling-tiltable aerial-aquatic quadrotor based on its nonlinear dynamics. Some characteristics and constraints derived by this method are specified as Singular Thrust Tilt Angles (STTAs), utilizing to generate motions including planar motions. A switch-based control scheme addresses issues of control direction uncertainty inherent to the mechanical structure by incorporating a saturated Nussbaum function. A high-fidelity simulation environment incorporating a comprehensive hydrodynamic model is built based on a Hardware-In-The-Loop (HITL) setup with Gazebo and a flight control board. The experiments validate the effectiveness of the absolute and quasi planar motions, which cannot be achieved by conventional quadrotors, and demonstrate stable performance when the pitch or roll angle is activated in the auxiliary control channel.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
Deep Multimodal Fusion for Surgical Feedback Classification
Authors:
Rafal Kocielnik,
Elyssa Y. Wong,
Timothy N. Chu,
Lydia Lin,
De-An Huang,
Jiayun Wang,
Anima Anandkumar,
Andrew J. Hung
Abstract:
Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In th…
▽ More
Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: "Anatomic", "Technical", "Procedural", "Praise" and "Visual Aid". We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1%. We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Efficient Segmentation with Texture in Ore Images Based on Box-supervised Approach
Authors:
Guodong Sun,
Delong Huang,
Yuting Peng,
Le Cheng,
Bo Wu,
Yang Zhang
Abstract:
Image segmentation methods have been utilized to determine the particle size distribution of crushed ores. Due to the complex working environment, high-powered computing equipment is difficult to deploy. At the same time, the ore distribution is stacked, and it is difficult to identify the complete features. To address this issue, an effective box-supervised technique with texture features is prov…
▽ More
Image segmentation methods have been utilized to determine the particle size distribution of crushed ores. Due to the complex working environment, high-powered computing equipment is difficult to deploy. At the same time, the ore distribution is stacked, and it is difficult to identify the complete features. To address this issue, an effective box-supervised technique with texture features is provided for ore image segmentation that can identify complete and independent ores. Firstly, a ghost feature pyramid network (Ghost-FPN) is proposed to process the features obtained from the backbone to reduce redundant semantic information and computation generated by complex networks. Then, an optimized detection head is proposed to obtain the feature to maintain accuracy. Finally, Lab color space (Lab) and local binary patterns (LBP) texture features are combined to form a fusion feature similarity-based loss function to improve accuracy while incurring no loss. Experiments on MS COCO have shown that the proposed fusion features are also worth studying on other types of datasets. Extensive experimental results demonstrate the effectiveness of the proposed method, which achieves over 50 frames per second with a small model size of 21.6 MB. Meanwhile, the method maintains a high level of accuracy compared with the state-of-the-art approaches on ore image dataset. The source code is available at \url{https://github.com/MVME-HBUT/OREINST}.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Non-line-of-sight reconstruction via structure sparsity regularization
Authors:
Duolan Huang,
Quan Chen,
Zhun Wei,
Rui Chen
Abstract:
Non-line-of-sight (NLOS) imaging allows for the imaging of objects around a corner, which enables potential applications in various fields such as autonomous driving, robotic vision, medical imaging, security monitoring, etc. However, the quality of reconstruction is challenged by low signal-noise-ratio (SNR) measurements. In this study, we present a regularization method, referred to as structure…
▽ More
Non-line-of-sight (NLOS) imaging allows for the imaging of objects around a corner, which enables potential applications in various fields such as autonomous driving, robotic vision, medical imaging, security monitoring, etc. However, the quality of reconstruction is challenged by low signal-noise-ratio (SNR) measurements. In this study, we present a regularization method, referred to as structure sparsity (SS) regularization, for denoising in NLOS reconstruction. By exploiting the prior knowledge of structure sparseness, we incorporate nuclear norm penalization into the cost function of directional light-cone transform (DLCT) model for NLOS imaging system. This incorporation effectively integrates the neighborhood information associated with the directional albedo, thereby facilitating the denoising process. Subsequently, the reconstruction is achieved by optimizing a directional albedo model with SS regularization using fast iterative shrinkage-thresholding algorithm. Notably, the robust reconstruction of occluded objects is observed. Through comprehensive evaluations conducted on both synthetic and experimental datasets, we demonstrate that the proposed approach yields high-quality reconstructions, surpassing the state-of-the-art reconstruction algorithms, especially in scenarios involving short exposure and low SNR measurements.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
Learning Koopman Operators with Control Using Bi-level Optimization
Authors:
Daning Huang,
Muhammad Bayu Prasetyo,
Yin Yu,
Junyi Geng
Abstract:
The accurate modeling and control of nonlinear dynamical effects are crucial for numerous robotic systems. The Koopman formalism emerges as a valuable tool for linear control design in nonlinear systems within unknown environments. However, it still remains a challenging task to learn the Koopman operator with control from data, and in particular, the simultaneous identification of the Koopman lin…
▽ More
The accurate modeling and control of nonlinear dynamical effects are crucial for numerous robotic systems. The Koopman formalism emerges as a valuable tool for linear control design in nonlinear systems within unknown environments. However, it still remains a challenging task to learn the Koopman operator with control from data, and in particular, the simultaneous identification of the Koopman linear dynamics and the map** between the physical and Koopman states. Conventionally, the simultaneous learning of the dynamics and map** is achieved via single-level optimization based on one-step or multi-step discrete-time predictions, but the learned model may lack model robustness, training efficiency, and/or long-term predictive accuracy. This paper presents a bi-level optimization framework that jointly learns the Koopman embedding map** and Koopman dynamics with exact long-term dynamical constraints. Our formulation allows back-propagation in standard learning framework and the use of state-of-the-art optimizers, yielding more accurate and stable system prediction in long-time horizon over various applications compared to conventional methods.
△ Less
Submitted 5 November, 2023; v1 submitted 11 July, 2023;
originally announced July 2023.
-
Enhancing Building Semantic Segmentation Accuracy with Super Resolution and Deep Learning: Investigating the Impact of Spatial Resolution on Various Datasets
Authors:
Zhiling Guo,
Xiaodan Shi,
Haoran Zhang,
Dou Huang,
Xiaoya Song,
**yue Yan,
Ryosuke Shibasaki
Abstract:
The development of remote sensing and deep learning techniques has enabled building semantic segmentation with high accuracy and efficiency. Despite their success in different tasks, the discussions on the impact of spatial resolution on deep learning based building semantic segmentation are quite inadequate, which makes choosing a higher cost-effective data source a big challenge. To address the…
▽ More
The development of remote sensing and deep learning techniques has enabled building semantic segmentation with high accuracy and efficiency. Despite their success in different tasks, the discussions on the impact of spatial resolution on deep learning based building semantic segmentation are quite inadequate, which makes choosing a higher cost-effective data source a big challenge. To address the issue mentioned above, in this study, we create remote sensing images among three study areas into multiple spatial resolutions by super-resolution and down-sampling. After that, two representative deep learning architectures: UNet and FPN, are selected for model training and testing. The experimental results obtained from three cities with two deep learning models indicate that the spatial resolution greatly influences building segmentation results, and with a better cost-effectiveness around 0.3m, which we believe will be an important insight for data selection and preparation.
△ Less
Submitted 9 July, 2023;
originally announced July 2023.
-
Motion Magnification in Robotic Sonography: Enabling Pulsation-Aware Artery Segmentation
Authors:
Dianye Huang,
Yuan Bi,
Nassir Navab,
Zhongliang Jiang
Abstract:
Ultrasound (US) imaging is widely used for diagnosing and monitoring arterial diseases, mainly due to the advantages of being non-invasive, radiation-free, and real-time. In order to provide additional information to assist clinicians in diagnosis, the tubular structures are often segmented from US images. To improve the artery segmentation accuracy and stability during scans, this work presents a…
▽ More
Ultrasound (US) imaging is widely used for diagnosing and monitoring arterial diseases, mainly due to the advantages of being non-invasive, radiation-free, and real-time. In order to provide additional information to assist clinicians in diagnosis, the tubular structures are often segmented from US images. To improve the artery segmentation accuracy and stability during scans, this work presents a novel pulsation-assisted segmentation neural network (PAS-NN) by explicitly taking advantage of the cardiac-induced motions. Motion magnification techniques are employed to amplify the subtle motion within the frequency band of interest to extract the pulsation signals from sequential US images. The extracted real-time pulsation information can help to locate the arteries on cross-section US images; therefore, we explicitly integrated the pulsation into the proposed PAS-NN as attention guidance. Notably, a robotic arm is necessary to provide stable movement during US imaging since magnifying the target motions from the US images captured along a scan path is not manually feasible due to the hand tremor. To validate the proposed robotic US system for imaging arteries, experiments are carried out on volunteers' carotid and radial arteries. The results demonstrated that the PAS-NN could achieve comparable results as state-of-the-art on carotid and can effectively improve the segmentation performance for small vessels (radial artery).
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Deep Learning Methods for Device Identification Using Symbols Trace Plot
Authors:
Da Huang,
Akram Al-Hourani,
Kandeepan Sithamparanathan,
Wayne S. T. Rowe
Abstract:
Devices authentication is one crucial aspect of any communication system. Recently, the physical layer approach radio frequency (RF) fingerprinting has gained increased interest as it provides an extra layer of security without requiring additional components. In this work, we propose an RF fingerprinting based transmitter authentication approach density trace plot (DTP) to exploit device-identifi…
▽ More
Devices authentication is one crucial aspect of any communication system. Recently, the physical layer approach radio frequency (RF) fingerprinting has gained increased interest as it provides an extra layer of security without requiring additional components. In this work, we propose an RF fingerprinting based transmitter authentication approach density trace plot (DTP) to exploit device-identifiable fingerprints. By considering IQ imbalance solely as the feature source, DTP can efficiently extract device-identifiable fingerprints from symbol transition trajectories and density center drifts. In total, three DTP modalities based on constellation, eye and phase traces are respectively generated and tested against three deep learning classifiers: the 2D-CNN, 2D-CNN+biLSTM and 3D-CNN. The feasibility of these DTP and classifier pairs is verified using a practical dataset collected from the ADALM-PLUTO software-defined radios (SDRs).
△ Less
Submitted 11 February, 2024; v1 submitted 28 June, 2023;
originally announced June 2023.
-
Temporal Aware Mixed Attention-based Convolution and Transformer Network (MACTN) for EEG Emotion Recognition
Authors:
Xiaopeng Si,
Dong Huang,
Yulin Sun,
Dong Ming
Abstract:
Emotion recognition plays a crucial role in human-computer interaction, and electroencephalography (EEG) is advantageous for reflecting human emotional states. In this study, we propose MACTN, a hierarchical hybrid model for jointly modeling local and global temporal information. The model is inspired by neuroscience research on the temporal dynamics of emotions. MACTN extracts local emotional fea…
▽ More
Emotion recognition plays a crucial role in human-computer interaction, and electroencephalography (EEG) is advantageous for reflecting human emotional states. In this study, we propose MACTN, a hierarchical hybrid model for jointly modeling local and global temporal information. The model is inspired by neuroscience research on the temporal dynamics of emotions. MACTN extracts local emotional features through a convolutional neural network (CNN) and integrates sparse global emotional features through a transformer. Moreover, we employ channel attention mechanisms to identify the most task-relevant channels. Through extensive experimentation on two publicly available datasets, namely THU-EP and DEAP, our proposed method, MACTN, consistently achieves superior classification accuracy and F1 scores compared to other existing methods in most experimental settings. Furthermore, ablation studies have shown that the integration of both self-attention mechanisms and channel attention mechanisms leads to improved classification performance. Finally, an earlier version of this method, which shares the same ideas, won the Emotional BCI Competition's final championship in the 2022 World Robot Contest.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
Accelerated Nonconvex ADMM with Self-Adaptive Penalty for Rank-Constrained Model Identification
Authors:
Qingyuan Liu,
Zhengchao Huang,
Hao Ye,
Dexian Huang,
Chao Shang
Abstract:
The alternating direction method of multipliers (ADMM) has been widely adopted in low-rank approximation and low-order model identification tasks; however, the performance of nonconvex ADMM is highly reliant on the choice of penalty parameter. To accelerate ADMM for solving rank-constrained identification problems, this paper proposes a new self-adaptive strategy for automatic penalty update. Guid…
▽ More
The alternating direction method of multipliers (ADMM) has been widely adopted in low-rank approximation and low-order model identification tasks; however, the performance of nonconvex ADMM is highly reliant on the choice of penalty parameter. To accelerate ADMM for solving rank-constrained identification problems, this paper proposes a new self-adaptive strategy for automatic penalty update. Guided by first-order analysis of the increment of the augmented Lagrangian, the self-adaptive penalty updating enables effective and balanced minimization of both primal and dual residuals and thus ensures a stable convergence. Moreover, improved efficiency can be obtained within the Anderson acceleration scheme. Numerical examples show that the proposed strategy significantly accelerates the convergence of nonconvex ADMM while alleviating the critical reliance on tedious tuning of penalty parameters.
△ Less
Submitted 8 September, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
SB-VQA: A Stack-Based Video Quality Assessment Framework for Video Enhancement
Authors:
Ding-Jiun Huang,
Yu-Ting Kao,
Tieh-Hung Chuang,
Ya-Chun Tsai,
**g-Kai Lou,
Shuen-Huei Guan
Abstract:
In recent years, several video quality assessment (VQA) methods have been developed, achieving high performance. However, these methods were not specifically trained for enhanced videos, which limits their ability to predict video quality accurately based on human subjective perception. To address this issue, we propose a stack-based framework for VQA that outperforms existing state-of-the-art met…
▽ More
In recent years, several video quality assessment (VQA) methods have been developed, achieving high performance. However, these methods were not specifically trained for enhanced videos, which limits their ability to predict video quality accurately based on human subjective perception. To address this issue, we propose a stack-based framework for VQA that outperforms existing state-of-the-art methods on VDPVE, a dataset consisting of enhanced videos. In addition to proposing the VQA framework for enhanced videos, we also investigate its application on professionally generated content (PGC). To address copyright issues with premium content, we create the PGCVQ dataset, which consists of videos from YouTube. We evaluate our proposed approach and state-of-the-art methods on PGCVQ, and provide new insights on the results. Our experiments demonstrate that existing VQA algorithms can be applied to PGC videos, and we find that VQA performance for PGC videos can be improved by considering the plot of a play, which highlights the importance of video semantic understanding.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
TJ-FlyingFish: Design and Implementation of an Aerial-Aquatic Quadrotor with Tiltable Propulsion Units
Authors:
Xuchen Liu,
Minghao Dou,
Dongyue Huang,
Biao Wang,
**qiang Cui,
Qinyuan Ren,
Lihua Dou,
Zhi Gao,
Jie Chen,
Ben M. Chen
Abstract:
Aerial-aquatic vehicles are capable to move in the two most dominant fluids, making them more promising for a wide range of applications. We propose a prototype with special designs for propulsion and thruster configuration to cope with the vast differences in the fluid properties of water and air. For propulsion, the operating range is switched for the different mediums by the dual-speed propulsi…
▽ More
Aerial-aquatic vehicles are capable to move in the two most dominant fluids, making them more promising for a wide range of applications. We propose a prototype with special designs for propulsion and thruster configuration to cope with the vast differences in the fluid properties of water and air. For propulsion, the operating range is switched for the different mediums by the dual-speed propulsion unit, providing sufficient thrust and also ensuring output efficiency. For thruster configuration, thrust vectoring is realized by the rotation of the propulsion unit around the mount arm, thus enhancing the underwater maneuverability. This paper presents a quadrotor prototype of this concept and the design details and realization in practice.
△ Less
Submitted 6 February, 2023; v1 submitted 28 January, 2023;
originally announced January 2023.
-
Interpretable Diabetic Retinopathy Diagnosis based on Biomarker Activation Map
Authors:
Pengxiao Zang,
Tristan T. Hormel,
Jie Wang,
Yukun Guo,
Steven T. Bailey,
Christina J. Flaxel,
David Huang,
Thomas S. Hwang,
Yali Jia
Abstract:
Deep learning classifiers provide the most accurate means of automatically diagnosing diabetic retinopathy (DR) based on optical coherence tomography (OCT) and its angiography (OCTA). The power of these models is attributable in part to the inclusion of hidden layers that provide the complexity required to achieve a desired task. However, hidden layers also render algorithm outputs difficult to in…
▽ More
Deep learning classifiers provide the most accurate means of automatically diagnosing diabetic retinopathy (DR) based on optical coherence tomography (OCT) and its angiography (OCTA). The power of these models is attributable in part to the inclusion of hidden layers that provide the complexity required to achieve a desired task. However, hidden layers also render algorithm outputs difficult to interpret. Here we introduce a novel biomarker activation map (BAM) framework based on generative adversarial learning that allows clinicians to verify and understand classifiers decision-making. A data set including 456 macular scans were graded as non-referable or referable DR based on current clinical standards. A DR classifier that was used to evaluate our BAM was first trained based on this data set. The BAM generation framework was designed by combing two U-shaped generators to provide meaningful interpretability to this classifier. The main generator was trained to take referable scans as input and produce an output that would be classified by the classifier as non-referable. The BAM is then constructed as the difference image between the output and input of the main generator. To ensure that the BAM only highlights classifier-utilized biomarkers an assistant generator was trained to do the opposite, producing scans that would be classified as referable by the classifier from non-referable scans. The generated BAMs highlighted known pathologic features including nonperfusion area and retinal fluid. A fully interpretable classifier based on these highlights could help clinicians better utilize and verify automated DR diagnosis.
△ Less
Submitted 26 June, 2023; v1 submitted 12 December, 2022;
originally announced December 2022.
-
Data-Driven Network Neuroscience: On Data Collection and Benchmark
Authors:
Jiaxing Xu,
Yunhan Yang,
David Tse Jung Huang,
Sophi Shilpa Gururajapathy,
Yi** Ke,
Miao Qiao,
Alan Wang,
Haribalan Kumar,
Josh McGeown,
Eryn Kwon
Abstract:
This paper presents a comprehensive and quality collection of functional human brain network data for potential research in the intersection of neuroscience, machine learning, and graph analytics. Anatomical and functional MRI images have been used to understand the functional connectivity of the human brain and are particularly important in identifying underlying neurodegenerative conditions such…
▽ More
This paper presents a comprehensive and quality collection of functional human brain network data for potential research in the intersection of neuroscience, machine learning, and graph analytics. Anatomical and functional MRI images have been used to understand the functional connectivity of the human brain and are particularly important in identifying underlying neurodegenerative conditions such as Alzheimer's, Parkinson's, and Autism. Recently, the study of the brain in the form of brain networks using machine learning and graph analytics has become increasingly popular, especially to predict the early onset of these conditions. A brain network, represented as a graph, retains rich structural and positional information that traditional examination methods are unable to capture. However, the lack of publicly accessible brain network data prevents researchers from data-driven explorations. One of the main difficulties lies in the complicated domain-specific preprocessing steps and the exhaustive computation required to convert the data from MRI images into brain networks. We bridge this gap by collecting a large amount of MRI images from public databases and a private source, working with domain experts to make sensible design choices, and preprocessing the MRI images to produce a collection of brain network datasets. The datasets originate from 6 different sources, cover 4 brain conditions, and consist of a total of 2,702 subjects. We test our graph datasets on 12 machine learning models to provide baselines and validate the data quality on a recent graph analysis model. To lower the barrier to entry and promote the research in this interdisciplinary field, we release our brain network data and complete preprocessing details including codes at https://doi.org/10.17608/k6.auckland.21397377 and https://github.com/brainnetuoa/data_driven_network_neuroscience.
△ Less
Submitted 29 October, 2023; v1 submitted 10 November, 2022;
originally announced November 2022.
-
Racial Disparities in Pulse Oximetry Cannot Be Fixed With Race-Based Correction
Authors:
Neal Patwari,
Di Huang,
Kiki Bonetta-Misteli
Abstract:
Studies have shown pulse oximeter measurements of blood oxygenation have statistical bias that is a function of race, which results in higher rates of occult hypoxemia, i.e., missed detection of dangerously low oxygenation, in patients of color. This paper further characterizes the statistical distribution of pulse ox measurements, showing they also have a higher variance for patients racialized a…
▽ More
Studies have shown pulse oximeter measurements of blood oxygenation have statistical bias that is a function of race, which results in higher rates of occult hypoxemia, i.e., missed detection of dangerously low oxygenation, in patients of color. This paper further characterizes the statistical distribution of pulse ox measurements, showing they also have a higher variance for patients racialized as Black, compared to those racialized as white. We show that no single race-based correction factor will provide equal performance in the detection of hypoxemia. The results have implications for racially equitable pulse oximetry.
△ Less
Submitted 10 October, 2022;
originally announced October 2022.
-
Optimal control of dielectric elastomer actuated multibody dynamical systems
Authors:
Dengpeng Huang,
Sigrid Leyendecker
Abstract:
In this work, a simulation model for the optimal control of dielectric elastomer actuated flexible multibody dynamics systems is presented. The Dielectric Elastomer Actuator (DEA) behaves like a flexible artificial muscles in soft robotics. It is modeled as an electromechanically coupled geometrically exact beam, where the electric charges serve as control variables. The DEA-beam is integrated as…
▽ More
In this work, a simulation model for the optimal control of dielectric elastomer actuated flexible multibody dynamics systems is presented. The Dielectric Elastomer Actuator (DEA) behaves like a flexible artificial muscles in soft robotics. It is modeled as an electromechanically coupled geometrically exact beam, where the electric charges serve as control variables. The DEA-beam is integrated as an actuator into multibody systems consisting of rigid and flexible components. The model also represents contact interaction via unilateral constraints between the beam actuator and e.g. a rigid body during the gras** process of a soft robot. Specifically for the DEA, a work conjugated electric displacement and strain-like electric variables are derived for the Cosserat beam. With a mathematically concise and physically representative formulation, a reduced free energy function is developed for the beam-DEA. In the optimal control problem, an objective function is minimized while the dynamic balance equations for the multibody system have to be fulfilled together with the complementarity conditions for the contact and boundary conditions. The optimal control problem is solved via a direct transcription method, transforming it into a constrained nonlinear optimization problem. The beam is firstly semidiscretized with 1D finite elements and then the multibody dynamics is temporally discretized with a variational integrator leading to the discrete Euler-Lagrange equations, which are further reduced with the null space projection. The discrete Euler-Lagrange equations and the boundary conditions serve as equality constraints, whereas the contact constraints are treated as inequality constraints in the optimization of the discretized objective. The effectiveness of the developed model is demonstrated by three numerical examples, including a cantilever beam, a soft robotic worm and a soft grasper.
△ Less
Submitted 13 July, 2022;
originally announced July 2022.
-
A Robust Deep Learning Enabled Semantic Communication System for Text
Authors:
Xiang Peng,
Zhi** Qin,
Danlan Huang,
Xiaoming Tao,
Jianhua Lu,
Guangyi Liu,
Chengkang Pan
Abstract:
With the advent of the 6G era, the concept of semantic communication has attracted increasing attention. Compared with conventional communication systems, semantic communication systems are not only affected by physical noise existing in the wireless communication environment, e.g., additional white Gaussian noise, but also by semantic noise due to the source and the nature of deep learning-based…
▽ More
With the advent of the 6G era, the concept of semantic communication has attracted increasing attention. Compared with conventional communication systems, semantic communication systems are not only affected by physical noise existing in the wireless communication environment, e.g., additional white Gaussian noise, but also by semantic noise due to the source and the nature of deep learning-based systems. In this paper, we elaborate on the mechanism of semantic noise. In particular, we categorize semantic noise into two categories: literal semantic noise and adversarial semantic noise. The former is caused by written errors or expression ambiguity, while the latter is caused by perturbations or attacks added to the embedding layer via the semantic channel. To prevent semantic noise from influencing semantic communication systems, we present a robust deep learning enabled semantic communication system (R-DeepSC) that leverages a calibrated self-attention mechanism and adversarial training to tackle semantic noise. Compared with baseline models that only consider physical noise for text transmission, the proposed R-DeepSC achieves remarkable performance in dealing with semantic noise under different signal-to-noise ratios.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
Modularized Bilinear Koopman Operator for Modeling and Predicting Transients of Microgrids
Authors:
Xinyuan Jiang,
Yan Li,
Daning Huang
Abstract:
Modularized Koopman Bilinear Form (M-KBF) is presented to model and predict the transient dynamics of microgrids in the presence of disturbances. As a scalable data-driven approach, M-KBF divides the identification and prediction of the high-dimensional nonlinear system into the individual study of subsystems; and thus, alleviating the difficulty of intensively handling high volume data and overco…
▽ More
Modularized Koopman Bilinear Form (M-KBF) is presented to model and predict the transient dynamics of microgrids in the presence of disturbances. As a scalable data-driven approach, M-KBF divides the identification and prediction of the high-dimensional nonlinear system into the individual study of subsystems; and thus, alleviating the difficulty of intensively handling high volume data and overcoming the curse of dimensionality. For each subsystem, Koopman bilinear form is applied to efficiently identify its model by develo** eigenfunctions via the extended dynamic mode decomposition method with an eigenvalue-based order truncation. Extensive tests show that M-KBF can provide accurate transient dynamics prediction for the nonlinear microgrids and verify the plug-and-play modeling and prediction function, which offers a potent tool for identifying high-dimensional systems. The modularity feature of M-KBF enables the provision of fast and precise prediction for the microgrid operation and control, paving the way towards online applications.
△ Less
Submitted 17 May, 2022; v1 submitted 6 May, 2022;
originally announced May 2022.
-
PIDGeuN: Graph Neural Network-Enabled Transient Dynamics Prediction of Networked Microgrids Through Full-Field Measurement
Authors:
Yin Yu,
Xinyuan Jiang,
Daning Huang,
Yan Li
Abstract:
A Physics-Informed Dynamic Graph Neural Network (PIDGeuN) is presented to accurately, efficiently and robustly predict the nonlinear transient dynamics of microgrids in the presence of disturbances. The graph-based architecture of PIDGeuN provides a natural representation of the microgrid topology. Using only the state information that is practically measurable, PIDGeuN employs a time delay embedd…
▽ More
A Physics-Informed Dynamic Graph Neural Network (PIDGeuN) is presented to accurately, efficiently and robustly predict the nonlinear transient dynamics of microgrids in the presence of disturbances. The graph-based architecture of PIDGeuN provides a natural representation of the microgrid topology. Using only the state information that is practically measurable, PIDGeuN employs a time delay embedding formulation to fully reproduce the system dynamics, avoiding the dependency of conventional methods on internal dynamic states such as controllers. Based on a judiciously designed message passing mechanism, the PIDGeuN incorporates two physics-informed techniques to improve its prediction performance, including a physics-data-infusion approach to determining the inter-dependencies between buses, and a loss term to respect the known physical law of the power system, i.e., the Kirchhoff's law, to ensure the feasibility of the model prediction. Extensive tests show that PIDGeuN can provide accurate and robust prediction of transient dynamics for nonlinear microgrids over a long-term time period. Therefore, the PIDGeuN offers a potent tool for the modeling of large scale networked microgrids (NMs), with potential applications to predictive or preventive control in real time applications for the stable and resilient operations of NMs.
△ Less
Submitted 18 April, 2022;
originally announced April 2022.
-
You Can Wash Better: Daily Handwashing Assessment with Smartwatches
Authors:
Fei Wang,
Xilei Wu,
Xin Wang,
Jianlei Chi,
**gang Shi,
Dong Huang
Abstract:
We propose UWash, an intelligent solution upon smartwatches, to assess handwashing for the purpose of raising users' awareness and cultivating habits in high-quality handwashing. UWash can identify the onset/offset of handwashing, measure the duration of each gesture, and score each gesture as well as the entire procedure in accordance with the WHO guidelines. Technically, we address the task of h…
▽ More
We propose UWash, an intelligent solution upon smartwatches, to assess handwashing for the purpose of raising users' awareness and cultivating habits in high-quality handwashing. UWash can identify the onset/offset of handwashing, measure the duration of each gesture, and score each gesture as well as the entire procedure in accordance with the WHO guidelines. Technically, we address the task of handwashing assessment as the semantic segmentation problem in computer vision, and propose a lightweight UNet-like network, only 496KBits, to achieve it effectively. Experiments over 51 subjects show that UWash achieves the accuracy of 92.27\% on sample-wise handwashing gesture recognition, $<$0.5 \textit{seconds} error in onset/offset detection, and $<$5 out of 100 \textit{points} error in scoring in the user-dependent setting, while remains promising in the cross-user evaluation and in the cross-user-cross-location evaluation.
△ Less
Submitted 9 December, 2021;
originally announced December 2021.
-
Breaking the Dilemma of Medical Image-to-image Translation
Authors:
Lingke Kong,
Chenyu Lian,
Detian Huang,
Zhenjiang Li,
Yanle Hu,
Qichao Zhou
Abstract:
Supervised Pix2Pix and unsupervised Cycle-consistency are two modes that dominate the field of medical image-to-image translation. However, neither modes are ideal. The Pix2Pix mode has excellent performance. But it requires paired and well pixel-wise aligned images, which may not always be achievable due to respiratory motion or anatomy change between times that paired images are acquired. The Cy…
▽ More
Supervised Pix2Pix and unsupervised Cycle-consistency are two modes that dominate the field of medical image-to-image translation. However, neither modes are ideal. The Pix2Pix mode has excellent performance. But it requires paired and well pixel-wise aligned images, which may not always be achievable due to respiratory motion or anatomy change between times that paired images are acquired. The Cycle-consistency mode is less stringent with training data and works well on unpaired or misaligned images. But its performance may not be optimal. In order to break the dilemma of the existing modes, we propose a new unsupervised mode called RegGAN for medical image-to-image translation. It is based on the theory of "loss-correction". In RegGAN, the misaligned target images are considered as noisy labels and the generator is trained with an additional registration network to fit the misaligned noise distribution adaptively. The goal is to search for the common optimal solution to both image-to-image translation and registration tasks. We incorporated RegGAN into a few state-of-the-art image-to-image translation methods and demonstrated that RegGAN could be easily combined with these methods to improve their performances. Such as a simple CycleGAN in our mode surpasses latest NICEGAN even though using less network parameters. Based on our results, RegGAN outperformed both Pix2Pix on aligned data and Cycle-consistency on misaligned or unpaired data. RegGAN is insensitive to noises which makes it a better choice for a wide range of scenarios, especially for medical image-to-image translation tasks in which well pixel-wise aligned data are not available
△ Less
Submitted 10 November, 2021; v1 submitted 12 October, 2021;
originally announced October 2021.
-
Sensoring and Application of Multimodal Data for the Detection of Freezing of Gait in Parkinson's Disease
Authors:
Wei Zhang,
Debin Huang,
Hantao Li,
Lipeng Wang,
Yanzhao Wei,
Kang Pan,
Lin Ma,
Huanhuan Feng,
**g Pan,
Yuzhu Guo
Abstract:
The accurate and reliable detection or prediction of freezing of gaits (FOG) is important for fall prevention in Parkinson's Disease (PD) and studying the physiological transitions during the occurrence of FOG. Integrating both commercial and self-designed sensors, a protocal has been designed to acquire multimodal physical and physiological information during FOG, including gait acceleration (ACC…
▽ More
The accurate and reliable detection or prediction of freezing of gaits (FOG) is important for fall prevention in Parkinson's Disease (PD) and studying the physiological transitions during the occurrence of FOG. Integrating both commercial and self-designed sensors, a protocal has been designed to acquire multimodal physical and physiological information during FOG, including gait acceleration (ACC), electroencephalogram (EEG), electromyogram (EMG), and skin conductance (SC). Two tasks were designed to trigger FOG, including gait initiation failure and FOG during walking. A total number of 12 PD patients completed the experiments and produced a total length of 3 hours and 42 minutes of valid data. The FOG episodes were labeled by two qualified physicians. Each unimodal data and combinations have been used to detect FOG. Results showed that multimodal data benefit the detection of FOG. Among unimodal data, EEG had better discriminative ability than ACC and EMG. However, the acquisition of EEG are more complicated. Multimodal motional and electrophysiological data can also be used to study the physiological transition process during the occurrence of FOG and provide personalised interventions.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
Graph Fourier Transform based Audio Zero-watermarking
Authors:
Longting Xu,
Daiyu Huang,
Syed Faham Ali Zaidi,
Abdul Rauf,
Rohan Kumar Das
Abstract:
The frequent exchange of multimedia information in the present era projects an increasing demand for copyright protection. In this work, we propose a novel audio zero-watermarking technology based on graph Fourier transform for enhancing the robustness with respect to copyright protection. In this approach, the combined shift operator is used to construct the graph signal, upon which the graph Fou…
▽ More
The frequent exchange of multimedia information in the present era projects an increasing demand for copyright protection. In this work, we propose a novel audio zero-watermarking technology based on graph Fourier transform for enhancing the robustness with respect to copyright protection. In this approach, the combined shift operator is used to construct the graph signal, upon which the graph Fourier analysis is performed. The selected maximum absolute graph Fourier coefficients representing the characteristics of the audio segment are then encoded into a feature binary sequence using K-means algorithm. Finally, the resultant feature binary sequence is XOR-ed with the watermark binary sequence to realize the embedding of the zero-watermarking. The experimental studies show that the proposed approach performs more effectively in resisting common or synchronization attacks than the existing state-of-the-art methods.
△ Less
Submitted 16 September, 2021;
originally announced September 2021.
-
Intelligent Decision Method for Main Control Parameters of Tunnel Boring Machine based on Multi-Objective Optimization of Excavation Efficiency and Cost
Authors:
Bin Liu,
Yaxu Wang,
Guangzu Zhao,
Bin Yang,
Ruirui Wang,
Dexiang Huang,
Bin Xiang
Abstract:
Timely and reasonable matching of the control parameters and geological conditions of the rock mass in tunnel excavation is crucial for hard rock tunnel boring machines (TBMs). Therefore, this paper proposes an intelligent decision method for the main control parameters of the TBM based on the multi-objective optimization of excavation efficiency and cost. The main objectives of this method are to…
▽ More
Timely and reasonable matching of the control parameters and geological conditions of the rock mass in tunnel excavation is crucial for hard rock tunnel boring machines (TBMs). Therefore, this paper proposes an intelligent decision method for the main control parameters of the TBM based on the multi-objective optimization of excavation efficiency and cost. The main objectives of this method are to obtain the most important parameters of the rock mass and machine, determine the optimization objective, and establish the objective function. In this study, muck information was included as an important parameter in the traditional rock mass and machine parameter database. The rock-machine interaction model was established through an improved neural network algorithm. Using 250 sets of data collected in the field, the validity of the rock-machine interaction relationship model was verified. Then, taking the cost as the optimization objective, the cost calculation model related to tunneling and the cutter was obtained. Subsequently, combined with rock-machine interaction model, the objective function of control parameter optimization based on cost was established. Finally, a tunneling test was carried out at the engineering site, and the main TBM control parameters (thrust and torque) after the optimization decision were used to excavate the test section. Compared with the values in the section where the TBM operators relied on experience, the average penetration rate of the TBM increased by 11.10%, and the average cutter life increased by 15.62%. The results indicate that this method can play an effective role in TBM tunneling in the test section.
△ Less
Submitted 28 April, 2021;
originally announced April 2021.
-
DC-Assisted Stabilization of Internal Oscillations for Improved Symbol Transitions in a Direct Antenna Modulation Transmitter
Authors:
Danyang Huang,
Kurt Schab,
Joseph Dusenbury,
Brandon Sluss,
Jacob Adams
Abstract:
Internal oscillations in switched antenna transmitters cause undesirable fluctuations of the stored energy in the system, reducing the effectiveness of time-varying broadbanding methods, such as energy-synchronous direct antenna modulation. To mitigate these parasitic oscillations, a modified direct antenna modulation system with an auxiliary DC source is introduced to stabilize energy storage on…
▽ More
Internal oscillations in switched antenna transmitters cause undesirable fluctuations of the stored energy in the system, reducing the effectiveness of time-varying broadbanding methods, such as energy-synchronous direct antenna modulation. To mitigate these parasitic oscillations, a modified direct antenna modulation system with an auxiliary DC source is introduced to stabilize energy storage on the antenna. A detailed circuit model for a direct antenna modulation system is used to identify the origin of the oscillations and to justify the selection of the DC source. Measured phase shift keyed waveforms transmitted by the modified system show significant increases in signal fidelity, including a 10-20 dB reduction in error vector magnitude compared to a time-invariant system. Comparison to an equivalent, scalable time-invariant antenna suggests that the switched transmitter behaves as though it has 2-3 times lower radiation Q-factor and 20% higher radiation efficiency.
△ Less
Submitted 20 August, 2021; v1 submitted 26 February, 2021;
originally announced March 2021.
-
IEEE SLT 2021 Alpha-mini Speech Challenge: Open Datasets, Tracks, Rules and Baselines
Authors:
Yihui Fu,
Zhuoyuan Yao,
Weipeng He,
Jian Wu,
Xiong Wang,
Zhanheng Yang,
Shimin Zhang,
Lei Xie,
Dongyan Huang,
Hui Bu,
Petr Motlicek,
Jean-Marc Odobez
Abstract:
The IEEE Spoken Language Technology Workshop (SLT) 2021 Alpha-mini Speech Challenge (ASC) is intended to improve research on keyword spotting (KWS) and sound source location (SSL) on humanoid robots. Many publications report significant improvements in deep learning based KWS and SSL on open source datasets in recent years. For deep learning model training, it is necessary to expand the data cover…
▽ More
The IEEE Spoken Language Technology Workshop (SLT) 2021 Alpha-mini Speech Challenge (ASC) is intended to improve research on keyword spotting (KWS) and sound source location (SSL) on humanoid robots. Many publications report significant improvements in deep learning based KWS and SSL on open source datasets in recent years. For deep learning model training, it is necessary to expand the data coverage to improve the robustness of model. Thus, simulating multi-channel noisy and reverberant data from single-channel speech, noise, echo and room impulsive response (RIR) is widely adopted. However, this approach may generate mismatch between simulated data and recorded data in real application scenarios, especially echo data. In this challenge, we open source a sizable speech, keyword, echo and noise corpus for promoting data-driven methods, particularly deep-learning approaches on KWS and SSL. We also choose Alpha-mini, a humanoid robot produced by UBTECH equipped with a built-in four-microphone array on its head, to record development and evaluation sets under the actual Alpha-mini robot application scenario, including noise as well as echo and mechanical noise generated by the robot itself for model evaluation. Furthermore, we illustrate the rules, evaluation methods and baselines for researchers to quickly assess their achievements and optimize their models.
△ Less
Submitted 14 November, 2020; v1 submitted 4 November, 2020;
originally announced November 2020.
-
Generating Visually Aligned Sound from Videos
Authors:
Peihao Chen,
Yang Zhang,
Mingkui Tan,
Hongdong Xiao,
Deng Huang,
Chuang Gan
Abstract:
We focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated \emph{outside} a camera can not be inferred from video content. The model may be forced to learn an incorrect map** between visual content and these irrelevant sounds. To address this c…
▽ More
We focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated \emph{outside} a camera can not be inferred from video content. The model may be forced to learn an incorrect map** between visual content and these irrelevant sounds. To address this challenge, we propose a framework named REGNET. In this framework, we first extract appearance and motion features from video frames to better distinguish the object that emits sound from complex background information. We then introduce an innovative audio forwarding regularizer that directly considers the real sound as input and outputs bottlenecked sound features. Using both visual and bottlenecked sound features for sound prediction during training provides stronger supervision for the sound prediction. The audio forwarding regularizer can control the irrelevant sound component and thus prevent the model from learning an incorrect map** between video frames and sound emitted by the object that is out of the screen. During testing, the audio forwarding regularizer is removed to ensure that REGNET can produce purely aligned sound only from visual features. Extensive evaluations based on Amazon Mechanical Turk demonstrate that our method significantly improves both temporal and content-wise alignment. Remarkably, our generated sound can fool the human with a 68.12% success rate. Code and pre-trained models are publicly available at https://github.com/PeihaoChen/regnet
△ Less
Submitted 14 July, 2020;
originally announced August 2020.
-
Foley Music: Learning to Generate Music from Videos
Authors:
Chuang Gan,
Deng Huang,
Peihao Chen,
Joshua B. Tenenbaum,
Antonio Torralba
Abstract:
In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation probl…
▽ More
In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph$-$Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the demo video with audio turned on to experience the results.
△ Less
Submitted 21 July, 2020;
originally announced July 2020.
-
Focal Loss Analysis of Nerve Fiber Layer Reflectance for Glaucoma Diagnosis
Authors:
Ou Tan,
Liang Liu,
Qisheng You,
Jie Wang,
Aiyin Chen,
Eliesa Ing,
John C. Morrison,
Yali Jia,
David Huang
Abstract:
Purpose: To evaluate nerve fiber layer (NFL) reflectance for glaucoma diagnosis. Methods: Participants were imaged with 4.5X4.5-mm volumetric disc scans using spectral-domain optical coherence tomography (OCT). The normalized NFL reflectance map was processed by an azimuthal filter to reduce directional reflectance bias due to variation of beam incidence angle. The peripapillary area of the map wa…
▽ More
Purpose: To evaluate nerve fiber layer (NFL) reflectance for glaucoma diagnosis. Methods: Participants were imaged with 4.5X4.5-mm volumetric disc scans using spectral-domain optical coherence tomography (OCT). The normalized NFL reflectance map was processed by an azimuthal filter to reduce directional reflectance bias due to variation of beam incidence angle. The peripapillary area of the map was divided into 160 superpixels. Average reflectance was the mean of superpixel reflectance. Low-reflectance superpixels were identified as those with NFL reflectance below the 5 percentile normative cutoff. Focal reflectance loss was measure by summing loss in low-reflectance superpixels. Results: Thirty-five normal, 30 pre-perimetric and 35 perimetric glaucoma participants were enrolled. Azimuthal filtering improved the repeatability of the normalized NFL reflectance, as measured by the pooled superpixel standard deviation (SD), from 0.73 to 0.57 dB (p<0.001, paired t-test) and reduced the population SD from 2.14 to 1.78 dB (p<0.001, t-test). Most glaucomatous reflectance maps showed characteristic patterns of contiguous wedge or diffuse defects. Focal NFL reflectance loss had significantly higher diagnostic sensitivity than the best NFL thickness parameter (overall, inferior, or focal loss volume): 53% v. 23% (p=0.027) in PPG eyes and 100% v. 80% (p=0.023) in PG eyes, with the specificity fixed at 99%. Conclusions: Azimuthal filtering reduces the variability of NFL reflectance measurements. Focal NFL reflectance loss has excellent glaucoma diagnostic accuracy compared to the standard NFL thickness parameters. The reflectance map may be useful for localizing NFL defects.
△ Less
Submitted 24 June, 2020;
originally announced June 2020.
-
Music Gesture for Visual Sound Separation
Authors:
Chuang Gan,
Deng Huang,
Hang Zhao,
Joshua B. Tenenbaum,
Antonio Torralba
Abstract:
Recent deep learning approaches have achieved impressive performance on visual sound separation tasks. However, these approaches are mostly built on appearance and optical flow like motion feature representations, which exhibit limited abilities to find the correlations between audio signals and visual points, especially when separating multiple instruments of the same types, such as multiple viol…
▽ More
Recent deep learning approaches have achieved impressive performance on visual sound separation tasks. However, these approaches are mostly built on appearance and optical flow like motion feature representations, which exhibit limited abilities to find the correlations between audio signals and visual points, especially when separating multiple instruments of the same types, such as multiple violins in a scene. To address this, we propose "Music Gesture," a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals. Experimental results on three music performance datasets show: 1) strong improvements upon benchmark metrics for hetero-musical separation tasks (i.e. different instruments); 2) new ability for effective homo-musical separation for piano, flute, and trumpet duets, which to our best knowledge has never been achieved with alternative methods. Project page: http://music-gesture.csail.mit.edu.
△ Less
Submitted 20 April, 2020;
originally announced April 2020.
-
DWM: A Decomposable Winograd Method for Convolution Acceleration
Authors:
Di Huang,
Xishan Zhang,
Rui Zhang,
Tian Zhi,
Deyuan He,
Jiaming Guo,
Chang Liu,
Qi Guo,
Zidong Du,
Shaoli Liu,
Tianshi Chen,
Yunji Chen
Abstract:
Winograd's minimal filtering algorithm has been widely used in Convolutional Neural Networks (CNNs) to reduce the number of multiplications for faster processing. However, it is only effective on convolutions with kernel size as 3x3 and stride as 1, because it suffers from significantly increased FLOPs and numerical accuracy problem for kernel size larger than 3x3 and fails on convolution with str…
▽ More
Winograd's minimal filtering algorithm has been widely used in Convolutional Neural Networks (CNNs) to reduce the number of multiplications for faster processing. However, it is only effective on convolutions with kernel size as 3x3 and stride as 1, because it suffers from significantly increased FLOPs and numerical accuracy problem for kernel size larger than 3x3 and fails on convolution with stride larger than 1. In this paper, we propose a novel Decomposable Winograd Method (DWM), which breaks through the limitation of original Winograd's minimal filtering algorithm to a wide and general convolutions. DWM decomposes kernels with large size or large stride to several small kernels with stride as 1 for further applying Winograd method, so that DWM can reduce the number of multiplications while kee** the numerical accuracy. It enables the fast exploring of larger kernel size and larger stride value in CNNs for high performance and accuracy and even the potential for new CNNs. Comparing against the original Winograd, the proposed DWM is able to support all kinds of convolutions with a speedup of ~2, without affecting the numerical accuracy.
△ Less
Submitted 2 February, 2020;
originally announced February 2020.
-
High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram
Authors:
Leyuan Sheng,
Dong-Yan Huang,
Evgeniy N. Pavlovskiy
Abstract:
In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms toget…
▽ More
In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spectrograms to produce high quality synthesized speech. Our proposed model achieves improved mean opinion scores (MOS) of 3.71 and 4.01 over baseline results of 3.29 and 3.84, while using vocoder Griffin-Lim and WaveNet, respectively.
△ Less
Submitted 2 December, 2019;
originally announced December 2019.
-
Enhancement of Underwater Images with Statistical Model of Background Light and Optimization of Transmission Map
Authors:
Wei Song,
Yan Wang,
Dongmei Huang,
Antonio Liotta,
Cristian Perra
Abstract:
Underwater images often have severe quality degradation and distortion due to light absorption and scattering in the water medium. A hazed image formation model is widely used to restore the image quality. It depends on two optical parameters: the background light and the transmission map. Underwater images can also be enhanced by color and contrast correction from the perspective of image process…
▽ More
Underwater images often have severe quality degradation and distortion due to light absorption and scattering in the water medium. A hazed image formation model is widely used to restore the image quality. It depends on two optical parameters: the background light and the transmission map. Underwater images can also be enhanced by color and contrast correction from the perspective of image processing. In this paper, we propose an effective underwater image enhancement method for underwater images in composition of underwater image restoration and color correction. Firstly, a manually annotated background lights (MABLs) database is developed. With reference to the relationship between MABLs and the histogram distributions of various underwater images, robust statistical models of BLs estimation are provided. Next, the TM of R channel is roughly estimated based on the new underwater dark channel prior via the statistic of clear and high resolution underwater images, then a scene depth map based on the underwater light attenuation prior and an adjusted reversed saturation map are applied to compensate and modify the coarse TM of R channel. Next, TMs of G-B channels are estimated based on the difference of attenuation ratios between R channel and G-B channels. Finally, to improve the color and contrast of the restored image with a natural appearance, a variation of white balance is introduced as post-processing. In order to guide the priority of underwater image enhancement, sufficient evaluations are conducted to discuss the impacts of the key parameters including BL and TM, and the importance of the color correction. Comparisons with other state-of-the-art methods demonstrate that our proposed underwater image enhancement method can achieve higher accuracy of estimated BLs, less computation time, more superior performance, and more valuable information retention.
△ Less
Submitted 19 June, 2019;
originally announced June 2019.
-
Temporal Unet: Sample Level Human Action Recognition using WiFi
Authors:
Fei Wang,
Yunpeng Song,
Jimuyang Zhang,
**song Han,
Dong Huang
Abstract:
Human doing actions will result in WiFi distortion, which is widely explored for action recognition, such as the elderly fallen detection, hand sign language recognition, and keystroke estimation. As our best survey, past work recognizes human action by categorizing one complete distortion series into one action, which we term as series-level action recognition. In this paper, we introduce a much…
▽ More
Human doing actions will result in WiFi distortion, which is widely explored for action recognition, such as the elderly fallen detection, hand sign language recognition, and keystroke estimation. As our best survey, past work recognizes human action by categorizing one complete distortion series into one action, which we term as series-level action recognition. In this paper, we introduce a much more fine-grained and challenging action recognition task into WiFi sensing domain, i.e., sample-level action recognition. In this task, every WiFi distortion sample in the whole series should be categorized into one action, which is a critical technique in precise action localization, continuous action segmentation, and real-time action recognition. To achieve WiFi-based sample-level action recognition, we fully analyze approaches in image-based semantic segmentation as well as in video-based frame-level action recognition, then propose a simple yet efficient deep convolutional neural network, i.e., Temporal Unet. Experimental results show that Temporal Unet achieves this novel task well. Codes have been made publicly available at https://github.com/geekfeiw/WiSLAR.
△ Less
Submitted 19 April, 2019;
originally announced April 2019.