Search | arXiv e-print repository

Light-weight Retinal Layer Segmentation with Global Reasoning

Authors: Xiang He, Weiye Song, Yiming Wang, Fabio Poiesi, Ji Yi, Manishi Desai, Quanqing Xu, Kongzheng Yang, Yi Wan

Abstract: Automatic retinal layer segmentation with medical images, such as optical coherence tomography (OCT) images, serves as an important tool for diagnosing ophthalmic diseases. However, it is challenging to achieve accurate segmentation due to low contrast and blood flow noises presented in the images. In addition, the algorithm should be light-weight to be deployed for practical clinical applications… ▽ More Automatic retinal layer segmentation with medical images, such as optical coherence tomography (OCT) images, serves as an important tool for diagnosing ophthalmic diseases. However, it is challenging to achieve accurate segmentation due to low contrast and blood flow noises presented in the images. In addition, the algorithm should be light-weight to be deployed for practical clinical applications. Therefore, it is desired to design a light-weight network with high performance for retinal layer segmentation. In this paper, we propose LightReSeg for retinal layer segmentation which can be applied to OCT images. Specifically, our approach follows an encoder-decoder structure, where the encoder part employs multi-scale feature extraction and a Transformer block for fully exploiting the semantic information of feature maps at all scales and making the features have better global reasoning capabilities, while the decoder part, we design a multi-scale asymmetric attention (MAA) module for preserving the semantic information at each encoder scale. The experiments show that our approach achieves a better segmentation performance compared to the current state-of-the-art method TransUnet with 105.7M parameters on both our collected dataset and two other public datasets, with only 3.3M parameters. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: IEEE Transactions on Instrumentation & Measurement

arXiv:2403.05808 [pdf, other]

Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution

Authors: Junxiong Lin, Yan Wang, Zeng Tao, Boyang Wang, Qing Zhao, Haorang Wang, Xuan Tong, Xinji Mai, Yuxuan Lin, Wei Song, Jiawen Yu, Shaoqi Yan, Wenqiang Zhang

Abstract: Pre-trained diffusion models utilized for image generation encapsulate a substantial reservoir of a priori knowledge pertaining to intricate textures. Harnessing the potential of leveraging this a priori knowledge in the context of image super-resolution presents a compelling avenue. Nonetheless, prevailing diffusion-based methodologies presently overlook the constraints imposed by degradation inf… ▽ More Pre-trained diffusion models utilized for image generation encapsulate a substantial reservoir of a priori knowledge pertaining to intricate textures. Harnessing the potential of leveraging this a priori knowledge in the context of image super-resolution presents a compelling avenue. Nonetheless, prevailing diffusion-based methodologies presently overlook the constraints imposed by degradation information on the diffusion process. Furthermore, these methods fail to consider the spatial variability inherent in the estimated blur kernel, stemming from factors such as motion jitter and out-of-focus elements in open-environment scenarios. This oversight results in a notable deviation of the image super-resolution effect from fundamental realities. To address these concerns, we introduce a framework known as Adaptive Multi-modal Fusion of \textbf{S}patially Variant Kernel Refinement with Diffusion Model for Blind Image \textbf{S}uper-\textbf{R}esolution (SSR). Within the SSR framework, we propose a Spatially Variant Kernel Refinement (SVKR) module. SVKR estimates a Depth-Informed Kernel, which takes the depth information into account and is spatially variant. Additionally, SVKR enhance the accuracy of depth information acquired from LR images, allowing for mutual enhancement between the depth map and blur kernel estimates. Finally, we introduce the Adaptive Multi-Modal Fusion (AMF) module to align the information from three modalities: low-resolution images, depth maps, and blur kernels. This alignment can constrain the diffusion model to generate more authentic SR results. Quantitative and qualitative experiments affirm the superiority of our approach, while ablation experiments corroborate the effectiveness of the modules we have proposed. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2403.05136 [pdf, other]

DeRO: Dead Reckoning Based on Radar Odometry With Accelerometers Aided for Robot Localization

Authors: Hoang Viet Do, Yong Hun Kim, Joo Han Lee, Min Ho Lee, ** Woo Song

Abstract: In this paper, we propose a radar odometry structure that directly utilizes radar velocity measurements for dead reckoning while maintaining its ability to update estimations within the Kalman filter framework. Specifically, we employ the Doppler velocity obtained by a 4D Frequency Modulated Continuous Wave (FMCW) radar in conjunction with gyroscope data to calculate poses. This approach helps mit… ▽ More In this paper, we propose a radar odometry structure that directly utilizes radar velocity measurements for dead reckoning while maintaining its ability to update estimations within the Kalman filter framework. Specifically, we employ the Doppler velocity obtained by a 4D Frequency Modulated Continuous Wave (FMCW) radar in conjunction with gyroscope data to calculate poses. This approach helps mitigate high drift resulting from accelerometer biases and double integration. Instead, tilt angles measured by gravitational force are utilized alongside relative distance measurements from radar scan matching for the filter's measurement update. Additionally, to further enhance the system's accuracy, we estimate and compensate for the radar velocity scale factor. The performance of the proposed method is verified through five real-world open-source datasets. The results demonstrate that our approach reduces position error by 47% and rotation error by 52% on average compared to the state-of-the-art radar-inertial fusion method in terms of absolute trajectory error. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: 9 pages, 5 figures, 1 table, conference

ACM Class: I.2.9

arXiv:2401.05850 [pdf, other]

Contrastive Loss Based Frame-wise Feature disentanglement for Polyphonic Sound Event Detection

Authors: Yadong Guan, Jiqing Han, Hongwei Song, Wenjie Song, Guibin Zheng, Tieran Zheng, Yongjun He

Abstract: Overlap** sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlap** events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning fram… ▽ More Overlap** sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlap** events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning framework to learn a category-specific representation. Specifically, we employ different projectors to learn the frame-wise features for each category. To ensure that these feature does not contain information of other categories, we maximize the common information between frame-wise features within the same category and propose a frame-wise contrastive loss. In addition, considering that the labeled data used by the proposed method is limited, we propose a semi-supervised frame-wise contrastive loss that can leverage large amounts of unlabeled data to achieve feature disentanglement. The experimental results demonstrate the effectiveness of our method. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: accepted by icassp2024

arXiv:2401.03352 [pdf, other]

Dynamic and Memory-efficient Shape Based Methodologies for User Type Identification in Smart Grid Applications

Authors: Rui Yuan, S. Ali Pourmousavi, Wen L. Soong, Jon A. R. Liisberg

Abstract: Detecting behind-the-meter (BTM) equipment and major appliances at the residential level and tracking their changes in real time is important for aggregators and traditional electricity utilities. In our previous work, we developed a systematic solution called IRMAC to identify residential users' BTM equipment and applications from their imported energy data. As a part of IRMAC, a Similarity Profi… ▽ More Detecting behind-the-meter (BTM) equipment and major appliances at the residential level and tracking their changes in real time is important for aggregators and traditional electricity utilities. In our previous work, we developed a systematic solution called IRMAC to identify residential users' BTM equipment and applications from their imported energy data. As a part of IRMAC, a Similarity Profile (SP) was proposed for dimensionality reduction and extracting self-join similarity from the end users' daily electricity usage data. The proposed SP calculation, however, was computationally expensive and required a significant amount of memory at the user's end. To realise the benefits of edge computing, in this paper, we propose and assess three computationally-efficient updating solutions, namely additive, fixed memory, and codebook-based updating methods. Extensive simulation studies are carried out using real PV users' data to evaluate the performance of the proposed methods in identifying PV users, tracking changes in real time, and examining memory usage. We found that the Codebook-based solution reduces more than 30\% of the required memory without compromising the performance of extracting users' features. When the end users' data storage and computation speed are concerned, the fixed-memory method outperforms the others. In terms of tracking the changes, different variations of the fixed-memory method show various inertia levels, making them suitable for different applications. △ Less

Submitted 6 January, 2024; originally announced January 2024.

arXiv:2312.04398 [pdf]

Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning

Authors: Yongqi Dong, Xingmin Lu, Ruohan Li, Wei Song, Bart van Arem, Haneen Farah

Abstract: The burgeoning navigation services using digital maps provide great convenience to drivers. Nevertheless, the presence of anomalies in lane rendering map images occasionally introduces potential hazards, as such anomalies can be misleading to human drivers and consequently contribute to unsafe driving conditions. In response to this concern and to accurately and effectively detect the anomalies, t… ▽ More The burgeoning navigation services using digital maps provide great convenience to drivers. Nevertheless, the presence of anomalies in lane rendering map images occasionally introduces potential hazards, as such anomalies can be misleading to human drivers and consequently contribute to unsafe driving conditions. In response to this concern and to accurately and effectively detect the anomalies, this paper transforms lane rendering image anomaly detection into a classification problem and proposes a four-phase pipeline consisting of data pre-processing, self-supervised pre-training with the masked image modeling (MiM) method, customized fine-tuning using cross-entropy based loss with label smoothing, and post-processing to tackle it leveraging state-of-the-art deep learning techniques, especially those involving Transformer models. Various experiments verify the effectiveness of the proposed pipeline. Results indicate that the proposed pipeline exhibits superior performance in lane rendering image anomaly detection, and notably, the self-supervised pre-training with MiM can greatly enhance the detection accuracy while significantly reducing the total training time. For instance, employing the Swin Transformer with Uniform Masking as self-supervised pretraining (Swin-Trans-UM) yielded a heightened accuracy at 94.77% and an improved Area Under The Curve (AUC) score of 0.9743 compared with the pure Swin Transformer without pre-training (Swin-Trans) with an accuracy of 94.01% and an AUC of 0.9498. The fine-tuning epochs were dramatically reduced to 41 from the original 280. In conclusion, the proposed pipeline, with its incorporation of self-supervised pre-training using MiM and other advanced deep learning techniques, emerges as a robust solution for enhancing the accuracy and efficiency of lane rendering image anomaly detection in digital navigation systems. △ Less

Submitted 29 May, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: 22 pages, 6 figures, accepted by the 103rd Transportation Research Board (TRB) Annual Meeting, under review by Transportation Research Record: Journal of the Transportation Research Board

arXiv:2310.12399 [pdf, other]

A New Time Series Similarity Measure and Its Smart Grid Applications

Authors: Rui Yuan, S. Ali Pourmousavi, Wen L. Soong, Andrew J. Black, Jon A. R. Liisberg, Julian Lemos-Vinasco

Abstract: Many smart grid applications involve data mining, clustering, classification, identification, and anomaly detection, among others. These applications primarily depend on the measurement of similarity, which is the distance between different time series or subsequences of a time series. The commonly used time series distance measures, namely Euclidean Distance (ED) and Dynamic Time War** (DTW), d… ▽ More Many smart grid applications involve data mining, clustering, classification, identification, and anomaly detection, among others. These applications primarily depend on the measurement of similarity, which is the distance between different time series or subsequences of a time series. The commonly used time series distance measures, namely Euclidean Distance (ED) and Dynamic Time War** (DTW), do not quantify the flexible nature of electricity usage data in terms of temporal dynamics. As a result, there is a need for a new distance measure that can quantify both the amplitude and temporal changes of electricity time series for smart grid applications, e.g., demand response and load profiling. This paper introduces a novel distance measure to compare electricity usage patterns. The method consists of two phases that quantify the effort required to reshape one time series into another, considering both amplitude and temporal changes. The proposed method is evaluated against ED and DTW using real-world data in three smart grid applications. Overall, the proposed measure outperforms ED and DTW in accurately identifying the best load scheduling strategy, anomalous days with irregular electricity usage, and determining electricity users' behind-the-meter (BTM) equipment. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: 7 pages, 6 figures conference

arXiv:2303.12123 [pdf, other]

Oral-3Dv2: 3D Oral Reconstruction from Panoramic X-Ray Imaging with Implicit Neural Representation

Authors: Weinan Song, Haoxin Zheng, Dezhan Tu, Chengwen Liang, Lei He

Abstract: 3D reconstruction of medical imaging from 2D images has become an increasingly interesting topic with the development of deep learning models in recent years. Previous studies in 3D reconstruction from limited X-ray images mainly rely on learning from paired 2D and 3D images, where the reconstruction quality relies on the scale and variation of collected data. This has brought significant challeng… ▽ More 3D reconstruction of medical imaging from 2D images has become an increasingly interesting topic with the development of deep learning models in recent years. Previous studies in 3D reconstruction from limited X-ray images mainly rely on learning from paired 2D and 3D images, where the reconstruction quality relies on the scale and variation of collected data. This has brought significant challenges in the collection of training data, as only a tiny fraction of patients take two types of radiation examinations in the same period. Although simulation from higher-dimension images could solve this problem, the variance between real and simulated data could bring great uncertainty at the same time. In oral reconstruction, the situation becomes more challenging as only a single panoramic X-ray image is available, where models need to infer the curved shape by prior individual knowledge. To overcome these limitations, we propose Oral-3Dv2 to solve this cross-dimension translation problem in dental healthcare by learning solely on projection information, i.e., the projection image and trajectory of the X-ray tube. Our model learns to represent the 3D oral structure in an implicit way by map** 2D coordinates into density values of voxels in the 3D space. To improve efficiency and effectiveness, we utilize a multi-head model that predicts a bunch of voxel values in 3D space simultaneously from a 2D coordinate in the axial plane and the dynamic sampling strategy to refine details of the density distribution in the reconstruction result. Extensive experiments in simulated and real data show that our model significantly outperforms existing state-of-the-art models without learning from paired images or prior individual knowledge. To the best of our knowledge, this is the first work of a non-adversarial-learning-based model in 3D radiology reconstruction from a single panoramic X-ray image. △ Less

Submitted 3 September, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

arXiv:2302.11806 [pdf, other]

PLU-Net: Extraction of multi-scale feature fusion

Authors: Weihu Song

Abstract: Deep learning algorithms have achieved remarkable results in medical image segmentation in recent years. These networks are unable to handle with image boundaries and details with enormous parameters, resulting in poor segmentation results. To address the issue, we develop atrous spatial pyramid pooling (ASPP) and combine it with the Squeeze-and-Excitation block (SE block), as well as present the… ▽ More Deep learning algorithms have achieved remarkable results in medical image segmentation in recent years. These networks are unable to handle with image boundaries and details with enormous parameters, resulting in poor segmentation results. To address the issue, we develop atrous spatial pyramid pooling (ASPP) and combine it with the Squeeze-and-Excitation block (SE block), as well as present the PS module, which employs a broader and multi-scale receptive field at the network's bottom to obtain more detailed semantic information. We also propose the Local Guided block (LG block) and also its combination with the SE block to form the LS block, which can obtain more abundant local features in the feature map, so that more edge information can be retained in each down sampling process, thereby improving the performance of boundary segmentation. We propose PLU-Net and integrate our PS module and LS block into U-Net. We put our PLU-Net to the test on three benchmark datasets, and the results show that by fewer parameters and FLOPs, it outperforms on medical semantic segmentation tasks. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: 11 pages, 9 figures

arXiv:2302.10412 [pdf, other]

Non-pooling Network for medical image segmentation

Authors: Weihu Song, Heng Yu

Abstract: Existing studies tend tofocus onmodel modifications and integration with higher accuracy, which improve performance but also carry huge computational costs, resulting in longer detection times. Inmedical imaging, the use of time is extremely sensitive. And at present most of the semantic segmentation models have encoder-decoder structure or double branch structure. Their several times of the pooli… ▽ More Existing studies tend tofocus onmodel modifications and integration with higher accuracy, which improve performance but also carry huge computational costs, resulting in longer detection times. Inmedical imaging, the use of time is extremely sensitive. And at present most of the semantic segmentation models have encoder-decoder structure or double branch structure. Their several times of the pooling use with high-level semantic information extraction operation cause information loss although there si a reverse pooling or other similar action to restore information loss of pooling operation. In addition, we notice that visual attention mechanism has superior performance on a variety of tasks. Given this, this paper proposes non-pooling network(NPNet), non-pooling commendably reduces the loss of information and attention enhancement m o d u l e ( A M ) effectively increases the weight of useful information. The method greatly reduces the number of parametersand computation costs by the shallow neural network structure. We evaluate the semantic segmentation model of our NPNet on three benchmark datasets comparing w i t h multiple current state-of-the-art(SOTA) models, and the implementation results show thatour NPNetachieves SOTA performance, with an excellent balance between accuracyand speed. △ Less

Submitted 20 February, 2023; originally announced February 2023.

Comments: 8 pages, 5 figures

arXiv:2302.06381 [pdf]

Self-supervised phase unwrap** in fringe projection profilometry

Authors: Xiaomin Gao, Wanzhong Song, Chunqian Tan, Junzhe Lei

Abstract: Fast-speed and high-accuracy three-dimensional (3D) shape measurement has been the goal all along in fringe projection profilometry (FPP). The dual-frequency temporal phase unwrap** method (DF-TPU) is one of the prominent technologies to achieve this goal. However, the period number of the high-frequency pattern of existing DF-TPU approaches is usually limited by the inevitable phase errors, set… ▽ More Fast-speed and high-accuracy three-dimensional (3D) shape measurement has been the goal all along in fringe projection profilometry (FPP). The dual-frequency temporal phase unwrap** method (DF-TPU) is one of the prominent technologies to achieve this goal. However, the period number of the high-frequency pattern of existing DF-TPU approaches is usually limited by the inevitable phase errors, setting a limit to measurement accuracy. Deep-learning-based phase unwrap** methods for single-camera FPP usually require labeled data for training. In this letter, a novel self-supervised phase unwrap** method for single-camera FPP systems is proposed. The trained network can retrieve the absolute fringe order from one phase map of 64-period and overperform DF-TPU approaches in terms of depth accuracy. Experimental results demonstrate the validation of the proposed method on real scenes of motion blur, isolated objects, low reflectivity, and phase discontinuity. △ Less

Submitted 30 May, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

arXiv:2211.06170 [pdf, other]

MaskedSpeech: Context-aware Speech Synthesis with Masking Strategy

Authors: Ya-Jie Zhang, Wei Song, Yanghao Yue, Zhengchen Zhang, Youzheng Wu, Xiaodong He

Abstract: Humans often speak in a continuous manner which leads to coherent and consistent prosody properties across neighboring utterances. However, most state-of-the-art speech synthesis systems only consider the information within each sentence and ignore the contextual semantic and acoustic features. This makes it inadequate to generate high-quality paragraph-level speech which requires high expressiven… ▽ More Humans often speak in a continuous manner which leads to coherent and consistent prosody properties across neighboring utterances. However, most state-of-the-art speech synthesis systems only consider the information within each sentence and ignore the contextual semantic and acoustic features. This makes it inadequate to generate high-quality paragraph-level speech which requires high expressiveness and naturalness. To synthesize natural and expressive speech for a paragraph, a context-aware speech synthesis system named MaskedSpeech is proposed in this paper, which considers both contextual semantic and acoustic features. Inspired by the masking strategy in the speech editing research, the acoustic features of the current sentence are masked out and concatenated with those of contextual speech, and further used as additional model input. The phoneme encoder takes the concatenated phoneme sequence from neighboring sentences as input and learns fine-grained semantic information from contextual text. Furthermore, cross-utterance coarse-grained semantic features are employed to improve the prosody generation. The model is trained to reconstruct the masked acoustic features with the augmentation of both the contextual semantic and acoustic features. Experimental results demonstrate that the proposed MaskedSpeech outperformed the baseline system significantly in terms of naturalness and expressiveness. △ Less

Submitted 18 May, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

Comments: Accepted in Interspeech 2023

arXiv:2211.00996 [pdf, other]

Singing Voice Synthesis with Vibrato Modeling and Latent Energy Representation

Authors: Yingjie Song, Wei Song, Wei Zhang, Zhengchen Zhang, Dan Zeng, Zhi Liu, Yang Yu

Abstract: This paper proposes an expressive singing voice synthesis system by introducing explicit vibrato modeling and latent energy representation. Vibrato is essential to the naturalness of synthesized sound, due to the inherent characteristics of human singing. Hence, a deep learning-based vibrato model is introduced in this paper to control the vibrato's likeliness, rate, depth and phase in singing, wh… ▽ More This paper proposes an expressive singing voice synthesis system by introducing explicit vibrato modeling and latent energy representation. Vibrato is essential to the naturalness of synthesized sound, due to the inherent characteristics of human singing. Hence, a deep learning-based vibrato model is introduced in this paper to control the vibrato's likeliness, rate, depth and phase in singing, where the vibrato likeliness represents the existence probability of vibrato and it would help improve the singing voice's naturalness. Actually, there is no annotated label about vibrato likeliness in existing singing corpus. We adopt a novel vibrato likeliness labeling method to label the vibrato likeliness automatically. Meanwhile, the power spectrogram of audio contains rich information that can improve the expressiveness of singing. An autoencoder-based latent energy bottleneck feature is proposed for expressive singing voice synthesis. Experimental results on the open dataset NUS48E show that both the vibrato modeling and the latent energy representation could significantly improve the expressiveness of singing voice. The audio samples are shown in the demo website. △ Less

Submitted 2 November, 2022; originally announced November 2022.

arXiv:2211.00967 [pdf, other]

Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement

Authors: Wei Song, Yanghao Yue, Ya-jie Zhang, Zhengchen Zhang, Youzheng Wu, Xiaodong He

Abstract: Disentanglement of a speaker's timbre and style is very important for style transfer in multi-speaker multi-style text-to-speech (TTS) scenarios. With the disentanglement of timbres and styles, TTS systems could synthesize expressive speech for a given speaker with any style which has been seen in the training corpus. However, there are still some shortcomings with the current research on timbre a… ▽ More Disentanglement of a speaker's timbre and style is very important for style transfer in multi-speaker multi-style text-to-speech (TTS) scenarios. With the disentanglement of timbres and styles, TTS systems could synthesize expressive speech for a given speaker with any style which has been seen in the training corpus. However, there are still some shortcomings with the current research on timbre and style disentanglement. The current method either requires single-speaker multi-style recordings, which are difficult and expensive to collect, or uses a complex network and complicated training method, which is difficult to reproduce and control the style transfer behavior. To improve the disentanglement effectiveness of timbres and styles, and to remove the reliance on single-speaker multi-style corpus, a simple but effective timbre and style disentanglement method is proposed in this paper. The FastSpeech2 network is employed as the backbone network, with explicit duration, pitch, and energy trajectory to represent the style. Each speaker's data is considered as a separate and isolated style, then a speaker embedding and a style embedding are added to the FastSpeech2 network to learn disentangled representations. Utterance level pitch and energy normalization are utilized to improve the decoupling effect. Experimental results demonstrate that the proposed model could synthesize speech with any style seen during training with high style similarity while maintaining very high speaker similarity. △ Less

Submitted 22 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

arXiv:2208.03524 [pdf]

doi 10.2139/ssrn.4253498

Deep Learning-enabled Spatial Phase Unwrap** for 3D Measurement

Authors: Xiaolong Luo, Wanzhong Song, Songlin Bai, Yu Li, Zhihe Zhao

Abstract: In terms of 3D imaging speed and system cost, the single-camera system projecting single-frequency patterns is the ideal option among all proposed Fringe Projection Profilometry (FPP) systems. This system necessitates a robust spatial phase unwrap** (SPU) algorithm. However, robust SPU remains a challenge in complex scenes. Quality-guided SPU algorithms need more efficient ways to identify the u… ▽ More In terms of 3D imaging speed and system cost, the single-camera system projecting single-frequency patterns is the ideal option among all proposed Fringe Projection Profilometry (FPP) systems. This system necessitates a robust spatial phase unwrap** (SPU) algorithm. However, robust SPU remains a challenge in complex scenes. Quality-guided SPU algorithms need more efficient ways to identify the unreliable points in phase maps before unwrap**. End-to-end deep learning SPU methods face generality and interpretability problems. This paper proposes a hybrid method combining deep learning and traditional path-following for robust SPU in FPP. This hybrid SPU scheme demonstrates better robustness than traditional quality-guided SPU methods, better interpretability than end-to-end deep learning scheme, and generality on unseen data. Experiments on the real dataset of multiple illumination conditions and multiple FPP systems differing in image resolution, the number of fringes, fringe direction, and optics wavelength verify the effectiveness of the proposed method. △ Less

Submitted 6 August, 2022; originally announced August 2022.

Comments: 26 pages

ACM Class: I.4.5

Journal ref: Optics & Laser Technology, 163 (2023) 109340

arXiv:2207.13434 [pdf]

doi 10.1117/12.2643881

End-To-End Audiovisual Feature Fusion for Active Speaker Detection

Authors: Fiseha B. Tesema, Zheyuan Lin, Shiqiang Zhu, Wei Song, Jason Gu, Hong Wu

Abstract: Active speaker detection plays a vital role in human-machine interaction. Recently, a few end-to-end audiovisual frameworks emerged. However, these models' inference time was not explored and are not applicable for real-time applications due to their complexity and large input size. In addition, they explored a similar feature extraction strategy that employs the ConvNet on audio and visual inputs… ▽ More Active speaker detection plays a vital role in human-machine interaction. Recently, a few end-to-end audiovisual frameworks emerged. However, these models' inference time was not explored and are not applicable for real-time applications due to their complexity and large input size. In addition, they explored a similar feature extraction strategy that employs the ConvNet on audio and visual inputs. This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform. The network has two BiGRU layers attached to each stream to handle each stream's temporal dynamic before fusion. After fusion, one BiGRU layer is attached to model the joint temporal dynamics. The experiment result on the AVA-ActiveSpeaker dataset indicates that our new feature extraction strategy shows more robustness to noisy signals and better inference time than models that employed ConvNet on both modalities. The proposed model predicts within 44.41 ms, which is fast enough for real-time applications. Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art -work. △ Less

Submitted 27 July, 2022; originally announced July 2022.

Comments: To appear on the proceeding of the Fourteenth International Conference on Digital Image Processing (ICDIP 2022), May 20-23, Wuhan, China, 8 pages, 3 figures

Journal ref: Proceedings Volume 12342, Fourteenth International Conference on Digital Image Processing (ICDIP 2022); 123422A (2022)

arXiv:2206.13390 [pdf, other]

A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!

Authors: Chenglizhao Chen, Mengke Song, Wenfeng Song, Li Guo, Muwei Jian

Abstract: Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied on the visual system but paid less attention to the audio aspect, while, actually, our audio system is the most vital complementary part to our visual system. Also, audio-visual saliency detection (AVSD), one of the most representativ… ▽ More Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied on the visual system but paid less attention to the audio aspect, while, actually, our audio system is the most vital complementary part to our visual system. Also, audio-visual saliency detection (AVSD), one of the most representative research topics for mimicking human perceptual mechanisms, is currently in its infancy, and none of the existing survey papers have touched on it, especially from the perspective of saliency detection. Thus, the ultimate goal of this paper is to provide an extensive review to bridge the gap between audio-visual fusion and saliency detection. In addition, as another highlight of this review, we have provided a deep insight into key factors which could directly determine the performances of AVSD deep models, and we claim that the audio-visual consistency degree (AVC) -- a long-overlooked issue, can directly influence the effectiveness of using audio to benefit its visual counterpart when performing saliency detection. Moreover, in order to make the AVC issue more practical and valuable for future followers, we have newly equipped almost all existing publicly available AVSD datasets with additional frame-wise AVC labels. Based on these upgraded datasets, we have conducted extensive quantitative evaluations to ground our claim on the importance of AVC in the AVSD task. In a word, both our ideas and new sets serve as a convenient platform with preliminaries and guidelines, all of which are very potential to facilitate future works in promoting state-of-the-art (SOTA) performance further. △ Less

Submitted 20 June, 2022; originally announced June 2022.

arXiv:2205.00511

An Early Fault Detection Method of Rotating Machines Based on Multiple Feature Fusion with Stacking Architecture

Authors: Wenbin Song, Di Wu, Weiming Shen, Benoit Boulet

Abstract: Early fault detection (EFD) of rotating machines is important to decrease the maintenance cost and improve the mechanical system stability. One of the key points of EFD is develo** a generic model to extract robust and discriminative features from different equipment for early fault detection. Most existing EFD methods focus on learning fault representation by one type of feature. However, a com… ▽ More Early fault detection (EFD) of rotating machines is important to decrease the maintenance cost and improve the mechanical system stability. One of the key points of EFD is develo** a generic model to extract robust and discriminative features from different equipment for early fault detection. Most existing EFD methods focus on learning fault representation by one type of feature. However, a combination of multiple features can capture a more comprehensive representation of system state. In this paper, we propose an EFD method based on multiple feature fusion with stacking architecture (M2FSA). The proposed method can extract generic and discriminiative features to detect early faults by combining time domain (TD), frequency domain (FD), and time-frequency domain (TFD) features. In order to unify the dimensions of the different domain features, Stacked Denoising Autoencoder (SDAE) is utilized to learn deep features in three domains. The architecture of the proposed M2FSA consists of two layers. The first layer contains three base models, whose corresponding inputs are different deep features. The outputs of the first layer are concatenated to generate the input to the second layer, which consists of a meta model. The proposed method is tested on three bearing datasets. The results demonstrate that the proposed method is better than existing methods both in sensibility and reliability. △ Less

Submitted 28 February, 2023; v1 submitted 1 May, 2022; originally announced May 2022.

Comments: The results require to be updated

arXiv:2203.06363 [pdf, other]

MDT-Net: Multi-domain Transfer by Perceptual Supervision for Unpaired Images in OCT Scan

Authors: Weinan Song, Gaurav Fotedar, Nima Tajbakhsh, Ziheng Zhou, Lei He, Xiaowei Ding

Abstract: Deep learning models tend to underperform in the presence of domain shifts. Domain transfer has recently emerged as a promising approach wherein images exhibiting a domain shift are transformed into other domains for augmentation or adaptation. However, with the absence of paired and annotated images, models merely learned by adversarial loss and cycle consistency loss could result in poor consist… ▽ More Deep learning models tend to underperform in the presence of domain shifts. Domain transfer has recently emerged as a promising approach wherein images exhibiting a domain shift are transformed into other domains for augmentation or adaptation. However, with the absence of paired and annotated images, models merely learned by adversarial loss and cycle consistency loss could result in poor consistency of anatomy structures during the translation. Additionally, the complexity of learning multi-domain transfer could significantly increase with the number of target domains and source images. In this paper, we propose a multi-domain transfer network, named MDT-Net, to address the limitations above through perceptual supervision. Specifically, our model consists of a single encoder-decoder network and multiple domain-specific transfer modules to disentangle feature representations of the anatomy content and domain variance. Owing to this architecture, the model could significantly reduce the complexity when the translation is conducted among multiple domains. To demonstrate the performance of our method, we evaluate our model qualitatively and quantitatively on RETOUCH, an OCT dataset comprising scans from three different scanner devices (domains). Furthermore, we take the transfer results as additional training data for fluid segmentation to prove the advantage of our model indirectly, i.e., in the task of data adaptation and augmentation. Experimental results show that our method could bring universal improvement in these segmentation tasks, which demonstrates the effectiveness and efficiency of MDT-Net in multi-domain transfer. △ Less

Submitted 25 October, 2022; v1 submitted 12 March, 2022; originally announced March 2022.

arXiv:2202.03648 [pdf, ps, other]

Energy Efficiency and Delay Tradeoff in an MEC-Enabled Mobile IoT Network

Authors: Han Hu, Weiwei Song, Qun Wang, Rose Qingyang Hu, Hongbo Zhu

Abstract: Mobile Edge Computing (MEC) has recently emerged as a promising technology in the 5G era. It is deemed an effective paradigm to support computation-intensive and delay critical applications even at energy-constrained and computation-limited Internet of Things (IoT) devices. To effectively exploit the performance benefits enabled by MEC, it is imperative to jointly allocate radio and computational… ▽ More Mobile Edge Computing (MEC) has recently emerged as a promising technology in the 5G era. It is deemed an effective paradigm to support computation-intensive and delay critical applications even at energy-constrained and computation-limited Internet of Things (IoT) devices. To effectively exploit the performance benefits enabled by MEC, it is imperative to jointly allocate radio and computational resources by considering non-stationary computation demands, user mobility, and wireless fading channels. This paper aims to study the tradeoff between energy efficiency (EE) and service delay for multi-user multi-server MEC-enabled IoT systems when provisioning offloading services in a user mobility scenario. Particularly, we formulate a stochastic optimization problem with the objective of minimizing the long-term average network EE with the constraints of the task queue stability, peak transmit power, maximum CPU-cycle frequency, and maximum user number. To tackle the problem, we propose an online offloading and resource allocation algorithm by transforming the original problem into several individual subproblems in each time slot based on Lyapunov optimization theory, which are then solved by convex decomposition and submodular methods. Theoretical analysis proves that the proposed algorithm can achieve a $[O(1/V), O(V)]$ tradeoff between EE and service delay. Simulation results verify the theoretical analysis and demonstrate our proposed algorithm can offer much better EE-delay performance in task offloading challenges, compared to several baselines. △ Less

Submitted 8 February, 2022; originally announced February 2022.

arXiv:2201.02656 [pdf, other]

GPU-Net: Lightweight U-Net with more diverse features

Authors: Heng Yu, Di Fan, Weihu Song

Abstract: Image segmentation is an important task in the medical image field and many convolutional neural networks (CNNs) based methods have been proposed, among which U-Net and its variants show promising performance. In this paper, we propose GP-module and GPU-Net based on U-Net, which can learn more diverse features by introducing Ghost module and atrous spatial pyramid pooling (ASPP). Our method achiev… ▽ More Image segmentation is an important task in the medical image field and many convolutional neural networks (CNNs) based methods have been proposed, among which U-Net and its variants show promising performance. In this paper, we propose GP-module and GPU-Net based on U-Net, which can learn more diverse features by introducing Ghost module and atrous spatial pyramid pooling (ASPP). Our method achieves better performance with more than 4 times fewer parameters and 2 times fewer FLOPs, which provides a new potential direction for future research. Our plug-and-play module can also be applied to existing segmentation methods to further improve their performance. △ Less

Submitted 7 January, 2022; originally announced January 2022.

arXiv:2109.13732 [pdf, ps, other]

doi 10.1016/j.engappai.2022.105588

IRMAC: Interpretable Refined Motifs in Binary Classification for Smart Grid Applications

Authors: Rui Yuan, S. Ali Pourmousavi, Wen L. Soong, Giang Nguyen, Jon A. R. Liisberg

Abstract: Modern power systems are experiencing the challenge of high uncertainty with the increasing penetration of renewable energy resources and the electrification of heating systems. In this paradigm shift, understanding electricity users' demand is of utmost value to retailers, aggregators, and policymakers. However, behind-the-meter (BTM) equipment and appliances at the household level are unknown to… ▽ More Modern power systems are experiencing the challenge of high uncertainty with the increasing penetration of renewable energy resources and the electrification of heating systems. In this paradigm shift, understanding electricity users' demand is of utmost value to retailers, aggregators, and policymakers. However, behind-the-meter (BTM) equipment and appliances at the household level are unknown to the other stakeholders mainly due to privacy concerns and tight regulations. In this paper, we seek to identify residential consumers based on their BTM equipment, mainly rooftop photovoltaic (PV) systems and electric heating, using imported/purchased energy data from utility meters. To solve this problem with an interpretable, fast, secure, and maintainable solution, we propose an integrated method called Interpretable Refined Motifs And binary Classification (IRMAC). The proposed method comprises a novel shape-based pattern extraction technique, called Refined Motif (RM) discovery, and a single-neuron classifier. The first part extracts a sub-pattern from the long time series considering the frequency of occurrences, average dissimilarity, and time dynamics while emphasising specific times with annotated distances. The second part identifies users' types with linear complexity while preserving the transparency of the algorithms. With the real data from Australia and Denmark, the proposed method is tested and verified in identifying PV owners and electrical heating system users. △ Less

Submitted 14 November, 2022; v1 submitted 22 September, 2021; originally announced September 2021.

Comments: 22 pages, 13 figures

Journal ref: Engineering Applicationsof Artificial Intelligence (2022) 105588

arXiv:2106.15842 [pdf, other]

doi 10.1109/TIM.2022.3160561

Dual Aspect Self-Attention based on Transformer for Remaining Useful Life Prediction

Authors: Zhizheng Zhang, Wen Song, Qiqiang Li

Abstract: Remaining useful life prediction (RUL) is one of the key technologies of condition-based maintenance, which is important to maintain the reliability and safety of industrial equipments. Massive industrial measurement data has effectively improved the performance of the data-driven based RUL prediction method. While deep learning has achieved great success in RUL prediction, existing methods have d… ▽ More Remaining useful life prediction (RUL) is one of the key technologies of condition-based maintenance, which is important to maintain the reliability and safety of industrial equipments. Massive industrial measurement data has effectively improved the performance of the data-driven based RUL prediction method. While deep learning has achieved great success in RUL prediction, existing methods have difficulties in processing long sequences and extracting information from the sensor and time step aspects. In this paper, we propose Dual Aspect Self-attention based on Transformer (DAST), a novel deep RUL prediction method, which is an encoder-decoder structure purely based on self-attention without any RNN/CNN module. DAST consists of two encoders, which work in parallel to simultaneously extract features of different sensors and time steps. Solely based on self-attention, the DAST encoders are more effective in processing long data sequences, and are capable of adaptively learning to focus on more important parts of input. Moreover, the parallel feature extraction design avoids mutual influence of information from two aspects. Experiments on two widely used turbofan engines datasets show that our method significantly outperforms the state-of-the-art RUL prediction methods. △ Less

Submitted 20 April, 2022; v1 submitted 30 June, 2021; originally announced June 2021.

arXiv:2103.09180 [pdf, ps, other]

Mobility-Aware Offloading and Resource Allocation in MEC-Enabled IoT Networks

Authors: Han Hu, Weiwei Song, Qun Wang, Fuhui Zhou, Rose Qingyang Hu

Abstract: Mobile edge computing (MEC)-enabled Internet of Things (IoT) networks have been deemed a promising paradigm to support massive energy-constrained and computation-limited IoT devices. IoT with mobility has found tremendous new services in the 5G era and the forthcoming 6G eras such as autonomous driving and vehicular communications. However, mobility of IoT devices has not been studied in the suffi… ▽ More Mobile edge computing (MEC)-enabled Internet of Things (IoT) networks have been deemed a promising paradigm to support massive energy-constrained and computation-limited IoT devices. IoT with mobility has found tremendous new services in the 5G era and the forthcoming 6G eras such as autonomous driving and vehicular communications. However, mobility of IoT devices has not been studied in the sufficient level in the existing works. In this paper, the offloading decision and resource allocation problem is studied with mobility consideration. The long-term average sum service cost of all the mobile IoT devices (MIDs) is minimized by jointly optimizing the CPU-cycle frequencies, the transmit power, and the user association vector of MIDs. An online mobility-aware offloading and resource allocation (OMORA) algorithm is proposed based on Lyapunov optimization and Semi-Definite Programming (SDP). Simulation results demonstrate that our proposed scheme can balance the system service cost and the delay performance, and outperforms other offloading benchmark methods in terms of the system service cost. △ Less

Submitted 16 March, 2021; originally announced March 2021.

arXiv:2011.05161 [pdf, other]

Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Authors: Guanghui Xu, Wei Song, Zhengchen Zhang, Chao Zhang, Xiaodong He, Bowen Zhou

Abstract: Despite prosody is related to the linguistic information up to the discourse structure, most text-to-speech (TTS) systems only take into account that within each sentence, which makes it challenging when converting a paragraph of texts into natural and expressive speech. In this paper, we propose to use the text embeddings of the neighboring sentences to improve the prosody generation for each utt… ▽ More Despite prosody is related to the linguistic information up to the discourse structure, most text-to-speech (TTS) systems only take into account that within each sentence, which makes it challenging when converting a paragraph of texts into natural and expressive speech. In this paper, we propose to use the text embeddings of the neighboring sentences to improve the prosody generation for each utterance of a paragraph in an end-to-end fashion without using any explicit prosody features. More specifically, cross-utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pre-trained BERT model, are used to augment the input of the Tacotron2 decoder. Two types of BERT embeddings are investigated, which leads to the use of different CU encoder structures. Experimental results on a Mandarin audiobook dataset and the LJ-Speech English audiobook dataset demonstrate the use of CU information can improve the naturalness and expressiveness of the synthesized speech. Subjective listening testing shows most of the participants prefer the voice generated using the CU encoder over that generated using standard Tacotron2. It is also found that the prosody can be controlled indirectly by changing the neighbouring sentences. △ Less

Submitted 6 November, 2020; originally announced November 2020.

Comments: 5 pages, 4 figures

arXiv:2008.04147 [pdf, ps, other]

Knowledge Distillation-aided End-to-End Learning for Linear Precoding in Multiuser MIMO Downlink Systems with Finite-Rate Feedback

Authors: Kyeongbo Kong, Woo-** Song, Moonsik Min

Abstract: We propose a deep learning-based channel estimation, quantization, feedback, and precoding method for downlink multiuser multiple-input and multiple-output systems. In the proposed system, channel estimation and quantization for limited feedback are handled by a receiver deep neural network (DNN). Precoder selection is handled by a transmitter DNN. To emulate the traditional channel quantization,… ▽ More We propose a deep learning-based channel estimation, quantization, feedback, and precoding method for downlink multiuser multiple-input and multiple-output systems. In the proposed system, channel estimation and quantization for limited feedback are handled by a receiver deep neural network (DNN). Precoder selection is handled by a transmitter DNN. To emulate the traditional channel quantization, a binarization layer is adopted at each receiver DNN, and the binarization layer is also used to enable end-to-end learning. However, this can lead to inaccurate gradients, which can trap the receiver DNNs at a poor local minimum during training. To address this, we consider knowledge distillation, in which the existing DNNs are jointly trained with an auxiliary transmitter DNN. The use of an auxiliary DNN as a teacher network allows the receiver DNNs to additionally exploit lossless gradients, which is useful in avoiding a poor local minimum. For the same number of feedback bits, our DNN-based precoding scheme can achieve a higher downlink rate compared to conventional linear precoding with codebook-based limited feedback. △ Less

Submitted 22 March, 2021; v1 submitted 10 August, 2020; originally announced August 2020.

Comments: 6 pages, 4 figures, submitted to IEEE Transactions on Vehicular Technology

arXiv:2003.10661 [pdf]

doi 10.1121/10.0001125

Training a U-Net based on a random mode-coupling matrix model to recover acoustic interference striations

Authors: Xiaolei Li, Wenhua Song, Dazhi Gao, Wei Gao, Haozhong Wan

Abstract: A U-Net is trained to recover acoustic interference striations (AISs) from distorted ones. A random mode-coupling matrix model is introduced to generate a large number of training data quickly, which are used to train the U-Net. The performance of AIS recovery of the U-Net is tested in range-dependent waveguides with nonlinear internal waves (NLIWs). Although the random mode-coupling matrix model… ▽ More A U-Net is trained to recover acoustic interference striations (AISs) from distorted ones. A random mode-coupling matrix model is introduced to generate a large number of training data quickly, which are used to train the U-Net. The performance of AIS recovery of the U-Net is tested in range-dependent waveguides with nonlinear internal waves (NLIWs). Although the random mode-coupling matrix model is not an accurate physical model, the test results show that the U-Net successfully recovers AISs under different signal-to-noise ratios (SNRs) and different amplitudes and widths of NLIWs for different shapes. △ Less

Submitted 24 March, 2020; originally announced March 2020.

arXiv:2003.08413 [pdf, other]

Oral-3D: Reconstructing the 3D Bone Structure of Oral Cavity from 2D Panoramic X-ray

Authors: Weinan Song, Yuan Liang, Jiawei Yang, Kun Wang, Lei He

Abstract: Panoramic X-ray (PX) provides a 2D picture of the patient's mouth in a panoramic view to help dentists observe the invisible disease inside the gum. However, it provides limited 2D information compared with cone-beam computed tomography (CBCT), another dental imaging method that generates a 3D picture of the oral cavity but with more radiation dose and a higher price. Consequently, it is of great… ▽ More Panoramic X-ray (PX) provides a 2D picture of the patient's mouth in a panoramic view to help dentists observe the invisible disease inside the gum. However, it provides limited 2D information compared with cone-beam computed tomography (CBCT), another dental imaging method that generates a 3D picture of the oral cavity but with more radiation dose and a higher price. Consequently, it is of great interest to reconstruct the 3D structure from a 2D X-ray image, which can greatly explore the application of X-ray imaging in dental surgeries. In this paper, we propose a framework, named Oral-3D, to reconstruct the 3D oral cavity from a single PX image and prior information of the dental arch. Specifically, we first train a generative model to learn the cross-dimension transformation from 2D to 3D. Then we restore the shape of the oral cavity with a deformation module with the dental arch curve, which can be obtained simply by taking a photo of the patient's mouth. To be noted, Oral-3D can restore both the density of bony tissues and the curved mandible surface. Experimental results show that Oral-3D can efficiently and effectively reconstruct the 3D oral structure and show critical information in clinical applications, e.g., tooth pulling and dental implants. To the best of our knowledge, we are the first to explore this domain transformation problem between these two imaging methods. △ Less

Submitted 8 January, 2021; v1 submitted 18 March, 2020; originally announced March 2020.

arXiv:2002.08406 [pdf, other]

T-Net: Learning Feature Representation with Task-specific Supervision for Biomedical Image Analysis

Authors: Weinan Song, Yuan Liang, Jiawei Yang, Kun Wang, Lei He

Abstract: The encoder-decoder network is widely used to learn deep feature representations from pixel-wise annotations in biomedical image analysis. Under this structure, the performance profoundly relies on the effectiveness of feature extraction achieved by the encoding network. However, few models have considered adapting the attention of the feature extractor even in different kinds of tasks. In this pa… ▽ More The encoder-decoder network is widely used to learn deep feature representations from pixel-wise annotations in biomedical image analysis. Under this structure, the performance profoundly relies on the effectiveness of feature extraction achieved by the encoding network. However, few models have considered adapting the attention of the feature extractor even in different kinds of tasks. In this paper, we propose a novel training strategy by adapting the attention of the feature extractor according to different tasks for effective representation learning. Specifically, the framework, named T-Net, consists of an encoding network supervised by task-specific attention maps and a posterior network that takes in the learned features to predict the corresponding results. The attention map is obtained by the transformation from pixel-wise annotations according to the specific task, which is used as the supervision to regularize the feature extractor to focus on different locations of the recognition object. To show the effectiveness of our method, we evaluate T-Net on two different tasks, i.e. , segmentation and localization. Extensive results on three public datasets (BraTS-17, MoNuSeg and IDRiD) have indicated the effectiveness and efficiency of our proposed supervision method, especially over the conventional encoding-decoding network. △ Less

Submitted 9 January, 2021; v1 submitted 19 February, 2020; originally announced February 2020.

arXiv:1910.12861 [pdf, other]

doi 10.1109/TGRS.2019.2907932

Deep Learning for Hyperspectral Image Classification: An Overview

Authors: Shutao Li, Weiwei Song, Leyuan Fang, Yushi Chen, Pedram Ghamisi, Jón Atli Benediktsson

Abstract: Hyperspectral image (HSI) classification has become a hot topic in the field of remote sensing. In general, the complex characteristics of hyperspectral data make the accurate classification of such data challenging for traditional machine learning methods. In addition, hyperspectral imaging often deals with an inherently nonlinear relation between the captured spectral information and the corresp… ▽ More Hyperspectral image (HSI) classification has become a hot topic in the field of remote sensing. In general, the complex characteristics of hyperspectral data make the accurate classification of such data challenging for traditional machine learning methods. In addition, hyperspectral imaging often deals with an inherently nonlinear relation between the captured spectral information and the corresponding materials. In recent years, deep learning has been recognized as a powerful feature-extraction tool to effectively address nonlinear problems and widely used in a number of image processing tasks. Motivated by those successful applications, deep learning has also been introduced to classify HSIs and demonstrated good performance. This survey paper presents a systematic review of deep learning-based HSI classification literatures and compares several strategies for this topic. Specifically, we first summarize the main challenges of HSI classification which cannot be effectively overcome by traditional machine learning methods, and also introduce the advantages of deep learning to handle these problems. Then, we build a framework which divides the corresponding works into spectral-feature networks, spatial-feature networks, and spectral-spatial-feature networks to systematically review the recent achievements in deep learning-based HSI classification. In addition, considering the fact that available training samples in the remote sensing field are usually very limited and training deep networks require a large number of samples, we include some strategies to improve classification performance, which can provide some guidelines for future studies on this topic. Finally, several representative deep learning-based classification methods are conducted on real HSIs in our experiments. △ Less

Submitted 26 October, 2019; originally announced October 2019.

Journal ref: IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 9, pp. 6690-6709, Sep. 2019

arXiv:1909.04614 [pdf, other]

Deep Hashing Learning for Visual and Semantic Retrieval of Remote Sensing Images

Authors: Weiwei Song, Shutao Li, Jon Atli Benediktsson

Abstract: Driven by the urgent demand for managing remote sensing big data, large-scale remote sensing image retrieval (RSIR) attracts increasing attention in the remote sensing field. In general, existing retrieval methods can be regarded as visual-based retrieval approaches which search and return a set of similar images from a database to a given query image. Although retrieval methods have achieved grea… ▽ More Driven by the urgent demand for managing remote sensing big data, large-scale remote sensing image retrieval (RSIR) attracts increasing attention in the remote sensing field. In general, existing retrieval methods can be regarded as visual-based retrieval approaches which search and return a set of similar images from a database to a given query image. Although retrieval methods have achieved great success, there is still a question that needs to be responded to: Can we obtain the accurate semantic labels of the returned similar images to further help analyzing and processing imagery? Inspired by the above question, in this paper, we redefine the image retrieval problem as visual and semantic retrieval of images. Specifically, we propose a novel deep hashing convolutional neural network (DHCNN) to simultaneously retrieve the similar images and classify their semantic labels in a unified framework. In more detail, a convolutional neural network (CNN) is used to extract high-dimensional deep features. Then, a hash layer is perfectly inserted into the network to transfer the deep features into compact hash codes. In addition, a fully connected layer with a softmax function is performed on hash layer to generate class distribution. Finally, a loss function is elaborately designed to simultaneously consider the label loss of each image and similarity loss of pairs of images. Experimental results on two remote sensing datasets demonstrate that the proposed method achieves the state-of-art retrieval and classification performance. △ Less

Submitted 10 September, 2019; originally announced September 2019.

arXiv:1907.03246 [pdf]

An Experimental-based Review of Image Enhancement and Image Restoration Methods for Underwater Imaging

Authors: Yan Wang, Wei Song, Giancarlo Fortino, Lizhe Qi, Wenqiang Zhang, Antonio Liotta

Abstract: Underwater images play a key role in ocean exploration, but often suffer from severe quality degradation due to light absorption and scattering in water medium. Although major breakthroughs have been made recently in the general area of image enhancement and restoration, the applicability of new methods for improving the quality of underwater images has not specifically been captured. In this pape… ▽ More Underwater images play a key role in ocean exploration, but often suffer from severe quality degradation due to light absorption and scattering in water medium. Although major breakthroughs have been made recently in the general area of image enhancement and restoration, the applicability of new methods for improving the quality of underwater images has not specifically been captured. In this paper, we review the image enhancement and restoration methods that tackle typical underwater image impairments, including some extreme degradations and distortions. Firstly, we introduce the key causes of quality reduction in underwater images, in terms of the underwater image formation model (IFM). Then, we review underwater restoration methods, considering both the IFM-free and the IFM-based approaches. Next, we present an experimental-based comparative evaluation of state-of-the-art IFM-free and IFM-based methods, considering also the prior-based parameter estimation algorithms of the IFM-based methods, using both subjective and objective analysis (the used code is freely available at https://github.com/wangyanckxx/Single-Underwater-Image-Enhancement-and-Color-Restoration). Starting from this study, we pinpoint the key shortcomings of existing methods, drawing recommendations for future research in this area. Our review of underwater image enhancement and restoration provides researchers with the necessary background to appreciate challenges and opportunities in this important field. △ Less

Submitted 7 July, 2019; originally announced July 2019.

Comments: 19

arXiv:1906.08673 [pdf]

Enhancement of Underwater Images with Statistical Model of Background Light and Optimization of Transmission Map

Authors: Wei Song, Yan Wang, Dongmei Huang, Antonio Liotta, Cristian Perra

Abstract: Underwater images often have severe quality degradation and distortion due to light absorption and scattering in the water medium. A hazed image formation model is widely used to restore the image quality. It depends on two optical parameters: the background light and the transmission map. Underwater images can also be enhanced by color and contrast correction from the perspective of image process… ▽ More Underwater images often have severe quality degradation and distortion due to light absorption and scattering in the water medium. A hazed image formation model is widely used to restore the image quality. It depends on two optical parameters: the background light and the transmission map. Underwater images can also be enhanced by color and contrast correction from the perspective of image processing. In this paper, we propose an effective underwater image enhancement method for underwater images in composition of underwater image restoration and color correction. Firstly, a manually annotated background lights (MABLs) database is developed. With reference to the relationship between MABLs and the histogram distributions of various underwater images, robust statistical models of BLs estimation are provided. Next, the TM of R channel is roughly estimated based on the new underwater dark channel prior via the statistic of clear and high resolution underwater images, then a scene depth map based on the underwater light attenuation prior and an adjusted reversed saturation map are applied to compensate and modify the coarse TM of R channel. Next, TMs of G-B channels are estimated based on the difference of attenuation ratios between R channel and G-B channels. Finally, to improve the color and contrast of the restored image with a natural appearance, a variation of white balance is introduced as post-processing. In order to guide the priority of underwater image enhancement, sufficient evaluations are conducted to discuss the impacts of the key parameters including BL and TM, and the importance of the color correction. Comparisons with other state-of-the-art methods demonstrate that our proposed underwater image enhancement method can achieve higher accuracy of estimated BLs, less computation time, more superior performance, and more valuable information retention. △ Less

Submitted 19 June, 2019; originally announced June 2019.

Comments: 17 pages

arXiv:1904.06063 [pdf, other]

Building a mixed-lingual neural TTS system with only monolingual data

Authors: Liumeng Xue, Wei Song, Guanghui Xu, Lei Xie, Zhizheng Wu

Abstract: When deploying a Chinese neural text-to-speech (TTS) synthesis system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and natur… ▽ More When deploying a Chinese neural text-to-speech (TTS) synthesis system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an Average Voice Model which is built from multi-speaker monolingual data, i.e. Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data. △ Less

Submitted 22 August, 2019; v1 submitted 12 April, 2019; originally announced April 2019.

Comments: To appear in INTERSPEECH 2019

arXiv:1810.12271 [pdf, other]

Toward Creating Subsurface Camera

Authors: WenZhan Song, Fangyu Li, Maria Valero, Liang Zhao

Abstract: In this article, the framework and architecture of Subsurface Camera (SAMERA) is envisioned and described for the first time. A SAMERA is a geophysical sensor network that senses and processes geophysical sensor signals, and computes a 3D subsurface image in-situ in real-time. The basic mechanism is: geophysical waves propagating/reflected/refracted through subsurface enter a network of geophysica… ▽ More In this article, the framework and architecture of Subsurface Camera (SAMERA) is envisioned and described for the first time. A SAMERA is a geophysical sensor network that senses and processes geophysical sensor signals, and computes a 3D subsurface image in-situ in real-time. The basic mechanism is: geophysical waves propagating/reflected/refracted through subsurface enter a network of geophysical sensors, where a 2D or 3D image is computed and recorded; a control software may be connected to this network to allow view of the 2D/3D image and adjustment of settings such as resolution, filter, regularization and other algorithm parameters. System prototypes based on seismic imaging have been designed. SAMERA technology is envisioned as a game changer to transform many subsurface survey and monitoring applications, including oil/gas exploration and production, subsurface infrastructures and homeland security, wastewater and CO2 sequestration, earthquake and volcano hazard monitoring. The system prototypes for seismic imaging have been built. Creating SAMERA requires an interdisciplinary collaboration and transformation of sensor networks, signal processing, distributed computing, and geophysical imaging. △ Less

Submitted 29 October, 2018; originally announced October 2018.

Comments: 15 pages, 7 figures

arXiv:1808.09087 [pdf, other]

TRINITY: Coordinated Performance, Energy and Temperature Management in 3D Processor-Memory Stacks

Authors: Karthik Rao, William Song, Yorai Wardi, Sudhakar Yalamanchili

Abstract: The consistent demand for better performance has lead to innovations at hardware and microarchitectural levels. 3D stacking of memory and logic dies delivers an order of magnitude improvement in available memory bandwidth. The price paid however is, tight thermal constraints. In this paper, we study the complex multiphysics interactions between performance, energy and temperature. Using a cache… ▽ More The consistent demand for better performance has lead to innovations at hardware and microarchitectural levels. 3D stacking of memory and logic dies delivers an order of magnitude improvement in available memory bandwidth. The price paid however is, tight thermal constraints. In this paper, we study the complex multiphysics interactions between performance, energy and temperature. Using a cache coherent multicore processor cycle level simulator coupled with power and thermal estimation tools, we investigate the interactions between (a) thermal behaviors (b) compute and memory microarchitecture and (c) application workloads. The key insights from this exploration reveal the need to manage performance, energy and temperature in a coordinated fashion. Furthermore, we identify the concept of "effective heat capacity" i.e. the heat generated beyond which no further gains in performance is observed with increases in voltage-frequency of the compute logic. Subsequently, a real-time, numerical optimization based, application agnostic controller (TRINITY) is developed which intelligently manages the three parameters of interest. We observe up to $30\%$ improvement in Energy Delay$^2$ Product and up to $8$ Kelvin lower core temperatures as compared to fixed frequencies. Compared to the \texttt{ondemand} Linux CPU DVFS governor, for similar energy efficiency, TRINITY keeps the cores cooler by $6$ Kelvin which increases the lifetime reliability by up to 59\%. △ Less

Submitted 9 September, 2018; v1 submitted 27 August, 2018; originally announced August 2018.

Showing 1–36 of 36 results for author: Song, W