-
FakeSound: Deepfake General Audio Detection
Authors:
Zeyu Xie,
Baihan Li,
Xuenan Xu,
Zheng Liang,
Kai Yu,
Mengyue Wu
Abstract:
With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset n…
▽ More
With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset named FakeSound for deepfake general audio detection is proposed, and samples can be viewed on website https://FakeSoundData.github.io. The average binary accuracy of humans on all test sets is consistently below 0.6, which indicates the difficulty humans face in discerning deepfake audio and affirms the efficacy of the FakeSound dataset. A deepfake detection model utilizing a general audio pre-trained model is proposed as a benchmark system. Experimental results demonstrate that the performance of the proposed model surpasses the state-of-the-art in deepfake speech detection and human testers.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Intelligent Text-Conditioned Music Generation
Authors:
Zhouyao Xie,
Nikhil Yadala,
Xinyi Chen,
**g Xi Liu
Abstract:
CLIP (Contrastive Language-Image Pre-Training) is a multimodal neural network trained on (text, image) pairs to predict the most relevant text caption given an image. It has been used extensively in image generation by connecting its output with a generative model such as VQGAN, with the most notable example being OpenAI's DALLE-2. In this project, we apply a similar approach to bridge the gap bet…
▽ More
CLIP (Contrastive Language-Image Pre-Training) is a multimodal neural network trained on (text, image) pairs to predict the most relevant text caption given an image. It has been used extensively in image generation by connecting its output with a generative model such as VQGAN, with the most notable example being OpenAI's DALLE-2. In this project, we apply a similar approach to bridge the gap between natural language and music. Our model is split into two steps: first, we train a CLIP-like model on pairs of text and music over contrastive loss to align a piece of music with its most probable text caption. Then, we combine the alignment model with a music decoder to generate music. To the best of our knowledge, this is the first attempt at text-conditioned deep music generation. Our experiments show that it is possible to train the text-music alignment model using contrastive loss and train a decoder to generate music from text prompts.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Analog Beamforming Enabled Multicasting: Finite-Alphabet Inputs and Statistical CSI
Authors:
Yanjun Wu,
Zhong Xie,
Zhuochen Xie,
Chongjun Ouyang,
Xuwen Liang
Abstract:
The average multicast rate (AMR) is analyzed in a multicast channel utilizing analog beamforming with finite-alphabet inputs, considering statistical channel state information (CSI). New expressions for the AMR are derived for non-cooperative and cooperative multicasting scenarios. Asymptotic analyses are conducted in the high signal-to-noise ratio regime to derive the array gain and diversity ord…
▽ More
The average multicast rate (AMR) is analyzed in a multicast channel utilizing analog beamforming with finite-alphabet inputs, considering statistical channel state information (CSI). New expressions for the AMR are derived for non-cooperative and cooperative multicasting scenarios. Asymptotic analyses are conducted in the high signal-to-noise ratio regime to derive the array gain and diversity order. It is proved that the analog beamformer influences the AMR through its array gain, leading to the proposal of efficient beamforming algorithms aimed at maximizing the array gain to enhance the AMR.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Accel-NASBench: Sustainable Benchmarking for Accelerator-Aware NAS
Authors:
Afzal Ahmad,
Linfeng Du,
Zhiyao Xie,
Wei Zhang
Abstract:
One of the primary challenges impeding the progress of Neural Architecture Search (NAS) is its extensive reliance on exorbitant computational resources. NAS benchmarks aim to simulate runs of NAS experiments at zero cost, remediating the need for extensive compute. However, existing NAS benchmarks use synthetic datasets and model proxies that make simplified assumptions about the characteristics o…
▽ More
One of the primary challenges impeding the progress of Neural Architecture Search (NAS) is its extensive reliance on exorbitant computational resources. NAS benchmarks aim to simulate runs of NAS experiments at zero cost, remediating the need for extensive compute. However, existing NAS benchmarks use synthetic datasets and model proxies that make simplified assumptions about the characteristics of these datasets and models, leading to unrealistic evaluations. We present a technique that allows searching for training proxies that reduce the cost of benchmark construction by significant margins, making it possible to construct realistic NAS benchmarks for large-scale datasets. Using this technique, we construct an open-source bi-objective NAS benchmark for the ImageNet2012 dataset combined with the on-device performance of accelerators, including GPUs, TPUs, and FPGAs. Through extensive experimentation with various NAS optimizers and hardware platforms, we show that the benchmark is accurate and allows searching for state-of-the-art hardware-aware models at zero cost.
△ Less
Submitted 18 June, 2024; v1 submitted 9 April, 2024;
originally announced April 2024.
-
Grad-CAMO: Learning Interpretable Single-Cell Morphological Profiles from 3D Cell Painting Images
Authors:
Vivek Gopalakrishnan,
**gzhe Ma,
Zhiyong Xie
Abstract:
Despite their black-box nature, deep learning models are extensively used in image-based drug discovery to extract feature vectors from single cells in microscopy images. To better understand how these networks perform representation learning, we employ visual explainability techniques (e.g., Grad-CAM). Our analyses reveal several mechanisms by which supervised models cheat, exploiting biologicall…
▽ More
Despite their black-box nature, deep learning models are extensively used in image-based drug discovery to extract feature vectors from single cells in microscopy images. To better understand how these networks perform representation learning, we employ visual explainability techniques (e.g., Grad-CAM). Our analyses reveal several mechanisms by which supervised models cheat, exploiting biologically irrelevant pixels when extracting morphological features from images, such as noise in the background. This raises doubts regarding the fidelity of learned single-cell representations and their relevance when investigating downstream biological questions. To address this misalignment between researcher expectations and machine behavior, we introduce Grad-CAMO, a novel single-cell interpretability score for supervised feature extractors. Grad-CAMO measures the proportion of a model's attention that is concentrated on the cell of interest versus the background. This metric can be assessed per-cell or averaged across a validation set, offering a tool to audit individual features vectors or guide the improved design of deep learning architectures. Importantly, Grad-CAMO seamlessly integrates into existing workflows, requiring no dataset or model modifications, and is compatible with both 2D and 3D Cell Painting data. Additional results are available at https://github.com/eigenvivek/Grad-CAMO.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Direct and Indirect Hydrogen Storage: Dynamics and Interactions in the Transition to a Renewable Energy Based System for Europe
Authors:
Zhiyuan Xie,
Gorm Bruun Andresen
Abstract:
To move towards a low-carbon society by 2050, understanding the intricate dynamics of energy systems is critical. Our study examines these interactions through the lens of hydrogen storage, dividing it into 'direct' and 'indirect' hydrogen storage. Direct hydrogen storage involves electrolysis-produced hydrogen being stored before use, while indirect storage first transforms hydrogen into gas via…
▽ More
To move towards a low-carbon society by 2050, understanding the intricate dynamics of energy systems is critical. Our study examines these interactions through the lens of hydrogen storage, dividing it into 'direct' and 'indirect' hydrogen storage. Direct hydrogen storage involves electrolysis-produced hydrogen being stored before use, while indirect storage first transforms hydrogen into gas via the Sabatier process for later energy distribution. Firstly, we utilize the PyPSA-Eur-Sec-30-path model to capture the interactions within the energy system. The model is an hour-level, one node per country system that encompasses a range of energy transformation technologies, outlining a pathway for Europe to reduce carbon emissions by 95 percent by 2050 compared to 1990, with updates every 5 years. Subsequently, we employ both quantitative and qualitative approaches to thoroughly analyze these complex relationships. Our research indicates that during the European green transition, cross-country flow of electricity will play an important role in Europe's rapid decarbonization stage before the large-scale introduction of energy storage. Under the paper cost assumptions, fuel cells are not considered a viable option. This research further identifies the significant impact of natural resource variability on the local energy mix, highlighting indirect hydrogen storage as a common solution due to the better economic performance and actively fluctuation pattern. Specifically, indirect hydrogen storage will contribute at least 60 percent of hydrogen storage benefits, reaching 100 percent in Italy. Moreover, its fluctuation pattern will change with the local energy structure, which is a distinct difference with the unchanged pattern of direct hydrogen storage and battery storage.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
Authors:
Xuenan Xu,
Xiaohang Xu,
Zeyu Xie,
**yue Zhang,
Mengyue Wu,
Kai Yu
Abstract:
Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound even…
▽ More
Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound event labels. Based on the analysis, we propose an automatic pipeline for curating audio-text pairs with rich details. Leveraging the property that sounds can be mixed and concatenated in the time domain, we control details in four aspects: temporal relationship, loudness, speaker identity, and occurrence number, in simulating audio mixtures. Corresponding details are transformed into captions by large language models. Audio-text pairs with rich details in text descriptions are thereby obtained. We validate the effectiveness of our pipeline with a small amount of simulated data, demonstrating that the simulated data enables models to learn detailed audio captioning.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Enhancing Audio Generation Diversity with Visual Information
Authors:
Zeyu Xie,
Baihan Li,
Xuenan Xu,
Mengyue Wu,
Kai Yu
Abstract:
Audio and sound generation has garnered significant attention in recent years, with a primary focus on improving the quality of generated audios. However, there has been limited research on enhancing the diversity of generated audio, particularly when it comes to audio generation within specific categories. Current models tend to produce homogeneous audio samples within a category. This work aims…
▽ More
Audio and sound generation has garnered significant attention in recent years, with a primary focus on improving the quality of generated audios. However, there has been limited research on enhancing the diversity of generated audio, particularly when it comes to audio generation within specific categories. Current models tend to produce homogeneous audio samples within a category. This work aims to address this limitation by improving the diversity of generated audio with visual information. We propose a clustering-based method, leveraging visual information to guide the model in generating distinct audio content within each category. Results on seven categories indicate that extra visual input can largely enhance audio generation diversity. Audio samples are available at https://zeyuxie29.github.io/DiverseAudioGeneration.
△ Less
Submitted 2 March, 2024;
originally announced March 2024.
-
Phonetic and Lexical Discovery of a Canine Language using HuBERT
Authors:
Xingyuan Li,
Sinong Wang,
Zeyu Xie,
Mengyue Wu,
Kenny Q. Zhu
Abstract:
This paper delves into the pioneering exploration of potential communication patterns within dog vocalizations and transcends traditional linguistic analysis barriers, which heavily relies on human priori knowledge on limited datasets to find sound units in dog vocalization. We present a self-supervised approach with HuBERT, enabling the accurate classification of phoneme labels and the identifica…
▽ More
This paper delves into the pioneering exploration of potential communication patterns within dog vocalizations and transcends traditional linguistic analysis barriers, which heavily relies on human priori knowledge on limited datasets to find sound units in dog vocalization. We present a self-supervised approach with HuBERT, enabling the accurate classification of phoneme labels and the identification of vocal patterns that suggest a rudimentary vocabulary within dog vocalizations. Our findings indicate a significant acoustic consistency in these identified canine vocabulary, covering the entirety of observed dog vocalization sequences. We further develop a web-based dog vocalization labeling system. This system can highlight phoneme n-grams, present in the vocabulary, in the dog audio uploaded by users.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
Cortical Surface Diffusion Generative Models
Authors:
Zhenshan Xie,
Simon Dahan,
Logan Z. J. Williams,
M. Jorge Cardoso,
Emma C. Robinson
Abstract:
Cortical surface analysis has gained increased prominence, given its potential implications for neurological and developmental disorders. Traditional vision diffusion models, while effective in generating natural images, present limitations in capturing intricate development patterns in neuroimaging due to limited datasets. This is particularly true for generating cortical surfaces where individua…
▽ More
Cortical surface analysis has gained increased prominence, given its potential implications for neurological and developmental disorders. Traditional vision diffusion models, while effective in generating natural images, present limitations in capturing intricate development patterns in neuroimaging due to limited datasets. This is particularly true for generating cortical surfaces where individual variability in cortical morphology is high, leading to an urgent need for better methods to model brain development and diverse variability inherent across different individuals. In this work, we proposed a novel diffusion model for the generation of cortical surface metrics, using modified surface vision transformers as the principal architecture. We validate our method in the develo** Human Connectome Project (dHCP), the results suggest our model demonstrates superior performance in capturing the intricate details of evolving cortical surfaces. Furthermore, our model can generate high-quality realistic samples of cortical surfaces conditioned on postmenstrual age(PMA) at scan.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation
Authors:
Zhaozhi Xie,
Bochen Guan,
Weihao Jiang,
Muyang Yi,
Yue Ding,
Hongtao Lu,
Lei Zhang
Abstract:
The Segment Anything Model (SAM) has exhibited outstanding performance in various image segmentation tasks. Despite being trained with over a billion masks, SAM faces challenges in mask prediction quality in numerous scenarios, especially in real-world contexts. In this paper, we introduce a novel prompt-driven adapter into SAM, namely Prompt Adapter Segment Anything Model (PA-SAM), aiming to enha…
▽ More
The Segment Anything Model (SAM) has exhibited outstanding performance in various image segmentation tasks. Despite being trained with over a billion masks, SAM faces challenges in mask prediction quality in numerous scenarios, especially in real-world contexts. In this paper, we introduce a novel prompt-driven adapter into SAM, namely Prompt Adapter Segment Anything Model (PA-SAM), aiming to enhance the segmentation mask quality of the original SAM. By exclusively training the prompt adapter, PA-SAM extracts detailed information from images and optimizes the mask decoder feature at both sparse and dense prompt levels, improving the segmentation performance of SAM to produce high-quality masks. Experimental results demonstrate that our PA-SAM outperforms other SAM-based methods in high-quality, zero-shot, and open-set segmentation. We're making the source code and models available at https://github.com/xzz2/pa-sam.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
SonicVisionLM: Playing Sound with Vision Language Models
Authors:
Zhifeng Xie,
Shengye Yu,
Qile He,
Mengtian Li
Abstract:
There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper…
▽ More
There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision-language models(VLMs). Instead of generating audio directly from video, we use the capabilities of powerful VLMs. When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio, enhancing synchronization with the visuals, and improving alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/
△ Less
Submitted 3 April, 2024; v1 submitted 9 January, 2024;
originally announced January 2024.
-
Sensing Mutual Information with Random Signals in Gaussian Channels
Authors:
Lei Xie,
Fan Liu,
Zhanyuan Xie,
Zheng Jiang,
Shenghui Song
Abstract:
Sensing performance is typically evaluated by classical metrics, such as Cramer-Rao bound and signal-to-clutter-plus-noise ratio. The recent development of the integrated sensing and communication (ISAC) framework motivated the efforts to unify the metric for sensing and communication, where researchers have proposed to utilize mutual information (MI) to measure the sensing performance with determ…
▽ More
Sensing performance is typically evaluated by classical metrics, such as Cramer-Rao bound and signal-to-clutter-plus-noise ratio. The recent development of the integrated sensing and communication (ISAC) framework motivated the efforts to unify the metric for sensing and communication, where researchers have proposed to utilize mutual information (MI) to measure the sensing performance with deterministic signals. However, the need to communicate in ISAC systems necessitates the use of random signals for sensing applications and the closed-form evaluation for the sensing mutual information (SMI) with random signals is not yet available in the literature. This paper investigates the achievable performance and precoder design for sensing applications with random signals. For that purpose, we first derive the closed-form expression for the SMI with random signals by utilizing random matrix theory. The result reveals some interesting physical insights regarding the relation between the SMI with deterministic and random signals. The derived SMI is then utilized to optimize the precoder by leveraging a manifold-based optimization approach. The effectiveness of the proposed methods is validated by simulation results.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Space-Air-Ground Integrated Network (SAGIN): A Survey
Authors:
Jiming Chen,
Han Zhang,
Zhe Xie
Abstract:
Since existing mobile communication networks may not be able to meet the low latency and high-efficiency requirements of emerging technologies and applications, novel network architectures need to be investigated to support these new requirements. As a new network architecture that integrates satellite systems, air networks and ground communication, Space-Air-Ground Integrated Network (SAGIN) has…
▽ More
Since existing mobile communication networks may not be able to meet the low latency and high-efficiency requirements of emerging technologies and applications, novel network architectures need to be investigated to support these new requirements. As a new network architecture that integrates satellite systems, air networks and ground communication, Space-Air-Ground Integrated Network (SAGIN) has attracted extensive attention in recent years. This paper summarizes the recent research work on SAGIN from several aspects, with the basic information of SAGIN first introduced, followed by the physical characteristics. Then the drive and prospects of the current SAGIN architecture in supporting new requirements are deeply analyzed. On this basis, the requirements and challenges are analyzed. Finally, it summarizes the existing solutions and prospects the future research directions.
△ Less
Submitted 27 July, 2023;
originally announced July 2023.
-
Super-resolution imaging through a multimode fiber: the physical upsampling of speckle-driven
Authors:
Chuncheng Zhang,
Tingting Liu,
Zhihua Xie,
Yu Wang,
Tong Liu,
Qian Chen,
Xiubao Sui
Abstract:
Following recent advancements in multimode fiber (MMF), miniaturization of imaging endoscopes has proven crucial for minimally invasive surgery in vivo. Recent progress enabled by super-resolution imaging methods with a data-driven deep learning (DL) framework has balanced the relationship between the core size and resolution. However, most of the DL approaches lack attention to the physical prope…
▽ More
Following recent advancements in multimode fiber (MMF), miniaturization of imaging endoscopes has proven crucial for minimally invasive surgery in vivo. Recent progress enabled by super-resolution imaging methods with a data-driven deep learning (DL) framework has balanced the relationship between the core size and resolution. However, most of the DL approaches lack attention to the physical properties of the speckle, which is crucial for reconciling the relationship between the magnification of super-resolution imaging and the quality of reconstruction quality. In the paper, we find that the interferometric process of speckle formation is an essential basis for creating DL models with super-resolution imaging. It physically realizes the upsampling of low-resolution (LR) images and enhances the perceptual capabilities of the models. The finding experimentally validates the role played by the physical upsampling of speckle-driven, effectively complementing the lack of information in data-driven. Experimentally, we break the restriction of the poor reconstruction quality at great magnification by inputting the same size of the speckle with the size of the high-resolution (HR) image to the model. The guidance of our research for endoscopic imaging may accelerate the further development of minimally invasive surgery.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Ultrasonic Image's Annotation Removal: A Self-supervised Noise2Noise Approach
Authors:
Yuanheng Zhang,
Nan Jiang,
Zhaoheng Xie,
Junying Cao,
Yueyang Teng
Abstract:
Accurately annotated ultrasonic images are vital components of a high-quality medical report. Hospitals often have strict guidelines on the types of annotations that should appear on imaging results. However, manually inspecting these images can be a cumbersome task. While a neural network could potentially automate the process, training such a model typically requires a dataset of paired input an…
▽ More
Accurately annotated ultrasonic images are vital components of a high-quality medical report. Hospitals often have strict guidelines on the types of annotations that should appear on imaging results. However, manually inspecting these images can be a cumbersome task. While a neural network could potentially automate the process, training such a model typically requires a dataset of paired input and target images, which in turn involves significant human labour. This study introduces an automated approach for detecting annotations in images. This is achieved by treating the annotations as noise, creating a self-supervised pretext task and using a model trained under the Noise2Noise scheme to restore the image to a clean state. We tested a variety of model structures on the denoising task against different types of annotation, including body marker annotation, radial line annotation, etc. Our results demonstrate that most models trained under the Noise2Noise scheme outperformed their counterparts trained with noisy-clean data pairs. The costumed U-Net yielded the most optimal outcome on the body marker annotation dataset, with high scores on segmentation precision and reconstruction similarity. We released our code at https://github.com/GrandArth/UltrasonicImage-N2N-Approach.
△ Less
Submitted 9 July, 2023;
originally announced July 2023.
-
Multiscale Progressive Text Prompt Network for Medical Image Segmentation
Authors:
Xianjun Han,
Qianqian Chen,
Zhaoyang Xie,
Xuejun Li,
Hongyu Yang
Abstract:
The accurate segmentation of medical images is a crucial step in obtaining reliable morphological statistics. However, training a deep neural network for this task requires a large amount of labeled data to ensure high-accuracy results. To address this issue, we propose using progressive text prompts as prior knowledge to guide the segmentation process. Our model consists of two stages. In the fir…
▽ More
The accurate segmentation of medical images is a crucial step in obtaining reliable morphological statistics. However, training a deep neural network for this task requires a large amount of labeled data to ensure high-accuracy results. To address this issue, we propose using progressive text prompts as prior knowledge to guide the segmentation process. Our model consists of two stages. In the first stage, we perform contrastive learning on natural images to pretrain a powerful prior prompt encoder (PPE). This PPE leverages text prior prompts to generate multimodality features. In the second stage, medical image and text prior prompts are sent into the PPE inherited from the first stage to achieve the downstream medical image segmentation task. A multiscale feature fusion block (MSFF) combines the features from the PPE to produce multiscale multimodality features. These two progressive features not only bridge the semantic gap but also improve prediction accuracy. Finally, an UpAttention block refines the predicted results by merging the image and text features. This design provides a simple and accurate way to leverage multiscale progressive text prior prompts for medical image segmentation. Compared with using only images, our model achieves high-quality results with low data annotation costs. Moreover, our model not only has excellent reliability and validity on medical images but also performs well on natural images. The experimental results on different image datasets demonstrate that our model is effective and robust for image segmentation.
△ Less
Submitted 30 June, 2023;
originally announced July 2023.
-
Improving Audio Caption Fluency with Automatic Error Correction
Authors:
Hanxue Zhang,
Zeyu Xie,
Xuenan Xu,
Mengyue Wu,
Kai Yu
Abstract:
Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips. However, captions generated by previous AAC models have faced ``false-repetition'' errors due to the training objective. In such scenarios, we propose a new task of AAC error correction and hope to reduce such errors by post-processing AAC outputs. To tackle this pro…
▽ More
Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips. However, captions generated by previous AAC models have faced ``false-repetition'' errors due to the training objective. In such scenarios, we propose a new task of AAC error correction and hope to reduce such errors by post-processing AAC outputs. To tackle this problem, we use observation-based rules to corrupt captions without errors, for pseudo grammatically-erroneous sentence generation. One pair of corrupted and clean sentences can thus be used for training. We train a neural network-based model on the synthetic error dataset and apply the model to correct real errors in AAC outputs. Results on two benchmark datasets indicate that our approach significantly improves fluency while maintaining semantic information.
△ Less
Submitted 16 June, 2023;
originally announced June 2023.
-
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
Authors:
Zeyu Xie,
Xuenan Xu,
Mengyue Wu,
Kai Yu
Abstract:
Automated audio captioning aims at generating natural language descriptions for given audio clips, not only detecting and classifying sounds, but also summarizing the relationships between audio events. Recent research advances in audio captioning have introduced additional guidance to improve the accuracy of audio events in generated sentences. However, temporal relations between audio events hav…
▽ More
Automated audio captioning aims at generating natural language descriptions for given audio clips, not only detecting and classifying sounds, but also summarizing the relationships between audio events. Recent research advances in audio captioning have introduced additional guidance to improve the accuracy of audio events in generated sentences. However, temporal relations between audio events have received little attention while revealing complex relations is a key component in summarizing audio content. Therefore, this paper aims to better capture temporal relationships in caption generation with sound event detection (SED), a task that locates events' timestamps. We investigate the best approach to integrate temporal information in a captioning model and propose a temporal tag system to transform the timestamps into comprehensible relations. Results evaluated by the proposed temporal metrics suggest that great improvement is achieved in terms of temporal relation generation.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
Performance Analysis for Near-Field MIMO: Discrete and Continuous Aperture Antennas
Authors:
Ziyi Xie,
Yuanwei Liu,
Jiaqi Xu,
Xuanli Wu,
Arumugam Nallanathan
Abstract:
Performance analysis is carried out in a near-field multiple-input multiple-output (MIMO) system for both discrete and continuous aperture antennas. The effective degrees of freedom (EDoF) is first derived. It is shown that near-field MIMO systems have a higher EDoF than free-space far-field ones. Additionally, the near-field EDoF further depends on the communication distance. Based on the derived…
▽ More
Performance analysis is carried out in a near-field multiple-input multiple-output (MIMO) system for both discrete and continuous aperture antennas. The effective degrees of freedom (EDoF) is first derived. It is shown that near-field MIMO systems have a higher EDoF than free-space far-field ones. Additionally, the near-field EDoF further depends on the communication distance. Based on the derived EDoF, closed-form expressions of channel capacity with a fixed distance are obtained. As a further advance, with randomly deployed receivers, ergodic capacity is derived. Simulation results reveal that near-field MIMO has an enhanced multiplexing gain even under line-of-sight transmissions. In addition, the performance of discrete MIMO converges to that of continuous aperture MIMO.
△ Less
Submitted 1 October, 2023; v1 submitted 12 April, 2023;
originally announced April 2023.
-
BLAT: Bootstrap** Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
Authors:
Xuenan Xu,
Zhiling Zhang,
Zelin Zhou,
**yue Zhang,
Zeyu Xie,
Mengyue Wu,
Kenny Q. Zhu
Abstract:
Compared with ample visual-text pre-training research, few works explore audio-text pre-training, mostly due to the lack of sufficient parallel audio-text data. Most existing methods incorporate the visual modality as a pivot for audio-text pre-training, which inevitably induces data noise. In this paper, we propose to utilize audio captioning to generate text directly from audio, without the aid…
▽ More
Compared with ample visual-text pre-training research, few works explore audio-text pre-training, mostly due to the lack of sufficient parallel audio-text data. Most existing methods incorporate the visual modality as a pivot for audio-text pre-training, which inevitably induces data noise. In this paper, we propose to utilize audio captioning to generate text directly from audio, without the aid of the visual modality so that potential noise from modality mismatch is eliminated. Furthermore, we propose caption generation under the guidance of AudioSet tags, leading to more accurate captions. With the above two improvements, we curate high-quality, large-scale parallel audio-text data, based on which we perform audio-text pre-training. We comprehensively demonstrate the performance of the pre-trained model on a series of downstream audio-related tasks, including single-modality tasks like audio classification and tagging, as well as cross-modal tasks consisting of audio-text retrieval and audio-based text generation. Experimental results indicate that our approach achieves state-of-the-art zero-shot classification performance on most datasets, suggesting the effectiveness of our synthetic data. The audio encoder also serves as an efficient pattern recognition model by fine-tuning it on audio-related tasks. Synthetic data and pre-trained models are available online.
△ Less
Submitted 5 March, 2024; v1 submitted 14 March, 2023;
originally announced March 2023.
-
Towards more precise automatic analysis: a comprehensive survey of deep learning-based multi-organ segmentation
Authors:
Xiaoyu Liu,
Linhao Qu,
Ziyue Xie,
Jiayue Zhao,
Yonghong Shi,
Zhijian Song
Abstract:
Accurate segmentation of multiple organs of the head, neck, chest, and abdomen from medical images is an essential step in computer-aided diagnosis, surgical navigation, and radiation therapy. In the past few years, with a data-driven feature extraction approach and end-to-end training, automatic deep learning-based multi-organ segmentation method has far outperformed traditional methods and becom…
▽ More
Accurate segmentation of multiple organs of the head, neck, chest, and abdomen from medical images is an essential step in computer-aided diagnosis, surgical navigation, and radiation therapy. In the past few years, with a data-driven feature extraction approach and end-to-end training, automatic deep learning-based multi-organ segmentation method has far outperformed traditional methods and become a new research topic. This review systematically summarizes the latest research in this field. For the first time, from the perspective of full and imperfect annotation, we comprehensively compile 161 studies on deep learning-based multi-organ segmentation in multiple regions such as the head and neck, chest, and abdomen, containing a total of 214 related references. The method based on full annotation summarizes the existing methods from four aspects: network architecture, network dimension, network dedicated modules, and network loss function. The method based on imperfect annotation summarizes the existing methods from two aspects: weak annotation-based methods and semi annotation-based methods. We also summarize frequently used datasets for multi-organ segmentation and discuss new challenges and new research trends in this field.
△ Less
Submitted 2 March, 2023; v1 submitted 28 February, 2023;
originally announced March 2023.
-
High speed free-space optical communication using standard fiber communication component without optical amplification
Authors:
Yao Zhang,
Hua-Ying Liu,
Xiaoyi Liu,
Peng Xu,
Xiang Dong,
Pengfei Fan,
Xiaohui Tian,
Hua Yu,
Dong Pan,
Zhijun Yin,
Guilu Long,
Shi-Ning Zhu,
Zhenda Xie
Abstract:
Free-space optical communication (FSO) can achieve fast, secure and license-free communication without need for physical cables, making it a cost-effective, energy-efficient and flexible solution when the fiber connection is unavailable. To establish FSO connection on-demand, it is essential to build portable FSO devices with compact structure and light weight. Here, we develop a miniaturized FSO…
▽ More
Free-space optical communication (FSO) can achieve fast, secure and license-free communication without need for physical cables, making it a cost-effective, energy-efficient and flexible solution when the fiber connection is unavailable. To establish FSO connection on-demand, it is essential to build portable FSO devices with compact structure and light weight. Here, we develop a miniaturized FSO system and realize 9.16 Gbps FSO between two nodes that is 1 km apart, using a commercial single-mode-fiber-coupled optical transceiver module without optical amplification. Using our 4-stage acquisition, pointing and tracking (APT) systems, the tracking error is within 3 μrad and results an average link loss of 13.7 dB, which is the key for this high-bandwidth FSO demonstration without optical amplification. Our FSO link has been tested up to 4 km, with link loss of 18 dB that is limited by the foggy weather during the test. Longer FSO distances can be expected with better weather condition and optical amplification. With single FSO device weight of only 9.5 kg, this result arouses massive applications of field-deployable high-speed wireless communication.
△ Less
Submitted 16 April, 2023; v1 submitted 27 February, 2023;
originally announced February 2023.
-
A Fault Location Method Based on Electromagnetic Transient Convolution Considering Frequency-Dependent Parameters and Lossy Ground
Authors:
Guanbo Wang,
Chijie Zhuang,
Jun Deng,
Zhicheng Xie
Abstract:
As the capacity of power systems grows, the need for quick and precise short-circuit fault location becomes increasingly vital for ensuring the safe and continuous supply of power. In this paper, we propose a fault location method that utilizes electromagnetic transient convolution (EMTC). We assess the performance of a naive EMTC implementation in multi-phase power lines by using frequency-depend…
▽ More
As the capacity of power systems grows, the need for quick and precise short-circuit fault location becomes increasingly vital for ensuring the safe and continuous supply of power. In this paper, we propose a fault location method that utilizes electromagnetic transient convolution (EMTC). We assess the performance of a naive EMTC implementation in multi-phase power lines by using frequency-dependent parameters in real fault simulation, while using constant parameters in pre-calculation. Our results show that the location error increases as the distance between the fault location and the measurement location increases. Therefore, we adopt the aerial mode transients after phase-mode transformation to perform the convolution, which reduces the influence of frequency-dependence and ground loss. We conduct numerical experiments in a 3-phase 100-km transmission line, a radial distribution network and IEEE 9-bus system under different fault conditions. Our results show that the proposed method achieves tolerable location errors and operates efficiently through direct convolution of the real fault-generated transient signals and the pre-stored calculated transient signals.
△ Less
Submitted 31 December, 2023; v1 submitted 29 December, 2022;
originally announced December 2022.
-
Piston sensing for sparse aperture systems via all-optical diffractive neural network
Authors:
Xiafei Ma,
Zongliang Xie,
Haotong Ma,
Ge Ren
Abstract:
It is a crucial issue to realize real-time piston correction in the area of sparse aperture imaging. This paper introduces an optical diffractive neural network-based piston sensing method, which can achieve light-speed sensing. By using detectable intensity to represent pistons, the proposed method is capable of converting complex amplitude distribution of the imaging optical field into piston va…
▽ More
It is a crucial issue to realize real-time piston correction in the area of sparse aperture imaging. This paper introduces an optical diffractive neural network-based piston sensing method, which can achieve light-speed sensing. By using detectable intensity to represent pistons, the proposed method is capable of converting complex amplitude distribution of the imaging optical field into piston values directly. Differing from the electrical neural network, the way of intensity representation enables the method to obtain the predicted pistons without imaging acquisition and electrical processing process. The simulations demonstrate the feasibility of the method for point source, and high accuracies are achieved for both monochromatic light and broadband light. This method can greatly improve the real-time performance of piston sensing and contribute to the development of the sparse aperture system.
△ Less
Submitted 29 June, 2023; v1 submitted 13 December, 2022;
originally announced December 2022.
-
Is the Envelope Beneficial to Non-Orthogonal Multiple Access?
Authors:
Ziyi Xie,
Wenqiang Yi,
Xuanli Wu,
Yuanwei Liu,
Arumugam Nallanathan
Abstract:
Non-orthogonal multiple access (NOMA) is capable of serving different numbers of users in the same time-frequency resource element, and this feature can be leveraged to carry additional information. In the orthogonal frequency division multiplexing (OFDM) system, we propose a novel enhanced NOMA scheme, called NOMA with informative envelope (NOMA-IE), to explore the flexibility of the envelope of…
▽ More
Non-orthogonal multiple access (NOMA) is capable of serving different numbers of users in the same time-frequency resource element, and this feature can be leveraged to carry additional information. In the orthogonal frequency division multiplexing (OFDM) system, we propose a novel enhanced NOMA scheme, called NOMA with informative envelope (NOMA-IE), to explore the flexibility of the envelope of NOMA signals. In this scheme, data bits are conveyed by the quantified signal envelope in addition to classic signal constellations. The subcarrier activation patterns of different users are jointly decided by the envelope former. At the receiver, successive interference cancellation (SIC) is employed, and we also introduce the envelope detection coefficient to eliminate the error floor. Theoretical expressions of spectral efficiency and energy efficiency are provided for the NOMA-IE. Then, considering the binary phase shift keying modulation, we derive the asymptotic bit error rate for the two-subcarrier OFDM subblock. Afterwards, the expressions are extended to the four-subcarrier case. The analytical results reveal that the imperfect SIC and the index error are the main factors degrading the error performance. The numerical results demonstrate the superiority of the NOMA-IE over the OFDM and OFDM-NOMA, especially in the high signal-to-noise ratio (SNR) regime.
△ Less
Submitted 24 October, 2022;
originally announced October 2022.
-
Fast optical refocusing through multimode fiber bend using Cake-Cutting Hadamard encoding algorithm to improve robustness
Authors:
Chuncheng Zhang,
Zheyi Yao,
Zhengyue Qin,
Guohua Gu,
Qian Chen,
Zhihua Xie,
Guodong Liu,
Xiubao Sui
Abstract:
Multimode fibres offer the advantages of high resolution and miniaturization over single mode fibers in the field of optical imaging. However, multimode fibre's imaging is susceptible to perturbations of MMF that can lead to secondary spatial distortions in the transmitted image. Perturbations include random disturbances in the fiber as well as environmental noise. Here, we exploit the fast focusi…
▽ More
Multimode fibres offer the advantages of high resolution and miniaturization over single mode fibers in the field of optical imaging. However, multimode fibre's imaging is susceptible to perturbations of MMF that can lead to secondary spatial distortions in the transmitted image. Perturbations include random disturbances in the fiber as well as environmental noise. Here, we exploit the fast focusing capability of the Cake-Cutting Hadamard coding algorithm to counteract the effects of perturbations and improve the system's robustness. Simulation shows that it can approach the theoretical enhancement at 2000 measurements. Experimental results show that the algorithm can help the system to refocus in a short time when MMFs are perturbed. This research will further contribute to using multimode fibres in medicine, communication, and detection.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.
-
Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning
Authors:
Xuenan Xu,
Zeyu Xie,
Mengyue Wu,
Kai Yu
Abstract:
Automated audio captioning (AAC), a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. AAC requires recognizing contents such as the environment, sound events and the temporal relationships between sound events and describing these elements with a fluent sentence. Currently, an encode…
▽ More
Automated audio captioning (AAC), a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. AAC requires recognizing contents such as the environment, sound events and the temporal relationships between sound events and describing these elements with a fluent sentence. Currently, an encoder-decoder-based deep learning framework is the standard approach to tackle this problem. Plenty of works have proposed novel network architectures and training schemes, including extra guidance, reinforcement learning, audio-text self-supervised learning and diverse or controllable captioning. Effective data augmentation techniques, especially based on large language models are explored. Benchmark datasets and AAC-oriented evaluation metrics also accelerate the improvement of this field. This paper situates itself as a comprehensive survey covering the comparison between AAC and its related tasks, the existing deep learning techniques, datasets, and the evaluation metrics in AAC, with insights provided to guide potential future research directions.
△ Less
Submitted 15 November, 2023; v1 submitted 11 May, 2022;
originally announced May 2022.
-
GlacierNet2: A Hybrid Multi-Model Learning Architecture for Alpine Glacier Map**
Authors:
Zhiyuan Xie,
Umesh K. Haritashya,
Vijayan K. Asari,
Michael P. Bishop,
Jeffrey S. Kargel,
Theus H. Aspiras
Abstract:
In recent decades, climate change has significantly affected glacier dynamics, resulting in mass loss and an increased risk of glacier-related hazards including supraglacial and proglacial lake development, as well as catastrophic outburst flooding. Rapidly changing conditions dictate the need for continuous and detailed observations and analysis of climate-glacier dynamics. Thematic and quantitat…
▽ More
In recent decades, climate change has significantly affected glacier dynamics, resulting in mass loss and an increased risk of glacier-related hazards including supraglacial and proglacial lake development, as well as catastrophic outburst flooding. Rapidly changing conditions dictate the need for continuous and detailed observations and analysis of climate-glacier dynamics. Thematic and quantitative information regarding glacier geometry is fundamental for understanding climate forcing and the sensitivity of glaciers to climate change, however, accurately map** debris-cover glaciers (DCGs) is notoriously difficult based upon the use of spectral information and conventional machine-learning techniques. The objective of this research is to improve upon an earlier proposed deep-learning-based approach, GlacierNet, which was developed to exploit a convolutional neural-network segmentation model to accurately outline regional DCG ablation zones. Specifically, we developed an enhanced GlacierNet2 architecture thatincorporates multiple models, automatic post-processing, and basin-level hydrological flow techniques to improve the map** of DCGs such that it includes both the ablation and accumulation zones. Experimental evaluations demonstrate that GlacierNet2 improves the estimation of the ablation zone and allows a high level of intersection over union (IOU: 0.8839) score. The proposed architecture provides complete glacier (both accumulation and ablation zone) outlines at regional scales, with an overall IOU score of 0.8619. This is a crucial first step in automating complete glacier map** that can be used for accurate glacier modeling or mass-balance analysis.
△ Less
Submitted 29 July, 2022; v1 submitted 6 April, 2022;
originally announced April 2022.
-
The Dark Side: Security Concerns in Machine Learning for EDA
Authors:
Zhiyao Xie,
**gyu Pan,
Chen-Chia Chang,
Yiran Chen
Abstract:
The growing IC complexity has led to a compelling need for design efficiency improvement through new electronic design automation (EDA) methodologies. In recent years, many unprecedented efficient EDA methods have been enabled by machine learning (ML) techniques. While ML demonstrates its great potential in circuit design, however, the dark side about security problems, is seldomly discussed. This…
▽ More
The growing IC complexity has led to a compelling need for design efficiency improvement through new electronic design automation (EDA) methodologies. In recent years, many unprecedented efficient EDA methods have been enabled by machine learning (ML) techniques. While ML demonstrates its great potential in circuit design, however, the dark side about security problems, is seldomly discussed. This paper gives a comprehensive and impartial summary of all security concerns we have observed in ML for EDA. Many of them are hidden or neglected by practitioners in this field. In this paper, we first provide our taxonomy to define four major types of security concerns, then we analyze different application scenarios and special properties in ML for EDA. After that, we present our detailed analysis of each security concern with experiments.
△ Less
Submitted 20 March, 2022;
originally announced March 2022.
-
RA V-Net: Deep learning network for automated liver segmentation
Authors:
Zhiqi Lee,
Sumin Qi,
Chongchong Fan,
Ziwei Xie
Abstract:
Accurate segmentation of the liver is a prerequisite for the diagnosis of disease. Automated segmentation is an important application of computer-aided detection and diagnosis of liver disease. In recent years, automated processing of medical images has gained breakthroughs. However, the low contrast of abdominal scan CT images and the complexity of liver morphology make accurate automatic segment…
▽ More
Accurate segmentation of the liver is a prerequisite for the diagnosis of disease. Automated segmentation is an important application of computer-aided detection and diagnosis of liver disease. In recent years, automated processing of medical images has gained breakthroughs. However, the low contrast of abdominal scan CT images and the complexity of liver morphology make accurate automatic segmentation challenging. In this paper, we propose RA V-Net, which is an improved medical image automatic segmentation model based on U-Net. It has the following three main innovations. CofRes Module (Composite Original Feature Residual Module) is proposed. With more complex convolution layers and skip connections to make it obtain a higher level of image feature extraction capability and prevent gradient disappearance or explosion. AR Module (Attention Recovery Module) is proposed to reduce the computational effort of the model. In addition, the spatial features between the data pixels of the encoding and decoding modules are sensed by adjusting the channels and LSTM convolution. Finally, the image features are effectively retained. CA Module (Channel Attention Module) is introduced, which used to extract relevant channels with dependencies and strengthen them by matrix dot product, while weakening irrelevant channels without dependencies. The purpose of channel attention is achieved. The attention mechanism provided by LSTM convolution and CA Module are strong guarantees for the performance of the neural network. The accuracy of U-Net network: 0.9862, precision: 0.9118, DSC: 0.8547, JSC: 0.82. The evaluation metrics of RA V-Net, accuracy: 0.9968, precision: 0.9597, DSC: 0.9654, JSC: 0.9414. The most representative metric for the segmentation effect is DSC, which improves 0.1107 over U-Net, and JSC improves 0.1214.
△ Less
Submitted 15 December, 2021; v1 submitted 15 December, 2021;
originally announced December 2021.
-
Can Audio Captions Be Evaluated with Image Caption Metrics?
Authors:
Zelin Zhou,
Zhiling Zhang,
Xuenan Xu,
Zeyu Xie,
Mengyue Wu,
Kenny Q. Zhu
Abstract:
Automated audio captioning aims at generating textual descriptions for an audio clip. To evaluate the quality of generated audio captions, previous works directly adopt image captioning metrics like SPICE and CIDEr, without justifying their suitability in this new domain, which may mislead the development of advanced models. This problem is still unstudied due to the lack of human judgment dataset…
▽ More
Automated audio captioning aims at generating textual descriptions for an audio clip. To evaluate the quality of generated audio captions, previous works directly adopt image captioning metrics like SPICE and CIDEr, without justifying their suitability in this new domain, which may mislead the development of advanced models. This problem is still unstudied due to the lack of human judgment datasets on caption quality. Therefore, we firstly construct two evaluation benchmarks, AudioCaps-Eval and Clotho-Eval. They are established with pairwise comparison instead of absolute rating to achieve better inter-annotator agreement. Current metrics are found in poor correlation with human annotations on these datasets. To overcome their limitations, we propose a metric named FENSE, where we combine the strength of Sentence-BERT in capturing similarity, and a novel Error Detector to penalize erroneous sentences for robustness. On the newly established benchmarks, FENSE outperforms current metrics by 14-25% accuracy. Code, data and web demo available at: https://github.com/blmoistawinde/fense
△ Less
Submitted 27 January, 2022; v1 submitted 9 October, 2021;
originally announced October 2021.
-
Modeling and Coverage Analysis for RIS-aided NOMA Transmissions in Heterogeneous Networks
Authors:
Ziyi Xie,
Wenqiang Yi,
Xuanli Wu,
Yuanwei Liu,
Arumugam Nallanathan
Abstract:
Reconfigurable intelligent surface (RIS) has been regarded as a promising tool to strengthen the quality of signal transmissions in non-orthogonal multiple access (NOMA) networks. This article introduces a heterogeneous network (HetNet) structure into RIS-aided NOMA multi-cell networks. A practical user equipment (UE) association scheme for maximizing the average received power is adopted. To eval…
▽ More
Reconfigurable intelligent surface (RIS) has been regarded as a promising tool to strengthen the quality of signal transmissions in non-orthogonal multiple access (NOMA) networks. This article introduces a heterogeneous network (HetNet) structure into RIS-aided NOMA multi-cell networks. A practical user equipment (UE) association scheme for maximizing the average received power is adopted. To evaluate system performance, we provide a stochastic geometry based analytical framework, where the locations of RISs, base stations (BSs), and UEs are modeled as homogeneous Poisson point processes (PPPs). Based on this framework, we first derive the closed-form probability density function (PDF) to characterize the distribution of the reflective links created by RISs. Then, both the exact expressions and upper/lower bounds of UE association probability are calculated. Lastly, the analytical expressions of the signal-to-interference-plus-noise-ratio (SINR) and rate coverage probability are deduced. Additionally, to investigate the impact of RISs on system coverage, the asymptotic expressions of two coverage probabilities are derived. The theoretical results show that RIS length is not the decisive factor for coverage improvement. Numerical results demonstrate that the proposed RIS HetNet structure brings significant enhancement in rate coverage. Moreover, there exists an optimal combination of RISs and BSs deployment densities to maximize coverage probability.
△ Less
Submitted 27 April, 2021;
originally announced April 2021.
-
Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning
Authors:
Xuenan Xu,
Heinrich Dinkel,
Mengyue Wu,
Zeyu Xie,
Kai Yu
Abstract:
Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic scenery. Currently, the mainstream paradigm for AAC is the end-to-end encoder-decoder architecture, expecting the encoder to learn all levels of concepts embedd…
▽ More
Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic scenery. Currently, the mainstream paradigm for AAC is the end-to-end encoder-decoder architecture, expecting the encoder to learn all levels of concepts embedded in the audio automatically. This paper first proposes a topic model for audio descriptions, comprehensively analyzing the hierarchical audio topics that are commonly covered. We then explore a transfer learning scheme to access local and global information. Two source tasks are identified to respectively represent local and global information, being Audio Tagging (AT) and Acoustic Scene Classification (ASC). Experiments are conducted on the AAC benchmark dataset Clotho and Audiocaps, amounting to a vast increase in all eight metrics with topic transfer learning. Further, it is discovered that local information and abstract representation learning are more crucial to AAC than global information and temporal relationship learning.
△ Less
Submitted 22 February, 2021;
originally announced February 2021.
-
Fast IR Drop Estimation with Machine Learning
Authors:
Zhiyao Xie,
Hai Li,
Xiaoqing Xu,
Jiang Hu,
Yiran Chen
Abstract:
IR drop constraint is a fundamental requirement enforced in almost all chip designs. However, its evaluation takes a long time, and mitigation techniques for fixing violations may require numerous iterations. As such, fast and accurate IR drop prediction becomes critical for reducing design turnaround time. Recently, machine learning (ML) techniques have been actively studied for fast IR drop esti…
▽ More
IR drop constraint is a fundamental requirement enforced in almost all chip designs. However, its evaluation takes a long time, and mitigation techniques for fixing violations may require numerous iterations. As such, fast and accurate IR drop prediction becomes critical for reducing design turnaround time. Recently, machine learning (ML) techniques have been actively studied for fast IR drop estimation due to their promise and success in many fields. These studies target at various design stages with different emphasis, and accordingly, different ML algorithms are adopted and customized. This paper provides a review to the latest progress in ML-based IR drop estimation techniques. It also serves as a vehicle for discussing some general challenges faced by ML applications in electronics design automation (EDA), and demonstrating how to integrate ML models with conventional techniques for the better efficiency of EDA tools.
△ Less
Submitted 26 November, 2020;
originally announced November 2020.
-
Smart-PGSim: Using Neural Network to Accelerate AC-OPF Power Grid Simulation
Authors:
Wenqian Dong,
Zhen Xie,
Gokcen Kestor,
Dong Li
Abstract:
The optimal power flow (OPF) problem is one of the most important optimization problems for the operation of the power grid. It calculates the optimum scheduling of the committed generation units. In this paper, we develop a neural network approach to the problem of accelerating the current optimal power flow (AC-OPF) by generating an intelligent initial solution. The high quality of the initial s…
▽ More
The optimal power flow (OPF) problem is one of the most important optimization problems for the operation of the power grid. It calculates the optimum scheduling of the committed generation units. In this paper, we develop a neural network approach to the problem of accelerating the current optimal power flow (AC-OPF) by generating an intelligent initial solution. The high quality of the initial solution and guidance of other outputs generated by the neural network enables faster convergence to the solution without losing optimality of final solution as computed by traditional methods. Smart-PGSim generates a novel multitask-learning neural network model to accelerate the AC-OPF simulation. Smart-PGSim also imposes the physical constraints of the simulation on the neural network automatically. Smart-PGSim brings an average of 49.2% performance improvement (up to 91%), computed over 10,000 problem simulations, with respect to the original AC-OPF implementation, without losing the optimality of the final solution.
△ Less
Submitted 26 August, 2020;
originally announced August 2020.
-
Joint Bandwidth Allocation and Path Selection in WANs with Path Cardinality Constraints
Authors:
**xin Wang,
Fan Zhang,
Zhonglin Xie,
Gong Zhang,
Zaiwen Wen
Abstract:
In this paper, we study a joint bandwidth allocation and path selection problem via solving a multi-objective minimization problem under the path cardinality constraints, namely MOPC. Our problem formulation captures various types of objectives including the proportional fairness, the total completion time, as well as the worst-case link utilization ratio. Such an optimization problem is very chal…
▽ More
In this paper, we study a joint bandwidth allocation and path selection problem via solving a multi-objective minimization problem under the path cardinality constraints, namely MOPC. Our problem formulation captures various types of objectives including the proportional fairness, the total completion time, as well as the worst-case link utilization ratio. Such an optimization problem is very challenging since it is highly non-convex. Almost all existing works deal with such a problem using relaxation techniques to transform it to be a convex optimization problem. However, we provide a novel solution framework based on the classic alternating direction method of multipliers (ADMM) approach for solving this problem. Our proposed algorithm is simple and easy to be implemented. Each step of our algorithm consists of either finding the maximal root of a single-cubic equation which is guaranteed to have at least one positive solution or solving a one-dimensional convex subproblem in a fixed interval. Under some mild assumptions, we prove that any limiting point of the generated sequence under our proposed algorithm is a stationary point. Extensive numerical simulations are performed to demonstrate the advantages of our algorithm compared with various baselines.
△ Less
Submitted 10 August, 2020;
originally announced August 2020.
-
Neural Video Coding using Multiscale Motion Compensation and Spatiotemporal Context Model
Authors:
Haojie Liu,
Ming Lu,
Zhan Ma,
Fan Wang,
Zhihuang Xie,
Xun Cao,
Yao Wang
Abstract:
Over the past two decades, traditional block-based video coding has made remarkable progress and spawned a series of well-known standards such as MPEG-4, H.264/AVC and H.265/HEVC. On the other hand, deep neural networks (DNNs) have shown their powerful capacity for visual content understanding, feature extraction and compact representation. Some previous works have explored the learnt video coding…
▽ More
Over the past two decades, traditional block-based video coding has made remarkable progress and spawned a series of well-known standards such as MPEG-4, H.264/AVC and H.265/HEVC. On the other hand, deep neural networks (DNNs) have shown their powerful capacity for visual content understanding, feature extraction and compact representation. Some previous works have explored the learnt video coding algorithms in an end-to-end manner, which show the great potential compared with traditional methods. In this paper, we propose an end-to-end deep neural video coding framework (NVC), which uses variational autoencoders (VAEs) with joint spatial and temporal prior aggregation (PA) to exploit the correlations in intra-frame pixels, inter-frame motions and inter-frame compensation residuals, respectively. Novel features of NVC include: 1) To estimate and compensate motion over a large range of magnitudes, we propose an unsupervised multiscale motion compensation network (MS-MCN) together with a pyramid decoder in the VAE for coding motion features that generates multiscale flow fields, 2) we design a novel adaptive spatiotemporal context model for efficient entropy coding for motion information, 3) we adopt nonlocal attention modules (NLAM) at the bottlenecks of the VAEs for implicit adaptive feature extraction and activation, leveraging its high transformation capacity and unequal weighting with joint global and local information, and 4) we introduce multi-module optimization and a multi-frame training strategy to minimize the temporal error propagation among P-frames. NVC is evaluated for the low-delay causal settings and compared with H.265/HEVC, H.264/AVC and the other learnt video compression methods following the common test conditions, demonstrating consistent gains across all popular test sequences for both PSNR and MS-SSIM distortion metrics.
△ Less
Submitted 9 July, 2020;
originally announced July 2020.
-
Hyperspectral Image Classification Based on Adaptive Sparse Deep Network
Authors:
**gwen Yan,
Zixin Xie,
**gyao Chen,
Yinan Liu,
Lei Liu
Abstract:
Sparse model is widely used in hyperspectral image classification.However, different of sparsity and regularization parameters has great influence on the classification results.In this paper, a novel adaptive sparse deep network based on deep architecture is proposed, which can construct the optimal sparse representation and regularization parameters by deep network.Firstly, a data flow graph is d…
▽ More
Sparse model is widely used in hyperspectral image classification.However, different of sparsity and regularization parameters has great influence on the classification results.In this paper, a novel adaptive sparse deep network based on deep architecture is proposed, which can construct the optimal sparse representation and regularization parameters by deep network.Firstly, a data flow graph is designed to represent each update iteration based on Alternating Direction Method of Multipliers (ADMM) algorithm.Forward network and Back-Propagation network are deduced.All parameters are updated by gradient descent in Back-Propagation.Then we proposed an Adaptive Sparse Deep Network.Comparing with several traditional classifiers or other algorithm for sparse model, experiment results indicate that our method achieves great improvement in HSI classification.
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Translation position extracting in incoherent Fourier ptychography
Authors:
Zongliang Xie,
Haotong Ma,
Yihan Luo,
Bo Qi,
Ge Ren
Abstract:
Incoherent Fourier ptychography (IFP) is a newly developed super-resolution method, where accurate knowledge of translation positions is essential for image reconstruction.To release this limitation, we propose a preprocessing algorithm capable of extracting translation positions of the structure light directly from raw images of IFP, termed translation position extracting (TPE). TPE mainly involv…
▽ More
Incoherent Fourier ptychography (IFP) is a newly developed super-resolution method, where accurate knowledge of translation positions is essential for image reconstruction.To release this limitation, we propose a preprocessing algorithm capable of extracting translation positions of the structure light directly from raw images of IFP, termed translation position extracting (TPE). TPE mainly involves two steps. First, the speckle parts mixed in the acquired intensities, in which the illumination motion is encoded, are isolated by intensity averaging and division. Then the cross-correlations of the speckle dataset are computed to determine the shift positions. TPE-IFP improves the previous IFP by removal of the requirement for prior knowledge of translation positions. Its effectiveness is demonstrated by obtaining high-quality super-resolution images in absence of location information in both simulations and experiments. By further relaxing the practical conditions, the proposed TPE may accelerate the applications of IFP. What is more, as a preprocessing approach, TPE might also contribute to the estimation of pattern positions for the similar speckle-based imaging.
△ Less
Submitted 17 October, 2019;
originally announced October 2019.
-
Physical-Layer Network Coding: An Efficient Technique for Wireless Communications
Authors:
**** Chen,
Zhaopeng Xie,
Yi Fang,
Zhifeng Chen,
Shahid Mumtaz,
Joel J. P. C. Rodrigues
Abstract:
As a subfield of network coding, physical-layer network coding (PNC) can effectively enhance the throughput of wireless networks by map** superimposed signals at receiver to other forms of user messages. Over the past twenty years, PNC has received significant research attention and has been widely studied in various communication scenarios, e.g., two-way relay communications (TWRC), nonorthogon…
▽ More
As a subfield of network coding, physical-layer network coding (PNC) can effectively enhance the throughput of wireless networks by map** superimposed signals at receiver to other forms of user messages. Over the past twenty years, PNC has received significant research attention and has been widely studied in various communication scenarios, e.g., two-way relay communications (TWRC), nonorthogonal multiple access (NOMA) in 5G networks, random access networks, etc. To ensure network reliability, channel-coded PNC is proposed and related communication techniques are investigated, such as the design of channel code, low-complexity decoding, and cross-layer design. In this article, we briefly review the variants of channel-coded PNC wireless communications with the aim of inspiring future research activities in this area. We also put forth open research problems along with a few selected research directions under PNC-aided frameworks.
△ Less
Submitted 23 July, 2019; v1 submitted 21 July, 2019;
originally announced July 2019.
-
Fashion Editing with Adversarial Parsing Learning
Authors:
Haoye Dong,
Xiaodan Liang,
Yixuan Zhang,
Xujie Zhang,
Zhenyu Xie,
Bowen Wu,
Ziqi Zhang,
Xiaohui Shen,
Jian Yin
Abstract:
Interactive fashion image manipulation, which enables users to edit images with sketches and color strokes, is an interesting research problem with great application value. Existing works often treat it as a general inpainting task and do not fully leverage the semantic structural information in fashion images. Moreover, they directly utilize conventional convolution and normalization layers to re…
▽ More
Interactive fashion image manipulation, which enables users to edit images with sketches and color strokes, is an interesting research problem with great application value. Existing works often treat it as a general inpainting task and do not fully leverage the semantic structural information in fashion images. Moreover, they directly utilize conventional convolution and normalization layers to restore the incomplete image, which tends to wash away the sketch and color information. In this paper, we propose a novel Fashion Editing Generative Adversarial Network (FE-GAN), which is capable of manipulating fashion images by free-form sketches and sparse color strokes. FE-GAN consists of two modules: 1) a free-form parsing network that learns to control the human parsing generation by manipulating sketch and color; 2) a parsing-aware inpainting network that renders detailed textures with semantic guidance from the human parsing map. A new attention normalization layer is further applied at multiple scales in the decoder of the inpainting network to enhance the quality of the synthesized image. Extensive experiments on high-resolution fashion image datasets demonstrate that the proposed method significantly outperforms the state-of-the-art methods on image manipulation.
△ Less
Submitted 28 September, 2019; v1 submitted 3 June, 2019;
originally announced June 2019.
-
Known-plaintext attack and ciphertext-only attack for encrypted single-pixel imaging
Authors:
Shuming Jiao,
Yang Gao,
Ting Lei,
Zhenwei Xie,
Xiaocong Yuan
Abstract:
In many previous works, a single-pixel imaging (SPI) system is constructed as an optical image encryption system. Unauthorized users are not able to reconstruct the plaintext image from the ciphertext intensity sequence without knowing the illumination pattern key. However, little cryptanalysis about encrypted SPI has been investigated in the past. In this work, we propose a known-plaintext attack…
▽ More
In many previous works, a single-pixel imaging (SPI) system is constructed as an optical image encryption system. Unauthorized users are not able to reconstruct the plaintext image from the ciphertext intensity sequence without knowing the illumination pattern key. However, little cryptanalysis about encrypted SPI has been investigated in the past. In this work, we propose a known-plaintext attack scheme and a ciphertext-only attack scheme to an encrypted SPI system for the first time. The known-plaintext attack is implemented by interchanging the roles of illumination patterns and object images in the SPI model. The ciphertext-only attack is implemented based on the statistical features of single-pixel intensity values. The two schemes can crack encrypted SPI systems and successfully recover the key containing correct illumination patterns.
△ Less
Submitted 31 May, 2019;
originally announced May 2019.
-
Multiple-image encryption and hiding with an optical diffractive neural network
Authors:
Yang Gao,
Shuming Jiao,
Juncheng Fang,
Ting Lei,
Zhenwei Xie,
Xiaocong Yuan
Abstract:
A cascaded phase-only mask architecture (or an optical diffractive neural network) can be employed for different optical information processing tasks such as pattern recognition, orbital angular momentum (OAM) mode conversion, image salience detection and image encryption. However, for optical encryption and watermarking applications, such a system usually cannot process multiple pairs of input im…
▽ More
A cascaded phase-only mask architecture (or an optical diffractive neural network) can be employed for different optical information processing tasks such as pattern recognition, orbital angular momentum (OAM) mode conversion, image salience detection and image encryption. However, for optical encryption and watermarking applications, such a system usually cannot process multiple pairs of input images and output images simultaneously. In our proposed scheme, multiple input images can be simultaneously fed to an optical diffractive neural network (DNN) system and each corresponding output image will be displayed in a non-overlap sub-region in the output imaging plane. Each input image undergoes a different optical transform in an independent channel within the same system. The multiple cascaded phase masks in the system can be effectively optimized by a wavefront matching algorithm. Similar to recent optical pattern recognition and mode conversion works, the orthogonality property is employed to design a multiplexed DNN.
△ Less
Submitted 10 February, 2020; v1 submitted 21 February, 2019;
originally announced February 2019.