Search | arXiv e-print repository

Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information

Authors: Qiaochu Huang, Xu He, Boshi Tang, Haolin Zhuang, Liyang Chen, Shuochen Gao, Zhiyong Wu, Haozhi Huang, Helen Meng

Abstract: Dance generation, as a branch of human motion generation, has attracted increasing attention. Recently, a few works attempt to enhance dance expressiveness, which includes genre matching, beat alignment, and dance dynamics, from certain aspects. However, the enhancement is quite limited as they lack comprehensive consideration of the aforementioned three factors. In this paper, we propose Expressi… ▽ More Dance generation, as a branch of human motion generation, has attracted increasing attention. Recently, a few works attempt to enhance dance expressiveness, which includes genre matching, beat alignment, and dance dynamics, from certain aspects. However, the enhancement is quite limited as they lack comprehensive consideration of the aforementioned three factors. In this paper, we propose ExpressiveBailando, a novel dance generation method designed to generate expressive dances, concurrently taking all three factors into account. Specifically, we mitigate the issue of speed homogenization by incorporating frequency information into VQ-VAE, thus improving dance dynamics. Additionally, we integrate music style information by extracting genre- and beat-related features with a pre-trained music model, hence achieving improvements in the other two factors. Extensive experimental results demonstrate that our proposed method can generate dances with high expressiveness and outperforms existing methods both qualitatively and quantitatively. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2311.04769 [pdf]

An attention-based deep learning network for predicting Platinum resistance in ovarian cancer

Authors: Haoming Zhuang, Beibei Li, **gtong Ma, Patrice Monkam, Shouliang Qi, Wei Qian, Dianning He

Abstract: Background: Ovarian cancer is among the three most frequent gynecologic cancers globally. High-grade serous ovarian cancer (HGSOC) is the most common and aggressive histological type. Guided treatment for HGSOC typically involves platinum-based combination chemotherapy, necessitating an assessment of whether the patient is platinum-resistant. The purpose of this study is to propose a deep learning… ▽ More Background: Ovarian cancer is among the three most frequent gynecologic cancers globally. High-grade serous ovarian cancer (HGSOC) is the most common and aggressive histological type. Guided treatment for HGSOC typically involves platinum-based combination chemotherapy, necessitating an assessment of whether the patient is platinum-resistant. The purpose of this study is to propose a deep learning-based method to determine whether a patient is platinum-resistant using multimodal positron emission tomography/computed tomography (PET/CT) images. Methods: 289 patients with HGSOC were included in this study. An end-to-end SE-SPP-DenseNet model was built by adding Squeeze-Excitation Block (SE Block) and Spatial Pyramid Pooling Layer (SPPLayer) to Dense Convolutional Network (DenseNet). Multimodal data from PET/CT images of the regions of interest (ROI) were used to predict platinum resistance in patients. Results: Through five-fold cross-validation, SE-SPP-DenseNet achieved a high accuracy rate and an area under the curve (AUC) in predicting platinum resistance in patients, which were 92.6% and 0.93, respectively. The importance of incorporating SE Block and SPPLayer into the deep learning model, and considering multimodal data was substantiated by carrying out ablation studies and experiments with single modality data. Conclusions: The obtained classification results indicate that our proposed deep learning framework performs better in predicting platinum resistance in patients, which can help gynecologists make better treatment decisions. Keywords: PET/CT, CNN, SE Block, SPP Layer, Platinum resistance, Ovarian cancer △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2305.11094 [pdf, other]

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation

Authors: Sicheng Yang, Zhiyong Wu, Minglei Li, Zhensong Zhang, Lei Hao, Weihong Bao, Haolin Zhuang

Abstract: Speech-driven gesture generation is highly challenging due to the random jitters of human motion. In addition, there is an inherent asynchronous relationship between human speech and gestures. To tackle these challenges, we introduce a novel quantization-based and phase-guided motion-matching framework. Specifically, we first present a gesture VQ-VAE module to learn a codebook to summarize meaning… ▽ More Speech-driven gesture generation is highly challenging due to the random jitters of human motion. In addition, there is an inherent asynchronous relationship between human speech and gestures. To tackle these challenges, we introduce a novel quantization-based and phase-guided motion-matching framework. Specifically, we first present a gesture VQ-VAE module to learn a codebook to summarize meaningful gesture units. With each code representing a unique gesture, random jittering problems are alleviated effectively. We then use Levenshtein distance to align diverse gestures with different speech. Levenshtein distance based on audio quantization as a similarity metric of corresponding speech of gestures helps match more appropriate gestures with speech, and solves the alignment problem of speech and gestures well. Moreover, we introduce phase to guide the optimal gesture matching based on the semantics of context or rhythm of audio. Phase guides when text-based or speech-based gestures should be performed to make the generated gestures more natural. Extensive experiments show that our method outperforms recent approaches on speech-driven gesture generation. Our code, database, pre-trained models, and demos are available at https://github.com/YoungSeng/QPGesture. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: 15 pages, 12 figures, CVPR 2023 Highlight

arXiv:2304.12704 [pdf, other]

GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network

Authors: Haolin Zhuang, Shun Lei, Long Xiao, Weiqin Li, Liyang Chen, Sicheng Yang, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Music-driven 3D dance generation has become an intensive research topic in recent years with great potential for real-world applications. Most existing methods lack the consideration of genre, which results in genre inconsistency in the generated dance movements. In addition, the correlation between the dance genre and the music has not been investigated. To address these issues, we propose a genr… ▽ More Music-driven 3D dance generation has become an intensive research topic in recent years with great potential for real-world applications. Most existing methods lack the consideration of genre, which results in genre inconsistency in the generated dance movements. In addition, the correlation between the dance genre and the music has not been investigated. To address these issues, we propose a genre-consistent dance generation framework, GTN-Bailando. First, we propose the Genre Token Network (GTN), which infers the genre from music to enhance the genre consistency of long-term dance generation. Second, to improve the generalization capability of the model, the strategy of pre-training and fine-tuning is adopted.Experimental results on the AIST++ dataset show that the proposed dance generation framework outperforms state-of-the-art methods in terms of motion quality and genre consistency. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted by ICASSP2023.Demo page: https://im1eon.github.io/ICASSP23-GTNB-DG/

arXiv:2304.08990 [pdf, other]

A Comparison of Image Denoising Methods

Authors: Zhaoming Kong, Fangxi Deng, Haomin Zhuang, Jun Yu, Lifang He, Xiaowei Yang

Abstract: The advancement of imaging devices and countless images generated everyday pose an increasingly high demand on image denoising, which still remains a challenging task in terms of both effectiveness and efficiency. To improve denoising quality, numerous denoising techniques and approaches have been proposed in the past decades, including different transforms, regularization terms, algebraic represe… ▽ More The advancement of imaging devices and countless images generated everyday pose an increasingly high demand on image denoising, which still remains a challenging task in terms of both effectiveness and efficiency. To improve denoising quality, numerous denoising techniques and approaches have been proposed in the past decades, including different transforms, regularization terms, algebraic representations and especially advanced deep neural network (DNN) architectures. Despite their sophistication, many methods may fail to achieve desirable results for simultaneous noise removal and fine detail preservation. In this paper, to investigate the applicability of existing denoising techniques, we compare a variety of denoising methods on both synthetic and real-world datasets for different applications. We also introduce a new dataset for benchmarking, and the evaluations are performed from four different perspectives including quantitative metrics, visual effects, human ratings and computational cost. Our experiments demonstrate: (i) the effectiveness and efficiency of representative traditional denoisers for various denoising tasks, (ii) a simple matrix-based algorithm may be able to produce similar results compared with its tensor counterparts, and (iii) the notable achievements of DNN models, which exhibit impressive generalization ability and show state-of-the-art performance on various datasets. In spite of the progress in recent years, we discuss shortcomings and possible extensions of existing techniques. Datasets, code and results are made publicly available and will be continuously updated at https://github.com/ZhaomingKong/Denoising-Comparison. △ Less

Submitted 9 May, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

Comments: In this paper, we intend to collect and compare various denoising methods to investigate their effectiveness, efficiency, applicability and generalization ability with both synthetic and real-world experiments. arXiv admin note: substantial text overlap with arXiv:2011.03462

arXiv:2212.01042 [pdf, other]

doi 10.1109/SP46214.2022.9833716

AccEar: Accelerometer Acoustic Eavesdrop** with Unconstrained Vocabulary

Authors: Pengfei Hu, Hui Zhuang, Panneer Selvam Santhalingamy, Riccardo Spolaor, Parth Pathaky, Guoming Zhang, Xiuzhen Cheng

Abstract: With the increasing popularity of voice-based applications, acoustic eavesdrop** has become a serious threat to users' privacy. While on smartphones the access to microphones needs an explicit user permission, acoustic eavesdrop** attacks can rely on motion sensors (such as accelerometer and gyroscope), which access is unrestricted. However, previous instances of such attacks can only recogniz… ▽ More With the increasing popularity of voice-based applications, acoustic eavesdrop** has become a serious threat to users' privacy. While on smartphones the access to microphones needs an explicit user permission, acoustic eavesdrop** attacks can rely on motion sensors (such as accelerometer and gyroscope), which access is unrestricted. However, previous instances of such attacks can only recognize a limited set of pre-trained words or phrases. In this paper, we present AccEar, an accelerometerbased acoustic eavesdrop** attack that can reconstruct any audio played on the smartphone's loudspeaker with unconstrained vocabulary. We show that an attacker can employ a conditional Generative Adversarial Network (cGAN) to reconstruct highfidelity audio from low-frequency accelerometer signals. The presented cGAN model learns to recreate high-frequency components of the user's voice from low-frequency accelerometer signals through spectrogram enhancement. We assess the feasibility and effectiveness of AccEar attack in a thorough set of experiments using audio from 16 public personalities. As shown by the results in both objective and subjective evaluations, AccEar successfully reconstructs user speeches from accelerometer signals in different scenarios including varying sampling rate, audio volume, device model, etc. △ Less

Submitted 2 December, 2022; originally announced December 2022.

Comments: 2022 IEEE Symposium on Security and Privacy (SP)

Journal ref: 2022 IEEE Symposium on Security and Privacy (SP)

arXiv:2208.08757 [pdf, other]

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

Authors: SiCheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu, Aolan Sun, Jianzong Wang, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng

Abstract: One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the va… ▽ More One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentangled representation during training. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility. In addition, we can transfer characteristics of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement. Our code, pre-trained models and demo are available at https://im1eon.github.io/IS2022-SRDVC/. △ Less

Submitted 18 August, 2022; originally announced August 2022.

Comments: 5 pages,5 figures,INTERSPEECH 2022

arXiv:2206.01599 [pdf, other]

A Deep-Learning Usability Expansion Model of Ocean Observations

Authors: Ali Muhamed Ali, Hanqi Zhuang, Yu Huang, Ali K. Ibrahim, Ali Salem Altaher, Laurent Chérubin

Abstract: Today's ocean numerical prediction skills depend on the availability of in-situ and remote ocean observations at the time of the predictions only. Because observations are scarce and discontinuous in time and space, numerical models are often unable to accurately model and predict real ocean dynamics, leading to a lack of fulfillment of a range of services that require reliable predictions at vari… ▽ More Today's ocean numerical prediction skills depend on the availability of in-situ and remote ocean observations at the time of the predictions only. Because observations are scarce and discontinuous in time and space, numerical models are often unable to accurately model and predict real ocean dynamics, leading to a lack of fulfillment of a range of services that require reliable predictions at various temporal and spatial scales. The process of constraining free numerical models with observations is known as data assimilation. The primary objective is to minimize the misfit of model states with the observations while respecting the rules of physics. The caveat of this approach is that measurements are used only once, at the time of the prediction. The information contained in the history of the measurements and its role in the determinism of the prediction is, therefore, not accounted for. Consequently, historical measurement cannot be used in real-time forecasting systems. The research presented in this paper provides a novel approach rooted in artificial intelligence to expand the usability of observations made before the time of the prediction. Our approach is based on the re-purpose of an existing deep learning model, called U-Net, designed specifically for image segmentation analysis in the biomedical field. U-Net is used here to create a Transform Model that retains the temporal and spatial evolution of the differences between model and observations to produce a correction in the form of regression weights that evolves spatially and temporally with the model both forward and backward in time, beyond the observation period. Using virtual observations, we show that the usability of the observation can be extended up to a one year prior or post observations. △ Less

Submitted 3 June, 2022; originally announced June 2022.

Comments: 34 pages, 14 figurs, one table

arXiv:2008.01798 [pdf, other]

doi 10.3389/frai.2021.780271

Physics-informed Tensor-train ConvLSTM for Volumetric Velocity Forecasting of Loop Current

Authors: Yu Huang, Yufei Tang, Hanqi Zhuang, James VanZwieten, Laurent Cherubin

Abstract: According to the National Academies, a weekly forecast of velocity, vertical structure, and duration of the Loop Current (LC) and its eddies is critical for understanding the oceanography and ecosystem, and for mitigating outcomes of anthropogenic and natural disasters in the Gulf of Mexico (GoM). However, this forecast is a challenging problem since the LC behaviour is dominated by long-range spa… ▽ More According to the National Academies, a weekly forecast of velocity, vertical structure, and duration of the Loop Current (LC) and its eddies is critical for understanding the oceanography and ecosystem, and for mitigating outcomes of anthropogenic and natural disasters in the Gulf of Mexico (GoM). However, this forecast is a challenging problem since the LC behaviour is dominated by long-range spatial connections across multiple timescales. In this paper, we extend spatiotemporal predictive learning, showing its effectiveness beyond video prediction, to a 4D model, i.e., a novel Physics-informed Tensor-train ConvLSTM (PITT-ConvLSTM) for temporal sequences of 3D geospatial data forecasting. Specifically, we propose 1) a novel 4D higher-order recurrent neural network with empirical orthogonal function analysis to capture the hidden uncorrelated patterns of each hierarchy, 2) a convolutional tensor-train decomposition to capture higher-order space-time correlations, and 3) to incorporate prior physic knowledge that is provided from domain experts by informing the learning in latent space. The advantage of our proposed method is clear: constrained by physical laws, it simultaneously learns good representations for frame dependencies (both short-term and long-term high-level dependency) and inter-hierarchical relations within each time frame. Experiments on geospatial data collected from the GoM demonstrate that PITT-ConvLSTM outperforms the state-of-the-art methods in forecasting the volumetric velocity of the LC and its eddies for a period of over one week. △ Less

Submitted 18 December, 2021; v1 submitted 4 August, 2020; originally announced August 2020.

Comments: 10 pages, 5 figures

Journal ref: Front. Artif. Intell. 4(2021)197

arXiv:2007.09478 [pdf]

Classification of Diabetic Retinopathy via Fundus Photography: Utilization of Deep Learning Approaches to Speed up Disease Detection

Authors: Hangwei Zhuang, Nabil Ettehadi

Abstract: In this paper, we propose two distinct solutions to the problem of Diabetic Retinopathy (DR) classification. In the first approach, we introduce a shallow neural network architecture. This model performs well on classification of the most frequent classes while fails at classifying the less frequent ones. In the second approach, we use transfer learning to re-train the last modified layer of a ver… ▽ More In this paper, we propose two distinct solutions to the problem of Diabetic Retinopathy (DR) classification. In the first approach, we introduce a shallow neural network architecture. This model performs well on classification of the most frequent classes while fails at classifying the less frequent ones. In the second approach, we use transfer learning to re-train the last modified layer of a very deep neural network to improve the generalization ability of the model to the less frequent classes. Our results demonstrate superior abilities of transfer learning in DR classification of less frequent classes compared to the shallow neural network. △ Less

Submitted 18 July, 2020; originally announced July 2020.

Comments: 6 pages, 9 figures

ACM Class: I.4.6; I.4.9

arXiv:2006.10159 [pdf, other]

doi 10.1038/s42256-021-00356-5

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

Authors: Claudionor N. Coelho Jr., Aki Kuusela, Shan Li, Hao Zhuang, Thea Aarrestad, Vladimir Loncar, Jennifer Ngadiuba, Maurizio Pierini, Adrian Alan Pol, Sioni Summers

Abstract: Although the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference and therefore reduction in model size, latency and energy consumption. One technique to limit model size is quantization, which implies using fewer bits to represent weights and biases. Such an approach usually results in a decline in… ▽ More Although the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference and therefore reduction in model size, latency and energy consumption. One technique to limit model size is quantization, which implies using fewer bits to represent weights and biases. Such an approach usually results in a decline in performance. Here, we introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic quantization procedure, sampling from a wide range of quantizers, model energy consumption and size are minimized while high accuracy is maintained. This is crucial for the event selection procedure in proton-proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of ${\mathcal O}(1)~μ$s is required. Nanosecond inference and a resource consumption reduced by a factor of 50 when implemented on field-programmable gate array hardware are achieved. △ Less

Submitted 21 June, 2021; v1 submitted 15 June, 2020; originally announced June 2020.

Journal ref: Nature Machine Intelligence, Volume 3 (2021)

arXiv:2005.08356 [pdf]

North Atlantic Right Whales Up-call Detection Using Multimodel Deep Learning

Authors: Ali K Ibrahim, Hanqi Zhuang, Laurent M. Ch'erubin, Nurgun Erdol, Gregory O Corry-Crowe, Ali Muhamed Ali

Abstract: A new method for North Atlantic Right Whales (NARW) up-call detection using Multimodel Deep Learning (MMDL) is presented in this paper. In this approach, signals from passive acoustic sensors are first converted to spectrogram and scalogram images, which are time-frequency representations of the signals. These images are in turn used to train an MMDL detec-tor, consisting of Convolutional Neural N… ▽ More A new method for North Atlantic Right Whales (NARW) up-call detection using Multimodel Deep Learning (MMDL) is presented in this paper. In this approach, signals from passive acoustic sensors are first converted to spectrogram and scalogram images, which are time-frequency representations of the signals. These images are in turn used to train an MMDL detec-tor, consisting of Convolutional Neural Networks (CNNs) and Stacked Auto Encoders (SAEs). Our experimental studies revealed that CNNs work better with spectrograms and SAEs with sca-lograms. Therefore in our experimental design, the CNNs are trained by using spectrogram im-ages, and the SAEs are trained by using scalogram images. A fusion mechanism is used to fuse the results from individual neural networks. In this paper, the results obtained from the MMDL detector are compared with those obtained from conventional machine learning algorithms trained with handcraft features. It is shown that the performance of the MMDL detector is sig-nificantly better than those of the representative conventional machine learning methods in terms of up-call detection rate, non-up-call detection rate, and false alarm rate. △ Less

Submitted 17 May, 2020; originally announced May 2020.

arXiv:2001.00170 [pdf]

Residual Block-based Multi-Label Classification and Localization Network with Integral Regression for Vertebrae Labeling

Authors: Chunli Qin, Demin Yao, Han Zhuang, Hui Wang, Yonghong Shi, Zhijian Song

Abstract: Accurate identification and localization of the vertebrae in CT scans is a critical and standard preprocessing step for clinical spinal diagnosis and treatment. Existing methods are mainly based on the integration of multiple neural networks, and most of them use the Gaussian heat map to locate the vertebrae's centroid. However, the process of obtaining the vertebrae's centroid coordinates using h… ▽ More Accurate identification and localization of the vertebrae in CT scans is a critical and standard preprocessing step for clinical spinal diagnosis and treatment. Existing methods are mainly based on the integration of multiple neural networks, and most of them use the Gaussian heat map to locate the vertebrae's centroid. However, the process of obtaining the vertebrae's centroid coordinates using heat maps is non-differentiable, so it is impossible to train the network to label the vertebrae directly. Therefore, for end-to-end differential training of vertebra coordinates on CT scans, a robust and accurate automatic vertebral labeling algorithm is proposed in this study. Firstly, a novel residual-based multi-label classification and localization network is developed, which can capture multi-scale features, but also utilize the residual module and skip connection to fuse the multi-level features. Secondly, to solve the problem that the process of finding coordinates is non-differentiable and the spatial structure is not destructible, integral regression module is used in the localization network. It combines the advantages of heat map representation and direct regression coordinates to achieve end-to-end training, and can be compatible with any key point detection methods of medical image based on heat map. Finally, multi-label classification of vertebrae is carried out, which use bidirectional long short term memory (Bi-LSTM) to enhance the learning of long contextual information to improve the classification performance. The proposed method is evaluated on a challenging dataset and the results are significantly better than the state-of-the-art methods (mean localization error <3mm). △ Less

Submitted 1 January, 2020; originally announced January 2020.

Comments: 10 pages with 9 figures

arXiv:0908.1273 [pdf, other]

doi 10.1109/TIT.2011.2178152

A General Class of Throughput Optimal Routing Policies in Multi-hop Wireless Networks

Authors: Mohammad Naghshvar, Hairuo Zhuang, Tara Javidi

Abstract: This paper considers the problem of throughput optimal routing/scheduling in a multi-hop constrained queueing network with random connectivity whose special case includes opportunistic multi-hop wireless networks and input-queued switch fabrics. The main challenge in the design of throughput optimal routing policies is closely related to identifying appropriate and universal Lyapunov functions wit… ▽ More This paper considers the problem of throughput optimal routing/scheduling in a multi-hop constrained queueing network with random connectivity whose special case includes opportunistic multi-hop wireless networks and input-queued switch fabrics. The main challenge in the design of throughput optimal routing policies is closely related to identifying appropriate and universal Lyapunov functions with negative expected drift. The few well-known throughput optimal policies in the literature are constructed using simple quadratic or exponential Lyapunov functions of the queue backlogs and as such they seek to balance the queue backlogs across network independent of the topology. By considering a class of continuous, differentiable, and piece-wise quadratic Lyapunov functions, this paper provides a large class of throughput optimal routing policies. The proposed class of Lyapunov functions allow for the routing policy to control the traffic along short paths for a large portion of state-space while ensuring a negative expected drift. This structure enables the design of a large class of routing policies. In particular, and in addition to recovering the throughput optimality of the well known backpressure routing policy, an opportunistic routing policy with congestion diversity is proved to be throughput optimal. △ Less

Submitted 10 March, 2011; v1 submitted 10 August, 2009; originally announced August 2009.

Comments: 31 pages (one column), 8 figures, (revision submitted to IEEE Transactions on Information Theory)

MSC Class: 34D20

Showing 1–14 of 14 results for author: Zhuang, H