Search | arXiv e-print repository

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Authors: Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria

Abstract: Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models… ▽ More Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics. △ Less

Submitted 16 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: https://github.com/declare-lab/tango

arXiv:2401.11095 [pdf, other]

doi 10.1145/3643834.3661556

SoundShift: Exploring Sound Manipulations for Accessible Mixed-Reality Awareness

Authors: Ruei-Che Chang, Chia-Sheng Hung, Bing-Yu Chen, Dhruv Jain, Anhong Guo

Abstract: Mixed-reality (MR) soundscapes blend real-world sound with virtual audio from hearing devices, presenting intricate auditory information that is hard to discern and differentiate. This is particularly challenging for blind or visually impaired individuals, who rely on sounds and descriptions in their everyday lives. To understand how complex audio information is consumed, we analyzed online forum… ▽ More Mixed-reality (MR) soundscapes blend real-world sound with virtual audio from hearing devices, presenting intricate auditory information that is hard to discern and differentiate. This is particularly challenging for blind or visually impaired individuals, who rely on sounds and descriptions in their everyday lives. To understand how complex audio information is consumed, we analyzed online forum posts within the blind community, identifying prevailing challenges, needs, and desired solutions. We synthesized the results and propose SoundShift for increasing MR sound awareness, which includes six sound manipulations: Transparency Shift, Envelope Shift, Position Shift, Style Shift, Time Shift, and Sound Append. To evaluate the effectiveness of SoundShift, we conducted a user study with 18 blind participants across three simulated MR scenarios, where participants identified specific sounds within intricate soundscapes. We found that SoundShift increased MR sound awareness and minimized cognitive load. Finally, we developed three real-world example applications to demonstrate the practicality of SoundShift. △ Less

Submitted 26 May, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

Comments: DIS 2024

arXiv:2305.05139

Temporal Convolution Network Based Onset Detection and Query by Humming System Design

Authors: Yu Cheng Hung, Jian-Jiun Ding

Abstract: Onsets are a key factor to split audio into several notes. In this paper, we ensemble multiple temporal convolution network (TCN) based model and utilize a restricted frequency range spectrogram to achieve more robust onset detection. Different from the present onset detection of QBH system which is only available in a clean scenario, our proposal of onset detection and speech enhancement can prev… ▽ More Onsets are a key factor to split audio into several notes. In this paper, we ensemble multiple temporal convolution network (TCN) based model and utilize a restricted frequency range spectrogram to achieve more robust onset detection. Different from the present onset detection of QBH system which is only available in a clean scenario, our proposal of onset detection and speech enhancement can prevent noise from affecting onset detection function (ODF). Compared to the CNN model which exploits spatial features of the spectrogram, the TCN model exploits both spatial and temporal features of the spectrogram. As the usage of QBH in noisy scenarios, we apply the TCN-based speech enhancement as a preprocessor of QBH. With the combinations of TCN-based speech enhancement and onset detection, simulations show that the proposal can enable the QBH system in both noisy and clean circumstances with short response time. △ Less

Submitted 7 June, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

Comments: This paper has been withdrawn by the author due to a crucial definition of probability threshold and several grammer and vocabulary mistakes

arXiv:2305.03982 [pdf]

Pitch Estimation by Denoising Preprocessor and Hybrid Estimation Model

Authors: Yu Cheng Hung, ** Hung Chen, Jian Jiun Ding

Abstract: Pitch estimation is to estimate the fundamental frequency and the midi number and plays a critical role in music signal analysis and vocal signal processing. In this work, we proposed a new architecture based on a learning-based enhancement preprocessor and a combination of several traditional and deep learning pitch estimation methods to achieve better pitch estimation performance in both noisy a… ▽ More Pitch estimation is to estimate the fundamental frequency and the midi number and plays a critical role in music signal analysis and vocal signal processing. In this work, we proposed a new architecture based on a learning-based enhancement preprocessor and a combination of several traditional and deep learning pitch estimation methods to achieve better pitch estimation performance in both noisy and clean scenarios. We test 17 different types of noise and 4 SNRdb noise levels. The results show that the proposed pitch estimation can perform better in both noisy and clean scenarios with short response time. △ Less

Submitted 6 May, 2023; originally announced May 2023.

Comments: From ICCE-Taiwan

arXiv:2211.14986 [pdf]

An Unpaired Cross-modality Segmentation Framework Using Data Augmentation and Hybrid Convolutional Networks for Segmenting Vestibular Schwannoma and Cochlea

Authors: Yuzhou Zhuang, Hong Liu, Enmin Song, Coskun Cetinkaya, Chih-Cheng Hung

Abstract: The crossMoDA challenge aims to automatically segment the vestibular schwannoma (VS) tumor and cochlea regions of unlabeled high-resolution T2 scans by leveraging labeled contrast-enhanced T1 scans. The 2022 edition extends the segmentation task by including multi-institutional scans. In this work, we proposed an unpaired cross-modality segmentation framework using data augmentation and hybrid con… ▽ More The crossMoDA challenge aims to automatically segment the vestibular schwannoma (VS) tumor and cochlea regions of unlabeled high-resolution T2 scans by leveraging labeled contrast-enhanced T1 scans. The 2022 edition extends the segmentation task by including multi-institutional scans. In this work, we proposed an unpaired cross-modality segmentation framework using data augmentation and hybrid convolutional networks. Considering heterogeneous distributions and various image sizes for multi-institutional scans, we apply the min-max normalization for scaling the intensities of all scans between -1 and 1, and use the voxel size resampling and center crop** to obtain fixed-size sub-volumes for training. We adopt two data augmentation methods for effectively learning the semantic information and generating realistic target domain scans: generative and online data augmentation. For generative data augmentation, we use CUT and CycleGAN to generate two groups of realistic T2 volumes with different details and appearances for supervised segmentation training. For online data augmentation, we design a random tumor signal reducing method for simulating the heterogeneity of VS tumor signals. Furthermore, we utilize an advanced hybrid convolutional network with multi-dimensional convolutions to adaptively learn sparse inter-slice information and dense intra-slice information for accurate volumetric segmentation of VS tumor and cochlea regions in anisotropic scans. On the crossMoDA2022 validation dataset, our method produces promising results and achieves the mean DSC values of 72.47% and 76.48% and ASSD values of 3.42 mm and 0.53 mm for VS tumor and cochlea regions, respectively. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: Accepted by BrainLes MICCAI proceedings

arXiv:2011.05755 [pdf, other]

Cryo-RALib -- a modular library for accelerating alignment in cryo-EM

Authors: Szu-Chi Chung, Cheng-Yu Hung, Huei-Lun Siao, Hung-Yi Wu, Wei-Hau Chang, I-** Tu

Abstract: Thanks to automated cryo-EM and GPU-accelerated processing, single-particle cryo-EM has become a rapid structure determination method that permits capture of dynamical structures of molecules in solution, which has been recently demonstrated by the determination of COVID-19 spike protein in March, shortly after its breakout in late January 2020. This rapidity is critical for vaccine development in… ▽ More Thanks to automated cryo-EM and GPU-accelerated processing, single-particle cryo-EM has become a rapid structure determination method that permits capture of dynamical structures of molecules in solution, which has been recently demonstrated by the determination of COVID-19 spike protein in March, shortly after its breakout in late January 2020. This rapidity is critical for vaccine development in response to emerging pandemic. This explains why a 2D classification approach based on multi-reference alignment (MRA) is not as popular as the Bayesian-based approach despite that the former has advantage in differentiating structural variations under low signal-to-noise ratio. This is perhaps because that MRA is a time-consuming process and a modular GPU-acceleration library for MRA is lacking. Here, we introduce a library called Cryo-RALib that expands the functionality of CUDA library used by GPU ISAC. It contains a GPU-accelerated MRA routine for accelerating MRA-based classification algorithms. In addition, we connect the cryo-EM image analysis with the python data science stack so as to make it easier for users to perform data analysis and visualization. Benchmarking on the TaiWan Computing Cloud (TWCC) container shows that our implementation can accelerate the computation by one order of magnitude. The library is available at https://github.com/phonchi/Cryo-RAlib. △ Less

Submitted 25 February, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

arXiv:2003.12175 [pdf, other]

Incremental Learning Algorithm for Sound Event Detection

Authors: Eunjeong Koh, Fatemeh Saki, Yinyi Guo, Cheng-Yu Hung, Erik Visser

Abstract: This paper presents a new learning strategy for the Sound Event Detection (SED) system to tackle the issues of i) knowledge migration from a pre-trained model to a new target model and ii) learning new sound events without forgetting the previously learned ones without re-training from scratch. In order to migrate the previously learned knowledge from the source model to the target one, a neural a… ▽ More This paper presents a new learning strategy for the Sound Event Detection (SED) system to tackle the issues of i) knowledge migration from a pre-trained model to a new target model and ii) learning new sound events without forgetting the previously learned ones without re-training from scratch. In order to migrate the previously learned knowledge from the source model to the target one, a neural adapter is employed on the top of the source model. The source model and the target model are merged via this neural adapter layer. The neural adapter layer facilitates the target model to learn new sound events with minimal training data and maintaining the performance of the previously learned sound events similar to the source model. Our extensive analysis on the DCASE16 and US-SED dataset reveals the effectiveness of the proposed method in transferring knowledge between source and target models without introducing any performance degradation on the previously learned sound events while obtaining a competitive detection performance on the newly learned sound events. △ Less

Submitted 26 March, 2020; originally announced March 2020.

Comments: IEEE ICME 2020 Camera Ready Version

Journal ref: IEEE ICME 2020

arXiv:1905.08413 [pdf]

Dual-branch residual network for lung nodule segmentation

Authors: Haichao Cao, Hong Liu, Enmin Song, Chih-Cheng Hung, Guangzhi Ma, Xiangyang Xu, Renchao **, Jianguo Lu

Abstract: An accurate segmentation of lung nodules in computed tomography (CT) images is critical to lung cancer analysis and diagnosis. However, due to the variety of lung nodules and the similarity of visual characteristics between nodules and their surroundings, a robust segmentation of nodules becomes a challenging problem. In this study, we propose the Dual-branch Residual Network (DB-ResNet) which is… ▽ More An accurate segmentation of lung nodules in computed tomography (CT) images is critical to lung cancer analysis and diagnosis. However, due to the variety of lung nodules and the similarity of visual characteristics between nodules and their surroundings, a robust segmentation of nodules becomes a challenging problem. In this study, we propose the Dual-branch Residual Network (DB-ResNet) which is a data-driven model. Our approach integrates two new schemes to improve the generalization capability of the model: 1) the proposed model can simultaneously capture multi-view and multi-scale features of different nodules in CT images; 2) we combine the features of the intensity and the convolution neural networks (CNN). We propose a pooling method, called the central intensity-pooling layer (CIP), to extract the intensity features of the center voxel of the block, and then use the CNN to obtain the convolutional features of the center voxel of the block. In addition, we designed a weighted sampling strategy based on the boundary of nodules for the selection of those voxels using the weighting score, to increase the accuracy of the model. The proposed method has been extensively evaluated on the LIDC dataset containing 986 nodules. Experimental results show that the DB-ResNet achieves superior segmentation performance with an average dice score of 82.74% on the dataset. Moreover, we compared our results with those of four radiologists on the same dataset. The comparison showed that our average dice score was 0.49% higher than that of human experts. This proves that our proposed method is as good as the experienced radiologist. △ Less

Submitted 20 May, 2019; originally announced May 2019.

Comments: 24 pages, 6 figures

arXiv:1905.03445 [pdf]

Two-Stage Convolutional Neural Network Architecture for Lung Nodule Detection

Authors: Haichao Cao, Hong Liu, Enmin Song, Guangzhi Ma, Xiangyang Xu, Renchao **, Tengying Liu, Chih-Cheng Hung

Abstract: Early detection of lung cancer is an effective way to improve the survival rate of patients. It is a critical step to have accurate detection of lung nodules in computed tomography (CT) images for the diagnosis of lung cancer. However, due to the heterogeneity of the lung nodules and the complexity of the surrounding environment, robust nodule detection has been a challenging task. In this study,… ▽ More Early detection of lung cancer is an effective way to improve the survival rate of patients. It is a critical step to have accurate detection of lung nodules in computed tomography (CT) images for the diagnosis of lung cancer. However, due to the heterogeneity of the lung nodules and the complexity of the surrounding environment, robust nodule detection has been a challenging task. In this study, we propose a two-stage convolutional neural network (TSCNN) architecture for lung nodule detection. The CNN architecture in the first stage is based on the improved UNet segmentation network to establish an initial detection of lung nodules. Simultaneously, in order to obtain a high recall rate without introducing excessive false positive nodules, we propose a novel sampling strategy, and use the offline hard mining idea for training and prediction according to the proposed cascaded prediction method. The CNN architecture in the second stage is based on the proposed dual pooling structure, which is built into three 3D CNN classification networks for false positive reduction. Since the network training requires a significant amount of training data, we adopt a data augmentation method based on random mask. Furthermore, we have improved the generalization ability of the false positive reduction model by means of ensemble learning. The proposed method has been experimentally verified on the LUNA dataset. Experimental results show that the proposed TSCNN architecture can obtain competitive detection performance. △ Less

Submitted 9 May, 2019; originally announced May 2019.

Comments: 29 pages, 10 figures

arXiv:1903.07164 [pdf, ps, other]

Linearly Constrained Smoothing Group Sparsity Solvers in Off-grid Model

Authors: Cheng-Yu Hung, Mostafa Kaveh

Abstract: In compressed sensing, the sensing matrix is assumed perfectly known. However, there exists perturbation in the sensing matrix in reality due to sensor offsets or noise disturbance. Directions-of-arrival (DoA) estimation with off-grid effect satisfies this situation, and can be formulated into a (non)convex optimization problem with linear inequalities constraints, which can be solved by the inter… ▽ More In compressed sensing, the sensing matrix is assumed perfectly known. However, there exists perturbation in the sensing matrix in reality due to sensor offsets or noise disturbance. Directions-of-arrival (DoA) estimation with off-grid effect satisfies this situation, and can be formulated into a (non)convex optimization problem with linear inequalities constraints, which can be solved by the interior point method (using the CVX tools), but at a large computational cost. In this work, in order to design efficient algorithms, we consider various alternative formulations, such as unconstrained formulation, primal-dual formulation, or conic formulation to develop group-sparsity promoted solvers. First, the consensus alternating direction method of multipliers (C-ADMM) is applied. Then, iterative algorithms for the BPDN formulation is proposed by combining the Nesterov smoothing technique with accelerated proximal gradient method, and the convergence analysis of the method is conducted as well. We also developed a variant of EGT (Excessive Gap Technique)-based primal-dual method to systematically reduce the smoothing parameter sequentially. Finally, we propose algorithms for quadratically constrained L2-L1 mixed norm minimization problem by using the smoothed dual conic optimization (SDCO) and continuation technique. The performance of accuracy and convergence for all the proposed methods are demonstrated in the numerical simulations. △ Less

Submitted 3 June, 2019; v1 submitted 17 March, 2019; originally announced March 2019.

arXiv:1903.07158 [pdf, ps, other]

Joint Block Low Rank and Sparse Matrix Recovery in Array Self-Calibration Off-Grid DoA Estimation

Authors: Cheng-Yu Hung, Mostafa Kaveh

Abstract: This letter addresses the estimation of directions-of-arrival (DoA) by a sensor array using a sparse model in the presence of array calibration errors and off-grid directions. The received signal utilizes previously used models for unknown errors in calibration and structured linear representation of the off-grid effect. A convex optimization problem is formulated with an objective function to pro… ▽ More This letter addresses the estimation of directions-of-arrival (DoA) by a sensor array using a sparse model in the presence of array calibration errors and off-grid directions. The received signal utilizes previously used models for unknown errors in calibration and structured linear representation of the off-grid effect. A convex optimization problem is formulated with an objective function to promote two-layer joint block-sparsity with its second-order cone programming (SOCP) representation. The performance of the proposed method is demonstrated by numerical simulations and compared with the Cramer-Rao Bound (CRB), and several previously proposed methods. △ Less

Submitted 3 June, 2019; v1 submitted 17 March, 2019; originally announced March 2019.

arXiv:1712.05890 [pdf, ps, other]

Low Rank Matrix Recovery for Joint Array Self-Calibration and Sparse Model DoA Estimation

Authors: Cheng-Yu Hung, Mostafa Kaveh

Abstract: In this work, combined calibration and DoA estimation is approached as an extension of the formulation for the Single Measurement Vector (SMV) model of self-calibration to the Multiple Measurement Model (MMV) case. By taking advantage of multiple snapshots, a modified nuclear norm minimization problem is proposed to recover a low-rank larger dimension matrix. We also give the definition of a linea… ▽ More In this work, combined calibration and DoA estimation is approached as an extension of the formulation for the Single Measurement Vector (SMV) model of self-calibration to the Multiple Measurement Model (MMV) case. By taking advantage of multiple snapshots, a modified nuclear norm minimization problem is proposed to recover a low-rank larger dimension matrix. We also give the definition of a linear operator for the MMV model, and give its corresponding matrix representation to generate a variant of a convex optimization problem. In order to mitigate the computational complexity of the approach, singular value decomposition (SVD) is applied to reduce the problem size. The performance of the proposed methods are demonstrated by numerical simulations. △ Less

Submitted 15 December, 2017; originally announced December 2017.

Showing 1–12 of 12 results for author: Hung, C