-
DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability
Authors:
Hyun Joon Park,
** Sob Kim,
Wooseok Shin,
Sung Won Han
Abstract:
Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a…
▽ More
Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlap** patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
LLM-driven Multimodal Target Volume Contouring in Radiation Oncology
Authors:
Yu** Oh,
Sangjoon Park,
Hwa Kyung Byun,
Yeona Cho,
Ik Jae Lee,
** Sung Kim,
Jong Chul Ye
Abstract:
Target volume contouring for radiation therapy is considered significantly more challenging than the normal organ segmentation tasks as it necessitates the utilization of both image and text-based clinical information. Inspired by the recent advancement of large language models (LLMs) that can facilitate the integration of the textural information and images, here we present a novel LLM-driven mul…
▽ More
Target volume contouring for radiation therapy is considered significantly more challenging than the normal organ segmentation tasks as it necessitates the utilization of both image and text-based clinical information. Inspired by the recent advancement of large language models (LLMs) that can facilitate the integration of the textural information and images, here we present a novel LLM-driven multimodal AI, namely LLMSeg, that utilizes the clinical text information and is applicable to the challenging task of target volume contouring for radiation therapy, and validate it within the context of breast cancer radiation therapy target volume contouring. Using external validation and data-insufficient environments, which attributes highly conducive to real-world applications, we demonstrate that the proposed model exhibits markedly improved performance compared to conventional unimodal AI models, particularly exhibiting robust generalization performance and data efficiency. To our best knowledge, this is the first LLM-driven multimodal AI model that integrates the clinical text information into target volume delineation for radiation oncology.
△ Less
Submitted 15 April, 2024; v1 submitted 3 November, 2023;
originally announced November 2023.
-
K-SMPC: Koopman Operator-Based Stochastic Model Predictive Control for Enhanced Lateral Control of Autonomous Vehicles
Authors:
** Sung Kim,
Ying Shuai Quan,
Chung Choo Chung
Abstract:
This paper proposes Koopman operator-based Stochastic Model Predictive Control (K-SMPC) for enhanced lateral control of autonomous vehicles. The Koopman operator is a linear map representing the nonlinear dynamics in an infinite-dimensional space. Thus, we use the Koopman operator to represent the nonlinear dynamics of a vehicle in dynamic lane-kee** situations. The Extended Dynamic Mode Decompo…
▽ More
This paper proposes Koopman operator-based Stochastic Model Predictive Control (K-SMPC) for enhanced lateral control of autonomous vehicles. The Koopman operator is a linear map representing the nonlinear dynamics in an infinite-dimensional space. Thus, we use the Koopman operator to represent the nonlinear dynamics of a vehicle in dynamic lane-kee** situations. The Extended Dynamic Mode Decomposition (EDMD) method is adopted to approximate the Koopman operator in a finite-dimensional space for practical implementation. We consider the modeling error of the approximated Koopman operator in the EDMD method. Then, we design K-SMPC to tackle the Koopman modeling error, where the error is handled as a probabilistic signal. The recursive feasibility of the proposed method is investigated with an explicit first-step state constraint by computing the robust control invariant set. A high-fidelity vehicle simulator, i.e., CarSim, is used to validate the proposed method with a comparative study. From the results, it is confirmed that the proposed method outperforms other methods in tracking performance. Furthermore, it is observed that the proposed method satisfies the given constraints and is recursively feasible.
△ Less
Submitted 9 December, 2023; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Uncertainty Quantification of Autoencoder-based Koopman Operator
Authors:
** Sung Kim,
Ying Shuai Quan,
Chung Choo Chung
Abstract:
This paper proposes a method for uncertainty quantification of an autoencoder-based Koopman operator. The main challenge of using the Koopman operator is to design the basis functions for lifting the state. To this end, this paper builds an autoencoder to automatically search the optimal lifting basis functions with a given loss function. We approximate the Koopman operator in a finite-dimensional…
▽ More
This paper proposes a method for uncertainty quantification of an autoencoder-based Koopman operator. The main challenge of using the Koopman operator is to design the basis functions for lifting the state. To this end, this paper builds an autoencoder to automatically search the optimal lifting basis functions with a given loss function. We approximate the Koopman operator in a finite-dimensional space with the autoencoder, while the approximated Koopman has an approximation uncertainty. To resolve the problem, we compute a robust positively invariant set for the approximated Koopman operator to consider the approximation error. Then, the decoder of the autoencoder is analyzed by robustness certification against approximation error using the Lipschitz constant in the reconstruction phase. The forced Van der Pol model is used to show the validity of the proposed method. From the numerical simulation results, we confirmed that the trajectory of the true state stays in the uncertainty set centered by the reconstructed state.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
RNN Controller for Lane-Kee** Systems with Robustness and Safety Verification
Authors:
Ying Shuai Quan,
** Sung Kim,
Chung Choo Chung
Abstract:
This paper proposes a Recurrent Neural Network (RNN) controller for lane-kee** systems, effectively handling model uncertainties and disturbances. First, quadratic constraints cover the nonlinearities brought by the RNN controller, and the linear fractional transformation method models the dynamics of system uncertainties. Second, we prove the robust stability of the lane-kee** system in the p…
▽ More
This paper proposes a Recurrent Neural Network (RNN) controller for lane-kee** systems, effectively handling model uncertainties and disturbances. First, quadratic constraints cover the nonlinearities brought by the RNN controller, and the linear fractional transformation method models the dynamics of system uncertainties. Second, we prove the robust stability of the lane-kee** system in the presence of uncertain vehicle speed using a linear matrix inequality. Then, we define a reachable set for the lane-kee** system. Finally, to confirm the safety of the lane-kee** system with tracking error bound, we formulate semidefinite programming to approximate the outer set of the reachable set. Numerical experiments demonstrate that this approach confirms the stabilizing RNN controller and validates the safety with an untrained dataset with untrained varying road curvatures.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Deep Learning-based Synthetic High-Resolution In-Depth Imaging Using an Attachable Dual-element Endoscopic Ultrasound Probe
Authors:
Hah Min Lew,
Jae Seong Kim,
Moon Hwan Lee,
Jaegeun Park,
Sangyeon Youn,
Hee Man Kim,
Jihun Kim,
Jae Youn Hwang
Abstract:
Endoscopic ultrasound (EUS) imaging has a trade-off between resolution and penetration depth. By considering the in-vivo characteristics of human organs, it is necessary to provide clinicians with appropriate hardware specifications for precise diagnosis. Recently, super-resolution (SR) ultrasound imaging studies, including the SR task in deep learning fields, have been reported for enhancing ultr…
▽ More
Endoscopic ultrasound (EUS) imaging has a trade-off between resolution and penetration depth. By considering the in-vivo characteristics of human organs, it is necessary to provide clinicians with appropriate hardware specifications for precise diagnosis. Recently, super-resolution (SR) ultrasound imaging studies, including the SR task in deep learning fields, have been reported for enhancing ultrasound images. However, most of those studies did not consider ultrasound imaging natures, but rather they were conventional SR techniques based on downsampling of ultrasound images. In this study, we propose a novel deep learning-based high-resolution in-depth imaging probe capable of offering low- and high-frequency ultrasound image pairs. We developed an attachable dual-element EUS probe with customized low- and high-frequency ultrasound transducers under small hardware constraints. We also designed a special geared structure to enable the same image plane. The proposed system was evaluated with a wire phantom and a tissue-mimicking phantom. After the evaluation, 442 ultrasound image pairs from the tissue-mimicking phantom were acquired. We then applied several deep learning models to obtain synthetic high-resolution in-depth images, thus demonstrating the feasibility of our approach for clinical unmet needs. Furthermore, we quantitatively and qualitatively analyzed the results to find a suitable deep-learning model for our task. The obtained results demonstrate that our proposed dual-element EUS probe with an image-to-image translation network has the potential to provide synthetic high-frequency ultrasound images deep inside tissues.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Reachable Set-based Path Planning for Automated Vertical Parking System
Authors:
In Hyuk Oh,
Ju Won Seo,
** Sung Kim,
Chung Choo Chung
Abstract:
This paper proposes a local path planning method with a reachable set for Automated vertical Parking Systems (APS). First, given a parking lot layout with a goal position, we define an intermediate pose for the APS to accomplish reverse parking with a single maneuver, i.e., without changing the gear shift. Then, we introduce a reachable set which is a set of points consisting of the grid points of…
▽ More
This paper proposes a local path planning method with a reachable set for Automated vertical Parking Systems (APS). First, given a parking lot layout with a goal position, we define an intermediate pose for the APS to accomplish reverse parking with a single maneuver, i.e., without changing the gear shift. Then, we introduce a reachable set which is a set of points consisting of the grid points of all possible intermediate poses. Once the APS approaches the goal position, it must select an intermediate pose in the reachable set. A minimization problem was formulated and solved to choose the intermediate pose. We performed various scenarios with different parking lot conditions. We used the Hybrid-A* algorithm for the global path planning to move the vehicle from the starting pose to the intermediate pose and utilized clothoid-based local path planning to move from the intermediate pose to the goal pose. Additionally, we designed a controller to follow the generated path and validated its tracking performance. It was confirmed that the tracking error in the mean root square for the lateral position was bounded within 0.06m and for orientation within 0.01rad.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
Classification Method of Road Surface Condition and Type with LiDAR Using Spatiotemporal Information
Authors:
Ju Won Seo,
** Sung Kim,
Chung Choo Chung
Abstract:
This paper proposes a spatiotemporal architecture with a deep neural network (DNN) for road surface conditions and types classification using LiDAR. It is known that LiDAR provides information on the reflectivity and number of point clouds depending on a road surface. Thus, this paper utilizes the information to classify the road surface. We divided the front road area into four subregions. First,…
▽ More
This paper proposes a spatiotemporal architecture with a deep neural network (DNN) for road surface conditions and types classification using LiDAR. It is known that LiDAR provides information on the reflectivity and number of point clouds depending on a road surface. Thus, this paper utilizes the information to classify the road surface. We divided the front road area into four subregions. First, we constructed feature vectors using each subregion's reflectivity, number of point clouds, and in-vehicle information. Second, the DNN classifies road surface conditions and types for each subregion. Finally, the output of the DNN feeds into the spatiotemporal process to make the final classification reflecting vehicle speed and probability given by the outcomes of softmax functions of the DNN output layer. To validate the effectiveness of the proposed method, we performed a comparative study with five other algorithms. With the proposed DNN, we obtained the highest accuracy of 98.0\% and 98.6\% for two subregions near the vehicle. In addition, we implemented the proposed method on the Jetson TX2 board to confirm that it is applicable in real-time.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
AD-YOLO: You Look Only Once in Training Multiple Sound Event Localization and Detection
Authors:
** Sob Kim,
Hyun Joon Park,
Wooseok Shin,
Sung Won Han
Abstract:
Sound event localization and detection (SELD) combines the identification of sound events with the corresponding directions of arrival (DOA). Recently, event-oriented track output formats have been adopted to solve this problem; however, they still have limited generalization toward real-world problems in an unknown polyphony environment. To address the issue, we proposed an angular-distance-based…
▽ More
Sound event localization and detection (SELD) combines the identification of sound events with the corresponding directions of arrival (DOA). Recently, event-oriented track output formats have been adopted to solve this problem; however, they still have limited generalization toward real-world problems in an unknown polyphony environment. To address the issue, we proposed an angular-distance-based multiple SELD (AD-YOLO), which is an adaptation of the "You Only Look Once" algorithm for SELD. The AD-YOLO format allows the model to learn sound occurrences location-sensitively by assigning class responsibility to DOA predictions. Hence, the format enables the model to handle the polyphony problem, regardless of the number of sound overlaps. We evaluated AD-YOLO on DCASE 2020-2022 challenge Task 3 datasets using four SELD objective metrics. The experimental results show that AD-YOLO achieved outstanding performance overall and also accomplished robustness in class-homogeneous polyphony environments.
△ Less
Submitted 10 May, 2023; v1 submitted 27 March, 2023;
originally announced March 2023.
-
TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion
Authors:
Hyun Joon Park,
Seok Woo Yang,
** Sob Kim,
Wooseok Shin,
Sung Won Han
Abstract:
Voice Conversion (VC) must be achieved while maintaining the content of the source speech and representing the characteristics of the target speaker. The existing methods do not simultaneously satisfy the above two aspects of VC, and their conversion outputs suffer from a trade-off problem between maintaining source contents and target characteristics. In this study, we propose Triple Adaptive Att…
▽ More
Voice Conversion (VC) must be achieved while maintaining the content of the source speech and representing the characteristics of the target speaker. The existing methods do not simultaneously satisfy the above two aspects of VC, and their conversion outputs suffer from a trade-off problem between maintaining source contents and target characteristics. In this study, we propose Triple Adaptive Attention Normalization VC (TriAAN-VC), comprising an encoder-decoder and an attention-based adaptive normalization block, that can be applied to non-parallel any-to-any VC. The proposed adaptive normalization block extracts target speaker representations and achieves conversion while minimizing the loss of the source content with siamese loss. We evaluated TriAAN-VC on the VCTK dataset in terms of the maintenance of the source content and target speaker similarity. Experimental results for one-shot VC suggest that TriAAN-VC achieves state-of-the-art performance while mitigating the trade-off problem encountered in the existing VC methods.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
Multi-View Attention Transfer for Efficient Speech Enhancement
Authors:
Wooseok Shin,
Hyun Joon Park,
** Sob Kim,
Byung Hoon Lee,
Sung Won Han
Abstract:
Recent deep learning models have achieved high performance in speech enhancement; however, it is still challenging to obtain a fast and low-complexity model without significant performance degradation. Previous knowledge distillation studies on speech enhancement could not solve this problem because their output distillation methods do not fit the speech enhancement task in some aspects. In this s…
▽ More
Recent deep learning models have achieved high performance in speech enhancement; however, it is still challenging to obtain a fast and low-complexity model without significant performance degradation. Previous knowledge distillation studies on speech enhancement could not solve this problem because their output distillation methods do not fit the speech enhancement task in some aspects. In this study, we propose multi-view attention transfer (MV-AT), a feature-based distillation, to obtain efficient speech enhancement models in the time domain. Based on the multi-view features extraction model, MV-AT transfers multi-view knowledge of the teacher network to the student network without additional parameters. The experimental results show that the proposed method consistently improved the performance of student models of various sizes on the Valentini and deep noise suppression (DNS) datasets. MANNER-S-8.1GF with our proposed method, a lightweight model for efficient deployment, achieved 15.4x and 4.71x fewer parameters and floating-point operations (FLOPs), respectively, compared to the baseline model with similar performance.
△ Less
Submitted 30 October, 2022; v1 submitted 22 August, 2022;
originally announced August 2022.
-
MANNER: Multi-view Attention Network for Noise Erasure
Authors:
Hyun Joon Park,
Byung Ha Kang,
Wooseok Shin,
** Sob Kim,
Sung Won Han
Abstract:
In the field of speech enhancement, time domain methods have difficulties in achieving both high performance and efficiency. Recently, dual-path models have been adopted to represent long sequential features, but they still have limited representations and poor memory efficiency. In this study, we propose Multi-view Attention Network for Noise ERasure (MANNER) consisting of a convolutional encoder…
▽ More
In the field of speech enhancement, time domain methods have difficulties in achieving both high performance and efficiency. Recently, dual-path models have been adopted to represent long sequential features, but they still have limited representations and poor memory efficiency. In this study, we propose Multi-view Attention Network for Noise ERasure (MANNER) consisting of a convolutional encoder-decoder with a multi-view attention block, applied to the time-domain signals. MANNER efficiently extracts three different representations from noisy speech and estimates high-quality clean speech. We evaluated MANNER on the VoiceBank-DEMAND dataset in terms of five objective speech quality metrics. Experimental results show that MANNER achieves state-of-the-art performance while efficiently processing noisy speech.
△ Less
Submitted 4 March, 2022;
originally announced March 2022.
-
Robust Control for Lane Kee** System Using Linear Parameter Varying Approach with Scheduling Variables Reduction
Authors:
Ying Shuai Quan,
** Sung Kim,
Chung Choo Chung
Abstract:
This paper presents a robust controller using a Linear Parameter Varying (LPV) model of the lane-kee** system with parameter reduction. Both varying vehicle speed and roll motion on a curved road influence the lateral vehicle model parameters, such as tire cornering stiffness. Thus, we use the LPV technique to take the parameter variations into account in vehicle dynamics. However, multiple vary…
▽ More
This paper presents a robust controller using a Linear Parameter Varying (LPV) model of the lane-kee** system with parameter reduction. Both varying vehicle speed and roll motion on a curved road influence the lateral vehicle model parameters, such as tire cornering stiffness. Thus, we use the LPV technique to take the parameter variations into account in vehicle dynamics. However, multiple varying parameters lead to a high number of scheduling variables and cause massive computational complexity. In this paper, to reduce the computational complexity, Principal Component Analysis (PCA)-based parameter reduction is performed to obtain a reduced model with a tighter convex set. We designed the LPV robust feedback controller using the reduced model solving a set of Linear Matrix Inequality (LMI). The effectiveness of the proposed system is validated with full vehicle dynamics from CarSim on an interchange road. From the simulation, we confirmed that the proposed method largely reduces the lateral offset error, compared with other controllers based on Linear Time-Invariant (LTI) system.
△ Less
Submitted 4 May, 2021; v1 submitted 3 May, 2021;
originally announced May 2021.
-
Intentional Deep Overfit Learning (IDOL): A Novel Deep Learning Strategy for Adaptive Radiation Therapy
Authors:
Jaehee Chun,
Justin C. Park,
Sven Olberg,
You Zhang,
Dan Nguyen,
**g Wang,
** Sung Kim,
Steve Jiang
Abstract:
In this study, we propose a tailored DL framework for patient-specific performance that leverages the behavior of a model intentionally overfitted to a patient-specific training dataset augmented from the prior information available in an ART workflow - an approach we term Intentional Deep Overfit Learning (IDOL). Implementing the IDOL framework in any task in radiotherapy consists of two training…
▽ More
In this study, we propose a tailored DL framework for patient-specific performance that leverages the behavior of a model intentionally overfitted to a patient-specific training dataset augmented from the prior information available in an ART workflow - an approach we term Intentional Deep Overfit Learning (IDOL). Implementing the IDOL framework in any task in radiotherapy consists of two training stages: 1) training a generalized model with a diverse training dataset of N patients, just as in the conventional DL approach, and 2) intentionally overfitting this general model to a small training dataset-specific the patient of interest (N+1) generated through perturbations and augmentations of the available task- and patient-specific prior information to establish a personalized IDOL model. The IDOL framework itself is task-agnostic and is thus widely applicable to many components of the ART workflow, three of which we use as a proof of concept here: the auto-contouring task on re-planning CTs for traditional ART, the MRI super-resolution (SR) task for MRI-guided ART, and the synthetic CT (sCT) reconstruction task for MRI-only ART. In the re-planning CT auto-contouring task, the accuracy measured by the Dice similarity coefficient improves from 0.847 with the general model to 0.935 by adopting the IDOL model. In the case of MRI SR, the mean absolute error (MAE) is improved by 40% using the IDOL framework over the conventional model. Finally, in the sCT reconstruction task, the MAE is reduced from 68 to 22 HU by utilizing the IDOL framework.
△ Less
Submitted 22 April, 2021;
originally announced April 2021.
-
Continuous body 3-D reconstruction of limbless animals
Authors:
Qiyuan Fu,
Thomas W. Mitchel,
** Seob Kim,
Gregory S. Chirikjian,
Chen Li
Abstract:
Limbless animals such as snakes, limbless lizards, worms, eels, and lampreys move their slender, long bodies in three dimensions to traverse diverse environments. Accurately quantifying their continuous body's 3-D shape and motion is important for understanding body-environment interactions in complex terrain, but this is difficult to achieve (especially for local orientation and rotation). Here,…
▽ More
Limbless animals such as snakes, limbless lizards, worms, eels, and lampreys move their slender, long bodies in three dimensions to traverse diverse environments. Accurately quantifying their continuous body's 3-D shape and motion is important for understanding body-environment interactions in complex terrain, but this is difficult to achieve (especially for local orientation and rotation). Here, we describe an interpolation method to quantify continuous body 3-D position and orientation. We simplify the body as an elastic rod and apply a backbone optimization method to interpolate continuous body shape between end constraints imposed by tracked markers. Despite over-simplifying the biomechanics, our method achieves a higher interpolation accuracy (~50% error) in both 3-D position and orientation compared with the widely-used cubic B-spline interpolation method. Beyond snakes traversing large obstacles as demonstrated, our method applies to other long, slender, limbless animals and continuum robots. We provide codes and demo files for easy application of our method.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
Lateral oscillation and body compliance help snakes and snake robots stably traverse large, smooth obstacles
Authors:
Qiyuan Fu,
Sean W. Gart,
Thomas W. Mitchel,
** Seob Kim,
Gregory S. Chirikjian,
Chen Li
Abstract:
Snakes can move through almost any terrain. Similarly, snake robots hold the promise as a versatile platform to traverse complex environments like earthquake rubble. Unlike snake locomotion on flat surfaces which is inherently stable, when snakes traverse complex terrain by deforming their body out of plane, it becomes challenging to maintain stability. Here, we review our recent progress in under…
▽ More
Snakes can move through almost any terrain. Similarly, snake robots hold the promise as a versatile platform to traverse complex environments like earthquake rubble. Unlike snake locomotion on flat surfaces which is inherently stable, when snakes traverse complex terrain by deforming their body out of plane, it becomes challenging to maintain stability. Here, we review our recent progress in understanding how snakes and snake robots traverse large, smooth obstacles that lack anchor points for grip** or bracing. First, we discovered that the generalist variable kingsnake combines lateral oscillation and cantilevering. Regardless of step height and surface friction, the overall gait is preserved. Next, to quantify static stability of the snake, we developed a method to interpolate continuous body in three dimensions (both position and orientation) between discrete tracked markers. By analyzing the base of support using the interpolated continuous body 3-D kinematics, we discovered that the snake maintained perfect stability during traversal, even on the most challenging low friction, high step. Finally, we applied this gait to a snake robot and systematically tested its performance traversing large steps with variable heights to further understand stability principles. The robot rapidly and stably traversed steps nearly as high as a third of its body length. As step height increased, the robot rolled more frequently to the extent of flip** over, reducing traversal probability. The absence of such failure in the snake with a compliant body inspired us to add body compliance to the robot. With better surface contact, the compliant body robot suffered less roll instability and traversed high steps at higher probability, without sacrificing traversal speed. Our robot traversed large step-like obstacles more rapidly than most previous snake robots, approaching that of the animal.
△ Less
Submitted 30 March, 2020;
originally announced March 2020.