-
Dual-sided Peltier Elements for Rapid Thermal Feedback in Wearables
Authors:
Seongjun Kang,
Gwangbin Kim,
Seokhyun Hwang,
Jeongju Park,
Ahmed Elsharkawy,
SeungJun Kim
Abstract:
This paper introduces a motor-driven Peltier device designed to deliver immediate thermal sensations within extended reality (XR) environments. The system incorporates eight motor-driven Peltier elements, facilitating swift transitions between warm and cool sensations by rotating preheated or cooled elements to opposite sides. A multi-layer structure, comprising aluminum and silicone layers, ensur…
▽ More
This paper introduces a motor-driven Peltier device designed to deliver immediate thermal sensations within extended reality (XR) environments. The system incorporates eight motor-driven Peltier elements, facilitating swift transitions between warm and cool sensations by rotating preheated or cooled elements to opposite sides. A multi-layer structure, comprising aluminum and silicone layers, ensures user comfort and safety while maintaining optimal temperatures for thermal stimuli. Time-temperature characteristic analysis demonstrates the system's ability to provide warm and cool sensations efficiently, with a dual-sided lifetime of up to 206 seconds at a 2V input. Our system design is adaptable to various body parts and can be synchronized with corresponding visual stimuli to enhance the immersive sensation of virtual object interaction and information delivery.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
WaveDH: Wavelet Sub-bands Guided ConvNet for Efficient Image Dehazing
Authors:
Seongmin Hwang,
Daeyoung Han,
Cheolkon Jung,
Moongu Jeon
Abstract:
The surge in interest regarding image dehazing has led to notable advancements in deep learning-based single image dehazing approaches, exhibiting impressive performance in recent studies. Despite these strides, many existing methods fall short in meeting the efficiency demands of practical applications. In this paper, we introduce WaveDH, a novel and compact ConvNet designed to address this effic…
▽ More
The surge in interest regarding image dehazing has led to notable advancements in deep learning-based single image dehazing approaches, exhibiting impressive performance in recent studies. Despite these strides, many existing methods fall short in meeting the efficiency demands of practical applications. In this paper, we introduce WaveDH, a novel and compact ConvNet designed to address this efficiency gap in image dehazing. Our WaveDH leverages wavelet sub-bands for guided up-and-downsampling and frequency-aware feature refinement. The key idea lies in utilizing wavelet decomposition to extract low-and-high frequency components from feature levels, allowing for faster processing while upholding high-quality reconstruction. The downsampling block employs a novel squeeze-and-attention scheme to optimize the feature downsampling process in a structurally compact manner through wavelet domain learning, preserving discriminative features while discarding noise components. In our upsampling block, we introduce a dual-upsample and fusion mechanism to enhance high-frequency component awareness, aiding in the reconstruction of high-frequency details. Departing from conventional dehazing methods that treat low-and-high frequency components equally, our feature refinement block strategically processes features with a frequency-aware approach. By employing a coarse-to-fine methodology, it not only refines the details at frequency levels but also significantly optimizes computational costs. The refinement is performed in a maximum 8x downsampled feature space, striking a favorable efficiency-vs-accuracy trade-off. Extensive experiments demonstrate that our method, WaveDH, outperforms many state-of-the-art methods on several image dehazing benchmarks with significantly reduced computational costs. Our code is available at https://github.com/AwesomeHwang/WaveDH.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Control Barrier Functions for Linear Continuous-Time Input-Delay Systems with Limited-Horizon Previewable Disturbances
Authors:
Tarun Pati,
Seunghoon Hwang,
Sze Zheng Yong
Abstract:
Cyber-physical and autonomous systems are often equipped with mechanisms that provide predictions/projections of future disturbances, e.g., road curvatures, commonly referred to as preview or lookahead, but this preview information is typically not leveraged in the context of deriving control barrier functions (CBFs) for safety. This paper proposes a novel limited preview control barrier function…
▽ More
Cyber-physical and autonomous systems are often equipped with mechanisms that provide predictions/projections of future disturbances, e.g., road curvatures, commonly referred to as preview or lookahead, but this preview information is typically not leveraged in the context of deriving control barrier functions (CBFs) for safety. This paper proposes a novel limited preview control barrier function (LPrev-CBF) that avoids both ends of the spectrum, where on one end, the standard CBF approach treats the (previewable) disturbances simply as worst-case adversarial signals and on the other end, a recent Prev-CBF approach assumes that the disturbances are previewable and known for the entire future. Moreover, our approach applies to input-delay systems and has recursive feasibility guarantees since we explicitly take input constraints/bounds into consideration. Thus, our approach provides strong safety guarantees in a less conservative manner than standard CBF approaches while considering a more realistic setting with limited preview and input delays.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
CongNaMul: A Dataset for Advanced Image Processing of Soybean Sprouts
Authors:
Byunghyun Ban,
Donghun Ryu,
Su-won Hwang
Abstract:
We present 'CongNaMul', a comprehensive dataset designed for various tasks in soybean sprouts image analysis. The CongNaMul dataset is curated to facilitate tasks such as image classification, semantic segmentation, decomposition, and measurement of length and weight. The classification task provides four classes to determine the quality of soybean sprouts: normal, broken, spotted, and broken and…
▽ More
We present 'CongNaMul', a comprehensive dataset designed for various tasks in soybean sprouts image analysis. The CongNaMul dataset is curated to facilitate tasks such as image classification, semantic segmentation, decomposition, and measurement of length and weight. The classification task provides four classes to determine the quality of soybean sprouts: normal, broken, spotted, and broken and spotted, for the development of AI-aided automatic quality inspection technology. For semantic segmentation, images with varying complexity, from single sprout images to images with multiple sprouts, along with human-labelled mask images, are included. The label has 4 different classes: background, head, body, tail. The dataset also provides images and masks for the image decomposition task, including two separate sprout images and their combined form. Lastly, 5 physical features of sprouts (head length, body length, body thickness, tail length, weight) are provided for image-based measurement tasks. This dataset is expected to be a valuable resource for a wide range of research and applications in the advanced analysis of images of soybean sprouts. Also, we hope that this dataset can assist researchers studying classification, semantic segmentation, decomposition, and physical feature measurement in other industrial fields, in evaluating their models. The dataset is available at the authors' repository. (https://bhban.kr/data)
△ Less
Submitted 30 August, 2023; v1 submitted 29 August, 2023;
originally announced August 2023.
-
ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models
Authors:
Minki Kang,
Wooseok Han,
Sung Ju Hwang,
Eunho Yang
Abstract:
Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllab…
▽ More
Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https://ZET-Speech.github.io/ZET-Speech-Demo/.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Evidence-empowered Transfer Learning for Alzheimer's Disease
Authors:
Kai Tzu-iunn Ong,
Hana Kim,
Min** Kim,
**seong Jang,
Beomseok Sohn,
Yoon Seong Choi,
Dosik Hwang,
Seong Jae Hwang,
**young Yeo
Abstract:
Transfer learning has been widely utilized to mitigate the data scarcity problem in the field of Alzheimer's disease (AD). Conventional transfer learning relies on re-using models trained on AD-irrelevant tasks such as natural image classification. However, it often leads to negative transfer due to the discrepancy between the non-medical source and target medical domains. To address this, we pres…
▽ More
Transfer learning has been widely utilized to mitigate the data scarcity problem in the field of Alzheimer's disease (AD). Conventional transfer learning relies on re-using models trained on AD-irrelevant tasks such as natural image classification. However, it often leads to negative transfer due to the discrepancy between the non-medical source and target medical domains. To address this, we present evidence-empowered transfer learning for AD diagnosis. Unlike conventional approaches, we leverage an AD-relevant auxiliary task, namely morphological change prediction, without requiring additional MRI data. In this auxiliary task, the diagnosis model learns the evidential and transferable knowledge from morphological features in MRI scans. Experimental results demonstrate that our framework is not only effective in improving detection performance regardless of model capacity, but also more data-efficient and faithful.
△ Less
Submitted 17 April, 2023; v1 submitted 2 March, 2023;
originally announced March 2023.
-
Source-free Subject Adaptation for EEG-based Visual Recognition
Authors:
Pilhyeon Lee,
Seogkyu Jeon,
Sunhee Hwang,
Minjung Shin,
Hyeran Byun
Abstract:
This paper focuses on subject adaptation for EEG-based visual recognition. It aims at building a visual stimuli recognition system customized for the target subject whose EEG samples are limited, by transferring knowledge from abundant data of source subjects. Existing approaches consider the scenario that samples of source subjects are accessible during training. However, it is often infeasible a…
▽ More
This paper focuses on subject adaptation for EEG-based visual recognition. It aims at building a visual stimuli recognition system customized for the target subject whose EEG samples are limited, by transferring knowledge from abundant data of source subjects. Existing approaches consider the scenario that samples of source subjects are accessible during training. However, it is often infeasible and problematic to access personal biological data like EEG signals due to privacy issues. In this paper, we introduce a novel and practical problem setup, namely source-free subject adaptation, where the source subject data are unavailable and only the pre-trained model parameters are provided for subject adaptation. To tackle this challenging problem, we propose classifier-based data generation to simulate EEG samples from source subjects using classifier responses. Using the generated samples and target subject data, we perform subject-independent feature learning to exploit the common knowledge shared across different subjects. Notably, our framework is generalizable and can adopt any subject-independent learning method. In the experiments on the EEG-ImageNet40 benchmark, our model brings consistent improvements regardless of the choice of subject-independent learning. Also, our method shows promising performance, recording top-1 test accuracy of 74.6% under the 5-shot setting even without relying on source data. Our code can be found at https://github.com/DeepBCI/Deep-BCI/tree/master/1_Intelligent_BCI/Source_Free_Subject_Adaptation_for_EEG.
△ Less
Submitted 20 January, 2023;
originally announced January 2023.
-
Interpretable Diabetic Retinopathy Diagnosis based on Biomarker Activation Map
Authors:
Pengxiao Zang,
Tristan T. Hormel,
Jie Wang,
Yukun Guo,
Steven T. Bailey,
Christina J. Flaxel,
David Huang,
Thomas S. Hwang,
Yali Jia
Abstract:
Deep learning classifiers provide the most accurate means of automatically diagnosing diabetic retinopathy (DR) based on optical coherence tomography (OCT) and its angiography (OCTA). The power of these models is attributable in part to the inclusion of hidden layers that provide the complexity required to achieve a desired task. However, hidden layers also render algorithm outputs difficult to in…
▽ More
Deep learning classifiers provide the most accurate means of automatically diagnosing diabetic retinopathy (DR) based on optical coherence tomography (OCT) and its angiography (OCTA). The power of these models is attributable in part to the inclusion of hidden layers that provide the complexity required to achieve a desired task. However, hidden layers also render algorithm outputs difficult to interpret. Here we introduce a novel biomarker activation map (BAM) framework based on generative adversarial learning that allows clinicians to verify and understand classifiers decision-making. A data set including 456 macular scans were graded as non-referable or referable DR based on current clinical standards. A DR classifier that was used to evaluate our BAM was first trained based on this data set. The BAM generation framework was designed by combing two U-shaped generators to provide meaningful interpretability to this classifier. The main generator was trained to take referable scans as input and produce an output that would be classified by the classifier as non-referable. The BAM is then constructed as the difference image between the output and input of the main generator. To ensure that the BAM only highlights classifier-utilized biomarkers an assistant generator was trained to do the opposite, producing scans that would be classified as referable by the classifier from non-referable scans. The generated BAMs highlighted known pathologic features including nonperfusion area and retinal fluid. A fully interpretable classifier based on these highlights could help clinicians better utilize and verify automated DR diagnosis.
△ Less
Submitted 26 June, 2023; v1 submitted 12 December, 2022;
originally announced December 2022.
-
Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models
Authors:
Minki Kang,
Dongchan Min,
Sung Ju Hwang
Abstract:
There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory performance, due to their suboptimal accuracy in mimicking the target speakers' styles. In this work, we present Grad-StyleSpeech, which is an any-speaker adapt…
▽ More
There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory performance, due to their suboptimal accuracy in mimicking the target speakers' styles. In this work, we present Grad-StyleSpeech, which is an any-speaker adaptive TTS framework that is based on a diffusion model that can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. Grad-StyleSpeech significantly outperforms recent speaker-adaptive TTS baselines on English benchmarks. Audio samples are available at https://nardien.github.io/grad-stylespeech-demo.
△ Less
Submitted 13 March, 2023; v1 submitted 17 November, 2022;
originally announced November 2022.
-
StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation
Authors:
Dongchan Min,
Minyoung Song,
Eunji Ko,
Sung Ju Hwang
Abstract:
We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given…
▽ More
We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines.
△ Less
Submitted 15 March, 2024; v1 submitted 23 August, 2022;
originally announced August 2022.
-
A Codebook Design for FD-MIMO Systems with Multi-Panel Array
Authors:
Zhilin Fu,
Sangwon Hwang,
Jihwan Moon,
Haibao Ren,
Inkyu Lee
Abstract:
In this work, we study codebook designs for full-dimension multiple-input multiple-output (FD-MIMO) systems with a multi-panel array (MPA). We propose novel codebooks which allow precise beam structures for MPA FD-MIMO systems by investigating the physical properties and alignments of the panels. We specifically exploit the characteristic that a group of antennas in a vertical direction exhibit mo…
▽ More
In this work, we study codebook designs for full-dimension multiple-input multiple-output (FD-MIMO) systems with a multi-panel array (MPA). We propose novel codebooks which allow precise beam structures for MPA FD-MIMO systems by investigating the physical properties and alignments of the panels. We specifically exploit the characteristic that a group of antennas in a vertical direction exhibit more correlation than those in a horizontal direction. This enables an economical use of feedback bits while constructing finer beams compared to conventional codebooks. The codebook is further improved by dynamically allocating the feedback bits on multiple parts such as beam amplitude and co-phasing coefficients using reinforcement learning. The numerical results confirm the effectiveness of the proposed approach in terms of both performance and computational complexity.
△ Less
Submitted 9 August, 2022;
originally announced August 2022.
-
VCT: A Video Compression Transformer
Authors:
Fabian Mentzer,
George Toderici,
David Minnen,
Sung-** Hwang,
Sergi Caelles,
Mario Lucic,
Eirikur Agustsson
Abstract:
We show how transformers can be used to vastly simplify neural video compression. Previous methods have been relying on an increasing number of architectural biases and priors, including motion prediction and war** operations, resulting in complex models. Instead, we independently map input frames to representations and use a transformer to model their dependencies, letting it predict the distri…
▽ More
We show how transformers can be used to vastly simplify neural video compression. Previous methods have been relying on an increasing number of architectural biases and priors, including motion prediction and war** operations, resulting in complex models. Instead, we independently map input frames to representations and use a transformer to model their dependencies, letting it predict the distribution of future representations given the past. The resulting video compression transformer outperforms previous methods on standard video compression data sets. Experiments on synthetic data show that our model learns to handle complex motion patterns such as panning, blurring and fading purely from data. Our approach is easy to implement, and we release code to facilitate future research.
△ Less
Submitted 12 October, 2022; v1 submitted 15 June, 2022;
originally announced June 2022.
-
AI-based automated Meibomian gland segmentation, classification and reflection correction in infrared Meibography
Authors:
Ripon Kumar Saha,
A. M. Mahmud Chowdhury,
Kyung-Sun Na,
Gyu Deok Hwang,
Youngsub Eom,
Jaeyoung Kim,
Hae-Gon Jeon,
Ho Sik Hwang,
Euiheon Chung
Abstract:
Purpose: Develop a deep learning-based automated method to segment meibomian glands (MG) and eyelids, quantitatively analyze the MG area and MG ratio, estimate the meiboscore, and remove specular reflections from infrared images. Methods: A total of 1600 meibography images were captured in a clinical setting. 1000 images were precisely annotated with multiple revisions by investigators and graded…
▽ More
Purpose: Develop a deep learning-based automated method to segment meibomian glands (MG) and eyelids, quantitatively analyze the MG area and MG ratio, estimate the meiboscore, and remove specular reflections from infrared images. Methods: A total of 1600 meibography images were captured in a clinical setting. 1000 images were precisely annotated with multiple revisions by investigators and graded 6 times by meibomian gland dysfunction (MGD) experts. Two deep learning (DL) models were trained separately to segment areas of the MG and eyelid. Those segmentation were used to estimate MG ratio and meiboscores using a classification-based DL model. A generative adversarial network was implemented to remove specular reflections from original images. Results: The mean ratio of MG calculated by investigator annotation and DL segmentation was consistent 26.23% vs 25.12% in the upper eyelids and 32.34% vs. 32.29% in the lower eyelids, respectively. Our DL model achieved 73.01% accuracy for meiboscore classification on validation set and 59.17% accuracy when tested on images from independent center, compared to 53.44% validation accuracy by MGD experts. The DL-based approach successfully removes reflection from the original MG images without affecting meiboscore grading. Conclusions: DL with infrared meibography provides a fully automated, fast quantitative evaluation of MG morphology (MG Segmentation, MG area, MG ratio, and meiboscore) which are sufficiently accurate for diagnosing dry eye disease. Also, the DL removes specular reflection from images to be used by ophthalmologists for distraction-free assessment.
△ Less
Submitted 31 May, 2022;
originally announced May 2022.
-
Inter-subject Contrastive Learning for Subject Adaptive EEG-based Visual Recognition
Authors:
Pilhyeon Lee,
Sunhee Hwang,
Jewook Lee,
Minjung Shin,
Seogkyu Jeon,
Hyeran Byun
Abstract:
This paper tackles the problem of subject adaptive EEG-based visual recognition. Its goal is to accurately predict the categories of visual stimuli based on EEG signals with only a handful of samples for the target subject during training. The key challenge is how to appropriately transfer the knowledge obtained from abundant data of source subjects to the subject of interest. To this end, we intr…
▽ More
This paper tackles the problem of subject adaptive EEG-based visual recognition. Its goal is to accurately predict the categories of visual stimuli based on EEG signals with only a handful of samples for the target subject during training. The key challenge is how to appropriately transfer the knowledge obtained from abundant data of source subjects to the subject of interest. To this end, we introduce a novel method that allows for learning subject-independent representation by increasing the similarity of features sharing the same class but coming from different subjects. With the dedicated sampling principle, our model effectively captures the common knowledge shared across different subjects, thereby achieving promising performance for the target subject even under harsh problem settings with limited data. Specifically, on the EEG-ImageNet40 benchmark, our model records the top-1 / top-3 test accuracy of 72.6% / 91.6% when using only five EEG samples per class for the target subject. Our code is available at https://github.com/DeepBCI/Deep-BCI/tree/master/1_Intelligent_BCI/Inter_Subject_Contrastive_Learning_for_EEG.
△ Less
Submitted 6 February, 2022;
originally announced February 2022.
-
LVAC: Learned Volumetric Attribute Compression for Point Clouds using Coordinate Based Networks
Authors:
Berivan Isik,
Philip A. Chou,
Sung ** Hwang,
Nick Johnston,
George Toderici
Abstract:
We consider the attributes of a point cloud as samples of a vector-valued volumetric function at discrete positions. To compress the attributes given the positions, we compress the parameters of the volumetric function. We model the volumetric function by tiling space into blocks, and representing the function over each block by shifts of a coordinate-based, or implicit, neural network. Inputs to…
▽ More
We consider the attributes of a point cloud as samples of a vector-valued volumetric function at discrete positions. To compress the attributes given the positions, we compress the parameters of the volumetric function. We model the volumetric function by tiling space into blocks, and representing the function over each block by shifts of a coordinate-based, or implicit, neural network. Inputs to the network include both spatial coordinates and a latent vector per block. We represent the latent vectors using coefficients of the region-adaptive hierarchical transform (RAHT) used in the MPEG geometry-based point cloud codec G-PCC. The coefficients, which are highly compressible, are rate-distortion optimized by back-propagation through a rate-distortion Lagrangian loss in an auto-decoder configuration. The result outperforms RAHT by 2--4 dB. This is the first work to compress volumetric functions represented by local coordinate-based neural networks. As such, we expect it to be applicable beyond point clouds, for example to compression of high-resolution neural radiance fields.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
Design and Implementation of 5.8GHz RF Wireless PowerTransfer System
Authors:
Je Hyeon Park,
Nguyen Minh Tran,
Sa Il Hwang,
Dong In Kim,
Kae Won Choi
Abstract:
In this paper, we present a 5.8 GHz radio-frequency (RF) wireless power transfer (WPT) system that consists of 64 transmit antennas and 16 receive antennas. Unlike the inductive or resonant coupling-based near-field WPT, RF WPT has a great advantage in powering low-power internet of things (IoT) devices with its capability of long-range wireless power transfer. We also propose a beam scanning algo…
▽ More
In this paper, we present a 5.8 GHz radio-frequency (RF) wireless power transfer (WPT) system that consists of 64 transmit antennas and 16 receive antennas. Unlike the inductive or resonant coupling-based near-field WPT, RF WPT has a great advantage in powering low-power internet of things (IoT) devices with its capability of long-range wireless power transfer. We also propose a beam scanning algorithm that can effectively transfer the power no matter whether the receiver is located in the radiative near-field zone or far-field zone. The proposed beam scanning algorithm is verified with a real-life WPT testbed implemented by ourselves. By experiments, we confirm that the implemented 5.8 GHz RF WPT system is able to transfer 3.67 mW at a distance of 25 meters with the proposed beam scanning algorithm. Moreover, the results show that the proposed algorithm can effectively cover radiative near-field region differently from the conventional scanning schemes which are designed under the assumption of the far-field WPT.
△ Less
Submitted 6 October, 2021;
originally announced October 2021.
-
Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
Authors:
Dongchan Min,
Dong Bok Lee,
Eunho Yang,
Sung Ju Hwang
Abstract:
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality witho…
▽ More
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.
△ Less
Submitted 16 June, 2021; v1 submitted 6 June, 2021;
originally announced June 2021.
-
Multi-Domain Learning by Meta-Learning: Taking Optimal Steps in Multi-Domain Loss Landscapes by Inner-Loop Learning
Authors:
Anthony Sicilia,
Xingchen Zhao,
Davneet Minhas,
Erin O'Connor,
Howard Aizenstein,
William Klunk,
Dana Tudorascu,
Seong Jae Hwang
Abstract:
We consider a model-agnostic solution to the problem of Multi-Domain Learning (MDL) for multi-modal applications. Many existing MDL techniques are model-dependent solutions which explicitly require nontrivial architectural changes to construct domain-specific modules. Thus, properly applying these MDL techniques for new problems with well-established models, e.g. U-Net for semantic segmentation, m…
▽ More
We consider a model-agnostic solution to the problem of Multi-Domain Learning (MDL) for multi-modal applications. Many existing MDL techniques are model-dependent solutions which explicitly require nontrivial architectural changes to construct domain-specific modules. Thus, properly applying these MDL techniques for new problems with well-established models, e.g. U-Net for semantic segmentation, may demand various low-level implementation efforts. In this paper, given emerging multi-modal data (e.g., various structural neuroimaging modalities), we aim to enable MDL purely algorithmically so that widely used neural networks can trivially achieve MDL in a model-independent manner. To this end, we consider a weighted loss function and extend it to an effective procedure by employing techniques from the recently active area of learning-to-learn (meta-learning). Specifically, we take inner-loop gradient steps to dynamically estimate posterior distributions over the hyperparameters of our loss function. Thus, our method is model-agnostic, requiring no additional model parameters and no network architecture changes; instead, only a few efficient algorithmic modifications are needed to improve performance in MDL. We demonstrate our solution to a fitting problem in medical imaging, specifically, in the automatic segmentation of white matter hyperintensity (WMH). We look at two neuroimaging modalities (T1-MR and FLAIR) with complementary information fitting for our problem.
△ Less
Submitted 25 February, 2021;
originally announced February 2021.
-
Online Graph Completion: Multivariate Signal Recovery in Computer Vision
Authors:
Won Hwa Kim,
Mona Jalal,
Seongjae Hwang,
Sterling C. Johnson,
Vikas Singh
Abstract:
The adoption of "human-in-the-loop" paradigms in computer vision and machine learning is leading to various applications where the actual data acquisition (e.g., human supervision) and the underlying inference algorithms are closely interwined. While classical work in active learning provides effective solutions when the learning module involves classification and regression tasks, many practical…
▽ More
The adoption of "human-in-the-loop" paradigms in computer vision and machine learning is leading to various applications where the actual data acquisition (e.g., human supervision) and the underlying inference algorithms are closely interwined. While classical work in active learning provides effective solutions when the learning module involves classification and regression tasks, many practical issues such as partially observed measurements, financial constraints and even additional distributional or structural aspects of the data typically fall outside the scope of this treatment. For instance, with sequential acquisition of partial measurements of data that manifest as a matrix (or tensor), novel strategies for completion (or collaborative filtering) of the remaining entries have only been studied recently. Motivated by vision problems where we seek to annotate a large dataset of images via a crowdsourced platform or alternatively, complement results from a state-of-the-art object detector using human feedback, we study the "completion" problem defined on graphs, where requests for additional measurements must be made sequentially. We design the optimization model in the Fourier domain of the graph describing how ideas based on adaptive submodularity provide algorithms that work well in practice. On a large set of images collected from Imgur, we see promising results on images that are otherwise difficult to categorize. We also show applications to an experimental design problem in neuroimaging.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
Nonlinear Transform Coding
Authors:
Johannes Ballé,
Philip A. Chou,
David Minnen,
Saurabh Singh,
Nick Johnston,
Eirikur Agustsson,
Sung ** Hwang,
George Toderici
Abstract:
We review a class of methods that can be collected under the name nonlinear transform coding (NTC), which over the past few years have become competitive with the best linear transform codecs for images, and have superseded them in terms of rate--distortion performance under established perceptual quality metrics such as MS-SSIM. We assess the empirical rate--distortion performance of NTC with the…
▽ More
We review a class of methods that can be collected under the name nonlinear transform coding (NTC), which over the past few years have become competitive with the best linear transform codecs for images, and have superseded them in terms of rate--distortion performance under established perceptual quality metrics such as MS-SSIM. We assess the empirical rate--distortion performance of NTC with the help of simple example sources, for which the optimal performance of a vector quantizer is easier to estimate than with natural data sources. To this end, we introduce a novel variant of entropy-constrained vector quantization. We provide an analysis of various forms of stochastic optimization techniques for NTC models; review architectures of transforms based on artificial neural networks, as well as learned entropy models; and provide a direct comparison of a number of methods to parameterize the rate--distortion trade-off of nonlinear transforms, introducing a simplified one.
△ Less
Submitted 23 October, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.
-
Implementation of Symbol Timing Recovery for Estimation of Clock Skew
Authors:
S. M. Usman Hashmi,
Muntazir Hussain,
Fahad Bin Muslim,
Kashif Inayat,
Seong Oun Hwang
Abstract:
Time synchronization in any distributed network can be achieved by using application layer protocols for time correction. Time synchronization method proposed in this article uses symbol timing recovery at the physical layer to correct application layer clock. This cross layer methodology diminishes the quantity of message trades needed by application layer for time synchronization thus resulting…
▽ More
Time synchronization in any distributed network can be achieved by using application layer protocols for time correction. Time synchronization method proposed in this article uses symbol timing recovery at the physical layer to correct application layer clock. This cross layer methodology diminishes the quantity of message trades needed by application layer for time synchronization thus resulting in energy saving. Precision of skew estimate can be increased by using multiple message exchanges. Examination of the cross layer strategy including the simulation results, the experimentation outcomes and mathematical analysis demonstrates that clock skew at physical layer is same as of application layer, which is actually the skew of hardware clock within the node.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.
-
DcardNet: Diabetic Retinopathy Classification at Multiple Levels Based on Structural and Angiographic Optical Coherence Tomography
Authors:
Pengxiao Zang,
Liqin Gao,
Tristan T. Hormel,
Jie Wang,
Qisheng You,
Thomas S. Hwang,
Yali Jia
Abstract:
Objective: Optical coherence tomography (OCT) and its angiography (OCTA) have several advantages for the early detection and diagnosis of diabetic retinopathy (DR). However, automated, complete DR classification frameworks based on both OCT and OCTA data have not been proposed. In this study, a convolutional neural network (CNN) based method is proposed to fulfill a DR classification framework usi…
▽ More
Objective: Optical coherence tomography (OCT) and its angiography (OCTA) have several advantages for the early detection and diagnosis of diabetic retinopathy (DR). However, automated, complete DR classification frameworks based on both OCT and OCTA data have not been proposed. In this study, a convolutional neural network (CNN) based method is proposed to fulfill a DR classification framework using en face OCT and OCTA. Methods: A densely and continuously connected neural network with adaptive rate dropout (DcardNet) is designed for the DR classification. In addition, adaptive label smoothing was proposed and used to suppress overfitting. Three separate classification levels are generated for each case based on the International Clinical Diabetic Retinopathy scale. At the highest level the network classifies scans as referable or non-referable for DR. The second level classifies the eye as non-DR, non-proliferative DR (NPDR), or proliferative DR (PDR). The last level classifies the case as no DR, mild and moderate NPDR, severe NPDR, and PDR. Results: We used 10-fold cross-validation with 10% of the data to assess the networks performance. The overall classification accuracies of the three levels were 95.7%, 85.0%, and 71.0% respectively. Conclusion/Significance: A reliable, sensitive and specific automated classification framework for referral to an ophthalmologist can be a key technology for reducing vision loss related to DR.
△ Less
Submitted 24 September, 2020; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Automated segmentation of retinal fluid volumes from structural and angiographic optical coherence tomography using deep learning
Authors:
Yukun Guo,
Tristan T. Hormel,
Honglian Xiong,
Jie Wang,
Thomas S. Hwang,
Yali Jia
Abstract:
Purpose: We proposed a deep convolutional neural network (CNN), named Retinal Fluid Segmentation Network (ReF-Net) to segment volumetric retinal fluid on optical coherence tomography (OCT) volume. Methods: 3 x 3-mm OCT scans were acquired on one eye by a 70-kHz OCT commercial AngioVue system (RTVue-XR; Optovue, Inc.) from 51 participants in a clinical diabetic retinopathy (DR) study (45 with retin…
▽ More
Purpose: We proposed a deep convolutional neural network (CNN), named Retinal Fluid Segmentation Network (ReF-Net) to segment volumetric retinal fluid on optical coherence tomography (OCT) volume. Methods: 3 x 3-mm OCT scans were acquired on one eye by a 70-kHz OCT commercial AngioVue system (RTVue-XR; Optovue, Inc.) from 51 participants in a clinical diabetic retinopathy (DR) study (45 with retinal edema and 6 healthy controls). A CNN with U-Net-like architecture was constructed to detect and segment the retinal fluid. Cross-sectional OCT and angiography (OCTA) scans were used for training and testing ReF-Net. The effect of including OCTA data for retinal fluid segmentation was investigated in this study. Volumetric retinal fluid can be constructed using the output of ReF-Net. Area-under-Receiver-Operating-Characteristic-curve (AROC), intersection-over-union (IoU), and F1-score were calculated to evaluate the performance of ReF-Net. Results: ReF-Net shows high accuracy (F1 = 0.864 +/- 0.084) in retinal fluid segmentation. The performance can be further improved (F1 = 0.892 +/- 0.038) by including information from both OCTA and structural OCT. ReF-Net also shows strong robustness to shadow artifacts. Volumetric retinal fluid can provide more comprehensive information than the 2D area, whether cross-sectional or en face projections. Conclusions: A deep-learning-based method can accurately segment retinal fluid volumetrically on OCT/OCTA scans with strong robustness to shadow artifacts. OCTA data can improve retinal fluid segmentation. Volumetric representations of retinal fluid are superior to 2D projections. Translational Relevance: Using a deep learning method to segment retinal fluid volumetrically has the potential to improve the diagnostic accuracy of diabetic macular edema by OCT systems.
△ Less
Submitted 3 June, 2020;
originally announced June 2020.
-
High-resolution wide-field OCT angiography with a self-navigation method to correct microsaccades and blinks
Authors:
Xiang Wei,
Tristan T. Hormel,
Yukun Guo,
Thomas S. Hwang,
Yali Jia
Abstract:
In this study, we demonstrate a novel self-navigated motion correction method that suppresses eye motion and blinking artifacts on wide-field optical coherence tomographic angiography (OCTA) without requiring any hardware modification. Highly efficient GPU-based, real-time OCTA image acquisition and processing software was developed to detect eye motion artifacts. The algorithm includes an instant…
▽ More
In this study, we demonstrate a novel self-navigated motion correction method that suppresses eye motion and blinking artifacts on wide-field optical coherence tomographic angiography (OCTA) without requiring any hardware modification. Highly efficient GPU-based, real-time OCTA image acquisition and processing software was developed to detect eye motion artifacts. The algorithm includes an instantaneous motion index that evaluates the strength of motion artifact on en face OCTA images. Areas with suprathreshold motion and eye blinking artifacts are automatically rescanned in real-time. Both healthy eyes and eyes with diabetic retinopathy were imaged, and the self-navigated motion correction performance was demonstrated.
△ Less
Submitted 21 May, 2020; v1 submitted 9 April, 2020;
originally announced April 2020.
-
Inspector Gadget: A Data Programming-based Labeling System for Industrial Images
Authors:
Geon Heo,
Yuji Roh,
Seonghyeon Hwang,
Dayun Lee,
Steven Euijong Whang
Abstract:
As machine learning for images becomes democratized in the Software 2.0 era, one of the serious bottlenecks is securing enough labeled data for training. This problem is especially critical in a manufacturing setting where smart factories rely on machine learning for product quality control by analyzing industrial images. Such images are typically large and may only need to be partially analyzed w…
▽ More
As machine learning for images becomes democratized in the Software 2.0 era, one of the serious bottlenecks is securing enough labeled data for training. This problem is especially critical in a manufacturing setting where smart factories rely on machine learning for product quality control by analyzing industrial images. Such images are typically large and may only need to be partially analyzed where only a small portion is problematic (e.g., identifying defects on a surface). Since manual labeling these images is expensive, weak supervision is an attractive alternative where the idea is to generate weak labels that are not perfect, but can be produced at scale. Data programming is a recent paradigm in this category where it uses human knowledge in the form of labeling functions and combines them into a generative model. Data programming has been successful in applications based on text or structured data and can also be applied to images usually if one can find a way to convert them into structured data. In this work, we expand the horizon of data programming by directly applying it to images without this conversion, which is a common scenario for industrial applications. We propose Inspector Gadget, an image labeling system that combines crowdsourcing, data augmentation, and data programming to produce weak labels at scale for image classification. We perform experiments on real industrial image datasets and show that Inspector Gadget obtains better performance than other weak-labeling techniques: Snuba, GOGGLES, and self-learning baselines using convolutional neural networks (CNNs) without pre-training.
△ Less
Submitted 21 August, 2020; v1 submitted 7 April, 2020;
originally announced April 2020.
-
Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs
Authors:
Seong Min Kye,
Youngmoon Jung,
Hae Beom Lee,
Sung Ju Hwang,
Hoirin Kim
Abstract:
In practical settings, a speaker recognition system needs to identify a speaker given a short utterance, while the enrollment utterance may be relatively long. However, existing speaker recognition models perform poorly with such short utterances. To solve this problem, we introduce a meta-learning framework for imbalance length pairs. Specifically, we use a Prototypical Networks and train it with…
▽ More
In practical settings, a speaker recognition system needs to identify a speaker given a short utterance, while the enrollment utterance may be relatively long. However, existing speaker recognition models perform poorly with such short utterances. To solve this problem, we introduce a meta-learning framework for imbalance length pairs. Specifically, we use a Prototypical Networks and train it with a support set of long utterances and a query set of short utterances of varying lengths. Further, since optimizing only for the classes in the given episode may be insufficient for learning discriminative embeddings for unseen classes, we additionally enforce the model to classify both the support and the query set against the entire set of classes in the training set. By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models learned with a standard supervised learning framework on short utterance (1-2 seconds) on the VoxCeleb datasets. We also validate our proposed model for unseen speaker identification, on which it also achieves significant performance gains over the existing approaches. The codes are available at https://github.com/seongmin-kye/meta-SR.
△ Less
Submitted 10 August, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
Semi-Relaxed Quantization with DropBits: Training Low-Bit Neural Networks via Bit-wise Regularization
Authors:
Jung Hyun Lee,
Jihun Yun,
Sung Ju Hwang,
Eunho Yang
Abstract:
Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged as one of the key ingredients to reduce the size of neural networks for their deployments to resource-limited devices. In order to overcome the nature of transforming continuous activations and weights to discrete ones, recent study called Relaxed Quantization (RQ) [Louizos et al. 2019] s…
▽ More
Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged as one of the key ingredients to reduce the size of neural networks for their deployments to resource-limited devices. In order to overcome the nature of transforming continuous activations and weights to discrete ones, recent study called Relaxed Quantization (RQ) [Louizos et al. 2019] successfully employ the popular Gumbel-Softmax that allows this transformation with efficient gradient-based optimization. However, RQ with this Gumbel-Softmax relaxation still suffers from bias-variance trade-off depending on the temperature parameter of Gumbel-Softmax. To resolve the issue, we propose a novel method, Semi-Relaxed Quantization (SRQ) that uses multi-class straight-through estimator to effectively reduce the bias and variance, along with a new regularization technique, DropBits that replaces dropout regularization to randomly drop the bits instead of neurons to further reduce the bias of the multi-class straight-through estimator in SRQ. As a natural extension of DropBits, we further introduce the way of learning heterogeneous quantization levels to find proper bit-length for each layer using DropBits. We experimentally validate our method on various benchmark datasets and network architectures, and also support the quantized lottery ticket hypothesis: learning heterogeneous quantization levels outperforms the case using the same but fixed quantization levels from scratch.
△ Less
Submitted 7 September, 2021; v1 submitted 29 November, 2019;
originally announced November 2019.
-
Attention-Aware Linear Depthwise Convolution for Single Image Super-Resolution
Authors:
Seongmin Hwang,
Gwanghuyn Yu,
Cheolkon Jung,
**young Kim
Abstract:
Although deep convolutional neural networks (CNNs) have obtained outstanding performance in image superresolution (SR), their computational cost increases geometrically as CNN models get deeper and wider. Meanwhile, the features of intermediate layers are treated equally across the channel, thus hindering the representational capability of CNNs. In this paper, we propose an attention-aware linear…
▽ More
Although deep convolutional neural networks (CNNs) have obtained outstanding performance in image superresolution (SR), their computational cost increases geometrically as CNN models get deeper and wider. Meanwhile, the features of intermediate layers are treated equally across the channel, thus hindering the representational capability of CNNs. In this paper, we propose an attention-aware linear depthwise network to address the problems for single image SR, named ALDNet. Specifically, linear depthwise convolution allows CNN-based SR models to preserve useful information for reconstructing a super-resolved image while reducing computational burden. Furthermore, we design an attention-aware branch that enhances the representation ability of depthwise convolution layers by making full use of depthwise filter interdependency. Experiments on publicly available benchmark datasets show that ALDNet achieves superior performance to traditional depthwise separable convolutions in terms of quantitative measurements and visual quality.
△ Less
Submitted 29 November, 2019; v1 submitted 7 August, 2019;
originally announced August 2019.
-
Variational image compression with a scale hyperprior
Authors:
Johannes Ballé,
David Minnen,
Saurabh Singh,
Sung ** Hwang,
Nick Johnston
Abstract:
We describe an end-to-end trainable model for image compression based on variational autoencoders. The model incorporates a hyperprior to effectively capture spatial dependencies in the latent representation. This hyperprior relates to side information, a concept universal to virtually all modern image codecs, but largely unexplored in image compression using artificial neural networks (ANNs). Unl…
▽ More
We describe an end-to-end trainable model for image compression based on variational autoencoders. The model incorporates a hyperprior to effectively capture spatial dependencies in the latent representation. This hyperprior relates to side information, a concept universal to virtually all modern image codecs, but largely unexplored in image compression using artificial neural networks (ANNs). Unlike existing autoencoder compression methods, our model trains a complex prior jointly with the underlying autoencoder. We demonstrate that this model leads to state-of-the-art image compression when measuring visual quality using the popular MS-SSIM index, and yields rate-distortion performance surpassing published ANN-based methods when evaluated using a more traditional metric based on squared error (PSNR). Furthermore, we provide a qualitative comparison of models trained for different distortion metrics.
△ Less
Submitted 1 May, 2018; v1 submitted 31 January, 2018;
originally announced February 2018.
-
Advanced Satellite-based Frequency Transfer at the 10^{-16} Level
Authors:
M. Fujieda,
S-H. Yang,
T. Gotoh,
S-W. Hwang,
H. Hachisu,
H. Kim,
Y. K. Lee,
R. Tabuchi,
T. Ido,
W-K. Lee,
M-S. Heo,
C. Y. Park,
D-H. Yu,
G. Petit
Abstract:
Advanced satellite-based frequency transfers by TWCP and IPPP have been performed between NICT and KRISS. We confirm that the disagreement between them is less than 1x10^{-16} at an averaging time of several days. Additionally, an intercontinental frequency ratio measurement of Sr and Yb optical lattice clocks was directly performed by TWCP. We achieved an uncertainty at the mid-10^{-16} level aft…
▽ More
Advanced satellite-based frequency transfers by TWCP and IPPP have been performed between NICT and KRISS. We confirm that the disagreement between them is less than 1x10^{-16} at an averaging time of several days. Additionally, an intercontinental frequency ratio measurement of Sr and Yb optical lattice clocks was directly performed by TWCP. We achieved an uncertainty at the mid-10^{-16} level after a total measurement time of 12 hours. The frequency ratio was consistent with the recently reported values within the uncertainty.
△ Less
Submitted 6 October, 2017;
originally announced October 2017.
-
Decompositions of two player games: potential, zero-sum, and stable games
Authors:
Sung-Ha Hwang,
Luc Rey-Bellet
Abstract:
We introduce several methods of decomposition for two player normal form games. Viewing the set of all games as a vector space, we exhibit explicit orthonormal bases for the subspaces of potential games, zero-sum games, and their orthogonal complements which we call anti-potential games and anti-zero-sum games, respectively. Perhaps surprisingly, every anti-potential game comes either from the Roc…
▽ More
We introduce several methods of decomposition for two player normal form games. Viewing the set of all games as a vector space, we exhibit explicit orthonormal bases for the subspaces of potential games, zero-sum games, and their orthogonal complements which we call anti-potential games and anti-zero-sum games, respectively. Perhaps surprisingly, every anti-potential game comes either from the Rock-Paper-Scissors type games (in the case of symmetric games) or from the Matching Pennies type games (in the case of asymmetric games). Using these decompositions, we prove old (and some new) cycle criteria for potential and zero-sum games (as orthogonality relations between subspaces). We illustrate the usefulness of our decomposition by (a) analyzing the generalized Rock-Paper-Scissors game, (b) completely characterizing the set of all null-stable games, (c) providing a large class of strict stable games, (d) relating the game decomposition to the decomposition of vector fields for the replicator equations, (e) constructing Lyapunov functions for some replicator dynamics, and (f) constructing Zeeman games -games with an interior asymptotically stable Nash equilibrium and a pure strategy ESS.
△ Less
Submitted 18 July, 2011; v1 submitted 17 June, 2011;
originally announced June 2011.