Search | arXiv e-print repository

Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding Boxes

Authors: Hyeonuk Nam, Deokki Min, Seungdeok Choi, Inhan Choi, Yong-Hwa Park

Abstract: To tackle sound event detection (SED) task, we propose frequency dependent networks (FreDNets), which heavily leverage frequency-dependent methods. We apply frequency war** and FilterAugment, which are frequency-dependent data augmentation methods. The model architecture consists of 3 branches: audio teacher-student transformer (ATST) branch, BEATs branch and CNN branch including either partial… ▽ More To tackle sound event detection (SED) task, we propose frequency dependent networks (FreDNets), which heavily leverage frequency-dependent methods. We apply frequency war** and FilterAugment, which are frequency-dependent data augmentation methods. The model architecture consists of 3 branches: audio teacher-student transformer (ATST) branch, BEATs branch and CNN branch including either partial dilated frequency dynamic convolution (PDFD) or squeeze-and-Excitation (SE) with time-frame frequency-wise SE (tfwSE). To train MAESTRO labels with coarse temporal resolution, we apply max pooling on prediction for the MAESTRO dataset. Using best ensemble model, we apply self training to obtain pseudo label from DESED weak set, DESED unlabeled set and AudioSet. AudioSet labels are filtered to focus on high-confidence pseudo labels and AudioSet pseudo labels are used to train on DESED labels only. We used change-detection-based sound event bounding boxes (cSEBBs) as post processing for ensemble models on self training and submission models. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: DCASE 2024 Challenge Task 4 technical report

arXiv:2406.05472 [pdf, other]

A Novel Generative AI-Based Framework for Anomaly Detection in Multicast Messages in Smart Grid Communications

Authors: Aydin Zaboli, Seong Lok Choi, Tai-** Song, Junho Hong

Abstract: Cybersecurity breaches in digital substations can pose significant challenges to the stability and reliability of power system operations. To address these challenges, defense and mitigation techniques are required. Identifying and detecting anomalies in information and communication technology (ICT) is crucial to ensure secure device interactions within digital substations. This paper proposes a… ▽ More Cybersecurity breaches in digital substations can pose significant challenges to the stability and reliability of power system operations. To address these challenges, defense and mitigation techniques are required. Identifying and detecting anomalies in information and communication technology (ICT) is crucial to ensure secure device interactions within digital substations. This paper proposes a task-oriented dialogue (ToD) system for anomaly detection (AD) in datasets of multicast messages e.g., generic object oriented substation event (GOOSE) and sampled value (SV) in digital substations using large language models (LLMs). This model has a lower potential error and better scalability and adaptability than a process that considers the cybersecurity guidelines recommended by humans, known as the human-in-the-loop (HITL) process. Also, this methodology significantly reduces the effort required when addressing new cyber threats or anomalies compared with machine learning (ML) techniques, since it leaves the models complexity and precision unaffected and offers a faster implementation. These findings present a comparative assessment, conducted utilizing standard and advanced performance evaluation metrics for the proposed AD framework and the HITL process. To generate and extract datasets of IEC 61850 communications, a hardware-in-the-loop (HIL) testbed was employed. △ Less

Submitted 8 June, 2024; originally announced June 2024.

Comments: 10 pages, 10 figures, Submitted to IEEE Transactions on Information Forensics and Security

arXiv:2405.01591 [pdf, other]

Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model

Authors: Seonhee Cho, Choonghan Kim, Jiho Lee, Chetan Chilkunda, Su** Choi, Joo Heung Yoon

Abstract: Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises… ▽ More Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises questions about the feasibility of these models when encountering with the inevitable variations and errors inherent in real-world medical data. In this paper, we introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions. MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data, with significantly fewer parameters. This highlights the potential of leveraging general-domain LLMs for domain-specific tasks and offers a sustainable and cost-effective alternative to traditional LMM developments. Moreover, the robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications. △ Less

Submitted 29 April, 2024; originally announced May 2024.

Comments: Under review

arXiv:2402.17127 [pdf, other]

Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0

Authors: Taein Kang, Soyul Han, Sunmook Choi, Jae** Seo, Sanghyeok Chung, Seungeun Lee, Seungsang Oh, Il-Youp Kwak

Abstract: Conventional spoofing detection systems have heavily relied on the use of handcrafted features derived from speech data. However, a notable shift has recently emerged towards the direct utilization of raw speech waveforms, as demonstrated by methods like SincNet filters. This shift underscores the demand for more sophisticated audio sample features. Moreover, the success of deep learning models, p… ▽ More Conventional spoofing detection systems have heavily relied on the use of handcrafted features derived from speech data. However, a notable shift has recently emerged towards the direct utilization of raw speech waveforms, as demonstrated by methods like SincNet filters. This shift underscores the demand for more sophisticated audio sample features. Moreover, the success of deep learning models, particularly those utilizing large pretrained wav2vec 2.0 as a featurization front-end, highlights the importance of refined feature encoders. In response, this research assessed the representational capability of wav2vec 2.0 as an audio feature extractor, modifying the size of its pretrained Transformer layers through two key adjustments: (1) selecting a subset of layers starting from the leftmost one and (2) fine-tuning a portion of the selected layers from the rightmost one. We complemented this analysis with five spoofing detection back-end models, with a primary focus on AASIST, enabling us to pinpoint the optimal configuration for the selection and fine-tuning process. In contrast to conventional handcrafted features, our investigation identified several spoofing detection systems that achieve state-of-the-art performance in the ASVspoof 2019 LA dataset. This comprehensive exploration offers valuable insights into feature selection strategies, advancing the field of spoofing detection. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: 5 pages

MSC Class: 00A71 ACM Class: I.2.6

arXiv:2401.13498 [pdf, other]

Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting

Authors: Hounsu Kim, Soonbeom Choi, Juhan Nam

Abstract: Synthesizing performing guitar sound is a highly challenging task due to the polyphony and high variability in expression. Recently, deep generative models have shown promising results in synthesizing expressive polyphonic instrument sounds from music scores, often using a generic MIDI input. In this work, we propose an expressive acoustic guitar sound synthesis model with a customized input repre… ▽ More Synthesizing performing guitar sound is a highly challenging task due to the polyphony and high variability in expression. Recently, deep generative models have shown promising results in synthesizing expressive polyphonic instrument sounds from music scores, often using a generic MIDI input. In this work, we propose an expressive acoustic guitar sound synthesis model with a customized input representation to the instrument, which we call guitarroll. We implement the proposed approach using diffusion-based outpainting which can generate audio with long-term consistency. To overcome the lack of MIDI/audio-paired datasets, we used not only an existing guitar dataset but also collected data from a high quality sample-based guitar synthesizer. Through quantitative and qualitative evaluations, we show that our proposed model has higher audio quality than the baseline model and generates more realistic timbre sounds than the previous leading work. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted to ICASSP 2024

arXiv:2401.12473 [pdf, other]

Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Authors: Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhong-Qiu Wang, Shinji Watanabe

Abstract: We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations… ▽ More We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures, accepted by ICASSP 2024

arXiv:2311.18287 [pdf, other]

Dispersed Structured Light for Hyperspectral 3D Imaging

Authors: Suhyun Shin, Seokjun Choi, Felix Heide, Seung-Hwan Baek

Abstract: Hyperspectral 3D imaging aims to acquire both depth and spectral information of a scene. However, existing methods are either prohibitively expensive and bulky or compromise on spectral and depth accuracy. In this work, we present Dispersed Structured Light (DSL), a cost-effective and compact method for accurate hyperspectral 3D imaging. DSL modifies a traditional projector-camera system by placin… ▽ More Hyperspectral 3D imaging aims to acquire both depth and spectral information of a scene. However, existing methods are either prohibitively expensive and bulky or compromise on spectral and depth accuracy. In this work, we present Dispersed Structured Light (DSL), a cost-effective and compact method for accurate hyperspectral 3D imaging. DSL modifies a traditional projector-camera system by placing a sub-millimeter thick diffraction grating film front of the projector. The grating disperses structured light based on light wavelength. To utilize the dispersed structured light, we devise a model for dispersive projection image formation and a per-pixel hyperspectral 3D reconstruction method. We validate DSL by instantiating a compact experimental prototype. DSL achieves spectral accuracy of 18.8nm full-width half-maximum (FWHM) and depth error of 1mm. We demonstrate that DSL outperforms prior work on practical hyperspectral 3D imaging. DSL promises accurate and practical hyperspectral 3D imaging for diverse application domains, including computer vision and graphics, cultural heritage, geology, and biology. △ Less

Submitted 25 March, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

arXiv:2311.05462 [pdf, other]

ChatGPT and Other Large Language Models for Cybersecurity of Smart Grid Applications

Authors: Aydin Zaboli, Seong Lok Choi, Tai-** Song, Junho Hong

Abstract: Cybersecurity breaches targeting electrical substations constitute a significant threat to the integrity of the power grid, necessitating comprehensive defense and mitigation strategies. Any anomaly in information and communication technology (ICT) should be detected for secure communications between devices in digital substations. This paper proposes large language models (LLM), e.g., ChatGPT, fo… ▽ More Cybersecurity breaches targeting electrical substations constitute a significant threat to the integrity of the power grid, necessitating comprehensive defense and mitigation strategies. Any anomaly in information and communication technology (ICT) should be detected for secure communications between devices in digital substations. This paper proposes large language models (LLM), e.g., ChatGPT, for the cybersecurity of IEC 61850-based digital substation communications. Multicast messages such as generic object oriented system event (GOOSE) and sampled value (SV) are used for case studies. The proposed LLM-based cybersecurity framework includes, for the first time, data pre-processing of communication systems and human-in-the-loop (HITL) training (considering the cybersecurity guidelines recommended by humans). The results show a comparative analysis of detected anomaly data carried out based on the performance evaluation metrics for different LLMs. A hardware-in-the-loop (HIL) testbed is used to generate and extract dataset of IEC 61850 communications. △ Less

Submitted 25 February, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

Comments: 5 pages, 2 figures, Accepted, 2024 IEEE Power & Energy Society General Meeting (PESGM), Seattle, WA, USA

arXiv:2311.00332 [pdf, other]

SDF4CHD: Generative Modeling of Cardiac Anatomies with Congenital Heart Defects

Authors: Fanwei Kong, Sascha Stocker, Perry S. Choi, Michael Ma, Daniel B. Ennis, Alison Marsden

Abstract: Congenital heart disease (CHD) encompasses a spectrum of cardiovascular structural abnormalities, often requiring customized treatment plans for individual patients. Computational modeling and analysis of these unique cardiac anatomies can improve diagnosis and treatment planning and may ultimately lead to improved outcomes. Deep learning (DL) methods have demonstrated the potential to enable effi… ▽ More Congenital heart disease (CHD) encompasses a spectrum of cardiovascular structural abnormalities, often requiring customized treatment plans for individual patients. Computational modeling and analysis of these unique cardiac anatomies can improve diagnosis and treatment planning and may ultimately lead to improved outcomes. Deep learning (DL) methods have demonstrated the potential to enable efficient treatment planning by automating cardiac segmentation and mesh construction for patients with normal cardiac anatomies. However, CHDs are often rare, making it challenging to acquire sufficiently large patient cohorts for training such DL models. Generative modeling of cardiac anatomies has the potential to fill this gap via the generation of virtual cohorts; however, prior approaches were largely designed for normal anatomies and cannot readily capture the significant topological variations seen in CHD patients. Therefore, we propose a type- and shape-disentangled generative approach suitable to capture the wide spectrum of cardiac anatomies observed in different CHD types and synthesize differently shaped cardiac anatomies that preserve the unique topology for specific CHD types. Our DL approach represents generic whole heart anatomies with CHD type-specific abnormalities implicitly using signed distance fields (SDF) based on CHD type diagnosis, which conveniently captures divergent anatomical variations across different types and represents meaningful intermediate CHD states. To capture the shape-specific variations, we then learn invertible deformations to morph the learned CHD type-specific anatomies and reconstruct patient-specific shapes. Our approach has the potential to augment the image-segmentation pairs for rarer CHD types for cardiac segmentation and generate cohorts of CHD cardiac meshes for computational simulation. △ Less

Submitted 8 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.10633 [pdf, other]

Telescope imaging beyond the Rayleigh limit in extremely low SNR

Authors: Hyunsoo Choi, Seungman Choi, Peter Menart, Angshuman Deka, Zubin Jacob

Abstract: The Rayleigh limit and low Signal-to-Noise Ratio (SNR) scenarios pose significant limitations to optical imaging systems used in remote sensing, infrared thermal imaging, and space domain awareness. In this study, we introduce a Stochastic Sub-Rayleigh Imaging (SSRI) algorithm to localize point objects and estimate their positions, brightnesses, and number in low SNR conditions, even below the Ray… ▽ More The Rayleigh limit and low Signal-to-Noise Ratio (SNR) scenarios pose significant limitations to optical imaging systems used in remote sensing, infrared thermal imaging, and space domain awareness. In this study, we introduce a Stochastic Sub-Rayleigh Imaging (SSRI) algorithm to localize point objects and estimate their positions, brightnesses, and number in low SNR conditions, even below the Rayleigh limit. Our algorithm adopts a maximum likelihood approach and exploits the Poisson distribution of incoming photons to overcome the Rayleigh limit in low SNR conditions. In our experimental validation, which closely mirrors practical scenarios, we focus on conditions with closely spaced sources within the sub-Rayleigh limit (0.49-1.00R) and weak signals (SNR less than 1.2). We use the Jaccard index and Jaccard efficiency as a figure of merit to quantify imaging performance in the sub-Rayleigh region. Our approach consistently outperforms established algorithms such as Richardson-Lucy and CLEAN by 4X in the low SNR, sub-Rayleigh regime. Our SSRI algorithm allows existing telescope-based optical/infrared imaging systems to overcome the extreme limit of sub-Rayleigh, low SNR source distributions, potentially impacting a wide range of fields, including passive thermal imaging, remote sensing, and space domain awareness. △ Less

Submitted 17 January, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: 18 pages, 5 figures

arXiv:2310.06364 [pdf, other]

Noisy-ArcMix: Additive Noisy Angular Margin Loss Combined With Mixup Anomalous Sound Detection

Authors: Soonhyeon Choi, Jung-Woo Choi

Abstract: Unsupervised anomalous sound detection (ASD) aims to identify anomalous sounds by learning the features of normal operational sounds and sensing their deviations. Recent approaches have focused on the self-supervised task utilizing the classification of normal data, and advanced models have shown that securing representation space for anomalous data is important through representation learning yie… ▽ More Unsupervised anomalous sound detection (ASD) aims to identify anomalous sounds by learning the features of normal operational sounds and sensing their deviations. Recent approaches have focused on the self-supervised task utilizing the classification of normal data, and advanced models have shown that securing representation space for anomalous data is important through representation learning yielding compact intra-class and well-separated intra-class distributions. However, we show that conventional approaches often fail to ensure sufficient intra-class compactness and exhibit angular disparity between samples and their corresponding centers. In this paper, we propose a training technique aimed at ensuring intra-class compactness and increasing the angle gap between normal and abnormal samples. Furthermore, we present an architecture that extracts features for important temporal regions, enabling the model to learn which time frames should be emphasized or suppressed. Experimental results demonstrate that the proposed method achieves the best performance giving 0.90%, 0.83%, and 2.16% improvement in terms of AUC, pAUC, and mAUC, respectively, compared to the state-of-the-art method on DCASE 2020 Challenge Task2 dataset. △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: Submitted to ICASSP 2024

arXiv:2309.07937 [pdf, other]

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

Authors: Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe

Abstract: We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech syn… ▽ More We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. Further, VoxtLM is trained with publicly available data and training recipes and model checkpoints are open-sourced to make fully reproducible work. △ Less

Submitted 24 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

arXiv:2308.02133 [pdf, other]

NeuralEQ: Neural-Network-Based Equalizer for High-Speed Wireline Communication

Authors: Hanseok Kim, Jae Hyung Ju, Hyun Seok Choi, Hyeri Roh, Woo-Seok Choi

Abstract: With the growing demand for high-bandwidth applications like video streaming and cloud services, the data transfer rates required for wireline communication keeps increasing, making the channel loss a major obstacle in achieving low bit error rate (BER). Equalization techniques such as feed-forward equalizer (FFE) and decision feedback equalizer (DFE) are commonly used to compensate for channel lo… ▽ More With the growing demand for high-bandwidth applications like video streaming and cloud services, the data transfer rates required for wireline communication keeps increasing, making the channel loss a major obstacle in achieving low bit error rate (BER). Equalization techniques such as feed-forward equalizer (FFE) and decision feedback equalizer (DFE) are commonly used to compensate for channel loss in wireline communication, but they have limitations in terms of noise boosting and timing constraints. On the other hand, the forward-backward algorithm can achieve better BER performance, but its high complexity makes it impractical for wireline communication. In this work, we propose a novel neural network, NeuralEQ, that effectively mimics the forward-backward algorithm and performs better than FFE and DFE while reducing complexity of the forward-backward algorithm. Performance of NeuralEQ is verified through simulations using real channels. △ Less

Submitted 4 August, 2023; originally announced August 2023.

arXiv:2306.14310 [pdf, other]

Addressing Cold Start Problem for End-to-end Automatic Speech Scoring

Authors: Jungbae Park, Seungtaek Choi

Abstract: Integrating automatic speech scoring/assessment systems has become a critical aspect of second-language speaking education. With self-supervised learning advancements, end-to-end speech scoring approaches have exhibited promising results. However, this study highlights the significant decrease in the performance of speech scoring systems in new question contexts, thereby identifying this as a cold… ▽ More Integrating automatic speech scoring/assessment systems has become a critical aspect of second-language speaking education. With self-supervised learning advancements, end-to-end speech scoring approaches have exhibited promising results. However, this study highlights the significant decrease in the performance of speech scoring systems in new question contexts, thereby identifying this as a cold start problem in terms of items. With the finding of cold-start phenomena, this paper seeks to alleviate the problem by following methods: 1) prompt embeddings, 2) question context embeddings using BERT or CLIP models, and 3) choice of the pretrained acoustic model. Experiments are conducted on TOEIC speaking test datasets collected from English-as-a-second-language (ESL) learners rated by professional TOEIC speaking evaluators. The results demonstrate that the proposed framework not only exhibits robustness in a cold-start environment but also outperforms the baselines for known content. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: Accepted at Interspeech 2023, 4 pages, 1 page for reference

arXiv:2306.06340 [pdf, other]

ECGBERT: Understanding Hidden Language of ECGs with Self-Supervised Representation Learning

Authors: Seokmin Choi, Sajad Mousavi, Phillip Si, Haben G. Yhdego, Fatemeh Khadem, Fatemeh Afghah

Abstract: In the medical field, current ECG signal analysis approaches rely on supervised deep neural networks trained for specific tasks that require substantial amounts of labeled data. However, our paper introduces ECGBERT, a self-supervised representation learning approach that unlocks the underlying language of ECGs. By unsupervised pre-training of the model, we mitigate challenges posed by the lack of… ▽ More In the medical field, current ECG signal analysis approaches rely on supervised deep neural networks trained for specific tasks that require substantial amounts of labeled data. However, our paper introduces ECGBERT, a self-supervised representation learning approach that unlocks the underlying language of ECGs. By unsupervised pre-training of the model, we mitigate challenges posed by the lack of well-labeled and curated medical data. ECGBERT, inspired by advances in the area of natural language processing and large language models, can be fine-tuned with minimal additional layers for various ECG-based problems. Through four tasks, including Atrial Fibrillation arrhythmia detection, heartbeat classification, sleep apnea detection, and user authentication, we demonstrate ECGBERT's potential to achieve state-of-the-art results on a wide variety of tasks. △ Less

Submitted 10 June, 2023; originally announced June 2023.

arXiv:2305.18739 [pdf, other]

doi 10.1109/ICASSP49357.2023.10095881

An empirical study on speech restoration guided by self supervised speech representation

Authors: Jaeuk Byun, Youna Ji, Soo Whan Chung, Soyeon Choe, Min Seok Choi

Abstract: Enhancing speech quality is an indispensable yet difficult task as it is often complicated by a range of degradation factors. In addition to additive noise, reverberation, clip**, and speech attenuation can all adversely affect speech quality. Speech restoration aims to recover speech components from these distortions. This paper focuses on exploring the impact of self-supervised speech represen… ▽ More Enhancing speech quality is an indispensable yet difficult task as it is often complicated by a range of degradation factors. In addition to additive noise, reverberation, clip**, and speech attenuation can all adversely affect speech quality. Speech restoration aims to recover speech components from these distortions. This paper focuses on exploring the impact of self-supervised speech representation learning on the speech restoration task. Specifically, we employ speech representation in various speech restoration networks and evaluate their performance under complicated distortion scenarios. Our experiments demonstrate that the contextual information provided by the self-supervised speech representation can enhance speech restoration performance in various distortion scenarios, while also increasing robustness against the duration of speech attenuation and mismatched test conditions. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: To be presented at ICASSP 2023

arXiv:2305.08878 [pdf, other]

Learning to Learn Unlearned Feature for Brain Tumor Segmentation

Authors: Seungyub Han, Yeongmo Kim, Seokhyeon Ha, Jungwoo Lee, Seunghong Choi

Abstract: We propose a fine-tuning algorithm for brain tumor segmentation that needs only a few data samples and helps networks not to forget the original tasks. Our approach is based on active learning and meta-learning. One of the difficulties in medical image segmentation is the lack of datasets with proper annotations, because it requires doctors to tag reliable annotation and there are many variants of… ▽ More We propose a fine-tuning algorithm for brain tumor segmentation that needs only a few data samples and helps networks not to forget the original tasks. Our approach is based on active learning and meta-learning. One of the difficulties in medical image segmentation is the lack of datasets with proper annotations, because it requires doctors to tag reliable annotation and there are many variants of a disease, such as glioma and brain metastasis, which are the different types of brain tumor and have different structural features in MR images. Therefore, it is impossible to produce the large-scale medical image datasets for all types of diseases. In this paper, we show a transfer learning method from high grade glioma to brain metastasis, and demonstrate that the proposed algorithm achieves balanced parameters for both glioma and brain metastasis domains within a few steps. △ Less

Submitted 13 May, 2023; originally announced May 2023.

Comments: Medical Imaging Meets NeurIPS 2018

arXiv:2304.08707 [pdf, other]

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

Abstract: We propose FSB-LSTM, a novel long short-term memory (LSTM) based architecture that integrates full- and sub-band (FSB) modeling, for single- and multi-channel speech enhancement in the short-time Fourier transform (STFT) domain. The model maintains an information highway to flow an over-complete input representation through multiple FSB-LSTM modules. Each FSB-LSTM module consists of a full-band bl… ▽ More We propose FSB-LSTM, a novel long short-term memory (LSTM) based architecture that integrates full- and sub-band (FSB) modeling, for single- and multi-channel speech enhancement in the short-time Fourier transform (STFT) domain. The model maintains an information highway to flow an over-complete input representation through multiple FSB-LSTM modules. Each FSB-LSTM module consists of a full-band block to model spectro-temporal patterns at all frequencies and a sub-band block to model patterns within each sub-band, where each of the two blocks takes a down-sampled representation as input and returns an up-sampled discriminative representation to be added to the block input via a residual connection. The model is designed to have a low algorithmic complexity, a small run-time buffer and a very low algorithmic latency, at the same time producing a strong enhancement performance on a noisy-reverberant speech enhancement task even if the hop size is as low as $2$ ms. △ Less

Submitted 17 April, 2023; originally announced April 2023.

Comments: in ICASSP 2023

arXiv:2304.02389 [pdf, other]

DRAC: Diabetic Retinopathy Analysis Challenge with Ultra-Wide Optical Coherence Tomography Angiography Images

Authors: Bo Qian, Hao Chen, Xiangning Wang, Haoxuan Che, Gitaek Kwon, Jaeyoung Kim, Sung** Choi, Seoyoung Shin, Felix Krause, Markus Unterdechler, Junlin Hou, Rui Feng, Yihao Li, Mostafa El Habib Daho, Qiang Wu, ** Zhang, Xiaokang Yang, Yiyu Cai, Wei** Jia, Huating Li, Bin Sheng

Abstract: Computer-assisted automatic analysis of diabetic retinopathy (DR) is of great importance in reducing the risks of vision loss and even blindness. Ultra-wide optical coherence tomography angiography (UW-OCTA) is a non-invasive and safe imaging modality in DR diagnosis system, but there is a lack of publicly available benchmarks for model development and evaluation. To promote further research and s… ▽ More Computer-assisted automatic analysis of diabetic retinopathy (DR) is of great importance in reducing the risks of vision loss and even blindness. Ultra-wide optical coherence tomography angiography (UW-OCTA) is a non-invasive and safe imaging modality in DR diagnosis system, but there is a lack of publicly available benchmarks for model development and evaluation. To promote further research and scientific benchmarking for diabetic retinopathy analysis using UW-OCTA images, we organized a challenge named "DRAC - Diabetic Retinopathy Analysis Challenge" in conjunction with the 25th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2022). The challenge consists of three tasks: segmentation of DR lesions, image quality assessment and DR grading. The scientific community responded positively to the challenge, with 11, 12, and 13 teams from geographically diverse institutes submitting different solutions in these three tasks, respectively. This paper presents a summary and analysis of the top-performing solutions and results for each task of the challenge. The obtained results from top algorithms indicate the importance of data augmentation, model architecture and ensemble of networks in improving the performance of deep learning models. These findings have the potential to enable new developments in diabetic retinopathy analysis. The challenge remains open for post-challenge registrations and submissions for benchmarking future methodology developments. △ Less

Submitted 5 April, 2023; originally announced April 2023.

arXiv:2304.00471 [pdf, other]

A Unified Compression Framework for Efficient Speech-Driven Talking-Face Generation

Authors: Bo-Kyeong Kim, Jaemin Kang, Daeun Seo, Hancheol Park, Shinkook Choi, Hyoung-Kyu Song, Hyungshin Kim, Sungsu Lim

Abstract: Virtual humans have gained considerable attention in numerous industries, e.g., entertainment and e-commerce. As a core technology, synthesizing photorealistic face frames from target speech and facial identity has been actively studied with generative adversarial networks. Despite remarkable results of modern talking-face generation models, they often entail high computational burdens, which limi… ▽ More Virtual humans have gained considerable attention in numerous industries, e.g., entertainment and e-commerce. As a core technology, synthesizing photorealistic face frames from target speech and facial identity has been actively studied with generative adversarial networks. Despite remarkable results of modern talking-face generation models, they often entail high computational burdens, which limit their efficient deployment. This study aims to develop a lightweight model for speech-driven talking-face synthesis. We build a compact generator by removing the residual blocks and reducing the channel width from Wav2Lip, a popular talking-face generator. We also present a knowledge distillation scheme to stably yet effectively train the small-capacity generator without adversarial learning. We reduce the number of parameters and MACs by 28$\times$ while retaining the performance of the original model. Moreover, to alleviate a severe performance drop when converting the whole generator to INT8 precision, we adopt a selective quantization method that uses FP16 for the quantization-sensitive layers and INT8 for the other layers. Using this mixed precision, we achieve up to a 19$\times$ speedup on edge GPUs without noticeably compromising the generation quality. △ Less

Submitted 28 April, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

Comments: MLSys Workshop on On-Device Intelligence, 2023; Demo: https://huggingface.co/spaces/nota-ai/compressed_wav2lip

arXiv:2303.16511 [pdf, other]

Joint unsupervised and supervised learning for context-aware language identification

Authors: **seok Park, Hyung Yong Kim, Jihwan Park, Byeong-Yeol Kim, Shukjae Choi, Yunkyu Lim

Abstract: Language identification (LID) recognizes the language of a spoken utterance automatically. According to recent studies, LID models trained with an automatic speech recognition (ASR) task perform better than those trained with a LID task only. However, we need additional text labels to train the model to recognize speech, and acquiring the text labels is a cost high. In order to overcome this probl… ▽ More Language identification (LID) recognizes the language of a spoken utterance automatically. According to recent studies, LID models trained with an automatic speech recognition (ASR) task perform better than those trained with a LID task only. However, we need additional text labels to train the model to recognize speech, and acquiring the text labels is a cost high. In order to overcome this problem, we propose context-aware language identification using a combination of unsupervised and supervised learning without any text labels. The proposed method learns the context of speech through masked language modeling (MLM) loss and simultaneously trains to determine the language of the utterance with supervised learning loss. The proposed joint learning was found to reduce the error rate by 15.6% compared to the same structure model trained by supervised-only learning on a subset of the VoxLingua107 dataset consisting of sub-three-second utterances in 11 languages. △ Less

Submitted 14 April, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.07592 [pdf, other]

Lightweight feature encoder for wake-up word detection based on self-supervised speech representation

Authors: Hyungjun Lim, Younggwan Kim, Kiho Yeom, Eunjoo Seo, Hoodong Lee, Stanley Jungkyu Choi, Honglak Lee

Abstract: Self-supervised learning method that provides generalized speech representations has recently received increasing attention. Wav2vec 2.0 is the most famous example, showing remarkable performance in numerous downstream speech processing tasks. Despite its success, it is challenging to use it directly for wake-up word detection on mobile devices due to its expensive computational cost. In this work… ▽ More Self-supervised learning method that provides generalized speech representations has recently received increasing attention. Wav2vec 2.0 is the most famous example, showing remarkable performance in numerous downstream speech processing tasks. Despite its success, it is challenging to use it directly for wake-up word detection on mobile devices due to its expensive computational cost. In this work, we propose LiteFEW, a lightweight feature encoder for wake-up word detection that preserves the inherent ability of wav2vec 2.0 with a minimum scale. In the method, the knowledge of the pre-trained wav2vec 2.0 is compressed by introducing an auto-encoder-based dimensionality reduction technique and distilled to LiteFEW. Experimental results on the open-source "Hey Snips" dataset show that the proposed method applied to various model structures significantly improves the performance, achieving over 20% of relative improvements with only 64k parameters. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.01105 [pdf, other]

Evidence-empowered Transfer Learning for Alzheimer's Disease

Authors: Kai Tzu-iunn Ong, Hana Kim, Min** Kim, **seong Jang, Beomseok Sohn, Yoon Seong Choi, Dosik Hwang, Seong Jae Hwang, **young Yeo

Abstract: Transfer learning has been widely utilized to mitigate the data scarcity problem in the field of Alzheimer's disease (AD). Conventional transfer learning relies on re-using models trained on AD-irrelevant tasks such as natural image classification. However, it often leads to negative transfer due to the discrepancy between the non-medical source and target medical domains. To address this, we pres… ▽ More Transfer learning has been widely utilized to mitigate the data scarcity problem in the field of Alzheimer's disease (AD). Conventional transfer learning relies on re-using models trained on AD-irrelevant tasks such as natural image classification. However, it often leads to negative transfer due to the discrepancy between the non-medical source and target medical domains. To address this, we present evidence-empowered transfer learning for AD diagnosis. Unlike conventional approaches, we leverage an AD-relevant auxiliary task, namely morphological change prediction, without requiring additional MRI data. In this auxiliary task, the diagnosis model learns the evidential and transferable knowledge from morphological features in MRI scans. Experimental results demonstrate that our framework is not only effective in improving detection performance regardless of model capacity, but also more data-efficient and faithful. △ Less

Submitted 17 April, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

Comments: Accepted to IEEE International Symposium on Biomedical Imaging (ISBI) 2023. The authorship was changed from co-first authors to a single first author, which was authorized by the adviser/corresponding author **young Yeo (Apr 18th, 2023)

arXiv:2211.15948 [pdf, other]

Neural Vocoder Feature Estimation for Dry Singing Voice Separation

Authors: Jaekwon Im, Soonbeom Choi, Sangeon Yong, Juhan Nam

Abstract: Singing voice separation (SVS) is a task that separates singing voice audio from its mixture with instrumental audio. Previous SVS studies have mainly employed the spectrogram masking method which requires a large dimensionality in predicting the binary masks. In addition, they focused on extracting a vocal stem that retains the wet sound with the reverberation effect. This result may hinder the r… ▽ More Singing voice separation (SVS) is a task that separates singing voice audio from its mixture with instrumental audio. Previous SVS studies have mainly employed the spectrogram masking method which requires a large dimensionality in predicting the binary masks. In addition, they focused on extracting a vocal stem that retains the wet sound with the reverberation effect. This result may hinder the reusability of the isolated singing voice. This paper addresses the issues by predicting mel-spectrogram of dry singing voices from the mixed audio as neural vocoder features and synthesizing the singing voice waveforms from the neural vocoder. We experimented with two separation methods. One is predicting binary masks in the mel-spectrogram domain and the other is directly predicting the mel-spectrogram. Furthermore, we add a singing voice detector to identify the singing voice segments over time more explicitly. We measured the model performance in terms of audio, dereverberation, separation, and overall quality. The results show that our proposed model outperforms state-of-the-art singing voice separation models in both objective and subjective evaluation except the audio quality. △ Less

Submitted 29 November, 2022; originally announced November 2022.

Comments: 6 pages, 4 figures

Journal ref: 14th Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2022

arXiv:2211.12433 [pdf, other]

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

Abstract: We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral map**, where the real and imaginary (RI)… ▽ More We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral map**, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral map**, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this paper is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge. △ Less

Submitted 4 August, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: In IEEE/ACM Transactions on Audio, Speech, and Language Processing. A sound demo is available at https://zqwang7.github.io/demos/TF-GridNet-demo/index.html, and the code is available at https://github.com/espnet/espnet/pull/5395

arXiv:2210.09135 [pdf, other]

Gated Recurrent Unit for Video Denoising

Authors: Kai Guo, Seungwon Choi, Jongseong Choi

Abstract: Current video denoising methods perform temporal fusion by designing convolutional neural networks (CNN) or combine spatial denoising with temporal fusion into basic recurrent neural networks (RNNs). However, there have not yet been works which adapt gated recurrent unit (GRU) mechanisms for video denoising. In this letter, we propose a new video denoising model based on GRU, namely GRU-VD. First,… ▽ More Current video denoising methods perform temporal fusion by designing convolutional neural networks (CNN) or combine spatial denoising with temporal fusion into basic recurrent neural networks (RNNs). However, there have not yet been works which adapt gated recurrent unit (GRU) mechanisms for video denoising. In this letter, we propose a new video denoising model based on GRU, namely GRU-VD. First, the reset gate is employed to mark the content related to the current frame in the previous frame output. Then the hidden activation works as an initial spatial-temporal denoising with the help from the marked relevant content. Finally, the update gate recursively fuses the initial denoised result with previous frame output to further increase accuracy. To handle various light conditions adaptively, the noise standard deviation of the current frame is also fed to these three modules. A weighted loss is adopted to regulate initial denoising and final fusion at the same time. The experimental results show that the GRU-VD network not only can achieve better quality than state of the arts objectively and subjectively, but also can obtain satisfied subjective quality on real video. △ Less

Submitted 17 October, 2022; originally announced October 2022.

Comments: 5 pages, 5 figures

MSC Class: 62H35; 68U10 ACM Class: I.4.4

arXiv:2209.03952 [pdf, other]

TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation

Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

Abstract: We propose TF-GridNet, a novel multi-path deep neural network (DNN) operating in the time-frequency (T-F) domain, for monaural talker-independent speaker separation in anechoic conditions. The model stacks several multi-path blocks, each consisting of an intra-frame spectral module, a sub-band temporal module, and a full-band self-attention module, to leverage local and global spectro-temporal inf… ▽ More We propose TF-GridNet, a novel multi-path deep neural network (DNN) operating in the time-frequency (T-F) domain, for monaural talker-independent speaker separation in anechoic conditions. The model stacks several multi-path blocks, each consisting of an intra-frame spectral module, a sub-band temporal module, and a full-band self-attention module, to leverage local and global spectro-temporal information for separation. The model is trained to perform complex spectral map**, where the real and imaginary (RI) components of the input mixture are stacked as input features to predict target RI components. Besides using the scale-invariant signal-to-distortion ratio (SI-SDR) loss for model training, we include a novel loss term to encourage separated sources to add up to the input mixture. Without using dynamic mixing, we obtain 23.4 dB SI-SDR improvement (SI-SDRi) on the WSJ0-2mix dataset, outperforming the previous best by a large margin. △ Less

Submitted 15 March, 2023; v1 submitted 8 September, 2022; originally announced September 2022.

Comments: in IEEE ICASSP 2023

arXiv:2208.03718 [pdf, other]

doi 10.1016/j.tre.2023.103213

Optimal Parking Planning for Shared Autonomous Vehicles

Authors: Seong** Choi, **woo Lee

Abstract: Parking is a crucial element of the driving experience in urban transportation systems. Especially in the coming era of Shared Autonomous Vehicles (SAVs), parking operations in urban transportation networks will inevitably change. Parking stations will serve as storage places for unused vehicles and depots that control the level-of-service of SAVs. This study presents an Analytical Parking Plannin… ▽ More Parking is a crucial element of the driving experience in urban transportation systems. Especially in the coming era of Shared Autonomous Vehicles (SAVs), parking operations in urban transportation networks will inevitably change. Parking stations will serve as storage places for unused vehicles and depots that control the level-of-service of SAVs. This study presents an Analytical Parking Planning Model (APPM) for the SAV environment to provide broader insights into parking planning decisions. Two specific planning scenarios are considered for the APPM: (i) Single-zone APPM (S-APPM), which considers the target area as a single homogeneous zone, and (ii) Two-zone APPM (T-APPM), which considers the target area as two different zones, such as city center and suburban area. S-APPM offers a closed-form solution to find the optimal density of parking stations and parking spaces and the optimal number of SAV fleets, which is beneficial for understanding the explicit relationship between planning decisions and the given environments, including demand density and cost factors. In addition, to incorporate different macroscopic characteristics across two zones, T-APPM accounts for inter- and intra-zonal passenger trips and the relocation of vehicles. We conduct a case study to demonstrate the proposed method with the actual data collected in Seoul Metropolitan Area, South Korea. Sensitivity analyses with respect to cost factors are performed to provide decision-makers with further insights. Also, we find that the optimal densities of parking stations and spaces in the target area are much lower than the current situations. △ Less

Submitted 7 August, 2022; originally announced August 2022.

Comments: 27 pages, 9 figures, 9 tables

arXiv:2206.12059 [pdf]

Data Augmentation and Squeeze-and-Excitation Network on Multiple Dimension for Sound Event Localization and Detection in Real Scenes

Authors: Byeong-Yun Ko, Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Seung-Deok Choi, Yong-Hwa Park

Abstract: Performance of sound event localization and detection (SELD) in real scenes is limited by small size of SELD dataset, due to difficulty in obtaining sufficient amount of realistic multi-channel audio data recordings with accurate label. We used two main strategies to solve problems arising from the small real SELD dataset. First, we applied various data augmentation methods on all data dimensions:… ▽ More Performance of sound event localization and detection (SELD) in real scenes is limited by small size of SELD dataset, due to difficulty in obtaining sufficient amount of realistic multi-channel audio data recordings with accurate label. We used two main strategies to solve problems arising from the small real SELD dataset. First, we applied various data augmentation methods on all data dimensions: channel, frequency and time. We also propose original data augmentation method named Moderate Mixup in order to simulate situations where noise floor or interfering events exist. Second, we applied Squeeze-and-Excitation block on channel and frequency dimensions to efficiently extract feature characteristics. Result of our trained models on the STARSS22 test dataset achieved the best ER, F1, LE, and LR of 0.53, 49.8%, 16.0deg., and 56.2% respectively. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: Technical Report submitted for DCASE2022 Challenge Task3

arXiv:2206.11645 [pdf, ps, other]

Frequency Dependent Sound Event Detection for DCASE 2022 Challenge Task 4

Authors: Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Byeong-Yun Ko, Seung-Deok Choi, Yong-Hwa Park

Abstract: While many deep learning methods on other domains have been applied to sound event detection (SED), differences between original domains of the methods and SED have not been appropriately considered so far. As SED uses audio data with two dimensions (time and frequency) for input, thorough comprehension on these two dimensions is essential for application of methods from other domains on SED. Prev… ▽ More While many deep learning methods on other domains have been applied to sound event detection (SED), differences between original domains of the methods and SED have not been appropriately considered so far. As SED uses audio data with two dimensions (time and frequency) for input, thorough comprehension on these two dimensions is essential for application of methods from other domains on SED. Previous works proved that methods those address on frequency dimension are especially powerful in SED. By applying FilterAugment and frequency dynamic convolution those are frequency dependent methods proposed to enhance SED performance, our submitted models achieved best PSDS1 of 0.4704 and best PSDS2 of 0.8224. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: Technical Reprot submitted for DCASE2022 Challenge Task4

arXiv:2206.03612 [pdf, other]

Predictive Modeling of Charge Levels for Battery Electric Vehicles using CNN EfficientNet and IGTD Algorithm

Authors: Seongwoo Choi, Chongzhou Fang, David Haddad, Minsung Kim

Abstract: Convolutional Neural Networks (CNN) have been a good solution for understanding a vast image dataset. As the increased number of battery-equipped electric vehicles is flourishing globally, there has been much research on understanding which charge levels electric vehicle drivers would choose to charge their vehicles to get to their destination without any prevention. We implemented deep learning a… ▽ More Convolutional Neural Networks (CNN) have been a good solution for understanding a vast image dataset. As the increased number of battery-equipped electric vehicles is flourishing globally, there has been much research on understanding which charge levels electric vehicle drivers would choose to charge their vehicles to get to their destination without any prevention. We implemented deep learning approaches to analyze the tabular datasets to understand their state of charge and which charge levels they would choose. In addition, we implemented the Image Generator for Tabular Dataset algorithm to utilize tabular datasets as image datasets to train convolutional neural networks. Also, we integrated other CNN architecture such as EfficientNet to prove that CNN is a great learner for reading information from images that were converted from the tabular dataset, and able to predict charge levels for battery-equipped electric vehicles. We also evaluated several optimization methods to enhance the learning rate of the models and examined further analysis on improving the model architecture. △ Less

Submitted 7 June, 2022; originally announced June 2022.

arXiv:2204.10836 [pdf, other]

doi 10.1038/s41467-022-33407-5

Federated Learning Enables Big Data for Rare Cancer Boundary Detection

Authors: Sarthak Pati, Ujjwal Baid, Brandon Edwards, Micah Sheller, Shih-Han Wang, G Anthony Reina, Patrick Foley, Alexey Gruzdev, Deepthi Karkada, Christos Davatzikos, Chiharu Sako, Satyam Ghodasara, Michel Bilello, Suyash Mohan, Philipp Vollmuth, Gianluca Brugnara, Chandrakanth J Preetha, Felix Sahm, Klaus Maier-Hein, Maximilian Zenk, Martin Bendszus, Wolfgang Wick, Evan Calabrese, Jeffrey Rudie, Javier Villanueva-Meyer , et al. (254 additional authors not shown)

Abstract: Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train acc… ▽ More Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train accurate and generalizable ML models, by only sharing numerical model updates. Here we present findings from the largest FL study to-date, involving data from 71 healthcare institutions across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, utilizing the largest dataset of such patients ever used in the literature (25,256 MRI scans from 6,314 patients). We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent. We anticipate our study to: 1) enable more studies in healthcare informed by large and diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further quantitative analyses for glioblastoma via performance optimization of our consensus model for eventual public release, and 3) demonstrate the effectiveness of FL at such scale and task complexity as a paradigm shift for multi-site collaborations, alleviating the need for data sharing. △ Less

Submitted 25 April, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

Comments: federated learning, deep learning, convolutional neural network, segmentation, brain tumor, glioma, glioblastoma, FeTS, BraTS

arXiv:2204.03778 [pdf, other]

Mitigating Mismatch Compression in Differential Local Field Potentials

Authors: Vineet Tiruvadi, Sam James, Bryan Howell, Mosadoluwa Obatusin, Andrea Crowell, Patricio Riva-Posse, Ki Sueng Choi, Allison Waters, Robert E. Gross, Cameron C. McIntyre, Helen S. Mayberg, Robert Butera

Abstract: Bidirectional deep brain stimulation (bdDBS) devices capable of recording differential local field potentials (dLFP) enable neural recordings alongside clinical therapy. Efforts to identify objective signals of various brain disorders, or disease readouts, are challenging in dLFP, especially during active DBS. In this report we identified, characterized, and mitigated a major source of distortion… ▽ More Bidirectional deep brain stimulation (bdDBS) devices capable of recording differential local field potentials (dLFP) enable neural recordings alongside clinical therapy. Efforts to identify objective signals of various brain disorders, or disease readouts, are challenging in dLFP, especially during active DBS. In this report we identified, characterized, and mitigated a major source of distortion in dLFP that we introduce as mismatch compression (MC). MC occurs secondary to impedance mismatches across the dLFP channel resulting in incomplete rejection of artifacts and downstream amplifier gain compression. Using in silico and in vitro models we demonstrate that MC accounts for impedance-related distortions sensitive to DBS amplitude. We then use these models to develop and validate a mitigation strategy for MC that is provided as an opensource library for more reliable oscillatory disease readouts. △ Less

Submitted 7 April, 2022; originally announced April 2022.

Comments: 9 pages, 9 figures

arXiv:2203.10047 [pdf, ps, other]

High-Density Coding Scheme for SWIPT Systems

Authors: Dongheon Lee, Gyuyeol Kong, Jang-Won Lee, Sooyong Choi

Abstract: In this study, a novel coding scheme called highdensity coding based on high-density codebooks using a genetic local search algorithm is proposed. The high-density codebook maximizes the energy transfer capability by maximizing the ratio of 1 in the codebook while satisfying the conditions of a codeword with length n, a codebook with 2k codewords, and a minimum Hamming distance of the codebook of… ▽ More In this study, a novel coding scheme called highdensity coding based on high-density codebooks using a genetic local search algorithm is proposed. The high-density codebook maximizes the energy transfer capability by maximizing the ratio of 1 in the codebook while satisfying the conditions of a codeword with length n, a codebook with 2k codewords, and a minimum Hamming distance of the codebook of d. Furthermore, the proposed high-density codebook provides a trade-off between the throughput and harvested energy with respect to n, k, and d. The block error rate performances of the designed highdensity codebooks are derived theoretically and compared with the simulation results. The simulation results indicate that as d and k decrease, the throughput decreases by a maximum of 10% and 40%, whereas the harvested energy per time increases by a maximum of 40% and 100%, respectively. When n increases, the throughput decreases by a maximum of 30%, while the harvested energy per time increases by a maximum of 110%. With the proposed high-density coding scheme, the throughput and harvested energy at the user can be controlled adaptively according to the system requirements. △ Less

Submitted 18 March, 2022; originally announced March 2022.

arXiv:2203.03166 [pdf]

HRTF measurement for accurate sound localization cues

Authors: Gyeong-Tae Lee, Sang-Min Choi, Byeong-Yun Ko, Yong-Hwa Park

Abstract: A new database of head-related transfer functions (HRTFs) for accurate sound source localization is presented through precise measurement and post-processing in terms of improved frequency bandwidth and causality of head-related impulse responses (HRIRs) for accurate spectral cue (SC) and interaural time difference (ITD), respectively. The improvement effects of the proposed methods on binaural so… ▽ More A new database of head-related transfer functions (HRTFs) for accurate sound source localization is presented through precise measurement and post-processing in terms of improved frequency bandwidth and causality of head-related impulse responses (HRIRs) for accurate spectral cue (SC) and interaural time difference (ITD), respectively. The improvement effects of the proposed methods on binaural sound localization cues were investigated. To achieve sufficient frequency bandwidth with a single source, a one-way sealed speaker module was designed to obtain wide band frequency response based on electro-acoustics, whereas most existing HRTF databases rely on a two-way vented loudspeaker that has multiple sources. The origin transfer function at the head center was obtained by the proposed measurement scheme using a 0 degree on-axis microphone to ensure accurate spectral cue pattern of HRTFs, whereas in the previous measurements with a 90 degree off-axis microphone, the magnitude response of the origin transfer function fluctuated and decreased with increasing frequency, causing erroneous SCs of HRTFs. To prevent discontinuity of ITD due to non-causality of ipsilateral HRTFs, obtained HRIRs were circularly shifted by time delay considering the head radius of the measurement subject. Finally, various sound localization cues such as ITD, interaural level difference (ILD), SC, and horizontal plane directivity (HPD) were derived from the presented HRTFs, and improvements on binaural sound localization cues were examined. As a result, accurate SC patterns of HRTFs were confirmed through the proposed measurement scheme using the 0 degree on-axis microphone, and continuous ITD patterns were obtained due to the non-causality compensation. Source codes and presented HRTF database are available to relevant research groups at GitHub (https://github.com/han-saram/HRTF-HATS-KAIST). △ Less

Submitted 5 April, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: 39 pages, 27 figures, and 1 table

arXiv:2202.04328 [pdf, other]

CAU_KU team's submission to ADD 2022 Challenge task 1: Low-quality fake audio detection through frequency feature masking

Authors: Il-Youp Kwak, Sunmook Choi, Jonghoon Yang, Yerin Lee, Seungsang Oh

Abstract: This technical report describes Chung-Ang University and Korea University (CAU_KU) team's model participating in the Audio Deep Synthesis Detection (ADD) 2022 Challenge, track 1: Low-quality fake audio detection. For track 1, we propose a frequency feature masking (FFM) augmentation technique to deal with a low-quality audio environment. %detection that spectrogram-based models can be applied. We… ▽ More This technical report describes Chung-Ang University and Korea University (CAU_KU) team's model participating in the Audio Deep Synthesis Detection (ADD) 2022 Challenge, track 1: Low-quality fake audio detection. For track 1, we propose a frequency feature masking (FFM) augmentation technique to deal with a low-quality audio environment. %detection that spectrogram-based models can be applied. We applied FFM and mixup augmentation on five spectrogram-based deep neural network architectures that performed well for spoofing detection using mel-spectrogram and constant Q transform (CQT) features. Our best submission achieved 23.8% of EER ranked 3rd on track 1. △ Less

Submitted 9 February, 2022; originally announced February 2022.

arXiv:2201.06735 [pdf]

AI Augmented Digital Metal Component

Authors: Eunhyeok Seo, Hyokyung Sung, Hayeol Kim, Taekyeong Kim, Sangeun Park, Min Sik Lee, Seung Ki Moon, Jung Gi Kim, Hayoung Chung, Seong-Kyum Choi, Ji-hun Yu, Kyung Tae Kim, Seong ** Park, Namhun Kim, Im Doo Jung

Abstract: The aim of this work is to propose a new paradigm that imparts intelligence to metal parts with the fusion of metal additive manufacturing and artificial intelligence (AI). Our digital metal part classifies the status with real time data processing with convolutional neural network (CNN). The training data for the CNN is collected from a strain gauge embedded in metal parts by laser powder bed fus… ▽ More The aim of this work is to propose a new paradigm that imparts intelligence to metal parts with the fusion of metal additive manufacturing and artificial intelligence (AI). Our digital metal part classifies the status with real time data processing with convolutional neural network (CNN). The training data for the CNN is collected from a strain gauge embedded in metal parts by laser powder bed fusion process. We implement this approach using additive manufacturing, demonstrate a self-cognitive metal part recognizing partial screw loosening, malfunctioning, and external impacting object. The results indicate that metal part can recognize subtle change of multiple fixation state under repetitive compression with 89.1% accuracy with test sets. The proposed strategy showed promising potential in contributing to the hyper-connectivity for next generation of digital metal based mechanical systems △ Less

Submitted 17 January, 2022; originally announced January 2022.

Comments: 46 pages

arXiv:2112.02896 [pdf, other]

Tunable Image Quality Control of 3-D Ultrasound using Switchable CycleGAN

Authors: Jaeyoung Huh, Shujaat Khan, Sung** Choi, Dongkuk Shin, Eun Sun Lee, Jong Chul Ye

Abstract: In contrast to 2-D ultrasound (US) for uniaxial plane imaging, a 3-D US imaging system can visualize a volume along three axial planes. This allows for a full view of the anatomy, which is useful for gynecological (GYN) and obstetrical (OB) applications. Unfortunately, the 3-D US has an inherent limitation in resolution compared to the 2-D US. In the case of 3-D US with a 3-D mechanical probe, for… ▽ More In contrast to 2-D ultrasound (US) for uniaxial plane imaging, a 3-D US imaging system can visualize a volume along three axial planes. This allows for a full view of the anatomy, which is useful for gynecological (GYN) and obstetrical (OB) applications. Unfortunately, the 3-D US has an inherent limitation in resolution compared to the 2-D US. In the case of 3-D US with a 3-D mechanical probe, for example, the image quality is comparable along the beam direction, but significant deterioration in image quality is often observed in the other two axial image planes. To address this, here we propose a novel unsupervised deep learning approach to improve 3-D US image quality. In particular, using {\em unmatched} high-quality 2-D US images as a reference, we trained a recently proposed switchable CycleGAN architecture so that every map** plane in 3-D US can learn the image quality of 2-D US images. Thanks to the switchable architecture, our network can also provide real-time control of image enhancement level based on user preference, which is ideal for a user-centric scanner setup. Extensive experiments with clinical evaluation confirm that our method offers significantly improved image quality as well user-friendly flexibility. △ Less

Submitted 6 December, 2021; originally announced December 2021.

arXiv:2110.06546 [pdf, other]

A Melody-Unsupervision Model for Singing Voice Synthesis

Authors: Soonbeom Choi, Juhan Nam

Abstract: Recent studies in singing voice synthesis have achieved high-quality results leveraging advances in text-to-speech models based on deep neural networks. One of the main issues in training singing voice synthesis models is that they require melody and lyric labels to be temporally aligned with audio data. The temporal alignment is a time-exhausting manual work in preparing for the training data. To… ▽ More Recent studies in singing voice synthesis have achieved high-quality results leveraging advances in text-to-speech models based on deep neural networks. One of the main issues in training singing voice synthesis models is that they require melody and lyric labels to be temporally aligned with audio data. The temporal alignment is a time-exhausting manual work in preparing for the training data. To address the issue, we propose a melody-unsupervision model that requires only audio-and-lyrics pairs without temporal alignment in training time but generates singing voice audio given a melody and lyrics input in inference time. The proposed model is composed of a phoneme classifier and a singing voice generator jointly trained in an end-to-end manner. The model can be fine-tuned by adjusting the amount of supervision with temporally aligned melody labels. Through experiments in melody-unsupervision and semi-supervision settings, we compare the audio quality of synthesized singing voice. We also show that the proposed model is capable of being trained with speech audio and text labels but can generate singing voice in inference time. △ Less

Submitted 14 April, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: ICASSP 2022

arXiv:2109.12253 [pdf]

Development of Safety Monitoring System of Connected and Automated Vehicles considering the Trade-off between Communication Efficiency and Data Reliability

Authors: Sehyun Tak, Seong** Choi

Abstract: The safety of urban transportation systems is considered a public health issue worldwide, and many researchers have contributed to improving it. Connected automated vehicles (CAVs) and cooperative intelligent transportation systems (C-ITSs) are considered solutions to ensure urban transportation systems' safety using various sensors and communication devices. However, it is found difficult to depl… ▽ More The safety of urban transportation systems is considered a public health issue worldwide, and many researchers have contributed to improving it. Connected automated vehicles (CAVs) and cooperative intelligent transportation systems (C-ITSs) are considered solutions to ensure urban transportation systems' safety using various sensors and communication devices. However, it is found difficult to deploy the C-ITS framework in South Korea, because CAVs produce a massive amount of data every minute but it cannot be transmitted via existing communication network. Thus, raw data must be sampled to reduce the size of the data over communication network and transmitted to the server for further processing. On the other hand, the sampled data must be highly accurate to ensure the safety of different agents in C-ITS. Thus, in this study, we designed and developed a C-ITS architecture and data flow, including messages and protocols for the safety monitoring system of CAVs, and determined the optimal sampling interval for data transmission while considering the trade-off between communication efficiency and reliability of safety performance indicators. Three safety performance indicators were introduced: severe deceleration, lateral position variance, and inverse time to collision. A field test is conducted to collect data from various sensors installed in the CAV, determining the optimal sampling interval. Kolmogorov-Smirnov test is conducted to ensure statistical consistency between sampled and raw datasets. The effects of the sampling interval on message delay, data accuracy, and communication efficiency were analyzed. △ Less

Submitted 30 September, 2021; v1 submitted 24 September, 2021; originally announced September 2021.

Comments: 13 pages, 16 figures, 1 table

arXiv:2108.10147 [pdf, other]

Spatio-Temporal Split Learning for Privacy-Preserving Medical Platforms: Case Studies with COVID-19 CT, X-Ray, and Cholesterol Data

Authors: Yoo Jeong Ha, Minjae Yoo, Gusang Lee, Soyi Jung, Sae Won Choi, Joongheon Kim, Seehwan Yoo

Abstract: Machine learning requires a large volume of sample data, especially when it is used in high-accuracy medical applications. However, patient records are one of the most sensitive private information that is not usually shared among institutes. This paper presents spatio-temporal split learning, a distributed deep neural network framework, which is a turning point in allowing collaboration among pri… ▽ More Machine learning requires a large volume of sample data, especially when it is used in high-accuracy medical applications. However, patient records are one of the most sensitive private information that is not usually shared among institutes. This paper presents spatio-temporal split learning, a distributed deep neural network framework, which is a turning point in allowing collaboration among privacy-sensitive organizations. Our spatio-temporal split learning presents how distributed machine learning can be efficiently conducted with minimal privacy concerns. The proposed split learning consists of a number of clients and a centralized server. Each client has only has one hidden layer, which acts as the privacy-preserving layer, and the centralized server comprises the other hidden layers and the output layer. Since the centralized server does not need to access the training data and trains the deep neural network with parameters received from the privacy-preserving layer, privacy of original data is guaranteed. We have coined the term, spatio-temporal split learning, as multiple clients are spatially distributed to cover diverse datasets from different participants, and we can temporally split the learning process, detaching the privacy preserving layer from the rest of the learning process to minimize privacy breaches. This paper shows how we can analyze the medical data whilst ensuring privacy using our proposed multi-site spatio-temporal split learning algorithm on Coronavirus Disease-19 (COVID-19) chest Computed Tomography (CT) scans, MUsculoskeletal RAdiographs (MURA) X-ray images, and cholesterol levels. △ Less

Submitted 20 August, 2021; originally announced August 2021.

arXiv:2108.05543

Development of Simulation-based Lane Change Control System for Autonomous Vehicles

Authors: Seong** Choi

Abstract: Originally, the decision and control of the lane change of the vehicle were on the human driver. In previous studies, the decision-making of lane-changing of the human drivers was mainly used to increase the individual's benefit. However, the lane-changing behavior of these human drivers can sometimes have a bad influence on the overall traffic flow. As technology for autonomous vehicles develop,… ▽ More Originally, the decision and control of the lane change of the vehicle were on the human driver. In previous studies, the decision-making of lane-changing of the human drivers was mainly used to increase the individual's benefit. However, the lane-changing behavior of these human drivers can sometimes have a bad influence on the overall traffic flow. As technology for autonomous vehicles develop, lane changing action as well as lane changing decision making fall within the control category of autonomous vehicles. However, since many of the current lane-changing decision algorithms of autonomous vehicles are based on the human driver model, it is hard to know the potential traffic impact of such lane change. Therefore, in this study, we focused on the decision-making of lane change considering traffic flow, and accordingly, we study the lane change control system considering the whole traffic flow. In this research, the lane change control system predicts the future traffic situation through the cell transition model, one of the most popular macroscopic traffic simulation models, and determines the change probability of each lane that minimizes the total time delay through the genetic algorithm. The lane change control system then conveys the change probability to this vehicle. In the macroscopic simulation result, the proposed control system reduced the overall travel time delay. The proposed system is applied to microscopic traffic simulation, the oversaturated freeway traffic flow algorithm (OFFA), to evaluate the potential performance when it is applied to the actual traffic system. In the traffic flow-density, the maximum traffic flow has been shown to be increased, and the points in the congestion area have also been greatly reduced. Overall, the time required for individual vehicles was reduced. △ Less

Submitted 27 August, 2021; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: bitmapped submission

arXiv:2108.01812 [pdf, other]

Improving Distinction between ASR Errors and Speech Disfluencies with Feature Space Interpolation

Authors: Seongmin Park, Dongchan Shin, Sangyoun Paik, Subong Choi, Alena Kazakova, Jihwa Lee

Abstract: Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing. While error detection systems often take advantage of statistical language archetypes captured by LMs, at times the pretrained knowledge can hinder error detection performance. For instance, presence of speech disfluencies might confuse the post-processin… ▽ More Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing. While error detection systems often take advantage of statistical language archetypes captured by LMs, at times the pretrained knowledge can hinder error detection performance. For instance, presence of speech disfluencies might confuse the post-processing system into tagging disfluent but accurate transcriptions as ASR errors. Such confusion occurs because both error detection and disfluency detection tasks attempt to identify tokens at statistically unlikely positions. This paper proposes a scheme to improve existing LM-based ASR error detection systems, both in terms of detection scores and resilience to such distracting auxiliary tasks. Our approach adopts the popular mixup method in text feature space and can be utilized with any black-box ASR output. To demonstrate the effectiveness of our method, we conduct post-processing experiments with both traditional and end-to-end ASR systems (both for English and Korean languages) with 5 different speech corpora. We find that our method improves both ASR error detection F 1 scores and reduces the number of correctly transcribed disfluencies wrongly detected as ASR errors. Finally, we suggest methods to utilize resulting LMs directly in semi-supervised ASR training. △ Less

Submitted 3 August, 2021; originally announced August 2021.

arXiv:2107.13260 [pdf]

doi 10.1016/j.eswa.2022.117811

Deep learning based cough detection camera using enhanced features

Authors: Gyeong-Tae Lee, Hyeonuk Nam, Seong-Hu Kim, Sang-Min Choi, Youngkey Kim, Yong-Hwa Park

Abstract: Coughing is a typical symptom of COVID-19. To detect and localize coughing sounds remotely, a convolutional neural network (CNN) based deep learning model was developed in this work and integrated with a sound camera for the visualization of the cough sounds. The cough detection model is a binary classifier of which the input is a two second acoustic feature and the output is one of two inferences… ▽ More Coughing is a typical symptom of COVID-19. To detect and localize coughing sounds remotely, a convolutional neural network (CNN) based deep learning model was developed in this work and integrated with a sound camera for the visualization of the cough sounds. The cough detection model is a binary classifier of which the input is a two second acoustic feature and the output is one of two inferences (Cough or Others). Data augmentation was performed on the collected audio files to alleviate class imbalance and reflect various background noises in practical environments. For effective featuring of the cough sound, conventional features such as spectrograms, mel-scaled spectrograms, and mel-frequency cepstral coefficients (MFCC) were reinforced by utilizing their velocity (V) and acceleration (A) maps in this work. VGGNet, GoogLeNet, and ResNet were simplified to binary classifiers, and were named V-net, G-net, and R-net, respectively. To find the best combination of features and networks, training was performed for a total of 39 cases and the performance was confirmed using the test F1 score. Finally, a test F1 score of 91.9% (test accuracy of 97.2%) was achieved from G-net with the MFCC-V-A feature (named Spectroflow), an acoustic feature effective for use in cough detection. The trained cough detection model was integrated with a sound camera (i.e., one that visualizes sound sources using a beamforming microphone array). In a pilot test, the cough detection camera detected coughing sounds with an F1 score of 90.0% (accuracy of 96.0%), and the cough location in the camera image was tracked in real time. △ Less

Submitted 24 May, 2022; v1 submitted 28 July, 2021; originally announced July 2021.

Comments: 28 pages, 20 figures, and 14 tables

Journal ref: Expert Systems With Applications, Vol. 206, No. 15, pp. 1-20, 2022

arXiv:2107.05004 [pdf, other]

Designing a Robust Carrier Frequency Offset Estimation Scheme for Meeting Target Decoding Performance in an OFDM System

Authors: Minkyeong Jeong, Sang-Won Choi, Juyeop Kim

Abstract: In a target communication system, a delicately designed frequency offset estimation scheme is required to meet certain decoding performance. In this paper, we proposed at wo-step estimation scheme, coarse and residual, with different value of an time interval parameter. A result of RF conduction test shows that the proposed method has an 1dB gain of SNR compared to coarse-only estimator. A result… ▽ More In a target communication system, a delicately designed frequency offset estimation scheme is required to meet certain decoding performance. In this paper, we proposed at wo-step estimation scheme, coarse and residual, with different value of an time interval parameter. A result of RF conduction test shows that the proposed method has an 1dB gain of SNR compared to coarse-only estimator. A result of the commercial test also indicates the proposed method outperforms coarse-only estimator especially in low SNR condition. △ Less

Submitted 19 July, 2021; v1 submitted 11 July, 2021; originally announced July 2021.

Comments: 22 pages, 13 figures

arXiv:2107.04526 [pdf, ps, other]

A Dual-Connection based Handover Scheme for Ultra-Dense Millimeter-Wave Cellular Networks

Authors: Seongjoon Kang, Siyoung Choi, Goodsol Lee, Saewoong Bahk

Abstract: Mobile users in an ultra-dense millimeter-wave cellular network experience handover events more frequently than in conventional networks, which results in increased service interruption time and performance degradation due to blockages. Multi-connectivity has been proposed to resolve this, and it also extends the coverage of millimeter-wave communications. In this paper, we propose a dual-connecti… ▽ More Mobile users in an ultra-dense millimeter-wave cellular network experience handover events more frequently than in conventional networks, which results in increased service interruption time and performance degradation due to blockages. Multi-connectivity has been proposed to resolve this, and it also extends the coverage of millimeter-wave communications. In this paper, we propose a dual-connection based handover scheme for mobile UEs in an environment where they are connected simultaneously with two millimeter-wave cells to overcome frequent handover problems. This scheme allows a mobile UE to choose its serving link between the two mmWave connections according to the measured SINRs and then the corresponding base stations may forward duplicate packets to the UE. We compare our dual-connection based scheme with a conventional single-connection based scheme through ns-3 simulation. The simulation results show that the proposed scheme significantly reduces handover rate and delay. Therefore, we argue that the dual-connection based scheme helps mobile users achieve performance goals they require in ultra-dense cellular environments. △ Less

Submitted 9 July, 2021; originally announced July 2021.

arXiv:2107.03649 [pdf]

Heavily Augmented Sound Event Detection utilizing Weak Predictions

Authors: Hyeonuk Nam, Byeong-Yun Ko, Gyeong-Tae Lee, Seong-Hu Kim, Won-Ho Jung, Sang-Min Choi, Yong-Hwa Park

Abstract: The performances of Sound Event Detection (SED) systems are greatly limited by the difficulty in generating large strongly labeled dataset. In this work, we used two main approaches to overcome the lack of strongly labeled data. First, we applied heavy data augmentation on input features. Data augmentation methods used include not only conventional methods used in speech/audio domains but also our… ▽ More The performances of Sound Event Detection (SED) systems are greatly limited by the difficulty in generating large strongly labeled dataset. In this work, we used two main approaches to overcome the lack of strongly labeled data. First, we applied heavy data augmentation on input features. Data augmentation methods used include not only conventional methods used in speech/audio domains but also our proposed method named FilterAugment. Second, we propose two methods to utilize weak predictions to enhance weakly supervised SED performance. As a result, we obtained the best PSDS1 of 0.4336 and best PSDS2 of 0.8161 on the DESED real validation dataset. This work is submitted to DCASE 2021 Task4 and is ranked on the 3rd place. Code availa-ble: https://github.com/frednam93/FilterAugSED. △ Less

Submitted 14 September, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

Comments: Won 3rd place on IEEE DCASE 2021 Task 4

arXiv:2104.10964 [pdf, other]

doi 10.1109/TSP.2021.3111705

Tracking Cells and their Lineages via Labeled Random Finite Sets

Authors: Tran Thien Dat Nguyen, Ba-Ngu Vo, Ba-Tuong Vo, Du Yong Kim, Yu Suk Choi

Abstract: Determining the trajectories of cells and their lineages or ancestries in live-cell experiments are fundamental to the understanding of how cells behave and divide. This paper proposes novel online algorithms for jointly tracking and resolving lineages of an unknown and time-varying number of cells from time-lapse video data. Our approach involves modeling the cell ensemble as a labeled random fin… ▽ More Determining the trajectories of cells and their lineages or ancestries in live-cell experiments are fundamental to the understanding of how cells behave and divide. This paper proposes novel online algorithms for jointly tracking and resolving lineages of an unknown and time-varying number of cells from time-lapse video data. Our approach involves modeling the cell ensemble as a labeled random finite set with labels representing cell identities and lineages. A spawning model is developed to take into account cell lineages and changes in cell appearance prior to division. We then derive analytic filters to propagate multi-object distributions that contain information on the current cell ensemble including their lineages. We also develop numerical implementations of the resulting multi-object filters. Experiments using simulation, synthetic cell migration video, and real time-lapse sequence, are presented to demonstrate the capability of the solutions. △ Less

Submitted 27 October, 2021; v1 submitted 22 April, 2021; originally announced April 2021.

arXiv:2103.00006 [pdf, other]

Towards Synthesizing Twelve-Lead Electrocardiograms from Two Asynchronous Leads

Authors: Yong-Yeon Jo, Young Sang Choi, Jong-Hwan Jang, Joon-Myoung Kwon

Abstract: The electrocardiogram (ECG) records electrical signals in a non-invasive way to observe the condition of the heart, typically looking at the heart from 12 different directions. Several types of the cardiac disease are diagnosed by using 12-lead ECGs Recently, various wearable devices have enabled immediate access to the ECG without the use of wieldy equipment. However, they only provide ECGs with… ▽ More The electrocardiogram (ECG) records electrical signals in a non-invasive way to observe the condition of the heart, typically looking at the heart from 12 different directions. Several types of the cardiac disease are diagnosed by using 12-lead ECGs Recently, various wearable devices have enabled immediate access to the ECG without the use of wieldy equipment. However, they only provide ECGs with a couple of leads. This results in an inaccurate diagnosis of cardiac disease due to lacking of required leads. We propose a deep generative model for ECG synthesis from two asynchronous leads to ten leads. It first represents a heart condition referring to two leads, and then generates ten leads based on the represented heart condition. Both the rhythm and amplitude of leads generated resemble those of the original ones, while the technique removes noise and the baseline wander appearing in the original leads. As a data augmentation method, our model improves the classification performance of models compared with models using ECGs with only one or two leads. △ Less

Submitted 25 June, 2024; v1 submitted 28 February, 2021; originally announced March 2021.

arXiv:2102.13228 [pdf]

A High-Throughput Multi-Mode LDPC Decoder for 5G NR

Authors: Sina Pourjabar, Gwan S. Choi

Abstract: This paper presents a partially parallel low-density parity-check (LDPC) decoder designed for the 5G New Radio (NR) standard. The design is using a multi-block parallel architecture with a flooding schedule. The decoder can support any code rates and code lengths up to the lifting size Zmax= 96. To compensate for the dropped throughput associated with the smaller Z values, the design can double an… ▽ More This paper presents a partially parallel low-density parity-check (LDPC) decoder designed for the 5G New Radio (NR) standard. The design is using a multi-block parallel architecture with a flooding schedule. The decoder can support any code rates and code lengths up to the lifting size Zmax= 96. To compensate for the dropped throughput associated with the smaller Z values, the design can double and quadruple its parallelism when lifting sizes Z<= 48 and Z<= 24 are selected respectively. Therefore, the decoder can process up to eight frames and restore the throughput to the maximum. To simplify the design's architecture, a new variable node for decoding the extended parity bits present in the lower code rates is proposed. The FPGA implementation of the decoder results in a throughput of 2.1 Gbps decoding the 11/12 code rate. Additionally, the synthesized decoder using the 28 nm TSMC technology, achieves a maximum clock frequency of 526 MHz and a throughput of 13.46 Gbps. The core decoder occupies 1.03 mm2, and the power consumption is 229 mW. △ Less

Submitted 10 March, 2021; v1 submitted 25 February, 2021; originally announced February 2021.

Comments: More explanation added to section II.B. Fig. 3.(b) Revised. Typos corrected

Showing 1–50 of 73 results for author: Choi, S