-
Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation
Authors:
Tsun-An Hsieh,
Heeyoul Choi,
Minje Kim
Abstract:
Recent studies highlight the potential of textual modalities in conditioning the speech separation model's inference process. However, regularization-based methods remain underexplored despite their advantages of not requiring auxiliary text data during the test time. To address this gap, we introduce a timed text-based regularization (TTR) method that uses language model-derived semantics to impr…
▽ More
Recent studies highlight the potential of textual modalities in conditioning the speech separation model's inference process. However, regularization-based methods remain underexplored despite their advantages of not requiring auxiliary text data during the test time. To address this gap, we introduce a timed text-based regularization (TTR) method that uses language model-derived semantics to improve speech separation models. Our approach involves two steps. We begin with two pretrained audio and language models, WavLM and BERT, respectively. Then, a Transformer-based audio summarizer is learned to align the audio and word embeddings and to minimize their gap. The summarizer Transformer, incorporated as a regularizer, promotes the separated sources' alignment with the semantics from the timed text. Experimental results show that the proposed TTR method consistently improves the various objective metrics of the separation results over the unregularized baselines.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
PAAM: A Framework for Coordinated and Priority-Driven Accelerator Management in ROS 2
Authors:
Daniel Enright,
Yecheng Xiang,
Hyunjong Choi,
Hyoseung Kim
Abstract:
This paper proposes a Priority-driven Accelerator Access Management (PAAM) framework for multi-process robotic applications built on top of the Robot Operating System (ROS) 2 middleware platform. The framework addresses the issue of predictable execution of time- and safety-critical callback chains that require hardware accelerators such as GPUs and TPUs. PAAM provides a standalone ROS executor th…
▽ More
This paper proposes a Priority-driven Accelerator Access Management (PAAM) framework for multi-process robotic applications built on top of the Robot Operating System (ROS) 2 middleware platform. The framework addresses the issue of predictable execution of time- and safety-critical callback chains that require hardware accelerators such as GPUs and TPUs. PAAM provides a standalone ROS executor that acts as an accelerator resource server, arbitrating accelerator access requests from all other callbacks at the application layer. This approach enables coordinated and priority-driven accelerator access management in multi-process robotic systems. The framework design is directly applicable to all types of accelerators and enables granular control over how specific chains access accelerators, making it possible to achieve predictable real-time support for accelerators used by safety-critical callback chains without making changes to underlying accelerator device drivers. The paper shows that PAAM also offers a theoretical analysis that can upper bound the worst-case response time of safety-critical callback chains that necessitate accelerator access. This paper also demonstrates that complex robotic systems with extensive accelerator usage that are integrated with PAAM may achieve up to a 91\% reduction in end-to-end response time of their critical callback chains.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Creating Aesthetic Sonifications on the Web with SIREN
Authors:
Tristan Peng,
Hongchan Choi,
Jonathan Berger
Abstract:
SIREN is a flexible, extensible, and customizable web-based general-purpose interface for auditory data display (sonification). Designed as a digital audio workstation for sonification, synthesizers written in JavaScript using the Web Audio API facilitate intuitive map** of data to auditory parameters for a wide range of purposes.
This paper explores the breadth of sound synthesis techniques s…
▽ More
SIREN is a flexible, extensible, and customizable web-based general-purpose interface for auditory data display (sonification). Designed as a digital audio workstation for sonification, synthesizers written in JavaScript using the Web Audio API facilitate intuitive map** of data to auditory parameters for a wide range of purposes.
This paper explores the breadth of sound synthesis techniques supported by SIREN, and details the structure and definition of a SIREN synthesizer module. The paper proposes further development that will increase SIREN's utility.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
WMMSE-Based Rate Maximization for RIS-Assisted MU-MIMO Systems
Authors:
Hyuck** Choi,
A. Lee Swindlehurst,
Junil Choi
Abstract:
Reconfigurable intelligent surface (RIS) technology, given its ability to favorably modify wireless communication environments, will play a pivotal role in the evolution of future communication systems. This paper proposes rate maximization techniques for both single-user and multiuser MIMO systems, based on the well-known weighted minimum mean square error (WMMSE) criterion. Using a suitable weig…
▽ More
Reconfigurable intelligent surface (RIS) technology, given its ability to favorably modify wireless communication environments, will play a pivotal role in the evolution of future communication systems. This paper proposes rate maximization techniques for both single-user and multiuser MIMO systems, based on the well-known weighted minimum mean square error (WMMSE) criterion. Using a suitable weight matrix, the WMMSE algorithm tackles an equivalent weighted mean square error (WMSE) minimization problem to achieve the sum-rate maximization. By considering a more practical RIS system model that employs a tensor-based representation enforced by the electromagnetic behavior exhibited by the RIS panel, we detail both the sum-rate maximizing and WMSE minimizing strategies for RIS phase shift optimization by deriving the closed-form gradient of the WMSE and the sum-rate with respect to the RIS phase shift vector. Our simulations reveal that the proposed rate maximization technique, rooted in the WMMSE algorithm, exhibits superior performance when compared to other benchmarks.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
A Deep Reinforcement Learning Approach for Autonomous Reconfigurable Intelligent Surfaces
Authors:
Hyuck** Choi,
Ly V. Nguyen,
Junil Choi,
A. Lee Swindlehurst
Abstract:
A reconfigurable intelligent surface (RIS) is a prospective wireless technology that enhances wireless channel quality. An RIS is often equipped with passive array of elements and provides cost and power-efficient solutions for coverage extension of wireless communication systems. Without any radio frequency (RF) chains or computing resources, however, the RIS requires control information to be se…
▽ More
A reconfigurable intelligent surface (RIS) is a prospective wireless technology that enhances wireless channel quality. An RIS is often equipped with passive array of elements and provides cost and power-efficient solutions for coverage extension of wireless communication systems. Without any radio frequency (RF) chains or computing resources, however, the RIS requires control information to be sent to it from an external unit, e.g., a base station (BS). The control information can be delivered by wired or wireless channels, and the BS must be aware of the RIS and the RIS-related channel conditions in order to effectively configure its behavior. Recent works have introduced hybrid RIS structures possessing a few active elements that can sense and digitally process received data. Here, we propose the operation of an entirely autonomous RIS that operates without a control link between the RIS and BS. Using a few sensing elements, the autonomous RIS employs a deep Q network (DQN) based on reinforcement learning in order to enhance the sum rate of the network. Our results illustrate the potential of deploying autonomous RISs in wireless networks with essentially no network overhead.
△ Less
Submitted 19 March, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Variable-Rate Learned Image Compression with Multi-Objective Optimization and Quantization-Reconstruction Offsets
Authors:
Fatih Kamisli,
Fabien Racape,
Hyomin Choi
Abstract:
Achieving successful variable bitrate compression with computationally simple algorithms from a single end-to-end learned image or video compression model remains a challenge. Many approaches have been proposed, including conditional auto-encoders, channel-adaptive gains for the latent tensor or uniformly quantizing all elements of the latent tensor. This paper follows the traditional approach to…
▽ More
Achieving successful variable bitrate compression with computationally simple algorithms from a single end-to-end learned image or video compression model remains a challenge. Many approaches have been proposed, including conditional auto-encoders, channel-adaptive gains for the latent tensor or uniformly quantizing all elements of the latent tensor. This paper follows the traditional approach to vary a single quantization step size to perform uniform quantization of all latent tensor elements. However, three modifications are proposed to improve the variable rate compression performance. First, multi objective optimization is used for (post) training. Second, a quantization-reconstruction offset is introduced into the quantization operation. Third, variable rate quantization is used also for the hyper latent. All these modifications can be made on a pre-trained single-rate compression model by performing post training. The algorithms are implemented into three well-known image compression models and the achieved variable rate compression results indicate negligible or minimal compression performance loss compared to training multiple models. (Codes will be shared at https://github.com/InterDigitalInc/CompressAI)
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
Multi-Agent Based Transfer Learning for Data-Driven Air Traffic Applications
Authors:
Chuhao Deng,
Hong-Cheol Choi,
Hyunsang Park,
Inseok Hwang
Abstract:
Research in develo** data-driven models for Air Traffic Management (ATM) has gained a tremendous interest in recent years. However, data-driven models are known to have long training time and require large datasets to achieve good performance. To address the two issues, this paper proposes a Multi-Agent Bidirectional Encoder Representations from Transformers (MA-BERT) model that fully considers…
▽ More
Research in develo** data-driven models for Air Traffic Management (ATM) has gained a tremendous interest in recent years. However, data-driven models are known to have long training time and require large datasets to achieve good performance. To address the two issues, this paper proposes a Multi-Agent Bidirectional Encoder Representations from Transformers (MA-BERT) model that fully considers the multi-agent characteristic of the ATM system and learns air traffic controllers' decisions, and a pre-training and fine-tuning transfer learning framework. By pre-training the MA-BERT on a large dataset from a major airport and then fine-tuning it to other airports and specific air traffic applications, a large amount of the total training time can be saved. In addition, for newly adopted procedures and constructed airports where no historical data is available, this paper shows that the pre-trained MA-BERT can achieve high performance by updating regularly with little data. The proposed transfer learning framework and MA-BERT are tested with the automatic dependent surveillance-broadcast data recorded in 3 airports in South Korea in 2019.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
A Unified Multi-Phase CT Synthesis and Classification Framework for Kidney Cancer Diagnosis with Incomplete Data
Authors:
Kwang-Hyun Uhm,
Seung-Won Jung,
Moon Hyung Choi,
Sung-Hoo Hong,
Sung-Jea Ko
Abstract:
Multi-phase CT is widely adopted for the diagnosis of kidney cancer due to the complementary information among phases. However, the complete set of multi-phase CT is often not available in practical clinical applications. In recent years, there have been some studies to generate the missing modality image from the available data. Nevertheless, the generated images are not guaranteed to be effectiv…
▽ More
Multi-phase CT is widely adopted for the diagnosis of kidney cancer due to the complementary information among phases. However, the complete set of multi-phase CT is often not available in practical clinical applications. In recent years, there have been some studies to generate the missing modality image from the available data. Nevertheless, the generated images are not guaranteed to be effective for the diagnosis task. In this paper, we propose a unified framework for kidney cancer diagnosis with incomplete multi-phase CT, which simultaneously recovers missing CT images and classifies cancer subtypes using the completed set of images. The advantage of our framework is that it encourages a synthesis model to explicitly learn to generate missing CT phases that are helpful for classifying cancer subtypes. We further incorporate lesion segmentation network into our framework to exploit lesion-level features for effective cancer classification in the whole CT volumes. The proposed framework is based on fully 3D convolutional neural networks to jointly optimize both synthesis and classification of 3D CT volumes. Extensive experiments on both in-house and external datasets demonstrate the effectiveness of our framework for the diagnosis with incomplete data compared with state-of-the-art baselines. In particular, cancer subtype classification using the completed CT data by our method achieves higher performance than the classification using the given incomplete data.
△ Less
Submitted 9 December, 2023;
originally announced December 2023.
-
ProsDectNet: Bridging the Gap in Prostate Cancer Detection via Transrectal B-mode Ultrasound Imaging
Authors:
Sulaiman Vesal,
Indrani Bhattacharya,
Hassan Jahanandish,
Xinran Li,
Zachary Kornberg,
Steve Ran Zhou,
Elijah Richard Sommer,
Moon Hyung Choi,
Richard E. Fan,
Geoffrey A. Sonn,
Mirabela Rusu
Abstract:
Interpreting traditional B-mode ultrasound images can be challenging due to image artifacts (e.g., shadowing, speckle), leading to low sensitivity and limited diagnostic accuracy. While Magnetic Resonance Imaging (MRI) has been proposed as a solution, it is expensive and not widely available. Furthermore, most biopsies are guided by Transrectal Ultrasound (TRUS) alone and can miss up to 52% cancer…
▽ More
Interpreting traditional B-mode ultrasound images can be challenging due to image artifacts (e.g., shadowing, speckle), leading to low sensitivity and limited diagnostic accuracy. While Magnetic Resonance Imaging (MRI) has been proposed as a solution, it is expensive and not widely available. Furthermore, most biopsies are guided by Transrectal Ultrasound (TRUS) alone and can miss up to 52% cancers, highlighting the need for improved targeting. To address this issue, we propose ProsDectNet, a multi-task deep learning approach that localizes prostate cancer on B-mode ultrasound. Our model is pre-trained using radiologist-labeled data and fine-tuned using biopsy-confirmed labels. ProsDectNet includes a lesion detection and patch classification head, with uncertainty minimization using entropy to improve model performance and reduce false positive predictions. We trained and validated ProsDectNet using a cohort of 289 patients who underwent MRI-TRUS fusion targeted biopsy. We then tested our approach on a group of 41 patients and found that ProsDectNet outperformed the average expert clinician in detecting prostate cancer on B-mode ultrasound images, achieving a patient-level ROC-AUC of 82%, a sensitivity of 74%, and a specificity of 67%. Our results demonstrate that ProsDectNet has the potential to be used as a computer-aided diagnosis system to improve targeted biopsy and treatment planning.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
Authors:
Sang-Hoon Lee,
Ha-Yeong Choi,
Seung-Bin Kim,
Seong-Whan Lee
Abstract:
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice convers…
▽ More
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For text-to-speech, we adopt the text-to-vec framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/sh-lee-prml/HierSpeechpp.
△ Less
Submitted 27 November, 2023; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation
Authors:
Ha-Yeong Choi,
Sang-Hoon Lee,
Seong-Whan Lee
Abstract:
Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a hierarchical VC system based on two diffusion models. We first introduce DiffPitch, which can effectively generate F0 with the target voice style. Subsequently, th…
▽ More
Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a hierarchical VC system based on two diffusion models. We first introduce DiffPitch, which can effectively generate F0 with the target voice style. Subsequently, the generated F0 is fed to DiffVoice to convert the speech with a target voice style. Furthermore, using the source-filter encoder, we disentangle the speech and use the converted Mel-spectrogram as a data-driven prior in DiffVoice to improve the voice style transfer capacity. Finally, by using the masked prior in diffusion models, our model can improve the speaker adaptation quality. Experimental results verify the superiority of our model in pitch generation and voice style transfer performance, and our model also achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Yet Another Generative Model For Room Impulse Response Estimation
Authors:
Sungho Lee,
Hyeong-Seok Choi,
Kyogu Lee
Abstract:
Recent neural room impulse response (RIR) estimators typically comprise an encoder for reference audio analysis and a generator for RIR synthesis. Especially, it is the performance of the generator that directly influences the overall estimation quality. In this context, we explore an alternate generator architecture for improved performance. We first train an autoencoder with residual quantizatio…
▽ More
Recent neural room impulse response (RIR) estimators typically comprise an encoder for reference audio analysis and a generator for RIR synthesis. Especially, it is the performance of the generator that directly influences the overall estimation quality. In this context, we explore an alternate generator architecture for improved performance. We first train an autoencoder with residual quantization to learn a discrete latent token space, where each token represents a small time-frequency patch of the RIR. Then, we cast the RIR estimation problem as a reference-conditioned autoregressive token generation task, employing transformer variants that operate across frequency, time, and quantization depth axes. This way, we address the standard blind estimation task and additional acoustic matching problem, which aims to find an RIR that matches the source signal to the target signal's reverberation characteristics. Experimental results show that our system is preferable to other baselines across various evaluation metrics.
△ Less
Submitted 5 November, 2023;
originally announced November 2023.
-
Label Space Partition Selection for Multi-Object Tracking Using Two-Layer Partitioning
Authors:
Ji Youn Lee,
Changbeom Shim,
Hoa Van Nguyen,
Tran Thien Dat Nguyen,
Hyun** Choi,
Youngho Kim
Abstract:
Estimating the trajectories of multi-objects poses a significant challenge due to data association ambiguity, which leads to a substantial increase in computational requirements. To address such problems, a divide-and-conquer manner has been employed with parallel computation. In this strategy, distinguished objects that have unique labels are grouped based on their statistical dependencies, the i…
▽ More
Estimating the trajectories of multi-objects poses a significant challenge due to data association ambiguity, which leads to a substantial increase in computational requirements. To address such problems, a divide-and-conquer manner has been employed with parallel computation. In this strategy, distinguished objects that have unique labels are grouped based on their statistical dependencies, the intersection of predicted measurements. Several geometry approaches have been used for label grou** since finding all intersected label pairs is clearly infeasible for large-scale tracking problems. This paper proposes an efficient implementation of label grou** for label-partitioned generalized labeled multi-Bernoulli filter framework using a secondary partitioning technique. This allows for parallel computation in the label graph indexing step, avoiding generating and eliminating duplicate comparisons. Additionally, we compare the performance of the proposed technique with several efficient spatial searching algorithms. The results demonstrate the superior performance of the proposed approach on large-scale data sets, enabling scalable trajectory estimation.
△ Less
Submitted 22 October, 2023;
originally announced October 2023.
-
Telescope imaging beyond the Rayleigh limit in extremely low SNR
Authors:
Hyunsoo Choi,
Seungman Choi,
Peter Menart,
Angshuman Deka,
Zubin Jacob
Abstract:
The Rayleigh limit and low Signal-to-Noise Ratio (SNR) scenarios pose significant limitations to optical imaging systems used in remote sensing, infrared thermal imaging, and space domain awareness. In this study, we introduce a Stochastic Sub-Rayleigh Imaging (SSRI) algorithm to localize point objects and estimate their positions, brightnesses, and number in low SNR conditions, even below the Ray…
▽ More
The Rayleigh limit and low Signal-to-Noise Ratio (SNR) scenarios pose significant limitations to optical imaging systems used in remote sensing, infrared thermal imaging, and space domain awareness. In this study, we introduce a Stochastic Sub-Rayleigh Imaging (SSRI) algorithm to localize point objects and estimate their positions, brightnesses, and number in low SNR conditions, even below the Rayleigh limit. Our algorithm adopts a maximum likelihood approach and exploits the Poisson distribution of incoming photons to overcome the Rayleigh limit in low SNR conditions. In our experimental validation, which closely mirrors practical scenarios, we focus on conditions with closely spaced sources within the sub-Rayleigh limit (0.49-1.00R) and weak signals (SNR less than 1.2). We use the Jaccard index and Jaccard efficiency as a figure of merit to quantify imaging performance in the sub-Rayleigh region. Our approach consistently outperforms established algorithms such as Richardson-Lucy and CLEAN by 4X in the low SNR, sub-Rayleigh regime. Our SSRI algorithm allows existing telescope-based optical/infrared imaging systems to overcome the extreme limit of sub-Rayleigh, low SNR source distributions, potentially impacting a wide range of fields, including passive thermal imaging, remote sensing, and space domain awareness.
△ Less
Submitted 17 January, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion
Authors:
Haeyun Choi,
Jio Gim,
Yuho Lee,
Youngin Kim,
Young-Joo Suh
Abstract:
This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing. Previous works suffer from information loss and poor synthesis quality due to their reliance on a carefully designed bottleneck structure. Moreover, models relying solely on self-reconstruction loss struggled with reproducing different speakers' voices. To address th…
▽ More
This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing. Previous works suffer from information loss and poor synthesis quality due to their reliance on a carefully designed bottleneck structure. Moreover, models relying solely on self-reconstruction loss struggled with reproducing different speakers' voices. To address these issues, we suggested a cycle-consistency loss that considers conversion back and forth between target and source speakers. Additionally, stacked random-shuffled mel-spectrograms and a label smoothing method are utilized during speaker encoder training to extract a time-independent global speaker representation from speech, which is the key to a zero-shot conversion. Our model outperforms existing state-of-the-art results in both subjective and objective evaluations. Furthermore, it facilitates cross-lingual voice conversions and enhances the quality of synthesized speech.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition
Authors:
Hyeongju Choi,
Apoorva Beedu,
Irfan Essa
Abstract:
Human Activity Recognition (HAR) systems have been extensively studied by the vision and ubiquitous computing communities due to their practical applications in daily life, such as smart homes, surveillance, and health monitoring.
Typically, this process is supervised in nature and the development of such systems requires access to large quantities of annotated data.
However, the higher costs…
▽ More
Human Activity Recognition (HAR) systems have been extensively studied by the vision and ubiquitous computing communities due to their practical applications in daily life, such as smart homes, surveillance, and health monitoring.
Typically, this process is supervised in nature and the development of such systems requires access to large quantities of annotated data.
However, the higher costs and challenges associated with obtaining good quality annotations have rendered the application of self-supervised methods an attractive option and contrastive learning comprises one such method.
However, a major component of successful contrastive learning is the selection of good positive and negative samples.
Although positive samples are directly obtainable, sampling good negative samples remain a challenge.
As human activities can be recorded by several modalities like camera and IMU sensors, we propose a hard negative sampling method for multimodal HAR with a hard negative sampling loss for skeleton and IMU data pairs.
We exploit hard negatives that have different labels from the anchor but are projected nearby in the latent space using an adjustable concentration parameter.
Through extensive experiments on two benchmark datasets: UTD-MHAD and MMAct, we demonstrate the robustness of our approach forlearning strong feature representation for HAR tasks, and on the limited data setting.
We further show that our model outperforms all other state-of-the-art methods for UTD-MHAD dataset, and self-supervised methods for MMAct: Cross session, even when uni-modal data are used during downstream activity recognition.
△ Less
Submitted 3 September, 2023;
originally announced September 2023.
-
NeuralEQ: Neural-Network-Based Equalizer for High-Speed Wireline Communication
Authors:
Hanseok Kim,
Jae Hyung Ju,
Hyun Seok Choi,
Hyeri Roh,
Woo-Seok Choi
Abstract:
With the growing demand for high-bandwidth applications like video streaming and cloud services, the data transfer rates required for wireline communication keeps increasing, making the channel loss a major obstacle in achieving low bit error rate (BER). Equalization techniques such as feed-forward equalizer (FFE) and decision feedback equalizer (DFE) are commonly used to compensate for channel lo…
▽ More
With the growing demand for high-bandwidth applications like video streaming and cloud services, the data transfer rates required for wireline communication keeps increasing, making the channel loss a major obstacle in achieving low bit error rate (BER). Equalization techniques such as feed-forward equalizer (FFE) and decision feedback equalizer (DFE) are commonly used to compensate for channel loss in wireline communication, but they have limitations in terms of noise boosting and timing constraints. On the other hand, the forward-backward algorithm can achieve better BER performance, but its high complexity makes it impractical for wireline communication. In this work, we propose a novel neural network, NeuralEQ, that effectively mimics the forward-backward algorithm and performs better than FFE and DFE while reducing complexity of the forward-backward algorithm. Performance of NeuralEQ is verified through simulations using real channels.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer
Authors:
Sang-Hoon Lee,
Ha-Yeong Choi,
Hyung-Seok Oh,
Seong-Whan Lee
Abstract:
Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervis…
▽ More
Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}.
△ Less
Submitted 30 July, 2023;
originally announced July 2023.
-
DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion
Authors:
Ha-Yeong Choi,
Sang-Hoon Lee,
Seong-Whan Lee
Abstract:
Diffusion-based generative models have exhibited powerful generative performance in recent years. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, this paper presents decoupled den…
▽ More
Diffusion-based generative models have exhibited powerful generative performance in recent years. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, this paper presents decoupled denoising diffusion models (DDDMs) with disentangled representations, which can control the style for each attribute in generative models. We apply DDDMs to voice conversion (VC) tasks to address the challenges of disentangling and controlling each speech attribute (e.g., linguistic information, intonation, and timbre). First, we use a self-supervised representation to disentangle the speech representation. Subsequently, the DDDMs are applied to resynthesize the speech from the disentangled representations for denoising with respect to each attribute. Moreover, we also propose the prior mixup for robust voice style transfer, which uses the converted representation of the mixed style as a prior distribution for the diffusion models. The experimental results reveal that our method outperforms publicly available VC models. Furthermore, we show that our method provides robust generative performance regardless of the model size. Audio samples are available https://hayeong0.github.io/DDDM-VC-demo/.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
Darwin: A DRAM-based Multi-level Processing-in-Memory Architecture for Data Analytics
Authors:
Donghyuk Kim,
Jae-Young Kim,
Wontak Han,
Jongsoon Won,
Haerang Choi,
Yongkee Kwon,
Joo-Young Kim
Abstract:
Processing-in-memory (PIM) architecture is an inherent match for data analytics application, but we observe major challenges to address when accelerating it using PIM. In this paper, we propose Darwin, a practical LRDIMM-based multi-level PIM architecture for data analytics, which fully exploits the internal bandwidth of DRAM using the bank-, bank group-, chip-, and rank-level parallelisms. Consid…
▽ More
Processing-in-memory (PIM) architecture is an inherent match for data analytics application, but we observe major challenges to address when accelerating it using PIM. In this paper, we propose Darwin, a practical LRDIMM-based multi-level PIM architecture for data analytics, which fully exploits the internal bandwidth of DRAM using the bank-, bank group-, chip-, and rank-level parallelisms. Considering the properties of data analytics operators and DRAM's area constraints, Darwin maximizes the internal data bandwidth by placing the PIM processing units, buffers, and control circuits across the hierarchy of DRAM. More specifically, it introduces the bank processing unit for each bank in which a single instruction multiple data (SIMD) unit handles regular data analytics operators and bank group processing unit for each bank group to handle workload imbalance in the condition-oriented data analytics operators. Furthermore, Darwin supports a novel PIM instruction architecture that concatenates instructions for multiple thread executions on bank group processing entities, addressing the command bottleneck by enabling separate control of up to 512 different in-memory processing units simultaneously. We build a cycle-accurate simulation framework to evaluate Darwin with various DRAM configurations, optimization schemes and workloads. Darwin achieves up to 14.7x speedup over the non-optimized version. Finally, the proposed Darwin architecture achieves 4.0x-43.9x higher throughput and reduces energy consumption by 85.7% than the baseline CPU system (Intel Xeon Gold 6226 + 4 channels of DDR4-2933). Compared to the state-of-the-art PIM, Darwin achieves up to 7.5x and 7.1x in the basic query operators and TPC-H queries, respectively. Darwin is based on the latest GDDR6 and requires only 5.6% area overhead, suggesting a promising PIM solution for the future main memory system.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
WiThRay: A Versatile Ray-Tracing Simulator for Smart Wireless Environments
Authors:
Hyuck** Choi,
Jaehoon Chung,
Jaeky Oh,
George C. Alexandropoulos,
Junil Choi
Abstract:
This paper presents the development and evaluation of WiThRay, a new wireless three-dimensional ray-tracing (RT) simulator. RT-based simulators are widely used for generating realistic channel data by combining RT methodology to get signal trajectories and electromagnetic (EM) equations, resulting in generalized channel impulse responses (CIRs). This paper first provides a comprehensive comparison…
▽ More
This paper presents the development and evaluation of WiThRay, a new wireless three-dimensional ray-tracing (RT) simulator. RT-based simulators are widely used for generating realistic channel data by combining RT methodology to get signal trajectories and electromagnetic (EM) equations, resulting in generalized channel impulse responses (CIRs). This paper first provides a comprehensive comparison on methodologies of existing RT-based simulators. We then introduce WiThRay, which can evaluate the performance of various wireless communication techniques such as channel estimation/tracking, beamforming, and localization in realistic EM wave propagation. WiThRay implements its own RT methodology, the bypassing on edge (BE) algorithm, that follows the Fermat's principle and has low computational complexity. The scattering ray calibration in WiThRay also provides a precise solution in the analysis of EM propagation. Different from most of the previous RT-based simulators, WiThRay incorporates reconfigurable intelligent surfaces (RIS), which will be a key component of future wireless communications. We thoroughly show that the channel data from WiThRay match sufficiently well with the fundamental theory of wireless channels. The virtue of WiThRay lies in its feature of not making any assumption about the channel, like being slow/fast fading or frequency selective. A realistic wireless environment, which can be conveniently simulated via WiThRay, naturally defines the physical properties of the wireless channels. WiThRay is open to the public, and anyone can exploit this versatile simulator to develop and test their communications and signal processing techniques.
△ Less
Submitted 22 April, 2023;
originally announced April 2023.
-
Cross-domain Denoising for Low-dose Multi-frame Spiral Computed Tomography
Authors:
Yucheng Lu,
Zhixin Xu,
Moon Hyung Choi,
Jimin Kim,
Seung-Won Jung
Abstract:
Computed tomography (CT) has been used worldwide as a non-invasive test to assist in diagnosis. However, the ionizing nature of X-ray exposure raises concerns about potential health risks such as cancer. The desire for lower radiation doses has driven researchers to improve reconstruction quality. Although previous studies on low-dose computed tomography (LDCT) denoising have demonstrated the effe…
▽ More
Computed tomography (CT) has been used worldwide as a non-invasive test to assist in diagnosis. However, the ionizing nature of X-ray exposure raises concerns about potential health risks such as cancer. The desire for lower radiation doses has driven researchers to improve reconstruction quality. Although previous studies on low-dose computed tomography (LDCT) denoising have demonstrated the effectiveness of learning-based methods, most were developed on the simulated data. However, the real-world scenario differs significantly from the simulation domain, especially when using the multi-slice spiral scanner geometry. This paper proposes a two-stage method for the commercially available multi-slice spiral CT scanners that better exploits the complete reconstruction pipeline for LDCT denoising across different domains. Our approach makes good use of the high redundancy of multi-slice projections and the volumetric reconstructions while leveraging the over-smoothing problem in conventional cascaded frameworks caused by aggressive denoising. The dedicated design also provides a more explicit interpretation of the data flow. Extensive experiments on various datasets showed that the proposed method could remove up to 70\% of noise without compromised spatial resolution, and subjective evaluations by two experienced radiologists further supported its superior performance against state-of-the-art methods in clinical practice.
△ Less
Submitted 28 June, 2024; v1 submitted 21 April, 2023;
originally announced April 2023.
-
Generative AI for Rapid Diffusion MRI with Improved Image Quality, Reliability and Generalizability
Authors:
Amir Sadikov,
Xinlei Pan,
Hannah Choi,
Lanya T. Cai,
Pratik Mukherjee
Abstract:
Diffusion MRI is a non-invasive, in-vivo biomedical imaging method for map** tissue microstructure. Applications include structural connectivity imaging of the human brain and detecting microstructural neural changes. However, acquiring high signal-to-noise ratio dMRI datasets with high angular and spatial resolution requires prohibitively long scan times, limiting usage in many important clinic…
▽ More
Diffusion MRI is a non-invasive, in-vivo biomedical imaging method for map** tissue microstructure. Applications include structural connectivity imaging of the human brain and detecting microstructural neural changes. However, acquiring high signal-to-noise ratio dMRI datasets with high angular and spatial resolution requires prohibitively long scan times, limiting usage in many important clinical settings, especially for children, the elderly, and in acute neurological disorders that may require conscious sedation or general anesthesia. We employ a Swin UNEt Transformers model, trained on augmented Human Connectome Project data and conditioned on registered T1 scans, to perform generalized denoising of dMRI. We also qualitatively demonstrate super-resolution with artificially downsampled HCP data in normal adult volunteers. Remarkably, Swin UNETR can be fine-tuned for an out-of-domain dataset with a single example scan, as we demonstrate on dMRI of children with neurodevelopmental disorders and of adults with acute evolving traumatic brain injury, each cohort scanned on different models of scanners with different imaging protocols at different sites. We exceed current state-of-the-art denoising methods in accuracy and test-retest reliability of rapid diffusion tensor imaging requiring only 90 seconds of scan time. Applied to tissue microstructural modeling of dMRI, Swin UNETR denoising achieves dramatic improvements over the state-of-the-art for test-retest reliability of intracellular volume fraction and free water fraction measurements and can remove heavy-tail noise, improving biophysical modeling fidelity. Swin UNeTR enables rapid diffusion MRI with unprecedented accuracy and reliability, especially for probing biological tissues for scientific and clinical applications. The code and model are publicly available at https://github.com/ucsfncl/dmri-swin.
△ Less
Submitted 6 October, 2023; v1 submitted 9 March, 2023;
originally announced March 2023.
-
AIROGS: Artificial Intelligence for RObust Glaucoma Screening Challenge
Authors:
Coen de Vente,
Koenraad A. Vermeer,
Nicolas Jaccard,
He Wang,
Hongyi Sun,
Firas Khader,
Daniel Truhn,
Temirgali Aimyshev,
Yerkebulan Zhanibekuly,
Tien-Dung Le,
Adrian Galdran,
Miguel Ángel González Ballester,
Gustavo Carneiro,
Devika R G,
Hrishikesh P S,
Densen Puthussery,
Hong Liu,
Zekang Yang,
Satoshi Kondo,
Satoshi Kasai,
Edward Wang,
Ashritha Durvasula,
Jónathan Heras,
Miguel Ángel Zapata,
Teresa Araújo
, et al. (11 additional authors not shown)
Abstract:
The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios…
▽ More
The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios due to the presence of out-of-distribution and low-quality images. To address this issue, we propose the Artificial Intelligence for Robust Glaucoma Screening (AIROGS) challenge. This challenge includes a large dataset of around 113,000 images from about 60,000 patients and 500 different screening centers, and encourages the development of algorithms that are robust to ungradable and unexpected input data. We evaluated solutions from 14 teams in this paper, and found that the best teams performed similarly to a set of 20 expert ophthalmologists and optometrists. The highest-scoring team achieved an area under the receiver operating characteristic curve of 0.99 (95% CI: 0.98-0.99) for detecting ungradable images on-the-fly. Additionally, many of the algorithms showed robust performance when tested on three other publicly available datasets. These results demonstrate the feasibility of robust AI-enabled glaucoma screening.
△ Less
Submitted 10 February, 2023; v1 submitted 3 February, 2023;
originally announced February 2023.
-
Learned Disentangled Latent Representations for Scalable Image Coding for Humans and Machines
Authors:
Ezgi Ozyilkan,
Mateen Ulhaq,
Hyomin Choi,
Fabien Racape
Abstract:
As an increasing amount of image and video content will be analyzed by machines, there is demand for a new codec paradigm that is capable of compressing visual input primarily for the purpose of computer vision inference, while secondarily supporting input reconstruction. In this work, we propose a learned compression architecture that can be used to build such a codec. We introduce a novel variat…
▽ More
As an increasing amount of image and video content will be analyzed by machines, there is demand for a new codec paradigm that is capable of compressing visual input primarily for the purpose of computer vision inference, while secondarily supporting input reconstruction. In this work, we propose a learned compression architecture that can be used to build such a codec. We introduce a novel variational formulation that explicitly takes feature data relevant to the desired inference task as input at the encoder side. As such, our learned scalable image codec encodes and transmits two disentangled latent representations for object detection and input reconstruction. We note that compared to relevant benchmarks, our proposed scheme yields a more compact latent representation that is specialized for the inference task. Our experiments show that our proposed system achieves a bit rate savings of 40.6% on the primary object detection task compared to the current state-of-the-art, albeit with some degradation in performance for the secondary input reconstruction task.
△ Less
Submitted 10 January, 2023;
originally announced January 2023.
-
Frequency-aware Learned Image Compression for Quality Scalability
Authors:
Hyomin Choi,
Fabien Racape,
Shahab Hamidi-Rad,
Mateen Ulhaq,
Simon Feltman
Abstract:
Spatial frequency analysis and transforms serve a central role in most engineered image and video lossy codecs, but are rarely employed in neural network (NN)-based approaches. We propose a novel NN-based image coding framework that utilizes forward wavelet transforms to decompose the input signal by spatial frequency. Our encoder generates separate bitstreams for each latent representation of low…
▽ More
Spatial frequency analysis and transforms serve a central role in most engineered image and video lossy codecs, but are rarely employed in neural network (NN)-based approaches. We propose a novel NN-based image coding framework that utilizes forward wavelet transforms to decompose the input signal by spatial frequency. Our encoder generates separate bitstreams for each latent representation of low and high frequencies. This enables our decoder to selectively decode bitstreams in a quality-scalable manner. Hence, the decoder can produce an enhanced image by using an enhancement bitstream in addition to the base bitstream. Furthermore, our method is able to enhance only a specific region of interest (ROI) by using a corresponding part of the enhancement latent representation. Our experiments demonstrate that the proposed method shows competitive rate-distortion performance compared to several non-scalable image codecs. We also showcase the effectiveness of our two-level quality scalability, as well as its practicality in ROI quality enhancement.
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric
Authors:
Hyeongju Kim,
Hyeong-Seok Choi
Abstract:
Phoneme boundary detection has been studied due to its central role in various speech applications. In this work, we point out that this task needs to be addressed not only by algorithmic way, but also by evaluation metric. To this end, we first propose a state-of-the-art phoneme boundary detector that operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT and Buckeye corp…
▽ More
Phoneme boundary detection has been studied due to its central role in various speech applications. In this work, we point out that this task needs to be addressed not only by algorithmic way, but also by evaluation metric. To this end, we first propose a state-of-the-art phoneme boundary detector that operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries with significant margin compared to existing models. Furthermore, we note that there is a limitation on the popular evaluation metric, R-value, and propose new evaluation metrics that prevent each boundary from contributing to evaluation multiple times. The proposed metrics reveal the weaknesses of non-autoregressive baselines and establishes a reliable criterion that suits for evaluating phoneme boundary detection.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis
Authors:
Hyeong-Seok Choi,
**hyeok Yang,
Juheon Lee,
Hyeongju Kim
Abstract:
Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, most of the voice synthesis models still require a large number of audio data paired with annotated labels (e.g., text transcription and music score) for training. To this end, we propose a unified framework of synthesizing and manipulating voice s…
▽ More
Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, most of the voice synthesis models still require a large number of audio data paired with annotated labels (e.g., text transcription and music score) for training. To this end, we propose a unified framework of synthesizing and manipulating voice signals from analysis features, dubbed NANSY++. The backbone network of NANSY++ is trained in a self-supervised manner that does not require any annotations paired with audio. After training the backbone network, we efficiently tackle four voice applications - i.e. voice conversion, text-to-speech, singing voice synthesis, and voice designing - by partially modeling the analysis features required for each task. Extensive experiments show that the proposed framework offers competitive advantages such as controllability, data efficiency, and fast training convergence, while providing high quality synthesis. Audio samples: tinyurl.com/8tnsy3uc.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space
Authors:
Jihwan Lee,
Jae-Sung Bae,
Seongkyu Mun,
Hee** Choi,
Joun Yeop Lee,
Hoon-Young Cho,
Chanwoo Kim
Abstract:
With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise. Moreover, running a subjective evaluation for such cross-lingual TTS systems is troublesome. The vowel space analysis, which is often utilized to explore various aspects of language including L2 accents, is a great alternative analysis tool. In this study, we apply th…
▽ More
With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise. Moreover, running a subjective evaluation for such cross-lingual TTS systems is troublesome. The vowel space analysis, which is often utilized to explore various aspects of language including L2 accents, is a great alternative analysis tool. In this study, we apply the vowel space analysis method to explore L2 accents of cross-lingual TTS systems. Through the vowel space analysis, we observe the three followings: a) a parallel architecture (Glow-TTS) is less L2-accented than an auto-regressive one (Tacotron); b) L2 accents are more dominant in non-shared vowels in a language pair; and c) L2 accents of cross-lingual TTS systems share some phenomena with those of human L2 learners. Our findings imply that it is necessary for TTS systems to handle each language pair differently, depending on their linguistic characteristics such as non-shared vowels. They also hint that we can further incorporate linguistics knowledge in develo** cross-lingual TTS systems.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
A Learning-Based Estimation and Control Framework for Contact-Intensive Tight-Tolerance Tasks
Authors:
Bukun Son,
Hyelim Choi,
Jaemin Yoon,
Dongjun Lee
Abstract:
We present a two-stage framework that integrates a learning-based estimator and a controller, designed to address contact-intensive tasks. The estimator leverages a Bayesian particle filter with a mixture density network (MDN) structure, effectively handling multi-modal issues arising from contact information. The controller combines a self-supervised and reinforcement learning (RL) approach, stra…
▽ More
We present a two-stage framework that integrates a learning-based estimator and a controller, designed to address contact-intensive tasks. The estimator leverages a Bayesian particle filter with a mixture density network (MDN) structure, effectively handling multi-modal issues arising from contact information. The controller combines a self-supervised and reinforcement learning (RL) approach, strategically dividing the low-level admittance controller's parameters into labelable and non-labelable categories, which are then trained accordingly. To further enhance accuracy and generalization performance, a transformer model is incorporated into the self-supervised learning component. The proposed framework is evaluated on the bolting task using an accurate real-time simulator and successfully transferred to an experimental environment. More visualization results are available on our project website: https://sites.google.com/view/2stagecitt
△ Less
Submitted 1 August, 2023; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Computing Forward Reachable Sets for Nonlinear Adaptive Multirotor Controllers
Authors:
Juyeop Han,
Han-Lim Choi
Abstract:
In multirotor systems, guaranteeing safety while considering unknown disturbances is essential for robust trajectory planning. The Forward reachable set (FRS), the set of feasible states subject to bounded disturbances, can be utilized to identify robust and collision-free trajectories by checking the intersections with obstacles. However, in many cases, the FRS is not calculated in real time and…
▽ More
In multirotor systems, guaranteeing safety while considering unknown disturbances is essential for robust trajectory planning. The Forward reachable set (FRS), the set of feasible states subject to bounded disturbances, can be utilized to identify robust and collision-free trajectories by checking the intersections with obstacles. However, in many cases, the FRS is not calculated in real time and is too conservative to be used in actual applications. In this paper, we address these issues by introducing a nonlinear disturbance observer (NDOB) and an adaptive controller to the multirotor system. We express the FRS of the closed-loop multirotor system with an adaptive controller in augmented state space using Hamilton-Jacobi reachability analysis. Then, we derive a closed-form expression that over-approximates the FRS as an ellipsoid, allowing for real-time computation. By compensating for disturbances with the adaptive controller, our over-approximated FRS can be smaller than other ellipsoidal over-approximations. Numerical examples validate the computational efficiency and the smaller scale of our proposed FRS.
△ Less
Submitted 6 March, 2023; v1 submitted 16 September, 2022;
originally announced September 2022.
-
On the Physical Layer Security of Visible Light Communications Empowered by Gold Nanoparticles
Authors:
Geonho Han,
Hyuck** Choi,
Ryeong Myeong Kim,
Ki Tae Nam,
Junil Choi,
Theodoros A. Tsiftsis
Abstract:
Visible light is a proper spectrum for secure wireless communications because of its high directivity and impermeability in indoor scenarios. However, if an eavesdropper is located very close to a legitimate receiver, secure communications become highly risky. In this paper, to further increase the level of security of visible light communication (VLC) and increase its resilience against to malici…
▽ More
Visible light is a proper spectrum for secure wireless communications because of its high directivity and impermeability in indoor scenarios. However, if an eavesdropper is located very close to a legitimate receiver, secure communications become highly risky. In this paper, to further increase the level of security of visible light communication (VLC) and increase its resilience against to malicious attacks, we propose to capitalize on the recently synthesized gold nanoparticles (GNPs) with chiroptical properties for circularly polarized light resulting the phase retardation that interacts with the linear polarizer angle. GNP plates made by judiciously stacking many GNPs perform as physical secret keys. Transmitters send both the intended symbol and artificial noise to exploit the channel variation effect by the GNP plates, which is highly effective when an eavesdropper is closely located to the legitimate receiver. A new VLC channel model is first developed by representing the effect of GNP plates and linear polarizers in the circular polarization domain. Based on the new channel model, the angles of linear polarizers at the transmitters and legitimate receiver are optimized considering the effect of GNP plates to increase the secrecy rate in wiretap** scenarios. Simulations verify that when the transmitters are equipped with GNP plates, even if the eavesdropper is located right next to the legitimate receiver, insightful results on the physical layer security metrics are gained as follows: 1) the secrecy rate is significantly improved and 2) the symbol error rate gap between the legitimate receiver and eavesdropper becomes much larger due to the chiroptical properties of GNP plates.
△ Less
Submitted 7 June, 2024; v1 submitted 12 August, 2022;
originally announced August 2022.
-
Scalable Video Coding for Humans and Machines
Authors:
Hyomin Choi,
Ivan V. Bajić
Abstract:
Video content is watched not only by humans, but increasingly also by machines. For example, machine learning models analyze surveillance video for security and traffic monitoring, search through YouTube videos for inappropriate content, and so on. In this paper, we propose a scalable video coding framework that supports machine vision (specifically, object detection) through its base layer bitstr…
▽ More
Video content is watched not only by humans, but increasingly also by machines. For example, machine learning models analyze surveillance video for security and traffic monitoring, search through YouTube videos for inappropriate content, and so on. In this paper, we propose a scalable video coding framework that supports machine vision (specifically, object detection) through its base layer bitstream and human vision via its enhancement layer bitstream. The proposed framework includes components from both conventional and Deep Neural Network (DNN)-based video coding. The results show that on object detection, the proposed framework achieves 13-19% bit savings compared to state-of-the-art video codecs, while remaining competitive in terms of MS-SSIM on the human vision task.
△ Less
Submitted 4 August, 2022;
originally announced August 2022.
-
Joint Image Compression and Denoising via Latent-Space Scalability
Authors:
Saeed Ranjbar Alvar,
Mateen Ulhaq,
Hyomin Choi,
Ivan V. Bajić
Abstract:
When it comes to image compression in digital cameras, denoising is traditionally performed prior to compression. However, there are applications where image noise may be necessary to demonstrate the trustworthiness of the image, such as court evidence and image forensics. This means that noise itself needs to be coded, in addition to the clean image itself. In this paper, we present a learning-ba…
▽ More
When it comes to image compression in digital cameras, denoising is traditionally performed prior to compression. However, there are applications where image noise may be necessary to demonstrate the trustworthiness of the image, such as court evidence and image forensics. This means that noise itself needs to be coded, in addition to the clean image itself. In this paper, we present a learning-based image compression framework where image denoising and compression are performed jointly. The latent space of the image codec is organized in a scalable manner such that the clean image can be decoded from a subset of the latent space (the base layer), while the noisy image is decoded from the full latent space at a higher rate. Using a subset of the latent space for the denoised image allows denoising to be carried out at a lower rate. Besides providing a scalable representation of the noisy input image, performing denoising jointly with compression makes intuitive sense because noise is hard to compress; hence, compressibility is one of the criteria that may help distinguish noise from the signal. The proposed codec is compared against established compression and denoising benchmarks, and the experiments reveal considerable bitrate savings compared to a cascade combination of a state-of-the-art codec and a state-of-the-art denoiser.
△ Less
Submitted 4 September, 2022; v1 submitted 3 May, 2022;
originally announced May 2022.
-
Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder
Authors:
Juheon Lee,
Hyeong-Seok Choi,
Kyogu Lee
Abstract:
This paper proposes a controllable singing voice synthesis system capable of generating expressive singing voice with two novel methodologies. First, a local style token module, which predicts frame-level style tokens from an input pitch and text sequence, is proposed to allow the singing voice system to control musical expression often unspecified in sheet music (e.g., breathing and intensity). S…
▽ More
This paper proposes a controllable singing voice synthesis system capable of generating expressive singing voice with two novel methodologies. First, a local style token module, which predicts frame-level style tokens from an input pitch and text sequence, is proposed to allow the singing voice system to control musical expression often unspecified in sheet music (e.g., breathing and intensity). Second, we propose a dual-path pitch encoder with a choice of two different pitch inputs: MIDI pitch sequence or f0 contour. Because the initial generation of a singing voice is usually executed by taking a MIDI pitch sequence, one can later extract an f0 contour from the generated singing voice and modify the f0 contour to a finer level as desired. Through quantitative and qualitative evaluations, we confirmed that the proposed model could control various musical expressions while not sacrificing the sound quality of the singing voice synthesis system.
△ Less
Submitted 7 April, 2022;
originally announced April 2022.
-
Into-TTS : Intonation Template Based Prosody Control System
Authors:
Jihwan Lee,
Joun Yeop Lee,
Hee** Choi,
Seongkyu Mun,
Sangjun Park,
Jae-Sung Bae,
Chanwoo Kim
Abstract:
Intonations play an important role in delivering the intention of a speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to TTS model training, speech data are grouped into intonation templates in an unsupervi…
▽ More
Intonations play an important role in delivering the intention of a speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to TTS model training, speech data are grouped into intonation templates in an unsupervised manner. Two proposed modules are added to the end-to-end TTS framework: an intonation predictor and an intonation encoder. The intonation predictor recommends a suitable intonation template to the given text. The intonation encoder, attached to the text encoder output, synthesizes speech abiding the requested intonation template. Main contributions of our paper are: (a) an easy-to-use intonation control system covering a wide range of users; (b) better performance in wrap** speech in a requested intonation with improved objective and subjective evaluation; and (c) incorporating a pre-trained language model for intonation modelling. Audio samples are available at https://srtts.github.io/IntoTTS.
△ Less
Submitted 6 November, 2022; v1 submitted 4 April, 2022;
originally announced April 2022.
-
Distributed goal assignment strategy for improving leader-following formation control performance
Authors:
Yun Ho Choi,
Doik Kim
Abstract:
This paper investigates a distributed goal assignment problem in leader-following formation control of second-order multi-agent systems. It is assumed that each agent can communicate with nearby agents within the communication range and the leader information is only available to a subset of agents. Compared with existing formation control schemes addressing the goal assignment issue, the main con…
▽ More
This paper investigates a distributed goal assignment problem in leader-following formation control of second-order multi-agent systems. It is assumed that each agent can communicate with nearby agents within the communication range and the leader information is only available to a subset of agents. Compared with existing formation control schemes addressing the goal assignment issue, the main contribution of this paper is to construct a novel distributed assignment strategy allotting appropriate goal positions of agents in the leader-following formation control framework. Based on the rigorous analysis using the Lyapunov stability theory, the enhancement of the control performance is proved via the proposed assignment strategy. To demonstrate the effectiveness of our theoretical results, two examples including multiple quadrotors are simulated.
△ Less
Submitted 2 March, 2022;
originally announced March 2022.
-
Red Light, Green Light Game of Multi-Robot Systems with Safety Barrier Certificates
Authors:
Yun Ho Choi,
Doik Kim
Abstract:
In this paper, we propose the safety barrier certificates for uncertain multi-robot systems playing red light, green light game. According to the rule of the game, the robots are allowed to move forward after a doll shouts `green light' and must stop when it shouts `red light'. Following this rule, a two-mode nominal controller is designed where one mode is for moving forward and the other one is…
▽ More
In this paper, we propose the safety barrier certificates for uncertain multi-robot systems playing red light, green light game. According to the rule of the game, the robots are allowed to move forward after a doll shouts `green light' and must stop when it shouts `red light'. Following this rule, a two-mode nominal controller is designed where one mode is for moving forward and the other one is for slowing down and being motionless. Then, multiple exponential control barrier functions(ECBFs) are developed to handle safety constraints for limited playground, collision avoidance, and saturation of the velocity. While designing the nominal controller and ECBFs, an estimated braking time and robust inequality constraints are derived to deal with the system uncertainty. Consequently, a controller guaranteeing safety barrier certificates of each robot has been formulated by a quadratic programming with the nominal controller and the robust inequality constraints. Finally, red light, green light game is simulated to validate the proposed safety-critical control system.
△ Less
Submitted 28 February, 2022;
originally announced February 2022.
-
Data-Driven Optimal Control via Linear Transfer Operators: A Convex Approach
Authors:
Joseph Moyalan,
Hyung** Choi,
Yongxin Chen,
Umesh Vaidya
Abstract:
This paper is concerned with data-driven optimal control of nonlinear systems. We present a convex formulation to the optimal control problem (OCP) with a discounted cost function. We consider OCP with both positive and negative discount factor. The convex approach relies on lifting nonlinear system dynamics in the space of densities using the linear Perron-Frobenius (P-F) operator. This lifting l…
▽ More
This paper is concerned with data-driven optimal control of nonlinear systems. We present a convex formulation to the optimal control problem (OCP) with a discounted cost function. We consider OCP with both positive and negative discount factor. The convex approach relies on lifting nonlinear system dynamics in the space of densities using the linear Perron-Frobenius (P-F) operator. This lifting leads to an infinite-dimensional convex optimization formulation of the optimal control problem. The data-driven approximation of the optimization problem relies on the approximation of the Koopman operator using the polynomial basis function. We write the approximate finite-dimensional optimization problem as a polynomial optimization which is then solved efficiently using a sum-of-squares-based optimization framework. Simulation results are presented to demonstrate the efficacy of the developed data-driven optimal control framework.
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
SFU-HW-Tracks-v1: Object Tracking Dataset on Raw Video Sequences
Authors:
Takehiro Tanaka,
Hyomin Choi,
Ivan V. Bajić
Abstract:
We present a dataset that contains object annotations with unique object identities (IDs) for the High Efficiency Video Coding (HEVC) v1 Common Test Conditions (CTC) sequences. Ground-truth annotations for 13 sequences were prepared and released as the dataset called SFU-HW-Tracks-v1. For each video frame, ground truth annotations include object class ID, object ID, and bounding box location and i…
▽ More
We present a dataset that contains object annotations with unique object identities (IDs) for the High Efficiency Video Coding (HEVC) v1 Common Test Conditions (CTC) sequences. Ground-truth annotations for 13 sequences were prepared and released as the dataset called SFU-HW-Tracks-v1. For each video frame, ground truth annotations include object class ID, object ID, and bounding box location and its dimensions. The dataset can be used to evaluate object tracking performance on uncompressed video sequences and study the relationship between video compression and object tracking.
△ Less
Submitted 30 December, 2021;
originally announced December 2021.
-
Stacked U-Nets with Self-Assisted Priors Towards Robust Correction of Rigid Motion Artifact in Brain MRI
Authors:
Mohammed A. Al-masni,
Seul Lee,
Jaeuk Yi,
Sewook Kim,
Sung-Min Gho,
Young Hun Choi,
Dong-Hyun Kim
Abstract:
In this paper, we develop an efficient retrospective deep learning method called stacked U-Nets with self-assisted priors to address the problem of rigid motion artifacts in MRI. The proposed work exploits the usage of additional knowledge priors from the corrupted images themselves without the need for additional contrast data. The proposed network learns missed structural details through sharing…
▽ More
In this paper, we develop an efficient retrospective deep learning method called stacked U-Nets with self-assisted priors to address the problem of rigid motion artifacts in MRI. The proposed work exploits the usage of additional knowledge priors from the corrupted images themselves without the need for additional contrast data. The proposed network learns missed structural details through sharing auxiliary information from the contiguous slices of the same distorted subject. We further design a refinement stacked U-Nets that facilitates preserving of the image spatial details and hence improves the pixel-to-pixel dependency. To perform network training, simulation of MRI motion artifacts is inevitable. We present an intensive analysis using various types of image priors: the proposed self-assisted priors and priors from other image contrast of the same subject. The experimental analysis proves the effectiveness and feasibility of our self-assisted priors since it does not require any further data scans.
△ Less
Submitted 11 November, 2021;
originally announced November 2021.
-
Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations
Authors:
Hyeong-Seok Choi,
Juheon Lee,
Wansoo Kim,
Jie Hwan Lee,
Hoon Heo,
Kyogu Lee
Abstract:
We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on informa…
▽ More
We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on information perturbation. The idea is to perturb information in the original input signal (e.g., formant, pitch, and frequency response), thereby letting synthesis networks selectively take essential attributes to reconstruct the input signal. Because NANSY does not need any bottleneck structures, it enjoys both high reconstruction quality and controllability. Furthermore, NANSY does not require any labels associated with speech data such as text and speaker information, but rather uses a new set of analysis features, i.e., wav2vec feature and newly proposed pitch feature, Yingram, which allows for fully self-supervised training. Taking advantage of fully self-supervised training, NANSY can be easily extended to a multilingual setting by simply training it with a multilingual dataset. The experiments show that NANSY can achieve significant improvement in performance in several applications such as zero-shot voice conversion, pitch shift, and time-scale modification.
△ Less
Submitted 28 October, 2021; v1 submitted 27 October, 2021;
originally announced October 2021.
-
Scalable Image Coding for Humans and Machines
Authors:
Hyomin Choi,
Ivan V. Bajic
Abstract:
At present, and increasingly so in the future, much of the captured visual content will not be seen by humans. Instead, it will be used for automated machine vision analytics and may require occasional human viewing. Examples of such applications include traffic monitoring, visual surveillance, autonomous navigation, and industrial machine vision. To address such requirements, we develop an end-to…
▽ More
At present, and increasingly so in the future, much of the captured visual content will not be seen by humans. Instead, it will be used for automated machine vision analytics and may require occasional human viewing. Examples of such applications include traffic monitoring, visual surveillance, autonomous navigation, and industrial machine vision. To address such requirements, we develop an end-to-end learned image codec whose latent space is designed to support scalability from simpler to more complicated tasks. The simplest task is assigned to a subset of the latent space (the base layer), while more complicated tasks make use of additional subsets of the latent space, i.e., both the base and enhancement layer(s). For the experiments, we establish a 2-layer and a 3-layer model, each of which offers input reconstruction for human vision, plus machine vision task(s), and compare them with relevant benchmarks. The experiments show that our scalable codecs offer 37%-80% bitrate savings on machine vision tasks compared to best alternatives, while being comparable to state-of-the-art image codecs in terms of input reconstruction.
△ Less
Submitted 13 January, 2022; v1 submitted 18 July, 2021;
originally announced July 2021.
-
Algorithm Unrolling for Massive Access via Deep Neural Network with Theoretical Guarantee
Authors:
Yandong Shi,
Hayoung Choi,
Yuanming Shi,
Yong Zhou
Abstract:
Massive access is a critical design challenge of Internet of Things (IoT) networks. In this paper, we consider the grant-free uplink transmission of an IoT network with a multiple-antenna base station (BS) and a large number of single-antenna IoT devices. Taking into account the sporadic nature of IoT devices, we formulate the joint activity detection and channel estimation (JADCE) problem as a gr…
▽ More
Massive access is a critical design challenge of Internet of Things (IoT) networks. In this paper, we consider the grant-free uplink transmission of an IoT network with a multiple-antenna base station (BS) and a large number of single-antenna IoT devices. Taking into account the sporadic nature of IoT devices, we formulate the joint activity detection and channel estimation (JADCE) problem as a group-sparse matrix estimation problem. This problem can be solved by applying the existing compressed sensing techniques, which however either suffer from high computational complexities or lack of algorithm robustness. To this end, we propose a novel algorithm unrolling framework based on the deep neural network to simultaneously achieve low computational complexity and high robustness for solving the JADCE problem. Specifically, we map the original iterative shrinkage thresholding algorithm (ISTA) into an unrolled recurrent neural network (RNN), thereby improving the convergence rate and computational efficiency through end-to-end training. Moreover, the proposed algorithm unrolling approach inherits the structure and domain knowledge of the ISTA, thereby maintaining the algorithm robustness, which can handle non-Gaussian preamble sequence matrix in massive access. With rigorous theoretical analysis, we further simplify the unrolled network structure by reducing the redundant training parameters. Furthermore, we prove that the simplified unrolled deep neural network structures enjoy a linear convergence rate. Extensive simulations based on various preamble signatures show that the proposed unrolled networks outperform the existing methods in terms of the convergence rate, robustness and estimation accuracy.
△ Less
Submitted 19 June, 2021;
originally announced June 2021.
-
Differentiable Artificial Reverberation
Authors:
Sungho Lee,
Hyeong-Seok Choi,
Kyogu Lee
Abstract:
Artificial reverberation (AR) models play a central role in various audio applications. Therefore, estimating the AR model parameters (ARPs) of a reference reverberation is a crucial task. Although a few recent deep-learning-based approaches have shown promising performance, their non-end-to-end training scheme prevents them from fully exploiting the potential of deep neural networks. This motivat…
▽ More
Artificial reverberation (AR) models play a central role in various audio applications. Therefore, estimating the AR model parameters (ARPs) of a reference reverberation is a crucial task. Although a few recent deep-learning-based approaches have shown promising performance, their non-end-to-end training scheme prevents them from fully exploiting the potential of deep neural networks. This motivates the introduction of differentiable artificial reverberation (DAR) models, allowing loss gradients to be back-propagated end-to-end. However, implementing the AR models with their difference equations "as is" in the deep learning framework severely bottlenecks the training speed when executed with a parallel processor like GPU due to their infinite impulse response (IIR) components. We tackle this problem by replacing the IIR filters with finite impulse response (FIR) approximations with the frequency-sampling method. Using this technique, we implement three DAR models -- differentiable Filtered Velvet Noise (FVN), Advanced Filtered Velvet Noise (AFVN), and Delay Network (DN). For each AR model, we train its ARP estimation networks for analysis-synthesis (RIR-to-ARP) and blind estimation (reverberant-speech-to-ARP) task in an end-to-end manner with its DAR model counterpart. Experiment results show that the proposed method achieves consistent performance improvement over the non-end-to-end approaches in both objective metrics and subjective listening test results.
△ Less
Submitted 20 July, 2022; v1 submitted 28 May, 2021;
originally announced May 2021.
-
Deep Neural Networks and End-to-End Learning for Audio Compression
Authors:
Daniela N. Rim,
Inseon Jang,
Heeyoul Choi
Abstract:
Recent achievements in end-to-end deep learning have encouraged the exploration of tasks dealing with highly structured data with unified deep network models. Having such models for compressing audio signals has been challenging since it requires discrete representations that are not easy to train with end-to-end backpropagation. In this paper, we present an end-to-end deep learning approach that…
▽ More
Recent achievements in end-to-end deep learning have encouraged the exploration of tasks dealing with highly structured data with unified deep network models. Having such models for compressing audio signals has been challenging since it requires discrete representations that are not easy to train with end-to-end backpropagation. In this paper, we present an end-to-end deep learning approach that combines recurrent neural networks (RNNs) within the training strategy of variational autoencoders (VAEs) with a binary representation of the latent space. We apply a reparametrization trick for the Bernoulli distribution for the discrete representations, which allows smooth backpropagation. In addition, our approach allows the separation of the encoder and decoder, which is necessary for compression tasks. To our best knowledge, this is the first end-to-end learning for a single audio compression model with RNNs, and our model achieves a Signal to Distortion Ratio (SDR) of 20.54.
△ Less
Submitted 13 July, 2021; v1 submitted 25 May, 2021;
originally announced May 2021.
-
Latent-space scalability for multi-task collaborative intelligence
Authors:
Hyomin Choi,
Ivan V. Bajic
Abstract:
We investigate latent-space scalability for multi-task collaborative intelligence, where one of the tasks is object detection and the other is input reconstruction. In our proposed approach, part of the latent space can be selectively decoded to support object detection while the remainder can be decoded when input reconstruction is needed. Such an approach allows reduced computational resources w…
▽ More
We investigate latent-space scalability for multi-task collaborative intelligence, where one of the tasks is object detection and the other is input reconstruction. In our proposed approach, part of the latent space can be selectively decoded to support object detection while the remainder can be decoded when input reconstruction is needed. Such an approach allows reduced computational resources when only object detection is required, and this can be achieved without reconstructing input pixels. By varying the scaling factors of various terms in the training loss function, the system can be trained to achieve various trade-offs between object detection accuracy and input reconstruction quality. Experiments are conducted to demonstrate the adjustable system performance on the two tasks compared to the relevant benchmarks.
△ Less
Submitted 20 May, 2021;
originally announced May 2021.
-
Lightweight Compression of Intermediate Neural Network Features for Collaborative Intelligence
Authors:
Robert A. Cohen,
Hyomin Choi,
Ivan V. Bajić
Abstract:
In collaborative intelligence applications, part of a deep neural network (DNN) is deployed on a lightweight device such as a mobile phone or edge device, and the remaining portion of the DNN is processed where more computing resources are available, such as in the cloud. This paper presents a novel lightweight compression technique designed specifically to quantize and compress the features outpu…
▽ More
In collaborative intelligence applications, part of a deep neural network (DNN) is deployed on a lightweight device such as a mobile phone or edge device, and the remaining portion of the DNN is processed where more computing resources are available, such as in the cloud. This paper presents a novel lightweight compression technique designed specifically to quantize and compress the features output by the intermediate layer of a split DNN, without requiring any retraining of the network weights. Mathematical models for estimating the clip** and quantization error of ReLU and leaky-ReLU activations at this intermediate layer are developed and used to compute optimal clip** ranges for coarse quantization. We also present a modified entropy-constrained design algorithm for quantizing clipped activations. When applied to popular object-detection and classification DNNs, we were able to compress the 32-bit floating point intermediate activations down to 0.6 to 0.8 bits, while kee** the loss in accuracy to less than 1%. When compared to HEVC, we found that the lightweight codec consistently provided better inference accuracy, by up to 1.3%. The performance and simplicity of this lightweight compression technique makes it an attractive option for coding an intermediate layer of a split neural network for edge/cloud applications.
△ Less
Submitted 14 May, 2021;
originally announced May 2021.
-
Lightweight compression of neural network feature tensors for collaborative intelligence
Authors:
Robert A. Cohen,
Hyomin Choi,
Ivan V. Bajić
Abstract:
In collaborative intelligence applications, part of a deep neural network (DNN) is deployed on a relatively low-complexity device such as a mobile phone or edge device, and the remainder of the DNN is processed where more computing resources are available, such as in the cloud. This paper presents a novel lightweight compression technique designed specifically to code the activations of a split DN…
▽ More
In collaborative intelligence applications, part of a deep neural network (DNN) is deployed on a relatively low-complexity device such as a mobile phone or edge device, and the remainder of the DNN is processed where more computing resources are available, such as in the cloud. This paper presents a novel lightweight compression technique designed specifically to code the activations of a split DNN layer, while having a low complexity suitable for edge devices and not requiring any retraining. We also present a modified entropy-constrained quantizer design algorithm optimized for clipped activations. When applied to popular object-detection and classification DNNs, we were able to compress the 32-bit floating point activations down to 0.6 to 0.8 bits, while kee** the loss in accuracy to less than 1%. When compared to HEVC, we found that the lightweight codec consistently provided better inference accuracy, by up to 1.3%. The performance and simplicity of this lightweight compression technique makes it an attractive option for coding a layer's activations in split neural networks for edge/cloud applications.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Room adaptive conditioning method for sound event classification in reverberant environments
Authors:
Jaejun Lee,
Donmoon Lee,
Hyeong-Seok Choi,
Kyogu Lee
Abstract:
Ensuring performance robustness for a variety of situations that can occur in real-world environments is one of the challenging tasks in sound event classification. One of the unpredictable and detrimental factors in performance, especially in indoor environments, is reverberation. To alleviate this problem, we propose a conditioning method that provides room impulse response (RIR) information to…
▽ More
Ensuring performance robustness for a variety of situations that can occur in real-world environments is one of the challenging tasks in sound event classification. One of the unpredictable and detrimental factors in performance, especially in indoor environments, is reverberation. To alleviate this problem, we propose a conditioning method that provides room impulse response (RIR) information to help the network become less sensitive to environmental information and focus on classifying the desired sound. Experimental results show that the proposed method successfully reduced performance degradation caused by the reverberation of the room. In particular, our proposed method works even with similar RIR that can be inferred from the room type rather than the exact one, which has the advantage of potentially being used in real-world applications.
△ Less
Submitted 21 April, 2021;
originally announced April 2021.