Search | arXiv e-print repository

DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture

Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can… ▽ More The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA demonstrates strong discriminative power, offering benefits across a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks. Code is available at: \url{https://github.com/DMTJEPA/DMTJEPA}. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2401.13936 [pdf, ps, other]

Learning-based sensing and computing decision for data freshness in edge computing-enabled networks

Authors: Sinwoong Yun, Dongsun Kim, Chanwon Park, Jemin Lee

Abstract: As the demand on artificial intelligence (AI)-based applications increases, the freshness of sensed data becomes crucial in the wireless sensor networks. Since those applications require a large amount of computation for processing the sensed data, it is essential to offload the computation load to the edge computing (EC) server. In this paper, we propose the sensing and computing decision (SCD) a… ▽ More As the demand on artificial intelligence (AI)-based applications increases, the freshness of sensed data becomes crucial in the wireless sensor networks. Since those applications require a large amount of computation for processing the sensed data, it is essential to offload the computation load to the edge computing (EC) server. In this paper, we propose the sensing and computing decision (SCD) algorithms for data freshness in the EC-enabled wireless sensor networks. We define the η-coverage probability to show the probability of maintaining fresh data for more than η ratio of the network, where the spatial-temporal correlation of information is considered. We then propose the probability-based SCD for the single pre-charged sensor case with providing the optimal point after deriving the η-coverage probability. We also propose the reinforcement learning (RL)- based SCD by training the SCD policy of sensors for both the single pre-charged and multiple energy harvesting (EH) sensor cases, to make a real-time decision based on its observation. Our simulation results verify the performance of the proposed algorithms under various environment settings, and show that the RL-based SCD algorithm achieves higher performance compared to baseline algorithms for both the single pre-charged sensor and multiple EH sensor cases. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: 15 pages

arXiv:2311.00822 [pdf, other]

Synthesis and verification of robust-adaptive safe controllers

Authors: Simin Liu, Kai S. Yun, John M. Dolan, Changliu Liu

Abstract: Safe control with guarantees generally requires the system model to be known. It is far more challenging to handle systems with uncertain parameters. In this paper, we propose a generic algorithm that can synthesize and verify safe controllers for systems with constant, unknown parameters. In particular, we use robust-adaptive control barrier functions (raCBFs) to achieve safety. We develop new th… ▽ More Safe control with guarantees generally requires the system model to be known. It is far more challenging to handle systems with uncertain parameters. In this paper, we propose a generic algorithm that can synthesize and verify safe controllers for systems with constant, unknown parameters. In particular, we use robust-adaptive control barrier functions (raCBFs) to achieve safety. We develop new theories and techniques using sum-of-squares that enable us to pose synthesis and verification as a series of convex optimization problems. In our experiments, we show that our algorithms are general and scalable, applying them to three different polynomial systems of up to moderate size (7D). Our raCBFs are currently the most effective way to guarantee safety for uncertain systems, achieving 100% safety and up to 55% performance improvement over a robust baseline. △ Less

Submitted 2 April, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

Comments: First 2 authors contributed equally

arXiv:2305.14032 [pdf, other]

doi 10.21437/Interspeech.2023-1426

Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification

Authors: Sangmin Bae, June-Woo Kim, Won-Yang Cho, Hyerim Baek, Soyoun Son, Byungjo Lee, Changwan Ha, Kyongpil Tae, Sungnyun Kim, Se-Young Yun

Abstract: Respiratory sound contains crucial information for the early diagnosis of fatal lung diseases. Since the COVID-19 pandemic, there has been a growing interest in contact-free medical care based on electronic stethoscopes. To this end, cutting-edge deep learning models have been developed to diagnose lung diseases; however, it is still challenging due to the scarcity of medical data. In this study,… ▽ More Respiratory sound contains crucial information for the early diagnosis of fatal lung diseases. Since the COVID-19 pandemic, there has been a growing interest in contact-free medical care based on electronic stethoscopes. To this end, cutting-edge deep learning models have been developed to diagnose lung diseases; however, it is still challenging due to the scarcity of medical data. In this study, we demonstrate that the pretrained model on large-scale visual and audio datasets can be generalized to the respiratory sound classification task. In addition, we introduce a straightforward Patch-Mix augmentation, which randomly mixes patches between different samples, with Audio Spectrogram Transformer (AST). We further propose a novel and effective Patch-Mix Contrastive Learning to distinguish the mixed representations in the latent space. Our method achieves state-of-the-art performance on the ICBHI dataset, outperforming the prior leading score by an improvement of 4.08%. △ Less

Submitted 22 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: INTERSPEECH 2023, Code URL: https://github.com/raymin0223/patch-mix_contrastive_learning

arXiv:2305.11685 [pdf, other]

doi 10.21437/Interspeech.2023-1329

Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation

Authors: Kangwook Jang, Sungnyun Kim, Se-Young Yun, Hoirin Kim

Abstract: Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key… ▽ More Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB benchmark. △ Less

Submitted 26 October, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

Comments: Proceedings of Interspeech 2023. Code URL: https://github.com/sungnyun/ARMHuBERT

arXiv:2302.08779 [pdf, other]

On the convergence result of the gradient-push algorithm on directed graphs with constant stepsize

Authors: Woocheol Choi, Doheon Kim, Seok-Bae Yun

Abstract: Gradient-push algorithm has been widely used for decentralized optimization problems when the connectivity network is a direct graph. This paper shows that the gradient-push algorithm with stepsize $α>0$ converges exponentially fast to an $O(α)$-neighborhood of the optimizer under the assumption that each cost is smooth and the total cost is strongly convex. Numerical experiments are provided to s… ▽ More Gradient-push algorithm has been widely used for decentralized optimization problems when the connectivity network is a direct graph. This paper shows that the gradient-push algorithm with stepsize $α>0$ converges exponentially fast to an $O(α)$-neighborhood of the optimizer under the assumption that each cost is smooth and the total cost is strongly convex. Numerical experiments are provided to support the theoretical convergence results. △ Less

Submitted 17 February, 2023; originally announced February 2023.

MSC Class: 90C25; 68Q25

arXiv:2211.13920 [pdf, other]

Secure Power Control for Downlink Cell-Free Massive MIMO With Passive Eavesdroppers

Authors: Junguk Park, Sangseok Yun, Jeongseok Ha

Abstract: This work studies secure communications for a cell-free massive multiple-input multiple-output (CF-mMIMO) network which is attacked by multiple passive eavesdroppers overhearing communications between access points (APs) and users in the network. It will be revealed that the distributed APs in CF-mMIMO allows not only legitimate users but also eavesdroppers to reap the diversity gain, which seriou… ▽ More This work studies secure communications for a cell-free massive multiple-input multiple-output (CF-mMIMO) network which is attacked by multiple passive eavesdroppers overhearing communications between access points (APs) and users in the network. It will be revealed that the distributed APs in CF-mMIMO allows not only legitimate users but also eavesdroppers to reap the diversity gain, which seriously degrades secrecy performance. Motivated by this, this work proposes an artificial noise (AN)-aided secure power control scheme for CF-mMIMO under passive eavesdrop** aiming to achieve a higher secrecy rate and/or guarantee security. In particular, it will be demonstrated that a careful use of AN signal in the power control is especially important to improve the secrecy performance. The performance of the proposed power control scheme is evaluated and compared with various power control schemes via numerical experiments, which clearly shows that the proposed power control scheme outperforms all the competing schemes. △ Less

Submitted 25 November, 2022; originally announced November 2022.

Comments: 5 pages, 3 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2206.13700 [pdf, other]

Domain Agnostic Few-shot Learning for Speaker Verification

Authors: Seunghan Yang, Debasmit Das, Janghoon Cho, Hyoungwoo Park, Sungrack Yun

Abstract: Deep learning models for verification systems often fail to generalize to new users and new environments, even though they learn highly discriminative features. To address this problem, we propose a few-shot domain generalization framework that learns to tackle distribution shift for new users and new domains. Our framework consists of domain-specific and domain-aggregation networks, which are the… ▽ More Deep learning models for verification systems often fail to generalize to new users and new environments, even though they learn highly discriminative features. To address this problem, we propose a few-shot domain generalization framework that learns to tackle distribution shift for new users and new domains. Our framework consists of domain-specific and domain-aggregation networks, which are the experts on specific and combined domains, respectively. By using these networks, we generate episodes that mimic the presence of both novel users and novel domains in the training phase to eventually produce better generalization. To save memory, we reduce the number of domain-specific networks by clustering similar domains together. Upon extensive evaluation on artificially generated noise domains, we can explicitly show generalization ability of our framework. In addition, we apply our proposed methods to the existing competitive architecture on the standard benchmark, which shows further performance improvements. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: Proceedings of INTERSPEECH 2022

arXiv:2206.07651 [pdf]

Fault Diagnosis of Inter-turn Short Circuit in Permanent Magnet Synchronous Motors with Current Signal Imaging and Unsupervised Learning

Authors: W. Jung, S. H. Yun, Y. S. Lim, S. Cheong, J. Bae, Y. H. Park

Abstract: This paper proposes machine-independent feature engineering for winding inter-turn short circuit fault that uses electrical current signals. Electrical current signal collected from permanent magnet synchronous motor (PMSM) is subjected to different environmental and operational conditions. To solve these problems, robust current signal imaging method and deep learning-based feature extraction met… ▽ More This paper proposes machine-independent feature engineering for winding inter-turn short circuit fault that uses electrical current signals. Electrical current signal collected from permanent magnet synchronous motor (PMSM) is subjected to different environmental and operational conditions. To solve these problems, robust current signal imaging method and deep learning-based feature extraction method are developed. The overall procedure includes the following three key steps: (1) transformation of a time-series current signal to two-dimensional image, (2) extracting features using convolutional neural networks, and (3) calculating a health indicator using Mahalanobis distance. Transformation of the time-series signal is based on recurrence plots (RP). The proposed RP method develops from feature engineering that provides the dominant fault feature representations in a robust way. The proposed RP is designed that maximizes the features of inter-turn short fault and minimizes the effect of noise from systems with various capacities. To demonstrate the validity of the proposed method, two case studies are conducted using an artificial fault seeded testbed with two different capacities of motor. By calculating the feature using only the electrical current signal of the motor without the parameters related to the capacity of the motor, the proposed feature can be applied to motors with different capacities while maintaining the same performance. △ Less

Submitted 9 June, 2022; originally announced June 2022.

Comments: submitted to IECON 2022

arXiv:2202.03571 [pdf, other]

Metal Artifact Reduction with Intra-Oral Scan Data for 3D Low Dose Maxillofacial CBCT Modeling

Authors: Chang Min Hyun, Taigyntuya Bayaraa, Hye Sun Yun, Tae Jun Jang, Hyoung Suk Park, ** Keun Seo

Abstract: Low-dose dental cone beam computed tomography (CBCT) has been increasingly used for maxillofacial modeling. However, the presence of metallic inserts, such as implants, crowns, and dental filling, causes severe streaking and shading artifacts in a CBCT image and loss of the morphological structures of the teeth, which consequently prevents accurate segmentation of bones. A two-stage metal artifact… ▽ More Low-dose dental cone beam computed tomography (CBCT) has been increasingly used for maxillofacial modeling. However, the presence of metallic inserts, such as implants, crowns, and dental filling, causes severe streaking and shading artifacts in a CBCT image and loss of the morphological structures of the teeth, which consequently prevents accurate segmentation of bones. A two-stage metal artifact reduction method is proposed for accurate 3D low-dose maxillofacial CBCT modeling, where a key idea is to utilize explicit tooth shape prior information from intra-oral scan data whose acquisition does not require any extra radiation exposure. In the first stage, an image-to-image deep learning network is employed to mitigate metal-related artifacts. To improve the learning ability, the proposed network is designed to take advantage of the intra-oral scan data as side-inputs and perform multi-task learning of auxiliary tooth segmentation. In the second stage, a 3D maxillofacial model is constructed by segmenting the bones from the dental CBCT image corrected in the first stage. For accurate bone segmentation, weighted thresholding is applied, wherein the weighting region is determined depending on the geometry of the intra-oral scan data. Because acquiring a paired training dataset of metal-artifact-free and metal artifact-affected dental CBCT images is challenging in clinical practice, an automatic method of generating a realistic dataset according to the CBCT physics model is introduced. Numerical simulations and clinical experiments show the feasibility of the proposed method, which takes advantage of tooth surface information from intra-oral scan data in 3D low dose maxillofacial CBCT modeling. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2112.01784 [pdf, other]

doi 10.1016/j.media.2024.103096

Fully automatic integration of dental CBCT images and full-arch intraoral impressions with stitching error correction via individual tooth segmentation and identification

Authors: Tae Jun Jang, Hye Sun Yun, Chang Min Hyun, Jong-Eun Kim, Sang-Hwy Lee, ** Keun Seo

Abstract: We present a fully automated method of integrating intraoral scan (IOS) and dental cone-beam computerized tomography (CBCT) images into one image by complementing each image's weaknesses. Dental CBCT alone may not be able to delineate precise details of the tooth surface due to limited image resolution and various CBCT artifacts, including metal-induced artifacts. IOS is very accurate for the scan… ▽ More We present a fully automated method of integrating intraoral scan (IOS) and dental cone-beam computerized tomography (CBCT) images into one image by complementing each image's weaknesses. Dental CBCT alone may not be able to delineate precise details of the tooth surface due to limited image resolution and various CBCT artifacts, including metal-induced artifacts. IOS is very accurate for the scanning of narrow areas, but it produces cumulative stitching errors during full-arch scanning. The proposed method is intended not only to compensate the low-quality of CBCT-derived tooth surfaces with IOS, but also to correct the cumulative stitching errors of IOS across the entire dental arch. Moreover, the integration provide both gingival structure of IOS and tooth roots of CBCT in one image. The proposed fully automated method consists of four parts; (i) individual tooth segmentation and identification module for IOS data (TSIM-IOS); (ii) individual tooth segmentation and identification module for CBCT data (TSIM-CBCT); (iii) global-to-local tooth registration between IOS and CBCT; and (iv) stitching error correction of full-arch IOS. The experimental results show that the proposed method achieved landmark and surface distance errors of 112.4 $μ$m and 301.7 $μ$m, respectively. △ Less

Submitted 2 March, 2023; v1 submitted 3 December, 2021; originally announced December 2021.

arXiv:2104.11849 [pdf, other]

Do All MobileNets Quantize Poorly? Gaining Insights into the Effect of Quantization on Depthwise Separable Convolutional Networks Through the Eyes of Multi-scale Distributional Dynamics

Authors: Stone Yun, Alexander Wong

Abstract: As the "Mobile AI" revolution continues to grow, so does the need to understand the behaviour of edge-deployed deep neural networks. In particular, MobileNets are the go-to family of deep convolutional neural networks (CNN) for mobile. However, they often have significant accuracy degradation under post-training quantization. While studies have introduced quantization-aware training and other meth… ▽ More As the "Mobile AI" revolution continues to grow, so does the need to understand the behaviour of edge-deployed deep neural networks. In particular, MobileNets are the go-to family of deep convolutional neural networks (CNN) for mobile. However, they often have significant accuracy degradation under post-training quantization. While studies have introduced quantization-aware training and other methods to tackle this challenge, there is limited understanding into why MobileNets (and potentially depthwise-separable CNNs (DWSCNN) in general) quantize so poorly compared to other CNN architectures. Motivated to gain deeper insights into this phenomenon, we take a different strategy and study the multi-scale distributional dynamics of MobileNet-V1, a set of smaller DWSCNNs, and regular CNNs. Specifically, we investigate the impact of quantization on the weight and activation distributional dynamics as information propagates from layer to layer, as well as overall changes in distributional dynamics at the network level. This fine-grained analysis revealed significant dynamic range fluctuations and a "distributional mismatch" between channelwise and layerwise distributions in DWSCNNs that lead to increasing quantized degradation and distributional shift during information propagation. Furthermore, analysis of the activation quantization errors show that there is greater quantization error accumulation in DWSCNN compared to regular CNNs. The hope is that such insights can lead to innovative strategies for reducing such distributional dynamics changes and improve post-training quantization for mobile. △ Less

Submitted 23 April, 2021; originally announced April 2021.

Comments: Accepted for publication in Mobile AI (MAI) Workshop 2021 at CVPR

arXiv:2101.05205 [pdf, other]

Automated 3D cephalometric landmark identification using computerized tomography

Authors: Hye Sun Yun, Chang Min Hyun, Seong Hyeon Baek, Sang-Hwy Lee, ** Keun Seo

Abstract: Identification of 3D cephalometric landmarks that serve as proxy to the shape of human skull is the fundamental step in cephalometric analysis. Since manual landmarking from 3D computed tomography (CT) images is a cumbersome task even for the trained experts, automatic 3D landmark detection system is in a great need. Recently, automatic landmarking of 2D cephalograms using deep learning (DL) has a… ▽ More Identification of 3D cephalometric landmarks that serve as proxy to the shape of human skull is the fundamental step in cephalometric analysis. Since manual landmarking from 3D computed tomography (CT) images is a cumbersome task even for the trained experts, automatic 3D landmark detection system is in a great need. Recently, automatic landmarking of 2D cephalograms using deep learning (DL) has achieved great success, but 3D landmarking for more than 80 landmarks has not yet reached a satisfactory level, because of the factors hindering machine learning such as the high dimensionality of the input data and limited amount of training data due to ethical restrictions on the use of medical data. This paper presents a semi-supervised DL method for 3D landmarking that takes advantage of anonymized landmark dataset with paired CT data being removed. The proposed method first detects a small number of easy-to-find reference landmarks, then uses them to provide a rough estimation of the entire landmarks by utilizing the low dimensional representation learned by variational autoencoder (VAE). Anonymized landmark dataset is used for training the VAE. Finally, coarse-to-fine detection is applied to the small bounding box provided by rough estimation, using separate strategies suitable for mandible and cranium. For mandibular landmarks, patch-based 3D CNN is applied to the segmented image of the mandible (separated from the maxilla), in order to capture 3D morphological features of mandible associated with the landmarks. We detect 6 landmarks around the condyle all at once, instead of one by one, because they are closely related to each other. For cranial landmarks, we again use VAE-based latent representation for more accurate annotation. In our experiment, the proposed method achieved an averaged 3D point-to-point error of 2.91 mm for 90 landmarks only with 15 paired training data. △ Less

Submitted 16 December, 2020; originally announced January 2021.

arXiv:1910.06790 [pdf, other]

Weakly Labeled Sound Event Detection Using Tri-training and Adversarial Learning

Authors: Hyoungwoo Park, Sungrack Yun, Jungyun Eum, Janghoon Cho, Kyuwoong Hwang

Abstract: This paper considers a semi-supervised learning framework for weakly labeled polyphonic sound event detection problems for the DCASE 2019 challenge's task4 by combining both the tri-training and adversarial learning. The goal of the task4 is to detect onsets and offsets of multiple sound events in a single audio clip. The entire dataset consists of the synthetic data with a strong label (sound eve… ▽ More This paper considers a semi-supervised learning framework for weakly labeled polyphonic sound event detection problems for the DCASE 2019 challenge's task4 by combining both the tri-training and adversarial learning. The goal of the task4 is to detect onsets and offsets of multiple sound events in a single audio clip. The entire dataset consists of the synthetic data with a strong label (sound event labels with boundaries) and real data with weakly labeled (sound event labels) and unlabeled dataset. Given this dataset, we apply the tri-training where two different classifiers are used to obtain pseudo labels on the weakly labeled and unlabeled dataset, and the final classifier is trained using the strongly labeled dataset and weakly/unlabeled dataset with pseudo labels. Also, we apply the adversarial learning to reduce the domain gap between the real and synthetic dataset. We evaluated our learning framework using the validation set of the task4 dataset, and in the experiments, our learning framework shows a considerable performance improvement over the baseline model. △ Less

Submitted 14 October, 2019; originally announced October 2019.

Comments: 5 pages, DCASE 2019 Workshop

arXiv:1910.06784 [pdf, other]

Acoustic Scene Classification Based on a Large-margin Factorized CNN

Authors: Janghoon Cho, Sungrack Yun, Hyoungwoo Park, Jungyun Eum, Kyuwoong Hwang

Abstract: In this paper, we present an acoustic scene classification framework based on a large-margin factorized convolutional neural network (CNN). We adopt the factorized CNN to learn the patterns in the time-frequency domain by factorizing the 2D kernel into two separate 1D kernels. The factorized kernel leads to learn the main component of two patterns: the long-term ambient and short-term event sounds… ▽ More In this paper, we present an acoustic scene classification framework based on a large-margin factorized convolutional neural network (CNN). We adopt the factorized CNN to learn the patterns in the time-frequency domain by factorizing the 2D kernel into two separate 1D kernels. The factorized kernel leads to learn the main component of two patterns: the long-term ambient and short-term event sounds which are the key patterns of the audio scene classification. In training our model, we consider the loss function based on the triplet sampling such that the same audio scene samples from different environments are minimized, and simultaneously the different audio scene samples are maximized. With this loss function, the samples from the same audio scene are clustered independently of the environment, and thus we can get the classifier with better generalization ability in an unseen environment. We evaluated our audio scene classification framework using the dataset of the DCASE challenge 2019 task1A. Experimental results show that the proposed algorithm improves the performance of the baseline network and reduces the number of parameters to one third. Furthermore, the performance gain is higher on unseen data, and it shows that the proposed algorithm has better generalization ability. △ Less

Submitted 14 October, 2019; originally announced October 2019.

Comments: 5 pages, DCASE 2019 Workshop

arXiv:1908.02612 [pdf, ps, other]

An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial Network

Authors: Sungrack Yun, Janghoon Cho, Jungyun Eum, Wonil Chang, Kyuwoong Hwang

Abstract: This paper presents an end-to-end text-independent speaker verification framework by jointly considering the speaker embedding (SE) network and automatic speech recognition (ASR) network. The SE network learns to output an embedding vector which distinguishes the speaker characteristics of the input utterance, while the ASR network learns to recognize the phonetic context of the input. In training… ▽ More This paper presents an end-to-end text-independent speaker verification framework by jointly considering the speaker embedding (SE) network and automatic speech recognition (ASR) network. The SE network learns to output an embedding vector which distinguishes the speaker characteristics of the input utterance, while the ASR network learns to recognize the phonetic context of the input. In training our speaker verification framework, we consider both the triplet loss minimization and adversarial gradient of the ASR network to obtain more discriminative and text-independent speaker embedding vectors. With the triplet loss, the distances between the embedding vectors of the same speaker are minimized while those of different speakers are maximized. Also, with the adversarial gradient of the ASR network, the text-dependency of the speaker embedding vector can be reduced. In the experiments, we evaluated our speaker verification framework using the LibriSpeech and CHiME 2013 dataset, and the evaluation results show that our speaker verification framework shows lower equal error rate and better text-independency compared to the other approaches. △ Less

Submitted 6 August, 2019; originally announced August 2019.

Comments: Will be appeared in INTERSPEECH 2019

arXiv:1906.06579 [pdf, other]

EXTD: Extremely Tiny Face Detector via Iterative Filter Reuse

Authors: YoungJoon Yoo, Dongyoon Han, Sangdoo Yun

Abstract: In this paper, we propose a new multi-scale face detector having an extremely tiny number of parameters (EXTD),less than 0.1 million, as well as achieving comparable performance to deep heavy detectors. While existing multi-scale face detectors extract feature maps with different scales from a single backbone network, our method generates the feature maps by iteratively reusing a shared lightweigh… ▽ More In this paper, we propose a new multi-scale face detector having an extremely tiny number of parameters (EXTD),less than 0.1 million, as well as achieving comparable performance to deep heavy detectors. While existing multi-scale face detectors extract feature maps with different scales from a single backbone network, our method generates the feature maps by iteratively reusing a shared lightweight and shallow backbone network. This iterative sharing of the backbone network significantly reduces the number of parameters, and also provides the abstract image semantics captured from the higher stage of the network layers to the lower-level feature map. The proposed idea is employed by various model architectures and evaluated by extensive experiments. From the experiments from WIDER FACE dataset, we show that the proposed face detector can handle faces with various scale and conditions, and achieved comparable performance to the more massive face detectors that few hundreds and tens times heavier in model size and floating point operations. △ Less

Submitted 23 June, 2019; v1 submitted 15 June, 2019; originally announced June 2019.

arXiv:1810.11520 [pdf, other]

Spectrogram-channels u-net: a source separation model viewing each channel as the spectrogram of each source

Authors: Jaehoon Oh, Duyeon Kim, Se-Young Yun

Abstract: Sound source separation has attracted attention from Music Information Retrieval(MIR) researchers, since it is related to many MIR tasks such as automatic lyric transcription, singer identification, and voice conversion. In this paper, we propose an intuitive spectrogram-based model for source separation by adapting U-Net. We call it Spectrogram-Channels U-Net, which means each channel of the outp… ▽ More Sound source separation has attracted attention from Music Information Retrieval(MIR) researchers, since it is related to many MIR tasks such as automatic lyric transcription, singer identification, and voice conversion. In this paper, we propose an intuitive spectrogram-based model for source separation by adapting U-Net. We call it Spectrogram-Channels U-Net, which means each channel of the output corresponds to the spectrogram of separated source itself. The proposed model can be used for not only singing voice separation but also multi-instrument separation by changing only the number of output channels. In addition, we propose a loss function that balances volumes between different sources. Finally, we yield performance that is state-of-the-art on both separation tasks. △ Less

Submitted 30 October, 2018; v1 submitted 26 October, 2018; originally announced October 2018.

Comments: 3 figures

Showing 1–18 of 18 results for author: Yun, S