Search | arXiv e-print repository

f-GAN: A frequency-domain-constrained generative adversarial network for PPG to ECG synthesis

Authors: Nathan C. L. Kong, Dae Lee, Huyen Do, Dae Hoon Park, Cong Xu, Hongda Mao, Jonathan Chung

Abstract: Electrocardiograms (ECGs) and photoplethysmograms (PPGs) are generally used to monitor an individual's cardiovascular health. In clinical settings, ECGs and fingertip PPGs are the main signals used for assessing cardiovascular health, but the equipment necessary for their collection precludes their use in daily monitoring. Although PPGs obtained from wrist-worn devices are susceptible to noise due… ▽ More Electrocardiograms (ECGs) and photoplethysmograms (PPGs) are generally used to monitor an individual's cardiovascular health. In clinical settings, ECGs and fingertip PPGs are the main signals used for assessing cardiovascular health, but the equipment necessary for their collection precludes their use in daily monitoring. Although PPGs obtained from wrist-worn devices are susceptible to noise due to motion, they have been widely used to continuously monitor cardiovascular health because of their convenience. Therefore, we would like to combine the ease with which PPGs can be collected with the information that ECGs provide about cardiovascular health by develo** models to synthesize ECG signals from paired PPG signals. We tackled this problem using generative adversarial networks (GANs) and found that models trained using the original GAN formulations can be successfully used to synthesize ECG signals from which heart rate can be extracted using standard signal processing pipelines. Incorporating a frequency-domain constraint to model training improved the stability of model performance and also the performance on heart rate estimation. △ Less

Submitted 15 May, 2024; originally announced June 2024.

arXiv:2308.07788 [pdf, ps, other]

GIST-AiTeR Speaker Diarization System for VoxCeleb Speaker Recognition Challenge (VoxSRC) 2023

Authors: Dongkeon Park, Ji Won Kim, Kang Ryeol Kim, Do Hyun Lee, Hong Kook Kim

Abstract: This report describes the submission system by the GIST-AiTeR team for the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23) Track 4. Our submission system focuses on implementing diverse speaker diarization (SD) techniques, including ResNet293 and MFA-Conformer with different combinations of segment and hop length. Then, those models are combined into an ensemble model. The ResNet293 and MF… ▽ More This report describes the submission system by the GIST-AiTeR team for the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23) Track 4. Our submission system focuses on implementing diverse speaker diarization (SD) techniques, including ResNet293 and MFA-Conformer with different combinations of segment and hop length. Then, those models are combined into an ensemble model. The ResNet293 and MFA-Conformer models exhibited the diarization error rates (DERs) of 3.65% and 3.83% on VAL46, respectively. The submitted ensemble model provided a DER of 3.50% on VAL46, and consequently, it achieved a DER of 4.88% on the VoxSRC-23 test set. △ Less

Submitted 25 August, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

Comments: VoxSRC 2023 Track4

arXiv:2307.10667 [pdf, other]

Efficient Unified Demosaicing for Bayer and Non-Bayer Patterned Image Sensors

Authors: Haechang Lee, Dongwon Park, Wongi Jeong, Kijeong Kim, Hyunwoo Je, Dongil Ryu, Se Young Chun

Abstract: As the physical size of recent CMOS image sensors (CIS) gets smaller, the latest mobile cameras are adopting unique non-Bayer color filter array (CFA) patterns (e.g., Quad, Nona, QxQ), which consist of homogeneous color units with adjacent pixels. These non-Bayer sensors are superior to conventional Bayer CFA thanks to their changeable pixel-bin sizes for different light conditions but may introdu… ▽ More As the physical size of recent CMOS image sensors (CIS) gets smaller, the latest mobile cameras are adopting unique non-Bayer color filter array (CFA) patterns (e.g., Quad, Nona, QxQ), which consist of homogeneous color units with adjacent pixels. These non-Bayer sensors are superior to conventional Bayer CFA thanks to their changeable pixel-bin sizes for different light conditions but may introduce visual artifacts during demosaicing due to their inherent pixel pattern structures and sensor hardware characteristics. Previous demosaicing methods have primarily focused on Bayer CFA, necessitating distinct reconstruction methods for non-Bayer patterned CIS with various CFA modes under different lighting conditions. In this work, we propose an efficient unified demosaicing method that can be applied to both conventional Bayer RAW and various non-Bayer CFAs' RAW data in different operation modes. Our Knowledge Learning-based demosaicing model for Adaptive Patterns, namely KLAP, utilizes CFA-adaptive filters for only 1% key filters in the network for each CFA, but still manages to effectively demosaic all the CFAs, yielding comparable performance to the large-scale models. Furthermore, by employing meta-learning during inference (KLAP-M), our model is able to eliminate unknown sensor-generic artifacts in real RAW data, effectively bridging the gap between synthetic images and real sensor RAW. Our KLAP and KLAP-M methods achieved state-of-the-art demosaicing performance in both synthetic and real RAW data of Bayer and non-Bayer CFAs. △ Less

Submitted 20 July, 2023; originally announced July 2023.

arXiv:2306.08133 [pdf, ps, other]

Large-scale Language Model Rescoring on Long-form Data

Authors: Tongzhou Chen, Cyril Allauzen, Yinghui Huang, Daniel Park, David Rybach, W. Ronny Huang, Rodrigo Cabrera, Kartik Audhkhasi, Bhuvana Ramabhadran, Pedro J. Moreno, Michael Riley

Abstract: In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER)… ▽ More In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER) over a strong first-pass baseline that uses a maximum-entropy based language model. Improved lattice processing that results in a lattice with a proper (non-tree) digraph topology and carrying context from the 1-best hypothesis of the previous segment(s) results in significant wins in rescoring with LLMs. We also find that the gains in performance from the combination of LLMs trained on vast quantities of available data (such as C4) and conventional neural LMs is additive and significantly outperforms a strong first-pass baseline with a maximum entropy LM. Copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. △ Less

Submitted 5 September, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

Comments: 5 pages, accepted in ICASSP 2023

Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2303.01037 [pdf, other]

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk , et al. (2 additional authors not shown)

Abstract: We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quant… ▽ More We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages. △ Less

Submitted 24 September, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

Comments: 20 pages, 7 figures, 8 tables

arXiv:2302.03917 [pdf, other]

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Authors: Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, Wei Han

Abstract: We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and… ▽ More We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. Generated examples: https://google-research.github.io/noise2music △ Less

Submitted 6 March, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

Comments: 15 pages

arXiv:2212.05936 [pdf]

Encoder-Decoder Network with Guided Transmission Map: Architecture

Authors: Le-Anh Tran, Dong-Chul Park

Abstract: An insight into the architecture of the Encoder-Decoder Network with Guided Transmission Map (EDN-GTM), a novel and effective single image dehazing scheme, is presented in this paper. The EDN-GTM takes a conventional RGB hazy image in conjunction with the corresponding transmission map estimated by the dark channel prior (DCP) approach as inputs of the network. The EDN-GTM adopts an enhanced struc… ▽ More An insight into the architecture of the Encoder-Decoder Network with Guided Transmission Map (EDN-GTM), a novel and effective single image dehazing scheme, is presented in this paper. The EDN-GTM takes a conventional RGB hazy image in conjunction with the corresponding transmission map estimated by the dark channel prior (DCP) approach as inputs of the network. The EDN-GTM adopts an enhanced structure of U-Net developed for dehazing tasks and the resulting EDN-GDM has shown state-of-the-art performances on benchmark dehazing datasets in terms of PSNR and SSIM metrics. In order to give an in-depth understanding of the well-designed architecture which largely contributes to the success of the EDN-GTM, extensive experiments and analysis from selecting the core structure of the scheme to investigating advanced network designs are presented in this paper. △ Less

Submitted 31 March, 2023; v1 submitted 7 December, 2022; originally announced December 2022.

Comments: 3 pages, 2 figures, ASPAI 2022

arXiv:2211.05910 [pdf, other]

Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report

Authors: Andrey Ignatov, Radu Timofte, Maurizio Denna, Abdel Younes, Ganzorig Gankhuyag, **gang Huh, Myeong Kyun Kim, Kihwan Yoon, Hyeon-Cheol Moon, Seungho Lee, Yoonsik Choe, **woo Jeong, Sungjei Kim, Maciej Smyl, Tomasz Latkowski, Pawel Kubik, Michal Sokolski, Yujie Ma, Jiahao Chao, Zhou Zhou, Hongfan Gao, Zhengfeng Yang, Zhenbing Zeng, Zhengyang Zhuge, Chenghua Li , et al. (71 additional authors not shown)

Abstract: Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose… ▽ More Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: arXiv admin note: text overlap with arXiv:2105.07825, arXiv:2105.08826, arXiv:2211.04470, arXiv:2211.03885, arXiv:2211.05256

arXiv:2211.04470 [pdf, other]

Efficient Single-Image Depth Estimation on Mobile Devices, Mobile AI & AIM 2022 Challenge: Report

Authors: Andrey Ignatov, Grigory Malivenko, Radu Timofte, Lukasz Treszczotko, Xin Chang, Piotr Ksiazek, Michal Lopuszynski, Maciej Pioro, Rafal Rudnicki, Maciej Smyl, Yujie Ma, Zhenyu Li, Zehui Chen, Jialei Xu, Xianming Liu, Junjun Jiang, XueChao Shi, Difan Xu, Yanan Li, Xiaotao Wang, Lei Lei, Ziyu Zhang, Yicheng Wang, Zilong Huang, Guozhong Luo , et al. (14 additional authors not shown)

Abstract: Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth es… ▽ More Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2105.08630, arXiv:2211.03885; text overlap with arXiv:2105.08819, arXiv:2105.08826, arXiv:2105.08629, arXiv:2105.07809, arXiv:2105.07825

arXiv:2210.10879 [pdf, other]

G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR

Authors: Gary Wang, Ekin D. Cubuk, Andrew Rosenberg, Shuyang Cheng, Ron J. Weiss, Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le, Daniel S. Park

Abstract: Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as… ▽ More Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as directed acyclic graphs (DAGs) and search over this space to optimize the augmentation policy itself. We show that given the same computational budget, policies produced by G-Augment are able to perform better than SpecAugment policies obtained by random search on fine-tuning tasks on CHiME-6 and AMI. G-Augment is also able to establish a new state-of-the-art ASR performance on the CHiME-6 evaluation set (30.7% WER). We further demonstrate that G-Augment policies show better transfer properties across warm-start to cold-start training and model size compared to random-searched SpecAugment policies. △ Less

Submitted 24 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

Comments: 6 pages, accepted at SLT 2022. Updated with copyright

arXiv:2209.10357 [pdf, other]

GIST-AiTeR System for the Diarization Task of the 2022 VoxCeleb Speaker Recognition Challenge

Authors: Dongkeon Park, Yechan Yu, Kyeong Wan Park, Ji Won Kim, Hong Kook Kim

Abstract: This report describes the submission system of the GIST-AiTeR team at the 2022 VoxCeleb Speaker Recognition Challenge (VoxSRC) Track 4. Our system mainly includes speech enhancement, voice activity detection , multi-scaled speaker embedding, probabilistic linear discriminant analysis-based speaker clustering, and overlapped speech detection models. We first construct four different diarization sys… ▽ More This report describes the submission system of the GIST-AiTeR team at the 2022 VoxCeleb Speaker Recognition Challenge (VoxSRC) Track 4. Our system mainly includes speech enhancement, voice activity detection , multi-scaled speaker embedding, probabilistic linear discriminant analysis-based speaker clustering, and overlapped speech detection models. We first construct four different diarization systems according to different model combinations with the best experimental efforts. Our final submission is an ensemble system of all the four systems and achieves a diarization error rate of 5.12% on the challenge evaluation set, ranked third at the diarization track of the challenge. △ Less

Submitted 6 October, 2022; v1 submitted 21 September, 2022; originally announced September 2022.

Comments: 2022 VoxSRC Track4

arXiv:2209.09217 [pdf, other]

WiForceSticker: Batteryless, Thin Sticker-like Flexible Force Sensor

Authors: Agrim Gupta, Daegue Park, Shayaun Bashar, Cedric Girerd, Tania Morimoto, Dinesh Bharadia

Abstract: Any two objects in contact with each other exert a force that could be simply due to gravity or mechanical contact, such as a robotic arm grip** an object or even the contact between two bones at our knee joints. The ability to naturally measure and monitor these contact forces allows a plethora of applications from warehouse management (detect faulty packages based on weights) to robotics (maki… ▽ More Any two objects in contact with each other exert a force that could be simply due to gravity or mechanical contact, such as a robotic arm grip** an object or even the contact between two bones at our knee joints. The ability to naturally measure and monitor these contact forces allows a plethora of applications from warehouse management (detect faulty packages based on weights) to robotics (making a robotic arms' grip as sensitive as human skin) and healthcare (knee-implants). It is challenging to design a ubiquitous force sensor that can be used naturally for all these applications. First, the sensor should be small enough to fit in narrow spaces. Next, we don't want to lay cumbersome cables to read the force values from the sensors. Finally, we need to have a battery-free design to meet the in-vivo applications. We develop WiForceSticker, a wireless, battery-free, sticker-like force sensor that can be ubiquitously deployed on any surface, such as all warehouse packages, robotic arms, and knee joints. WiForceSticker first designs a tiny $4$~mm~$\times$~$2$~mm~$\times$~$0.4$~mm capacitative sensor design equipped with a $10$~mm~$\times$~$10$~mm antenna designed on a flexible PCB substrate. Secondly, it introduces a new mechanism to transduce the force information on ambient RF radiations that can be read by a remotely located reader wirelessly without requiring any battery or active components at the force sensor, by interfacing the sensors with COTS RFID systems. The sensor can detect forces in the range of $0$-$6$~N with sensing accuracy of $<0.5$~N across multiple testing environments and evaluated with over $10,000$ varying force level presses on the sensor. We also showcase two application case studies with our designed sensors, weighing warehouse packages and sensing forces applied by bone joints. △ Less

Submitted 19 September, 2022; originally announced September 2022.

arXiv:2208.07552 [pdf]

Coil2Coil: Self-supervised MR image denoising using phased-array coil images

Authors: Juhyung Park, Dongwon Park, Hyeong-Geol Shin, Eun-Jung Choi, Hongjun An, Minjun Kim, Dongmyung Shin, Se Young Chun, Jongho Lee

Abstract: Denoising of magnetic resonance images is beneficial in improving the quality of low signal-to-noise ratio images. Recently, denoising using deep neural networks has demonstrated promising results. Most of these networks, however, utilize supervised learning, which requires large training images of noise-corrupted and clean image pairs. Obtaining training images, particularly clean images, is expe… ▽ More Denoising of magnetic resonance images is beneficial in improving the quality of low signal-to-noise ratio images. Recently, denoising using deep neural networks has demonstrated promising results. Most of these networks, however, utilize supervised learning, which requires large training images of noise-corrupted and clean image pairs. Obtaining training images, particularly clean images, is expensive and time-consuming. Hence, methods such as Noise2Noise (N2N) that require only pairs of noise-corrupted images have been developed to reduce the burden of obtaining training datasets. In this study, we propose a new self-supervised denoising method, Coil2Coil (C2C), that does not require the acquisition of clean images or paired noise-corrupted images for training. Instead, the method utilizes multichannel data from phased-array coils to generate training images. First, it divides and combines multichannel coil images into two images, one for input and the other for label. Then, they are processed to impose noise independence and sensitivity normalization such that they can be used for the training images of N2N. For inference, the method inputs a coil-combined image (e.g., DICOM image), enabling a wide application of the method. When evaluated using synthetic noise-added images, C2C shows the best performance against several self-supervised methods, reporting comparable outcomes to supervised methods. When testing the DICOM images, C2C successfully denoised real noise without showing structure-dependent residuals in the error maps. Because of the significant advantage of not requiring additional scans for clean or paired images, the method can be easily utilized for various clinical applications. △ Less

Submitted 16 August, 2022; originally announced August 2022.

Comments: 9 pages, 5figures

arXiv:2208.06056 [pdf, other]

doi 10.1121/10.0018139

Approximate Extraction of Late-Time Returns via Morphological Component Analysis

Authors: Geoff Goehle, Benjamin Cowen, Thomas E. Blanford, J. Daniel Park, Daniel C. Brown

Abstract: A fundamental challenge in acoustic data processing is to separate a measured time series into relevant phenomenological components. A given measurement is typically assumed to be an additive mixture of myriad signals plus noise whose separation forms an ill-posed inverse problem. In the setting of sensing elastic objects using active sonar, we wish to separate the early-time returns (e.g., return… ▽ More A fundamental challenge in acoustic data processing is to separate a measured time series into relevant phenomenological components. A given measurement is typically assumed to be an additive mixture of myriad signals plus noise whose separation forms an ill-posed inverse problem. In the setting of sensing elastic objects using active sonar, we wish to separate the early-time returns (e.g., returns from the object's exterior geometry) from late-time returns caused by elastic or compressional wave coupling. Under the framework of Morphological Component Analysis (MCA), we compare two separation models using the short-duration and long-duration responses as a proxy for early-time and late-time returns. Results are computed for Stanton's elastic cylinder model as well as on experimental data taken from an in-Air circular Synthetic Aperture Sonar (AirSAS) system, whose separated time series are formed into imagery. We find that MCA can be used to separate early and late-time responses in both cases without the use of time-gating. The separation process is demonstrated to be robust to noise and compatible with AirSAS image reconstruction. The best separation results are obtained with a flexible, but computationally intensive, frame based signal model, while a faster Fourier Transform based method is shown to have competitive performance. △ Less

Submitted 11 August, 2022; originally announced August 2022.

Comments: 18 pages, 17 figures

arXiv:2205.04821 [pdf, other]

Self-supervised regression learning using domain knowledge: Applications to improving self-supervised denoising in imaging

Authors: Il Yong Chun, Dongwon Park, Xuehang Zheng, Se Young Chun, Yong Long

Abstract: Regression that predicts continuous quantity is a central part of applications using computational imaging and computer vision technologies. Yet, studying and understanding self-supervised learning for regression tasks - except for a particular regression task, image denoising - have lagged behind. This paper proposes a general self-supervised regression learning (SSRL) framework that enables lear… ▽ More Regression that predicts continuous quantity is a central part of applications using computational imaging and computer vision technologies. Yet, studying and understanding self-supervised learning for regression tasks - except for a particular regression task, image denoising - have lagged behind. This paper proposes a general self-supervised regression learning (SSRL) framework that enables learning regression neural networks with only input data (but without ground-truth target data), by using a designable pseudo-predictor that encapsulates domain knowledge of a specific application. The paper underlines the importance of using domain knowledge by showing that under different settings, the better pseudo-predictor can lead properties of SSRL closer to those of ordinary supervised learning. Numerical experiments for low-dose computational tomography denoising and camera image denoising demonstrate that proposed SSRL significantly improves the denoising quality over several existing self-supervised denoising methods. △ Less

Submitted 10 May, 2022; originally announced May 2022.

Comments: 17 pages, 16 figures, 2 tables, submitted to IEEE T-IP

arXiv:2204.11669 [pdf]

doi 10.1038/s41746-023-00859-y

Deep-learning-enabled Brain Hemodynamic Map** Using Resting-state fMRI

Authors: Xirui Hou, Pengfei Guo, Puyang Wang, Peiying Liu, Doris D. M. Lin, Hongli Fan, Yang Li, Zhiliang Wei, Zixuan Lin, Dengrong Jiang, ** **, Catherine Kelly, Jay J. Pillai, Judy Huang, Marco C. Pinho, Binu P. Thomas, Babu G. Welch, Denise C. Park, Vishal M. Patel, Argye E. Hillis, Hanzhang Lu

Abstract: Cerebrovascular disease is a leading cause of death globally. Prevention and early intervention are known to be the most effective forms of its management. Non-invasive imaging methods hold great promises for early stratification, but at present lack the sensitivity for personalized prognosis. Resting-state functional magnetic resonance imaging (rs-fMRI), a powerful tool previously used for mappin… ▽ More Cerebrovascular disease is a leading cause of death globally. Prevention and early intervention are known to be the most effective forms of its management. Non-invasive imaging methods hold great promises for early stratification, but at present lack the sensitivity for personalized prognosis. Resting-state functional magnetic resonance imaging (rs-fMRI), a powerful tool previously used for map** neural activity, is available in most hospitals. Here we show that rs-fMRI can be used to map cerebral hemodynamic function and delineate impairment. By exploiting time variations in breathing pattern during rs-fMRI, deep learning enables reproducible map** of cerebrovascular reactivity (CVR) and bolus arrive time (BAT) of the human brain using resting-state CO2 fluctuations as a natural 'contrast media'. The deep-learning network was trained with CVR and BAT maps obtained with a reference method of CO2-inhalation MRI, which included data from young and older healthy subjects and patients with Moyamoya disease and brain tumors. We demonstrate the performance of deep-learning cerebrovascular map** in the detection of vascular abnormalities, evaluation of revascularization effects, and vascular alterations in normal aging. In addition, cerebrovascular maps obtained with the proposed method exhibited excellent reproducibility in both healthy volunteers and stroke patients. Deep-learning resting-state vascular imaging has the potential to become a useful tool in clinical cerebrovascular imaging. △ Less

Submitted 25 April, 2022; originally announced April 2022.

Journal ref: npj Digital Medicine (2023) 116

arXiv:2204.08418 [pdf, ps, other]

Enveloped Sinusoid Parseval Frames

Authors: Geoff Goehle, Benjamin Cowen, J. Daniel Park, Daniel C. Brown

Abstract: This paper presents a method of constructing Parseval frames from any collection of complex envelopes. The resulting Enveloped Sinusoid Parseval (ESP) frames can represent a wide variety of signal types as specified by their physical morphology. Since the ESP frame retains its Parseval property even when generated from a variety of envelopes, it is compatible with large scale and iterative optimiz… ▽ More This paper presents a method of constructing Parseval frames from any collection of complex envelopes. The resulting Enveloped Sinusoid Parseval (ESP) frames can represent a wide variety of signal types as specified by their physical morphology. Since the ESP frame retains its Parseval property even when generated from a variety of envelopes, it is compatible with large scale and iterative optimization algorithms. ESP frames are constructed by applying time-shifted envelo** functions to the discrete Fourier Transform basis, and in this way are similar to the short-time Fourier Transform. This work provides examples of ESP frame generation for both synthetic and experimentally measured signals. Furthermore, the frame's compatibility with distributed sparse optimization frameworks is demonstrated, and efficient implementation details are provided. Numerical experiments on acoustics data reveal that the flexibility of this method allows it to be simultaneously competitive with the STFT in time-frequency processing and also with Prony's Method for time-constant parameter estimation, surpassing the shortcomings of each individual technique. △ Less

Submitted 18 April, 2022; originally announced April 2022.

arXiv:2112.12296 [pdf, other]

Sub-Chain Beam for mmWave Devices: A Trade-off between Power Saving and Beam Correspondence

Authors: Jianhua Mo, Daehee Park, Boon Loong Ng, Vutha Va, Anum Ali, Chonghwa Seo, Jianzhong Charlie Zhang

Abstract: Beam correspondence, or downlink-uplink (DL-UL) beam reciprocity, refers to the assumption that the best beams in the DL are also the best beams in the UL. This is an important assumption that allows the existing beam management framework in 5G to rely heavily on DL beam swee** and avoid UL beam swee**: UL beams are inferred from the measurements of the DL reference signals. Beam correspondenc… ▽ More Beam correspondence, or downlink-uplink (DL-UL) beam reciprocity, refers to the assumption that the best beams in the DL are also the best beams in the UL. This is an important assumption that allows the existing beam management framework in 5G to rely heavily on DL beam swee** and avoid UL beam swee**: UL beams are inferred from the measurements of the DL reference signals. Beam correspondence holds when the radio configurations are symmetric in the DL and UL. However, as mmWave technology matures, the DL and the UL face different constraints often breaking the beam correspondence. For example, power constraints may require a UE to activate only a portion of its antenna array for UL transmission, while still activating the full array for DL reception. Meanwhile, if the UL beam with sub-array, named as sub-chain beam in this paper, has a similar radiation pattern as the DL beam, the beam correspondence can still hold. This paper proposes methods for sub-chain beam codebook design to achieve a trade-off between the power saving and beam correspondence. △ Less

Submitted 22 December, 2021; originally announced December 2021.

Comments: 6 pages, 7 figures, accepted by Asilomar conference 2021

arXiv:2111.09051 [pdf]

Implementation of Noise-Shaped Signaling System through Software-Defined Radio

Authors: Junsung Choi, Dongryul Park, Suil Kim, Seungyoung Ahn

Abstract: As developments of electromagnetic weapons, Electronic Warfare (EW) has been rising as the future form of war. Especially in wireless communications, the high security defense systems, such as Low Probability of Detection (LPD), Low Probability of Interception (LPI), or Low Prob-ability of Exploitation (LPE) communication algorithms, are studied to prevent the military force loss. One of the LPD,… ▽ More As developments of electromagnetic weapons, Electronic Warfare (EW) has been rising as the future form of war. Especially in wireless communications, the high security defense systems, such as Low Probability of Detection (LPD), Low Probability of Interception (LPI), or Low Prob-ability of Exploitation (LPE) communication algorithms, are studied to prevent the military force loss. One of the LPD, LPI, and LPE communication algorithm, physical-layer security, has been discussed and studied. We propose a noise signaling system, a type of physical-layer secu-rity, which modifies conventionally modulated I/Q data into a noise-like shape. For presenting the possibility of realistic implementation, we use Software-Defined Radio (SDR). Since there are certain limitations of hardware, we present the limitations, requirements, and preferences of practical implementation of noise signaling system, and the proposed system is ring-shaped signaling. We present the ring-shaped signaling system algorithm, SDR implementation meth-odology, and performance evaluations of the system by the metrics of Bit Error Rate (BER) and Probability of Modulation Identification (PMI), which we obtain by Convolutional Neural Net-work (CNN) algorithm. We conclude that the ring-shaped signaling system can perform a high LPI/LPE communication function due to the eavesdropper cannot obtain the correct used modu-lation scheme information, and the performance can vary by the configurations of the I/Q data modifying factors. △ Less

Submitted 17 November, 2021; originally announced November 2021.

arXiv:2110.07116 [pdf, other]

Auxiliary Loss of Transformer with Residual Connection for End-to-End Speaker Diarization

Authors: Yechan Yu, Dongkeon Park, Hong Kook Kim

Abstract: End-to-end neural diarization (EEND) with self-attention directly predicts speaker labels from inputs and enables the handling of overlapped speech. Although the EEND outperforms clustering-based speaker diarization (SD), it cannot be further improved by simply increasing the number of encoder blocks because the last encoder block is dominantly supervised compared with lower blocks. This paper pro… ▽ More End-to-end neural diarization (EEND) with self-attention directly predicts speaker labels from inputs and enables the handling of overlapped speech. Although the EEND outperforms clustering-based speaker diarization (SD), it cannot be further improved by simply increasing the number of encoder blocks because the last encoder block is dominantly supervised compared with lower blocks. This paper proposes a new residual auxiliary EEND (RX-EEND) learning architecture for transformers to enforce the lower encoder blocks to learn more accurately. The auxiliary loss is applied to the output of each encoder block, including the last encoder block. The effect of auxiliary loss on the learning of the encoder blocks can be further increased by adding a residual connection between the encoder blocks of the EEND. Performance evaluation and ablation study reveal that the auxiliary loss in the proposed RX-EEND provides relative reductions in the diarization error rate (DER) by 50.3% and 21.0% on the simulated and CALLHOME (CH) datasets, respectively, compared with self-attentive EEND (SA-EEND). Furthermore, the residual connection used in RX-EEND further relatively reduces the DER by 8.1% for CH dataset. △ Less

Submitted 26 September, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP 2022, equal contribution from first two authors

arXiv:2110.04621 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747197

Universal Paralinguistic Speech Representations Using Self-Supervised Conformers

Authors: Joel Shor, Aren Jansen, Wei Han, Daniel Park, Yu Zhang

Abstract: Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We… ▽ More Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96\% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks. △ Less

Submitted 13 December, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

Journal ref: ICASSP 2022-2022 IEEE

arXiv:2109.13226 [pdf, other]

doi 10.1109/JSTSP.2022.3182537

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks. △ Less

Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

arXiv:2103.12789 [pdf, other]

doi 10.1364/AO.425281

Single pixel structured imaging through fog

Authors: Mark Bashkansky, Samuel D. Park, John Reintjes

Abstract: We describe the application of structured imaging with a single pixel camera to imaging through fog. We demonstrate the use of a high-pass filter on the detected bucket signals to suppress the effects of temporal variations of fog density and enable an effective reconstruction of the image. A quantitative analysis and comparison of several high-pass filters are demonstrated for the application. Bo… ▽ More We describe the application of structured imaging with a single pixel camera to imaging through fog. We demonstrate the use of a high-pass filter on the detected bucket signals to suppress the effects of temporal variations of fog density and enable an effective reconstruction of the image. A quantitative analysis and comparison of several high-pass filters are demonstrated for the application. Both computational ghost imaging and compressive sensing techniques were used for image reconstruction and compressive sensing was observed to give a higher reconstructed image quality. △ Less

Submitted 23 March, 2021; originally announced March 2021.

arXiv:2011.06110 [pdf, other]

Efficient Knowledge Distillation for RNN-Transducer Models

Authors: Sankaran Panchapagesan, Daniel S. Park, Chung-Cheng Chiu, Yuan Shangguan, Qiao Liang, Alexander Gruenstein

Abstract: Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition.… ▽ More Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition. Our proposed distillation loss is simple and efficient, and uses only the "y" and "blank" posterior probabilities from the RNN-T output probability lattice. We study the effectiveness of the proposed approach in improving the accuracy of sparse RNN-T models obtained by gradually pruning a larger uncompressed model, which also serves as the teacher during distillation. With distillation of 60% and 90% sparse multi-domain RNN-T models, we obtain WER reductions of 4.3% and 12.1% respectively, on a noisy FarField eval set. We also present results of experiments on LibriSpeech, where the introduction of the distillation loss yields a 4.8% relative WER reduction on the test-other dataset for a small Conformer model. △ Less

Submitted 11 November, 2020; originally announced November 2020.

Comments: 5 pages, 1 figure, 2 tables; submitted to ICASSP 2021

arXiv:2010.10504 [pdf, other]

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

Authors: Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, Yonghui Wu

Abstract: We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-e… ▽ More We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%. △ Less

Submitted 20 July, 2022; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: 11 pages, 3 figures, 5 tables. Accepted to NeurIPS SAS 2020 Workshop; v2: minor errors corrected

arXiv:2007.01950 [pdf]

doi 10.1002/mrm.28415

Ultra-high spatial resolution BOLD fMRI in humans using combined segmented-accelerated VFA-FLEET with a recursive RF pulse design

Authors: Avery J. L. Berman, William A. Grissom, Thomas Witzel, Shahin Nasr, Daniel J. Park, Kawin Setsompop, Jonathan R. Polimeni

Abstract: Purpose To alleviate the spatial encoding limitations of single-shot EPI by develo** multi-shot segmented EPI for ultra-high-resolution fMRI with reduced ghosting artifacts from subject motion and respiration. Methods Segmented EPI can reduce readout duration and reduce acceleration factors, however, the time elapsed between segment acquisitions (on the order of seconds) can result in inte… ▽ More Purpose To alleviate the spatial encoding limitations of single-shot EPI by develo** multi-shot segmented EPI for ultra-high-resolution fMRI with reduced ghosting artifacts from subject motion and respiration. Methods Segmented EPI can reduce readout duration and reduce acceleration factors, however, the time elapsed between segment acquisitions (on the order of seconds) can result in intermittent ghosting, limiting its use for fMRI. Here, "FLEET" segment ordering--where segments are looped over before slices--was combined with a variable flip angle progression (VFA-FLEET) to improve inter-segment fidelity and maximize signal for fMRI. Scaling a sinc pulse's flip angle for each segment (VFA-FLEET-Sinc) produced inconsistent slice profiles and ghosting, therefore, a recursive Shinnar-Le Roux (SLR) RF pulse design was developed (VFA-FLEET-SLR) to generate unique pulses for every segment that together produce consistent slice profiles and signals. Results The temporal stability of VFA-FLEET-SLR was compared against conventional-segmented EPI and VFA-FLEET-Sinc at 3 T and 7 T. VFA-FLEET-SLR showed reductions in both intermittent and stable ghosting compared to conventional-segmented and VFA-FLEET-Sinc, resulting in improved image quality with a minor trade-off in temporal SNR. Combining VFA-FLEET-SLR with acceleration, we achieved a 0.6-mm isotropic acquisition at 7 T--without zoomed imaging or partial Fourier--demonstrating reliable detection of BOLD responses to a visual stimulus. To counteract the increased repetition time from segmentation, simultaneous multi-slice VFA-FLEET-SLR was demonstrated using RF-encoded controlled aliasing. Conclusions VFA-FLEET with a recursive RF pulse design supports acquisitions with low levels of artifact and spatial blur, enabling fMRI at previously inaccessible spatial resolutions with a "full-brain" field of view. △ Less

Submitted 3 July, 2020; originally announced July 2020.

Comments: 51 pages (including supplement), 8 main figures, 6 supporting figures. For supporting videos (8), please visit https://github.com/aveberman/vfa-fleet. Note: this work has been accepted for publication at Magnetic Resonance in Medicine

arXiv:2005.09629 [pdf, other]

doi 10.21437/Interspeech.2020-1470

Improved Noisy Student Training for Automatic Speech Recognition

Authors: Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, Quoc V. Le

Abstract: Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive… ▽ More Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method. We find effective methods to filter, balance and augment the data generated in between self-training iterations. By doing so, we are able to obtain word error rates (WERs) 4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h subset of LibriSpeech as the supervised set and the rest (860h) as the unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%). △ Less

Submitted 29 October, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

Comments: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference added

Journal ref: Proc. Interspeech 2020, 2817-2821

arXiv:2002.09847 [pdf, other]

Unsupervised Denoising for Satellite Imagery using Wavelet Subband CycleGAN

Authors: Joonyoung Song, Jae-Heon Jeong, Dae-Soon Park, Hyun-Ho Kim, Doo-Chun Seo, Jong Chul Ye

Abstract: Multi-spectral satellite imaging sensors acquire various spectral band images such as red (R), green (G), blue (B), near-infrared (N), etc. Thanks to the unique spectroscopic property of each spectral band with respective to the objects on the ground, multi-spectral satellite imagery can be used for various geological survey applications. Unfortunately, image artifacts from imaging sensor noises o… ▽ More Multi-spectral satellite imaging sensors acquire various spectral band images such as red (R), green (G), blue (B), near-infrared (N), etc. Thanks to the unique spectroscopic property of each spectral band with respective to the objects on the ground, multi-spectral satellite imagery can be used for various geological survey applications. Unfortunately, image artifacts from imaging sensor noises often affect the quality of scenes and have negative impacts on the applications of satellite imagery. Recently, deep learning approaches have been extensively explored for the removal of noises in satellite imagery. Most deep learning denoising methods, however, follow a supervised learning scheme, which requires matched noisy image and clean image pairs that are difficult to collect in real situations. In this paper, we propose a novel unsupervised multispectral denoising method for satellite imagery using wavelet subband cycle-consistent adversarial network (WavCycleGAN). The proposed method is based on unsupervised learning scheme using adversarial loss and cycle-consistency loss to overcome the lack of paired data. Moreover, in contrast to the standard image domain cycleGAN, we introduce a wavelet subband domain learning scheme for effective denoising without sacrificing high frequency components such as edges and detail information. Experimental results for the removal of vertical stripe and wave noises in satellite imaging sensors demonstrate that the proposed method effectively removes noises and preserves important high frequency features of satellite images. △ Less

Submitted 23 February, 2020; originally announced February 2020.

arXiv:1912.05533 [pdf, ps, other]

SpecAugment on Large Scale Datasets

Authors: Daniel S. Park, Yu Zhang, Chung-Cheng Chiu, Youzheng Chen, Bo Li, William Chan, Quoc V. Le, Yonghui Wu

Abstract: Recently, SpecAugment, an augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances, has shown to be highly effective in enhancing the performance of end-to-end networks on public datasets. In this paper, we demonstrate its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset (Naraya… ▽ More Recently, SpecAugment, an augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances, has shown to be highly effective in enhancing the performance of end-to-end networks on public datasets. In this paper, we demonstrate its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset (Narayanan et al., 2018). We achieve improvement across all test domains by mixing raw training data augmented with SpecAugment and noise-perturbed training data when training the acoustic model. We also introduce a modification of SpecAugment that adapts the time mask size and/or multiplicity depending on the length of the utterance, which can potentially benefit large scale tasks. By using adaptive masking, we are able to further improve the performance of the Listen, Attend and Spell model on LibriSpeech to 2.2% WER on test-clean and 5.2% WER on test-other. △ Less

Submitted 11 December, 2019; originally announced December 2019.

Comments: 5 pages, 3 tables; submitted to ICASSP 2020

arXiv:1911.07410 [pdf, other]

Multi-Temporal Recurrent Neural Networks For Progressive Non-Uniform Single Image Deblurring With Incremental Temporal Training

Authors: Dongwon Park, Dong Un Kang, Jisoo Kim, Se Young Chun

Abstract: Multi-scale (MS) approaches have been widely investigated for blind single image / video deblurring that sequentially recovers deblurred images in low spatial scale first and then in high spatial scale later with the output of lower scales. MS approaches have been effective especially for severe blurs induced by large motions in high spatial scale since those can be seen as small blurs in low spat… ▽ More Multi-scale (MS) approaches have been widely investigated for blind single image / video deblurring that sequentially recovers deblurred images in low spatial scale first and then in high spatial scale later with the output of lower scales. MS approaches have been effective especially for severe blurs induced by large motions in high spatial scale since those can be seen as small blurs in low spatial scale. In this work, we investigate alternative approach to MS, called multi-temporal (MT) approach, for non-uniform single image deblurring. We propose incremental temporal training with constructed MT level dataset from time-resolved dataset, develop novel MT-RNNs with recurrent feature maps, and investigate progressive single image deblurring over iterations. Our proposed MT methods outperform state-of-the-art MS methods on the GoPro dataset in PSNR with the smallest number of parameters. △ Less

Submitted 17 November, 2019; originally announced November 2019.

Comments: 10 pages, 8 figures, 6 tables, work in progress

arXiv:1910.14211 [pdf]

Accelerated spin-echo fMRI using Multisection Excitation by Simultaneous Spin-echo Interleaving (MESSI) with complex-encoded generalized SLIce Dithered Enhanced Resolution (cgSlider) Simultaneous Multi-Slice Echo-Planar Imaging

Authors: SoHyun Han, Congyu Liao, Mary Kate Manhard, Daniel Joseph Park, Berkin Bilgic, Merlin J. Fair, Fuyixue Wang, Anna I. Blazejewska, William A. Grissom, Jonathan R. Polimeni, Kawin Setsompop

Abstract: Spin-echo functional MRI (SE-fMRI) has the potential to improve spatial specificity when compared to gradient-echo fMRI. However, high spatiotemporal resolution SE-fMRI with large slice-coverage is challenging as SE-fMRI requires a long echo time (TE) to generate blood oxygenation level-dependent (BOLD) contrast, leading to long repetition times (TR). The aim of this work is to develop an acquisit… ▽ More Spin-echo functional MRI (SE-fMRI) has the potential to improve spatial specificity when compared to gradient-echo fMRI. However, high spatiotemporal resolution SE-fMRI with large slice-coverage is challenging as SE-fMRI requires a long echo time (TE) to generate blood oxygenation level-dependent (BOLD) contrast, leading to long repetition times (TR). The aim of this work is to develop an acquisition method that enhances the slice-coverage of SE-fMRI at high spatiotemporal resolution. An acquisition scheme was developed entitled Multisection Excitation by Simultaneous Spin-echo Interleaving (MESSI) with complex-encoded generalized SLIce Dithered Enhanced Resolution (cgSlider). MESSI utilizes the dead-time during the long TE by interleaving the excitation and readout of two slices to enable 2x slice-acceleration, while cgSlider utilizes the stable temporal background phase in SE-fMRI to encode and decode two adjacent slices simultaneously with a phase-constrained reconstruction method. The proposed cgSlider-MESSI was also combined with Simultaneous Multi-Slice (SMS) to achieve further slice-acceleration. This combined approach was used to achieve 1.5mm isotropic whole-brain SE-fMRI with a temporal resolution of 1.5s and was evaluated using sensory stimulation and breath-hold tasks at 3T. Compared to conventional SE-SMS, cgSlider-MESSI-SMS provides four-fold increase in slice-coverage for the same TR, with comparable temporal signal-to-noise ratio. Corresponding fMRI activation from cgSlider-MESSI-SMS for both fMRI tasks were consistent with those from conventional SE-SMS. Overall, cgSlider-MESSI-SMS achieved a 32x encoding-acceleration by combining RinplanexMBxcgSliderxMESSI=4x2x2x2. High-quality, high-resolution whole-brain SE-fMRI was acquired at a short TR using cgSlider-MESSI-SMS. △ Less

Submitted 30 October, 2019; originally announced October 2019.

Comments: 38 pages, 9 figures, ISMRM2019 #1165

arXiv:1909.11915 [pdf]

Unsupervised Image Translation using Adversarial Networks for Improved Plant Disease Recognition

Authors: Haseeb Nazki, Sook Yoon, Alvaro Fuentes, Dong Sun Park

Abstract: Acquisition of data in task-specific applications of machine learning like plant disease recognition is a costly endeavor owing to the requirements of professional human diligence and time constraints. In this paper, we present a simple pipeline that uses GANs in an unsupervised image translation environment to improve learning with respect to the data distribution in a plant disease dataset, redu… ▽ More Acquisition of data in task-specific applications of machine learning like plant disease recognition is a costly endeavor owing to the requirements of professional human diligence and time constraints. In this paper, we present a simple pipeline that uses GANs in an unsupervised image translation environment to improve learning with respect to the data distribution in a plant disease dataset, reducing the partiality introduced by acute class imbalance and hence shifting the classification decision boundary towards better performance. The empirical analysis of our method is demonstrated on a limited dataset of 2789 tomato plant disease images, highly corrupted with an imbalance in the 9 disease categories. First, we extend the state of the art for the GAN-based image-to-image translation method by enhancing the perceptual quality of the generated images and preserving the semantics. We introduce AR-GAN, where in addition to the adversarial loss, our synthetic image generator optimizes on Activation Reconstruction loss (ARL) function that optimizes feature activations against the natural image. We present visually more compelling synthetic images in comparison to most prominent existing models and evaluate the performance of our GAN framework in terms of various datasets and metrics. Second, we evaluate the performance of a baseline convolutional neural network classifier for improved recognition using the resulting synthetic samples to augment our training set and compare it with the classical data augmentation scheme. We observe a significant improvement in classification accuracy (+5.2%) using generated synthetic samples as compared to (+0.8%) increase using classic augmentation in an equal class distribution environment. △ Less

Submitted 26 September, 2019; originally announced September 2019.

Comments: 20 pages, 11 figures, 3 tables, article under review

arXiv:1907.06834 [pdf, other]

Noise Removal of FTIR Hyperspectral Images via MMSE

Authors: Chang Sik Lee, Hyeong Geun Yu, Dong Jo Park, Dong Eui Chang, Hyunwoo Nam, Byeong Hwang Park

Abstract: Fourier transform infrared (FTIR) hyperspectral imaging systems are deployed in various fields where spectral information is exploited. Chemical warfare agent (CWA) detection is one of such fields and it requires a fast and accurate process from the measurement to the visualization of detection results, including noise removal. A general concern of existing noise removal algorithms is a trade-off… ▽ More Fourier transform infrared (FTIR) hyperspectral imaging systems are deployed in various fields where spectral information is exploited. Chemical warfare agent (CWA) detection is one of such fields and it requires a fast and accurate process from the measurement to the visualization of detection results, including noise removal. A general concern of existing noise removal algorithms is a trade-off between time and performance. This paper suggests a minimum mean square error (MMSE) approach as an efficient noise removal algorithm for FTIR hyperspectral images. The experimental result shows that the MMSE estimator spends less time to achieve comparable performance to the existing algorithms. △ Less

Submitted 29 December, 2019; v1 submitted 16 July, 2019; originally announced July 2019.

arXiv:1904.08779 [pdf, other]

doi 10.21437/Interspeech.2019-2680

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Authors: Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le

Abstract: We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of war** the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech… ▽ More We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of war** the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER. △ Less

Submitted 3 December, 2019; v1 submitted 18 April, 2019; originally announced April 2019.

Comments: 5 pages, 3 figures, 6 tables; v3: references added

Journal ref: Proc. Interspeech 2019, 2613-2617

arXiv:1902.06562 [pdf, other]

doi 10.1016/j.bspc.2020.102037

Intra- and Inter-epoch Temporal Context Network (IITNet) Using Sub-epoch Features for Automatic Sleep Scoring on Raw Single-channel EEG

Authors: Hogeon Seo, Seunghyeok Back, Seongju Lee, Deokhwan Park, Tae Kim, Kyoobin Lee

Abstract: A deep learning model, named IITNet, is proposed to learn intra- and inter-epoch temporal contexts from raw single-channel EEG for automatic sleep scoring. To classify the sleep stage from half-minute EEG, called an epoch, sleep experts investigate sleep-related events and consider the transition rules between the found events. Similarly, IITNet extracts representative features at a sub-epoch leve… ▽ More A deep learning model, named IITNet, is proposed to learn intra- and inter-epoch temporal contexts from raw single-channel EEG for automatic sleep scoring. To classify the sleep stage from half-minute EEG, called an epoch, sleep experts investigate sleep-related events and consider the transition rules between the found events. Similarly, IITNet extracts representative features at a sub-epoch level by a residual neural network and captures intra- and inter-epoch temporal contexts from the sequence of the features via bidirectional LSTM. The performance was investigated for three datasets as the sequence length (L) increased from one to ten. IITNet achieved the comparable performance with other state-of-the-art results. The best accuracy, MF1, and Cohen's kappa ($κ$) were 83.9%, 77.6%, 0.78 for SleepEDF (L=10), 86.5%, 80.7%, 0.80 for MASS (L=9), and 86.7%, 79.8%, 0.81 for SHHS (L=10), respectively. Even though using four epochs, the performance was still comparable. Compared to using a single epoch, on average, accuracy and MF1 increased by 2.48%p and 4.90%p and F1 of N1, N2, and REM increased by 16.1%p, 1.50%p, and 6.42%p, respectively. Above four epochs, the performance improvement was not significant. The results support that considering the latest two-minute raw single-channel EEG can be a reasonable choice for sleep scoring via deep neural networks with efficiency and reliability. Furthermore, the experiments with the baselines showed that introducing intra-epoch temporal context learning with a deep residual network contributes to the improvement in the overall performance and has the positive synergy effect with the inter-epoch temporal context learning. △ Less

Submitted 10 June, 2020; v1 submitted 18 February, 2019; originally announced February 2019.

Comments: First three authors contributed equally to this work; Accepted manuscript for Biomedical Signal Processing and Control (BSPC); 12 pages, 6 figures;

Showing 1–35 of 35 results for author: Park, D