Search | arXiv e-print repository

On the Fundamental Tradeoff of Joint Communication and Quickest Change Detection

Abstract: In this work, we take the initiative in studying the fundamental tradeoff between communication and quickest change detection (QCD) under an integrated sensing and communication setting. We formally establish a joint communication and sensing problem for quickest change detection. Then, by utilizing constant subblock-composition codes and a modified QuSum detection rule, which we call subblock QuS… ▽ More In this work, we take the initiative in studying the fundamental tradeoff between communication and quickest change detection (QCD) under an integrated sensing and communication setting. We formally establish a joint communication and sensing problem for quickest change detection. Then, by utilizing constant subblock-composition codes and a modified QuSum detection rule, which we call subblock QuSum (SQS), we provide an inner bound on the fundamental tradeoff between communication rate and change point detection delay in the asymptotic regime of vanishing false alarm rate. We further provide a partial converse that matches our inner bound for a certain class of codes. This implies that the SQS detection strategy is asymptotically optimal for our codes as the false alarm rate constraint vanishes. We also present some canonical examples of the tradeoff region for a binary channel, a scalar Gaussian channel, and a MIMO Gaussian channel. △ Less

Submitted 23 January, 2024; originally announced January 2024.

arXiv:2312.07826 [pdf]

Integrated Path Tracking with DYC and MPC using LSTM Based Tire Force Estimator for Four-wheel Independent Steering and Driving Vehicle

Authors: Sung** Lim, Bilal Sadiq, Yongsik **, Sangho Lee, Gyeungho Choi, Kanghyun Nam, Yongseob Lim

Abstract: Active collision avoidance system plays a crucial role in ensuring the lateral safety of autonomous vehicles, and it is primarily related to path planning and tracking control algorithms. In particular, the direct yaw-moment control (DYC) system can significantly improve the lateral stability of a vehicle in environments with sudden changes in road conditions. In order to apply the DYC algorithm,… ▽ More Active collision avoidance system plays a crucial role in ensuring the lateral safety of autonomous vehicles, and it is primarily related to path planning and tracking control algorithms. In particular, the direct yaw-moment control (DYC) system can significantly improve the lateral stability of a vehicle in environments with sudden changes in road conditions. In order to apply the DYC algorithm, it is very important to accurately consider the properties of tire forces with complex nonlinearity for control to ensure the lateral stability of the vehicle. In this study, longitudinal and lateral tire forces for safety path tracking were simultaneously estimated using a long short-term memory (LSTM) neural network based estimator. Furthermore, to improve path tracking performance in case of sudden changes in road conditions, a system has been developed by combining 4-wheel independent steering (4WIS) model predictive control (MPC) and 4-wheel independent drive (4WID) direct yaw-moment control (DYC). The estimation performance of the extended Kalman filter (EKF), which are commonly used for tire force estimation, was compared. In addition, the estimated longitudinal and lateral tire forces of each wheel were applied to the proposed system, and system verification was performed through simulation using a vehicle dynamics simulator. Consequently, the proposed method, the integrated path tracking algorithm with DYC and MPC using the LSTM based estimator, was validated to significantly improve the vehicle stability in suddenly changing road conditions. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2312.05528 [pdf, other]

Exploring 3D U-Net Training Configurations and Post-Processing Strategies for the MICCAI 2023 Kidney and Tumor Segmentation Challenge

Authors: Kwang-Hyun Uhm, Hyunjun Cho, Zhixin Xu, Seohoon Lim, Seung-Won Jung, Sung-Hoo Hong, Sung-Jea Ko

Abstract: In 2023, it is estimated that 81,800 kidney cancer cases will be newly diagnosed, and 14,890 people will die from this cancer in the United States. Preoperative dynamic contrast-enhanced abdominal computed tomography (CT) is often used for detecting lesions. However, there exists inter-observer variability due to subtle differences in the imaging features of kidney and kidney tumors. In this paper… ▽ More In 2023, it is estimated that 81,800 kidney cancer cases will be newly diagnosed, and 14,890 people will die from this cancer in the United States. Preoperative dynamic contrast-enhanced abdominal computed tomography (CT) is often used for detecting lesions. However, there exists inter-observer variability due to subtle differences in the imaging features of kidney and kidney tumors. In this paper, we explore various 3D U-Net training configurations and effective post-processing strategies for accurate segmentation of kidneys, cysts, and kidney tumors in CT images. We validated our model on the dataset of the 2023 Kidney and Kidney Tumor Segmentation (KiTS23) challenge. Our method took second place in the final ranking of the KiTS23 challenge on unseen test data with an average Dice score of 0.820 and an average Surface Dice of 0.712. △ Less

Submitted 9 December, 2023; originally announced December 2023.

Comments: MICCAI 2023, KITS 2023 challenge 2nd place

arXiv:2304.00471 [pdf, other]

A Unified Compression Framework for Efficient Speech-Driven Talking-Face Generation

Authors: Bo-Kyeong Kim, Jaemin Kang, Daeun Seo, Hancheol Park, Shinkook Choi, Hyoung-Kyu Song, Hyungshin Kim, Sungsu Lim

Abstract: Virtual humans have gained considerable attention in numerous industries, e.g., entertainment and e-commerce. As a core technology, synthesizing photorealistic face frames from target speech and facial identity has been actively studied with generative adversarial networks. Despite remarkable results of modern talking-face generation models, they often entail high computational burdens, which limi… ▽ More Virtual humans have gained considerable attention in numerous industries, e.g., entertainment and e-commerce. As a core technology, synthesizing photorealistic face frames from target speech and facial identity has been actively studied with generative adversarial networks. Despite remarkable results of modern talking-face generation models, they often entail high computational burdens, which limit their efficient deployment. This study aims to develop a lightweight model for speech-driven talking-face synthesis. We build a compact generator by removing the residual blocks and reducing the channel width from Wav2Lip, a popular talking-face generator. We also present a knowledge distillation scheme to stably yet effectively train the small-capacity generator without adversarial learning. We reduce the number of parameters and MACs by 28$\times$ while retaining the performance of the original model. Moreover, to alleviate a severe performance drop when converting the whole generator to INT8 precision, we adopt a selective quantization method that uses FP16 for the quantization-sensitive layers and INT8 for the other layers. Using this mixed precision, we achieve up to a 19$\times$ speedup on edge GPUs without noticeably compromising the generation quality. △ Less

Submitted 28 April, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

Comments: MLSys Workshop on On-Device Intelligence, 2023; Demo: https://huggingface.co/spaces/nota-ai/compressed_wav2lip

arXiv:2303.00795 [pdf, other]

Improved Segmentation of Deep Sulci in Cortical Gray Matter Using a Deep Learning Framework Incorporating Laplace's Equation

Authors: Sadhana Ravikumar, Ranjit Ittyerah, Sydney Lim, Long Xie, Sandhitsu Das, Pulkit Khandelwal, Laura E. M. Wisse, Madigan L. Bedard, John L. Robinson, Terry Schuck, Murray Grossman, John Q. Trojanowski, Edward B. Lee, M. Dylan Tisdall, Karthik Prabhakaran, John A. Detre, David J. Irwin, Winifred Trotman, Gabor Mizsei, Emilio Artacho-Pérula, Maria Mercedes Iñiguez de Onzono Martin, Maria del Mar Arroyo Jiménez, Monica Muñoz, Francisco Javier Molina Romero, Maria del Pilar Marcos Rabal , et al. (7 additional authors not shown)

Abstract: When develo** tools for automated cortical segmentation, the ability to produce topologically correct segmentations is important in order to compute geometrically valid morphometry measures. In practice, accurate cortical segmentation is challenged by image artifacts and the highly convoluted anatomy of the cortex itself. To address this, we propose a novel deep learning-based cortical segmentat… ▽ More When develo** tools for automated cortical segmentation, the ability to produce topologically correct segmentations is important in order to compute geometrically valid morphometry measures. In practice, accurate cortical segmentation is challenged by image artifacts and the highly convoluted anatomy of the cortex itself. To address this, we propose a novel deep learning-based cortical segmentation method in which prior knowledge about the geometry of the cortex is incorporated into the network during the training process. We design a loss function which uses the theory of Laplace's equation applied to the cortex to locally penalize unresolved boundaries between tightly folded sulci. Using an ex vivo MRI dataset of human medial temporal lobe specimens, we demonstrate that our approach outperforms baseline segmentation networks, both quantitatively and qualitatively. △ Less

Submitted 3 March, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

Comments: Accepted at the 28th biennial international conference on Information Processing in Medical Imaging (IPMI 2023)

arXiv:2206.07651 [pdf]

Fault Diagnosis of Inter-turn Short Circuit in Permanent Magnet Synchronous Motors with Current Signal Imaging and Unsupervised Learning

Authors: W. Jung, S. H. Yun, Y. S. Lim, S. Cheong, J. Bae, Y. H. Park

Abstract: This paper proposes machine-independent feature engineering for winding inter-turn short circuit fault that uses electrical current signals. Electrical current signal collected from permanent magnet synchronous motor (PMSM) is subjected to different environmental and operational conditions. To solve these problems, robust current signal imaging method and deep learning-based feature extraction met… ▽ More This paper proposes machine-independent feature engineering for winding inter-turn short circuit fault that uses electrical current signals. Electrical current signal collected from permanent magnet synchronous motor (PMSM) is subjected to different environmental and operational conditions. To solve these problems, robust current signal imaging method and deep learning-based feature extraction method are developed. The overall procedure includes the following three key steps: (1) transformation of a time-series current signal to two-dimensional image, (2) extracting features using convolutional neural networks, and (3) calculating a health indicator using Mahalanobis distance. Transformation of the time-series signal is based on recurrence plots (RP). The proposed RP method develops from feature engineering that provides the dominant fault feature representations in a robust way. The proposed RP is designed that maximizes the features of inter-turn short fault and minimizes the effect of noise from systems with various capacities. To demonstrate the validity of the proposed method, two case studies are conducted using an artificial fault seeded testbed with two different capacities of motor. By calculating the feature using only the electrical current signal of the motor without the parameters related to the capacity of the motor, the proposed feature can be applied to motors with different capacities while maintaining the same performance. △ Less

Submitted 9 June, 2022; originally announced June 2022.

Comments: submitted to IECON 2022

arXiv:2206.07515 [pdf]

A Deep Learning Network for the Classification of Intracardiac Electrograms in Atrial Tachycardia

Authors: Zerui Chen, Sonia Xhyn Teo, Andrie Ochtman, Shier Nee Saw, Nicholas Cheng, Eric Tien Siang Lim, Murphy Lyu, Hwee Kuan Lee

Abstract: A key technology enabling the success of catheter ablation treatment for atrial tachycardia is activation map**, which relies on manual local activation time (LAT) annotation of all acquired intracardiac electrogram (EGM) signals. This is a time-consuming and error-prone procedure, due to the difficulty in identifying the signal activation peaks for fractionated signals. This work presents a Dee… ▽ More A key technology enabling the success of catheter ablation treatment for atrial tachycardia is activation map**, which relies on manual local activation time (LAT) annotation of all acquired intracardiac electrogram (EGM) signals. This is a time-consuming and error-prone procedure, due to the difficulty in identifying the signal activation peaks for fractionated signals. This work presents a Deep Learning approach for the automated classification of EGM signals into three different types: normal, abnormal, and unclassified, which forms part of the LAT annotation pipeline, and contributes towards bypassing the need for manual annotations of the LAT. The Deep Learning network, the CNN-LSTM model, is a hybrid network architecture which combines convolutional neural network (CNN) layers with long short-term memory (LSTM) layers. 1452 EGM signals from a total of 9 patients undergoing clinically-indicated 3D cardiac map** were used for the training, validation and testing of our models. From our findings, the CNN-LSTM model achieved an accuracy of 81% for the balanced dataset. For comparison, we separately developed a rule-based Decision Trees model which attained an accuracy of 67% for the same balanced dataset. Our work elucidates that analysing the EGM signals using a set of explicitly specified rules as proposed by the Decision Trees model is not suitable as EGM signals are complex. The CNN-LSTM model, on the other hand, has the ability to learn the complex, intrinsic features within the signals and identify useful features to differentiate the EGM signals. △ Less

Submitted 2 June, 2022; originally announced June 2022.

Comments: 34 pages, 10 figures

ACM Class: J.3

arXiv:2203.13072 [pdf, other]

Multitask Emotion Recognition Model with Knowledge Distillation and Task Discriminator

Authors: Euiseok Jeong, Geesung Oh, Sejoon Lim

Abstract: Due to the collection of big data and the development of deep learning, research to predict human emotions in the wild is being actively conducted. We designed a multi-task model using ABAW dataset to predict valence-arousal, expression, and action unit through audio data and face images at in real world. We trained model from the incomplete label by applying the knowledge distillation technique.… ▽ More Due to the collection of big data and the development of deep learning, research to predict human emotions in the wild is being actively conducted. We designed a multi-task model using ABAW dataset to predict valence-arousal, expression, and action unit through audio data and face images at in real world. We trained model from the incomplete label by applying the knowledge distillation technique. The teacher model was trained as a supervised learning method, and the student model was trained by using the output of the teacher model as a soft label. As a result we achieved 2.40 in Multi Task Learning task validation dataset. △ Less

Submitted 24 March, 2022; originally announced March 2022.

arXiv:2112.02164 [pdf, other]

doi 10.1002/mp.15777

Bridging the gap between prostate radiology and pathology through machine learning

Authors: Indrani Bhattacharya, David S. Lim, Han Lin Aung, Xingchen Liu, Arun Seetharaman, Christian A. Kunder, Wei Shao, Simon J. C. Soerensen, Richard E. Fan, Pejman Ghanouni, Katherine J. To'o, James D. Brooks, Geoffrey A. Sonn, Mirabela Rusu

Abstract: Prostate cancer is the second deadliest cancer for American men. While Magnetic Resonance Imaging (MRI) is increasingly used to guide targeted biopsies for prostate cancer diagnosis, its utility remains limited due to high rates of false positives and false negatives as well as low inter-reader agreements. Machine learning methods to detect and localize cancer on prostate MRI can help standardize… ▽ More Prostate cancer is the second deadliest cancer for American men. While Magnetic Resonance Imaging (MRI) is increasingly used to guide targeted biopsies for prostate cancer diagnosis, its utility remains limited due to high rates of false positives and false negatives as well as low inter-reader agreements. Machine learning methods to detect and localize cancer on prostate MRI can help standardize radiologist interpretations. However, existing machine learning methods vary not only in model architecture, but also in the ground truth labeling strategies used for model training. In this study, we compare different labeling strategies, namely, pathology-confirmed radiologist labels, pathologist labels on whole-mount histopathology images, and lesion-level and pixel-level digital pathologist labels (previously validated deep learning algorithm on histopathology images to predict pixel-level Gleason patterns) on whole-mount histopathology images. We analyse the effects these labels have on the performance of the trained machine learning models. Our experiments show that (1) radiologist labels and models trained with them can miss cancers, or underestimate cancer extent, (2) digital pathologist labels and models trained with them have high concordance with pathologist labels, and (3) models trained with digital pathologist labels achieve the best performance in prostate cancer detection in two different cohorts with different disease distributions, irrespective of the model architecture used. Digital pathologist labels can reduce challenges associated with human annotations, including labor, time, inter- and intra-reader variability, and can help bridge the gap between prostate radiology and pathology by enabling the training of reliable machine learning models to detect and localize prostate cancer on MRI. △ Less

Submitted 3 December, 2021; originally announced December 2021.

Comments: Indrani Bhattacharya and David S. Lim contributed equally as first authors. Geoffrey A. Sonn and Mirabela Rusu contributed equally as senior authors

arXiv:2110.13903 [pdf, other]

NeRV: Neural Representations for Videos

Authors: Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava

Abstract: We propose a novel neural representation for videos (NeRV) which encodes videos in neural networks. Unlike conventional representations that treat videos as frame sequences, we represent videos as neural networks taking frame index as input. Given a frame index, NeRV outputs the corresponding RGB image. Video encoding in NeRV is simply fitting a neural network to video frames and decoding process… ▽ More We propose a novel neural representation for videos (NeRV) which encodes videos in neural networks. Unlike conventional representations that treat videos as frame sequences, we represent videos as neural networks taking frame index as input. Given a frame index, NeRV outputs the corresponding RGB image. Video encoding in NeRV is simply fitting a neural network to video frames and decoding process is a simple feedforward operation. As an image-wise implicit representation, NeRV output the whole image and shows great efficiency compared to pixel-wise implicit representation, improving the encoding speed by 25x to 70x, the decoding speed by 38x to 132x, while achieving better video quality. With such a representation, we can treat videos as neural networks, simplifying several video-related tasks. For example, conventional video compression methods are restricted by a long and complex pipeline, specifically designed for the task. In contrast, with NeRV, we can use any neural network compression method as a proxy for video compression, and achieve comparable performance to traditional frame-based video compression approaches (H.264, HEVC \etc). Besides compression, we demonstrate the generalization of NeRV for video denoising. The source code and pre-trained model can be found at https://github.com/haochen-rye/NeRV.git. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: To appear at NeurIPS 2021

arXiv:2110.07711 [pdf, other]

Gray Matter Segmentation in Ultra High Resolution 7 Tesla ex vivo T2w MRI of Human Brain Hemispheres

Authors: Pulkit Khandelwal, Shokufeh Sadaghiani, Michael Tran Duong, Sadhana Ravikumar, Sydney Lim, Sanaz Arezoumandan, Claire Peterson, Eunice Chung, Madigan Bedard, Noah Capp, Ranjit Ittyerah, Elyse Migdal, Grace Choi, Emily Kopp, Bridget Loja, Eusha Hasan, Jiacheng Li, Karthik Prabhakaran, Gabor Mizsei, Marianna Gabrielyan, Theresa Schuck, John Robinson, Daniel Ohm, Edward Lee, John Q. Trojanowski , et al. (8 additional authors not shown)

Abstract: Ex vivo MRI of the brain provides remarkable advantages over in vivo MRI for visualizing and characterizing detailed neuroanatomy. However, automated cortical segmentation methods in ex vivo MRI are not well developed, primarily due to limited availability of labeled datasets, and heterogeneity in scanner hardware and acquisition protocols. In this work, we present a high resolution 7 Tesla datase… ▽ More Ex vivo MRI of the brain provides remarkable advantages over in vivo MRI for visualizing and characterizing detailed neuroanatomy. However, automated cortical segmentation methods in ex vivo MRI are not well developed, primarily due to limited availability of labeled datasets, and heterogeneity in scanner hardware and acquisition protocols. In this work, we present a high resolution 7 Tesla dataset of 32 ex vivo human brain specimens. We benchmark the cortical mantle segmentation performance of nine neural network architectures, trained and evaluated using manually-segmented 3D patches sampled from specific cortical regions, and show excellent generalizing capabilities across whole brain hemispheres in different specimens, and also on unseen images acquired at different magnetic field strength and imaging sequences. Finally, we provide cortical thickness measurements across key regions in 3D ex vivo human brain images. Our code and processed datasets are publicly available at https://github.com/Pulkit-Khandelwal/picsl-ex-vivo-segmentation. △ Less

Submitted 3 March, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

Comments: Ex vivo analysis framework (work in progress 2022 at the University of Pennsylvania)

arXiv:2106.10432 [pdf]

doi 10.33093/jetap.2021.3.1

Neural Network Facial Authentication for Public Electric Vehicle Charging Station

Authors: Muhamad Amin Husni Abdul Haris, Sin Liang Lim

Abstract: This study is to investigate and compare the facial recognition accuracy performance of Dlib ResNet against a K-Nearest Neighbour (KNN) classifier. Particularly when used against a dataset from an Asian ethnicity as Dlib ResNet was reported to have an accuracy deficiency when it comes to Asian faces. The comparisons are both implemented on the facial vectors extracted using the Histogram of Orient… ▽ More This study is to investigate and compare the facial recognition accuracy performance of Dlib ResNet against a K-Nearest Neighbour (KNN) classifier. Particularly when used against a dataset from an Asian ethnicity as Dlib ResNet was reported to have an accuracy deficiency when it comes to Asian faces. The comparisons are both implemented on the facial vectors extracted using the Histogram of Oriented Gradients (HOG) method and use the same dataset for a fair comparison. Authentication of a user by facial recognition in an electric vehicle (EV) charging station demonstrates a practical use case for such an authentication system. △ Less

Submitted 19 June, 2021; originally announced June 2021.

Journal ref: JETAP Vol.3 No.1 (2021) 17-21

arXiv:2103.10892 [pdf]

Deep Label Fusion: A 3D End-to-End Hybrid Multi-Atlas Segmentation and Deep Learning Pipeline

Authors: Long Xie, Laura E. M. Wisse, Jiancong Wang, Sadhana Ravikumar, Trevor Glenn, Anica Luther, Sydney Lim, David A. Wolk, Paul A. Yushkevich

Abstract: Deep learning (DL) is the state-of-the-art methodology in various medical image segmentation tasks. However, it requires relatively large amounts of manually labeled training data, which may be infeasible to generate in some applications. In addition, DL methods have relatively poor generalizability to out-of-sample data. Multi-atlas segmentation (MAS), on the other hand, has promising performance… ▽ More Deep learning (DL) is the state-of-the-art methodology in various medical image segmentation tasks. However, it requires relatively large amounts of manually labeled training data, which may be infeasible to generate in some applications. In addition, DL methods have relatively poor generalizability to out-of-sample data. Multi-atlas segmentation (MAS), on the other hand, has promising performance using limited amounts of training data and good generalizability. A hybrid method that integrates the high accuracy of DL and good generalizability of MAS is highly desired and could play an important role in segmentation problems where manually labeled data is hard to generate. Most of the prior work focuses on improving single components of MAS using DL rather than directly optimizing the final segmentation accuracy via an end-to-end pipeline. Only one study explored this idea in binary segmentation of 2D images, but it remains unknown whether it generalizes well to multi-class 3D segmentation problems. In this study, we propose a 3D end-to-end hybrid pipeline, named deep label fusion (DLF), that takes advantage of the strengths of MAS and DL. Experimental results demonstrate that DLF yields significant improvements over conventional label fusion methods and U-Net, a direct DL approach, in the context of segmenting medial temporal lobe subregions using 3T T1-weighted and T2-weighted MRI. Further, when applied to an unseen similar dataset acquired in 7T, DLF maintains its superior performance, which demonstrates its good generalizability. △ Less

Submitted 19 March, 2021; originally announced March 2021.

Comments: 12 pages paper accepted by the international conference of Information Processing in Medical Imaging (IPMI) 2021

arXiv:2103.03054 [pdf, other]

doi 10.1109/ACCESS.2022.3226784

An Open-Source Low-Cost Mobile Robot System with an RGB-D Camera and Efficient Real-Time Navigation Algorithm

Authors: Taekyung Kim, Seunghyun Lim, Gwanjun Shin, Geonhee Sim, Dongwon Yun

Abstract: Currently, mobile robots are develo** rapidly and are finding numerous applications in the industry. However, several problems remain related to their practical use, such as the need for expensive hardware and high power consumption levels. In this study, we build a low-cost indoor mobile robot platform that does not include a LiDAR or a GPU. Then, we design an autonomous navigation architecture… ▽ More Currently, mobile robots are develo** rapidly and are finding numerous applications in the industry. However, several problems remain related to their practical use, such as the need for expensive hardware and high power consumption levels. In this study, we build a low-cost indoor mobile robot platform that does not include a LiDAR or a GPU. Then, we design an autonomous navigation architecture that guarantees real-time performance on our platform with an RGB-D camera and a low-end off-the-shelf single board computer. The overall system includes SLAM, global path planning, ground segmentation, and motion planning. The proposed ground segmentation approach extracts a traversability map from raw depth images for the safe driving of low-body mobile robots. We apply both rule-based and learning-based navigation policies using the traversability map. Running sensor data processing and other autonomous driving components simultaneously, our navigation policies perform rapidly at a refresh rate of 18 Hz for control command, whereas other systems have slower refresh rates. Our methods show better performances than current state-of-the-art navigation approaches within limited computation resources as shown in 3D simulation tests. In addition, we demonstrate the applicability of our mobile robot system through successful autonomous driving in an indoor environment. △ Less

Submitted 13 December, 2022; v1 submitted 4 March, 2021; originally announced March 2021.

Comments: Accepted to IEEE Access 2022. Project Github: https://github.com/shinkansan/2019-UGRP-DPoom Video: https://youtu.be/Li3-RlO28lk

Journal ref: IEEE Access, vol. 10, pp. 127871-127881, 2022

arXiv:2102.11906 [pdf, other]

Handling Background Noise in Neural Speech Generation

Authors: Tom Denton, Alejandro Luebs, Felicia S. C. Lim, Andrew Storus, Hengchin Yeh, W. Bastiaan Kleijn, Jan Skoglund

Abstract: Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing… ▽ More Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing stage when extracting features and target clean speech during training is shown to be the best performing strategy. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: 5 pages, 3 figures, presented at the Asilomar Conference on Signals, Systems, and Computers 2020

arXiv:2102.09785 [pdf, ps, other]

Deep Learning-based Beam Tracking for Millimeter-wave Communications under Mobility

Authors: Sun Hong Lim, Sunwoo Kim, Byonghyo Shim, Jun Won Choi

Abstract: In this paper, we propose a deep learning-based beam tracking method for millimeter-wave (mmWave)communications. Beam tracking is employed for transmitting the known symbols using the sounding beams and tracking time-varying channels to maintain a reliable communication link. When the pose of a user equipment (UE) device varies rapidly, the mmWave channels also tend to vary fast, which hinders sea… ▽ More In this paper, we propose a deep learning-based beam tracking method for millimeter-wave (mmWave)communications. Beam tracking is employed for transmitting the known symbols using the sounding beams and tracking time-varying channels to maintain a reliable communication link. When the pose of a user equipment (UE) device varies rapidly, the mmWave channels also tend to vary fast, which hinders seamless communication. Thus, models that can capture temporal behavior of mmWave channels caused by the motion of the device are required, to cope with this problem. Accordingly, we employa deep neural network to analyze the temporal structure and patterns underlying in the time-varying channels and the signals acquired by inertial sensors. We propose a model based on long short termmemory (LSTM) that predicts the distribution of the future channel behavior based on a sequence of input signals available at the UE. This channel distribution is used to 1) control the sounding beams adaptively for the future channel state and 2) update the channel estimate through the measurement update step under a sequential Bayesian estimation framework. Our experimental results demonstrate that the proposed method achieves a significant performance gain over the conventional beam tracking methods under various mobility scenarios. △ Less

Submitted 1 December, 2022; v1 submitted 19 February, 2021; originally announced February 2021.

Comments: 23 pages, 8 figures

arXiv:2102.09660 [pdf, other]

Generative Speech Coding with Predictive Variance Regularization

Authors: W. Bastiaan Kleijn, Andrew Storus, Michael Chinen, Tom Denton, Felicia S. C. Lim, Alejandro Luebs, Jan Skoglund, Hengchin Yeh

Abstract: The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the in… ▽ More The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single autoregressive model. We introduce predictive-variance regularization to reduce the sensitivity to outliers, resulting in a significant increase in performance. We show that noise reduction to remove unwanted signals can significantly increase performance. We provide extensive subjective performance evaluations that show that our system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity. △ Less

Submitted 18 February, 2021; originally announced February 2021.

MSC Class: 94 ACM Class: I.m

arXiv:2102.00201 [pdf, other]

Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging

Authors: Andres Ferraro, Yuntae Kim, Soohyeon Lee, Biho Kim, Namjun Jo, Semi Lim, Suyon Lim, Jungtaek Jang, Sehwan Kim, Xavier Serra, Dmitry Bogdanov

Abstract: One of the main limitations in the field of audio signal processing is the lack of large public datasets with audio representations and high-quality annotations due to restrictions of copyrighted commercial music. We present Melon Playlist Dataset, a public dataset of mel-spectrograms for 649,091tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered fr… ▽ More One of the main limitations in the field of audio signal processing is the lack of large public datasets with audio representations and high-quality annotations due to restrictions of copyrighted commercial music. We present Melon Playlist Dataset, a public dataset of mel-spectrograms for 649,091tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered from Melon, a popular Korean streaming service. The dataset is suitable for music information retrieval tasks, in particular, auto-tagging and automatic playlist continuation. Even though the latter can be addressed by collaborative filtering approaches, audio provides opportunities for research on track suggestions and building systems resistant to the cold-start problem, for which we provide a baseline. Moreover, the playlists and the annotations included in the Melon Playlist Dataset make it suitable for metric learning and representation learning. △ Less

Submitted 30 January, 2021; originally announced February 2021.

Comments: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing

arXiv:2008.07742 [pdf, other]

UDC 2020 Challenge on Image Restoration of Under-Display Camera: Methods and Results

Authors: Yuqian Zhou, Michael Kwan, Kyle Tolentino, Neil Emerton, Sehoon Lim, Tim Large, Lijiang Fu, Zhihong Pan, Baopu Li, Qirui Yang, Yihao Liu, Jigang Tang, Tao Ku, Shibin Ma, Bingnan Hu, Jiarong Wang, Densen Puthussery, Hrishikesh P S, Melvin Kuriakose, Jiji C V, Varun Sundar, Sumanth Hegde, Divya Kothandaraman, Kaushik Mitra, Akashdeep Jassal , et al. (20 additional authors not shown)

Abstract: This paper is the report of the first Under-Display Camera (UDC) image restoration challenge in conjunction with the RLQ workshop at ECCV 2020. The challenge is based on a newly-collected database of Under-Display Camera. The challenge tracks correspond to two types of display: a 4k Transparent OLED (T-OLED) and a phone Pentile OLED (P-OLED). Along with about 150 teams registered the challenge, ei… ▽ More This paper is the report of the first Under-Display Camera (UDC) image restoration challenge in conjunction with the RLQ workshop at ECCV 2020. The challenge is based on a newly-collected database of Under-Display Camera. The challenge tracks correspond to two types of display: a 4k Transparent OLED (T-OLED) and a phone Pentile OLED (P-OLED). Along with about 150 teams registered the challenge, eight and nine teams submitted the results during the testing phase for each track. The results in the paper are state-of-the-art restoration performance of Under-Display Camera Restoration. Datasets and paper are available at https://yzhouas.github.io/projects/UDC/udc.html. △ Less

Submitted 18 August, 2020; originally announced August 2020.

Comments: 15 pages

arXiv:2007.01261 [pdf, other]

Curriculum Manager for Source Selection in Multi-Source Domain Adaptation

Authors: Luyu Yang, Yogesh Balaji, Ser-Nam Lim, Abhinav Shrivastava

Abstract: The performance of Multi-Source Unsupervised Domain Adaptation depends significantly on the effectiveness of transfer from labeled source domain samples. In this paper, we proposed an adversarial agent that learns a dynamic curriculum for source samples, called Curriculum Manager for Source Selection (CMSS). The Curriculum Manager, an independent network module, constantly updates the curriculum d… ▽ More The performance of Multi-Source Unsupervised Domain Adaptation depends significantly on the effectiveness of transfer from labeled source domain samples. In this paper, we proposed an adversarial agent that learns a dynamic curriculum for source samples, called Curriculum Manager for Source Selection (CMSS). The Curriculum Manager, an independent network module, constantly updates the curriculum during training, and iteratively learns which domains or samples are best suited for aligning to the target. The intuition behind this is to force the Curriculum Manager to constantly re-measure the transferability of latent domains over time to adversarially raise the error rate of the domain discriminator. CMSS does not require any knowledge of the domain labels, yet it outperforms other methods on four well-known benchmarks by significant margins. We also provide interpretable results that shed light on the proposed method. △ Less

Submitted 2 July, 2020; originally announced July 2020.

arXiv:2004.14491 [pdf, other]

Detecting Deep-Fake Videos from Appearance and Behavior

Authors: Shruti Agarwal, Tarek El-Gaaly, Hany Farid, Ser-Nam Lim

Abstract: Synthetically-generated audios and videos -- so-called deep fakes -- continue to capture the imagination of the computer-graphics and computer-vision communities. At the same time, the democratization of access to technology that can create sophisticated manipulated video of anybody saying anything continues to be of concern because of its power to disrupt democratic elections, commit small to lar… ▽ More Synthetically-generated audios and videos -- so-called deep fakes -- continue to capture the imagination of the computer-graphics and computer-vision communities. At the same time, the democratization of access to technology that can create sophisticated manipulated video of anybody saying anything continues to be of concern because of its power to disrupt democratic elections, commit small to large-scale fraud, fuel dis-information campaigns, and create non-consensual pornography. We describe a biometric-based forensic technique for detecting face-swap deep fakes. This technique combines a static biometric based on facial recognition with a temporal, behavioral biometric based on facial expressions and head movements, where the behavioral embedding is learned using a CNN with a metric-learning objective function. We show the efficacy of this approach across several large-scale video datasets, as well as in-the-wild deep fakes. △ Less

Submitted 29 April, 2020; originally announced April 2020.

Journal ref: IEEE Workshop on Image Forensics and Security, 2020

arXiv:2004.09584 [pdf, other]

ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric

Authors: Michael Chinen, Felicia S. C. Lim, Jan Skoglund, Nikita Gureev, Feargus O'Gorman, Andrew Hines

Abstract: Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previous versions, in terms of both design and usage. As an open source C++ library or binary with permissive licensing, ViSQOL can now be deployed beyond the research context into production… ▽ More Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previous versions, in terms of both design and usage. As an open source C++ library or binary with permissive licensing, ViSQOL can now be deployed beyond the research context into production usage. The feedback from internal production teams at Google has helped to improve this new release, and serves to show cases where it is most applicable, as well as to highlight limitations. The new model is benchmarked against real-world data for evaluation purposes. The trends and direction of future work is discussed. △ Less

Submitted 20 April, 2020; originally announced April 2020.

Comments: 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX)

arXiv:2004.09320 [pdf, other]

Quantization Guided JPEG Artifact Correction

Authors: Max Ehrlich, Larry Davis, Ser-Nam Lim, Abhinav Shrivastava

Abstract: The JPEG image compression algorithm is the most popular method of image compression because of its ability for large compression ratios. However, to achieve such high compression, information is lost. For aggressive quantization settings, this leads to a noticeable reduction in image quality. Artifact correction has been studied in the context of deep neural networks for some time, but the curren… ▽ More The JPEG image compression algorithm is the most popular method of image compression because of its ability for large compression ratios. However, to achieve such high compression, information is lost. For aggressive quantization settings, this leads to a noticeable reduction in image quality. Artifact correction has been studied in the context of deep neural networks for some time, but the current state-of-the-art methods require a different model to be trained for each quality setting, greatly limiting their practical application. We solve this problem by creating a novel architecture which is parameterized by the JPEG files quantization matrix. This allows our single model to achieve state-of-the-art performance over models trained for specific quality settings. △ Less

Submitted 16 July, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

Comments: Published in the proceedings of ECCV 2020, please see our released code and models at https://gitlab.com/Queuecumber/quantization-guided-ac

arXiv:2003.12015 [pdf, other]

doi 10.1109/JSTQE.2020.2982990

Photonic convolutional neural networks using integrated diffractive optics

Authors: Jun Rong Ong, Chin Chun Ooi, Thomas Y. L. Ang, Soon Thor Lim, Ching Eng Png

Abstract: With recent rapid advances in photonic integrated circuits, it has been demonstrated that programmable photonic chips can be used to implement artificial neural networks. Convolutional neural networks (CNN) are a class of deep learning methods that have been highly successful in applications such as image classification and speech processing. We present an architecture to implement a photonic CNN… ▽ More With recent rapid advances in photonic integrated circuits, it has been demonstrated that programmable photonic chips can be used to implement artificial neural networks. Convolutional neural networks (CNN) are a class of deep learning methods that have been highly successful in applications such as image classification and speech processing. We present an architecture to implement a photonic CNN using the Fourier transform property of integrated star couplers. We show, in computer simulation, high accuracy image classification using the MNIST dataset. We also model component imperfections in photonic CNN and show that the performance degradation can be recovered in a programmable chip. Our proposed architecture provides a large reduction in physical footprint compared to current implementations as it utilizes the natural advantages of optics and hence offers a scalable pathway towards integrated photonic deep learning processors. △ Less

Submitted 17 February, 2020; originally announced March 2020.

Comments: 9 pages, 6 figures

arXiv:2003.06464 [pdf, other]

LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition

Authors: Ramyad Hadidi, Bahar Asgari, Jiashen Cao, Younmin Bae, Da Eun Shim, Hyojong Kim, Sung-Kyu Lim, Michael S. Ryoo, Hyesoon Kim

Abstract: Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communic… ▽ More Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communication is costly, taking advantage of other available edge devices by using data- or model-parallelism methods is not an effective solution. To benefit from available compute resources with low communication overhead, we propose the first DNN parallelization method for reducing the communication overhead in a distributed system. We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. LCP offers close-to-minimum communication overhead with better distribution and parallelization opportunities while significantly reducing memory footprint and computation compared to data- and model-parallelism methods. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. We also evaluate the performance of LCP models on a customized hardware (tailored for low latency) implemented on a small edge FPGA and as a 16mW 0.107mm2 ASIC @7nm chip. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x by incorporating common optimizations such as pruning and quantization. △ Less

Submitted 17 November, 2020; v1 submitted 13 March, 2020; originally announced March 2020.

arXiv:1912.11781 [pdf, ps, other]

Multi-Source Direction-of-Arrival Estimation Using Improved Estimation Consistency Method

Authors: Rohith Mars, Hiroyuki Ehara, Srikanth Nagisetty, Chong Soon Lim

Abstract: We address the problem of estimating direction-of-arrivals (DOAs) for multiple acoustic sources in a reverberant environment using a spherical microphone array. It is well-known that multi-source DOA estimation is challenging in the presence of room reverberation, environmental noise and overlap** sources. In this work, we introduce multiple schemes to improve the robustness of estimation consis… ▽ More We address the problem of estimating direction-of-arrivals (DOAs) for multiple acoustic sources in a reverberant environment using a spherical microphone array. It is well-known that multi-source DOA estimation is challenging in the presence of room reverberation, environmental noise and overlap** sources. In this work, we introduce multiple schemes to improve the robustness of estimation consistency (EC) approach in reverberant and noisy conditions through redefined and modified parametric weights. Simulation results show that our proposed methods achieve superior performance compared to the existing EC approach, especially when the sources are spatially close in a reverberant environment. △ Less

Submitted 26 December, 2019; originally announced December 2019.

arXiv:1910.13122 [pdf]

doi 10.3390/su11205791

Algorithmic decision-making in AVs: Understanding ethical and technical concerns for smart cities

Authors: Hazel Si Min Lim, Araz Taeihagh

Abstract: Autonomous Vehicles (AVs) are increasingly embraced around the world to advance smart mobility and more broadly, smart, and sustainable cities. Algorithms form the basis of decision-making in AVs, allowing them to perform driving tasks autonomously, efficiently, and more safely than human drivers and offering various economic, social, and environmental benefits. However, algorithmic decision-makin… ▽ More Autonomous Vehicles (AVs) are increasingly embraced around the world to advance smart mobility and more broadly, smart, and sustainable cities. Algorithms form the basis of decision-making in AVs, allowing them to perform driving tasks autonomously, efficiently, and more safely than human drivers and offering various economic, social, and environmental benefits. However, algorithmic decision-making in AVs can also introduce new issues that create new safety risks and perpetuate discrimination. We identify bias, ethics, and perverse incentives as key ethical issues in the AV algorithms' decision-making that can create new safety risks and discriminatory outcomes. Technical issues in the AVs' perception, decision-making and control algorithms, limitations of existing AV testing and verification methods, and cybersecurity vulnerabilities can also undermine the performance of the AV system. This article investigates the ethical and technical concerns surrounding algorithmic decision-making in AVs by exploring how driving decisions can perpetuate discrimination and create new safety risks for the public. We discuss steps taken to address these issues, highlight the existing research gaps and the need to mitigate these issues through the design of AV's algorithms and of policies and regulations to fully realise AVs' benefits for smart and sustainable cities. △ Less

Submitted 29 October, 2019; originally announced October 2019.

Journal ref: Sustainability, 2019, 11(20), 5791

arXiv:1910.06464 [pdf, other]

doi 10.1109/ICASSP.2019.8683277

Low Bit-Rate Speech Coding with VQ-VAE and a WaveNet Decoder

Authors: Cristina Gârbacea, Aäron van den Oord, Yazhe Li, Felicia S C Lim, Alejandro Luebs, Oriol Vinyals, Thomas C Walters

Abstract: In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction… ▽ More In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits perceptual quality which is around halfway between the MELP codec at 2.4 kbps and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality recorded speech with the test speaker included in the training set, a model coding speech at 1.6 kbps produces output of similar perceptual quality to that generated by AMR-WB at 23.05 kbps. △ Less

Submitted 14 October, 2019; originally announced October 2019.

Comments: ICASSP 2019

Journal ref: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 735-739. IEEE, 2019

arXiv:1909.04776 [pdf, other]

Generative Speech Enhancement Based on Cloned Networks

Authors: Michael Chinen, W. Bastiaan Kleijn, Felicia S. C. Lim, Jan Skoglund

Abstract: We propose to implement speech enhancement by the regeneration of clean speech from a salient representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the cl… ▽ More We propose to implement speech enhancement by the regeneration of clean speech from a salient representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the clones to be similar for these different input signals, we train a feature extractor network that is robust to noise. At inference, the salient features form the input to a WaveNet network that generates a natural and clean speech signal with the same attributes as the ground-truth clean signal. As the signal becomes noisier, our system produces natural sounding errors that stay on the speech manifold, in place of traditional artifacts found in other systems. Our experiments confirm that our generative enhancement system provides state-of-the-art enhancement performance within the generative class of enhancers according to a MUSHRA-like test. The clones based system matches or outperforms the other systems at each input signal-to-noise (SNR) range with statistical significance. △ Less

Submitted 10 September, 2019; originally announced September 2019.

Comments: Accepted WASPAA 2019

arXiv:1908.09414 [pdf, other]

CycleGAN with a Blur Kernel for Deconvolution Microscopy: Optimal Transport Geometry

Authors: Sungjun Lim, Hyoungjun Park, Sang-Eun Lee, Sunghoe Chang, Jong Chul Ye

Abstract: Deconvolution microscopy has been extensively used to improve the resolution of the wide-field fluorescent microscopy, but the performance of classical approaches critically depends on the accuracy of a model and optimization algorithms. Recently, the convolutional neural network (CNN) approaches have been studied as a fast and high performance alternative. Unfortunately, the CNN approaches usuall… ▽ More Deconvolution microscopy has been extensively used to improve the resolution of the wide-field fluorescent microscopy, but the performance of classical approaches critically depends on the accuracy of a model and optimization algorithms. Recently, the convolutional neural network (CNN) approaches have been studied as a fast and high performance alternative. Unfortunately, the CNN approaches usually require matched high resolution images for supervised training. In this paper, we present a novel unsupervised cycle-consistent generative adversarial network (cycleGAN) with a linear blur kernel, which can be used for both blind- and non-blind image deconvolution. In contrast to the conventional cycleGAN approaches that require two deep generators, the proposed cycleGAN approach needs only a single deep generator and a linear blur kernel, which significantly improves the robustness and efficiency of network training. We show that the proposed architecture is indeed a dual formulation of an optimal transport problem that uses a special form of the penalized least squares cost as a transport cost. Experimental results using simulated and real experimental data confirm the efficacy of the algorithm. △ Less

Submitted 8 July, 2020; v1 submitted 25 August, 2019; originally announced August 2019.

Comments: This paper is accepted for IEEE Trans. Computational Imaging

arXiv:1908.07045 [pdf, other]

Salient Speech Representations Based on Cloned Networks

Authors: W. Bastiaan Kleijn, Felicia S. C. Lim, Michael Chinen, Jan Skoglund

Abstract: We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receiv… ▽ More We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receives as input a different signal from a set of equivalent signals. The objective function encourages the network clones to map their input into a set of features that is identical across the clones. It additionally encourages feature independence and, optionally, reconstruction of a desired target signal by a decoder. As an application, we train a system that extracts a time-sequence of feature vectors of speech and uses it as a conditioning of a WaveNet generative system, facilitating both coding and enhancement. △ Less

Submitted 19 August, 2019; originally announced August 2019.

Comments: Interspeech 2019

arXiv:1906.05956 [pdf, other]

doi 10.1007/978-3-030-32248-9_25

Scalable Neural Architecture Search for 3D Medical Image Segmentation

Authors: Sungwoong Kim, Ildoo Kim, Sungbin Lim, Woonhyuk Baek, Chiheon Kim, Hyungjoo Cho, Boogeon Yoon, Taesup Kim

Abstract: In this paper, a neural architecture search (NAS) framework is proposed for 3D medical image segmentation, to automatically optimize a neural architecture from a large design space. Our NAS framework searches the structure of each layer including neural connectivities and operation types in both of the encoder and decoder. Since optimizing over a large discrete architecture space is difficult due… ▽ More In this paper, a neural architecture search (NAS) framework is proposed for 3D medical image segmentation, to automatically optimize a neural architecture from a large design space. Our NAS framework searches the structure of each layer including neural connectivities and operation types in both of the encoder and decoder. Since optimizing over a large discrete architecture space is difficult due to high-resolution 3D medical images, a novel stochastic sampling algorithm based on a continuous relaxation is also proposed for scalable gradient based optimization. On the 3D medical image segmentation tasks with a benchmark dataset, an automatically designed architecture by the proposed NAS framework outperforms the human-designed 3D U-Net, and moreover this optimized architecture is well suited to be transferred for different tasks. △ Less

Submitted 13 June, 2019; originally announced June 2019.

Comments: 9 pages, 3 figures

arXiv:1712.01120 [pdf, other]

Wavenet based low rate speech coding

Authors: W. Bastiaan Kleijn, Felicia S. C. Lim, Alejandro Luebs, Jan Skoglund, Florian Stimberg, Quan Wang, Thomas C. Walters

Abstract: Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative m… ▽ More Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model. △ Less

Submitted 1 December, 2017; originally announced December 2017.

Comments: 5 pages, 2 figures

Showing 1–33 of 33 results for author: Lim, S