Search | arXiv e-print repository

doi 10.1145/3643834.3661556

SoundShift: Exploring Sound Manipulations for Accessible Mixed-Reality Awareness

Authors: Ruei-Che Chang, Chia-Sheng Hung, Bing-Yu Chen, Dhruv Jain, Anhong Guo

Abstract: Mixed-reality (MR) soundscapes blend real-world sound with virtual audio from hearing devices, presenting intricate auditory information that is hard to discern and differentiate. This is particularly challenging for blind or visually impaired individuals, who rely on sounds and descriptions in their everyday lives. To understand how complex audio information is consumed, we analyzed online forum… ▽ More Mixed-reality (MR) soundscapes blend real-world sound with virtual audio from hearing devices, presenting intricate auditory information that is hard to discern and differentiate. This is particularly challenging for blind or visually impaired individuals, who rely on sounds and descriptions in their everyday lives. To understand how complex audio information is consumed, we analyzed online forum posts within the blind community, identifying prevailing challenges, needs, and desired solutions. We synthesized the results and propose SoundShift for increasing MR sound awareness, which includes six sound manipulations: Transparency Shift, Envelope Shift, Position Shift, Style Shift, Time Shift, and Sound Append. To evaluate the effectiveness of SoundShift, we conducted a user study with 18 blind participants across three simulated MR scenarios, where participants identified specific sounds within intricate soundscapes. We found that SoundShift increased MR sound awareness and minimized cognitive load. Finally, we developed three real-world example applications to demonstrate the practicality of SoundShift. △ Less

Submitted 26 May, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

Comments: DIS 2024

arXiv:2311.18168 [pdf, other]

Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

Authors: Karren D. Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, Oncel Tuzel

Abstract: We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one map** from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D f… ▽ More We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one map** from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly, the relationship between speech and facial motion is one-to-many, containing both inter-speaker and intra-speaker variations and necessitating a probabilistic approach. In this paper, we identify and address key challenges that have so far limited the development of probabilistic models: lack of datasets and metrics that are suitable for training and evaluating them, as well as the difficulty of designing a model that generates diverse results while remaining faithful to a strong conditioning signal as speech. We first propose large-scale benchmark datasets and metrics suitable for probabilistic modeling. Then, we demonstrate a probabilistic model that achieves both diversity and fidelity to speech, outperforming other methods across the proposed benchmarks. Finally, we showcase useful applications of probabilistic models trained on these large-scale datasets: we can generate diverse speech-driven 3D facial motion that matches unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of downstream audio-visual models. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2310.15130 [pdf, other]

Novel-View Acoustic Synthesis from 3D Reconstructed Rooms

Authors: Byeongjoo Ahn, Karren Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan, Miguel Sarabia, Oncel Tuzel, Jen-Hao Rick Chang

Abstract: We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separ… ▽ More We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separation, and dereverberation. While naively training an end-to-end network fails to produce high-quality results, we show that incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms enables the same network to jointly tackle these tasks. Our method outperforms existing methods designed for the individual tasks, demonstrating its effectiveness at utilizing 3D visual information. In a simulated study on the Matterport3D-NVAS dataset, our model achieves near-perfect accuracy on source localization, a PSNR of 26.44 dB and a SDR of 14.23 dB for source separation and dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on novel-view acoustic synthesis. Code, pretrained model, and video results are available on the project webpage (https://github.com/apple/ml-nvas3d). △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2309.10707 [pdf, other]

Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models

Authors: Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Raviteja Vemulapalli, Jen-Hao Rick Chang, Karren Yang, Gautam Varma Mantena, Oncel Tuzel

Abstract: While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are not readily available in many scenarios. In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from… ▽ More While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are not readily available in many scenarios. In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. To accomplish this, we propose a novel data synthesis pipeline that uses a Large Language Model (LLM) to generate a target domain text corpus, and a state-of-the-art controllable speech synthesis model to generate the corresponding speech. We propose a simple yet effective in-context instruction finetuning strategy to increase the effectiveness of LLM in generating text corpora for new domains. Experiments on the SLURP dataset show that the proposed method achieves an average relative word error rate improvement of $28\%$ on unseen target domains without any performance drop in source domains. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2308.10790 [pdf]

Extraction of Text from Optic Nerve Optical Coherence Tomography Reports

Authors: Iyad Majid, Youchen Victor Zhang, Robert Chang, Sophia Y. Wang

Abstract: Purpose: The purpose of this study was to develop and evaluate rule-based algorithms to enhance the extraction of text data, including retinal nerve fiber layer (RNFL) values and other ganglion cell count (GCC) data, from Zeiss Cirrus optical coherence tomography (OCT) scan reports. Methods: DICOM files that contained encapsulated PDF reports with RNFL or Ganglion Cell in their document titles wer… ▽ More Purpose: The purpose of this study was to develop and evaluate rule-based algorithms to enhance the extraction of text data, including retinal nerve fiber layer (RNFL) values and other ganglion cell count (GCC) data, from Zeiss Cirrus optical coherence tomography (OCT) scan reports. Methods: DICOM files that contained encapsulated PDF reports with RNFL or Ganglion Cell in their document titles were identified from a clinical imaging repository at a single academic ophthalmic center. PDF reports were then converted into image files and processed using the PaddleOCR Python package for optical character recognition. Rule-based algorithms were designed and iteratively optimized for improved performance in extracting RNFL and GCC data. Evaluation of the algorithms was conducted through manual review of a set of RNFL and GCC reports. Results: The developed algorithms demonstrated high precision in extracting data from both RNFL and GCC scans. Precision was slightly better for the right eye in RNFL extraction (OD: 0.9803 vs. OS: 0.9046), and for the left eye in GCC extraction (OD: 0.9567 vs. OS: 0.9677). Some values presented more challenges in extraction, particularly clock hours 5 and 6 for RNFL thickness, and signal strength for GCC. Conclusions: A customized optical character recognition algorithm can identify numeric results from optical coherence scan reports with high precision. Automated processing of PDF reports can greatly reduce the time to extract OCT results on a large scale. △ Less

Submitted 21 August, 2023; originally announced August 2023.

arXiv:2308.03027 [pdf, other]

Causal Disentanglement Hidden Markov Model for Fault Diagnosis

Authors: Rihao Chang, Yongtao Ma, Weizhi Nie, Jie Nie, An-an Liu

Abstract: In modern industries, fault diagnosis has been widely applied with the goal of realizing predictive maintenance. The key issue for the fault diagnosis system is to extract representative characteristics of the fault signal and then accurately predict the fault type. In this paper, we propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism a… ▽ More In modern industries, fault diagnosis has been widely applied with the goal of realizing predictive maintenance. The key issue for the fault diagnosis system is to extract representative characteristics of the fault signal and then accurately predict the fault type. In this paper, we propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism and thus, capture their characteristics to achieve a more robust representation. Specifically, we make full use of the time-series data and progressively disentangle the vibration signal into fault-relevant and fault-irrelevant factors. The ELBO is reformulated to optimize the learning of the causal disentanglement Markov model. Moreover, to expand the scope of the application, we adopt unsupervised domain adaptation to transfer the learned disentangled representations to other working environments. Experiments were conducted on the CWRU dataset and IMS dataset. Relevant results validate the superiority of the proposed method. △ Less

Submitted 6 August, 2023; originally announced August 2023.

arXiv:2303.14885 [pdf, other]

Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

Authors: Karren Yang, Ting-Yao Hu, Jen-Hao Rick Chang, Hema Swetha Koppula, Oncel Tuzel

Abstract: Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To… ▽ More Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To address the first question, we adapt a state-of-the-art automatic speech recognition (ASR) model to target speakers from four benchmark datasets representative of different speaker types. We show that ASR personalization with synthetic data is effective in all cases, but particularly when (i) the target speaker is underrepresented in the global data, and (ii) the capacity of the global model is limited. To address the second question of why personalized synthetic data is effective, we use controllable speech synthesis to generate speech with varied styles and content. Surprisingly, we find that the text content of the synthetic data, rather than style, is important for speaker adaptation. These results lead us to propose a data selection strategy for ASR personalization based on speech content. △ Less

Submitted 26 March, 2023; originally announced March 2023.

Comments: ICASSP 2023

arXiv:2303.05745 [pdf, other]

Multi-site, Multi-domain Airway Tree Modeling (ATM'22): A Public Benchmark for Pulmonary Airway Segmentation

Authors: Minghui Zhang, Yangqian Wu, Hanxiao Zhang, Yulei Qin, Hao Zheng, Wen Tang, Corey Arnold, Chenhao Pei, Pengxin Yu, Yang Nan, Guang Yang, Simon Walsh, Dominic C. Marshall, Matthieu Komorowski, Puyang Wang, Dazhou Guo, Dakai **, Ya'nan Wu, Shuiqing Zhao, Runsheng Chang, Boyu Zhang, Xing Lv, Abdul Qayyum, Moona Mazher, Qi Su , et al. (11 additional authors not shown)

Abstract: Open international challenges are becoming the de facto standard for assessing computer vision and image analysis algorithms. In recent years, new methods have extended the reach of pulmonary airway segmentation that is closer to the limit of image resolution. Since EXACT'09 pulmonary airway segmentation, limited effort has been directed to quantitative comparison of newly emerged algorithms drive… ▽ More Open international challenges are becoming the de facto standard for assessing computer vision and image analysis algorithms. In recent years, new methods have extended the reach of pulmonary airway segmentation that is closer to the limit of image resolution. Since EXACT'09 pulmonary airway segmentation, limited effort has been directed to quantitative comparison of newly emerged algorithms driven by the maturity of deep learning based approaches and clinical drive for resolving finer details of distal airways for early intervention of pulmonary diseases. Thus far, public annotated datasets are extremely limited, hindering the development of data-driven methods and detailed performance evaluation of new algorithms. To provide a benchmark for the medical imaging community, we organized the Multi-site, Multi-domain Airway Tree Modeling (ATM'22), which was held as an official challenge event during the MICCAI 2022 conference. ATM'22 provides large-scale CT scans with detailed pulmonary airway annotation, including 500 CT scans (300 for training, 50 for validation, and 150 for testing). The dataset was collected from different sites and it further included a portion of noisy COVID-19 CTs with ground-glass opacity and consolidation. Twenty-three teams participated in the entire phase of the challenge and the algorithms for the top ten teams are reviewed in this paper. Quantitative and qualitative results revealed that deep learning models embedded with the topological continuity enhancement achieved superior performance in general. ATM'22 challenge holds as an open-call design, the training data and the gold standard evaluation are available upon successful registration via its homepage. △ Less

Submitted 27 June, 2023; v1 submitted 10 March, 2023; originally announced March 2023.

Comments: 32 pages, 16 figures. Homepage: https://atm22.grand-challenge.org/. Submitted

arXiv:2212.07651 [pdf, other]

Two-stage Contextual Transformer-based Convolutional Neural Network for Airway Extraction from CT Images

Authors: Yanan Wu, Shuiqing Zhao, Shouliang Qi, Jie Feng, Haowen Pang, Runsheng Chang, Long Bai, Mengqi Li, Shuyue Xia, Wei Qian, Hongliang Ren

Abstract: Accurate airway extraction from computed tomography (CT) images is a critical step for planning navigation bronchoscopy and quantitative assessment of airway-related chronic obstructive pulmonary disease (COPD). The existing methods are challenging to sufficiently segment the airway, especially the high-generation airway, with the constraint of the limited label and cannot meet the clinical use in… ▽ More Accurate airway extraction from computed tomography (CT) images is a critical step for planning navigation bronchoscopy and quantitative assessment of airway-related chronic obstructive pulmonary disease (COPD). The existing methods are challenging to sufficiently segment the airway, especially the high-generation airway, with the constraint of the limited label and cannot meet the clinical use in COPD. We propose a novel two-stage 3D contextual transformer-based U-Net for airway segmentation using CT images. The method consists of two stages, performing initial and refined airway segmentation. The two-stage model shares the same subnetwork with different airway masks as input. Contextual transformer block is performed both in the encoder and decoder path of the subnetwork to finish high-quality airway segmentation effectively. In the first stage, the total airway mask and CT images are provided to the subnetwork, and the intrapulmonary airway mask and corresponding CT scans to the subnetwork in the second stage. Then the predictions of the two-stage method are merged as the final prediction. Extensive experiments were performed on in-house and multiple public datasets. Quantitative and qualitative analysis demonstrate that our proposed method extracted much more branches and lengths of the tree while accomplishing state-of-the-art airway segmentation performance. The code is available at https://github.com/zhaozsq/airway_segmentation. △ Less

Submitted 15 December, 2022; originally announced December 2022.

arXiv:2212.02057 [pdf, other]

DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detection

Authors: Ziyuan Zhao, Mingxi Xu, Peisheng Qian, Ramanpreet Singh Pahwa, Richard Chang

Abstract: Deep learning has achieved notable success in 3D object detection with the advent of large-scale point cloud datasets. However, severe performance degradation in the past trained classes, i.e., catastrophic forgetting, still remains a critical issue for real-world deployment when the number of classes is unknown or may vary. Moreover, existing 3D class-incremental detection methods are developed f… ▽ More Deep learning has achieved notable success in 3D object detection with the advent of large-scale point cloud datasets. However, severe performance degradation in the past trained classes, i.e., catastrophic forgetting, still remains a critical issue for real-world deployment when the number of classes is unknown or may vary. Moreover, existing 3D class-incremental detection methods are developed for the single-domain scenario, which fail when encountering domain shift caused by different datasets, varying environments, etc. In this paper, we identify the unexplored yet valuable scenario, i.e., class-incremental learning under domain shift, and propose a novel 3D domain adaptive class-incremental object detection framework, DA-CIL, in which we design a novel dual-domain copy-paste augmentation method to construct multiple augmented domains for diversifying training distributions, thereby facilitating gradual domain adaptation. Then, multi-level consistency is explored to facilitate dual-teacher knowledge distillation from different domains for domain adaptive class-incremental learning. Extensive experiments on various datasets demonstrate the effectiveness of the proposed method over baselines in the domain adaptive class-incremental learning scenario. △ Less

Submitted 5 December, 2022; originally announced December 2022.

Comments: Accepted by the 33rd British Machine Vision Conference (BMVC 2022)

Journal ref: 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press, 2022. URL https://bmvc2022.mpi-inf.mpg.de/0916.pdf

arXiv:2208.08894 [pdf]

EEG Machine Learning for Analysis of Mild Traumatic Brain Injury: A survey

Authors: Weiqing Gu, Ryan Chang, Bohan Yang

Abstract: Mild Traumatic Brain Injury (mTBI) is a common brain injury and affects a diverse group of people: soldiers, constructors, athletes, drivers, children, elders, and nearly everyone. Thus, having a well-established, fast, cheap, and accurate classification method is crucial for the well-being of people around the globe. Luckily, using Machine Learning (ML) on electroencephalography (EEG) data shows… ▽ More Mild Traumatic Brain Injury (mTBI) is a common brain injury and affects a diverse group of people: soldiers, constructors, athletes, drivers, children, elders, and nearly everyone. Thus, having a well-established, fast, cheap, and accurate classification method is crucial for the well-being of people around the globe. Luckily, using Machine Learning (ML) on electroencephalography (EEG) data shows promising results. This survey analyzed the most cutting-edge articles from 2017 to the present. The articles were searched from the Google Scholar database and went through an elimination process based on our criteria. We reviewed, summarized, and compared the fourteen most cutting-edge machine learning research papers for predicting and classifying mTBI in terms of 1) EEG data types, 2) data preprocessing methods, 3) machine learning feature representations, 4) feature extraction methods, and 5) machine learning classifiers and predictions. The most common EEG data type was human resting-state EEG, with most studies using filters to clean the data. The power spectral, especially alpha and theta power, was the most prevalent feature. The other non-power spectral features, such as entropy, also show their great potential. The Fourier transform is the most common feature extraction method while using neural networks as automatic feature extraction generally returns a high accuracy result. Lastly, Support Vector Machine (SVM) was our survey's most common ML classifier due to its lower computational complexity and solid mathematical theoretical basis. The purpose of this study was to collect and explore a sparsely populated sector of ML, and we hope that our survey has shined some light on the inherent trends, advantages, disadvantages, and preferences of the current state of machine learning-based EEG analysis for mTBI. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: 27 pages

arXiv:2208.01632 [pdf, ps, other]

Sensor Deployment and Link Analysis in Satellite IoT Systems for Wildfire Detection

Authors: How-Hang Liu, Ronald Y. Chang, Yi-Ying Chen, I-Kang Fu, H. Vincent Poor

Abstract: Climate change has been identified as one of the most critical threats to human civilization and sustainability. Wildfires, which produce huge amounts of carbon emission, are both drivers and results of climate change. An early and timely wildfire detection system can constrain fires to short and small ones and yield significant carbon reduction. In this paper, we propose to use ground sensor depl… ▽ More Climate change has been identified as one of the most critical threats to human civilization and sustainability. Wildfires, which produce huge amounts of carbon emission, are both drivers and results of climate change. An early and timely wildfire detection system can constrain fires to short and small ones and yield significant carbon reduction. In this paper, we propose to use ground sensor deployment and satellite Internet of Things (IoT) technologies for wildfire detection by taking advantage of satellites' ubiquitous global coverage. We first develop an optimal IoT sensor placement strategy based on fire ignition and detection models. Then, we analyze the uplink satellite communication budget and the bandwidth required for wildfire detection under the narrowband IoT (NB-IoT) radio interface. Finally, we conduct simulations on the California wildfire database and quantify the potential economical benefits by factoring in carbon emission reductions and sensor/bandwidth costs. △ Less

Submitted 5 August, 2022; v1 submitted 2 August, 2022; originally announced August 2022.

Comments: IEEE Global Communications Conference (GLOBECOM) 2022

arXiv:2202.01946 [pdf, ps, other]

Unsupervised Learning Based Hybrid Beamforming with Low-Resolution Phase Shifters for MU-MIMO Systems

Authors: Chia-Ho Kuo, Hsin-Yuan Chang, Ronald Y. Chang, Wei-Ho Chung

Abstract: Millimeter wave (mmWave) is a key technology for fifth-generation (5G) and beyond communications. Hybrid beamforming has been proposed for large-scale antenna systems in mmWave communications. Existing hybrid beamforming designs based on infinite-resolution phase shifters (PSs) are impractical due to hardware cost and power consumption. In this paper, we propose an unsupervised-learning-based sche… ▽ More Millimeter wave (mmWave) is a key technology for fifth-generation (5G) and beyond communications. Hybrid beamforming has been proposed for large-scale antenna systems in mmWave communications. Existing hybrid beamforming designs based on infinite-resolution phase shifters (PSs) are impractical due to hardware cost and power consumption. In this paper, we propose an unsupervised-learning-based scheme to jointly design the analog precoder and combiner with low-resolution PSs for multiuser multiple-input multiple-output (MU-MIMO) systems. We transform the analog precoder and combiner design problem into a phase classification problem and propose a generic neural network architecture, termed the phase classification network (PCNet), capable of producing solutions of various PS resolutions. Simulation results demonstrate the superior sum-rate and complexity performance of the proposed scheme, as compared to state-of-the-art hybrid beamforming designs for the most commonly used low-resolution PS configurations. △ Less

Submitted 3 February, 2022; originally announced February 2022.

Comments: IEEE International Conference on Communications (ICC) 2022

arXiv:2201.12656 [pdf, ps, other]

Few-Shot Transfer Learning for Device-Free Fingerprinting Indoor Localization

Authors: Bing-Jia Chen, Ronald Y. Chang

Abstract: Device-free wireless indoor localization is an essential technology for the Internet of Things (IoT), and fingerprint-based methods are widely used. A common challenge to fingerprint-based methods is data collection and labeling. This paper proposes a few-shot transfer learning system that uses only a small amount of labeled data from the current environment and reuses a large amount of existing l… ▽ More Device-free wireless indoor localization is an essential technology for the Internet of Things (IoT), and fingerprint-based methods are widely used. A common challenge to fingerprint-based methods is data collection and labeling. This paper proposes a few-shot transfer learning system that uses only a small amount of labeled data from the current environment and reuses a large amount of existing labeled data previously collected in other environments, thereby significantly reducing the data collection and labeling cost for localization in each new environment. The core method lies in graph neural network (GNN) based few-shot transfer learning and its modifications. Experimental results conducted on real-world environments show that the proposed system achieves comparable performance to a convolutional neural network (CNN) model, with 40 times fewer labeled data. △ Less

Submitted 29 January, 2022; originally announced January 2022.

Comments: IEEE International Conference on Communications (ICC) 2022

arXiv:2110.11479 [pdf, other]

Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition

Authors: Ting-Yao Hu, Mohammadreza Armandpour, Ashish Shrivastava, Jen-Hao Rick Chang, Hema Koppula, Oncel Tuzel

Abstract: With recent advances in speech synthesis, synthetic data is becoming a viable alternative to real data for training speech recognition models. However, machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions. Synthetic datasets may contain artifacts that do not exist in real data such as structured noise, content errors, or unrealist… ▽ More With recent advances in speech synthesis, synthetic data is becoming a viable alternative to real data for training speech recognition models. However, machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions. Synthetic datasets may contain artifacts that do not exist in real data such as structured noise, content errors, or unrealistic speaking styles. Moreover, the synthesis process may introduce a bias due to uneven sampling of the data manifold. We propose two novel techniques during training to mitigate the problems due to the distribution gap: (i) a rejection sampling algorithm and (ii) using separate batch normalization statistics for the real and the synthetic samples. We show that these methods significantly improve the training of speech recognition models using synthetic data. We evaluate the proposed approach on keyword detection and Automatic Speech Recognition (ASR) tasks, and observe up to 18% and 13% relative error reduction, respectively, compared to naively using the synthetic data. △ Less

Submitted 21 October, 2021; originally announced October 2021.

arXiv:2110.02891 [pdf, other]

Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

Authors: Jen-Hao Rick Chang, Ashish Shrivastava, Hema Swetha Koppula, Xiaoshuai Zhang, Oncel Tuzel

Abstract: Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, under an unsupervised-style setting, typical training algorithms f… ▽ More Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, under an unsupervised-style setting, typical training algorithms for controllable sequence generative models suffer from the training-inference mismatch, where the same sample is used as content and style input during training but unpaired samples are given during inference. In this paper, we tackle the training-inference mismatch encountered during unsupervised learning of controllable generative sequence models. The proposed method is simple yet effective, where we use a style transformation module to transfer target style information into an unrelated style input. This method enables training using unpaired content and style samples and thereby mitigate the training-inference mismatch. We apply style equalization to text-to-speech and text-to-handwriting synthesis on three datasets. We conduct thorough evaluation, including both quantitative and qualitative user studies. Our results show that by mitigating the training-inference mismatch with the proposed style equalization, we achieve style replication scores comparable to real data in our user studies. △ Less

Submitted 30 June, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: ICML 2022

arXiv:2109.10505 [pdf, ps, other]

Sensor-Based Satellite IoT for Early Wildfire Detection

Authors: How-Hang Liu, Ronald Y. Chang, Yi-Ying Chen, I-Kang Fu

Abstract: Frequent and severe wildfires have been observed lately on a global scale. Wildfires not only threaten lives and properties, but also pose negative environmental impacts that transcend national boundaries (e.g., greenhouse gas emission and global warming). Thus, early wildfire detection with timely feedback is much needed. We propose to use the emerging beyond fifth-generation (B5G) and sixth-gene… ▽ More Frequent and severe wildfires have been observed lately on a global scale. Wildfires not only threaten lives and properties, but also pose negative environmental impacts that transcend national boundaries (e.g., greenhouse gas emission and global warming). Thus, early wildfire detection with timely feedback is much needed. We propose to use the emerging beyond fifth-generation (B5G) and sixth-generation (6G) satellite Internet of Things (IoT) communication technology to enable massive sensor deployment for wildfire detection. We propose wildfire and carbon emission models that take into account real environmental data including wind speed, soil wetness, and biomass, to simulate the fire spreading process and quantify the fire burning areas, carbon emissions, and economical benefits of the proposed system against the backdrop of recent California wildfires. We also conduct a satellite IoT feasibility check by analyzing the satellite link budget. Future research directions to further illustrate the promise of the proposed system are discussed. △ Less

Submitted 21 September, 2021; originally announced September 2021.

Comments: To appear in IEEE GLOBECOM 2021 Workshops

arXiv:2109.09267 [pdf, ps, other]

Intelligent Reflecting Surfaces and Classical Relays: Coexistence and Co-Design

Authors: Te-Yi Kan, Ronald Y. Chang, Feng-Tsun Chien

Abstract: This paper investigates a multiuser downlink communication system with coexisting intelligent reflecting surface (IRS) and classical half-duplex decode-and-forward (DF) relay. In this system, the IRS and the DF relay interact with each other and assist transmission simultaneously. In particular, active beamforming at the base station (BS) and at the DF relay, and passive beamforming at the IRS, ar… ▽ More This paper investigates a multiuser downlink communication system with coexisting intelligent reflecting surface (IRS) and classical half-duplex decode-and-forward (DF) relay. In this system, the IRS and the DF relay interact with each other and assist transmission simultaneously. In particular, active beamforming at the base station (BS) and at the DF relay, and passive beamforming at the IRS, are jointly designed to maximize the sum-rate of all users. The sum-rate maximization problem is nonconvex due to the coupled beamforming vectors. We propose an alternating optimization (AO) based algorithm to tackle this complex co-design problem. Numerical validation and discussion on the superiority of the coexistence system and the tradeoffs therein are presented. △ Less

Submitted 19 September, 2021; originally announced September 2021.

Comments: To appear in IEEE GLOBECOM 2021 Workshops

arXiv:2102.00178 [pdf, other]

Deep Reinforcement Learning Aided Monte Carlo Tree Search for MIMO Detection

Authors: Tz-Wei Mo, Ronald Y. Chang, Te-Yi Kan

Abstract: This paper proposes a novel multiple-input multiple-output (MIMO) symbol detector that incorporates a deep reinforcement learning (DRL) agent into the Monte Carlo tree search (MCTS) detection algorithm. We first describe how the MCTS algorithm, used in many decision-making problems, is applied to the MIMO detection problem. Then, we introduce a self-designed deep reinforcement learning agent, cons… ▽ More This paper proposes a novel multiple-input multiple-output (MIMO) symbol detector that incorporates a deep reinforcement learning (DRL) agent into the Monte Carlo tree search (MCTS) detection algorithm. We first describe how the MCTS algorithm, used in many decision-making problems, is applied to the MIMO detection problem. Then, we introduce a self-designed deep reinforcement learning agent, consisting of a policy value network and a state value network, which is trained to detect MIMO symbols. The outputs of the trained networks are adopted into a modified MCTS detection algorithm to provide useful node statistics and facilitate enhanced tree search process. The resulted scheme, termed the DRL-MCTS detector, demonstrates significant improvements over the original MCTS detection algorithm and exhibits favorable performance compared to other existing linear and DNN-based detection methods under varying channel conditions. △ Less

Submitted 30 January, 2021; originally announced February 2021.

arXiv:2008.07111 [pdf, other]

Semi-Supervised Learning with GANs for Device-Free Fingerprinting Indoor Localization

Authors: Kevin M. Chen, Ronald Y. Chang

Abstract: Device-free wireless indoor localization is a key enabling technology for the Internet of Things (IoT). Fingerprint-based indoor localization techniques are a commonly used solution. This paper proposes a semi-supervised, generative adversarial network (GAN)-based device-free fingerprinting indoor localization system. The proposed system uses a small amount of labeled data and a large amount of un… ▽ More Device-free wireless indoor localization is a key enabling technology for the Internet of Things (IoT). Fingerprint-based indoor localization techniques are a commonly used solution. This paper proposes a semi-supervised, generative adversarial network (GAN)-based device-free fingerprinting indoor localization system. The proposed system uses a small amount of labeled data and a large amount of unlabeled data (i.e., semi-supervised), thus considerably reducing the expensive data labeling effort. Experimental results show that, as compared to the state-of-the-art supervised scheme, the proposed semi-supervised system achieves comparable performance with equal, sufficient amount of labeled data, and significantly superior performance with equal, highly limited amount of labeled data. Besides, the proposed semi-supervised system retains its performance over a broad range of the amount of labeled data. The interactions between the generator, discriminator, and classifier models of the proposed GAN-based system are visually examined and discussed. A mathematical description of the proposed system is also presented. △ Less

Submitted 17 August, 2020; originally announced August 2020.

Comments: Accepted at IEEE GLOBECOM 2020

arXiv:2005.00946 [pdf, other]

Towards Occlusion-Aware Multifocal Displays

Authors: Jen-Hao Rick Chang, Anat Levin, B. V. K. Vijaya Kumar, Aswin C. Sankaranarayanan

Abstract: The human visual system uses numerous cues for depth perception, including disparity, accommodation, motion parallax and occlusion. It is incumbent upon virtual-reality displays to satisfy these cues to provide an immersive user experience. Multifocal displays, one of the classic approaches to satisfy the accommodation cue, place virtual content at multiple focal planes, each at a di erent depth.… ▽ More The human visual system uses numerous cues for depth perception, including disparity, accommodation, motion parallax and occlusion. It is incumbent upon virtual-reality displays to satisfy these cues to provide an immersive user experience. Multifocal displays, one of the classic approaches to satisfy the accommodation cue, place virtual content at multiple focal planes, each at a di erent depth. However, the content on focal planes close to the eye do not occlude those farther away; this deteriorates the occlusion cue as well as reduces contrast at depth discontinuities due to leakage of the defocus blur. This paper enables occlusion-aware multifocal displays using a novel ConeTilt operator that provides an additional degree of freedom -- tilting the light cone emitted at each pixel of the display panel. We show that, for scenes with relatively simple occlusion con gurations, tilting the light cones provides the same e ect as physical occlusion. We demonstrate that ConeTilt can be easily implemented by a phase-only spatial light modulator. Using a lab prototype, we show results that demonstrate the presence of occlusion cues and the increased contrast of the display at depth edges. △ Less

Submitted 2 May, 2020; originally announced May 2020.

Comments: SIGGRAPH 2020

arXiv:1910.06302 [pdf, other]

Finding New Diagnostic Information for Detecting Glaucoma using Neural Networks

Authors: Erfan Noury, Suria S. Mannil, Robert T. Chang, An Ran Ran, Carol Y. Cheung, Suman S. Thapa, Harsha L. Rao, Srilakshmi Dasari, Mohammed Riyazuddin, Dolly Chang, Sriharsha Nagaraj, Clement C. Tham, Reza Zadeh

Abstract: We describe a new approach to automated Glaucoma detection in 3D Spectral Domain Optical Coherence Tomography (OCT) optic nerve scans. First, we gathered a unique and diverse multi-ethnic dataset of OCT scans consisting of glaucoma and non-glaucomatous cases obtained from four tertiary care eye hospitals located in four different countries. Using this longitudinal data, we achieved state-of-the-ar… ▽ More We describe a new approach to automated Glaucoma detection in 3D Spectral Domain Optical Coherence Tomography (OCT) optic nerve scans. First, we gathered a unique and diverse multi-ethnic dataset of OCT scans consisting of glaucoma and non-glaucomatous cases obtained from four tertiary care eye hospitals located in four different countries. Using this longitudinal data, we achieved state-of-the-art results for automatically detecting Glaucoma from a single raw OCT using a 3D Deep Learning system. These results are close to human doctors in a variety of settings across heterogeneous datasets and scanning environments. To verify correctness and interpretability of the automated categorization, we used saliency maps to find areas of focus for the model. Matching human doctor behavior, the model predictions indeed correlated with the conventional diagnostic parameters in the OCT printouts, such as the retinal nerve fiber layer. We further used our model to find new areas in the 3D data that are presently not being identified as a diagnostic parameter to detect glaucoma by human doctors. Namely, we found that the Lamina Cribrosa (LC) region can be a valuable source of helpful diagnostic information previously unavailable to doctors during routine clinical care because it lacks a quantitative printout. Our model provides such volumetric quantification of this region. We found that even when a majority of the RNFL is removed, the LC region can distinguish glaucoma. This is clinically relevant in high myopes, when the RNFL is already reduced, and thus the LC region may help differentiate glaucoma in this confounding situation. We further generalize this approach to create a new algorithm called DiagFind that provides a recipe for finding new diagnostic information in medical imagery that may have been previously unusable by doctors. △ Less

Submitted 2 September, 2020; v1 submitted 14 October, 2019; originally announced October 2019.

Comments: 28 pages, 12 figures, 15 tables, title changed, new authors added

arXiv:1812.11031 [pdf, other]

Distributed Multi-Stream Beamforming in MIMO Multi-Relay Interference Networks

Authors: Cenk M. Yetis, Ronald Y. Chang

Abstract: In this paper, multi-stream transmission in interference networks aided by multiple amplify-and-forward (AF) relays in the presence of direct links is considered. The objective is to minimize the sum power of transmitters and relays by beamforming optimization under the stream signal-to-interference-plus-noise-ratio (SINR) constraints. For transmit beamforming optimization, the problem is a well-k… ▽ More In this paper, multi-stream transmission in interference networks aided by multiple amplify-and-forward (AF) relays in the presence of direct links is considered. The objective is to minimize the sum power of transmitters and relays by beamforming optimization under the stream signal-to-interference-plus-noise-ratio (SINR) constraints. For transmit beamforming optimization, the problem is a well-known non-convex quadratically constrained quadratic program (QCQP) that is NP-hard to solve. After semi-definite relaxation (SDR), the problem can be optimally solved via alternating direction method of multipliers (ADMM) algorithm for distributed implementation. Analytical and extensive numerical analyses demonstrate that the proposed ADMM solution converges to the optimal centralized solution. The convergence rate, computational complexity, and message exchange load of the proposed algorithm outperforms the existing solutions. Furthermore, by SINR approximation at the relay side, distributed joint transmit and relay beamforming optimization is also proposed that further improves the total power saving at the cost of increased complexity. △ Less

Submitted 14 December, 2018; originally announced December 2018.

Comments: 18 pages, 10 figures, and 4 tables. This paper is to appear in IEEE Access

Showing 1–23 of 23 results for author: Chang, R