Search | arXiv e-print repository

MoME: Mixture of Multimodal Experts for Cancer Survival Prediction

Authors: Conghao Xiong, Hao Chen, Hao Zheng, Dong Wei, Yefeng Zheng, Joseph J. Y. Sung, Irwin King

Abstract: Survival analysis, as a challenging task, requires integrating Whole Slide Images (WSIs) and genomic data for comprehensive decision-making. There are two main challenges in this task: significant heterogeneity and complex inter- and intra-modal interactions between the two modalities. Previous approaches utilize co-attention methods, which fuse features from both modalities only once after separa… ▽ More Survival analysis, as a challenging task, requires integrating Whole Slide Images (WSIs) and genomic data for comprehensive decision-making. There are two main challenges in this task: significant heterogeneity and complex inter- and intra-modal interactions between the two modalities. Previous approaches utilize co-attention methods, which fuse features from both modalities only once after separate encoding. However, these approaches are insufficient for modeling the complex task due to the heterogeneous nature between the modalities. To address these issues, we propose a Biased Progressive Encoding (BPE) paradigm, performing encoding and fusion simultaneously. This paradigm uses one modality as a reference when encoding the other. It enables deep fusion of the modalities through multiple alternating iterations, progressively reducing the cross-modal disparities and facilitating complementary interactions. Besides modality heterogeneity, survival analysis involves various biomarkers from WSIs, genomics, and their combinations. The critical biomarkers may exist in different modalities under individual variations, necessitating flexible adaptation of the models to specific scenarios. Therefore, we further propose a Mixture of Multimodal Experts (MoME) layer to dynamically selects tailored experts in each stage of the BPE paradigm. Experts incorporate reference information from another modality to varying degrees, enabling a balanced or biased focus on different modalities during the encoding process. Extensive experimental results demonstrate the superior performance of our method on various datasets, including TCGA-BLCA, TCGA-UCEC and TCGA-LUAD. Codes are available at https://github.com/BearCleverProud/MoME. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 8 + 1/2 pages, early accepted to MICCAI2024

arXiv:2401.13850 [pdf, other]

PADTHAI-MM: A Principled Approach for Designing Trustable, Human-centered AI systems using the MAST Methodology

Authors: Nayoung Kim, Myke C. Cohen, Yang Ba, Anna Pan, Shawaiz Bhatti, Pouria Salehi, James Sung, Erik Blasch, Michelle V. Mancenido, Erin K. Chiou

Abstract: Designing for AI trustworthiness is challenging, with a lack of practical guidance despite extensive literature on trust. The Multisource AI Scorecard Table (MAST), a checklist rating system, addresses this gap in designing and evaluating AI-enabled decision support systems. We propose the Principled Approach for Designing Trustable Human-centered AI systems using MAST Methodology (PADTHAI-MM), a… ▽ More Designing for AI trustworthiness is challenging, with a lack of practical guidance despite extensive literature on trust. The Multisource AI Scorecard Table (MAST), a checklist rating system, addresses this gap in designing and evaluating AI-enabled decision support systems. We propose the Principled Approach for Designing Trustable Human-centered AI systems using MAST Methodology (PADTHAI-MM), a nine-step framework what we demonstrate through the iterative design of a text analysis platform called the REporting Assistant for Defense and Intelligence Tasks (READIT). We designed two versions of READIT, high-MAST including AI context and explanations, and low-MAST resembling a "black box" type system. Participant feedback and state-of-the-art AI knowledge was integrated in the design process, leading to a redesigned prototype tested by participants in an intelligence reporting task. Results show that MAST-guided design can improve trust perceptions, and that MAST criteria can be linked to performance, process, and purpose information, providing a practical and theory-informed basis for AI system design. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2311.18040 [pdf, other]

Evaluating Trustworthiness of AI-Enabled Decision Support Systems: Validation of the Multisource AI Scorecard Table (MAST)

Authors: Pouria Salehi, Yang Ba, Nayoung Kim, Ahmadreza Mosallanezhad, Anna Pan, Myke C. Cohen, Yixuan Wang, Jieqiong Zhao, Shawaiz Bhatti, James Sung, Erik Blasch, Michelle V. Mancenido, Erin K. Chiou

Abstract: The Multisource AI Scorecard Table (MAST) is a checklist tool based on analytic tradecraft standards to inform the design and evaluation of trustworthy AI systems. In this study, we evaluate whether MAST is associated with people's trust perceptions in AI-enabled decision support systems (AI-DSSs). Evaluating trust in AI-DSSs poses challenges to researchers and practitioners. These challenges incl… ▽ More The Multisource AI Scorecard Table (MAST) is a checklist tool based on analytic tradecraft standards to inform the design and evaluation of trustworthy AI systems. In this study, we evaluate whether MAST is associated with people's trust perceptions in AI-enabled decision support systems (AI-DSSs). Evaluating trust in AI-DSSs poses challenges to researchers and practitioners. These challenges include identifying the components, capabilities, and potential of these systems, many of which are based on the complex deep learning algorithms that drive DSS performance and preclude complete manual inspection. We developed two interactive, AI-DSS test environments using the MAST criteria. One emulated an identity verification task in security screening, and another emulated a text summarization system to aid in an investigative reporting task. Each test environment had one version designed to match low-MAST ratings, and another designed to match high-MAST ratings, with the hypothesis that MAST ratings would be positively related to the trust ratings of these systems. A total of 177 subject matter experts were recruited to interact with and evaluate these systems. Results generally show higher MAST ratings for the high-MAST conditions compared to the low-MAST groups, and that measures of trust perception are highly correlated with the MAST ratings. We conclude that MAST can be a useful tool for designing and evaluating systems that will engender high trust perceptions, including AI-DSS that may be used to support visual screening and text summarization tasks. However, higher MAST ratings may not translate to higher joint performance. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.02107 [pdf]

Generative Artificial Intelligence in Healthcare: Ethical Considerations and Assessment Checklist

Authors: Yilin Ning, Salinelat Teixayavong, Yuqing Shang, Julian Savulescu, Vaishaanth Nagaraj, Di Miao, Mayli Mertens, Daniel Shu Wei Ting, Jasmine Chiat Ling Ong, Mingxuan Liu, Jiuwen Cao, Michael Dunn, Roger Vaughan, Marcus Eng Hock Ong, Joseph Jao-Yiu Sung, Eric J Topol, Nan Liu

Abstract: The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (GenAI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare, but ethical discussions are yet to translate into operationalisable solutions. Furthermore, ongoing ethical discussions often neglect other types of GenAI that have been use… ▽ More The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (GenAI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare, but ethical discussions are yet to translate into operationalisable solutions. Furthermore, ongoing ethical discussions often neglect other types of GenAI that have been used to synthesise data (e.g., images) for research and practical purposes, which resolved some ethical issues and exposed others. We conduct a sco** review of ethical discussions on GenAI in healthcare to comprehensively analyse gaps in the current research, and further propose to reduce the gaps by develo** a checklist for comprehensive assessment and transparent documentation of ethical discussions in GenAI research. The checklist can be readily integrated into the current peer review and publication system to enhance GenAI research, and may be used for ethics-related disclosures for GenAI-powered products, healthcare applications of such products and beyond. △ Less

Submitted 23 February, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

arXiv:2309.00208 [pdf, other]

Large Language Models for Semantic Monitoring of Corporate Disclosures: A Case Study on Korea's Top 50 KOSPI Companies

Authors: Junwon Sung, Woo** Heo, Yunkyung Byun, Youngsam Kim

Abstract: In the rapidly advancing domain of artificial intelligence, state-of-the-art language models such as OpenAI's GPT-3.5-turbo and GPT-4 offer unprecedented opportunities for automating complex tasks. This research paper delves into the capabilities of these models for semantically analyzing corporate disclosures in the Korean context, specifically for timely disclosure. The study focuses on the top… ▽ More In the rapidly advancing domain of artificial intelligence, state-of-the-art language models such as OpenAI's GPT-3.5-turbo and GPT-4 offer unprecedented opportunities for automating complex tasks. This research paper delves into the capabilities of these models for semantically analyzing corporate disclosures in the Korean context, specifically for timely disclosure. The study focuses on the top 50 publicly traded companies listed on the Korean KOSPI, based on market capitalization, and scrutinizes their monthly disclosure summaries over a period of 17 months. Each summary was assigned a sentiment rating on a scale ranging from 1(very negative) to 5(very positive). To gauge the effectiveness of the language models, their sentiment ratings were compared with those generated by human experts. Our findings reveal a notable performance disparity between GPT-3.5-turbo and GPT-4, with the latter demonstrating significant accuracy in human evaluation tests. The Spearman correlation coefficient was registered at 0.61, while the simple concordance rate was recorded at 0.82. This research contributes valuable insights into the evaluative characteristics of GPT models, thereby laying the groundwork for future innovations in the field of automated semantic monitoring. △ Less

Submitted 31 August, 2023; originally announced September 2023.

arXiv:2307.07130 [pdf, other]

Digital Health Discussion Through Articles Published Until the Year 2021: A Digital Topic Modeling Approach

Authors: Junhyoun Sung, Hyungsook Kim

Abstract: The digital health industry has grown in popularity since the 2010s, but there has been limited analysis of the topics discussed in the field across academic disciplines. This study aims to analyze the research trends of digital health-related articles published on the Web of Science until 2021, in order to understand the concentration, scope, and characteristics of the research. 15,950 digital he… ▽ More The digital health industry has grown in popularity since the 2010s, but there has been limited analysis of the topics discussed in the field across academic disciplines. This study aims to analyze the research trends of digital health-related articles published on the Web of Science until 2021, in order to understand the concentration, scope, and characteristics of the research. 15,950 digital health-related papers from the top 10 academic fields were analyzed using the Web of Science. The papers were grouped into three domains: public health, medicine, and electrical engineering and computer science (EECS). Two time periods (2012-2016 and 2017-2021) were compared using Latent Dirichlet Allocation (LDA) for topic modeling. The number of topics was determined based on coherence score, and topic compositions were compared using a homogeneity test. The number of optimal topics varied across domains and time periods. For public health, the first and second halves had 13 and 19 topics, respectively. Medicine had 14 and 25 topics, and EECS had 7 and 21 topics. Text analysis revealed shared topics among the domains, but with variations in composition. The homogeneity test confirmed significant differences between the groups (adjusted p-value<0.05). Six dominant themes emerged, including journal article methodology, information technology, medical issues, population demographics, social phenomena, and healthcare. Digital health research is expanding and evolving, particularly in relation to Covid-19, where topics such as depression and mental disorders, education, and physical activity have gained prominence. There was no bias in topic composition among the three domains, but other fields like kinesiology or psychology could contribute to future digital health research. Exploring expanded topics that reflect people's needs for digital health over time will be crucial. △ Less

Submitted 18 September, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

Comments: 13 pages, 5 figures

arXiv:2303.05780 [pdf, other]

Knowledge Transfer via Multi-Head Feature Adaptation for Whole Slide Image Classification

Authors: Conghao Xiong, Yi Lin, Hao Chen, Joseph Sung, Irwin King

Abstract: Transferring prior knowledge from a source domain to the same or similar target domain can greatly enhance the performance of models on the target domain. However, it is challenging to directly leverage the knowledge from the source domain due to task discrepancy and domain shift. To bridge the gaps between different tasks and domains, we propose a Multi-Head Feature Adaptation module, which proje… ▽ More Transferring prior knowledge from a source domain to the same or similar target domain can greatly enhance the performance of models on the target domain. However, it is challenging to directly leverage the knowledge from the source domain due to task discrepancy and domain shift. To bridge the gaps between different tasks and domains, we propose a Multi-Head Feature Adaptation module, which projects features in the source feature space to a new space that is more similar to the target space. Knowledge transfer is particularly important in Whole Slide Image (WSI) classification since the number of WSIs in one dataset might be too small to achieve satisfactory performance. Therefore, WSI classification is an ideal testbed for our method, and we adapt multiple knowledge transfer methods for WSI classification. The experimental results show that models with knowledge transfer outperform models that are trained from scratch by a large margin regardless of the number of WSIs in the datasets, and our method achieves state-of-the-art performances among other knowledge transfer methods on multiple datasets, including TCGA-RCC, TCGA-NSCLC, and Camelyon16 datasets. △ Less

Submitted 10 March, 2023; originally announced March 2023.

arXiv:2301.08125 [pdf, other]

Diagnose Like a Pathologist: Transformer-Enabled Hierarchical Attention-Guided Multiple Instance Learning for Whole Slide Image Classification

Authors: Conghao Xiong, Hao Chen, Joseph J. Y. Sung, Irwin King

Abstract: Multiple Instance Learning (MIL) and transformers are increasingly popular in histopathology Whole Slide Image (WSI) classification. However, unlike human pathologists who selectively observe specific regions of histopathology tissues under different magnifications, most methods do not incorporate multiple resolutions of the WSIs, hierarchically and attentively, thereby leading to a loss of focus… ▽ More Multiple Instance Learning (MIL) and transformers are increasingly popular in histopathology Whole Slide Image (WSI) classification. However, unlike human pathologists who selectively observe specific regions of histopathology tissues under different magnifications, most methods do not incorporate multiple resolutions of the WSIs, hierarchically and attentively, thereby leading to a loss of focus on the WSIs and information from other resolutions. To resolve this issue, we propose a Hierarchical Attention-Guided Multiple Instance Learning framework to fully exploit the WSIs. This framework can dynamically and attentively discover the discriminative regions across multiple resolutions of the WSIs. Within this framework, an Integrated Attention Transformer is proposed to further enhance the performance of the transformer and obtain a more holistic WSI (bag) representation. This transformer consists of multiple Integrated Attention Modules, which is the combination of a transformer layer and an aggregation module that produces a bag representation based on every instance representation in that bag. The experimental results show that our method achieved state-of-the-art performances on multiple datasets, including Camelyon16, TCGA-RCC, TCGA-NSCLC, and an in-house IMGC dataset. The code is available at https://github.com/BearCleverProud/HAG-MIL. △ Less

Submitted 16 July, 2023; v1 submitted 19 January, 2023; originally announced January 2023.

Comments: Accepted to IJCAI2023

arXiv:2301.01449 [pdf, other]

Building Coverage Estimation with Low-resolution Remote Sensing Imagery

Authors: Enci Liu, Chenlin Meng, Matthew Kolodner, Eun Jee Sung, Sihang Chen, Marshall Burke, David Lobell, Stefano Ermon

Abstract: Building coverage statistics provide crucial insights into the urbanization, infrastructure, and poverty level of a region, facilitating efforts towards alleviating poverty, building sustainable cities, and allocating infrastructure investments and public service provision. Global map** of buildings has been made more efficient with the incorporation of deep learning models into the pipeline. Ho… ▽ More Building coverage statistics provide crucial insights into the urbanization, infrastructure, and poverty level of a region, facilitating efforts towards alleviating poverty, building sustainable cities, and allocating infrastructure investments and public service provision. Global map** of buildings has been made more efficient with the incorporation of deep learning models into the pipeline. However, these models typically rely on high-resolution satellite imagery which are expensive to collect and infrequently updated. As a result, building coverage data are not updated timely especially in develo** regions where the built environment is changing quickly. In this paper, we propose a method for estimating building coverage using only publicly available low-resolution satellite imagery that is more frequently updated. We show that having a multi-node quantile regression layer greatly improves the model's spatial and temporal generalization. Our model achieves a coefficient of determination ($R^2$) as high as 0.968 on predicting building coverage in regions of different levels of development around the world. We demonstrate that the proposed model accurately predicts the building coverage from raw input images and generalizes well to unseen countries and continents, suggesting the possibility of estimating global building coverage using only low-resolution remote sensing data. △ Less

Submitted 4 January, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

arXiv:2211.16307 [pdf, other]

doi 10.1016/j.specom.2022.11.006

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Authors: Nikolaos Ellinas, Myrsini Christidou, Alexandra Vioni, June Sig Sung, Aimilios Chalamandaris, Pirros Tsiakoulis, Paris Mastorocostas

Abstract: In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autore… ▽ More In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces. △ Less

Submitted 29 November, 2022; originally announced November 2022.

Comments: Final published version available at: Speech Communication. arXiv admin note: substantial text overlap with arXiv:2111.10168

arXiv:2211.01327 [pdf, other]

Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

Authors: Konstantinos Klapsas, Karolos Nikitaras, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

Abstract: A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics t… ▽ More A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics to show that normalizing flow based prior networks can result in more expressive speech at the cost of a slight drop in quality. Furthermore, we show that the synthesized speech has higher variability, for a given text, due to the nature of normalizing flows. We also propose a Dynamical VAE model, that can generate higher quality speech although with decreased expressiveness and variability compared to the flow based models. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2211.00523 [pdf, other]

Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

Authors: Karolos Nikitaras, Konstantinos Klapsas, Nikolaos Ellinas, Georgia Maniati, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

Abstract: This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the correspond… ▽ More This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the corresponding level. We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. Therefore, a trade-off arises between the diversity of the token-level and utterance-level representations and their disentanglement. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step. Both qualitative and quantitative evaluations are used to demonstrate the effectiveness of the proposed approach. Audio samples are available in our demo page. △ Less

Submitted 1 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2211.00342 [pdf, other]

doi 10.1109/ICASSP49357.2023.10096255

Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

Authors: Alexandra Vioni, Georgia Maniati, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis

Abstract: Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with re… ▽ More Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations. Results show that the proposed additional features are beneficial in the MOS prediction task, by improving the predicted MOS scores' correlation with the ground truths, both at utterance-level and system-level predictions. △ Less

Submitted 7 May, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: Proceedings of ICASSP 2023

arXiv:2210.17264

Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Georgia Maniati, Panos Kakoulidis, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

Abstract: This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC)… ▽ More This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC) to be performed by the same model due to the inherent linguistic content and speaker identity disentanglement. When used in a cross-lingual setting, acoustic features are initially produced with a native speaker of the target language and then voice conversion is applied by the same model in order to convert these features to the target speaker's voice. We verify through objective and subjective evaluations that our method can have benefits compared to baseline cross-lingual synthesis. By including speakers averaging 7.5 minutes of speech, we also present positive results on low-resource scenarios. △ Less

Submitted 27 February, 2024; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: Fundamental changes to the model described and experimental procedure

arXiv:2206.10878 [pdf, other]

Feature Re-calibration based Multiple Instance Learning for Whole Slide Image Classification

Authors: Philip Chikontwe, Soo Jeong Nam, Heounjeong Go, Meejeong Kim, Hyun Jung Sung, Sang Hyun Park

Abstract: Whole slide image (WSI) classification is a fundamental task for the diagnosis and treatment of diseases; but, curation of accurate labels is time-consuming and limits the application of fully-supervised methods. To address this, multiple instance learning (MIL) is a popular method that poses classification as a weakly supervised learning task with slide-level labels only. While current MIL method… ▽ More Whole slide image (WSI) classification is a fundamental task for the diagnosis and treatment of diseases; but, curation of accurate labels is time-consuming and limits the application of fully-supervised methods. To address this, multiple instance learning (MIL) is a popular method that poses classification as a weakly supervised learning task with slide-level labels only. While current MIL methods apply variants of the attention mechanism to re-weight instance features with stronger models, scant attention is paid to the properties of the data distribution. In this work, we propose to re-calibrate the distribution of a WSI bag (instances) by using the statistics of the max-instance (critical) feature. We assume that in binary MIL, positive bags have larger feature magnitudes than negatives, thus we can enforce the model to maximize the discrepancy between bags with a metric feature loss that models positive bags as out-of-distribution. To achieve this, unlike existing MIL methods that use single-batch training modes, we propose balanced-batch sampling to effectively use the feature loss i.e., (+/-) bags simultaneously. Further, we employ a position encoding module (PEM) to model spatial/morphological information, and perform pooling by multi-head self-attention (PSMA) with a Transformer encoder. Experimental results on existing benchmark datasets show our approach is effective and improves over state-of-the-art MIL methods. △ Less

Submitted 21 July, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

Comments: MICCAI 2022

arXiv:2204.05070 [pdf, other]

doi 10.21437/Interspeech.2022-10765

Fine-grained Noise Control for Multispeaker Speech Synthesis

Authors: Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

Abstract: A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper pr… ▽ More A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper proposes unsupervised, interpretable and fine-grained noise and prosody modeling. We incorporate adversarial training, representation bottleneck and utterance-to-frame modeling in order to learn frame-level noise representations. To the same end, we perform fine-grained prosody modeling via a Fully Hierarchical Variational AutoEncoder (FVAE) which additionally results in more expressive speech synthesis. △ Less

Submitted 27 October, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: Accepted to INTERSPEECH 2022

arXiv:2204.04127 [pdf, other]

doi 10.21437/Interspeech.2022-10446

Karaoker: Alignment-free singing voice synthesis with speech training data

Authors: Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris

Abstract: Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthes… ▽ More Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice and transfers style following a multi-dimensional template extracted from a source waveform of an unseen singer/speaker. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. In addition to multitasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model. △ Less

Submitted 31 August, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted to INTERSPEECH 2022

arXiv:2204.03421 [pdf, ps, other]

Self-supervised learning for robust voice cloning

Authors: Konstantinos Klapsas, Nikolaos Ellinas, Karolos Nikitaras, Georgios Vamvoukakis, Panos Kakoulidis, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

Abstract: Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are… ▽ More Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance. △ Less

Submitted 2 November, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: Accepted to INTERSPEECH 2022

arXiv:2204.03040 [pdf, other]

doi 10.21437/Interspeech.2022-10922

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

Authors: Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

Abstract: In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a publ… ▽ More In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 TTS systems including vanilla neural acoustic models as well as models which allow prosodic variations. An LPCNet vocoder is used for all systems, so that the samples' variation depends only on the acoustic models. The synthesized utterances provide balanced and adequate domain and length coverage. We collect MOS naturalness evaluations on 3 English Amazon Mechanical Turk locales and share practices leading to reliable crowdsourced annotations for this task. We provide baseline results of state-of-the-art MOS prediction models on the SOMOS dataset and show the limitations that such models face when assigned to evaluate TTS utterances. △ Less

Submitted 24 August, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

Comments: Accepted to INTERSPEECH 2022

arXiv:2203.14416 [pdf, other]

Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge

Authors: Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin Osipov, June Sig Sung

Abstract: Text-to-Speech (TTS) services that run on edge devices have many advantages compared to cloud TTS, e.g., latency and privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This study proposes a Bunched LPCNet2, an improved LPCNet architecture that provides highly efficient performance in high-quality for cloud servers and in a… ▽ More Text-to-Speech (TTS) services that run on edge devices have many advantages compared to cloud TTS, e.g., latency and privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This study proposes a Bunched LPCNet2, an improved LPCNet architecture that provides highly efficient performance in high-quality for cloud servers and in a low-complexity for low-resource edge devices. Single logistic distribution achieves computational efficiency, and insightful tricks reduce the model footprint while maintaining speech quality. A DualRate architecture, which generates a lower sampling rate from a prosody model, is also proposed to reduce maintenance costs. The experiments demonstrate that Bunched LPCNet2 generates satisfactory speech quality with a model footprint of 1.1MB while operating faster than real-time on a RPi 3B. Our audio samples are available at https://srtts.github.io/bunchedLPCNet2. △ Less

Submitted 30 June, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

Comments: Interspeech 2022

arXiv:2111.10177 [pdf, other]

doi 10.1109/ICASSP39728.2021.9413604

Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis

Authors: Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

Abstract: This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering… ▽ More This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker. △ Less

Submitted 19 November, 2021; originally announced November 2021.

Comments: Proceedings of ICASSP 2021

arXiv:2111.10173 [pdf, other]

doi 10.1007/978-3-030-87802-3_31

Word-Level Style Control for Expressive, Non-attentive Speech Synthesis

Authors: Konstantinos Klapsas, Nikolaos Ellinas, June Sig Sung, Hyoungmin Park, Spyros Raptis

Abstract: This paper presents an expressive speech synthesis architecture for modeling and controlling the speaking style at a word level. It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders. The first one models style by finding a combination of style tokens for each word given the acoustic features, and the second outputs a word-level seq… ▽ More This paper presents an expressive speech synthesis architecture for modeling and controlling the speaking style at a word level. It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders. The first one models style by finding a combination of style tokens for each word given the acoustic features, and the second outputs a word-level sequence conditioned only on the phonetic information in order to disentangle it from the style information. The two encoder outputs are aligned and concatenated with the phoneme encoder outputs and then decoded with a Non-Attentive Tacotron model. An extra prior encoder is used to predict the style tokens autoregressively, in order for the model to be able to run without a reference utterance. We find that the resulting model gives both word-level and global control over the style, as well as prosody transfer capabilities. △ Less

Submitted 19 November, 2021; originally announced November 2021.

Comments: Proceedings of SPECOM 2021

arXiv:2111.10168 [pdf, other]

doi 10.1007/978-3-030-87802-3_11

Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Authors: Myrsini Christidou, Alexandra Vioni, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

Abstract: This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control ra… ▽ More This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control range and coverage. More specifically we employ data augmentation, F0 normalization, balanced clustering for duration, and speaker-independent prosodic clustering. These modifications enable fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. The model is also fine-tuned to unseen speakers with limited amounts of data and it is shown to maintain its prosody control capabilities, verifying that the speaker-independent prosodic clustering is effective. Experimental results verify that the model maintains high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces. △ Less

Submitted 19 November, 2021; originally announced November 2021.

Comments: Proceedings of SPECOM 2021

arXiv:2111.09146 [pdf, other]

doi 10.21437/SSW.2021-21

Rap**-Singing Voice Synthesis based on Phoneme-level Prosody Control

Authors: Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, Georgia Maniati, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris

Abstract: In this paper, a text-to-rap**/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-… ▽ More In this paper, a text-to-rap**/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-tuned to an unseen speaker's limited recordings, allowing rap**/singing synthesis with the target's speaker voice. The detailed pipeline of the system is described, which includes the extraction of the target pitch and duration values from an a capella song and their conversion into target speaker's valid range of notes before synthesis. An additional stage of prosodic manipulation of the output via WSOLA is also investigated for better matching the target duration values. The synthesized utterances can be mixed with an instrumental accompaniment track to produce a complete song. The proposed system is evaluated via subjective listening tests as well as in comparison to an available alternate system which also aims to produce synthetic singing voice from read-only training data. Results show that the proposed approach can produce high quality rap**/singing voice with increased naturalness. △ Less

Submitted 17 November, 2021; originally announced November 2021.

Comments: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 11)

arXiv:2111.09075 [pdf, ps, other]

doi 10.21437/Interspeech.2021-327

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

Authors: Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

Abstract: The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologi… ▽ More The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages, with the goal of achieving cross-lingual speaker adaptation. We first experiment with the effect of language phonological similarity on cross-lingual TTS of several source-target language combinations. Subsequently, we fine-tune the model with very limited data of a new speaker's voice in either a seen or an unseen language, and achieve synthetic speech of equal quality, while preserving the target speaker's identity. With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature. In the extreme case of only 2 available adaptation utterances, we find that our model behaves as a few-shot learner, as the performance is similar in both the seen and unseen adaptation language scenarios. △ Less

Submitted 17 November, 2021; originally announced November 2021.

Comments: Proceedings of INTERSPEECH 2021

arXiv:2111.09052 [pdf, other]

doi 10.21437/Interspeech.2020-2464

High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis

Abstract: This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by usin… ▽ More This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by using a recently proposed purely location-based attention mechanism, suitable for arbitrary sentence length generation. During inference, the decoder is unrolled and acoustic feature generation is performed in a streaming manner, allowing for a nearly constant latency which is independent from the sentence length. Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU, enabling it to meet the conditions required for real-time applications on both devices. The full end-to-end system can generate almost natural quality speech, which is verified by listening tests. △ Less

Submitted 17 November, 2021; originally announced November 2021.

Comments: Proceedings of INTERSPEECH 2020

arXiv:2111.07072 [pdf, other]

Factorial Convolution Neural Networks

Authors: Jaemo Sung, Eun-Sung Jung

Abstract: In recent years, GoogleNet has garnered substantial attention as one of the base convolutional neural networks (CNNs) to extract visual features for object detection. However, it experiences challenges of contaminated deep features when concatenating elements with different properties. Also, since GoogleNet is not an entirely lightweight CNN, it still has many execution overheads to apply to a res… ▽ More In recent years, GoogleNet has garnered substantial attention as one of the base convolutional neural networks (CNNs) to extract visual features for object detection. However, it experiences challenges of contaminated deep features when concatenating elements with different properties. Also, since GoogleNet is not an entirely lightweight CNN, it still has many execution overheads to apply to a resource-starved application domain. Therefore, a new CNNs, FactorNet, has been proposed to overcome these functional challenges. The FactorNet CNN is composed of multiple independent sub CNNs to encode different aspects of the deep visual features and has far fewer execution overheads in terms of weight parameters and floating-point operations. Incorporating FactorNet into the Faster-RCNN framework proved that FactorNet gives \ignore{a 5\%} better accuracy at a minimum and produces additional speedup over GoolgleNet throughout the KITTI object detection benchmark data set in a real-time object detection system. △ Less

Submitted 13 November, 2021; originally announced November 2021.

arXiv:2104.12845 [pdf, other]

Multi-Output Random Forest Regression to Emulate the Earliest Stages of Planet Formation

Authors: Kevin Hoffman, Jae Yoon Sung, André Zazzera

Abstract: In the current paradigm of planet formation research, it is believed that the first step to forming massive bodies (such as asteroids and planets) requires that small interstellar dust grains floating through space collide with each other and grow to larger sizes. The initial formation of these pebbles is governed by an integro-differential equation known as the Smoluchowski coagulation equation,… ▽ More In the current paradigm of planet formation research, it is believed that the first step to forming massive bodies (such as asteroids and planets) requires that small interstellar dust grains floating through space collide with each other and grow to larger sizes. The initial formation of these pebbles is governed by an integro-differential equation known as the Smoluchowski coagulation equation, to which analytical solutions are intractable for all but the simplest possible scenarios. While brute-force methods of approximation have been developed, they are computationally costly, currently making it infeasible to simulate this process including other physical processes relevant to planet formation, and across the very large range of scales on which it occurs. In this paper, we take a machine learning approach to designing a system for a much faster approximation. We develop a multi-output random forest regression model trained on brute-force simulation data to approximate distributions of dust particle sizes in protoplanetary disks at different points in time. The performance of our random forest model is measured against the existing brute-force models, which are the standard for realistic simulations. Results indicate that the random forest model can generate highly accurate predictions relative to the brute-force simulation results, with an $R^{2}$ of 0.97, and do so significantly faster than brute-force methods. △ Less

Submitted 26 April, 2021; originally announced April 2021.

arXiv:2103.14776 [pdf, other]

doi 10.1109/TASLP.2021.3129353

Scalable and Efficient Neural Speech Coding: A Hybrid Design

Authors: Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beak, Minje Kim

Abstract: We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifact… ▽ More We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model components to NWC, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where each NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. The hybrid design integrates LPC and NWC by redefining LPC's quantization as a differentiable process, making the system training an end-to-end manner. The decoder of proposed system is with either one NWC (0.12 million parameters) in low to medium bitrate ranges (12 to 20 kbps) or two NWCs in the high bitrate (32 kbps). Although the decoding complexity is not yet as low as that of conventional speech codecs, it is significantly reduced from that of other neural speech coders, such as a WaveNet-based vocoder. For wide-band speech coding quality, our system yields comparable or superior performance to AMR-WB and Opus on TIMIT test utterances at low and medium bitrates. The proposed system can scale up to higher bitrates to achieve near transparent performance. △ Less

Submitted 27 November, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP), 2021 (Accepted for publication)

arXiv:2102.03985 [pdf]

Multisource AI Scorecard Table for System Evaluation

Authors: Erik Blasch, James Sung, Tao Nguyen

Abstract: The paper describes a Multisource AI Scorecard Table (MAST) that provides the developer and user of an artificial intelligence (AI)/machine learning (ML) system with a standard checklist focused on the principles of good analysis adopted by the intelligence community (IC) to help promote the development of more understandable systems and engender trust in AI outputs. Such a scorecard enables a tra… ▽ More The paper describes a Multisource AI Scorecard Table (MAST) that provides the developer and user of an artificial intelligence (AI)/machine learning (ML) system with a standard checklist focused on the principles of good analysis adopted by the intelligence community (IC) to help promote the development of more understandable systems and engender trust in AI outputs. Such a scorecard enables a transparent, consistent, and meaningful understanding of AI tools applied for commercial and government use. A standard is built on compliance and agreement through policy, which requires buy-in from the stakeholders. While consistency for testing might only exist across a standard data set, the community requires discussion on verification and validation approaches which can lead to interpretability, explainability, and proper use. The paper explores how the analytic tradecraft standards outlined in Intelligence Community Directive (ICD) 203 can provide a framework for assessing the performance of an AI system supporting various operational needs. These include sourcing, uncertainty, consistency, accuracy, and visualization. Three use cases are presented as notional examples that support security for comparative analysis. △ Less

Submitted 7 February, 2021; originally announced February 2021.

Comments: Presented at AAAI FSS-20: Artificial Intelligence in Government and Public Sector, Washington, DC, USA

arXiv:2101.00054 [pdf, other]

doi 10.1109/LSP.2020.3039765

Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

Authors: Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, Minje Kim

Abstract: Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we pres… ▽ More Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we present a psychoacoustic calibration scheme to re-define the loss functions of neural audio coding systems so that it can decode signals more perceptually similar to the reference, yet with a much lower model complexity. The proposed loss function incorporates the global masking threshold, allowing the reconstruction error that corresponds to inaudible artifacts. Experimental results show that the proposed model outperforms the baseline neural codec twice as large and consuming 23.4% more bits per second. With the proposed method, a lightweight neural codec, with only 0.9 million parameters, performs near-transparent audio coding comparable with the commercial MPEG-1 Audio Layer III codec at 112 kbps. △ Less

Submitted 31 December, 2020; originally announced January 2021.

Journal ref: IEEE Signal Processing Letters, vol. 27, pp. 2159-2163, 2020

arXiv:2005.00919 [pdf, other]

Compressed-Sensing based Beam Detection in 5G NR Initial Access

Authors: Junmo Sung, Brian L. Evans

Abstract: To support millimeter wave (mmWave) frequency bands in cellular communications, both the base station and the mobile platform utilize large antenna arrays to steer narrow beams towards each other to compensate the path loss and improve communication performance. The time-frequency resource allocated for initial access, however, is limited, which gives rise to need for efficient approaches for beam… ▽ More To support millimeter wave (mmWave) frequency bands in cellular communications, both the base station and the mobile platform utilize large antenna arrays to steer narrow beams towards each other to compensate the path loss and improve communication performance. The time-frequency resource allocated for initial access, however, is limited, which gives rise to need for efficient approaches for beam detection. For hybrid analog-digital beamforming (HB) architectures, which are used to reduce power consumption, we propose a compressed sensing (CS) based approach for 5G initial access beam detection that is for a HB architecture and that is compliant with the 3GPP standard. The CS-based approach is compared with the exhaustive search in terms of beam detection accuracy and by simulation is shown to outperform. Up to 256 antennas are considered, and the importance of a careful codebook design is reaffirmed. △ Less

Submitted 2 May, 2020; originally announced May 2020.

Comments: 5 pages, 6 figures, SPAWC2020

arXiv:2002.05604 [pdf, other]

Efficient And Scalable Neural Residual Waveform Coding With Collaborative Quantization

Authors: Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, Minje Kim

Abstract: Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network model… ▽ More Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models. △ Less

Submitted 13 February, 2020; originally announced February 2020.

Comments: Accepted in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , Barcelona, Spain, May 4-8, 2020

arXiv:1911.05727 [pdf]

Artificial Intelligence Strategies for National Security and Safety Standards

Authors: Erik Blasch, James Sung, Tao Nguyen, Chandra P. Daniel, Alisa P. Mason

Abstract: Recent advances in artificial intelligence (AI) have lead to an explosion of multimedia applications (e.g., computer vision (CV) and natural language processing (NLP)) for different domains such as commercial, industrial, and intelligence. In particular, the use of AI applications in a national security environment is often problematic because the opaque nature of the systems leads to an inability… ▽ More Recent advances in artificial intelligence (AI) have lead to an explosion of multimedia applications (e.g., computer vision (CV) and natural language processing (NLP)) for different domains such as commercial, industrial, and intelligence. In particular, the use of AI applications in a national security environment is often problematic because the opaque nature of the systems leads to an inability for a human to understand how the results came about. A reliance on 'black boxes' to generate predictions and inform decisions is potentially disastrous. This paper explores how the application of standards during each stage of the development of an AI system deployed and used in a national security environment would help enable trust. Specifically, we focus on the standards outlined in Intelligence Community Directive 203 (Analytic Standards) to subject machine outputs to the same rigorous standards as analysis performed by humans. △ Less

Submitted 3 November, 2019; originally announced November 2019.

Comments: Presented at AAAI FSS-19: Artificial Intelligence in Government and Public Sector, Arlington, Virginia, USA

arXiv:1909.09861 [pdf, other]

Hybrid Beamformer Codebook Design and Ordering for Compressive mmWave Channel Estimation

Authors: Junmo Sung, Brian L. Evans

Abstract: In millimeter wave (mmWave) communication systems, beamforming with large antenna arrays is critical to overcome high path losses. Separating all-digital beamforming into analog and digital stages can provide the large reduction in power consumption and small loss in spectral efficiency needed for practical implementations. Develo** algorithms with this favorable tradeoff is challenging due to t… ▽ More In millimeter wave (mmWave) communication systems, beamforming with large antenna arrays is critical to overcome high path losses. Separating all-digital beamforming into analog and digital stages can provide the large reduction in power consumption and small loss in spectral efficiency needed for practical implementations. Develo** algorithms with this favorable tradeoff is challenging due to the additional degrees of freedom in the analog stage and its accompanying hardware constraints. In hybrid beamforming systems, for example, channel estimation algorithms do not directly observe the channels, face a high channel count, and operate at low SNR before transmit-receive beam alignment. Since mmWave channels are sparse in time and beam domains, many compressed sensing (CS) channel estimation algorithms have been developed that randomly configure the analog beamformers, digital beamformers, and/or pilot symbols. In this paper, we propose to design deterministic beamformers and pilot symbols for open-loop channel estimation. We use CS approaches that rely on low coherence for their recovery guarantees, and hence seek to minimize the mutual coherence of the compressed sensing matrix. We also propose a precoder column ordering to design the pilot symbols. Simulation results show that our beamformer designs reduce channel estimation error over competing methods. △ Less

Submitted 21 September, 2019; originally announced September 2019.

arXiv:1909.09858 [pdf, other]

Versatile Compressive mmWave Hybrid Beamformer Codebook Design Framework

Authors: Junmo Sung, Brian L. Evans

Abstract: Hybrid beamforming (HB) architectures are attractive for wireless communication systems with large antenna arrays because the analog beamforming stage can significantly reduce the number of RF transceivers and hence power consumption. In HB systems, channel estimation (CE) becomes challenging due to indirect access by the baseband processing to the communication channels and due to low SNR before… ▽ More Hybrid beamforming (HB) architectures are attractive for wireless communication systems with large antenna arrays because the analog beamforming stage can significantly reduce the number of RF transceivers and hence power consumption. In HB systems, channel estimation (CE) becomes challenging due to indirect access by the baseband processing to the communication channels and due to low SNR before beam alignment. Compressed sensing (CS) based algorithms have been adopted to address these challenges by leveraging the sparse nature of millimeter wave multi-input multi-output (mmWave MIMO) channels. In many CS algorithms for narrowband CE, the hybrid beamformers are randomly configured which does not always yield the low-coherence sensing matrices desirable for those CS algorithms whose recovery guarantees rely on coherence. In this paper, we propose a versatile deterministic HB codebook design framework for CS algorithms with coherence-based recovery guarantees to enhance CE accuracy. Simulation results show that the proposed design can obtain lower channel estimation error and higher spectral efficiency compared with random codebook for phase-shifter-, switch-, and lens-based HB architectures. △ Less

Submitted 21 September, 2019; originally announced September 2019.

arXiv:1907.05415 [pdf, other]

Learning to learn with quantum neural networks via classical neural networks

Authors: Guillaume Verdon, Michael Broughton, Jarrod R. McClean, Kevin J. Sung, Ryan Babbush, Zhang Jiang, Hartmut Neven, Masoud Mohseni

Abstract: Quantum Neural Networks (QNNs) are a promising variational learning paradigm with applications to near-term quantum processors, however they still face some significant challenges. One such challenge is finding good parameter initialization heuristics that ensure rapid and consistent convergence to local minima of the parameterized quantum circuit landscape. In this work, we train classical neural… ▽ More Quantum Neural Networks (QNNs) are a promising variational learning paradigm with applications to near-term quantum processors, however they still face some significant challenges. One such challenge is finding good parameter initialization heuristics that ensure rapid and consistent convergence to local minima of the parameterized quantum circuit landscape. In this work, we train classical neural networks to assist in the quantum learning process, also know as meta-learning, to rapidly find approximate optima in the parameter landscape for several classes of quantum variational algorithms. Specifically, we train classical recurrent neural networks to find approximately optimal parameters within a small number of queries of the cost function for the Quantum Approximate Optimization Algorithm (QAOA) for MaxCut, QAOA for Sherrington-Kirkpatrick Ising model, and for a Variational Quantum Eigensolver for the Hubbard model. By initializing other optimizers at parameter values suggested by the classical neural network, we demonstrate a significant improvement in the total number of optimization iterations required to reach a given accuracy. We further demonstrate that the optimization strategies learned by the neural network generalize well across a range of problem instance sizes. This opens up the possibility of training on small, classically simulatable problem instances, in order to initialize larger, classically intractably simulatable problem instances on quantum devices, thereby significantly reducing the number of required quantum-classical optimization iterations. △ Less

Submitted 11 July, 2019; originally announced July 2019.

Comments: 12 pages, 4 figures

arXiv:1907.00482 [pdf, other]

Base Station Antenna Selection for Low-Resolution ADC Systems

Authors: **seok Choi, Junmo Sung, Narayan Prasad, Xiao-Feng Qi, Brian L. Evans, Alan Gatherer

Abstract: This paper investigates antenna selection at a base station with large antenna arrays and low-resolution analog-to-digital converters. For downlink transmit antenna selection for narrowband channels, we show (1) a selection criterion that maximizes sum rate with zero-forcing precoding equivalent to that of a perfect quantization system; (2) maximum sum rate increases with number of selected antenn… ▽ More This paper investigates antenna selection at a base station with large antenna arrays and low-resolution analog-to-digital converters. For downlink transmit antenna selection for narrowband channels, we show (1) a selection criterion that maximizes sum rate with zero-forcing precoding equivalent to that of a perfect quantization system; (2) maximum sum rate increases with number of selected antennas; (3) derivation of the sum rate loss function from using a subset of antennas; and (4) unlike high-resolution converter systems, sum rate loss reaches a maximum at a point of total transmit power and decreases beyond that point to converge to zero. For wideband orthogonal-frequency-division-multiplexing (OFDM) systems, our results hold when entire subcarriers share a common subset of antennas. For uplink receive antenna selection for narrowband channels, we (1) generalize a greedy antenna selection criterion to capture tradeoffs between channel gain and quantization error; (2) propose a quantization-aware fast antenna selection algorithm using the criterion; and (3) derive a lower bound on sum rate achieved by the proposed algorithm based on submodular functions. For wideband OFDM systems, we extend our algorithm and derive a lower bound on its sum rate. Simulation results validate theoretical analyses and show increases in sum rate over conventional algorithms. △ Less

Submitted 30 June, 2019; originally announced July 2019.

Comments: Submitted to IEEE Transactions on Communications

arXiv:1906.07769 [pdf, other]

Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

Authors: Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beack, Minje Kim

Abstract: Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. C… ▽ More Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature. △ Less

Submitted 13 September, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

Comments: Accepted for publication in INTERSPEECH 2019

Journal ref: Published in Interspeech 2019

arXiv:1805.02838 [pdf, other]

A Memory Network Approach for Story-based Temporal Summarization of 360° Videos

Authors: Sangho Lee, **young Sung, Youngjae Yu, Gunhee Kim

Abstract: We address the problem of story-based temporal summarization of long 360° videos. We propose a novel memory network model named Past-Future Memory Network (PFMN), in which we first compute the scores of 81 normal field of view (NFOV) region proposals cropped from the input 360° video, and then recover a latent, collective summary using the network with two external memories that store the embeddin… ▽ More We address the problem of story-based temporal summarization of long 360° videos. We propose a novel memory network model named Past-Future Memory Network (PFMN), in which we first compute the scores of 81 normal field of view (NFOV) region proposals cropped from the input 360° video, and then recover a latent, collective summary using the network with two external memories that store the embeddings of previously selected subshots and future candidate subshots. Our major contributions are two-fold. First, our work is the first to address story-based temporal summarization of 360° videos. Second, our model is the first attempt to leverage memory networks for video summarization tasks. For evaluation, we perform three sets of experiments. First, we investigate the view selection capability of our model on the Pano2Vid dataset. Second, we evaluate the temporal summarization with a newly collected 360° video dataset. Finally, we experiment our model's performance in another domain, with image-based storytelling VIST dataset. We verify that our model achieves state-of-the-art performance on all the tasks. △ Less

Submitted 18 June, 2018; v1 submitted 8 May, 2018; originally announced May 2018.

Comments: Accepted paper at CVPR 2018

arXiv:1801.09846 [pdf, other]

Antenna Selection for Large-Scale MIMO Systems with Low-Resolution ADCs

Authors: **seok Choi, Junmo Sung, Brian L. Evans, Alan Gatherer

Abstract: One way to reduce the power consumption in large-scale multiple-input multiple-output (MIMO) systems is to employ low-resolution analog-to-digital converters (ADCs). In this paper, we investigate antenna selection for large-scale MIMO receivers with low-resolution ADCs, thereby providing more flexibility in resolution and number of ADCs. To incorporate quantization effects, we generalize an existi… ▽ More One way to reduce the power consumption in large-scale multiple-input multiple-output (MIMO) systems is to employ low-resolution analog-to-digital converters (ADCs). In this paper, we investigate antenna selection for large-scale MIMO receivers with low-resolution ADCs, thereby providing more flexibility in resolution and number of ADCs. To incorporate quantization effects, we generalize an existing objective function for a greedy capacity-maximization antenna selection approach. The derived objective function offers an opportunity to select an antenna with the best tradeoff between the additional channel gain and increase in quantization error. Using the generalized objective function, we propose an antenna selection algorithm based on a conventional antenna selection algorithm without an increase in overall complexity. Simulation results show that the proposed algorithm outperforms the conventional algorithm in achievable capacity for the same number of antennas. △ Less

Submitted 20 April, 2019; v1 submitted 29 January, 2018; originally announced January 2018.

Comments: IEEE International Conference on Acoustics, Speech and Signal Processing 2018

arXiv:1801.09774 [pdf, other]

On Psychoacoustically Weighted Cost Functions Towards Resource-Efficient Deep Neural Networks for Speech Denoising

Authors: Kai Zhen, Aswin Sivaraman, Jongmo Sung, Minje Kim

Abstract: We present a psychoacoustically enhanced cost function to balance network complexity and perceptual performance of deep neural networks for speech denoising. While training the network, we utilize perceptual weights added to the ordinary mean-squared error to emphasize contribution from frequency bins which are most audible while ignoring error from inaudible bins. To generate the weights, we empl… ▽ More We present a psychoacoustically enhanced cost function to balance network complexity and perceptual performance of deep neural networks for speech denoising. While training the network, we utilize perceptual weights added to the ordinary mean-squared error to emphasize contribution from frequency bins which are most audible while ignoring error from inaudible bins. To generate the weights, we employ psychoacoustic models to compute the global masking threshold from the clean speech spectra. We then evaluate the speech denoising performance of our perceptually guided neural network by using both objective and perceptual sound quality metrics, testing on various network structures ranging from shallow and narrow ones to deep and wide ones. The experimental results showcase our method as a valid approach for infusing perceptual significance to deep neural network operations. In particular, the more perceptually sensible enhancement in performance seen by simple neural network topologies proves that the proposed method can lead to resource-efficient speech denoising implementations in small devices without degrading the perceived signal fidelity. △ Less

Submitted 29 January, 2018; originally announced January 2018.

Comments: 5 pages, 4 figures

arXiv:1712.02018 [pdf, other]

ADC Bit Optimization for Spectrum- and Energy-Efficient Millimeter Wave Communications

Authors: **seok Choi, Junmo Sung, Brian L. Evans, Alan Gatherer

Abstract: A spectrum- and energy-efficient system is essential for millimeter wave communication systems that require large antenna arrays with power-demanding ADCs. We propose an ADC bit allocation (BA) algorithm that solves a minimum mean squared quantization error problem under a power constraint. Unlike existing BA methods that only consider an ADC power constraint, the proposed algorithm regards total… ▽ More A spectrum- and energy-efficient system is essential for millimeter wave communication systems that require large antenna arrays with power-demanding ADCs. We propose an ADC bit allocation (BA) algorithm that solves a minimum mean squared quantization error problem under a power constraint. Unlike existing BA methods that only consider an ADC power constraint, the proposed algorithm regards total receiver power constraint for a hybrid analog-digital beamforming architecture. The major challenge is the non-linearities in the minimization problem. To address this issue, we first convert the problem into a convex optimization problem through real number relaxation and substitution of ADC resolution switching power with constant average switching power. Then, we derive a closed-form solution by fixing the number of activated radio frequency (RF) chains M. Leveraging the solution, the binary search finds the optimal M and its corresponding optimal solution. We also provide an off-line training and modeling approach to estimate the average switching power. Simulation results validate the spectral and energy efficiency of the proposed algorithm. In particular, existing state-of-the-art digital beamformers can be used in the system in conjunction with the BA algorithm as it makes the quantization error negligible in the low-resolution regime. △ Less

Submitted 5 December, 2017; originally announced December 2017.

Comments: Accepted to Globecom 2017 Singapore

arXiv:1710.10673 [pdf, other]

Narrowband Channel Estimation for Hybrid Beamforming Millimeter Wave Communication Systems with One-bit Quantization

Authors: Junmo Sung, **seok Choi, Brian L. Evans

Abstract: Millimeter wave (mmWave) spectrum has drawn attention due to its tremendous available bandwidth. The high propagation losses in the mmWave bands necessitate beamforming with a large number of antennas. Traditionally each antenna is paired with a high-speed analog-to-digital converter (ADC), which results in high power consumption. A hybrid beamforming architecture and one-bit resolution ADCs have… ▽ More Millimeter wave (mmWave) spectrum has drawn attention due to its tremendous available bandwidth. The high propagation losses in the mmWave bands necessitate beamforming with a large number of antennas. Traditionally each antenna is paired with a high-speed analog-to-digital converter (ADC), which results in high power consumption. A hybrid beamforming architecture and one-bit resolution ADCs have been proposed to reduce power consumption. However, analog beamforming and one-bit quantization make channel estimation more challenging. In this paper, we propose a narrowband channel estimation algorithm for mmWave communication systems with one-bit ADCs and hybrid beamforming based on generalized approximate message passing (GAMP). We show through simulation that 1) GAMP variants with one-bit ADCs have better performance than do least-squares estimation methods without quantization, 2) the proposed one-bit GAMP algorithm achieves the lowest estimation error among the GAMP variants, and 3) exploiting more frames and RF chains enhances the channel estimation performance. △ Less

Submitted 29 May, 2018; v1 submitted 29 October, 2017; originally announced October 2017.

Comments: 5 pages, 5 figures, accepted to ICASSP 2018

arXiv:1710.10669 [pdf, other]

Wideband Channel Estimation for Hybrid Beamforming Millimeter Wave Communication Systems with Low-Resolution ADCs

Authors: Junmo Sung, **seok Choi, Brian L. Evans

Abstract: A potential tremendous spectrum resource makes millimeter wave (mmWave) communications a promising technology. High power consumption due to a large number of antennas and analog-to-digital converters (ADCs) for beamforming to overcome the large propagation losses is problematic in practice. As a hybrid beamforming architecture and low-resolution ADCs are considered to reduce power consumption, es… ▽ More A potential tremendous spectrum resource makes millimeter wave (mmWave) communications a promising technology. High power consumption due to a large number of antennas and analog-to-digital converters (ADCs) for beamforming to overcome the large propagation losses is problematic in practice. As a hybrid beamforming architecture and low-resolution ADCs are considered to reduce power consumption, estimation of mmWave channels becomes challenging. We evaluate several channel estimation algorithms for wideband mmWave systems with hybrid beamforming and low-resolution ADCs. Through simulation, we show that 1) infinite bit ADCs with least-squares estimation have worse channel estimation performance than do one-bit ADCs with orthogonal matching pursuit (OMP) in an SNR range of interest, 2) three- and four-bit quantizers can achieve channel estimation performance close to the unquantized case when using OMP, 3) a receiver with a single RF chain can yield better estimates than that with four RF chains if enough frames are exploited, and 4) for one-bit ADCs, exploitation of higher transmit power and more frames for performance enhancement adversely affects estimation performance after a certain point. △ Less

Submitted 29 October, 2017; originally announced October 2017.

Comments: 6 pages, 8 figures, submitted to ICC 2018

arXiv:1707.06519 [pdf, other]

Language Transfer of Audio Word2Vec: Learning Audio Segment Representations without Target Language Data

Authors: Chia-Hao Shen, Janet Y. Sung, Hung-Yi Lee

Abstract: Audio Word2Vec offers vector representations of fixed dimensionality for variable-length audio segments using Sequence-to-sequence Autoencoder (SA). These vector representations are shown to describe the sequential phonetic structures of the audio segments to a good degree, with real world applications such as query-by-example Spoken Term Detection (STD). This paper examines the capability of lang… ▽ More Audio Word2Vec offers vector representations of fixed dimensionality for variable-length audio segments using Sequence-to-sequence Autoencoder (SA). These vector representations are shown to describe the sequential phonetic structures of the audio segments to a good degree, with real world applications such as query-by-example Spoken Term Detection (STD). This paper examines the capability of language transfer of Audio Word2Vec. We train SA from one language (source language) and use it to extract the vector representation of the audio segments of another language (target language). We found that SA can still catch phonetic structure from the audio segments of the target language if the source and target languages are similar. In query-by-example STD, we obtain the vector representations from the SA learned from a large amount of source language data, and found them surpass the representations from naive encoder and SA directly learned from a small amount of target language data. The result shows that it is possible to learn Audio Word2Vec model from high-resource languages and use it on low-resource languages. This further expands the usability of Audio Word2Vec. △ Less

Submitted 19 July, 2017; originally announced July 2017.

Comments: arXiv admin note: text overlap with arXiv:1603.00982

arXiv:1705.06243 [pdf, other]

Learning to Represent Haptic Feedback for Partially-Observable Tasks

Authors: Jaeyong Sung, J. Kenneth Salisbury, Ashutosh Saxena

Abstract: The sense of touch, being the earliest sensory system to develop in a human body [1], plays a critical part of our daily interaction with the environment. In order to successfully complete a task, many manipulation interactions require incorporating haptic feedback. However, manually designing a feedback mechanism can be extremely challenging. In this work, we consider manipulation tasks that need… ▽ More The sense of touch, being the earliest sensory system to develop in a human body [1], plays a critical part of our daily interaction with the environment. In order to successfully complete a task, many manipulation interactions require incorporating haptic feedback. However, manually designing a feedback mechanism can be extremely challenging. In this work, we consider manipulation tasks that need to incorporate tactile sensor feedback in order to modify a provided nominal plan. To incorporate partial observation, we present a new framework that models the task as a partially observable Markov decision process (POMDP) and learns an appropriate representation of haptic feedback which can serve as the state for a POMDP model. The model, that is parametrized by deep recurrent neural networks, utilizes variational Bayes methods to optimize the approximate posterior. Finally, we build on deep Q-learning to be able to select the optimal action in each state without access to a simulator. We test our model on a PR2 robot for multiple tasks of turning a knob until it clicks. △ Less

Submitted 17 May, 2017; originally announced May 2017.

Comments: IEEE International Conference on Robotics and Automation (ICRA), 2017

arXiv:1609.04135 [pdf, other]

Real-time testbed for diversity in powerline and wireless smart grid communications

Authors: Junmo Sung, Brian L. Evans

Abstract: Two-way communication is a key feature in a smart grid. It is enabled by either powerline communication or wireless communication technologies. Utilizing both technologies can potentially enhance communication reliability, and many diversity combining schemes have been proposed. In this paper, we propose a flexible real-time testbed to evaluate diversity combining schemes over physical channels. T… ▽ More Two-way communication is a key feature in a smart grid. It is enabled by either powerline communication or wireless communication technologies. Utilizing both technologies can potentially enhance communication reliability, and many diversity combining schemes have been proposed. In this paper, we propose a flexible real-time testbed to evaluate diversity combining schemes over physical channels. The testbed provides essential parts of physical layers on which both powerline and wireless communications operate. The contributions of this paper are 1) design and implementation of a real-time testbed for diversity of simultaneous powerline and wireless communications, 2) release of the setup information and complete source code for the testbed, and 3) performance evaluation of maximal ratio combining (MRC) on the testbed. As initial results, we show that performance of MRC from measurements obtained on the testbed over physical channels is very close to that in simulations in various test cases under a controlled laboratory environment. △ Less

Submitted 29 October, 2017; v1 submitted 14 September, 2016; originally announced September 2016.

Comments: 6 pages, 5 figures, submitted to ICC 2018

arXiv:1601.02705 [pdf, other]

Robobarista: Learning to Manipulate Novel Objects via Deep Multimodal Embedding

Authors: Jaeyong Sung, Seok Hyun **, Ian Lenz, Ashutosh Saxena

Abstract: There is a large variety of objects and appliances in human environments, such as stoves, coffee dispensers, juice extractors, and so on. It is challenging for a roboticist to program a robot for each of these object types and for each of their instantiations. In this work, we present a novel approach to manipulation planning based on the idea that many household objects share similarly-operated o… ▽ More There is a large variety of objects and appliances in human environments, such as stoves, coffee dispensers, juice extractors, and so on. It is challenging for a roboticist to program a robot for each of these object types and for each of their instantiations. In this work, we present a novel approach to manipulation planning based on the idea that many household objects share similarly-operated object parts. We formulate the manipulation planning as a structured prediction problem and learn to transfer manipulation strategy across different objects by embedding point-cloud, natural language, and manipulation trajectory data into a shared embedding space using a deep neural network. In order to learn semantically meaningful spaces throughout our network, we introduce a method for pre-training its lower layers for multimodal feature embedding and a method for fine-tuning this embedding space using a loss-based margin. In order to collect a large number of manipulation demonstrations for different objects, we develop a new crowd-sourcing platform called Robobarista. We test our model on our dataset consisting of 116 objects and appliances with 249 parts along with 250 language instructions, for which there are 1225 crowd-sourced manipulation demonstrations. We further show that our robot with our model can even prepare a cup of a latte with appliances it has never seen before. △ Less

Submitted 11 January, 2016; originally announced January 2016.

Comments: Journal Version

arXiv:1509.07831 [pdf, other]

Deep Multimodal Embedding: Manipulating Novel Objects with Point-clouds, Language and Trajectories

Authors: Jaeyong Sung, Ian Lenz, Ashutosh Saxena

Abstract: A robot operating in a real-world environment needs to perform reasoning over a variety of sensor modalities such as vision, language and motion trajectories. However, it is extremely challenging to manually design features relating such disparate modalities. In this work, we introduce an algorithm that learns to embed point-cloud, natural language, and manipulation trajectory data into a shared e… ▽ More A robot operating in a real-world environment needs to perform reasoning over a variety of sensor modalities such as vision, language and motion trajectories. However, it is extremely challenging to manually design features relating such disparate modalities. In this work, we introduce an algorithm that learns to embed point-cloud, natural language, and manipulation trajectory data into a shared embedding space with a deep neural network. To learn semantically meaningful spaces throughout our network, we use a loss-based margin to bring embeddings of relevant pairs closer together while driving less-relevant cases from different modalities further apart. We use this both to pre-train its lower layers and fine-tune our final embedding space, leading to a more robust representation. We test our algorithm on the task of manipulating novel objects and appliances based on prior experience with other objects. On a large dataset, we achieve significant improvements in both accuracy and inference time over the previous state of the art. We also perform end-to-end experiments on a PR2 robot utilizing our learned embedding space. △ Less

Submitted 17 May, 2017; v1 submitted 25 September, 2015; originally announced September 2015.

Comments: IEEE International Conference on Robotics and Automation (ICRA), 2017

Showing 1–50 of 55 results for author: Sung, J