-
Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment
Authors:
Hee** Do,
Wonjun Lee,
Gary Geunbae Lee
Abstract:
In automated pronunciation assessment, recent emphasis progressively lies on evaluating multiple aspects to provide enriched feedback. However, acquiring multi-aspect-score labeled data for non-native language learners' speech poses challenges; moreover, it often leads to score-imbalanced distributions. In this paper, we propose two Acoustic Feature Mixup strategies, linearly and non-linearly inte…
▽ More
In automated pronunciation assessment, recent emphasis progressively lies on evaluating multiple aspects to provide enriched feedback. However, acquiring multi-aspect-score labeled data for non-native language learners' speech poses challenges; moreover, it often leads to score-imbalanced distributions. In this paper, we propose two Acoustic Feature Mixup strategies, linearly and non-linearly interpolating with the in-batch averaged feature, to address data scarcity and score-label imbalances. Primarily using goodness-of-pronunciation as an acoustic feature, we tailor mixup designs to suit pronunciation assessment. Further, we integrate fine-grained error-rate features by comparing speech recognition results with the original answer phonemes, giving direct hints for mispronunciation. Effective mixing of the acoustic features notably enhances overall scoring performances on the speechocean762 dataset, and detailed analysis highlights our potential to predict unseen distortions.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
CONMOD: Controllable Neural Frame-based Modulation Effects
Authors:
Gyubin Lee,
Hounsu Kim,
Junwon Lee,
Juhan Nam
Abstract:
Deep learning models have seen widespread use in modelling LFO-driven audio effects, such as phaser and flanger. Although existing neural architectures exhibit high-quality emulation of individual effects, they do not possess the capability to manipulate the output via control parameters. To address this issue, we introduce Controllable Neural Frame-based Modulation Effects (CONMOD), a single blac…
▽ More
Deep learning models have seen widespread use in modelling LFO-driven audio effects, such as phaser and flanger. Although existing neural architectures exhibit high-quality emulation of individual effects, they do not possess the capability to manipulate the output via control parameters. To address this issue, we introduce Controllable Neural Frame-based Modulation Effects (CONMOD), a single black-box model which emulates various LFO-driven effects in a frame-wise manner, offering control over LFO frequency and feedback parameters. Additionally, the model is capable of learning the continuous embedding space of two distinct phaser effects, enabling us to steer between effects and achieve creative outputs. Our model outperforms previous work while possessing both controllability and universality, presenting opportunities to enhance creativity in modern LFO-driven audio effects.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Predicting the risk of early-stage breast cancer recurrence using H\&E-stained tissue images
Authors:
Geongyu Lee,
Joonho Lee,
Tae-Yeong Kwak,
Sun Woo Kim,
Youngmee Kwon,
Chungyeul Kim,
Hyeyoon Chang
Abstract:
Accurate prediction of the likelihood of recurrence is important in the selection of postoperative treatment for patients with early-stage breast cancer. In this study, we investigated whether deep learning algorithms can predict patients' risk of recurrence by analyzing the pathology images of their cancer histology. A total of 125 hematoxylin and eosin stained breast cancer whole slide images la…
▽ More
Accurate prediction of the likelihood of recurrence is important in the selection of postoperative treatment for patients with early-stage breast cancer. In this study, we investigated whether deep learning algorithms can predict patients' risk of recurrence by analyzing the pathology images of their cancer histology. A total of 125 hematoxylin and eosin stained breast cancer whole slide images labeled with the risk prediction via genomics assays were used, and we obtained sensitivity of 0.857, 0.746, and 0.529 for predicting low, intermediate, and high risk, and specificity of 0.816, 0.803, and 0.972. When compared to the expert pathologist's regional histology grade information, a Pearson's correlation coefficient of 0.61 was obtained. When we checked the model learned through these studies through the class activation map, we found that it actually considered tubule formation and mitotic rate when predicting different risk groups.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation
Authors:
Ye** Jeon,
Yunsu Kim,
Gary Geunbae Lee
Abstract:
Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and Fas…
▽ More
Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and FastSpeech variants show substantial pausing errors when applied to the Korean language, which affects speech perception and naturalness. In order to address the aforementioned issues, we propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns. Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences, despite its training on short audio clips. Architectural design choices are validated through comparisons with baseline models and ablation studies using subjective and objective metrics, thus confirming model performance.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Multi-Level Attention Aggregation for Language-Agnostic Speaker Replication
Authors:
Ye** Jeon,
Gary Geunbae Lee
Abstract:
This paper explores the task of language-agnostic speaker replication, a novel endeavor that seeks to replicate a speaker's voice irrespective of the language they are speaking. Towards this end, we introduce a multi-level attention aggregation approach that systematically probes and amplifies various speaker-specific attributes in a hierarchical manner. Through rigorous evaluations across a wide…
▽ More
This paper explores the task of language-agnostic speaker replication, a novel endeavor that seeks to replicate a speaker's voice irrespective of the language they are speaking. Towards this end, we introduce a multi-level attention aggregation approach that systematically probes and amplifies various speaker-specific attributes in a hierarchical manner. Through rigorous evaluations across a wide range of scenarios including seen and unseen speakers conversing in seen and unseen lingua, we establish that our proposed model is able to achieve substantial speaker similarity, and is able to generalize to out-of-domain (OOD) cases.
△ Less
Submitted 3 April, 2024; v1 submitted 6 March, 2024;
originally announced March 2024.
-
Using digital twins for managing change in complex projects
Authors:
Jennifer Whyte,
Ranjith Soman,
Rafael Sacks,
Neda Mohammadi,
Nader Naderpajouh,
Wei-Ting Hong,
Ghang Lee
Abstract:
Complex systems are not entirely decomposable, hence interdependences arise at the interfaces in complex projects. When changes occur, significant risks arise at these interfaces as it is hard to identify, manage and visualise the systemic consequences of changes. Particularly problematic are the interfaces in which there are multiple interdependencies, which occur where the boundaries between des…
▽ More
Complex systems are not entirely decomposable, hence interdependences arise at the interfaces in complex projects. When changes occur, significant risks arise at these interfaces as it is hard to identify, manage and visualise the systemic consequences of changes. Particularly problematic are the interfaces in which there are multiple interdependencies, which occur where the boundaries between design components, contracts and organisation coincide, such as between design disciplines. In this paper, we propose an approach to digital twin-based interface management, through an underpinning state-of-the-art review of the existing technical literature and a small pilot to identify the characteristics of future data-driven solutions. We set out an approach to digital twin-based interface management and an agenda for research on advanced methodologies for managing change in complex projects. This agenda includes the need to integrate work on identifying systems interfaces, change propagation and visualisation, and the potential to significantly extend the limitations of existing solutions by using developments in the digital twin, such as linked data, semantic enrichment, network analyses, natural language processing (NLP)-enhanced ontology and machine learning.
△ Less
Submitted 30 May, 2024; v1 submitted 31 January, 2024;
originally announced February 2024.
-
Locality enhanced dynamic biasing and sampling strategies for contextual ASR
Authors:
Md Asif Jalal,
Pablo Peso Parada,
George Pavlidis,
Vasileios Moschopoulos,
Karthikeyan Saravanan,
Chrysovalantis-Giorgos Kontoulis,
Jisi Zhang,
Anastasios Drosou,
Gil Ho Lee,
Jungin Lee,
Seokyeong Jung
Abstract:
Automatic Speech Recognition (ASR) still face challenges when recognizing time-variant rare-phrases. Contextual biasing (CB) modules bias ASR model towards such contextually-relevant phrases. During training, a list of biasing phrases are selected from a large pool of phrases following a sampling strategy. In this work we firstly analyse different sampling strategies to provide insights into the t…
▽ More
Automatic Speech Recognition (ASR) still face challenges when recognizing time-variant rare-phrases. Contextual biasing (CB) modules bias ASR model towards such contextually-relevant phrases. During training, a list of biasing phrases are selected from a large pool of phrases following a sampling strategy. In this work we firstly analyse different sampling strategies to provide insights into the training of CB for ASR with correlation plots between the bias embeddings among various training stages. Secondly, we introduce a neighbourhood attention (NA) that localizes self attention (SA) to the nearest neighbouring frames to further refine the CB output. The results show that this proposed approach provides on average a 25.84% relative WER improvement on LibriSpeech sets and rare-word evaluation compared to the baseline.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
Consistency Based Unsupervised Self-training For ASR Personalisation
Authors:
Jisi Zhang,
Vandana Rajan,
Haaris Mehmood,
David Tuckey,
Pablo Peso Parada,
Md Asif Jalal,
Karthikeyan Saravanan,
Gil Ho Lee,
Jungin Lee,
Seokyeong Jung
Abstract:
On-device Automatic Speech Recognition (ASR) models trained on speech data of a large population might underperform for individuals unseen during training. This is due to a domain shift between user data and the original training data, differed by user's speaking characteristics and environmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model…
▽ More
On-device Automatic Speech Recognition (ASR) models trained on speech data of a large population might underperform for individuals unseen during training. This is due to a domain shift between user data and the original training data, differed by user's speaking characteristics and environmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model robustness. The majority of ASR personalisation methods assume labelled user data for supervision. Personalisation without any labelled data is challenging due to limited data size and poor quality of recorded audio samples. This work addresses unsupervised personalisation by develo** a novel consistency based training method via pseudo-labelling. Our method achieves a relative Word Error Rate Reduction (WERR) of 17.3% on unlabelled training data and 8.1% on held-out data compared to a pre-trained model, and outperforms the current state-of-the art methods.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Joint Downlink and Uplink Optimization for RIS-Aided FDD MIMO Communication Systems
Authors:
Gyoseung Lee,
Hyeongtaek Lee,
Donghwan Kim,
Jaehoon Chung,
A. Lee. Swindlehurst,
Junil Choi
Abstract:
This paper investigates reconfigurable intelligent surface (RIS)-aided frequency division duplexing (FDD) communication systems. Since the downlink and uplink signals are simultaneously transmitted in FDD, the phase shifts at the RIS should be designed to support both transmissions. Considering a single-user multiple-input multiple-output system, we formulate a weighted sum-rate maximization probl…
▽ More
This paper investigates reconfigurable intelligent surface (RIS)-aided frequency division duplexing (FDD) communication systems. Since the downlink and uplink signals are simultaneously transmitted in FDD, the phase shifts at the RIS should be designed to support both transmissions. Considering a single-user multiple-input multiple-output system, we formulate a weighted sum-rate maximization problem to jointly maximize the downlink and uplink system performance. To tackle the non-convex optimization problem, we adopt an alternating optimization (AO) algorithm, in which two phase shift optimization techniques are developed to handle the unit-modulus constraints induced by the reflection coefficients at the RIS. The first technique exploits the manifold optimization-based algorithm, while the second uses a lower-complexity AO approach. Numerical results verify that the proposed techniques rapidly converge to local optima and significantly improve the overall system performance compared to existing benchmark schemes.
△ Less
Submitted 21 January, 2024;
originally announced January 2024.
-
Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
Authors:
Ye** Jeon,
Yunsu Kim,
Gary Geunbae Lee
Abstract:
Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that mode…
▽ More
Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation. By eliminating superfluous content information from the speaker representation, our negation scheme not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity. In addition, to facilitate the learning of diverse speaker attributes, we leverage multi-stream Transformers, which retain multiple hypotheses and instigate a training paradigm akin to ensemble learning. To unify these hypotheses and realize the final speaker representation, we employ attention pooling. Finally, in light of the imperative to generate target text utterances in the desired voice, we adopt adaptive layer normalizations to effectively fuse the previously generated speaker representation with the target text representations, as opposed to mere concatenation of the text and audio modalities. Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes vis-`a-vis alternative baseline models.
△ Less
Submitted 5 March, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation
Authors:
Wonjun Lee,
Gary Geunbae Lee,
Yunsu Kim
Abstract:
This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. Our approach optimizes these two stages to improve speech recognition across languages. We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics, thus improving recognition accuracy. A…
▽ More
This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. Our approach optimizes these two stages to improve speech recognition across languages. We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics, thus improving recognition accuracy. Additionally, we introduce a global phoneme noise generator for realistic ASR noise during phoneme-to-grapheme training to reduce error propagation. Experiments on the CommonVoice 12.0 dataset show significant reductions in Word Error Rate (WER) for low-resource languages, highlighting the effectiveness of our approach. This research contributes to the advancements of two-pass ASR systems in low-resource languages, offering the potential for improved cross-lingual transfer learning.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking
Authors:
Jihyun Lee,
Ye** Jeon,
Wonjun Lee,
Yunsu Kim,
Gary Geunbae Lee
Abstract:
Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audi…
▽ More
Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data. To facilitate evaluation tailored to audio modalities, we introduce a novel PhonemeF1 to capture pronunciation similarity. Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data. By eliminating the dependency on human speech data collection, these insights pave the way for significant practical advancements in audio-based DST. Data and code are available at https://github.com/JihyunLee1/E2E-DST.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Unlocking the capabilities of explainable fewshot learning in remote sensing
Authors:
Gao Yu Lee,
Tanmoy Dam,
Md Meftahul Ferdaus,
Daniel Puiu Poenar,
Vu N Duong
Abstract:
Recent advancements have significantly improved the efficiency and effectiveness of deep learning methods for imagebased remote sensing tasks. However, the requirement for large amounts of labeled data can limit the applicability of deep neural networks to existing remote sensing datasets. To overcome this challenge, fewshot learning has emerged as a valuable approach for enabling learning with li…
▽ More
Recent advancements have significantly improved the efficiency and effectiveness of deep learning methods for imagebased remote sensing tasks. However, the requirement for large amounts of labeled data can limit the applicability of deep neural networks to existing remote sensing datasets. To overcome this challenge, fewshot learning has emerged as a valuable approach for enabling learning with limited data. While previous research has evaluated the effectiveness of fewshot learning methods on satellite based datasets, little attention has been paid to exploring the applications of these methods to datasets obtained from UAVs, which are increasingly used in remote sensing studies. In this review, we provide an up to date overview of both existing and newly proposed fewshot classification techniques, along with appropriate datasets that are used for both satellite based and UAV based data. Our systematic approach demonstrates that fewshot learning can effectively adapt to the broader and more diverse perspectives that UAVbased platforms can provide. We also evaluate some SOTA fewshot approaches on a UAV disaster scene classification dataset, yielding promising results. We emphasize the importance of integrating XAI techniques like attention maps and prototype analysis to increase the transparency, accountability, and trustworthiness of fewshot models for remote sensing. Key challenges and future research directions are identified, including tailored fewshot methods for UAVs, extending to unseen tasks like segmentation, and develo** optimized XAI techniques suited for fewshot remote sensing problems. This review aims to provide researchers and practitioners with an improved understanding of fewshot learnings capabilities and limitations in remote sensing, while highlighting open problems to guide future progress in efficient, reliable, and interpretable fewshot methods.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Revolutionizing Space Health (Swin-FSR): Advancing Super-Resolution of Fundus Images for SANS Visual Assessment Technology
Authors:
Khondker Fariha Hossain,
Sharif Amit Kamran,
Joshua Ong,
Andrew G. Lee,
Alireza Tavakkoli
Abstract:
The rapid accessibility of portable and affordable retinal imaging devices has made early differential diagnosis easier. For example, color funduscopy imaging is readily available in remote villages, which can help to identify diseases like age-related macular degeneration (AMD), glaucoma, or pathological myopia (PM). On the other hand, astronauts at the International Space Station utilize this ca…
▽ More
The rapid accessibility of portable and affordable retinal imaging devices has made early differential diagnosis easier. For example, color funduscopy imaging is readily available in remote villages, which can help to identify diseases like age-related macular degeneration (AMD), glaucoma, or pathological myopia (PM). On the other hand, astronauts at the International Space Station utilize this camera for identifying spaceflight-associated neuro-ocular syndrome (SANS). However, due to the unavailability of experts in these locations, the data has to be transferred to an urban healthcare facility (AMD and glaucoma) or a terrestrial station (e.g, SANS) for more precise disease identification. Moreover, due to low bandwidth limits, the imaging data has to be compressed for transfer between these two places. Different super-resolution algorithms have been proposed throughout the years to address this. Furthermore, with the advent of deep learning, the field has advanced so much that x2 and x4 compressed images can be decompressed to their original form without losing spatial information. In this paper, we introduce a novel model called Swin-FSR that utilizes Swin Transformer with spatial and depth-wise attention for fundus image super-resolution. Our architecture achieves Peak signal-to-noise-ratio (PSNR) of 47.89, 49.00 and 45.32 on three public datasets, namely iChallenge-AMD, iChallenge-PM, and G1020. Additionally, we tested the model's effectiveness on a privately held dataset for SANS provided by NASA and achieved comparable results against previous architectures.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
The Multi-modality Cell Segmentation Challenge: Towards Universal Solutions
Authors:
Jun Ma,
Ronald Xie,
Shamini Ayyadhury,
Cheng Ge,
Anubha Gupta,
Ritu Gupta,
Song Gu,
Yao Zhang,
Gihun Lee,
Joonkee Kim,
Wei Lou,
Haofeng Li,
Eric Upschulte,
Timo Dickscheid,
José Guilherme de Almeida,
Yixin Wang,
Lin Han,
Xin Yang,
Marco Labagnara,
Vojislav Gligorovski,
Maxime Scheder,
Sahand Jamal Rahi,
Carly Kempster,
Alice Pollitt,
Leon Espinosa
, et al. (15 additional authors not shown)
Abstract:
Cell segmentation is a critical step for quantitative single-cell analysis in microscopy images. Existing cell segmentation methods are often tailored to specific modalities or require manual interventions to specify hyper-parameters in different experimental settings. Here, we present a multi-modality cell segmentation benchmark, comprising over 1500 labeled images derived from more than 50 diver…
▽ More
Cell segmentation is a critical step for quantitative single-cell analysis in microscopy images. Existing cell segmentation methods are often tailored to specific modalities or require manual interventions to specify hyper-parameters in different experimental settings. Here, we present a multi-modality cell segmentation benchmark, comprising over 1500 labeled images derived from more than 50 diverse biological experiments. The top participants developed a Transformer-based deep-learning algorithm that not only exceeds existing methods but can also be applied to diverse microscopy images across imaging platforms and tissue types without manual parameter adjustments. This benchmark and the improved algorithm offer promising avenues for more accurate and versatile cell analysis in microscopy imaging.
△ Less
Submitted 1 April, 2024; v1 submitted 10 August, 2023;
originally announced August 2023.
-
ECG-QA: A Comprehensive Question Answering Dataset Combined With Electrocardiogram
Authors:
Jungwoo Oh,
Gyubok Lee,
Seongsu Bae,
Joon-myoung Kwon,
Edward Choi
Abstract:
Question answering (QA) in the field of healthcare has received much attention due to significant advancements in natural language processing. However, existing healthcare QA datasets primarily focus on medical images, clinical notes, or structured electronic health record tables. This leaves the vast potential of combining electrocardiogram (ECG) data with these systems largely untapped. To addre…
▽ More
Question answering (QA) in the field of healthcare has received much attention due to significant advancements in natural language processing. However, existing healthcare QA datasets primarily focus on medical images, clinical notes, or structured electronic health record tables. This leaves the vast potential of combining electrocardiogram (ECG) data with these systems largely untapped. To address this gap, we present ECG-QA, the first QA dataset specifically designed for ECG analysis. The dataset comprises a total of 70 question templates that cover a wide range of clinically relevant ECG topics, each validated by an ECG expert to ensure their clinical utility. As a result, our dataset includes diverse ECG interpretation questions, including those that require a comparative analysis of two different ECGs. In addition, we have conducted numerous experiments to provide valuable insights for future research directions. We believe that ECG-QA will serve as a valuable resource for the development of intelligent QA systems capable of assisting clinicians in ECG interpretations. Dataset URL: https://github.com/Jwoo5/ecg-qa
△ Less
Submitted 10 October, 2023; v1 submitted 21 June, 2023;
originally announced June 2023.
-
Score-based Source Separation with Applications to Digital Communication Signals
Authors:
Tejas Jayashankar,
Gary C. F. Lee,
Alejandro Lancho,
Amir Weiss,
Yury Polyanskiy,
Gregory W. Wornell
Abstract:
We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by maximum a posteriori estimation with an $α$-posterior, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we a…
▽ More
We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by maximum a posteriori estimation with an $α$-posterior, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we are interested in sources with underlying discrete nature and the recovery of encoded bits from a signal of interest, as measured by the bit error rate (BER). Experimental results with RF mixtures demonstrate that our method results in a BER reduction of 95% over classical and existing learning-based methods. Our analysis demonstrates that our proposed method yields solutions that asymptotically approach the modes of an underlying discrete distribution. Furthermore, our method can be viewed as a multi-source extension to the recently proposed score distillation sampling scheme, shedding additional light on its use beyond conditional sampling. The project webpage is available at https://alpha-rgs.github.io
△ Less
Submitted 17 January, 2024; v1 submitted 26 June, 2023;
originally announced June 2023.
-
Score-balanced Loss for Multi-aspect Pronunciation Assessment
Authors:
Hee** Do,
Yunsu Kim,
Gary Geunbae Lee
Abstract:
With rapid technological growth, automatic pronunciation assessment has transitioned toward systems that evaluate pronunciation in various aspects, such as fluency and stress. However, despite the highly imbalanced score labels within each aspect, existing studies have rarely tackled the data imbalance problem. In this paper, we suggest a novel loss function, score-balanced loss, to address the pr…
▽ More
With rapid technological growth, automatic pronunciation assessment has transitioned toward systems that evaluate pronunciation in various aspects, such as fluency and stress. However, despite the highly imbalanced score labels within each aspect, existing studies have rarely tackled the data imbalance problem. In this paper, we suggest a novel loss function, score-balanced loss, to address the problem caused by uneven data, such as bias toward the majority scores. As a re-weighting approach, we assign higher costs when the predicted score is of the minority class, thus, guiding the model to gain positive feedback for sparse score prediction. Specifically, we design two weighting factors by leveraging the concept of an effective number of samples and using the ranks of scores. We evaluate our method on the speechocean762 dataset, which has noticeably imbalanced scores for several aspects. Improved results particularly on such uneven aspects prove the effectiveness of our method.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
WATT-EffNet: A Lightweight and Accurate Model for Classifying Aerial Disaster Images
Authors:
Gao Yu Lee,
Tanmoy Dam,
Md Meftahul Ferdaus,
Daniel Puiu Poenar,
Vu N. Duong
Abstract:
Incorporating deep learning (DL) classification models into unmanned aerial vehicles (UAVs) can significantly augment search-and-rescue operations and disaster management efforts. In such critical situations, the UAV's ability to promptly comprehend the crisis and optimally utilize its limited power and processing resources to narrow down search areas is crucial. Therefore, develo** an efficient…
▽ More
Incorporating deep learning (DL) classification models into unmanned aerial vehicles (UAVs) can significantly augment search-and-rescue operations and disaster management efforts. In such critical situations, the UAV's ability to promptly comprehend the crisis and optimally utilize its limited power and processing resources to narrow down search areas is crucial. Therefore, develo** an efficient and lightweight method for scene classification is of utmost importance. However, current approaches tend to prioritize accuracy on benchmark datasets at the expense of computational efficiency. To address this shortcoming, we introduce the Wider ATTENTION EfficientNet (WATT-EffNet), a novel method that achieves higher accuracy with a more lightweight architecture compared to the baseline EfficientNet. The WATT-EffNet leverages width-wise incremental feature modules and attention mechanisms over width-wise features to ensure the network structure remains lightweight. We evaluate our method on a UAV-based aerial disaster image classification dataset and demonstrate that it outperforms the baseline by up to 15 times in terms of classification accuracy and 38.3% in terms of computing efficiency as measured by Floating Point Operations per second (FLOPs). Additionally, we conduct an ablation study to investigate the effect of varying the width of WATT-EffNet on accuracy and computational efficiency. Our code is available at \url{https://github.com/TanmDL/WATT-EffNet}.
△ Less
Submitted 1 May, 2023; v1 submitted 21 April, 2023;
originally announced April 2023.
-
Complexity reduction for resilient state estimation of uniformly observable nonlinear systems
Authors:
Junsoo Kim,
** Gyu Lee,
Henrik Sandberg,
Karl H. Johansson
Abstract:
A resilient state estimation scheme for uniformly observable nonlinear systems, based on a method for local identification of sensor attacks, is presented. The estimation problem is combinatorial in nature, and so many methods require substantial computational and storage resources as the number of sensors increases. To reduce the complexity, the proposed method performs the attack identification…
▽ More
A resilient state estimation scheme for uniformly observable nonlinear systems, based on a method for local identification of sensor attacks, is presented. The estimation problem is combinatorial in nature, and so many methods require substantial computational and storage resources as the number of sensors increases. To reduce the complexity, the proposed method performs the attack identification with local subsets of the measurements, not with the set of all measurements. A condition for nonlinear attack identification is introduced as a relaxed version of existing redundant observability condition. It is shown that an attack identification can be performed even when the state cannot be recovered from the measurements. As a result, although a portion of measurements are compromised, they can be locally identified and excluded from the state estimation, and thus the true state can be recovered. Simulation results demonstrate the effectiveness of the proposed scheme.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
On Neural Architectures for Deep Learning-based Source Separation of Co-Channel OFDM Signals
Authors:
Gary C. F. Lee,
Amir Weiss,
Alejandro Lancho,
Yury Polyanskiy,
Gregory W. Wornell
Abstract:
We study the single-channel source separation problem involving orthogonal frequency-division multiplexing (OFDM) signals, which are ubiquitous in many modern-day digital communication systems. Related efforts have been pursued in monaural source separation, where state-of-the-art neural architectures have been adopted to train an end-to-end separator for audio signals (as 1-dimensional time serie…
▽ More
We study the single-channel source separation problem involving orthogonal frequency-division multiplexing (OFDM) signals, which are ubiquitous in many modern-day digital communication systems. Related efforts have been pursued in monaural source separation, where state-of-the-art neural architectures have been adopted to train an end-to-end separator for audio signals (as 1-dimensional time series). In this work, through a prototype problem based on the OFDM source model, we assess -- and question -- the efficacy of using audio-oriented neural architectures in separating signals based on features pertinent to communication waveforms. Perhaps surprisingly, we demonstrate that in some configurations, where perfect separation is theoretically attainable, these audio-oriented neural architectures perform poorly in separating co-channel OFDM waveforms. Yet, we propose critical domain-informed modifications to the network parameterization, based on insights from OFDM structures, that can confer about 30 dB improvement in performance.
△ Less
Submitted 15 March, 2023; v1 submitted 11 March, 2023;
originally announced March 2023.
-
Channel Estimation for Reconfigurable Intelligent Surface with a few Active Elements
Authors:
Gyoseung Lee,
Hyeongtaek Lee,
Jaeky Oh,
Jaehoon Chung,
Junil Choi
Abstract:
In this paper, a channel estimation technique for reconfigurable intelligent surface (RIS)-aided multi-user multiple-input single-output communication systems is proposed. By deploying a small number of active elements at the RIS, the RIS can receive and process the training signals. Through the partial channel state information (CSI) obtained from the active elements, the overall training overhea…
▽ More
In this paper, a channel estimation technique for reconfigurable intelligent surface (RIS)-aided multi-user multiple-input single-output communication systems is proposed. By deploying a small number of active elements at the RIS, the RIS can receive and process the training signals. Through the partial channel state information (CSI) obtained from the active elements, the overall training overhead to estimate the entire channel can be dramatically reduced. To minimize the estimation complexity, the proposed technique is based on the linear combination of partial CSI, which only requires linear matrix operations. By exploiting the spatial correlation among the RIS elements, proper weights for the linear combination and normalization factors are developed. Numerical results show that the proposed technique outperforms other schemes using the active elements at the RIS in terms of the normalized mean squared error when the number of active elements is small, which is necessary to maintain the low cost and power consumption of RIS.
△ Less
Submitted 8 February, 2023;
originally announced February 2023.
-
SurgT challenge: Benchmark of Soft-Tissue Trackers for Robotic Surgery
Authors:
Joao Cartucho,
Alistair Weld,
Samyakh Tukra,
Haozheng Xu,
Hiroki Matsuzaki,
Taiyo Ishikawa,
Minjun Kwon,
Yong Eun Jang,
Kwang-Ju Kim,
Gwang Lee,
Bizhe Bai,
Lueder Kahrs,
Lars Boecking,
Simeon Allmendinger,
Leopold Muller,
Yitong Zhang,
Yueming **,
Sophia Bano,
Francisco Vasconcelos,
Wolfgang Reiter,
Jonas Hajek,
Bruno Silva,
Estevao Lima,
Joao L. Vilaca,
Sandro Queiros
, et al. (1 additional authors not shown)
Abstract:
This paper introduces the ``SurgT: Surgical Tracking" challenge which was organised in conjunction with MICCAI 2022. There were two purposes for the creation of this challenge: (1) the establishment of the first standardised benchmark for the research community to assess soft-tissue trackers; and (2) to encourage the development of unsupervised deep learning methods, given the lack of annotated da…
▽ More
This paper introduces the ``SurgT: Surgical Tracking" challenge which was organised in conjunction with MICCAI 2022. There were two purposes for the creation of this challenge: (1) the establishment of the first standardised benchmark for the research community to assess soft-tissue trackers; and (2) to encourage the development of unsupervised deep learning methods, given the lack of annotated data in surgery. A dataset of 157 stereo endoscopic videos from 20 clinical cases, along with stereo camera calibration parameters, have been provided. Participants were assigned the task of develo** algorithms to track the movement of soft tissues, represented by bounding boxes, in stereo endoscopic videos. At the end of the challenge, the developed methods were assessed on a previously hidden test subset. This assessment uses benchmarking metrics that were purposely developed for this challenge, to verify the efficacy of unsupervised deep learning algorithms in tracking soft-tissue. The metric used for ranking the methods was the Expected Average Overlap (EAO) score, which measures the average overlap between a tracker's and the ground truth bounding boxes. Coming first in the challenge was the deep learning submission by ICVS-2Ai with a superior EAO score of 0.617. This method employs ARFlow to estimate unsupervised dense optical flow from cropped images, using photometric and regularization losses. Second, Jmees with an EAO of 0.583, uses deep learning for surgical tool segmentation on top of a non-deep learning baseline method: CSRT. CSRT by itself scores a similar EAO of 0.563. The results from this challenge show that currently, non-deep learning methods are still competitive. The dataset and benchmarking tool created for this challenge have been made publicly available at https://surgt.grand-challenge.org/.
△ Less
Submitted 30 August, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Hierarchical Pronunciation Assessment with Multi-Aspect Attention
Authors:
Hee** Do,
Yunsu Kim,
Gary Geunbae Lee
Abstract:
Automatic pronunciation assessment is a major component of a computer-assisted pronunciation training system. To provide in-depth feedback, scoring pronunciation at various levels of granularity such as phoneme, word, and utterance, with diverse aspects such as accuracy, fluency, and completeness, is essential. However, existing multi-aspect multi-granularity methods simultaneously predict all asp…
▽ More
Automatic pronunciation assessment is a major component of a computer-assisted pronunciation training system. To provide in-depth feedback, scoring pronunciation at various levels of granularity such as phoneme, word, and utterance, with diverse aspects such as accuracy, fluency, and completeness, is essential. However, existing multi-aspect multi-granularity methods simultaneously predict all aspects at all granularity levels; therefore, they have difficulty in capturing the linguistic hierarchy of phoneme, word, and utterance. This limitation further leads to neglecting intimate cross-aspect relations at the same linguistic unit. In this paper, we propose a Hierarchical Pronunciation Assessment with Multi-aspect Attention (HiPAMA) model, which hierarchically represents the granularity levels to directly capture their linguistic structures and introduces multi-aspect attention that reflects associations across aspects at the same level to create more connotative representations. By obtaining relational information from both the granularity- and aspect-side, HiPAMA can take full advantage of multi-task learning. Remarkable improvements in the experimental results on the speachocean762 datasets demonstrate the robustness of HiPAMA, particularly in the difficult-to-assess aspects.
△ Less
Submitted 26 May, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Hi,KIA: A Speech Emotion Recognition Dataset for Wake-Up Words
Authors:
Taesu Kim,
SeungHeon Doh,
Gyunpyo Lee,
Hyungseok Jeon,
Juhan Nam,
Hyeon-Jeong Suk
Abstract:
Wake-up words (WUW) is a short sentence used to activate a speech recognition system to receive the user's speech input. WUW utterances include not only the lexical information for waking up the system but also non-lexical information such as speaker identity or emotion. In particular, recognizing the user's emotional state may elaborate the voice communication. However, there is few dataset where…
▽ More
Wake-up words (WUW) is a short sentence used to activate a speech recognition system to receive the user's speech input. WUW utterances include not only the lexical information for waking up the system but also non-lexical information such as speaker identity or emotion. In particular, recognizing the user's emotional state may elaborate the voice communication. However, there is few dataset where the emotional state of the WUW utterances is labeled. In this paper, we introduce Hi, KIA, a new WUW dataset which consists of 488 Korean accent emotional utterances collected from four male and four female speakers and each of utterances is labeled with four emotional states including anger, happy, sad, or neutral. We present the step-by-step procedure to build the dataset, covering scenario selection, post-processing, and human validation for label agreement. Also, we provide two classification models for WUW speech emotion recognition using the dataset. One is based on traditional hand-craft features and the other is a transfer-learning approach using a pre-trained neural network. These classification models could be used as benchmarks in further research.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Massive MIMO Evolution Towards 3GPP Release 18
Authors:
Huang** **,
Kunpeng Liu,
Gilwon Lee,
Emad J. Farag,
Min Zhang,
Dalin Zhu,
Leiming Zhang,
Eko Onggosanusi,
Mansoor Shafi,
Harsh Tataria
Abstract:
Since the introduction of fifth-generation new radio (5G-NR) in Third Generation Partnership Project (3GPP) Release 15, swift progress has been made to evolve 5G with 3GPP Release 18 emerging. A critical aspect is the design of massive multiple-input multiple-output (MIMO) technology. In this line, this paper makes several important contributions: We provide a comprehensive overview of the evoluti…
▽ More
Since the introduction of fifth-generation new radio (5G-NR) in Third Generation Partnership Project (3GPP) Release 15, swift progress has been made to evolve 5G with 3GPP Release 18 emerging. A critical aspect is the design of massive multiple-input multiple-output (MIMO) technology. In this line, this paper makes several important contributions: We provide a comprehensive overview of the evolution of standardized massive MIMO features from 3GPP Release 15 to 17 for both time/frequency-division duplex operation across bands FR-1 and FR-2. We analyze the progress on channel state information (CSI) frameworks, beam management frameworks and present enhancements for uplink CSI. We shed light on emerging 3GPP Release 18 problems requiring imminent attention. These include advanced codebook design and sounding reference signal design for coherent joint transmission (CJT) with multiple transmission/reception points (multi- TRPs). We discuss advancements in uplink demodulation reference signal design, enhancements for mobility to provide accurate CSI estimates, and unified transmission configuration indicator framework tailored for FR-2 bands. For each concept, we provide system level simulation results to highlight their performance benefits. Via field trials in an outdoor environment at Shanghai Jiaotong University, we demonstrate the gains of multi-TRP CJT relative to single TRP at 3.7 GHz.
△ Less
Submitted 15 October, 2022;
originally announced October 2022.
-
A Design Method of Distributed Algorithms via Discrete-time Blended Dynamics Theorem
Authors:
Jeong Woo Kim,
** Gyu Lee,
Donggil Lee,
Hyungbo Shim
Abstract:
We develop a discrete-time version of the blended dynamics theorem for the use of designing distributed computation algorithms. The blended dynamics theorem enables to predict the behavior of heterogeneous multi-agent systems. Therefore, once we get a blended dynamics for a particular computational task, design idea of node dynamics for individual heterogeneous agents can easily occur. In the cont…
▽ More
We develop a discrete-time version of the blended dynamics theorem for the use of designing distributed computation algorithms. The blended dynamics theorem enables to predict the behavior of heterogeneous multi-agent systems. Therefore, once we get a blended dynamics for a particular computational task, design idea of node dynamics for individual heterogeneous agents can easily occur. In the continuous-time case, prediction by blended dynamics was enabled by high coupling gain among neighboring agents. In the discrete-time case, we propose an equivalent action, which we call multi-step coupling in this paper. Compared to the continuous-time case, the blended dynamics can have more variety depending on the coupling matrix. This benefit is demonstrated with three applications; distributed estimation of network size, distributed computation of the PageRank, and distributed computation of the degree sequence of a graph, which correspond to the coupling by doubly-stochastic, column-stochastic, and row-stochastic matrices, respectively.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Fine-tuning Wav2vec for Vocal-burst Emotion Recognition
Authors:
Dang-Khanh Nguyen,
Sudarshan Pant,
Ngoc-Huynh Ho,
Guee-Sang Lee,
Soo-Huyng Kim,
Hyung-Jeong Yang
Abstract:
The ACII Affective Vocal Bursts (A-VB) competition introduces a new topic in affective computing, which is understanding emotional expression using the non-verbal sound of humans. We are familiar with emotion recognition via verbal vocal or facial expression. However, the vocal bursts such as laughs, cries, and signs, are not exploited even though they are very informative for behavior analysis. T…
▽ More
The ACII Affective Vocal Bursts (A-VB) competition introduces a new topic in affective computing, which is understanding emotional expression using the non-verbal sound of humans. We are familiar with emotion recognition via verbal vocal or facial expression. However, the vocal bursts such as laughs, cries, and signs, are not exploited even though they are very informative for behavior analysis. The A-VB competition comprises four tasks that explore non-verbal information in different spaces. This technical report describes the method and the result of SclabCNU Team for the tasks of the challenge. We achieved promising results compared to the baseline model provided by the organizers.
△ Less
Submitted 1 October, 2022;
originally announced October 2022.
-
Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst
Authors:
Dang-Linh Trinh,
Minh-Cong Vo,
Guee-Sang Lee
Abstract:
The technical report presents our emotion recognition pipeline for high-dimensional emotion task (A-VB High) in The ACII Affective Vocal Bursts (A-VB) 2022 Workshop \& Competition. Our proposed method contains three stages. Firstly, we extract the latent features from the raw audio signal and its Mel-spectrogram by self-supervised learning methods. Then, the features from the raw signal are fed to…
▽ More
The technical report presents our emotion recognition pipeline for high-dimensional emotion task (A-VB High) in The ACII Affective Vocal Bursts (A-VB) 2022 Workshop \& Competition. Our proposed method contains three stages. Firstly, we extract the latent features from the raw audio signal and its Mel-spectrogram by self-supervised learning methods. Then, the features from the raw signal are fed to the self-relation attention and temporal awareness (SA-TA) module for learning the valuable information between these latent features. Finally, we concatenate all the features and utilize a fully-connected layer to predict each emotion's score. By empirical experiments, our proposed method achieves a mean concordance correlation coefficient (CCC) of 0.7295 on the test set, compared to 0.5686 on the baseline model. The code of our method is available at https://github.com/linhtd812/A-VB2022.
△ Less
Submitted 26 September, 2022; v1 submitted 15 September, 2022;
originally announced September 2022.
-
Data-Driven Blind Synchronization and Interference Rejection for Digital Communication Signals
Authors:
Alejandro Lancho,
Amir Weiss,
Gary C. F. Lee,
Jennifer Tang,
Yuheng Bu,
Yury Polyanskiy,
Gregory W. Wornell
Abstract:
We study the potential of data-driven deep learning methods for separation of two communication signals from an observation of their mixture. In particular, we assume knowledge on the generation process of one of the signals, dubbed signal of interest (SOI), and no knowledge on the generation process of the second signal, referred to as interference. This form of the single-channel source separati…
▽ More
We study the potential of data-driven deep learning methods for separation of two communication signals from an observation of their mixture. In particular, we assume knowledge on the generation process of one of the signals, dubbed signal of interest (SOI), and no knowledge on the generation process of the second signal, referred to as interference. This form of the single-channel source separation problem is also referred to as interference rejection. We show that capturing high-resolution temporal structures (nonstationarities), which enables accurate synchronization to both the SOI and the interference, leads to substantial performance gains. With this key insight, we propose a domain-informed neural network (NN) design that is able to improve upon both "off-the-shelf" NNs and classical detection and interference rejection methods, as demonstrated in our simulations. Our findings highlight the key role communication-specific domain knowledge plays in the development of data-driven approaches that hold the promise of unprecedented gains.
△ Less
Submitted 11 September, 2022;
originally announced September 2022.
-
Open-loop contraction design
Authors:
** Gyu Lee,
Thiago B. Burghi,
Rodolphe Sepulchre
Abstract:
Given a non-contracting trajectory of a nonlinear system, we consider the question of designing an input perturbation that makes the perturbed trajectory contracting. This paper stresses the analogy of this question with the classical question of feedback stabilization. In particular, it is shown that the existence of an output variable that ensures contraction of the inverse system facilitates th…
▽ More
Given a non-contracting trajectory of a nonlinear system, we consider the question of designing an input perturbation that makes the perturbed trajectory contracting. This paper stresses the analogy of this question with the classical question of feedback stabilization. In particular, it is shown that the existence of an output variable that ensures contraction of the inverse system facilitates the design of a contracting input perturbation. We illustrate the relevance of this question in parameter estimation.
△ Less
Submitted 9 September, 2022;
originally announced September 2022.
-
Exploiting Temporal Structures of Cyclostationary Signals for Data-Driven Single-Channel Source Separation
Authors:
Gary C. F. Lee,
Amir Weiss,
Alejandro Lancho,
Jennifer Tang,
Yuheng Bu,
Yury Polyanskiy,
Gregory W. Wornell
Abstract:
We study the problem of single-channel source separation (SCSS), and focus on cyclostationary signals, which are particularly suitable in a variety of application domains. Unlike classical SCSS approaches, we consider a setting where only examples of the sources are available rather than their models, inspiring a data-driven approach. For source models with underlying cyclostationary Gaussian cons…
▽ More
We study the problem of single-channel source separation (SCSS), and focus on cyclostationary signals, which are particularly suitable in a variety of application domains. Unlike classical SCSS approaches, we consider a setting where only examples of the sources are available rather than their models, inspiring a data-driven approach. For source models with underlying cyclostationary Gaussian constituents, we establish a lower bound on the attainable mean squared error (MSE) for any separation method, model-based or data-driven. Our analysis further reveals the operation for optimal separation and the associated implementation challenges. As a computationally attractive alternative, we propose a deep learning approach using a U-Net architecture, which is competitive with the minimum MSE estimator. We demonstrate in simulation that, with suitable domain-informed architectural choices, our U-Net method can approach the optimal performance with substantially reduced computational burden.
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
Rapid and robust synchronization via weak synaptic coupling Extended arXiv version
Authors:
** Gyu Lee,
Rodolphe Sepulchre
Abstract:
This paper examines how weak synaptic coupling can achieve rapid synchronization in heterogeneous networks. The assumptions aim at capturing the key mathematical properties that make this possible for biophysical networks. In particular, the combination of nodal excitability and synaptic coupling are shown to be essential to the phenomenon.
This paper examines how weak synaptic coupling can achieve rapid synchronization in heterogeneous networks. The assumptions aim at capturing the key mathematical properties that make this possible for biophysical networks. In particular, the combination of nodal excitability and synaptic coupling are shown to be essential to the phenomenon.
△ Less
Submitted 17 October, 2023; v1 submitted 18 July, 2022;
originally announced July 2022.
-
A Multi-stage Framework with Mean Subspace Computation and Recursive Feedback for Online Unsupervised Domain Adaptation
Authors:
Jihoon Moon,
Debasmit Das,
C. S. George Lee
Abstract:
In this paper, we address the Online Unsupervised Domain Adaptation (OUDA) problem and propose a novel multi-stage framework to solve real-world situations when the target data are unlabeled and arriving online sequentially in batches. To project the data from the source and the target domains to a common subspace and manipulate the projected data in real-time, our proposed framework institutes a…
▽ More
In this paper, we address the Online Unsupervised Domain Adaptation (OUDA) problem and propose a novel multi-stage framework to solve real-world situations when the target data are unlabeled and arriving online sequentially in batches. To project the data from the source and the target domains to a common subspace and manipulate the projected data in real-time, our proposed framework institutes a novel method, called an Incremental Computation of Mean-Subspace (ICMS) technique, which computes an approximation of mean-target subspace on a Grassmann manifold and is proven to be a close approximate to the Karcher mean. Furthermore, the transformation matrix computed from the mean-target subspace is applied to the next target data in the recursive-feedback stage, aligning the target data closer to the source domain. The computation of transformation matrix and the prediction of next-target subspace leverage the performance of the recursive-feedback stage by considering the cumulative temporal dependency among the flow of the target subspace on the Grassmann manifold. The labels of the transformed target data are predicted by the pre-trained source classifier, then the classifier is updated by the transformed data and predicted labels. Extensive experiments on six datasets were conducted to investigate in depth the effect and contribution of each stage in our proposed framework and its performance over previous approaches in terms of classification accuracy and computational speed. In addition, the experiments on traditional manifold-based learning models and neural-network-based learning models demonstrated the applicability of our proposed framework for various types of learning models.
△ Less
Submitted 23 June, 2022;
originally announced July 2022.
-
Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis
Authors:
Tae-Woo Kim,
Min-Su Kang,
Gyeong-Hoon Lee
Abstract:
Recently, deep learning-based generative models have been introduced to generate singing voices. One approach is to predict the parametric vocoder features consisting of explicit speech parameters. This approach has the advantage that the meaning of each feature is explicitly distinguished. Another approach is to predict mel-spectrograms for a neural vocoder. However, parametric vocoders have limi…
▽ More
Recently, deep learning-based generative models have been introduced to generate singing voices. One approach is to predict the parametric vocoder features consisting of explicit speech parameters. This approach has the advantage that the meaning of each feature is explicitly distinguished. Another approach is to predict mel-spectrograms for a neural vocoder. However, parametric vocoders have limitations of voice quality and the mel-spectrogram features are difficult to model because the timbre and pitch information are entangled. In this study, we propose a singing voice synthesis model with multi-task learning to use both approaches -- acoustic features for a parametric vocoder and mel-spectrograms for a neural vocoder. By using the parametric vocoder features as auxiliary features, the proposed model can efficiently disentangle and control the timbre and pitch components of the mel-spectrogram. Moreover, a generative adversarial network framework is applied to improve the quality of singing voices in a multi-singer model. Experimental results demonstrate that our proposed model can generate more natural singing voices than the single-task models, while performing better than the conventional parametric vocoder-based model.
△ Less
Submitted 13 June, 2024; v1 submitted 23 June, 2022;
originally announced June 2022.
-
HRTF measurement for accurate sound localization cues
Authors:
Gyeong-Tae Lee,
Sang-Min Choi,
Byeong-Yun Ko,
Yong-Hwa Park
Abstract:
A new database of head-related transfer functions (HRTFs) for accurate sound source localization is presented through precise measurement and post-processing in terms of improved frequency bandwidth and causality of head-related impulse responses (HRIRs) for accurate spectral cue (SC) and interaural time difference (ITD), respectively. The improvement effects of the proposed methods on binaural so…
▽ More
A new database of head-related transfer functions (HRTFs) for accurate sound source localization is presented through precise measurement and post-processing in terms of improved frequency bandwidth and causality of head-related impulse responses (HRIRs) for accurate spectral cue (SC) and interaural time difference (ITD), respectively. The improvement effects of the proposed methods on binaural sound localization cues were investigated. To achieve sufficient frequency bandwidth with a single source, a one-way sealed speaker module was designed to obtain wide band frequency response based on electro-acoustics, whereas most existing HRTF databases rely on a two-way vented loudspeaker that has multiple sources. The origin transfer function at the head center was obtained by the proposed measurement scheme using a 0 degree on-axis microphone to ensure accurate spectral cue pattern of HRTFs, whereas in the previous measurements with a 90 degree off-axis microphone, the magnitude response of the origin transfer function fluctuated and decreased with increasing frequency, causing erroneous SCs of HRTFs. To prevent discontinuity of ITD due to non-causality of ipsilateral HRTFs, obtained HRIRs were circularly shifted by time delay considering the head radius of the measurement subject. Finally, various sound localization cues such as ITD, interaural level difference (ILD), SC, and horizontal plane directivity (HPD) were derived from the presented HRTFs, and improvements on binaural sound localization cues were examined. As a result, accurate SC patterns of HRTFs were confirmed through the proposed measurement scheme using the 0 degree on-axis microphone, and continuous ITD patterns were obtained due to the non-causality compensation. Source codes and presented HRTF database are available to relevant research groups at GitHub (https://github.com/han-saram/HRTF-HATS-KAIST).
△ Less
Submitted 5 April, 2022; v1 submitted 7 March, 2022;
originally announced March 2022.
-
Feasibility Study of Multi-Site Split Learning for Privacy-Preserving Medical Systems under Data Imbalance Constraints in COVID-19, X-Ray, and Cholesterol Dataset
Authors:
Yoo Jeong Ha,
Gusang Lee,
Minjae Yoo,
Soyi Jung,
Seehwan Yoo,
Joongheon Kim
Abstract:
It seems as though progressively more people are in the race to upload content, data, and information online; and hospitals haven't neglected this trend either. Hospitals are now at the forefront for multi-site medical data sharing to provide groundbreaking advancements in the way health records are shared and patients are diagnosed. Sharing of medical data is essential in modern medical research.…
▽ More
It seems as though progressively more people are in the race to upload content, data, and information online; and hospitals haven't neglected this trend either. Hospitals are now at the forefront for multi-site medical data sharing to provide groundbreaking advancements in the way health records are shared and patients are diagnosed. Sharing of medical data is essential in modern medical research. Yet, as with all data sharing technology, the challenge is to balance improved treatment with protecting patient's personal information. This paper provides a novel split learning algorithm coined the term, "multi-site split learning", which enables a secure transfer of medical data between multiple hospitals without fear of exposing personal data contained in patient records. It also explores the effects of varying the number of end-systems and the ratio of data-imbalance on the deep learning performance. A guideline for the most optimal configuration of split learning that ensures privacy of patient data whilst achieving performance is empirically given. We argue the benefits of our multi-site split learning algorithm, especially regarding the privacy preserving factor, using CT scans of COVID-19 patients, X-ray bone scans, and cholesterol level medical data.
△ Less
Submitted 20 February, 2022;
originally announced February 2022.
-
Node-wise monotone barrier coupling law for formation control
Authors:
** Gyu Lee,
Cyrus Mostajeran,
Graham Van Goffrier
Abstract:
We study a node-wise monotone barrier coupling law, motivated by the synaptic coupling of neural central pattern generators. It is illustrated that this coupling imitates the desirable properties of neural central pattern generators. In particular, the coupling law 1) allows us to assign multiple central patterns on the circle and 2) allows for rapid switching between different patterns via simple…
▽ More
We study a node-wise monotone barrier coupling law, motivated by the synaptic coupling of neural central pattern generators. It is illustrated that this coupling imitates the desirable properties of neural central pattern generators. In particular, the coupling law 1) allows us to assign multiple central patterns on the circle and 2) allows for rapid switching between different patterns via simple `kicks'. In the end, we achieve full control by partitioning the state space by utilizing a barrier effect and assigning a unique steady-state behavior to each element of the resulting partition. We analyze the global behavior and study the viability of the design.
△ Less
Submitted 1 February, 2024; v1 submitted 6 February, 2022;
originally announced February 2022.
-
Graph attentive feature aggregation for text-independent speaker verification
Authors:
Hye-** Shim,
Jungwoo Heo,
Jae-han Park,
Ga-hui Lee,
Ha-** Yu
Abstract:
The objective of this paper is to combine multiple frame-level features into a single utterance-level representation considering pairwise relationship. For this purpose, we propose a novel graph attentive feature aggregation module by interpreting each frame-level feature as a node of a graph. The inter-relationship between all possible pairs of features, typically exploited indirectly, can be dir…
▽ More
The objective of this paper is to combine multiple frame-level features into a single utterance-level representation considering pairwise relationship. For this purpose, we propose a novel graph attentive feature aggregation module by interpreting each frame-level feature as a node of a graph. The inter-relationship between all possible pairs of features, typically exploited indirectly, can be directly modeled using a graph. The module comprises a graph attention layer and a graph pooling layer followed by a readout operation. The graph attention layer first models the non-Euclidean data manifold between different nodes. Then, the graph pooling layer discards less informative nodes considering the significance of the nodes. Finally, the readout operation combines the remaining nodes into a single representation. We employ two recent systems, SE-ResNet and RawNet2, with different input features and architectures and demonstrate that the proposed feature aggregation module consistently shows a relative improvement over 10%, compared to the baseline.
△ Less
Submitted 22 December, 2021;
originally announced December 2021.
-
Edge-wise funnel output synchronization of heterogeneous agents with relative degree one
Authors:
** Gyu Lee,
Thomas Berger,
Stephan Trenn,
Hyungbo Shim
Abstract:
When a group of heterogeneous node dynamics are diffusively coupled with a high coupling gain, the group exhibits a collective emergent behavior which is governed by a simple algebraic average of the node dynamics called the blended dynamics. This finding has been utilized for designing heterogeneous multi-agent systems by building the desired blended dynamics first and then splitting it into the…
▽ More
When a group of heterogeneous node dynamics are diffusively coupled with a high coupling gain, the group exhibits a collective emergent behavior which is governed by a simple algebraic average of the node dynamics called the blended dynamics. This finding has been utilized for designing heterogeneous multi-agent systems by building the desired blended dynamics first and then splitting it into the node dynamics. However, to compute the magnitude of the coupling gain, each agent needs to know global information such as the number of participating nodes, the graph structure, and so on, which prevents a fully decentralized design of the node dynamics in conjunction with the coupling laws. To resolve this issue, the idea of funnel control, which is a method for adaptive gain selection, can be exploited for a node-wise coupling, but the price to pay is that the collective emergent behavior is no longer governed by a simple average of the node dynamics. Our analysis reveals that this drawback can be avoided by an edge-wise design premise, which is the idea that we present in this paper. After all, we gain benefits such as a fully decentralized design without global information, collective emergent behavior being governed by the blended dynamics, and the plug-and-play operation based on edge-wise handshaking between two nodes.
△ Less
Submitted 16 January, 2023; v1 submitted 11 October, 2021;
originally announced October 2021.
-
Spatio-Temporal Split Learning for Privacy-Preserving Medical Platforms: Case Studies with COVID-19 CT, X-Ray, and Cholesterol Data
Authors:
Yoo Jeong Ha,
Minjae Yoo,
Gusang Lee,
Soyi Jung,
Sae Won Choi,
Joongheon Kim,
Seehwan Yoo
Abstract:
Machine learning requires a large volume of sample data, especially when it is used in high-accuracy medical applications. However, patient records are one of the most sensitive private information that is not usually shared among institutes. This paper presents spatio-temporal split learning, a distributed deep neural network framework, which is a turning point in allowing collaboration among pri…
▽ More
Machine learning requires a large volume of sample data, especially when it is used in high-accuracy medical applications. However, patient records are one of the most sensitive private information that is not usually shared among institutes. This paper presents spatio-temporal split learning, a distributed deep neural network framework, which is a turning point in allowing collaboration among privacy-sensitive organizations. Our spatio-temporal split learning presents how distributed machine learning can be efficiently conducted with minimal privacy concerns. The proposed split learning consists of a number of clients and a centralized server. Each client has only has one hidden layer, which acts as the privacy-preserving layer, and the centralized server comprises the other hidden layers and the output layer. Since the centralized server does not need to access the training data and trains the deep neural network with parameters received from the privacy-preserving layer, privacy of original data is guaranteed. We have coined the term, spatio-temporal split learning, as multiple clients are spatially distributed to cover diverse datasets from different participants, and we can temporally split the learning process, detaching the privacy preserving layer from the rest of the learning process to minimize privacy breaches. This paper shows how we can analyze the medical data whilst ensuring privacy using our proposed multi-site spatio-temporal split learning algorithm on Coronavirus Disease-19 (COVID-19) chest Computed Tomography (CT) scans, MUsculoskeletal RAdiographs (MURA) X-ray images, and cholesterol levels.
△ Less
Submitted 20 August, 2021;
originally announced August 2021.
-
Deep learning based cough detection camera using enhanced features
Authors:
Gyeong-Tae Lee,
Hyeonuk Nam,
Seong-Hu Kim,
Sang-Min Choi,
Youngkey Kim,
Yong-Hwa Park
Abstract:
Coughing is a typical symptom of COVID-19. To detect and localize coughing sounds remotely, a convolutional neural network (CNN) based deep learning model was developed in this work and integrated with a sound camera for the visualization of the cough sounds. The cough detection model is a binary classifier of which the input is a two second acoustic feature and the output is one of two inferences…
▽ More
Coughing is a typical symptom of COVID-19. To detect and localize coughing sounds remotely, a convolutional neural network (CNN) based deep learning model was developed in this work and integrated with a sound camera for the visualization of the cough sounds. The cough detection model is a binary classifier of which the input is a two second acoustic feature and the output is one of two inferences (Cough or Others). Data augmentation was performed on the collected audio files to alleviate class imbalance and reflect various background noises in practical environments. For effective featuring of the cough sound, conventional features such as spectrograms, mel-scaled spectrograms, and mel-frequency cepstral coefficients (MFCC) were reinforced by utilizing their velocity (V) and acceleration (A) maps in this work. VGGNet, GoogLeNet, and ResNet were simplified to binary classifiers, and were named V-net, G-net, and R-net, respectively. To find the best combination of features and networks, training was performed for a total of 39 cases and the performance was confirmed using the test F1 score. Finally, a test F1 score of 91.9% (test accuracy of 97.2%) was achieved from G-net with the MFCC-V-A feature (named Spectroflow), an acoustic feature effective for use in cough detection. The trained cough detection model was integrated with a sound camera (i.e., one that visualizes sound sources using a beamforming microphone array). In a pilot test, the cough detection camera detected coughing sounds with an F1 score of 90.0% (accuracy of 96.0%), and the cough location in the camera image was tracked in real time.
△ Less
Submitted 24 May, 2022; v1 submitted 28 July, 2021;
originally announced July 2021.
-
A Dual-Connection based Handover Scheme for Ultra-Dense Millimeter-Wave Cellular Networks
Authors:
Seongjoon Kang,
Siyoung Choi,
Goodsol Lee,
Saewoong Bahk
Abstract:
Mobile users in an ultra-dense millimeter-wave cellular network experience handover events more frequently than in conventional networks, which results in increased service interruption time and performance degradation due to blockages. Multi-connectivity has been proposed to resolve this, and it also extends the coverage of millimeter-wave communications. In this paper, we propose a dual-connecti…
▽ More
Mobile users in an ultra-dense millimeter-wave cellular network experience handover events more frequently than in conventional networks, which results in increased service interruption time and performance degradation due to blockages. Multi-connectivity has been proposed to resolve this, and it also extends the coverage of millimeter-wave communications. In this paper, we propose a dual-connection based handover scheme for mobile UEs in an environment where they are connected simultaneously with two millimeter-wave cells to overcome frequent handover problems. This scheme allows a mobile UE to choose its serving link between the two mmWave connections according to the measured SINRs and then the corresponding base stations may forward duplicate packets to the UE. We compare our dual-connection based scheme with a conventional single-connection based scheme through ns-3 simulation. The simulation results show that the proposed scheme significantly reduces handover rate and delay. Therefore, we argue that the dual-connection based scheme helps mobile users achieve performance goals they require in ultra-dense cellular environments.
△ Less
Submitted 9 July, 2021;
originally announced July 2021.
-
Heavily Augmented Sound Event Detection utilizing Weak Predictions
Authors:
Hyeonuk Nam,
Byeong-Yun Ko,
Gyeong-Tae Lee,
Seong-Hu Kim,
Won-Ho Jung,
Sang-Min Choi,
Yong-Hwa Park
Abstract:
The performances of Sound Event Detection (SED) systems are greatly limited by the difficulty in generating large strongly labeled dataset. In this work, we used two main approaches to overcome the lack of strongly labeled data. First, we applied heavy data augmentation on input features. Data augmentation methods used include not only conventional methods used in speech/audio domains but also our…
▽ More
The performances of Sound Event Detection (SED) systems are greatly limited by the difficulty in generating large strongly labeled dataset. In this work, we used two main approaches to overcome the lack of strongly labeled data. First, we applied heavy data augmentation on input features. Data augmentation methods used include not only conventional methods used in speech/audio domains but also our proposed method named FilterAugment. Second, we propose two methods to utilize weak predictions to enhance weakly supervised SED performance. As a result, we obtained the best PSDS1 of 0.4336 and best PSDS2 of 0.8161 on the DESED real validation dataset. This work is submitted to DCASE 2021 Task4 and is ranked on the 3rd place. Code availa-ble: https://github.com/frednam93/FilterAugSED.
△ Less
Submitted 14 September, 2021; v1 submitted 8 July, 2021;
originally announced July 2021.
-
N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement
Authors:
Gyeong-Hoon Lee,
Tae-Woo Kim,
Hanbin Bae,
Min-Ji Lee,
Young-Ik Kim,
Hoon-Young Cho
Abstract:
Recently, end-to-end Korean singing voice systems have been designed to generate realistic singing voices. However, these systems still suffer from a lack of robustness in terms of pronunciation accuracy. In this paper, we propose N-Singer, a non-autoregressive Korean singing voice system, to synthesize accurate and pronounced Korean singing voices in parallel. N-Singer consists of a Transformer-b…
▽ More
Recently, end-to-end Korean singing voice systems have been designed to generate realistic singing voices. However, these systems still suffer from a lack of robustness in terms of pronunciation accuracy. In this paper, we propose N-Singer, a non-autoregressive Korean singing voice system, to synthesize accurate and pronounced Korean singing voices in parallel. N-Singer consists of a Transformer-based mel-generator, a convolutional network-based postnet, and voicing-aware discriminators. It can contribute in the following ways. First, for accurate pronunciation, N-Singer separately models linguistic and pitch information without other acoustic features. Second, to achieve improved mel-spectrograms, N-Singer uses a combination of Transformer-based modules and convolutional network-based modules. Third, in adversarial training, voicing-aware conditional discriminators are used to capture the harmonic features of voiced segments and noise components of unvoiced segments. The experimental results prove that N-Singer can synthesize a natural singing voice in parallel with a more accurate pronunciation than the baseline model.
△ Less
Submitted 21 February, 2022; v1 submitted 29 June, 2021;
originally announced June 2021.
-
Networks obtained by Implicit-Explicit Method: Discrete-time distributed median solver
Authors:
** Gyu Lee
Abstract:
In the purpose of making the consensus algorithm robust to outliers, consensus on the median value has recently attracted some attention. It has its applicability in for instance constructing a resilient distributed state estimator. Meanwhile, most of the existing works consider continuous-time algorithms and uses high-gain and discontinuous vector fields. This issues a problem of the need for sma…
▽ More
In the purpose of making the consensus algorithm robust to outliers, consensus on the median value has recently attracted some attention. It has its applicability in for instance constructing a resilient distributed state estimator. Meanwhile, most of the existing works consider continuous-time algorithms and uses high-gain and discontinuous vector fields. This issues a problem of the need for smaller time steps and yielding chattering when discretizing by explicit method for its practical use. Thus, in this paper, we highlight that these issues vanish when we utilize instead Implicit-Explicit Method, for a broader class of networks designed by the blended dynamics approach. In particular, for undirected and connected graphs, we propose a discrete-time distributed median solver that does not suffer from chattering. We also verify by simulation that it has a smaller iteration number required to arrive at a steady-state.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Estimation of Closest In-Path Vehicle (CIPV) by Low-Channel LiDAR and Camera Sensor Fusion for Autonomous Vehicle
Authors:
Hyun** Bae,
Gu Lee,
Jaeseung Yang,
Gwanjun Shin,
Yongseob Lim,
Gyeungho Choi
Abstract:
In autonomous driving, using a variety of sensors to recognize preceding vehicles in middle and long distance is helpful for improving driving performance and develo** various functions. However, if only LiDAR or camera is used in the recognition stage, it is difficult to obtain necessary data due to the limitations of each sensor. In this paper, we proposed a method of converting the tracking d…
▽ More
In autonomous driving, using a variety of sensors to recognize preceding vehicles in middle and long distance is helpful for improving driving performance and develo** various functions. However, if only LiDAR or camera is used in the recognition stage, it is difficult to obtain necessary data due to the limitations of each sensor. In this paper, we proposed a method of converting the tracking data of vision into bird's eye view (BEV) coordinates using an equation that projects LiDAR points onto an image, and a method of fusion between LiDAR and vision tracked data. Thus, the newly proposed method was effective through the results of detecting closest in-path vehicle (CIPV) in various situations. In addition, even when experimenting with the EuroNCAP autonomous emergency braking (AEB) test protocol using the result of fusion, AEB performance is improved through improved cognitive performance than when using only LiDAR. In experimental results, the performance of the proposed method was proved through actual vehicle tests in various scenarios. Consequently, it is convincing that the newly proposed sensor fusion method significantly improves the ACC function in autonomous maneuvering. We expect that this improvement in perception performance will contribute to improving the overall stability of ACC.
△ Less
Submitted 25 March, 2021;
originally announced March 2021.
-
Design of heterogeneous multi-agent system for distributed computation
Authors:
** Gyu Lee,
Hyungbo Shim
Abstract:
A group behavior of a heterogeneous multi-agent system is studied which obeys an "average of individual vector fields" under strong couplings among the agents. Under stability of the averaged dynamics (not asking stability of individual agents), the behavior of heterogeneous multi-agent system can be estimated by the solution to the averaged dynamics. A following idea is to "design" individual age…
▽ More
A group behavior of a heterogeneous multi-agent system is studied which obeys an "average of individual vector fields" under strong couplings among the agents. Under stability of the averaged dynamics (not asking stability of individual agents), the behavior of heterogeneous multi-agent system can be estimated by the solution to the averaged dynamics. A following idea is to "design" individual agent's dynamics such that the averaged dynamics performs the desired task. A few applications are discussed including estimation of the number of agents in a network, distributed least-squares or median solver, distributed optimization, distributed state estimation, and robust synchronization of coupled oscillators. Since stability of the averaged dynamics makes the initial conditions forgotten as time goes on, these algorithms are initialization-free and suitable for plug-and-play operation. At last, nonlinear couplings are also considered, which potentially asserts that enforced synchronization gives rise to an emergent behavior of a heterogeneous multi-agent system.
△ Less
Submitted 19 September, 2021; v1 submitted 31 December, 2020;
originally announced January 2021.
-
Synchronization with prescribed transient behavior: Heterogeneous multi-agent systems under funnel coupling Extended arXiv version
Authors:
** Gyu Lee,
Stephan Trenn,
Hyungbo Shim
Abstract:
In this paper, we introduce a nonlinear time-varying coupling law, which can be designed in a fully decentralized manner and achieves approximate synchronization with arbitrary precision, under only mild assumptions on the individual vector fields and the underlying (undirected) graph structure. The proposed coupling law is motivated by the so-called funnel control method studied in adaptive contr…
▽ More
In this paper, we introduce a nonlinear time-varying coupling law, which can be designed in a fully decentralized manner and achieves approximate synchronization with arbitrary precision, under only mild assumptions on the individual vector fields and the underlying (undirected) graph structure. The proposed coupling law is motivated by the so-called funnel control method studied in adaptive control under the observation that arbitrary precision synchronization can be achieved for heterogeneous multi-agent systems by a high-gain coupling; consequently we call our novel synchronization method `(node-wise) funnel coupling.' By adjusting the conventional proof technique in the funnel control study, we are even able to obtain asymptotic synchronization with the same funnel coupling law. Moreover, the emergent collective behavior that arises for a heterogeneous multi-agent system when enforcing arbitrary precision synchronization by the proposed funnel coupling law, is analyzed in this paper. In particular, we introduce a single scalar dynamics called `emergent dynamics' which describes the emergent synchronized behavior of the multi-agent system under funnel coupling. Characterization of the emergent dynamics is important because, for instance, one can design the emergent dynamics first such that the solution trajectory behaves as desired, and then, provide a design guideline to each agent so that the constructed vector fields yield the desired emergent dynamics. We illustrate this idea via the example of a distributed median solver based on funnel coupling.
△ Less
Submitted 11 October, 2021; v1 submitted 28 December, 2020;
originally announced December 2020.
-
Adaptive Charging Networks: A Framework for Smart Electric Vehicle Charging
Authors:
Zachary J. Lee,
George Lee,
Ted Lee,
Cheng **,
Rand Lee,
Zhi Low,
Daniel Chang,
Christine Ortega,
Steven H. Low
Abstract:
We describe the architecture and algorithms of the Adaptive Charging Network (ACN), which was first deployed on the Caltech campus in early 2016 and is currently operating at over 100 other sites in the United States. The architecture enables real-time monitoring and control and supports electric vehicle (EV) charging at scale. The ACN adopts a flexible Adaptive Scheduling Algorithm based on conve…
▽ More
We describe the architecture and algorithms of the Adaptive Charging Network (ACN), which was first deployed on the Caltech campus in early 2016 and is currently operating at over 100 other sites in the United States. The architecture enables real-time monitoring and control and supports electric vehicle (EV) charging at scale. The ACN adopts a flexible Adaptive Scheduling Algorithm based on convex optimization and model predictive control and allows for significant over-subscription of electrical infrastructure. We describe some of the practical challenges in real-world charging systems, including unbalanced three-phase infrastructure, non-ideal battery charging behavior, and quantized control signals. We demonstrate how the Adaptive Scheduling Algorithm handles these challenges, and compare its performance against baseline algorithms from the deadline scheduling literature using real workloads recorded from the Caltech ACN and accurate system models. We find that in these realistic settings, our scheduling algorithm can improve operator profit by 3.4 times over uncontrolled charging and consistently outperforms baseline algorithms when delivering energy in highly congested systems.
△ Less
Submitted 4 December, 2020;
originally announced December 2020.