Search | arXiv e-print repository

Dual-sided Peltier Elements for Rapid Thermal Feedback in Wearables

Authors: Seongjun Kang, Gwangbin Kim, Seokhyun Hwang, Jeongju Park, Ahmed Elsharkawy, SeungJun Kim

Abstract: This paper introduces a motor-driven Peltier device designed to deliver immediate thermal sensations within extended reality (XR) environments. The system incorporates eight motor-driven Peltier elements, facilitating swift transitions between warm and cool sensations by rotating preheated or cooled elements to opposite sides. A multi-layer structure, comprising aluminum and silicone layers, ensur… ▽ More This paper introduces a motor-driven Peltier device designed to deliver immediate thermal sensations within extended reality (XR) environments. The system incorporates eight motor-driven Peltier elements, facilitating swift transitions between warm and cool sensations by rotating preheated or cooled elements to opposite sides. A multi-layer structure, comprising aluminum and silicone layers, ensures user comfort and safety while maintaining optimal temperatures for thermal stimuli. Time-temperature characteristic analysis demonstrates the system's ability to provide warm and cool sensations efficiently, with a dual-sided lifetime of up to 206 seconds at a 2V input. Our system design is adaptable to various body parts and can be synchronized with corresponding visual stimuli to enhance the immersive sensation of virtual object interaction and information delivery. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: 3 pages, 4 figures, ICRA Wearable Workshop 2024 - 1st Workshop on Advancing Wearable Devices and Applications through Novel Design, Sensing, Actuation, and AI

arXiv:2403.14126 [pdf, other]

Sub-Nyquist Sampling OFDM Radar With a Time-Frequency Phase-Coded Waveform

Authors: Seonghyeon Kang, Kawon Han, Songcheol Hong

Abstract: This paper presents a time-frequency phase-coded sub-Nyquist sampling orthogonal frequency division multiplexing (PC-SNS-OFDM) radar system to reduce the analog-to-digital converter (ADC) sampling rate without any additional hardware or signal processing. The proposed radar divides the transmitted OFDM signal into multiple sub-bands along the frequency axis and provides orthogonality to these sub-… ▽ More This paper presents a time-frequency phase-coded sub-Nyquist sampling orthogonal frequency division multiplexing (PC-SNS-OFDM) radar system to reduce the analog-to-digital converter (ADC) sampling rate without any additional hardware or signal processing. The proposed radar divides the transmitted OFDM signal into multiple sub-bands along the frequency axis and provides orthogonality to these sub-bands by multiplying phase codes in both the time and frequency domains. Although the sampling rate is reduced by the factor of the number of sub-bands, the sub-bands above the sampling rate are folded into the lowest one due to aliasing. In the process of restoring the signals in folded sub-bands to those in full signal bands, the proposed PC-SNS-OFDM radar effectively eliminates symbol-mismatch noise while introducing trade-offs in the range and Doppler ambiguities. The utilization of phase codes in both the frequency and time domains provides flexible control of the range and Doppler ambiguities. It also improves the signal-to-noise ratio (SNR) of detected targets compared to an earlier sub-Nyquist sampling OFDM radar system. This is validated with simulations and experiments under various sub-Nyquist sampling rates. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2402.16153 [pdf, other]

ChatMusician: Understanding and Generating Music Intrinsically with LLM

Authors: Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, Ziyang Ma, Liumeng Xue, Ziyu Wang, Qin Liu, Tianyu Zheng, Yizhi Li, Yinghao Ma, Yiming Liang, Xiaowei Chi, Ruibo Liu, Zili Wang, Pengfei Li, **gcheng Wu, Chenghua Lin, Qifeng Liu , et al. (10 additional authors not shown)

Abstract: While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the… ▽ More While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub. △ Less

Submitted 25 February, 2024; originally announced February 2024.

Comments: GitHub: https://shanghaicannon.github.io/ChatMusician/

arXiv:2402.03517 [pdf, other]

Spatially Consistent Air-to-Ground Channel Modeling via Generative Neural Networks

Authors: Amedeo Giuliani, Rasoul Nikbakht, Giovanni Geraci, Seongjoon Kang, Angel Lozano, Sundeep Rangan

Abstract: This article proposes a generative neural network architecture for spatially consistent air-to-ground channel modeling. The approach considers the trajectories of uncrewed aerial vehicles along typical urban paths, capturing spatial dependencies within received signal strength (RSS) sequences from multiple cellular base stations (gNBs). Through the incorporation of conditioning data, the model acc… ▽ More This article proposes a generative neural network architecture for spatially consistent air-to-ground channel modeling. The approach considers the trajectories of uncrewed aerial vehicles along typical urban paths, capturing spatial dependencies within received signal strength (RSS) sequences from multiple cellular base stations (gNBs). Through the incorporation of conditioning data, the model accurately discriminates between gNBs and drives the correlation matrix distance between real and generated sequences to minimal values. This enables evaluating performance and mobility management metrics with spatially (and by extension temporally) consistent RSS values, rather than independent snapshots. For some tasks underpinned by these metrics, say handovers, consistency is essential. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: To appear in IEEE Wireless Communications Letters

arXiv:2401.13276 [pdf, other]

SCNet: Sparse Compression Network for Music Source Separation

Authors: Weinan Tong, Jiaxu Zhu, Jun Chen, Shiyin Kang, Tao Jiang, Yang Li, Zhiyong Wu, Helen Meng

Abstract: Deep learning-based methods have made significant achievements in music source separation. However, obtaining good results while maintaining a low model complexity remains challenging in super wide-band music source separation. Previous works either overlook the differences in subbands or inadequately address the problem of information loss when generating subband features. In this paper, we propo… ▽ More Deep learning-based methods have made significant achievements in music source separation. However, obtaining good results while maintaining a low model complexity remains challenging in super wide-band music source separation. Previous works either overlook the differences in subbands or inadequately address the problem of information loss when generating subband features. In this paper, we propose SCNet, a novel frequency-domain network to explicitly split the spectrogram of the mixture into several subbands and introduce a sparsity-based encoder to model different frequency bands. We use a higher compression ratio on subbands with less information to improve the information density and focus on modeling subbands with more information. In this way, the separation performance can be significantly improved using lower computational consumption. Experiment results show that the proposed model achieves a signal to distortion ratio (SDR) of 9.0 dB on the MUSDB18-HQ dataset without using extra data, which outperforms state-of-the-art methods. Specifically, SCNet's CPU inference time is only 48% of HT Demucs, one of the previous state-of-the-art models. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP 2024

arXiv:2401.07532 [pdf, other]

Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

Authors: Zhiwei Lin, Jun Chen, Boshi Tang, Binzhu Sha, **g Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu, Helen Meng

Abstract: Variational Autoencoders (VAEs) constitute a crucial component of neural symbolic music generation, among which some works have yielded outstanding results and attracted considerable attention. Nevertheless, previous VAEs still encounter issues with overly long feature sequences and generated results lack contextual coherence, thus the challenge of modeling long multi-track symbolic music still re… ▽ More Variational Autoencoders (VAEs) constitute a crucial component of neural symbolic music generation, among which some works have yielded outstanding results and attracted considerable attention. Nevertheless, previous VAEs still encounter issues with overly long feature sequences and generated results lack contextual coherence, thus the challenge of modeling long multi-track symbolic music still remains unaddressed. To this end, we propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music. The Multi-view MidiVAE utilizes the two-dimensional (2-D) representation, OctupleMIDI, to capture relationships among notes while reducing the feature sequences length. Moreover, we focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy to integrate both Track- and Bar-view MidiVAE features. Objective and subjective experimental results on the CocoChorales dataset demonstrate that, compared to the baseline, Multi-view MidiVAE exhibits significant improvements in terms of modeling long multi-track symbolic music. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP 2024

arXiv:2312.06637 [pdf, other]

A Geometry-based Stochastic Wireless Channel Model using Channel Images

Authors: Seongjoon Kang

Abstract: Due to the high complexity of geometry-deterministic wireless channel modeling and the difficulty in its implementation, geometry-based stochastic channel modeling (GBSM) approaches have been used to evaluate wireless systems. This paper introduces a new method to model any GBSM by training a generative neural network using images formed by channel parameters. In this work, we obtain channel param… ▽ More Due to the high complexity of geometry-deterministic wireless channel modeling and the difficulty in its implementation, geometry-based stochastic channel modeling (GBSM) approaches have been used to evaluate wireless systems. This paper introduces a new method to model any GBSM by training a generative neural network using images formed by channel parameters. In this work, we obtain channel parameters from the ray-tracing simulation in a specific area and process them in the form of images to train any generative model. Through a case study, we confirm that the use of channel images completes the training of the generative model within 10 epochs. In addition, we show that the trained generative model based on channel images faithfully represents the distributions of the original data under the assigned conditions. Therefore, we argue that the proposed data-to-image methods will facilitate the modeling of GBSM using any generative neural network under general wireless conditions. △ Less

Submitted 21 June, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2311.12965 [pdf, other]

Terrestrial-Satellite Spectrum Sharing in the Upper Mid-Band with Interference Nulling

Authors: Seongjoon Kang, Giovanni Geraci, Marco Mezzavilla, Sundeep Rangan

Abstract: The growing demand for broader bandwidth in cellular networks has turned the upper mid-band (7-24 GHz) into a focal point for expansion. However, the integration of terrestrial cellular and incumbent satellite services, particularly in the 12 GHz band, poses significant interference challenges. This paper investigates the interference dynamics in terrestrial-satellite coexistence scenarios and int… ▽ More The growing demand for broader bandwidth in cellular networks has turned the upper mid-band (7-24 GHz) into a focal point for expansion. However, the integration of terrestrial cellular and incumbent satellite services, particularly in the 12 GHz band, poses significant interference challenges. This paper investigates the interference dynamics in terrestrial-satellite coexistence scenarios and introduces a novel beamforming approach that leverages available ephemeris data for dynamic interference mitigation. By establishing spatial radiation nulls directed towards visible satellites, our technique ensures the protection of satellite uplink communications without markedly compromising terrestrial downlink quality. Through a practical case study, we demonstrate that our approach maintains the satellite uplink signal-to-noise ratio (SNR) degradation under 0.1 dB and incurs only a negligible SNR penalty for the terrestrial downlink. Our findings offer a promising pathway for efficient spectrum sharing in the upper mid-band, fostering a concurrent enhancement in both terrestrial and satellite network capacity. △ Less

Submitted 6 March, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2310.19264 [pdf, other]

Sound of Story: Multi-modal Storytelling with Audio

Authors: Jaeyeon Bae, Seokhoon Jeong, Seokun Kang, Namgi Han, Jae-Yon Lee, Hyounghun Kim, Taehwan Kim

Abstract: Storytelling is multi-modal in the real world. When one tells a story, one may use all of the visualizations and sounds along with the story itself. However, prior studies on storytelling datasets and tasks have paid little attention to sound even though sound also conveys meaningful semantics of the story. Therefore, we propose to extend story understanding and telling areas by establishing a new… ▽ More Storytelling is multi-modal in the real world. When one tells a story, one may use all of the visualizations and sounds along with the story itself. However, prior studies on storytelling datasets and tasks have paid little attention to sound even though sound also conveys meaningful semantics of the story. Therefore, we propose to extend story understanding and telling areas by establishing a new component called "background sound" which is story context-based audio without any linguistic information. For this purpose, we introduce a new dataset, called "Sound of Story (SoS)", which has paired image and text sequences with corresponding sound or background music for a story. To the best of our knowledge, this is the largest well-curated dataset for storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6 images per story and 984 hours of speech-decoupled audio such as background music and other sounds. As benchmark tasks for storytelling with sound and the dataset, we propose retrieval tasks between modalities, and audio generation tasks from image-text sequences, introducing strong baselines for them. We believe the proposed dataset and tasks may shed light on the multi-modal understanding of storytelling in terms of sound. Downloading the dataset and baseline codes for each task will be released in the link: https://github.com/Sosdatasets/SoS_Dataset. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: Findings of EMNLP 2023, project: https://github.com/Sosdatasets/SoS_Dataset/

arXiv:2310.04010 [pdf, other]

Excision And Recovery: Visual Defect Obfuscation Based Self-Supervised Anomaly Detection Strategy

Authors: YeongHyeon Park, Sungho Kang, Myung ** Kim, Yeonho Lee, Hyeong Seok Kim, Juneho Yi

Abstract: Due to scarcity of anomaly situations in the early manufacturing stage, an unsupervised anomaly detection (UAD) approach is widely adopted which only uses normal samples for training. This approach is based on the assumption that the trained UAD model will accurately reconstruct normal patterns but struggles with unseen anomalous patterns. To enhance the UAD performance, reconstruction-by-inpainti… ▽ More Due to scarcity of anomaly situations in the early manufacturing stage, an unsupervised anomaly detection (UAD) approach is widely adopted which only uses normal samples for training. This approach is based on the assumption that the trained UAD model will accurately reconstruct normal patterns but struggles with unseen anomalous patterns. To enhance the UAD performance, reconstruction-by-inpainting based methods have recently been investigated, especially on the masking strategy of suspected defective regions. However, there are still issues to overcome: 1) time-consuming inference due to multiple masking, 2) output inconsistency by random masking strategy, and 3) inaccurate reconstruction of normal patterns when the masked area is large. Motivated by this, we propose a novel reconstruction-by-inpainting method, dubbed Excision And Recovery (EAR), that features single deterministic masking based on the ImageNet pre-trained DINO-ViT and visual obfuscation for hint-providing. Experimental results on the MVTec AD dataset show that deterministic masking by pre-trained attention effectively cuts out suspected defective regions and resolve the aforementioned issues 1 and 2. Also, hint-providing by mosaicing proves to enhance the UAD performance than emptying those regions by binary masking, thereby overcomes issue 3. Our approach achieves a high UAD performance without any change of the neural network structure. Thus, we suggest that EAR be adopted in various manufacturing industries as a practically deployable solution. △ Less

Submitted 9 November, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: 10 pages, 5 figures, 5 tables

arXiv:2309.13077 [pdf, other]

A Differentiable Framework for End-to-End Learning of Hybrid Structured Compression

Authors: Moonjung Eo, Suhyun Kang, Wonjong Rhee

Abstract: Filter pruning and low-rank decomposition are two of the foundational techniques for structured compression. Although recent efforts have explored hybrid approaches aiming to integrate the advantages of both techniques, their performance gains have been modest at best. In this study, we develop a \textit{Differentiable Framework~(DF)} that can express filter selection, rank selection, and budget c… ▽ More Filter pruning and low-rank decomposition are two of the foundational techniques for structured compression. Although recent efforts have explored hybrid approaches aiming to integrate the advantages of both techniques, their performance gains have been modest at best. In this study, we develop a \textit{Differentiable Framework~(DF)} that can express filter selection, rank selection, and budget constraint into a single analytical formulation. Within the framework, we introduce DML-S for filter selection, integrating scheduling into existing mask learning techniques. Additionally, we present DTL-S for rank selection, utilizing a singular value thresholding operator. The framework with DML-S and DTL-S offers a hybrid structured compression methodology that facilitates end-to-end learning through gradient-base optimization. Experimental results demonstrate the efficacy of DF, surpassing state-of-the-art structured compression methods. Our work establishes a robust and versatile avenue for advancing structured compression techniques. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: 11 pages, 5 figures, 6 tables

arXiv:2309.11977 [pdf, other]

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

Abstract: Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by th… ▽ More Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt. △ Less

Submitted 9 April, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

Comments: Accepted bt ICASSP 2024

arXiv:2309.03038 [pdf, other]

doi 10.1109/OJCOMS.2024.3373368

Cellular Wireless Networks in the Upper Mid-Band

Authors: Seongjoon Kang, Marco Mezzavilla, Sundeep Rangan, Arjuna Madanayake, Satheesh Bojja Venkatakrishnan, Gregory Hellbourg, Monisha Ghosh, Hamed Rahmani, Aditya Dhananjay

Abstract: The upper mid-band - roughly from 7 to 24 GHz - has attracted considerable recent interest for new cellular services. This frequency range has vastly more spectrum than the highly congested bands below 7 GHz while offering more favorable propagation and coverage than the millimeter wave (mmWave) frequencies. The upper mid-band can thus provide a powerful and complementary frequency range to balanc… ▽ More The upper mid-band - roughly from 7 to 24 GHz - has attracted considerable recent interest for new cellular services. This frequency range has vastly more spectrum than the highly congested bands below 7 GHz while offering more favorable propagation and coverage than the millimeter wave (mmWave) frequencies. The upper mid-band can thus provide a powerful and complementary frequency range to balance coverage and capacity. Realizing the full potential of these bands, however, will require fundamental changes to the design of cellular systems. Most importantly, spectrum will likely need to be shared with incumbents including communication satellites, military RADAR, and radio astronomy. Also, the upper mid-band is simply a vast frequency range. Due to this wide bandwidth, combined with the directional nature of transmission and intermittent occupancy of incumbents, cellular systems will need to be agile to sense and intelligently use large spatial and frequency degrees of freedom. This paper attempts to provide an initial assessment of the feasibility and potential gains of wideband cellular systems operating in the upper mid-band. The study includes: (1) a system study to assess potential gains of multi-band systems in a representative dense urban environment and illustrate the value of wide band system with dynamic frequency selectivity; (2) an evaluation of potential cross interference between satellites and terrestrial cellular services and interference nulling to reduce that interference; and (3) design and evaluation of a compact multi-band antenna array structure. Leveraging these preliminary results, we identify potential future research directions to realize next-generation systems in these frequencies. △ Less

Submitted 6 March, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: 18 pages

arXiv:2309.01615 [pdf]

A balanced Memristor-CMOS ternary logic family and its application

Authors: Xiao-Yuan Wang, Jia-Wei Zhou, Chuan-Tao Dong, Xin-Hui Chen, Sanjoy Kumar Nandi, Robert G. Elliman, Sung-Mo Kang, Herbert Ho-Ching Iu

Abstract: The design of balanced ternary digital logic circuits based on memristors and conventional CMOS devices is proposed. First, balanced ternary minimum gate TMIN, maximum gate TMAX and ternary inverters are systematically designed and verified by simulation, and then logic circuits such as ternary encoders, decoders and multiplexers are designed on this basis. Two different schemes are then used to r… ▽ More The design of balanced ternary digital logic circuits based on memristors and conventional CMOS devices is proposed. First, balanced ternary minimum gate TMIN, maximum gate TMAX and ternary inverters are systematically designed and verified by simulation, and then logic circuits such as ternary encoders, decoders and multiplexers are designed on this basis. Two different schemes are then used to realize the design of functional combinational logic circuits such as a balanced ternary half adder, multiplier, and numerical comparator. Finally, we report a series of comparisons and analyses of the two design schemes, which provide a reference for subsequent research and development of three-valued logic circuits. △ Less

Submitted 4 September, 2023; originally announced September 2023.

Comments: 15 pages, 30 figures

arXiv:2308.16836 [pdf, other]

Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information

Authors: Shaohuan Zhou, Shun Lei, Weiya You, Deyi Tuo, Yuren You, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings to improve the expressiveness of the synthesized singing voice. Based on the main architecture of recently proposed VISinger, we put forward several specific designs for expressive singing voice synthesis. First, dif… ▽ More This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings to improve the expressiveness of the synthesized singing voice. Based on the main architecture of recently proposed VISinger, we put forward several specific designs for expressive singing voice synthesis. First, different from the previous SVS models, we use text representation of lyrics extracted from pre-trained BERT as additional input to the model. The representation contains information about semantics of the lyrics, which could help SVS system produce more expressive and natural voice. Second, we further introduce an energy predictor to stabilize the synthesized voice and model the wider range of energy variations that also contribute to the expressiveness of singing voice. Last but not the least, to attenuate the off-key issues, the pitch predictor is re-designed to predict the real to note pitch ratio. Both objective and subjective experimental results indicate that the proposed SVS system can produce singing voice with higher-quality outperforming VISinger. △ Less

Submitted 31 August, 2023; originally announced August 2023.

arXiv:2308.16593 [pdf, other]

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

Authors: Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech an… ▽ More The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text. △ Less

Submitted 31 August, 2023; originally announced August 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2308.16577 [pdf, other]

doi 10.21437/Interspeech.2022-131

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

Authors: Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

Abstract: For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in producing natural and intelligible speech. Although inter-utterance linguistic information can influence the speech interpretation of the target utterance, previous works on PSP mainly focus on utilizing intrautterance linguistic information of the current utterance only. This work proposes to use in… ▽ More For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in producing natural and intelligible speech. Although inter-utterance linguistic information can influence the speech interpretation of the target utterance, previous works on PSP mainly focus on utilizing intrautterance linguistic information of the current utterance only. This work proposes to use inter-utterance linguistic information to improve the performance of PSP. Multi-level contextual information, which includes both inter-utterance and intrautterance linguistic information, is extracted by a hierarchical encoder from character level, utterance level and discourse level of the input text. Then a multi-task learning (MTL) decoder predicts prosodic boundaries from multi-level contextual information. Objective evaluation results on two datasets show that our method achieves better F1 scores in predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH). It demonstrates the effectiveness of using multi-level contextual information for PSP. Subjective preference tests also indicate the naturalness of synthesized speeches are improved. △ Less

Submitted 31 August, 2023; originally announced August 2023.

Comments: Accepted by Interspeech2022

arXiv:2308.14595 [pdf, other]

Neural Network Training Strategy to Enhance Anomaly Detection Performance: A Perspective on Reconstruction Loss Amplification

Authors: YeongHyeon Park, Sungho Kang, Myung ** Kim, Hyeonho Jeong, Hyunkyu Park, Hyeong Seok Kim, Juneho Yi

Abstract: Unsupervised anomaly detection (UAD) is a widely adopted approach in industry due to rare anomaly occurrences and data imbalance. A desirable characteristic of an UAD model is contained generalization ability which excels in the reconstruction of seen normal patterns but struggles with unseen anomalies. Recent studies have pursued to contain the generalization capability of their UAD models in rec… ▽ More Unsupervised anomaly detection (UAD) is a widely adopted approach in industry due to rare anomaly occurrences and data imbalance. A desirable characteristic of an UAD model is contained generalization ability which excels in the reconstruction of seen normal patterns but struggles with unseen anomalies. Recent studies have pursued to contain the generalization capability of their UAD models in reconstruction from different perspectives, such as design of neural network (NN) structure and training strategy. In contrast, we note that containing of generalization ability in reconstruction can also be obtained simply from steep-shaped loss landscape. Motivated by this, we propose a loss landscape sharpening method by amplifying the reconstruction loss, dubbed Loss AMPlification (LAMP). LAMP deforms the loss landscape into a steep shape so the reconstruction error on unseen anomalies becomes greater. Accordingly, the anomaly detection performance is improved without any change of the NN architecture. Our findings suggest that LAMP can be easily applied to any reconstruction error metrics in UAD settings where the reconstruction model is trained with anomaly-free samples only. △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: 5 pages, 4 figures, 2 tables

arXiv:2308.02076 [pdf, other]

doi 10.1109/TRS.2023.3333430

Sub-Nyquist Sampling OFDM Radar

Authors: Kawon Han, SeongHyeon Kang, Songcheol Hong

Abstract: In this paper, we propose a sub-Nyquist sampling (SNS) orthogonal frequency-division multiplexing (OFDM) radar system capable of reducing the analog-to-digital converter (ADC) sampling rate in OFDM radar without any additional manipulations of its hardware and waveform. To this end, the proposed system utilizes the ADC sampling rate of B/L to sample the received baseband signal with a bandwidth of… ▽ More In this paper, we propose a sub-Nyquist sampling (SNS) orthogonal frequency-division multiplexing (OFDM) radar system capable of reducing the analog-to-digital converter (ADC) sampling rate in OFDM radar without any additional manipulations of its hardware and waveform. To this end, the proposed system utilizes the ADC sampling rate of B/L to sample the received baseband signal with a bandwidth of B, where L is a positive proper divisor of the number of subcarriers. This divides the baseband signal into L sub-bands, folding into a sub-Nyquist frequency band due to aliasing. By leveraging known modulation symbols of the transmitted signal, the folded signal can be unfolded to the full-band signal. This allows an estimation of target ranges with the range resolution of the full signal bandwidth B without the degradation of the maximum unambiguous range. During the signal-unfolding process, the signals from other sub-bands remain as symbol-mismatch noise (SMN), which significantly degrades the signal-to-noise ratio (SNR) of the detected targets. It also causes weaker targets to be submerged under the noise in range profiles. To resolve this, a symbol-mismatch noise cancellation (SMNC) technique is also proposed, which reconstructs the interfering signals from the other sub-bands using the detected targets and subtracts them from the unfolded signal. As a result, the proposed sub-Nyquist sampling OFDM radar and corresponding signal processing technique enable a reduction in the ADC sampling rate by the ratio of L while incurring only a 10 log10 L increase in the noise due to noise folding. This is validated through simulations and measurements with various sub-sampling ratios. △ Less

Submitted 3 August, 2023; originally announced August 2023.

Comments: 12 pages, 13 figures

Journal ref: IEEE Transactions on Radar Systems, vol. 1, pp. 669-680, 2023

arXiv:2307.16012 [pdf, other]

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Xixin Wu, Shiyin Kang, Helen Meng

Abstract: Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challengi… ▽ More Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationships in context and predict style embeddings at global-level, sentence-level and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method significantly outperforms the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representations that have never been discussed before. △ Less

Submitted 29 July, 2023; originally announced July 2023.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2304.12704 [pdf, other]

GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network

Authors: Haolin Zhuang, Shun Lei, Long Xiao, Weiqin Li, Liyang Chen, Sicheng Yang, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Music-driven 3D dance generation has become an intensive research topic in recent years with great potential for real-world applications. Most existing methods lack the consideration of genre, which results in genre inconsistency in the generated dance movements. In addition, the correlation between the dance genre and the music has not been investigated. To address these issues, we propose a genr… ▽ More Music-driven 3D dance generation has become an intensive research topic in recent years with great potential for real-world applications. Most existing methods lack the consideration of genre, which results in genre inconsistency in the generated dance movements. In addition, the correlation between the dance genre and the music has not been investigated. To address these issues, we propose a genre-consistent dance generation framework, GTN-Bailando. First, we propose the Genre Token Network (GTN), which infers the genre from music to enhance the genre consistency of long-term dance generation. Second, to improve the generalization capability of the model, the strategy of pre-training and fine-tuning is adopted.Experimental results on the AIST++ dataset show that the proposed dance generation framework outperforms state-of-the-art methods in terms of motion quality and genre consistency. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted by ICASSP2023.Demo page: https://im1eon.github.io/ICASSP23-GTNB-DG/

arXiv:2304.09607 [pdf, other]

CB-Conformer: Contextual biasing Conformer for biased word recognition

Authors: Yaoxun Xu, Baiji Liu, Qiaochu Huang and, Xingchen Song, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Due to the mismatch between the source and target domains, how to better utilize the biased word information to improve the performance of the automatic speech recognition model in the target domain becomes a hot research topic. Previous approaches either decode with a fixed external language model or introduce a sizeable biasing module, which leads to poor adaptability and slow inference. In this… ▽ More Due to the mismatch between the source and target domains, how to better utilize the biased word information to improve the performance of the automatic speech recognition model in the target domain becomes a hot research topic. Previous approaches either decode with a fixed external language model or introduce a sizeable biasing module, which leads to poor adaptability and slow inference. In this work, we propose CB-Conformer to improve biased word recognition by introducing the Contextual Biasing Module and the Self-Adaptive Language Model to vanilla Conformer. The Contextual Biasing Module combines audio fragments and contextual information, with only 0.2% model parameters of the original Conformer. The Self-Adaptive Language Model modifies the internal weights of biased words based on their recall and precision, resulting in a greater focus on biased words and more successful integration with the automatic speech recognition model than the standard fixed language model. In addition, we construct and release an open-source Mandarin biased-word dataset based on WenetSpeech. Experiments indicate that our proposed method brings a 15.34% character error rate reduction, a 14.13% biased word recall increase, and a 6.80% biased word F1-score increase compared with the base Conformer. △ Less

Submitted 25 April, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

arXiv:2304.06359 [pdf, other]

Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Recent advances in text-to-speech have significantly improved the expressiveness of synthesized speech. However, it is still challenging to generate speech with contextually appropriate and coherent speaking style for multi-sentence text in audiobooks. In this paper, we propose a context-aware coherent speaking style prediction method for audiobook speech synthesis. To predict the style embedding… ▽ More Recent advances in text-to-speech have significantly improved the expressiveness of synthesized speech. However, it is still challenging to generate speech with contextually appropriate and coherent speaking style for multi-sentence text in audiobooks. In this paper, we propose a context-aware coherent speaking style prediction method for audiobook speech synthesis. To predict the style embedding of the current utterance, a hierarchical transformer-based context-aware style predictor with a mixture attention mask is designed, considering both text-side context information and speech-side style information of previous speeches. Based on this, we can generate long-form speech with coherent style and prosody sentence by sentence. Objective and subjective evaluations on a Mandarin audiobook dataset demonstrate that our proposed model can generate speech with more expressive and coherent speaking style than baselines, for both single-sentence and multi-sentence test. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: Accepted by ICASSP 2023

arXiv:2304.03295 [pdf, other]

Automatic Detection of Reactions to Music via Earable Sensing

Authors: Euihyoek Lee, Chulhong Min, Jeaseung Lee, ** Yu, Seungwoo Kang

Abstract: We present GrooveMeter, a novel system that automatically detects vocal and motion reactions to music via earable sensing and supports music engagement-aware applications. To this end, we use smart earbuds as sensing devices, which are already widely used for music listening, and devise reaction detection techniques by leveraging an inertial measurement unit (IMU) and a microphone on earbuds. To e… ▽ More We present GrooveMeter, a novel system that automatically detects vocal and motion reactions to music via earable sensing and supports music engagement-aware applications. To this end, we use smart earbuds as sensing devices, which are already widely used for music listening, and devise reaction detection techniques by leveraging an inertial measurement unit (IMU) and a microphone on earbuds. To explore reactions in daily music-listening situations, we collect the first kind of dataset, MusicReactionSet, containing 926-minute-long IMU and audio data with 30 participants. With the dataset, we discover a set of unique challenges in detecting music listening reactions accurately and robustly using audio and motion sensing. We devise sophisticated processing pipelines to make reaction detection accurate and efficient. We present a comprehensive evaluation to examine the performance of reaction detection and system cost. It shows that GrooveMeter achieves the macro F1 scores of 0.89 for vocal reaction and 0.81 for motion reaction with leave-one-subject-out cross-validation. More importantly, GrooveMeter shows higher accuracy and robustness compared to alternative methods. We also show that our filtering approach reduces 50% or more of the energy overhead. Finally, we demonstrate the potential use cases through a case study. △ Less

Submitted 6 April, 2023; originally announced April 2023.

arXiv:2303.10081 [pdf, other]

Verification and Synthesis of Robust Control Barrier Functions: Multilevel Polynomial Optimization and Semidefinite Relaxation

Authors: Shucheng Kang, Yuxiao Chen, Heng Yang, Marco Pavone

Abstract: We study the problem of verification and synthesis of robust control barrier functions (CBF) for control-affine polynomial systems with bounded additive uncertainty and convex polynomial constraints on the control. We first formulate robust CBF verification and synthesis as multilevel polynomial optimization problems (POP), where verification optimizes -- in three levels -- the uncertainty, contro… ▽ More We study the problem of verification and synthesis of robust control barrier functions (CBF) for control-affine polynomial systems with bounded additive uncertainty and convex polynomial constraints on the control. We first formulate robust CBF verification and synthesis as multilevel polynomial optimization problems (POP), where verification optimizes -- in three levels -- the uncertainty, control, and state, while synthesis additionally optimizes the parameter of a chosen parametric CBF candidate. We then show that, by invoking the KKT conditions of the inner optimizations over uncertainty and control, the verification problem can be simplified as a single-level POP and the synthesis problem reduces to a min-max POP. This reduction leads to multilevel semidefinite relaxations. For the verification problem, we apply Lasserre's hierarchy of moment relaxations. For the synthesis problem, we draw connections to existing relaxation techniques for robust min-max POP, which first use sum-of-squares programming to find increasingly tight polynomial lower bounds to the unknown value function of the verification POP, and then call Lasserre's hierarchy again to maximize the lower bounds. Both semidefinite relaxations guarantee asymptotic global convergence to optimality. We provide an in-depth study of our framework on the controlled Van der Pol Oscillator, both with and without additive uncertainty. △ Less

Submitted 21 July, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: Accepted to IEEE Conference on Decision and Control (CDC) 2023

arXiv:2303.07206 [pdf]

Toward A Dynamic Comfort Model for Human-Building Interaction in Grid-Interactive Efficient Buildings: Supported by Field Data

Authors: SungKu Kang, Kunind Sharma, Maharshi Pathak, Emily Casavant, Katherine Bassett, Misha Pavel, David Fannon, Michael Kane

Abstract: Controlling building electric loads could alleviate the increasing grid strain caused by the adoption of renewables and electrification. However, current approaches that automatically setback thermostats on the hottest day compromise their efficacy by neglecting human-building interaction (HBI). This study aims to define challenges and opportunities for develo** engineering models of HBI to be u… ▽ More Controlling building electric loads could alleviate the increasing grid strain caused by the adoption of renewables and electrification. However, current approaches that automatically setback thermostats on the hottest day compromise their efficacy by neglecting human-building interaction (HBI). This study aims to define challenges and opportunities for develo** engineering models of HBI to be used in the design of controls for grid-interactive efficient buildings (GEBs). Building system and measured and just-in-time surveyed psychophysiological data were collected from 41 participants in 20 homes from April-September. ASHRAE Standard 55 thermal comfort models for building design were evaluated with these data. Increased error bias was observed with increasing spatiotemporal temperature variations. Unsurprising, considering these models neglect such variance, but questioning their suitability for GEBs controlling thermostat setpoints, and given the observed 4°F intra-home spatial temperature variation. The results highlight opportunities for reducing these biases in GEBs through a paradigm shift to modeling discomfort instead of comfort, increasing use of low-cost sensors, and models that account for the observed dynamic occupant behavior: of the thermostat setpoint overrides made with 140-minutes of a previous setpoint change, 95% of small changes ( 2°F) were made with 120-minutes, while 95% of larger changes ( 10°F) were made within only 70-minutes. △ Less

Submitted 10 March, 2023; originally announced March 2023.

Comments: 17 pages, 11 figures

arXiv:2301.06200 [pdf, other]

Efficiently Computing Sparse Fourier Transforms of $q$-ary Functions

Authors: Yigit Efe Erginbas, Justin Singh Kang, Amirali Aghazadeh, Kannan Ramchandran

Abstract: Fourier transformations of pseudo-Boolean functions are popular tools for analyzing functions of binary sequences. Real-world functions often have structures that manifest in a sparse Fourier transform, and previous works have shown that under the assumption of sparsity the transform can be computed efficiently. But what if we want to compute the Fourier transform of functions defined over a $q$-a… ▽ More Fourier transformations of pseudo-Boolean functions are popular tools for analyzing functions of binary sequences. Real-world functions often have structures that manifest in a sparse Fourier transform, and previous works have shown that under the assumption of sparsity the transform can be computed efficiently. But what if we want to compute the Fourier transform of functions defined over a $q$-ary alphabet? These types of functions arise naturally in many areas including biology. A typical workaround is to encode the $q$-ary sequence in binary, however, this approach is computationally inefficient and fundamentally incompatible with the existing sparse Fourier transform techniques. Herein, we develop a sparse Fourier transform algorithm specifically for $q$-ary functions of length $n$ sequences, dubbed $q$-SFT, which provably computes an $S$-sparse transform with vanishing error as $q^n \rightarrow \infty$ in $O(Sn)$ function evaluations and $O(S n^2 \log q)$ computations, where $S = q^{nδ}$ for some $δ< 1$. Under certain assumptions, we show that for fixed $q$, a robust version of $q$-SFT has a sample complexity of $O(Sn^2)$ and a computational complexity of $O(Sn^3)$ with the same asymptotic guarantees. We present numerical simulations on synthetic and real-world RNA data, demonstrating the scalability of $q$-SFT to massively high dimensional $q$-ary functions. △ Less

Submitted 15 January, 2023; originally announced January 2023.

Comments: 29 pages, 3 figures

arXiv:2209.08964 [pdf, other]

Coexistence of UAVs and Terrestrial Users in Millimeter-Wave Urban Networks

Authors: Seongjoon Kang, Marco Mezzavilla, Angel Lozano, Giovanni Geraci, Sundeep Rangan, Vasilii Semkin, William Xia, Giuseppe Loianno

Abstract: 5G millimeter-wave (mmWave) cellular networks are in the early phase of commercial deployments and present a unique opportunity for robust, high-data-rate communication to unmanned aerial vehicles (UAVs). A fundamental question is whether and how mmWave networks designed for terrestrial users should be modified to serve UAVs. The paper invokes realistic cell layouts, antenna patterns, and channel… ▽ More 5G millimeter-wave (mmWave) cellular networks are in the early phase of commercial deployments and present a unique opportunity for robust, high-data-rate communication to unmanned aerial vehicles (UAVs). A fundamental question is whether and how mmWave networks designed for terrestrial users should be modified to serve UAVs. The paper invokes realistic cell layouts, antenna patterns, and channel models trained from extensive ray tracing data to assess the performance of various network alternatives. Importantly, the study considers the addition of dedicated uptilted rooftop-mounted cells for aerial coverage, as well as novel spectrum sharing modes between terrestrial and aerial network operators. The effect of power control and of multiuser multiple-input multiple-output are also studied. △ Less

Submitted 20 September, 2022; v1 submitted 19 September, 2022; originally announced September 2022.

arXiv:2208.13454 [pdf, other]

Minimum Input Design for Direct Data-driven Property Identification of Unknown Linear Systems

Authors: Shubo Kang, Keyou You

Abstract: In a direct data-driven approach, this paper studies the {\em property identification(ID)} problem to analyze whether an unknown linear system has a property of interest, e.g., stabilizability and structural properties. In sharp contrast to the model-based analysis, we approach it by directly using the input and state feedback data of the unknown system. Via a new concept of sufficient richness of… ▽ More In a direct data-driven approach, this paper studies the {\em property identification(ID)} problem to analyze whether an unknown linear system has a property of interest, e.g., stabilizability and structural properties. In sharp contrast to the model-based analysis, we approach it by directly using the input and state feedback data of the unknown system. Via a new concept of sufficient richness of input sectional data, we first establish the necessary and sufficient condition for the minimum input design to excite the system for property ID. Specifically, the input sectional data is sufficiently rich for property ID {\em if and only if} it spans a linear subspace that contains a property dependent minimum linear subspace, any basis of which can also be easily used to form the minimum excitation input. Interestingly, we show that many structural properties can be identified with the minimum input that is however unable to identify the explicit system model. Overall, our results rigorously quantify the advantages of the direct data-driven analysis over the model-based analysis for linear systems in terms of data efficiency. △ Less

Submitted 29 August, 2022; originally announced August 2022.

arXiv:2207.00934 [pdf, other]

Wireless Channel Prediction in Partially Observed Environments

Authors: Mingsheng Yin, Yaqi Hu, Tommy Azzino, Seongjoon Kang, Marco Mezzavilla, Sundeep Rangan

Abstract: Site-specific radio frequency (RF) propagation prediction increasingly relies on models built from visual data such as cameras and LIDAR sensors. When operating in dynamic settings, the environment may only be partially observed. This paper introduces a method to extract statistical channel models, given partial observations of the surrounding environment. We propose a simple heuristic algorithm t… ▽ More Site-specific radio frequency (RF) propagation prediction increasingly relies on models built from visual data such as cameras and LIDAR sensors. When operating in dynamic settings, the environment may only be partially observed. This paper introduces a method to extract statistical channel models, given partial observations of the surrounding environment. We propose a simple heuristic algorithm that performs ray tracing on the partial environment and then uses machine-learning trained predictors to estimate the channel and its uncertainty from features extracted from the partial ray tracing results. It is shown that the proposed method can interpolate between fully statistical models when no partial information is available and fully deterministic models when the environment is completely observed. The method can also capture the degree of uncertainty of the propagation predictions depending on the amount of region that has been explored. The methodology is demonstrated in a robotic navigation application simulated on a set of indoor maps with detailed models constructed using state-of-the-art navigation, simultaneous localization and map** (SLAM), and computer vision methods. △ Less

Submitted 2 July, 2022; originally announced July 2022.

arXiv:2204.02743 [pdf, other]

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Jiankun Hu, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Previous works on expressive speech synthesis focus on modelling the mono-scale style embedding from the current sentence or context, but the multi-scale nature of speaking style in human speech is neglected. In this paper, we propose a multi-scale speaking style modelling method to capture and predict multi-scale speaking style for improving the naturalness and expressiveness of synthetic speech.… ▽ More Previous works on expressive speech synthesis focus on modelling the mono-scale style embedding from the current sentence or context, but the multi-scale nature of speaking style in human speech is neglected. In this paper, we propose a multi-scale speaking style modelling method to capture and predict multi-scale speaking style for improving the naturalness and expressiveness of synthetic speech. A multi-scale extractor is proposed to extract speaking style embeddings at three different levels from the ground-truth speech, and explicitly guide the training of a multi-scale style predictor based on hierarchical context information. Both objective and subjective evaluations on a Mandarin audiobooks dataset demonstrate that our proposed method can significantly improve the naturalness and expressiveness of the synthesized speech. △ Less

Submitted 5 July, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

Comments: Accepted by INTERSPEECH 2022

arXiv:2203.12813 [pdf, other]

Disentangleing Content and Fine-grained Prosody Information via Hybrid ASR Bottleneck Features for Voice Conversion

Authors: Xintao Zhao, Feng Liu, Changhe Song, Zhiyong Wu, Shiyin Kang, Deyi Tuo, Helen Meng

Abstract: Non-parallel data voice conversion (VC) have achieved considerable breakthroughs recently through introducing bottleneck features (BNFs) extracted by the automatic speech recognition(ASR) model. However, selection of BNFs have a significant impact on VC result. For example, when extracting BNFs from ASR trained with Cross Entropy loss (CE-BNFs) and feeding into neural network to train a VC system,… ▽ More Non-parallel data voice conversion (VC) have achieved considerable breakthroughs recently through introducing bottleneck features (BNFs) extracted by the automatic speech recognition(ASR) model. However, selection of BNFs have a significant impact on VC result. For example, when extracting BNFs from ASR trained with Cross Entropy loss (CE-BNFs) and feeding into neural network to train a VC system, the timbre similarity of converted speech is significantly degraded. If BNFs are extracted from ASR trained using Connectionist Temporal Classification loss (CTC-BNFs), the naturalness of the converted speech may decrease. This phenomenon is caused by the difference of information contained in BNFs. In this paper, we proposed an any-to-one VC method using hybrid bottleneck features extracted from CTC-BNFs and CE-BNFs to complement each other advantages. Gradient reversal layer and instance normalization were used to extract prosody information from CE-BNFs and content information from CTC-BNFs. Auto-regressive decoder and Hifi-GAN vocoder were used to generate high-quality waveform. Experimental results show that our proposed method achieves higher similarity, naturalness, quality than baseline method and reveals the differences between the information contained in CE-BNFs and CTC-BNFs as well as the influence they have on the converted speech. △ Less

Submitted 23 March, 2022; originally announced March 2022.

Comments: Accepted by ICASSP 2022

arXiv:2203.12201 [pdf, other]

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Previous works on expressive speech synthesis mainly focus on current sentence. The context in adjacent sentences is neglected, resulting in inflexible speaking style for the same text, which lacks speech variations. In this paper, we propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information… ▽ More Previous works on expressive speech synthesis mainly focus on current sentence. The context in adjacent sentences is neglected, resulting in inflexible speaking style for the same text, which lacks speech variations. In this paper, we propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information considering structural relationship in context, including inter-phrase and inter-sentence relations. Moreover, to encourage this encoder to learn style representation better, we introduce a novel training strategy with knowledge distillation, which provides the target for encoder training. Both objective and subjective evaluations on a Mandarin lecture dataset demonstrate that the proposed method can significantly improve the naturalness and expressiveness of the synthesized speech. △ Less

Submitted 6 April, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

Comments: Accepted by ICASSP 2022

arXiv:2203.12188 [pdf, other]

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

Authors: Jun Chen, Zilin Wang, Deyi Tuo, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Previously proposed FullSubNet has achieved outstanding performance in Deep Noise Suppression (DNS) Challenge and attracted much attention. However, it still encounters issues such as input-output mismatch and coarse processing for frequency bands. In this paper, we propose an extended single-channel real-time speech enhancement framework called FullSubNet+ with following significant improvements.… ▽ More Previously proposed FullSubNet has achieved outstanding performance in Deep Noise Suppression (DNS) Challenge and attracted much attention. However, it still encounters issues such as input-output mismatch and coarse processing for frequency bands. In this paper, we propose an extended single-channel real-time speech enhancement framework called FullSubNet+ with following significant improvements. First, we design a lightweight multi-scale time sensitive channel attention (MulCA) module which adopts multi-scale convolution and channel attention mechanism to help the network focus on more discriminative frequency bands for noise reduction. Then, to make full use of the phase information in noisy speech, our model takes all the magnitude, real and imaginary spectrograms as inputs. Moreover, by replacing the long short-term memory (LSTM) layers in original full-band model with stacked temporal convolutional network (TCN) blocks, we design a more efficient full-band module called full-band extractor. The experimental results in DNS Challenge dataset show the superior performance of our FullSubNet+, which reaches the state-of-the-art (SOTA) performance and outperforms other existing speech enhancement approaches. △ Less

Submitted 26 March, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

Comments: Accepted by ICASSP 2022

arXiv:2203.05125 [pdf, ps, other]

A Lifted $\ell_1 $ Framework for Sparse Recovery

Authors: Yaghoub Rahimi, Sung Ha Kang, Yifei Lou

Abstract: Motivated by re-weighted $\ell_1$ approaches for sparse recovery, we propose a lifted $\ell_1$ (LL1) regularization which is a generalized form of several popular regularizations in the literature. By exploring such connections, we discover there are two types of lifting functions which can guarantee that the proposed approach is equivalent to the $\ell_0$ minimization. Computationally, we design… ▽ More Motivated by re-weighted $\ell_1$ approaches for sparse recovery, we propose a lifted $\ell_1$ (LL1) regularization which is a generalized form of several popular regularizations in the literature. By exploring such connections, we discover there are two types of lifting functions which can guarantee that the proposed approach is equivalent to the $\ell_0$ minimization. Computationally, we design an efficient algorithm via the alternating direction method of multiplier (ADMM) and establish the convergence for an unconstrained formulation. Experimental results are presented to demonstrate how this generalization improves sparse recovery over the state-of-the-art. △ Less

Submitted 12 May, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: 24 pages

MSC Class: 65K10; 49N45; 65F50; 90C90; 49M20

arXiv:2201.00229 [pdf, other]

Understanding Energy Efficiency and Interference Tolerance in Millimeter Wave Receivers

Authors: Panagiotis Skrimponis, Seongjoon Kang, Abbas Khalili, Wonho Lee, Navid Hosseinzadeh, Marco Mezzavilla, Elza Erkip, Mark J. W. Rodwell, James F. Buckwalter, Sundeep Rangan

Abstract: Power consumption is a key challenge in millimeter wave (mmWave) receiver front-ends, due to the need to support high dimensional antenna arrays at wide bandwidths. Recently, there has been considerable work in develo** low-power front-ends, often based on low-resolution ADCs and low-power mixers. A critical but less studied consequence of such designs is the relatively low-dynamic range which i… ▽ More Power consumption is a key challenge in millimeter wave (mmWave) receiver front-ends, due to the need to support high dimensional antenna arrays at wide bandwidths. Recently, there has been considerable work in develo** low-power front-ends, often based on low-resolution ADCs and low-power mixers. A critical but less studied consequence of such designs is the relatively low-dynamic range which in turn exposes the receiver to adjacent carrier interference and blockers. This paper provides a general mathematical framework for analyzing the performance of mmWave front-ends in the presence of out-of-band interference. The goal is to elucidate the fundamental trade-off of power consumption, interference tolerance and in-band performance. The analysis is combined with detailed network simulations in cellular systems with multiple carriers, as well as detailed circuit simulations of key components at 140 GHz. The analysis reveals critical bottlenecks for low-power interference robustness and suggests designs enhancements for use in practical systems. △ Less

Submitted 1 January, 2022; originally announced January 2022.

Comments: Appeared at the Asilomar Conference on Signals, Systems, and Computers 2021

arXiv:2110.03396 [pdf, other]

AnoSeg: Anomaly Segmentation Network Using Self-Supervised Learning

Authors: Jouwon Song, Kyeongbo Kong, Ye-In Park, Seong-Gyun Kim, Suk-Ju Kang

Abstract: Anomaly segmentation, which localizes defective areas, is an important component in large-scale industrial manufacturing. However, most recent researches have focused on anomaly detection. This paper proposes a novel anomaly segmentation network (AnoSeg) that can directly generate an accurate anomaly map using self-supervised learning. For highly accurate anomaly segmentation, the proposed AnoSeg… ▽ More Anomaly segmentation, which localizes defective areas, is an important component in large-scale industrial manufacturing. However, most recent researches have focused on anomaly detection. This paper proposes a novel anomaly segmentation network (AnoSeg) that can directly generate an accurate anomaly map using self-supervised learning. For highly accurate anomaly segmentation, the proposed AnoSeg considers three novel techniques: Anomaly data generation based on hard augmentation, self-supervised learning with pixel-wise and adversarial losses, and coordinate channel concatenation. First, to generate synthetic anomaly images and reference masks for normal data, the proposed method uses hard augmentation to change the normal sample distribution. Then, the proposed AnoSeg is trained in a self-supervised learning manner from the synthetic anomaly data and normal data. Finally, the coordinate channel, which represents the pixel location information, is concatenated to an input of AnoSeg to consider the positional relationship of each pixel in the image. The estimated anomaly map can also be utilized to improve the performance of anomaly detection. Our experiments show that the proposed method outperforms the state-of-the-art anomaly detection and anomaly segmentation methods for the MVTec AD dataset. In addition, we compared the proposed method with the existing methods through the intersection over union (IoU) metric commonly used in segmentation tasks and demonstrated the superiority of our method for anomaly segmentation. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Comments: 10 pages, 17 figures

arXiv:2107.04526 [pdf, ps, other]

A Dual-Connection based Handover Scheme for Ultra-Dense Millimeter-Wave Cellular Networks

Authors: Seongjoon Kang, Siyoung Choi, Goodsol Lee, Saewoong Bahk

Abstract: Mobile users in an ultra-dense millimeter-wave cellular network experience handover events more frequently than in conventional networks, which results in increased service interruption time and performance degradation due to blockages. Multi-connectivity has been proposed to resolve this, and it also extends the coverage of millimeter-wave communications. In this paper, we propose a dual-connecti… ▽ More Mobile users in an ultra-dense millimeter-wave cellular network experience handover events more frequently than in conventional networks, which results in increased service interruption time and performance degradation due to blockages. Multi-connectivity has been proposed to resolve this, and it also extends the coverage of millimeter-wave communications. In this paper, we propose a dual-connection based handover scheme for mobile UEs in an environment where they are connected simultaneously with two millimeter-wave cells to overcome frequent handover problems. This scheme allows a mobile UE to choose its serving link between the two mmWave connections according to the measured SINRs and then the corresponding base stations may forward duplicate packets to the UE. We compare our dual-connection based scheme with a conventional single-connection based scheme through ns-3 simulation. The simulation results show that the proposed scheme significantly reduces handover rate and delay. Therefore, we argue that the dual-connection based scheme helps mobile users achieve performance goals they require in ultra-dense cellular environments. △ Less

Submitted 9 July, 2021; originally announced July 2021.

arXiv:2107.03298 [pdf, other]

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis

Authors: Hui Lu, Zhiyong Wu, Xixin Wu, Xu Li, Shiyin Kang, Xunying Liu, Helen Meng

Abstract: This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their sequential decoding process can be time-consuming. Recently, non-autoregressive TTS (NAR-TTS) models have been shown to be more efficient with the parallel decodi… ▽ More This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their sequential decoding process can be time-consuming. Recently, non-autoregressive TTS (NAR-TTS) models have been shown to be more efficient with the parallel decoding process. However, these NAR-TTS models rely on phoneme-level durations to generate a hard alignment between the text and the spectrogram. Obtaining duration labels, either through forced alignment or knowledge distillation, is cumbersome. Furthermore, hard alignment based on phoneme expansion can degrade the naturalness of the synthesized speech. In contrast, the proposed model of VAENAR-TTS is an end-to-end approach that does not require phoneme-level durations. The VAENAR-TTS model does not contain recurrent structures and is completely non-autoregressive in both the training and inference phases. Based on the VAE architecture, the alignment information is encoded in the latent variable, and attention-based soft alignment between the text and the latent variable is used in the decoder to reconstruct the spectrogram. Experiments show that VAENAR-TTS achieves state-of-the-art synthesis quality, while the synthesis speed is comparable with other NAR-TTS models. △ Less

Submitted 7 July, 2021; originally announced July 2021.

arXiv:2104.04600 [pdf, ps, other]

Millimeter-Wave UAV Coverage in Urban Environments

Authors: Seongjoon Kang, Marco Mezzavilla, Angel Lozano, Giovanni Geraci, William Xia, Sundeep Rangan, Vasilii Semkin, Giuseppe Loianno

Abstract: With growing interest in mmWave connectivity for UAVs, a basic question is whether networks intended for terrestrial users can provide sufficient aerial coverage as well. To assess this possibility, the paper proposes a novel evaluation methodology using generative models trained on detailed ray tracing data. These models capture complex propagation characteristics and can be readily combined with… ▽ More With growing interest in mmWave connectivity for UAVs, a basic question is whether networks intended for terrestrial users can provide sufficient aerial coverage as well. To assess this possibility, the paper proposes a novel evaluation methodology using generative models trained on detailed ray tracing data. These models capture complex propagation characteristics and can be readily combined with antenna and beamforming assumptions. Extensive simulation using these models indicate that standard (street-level and downtilted) base stations at typical microcellular densities can indeed provide satisfactory UAV coverage. Interestingly, the coverage is possible via a conjunction of antenna sidelobes and strong reflections. With sparser deployments, the coverage is only guaranteed at progressively higher altitudes. Additional dedicated (rooftop-mounted and uptilted) base stations strengthen the coverage provided that their density is comparable to that of the standard deployment, and would be instrumental for sparse deployments of the latter. △ Less

Submitted 19 May, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

arXiv:2103.17149 [pdf, other]

Lightweight UAV-based Measurement System for Air-to-Ground Channels at 28 GHz

Authors: Vasilii Semkin, Seongjoon Kang, Jaakko Haarla, William Xia, Ismo Huhtinen, Giovanni Geraci, Angel Lozano, Giuseppe Loianno, Marco Mezzavilla, Sundeep Rangan

Abstract: Wireless communication at millimeter wave frequencies has attracted considerable attention for the delivery of high-bit-rate connectivity to unmanned aerial vehicles (UAVs). However, conducting the channel measurements necessary to assess communication at these frequencies has been challenging due to the severe payload and power restrictions in commercial UAVs. This work presents a novel lightweig… ▽ More Wireless communication at millimeter wave frequencies has attracted considerable attention for the delivery of high-bit-rate connectivity to unmanned aerial vehicles (UAVs). However, conducting the channel measurements necessary to assess communication at these frequencies has been challenging due to the severe payload and power restrictions in commercial UAVs. This work presents a novel lightweight (approximately 1.3 kg) channel measurement system at 28 GHz installed on a commercially available UAV. A ground transmitter equipped with a horn antenna conveys sounding signals to a UAV equipped with a lightweight spectrum analyzer. We demonstrate that the measurements can be highly influenced by the antenna pattern as shaped by the UAV's frame. A calibration procedure is presented to correct for the resulting angular variations in antenna gain. The measurement setup is then validated on real flights from an airstrip at distances in excess of 300 m. △ Less

Submitted 31 March, 2021; originally announced March 2021.

arXiv:2102.06536 [pdf, other]

CrossStack: A 3-D Reconfigurable RRAM Crossbar Inference Engine

Authors: Jason K. Eshraghian, Kyoungrok Cho, Sung Mo Kang

Abstract: Deep neural network inference accelerators are rapidly growing in importance as we turn to massively parallelized processing beyond GPUs and ASICs. The dominant operation in feedforward inference is the multiply-and-accumlate process, where each column in a crossbar generates the current response of a single neuron. As a result, memristor crossbar arrays parallelize inference and image processing… ▽ More Deep neural network inference accelerators are rapidly growing in importance as we turn to massively parallelized processing beyond GPUs and ASICs. The dominant operation in feedforward inference is the multiply-and-accumlate process, where each column in a crossbar generates the current response of a single neuron. As a result, memristor crossbar arrays parallelize inference and image processing tasks very efficiently. In this brief, we present a 3-D active memristor crossbar array `CrossStack', which adopts stacked pairs of Al/TiO2/TiO2-x/Al devices with common middle electrodes. By designing CMOS-memristor hybrid cells used in the layout of the array, CrossStack can operate in one of two user-configurable modes as a reconfigurable inference engine: 1) expansion mode and 2) deep-net mode. In expansion mode, the resolution of the network is doubled by increasing the number of inputs for a given chip area, reducing IR drop by 22%. In deep-net mode, inference speed per-10-bit convolution is improved by 29\% by simultaneously using one TiO2/TiO2-x layer for read processes, and the other for write processes. We experimentally verify both modes on our $10\times10\times2$ array. △ Less

Submitted 7 February, 2021; originally announced February 2021.

Comments: 5 pages, 4 figures

arXiv:2102.00184 [pdf, other]

Adversarially learning disentangled speech representations for robust multi-factor voice conversion

Authors: Jie Wang, **gbei Li, Xintao Zhao, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speechfactors are using primary di… ▽ More Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speechfactors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment,which however is hard to ensure robust speech representationdisentanglement. To increase the robustness of highly controllable style transfer on multiple factors in VC, we propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled by an adversarial Mask-And-Predict (MAP)network inspired by BERT. The adversarial network is used tominimize the correlations between the speech representations,by randomly masking and predicting one of the representationsfrom the others. Experimental results show that the proposedframework significantly improves the robustness of VC on multiple factors by increasing the speech quality MOS from 2.79 to3.30 and decreasing the MCD from 3.89 to 3.58. △ Less

Submitted 20 August, 2021; v1 submitted 30 January, 2021; originally announced February 2021.

arXiv:2010.06423 [pdf]

A comprehensive protocol for manual segmentation of the human claustrum and its sub-regions using high-resolution MRI

Authors: Seung Suk Kang, Joseph Bodenheimer, Tracey Butler

Abstract: The claustrum (Cl) is a thin grey matter structure located in the center of each brain hemisphere. Cl has been hypothesized as a central hub of the brain for multisensory/sensorimotor integration, consciousness, and attention. Accumulating evidence has suggested that Cl might be important in the development of severe neurological and psychiatric symptoms including epileptic seizures and psychosis.… ▽ More The claustrum (Cl) is a thin grey matter structure located in the center of each brain hemisphere. Cl has been hypothesized as a central hub of the brain for multisensory/sensorimotor integration, consciousness, and attention. Accumulating evidence has suggested that Cl might be important in the development of severe neurological and psychiatric symptoms including epileptic seizures and psychosis. However, the specifics of the roles of Cl in human epilepsy and psychosis are largely unknown, primarily due to methodological limitations related to the thin morphology of Cl that is challenging to delineate accurately using conventional methods. The goal of this work is to develop noninvasive multimodal neuroimaging methods to delineate Cl anatomy by utilizing a large healthy adult high resolution (0.7mm3) T1-weighted MRI collected as part of the Washington University-Minnesota Consortium Human Connectome Project (WU-Minn HCP). We developed a comprehensive manual segmentation protocol to delineate Cl based on a cellular level brain atlas. The protocol involves detailed guidelines to delineate the three subregions of Cl, including the dorsal, ventral, and temporal Cl that can be parcellated based on a geometric method. As demonstrated in a representative result, Cl is large in its anterior-posterior, and the dorsal-ventral extent. Also, the volume is comparable to that of the amygdala. It is required to assess the reliability of the protocol so that it can be used for future anatomical studies of neuropsychiatric disorders, including epilepsy and schizophrenia. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: 15 pages, 6 figures

arXiv:2010.01810 [pdf, other]

Painting Outside as Inside: Edge Guided Image Outpainting via Bidirectional Rearrangement with Progressive Step Learning

Authors: Kyunghun Kim, Yeohun Yun, Keon-Woo Kang, Kyeongbo Kong, Siyeong Lee, Suk-Ju Kang

Abstract: Image outpainting is a very intriguing problem as the outside of a given image can be continuously filled by considering as the context of the image. This task has two main challenges. The first is to maintain the spatial consistency in contents of generated regions and the original input. The second is to generate a high-quality large image with a small amount of adjacent information. Conventiona… ▽ More Image outpainting is a very intriguing problem as the outside of a given image can be continuously filled by considering as the context of the image. This task has two main challenges. The first is to maintain the spatial consistency in contents of generated regions and the original input. The second is to generate a high-quality large image with a small amount of adjacent information. Conventional image outpainting methods generate inconsistent, blurry, and repeated pixels. To alleviate the difficulty of an outpainting problem, we propose a novel image outpainting method using bidirectional boundary region rearrangement. We rearrange the image to benefit from the image inpainting task by reflecting more directional information. The bidirectional boundary region rearrangement enables the generation of the missing region using bidirectional information similar to that of the image inpainting task, thereby generating the higher quality than the conventional methods using unidirectional information. Moreover, we use the edge map generator that considers images as original input with structural information and hallucinates the edges of unknown regions to generate the image. Our proposed method is compared with other state-of-the-art outpainting and inpainting methods both qualitatively and quantitatively. We further compared and evaluated them using BRISQUE, one of the No-Reference image quality assessment (IQA) metrics, to evaluate the naturalness of the output. The experimental results demonstrate that our method outperforms other methods and generates new images with 360°panoramic characteristics. △ Less

Submitted 9 November, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

Comments: Paper accepted in WACV 2021

arXiv:2006.15833 [pdf, other]

End-to-End Differentiable Learning to HDR Image Synthesis for Multi-exposure Images

Authors: Jung Hee Kim, Siyeong Lee, Suk-Ju Kang

Abstract: Recently, high dynamic range (HDR) image reconstruction based on the multiple exposure stack from a given single exposure utilizes a deep learning framework to generate high-quality HDR images. These conventional networks focus on the exposure transfer task to reconstruct the multi-exposure stack. Therefore, they often fail to fuse the multi-exposure stack into a perceptually pleasant HDR image as… ▽ More Recently, high dynamic range (HDR) image reconstruction based on the multiple exposure stack from a given single exposure utilizes a deep learning framework to generate high-quality HDR images. These conventional networks focus on the exposure transfer task to reconstruct the multi-exposure stack. Therefore, they often fail to fuse the multi-exposure stack into a perceptually pleasant HDR image as the inversion artifacts occur. We tackle the problem in stack reconstruction-based methods by proposing a novel framework with a fully differentiable high dynamic range imaging (HDRI) process. By explicitly using the loss, which compares the network's output with the ground truth HDR image, our framework enables a neural network that generates the multiple exposure stack for HDRI to train stably. In other words, our differentiable HDR synthesis layer helps the deep neural network to train to create multi-exposure stacks while reflecting the precise correlations between multi-exposure images in the HDRI process. In addition, our network uses the image decomposition and the recursive process to facilitate the exposure transfer task and to adaptively respond to recursion frequency. The experimental results show that the proposed network outperforms the state-of-the-art quantitative and qualitative results in terms of both the exposure transfer tasks and the whole HDRI process. △ Less

Submitted 18 December, 2020; v1 submitted 29 June, 2020; originally announced June 2020.

arXiv:2006.11610 [pdf, other]

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

Authors: Huirong Huang, Zhiyong Wu, Shiyin Kang, Dongyang Dai, Jia Jia, Tianxiao Fu, Deyi Tuo, Guangzhi Lei, Peng Liu, Dan Su, Dong Yu, Helen Meng

Abstract: Generating 3D speech-driven talking head has received more and more attention in recent years. Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input. In this work, we propose a novel approach using phone… ▽ More Generating 3D speech-driven talking head has received more and more attention in recent years. Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input. In this work, we propose a novel approach using phonetic posteriorgrams (PPG). In this way, our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches. Furthermore, our method can support multilingual speech as input by building a universal phoneme space. As far as we know, our model is the first to support multilingual/mixlingual speech as input with convincing results. Objective and subjective experiments have shown that our model can generate high quality animations given speech from unseen languages or speakers and be robust to noise. △ Less

Submitted 20 June, 2020; originally announced June 2020.

Comments: 5 pages, 5 figures

arXiv:2005.09178 [pdf, other]

Transferring Source Style in Non-Parallel Voice Conversion

Authors: Songxiang Liu, Yuewen Cao, Shiyin Kang, Na Hu, Xunying Liu, Dan Su, Dong Yu, Helen Meng

Abstract: Voice conversion (VC) techniques aim to modify speaker identity of an utterance while preserving the underlying linguistic information. Most VC approaches ignore modeling of the speaking style (e.g. emotion and emphasis), which may contain the factors intentionally added by the speaker and should be retained during conversion. This study proposes a sequence-to-sequence based non-parallel VC approa… ▽ More Voice conversion (VC) techniques aim to modify speaker identity of an utterance while preserving the underlying linguistic information. Most VC approaches ignore modeling of the speaking style (e.g. emotion and emphasis), which may contain the factors intentionally added by the speaker and should be retained during conversion. This study proposes a sequence-to-sequence based non-parallel VC approach, which has the capability of transferring the speaking style from the source speech to the converted speech by explicitly modeling. Objective evaluation and subjective listening tests show superiority of the proposed VC approach in terms of speech naturalness and speaker similarity of the converted speech. Experiments are also conducted to show the source-style transferability of the proposed approach. △ Less

Submitted 18 May, 2020; originally announced May 2020.

Comments: 5 pages, 8 figures, submitted to INTERSPEECH 2020

arXiv:2003.09839 [pdf, other]

A New Update Rule of RLSEKF-based Joint-estimation Filters for Real-time SOH SOC Identification

Authors: Kwangrae Kim, Minho Kim, Suwon Kang, Jungwook Yu, Jungsoo Kim, Huiyong Chun, Soohee Han

Abstract: In order to accurately estimate the SOC and SOH of a lithium-ion battery used in an electric vehicle (EV), we propose an Adaptive Diagonal Forgetting Factor Recursive Least Square (ADFF-RLS) for accurate battery parameter estimation. ADFFRLS includes two new proposals in the existing DFF-RLS; The first is an excitation tag that changes the behavior of the DFFRLS and the EKF according to the dynami… ▽ More In order to accurately estimate the SOC and SOH of a lithium-ion battery used in an electric vehicle (EV), we propose an Adaptive Diagonal Forgetting Factor Recursive Least Square (ADFF-RLS) for accurate battery parameter estimation. ADFFRLS includes two new proposals in the existing DFF-RLS; The first is an excitation tag that changes the behavior of the DFFRLS and the EKF according to the dynamics of the input data. The second is auto-tuning that automatically finds the optimal value of RLS forgetting factor based on condition number (CN). Based on this, we proposed a joint estimation algorithm of ADFF-RLS and Extended Kalman Filter (EKF). To verify the accuracy of the proposed algorithm, we used experimental data of hybrid pattern battery cells mixed with dynamic and static patterns. In addition, we added a current measurement error that occurs when measuring at EV, and realized data that is closer to actual environment. This data was applied to two conventional estimation algorithms (Coulomb counting, Single EKF), two joint estimation algorithms (RLS & EKF, DFF-RLS & EKF) and ADFF-RLS & EKF. As a result, the proposed algorithm showed higher SOC and SOH estimation accuracy in various driving patterns and actual EV driving environment than previous studies. △ Less

Submitted 22 March, 2020; originally announced March 2020.

arXiv:2003.04917 [pdf, other]

doi 10.1109/TMECH.2021.3058851

A Fractional-Order Normalized Bouc-Wen Model for Piezoelectric Hysteresis Nonlinearity

Authors: Shengzheng Kang, Hongtao Wu, Yao Li, Xiaolong Yang, Jiafeng Yao

Abstract: This paper presents a new fractional-order normalized Bouc-Wen (BW) (FONBW) model to describe the asymmetric and rate-dependent hysteresis nonlinearity of piezoelectric actuators (PEAs). In view of the fact that the classical BW (CBW) model is only efficient for the symmetric and rate-independent hysteresis description, the FONBW model is devoted to characterizing the asymmetric and rate-dependent… ▽ More This paper presents a new fractional-order normalized Bouc-Wen (BW) (FONBW) model to describe the asymmetric and rate-dependent hysteresis nonlinearity of piezoelectric actuators (PEAs). In view of the fact that the classical BW (CBW) model is only efficient for the symmetric and rate-independent hysteresis description, the FONBW model is devoted to characterizing the asymmetric and rate-dependent behaviors of the hysteresis in PEAs by adopting an Nth-order polynomial input function and two fractional operators, respectively. Different from the traditional modified BW models, the proposed FONBW model also eliminates the redundancy of parameters in the CBW model via the normalization processing. By this way, the developed FONBW model has a relatively simple mathematic expression with fewer parameters to simultaneously characterize the asymmetric and rate-dependent hysteresis behaviors of PEAs. Model parameters are identified by the self-adaptive differential evolution algorithm. To validate the effectiveness of the proposed model, a series of model verification and inverse-multiplicative-structure-based feedforward control experiments are carried out on a PEA system. Results show that the proposed model is superior to the CBW model and traditional modified BW model in modeling accuracy and hysteresis compensation. △ Less

Submitted 20 November, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

Comments: 9 pages, 10 figures, submitted to TMech; 10 pages, 11 figures, add two subsections in Section IV; modify Tables I and III, and Figures 9 and 10

Journal ref: IEEE/ASME Transactions on Mechatronics, 2021

Showing 1–50 of 63 results for author: Kang, S