Search | arXiv e-print repository

A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons

Authors: Tzu-Yun Hung, Jui-Te Wu, Yu-Chia Kuo, Yo-Wei Hsiao, Ting-Wei Lin, Li Su

Abstract: Expressive music synthesis (EMS) for violin performance is a challenging task due to the disagreement among music performers in the interpretation of expressive musical terms (EMTs), scarcity of labeled recordings, and limited generalization ability of the synthesis model. These challenges create trade-offs between model effectiveness, diversity of generated results, and controllability of the syn… ▽ More Expressive music synthesis (EMS) for violin performance is a challenging task due to the disagreement among music performers in the interpretation of expressive musical terms (EMTs), scarcity of labeled recordings, and limited generalization ability of the synthesis model. These challenges create trade-offs between model effectiveness, diversity of generated results, and controllability of the synthesis system, making it essential to conduct a comparative study on EMS model design. This paper explores two violin EMS approaches. The end-to-end approach is a modification of a state-of-the-art text-to-speech generator. The parameter-controlled approach is based on a simple parameter sampling process that can render note lengths and other parameters compatible with MIDI-DDSP. We study these two approaches (in total, three model variants) through objective and subjective experiments and discuss several key issues of EMS based on the results. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 15 pages, 2 figures, 3 tables

arXiv:2406.06375 [pdf, other]

doi 10.1109/TASLP.2024.3407529

MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing

Authors: Yu-Fen Huang, Nikki Moran, Simon Coleman, Jon Kelly, Shun-Hwa Wei, Po-Yin Chen, Yun-Hsin Huang, Tsung-** Chen, Yu-Chia Kuo, Yu-Chi Wei, Chih-Hsuan Li, Da-Yu Huang, Hsuan-Kai Kao, Ting-Wei Lin, Li Su

Abstract: In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music m… ▽ More In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset). △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024. 14 pages, 7 figures. Dataset is available on: https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset/tree/main and https://zenodo.org/records/11393449

arXiv:2405.09022 [pdf, other]

doi 10.1109/JIOT.2024.3413687

Multi-Objective Optimization-based Transmit Beamforming for Multi-Target and Multi-User MIMO-ISAC Systems

Authors: Chunwei Meng, Zhiqing Wei, Dingyou Ma, Wanli Ni, Liyan Su, Zhiyong Feng

Abstract: Integrated sensing and communication (ISAC) is an enabling technology for the sixth-generation mobile communications, which equips the wireless communication networks with sensing capabilities. In this paper, we investigate transmit beamforming design for multiple-input and multiple-output (MIMO)-ISAC systems in scenarios with multiple radar targets and communication users. A general form of multi… ▽ More Integrated sensing and communication (ISAC) is an enabling technology for the sixth-generation mobile communications, which equips the wireless communication networks with sensing capabilities. In this paper, we investigate transmit beamforming design for multiple-input and multiple-output (MIMO)-ISAC systems in scenarios with multiple radar targets and communication users. A general form of multi-target sensing mutual information (MI) is derived, along with its upper bound, which can be interpreted as the sum of individual single-target sensing MI. Additionally, this upper bound can be achieved by suppressing the cross-correlation among reflected signals from different targets, which aligns with the principles of adaptive MIMO radar. Then, we propose a multi-objective optimization framework based on the signal-to-interference-plus-noise ratio of each user and the tight upper bound of sensing MI, introducing the Pareto boundary to characterize the achievable communication-sensing performance boundary of the proposed ISAC system. To achieve the Pareto boundary, the max-min system utility function method is employed, while considering the fairness between communication users and radar targets. Subsequently, the bisection search method is employed to find a specific Pareto optimal solution by solving a series of convex feasible problems. Finally, simulation results validate that the proposed method achieves a better tradeoff between multi-user communication and multi-target sensing performance. Additionally, utilizing the tight upper bound of sensing MI as a performance metric can enhance the multi-target resolution capability and angle estimation accuracy. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2403.07390 [pdf, other]

Learning Correction Errors via Frequency-Self Attention for Blind Image Super-Resolution

Authors: Haochen Sun, Yan Yuan, Lijuan Su, Haotian Shao

Abstract: Previous approaches for blind image super-resolution (SR) have relied on degradation estimation to restore high-resolution (HR) images from their low-resolution (LR) counterparts. However, accurate degradation estimation poses significant challenges. The SR model's incompatibility with degradation estimation methods, particularly the Correction Filter, may significantly impair performance as a res… ▽ More Previous approaches for blind image super-resolution (SR) have relied on degradation estimation to restore high-resolution (HR) images from their low-resolution (LR) counterparts. However, accurate degradation estimation poses significant challenges. The SR model's incompatibility with degradation estimation methods, particularly the Correction Filter, may significantly impair performance as a result of correction errors. In this paper, we introduce a novel blind SR approach that focuses on Learning Correction Errors (LCE). Our method employs a lightweight Corrector to obtain a corrected low-resolution (CLR) image. Subsequently, within an SR network, we jointly optimize SR performance by utilizing both the original LR image and the frequency learning of the CLR image. Additionally, we propose a new Frequency-Self Attention block (FSAB) that enhances the global information utilization ability of Transformer. This block integrates both self-attention and frequency spatial attention mechanisms. Extensive ablation and comparison experiments conducted across various settings demonstrate the superiority of our method in terms of visual quality and accuracy. Our approach effectively addresses the challenges associated with degradation estimation and correction errors, paving the way for more accurate blind image SR. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 16 pages

arXiv:2312.17156 [pdf, other]

BEAST: Online Joint Beat and Downbeat Tracking Based on Streaming Transformer

Authors: Chih-Cheng Chang, Li Su

Abstract: Many deep learning models have achieved dominant performance on the offline beat tracking task. However, online beat tracking, in which only the past and present input features are available, still remains challenging. In this paper, we propose BEAt tracking Streaming Transformer (BEAST), an online joint beat and downbeat tracking system based on the streaming Transformer. To deal with online scen… ▽ More Many deep learning models have achieved dominant performance on the offline beat tracking task. However, online beat tracking, in which only the past and present input features are available, still remains challenging. In this paper, we propose BEAt tracking Streaming Transformer (BEAST), an online joint beat and downbeat tracking system based on the streaming Transformer. To deal with online scenarios, BEAST applies contextual block processing in the Transformer encoder. Moreover, we adopt relative positional encoding in the attention layer of the streaming Transformer encoder to capture relative timing position which is critically important information in music. Carrying out beat and downbeat experiments on benchmark datasets for a low latency scenario with maximum latency under 50 ms, BEAST achieves an F1-measure of 80.04% in beat and 46.78% in downbeat, which is a substantial improvement of about 5 percentage points over the state-of-the-art online beat tracking model. △ Less

Submitted 23 April, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024

arXiv:2311.12488 [pdf, other]

Adapting pretrained speech model for Mandarin lyrics transcription and alignment

Authors: Jun-You Wang, Chon-In Leong, Yu-Chen Lin, Li Su, Jyh-Shing Roger Jang

Abstract: The tasks of automatic lyrics transcription and lyrics alignment have witnessed significant performance improvements in the past few years. However, most of the previous works only focus on English in which large-scale datasets are available. In this paper, we address lyrics transcription and alignment of polyphonic Mandarin pop music in a low-resource setting. To deal with the data scarcity issue… ▽ More The tasks of automatic lyrics transcription and lyrics alignment have witnessed significant performance improvements in the past few years. However, most of the previous works only focus on English in which large-scale datasets are available. In this paper, we address lyrics transcription and alignment of polyphonic Mandarin pop music in a low-resource setting. To deal with the data scarcity issue, we adapt pretrained Whisper model and fine-tune it on a monophonic Mandarin singing dataset. With the use of data augmentation and source separation model, results show that the proposed method achieves a character error rate of less than 18% on a Mandarin polyphonic dataset for lyrics transcription, and a mean absolute error of 0.071 seconds for lyrics alignment. Our results demonstrate the potential of adapting a pretrained speech model for lyrics transcription and alignment in low-resource scenarios. △ Less

Submitted 21 November, 2023; originally announced November 2023.

Comments: Accepted by ASRU 2023

arXiv:2310.19198 [pdf]

Enhancing Motor Imagery Decoding in Brain Computer Interfaces using Riemann Tangent Space Map** and Cross Frequency Coupling

Authors: Xiong Xiong, Li Su, **guo Huang, Guixia Kang

Abstract: Objective: Motor Imagery (MI) serves as a crucial experimental paradigm within the realm of Brain Computer Interfaces (BCIs), aiming to decoding motor intentions from electroencephalogram (EEG) signals. Method: Drawing inspiration from Riemannian geometry and Cross-Frequency Coupling (CFC), this paper introduces a novel approach termed Riemann Tangent Space Map** using Dichotomous Filter Bank wi… ▽ More Objective: Motor Imagery (MI) serves as a crucial experimental paradigm within the realm of Brain Computer Interfaces (BCIs), aiming to decoding motor intentions from electroencephalogram (EEG) signals. Method: Drawing inspiration from Riemannian geometry and Cross-Frequency Coupling (CFC), this paper introduces a novel approach termed Riemann Tangent Space Map** using Dichotomous Filter Bank with Convolutional Neural Network (DFBRTS) to enhance the representation quality and decoding capability pertaining to MI features. DFBRTS first initiates the process by meticulously filtering EEG signals through a Dichotomous Filter Bank, structured in the fashion of a complete binary tree. Subsequently, it employs Riemann Tangent Space Map** to extract salient EEG signal features within each sub-band. Finally, a lightweight convolutional neural network is employed for further feature extraction and classification, operating under the joint supervision of cross-entropy and center loss. To validate the efficacy, extensive experiments were conducted using DFBRTS on two well-established benchmark datasets: the BCI competition IV 2a (BCIC-IV-2a) dataset and the OpenBMI dataset. The performance of DFBRTS was benchmarked against several state-of-the-art MI decoding methods, alongside other Riemannian geometry-based MI decoding approaches. Results: DFBRTS significantly outperforms other MI decoding algorithms on both datasets, achieving a remarkable classification accuracy of 78.16% for four-class and 71.58% for two-class hold-out classification, as compared to the existing benchmarks. △ Less

Submitted 29 October, 2023; originally announced October 2023.

Comments: 22 pages, 7 figures

arXiv:2307.06634 [pdf, ps, other]

Coherent Compensation based ISAC Signal Processing for Long-range Sensing

Authors: Lin Wang, Zhiqing Wei, Liyan Su, Zhiyong Feng, Huici Wu, Dongsheng Xue

Abstract: Integrated sensing and communication (ISAC) will greatly enhance the efficiency of physical resource utilization. The design of ISAC signal based on the orthogonal frequency division multiplex (OFDM) signal is the mainstream. However, when detecting the long-range target, the delay of echo signal exceeds CP duration, which will result in inter-symbol interference (ISI) and inter-carrier interferen… ▽ More Integrated sensing and communication (ISAC) will greatly enhance the efficiency of physical resource utilization. The design of ISAC signal based on the orthogonal frequency division multiplex (OFDM) signal is the mainstream. However, when detecting the long-range target, the delay of echo signal exceeds CP duration, which will result in inter-symbol interference (ISI) and inter-carrier interference (ICI), limiting the sensing range. Facing the above problem, we propose to increase useful signal power through coherent compensation and improve the signal to interference plus noise power ratio (SINR) of each OFDM block. Compared with the traditional 2D-FFT algorithm, the improvement of SINR of range-doppler map (RDM) is verified by simulation, which will expand the sensing range. △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2305.20003 [pdf]

A Novel Black Box Process Quality Optimization Approach based on Hit Rate

Authors: Yang Yang, Jian Wu, Xiangman Song, Derun Wu, Lijie Su, Lixin Tang

Abstract: Hit rate is a key performance metric in predicting process product quality in integrated industrial processes. It represents the percentage of products accepted by downstream processes within a controlled range of quality. However, optimizing hit rate is a non-convex and challenging problem. To address this issue, we propose a data-driven quasi-convex approach that combines factorial hidden Markov… ▽ More Hit rate is a key performance metric in predicting process product quality in integrated industrial processes. It represents the percentage of products accepted by downstream processes within a controlled range of quality. However, optimizing hit rate is a non-convex and challenging problem. To address this issue, we propose a data-driven quasi-convex approach that combines factorial hidden Markov models, multitask elastic net, and quasi-convex optimization. Our approach converts the original non-convex problem into a set of convex feasible problems, achieving an optimal hit rate. We verify the convex optimization property and quasi-convex frontier through Monte Carlo simulations and real-world experiments in steel production. Results demonstrate that our approach outperforms classical models, improving hit rates by at least 41.11% and 31.01% on two real datasets. Furthermore, the quasi-convex frontier provides a reference explanation and visualization for the deterioration of solutions obtained by conventional models. △ Less

Submitted 2 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

arXiv:2305.19956 [pdf, other]

doi 10.1016/j.compmedimag.2024.102326

MicroSegNet: A Deep Learning Approach for Prostate Segmentation on Micro-Ultrasound Images

Authors: Hongxu Jiang, Muhammad Imran, Preethika Muralidharan, Anjali Patel, Jake Pensa, Muxuan Liang, Tarik Benidir, Joseph R. Grajo, Jason P. Joseph, Russell Terry, John Michael DiBianco, Li-Ming Su, Yuyin Zhou, Wayne G. Brisbane, Wei Shao

Abstract: Micro-ultrasound (micro-US) is a novel 29-MHz ultrasound technique that provides 3-4 times higher resolution than traditional ultrasound, potentially enabling low-cost, accurate diagnosis of prostate cancer. Accurate prostate segmentation is crucial for prostate volume measurement, cancer diagnosis, prostate biopsy, and treatment planning. However, prostate segmentation on micro-US is challenging… ▽ More Micro-ultrasound (micro-US) is a novel 29-MHz ultrasound technique that provides 3-4 times higher resolution than traditional ultrasound, potentially enabling low-cost, accurate diagnosis of prostate cancer. Accurate prostate segmentation is crucial for prostate volume measurement, cancer diagnosis, prostate biopsy, and treatment planning. However, prostate segmentation on micro-US is challenging due to artifacts and indistinct borders between the prostate, bladder, and urethra in the midline. This paper presents MicroSegNet, a multi-scale annotation-guided transformer UNet model designed specifically to tackle these challenges. During the training process, MicroSegNet focuses more on regions that are hard to segment (hard regions), characterized by discrepancies between expert and non-expert annotations. We achieve this by proposing an annotation-guided binary cross entropy (AG-BCE) loss that assigns a larger weight to prediction errors in hard regions and a lower weight to prediction errors in easy regions. The AG-BCE loss was seamlessly integrated into the training process through the utilization of multi-scale deep supervision, enabling MicroSegNet to capture global contextual dependencies and local information at various scales. We trained our model using micro-US images from 55 patients, followed by evaluation on 20 patients. Our MicroSegNet model achieved a Dice coefficient of 0.939 and a Hausdorff distance of 2.02 mm, outperforming several state-of-the-art segmentation methods, as well as three human annotators with different experience levels. Our code is publicly available at https://github.com/mirthAI/MicroSegNet and our dataset is publicly available at https://zenodo.org/records/10475293. △ Less

Submitted 25 January, 2024; v1 submitted 31 May, 2023; originally announced May 2023.

Journal ref: Computerized Medical Imaging and Graphics (2024): 102326

arXiv:2305.19939 [pdf, other]

Image Registration of In Vivo Micro-Ultrasound and Ex Vivo Pseudo-Whole Mount Histopathology Images of the Prostate: A Proof-of-Concept Study

Authors: Muhammad Imran, Brianna Nguyen, Jake Pensa, Sara M. Falzarano, Anthony E. Sisk, Muxuan Liang, John Michael DiBianco, Li-Ming Su, Yuyin Zhou, Wayne G. Brisbane, Wei Shao

Abstract: Early diagnosis of prostate cancer significantly improves a patient's 5-year survival rate. Biopsy of small prostate cancers is improved with image-guided biopsy. MRI-ultrasound fusion-guided biopsy is sensitive to smaller tumors but is underutilized due to the high cost of MRI and fusion equipment. Micro-ultrasound (micro-US), a novel high-resolution ultrasound technology, provides a cost-effecti… ▽ More Early diagnosis of prostate cancer significantly improves a patient's 5-year survival rate. Biopsy of small prostate cancers is improved with image-guided biopsy. MRI-ultrasound fusion-guided biopsy is sensitive to smaller tumors but is underutilized due to the high cost of MRI and fusion equipment. Micro-ultrasound (micro-US), a novel high-resolution ultrasound technology, provides a cost-effective alternative to MRI while delivering comparable diagnostic accuracy. However, the interpretation of micro-US is challenging due to subtle gray scale changes indicating cancer vs normal tissue. This challenge can be addressed by training urologists with a large dataset of micro-US images containing the ground truth cancer outlines. Such a dataset can be mapped from surgical specimens (histopathology) onto micro-US images via image registration. In this paper, we present a semi-automated pipeline for registering in vivo micro-US images with ex vivo whole-mount histopathology images. Our pipeline begins with the reconstruction of pseudo-whole-mount histopathology images and a 3-dimensional (3D) micro-US volume. Each pseudo-whole-mount histopathology image is then registered with the corresponding axial micro-US slice using a two-stage approach that estimates an affine transformation followed by a deformable transformation. We evaluated our registration pipeline using micro-US and histopathology images from 18 patients who underwent radical prostatectomy. The results showed a Dice coefficient of 0.94 and a landmark error of 2.7 mm, indicating the accuracy of our registration pipeline. This proof-of-concept study demonstrates the feasibility of accurately aligning micro-US and histopathology images. To promote transparency and collaboration in research, we will make our code and dataset publicly available. △ Less

Submitted 16 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

arXiv:2305.19023 [pdf, other]

Steady-state analysis of networked epidemic models

Authors: Sei Zhen Khong, Lanlan Su

Abstract: Compartmental epidemic models with dynamics that evolve over a graph network have gained considerable importance in recent years but analysis of these models is in general difficult due to their complexity. In this paper, we develop two positive feedback frameworks that are applicable to the study of steady-state values in a wide range of compartmental epidemic models, including both group and… ▽ More Compartmental epidemic models with dynamics that evolve over a graph network have gained considerable importance in recent years but analysis of these models is in general difficult due to their complexity. In this paper, we develop two positive feedback frameworks that are applicable to the study of steady-state values in a wide range of compartmental epidemic models, including both group and networked processes. In the case of a group (resp. networked) model, we show that the convergence limit of the susceptible proportion of the population (resp. the susceptible proportion in at least one of the subgroups) is upper bounded by the reciprocal of the basic reproduction number (BRN) of the model. The BRN, when it is greater than unity, thus demonstrates the level of penetration into a subpopulation by the disease. Both non-strict and strict bounds on the convergence limits are derived and shown to correspond to substantially distinct scenarios in the epidemic processes, one in the presence of the endemic state and another without. Formulae for calculating the limits are provided in the latter case. We apply the developed framework to examining various group and networked epidemic models commonly seen in the literature to verify the validity of our conclusions. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2304.05917 [pdf, other]

A Phoneme-Informed Neural Network Model for Note-Level Singing Transcription

Authors: Sangeon Yong, Li Su, Juhan Nam

Abstract: Note-level automatic music transcription is one of the most representative music information retrieval (MIR) tasks and has been studied for various instruments to understand music. However, due to the lack of high-quality labeled data, transcription of many instruments is still a challenging task. In particular, in the case of singing, it is difficult to find accurate notes due to its expressivene… ▽ More Note-level automatic music transcription is one of the most representative music information retrieval (MIR) tasks and has been studied for various instruments to understand music. However, due to the lack of high-quality labeled data, transcription of many instruments is still a challenging task. In particular, in the case of singing, it is difficult to find accurate notes due to its expressiveness in pitch, timbre, and dynamics. In this paper, we propose a method of finding note onsets of singing voice more accurately by leveraging the linguistic characteristics of singing, which are not seen in other instruments. The proposed model uses mel-scaled spectrogram and phonetic posteriorgram (PPG), a frame-wise likelihood of phoneme, as an input of the onset detection network while PPG is generated by the pre-trained network with singing and speech data. To verify how linguistic features affect onset detection, we compare the evaluation results through the dataset with different languages and divide onset types for detailed analysis. Our approach substantially improves the performance of singing transcription and therefore emphasizes the importance of linguistic features in singing analysis. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: Accepted at ICASSP 2023

arXiv:2206.01945 [pdf, other]

On the exponential convergence of input-output signals of nonlinear feedback systems

Authors: Lanlan Su, Di Zhao, Sei Zhen Khong

Abstract: This note studies the exponential convergence of input-output signals of discrete-time nonlinear systems composed of a feedback interconnection of a linear time-invariant system and a nonlinear uncertainty. Both the open-loop subsystems are allowed to be unbounded. Integral-quadratic-constraint-based conditions are proposed for these uncertain feedback systems, including the Lurye type, to exhibit… ▽ More This note studies the exponential convergence of input-output signals of discrete-time nonlinear systems composed of a feedback interconnection of a linear time-invariant system and a nonlinear uncertainty. Both the open-loop subsystems are allowed to be unbounded. Integral-quadratic-constraint-based conditions are proposed for these uncertain feedback systems, including the Lurye type, to exhibit the property that the endogenous input-output signals enjoy an exponential convergence rate for all initial conditions of the linear time-invariant subsystem. The conditions are established via a combination of tools, including integral quadratic constraints, directed gap, and exponential weightings. △ Less

Submitted 12 June, 2024; v1 submitted 4 June, 2022; originally announced June 2022.

Comments: This paper has been submitted to IEEE Transactions on Automatic Control

arXiv:2112.07456 [pdf, other]

On the Necessity and Sufficiency of Discrete-Time O'Shea-Zames-Falb Multipliers

Authors: Lanlan Su, Peter Seiler, Joaquin Carrasco, Sei Zhen Khong

Abstract: This paper considers the robust stability of a discrete-time Lurye system consisting of the feedback interconnection between a linear system and a bounded and monotone nonlinearity. It has been conjectured that the existence of a suitable linear time-invariant (LTI) O'Shea-Zames-Falb multiplier is not only sufficient but also necessary. Roughly speaking, a successful proof of the conjecture would… ▽ More This paper considers the robust stability of a discrete-time Lurye system consisting of the feedback interconnection between a linear system and a bounded and monotone nonlinearity. It has been conjectured that the existence of a suitable linear time-invariant (LTI) O'Shea-Zames-Falb multiplier is not only sufficient but also necessary. Roughly speaking, a successful proof of the conjecture would require: (a) a conic parameterization of a set of multipliers that describes exactly the set of nonlinearities, (b) a lossless S-procedure to show that the non-existence of a multiplier implies that the Lurye system is not uniformly robustly stable over the set of nonlinearities, and (c) the existence of a multiplier in the set of multipliers used in (a) implies the existence of an LTI multiplier. We investigate these three steps, showing the current bottlenecks for proving this conjecture. In addition, we provide an extension of the class of multipliers which may be used to disprove the conjecture. △ Less

Submitted 14 December, 2021; originally announced December 2021.

Comments: 25 Pages

arXiv:2110.12855 [pdf, other]

doi 10.1145/3474085.3475529

Actions Speak Louder than Listening: Evaluating Music Style Transfer based on Editing Experience

Authors: Wei-Tsung Lu, Meng-Hsuan Wu, Yuh-Ming Chiu, Li Su

Abstract: The subjective evaluation of music generation techniques has been mostly done with questionnaire-based listening tests while ignoring the perspectives from music composition, arrangement, and soundtrack editing. In this paper, we propose an editing test to evaluate users' editing experience of music generation models in a systematic way. To do this, we design a new music style transfer model combi… ▽ More The subjective evaluation of music generation techniques has been mostly done with questionnaire-based listening tests while ignoring the perspectives from music composition, arrangement, and soundtrack editing. In this paper, we propose an editing test to evaluate users' editing experience of music generation models in a systematic way. To do this, we design a new music style transfer model combining the non-chronological inference architecture, autoregressive models and the Transformer, which serves as an improvement from the baseline model on the same style transfer task. Then, we compare the performance of the two models with a conventional listening test and the proposed editing test, in which the quality of generated samples is assessed by the amount of effort (e.g., the number of required keyboard and mouse actions) spent by users to polish a music clip. Results on two target styles indicate that the improvement over the baseline model can be reflected by the editing test quantitatively. Also, the editing test provides profound insights which are not accessible from usual listening tests. The major contribution of this paper is the systematic presentation of the editing test and the corresponding insights, while the proposed music style transfer model based on state-of-the-art neural networks represents another contribution. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: 9 pages, Proceedings of the 29th ACM International Conference on Multimedia

arXiv:2107.04954 [pdf, other]

ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data

Authors: Kin Wai Cheuk, Dorien Herremans, Li Su

Abstract: Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlab… ▽ More Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlabelled music recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing U-net models for AMT, ReconVAT achieves competitive results on common benchmark datasets such as MAPS and MusicNet. For example, in the few-shot setting for the string part version of MusicNet, ReconVAT achieves F1-scores of 61.0% and 41.6% for the note-wise and note-with-offset-wise metrics respectively, which translates into an improvement of 22.2% and 62.5% compared to the supervised baseline model. Our proposed framework also demonstrates the potential of continual learning on new data, which could be useful in real-world applications whereby new data is constantly available. △ Less

Submitted 29 July, 2021; v1 submitted 10 July, 2021; originally announced July 2021.

Comments: Accepted in ACMMM 21. Camera ready version

arXiv:2106.00497 [pdf, ps, other]

Omnizart: A General Toolbox for Automatic Music Transcription

Authors: Yu-Te Wu, Yin-Jyun Luo, Tsung-** Chen, I-Chieh Wei, Jui-Yang Hsu, Yi-Chin Chuang, Li Su

Abstract: We present and release Omnizart, a new Python library that provides a streamlined solution to automatic music transcription (AMT). Omnizart encompasses modules that construct the life-cycle of deep learning-based AMT, and is designed for ease of use with a compact command-line interface. To the best of our knowledge, Omnizart is the first transcription toolkit which offers models covering a wide c… ▽ More We present and release Omnizart, a new Python library that provides a streamlined solution to automatic music transcription (AMT). Omnizart encompasses modules that construct the life-cycle of deep learning-based AMT, and is designed for ease of use with a compact command-line interface. To the best of our knowledge, Omnizart is the first transcription toolkit which offers models covering a wide class of instruments ranging from solo, instrument ensembles, percussion instruments to vocal, as well as models for chord recognition and beat/downbeat tracking, two music information retrieval (MIR) tasks highly related to AMT. △ Less

Submitted 1 June, 2021; originally announced June 2021.

arXiv:2011.10947 [pdf, other]

Who is in Control? Practical Physical Layer Attack and Defense for mmWave based Sensing in Autonomous Vehicles

Authors: Zhi Sun, Sarankumar Balakrishnan, Lu Su, Arupjyoti Bhuyan, Pu Wang, Chunming Qiao

Abstract: With the wide bandwidths in millimeter wave (mmWave) frequency band that results in unprecedented accuracy, mmWave sensing has become vital for many applications, especially in autonomous vehicles (AVs). In addition, mmWave sensing has superior reliability compared to other sensing counterparts such as camera and LiDAR, which is essential for safety-critical driving. Therefore, it is critical to u… ▽ More With the wide bandwidths in millimeter wave (mmWave) frequency band that results in unprecedented accuracy, mmWave sensing has become vital for many applications, especially in autonomous vehicles (AVs). In addition, mmWave sensing has superior reliability compared to other sensing counterparts such as camera and LiDAR, which is essential for safety-critical driving. Therefore, it is critical to understand the security vulnerabilities and improve the security and reliability of mmWave sensing in AVs. To this end, we perform the end-to-end security analysis of a mmWave-based sensing system in AVs, by designing and implementing practical physical layer attack and defense strategies in a state-of-the-art mmWave testbed and an AV testbed in real-world settings. Various strategies are developed to take control of the victim AV by spoofing its mmWave sensing module, including adding fake obstacles at arbitrary locations and faking the locations of existing obstacles. Five real-world attack scenarios are constructed to spoof the victim AV and force it to make dangerous driving decisions leading to a fatal crash. Field experiments are conducted to study the impact of the various attack scenarios using a Lincoln MKZ-based AV testbed, which validate that the attacker can indeed assume control of the victim AV to compromise its security and safety. To defend the attacks, we design and implement a challenge-response authentication scheme and a RF fingerprinting scheme to reliably detect aforementioned spoofing attacks. △ Less

Submitted 22 November, 2020; originally announced November 2020.

arXiv:2010.12196 [pdf, other]

Toward Expressive Singing Voice Correction: On Perceptual Validity of Evaluation Metrics for Vocal Melody Extraction

Authors: Yin-Jyun Luo, Yuen-Jen Lin, Li Su

Abstract: Singing voice correction (SVC) is an appealing application for amateur singers. Commercial products automate SVC by snap** pitch contours to equal-tempered scales, which could lead to deadpan modifications. Together with the neglect of rhythmic errors, extensive manual corrections are still necessary. In this paper, we present a streamlined system to automate expressive SVC for both pitch and rh… ▽ More Singing voice correction (SVC) is an appealing application for amateur singers. Commercial products automate SVC by snap** pitch contours to equal-tempered scales, which could lead to deadpan modifications. Together with the neglect of rhythmic errors, extensive manual corrections are still necessary. In this paper, we present a streamlined system to automate expressive SVC for both pitch and rhythmic errors. Particularly, we extend a previous work by integrating advanced techniques for singing voice separation (SVS) and vocal melody extraction. SVC is achieved by temporally aligning the source-target pair, followed by replacing pitch and rhythm of the source with those of the target. We evaluate the framework by a comparative study for melody extraction which involves both subjective and objective evaluations, whereby we investigate perceptual validity of the standard metrics through the lens of SVC. The results suggest that the high pitch accuracy obtained by the metrics does not signify good perceptual scores. △ Less

Submitted 23 October, 2020; originally announced October 2020.

Comments: Submitted to ICASSP 2021

arXiv:2009.13574 [pdf, other]

Robust Monotonic Convergent Iterative Learning Control Design: an LMI-based Method

Authors: Lanlan Su

Abstract: This work investigates robust monotonic convergent iterative learning control (ILC) for uncertain linear systems in both time and frequency domains, and the ILC algorithm optimizing the convergence speed in terms of $l_{2}$ norm of error signals is derived. Firstly, it is shown that the robust monotonic convergence of the ILC system can be established equivalently by the positive definiteness of a… ▽ More This work investigates robust monotonic convergent iterative learning control (ILC) for uncertain linear systems in both time and frequency domains, and the ILC algorithm optimizing the convergence speed in terms of $l_{2}$ norm of error signals is derived. Firstly, it is shown that the robust monotonic convergence of the ILC system can be established equivalently by the positive definiteness of a matrix polynomial over some set. Then, a necessary and sufficient condition in the form of sum of squares (SOS) for the positive definiteness is proposed, which is amendable to the feasibility of linear matrix inequalities (LMIs). Based on such a condition, the optimal ILC algorithm that maximizes the convergence speed is obtained by solving a set of convex optimization problems. Moreover, the order of the learning function can be chosen arbitrarily so that the designers have the flexibility to decide the complexity of the learning algorithm. △ Less

Submitted 15 January, 2021; v1 submitted 28 September, 2020; originally announced September 2020.

arXiv:2009.13571 [pdf, other]

On the Necessity and Sufficiency of the Zames-Falb Multipliers for Bounded Operators

Authors: Sei Zhen Khong, Lanlan Su

Abstract: This paper analyzes the robust feedback stability of a single-input-single-output stable linear time-invariant (LTI) system against four different classes of nonlinear systems using the Zames-Falb multipliers. The contribution is fourfold. Firstly, we present a generalised S-procedure lossless theorem that involves a countably infinite number of quadratic forms. Secondly, we identify a class of un… ▽ More This paper analyzes the robust feedback stability of a single-input-single-output stable linear time-invariant (LTI) system against four different classes of nonlinear systems using the Zames-Falb multipliers. The contribution is fourfold. Firstly, we present a generalised S-procedure lossless theorem that involves a countably infinite number of quadratic forms. Secondly, we identify a class of uncertain systems over which the robust feedback stability implies the existence of an appropriate Zames-Falb multiplier based on the generalised S-procedure lossless theorem. Meanwhile, we show that the existence of such a Zames-Falb multiplier is sufficient for the robust feedback stability over a smaller class of uncertain systems. Thirdly, when restricted to be static (a.k.a. memoryless), the second class of systems coincides with the class of sloped-restricted monotone nonlinearities, and the classical result of using the Zames-Falb multipliers to ensure feedback stability is recovered. Lastly, when restricted to be LTI, the second class is demonstrated to be a subset of the third, and the existence of a Zames-Falb multiplier is shown to be sufficient but not necessary for the robust feedback stability. △ Less

Submitted 18 August, 2021; v1 submitted 28 September, 2020; originally announced September 2020.

arXiv:2009.08015 [pdf, other]

doi 10.1145/3394171.3413848

Temporally Guided Music-to-Body-Movement Generation

Authors: Hsuan-Kai Kao, Li Su

Abstract: This paper presents a neural network model to generate virtual violinist's 3-D skeleton movements from music audio. Improved from the conventional recurrent neural network models for generating 2-D skeleton data in previous works, the proposed model incorporates an encoder-decoder architecture, as well as the self-attention mechanism to model the complicated dynamics in body movement sequences. To… ▽ More This paper presents a neural network model to generate virtual violinist's 3-D skeleton movements from music audio. Improved from the conventional recurrent neural network models for generating 2-D skeleton data in previous works, the proposed model incorporates an encoder-decoder architecture, as well as the self-attention mechanism to model the complicated dynamics in body movement sequences. To facilitate the optimization of self-attention model, beat tracking is applied to determine effective sizes and boundaries of the training examples. The decoder is accompanied with a refining network and a bowing attack inference mechanism to emphasize the right-hand behavior and bowing attack timing. Both objective and subjective evaluations reveal that the proposed model outperforms the state-of-the-art methods. To the best of our knowledge, this work represents the first attempt to generate 3-D violinists' body movements considering key features in musical body movement. △ Less

Submitted 16 September, 2020; originally announced September 2020.

arXiv:2008.06358 [pdf, other]

Semi-supervised learning using teacher-student models for vocal melody extraction

Authors: Sangeun Kum, **g-Hua Lin, Li Su, Juhan Nam

Abstract: The lack of labeled data is a major obstacle in many music information retrieval tasks such as melody extraction, where labeling is extremely laborious or costly. Semi-supervised learning (SSL) provides a solution to alleviate the issue by leveraging a large amount of unlabeled data. In this paper, we propose an SSL method using teacher-student models for vocal melody extraction. The teacher model… ▽ More The lack of labeled data is a major obstacle in many music information retrieval tasks such as melody extraction, where labeling is extremely laborious or costly. Semi-supervised learning (SSL) provides a solution to alleviate the issue by leveraging a large amount of unlabeled data. In this paper, we propose an SSL method using teacher-student models for vocal melody extraction. The teacher model is pre-trained with labeled data and guides the student model to make identical predictions given unlabeled input in a self-training setting. We examine three setups of teacher-student models with different data augmentation schemes and loss functions. Also, considering the scarcity of labeled data in the test phase, we artificially generate large-scale testing data with pitch labels from unlabeled data using an analysis-synthesis method. The results show that the SSL method significantly increases the performance against supervised learning only and the improvement depends on the teacher-student models, the size of unlabeled data, the number of self-training iterations, and other training details. We also find that it is essential to ensure that the unlabeled audio has vocal parts. Finally, we show that the proposed SSL method enables a baseline convolutional recurrent neural network model to achieve performance comparable to state-of-the-arts. △ Less

Submitted 14 August, 2020; originally announced August 2020.

Comments: 8 pages, 5 figures, accepted for the 21st International Society for Music Information Retrieval Conference (ISMIR 2020)

arXiv:2006.03633 [pdf, other]

doi 10.1109/IPSN48710.2020.00-25

Road Grade Estimation Using Crowd-Sourced Smartphone Data

Authors: Abhishek Gupta, Shaohan Hu, Weida Zhong, Adel Sadek, Lu Su, Chunming Qiao

Abstract: Estimates of road grade/slope can add another dimension of information to existing 2D digital road maps. Integration of road grade information will widen the scope of digital map's applications, which is primarily used for navigation, by enabling driving safety and efficiency applications such as Advanced Driver Assistance Systems (ADAS), eco-driving, etc. The huge scale and dynamic nature of road… ▽ More Estimates of road grade/slope can add another dimension of information to existing 2D digital road maps. Integration of road grade information will widen the scope of digital map's applications, which is primarily used for navigation, by enabling driving safety and efficiency applications such as Advanced Driver Assistance Systems (ADAS), eco-driving, etc. The huge scale and dynamic nature of road networks make sensing road grade a challenging task. Traditional methods oftentimes suffer from limited scalability and update frequency, as well as poor sensing accuracy. To overcome these problems, we propose a cost-effective and scalable road grade estimation framework using sensor data from smartphones. Based on our understanding of the error characteristics of smartphone sensors, we intelligently combine data from accelerometer, gyroscope and vehicle speed data from OBD-II/smartphone's GPS to estimate road grade. To improve accuracy and robustness of the system, the estimations of road grade from multiple sources/vehicles are crowd-sourced to compensate for the effects of varying quality of sensor data from different sources. Extensive experimental evaluation on a test route of ~9km demonstrates the superior performance of our proposed method, achieving $5\times$ improvement on road grade estimation accuracy over baselines, with 90\% of errors below 0.3$^\circ$. △ Less

Submitted 5 June, 2020; originally announced June 2020.

Comments: Proceedings of 19th ACM/IEEE Conference on Information Processing in Sensor Networks (IPSN'20)

arXiv:2002.01788 [pdf]

Learning Enabled Dense Space-division Multiplexing through a Single Multimode Fibre

Authors: Pengfei Fan, Michael Ruddlesden, Yufei Wang, Luming Zhao, Chao Lu, Lei Su

Abstract: Space-division multiplexing is a promising technology in optical fibre communication to improve the transmission capacity of a single optical fibre. However, the number of channels that can be multiplexed is limited by the crosstalks between channels, and the multiplexing is only applied to few-mode or multi-core fibres. Here, we propose a high-spatial-density channel multiplexing framework employ… ▽ More Space-division multiplexing is a promising technology in optical fibre communication to improve the transmission capacity of a single optical fibre. However, the number of channels that can be multiplexed is limited by the crosstalks between channels, and the multiplexing is only applied to few-mode or multi-core fibres. Here, we propose a high-spatial-density channel multiplexing framework employing deep learning for standard multimode fibres (MMF). We present a proof-of-concept experimental system, consisting of a single light source, a single digital-micromirror-device modulator, a single detection camera, and a deep convolutional neural network (CNN) to demonstrate up to 400-channel simultaneous data transmission with accuracy close to 100% over MMFs of different types, diameters and lengths. A novel scalable semi-supervised learning model is proposed to adapt the CNN to the time-varying MMF information channels in real-time, to overcome the environmental changes such as temperature variations and vibrations, and to reconstruct the input data from complex crosstalks among hundreds of channels. This deep-learning based approach is promising to maximize the use of the spatial dimension of MMFs, and to break the present number-of-channel limit in space-division multiplexing for future high-capacity MMF transmission data links. △ Less

Submitted 5 February, 2020; originally announced February 2020.

arXiv:1908.01654 [pdf, other]

doi 10.1109/TAC.2019.2945038

Analysis of Two-Dimensional Feedback Systems over Networks Using Dissipativity

Authors: Yang Yan, Lanlan Su, Vijay Gupta, Panos Antsaklis

Abstract: This paper investigates the closed-loop $\mathcal{L}_2$ stability of two-dimensional (2-D) feedback systems across a digital communication network by introducing the tool of dissipativity. First, sampling of a continuous 2-D system is considered and an analytical characterization of the $QSR$-dissipativity of the sampled system is presented. Next, the input-feedforward output-feedback passivity (I… ▽ More This paper investigates the closed-loop $\mathcal{L}_2$ stability of two-dimensional (2-D) feedback systems across a digital communication network by introducing the tool of dissipativity. First, sampling of a continuous 2-D system is considered and an analytical characterization of the $QSR$-dissipativity of the sampled system is presented. Next, the input-feedforward output-feedback passivity (IF-OFP), a simplified form of $QSR$-dissipativity, is utilized to study the framework of feedback interconnection of two 2-D systems over networks. Then, the effects of signal quantization in communication links on dissipativity degradation of the 2-D feedback quantized system is analyzed. Additionally, an event-triggered mechanism is developed for 2-D networked control systems while maintaining $\mathcal{L}_2$ stability of the closed-loop system. In the end, an illustrative example is provided. △ Less

Submitted 5 August, 2019; originally announced August 2019.

Comments: 13 pages, 7 figures

arXiv:1907.13024 [pdf, other]

Stabilization of Linear Systems Across a Time-Varying AWGN Fading Channel

Authors: Lanlan Su, Vijay Gupta, Graziano Chesi

Abstract: This technical note investigates the minimum average transmit power required for mean-square stabilization of a discrete-time linear process across a time-varying additive white Gaussian noise (AWGN) fading channel that is presented between the sensor and the controller. We assume channel state information at both the transmitter and the receiver, and allow the transmit power to vary with the chan… ▽ More This technical note investigates the minimum average transmit power required for mean-square stabilization of a discrete-time linear process across a time-varying additive white Gaussian noise (AWGN) fading channel that is presented between the sensor and the controller. We assume channel state information at both the transmitter and the receiver, and allow the transmit power to vary with the channel state to obtain the minimum required average transmit power via optimal power adaptation. We consider both the case of independent and identically distributed fading and fading subject to a Markov chain. Based on the proposed necessary and sufficient conditions for mean-square stabilization, we show that the minimum average transmit power to ensure stabilizability can be obtained by solving a geometric program. △ Less

Submitted 31 July, 2019; v1 submitted 30 July, 2019; originally announced July 2019.

Comments: 6 pages, 2 figures

arXiv:1907.13003 [pdf, other]

Distributed Resource Allocation over Time-varying Balanced Digraphs with Discrete-time Communication

Authors: Lanlan Su, Mengmou Li, Vijay Gupta, Graziano Chesi

Abstract: This work is concerned with the problem of distributed resource allocation in continuous-time setting but with discrete-time communication over infinitely jointly connected and balanced digraphs. We provide a passivity-based perspective for the continuous-time algorithm, based on which an intermittent communication scheme is developed. Particularly, a periodic communication scheme is first derived… ▽ More This work is concerned with the problem of distributed resource allocation in continuous-time setting but with discrete-time communication over infinitely jointly connected and balanced digraphs. We provide a passivity-based perspective for the continuous-time algorithm, based on which an intermittent communication scheme is developed. Particularly, a periodic communication scheme is first derived through analyzing the passivity degradation over output sampling of the distributed dynamics at each node. Then, an asynchronous distributed event-triggered scheme is further developed. The sampled-based event-triggered communication scheme is exempt from Zeno behavior as the minimum inter-event time is lower bounded by the sampling period. The parameters in the proposed algorithm rely only on local information of each individual nodes, which can be designed in a truly distributed fashion △ Less

Submitted 15 January, 2021; v1 submitted 30 July, 2019; originally announced July 2019.

Comments: 12 pages, 7 figures

arXiv:1907.12988 [pdf, other]

Feedback Passivation of Linear Systems with Fixed-Structured Controllers

Authors: Lanlan Su, Vijay Gupta, Panos Antsaklis

Abstract: This paper addresses the problem of designing an optimal output feedback controller with a specified controller structure for linear time-invariant (LTI) systems to maximize the passivity level for the closed-loop system, in both continuous-time (CT) and discrete-time (DT). Specifically, the set of controllers under consideration is linearly parameterized with constrained parameters. Both input fe… ▽ More This paper addresses the problem of designing an optimal output feedback controller with a specified controller structure for linear time-invariant (LTI) systems to maximize the passivity level for the closed-loop system, in both continuous-time (CT) and discrete-time (DT). Specifically, the set of controllers under consideration is linearly parameterized with constrained parameters. Both input feedforward passivity (IFP) and output feedback passivity (OFP) indices are used to capture the level of passivity. Given a set of stabilizing controllers, a necessary and sufficient condition is proposed for the existence of such fixed-structured output feedback controllers that can passivate the closed-loop system. Moreover, it is shown that the condition can be used to obtain the controller that maximizes the IFP or the OFP index by solving a convex optimization problem. △ Less

Submitted 30 July, 2019; originally announced July 2019.

Comments: 8 pages, 1 figure

arXiv:1902.00539 [pdf, other]

Multi-layered Cepstrum for Instantaneous Frequency Estimation

Authors: Chin-Yun Yu, Li Su

Abstract: We propose the multi-layered cepstrum (MLC) method to estimate multiple fundamental frequencies (MF0) of a signal under challenging contamination such as high-pass filter noise. Taking the operation of cepstrum (i.e., Fourier transform, filtering, and nonlinear activation) recursively, MLC is shown as an efficient method to enhance MF0 saliency in a step-by-step manner. Evaluation on a real-world… ▽ More We propose the multi-layered cepstrum (MLC) method to estimate multiple fundamental frequencies (MF0) of a signal under challenging contamination such as high-pass filter noise. Taking the operation of cepstrum (i.e., Fourier transform, filtering, and nonlinear activation) recursively, MLC is shown as an efficient method to enhance MF0 saliency in a step-by-step manner. Evaluation on a real-world polyphonic music dataset under both normal and low-fidelity conditions demonstrates the potential of MLC. △ Less

Submitted 1 February, 2019; originally announced February 2019.

Comments: In 2018 6th IEEE Global Conference on Signal and Information Processing

arXiv:1811.12214 [pdf, other]

Play as You Like: Timbre-enhanced Multi-modal Music Style Transfer

Authors: Chien-Yu Lu, Min-Xin Xue, Chia-Che Chang, Che-Rung Lee, Li Su

Abstract: Style transfer of polyphonic music recordings is a challenging task when considering the modeling of diverse, imaginative, and reasonable music pieces in the style different from their original one. To achieve this, learning stable multi-modal representations for both domain-variant (i.e., style) and domain-invariant (i.e., content) information of music in an unsupervised manner is critical. In th… ▽ More Style transfer of polyphonic music recordings is a challenging task when considering the modeling of diverse, imaginative, and reasonable music pieces in the style different from their original one. To achieve this, learning stable multi-modal representations for both domain-variant (i.e., style) and domain-invariant (i.e., content) information of music in an unsupervised manner is critical. In this paper, we propose an unsupervised music style transfer method without the need for parallel data. Besides, to characterize the multi-modal distribution of music pieces, we employ the Multi-modal Unsupervised Image-to-Image Translation (MUNIT) framework in the proposed system. This allows one to generate diverse outputs from the learned latent distributions representing contents and styles. Moreover, to better capture the granularity of sound, such as the perceptual dimensions of timbre and the nuance in instrument-specific performance, cognitively plausible features including mel-frequency cepstral coefficients (MFCC), spectral difference, and spectral envelope, are combined with the widely-used mel-spectrogram into a timber-enhanced multi-channel input representation. The Relativistic average Generative Adversarial Networks (RaGAN) is also utilized to achieve fast convergence and high stability. We conduct experiments on bilateral style transfer tasks among three different genres, namely piano solo, guitar solo, and string quartet. Results demonstrate the advantages of the proposed method in music style transfer with improved sound quality and in allowing users to manipulate the output. △ Less

Submitted 28 November, 2018; originally announced November 2018.

arXiv:1810.12947 [pdf, other]

A Streamlined Encoder/Decoder Architecture for Melody Extraction

Authors: Tsung-Han Hsieh, Li Su, Yi-Hsuan Yang

Abstract: Melody extraction in polyphonic musical audio is important for music signal processing. In this paper, we propose a novel streamlined encoder/decoder network that is designed for the task. We make two technical contributions. First, drawing inspiration from a state-of-the-art model for semantic pixel-wise segmentation, we pass through the pooling indices between pooling and un-pooling layers to lo… ▽ More Melody extraction in polyphonic musical audio is important for music signal processing. In this paper, we propose a novel streamlined encoder/decoder network that is designed for the task. We make two technical contributions. First, drawing inspiration from a state-of-the-art model for semantic pixel-wise segmentation, we pass through the pooling indices between pooling and un-pooling layers to localize the melody in frequency. We can achieve result close to the state-of-the-art with much fewer convolutional layers and simpler convolution modules. Second, we propose a way to use the bottleneck layer of the network to estimate the existence of a melody line for each time frame, and make it possible to use a simple argmax function instead of ad-hoc thresholding to get the final estimation of the melody line. Our experiments on both vocal melody extraction and general melody extraction validate the effectiveness of the proposed model. △ Less

Submitted 18 February, 2019; v1 submitted 30 October, 2018; originally announced October 2018.

Comments: This is a pre-print version of an ICASSP 2019 paper

arXiv:1810.12764 [pdf]

Single-shot image retrieval through a multimode fiber using a genetic algorithm

Authors: Michael Ruddlesden, **shuai Zhang, Tianrui Zhao, Wen Wang, Lei Su

Abstract: In this letter, we present a genetic algorithm-based approach for image retrieval through a multimode fiber in a reference-less system. Due to mode interference, when an image is illuminated at one side of a multimode fiber, the transmitted light forms a noise-like speckle pattern at the other end. With the use of a prior-measured transmission matrix of the fiber, a speckle pattern is calculated u… ▽ More In this letter, we present a genetic algorithm-based approach for image retrieval through a multimode fiber in a reference-less system. Due to mode interference, when an image is illuminated at one side of a multimode fiber, the transmitted light forms a noise-like speckle pattern at the other end. With the use of a prior-measured transmission matrix of the fiber, a speckle pattern is calculated using a random input mask. By optimizing the input mask to achieve a high correlation coefficient of experimental and calculated patterns, the input mask is optimized into an image with high similarity to the original image. △ Less

Submitted 26 October, 2018; originally announced October 2018.

arXiv:1810.10086 [pdf, ps, other]

Finite-time Guarantees for Byzantine-Resilient Distributed State Estimation with Noisy Measurements

Authors: Lili Su, Shahin Shahrampour

Abstract: This work considers resilient, cooperative state estimation in unreliable multi-agent networks. A network of agents aims to collaboratively estimate the value of an unknown vector parameter, while an {\em unknown} subset of agents suffer Byzantine faults. Faulty agents malfunction arbitrarily and may send out {\em highly unstructured} messages to other agents in the network. As opposed to fault-fr… ▽ More This work considers resilient, cooperative state estimation in unreliable multi-agent networks. A network of agents aims to collaboratively estimate the value of an unknown vector parameter, while an {\em unknown} subset of agents suffer Byzantine faults. Faulty agents malfunction arbitrarily and may send out {\em highly unstructured} messages to other agents in the network. As opposed to fault-free networks, reaching agreement in the presence of Byzantine faults is far from trivial. In this paper, we propose a computationally-efficient algorithm that is provably robust to Byzantine faults. At each iteration of the algorithm, a good agent (1) performs a gradient descent update based on noisy local measurements, (2) exchanges its update with other agents in its neighborhood, and (3) robustly aggregates the received messages using coordinate-wise trimmed means. Under mild technical assumptions, we establish that good agents learn the true parameter asymptotically in almost sure sense. We further complement our analysis by proving (high probability) {\em finite-time} convergence rate, encapsulating network characteristics. △ Less

Submitted 16 October, 2018; originally announced October 2018.

arXiv:1809.06970 [pdf, other]

doi 10.1145/3274783.3274840

FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices

Authors: Shuochao Yao, Yiran Zhao, Huajie Shao, Shengzhong Liu, Dongxin Liu, Lu Su, Tarek Abdelzaher

Abstract: Deep neural networks show great potential as solutions to many sensing application problems, but their excessive resource demand slows down execution time, pausing a serious impediment to deployment on low-end devices. To address this challenge, recent literature focused on compressing neural network size to improve performance. We show that changing neural network size does not proportionally aff… ▽ More Deep neural networks show great potential as solutions to many sensing application problems, but their excessive resource demand slows down execution time, pausing a serious impediment to deployment on low-end devices. To address this challenge, recent literature focused on compressing neural network size to improve performance. We show that changing neural network size does not proportionally affect performance attributes of interest, such as execution time. Rather, extreme run-time nonlinearities exist over the network configuration space. Hence, we propose a novel framework, called FastDeepIoT, that uncovers the non-linear relation between neural network structure and execution time, then exploits that understanding to find network configurations that significantly improve the trade-off between execution time and accuracy on mobile and embedded devices. FastDeepIoT makes two key contributions. First, FastDeepIoT automatically learns an accurate and highly interpretable execution time model for deep neural networks on the target device. This is done without prior knowledge of either the hardware specifications or the detailed implementation of the used deep learning library. Second, FastDeepIoT informs a compression algorithm how to minimize execution time on the profiled device without impacting accuracy. We evaluate FastDeepIoT using three different sensing-related tasks on two mobile devices: Nexus 5 and Galaxy Nexus. FastDeepIoT further reduces the neural network execution time by $48\%$ to $78\%$ and energy consumption by $37\%$ to $69\%$ compared with the state-of-the-art compression algorithms. △ Less

Submitted 18 September, 2018; originally announced September 2018.

Comments: Accepted by SenSys '18

arXiv:1804.09202 [pdf, other]

Vocal melody extraction using patch-based CNN

Authors: Li Su

Abstract: A patch-based convolutional neural network (CNN) model presented in this paper for vocal melody extraction in polyphonic music is inspired from object detection in image processing. The input of the model is a novel time-frequency representation which enhances the pitch contours and suppresses the harmonic components of a signal. This succinct data representation and the patch-based CNN model enab… ▽ More A patch-based convolutional neural network (CNN) model presented in this paper for vocal melody extraction in polyphonic music is inspired from object detection in image processing. The input of the model is a novel time-frequency representation which enhances the pitch contours and suppresses the harmonic components of a signal. This succinct data representation and the patch-based CNN model enable an efficient training process with limited labeled data. Experiments on various datasets show excellent speed and competitive accuracy comparing to other deep learning approaches. △ Less

Submitted 24 April, 2018; originally announced April 2018.

Journal ref: Proc. Int. Conf. Acoustic, Speech and Signal Processing (ICASSP), 2018

arXiv:1711.08600 [pdf, other]

Singing voice correction using canonical time war**

Authors: Yin-Jyun Luo, Ming-Tso Chen, Tai-Shih Chi, Li Su

Abstract: Expressive singing voice correction is an appealing but challenging problem. A robust time-war** algorithm which synchronizes two singing recordings can provide a promising solution. We thereby propose to address the problem by canonical time war** (CTW) which aligns amateur singing recordings to professional ones. A new pitch contour is generated given the alignment information, and a pitch-c… ▽ More Expressive singing voice correction is an appealing but challenging problem. A robust time-war** algorithm which synchronizes two singing recordings can provide a promising solution. We thereby propose to address the problem by canonical time war** (CTW) which aligns amateur singing recordings to professional ones. A new pitch contour is generated given the alignment information, and a pitch-corrected singing is synthesized back through the vocoder. The objective evaluation shows that CTW is robust against pitch-shifting and time-stretching effects, and the subjective test demonstrates that CTW prevails the other methods including DTW and the commercial auto-tuning software. Finally, we demonstrate the applicability of the proposed method in a practical, real-world scenario. △ Less

Submitted 23 November, 2017; originally announced November 2017.

arXiv:1705.03955 [pdf]

VehSense: Slippery Road Detection Using Smartphones

Authors: Yunfei Hou, Abhishek Gupta, Tong Guan, Shaohan Hu, Lu Su, Chunming Qiao

Abstract: This paper investigates a new application of vehicular sensing: detecting and reporting the slippery road conditions. We describe a system and associated algorithm to monitor vehicle skidding events using smartphones and OBD-II (On board Diagnostics) adaptors. This system, which we call the VehSense, gathers data from smartphone inertial sensors and vehicle wheel speed sensors, and processes the d… ▽ More This paper investigates a new application of vehicular sensing: detecting and reporting the slippery road conditions. We describe a system and associated algorithm to monitor vehicle skidding events using smartphones and OBD-II (On board Diagnostics) adaptors. This system, which we call the VehSense, gathers data from smartphone inertial sensors and vehicle wheel speed sensors, and processes the data to monitor slippery road conditions in real-time. Specifically, two speed readings are collected: 1) ground speed, which is estimated by vehicle acceleration and rotation, and 2) wheel speed, which is retrieved from the OBD-II interface. The mismatch between these two speeds is used to infer a skidding event. Without tap** into vehicle manufactures' proprietary data (e.g., antilock braking system), VehSense is compatible with most of the passenger vehicles, and thus can be easily deployed. We evaluate our system on snow-covered roads at Buffalo, and show that it can detect vehicle skidding effectively. △ Less

Submitted 10 May, 2017; originally announced May 2017.

Comments: 2017 IEEE 85th Vehicular Technology Conference (VTC2017-Spring)

Showing 1–39 of 39 results for author: Su, L