-
Monte-Carlo Integration Based Multiple-Scattering Channel Modeling for Ultraviolet Communications in Turbulent Atmosphere
Authors:
Renzhi Yuan,
Xinyi Chu,
Tao Shan,
Mugen Peng
Abstract:
Modeling of multiple-scattering channels in atmospheric turbulence is essential for the performance analysis of long-distance non-line-of-sight (NLOS) ultraviolet (UV) communications. Existing works on the turbulent channel modeling for NLOS UV communications either ignored the turbulence-induced scattering effect or erroneously estimated the turbulent fluctuation effect, resulting in a contradict…
▽ More
Modeling of multiple-scattering channels in atmospheric turbulence is essential for the performance analysis of long-distance non-line-of-sight (NLOS) ultraviolet (UV) communications. Existing works on the turbulent channel modeling for NLOS UV communications either ignored the turbulence-induced scattering effect or erroneously estimated the turbulent fluctuation effect, resulting in a contradiction with reported experiments. In this paper, we establish a comprehensive multiple-scattering turbulent channel model for NLOS UV communications considering both the turbulence-induced scattering effect and the turbulent fluctuation effect. We first derive the turbulent scattering coefficient and turbulent phase scattering function based on the Booker-Gordon turbulent power spectral density model. Then an improved estimation method is proposed for both the turbulent fluctuation and the turbulent fading coefficient based on the Monte-Carlo integration approach. Numerical results demonstrate that the turbulence-induced scattering effect can always be ignored for typical UV communication scenarios. Besides, the turbulent fluctuation will increase as either the communication distance, the elevation angle, or the divergence angle increases, which is compatible with existing experimental results. Moreover, we find that the probability density of the equivalent turbulent fading for multiple-scattering turbulent channels can be approximated as a Gaussian distribution.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
ComposerX: Multi-Agent Symbolic Music Composition with LLMs
Authors:
Qixin Deng,
Qikai Yang,
Ruibin Yuan,
Yipeng Huang,
Yi Wang,
Xubo Liu,
Zeyue Tian,
Jiahao Pan,
Ge Zhang,
Hanfeng Lin,
Yizhi Li,
Yinghao Ma,
Jie Fu,
Chenghua Lin,
Emmanouil Benetos,
Wenwu Wang,
Guangyu Xia,
Wei Xue,
Yike Guo
Abstract:
Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and C…
▽ More
Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and Chain-of-Thoughts. To further explore and enhance LLMs' potential in music composition by leveraging their reasoning ability and the large knowledge base in music history and theory, we propose ComposerX, an agent-based symbolic music generation framework. We find that applying a multi-agent approach significantly improves the music composition quality of GPT-4. The results demonstrate that ComposerX is capable of producing coherent polyphonic music compositions with captivating melodies, while adhering to user instructions.
△ Less
Submitted 30 April, 2024; v1 submitted 28 April, 2024;
originally announced April 2024.
-
MuPT: A Generative Symbolic Music Pretrained Transformer
Authors:
Xingwei Qu,
Yuelin Bai,
Yinghao Ma,
Ziya Zhou,
Ka Man Lo,
Jiaheng Liu,
Ruibin Yuan,
Lejun Min,
Xueling Liu,
Tianyu Zhang,
Xinrun Du,
Shuyue Guo,
Yiming Liang,
Yizhi Li,
Shangda Wu,
Junting Zhou,
Tianyu Zheng,
Ziyang Ma,
Fengze Han,
Wei Xue,
Gus Xia,
Emmanouil Benetos,
Xiang Yue,
Chenghua Lin,
Xu Tan
, et al. (4 additional authors not shown)
Abstract:
In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal…
▽ More
In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.
△ Less
Submitted 10 April, 2024; v1 submitted 9 April, 2024;
originally announced April 2024.
-
Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models
Authors:
Hanzhi Yin,
Gang Cheng,
Christian J. Steinmetz,
Ruibin Yuan,
Richard M. Stern,
Roger B. Dannenberg
Abstract:
We describe a novel approach for develo** realistic digital models of dynamic range compressors for digital audio production by analyzing their analog prototypes. While realistic digital dynamic compressors are potentially useful for many applications, the design process is challenging because the compressors operate nonlinearly over long time scales. Our approach is based on the structured stat…
▽ More
We describe a novel approach for develo** realistic digital models of dynamic range compressors for digital audio production by analyzing their analog prototypes. While realistic digital dynamic compressors are potentially useful for many applications, the design process is challenging because the compressors operate nonlinearly over long time scales. Our approach is based on the structured state space sequence model (S4), as implementing the state-space model (SSM) has proven to be efficient at learning long-range dependencies and is promising for modeling dynamic range compressors. We present in this paper a deep learning model with S4 layers to model the Teletronix LA-2A analog dynamic range compressor. The model is causal, executes efficiently in real time, and achieves roughly the same quality as previous deep-learning models but with fewer parameters.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Advancing COVID-19 Detection in 3D CT Scans
Authors:
Qingqiu Li,
Runtian Yuan,
Junlin Hou,
Jilan Xu,
Yuejie Zhang,
Rui Feng,
Hao Chen
Abstract:
To make a more accurate diagnosis of COVID-19, we propose a straightforward yet effective model. Firstly, we analyse the characteristics of 3D CT scans and remove the non-lung parts, facilitating the model to focus on lesion-related areas and reducing computational cost. We use ResNeSt50 as the strong feature extractor, initializing it with pretrained weights which have COVID-19-specific prior kno…
▽ More
To make a more accurate diagnosis of COVID-19, we propose a straightforward yet effective model. Firstly, we analyse the characteristics of 3D CT scans and remove the non-lung parts, facilitating the model to focus on lesion-related areas and reducing computational cost. We use ResNeSt50 as the strong feature extractor, initializing it with pretrained weights which have COVID-19-specific prior knowledge. Our model achieves a Macro F1 Score of 0.94 on the validation set of the 4th COV19D Competition Challenge $\mathrm{I}$, surpassing the baseline by 16%. This indicates its effectiveness in distinguishing between COVID-19 and non-COVID-19 cases, making it a robust method for COVID-19 detection.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Domain Adaptation Using Pseudo Labels for COVID-19 Detection
Authors:
Runtian Yuan,
Qingqiu Li,
Junlin Hou,
Jilan Xu,
Yuejie Zhang,
Rui Feng,
Hao Chen
Abstract:
In response to the need for rapid and accurate COVID-19 diagnosis during the global pandemic, we present a two-stage framework that leverages pseudo labels for domain adaptation to enhance the detection of COVID-19 from CT scans. By utilizing annotated data from one domain and non-annotated data from another, the model overcomes the challenge of data scarcity and variability, common in emergent he…
▽ More
In response to the need for rapid and accurate COVID-19 diagnosis during the global pandemic, we present a two-stage framework that leverages pseudo labels for domain adaptation to enhance the detection of COVID-19 from CT scans. By utilizing annotated data from one domain and non-annotated data from another, the model overcomes the challenge of data scarcity and variability, common in emergent health crises. The innovative approach of generating pseudo labels enables the model to iteratively refine its learning process, thereby improving its accuracy and adaptability across different hospitals and medical centres. Experimental results on COV19-CT-DB database showcase the model's potential to achieve high diagnostic precision, significantly contributing to efficient patient management and alleviating the strain on healthcare systems. Our method achieves 0.92 Macro F1 Score on the validation set of Covid-19 domain adaptation challenge.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Authors:
Ruibin Yuan,
Hanfeng Lin,
Yi Wang,
Zeyue Tian,
Shangda Wu,
Tianhao Shen,
Ge Zhang,
Yuhang Wu,
Cong Liu,
Ziya Zhou,
Ziyang Ma,
Liumeng Xue,
Ziyu Wang,
Qin Liu,
Tianyu Zheng,
Yizhi Li,
Yinghao Ma,
Yiming Liang,
Xiaowei Chi,
Ruibo Liu,
Zili Wang,
Pengfei Li,
**gcheng Wu,
Chenghua Lin,
Qifeng Liu
, et al. (10 additional authors not shown)
Abstract:
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the…
▽ More
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Joint Beam Direction Control and Radio Resource Allocation in Dynamic Multi-beam LEO Satellite Networks
Authors:
Shuo Yuan,
Yaohua Sun,
Mugen Peng,
Renzhi Yuan
Abstract:
Multi-beam low earth orbit (LEO) satellites are emerging as key components in beyond 5G and 6G to provide global coverage and high data rate. To fully unleash the potential of LEO satellite communication, resource management plays a key role. However, the uneven distribution of users, the coupling of multi-dimensional resources, complex inter-beam interference, and time-varying network topologies…
▽ More
Multi-beam low earth orbit (LEO) satellites are emerging as key components in beyond 5G and 6G to provide global coverage and high data rate. To fully unleash the potential of LEO satellite communication, resource management plays a key role. However, the uneven distribution of users, the coupling of multi-dimensional resources, complex inter-beam interference, and time-varying network topologies all impose significant challenges on effective communication resource management. In this paper, we study the joint optimization of beam direction and the allocation of spectrum, time, and power resource in a dynamic multi-beam LEO satellite network. The objective is to improve long-term user sum data rate while taking user fairness into account. Since the concerned resource management problem is mixed-integer non-convex programming, the problem is decomposed into three subproblems, namely beam direction control and time slot allocation, user subchannel assignment, and beam power allocation. Then, these subproblems are solved iteratively by leveraging matching with externalities and successive convex approximation, and the proposed algorithms are analyzed in terms of stability, convergence, and complexity. Extensive simulations are conducted, and the results demonstrate that our proposal can improve the number of served users by up to two times and the sum user data rate by up to 68%, compared to baseline schemes.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Dynamic and Memory-efficient Shape Based Methodologies for User Type Identification in Smart Grid Applications
Authors:
Rui Yuan,
S. Ali Pourmousavi,
Wen L. Soong,
Jon A. R. Liisberg
Abstract:
Detecting behind-the-meter (BTM) equipment and major appliances at the residential level and tracking their changes in real time is important for aggregators and traditional electricity utilities. In our previous work, we developed a systematic solution called IRMAC to identify residential users' BTM equipment and applications from their imported energy data. As a part of IRMAC, a Similarity Profi…
▽ More
Detecting behind-the-meter (BTM) equipment and major appliances at the residential level and tracking their changes in real time is important for aggregators and traditional electricity utilities. In our previous work, we developed a systematic solution called IRMAC to identify residential users' BTM equipment and applications from their imported energy data. As a part of IRMAC, a Similarity Profile (SP) was proposed for dimensionality reduction and extracting self-join similarity from the end users' daily electricity usage data. The proposed SP calculation, however, was computationally expensive and required a significant amount of memory at the user's end. To realise the benefits of edge computing, in this paper, we propose and assess three computationally-efficient updating solutions, namely additive, fixed memory, and codebook-based updating methods. Extensive simulation studies are carried out using real PV users' data to evaluate the performance of the proposed methods in identifying PV users, tracking changes in real time, and examining memory usage. We found that the Codebook-based solution reduces more than 30\% of the required memory without compromising the performance of extracting users' features. When the end users' data storage and computation speed are concerned, the fixed-memory method outperforms the others. In terms of tracking the changes, different variations of the fixed-memory method show various inertia levels, making them suitable for different applications.
△ Less
Submitted 6 January, 2024;
originally announced January 2024.
-
A New Time Series Similarity Measure and Its Smart Grid Applications
Authors:
Rui Yuan,
S. Ali Pourmousavi,
Wen L. Soong,
Andrew J. Black,
Jon A. R. Liisberg,
Julian Lemos-Vinasco
Abstract:
Many smart grid applications involve data mining, clustering, classification, identification, and anomaly detection, among others. These applications primarily depend on the measurement of similarity, which is the distance between different time series or subsequences of a time series. The commonly used time series distance measures, namely Euclidean Distance (ED) and Dynamic Time War** (DTW), d…
▽ More
Many smart grid applications involve data mining, clustering, classification, identification, and anomaly detection, among others. These applications primarily depend on the measurement of similarity, which is the distance between different time series or subsequences of a time series. The commonly used time series distance measures, namely Euclidean Distance (ED) and Dynamic Time War** (DTW), do not quantify the flexible nature of electricity usage data in terms of temporal dynamics. As a result, there is a need for a new distance measure that can quantify both the amplitude and temporal changes of electricity time series for smart grid applications, e.g., demand response and load profiling. This paper introduces a novel distance measure to compare electricity usage patterns. The method consists of two phases that quantify the effort required to reshape one time series into another, considering both amplitude and temporal changes. The proposed method is evaluated against ED and DTW using real-world data in three smart grid applications. Overall, the proposed measure outperforms ED and DTW in accurately identifying the best load scheduling strategy, anomalous days with irregular electricity usage, and determining electricity users' behind-the-meter (BTM) equipment.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
A Variational Auto-Encoder Enabled Multi-Band Channel Prediction Scheme for Indoor Localization
Authors:
Ruihao Yuan,
Kaixuan Huang,
Pan Yang,
Shunqing Zhang
Abstract:
Indoor localization is getting increasing demands for various cutting-edged technologies, like Virtual/Augmented reality and smart home. Traditional model-based localization suffers from significant computational overhead, so fingerprint localization is getting increasing attention, which needs lower computation cost after the fingerprint database is built. However, the accuracy of indoor localiza…
▽ More
Indoor localization is getting increasing demands for various cutting-edged technologies, like Virtual/Augmented reality and smart home. Traditional model-based localization suffers from significant computational overhead, so fingerprint localization is getting increasing attention, which needs lower computation cost after the fingerprint database is built. However, the accuracy of indoor localization is limited by the complicated indoor environment which brings the multipath signal refraction. In this paper, we provided a scheme to improve the accuracy of indoor fingerprint localization from the frequency domain by predicting the channel state information (CSI) values from another transmitting channel and spliced the multi-band information together to get more precise localization results. We tested our proposed scheme on COST 2100 simulation data and real time orthogonal frequency division multiplexing (OFDM) WiFi data collected from an office scenario.
△ Less
Submitted 19 September, 2023;
originally announced September 2023.
-
Modelling Irrational Behaviour of Residential End Users using Non-Stationary Gaussian Processes
Authors:
Nam Trong Dinh,
Sahand Karimi-Arpanahi,
Rui Yuan,
S. Ali Pourmousavi,
Mingyu Guo,
Jon A. R. Liisberg,
Julian Lemos-Vinasco
Abstract:
Demand response (DR) plays a critical role in ensuring efficient electricity consumption and optimal use of network assets. Yet, existing DR models often overlook a crucial element, the irrational behaviour of electricity end users. In this work, we propose a price-responsive model that incorporates key aspects of end-user irrationality, specifically loss aversion, time inconsistency, and bounded…
▽ More
Demand response (DR) plays a critical role in ensuring efficient electricity consumption and optimal use of network assets. Yet, existing DR models often overlook a crucial element, the irrational behaviour of electricity end users. In this work, we propose a price-responsive model that incorporates key aspects of end-user irrationality, specifically loss aversion, time inconsistency, and bounded rationality. To this end, we first develop a framework that uses Multiple Seasonal-Trend decomposition using Loess (MSTL) and non-stationary Gaussian processes to model the randomness in the electricity consumption by residential consumers. The impact of this model is then evaluated through a community battery storage (CBS) business model. Additionally, we apply a chance-constrained optimisation model for CBS operation that deals with the unpredictability of the end-user irrationality. Our simulations using real-world data show that the proposed DR model provides a more realistic estimate of end-user price-responsive behaviour when considering irrationality. Compared to a deterministic model that cannot fully take into account the irrational behaviour of end users, the chance-constrained CBS operation model yields an additional 19% revenue. Lastly, the business model reduces the electricity costs of solar end users by 11%.
△ Less
Submitted 26 March, 2024; v1 submitted 16 September, 2023;
originally announced September 2023.
-
On the Effectiveness of Speech Self-supervised Learning for Music
Authors:
Yinghao Ma,
Ruibin Yuan,
Yizhi Li,
Ge Zhang,
Xingran Chen,
Hanzhi Yin,
Chenghua Lin,
Emmanouil Benetos,
Anton Ragni,
Norbert Gyenge,
Ruibo Liu,
Gus Xia,
Roger Dannenberg,
Yike Guo,
Jie Fu
Abstract:
Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent speech models such as wav2vec2.0 have shown promise in music modelling. Neverthele…
▽ More
Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent speech models such as wav2vec2.0 have shown promise in music modelling. Nevertheless, research exploring the effectiveness of applying speech SSL models to music recordings has been limited. We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.0 and Hubert, and refer to them as music2vec and musicHuBERT, respectively. We train $12$ SSL models with 95M parameters under various pre-training configurations and systematically evaluate the MIR task performances with 13 different MIR tasks. Our findings suggest that training with music data can generally improve performance on MIR tasks, even when models are trained using paradigms designed for speech. However, we identify the limitations of such existing speech-oriented designs, especially in modelling polyphonic information. Based on the experimental results, empirical suggestions are also given for designing future musical SSL strategies and paradigms.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT
Authors:
Le Zhuo,
Ruibin Yuan,
Jiahao Pan,
Yinghao Ma,
Yizhi LI,
Ge Zhang,
Si Liu,
Roger Dannenberg,
Jie Fu,
Chenghua Lin,
Emmanouil Benetos,
Wenhu Chen,
Wei Xue,
Yike Guo
Abstract:
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language mo…
▽ More
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.
△ Less
Submitted 21 November, 2023; v1 submitted 29 June, 2023;
originally announced June 2023.
-
MARBLE: Music Audio Representation Benchmark for Universal Evaluation
Authors:
Ruibin Yuan,
Yinghao Ma,
Yizhi Li,
Ge Zhang,
Xingran Chen,
Hanzhi Yin,
Le Zhuo,
Yiqi Liu,
Jiawen Huang,
Zeyue Tian,
Binyue Deng,
Ningzhi Wang,
Chenghua Lin,
Emmanouil Benetos,
Anton Ragni,
Norbert Gyenge,
Roger Dannenberg,
Wenhu Chen,
Gus Xia,
Wei Xue,
Si Liu,
Shi Wang,
Ruibo Liu,
Yike Guo,
Jie Fu
Abstract:
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue…
▽ More
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at https://marble-bm.shef.ac.uk to promote future music AI research.
△ Less
Submitted 23 November, 2023; v1 submitted 18 June, 2023;
originally announced June 2023.
-
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
Authors:
Yizhi Li,
Ruibin Yuan,
Ge Zhang,
Yinghao Ma,
Xingran Chen,
Hanzhi Yin,
Chenghao Xiao,
Chenghua Lin,
Anton Ragni,
Emmanouil Benetos,
Norbert Gyenge,
Roger Dannenberg,
Ruibo Liu,
Wenhu Chen,
Gus Xia,
Yemin Shi,
Wenhao Huang,
Zili Wang,
Yike Guo,
Jie Fu
Abstract:
Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is partially due to the distinctive challenges associated with modelling musical knowledge, part…
▽ More
Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is partially due to the distinctive challenges associated with modelling musical knowledge, particularly tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified an effective combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantisation - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
△ Less
Submitted 22 April, 2024; v1 submitted 31 May, 2023;
originally announced June 2023.
-
MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning
Authors:
Yizhi Li,
Ruibin Yuan,
Ge Zhang,
Yinghao Ma,
Chenghua Lin,
Xingran Chen,
Anton Ragni,
Hanzhi Yin,
Zhijie Hu,
Haoyu He,
Emmanouil Benetos,
Norbert Gyenge,
Ruibo Liu,
Jie Fu
Abstract:
The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design Music2Vec, a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our mo…
▽ More
The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design Music2Vec, a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter. The model will be released on Huggingface(Please refer to: https://huggingface.co/m-a-p/music2vec-v1)
△ Less
Submitted 5 December, 2022;
originally announced December 2022.
-
Inconsistency Ranking-based Noisy Label Detection for High-quality Data
Authors:
Ruibin Yuan,
Hanzhi Yin,
Yi Wang,
Yifan He,
Yushi Ye,
Lei Zhang,
Zhizheng Wu
Abstract:
The success of deep learning requires high-quality annotated and massive data. However, the size and the quality of a dataset are usually a trade-off in practice, as data collection and cleaning are expensive and time-consuming. In real-world applications, especially those using crowdsourcing datasets, it is important to exclude noisy labels. To address this, this paper proposes an automatic noisy…
▽ More
The success of deep learning requires high-quality annotated and massive data. However, the size and the quality of a dataset are usually a trade-off in practice, as data collection and cleaning are expensive and time-consuming. In real-world applications, especially those using crowdsourcing datasets, it is important to exclude noisy labels. To address this, this paper proposes an automatic noisy label detection (NLD) technique with inconsistency ranking for high-quality data. We apply this technique to the automatic speaker verification (ASV) task as a proof of concept. We investigate both inter-class and intra-class inconsistency ranking and compare several metric learning loss functions under different noise settings. Experimental results confirm that the proposed solution could increase both the efficient and effective cleaning of large-scale speaker recognition datasets.
△ Less
Submitted 15 June, 2023; v1 submitted 30 November, 2022;
originally announced December 2022.
-
DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion
Authors:
Ruibin Yuan,
Yuxuan Wu,
Jacob Li,
Jaxter Kim
Abstract:
The widespread adoption of speech-based online services raises security and privacy concerns regarding the data that they use and share. If the data were compromised, attackers could exploit user speech to bypass speaker verification systems or even impersonate users. To mitigate this, we propose DeID-VC, a speaker de-identification system that converts a real speaker to pseudo speakers, thus remo…
▽ More
The widespread adoption of speech-based online services raises security and privacy concerns regarding the data that they use and share. If the data were compromised, attackers could exploit user speech to bypass speaker verification systems or even impersonate users. To mitigate this, we propose DeID-VC, a speaker de-identification system that converts a real speaker to pseudo speakers, thus removing or obfuscating the speaker-dependent attributes from a spoken voice. The key components of DeID-VC include a Variational Autoencoder (VAE) based Pseudo Speaker Generator (PSG) and a voice conversion Autoencoder (AE) under zero-shot settings. With the help of PSG, DeID-VC can assign unique pseudo speakers at speaker level or even at utterance level. Also, two novel learning objectives are added to bridge the gap between training and inference of zero-shot voice conversion. We present our experimental results with word error rate (WER) and equal error rate (EER), along with three subjective metrics to evaluate the generated output of DeID-VC. The result shows that our method substantially improved intelligibility (WER 10% lower) and de-identification effectiveness (EER 5% higher) compared to our baseline. Code and listening demo: https://github.com/a43992899/DeID-VC
△ Less
Submitted 9 September, 2022;
originally announced September 2022.
-
Optimal activity and battery scheduling algorithm using load and solar generation forecast
Authors:
Rui Yuan,
Nam Trong Dinh,
Yogesh Pipada,
S. Ali Pourmouasvi
Abstract:
In this report, we provide a technical sequence on tackling the solar PV and demand forecast as well as optimal scheduling problem proposed by the IEEE-CIS 3rd technical challenge on predict + optimize for activity and battery scheduling. Using the historical data provided by the organizers, a simple pre-processing approach with a rolling window was used to detect and replace invalid data points.…
▽ More
In this report, we provide a technical sequence on tackling the solar PV and demand forecast as well as optimal scheduling problem proposed by the IEEE-CIS 3rd technical challenge on predict + optimize for activity and battery scheduling. Using the historical data provided by the organizers, a simple pre-processing approach with a rolling window was used to detect and replace invalid data points. Upon filling the missing values, advanced time-series forecasting techniques, namely tree-based methods and refined motif discovery, were employed to predict the baseload consumption on six different buildings together with the power production on their associated solar PV panels. An optimization problem is then formulated to use the predicted values and the wholesale electricity prices to create a timetable for a set of activities, including the scheduling of lecture theatres and battery charging and discharging operation, for one month ahead. The valley-filling optimization was done across all the buildings with the objective of minimizing the total energy cost and achieving net-zero imported power from the grid.
△ Less
Submitted 6 December, 2021;
originally announced December 2021.
-
IRMAC: Interpretable Refined Motifs in Binary Classification for Smart Grid Applications
Authors:
Rui Yuan,
S. Ali Pourmousavi,
Wen L. Soong,
Giang Nguyen,
Jon A. R. Liisberg
Abstract:
Modern power systems are experiencing the challenge of high uncertainty with the increasing penetration of renewable energy resources and the electrification of heating systems. In this paradigm shift, understanding electricity users' demand is of utmost value to retailers, aggregators, and policymakers. However, behind-the-meter (BTM) equipment and appliances at the household level are unknown to…
▽ More
Modern power systems are experiencing the challenge of high uncertainty with the increasing penetration of renewable energy resources and the electrification of heating systems. In this paradigm shift, understanding electricity users' demand is of utmost value to retailers, aggregators, and policymakers. However, behind-the-meter (BTM) equipment and appliances at the household level are unknown to the other stakeholders mainly due to privacy concerns and tight regulations. In this paper, we seek to identify residential consumers based on their BTM equipment, mainly rooftop photovoltaic (PV) systems and electric heating, using imported/purchased energy data from utility meters. To solve this problem with an interpretable, fast, secure, and maintainable solution, we propose an integrated method called Interpretable Refined Motifs And binary Classification (IRMAC). The proposed method comprises a novel shape-based pattern extraction technique, called Refined Motif (RM) discovery, and a single-neuron classifier. The first part extracts a sub-pattern from the long time series considering the frequency of occurrences, average dissimilarity, and time dynamics while emphasising specific times with annotated distances. The second part identifies users' types with linear complexity while preserving the transparency of the algorithms. With the real data from Australia and Denmark, the proposed method is tested and verified in identifying PV owners and electrical heating system users.
△ Less
Submitted 14 November, 2022; v1 submitted 22 September, 2021;
originally announced September 2021.
-
Quantum Discrimination of Two Noisy Displaced Number States
Authors:
Renzhi Yuan,
Julian Cheng
Abstract:
The quantum discrimination of two non-coherent states draws much attention recently. In this letter, we first consider the quantum discrimination of two noiseless displaced number states. Then we derive the Fock representation of noisy displaced number states and address the problem of discriminating between two noisy displaced number states. We further prove that the optimal quantum discriminatio…
▽ More
The quantum discrimination of two non-coherent states draws much attention recently. In this letter, we first consider the quantum discrimination of two noiseless displaced number states. Then we derive the Fock representation of noisy displaced number states and address the problem of discriminating between two noisy displaced number states. We further prove that the optimal quantum discrimination of two noisy displaced number states can be achieved by the Kennedy receiver with threshold detection. Simulation results verify the theoretical derivations and show that the error probability of on-off keying modulation using a displaced number state is significantly less than that of on-off keying modulation using a coherent state with the same average energy.
△ Less
Submitted 9 December, 2020;
originally announced December 2020.
-
Diverse Melody Generation from Chinese Lyrics via Mutual Information Maximization
Authors:
Ruibin Yuan,
Ge Zhang,
Anqiao Yang,
Xinyue Zhang
Abstract:
In this paper, we propose to adapt the method of mutual information maximization into the task of Chinese lyrics conditioned melody generation to improve the generation quality and diversity. We employ scheduled sampling and force decoding techniques to improve the alignment between lyrics and melodies. With our method, which we called Diverse Melody Generation (DMG), a sequence-to-sequence model…
▽ More
In this paper, we propose to adapt the method of mutual information maximization into the task of Chinese lyrics conditioned melody generation to improve the generation quality and diversity. We employ scheduled sampling and force decoding techniques to improve the alignment between lyrics and melodies. With our method, which we called Diverse Melody Generation (DMG), a sequence-to-sequence model learns to generate diverse melodies heavily depending on the input style ids, while kee** the tonality and improving the alignment. The experimental results of subjective tests show that DMG can generate more pleasing and coherent tunes than baseline methods.
△ Less
Submitted 7 December, 2020;
originally announced December 2020.
-
Free-Space Optical Communication Using Non-mode-Selective Photonic Lantern Based Coherent Receiver
Authors:
Bo Zhang,
Renzhi Yuan,
Jianfeng Sun,
Julian Cheng,
Mohamed-Slim Alouini
Abstract:
A free-space optical communication system using non-mode-selective photonic lantern (PL) based coherent receiver is studied. Based on the simulation of photon distribution, the power distribution at the single-mode fiber end of the PL is quantitatively described as a truncated Gaussian distribution over a simplex. The signal-to-noise ratios (SNRs) for the communication system using PL based receiv…
▽ More
A free-space optical communication system using non-mode-selective photonic lantern (PL) based coherent receiver is studied. Based on the simulation of photon distribution, the power distribution at the single-mode fiber end of the PL is quantitatively described as a truncated Gaussian distribution over a simplex. The signal-to-noise ratios (SNRs) for the communication system using PL based receiver are analyzed using different combining techniques, including selection combining (SC), equal-gain combining (EGC), and maximal-ratio combining (MRC). The integral solution, series lower bound solution and asymptotic solution are presented for bit-error rate (BER) of PL based receiver, single-mode fiber receiver and multimode fiber receiver over the Gamma-Gamma atmosphere turbulence channels. We demonstrate that the power distribution of the PL has no effect on the SNR and BER performance of the PL based receiver when MRC is used; and it only has limited influence when EGC is used. However, the power distribution of the PL can greatly affect the BER performance when SC is used. Besides, the SNR gains of the PL based receiver using EGC over single-mode fiber receiver and multimode fiber receiver are numerically studied under different imperfect device parameters; and the scope of application of the communication system is further provided.
△ Less
Submitted 24 November, 2020; v1 submitted 3 July, 2020;
originally announced July 2020.