Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Xu, Tianze; Li, Jiajun; Chen, Xuesong; Yao, Yinrui; Liu, Shuchang

Computer Science > Sound

arXiv:2405.02801v1 (cs)

[Submitted on 5 May 2024 (this version), latest version 7 May 2024 (v2)]

Title:Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Authors:Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu

View PDF HTML (experimental)

Abstract:In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models. Our codes and examples is availble at: this https URL

Comments:	7 pages, 2 figures, submitted to ACM MM 2024
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2405.02801 [cs.SD]
	(or arXiv:2405.02801v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2405.02801

Submission history

From: Tianze Xu [view email]
[v1] Sun, 5 May 2024 03:15:52 UTC (5,225 KB)
[v2] Tue, 7 May 2024 09:55:39 UTC (5,225 KB)

Computer Science > Sound

Title:Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators