documentclassarticle \usepackage[T1]fontenc \usepackage[utf8]inputenc \usepackageismir,amsmath,cite,url \usepackagebooktabs \usepackagegraphicx \usepackagecolor \usepackagesubcaption \usepackagelineno \usepackagehyperref
Subtractive Training for Music Stem Insertion using Latent Diffusion Models
First Author Affiliation1 \\ [email protected] Second Author Retain these fake authors in\\submission to preserve the formatting Third Author Affiliation3 \\ [email protected]
1 APPENDIX
\thesubsection Text Prompt
The prompt that was used to generate the edit instructions from the music captions is given as follows:
"You are a professional music annotator that has been hired to write text prompts in a large scale music dataset to train a text-to-music generative AI model. You’re one of the best in the industry for this job and have stellar reviews from other music labelling companies that have hired you in the past. Your captions should be truthful and accurate. The input and output of the text-to-music generative AI model are as follows.\n Input: The audio file of the background music + a text prompt describing what kind of instrument should be added to the given background music and in what style.\n Output: The audio file of the music modified with the addition of the instrument in accordance with the content of the text prompt.\n The following text explains a music piece that you are going to be working with now, so imagine you are listening this music: {caption}\n Then imagine this music without {stem}. Write a single sentence text prompt input for instructing a text to music generative AI model to generate this music, when this music without {stem} was given as a background music input. So this is not removing. You want to add {stem} to the {stem}-less version, while persuming the {stem}-less version is given. You should not say like \"remove a {stem} track\". Do not add or make up any extra information about music in the edit prompt other than the given explantion. Use information only from the explantion of the music. For the text prompt instructions you generate, you have to focus on explaining the style of the instrument we are adding; You do not have to explain the style of the background music if it is unnecessary for explaining the instrument we are adding. Make sure the edit instruction doesnt have more than 7 words. When creating the edit instruction, use the action word ’{action_word}’. You don’t need to mention that the music is drum-less, although it is optional. If you mention the the notion that the background track has no drums, you are able to mention it without using the phrase ’{stem}-less’. Do not use the word ’energetic’ to describe the {action_word}. If possible, try to include the genre of the song in your description of the drums."
stem corresponds to the stem we want to subtract from our data, which in this case is ”drums”. caption Corresponds to the LPMusicCaps-generated caption of our song. The action_word comes from the set . The action word was chosen uniformly at random, and was included in order to create linguistic variety in the instructions that would be used as triaining input for the model.