documentclassarticle \usepackage[T1]fontenc \usepackage[utf8]inputenc \usepackageismir,amsmath,cite,url \usepackagebooktabs \usepackagegraphicx \usepackagecolor \usepackagesubcaption \usepackagelineno \usepackagehyperref

\linenumbers
\title

Subtractive Training for Music Stem Insertion using Latent Diffusion Models

\threeauthors

First Author Affiliation1 \\ [email protected] Second Author Retain these fake authors in\\submission to preserve the formatting Third Author Affiliation3 \\ [email protected]

\sloppy

1 APPENDIX

\thesubsection Text Prompt

The prompt that was used to generate the edit instructions from the music captions is given as follows:

    "You are a professional music
    annotator that has been hired to
    write text prompts in a large
    scale music dataset to train a
    text-to-music generative AI
    model.  You’re one of the best in
    the industry for this job and
    have stellar reviews from other
    music labelling companies that
    have hired you in the past. Your
    captions should be truthful and
    accurate. The input and output of
    the text-to-music generative AI
    model are as follows.\n
    Input: The audio file of the
    background music + a text prompt
    describing what kind of
    instrument should be added to the
    given background music and in
    what style.\n
    Output: The audio file of the
    music modified with the addition
    of the instrument in accordance
    with the content of the text
    prompt.\n
    The following text explains a
    music piece that you are going to
    be working with now, so imagine
    you are listening this music:
    {caption}\n
    Then imagine this music without
    {stem}. Write a single sentence
    text prompt input for instructing
    a text to music generative AI
    model to generate this music,
    when this music without {stem}
    was given as a background music
    input. So this is not removing.
    You want to add {stem} to the
    {stem}-less version, while
    persuming the {stem}-less version
    is given. You should not say like
    \"remove a {stem} track\". Do not
    add or make up any extra
    information about music in the
    edit prompt other than the given
    explantion. Use information only
    from the explantion of the music.
    For the text prompt instructions
    you generate, you have to focus
    on explaining the style of the
    instrument we are adding; You do
    not have to explain the style of
    the background music if it is
    unnecessary for explaining the
    instrument we are adding. Make
    sure
    the edit instruction doesnt have
    more than 7 words. When creating
    the
    edit instruction, use the action
    word ’{action_word}’. You don’t
    need to mention that the music is
    drum-less, although it is
    optional. If  you mention the the
    notion that the background track
    has no drums, you are able to
    mention it without using the
    phrase ’{stem}-less’. Do not use
    the word ’energetic’ to describe
    the {action_word}. If possible,
    try to include the genre of the
    song in your description of the
    drums."

stem corresponds to the stem we want to subtract from our data, which in this case is ”drums”. caption Corresponds to the LPMusicCaps-generated caption of our song. The action_word comes from the set {\textInsert,Add,Generate,Enhance,Put,"Augment}\{\text{{}^{\prime}Insert^{\prime},^{\prime}Add^{\prime},^{\prime}Generate^{% \prime},^{\prime}Enhance^{\prime},^{\prime}Put^{\prime},"Augment^{\prime}}\}{ start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_I italic_n italic_s italic_e italic_r italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_A italic_d italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_G italic_e italic_n italic_e italic_r italic_a italic_t italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_E italic_n italic_h italic_a italic_n italic_c italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_P italic_u italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , " italic_A italic_u italic_g italic_m italic_e italic_n italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }. The action word was chosen uniformly at random, and was included in order to create linguistic variety in the instructions that would be used as triaining input for the model.