Generating Symbolic Music From Natural Language Prompts Using an LLM-Enhanced Dataset

Author: Weihan Xu; Julian McAuley; Taylor Berg-Kirkpatrick; Shlomo Dubnov; Hao-Wen Dong

Publisher: Zenodo

DOI: 10.5281/zenodo.17706377

Source: https://zenodo.org/records/17706377/files/000026.pdf

GENERATING SYMBOLIC MUSIC FROM NATURAL LANGUAGE
PROMPTS USING AN LLM-ENHANCED DATASET
Weihan Xu1Julian McAuley2Taylo Be g-Ki kpa ick2
Shlomo Dubno 2Hao-Wen Dong 3
1Duke Uni e si y 2UC San Diego 3Uni e si y o Michigan
[email p o ec ed]
ABSTRACT
Recen yea s ha e seen many audio-domain ex - o-music
gene a ion models ha ely on la ge amoun s o ex -audio
pai s o aining. Howe e , symbolic-domain con ol-
lable music gene a ion has lagged behind pa ly due o
he lack o a la ge-scale symbolic music da ase wi h ex-
ensi e me ada a and cap ions. In his wo k, we p esen
Me aSco e, a new da ase consis ing o 963K musical
sco es pai ed wi h ich me ada a, including ee- o m use -
anno a ed ags, collec ed om an online music o um.
To app oach ex - o-music gene a ion, We employ a p e-
ained la ge language model (LLM) o gene a e pseudo-
na u al language cap ions o music om i s me ada a
ags. Wi h he LLM-enhanced Me aSco e, we ain a ex -
condi ioned music gene a ion model ha lea ns o gene a e
symbolic music om he pseudo cap ions, allowing con-
ol o ins umen s, gen e, compose , complexi y and o he
ee- o m music desc ip o s. In addi ion, we ain a ag-
condi ioned sys em ha suppo s a p ede ined se o ags
a ailable in Me aSco e. Ou expe imen al esul s show ha
bo h he p oposed ex - o-music and ags- o-music models
ou pe o m a baseline ex - o-music model in a lis ening
es . While a concu en wo k Tex 2MIDI [1] also sup-
po s ee- o m ex inpu , ou models achie e compa a-
ble pe o mance. Mo eo e , he ex - o-music sys em o -
e s a mo e na u al in e ace han he ags- o-music model,
as i allows use s o p o ide ee- o m na u al language
p omp s.
1 In oduc ion
Recen wo k has been in es iga ing he po en ial o con-
di ional music gene a ion wi h s a e-o - he-a machine
lea ning models. In pa icula , we ha e seen majo
p og ess in audio-domain con ollable music gene a ion
[2,3], la gely hanks o he as amoun o ex -audio pai s
o aining. Unlike audio-domain music gene a ion, sym-
© Weihan Xu, Julian McAuley, Taylo Be g-Ki kpa ick,
Shlomo Dubno and Hao-Wen Dong. Licensed unde a C ea i e Com-
mons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion:
Weihan Xu, Julian McAuley, Taylo Be g-Ki kpa ick, Shlomo Dubno
and Hao-Wen Dong, “Gene a ing Symbolic Music om Na u al Lan-
guage P omp s using an LLM-Enhanced Da ase ”, in P oc. o he 26 h
In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko-
ea, 2025.
Figu e 1. Le e aging he LLM-enhanced Me aSco e
da ase , ou p oposed Me aSco e T ans o me (MST)
model gene a es symbolic music using na u al language
p omp s wi h di icul y, gen e, ins umen and compose
con ols. The symbolic music ou pu s allow he use o
u he edi and comple e he composi ion.
bolic music gene a ion sys ems gene a e music in edi able
o ma s ha can be u he comple ed by he use s, mak-
ing i easie o musicians o in eg a e such sys ems in o
hei c ea i e wo k low. Howe e , symbolic-domain con-
ollable music gene a ion has been hinde ed by he lack
o a la ge, public symbolic music da ase wi h ich me a-
da a. In his pape , we in end o build a na u al language
based symbolic music gene a ion sys em wi h ou new
public da ase Me aSco e. Me aSco e con ains 963K mu-
sical sco es pai ed wi h ich me ada a collec ed om he
MuseSco e o um 1as well as ex ensi e me ada a such as
gen e, compose , complexi y, ime signa u e, key signa-
u e, empo and use in e ac ion s a is ics (e.g., numbe
o iews, likes and commen s). 2In o de o app oach
ex o music gene a ion, we u he enhance he Me aS-
co e da ase by comple ing missing gen e me ada a using a
machine lea ning-based gen e agging algo i hm and we
le e age la ge language models o con e he me ada a
in o na u al language cap ions.
Enabled by he me ada a p o ided in Me aSco e, we
explo e ex -condi ioned music gene a ion wi h con ol-
lable a ibu es such as ins umen , gen e, compose , com-
plexi y, and o he ee- o m musical desc ip o s exp essed
in na u al language. Wi h he LLM-enhanced da ase ,
we ain a ans o me -based ex - o-music model using a
p e ained la ge language model o encode he inpu ex
1h ps://musesco e.com/
2We no e a concu en wo k, MidiCaps [4], which con ains 168K
MIDI iles anno a ed wi h gen e, emo ion ags, and LLM-gene a ed cap-
ions. Howe e , i does no include compose names, complexi y le els,
use s a is ics, and ee- o m use -anno a ed ags (see Table 1).
215
Mul i-
ack
Me ada a
Fo ma Gen e Samples Gen e Compose Emo ion Complexi y UIS‡Cap ion
LMD [5] MIDI Misc. 176,581 ✓×†×†× × × ×
Me aMIDI [6] MIDI Misc. 437K ✓ ✓ ✓ × × × ×
WikiMusicTex [7] ABC Misc. 1,010∗×✓ ✓ × × × ✓
EMOPIA [8] MIDI Pop 1,087 ×✓×✓× × ×
MAESTRO [9] MIDI Classical 1,276 ×✓ ✓ × × × ×
MidiCaps [4] MIDI Misc. 168K ✓ ✓ ×✓× × ✓§
Me aSco e XML Misc. 1.27M ✓ ✓ ✓ ×✓ ✓ ✓§
∗Only a small subse o WikiMusicTex is publicly a ailable a h ps://hugging ace.co/da ase s/sande -wood/wikimusic ex
†A ailable h ough e o -p one mapping o Million Song Da ase [5,10]‡Use in e ac ion s a is ics §LLM-gene a ed (see Sec ion 3.4)
Table 1. Compa ison o commonly used publicly a ailable symbolic music da ase s
Model
size
Public
aining da a
Open
sou ce
Suppo s
d ums
Suppo s ee
ex p omp s
Con ols
Ins umen Gen e Compose Complexi y
FIGARO [11] 88.30M ✓ ✓ × × ✓× × ×
MuseCoco [12] 203M ×✓ ✓ ×✓ ✓ ✓ ×
BART-based [13] 139M ✓ ✓ ×✓ ✓ ✓ × ×
MST-Tags 87.36M ✓ ✓ ✓ ×✓ ✓ ✓ ✓
MST-Tex 87.44M ✓ ✓ ✓ ✓ ✓∗✓∗✓∗✓∗
∗These can be achie ed by ee- o m ex p omp s.
Table 2. Compa ison o con ollable music gene a ion sys ems.
p omp s (see Figu e 1). In addi ion, we ain a ans o me -
based ags- o-music model by p epending he inpu ags o
ou p oposed music ep esen a ion. Le e aging he LLM-
gene a ed cap ions o aining, he p oposed ex - o-music
model achie es compe i i e pe o mance agains he ag-
based model while o e ing a na u al language-based in e -
ace ha allows ee- o m ex inpu s.
To e alua e ou p oposed models, we compa e hem
wi h an open-sou ce ex - o-symbolic music sys em [13]
and a concu en wo k [1] in subjec i e lis ening s udies.
In hese s udies, we demons a e ha ou p oposed models
ou pe o m he baseline model in e ms o cohe ence, a -
angemen , adhe ence, and o e all quali y, while achie ing
pe o mance compa able o ha o he concu en wo k [1].
Ou con ibu ions can be summa ized as ollows:
• We p esen a new publicly a ailable da ase wi h
musical sco es pai ed wi h ich me ada a and LLM-
gene a ed na u al language cap ions.
• We ain wo new models o ag- and ex -based con-
ollable symbolic music gene a ion ha suppo in-
s umen , gen e, compose and complexi y con ols.
The Me aSco e da ase , codebase, and audio samples
can be ound on ou websi e. 3
2 Rela ed Wo k
2.1 Symbolic Music Da ase s
We compa e commonly used symbolic music da ase s in
Table 1. WikiMusicTex [7] pai s music wi h gen e, com-
pose , and cap ions. Howe e , i s publicly eleased e -
sion is small, and he musical sco es a e in ABC no a-
3h ps://wx83.gi hub.io/Me aSco e_O icial/
ion, which does no suppo mul i ack music na i ely. Al-
hough Me aMIDI [6] comp ises a ound 437K mul i ack
music pieces in MIDI o ma , i only includes gen e and
compose in o ma ion, lacking na u al language cap ions,
which a e impo an o aining ex - o-music gene a ion
models. Al hough EMOPIA [8] p o ides emo ion in o -
ma ion, i is small and con ains only pop music. Al hough
MidiCaps [4] con ains cap ions, i does no con ain use -
anno a ed ee- o m ags ha a e c ucial o ee- o m ex -
o-music gene a ion. In his wo k, we p esen a new, la ge
mul i ack and mul i-gen e symbolic music da ase wi h
ich me ada a, including gen e, compose , complexi y, key
signa u e, ime signa u e, empo, use in e ac ion s a is ics,
ee- o m use anno a ed ags and pseudo cap ions.
2.2 Con ollable Symbolic Music Gene a ion
Con ollable symbolic music gene a ion include a ibu e-
based music gene a ion, ee- o m ex o music gene -
a ion and music in illing. We compa e ou model wi h
exis ing con ollable music gene a ion sys em in Table 2.
EMOPIA [8] is designed o gene a e music ha aligns
wi h speci ic emo ional s a es, de ined wi hin he alence-
a ousal plane. This psychological model ca ego izes emo-
ions by alence, indica ing hei posi i i y o nega i i y,
and a ousal, which measu es hei in ensi y om calm o
exci ed. FIGARO [11] can gene a e samples based on a
ine-g ained desc ip ion o he cha ac e is ics o he desi ed
music. MuseCoco [12] i s classi ies a ixed se o p ede-
ined musical a ibu es using mul iple classi ica ion heads
and hen employs an a ibu e- o-music model o gene -
a e symbolic music. Recen wo k on music in illing [14]
can condi ion gene a ion on a ibu es including: ins u-
men ype, musical s yle, no e densi y, polyphony le el,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
216
Figu e 2. S a is ics o he me ada a a ailable in Me aSco e-Raw. No e ha no all songs include comple e me ada a. We
de i e exac keys by me ging use anno a ions wi h he MusicXML key elemen ( i hs + mode). C majo and A mino a e
unde ep esen ed because we include only pieces wi h explici ly speci ied keys.
and no e du a ion. In his wo k, we explo e ee- o m ex
condi ioned music gene a ion wi h LLM-gene a ed na u-
al language music cap ions. Addi ionally, we p esen a
ag-condi ioned music gene a ion model ha can gene a e
music based on ou condi ions: gen e, ins umen , com-
plexi y and compose .
3 Da ase
3.1 Da ase Collec ion
We sc aped ou da ase om an open sou ce and ee music
no a ion so wa e MuseSco e. 1To e alua e he quali y o
ou da ase , we le e age he a ing en y (Me aSco e-Raw)
as an indica o . This a ing, which anges om 1 o 5 (wi h
5 being he highes ), se es as a s uc u ed measu e o pe -
cei ed music quali y. To demons a e ha e en lowe - a ed
en ies can s ill be sui able o use, we andomly selec ed
10 samples om h ee ca ego ies: low/missed a ings (be-
low 3 o no a ed), mid- ange a ings (3–4), and high a -
ings (abo e 4). We show quali a i e examples on ou demo
page. 3These examples illus a e he o e all usabili y and
di e si y o he da ase ac oss di e en a ing le els.
3.2 Collec ing and P ep ocessing he Da ase
We collec 963K songs pai ed wi h musical sco es and
me ada a om he MuseSco e o um. We will e e o his
o iginal da ase as Me aSco e-Raw. Me aSco e-Raw con-
ains ex ensi e me ada a such as gen e, compose , com-
plexi y, key signa u e, ime signa u e, empo, use in e ac-
ion s a is ics (e.g., numbe o iews, likes and commen s)
and ee- o m use anno a ed ags. We p o ide s a is ics o
he me ada a in Figu e 2, and we no e ha no all songs
come wi h comple e me ada a.
F om he aw MSCZ iles, we ex ac key signa u e,
ime signa u e, empo and musical ins umen a ion. We
Type Da ase Adhe ence↑
G ound u h gen e ags Me aSco e-Gen e 3.11 ±0.49
Au o-gene a ed gen e ags Me aSco e-Plus∗3.05 ±0.54
LLM-gene a ed cap ions Me aSco e-Plus 3.23 ±0.49
∗We only include songs wi h au o-gene a ed gen e ags he e.
Table 3. Subjec i e e alua ion esul s on ags/ ex -music
adhe ence o he da ase a e measu ed on a Like scale o
1 o 5. We epo he mean alues and 95% con idence
in e als.
only e ain hose ins umen s ha a e compa ible wi h he
Gene al MIDI s anda d. Rega ding compose s, we i s il-
e he compose ags o ensu e hey a e o ma ed as hu-
man names and con e hem o lowe case. We also s an-
da dize he names o well-known musicians o hei ull
names; o ins ance, “moza ” is changed o “wol gang
amadeus moza .”
3.3 In e ing Missing Gen e Tags in Me aSco e-Raw
While Me aSco e-Raw p o ides ich me ada a in o ma-
ion, we no ice ha no all songs come wi h comple e me a-
da a. Fo example, only 181K (18.8%) ou o 963K songs
in Me aSco e-Raw con ain gen e me ada a. As gen e is
one o he mos in ui i e ways o a use o con ol he
s yle o a music gene a ion sys em, we wan o comple e
he gen e in o ma ion o songs wi hou a gen e label in
Me aSco e-Raw. The e o e, we ain a gen e agge ha is
based on he Mul i ack Music T ans o me (MMT) [15],
whe e we emo e he causal mask used o au o eg essi e
modeling and append a mul i-label classi ica ion laye . We
selec he h eshold o he mul i-label classi ica ion laye
o each class based on he F1 sco e on he alida ion se .
To in e gen e labels o music pieces lacking such ags,
we adop a da a-d i en app oach by aining a gen e ag-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
217
ge on Me aSco e-Gen e. The gen e agge is based on
he Mul i ack Music T ans o me (MMT) [15], whe e we
emo e he casual mask used o au o eg essi e model-
ing and append a mul i-label classi ica ion laye . MMT
ep esen s a music piece as a sequence o e en s x=
(x1, . . . , xn), whe e each e en xicomp ises six a ibu es:
ype, bea , posi ion, pi ch, du a ion, and ins umen . To
c ea e he inpu sequence, we ex ac okens om he s a ,
middle, and end sec ions o each music pieces, selec ing
341 okens om each o o m a conca ena ed sequence o
1,023 okens. Ins umen in o ma ion, impo an o gen e
iden i ica ion, is also inco po a ed as a p e ix condi ion o
hese sequences.
Du ing aining, gen es wi h sca ce p esence o high
ambigui y, such as “da kwa e” and “expe imen al,” a e ex-
cluded o a oid noise. We ex ac no es om he MSCZ
iles using MusPy [15]. When we ex ac no es, we ex-
clude b oken iles con aining nega i e pi ches. Mo eo e ,
o enhance gene alizabili y, we include addi ional 22,000
samples om he LMD da ase [5], all agged wi h gen-
es. Due o he a i y o ambigui y o ce ain gen es, we
me ge speci ic gen e ypes om hese wo da ase s in o 8
classes. 4Fo he aining p ocess, we alloca e 90% o he
samples o aining, wi h 5% each ese ed o alida ion
and es ing. We selec he h eshold o mul i-label classi-
ica ion laye o each class based on he pe o mance on
he alida ion se .
To e alua e he pe o mance o gen e agging, we i s
compu e he p ecision, ecall and F1-sco e on he es se ,
whe e we achie e a mic o-a e aged p ecision o 61.94, e-
call o 63.03, and F1 sco e o 62.48. In addi ion, we con-
duc a subjec i e lis ening es o compa e he quali y o
he au o-gene a ed gen e ags wi h he use -anno a ed ags
in Me aSco e-Raw. The 22 pa icipan s a e ins uc ed o
answe he ollowing ques ion in a Like scale o 1 o 5:
“How well do you hink his piece o music aligns wi h he
ollowing gen e?”. F om Table 3, we can see ha he au o-
gene a ed gen e ags in Me aSco e-Plus achie es a lowe
ags-music adhe ence compa ed o he g ound u h ags in
Me aSco e-Gen e(de ined in Sec ion 4), bu he di e ence
did no each s a is ical signi icance in ou se up.
3.4 Gene a ing Pseudo Cap ions using LLMs
To enable ex -based downs eam asks (e.g., music cap-
ioning and ex - o-music gene a ion), we le e age la ge
language models o con e he me ada a in o na u al lan-
guage cap ions. We ollow LP-MusicCaps [16] and CLAP
[17] and adop an in-con ex lea ning-based app oach [18]
using a p e ained la ge language model. We o m he in-
pu p omp s ing by combining gen e, compose , com-
plexi y, ime signa u e, key signa u e, empo, and ee-
o m use -speci ied ags. 5As shown on he demo page, 3
4We de ine he eigh gen es as ollows: “Classical & adi ional”: clas-
sical, eligious, new age; “Sound ack & s age”: sound ack, comedy;
“Rock& Me al”: pop, ock, me al; “Folk & coun y”: olk, coun y; “U -
ban”: hip hop, &b, unk&soul; “Elec onic & dance”: elec onic, disco;
“Wo ld”: wo ld music, eggae&ska; “Jazz & blues”: jazz, blues.
5We ha e an old e sion o LLM-gene a ed cap ions om ags. MST-
Tex was ained on an old e sion o LLM-gene a ed cap ions in which
we p o ide i e examples o inpu -ou pu pai s o acili a e
in-con ex lea ning wi h Bloom [19], whe e he examples
a e used o p o ide guidance o he LLM o cap u e he
one- o-many mapping be ween he inpu ags and na u al
language cap ions. We gene a e he pseudo cap ions using
he Hugging Face API [20]. We exclude non-English and
co up ed cap ions gene a ed by Bloom [19] and unca e
he ou pu sequence o a maximum o 32 okens.
4 Ve sions o Me aSco e
We will elease he ollowing h ee e sions o Me aSco e:
•Me aSco e-Raw (963K): The aw MuseSco e iles
and me ada a sc aped om he MuseSco e o um as
well as he co esponding musicxml ile o u u e e-
sea ch.
•Me asco e-Gen e (181K): A subse o MuseSco e-
Raw con aining iles wi h use -anno a ed gen es. Ad-
di ionally, we disca d any songs composed by a
compose ha has less han 100 composi ions in
Me aSco e-Raw. We also p o ide LLM-gene a ed
cap ions based on in o ma ion ex ac ed om he
me ada a in Me asco e-Gen e.
•Me aSco e-Plus (963K): Me aSco e-Raw whe e
missing gen e ags a e comple ed by he ained gen e
agge desc ibed in Sec ion 3.3. We also p o ide
LLM-gene a ed cap ions based on in o ma ion
ex ac ed om he me ada a in Me aSco e-Plus.
Due o copy igh conce ns, we will publicly elease music
sco es and me ada a ha a e in he public domain (228K)
o licensed wi h a C ea i e Commons licenses (46K) om
Me aSco e-Plus. The es o he da ase will be p o ided
upon eques o esea ch pu pose.
5 Me hod
We ep esen a music piece as a one-dimensional a ay o
in ege s using an e en -based ep esen a ion adap ed om
REMI+ [11] and MMT [15]. REMI+ [11] ep esen s no es
wi h six consecu i e okens encoding no e posi ion, pi ch,
eloci y, du a ion, ins umen and ime-signa u e in o ma-
ion. Howe e , i canno p o ide con ol o e ags such as
gen e, compose and complexi y. MMT [15] ep esen s a
sequence o six-dimension e en s, wi h each e en xien-
coded as a uple o a iables (x ype ,xbea ,xposi ion ,xpi ch ,
xdu a ion ,xins umen ). Howe e , MMT canno model he
in e dependencies wi hin hese ields o a speci ic no e as
i p edic s he six ields in pa allel. In his wo k, we adap
he REMI+ ep esen a ion [11] o p o ide con ols o e
gen e, ins umen , compose and complexi y, while p e-
se ing he exp essi eness o e ed by REMI+ [11]. Sim-
ila o MMT [15], we decompose no e-on e en s o bea
and posi ion o educe he size o he ocabula y and o
help he model lea n he hy hmic s uc u e o music. In
addi ion, we exclude he “ empo” and “cho d” e en s as
such in o ma ion is some imes una ailable in ou da ase .
cap ions a e gene a ed om gen e, ins umen , complexi y, copy igh , and
ee- o m desc ip o s wi h in-con ex lea ning.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
218
Following REMI+ [11], we use bea ,posi ion,ins umen ,
pi ch and du a ion e en s o ep esen ing musical no es
o non-d um acks. We ep esen d um no es as bea ,po-
si ion,ins umen ,d um_pi ch.
To enable ee- o m ex con ols, o each music piece
wi h ex , we use a p e ained sen ence ans o me [21]
(speci ically, he “all-MiniLM-L6- 2” e sion [22]) o ex-
ac he ex embedding. Then we add a linea laye o
p ojec he ex embedding o he inpu oken embedding
space, whe e he p ojec ed ex embedding is added o he
p e ious gene a ed oken embedding along wi h he posi-
ional encoding. Then we eed he encoded sequence in o
a decode -only linea ans o me . We will e e o his
model as Me aSco e T ans o me -Tex (MST-Tex ).
Addi ionally, we ain a ag-condi ioned music gene a-
ion model. To enable ag-based con ols, we p epend he
inpu ags o ou p oposed music ep esen a ion. We in o-
duce ou ag e en s, including ag_gen e, ag_compose ,
ag_complexi y and ag_ins umen o speci y condi ions.
Fu he , we use he s anda dized compose names o limi
he ocabula y size, and we keep only 47 compose s
ha ha e mo e han 100 aining samples. We use a
ag_{missing_ ag}_None e en o music pieces ha do
no con ain all ou ags. In addi ion o hese da a o-
kens, we ha e six special s uc u al e en s: The s a -o -
song e en signals he onse o a song, leading in o a se-
quence ma ked by s a -o -gen e,s a -o -compose ,s a -
o -complexi y,s a -o -ins umen e en s, each ollowed
by hei espec i e ag lis s, wi h s a -o -no es conclud-
ing he ag lis s and end-o -song indica ing he comple ion
o he song. To acili a e con ollabili y in he model, we
p epend hese con ol okens a he s a o he da a ep-
esen a ion. The con ol okens include gen e, compose ,
complexi y and ins umen s. Then we eed he sequence
wi h hese p epending ags in o a decode -only linea ans-
o me which capi alizes on he au o eg essi e na u e o
he ans o me model, enabling he in eg a ion o hese
okens du ing he in e ence p ocess. We will e e o his
model as Me aSco e T ans o me -Tags (MST-Tags).
6 Expe imen s and Resul s
6.1 Baselines
We compa e ou model wi h wo ex - o-symbolic music
gene a ion app oaches. The i s is a BART-based model
[13] ained on a pai ed ex and symbolic da ase using
ABC no a ion, wi h e alua ion p esen ed in Sec ion 6.2
and Sec ion 6.3. The second is a concu en app oach,
Tex 2MIDI [1], which di ec ly gene a es MIDI iles om
na u al language p omp s, wi h e alua ion p esen ed in
Sec ion 6.4. MuseCoco [12] i s classi ies a ixed se
o p ede ined musical a ibu es using mul iple classi ica-
ion heads and hen employs an a ibu e- o-music model
o gene a e symbolic music. This app oach does no sup-
po ee- o m na u al language inpu s o symbolic music
gene a ion. The BART-based model [13] le e ages p e-
ained language models o gene a ing symbolic music in
ABC no a ion. To ensu e a ai compa ison, we gene a e
Pi ch class
en opy
Scale
consis ency
G oo e
consis ency
MST-Tags-Small 2.88 ±0.08 0.89 ±0.02 0.92 ±0.01
MST-Tags 2.93 ±0.07 0.89 ±0.02 0.90 ±0.01
BART-based [13] 2.54 ±0.06 0.99 ±0.00 1.00 ±0.00
MST-Tex 2.70 ±0.06 0.95 ±0.01 0.92 ±0.01
G ound u h 2.67 ±0.06 0.95 ±0.01 0.92 ±0.01
Table 4. Objec i e e alua ion esul s on music quali y wi h
condi ions om MST es se . We epo he mean alues
and 95% con idence in e als.
music using he BART-based model [13] ia he Hugging
Face API [22] and hen con e he ABC ou pu s o mul i-
ack MIDI using he Meloby es ool [23].
6.2 Objec i e E alua ions
Following [15,24,25], we assess he quali y o gene a ed
music using pi ch class en opy, scale consis ency, and
g oo e consis ency, whe e alues close o he g ound u h
indica e be e pe o mance. To ensu e a ai compa ison,
we andomly sampled 100 condi ions om he MST es
se and gene a ed co esponding music o e alua ion. As
epo ed in Table 4, we ind ha MST-Tex mos closely
ma ches he g ound u h in e ms o pi ch class en opy
and scale consis ency, while MST-Tags-Small and MST-
Tex pe o m simila ly on g oo e consis ency. Addi ion-
ally, ou p oposed MST-Tex ou pe o ms he BART-based
[13] model ac oss all h ee me ics.
6.3 Subjec i e E alua ion
We conduc a subjec i e es whe e 22 pa icipan s a e in-
s uc ed o e alua e i e songs unde each scena io. Ou o
he 22 pa icipan s, 19 people ha e expe ience in playing
ins umen s, wi h wo being p o essional musicians. We
ask he pa icipan s o e alua e he audio samples in e ms
o cohe ence, a angemen , adhe ence and o e all quali y
in a Like scale o 1 o 5.
We epo he subjec i e e alua ion esul s in Ta-
ble 5. When con as ing MST-Tags-Small wi h MST-
Tags, we obse e ha MST-Tags achie es be e pe o -
mance in cohe ence and a angemen , bu we see a de-
c ease in adhe ence, possibly due o he inco po a ion
o some au o-gene a ed ags. This compa ison illus-
a es he ade-o be ween employing a smalle , high-
quali y da ase (Me aSco e-Gen e) e sus a la ge ye
noisy da ase (Me aSco e-Plus). Howe e , compa ing he
o e all quali y sco e o MST-Tags-Small and MST-Tags,
we see ha aining wi h a la ge da ase leads o an in-
c ease in he o e all quali y o music gene a ion.
Fo ex -condi ioned music gene a ion, MST-Tex ou -
pe o ms he BART-based [13] app oach, in e ms o co-
he ence, a angemen , adhe ence, and o e all quali y. We
obse e ha he ex -condi ioned sys em MST-Tex has a
lowe adhe ence agains MST-Tags-Small and MST-Tags.
This implies ha ex - o-music gene a ion is a mo e chal-
lenging ask han ag- o-music gene a ion as a ex - o-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
219

Model size T aining samples Cohe ence↑A angemen ↑Adhe ence↑O e all quali y↑
MST-Tags-Small 87.36M 150K 3.87 ±0.36 3.98 ±0.38 3.86 ±0.38 3.57 ±0.37
MST-Tags 87.36M 901K 4.01 ±0.37 4.06 ±0.39 3.60 ±0.49 3.66 ±0.45
BART-based [13] 139M 283K 3.86 ±0.30 3.63 ±0.39 2.81 ±0.50 3.29 ±0.42
MST-Tex 87.44M 560K 3.93 ±0.28 3.88 ±0.33 3.35 ±0.44 3.69 ±0.33
Table 5. Subjec i e e alua ion esul s in a Like scale o 1 o 5
Model
size
CLAP Sco e↑Cohe ence(%)↑A angemen (%)↑Adhe ence(%)↑O e all quali y(%)↑
M T M+T M T M+T M T M+T M T M+T M T M+T
Tex 2MIDI [1] 159M 0.23 0.20 0.22 0 40 20 40 50 45 40 80 60 40 60 50
MST-Tex 87.44M 0.36 0.13 0.24 100 60 80 60 50 55 60 20 40 60 40 50
Table 6. Compa ison o MST-Tex and Tex 2MIDI [1] on h ee p omp se s: 1) M: i e p omp s om ou es se , 2) T:
i e p omp s om he Tex 2MIDI [1] es se , and 3) M+T: he union o hese wo p omp se s. We epo he winning a es
in a subjec i e A/B lis ening es desc ibed in Sec ion 6.4.
music gene a ion sys em needs o lea n o in e p e he
ee- o m ex inpu s.
O e all, he ag-condi ioned sys ems, MST-Tags and
MST-Tags-Small, demons a e s ong pe o mance ac oss
mul iple dimensions, including cohe ence, a angemen ,
adhe ence o p omp s, and o e all music quali y. These e-
sul s highligh he high quali y o ou cons uc ed da ase .
No ably, MST-Tex achie es he highes sco e in o e all
quali y, indica ing ha ou ex -condi ioned model pe -
o ms on pa wi h he ag-condi ioned a ian s. This unde -
sco es he e ec i eness o ou app oach, which le e ages
a la ge language model o gene a e na u al language cap-
ions, enabling end- o-end aining o high-quali y ex - o-
music gene a ion.
6.4 Compa ison o Tex 2MIDI
In his sec ion, we compa e ou p oposed MST-Tex wi h a
concu en wo k Tex 2MIDI [1] ha also suppo s ex - o-
symbolic music gene a ion. Fo a ai compa ison, we c e-
a e h ee es se s: 1) i e ex p omp s andomly selec ed
om ou es se , 2) i e ex p omp s andomly selec ed
om he Tex 2MIDI es se , and 3) he union o he p e i-
ous wo es se s (i.e., i e p omp s om each es se ).
Objec i e E alua ion. We compa e music quali y o ou
MST-Tex and Tex 2MIDI [1] using pi ch class en opy,
scale consis ency, and g oo e consis ency, and epo he
esul s in Table 7. In addi ion, o assess he alignmen be-
ween ex and symbolic music, we epo he a e age se-
man ic simila i y compu ed wi h CLAP [17]. We ind ha
MST-Tex be e ma ches he g ound u h in e ms o pi ch
class en opy and scale consis ency, while Tex 2MIDI be -
e ma ches he g ound u h in e ms o g oo e consis-
ency. Addi ionally, MST-Tex shows be e alignmen pe -
o mance when p omp s a e aken om he MST-Tex es
se and he join es se .
Subjec i e E alua ion. We compa e ou model MST-
Tex wi h Tex 2MIDI [1] ia an A/B es on cohe ence,
a angemen , adhe ence, and o e all quali y. Ele en pa -
icipan s (9 wi h musical expe ience, including one p o es-
sional) e alua ed 5 p omp s om ou es se and 5 om he
Pi ch class
en opy
Scale
consis ency
G oo e
consis ency
Tex 2MIDI [1] 2.44 ±0.19 0.89 ±0.03 0.94 ±0.01
MST-Tex 2.65 ±0.08 0.96 ±0.02 0.92 ±0.01
G ound u h 2.71 ±0.07 0.96 ±0.02 0.94 ±0.01
Table 7. Objec i e e alua ion esul s on music quali y o
he join es se . We epo he mean alues and 95% con-
idence in e als.
Tex 2MIDI es se , wi h esul s summa ized in Table 6.
Ou expe imen s show ha MST-Tex p oduces mo e
cohe en esul s and achie es equal o supe io a ange-
men pe o mance compa ed o Tex 2MIDI [1]. How-
e e , when i comes o adhe ence, Tex 2MIDI ou pe o ms
MST-Tex on he Tex 2MIDI es se and he join es
se s, while MST-Tex pe o ms be e on MST-Tex es se .
O e all, bo h models deli e compa able quali y.
7 Conclusion
In his pape , we ha e in oduced Me aSco e, a new pub-
licly a ailable da ase con aining ich me ada a and LLM-
gene a ed cap ions. We also p esen a new music gene -
a ion model ha can gene a e symbolic music om ee-
o m ex , allowing con ols o e ins umen s, gen e, com-
pose , complexi y, among o he music desc ip o s. In ad-
di ion, he LLM-gene a ed pseudo cap ions con ain in o -
ma ion p o ided in ee- o m use -anno a ed ags, which
can pose a challenge o sys ems ha adop a p ede ined se
o ags [12]. Ou objec i e and subjec i e e alua ion e-
sul s show he e ec i eness o he p oposed ags- o-music
and ex - o-music models. Ou p oposed ex - o-music
model ou pe o ms a baseline ex - o-music model [13]
and achie es compa able pe o mance wi h a concu en
wo k [1]. In addi ion, he p oposed ex - o-symbolic mu-
sic gene a ion model ained wi h LLM-gene a ed pseudo
cap ions achie es compe i i e pe o mance agains he p o-
posed ags- o-music model ained using only he g ound
u h ags.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
220
8 E hics S a emen
We no e ha Me aSco e con ains many copy igh ed con-
en s. This aises conce ns o po en ial misuse o his
da ase ha can lead o se e e copy igh in ingemen s. To
minimize such isks, we will elease hose in public do-
main only and hose no in public domain will be sha ed
upon eques and only use o esea ch p upo se. Fu -
he , music gene a ion sys ems buil upon he Me aSco e
da ase may in inge he copy igh held by he con en c e-
a o s, and hus we mus be ca e ul abou adop ing hese
sys ems in comme cial applica ions. Howe e , we would
like o poin ou ha when used p ope ly and wi h cau-
ion, a ex - o-music gene a ion sys em can also make a
posi i e impac o socie y by enabling new oppo uni ies
and in e aces o music c ea ion, as demons a ed in [26].
Gi en he edi able na u e o symbolic music, we hope ou
p oposed ex - o-symbolic music models will open up new
pa hways owa ds human-AI music co-c ea ion.
9 Re e ences
[1] K. Bhanda i, A. Roy, K. Wang, G. Pu i, S. Col on,
and D. He emans, “Tex 2midi: Gene a ing symbolic
music om cap ions,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2412.16526
[2] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J.
B yan, “Music con olne : Mul iple ime- a ying
con ols o music gene a ion,” in a Xi p ep in
a Xi :2311.07069, 2023.
[3] Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. J. B yan, “DITTO: Di usion in e ence- ime -
op imiza ion o music gene a ion,” in a Xi p ep in
a Xi :2401.12179, 2024.
[4] J. Melecho sky, A. Roy, and D. He e-
mans, “Midicaps: A la ge-scale midi da ase
wi h ex cap ions,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2406.02255
[5] C. Ra el, “Lea ning-based me hods o compa ing se-
quences, wi h applica ions o audio- o-MIDI alignmen
and ma ching,” Ph.D. disse a ion, Columbia Uni e -
si y, 2016.
[6] J. Ens and P. Pasquie , “Building he Me aMIDI
da ase : Linking symbolic and audio musical da a,” in
P oc. ISMIR, 2021.
[7] S. Wu, D. Yu, X. Tan, and M. Sun, “Clamp: Con-
as i e language-music p e- aining o c oss-modal
symbolic music in o ma ion e ie al,” in P oc. ISMIR,
2023.
[8] H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and Y.-
H. Yang, “EMOPIA: A mul i-modal pop piano da ase
o emo ion ecogni ion and emo ion-based music gen-
e a ion,” in P oc. ISMIR, 2021.
[9] C. Haw ho ne, A. S asyuk, A. Robe s, I. Simon, C.-
Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and
D. Eck, “Enabling ac o ized piano music modeling
and gene a ion wi h he MAESTRO da ase ,” in P oc.
ICLR, 2019.
[10] T. Be in-Mahieux, D. P. Ellis, B. Whi man, and
P. Lame e, “The million song da ase ,” in P oc. ISMIR,
2011.
[11] D. on Rü e, L. Biggio, Y. Kilche , and T. Ho -
mann, “FIGARO: Con ollable music gene a ion using
lea ned and expe ea u es,” in P oc. ICLR, 2023.
[12] P. Lu, X. Xu, C. Kang, B. Yu, C. Xing, X. Tan, and
J. Bian, “MuseCoco: Gene a ing symbolic music om
ex ,” in a Xi p ep in a Xi :2306.00110, 2023.
[13] S. Wu and M. Sun, “Explo ing he e icacy o p e-
ained checkpoin s in ex - o-music gene a ion ask,”
in a Xi p ep in a Xi :2211.11216, 2022.
[14] P. Pasquie , J. Ens, N. F ade , P. T iana, D. Rizzo i,
J.-B. Rolland, and M. Sa i, “Midi-gp : A con ollable
gene a i e model o compu e -assis ed mul i ack
music composi ion,” 2025. [Online]. A ailable: h ps:
//a xi .o g/abs/2501.17011
[15] H.-W. Dong, K. Chen, S. Dubno , J. McAuley, and
T. Be g-Ki kpa ick, “Mul i ack music ans o me ,”
in P oc. ICASSP, 2023.
[16] S. Doh, K. Choi, J. Lee, and J. Nam, “LP-MusicCaps:
LLM-based pseudo music cap ioning,” in P oc. ISMIR,
2023.
[17] Y. Wu, K. Chen, T. Zhang, Y. Hui, M. Nezhu ina,
T. Be g-Ki kpa ick, and S. Dubno , “La ge-scale con-
as i e language-audio p e aining wi h ea u e u-
sion and keywo d- o-cap ion augmen a ion,” in P oc.
ICASSP, 2023.
[18] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang,
X. Sun, J. Xu, L. Li, and Z. Sui, “A su ey on in-
con ex lea ning,” in a Xi p ep in a Xi :2301.00234,
2023.
[19] T. L. Scao, A. Fan, C. Akiki, E. Pa lick, S. Ili´
c,
D. Hesslow, and e al., “Bloom: A 176b-pa ame e
open-access mul iliguai language model,” in a Xi
p ep in a Xi :2211.05100, 2022.
[20] BigScience, T. L. Scao, A. Fan, C. Wol , T. Rush, S. Bi-
de man, G. B. Black, S. Cu is, D. Elbayad, T. Gao
e al., “BLOOM: A 176b-pa ame e open-access
mul ilingual language model,” h ps://hugging ace.co/
bigscience/bloom, 2022, accessed: 2025-03-25.
[21] N. Reime s and I. Gu e ych, “Sen ence-BERT: Sen-
ence embeddings using siamese be -ne wo ks,” in
P oc. EMNLP, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
221
[22] ——, “all-minilm-l6- 2: A ligh weigh sen-
ence embedding model,” h ps://hugging ace.co/
sen ence- ans o me s/all-MiniLM-L6- 2, 2021,
accessed: 2025-03-25.
[23] “Meloby es ABC o MIDI Con e e ,” h ps:
//meloby es.com/en/app/abc2midi, accessed: 2025-03-
25.
[24] O. Mog en, “C-RNN-GAN: Con inuous ecu en
neu al ne wo ks wi h ad e sa ial aining,” in P oc.
Neu IPS Wo kshop on Cons uc i e Machine Lea n-
ing, 2016.
[25] S.-L. Wu and Y.-H. Yang, “The Jazz T ans o me on
he on line: Explo ing he sho comings o AI-
composed music h ough quan i a i e measu es,” in
P oc. ISMIR, 2020.
[26] C.-Z. A. Huang, H. V. Koops, E. New on-Rex,
M. Dinculescu, and C. J. Cai, “Ai song con es :
Human-ai co-c ea ion in songw i ing,” 2020. [Online].
A ailable: h ps://a xi .o g/abs/2010.05388
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
222

Related note

Why institutions use Plag.ai for originality review, entry 27
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai