scieee Science in your language
[en] (orig)

Generating Symbolic Music From Natural Language Prompts Using an LLM-Enhanced Dataset

Author: Weihan Xu; Julian McAuley; Taylor Berg-Kirkpatrick; Shlomo Dubnov; Hao-Wen Dong
Publisher: Zenodo
DOI: 10.5281/zenodo.17706377
Source: https://zenodo.org/records/17706377/files/000026.pdf
GENERATING SYMBOLIC MUSIC FROM NATURAL LANGUAGE
PROMPTS USING AN LLM-ENHANCED DATASET
Weihan Xu1Julian McAuley2Taylo Be g-Ki kpa ick2
Shlomo Dubno 2Hao-Wen Dong 3
1Duke Uni e si y 2UC San Diego 3Uni e si y o Michigan
[email p o ec ed]
ABSTRACT
Recen yea s ha e seen many audio-domain ex - o-music
gene a ion models ha ely on la ge amoun s o ex -audio
pai s o aining. Howe e , symbolic-domain con ol-
lable music gene a ion has lagged behind pa ly due o
he lack o a la ge-scale symbolic music da ase wi h ex-
ensi e me ada a and cap ions. In his wo k, we p esen
Me aSco e, a new da ase consis ing o 963K musical
sco es pai ed wi h ich me ada a, including ee- o m use -
anno a ed ags, collec ed om an online music o um.
To app oach ex - o-music gene a ion, We employ a p e-
ained la ge language model (LLM) o gene a e pseudo-
na u al language cap ions o music om i s me ada a
ags. Wi h he LLM-enhanced Me aSco e, we ain a ex -
condi ioned music gene a ion model ha lea ns o gene a e
symbolic music om he pseudo cap ions, allowing con-
ol o ins umen s, gen e, compose , complexi y and o he
ee- o m music desc ip o s. In addi ion, we ain a ag-
condi ioned sys em ha suppo s a p ede ined se o ags
a ailable in Me aSco e. Ou expe imen al esul s show ha
bo h he p oposed ex - o-music and ags- o-music models
ou pe o m a baseline ex - o-music model in a lis ening
es . While a concu en wo k Tex 2MIDI [1] also sup-
po s ee- o m ex inpu , ou models achie e compa a-
ble pe o mance. Mo eo e , he ex - o-music sys em o -
e s a mo e na u al in e ace han he ags- o-music model,
as i allows use s o p o ide ee- o m na u al language
p omp s.
1 In oduc ion
Recen wo k has been in es iga ing he po en ial o con-
di ional music gene a ion wi h s a e-o - he-a machine
lea ning models. In pa icula , we ha e seen majo
p og ess in audio-domain con ollable music gene a ion
[2,3], la gely hanks o he as amoun o ex -audio pai s
o aining. Unlike audio-domain music gene a ion, sym-
© Weihan Xu, Julian McAuley, Taylo Be g-Ki kpa ick,
Shlomo Dubno and Hao-Wen Dong. Licensed unde a C ea i e Com-
mons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion:
Weihan Xu, Julian McAuley, Taylo Be g-Ki kpa ick, Shlomo Dubno
and Hao-Wen Dong, “Gene a ing Symbolic Music om Na u al Lan-
guage P omp s using an LLM-Enhanced Da ase ”, in P oc. o he 26 h
In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko-
ea, 2025.
Figu e 1. Le e aging he LLM-enhanced Me aSco e
da ase , ou p oposed Me aSco e T ans o me (MST)
model gene a es symbolic music using na u al language
p omp s wi h di icul y, gen e, ins umen and compose
con ols. The symbolic music ou pu s allow he use o
u he edi and comple e he composi ion.
bolic music gene a ion sys ems gene a e music in edi able
o ma s ha can be u he comple ed by he use s, mak-
ing i easie o musicians o in eg a e such sys ems in o
hei c ea i e wo k low. Howe e , symbolic-domain con-
ollable music gene a ion has been hinde ed by he lack
o a la ge, public symbolic music da ase wi h ich me a-
da a. In his pape , we in end o build a na u al language
based symbolic music gene a ion sys em wi h ou new
public da ase Me aSco e. Me aSco e con ains 963K mu-
sical sco es pai ed wi h ich me ada a collec ed om he
MuseSco e o um 1as well as ex ensi e me ada a such as
gen e, compose , complexi y, ime signa u e, key signa-
u e, empo and use in e ac ion s a is ics (e.g., numbe
o iews, likes and commen s). 2In o de o app oach
ex o music gene a ion, we u he enhance he Me aS-
co e da ase by comple ing missing gen e me ada a using a
machine lea ning-based gen e agging algo i hm and we
le e age la ge language models o con e he me ada a
in o na u al language cap ions.
Enabled by he me ada a p o ided in Me aSco e, we
explo e ex -condi ioned music gene a ion wi h con ol-
lable a ibu es such as ins umen , gen e, compose , com-
plexi y, and o he ee- o m musical desc ip o s exp essed
in na u al language. Wi h he LLM-enhanced da ase ,
we ain a ans o me -based ex - o-music model using a
p e ained la ge language model o encode he inpu ex
1h ps://musesco e.com/
2We no e a concu en wo k, MidiCaps [4], which con ains 168K
MIDI iles anno a ed wi h gen e, emo ion ags, and LLM-gene a ed cap-
ions. Howe e , i does no include compose names, complexi y le els,
use s a is ics, and ee- o m use -anno a ed ags (see Table 1).
215
Mul i-
ack
Me ada a
Fo ma Gen e Samples Gen e Compose Emo ion Complexi y UIS‡Cap ion
LMD [5] MIDI Misc. 176,581 ✓×†×†× × × ×
Me aMIDI [6] MIDI Misc. 437K ✓ ✓ ✓ × × × ×
WikiMusicTex [7] ABC Misc. 1,010∗×✓ ✓ × × × ✓
EMOPIA [8] MIDI Pop 1,087 ×✓×✓× × ×
MAESTRO [9] MIDI Classical 1,276 ×✓ ✓ × × × ×
MidiCaps [4] MIDI Misc. 168K ✓ ✓ ×✓× × ✓§
Me aSco e XML Misc. 1.27M ✓ ✓ ✓ ×✓ ✓ ✓§
∗Only a small subse o WikiMusicTex is publicly a ailable a h ps://hugging ace.co/da ase s/sande -wood/wikimusic ex
†A ailable h ough e o -p one mapping o Million Song Da ase [5,10]‡Use in e ac ion s a is ics §LLM-gene a ed (see Sec ion 3.4)
Table 1. Compa ison o commonly used publicly a ailable symbolic music da ase s
Model
size
Public
aining da a
Open
sou ce
Suppo s
d ums
Suppo s ee
ex p omp s
Con ols
Ins umen Gen e Compose Complexi y
FIGARO [11] 88.30M ✓ ✓ × × ✓× × ×
MuseCoco [12] 203M ×✓ ✓ ×✓ ✓ ✓ ×
BART-based [13] 139M ✓ ✓ ×✓ ✓ ✓ × ×
MST-Tags 87.36M ✓ ✓ ✓ ×✓ ✓ ✓ ✓
MST-Tex 87.44M ✓ ✓ ✓ ✓ ✓∗✓∗✓∗✓∗
∗These can be achie ed by ee- o m ex p omp s.
Table 2. Compa ison o con ollable music gene a ion sys ems.
p omp s (see Figu e 1). In addi ion, we ain a ans o me -
based ags- o-music model by p epending he inpu ags o
ou p oposed music ep esen a ion. Le e aging he LLM-
gene a ed cap ions o aining, he p oposed ex - o-music
model achie es compe i i e pe o mance agains he ag-
based model while o e ing a na u al language-based in e -
ace ha allows ee- o m ex inpu s.
To e alua e ou p oposed models, we compa e hem
wi h an open-sou ce ex - o-symbolic music sys em [13]
and a concu en wo k [1] in subjec i e lis ening s udies.
In hese s udies, we demons a e ha ou p oposed models
ou pe o m he baseline model in e ms o cohe ence, a -
angemen , adhe ence, and o e all quali y, while achie ing
pe o mance compa able o ha o he concu en wo k [1].
Ou con ibu ions can be summa ized as ollows:
• We p esen a new publicly a ailable da ase wi h
musical sco es pai ed wi h ich me ada a and LLM-
gene a ed na u al language cap ions.
• We ain wo new models o ag- and ex -based con-
ollable symbolic music gene a ion ha suppo in-
s umen , gen e, compose and complexi y con ols.
The Me aSco e da ase , codebase, and audio samples
can be ound on ou websi e. 3
2 Rela ed Wo k
2.1 Symbolic Music Da ase s
We compa e commonly used symbolic music da ase s in
Table 1. WikiMusicTex [7] pai s music wi h gen e, com-
pose , and cap ions. Howe e , i s publicly eleased e -
sion is small, and he musical sco es a e in ABC no a-
3h ps://wx83.gi hub.io/Me aSco e_O icial/
ion, which does no suppo mul i ack music na i ely. Al-
hough Me aMIDI [6] comp ises a ound 437K mul i ack
music pieces in MIDI o ma , i only includes gen e and
compose in o ma ion, lacking na u al language cap ions,
which a e impo an o aining ex - o-music gene a ion
models. Al hough EMOPIA [8] p o ides emo ion in o -
ma ion, i is small and con ains only pop music. Al hough
MidiCaps [4] con ains cap ions, i does no con ain use -
anno a ed ee- o m ags ha a e c ucial o ee- o m ex -
o-music gene a ion. In his wo k, we p esen a new, la ge
mul i ack and mul i-gen e symbolic music da ase wi h
ich me ada a, including gen e, compose , complexi y, key
signa u e, ime signa u e, empo, use in e ac ion s a is ics,
ee- o m use anno a ed ags and pseudo cap ions.
2.2 Con ollable Symbolic Music Gene a ion
Con ollable symbolic music gene a ion include a ibu e-
based music gene a ion, ee- o m ex o music gene -
a ion and music in illing. We compa e ou model wi h
exis ing con ollable music gene a ion sys em in Table 2.
EMOPIA [8] is designed o gene a e music ha aligns
wi h speci ic emo ional s a es, de ined wi hin he alence-
a ousal plane. This psychological model ca ego izes emo-
ions by alence, indica ing hei posi i i y o nega i i y,
and a ousal, which measu es hei in ensi y om calm o
exci ed. FIGARO [11] can gene a e samples based on a
ine-g ained desc ip ion o he cha ac e is ics o he desi ed
music. MuseCoco [12] i s classi ies a ixed se o p ede-
ined musical a ibu es using mul iple classi ica ion heads
and hen employs an a ibu e- o-music model o gene -
a e symbolic music. Recen wo k on music in illing [14]
can condi ion gene a ion on a ibu es including: ins u-
men ype, musical s yle, no e densi y, polyphony le el,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
216
Figu e 2. S a is ics o he me ada a a ailable in Me aSco e-Raw. No e ha no all songs include comple e me ada a. We
de i e exac keys by me ging use anno a ions wi h he MusicXML key elemen ( i hs + mode). C majo and A mino a e
unde ep esen ed because we include only pieces wi h explici ly speci ied keys.
and no e du a ion. In his wo k, we explo e ee- o m ex
condi ioned music gene a ion wi h LLM-gene a ed na u-
al language music cap ions. Addi ionally, we p esen a
ag-condi ioned music gene a ion model ha can gene a e
music based on ou condi ions: gen e, ins umen , com-
plexi y and compose .
3 Da ase
3.1 Da ase Collec ion
We sc aped ou da ase om an open sou ce and ee music
no a ion so wa e MuseSco e. 1To e alua e he quali y o
ou da ase , we le e age he a ing en y (Me aSco e-Raw)
as an indica o . This a ing, which anges om 1 o 5 (wi h
5 being he highes ), se es as a s uc u ed measu e o pe -
cei ed music quali y. To demons a e ha e en lowe - a ed
en ies can s ill be sui able o use, we andomly selec ed
10 samples om h ee ca ego ies: low/missed a ings (be-
low 3 o no a ed), mid- ange a ings (3–4), and high a -
ings (abo e 4). We show quali a i e examples on ou demo
page. 3These examples illus a e he o e all usabili y and
di e si y o he da ase ac oss di e en a ing le els.
3.2 Collec ing and P ep ocessing he Da ase
We collec 963K songs pai ed wi h musical sco es and
me ada a om he MuseSco e o um. We will e e o his
o iginal da ase as Me aSco e-Raw. Me aSco e-Raw con-
ains ex ensi e me ada a such as gen e, compose , com-
plexi y, key signa u e, ime signa u e, empo, use in e ac-
ion s a is ics (e.g., numbe o iews, likes and commen s)
and ee- o m use anno a ed ags. We p o ide s a is ics o
he me ada a in Figu e 2, and we no e ha no all songs
come wi h comple e me ada a.
F om he aw MSCZ iles, we ex ac key signa u e,
ime signa u e, empo and musical ins umen a ion. We
Type Da ase Adhe ence↑
G ound u h gen e ags Me aSco e-Gen e 3.11 ±0.49
Au o-gene a ed gen e ags Me aSco e-Plus∗3.05 ±0.54
LLM-gene a ed cap ions Me aSco e-Plus 3.23 ±0.49
∗We only include songs wi h au o-gene a ed gen e ags he e.
Table 3. Subjec i e e alua ion esul s on ags/ ex -music
adhe ence o he da ase a e measu ed on a Like scale o
1 o 5. We epo he mean alues and 95% con idence
in e als.
only e ain hose ins umen s ha a e compa ible wi h he
Gene al MIDI s anda d. Rega ding compose s, we i s il-
e he compose ags o ensu e hey a e o ma ed as hu-
man names and con e hem o lowe case. We also s an-
da dize he names o well-known musicians o hei ull
names; o ins ance, “moza ” is changed o “wol gang
amadeus moza .”
3.3 In e ing Missing Gen e Tags in Me aSco e-Raw
While Me aSco e-Raw p o ides ich me ada a in o ma-
ion, we no ice ha no all songs come wi h comple e me a-
da a. Fo example, only 181K (18.8%) ou o 963K songs
in Me aSco e-Raw con ain gen e me ada a. As gen e is
one o he mos in ui i e ways o a use o con ol he
s yle o a music gene a ion sys em, we wan o comple e
he gen e in o ma ion o songs wi hou a gen e label in
Me aSco e-Raw. The e o e, we ain a gen e agge ha is
based on he Mul i ack Music T ans o me (MMT) [15],
whe e we emo e he causal mask used o au o eg essi e
modeling and append a mul i-label classi ica ion laye . We
selec he h eshold o he mul i-label classi ica ion laye
o each class based on he F1 sco e on he alida ion se .
To in e gen e labels o music pieces lacking such ags,
we adop a da a-d i en app oach by aining a gen e ag-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
217
ge on Me aSco e-Gen e. The gen e agge is based on
he Mul i ack Music T ans o me (MMT) [15], whe e we
emo e he casual mask used o au o eg essi e model-
ing and append a mul i-label classi ica ion laye . MMT
ep esen s a music piece as a sequence o e en s x=
(x1, . . . , xn), whe e each e en xicomp ises six a ibu es:
ype, bea , posi ion, pi ch, du a ion, and ins umen . To
c ea e he inpu sequence, we ex ac okens om he s a ,
middle, and end sec ions o each music pieces, selec ing
341 okens om each o o m a conca ena ed sequence o
1,023 okens. Ins umen in o ma ion, impo an o gen e
iden i ica ion, is also inco po a ed as a p e ix condi ion o
hese sequences.
Du ing aining, gen es wi h sca ce p esence o high
ambigui y, such as “da kwa e” and “expe imen al,” a e ex-
cluded o a oid noise. We ex ac no es om he MSCZ
iles using MusPy [15]. When we ex ac no es, we ex-
clude b oken iles con aining nega i e pi ches. Mo eo e ,
o enhance gene alizabili y, we include addi ional 22,000
samples om he LMD da ase [5], all agged wi h gen-
es. Due o he a i y o ambigui y o ce ain gen es, we
me ge speci ic gen e ypes om hese wo da ase s in o 8
classes. 4Fo he aining p ocess, we alloca e 90% o he
samples o aining, wi h 5% each ese ed o alida ion
and es ing. We selec he h eshold o mul i-label classi-
ica ion laye o each class based on he pe o mance on
he alida ion se .
To e alua e he pe o mance o gen e agging, we i s
compu e he p ecision, ecall and F1-sco e on he es se ,
whe e we achie e a mic o-a e aged p ecision o 61.94, e-
call o 63.03, and F1 sco e o 62.48. In addi ion, we con-
duc a subjec i e lis ening es o compa e he quali y o
he au o-gene a ed gen e ags wi h he use -anno a ed ags
in Me aSco e-Raw. The 22 pa icipan s a e ins uc ed o
answe he ollowing ques ion in a Like scale o 1 o 5:
“How well do you hink his piece o music aligns wi h he
ollowing gen e?”. F om Table 3, we can see ha he au o-
gene a ed gen e ags in Me aSco e-Plus achie es a lowe
ags-music adhe ence compa ed o he g ound u h ags in
Me aSco e-Gen e(de ined in Sec ion 4), bu he di e ence
did no each s a is ical signi icance in ou se up.
3.4 Gene a ing Pseudo Cap ions using LLMs
To enable ex -based downs eam asks (e.g., music cap-
ioning and ex - o-music gene a ion), we le e age la ge
language models o con e he me ada a in o na u al lan-
guage cap ions. We ollow LP-MusicCaps [16] and CLAP
[17] and adop an in-con ex lea ning-based app oach [18]
using a p e ained la ge language model. We o m he in-
pu p omp s ing by combining gen e, compose , com-
plexi y, ime signa u e, key signa u e, empo, and ee-
o m use -speci ied ags. 5As shown on he demo page, 3
4We de ine he eigh gen es as ollows: “Classical & adi ional”: clas-
sical, eligious, new age; “Sound ack & s age”: sound ack, comedy;
“Rock& Me al”: pop, ock, me al; “Folk & coun y”: olk, coun y; “U -
ban”: hip hop, &b, unk&soul; “Elec onic & dance”: elec onic, disco;
“Wo ld”: wo ld music, eggae&ska; “Jazz & blues”: jazz, blues.
5We ha e an old e sion o LLM-gene a ed cap ions om ags. MST-
Tex was ained on an old e sion o LLM-gene a ed cap ions in which
we p o ide i e examples o inpu -ou pu pai s o acili a e
in-con ex lea ning wi h Bloom [19], whe e he examples
a e used o p o ide guidance o he LLM o cap u e he
one- o-many mapping be ween he inpu ags and na u al
language cap ions. We gene a e he pseudo cap ions using
he Hugging Face API [20]. We exclude non-English and
co up ed cap ions gene a ed by Bloom [19] and unca e
he ou pu sequence o a maximum o 32 okens.
4 Ve sions o Me aSco e
We will elease he ollowing h ee e sions o Me aSco e:
•Me aSco e-Raw (963K): The aw MuseSco e iles
and me ada a sc aped om he MuseSco e o um as
well as he co esponding musicxml ile o u u e e-
sea ch.
•Me asco e-Gen e (181K): A subse o MuseSco e-
Raw con aining iles wi h use -anno a ed gen es. Ad-
di ionally, we disca d any songs composed by a
compose ha has less han 100 composi ions in
Me aSco e-Raw. We also p o ide LLM-gene a ed
cap ions based on in o ma ion ex ac ed om he
me ada a in Me asco e-Gen e.
•Me aSco e-Plus (963K): Me aSco e-Raw whe e
missing gen e ags a e comple ed by he ained gen e
agge desc ibed in Sec ion 3.3. We also p o ide
LLM-gene a ed cap ions based on in o ma ion
ex ac ed om he me ada a in Me aSco e-Plus.
Due o copy igh conce ns, we will publicly elease music
sco es and me ada a ha a e in he public domain (228K)
o licensed wi h a C ea i e Commons licenses (46K) om
Me aSco e-Plus. The es o he da ase will be p o ided
upon eques o esea ch pu pose.
5 Me hod
We ep esen a music piece as a one-dimensional a ay o
in ege s using an e en -based ep esen a ion adap ed om
REMI+ [11] and MMT [15]. REMI+ [11] ep esen s no es
wi h six consecu i e okens encoding no e posi ion, pi ch,
eloci y, du a ion, ins umen and ime-signa u e in o ma-
ion. Howe e , i canno p o ide con ol o e ags such as
gen e, compose and complexi y. MMT [15] ep esen s a
sequence o six-dimension e en s, wi h each e en xien-
coded as a uple o a iables (x ype ,xbea ,xposi ion ,xpi ch ,
xdu a ion ,xins umen ). Howe e , MMT canno model he
in e dependencies wi hin hese ields o a speci ic no e as
i p edic s he six ields in pa allel. In his wo k, we adap
he REMI+ ep esen a ion [11] o p o ide con ols o e
gen e, ins umen , compose and complexi y, while p e-
se ing he exp essi eness o e ed by REMI+ [11]. Sim-
ila o MMT [15], we decompose no e-on e en s o bea
and posi ion o educe he size o he ocabula y and o
help he model lea n he hy hmic s uc u e o music. In
addi ion, we exclude he “ empo” and “cho d” e en s as
such in o ma ion is some imes una ailable in ou da ase .
cap ions a e gene a ed om gen e, ins umen , complexi y, copy igh , and
ee- o m desc ip o s wi h in-con ex lea ning.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
218
Following REMI+ [11], we use bea ,posi ion,ins umen ,
pi ch and du a ion e en s o ep esen ing musical no es
o non-d um acks. We ep esen d um no es as bea ,po-
si ion,ins umen ,d um_pi ch.
To enable ee- o m ex con ols, o each music piece
wi h ex , we use a p e ained sen ence ans o me [21]
(speci ically, he “all-MiniLM-L6- 2” e sion [22]) o ex-
ac he ex embedding. Then we add a linea laye o
p ojec he ex embedding o he inpu oken embedding
space, whe e he p ojec ed ex embedding is added o he
p e ious gene a ed oken embedding along wi h he posi-
ional encoding. Then we eed he encoded sequence in o
a decode -only linea ans o me . We will e e o his
model as Me aSco e T ans o me -Tex (MST-Tex ).
Addi ionally, we ain a ag-condi ioned music gene a-
ion model. To enable ag-based con ols, we p epend he
inpu ags o ou p oposed music ep esen a ion. We in o-
duce ou ag e en s, including ag_gen e, ag_compose ,
ag_complexi y and ag_ins umen o speci y condi ions.
Fu he , we use he s anda dized compose names o limi
he ocabula y size, and we keep only 47 compose s
ha ha e mo e han 100 aining samples. We use a
ag_{missing_ ag}_None e en o music pieces ha do
no con ain all ou ags. In addi ion o hese da a o-
kens, we ha e six special s uc u al e en s: The s a -o -
song e en signals he onse o a song, leading in o a se-
quence ma ked by s a -o -gen e,s a -o -compose ,s a -
o -complexi y,s a -o -ins umen e en s, each ollowed
by hei espec i e ag lis s, wi h s a -o -no es conclud-
ing he ag lis s and end-o -song indica ing he comple ion
o he song. To acili a e con ollabili y in he model, we
p epend hese con ol okens a he s a o he da a ep-
esen a ion. The con ol okens include gen e, compose ,
complexi y and ins umen s. Then we eed he sequence
wi h hese p epending ags in o a decode -only linea ans-
o me which capi alizes on he au o eg essi e na u e o
he ans o me model, enabling he in eg a ion o hese
okens du ing he in e ence p ocess. We will e e o his
model as Me aSco e T ans o me -Tags (MST-Tags).
6 Expe imen s and Resul s
6.1 Baselines
We compa e ou model wi h wo ex - o-symbolic music
gene a ion app oaches. The i s is a BART-based model
[13] ained on a pai ed ex and symbolic da ase using
ABC no a ion, wi h e alua ion p esen ed in Sec ion 6.2
and Sec ion 6.3. The second is a concu en app oach,
Tex 2MIDI [1], which di ec ly gene a es MIDI iles om
na u al language p omp s, wi h e alua ion p esen ed in
Sec ion 6.4. MuseCoco [12] i s classi ies a ixed se
o p ede ined musical a ibu es using mul iple classi ica-
ion heads and hen employs an a ibu e- o-music model
o gene a e symbolic music. This app oach does no sup-
po ee- o m na u al language inpu s o symbolic music
gene a ion. The BART-based model [13] le e ages p e-
ained language models o gene a ing symbolic music in
ABC no a ion. To ensu e a ai compa ison, we gene a e
Pi ch class
en opy
Scale
consis ency
G oo e
consis ency
MST-Tags-Small 2.88 ±0.08 0.89 ±0.02 0.92 ±0.01
MST-Tags 2.93 ±0.07 0.89 ±0.02 0.90 ±0.01
BART-based [13] 2.54 ±0.06 0.99 ±0.00 1.00 ±0.00
MST-Tex 2.70 ±0.06 0.95 ±0.01 0.92 ±0.01
G ound u h 2.67 ±0.06 0.95 ±0.01 0.92 ±0.01
Table 4. Objec i e e alua ion esul s on music quali y wi h
condi ions om MST es se . We epo he mean alues
and 95% con idence in e als.
music using he BART-based model [13] ia he Hugging
Face API [22] and hen con e he ABC ou pu s o mul i-
ack MIDI using he Meloby es ool [23].
6.2 Objec i e E alua ions
Following [15,24,25], we assess he quali y o gene a ed
music using pi ch class en opy, scale consis ency, and
g oo e consis ency, whe e alues close o he g ound u h
indica e be e pe o mance. To ensu e a ai compa ison,
we andomly sampled 100 condi ions om he MST es
se and gene a ed co esponding music o e alua ion. As
epo ed in Table 4, we ind ha MST-Tex mos closely
ma ches he g ound u h in e ms o pi ch class en opy
and scale consis ency, while MST-Tags-Small and MST-
Tex pe o m simila ly on g oo e consis ency. Addi ion-
ally, ou p oposed MST-Tex ou pe o ms he BART-based
[13] model ac oss all h ee me ics.
6.3 Subjec i e E alua ion
We conduc a subjec i e es whe e 22 pa icipan s a e in-
s uc ed o e alua e i e songs unde each scena io. Ou o
he 22 pa icipan s, 19 people ha e expe ience in playing
ins umen s, wi h wo being p o essional musicians. We
ask he pa icipan s o e alua e he audio samples in e ms
o cohe ence, a angemen , adhe ence and o e all quali y
in a Like scale o 1 o 5.
We epo he subjec i e e alua ion esul s in Ta-
ble 5. When con as ing MST-Tags-Small wi h MST-
Tags, we obse e ha MST-Tags achie es be e pe o -
mance in cohe ence and a angemen , bu we see a de-
c ease in adhe ence, possibly due o he inco po a ion
o some au o-gene a ed ags. This compa ison illus-
a es he ade-o be ween employing a smalle , high-
quali y da ase (Me aSco e-Gen e) e sus a la ge ye
noisy da ase (Me aSco e-Plus). Howe e , compa ing he
o e all quali y sco e o MST-Tags-Small and MST-Tags,
we see ha aining wi h a la ge da ase leads o an in-
c ease in he o e all quali y o music gene a ion.
Fo ex -condi ioned music gene a ion, MST-Tex ou -
pe o ms he BART-based [13] app oach, in e ms o co-
he ence, a angemen , adhe ence, and o e all quali y. We
obse e ha he ex -condi ioned sys em MST-Tex has a
lowe adhe ence agains MST-Tags-Small and MST-Tags.
This implies ha ex - o-music gene a ion is a mo e chal-
lenging ask han ag- o-music gene a ion as a ex - o-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
219

Model size T aining samples Cohe ence↑A angemen ↑Adhe ence↑O e all quali y↑
MST-Tags-Small 87.36M 150K 3.87 ±0.36 3.98 ±0.38 3.86 ±0.38 3.57 ±0.37
MST-Tags 87.36M 901K 4.01 ±0.37 4.06 ±0.39 3.60 ±0.49 3.66 ±0.45
BART-based [13] 139M 283K 3.86 ±0.30 3.63 ±0.39 2.81 ±0.50 3.29 ±0.42
MST-Tex 87.44M 560K 3.93 ±0.28 3.88 ±0.33 3.35 ±0.44 3.69 ±0.33
Table 5. Subjec i e e alua ion esul s in a Like scale o 1 o 5
Model
size
CLAP Sco e↑Cohe ence(%)↑A angemen (%)↑Adhe ence(%)↑O e all quali y(%)↑
M T M+T M T M+T M T M+T M T M+T M T M+T
Tex 2MIDI [1] 159M 0.23 0.20 0.22 0 40 20 40 50 45 40 80 60 40 60 50
MST-Tex 87.44M 0.36 0.13 0.24 100 60 80 60 50 55 60 20 40 60 40 50
Table 6. Compa ison o MST-Tex and Tex 2MIDI [1] on h ee p omp se s: 1) M: i e p omp s om ou es se , 2) T:
i e p omp s om he Tex 2MIDI [1] es se , and 3) M+T: he union o hese wo p omp se s. We epo he winning a es
in a subjec i e A/B lis ening es desc ibed in Sec ion 6.4.
music gene a ion sys em needs o lea n o in e p e he
ee- o m ex inpu s.
O e all, he ag-condi ioned sys ems, MST-Tags and
MST-Tags-Small, demons a e s ong pe o mance ac oss
mul iple dimensions, including cohe ence, a angemen ,
adhe ence o p omp s, and o e all music quali y. These e-
sul s highligh he high quali y o ou cons uc ed da ase .
No ably, MST-Tex achie es he highes sco e in o e all
quali y, indica ing ha ou ex -condi ioned model pe -
o ms on pa wi h he ag-condi ioned a ian s. This unde -
sco es he e ec i eness o ou app oach, which le e ages
a la ge language model o gene a e na u al language cap-
ions, enabling end- o-end aining o high-quali y ex - o-
music gene a ion.
6.4 Compa ison o Tex 2MIDI
In his sec ion, we compa e ou p oposed MST-Tex wi h a
concu en wo k Tex 2MIDI [1] ha also suppo s ex - o-
symbolic music gene a ion. Fo a ai compa ison, we c e-
a e h ee es se s: 1) i e ex p omp s andomly selec ed
om ou es se , 2) i e ex p omp s andomly selec ed
om he Tex 2MIDI es se , and 3) he union o he p e i-
ous wo es se s (i.e., i e p omp s om each es se ).
Objec i e E alua ion. We compa e music quali y o ou
MST-Tex and Tex 2MIDI [1] using pi ch class en opy,
scale consis ency, and g oo e consis ency, and epo he
esul s in Table 7. In addi ion, o assess he alignmen be-
ween ex and symbolic music, we epo he a e age se-
man ic simila i y compu ed wi h CLAP [17]. We ind ha
MST-Tex be e ma ches he g ound u h in e ms o pi ch
class en opy and scale consis ency, while Tex 2MIDI be -
e ma ches he g ound u h in e ms o g oo e consis-
ency. Addi ionally, MST-Tex shows be e alignmen pe -
o mance when p omp s a e aken om he MST-Tex es
se and he join es se .
Subjec i e E alua ion. We compa e ou model MST-
Tex wi h Tex 2MIDI [1] ia an A/B es on cohe ence,
a angemen , adhe ence, and o e all quali y. Ele en pa -
icipan s (9 wi h musical expe ience, including one p o es-
sional) e alua ed 5 p omp s om ou es se and 5 om he
Pi ch class
en opy
Scale
consis ency
G oo e
consis ency
Tex 2MIDI [1] 2.44 ±0.19 0.89 ±0.03 0.94 ±0.01
MST-Tex 2.65 ±0.08 0.96 ±0.02 0.92 ±0.01
G ound u h 2.71 ±0.07 0.96 ±0.02 0.94 ±0.01
Table 7. Objec i e e alua ion esul s on music quali y o
he join es se . We epo he mean alues and 95% con-
idence in e als.
Tex 2MIDI es se , wi h esul s summa ized in Table 6.
Ou expe imen s show ha MST-Tex p oduces mo e
cohe en esul s and achie es equal o supe io a ange-
men pe o mance compa ed o Tex 2MIDI [1]. How-
e e , when i comes o adhe ence, Tex 2MIDI ou pe o ms
MST-Tex on he Tex 2MIDI es se and he join es
se s, while MST-Tex pe o ms be e on MST-Tex es se .
O e all, bo h models deli e compa able quali y.
7 Conclusion
In his pape , we ha e in oduced Me aSco e, a new pub-
licly a ailable da ase con aining ich me ada a and LLM-
gene a ed cap ions. We also p esen a new music gene -
a ion model ha can gene a e symbolic music om ee-
o m ex , allowing con ols o e ins umen s, gen e, com-
pose , complexi y, among o he music desc ip o s. In ad-
di ion, he LLM-gene a ed pseudo cap ions con ain in o -
ma ion p o ided in ee- o m use -anno a ed ags, which
can pose a challenge o sys ems ha adop a p ede ined se
o ags [12]. Ou objec i e and subjec i e e alua ion e-
sul s show he e ec i eness o he p oposed ags- o-music
and ex - o-music models. Ou p oposed ex - o-music
model ou pe o ms a baseline ex - o-music model [13]
and achie es compa able pe o mance wi h a concu en
wo k [1]. In addi ion, he p oposed ex - o-symbolic mu-
sic gene a ion model ained wi h LLM-gene a ed pseudo
cap ions achie es compe i i e pe o mance agains he p o-
posed ags- o-music model ained using only he g ound
u h ags.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
220
8 E hics S a emen
We no e ha Me aSco e con ains many copy igh ed con-
en s. This aises conce ns o po en ial misuse o his
da ase ha can lead o se e e copy igh in ingemen s. To
minimize such isks, we will elease hose in public do-
main only and hose no in public domain will be sha ed
upon eques and only use o esea ch p upo se. Fu -
he , music gene a ion sys ems buil upon he Me aSco e
da ase may in inge he copy igh held by he con en c e-
a o s, and hus we mus be ca e ul abou adop ing hese
sys ems in comme cial applica ions. Howe e , we would
like o poin ou ha when used p ope ly and wi h cau-
ion, a ex - o-music gene a ion sys em can also make a
posi i e impac o socie y by enabling new oppo uni ies
and in e aces o music c ea ion, as demons a ed in [26].
Gi en he edi able na u e o symbolic music, we hope ou
p oposed ex - o-symbolic music models will open up new
pa hways owa ds human-AI music co-c ea ion.
9 Re e ences
[1] K. Bhanda i, A. Roy, K. Wang, G. Pu i, S. Col on,
and D. He emans, “Tex 2midi: Gene a ing symbolic
music om cap ions,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2412.16526
[2] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J.
B yan, “Music con olne : Mul iple ime- a ying
con ols o music gene a ion,” in a Xi p ep in
a Xi :2311.07069, 2023.
[3] Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. J. B yan, “DITTO: Di usion in e ence- ime -
op imiza ion o music gene a ion,” in a Xi p ep in
a Xi :2401.12179, 2024.
[4] J. Melecho sky, A. Roy, and D. He e-
mans, “Midicaps: A la ge-scale midi da ase
wi h ex cap ions,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2406.02255
[5] C. Ra el, “Lea ning-based me hods o compa ing se-
quences, wi h applica ions o audio- o-MIDI alignmen
and ma ching,” Ph.D. disse a ion, Columbia Uni e -
si y, 2016.
[6] J. Ens and P. Pasquie , “Building he Me aMIDI
da ase : Linking symbolic and audio musical da a,” in
P oc. ISMIR, 2021.
[7] S. Wu, D. Yu, X. Tan, and M. Sun, “Clamp: Con-
as i e language-music p e- aining o c oss-modal
symbolic music in o ma ion e ie al,” in P oc. ISMIR,
2023.
[8] H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and Y.-
H. Yang, “EMOPIA: A mul i-modal pop piano da ase
o emo ion ecogni ion and emo ion-based music gen-
e a ion,” in P oc. ISMIR, 2021.
[9] C. Haw ho ne, A. S asyuk, A. Robe s, I. Simon, C.-
Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and
D. Eck, “Enabling ac o ized piano music modeling
and gene a ion wi h he MAESTRO da ase ,” in P oc.
ICLR, 2019.
[10] T. Be in-Mahieux, D. P. Ellis, B. Whi man, and
P. Lame e, “The million song da ase ,” in P oc. ISMIR,
2011.
[11] D. on Rü e, L. Biggio, Y. Kilche , and T. Ho -
mann, “FIGARO: Con ollable music gene a ion using
lea ned and expe ea u es,” in P oc. ICLR, 2023.
[12] P. Lu, X. Xu, C. Kang, B. Yu, C. Xing, X. Tan, and
J. Bian, “MuseCoco: Gene a ing symbolic music om
ex ,” in a Xi p ep in a Xi :2306.00110, 2023.
[13] S. Wu and M. Sun, “Explo ing he e icacy o p e-
ained checkpoin s in ex - o-music gene a ion ask,”
in a Xi p ep in a Xi :2211.11216, 2022.
[14] P. Pasquie , J. Ens, N. F ade , P. T iana, D. Rizzo i,
J.-B. Rolland, and M. Sa i, “Midi-gp : A con ollable
gene a i e model o compu e -assis ed mul i ack
music composi ion,” 2025. [Online]. A ailable: h ps:
//a xi .o g/abs/2501.17011
[15] H.-W. Dong, K. Chen, S. Dubno , J. McAuley, and
T. Be g-Ki kpa ick, “Mul i ack music ans o me ,”
in P oc. ICASSP, 2023.
[16] S. Doh, K. Choi, J. Lee, and J. Nam, “LP-MusicCaps:
LLM-based pseudo music cap ioning,” in P oc. ISMIR,
2023.
[17] Y. Wu, K. Chen, T. Zhang, Y. Hui, M. Nezhu ina,
T. Be g-Ki kpa ick, and S. Dubno , “La ge-scale con-
as i e language-audio p e aining wi h ea u e u-
sion and keywo d- o-cap ion augmen a ion,” in P oc.
ICASSP, 2023.
[18] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang,
X. Sun, J. Xu, L. Li, and Z. Sui, “A su ey on in-
con ex lea ning,” in a Xi p ep in a Xi :2301.00234,
2023.
[19] T. L. Scao, A. Fan, C. Akiki, E. Pa lick, S. Ili´
c,
D. Hesslow, and e al., “Bloom: A 176b-pa ame e
open-access mul iliguai language model,” in a Xi
p ep in a Xi :2211.05100, 2022.
[20] BigScience, T. L. Scao, A. Fan, C. Wol , T. Rush, S. Bi-
de man, G. B. Black, S. Cu is, D. Elbayad, T. Gao
e al., “BLOOM: A 176b-pa ame e open-access
mul ilingual language model,” h ps://hugging ace.co/
bigscience/bloom, 2022, accessed: 2025-03-25.
[21] N. Reime s and I. Gu e ych, “Sen ence-BERT: Sen-
ence embeddings using siamese be -ne wo ks,” in
P oc. EMNLP, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
221
[22] ——, “all-minilm-l6- 2: A ligh weigh sen-
ence embedding model,” h ps://hugging ace.co/
sen ence- ans o me s/all-MiniLM-L6- 2, 2021,
accessed: 2025-03-25.
[23] “Meloby es ABC o MIDI Con e e ,” h ps:
//meloby es.com/en/app/abc2midi, accessed: 2025-03-
25.
[24] O. Mog en, “C-RNN-GAN: Con inuous ecu en
neu al ne wo ks wi h ad e sa ial aining,” in P oc.
Neu IPS Wo kshop on Cons uc i e Machine Lea n-
ing, 2016.
[25] S.-L. Wu and Y.-H. Yang, “The Jazz T ans o me on
he on line: Explo ing he sho comings o AI-
composed music h ough quan i a i e measu es,” in
P oc. ISMIR, 2020.
[26] C.-Z. A. Huang, H. V. Koops, E. New on-Rex,
M. Dinculescu, and C. J. Cai, “Ai song con es :
Human-ai co-c ea ion in songw i ing,” 2020. [Online].
A ailable: h ps://a xi .o g/abs/2010.05388
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
222