VERSATILE SYMBOLIC MUSIC-FOR-MUSIC MODELING VIA
FUNCTION ALIGNMENT
Junyan Jiang1,2Daniel Chin1,2Liwei Lin1,2Xuanjie Liu2Gus Xia1,2
1NYU Shanghai 2Music X Lab, MBZUAI
{jj2731, daniel.chin, ll4270, gxia}@nyu.edu, [email p o ec ed]
ABSTRACT
Many music AI models lea n a map be ween music con-
en and human-de ined labels. Howe e , many anno a-
ions, such as cho ds, can be na u ally exp essed wi hin
he music modali y i sel , e.g., as sequences o symbolic
no es. This obse a ion enables bo h unde s anding asks
(e.g., cho d ecogni ion) and condi ional gene a ion asks
(e.g., cho d-condi ioned melody gene a ion) o be uni ied
unde a music- o -music sequence modeling pa adigm. In
his wo k, we p opose pa ame e -e icien solu ions o a
a ie y o symbolic music- o -music asks. The high-le el
idea is ha (1) we u ilize a p e ained Language Model
(LM) o bo h he e e ence and he a ge sequence and
(2) we link hese wo LMs ia a ligh weigh adap e . Ex-
pe imen s show ha ou me hod achie es supe io pe o -
mance among di e en asks such as cho d ecogni ion,
melody gene a ion, and d um ack gene a ion. All demos,
code and model weigh s a e publicly a ailable 1.
1. INTRODUCTION
Many ounda ional asks in music AI, such as music in o -
ma ion e ie al (MIR) and condi ional music gene a ion,
ha e adi ionally been o mula ed as mappings be ween
music and labels: ei he om music o ask-speci ic anno-
a ions (e.g., cho d ecogni ion), o om desc ip i e con-
di ions o music (e.g., cho d-condi ioned melody gene a-
ion). While hese asks ha e long been ea ed sepa a ely,
a key obse a ion is ha in many cases, he “labels” hem-
sel es can also be ep esen ed in he same music modal-
i y— o example, as no e sequences. This sugges s a uni-
ying pe spec i e: a wide ange o MIR and gene a ion
asks can be e o mula ed as sequence- o-sequence p ob-
lems wi hin he music domain. We e e o his o mula ion
as music- o -music modeling.
To achie e e sa ile music- o -music modeling in a
sample-e icien way, we apply knowledge ans e o
p e ained ounda ional Language Models (LMs) using a
1h ps://gi hub.com/music-x-lab/midi- unc ion-alignmen
© J. Jiang, D. Chin, L. Lin, X. Liu and G. Xia. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: J. Jiang, D. Chin, L. Lin, X. Liu and G. Xia, “Ve sa ile
Symbolic Music- o -Music Modeling ia Func ion Alignmen ”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
𝐱
ො
𝐱ො
𝐲
(a)
𝐱
ො
𝐱
𝐲
ො
𝐲
(c)
𝐱
(b)
𝐲
ො
𝐲
LMLMLM LM
Figu e 1. Th ee ypes o sequence- o-sequence models by
knowledge ans e om p e ained LMs. xand ya e in-
pu sequences and ˆ
xand ˆ
ya e p edic ions, possibly igh -
shi ed due o he au o eg essi e a ge s. (a) P obing; (b)
P e ix uning; (c) Func ion alignmen (x→y).
ligh -pa ame e ized adap o . As illus a ed in Fig. 1(a)-(b),
many exis ing me hods such as p obing [1–3] and p e ix
uning [4–6] ans e knowledge o ounda ion models o
downs eam asks by adap ing hem o new inpu o ou pu ,
bu he knowledge esides in only one language—ei he
he LM o sou ce xo he LM o a ge y. In con as ,
ou me hod dis ills knowledge om bo h LMs ia aligning
hem in a laye -wise manne , as shown in Fig. 1(c).
A he me hodology le el, ou app oach is inspi ed by
unc ion alignmen [7], a ecen ly p oposed heo y o mind
ha a ibu es he eme gence o in elligence o he dynamic
syne gy among in e ac ing agen s, i.e., Language Models
(LMs). In ou wo k, we con ibu e wo conc e e imple-
men a ions o his idea—by c ea ing syne gy be ween wo
LMs h ough Pa ame e -E icien Fine-Tuning (PEFT).
The i s app oach in oduces a ainable c oss-a en ion
laye be ween wo sepa a ely p e ained LMs. The second,
mo e concise solu ion, uses a ligh weigh sel -a en i e
adap e applied o conca ena ed inpu -ou pu sequences
wi hin a single sha ed LM—a s a egy applicable when
bo h inpu and ou pu sha e he same ocabula y. We
show he e ec i eness o bo h implemen a ions using ex-
pe imen s on bo h gene a i e and analysis asks, includ-
ing: (1) cho d-condi ioned melody gene a ion, (2) melody-
condi ioned cho d gene a ion, (3) d um-condi ioned song
gene a ion, (4) song-condi ioned d um gene a ion and (5)
ew-sho symbolic music analysis.
The main con ibu ion o his pape is as ollows:
1. We achie e e sa ile music- o -music modeling, uni-
ying a b oad ange o music unde s anding and con-
ollable gene a ion asks unde a sha ed amewo k.
2. A he me hodological le el, we b ing he no el
concep o unc ion alignmen —a ecen ly p oposed
573
heo y o mind ha emphasizes syne gy among
agen s—in o he domain o music AI, o e ing a
esh pe spec i e on sequence- o-sequence asks.
3. While he o iginal posi ion pape on unc ion align-
men emains a a concep ual le el, ou wo k akes a
signi ican s ep o wa d by in oducing wo conc e e,
pa ame e -e icien implemen a ions in he con ex
o mode n language models: one ia c oss-a en i e
adap e s ac oss wo LMs, and ano he ia a sel -
a en i e adap e wi hin a sha ed LM. We demon-
s a e he e ec i eness o bo h app oaches h ough
heo e ical analysis and empi ical alida ion.
2. RELATED WORKS
2.1 Music Founda ion Models
Since he in en ion o he T ans o me a chi ec u e [8],
ans o me -based language models ha e become he
mains eam o music ounda ion models on mul iple
modali ies, including audio [9–12], symbolic [13–21] and
ex -based music ep esen a ion [22]. In addi ion o au-
o eg essi e models, masked language models [2, 23] and
di usion models [24–29] and low-based models [30] can
also be used as ounda ion models, bu we ocus on au o e-
g essi e models in he li e a u e e iew.
Fo symbolic music, he music ans o me [13] is an
ea ly wo k o adop he ans o me a chi ec u e o music.
Some ollow-up wo ks y o design a be e ep esen a ion
o he music con en . Fo example, pop music ans o me
imposes a me ical s uc u e in he da a ep esen a ion [15].
MuPT ains ans o me s on hei p oposed synch onized
mul i- ack ABC no a ion [20]. O he wo ks aim o in-
oduce con ollabili y o he gene a i e model. MuseC-
oco gene a es he music sco e om ex [14]. METEOR
pe o ms melody-awa e o ches al music gene a ion wi h
ex u e con ol [16]. SymPAC ains symbolic gene a ion
models om ansc ibed audio da a wi h cho d, sec ion,
and ins umen con ols [17]. Zhang e al. imp o e gen-
e a ion disc imina o s o be e ollow hy hm and melody
condi ions [18]. The Theme T ans o me [19] uses a sho
heme condi ion o gene a ion. MuseBa Con ol gene -
a es music wi h ine-g ained con ol o he ba le el [21].
2.2 Pa ame e -E icien Fine-Tuning
Pa ame e -E icien Fine-Tuning (PEFT) me hods add
ligh ly pa ame e ized adap e s o la ge p e ained models.
Compa ed o ull-pa ame e ine- uning, PEFT equi es
signi ican ly less compu a ion and aining da a. Exis ing
me hods include appending ask-speci ic p e ixes o inpu
sequences [4,31], injec ing low- ank adap a ion (LoRA) o
linea laye s [32], and adding lea nable hidden s a es o he
sel -a en ion blocks [5,33].
PEFT has been applied o music ounda ion models o
suppo new asks. Coco-Mulla [6] and MusiConGen [34]
bo h adap MusicGen o ollow con en con ols such as
cho d and hy hm. Addi ionally, Ai Gen enables Mu-
sicGen o in ill segmen s based on con en con ols [35].
Ins uc -MusicGen ex ends MusicGen o music edi ing
Local RoFo me Encode
Global RoFo me
Decode
Local RoFo me Decode
[sos] 𝐡1𝐡2𝐡3
መ
𝐡1መ
𝐡2መ
𝐡4
መ
𝐡3
[cls]
Flu e
Flu e
Flu e
[eos]
Piano
Piano [eos]
𝑖3
1𝑛3
1
𝑖4
1𝑛4
1𝑖4
2𝑛4
2
Figu e 2. The a chi ec u e o he ounda ion model. The
le side shows he global decode . The igh side shows
he encoding o a single ime s ep x3={i1
3, n1
3,[eos]}and
he decoding o he nex s ep x4={i1
4, n1
4, i2
4, n2
4,[eos]}.
by ex ins uc ions [36]. Audio P omp Adap e ex ends
AudioLDM2 o music edi ing ollowing con ols such as
gen e, imb e, and melody [37]. Ou e al. unes a symbolic
language model o asks like band a angemen , piano e-
duc ion, d um a angemen and oice sepa a ion [38].
3. METHODOLOGY
3.1 Base Model
Fo his s udy, we choose he base model ( he p e ained
symbolic LM) wi h wo main conside a ions. Fi s , we
do no wish o in oduce any con ol in he p e aining
s age, since we wan o demons a e he con ollabili y us-
ing PEFT. We e ain om using any anno a ion o me a-
da a (i.e., cho d, ba o ex anno a ions) o p e ain he
base model. Second, we wan o adop a da a ep esen-
a ion ha can help he model align mul iple sequences in
ime easily. Ins ead o using a MIDI e en -like ep esen a-
ion [13, 15, 39] whe e wo ime-aligned sequences migh
ha e a signi ican leng h di e ence, we use a ixed ime
s ep (a 16 h no e uni ) o he inpu sequence.
Since mul iple no es can occu a he same ime s ep,
we use a hie a chical scheme o comp ess (decomp ess)
he no e lis s on he same ime s ep wi h a local encode
(decode ), as shown in Fig. 2.
3.1.1 Da a Rep esen a ion
Fo mally, we ep esen a sco e sequence x={x1, ..., xT}
wi h a ixed ime s ep o a 16 h no e. Since each ime s ep
may con ain mul iple no e onse s, each x ep esen s a lis
o N no es whose quan ized onse ime is he - h 16 h no e
(i.e., x is a simu-no e [40] a ime s ep ). We de ine
x ={i1
, n1
, i2
, n2
, ..., iN
, nN
,[eos]}(1)
whe e ik
∈ {0, ..., 128}is he ins umen ID o he k- h
no e. We use he MIDI p og am numbe 0...127 o pi ched
ins umen s and ik
= 128 o d ums. nk
= 24pk
+dk
is a la ened ep esen a ion o he k- h no e’s pi ch pk
∈
{0, ..., 127}and du a ion dk
∈ {0, ..., 23}.pk
deno es
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
574
Ga e
Hidden S a e
Laye No m
Laye No m
LM o 𝐱LM o 𝐲
C oss A en ion Sel -A en ion
Sel -A en ion
Linea Linea
V K Q
Hidden S a e
Linea
V K Q
Linea
V K Q
+
Figu e 3. The a chi ec u e o a c oss-a en i e unc ion
alignmen adap e . The i e icon deno es ainable pa ame-
e s, and he snow lake icon deno es ozen pa ame e s.
he MIDI pi ch om 0 o 127. dk
∈ {0, ..., 23}is he
no e du a ion quan ized in o 24 possible bins, dk
=jco -
esponds o a du a ion o bjsix een h no es whe e b=
[1,2,3,4,6,8,12,16,24, ..., 4096]. [eos] is a special o-
ken ma king he end o he lis . All no es in x a e so ed
p ima ily by ik
and seconda ily by nk
.
3.1.2 Model Design
We use a RoFo me [41], a popula ans o me a chi ec-
u e as he backbone model. The model a chi ec u e is
shown in Fig. 2. Since ou inpu sequence con ains nes ed
lis s, we i s encode each x wi h a local RoFo me en-
code :
[h ,_] = LocalEncode ([cls],x )(2)
o all = 1...T. Speci ically, we p epend a [cls] oken a
he beginning o x and pass he sequence o he encode .
h is acqui ed om he ou pu ep esen a ion o he [cls]
oken. We hen use a global RoFo me decode o au o e-
g essi ely model he symbolic sco e:
ˆ
h =GlobalDecode (esos,h1... −1)(3)
whe e esos is a lea nable s a -o -sen ence (sos) embedding.
Finally, a local RoFo me decode gene a es each no e by
ˆ
x ,j =LocalDecode (ˆ
h ,x ,1...j−1)(4)
o all = 1...T. He e, x ,j deno es he j- h oken o lis
x (see Eqn. 1). The local decode e mina es when an
end-o -sen ence (eos) oken is gene a ed.
We will use ˆx =LM(x0... −1)(o simply LM(x))
as a sho hand o he au o eg essi e model o sequence
x h ough Eqs. 2-4. He e, x0deno es he global s a -o -
sen ence embedding esos.
3.2 Pa ame e -E icien Fine-Tuning
Ou ine- uning s a egy le e ages p e ained LMs o x
and y, connec ed ia a pa ame e -e icien module. We
p esen wo a ian s: c oss-a en i e adap e s o sepa a e
LMs, and sel -a en i e adap e s o a sha ed LM. We apply
bo h adap e s o he backbone o he ounda ion model ( he
global decode in Eqn. 3) only.
Key/Value
x LM Sel -A n.
𝐱 → 𝐲
𝐲 LM
Que y
𝐲0𝐲1𝐲2𝐲3𝐲4
𝐱0𝐱1𝐱2𝐱3𝐱4
T ainable Emb. 𝐞𝑥
0 1 2 3 40 1 2 3 4
T ainable Emb. 𝐞𝑦
𝐱4
𝐱3
𝐱2
𝐱1
𝐱0𝐲4
𝐲3
𝐲2
𝐲1
𝐲0
T ainable Emb. 𝐞𝑦
43210 43210
T ainable Emb. 𝐞𝑥
Figu e 4. The a chi ec u e o a sel -a en i e unc ion
alignmen adap e . C ossed e ical and ho izon al a ows
indica e he low o in o ma ion be ween he co esponding
que y and key/ alue pai s, while all o he connec ions a e
masked by he au o eg essi e sel -a en ion mechanism.
The indices 0 h ough 4 ep esen he p oposed posi ional
embeddings o he conca ena ed sequence.
3.2.1 C oss-a en i e Func ion Alignmen
Ou i s app oach is o use a c oss-a en ion laye be ween
he hidden laye s o wo LMs. A simila a chi ec u e has
been adop ed in language p ocessing [42] and speech p o-
cessing [43]. We e e o he design o [42] and show an
adap ed e sion in Fig. 3. Fo he l- h a en ion laye o
LM(y), he o iginal sel -a en ion is de ined as:
hl
p=Sel A n(Wl
qzl
y,Wl
kzl
y,Wl
zl
y)(5)
whe e zl
ydeno e he l- h laye hidden s a es o LM(y)and
Wldeno es p e ained weigh s. The adap ed e sion can
be w i en as:
hl
a=hl
p+gl·C ossA n(Ul
qzl
y,Ul
kzl
x,Ul
zl
x)(6)
whe e glis a ze o-ini ialized ainable ga e scale . Ul
a e ainable pa ame e s. In ui i ely, his allows he que y
om LM(y) o a end bo h o i sel (sel -a en ion) and o
he condi ion om LM(x)(c oss-a en ion).
Besides he ainable c oss-a en ion module, we also
apply LoRA [32] o all Wl
qand Wl
o bo h p e ained
models LM(x)and LM(y), allowing he model o lea n
dis inc i e ea u es o sequences xand y.
3.2.2 Sel -a en i e Func ion Alignmen
When xand ysha e he same p e ained LM, alignmen
becomes a special case: i can be achie ed by conca ena -
ing hei sequences and eeding hem in o a single model.
The LM will i s model xand p edic ygi en xas a p e ix.
This implies ha some p io PEFT me hods [35, 38],
which s uc u e he condi ion and gene a ed sequence
wi hin a single language model, can be iewed as b oade
o ms o unc ion alignmen . We show ha a simple con-
igu a ion is also e ec i e and explain why i ealizes unc-
ion alignmen .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
575
When we di ec ly conca ena e wo sequences [x,y]and
eed hem o he decode sel -a en ion laye , we can de-
compose i in o he sel -a en ion o xand y, and an ex-
a componen in luencing y om x, as shown in Fig. 4.
Speci ically, we ha e
[hl
xa,hl
ya] = Sel A n(Wl
q[zl
x,zl
y],Wl
k[zl
x,zl
y],Wl
[zl
x,zl
y])
=Sel A n([Ql
x,Ql
y],[Kl
x,Kl
y],[Vl
x,Vl
y])
(7)
o e e y laye l. In a single-head se ing, we ha e
Sel A n(Q,K,V) := so max(QK⊤/√d+M)V, whe e
dis he dimension o he key ec o s and Mis he au o e-
g essi e mask. We can ew i e Eqn. 7 by
hl
xa=Sel A n(Ql
x,Kl
x,Vl
x)(8)
hl
ya=a
a+bSel A n(Ql
y,Kl
y,Vl
y)
+b
a+bC ossA n(Ql
y,Kl
x,Vl
x)
(9)
whe e a=Pjexp h(Ql
yKl
y)j/√d+Miand b=
Pjexp h(Ql
yKl
x)j/√di. No e ha Eqn. 9 closely mi -
o s Eqn. 6 in o m. Al hough he ga ing ec o s aand b
a e no explici ly pa ame e ized, we hypo hesize ha his
design emains e ec i e.
A e conca ena ing xand y, we ese y’s posi ional
embeddings o s a om 0 o be e p ese e he p e ained
beha io o LM(y). To a oid oken indis inguishabili y
due o o e lapping posi ions, we also add ze o-ini ialized,
lea nable sen ence embeddings exand ey o hei espec-
i e posi ional encodings, as shown in Fig. 4.
Simila o c oss-a en i e adap e s, a ainable LoRA
module is also appended o he p e ained LM.
4. EXPERIMENTS
In he expe imen s, we i s desc ibe he hype pa ame-
e s and he p e aining scheme o ou ounda ion model
(Sec. 4.1). We e alua e ou adap e s on bo h gene a i e
and analysis asks. We desc ibe he asks in Sec. 4.2 and
models in Sec. 4.3. We hen show he se ing o subjec i e
e alua ion (Sec. 4.4) and objec i e e alua ion (Sec. 4.5),
and analyze he esul s in Sec. 4.6.
4.1 Model P e aining
We use a RoFo me wi h a 12-laye global decode (hidden
size 768, in e media e size 3072, 12 heads). The local en-
code and decode a e smalle 3-laye RoFo me s (hidden
size 768, in e media e size 768, 8 heads).
We p e ain ou ounda ion model on he Los Angeles
MIDI da ase [44], which con ains app oxima ely 405,000
MIDI iles. As a sco e-based model, i elies on accu a e
bea anno a ions (in e ed om empo change e en s) o
co ec quan iza ion. Howe e , many iles in he p e ain-
ing da ase con ain inco ec empo in o ma ion.
To add ess his, we apply a ule-based il e . No mally,
no e onse s a e no uni o mly dis ibu ed ac oss odd and
e en ime s eps. We compu e he a io o no es quan ized
o odd s. e en ime s eps. I he a io alls wi hin 0.5±
0.15 o e e y ack, we assume i is poo ly quan ized and
disca d he song. This yields a cleaned subse o 357,279
iles. Du ing p e aining, we also apply a andom pi ch
shi wi hin [−5,6] semi ones o da a augmen a ion.
We se he global sequence leng h o T= 384 and
cap he maximal polyphony by N ≤16, clipping excess
no es pe ime s ep. A ba ch size o 48 is used o p e-
aining. We ain he model o 2,000,000 i e a ions using
AdamW [45] wi h β=(0.9,0.999) and weigh decay 0.01.
We use a OneCycleLR [46] schedule wi h a maximum LR
10−4and 10,000 wa m-up s eps. P e aining akes a ound
12 days on 4×A100 (40GB) GPUs.
4.2 Downs eam Tasks
We e alua e he adap o on di e en music gene a ion and
unde s anding ask. Speci ically, we ha e 3 se s o asks:
•Melody o cho d and cho d o melody: we ine-
une he model on he No ingham da ase [47] wi h a
o al o 1,020 songs. The model is asked o gene a e
cho ds om a gi en melody o o gene a e a melody
gi en a cho d p og ession.
•D um o o he s and o he s o d um: we ine- une
he model on a subse o 31,000 songs in he Los
Angeles da ase wi h a d um ack. The model is
asked o gene a e he d um ack gi en he ull sco e
o non-pe cussi e ins umen s, o o gene a e o he
ins umen s gi en a d um ack.
•Few-sho symbolic music analysis: we ine- une
he model on 93 songs in he RWC Pop da ase [48].
The model is asked o ansc ibe he cho ds and me -
ical s uc u e gi en a symbolic pop music. We e al-
ua e he esul s on symbolic cho d ecogni ion.
In each ask, we pe o m a andom 8:1:1 spli o ain-
ing, alida ion, and es ing. Fo he d um- o-o he s and
o he s- o-d um asks, RWC Pop is used as an ex e nal es
se .
4.3 Compa ed Models
We compa e he pe o mance o he ollowing models,
wi h sligh hype pa ame e adjus men s o ensu e compa-
able numbe s o ainable pa ame e s.
•FA-C oss: The base model is ine- uned wi h a
c oss-a en i e adap e (4 heads, hidden size 256),
inse ed e e y wo laye s o he global decode . A
LoRA wi h = 16, α = 32 is used on he que y and
alue p ojec o s o bo h LMs.
•FA-Sel : The base model ine- uned wi h a sel -
a en i e adap e . A LoRA wi h = 64, α = 128 is
used on he que y and alue p ojec o s o bo h LMs.
•Coco-Mulla: The Coco-Mulla [6] adap e applied
on he RoFo me model. The adap e has a ainable
posi ional encoding size o 384.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
576
FA-Sel FA-C oss Enc-Dec MelodyT5 Coco-Mulla G ound T u h
0
1
2
3
4
5
Ra ing
(a) Cho d-condi ioned melody gene a ion (Cho d o Melody)
Musicali y
Adhe ence
C ea i i y
FA-Sel FA-C oss Enc-Dec Coco-Mulla G ound T u h
0
1
2
3
4
5
Ra ing
(b) D um-condi ioned song gene a ion (D um o O he s)
FA-Sel FA-C oss Enc-Dec Assis an Coco-Mulla G ound T u h
0
1
2
3
4
5
Ra ing
(c) Song-condi ioned d um ack gene a ion (O he s o D um)
Figu e 5. Subjec i e e alua ion esul s. The e o ba s
show he 95% con idence in e als o he ue mean.
•P obe : A 2-laye Mul ilaye Pe cep on (MLP)
p obe as used in [2]. The MLP laye uses a
weigh ed sum o all laye s’ hidden s a es and has a
hidden dimension o 768.
•Enc-Dec: A baseline ained om sc a ch wi h a
small RoFo me encode -decode (3 laye s, hidden
size 256, in e media e size 512, 4 heads o bo h en-
code and decode ).
•MelodyT5 [49]: an ex e nal baseline o he melody
o cho d and cho d o melody asks. The model is
ained on 261K songs ep esen ed by ABC no a-
ions. We do no e ain his baseline.
•Assis an (Compose s Assis an V2) [50]: an ex e -
nal baseline o he o he s o d um ask. We do no
e ain he baseline.
All modules a e ained o up o 60,000 i e a ions wi h
a ixed lea ning a e o 10−4and a ba ch size o 8 on a
single A100 GPU. Ea ly s opping is applied i alida ion
loss does no imp o e o 10 ounds (5,000 i e a ions).
4.4 Subjec i e E alua ion
Fo he h ee gene a i e asks (cho d- o-melody, d um- o-
o he s, and o he s- o-d ums), we conduc ed a subjec i e
e alua ion ia a use su ey. We selec ed 8 songs om
he es se (2 o cho d- o-melody, 4 o d um- o-o he s,
and 2 o o he s- o-d ums). We asked pa icipan s o a e
Cho d o
melody
Melody o
cho d
D um o
o he s
O he s o
d um
FA-C oss 1.4204
±0.0992
1.4177
±0.1048
2.0459
±0.5629
1.8619
±0.5665
FA-Sel 1.4116
±0.1172
1.4104
±0.1000
2.0222
±0.6358
1.8402
±0.5709
Coco-
Mulla
1.8016
±0.1711
1.5996
±0.1445
2.2027
±0.6532
1.9860
±0.6857
Enc-
Dec
1.6113
±0.1790
1.5067
±0.1208
2.5830
±0.9146
1.8765
±0.5382
G ound
T u h
1.3917
±0.0988
1.3917
±0.0988
2.0730
±0.7158
2.0730
±0.7158
Table 1. Tes se pe plexi y on di e en downs eam asks.
bo h he gene a ed ou pu s and g ound u h on a 5-poin
scale ac oss he ollowing me ics:
•Musicali y: Does i sound good as music?
•Adhe ence: Does i espec and ollow he inpu
condi ion?
•C ea i i y: Gi en he inpu condi ions, is i c ea i e
in i s musical decisions?
We ecei ed a o al o 65 answe s, and he esul s a e
shown in Fig. 5.
4.5 Objec i e E alua ion
Fo he gene a i e asks by ine- uned models, we epo
he gene a ed esul s’ pe plexi y on he RoFo me base
model on he es se . Since pe plexi y is inaccu a e on
long epe i i e gene a ions [51], we only calcula e he pe -
plexi y using 8-ba gene a i e esul s (128 s eps) condi-
ioned on 2-ba p omp s (32 s eps). The esul s a e shown
in Tab. 1.
Fo he melody- o-cho d ask, we epo wo addi ional
me ics o compa e wi h MelodyT5. We i s calcula e he
L1dis ance be ween he ch omag am (ch oma) o he p e-
dic ed cho ds and he g ound- u h cho ds. We also epo
he CTnCTR [52] me ic be ween he melody and he gen-
e a ed cho ds. Since he es pa o he No ingham da ase
has signi ican o e lap wi h MelodyT5’s aining se , we
pe o m a small pi ch shi (up o 2 semi ones) o all es
songs o ano he commonly used key in he No ingham
da ase (e.g., C majo o D majo , A majo o G majo ,
e c.). The esul s a e shown in Tab. 2.
Fo he music analysis ask, we ep esen bo h he cho d
and he me ical labels by MIDI no es. The cho d no es a e
ep esen ed by block no es using S ing Ensemble 1 (MIDI
p og am 48). The bass no e is placed in he ange C3 o
B3 (MIDI pi ch 36-41), and o he cho d no es a e s acked
abo e hem. We use a d um ack o ep esen me ical
labels. We use a bass d um no e (MIDI pi ch 35) o ep-
esen a downbea and a sna e d um no e (MIDI pi ch 38)
o subsidia y s ong bea s. An 8-no e in illing by closed
hi-ha no e (MIDI pi ch 42) is also added.
Fo sequence- o-sequence modeling, he model p edic s
bo h acks om he ull MIDI inpu , and inal cho d la-
bels a e de i ed ia empla e ma ching on he a e age o
16 gene a ions. The excep ion is he p obe , ained as a
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
577
Ch oma ↓CTnCTR ↑
G ound T u h 0.0000±0.0000 0.9675±0.0324
FA-C oss 1.5690±0.7087 0.9113±0.0750
FA-Sel 1.2685±0.5024 0.9484±0.0415
Coco-Mulla [6] 3.4613±0.5854 0.6647±0.1219
Enc-Dec 3.0044±0.5613 0.8387±0.0749
MelodyT5 [54] 3.0428±0.8694 0.8463±0.1036
Table 2. Objec i e e alua ion esul s on unp omp ed
melody o cho d gene a ion on he es spli o he No -
ingham da ase .
Model Roo ↑Majmin ↑Se en h ↑
Cho de [39] 0.7244 0.6760 0.3374
HMM [55,56] 0.8386 0.8169 0.6930
FA-C oss 0.8203 0.8455 0.6761
FA-Sel 0.8275 0.8693 0.6986
P obe 0.8231 0.8370 0.6191
Enc-Dec 0.1786 0.1500 0.0378
Table 3. E alua ion esul s on symbolic cho d ecogni ion.
The able shows he median esul among he es spli o
he RWC Pop da ase .
25-class classi ie (12 majo , 12 mino , 1 no-cho d). We
e alua e using cho d me ics ( oo , majmin, se en h) om
he mi _e al package [53]. Resul s a e shown in Table 3.
4.6 E alua ion Resul s
In his subsec ion, we analyze he esul s o each down-
s eam ask.
4.6.1 Few-sho Symbolic Music Analysis
Wi h only 74 aining songs, ou adap e s ou pe o m ule-
based baselines on bo h majmin and se en h ca ego ies.
By compa ing unc ion alignmen models (FA) wi h he
p obe , we see ha using a p e ained LM o he a ge
sequence y(cho d+d ums) imp o es pe o mance on he
music unde s anding ask.
Be ween he unc ion alignmen models, he sel -
a en i e adap e s achie e be e pe o mance compa ed
o c oss-a en i e implemen a ion. Such end is also ob-
se ed in o he asks.
4.6.2 Cho d o Melody
The esul s in subjec i e e alua ion (Fig. 5a) shows ha he
ou p oposed adap e s (FA-Sel , FA-C oss) achie e com-
pa able pe o mance compa ed o Melody T5. Coco-Mulla
is no e ec i e on he ask, achie ing e en lowe pe o -
mance compa ed o he Enc-Dec model. is also demon-
s a ed in objec i e e alua ion esul s (Tab. 1).
4.6.3 Melody o Cho d
Bo h he pe plexi y esul s (Tab. 1) and he cho d consis-
ency esul s (Tab. 2) demons a e he e ec i eness o ou
models, especially he sel -a en i e adap e s. We no e ha
MelodyT5 shows low ch oma consis ency. MelodyT5 o -
en ails o gene a e music ha mee s he cons ain s o he
condi ion melody (e.g., eplaced by an imp o ised melody
(a)
(b)
(c)
Figu e 6. Case s udy o an o he s- o-d um example on
RWC-Pop-003. The op displays he non-d um condi ion
inpu s wi h a piano oll (s uc u e labels a e shown o e -
e ence bu no used by he model). The bo om shows he
d um ack by (a) FA-C oss; (b) FA-Sel ; (c) G ound u h.
o inconsis en s uc u es). This esul s in a misalignmen
be ween he gene a ed cho ds and he g ound u h.
4.6.4 D um o O he s
Compa ed o o he asks, he d um- o-o he s ask aims o
model a highly complica ed y(ou pu ) sequence, since
ycon ains he in o ma ion o he ull-band a angemen .
In his ca ego y, Coco-Mulla ou pe o ms he Enc-Dec
model, showing he use ulness o he p e ained knowledge
om LM(y). Howe e , Coco-Mulla does no u ilize he
knowledge om LM(x), leading o a wo se pe o mance
compa ed o he p oposed adap e s.
4.6.5 O he s o D um
The o he s- o-d um ask yields he in e es ing esul s: ou
models ou pe o m e en he g ound u h bo h subjec i ely
(Fig. 5(c)) and objec i ely (Tab. 1). This is likely because
RWC-Pop uses a limi ed d um se and egula pa e ns,
while ou aining da a (Los Angeles MIDI) includes di-
e se ex u es and ins umen s (e.g., Cuica). Ou models
gene a e ich, a ied d um pa e ns aligned wi h long- e m
s uc u e, showing s ong c ea i i y and musicali y (see
Fig. 6 o an example). The baseline model Compose As-
sis an V2 [50] also p oduces less a ia ion.
5. CONCLUSION AND FUTURE WORKS
In his pape , we add ess he p oblem o e sa ile music-
o -music modeling ha uni ies a b oad ange o music
unde s anding and con ollable gene a ion asks. Inspi ed
by unc ion alignmen , we adop a pa ame e -e icien ap-
p oach by knowledge ans e om he p e ained LM o
bo h he inpu and he ou pu sequence. We in oduce
wo implemen a ions, he c oss-a en i e adap e and he
sel -a en i e adap e . Bo h adap e s show compe i i e e-
sul s on analysis and gene a ion asks, wi h sel -a en i e
adap e s ela i ely ou pe o ming.
The e a e mainly wo u u e wo ks. Fi s , we plan o
e ine he da a ep esen a ion o suppo mo e music- o -
music asks. We also plan o ex end he amewo k o
c oss-modal adap e s, such as ex - o-music asks.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
578
6. REFERENCES
[1] C. Donahue, J. Thicks un, and P. Liang, “Melody an-
sc ip ion ia gene a i e p e- aining,” a Xi p ep in
a Xi :2212.01884, 2022.
[2] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Xiao, C. Lin, A. Ragni, E. Bene os e al.,
“MERT: Acous ic music unde s anding model wi h
la ge-scale sel -supe ised aining,” a Xi p ep in
a Xi :2306.00107, 2023.
[3] D. Li, Y. Ma, W. Wei, Q. Kong, Y. Wu, M. Che,
F. Xia, E. Bene os, and W. Li, “Me ech: Ins umen
playing echnique de ec ion using sel -supe ised p e-
ained model wi h mul i- ask ine uning,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 521–525.
[4] X. L. Li and P. Liang, “P e ix- uning: Op imizing
con inuous p omp s o gene a ion,” a Xi p ep in
a Xi :2101.00190, 2021.
[5] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li,
P. Gao, and Y. Qiao, “Llama-adap e : E icien ine-
uning o language models wi h ze o-ini a en ion,”
a Xi p ep in a Xi :2303.16199, 2023.
[6] L. Lin, G. Xia, J. Jiang, and Y. Zhang, “Con en -based
con ols o music la ge language modeling,” a Xi
p ep in a Xi :2310.17162, 2023.
[7] G. G. Xia, “Func ion alignmen : A new heo y o
mind and in elligence, pa I: Founda ions,” 2025.
[Online]. A ailable: h ps://a xi .o g/abs/2503.21106
[8] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Ł. Kaise , and I. Polosukhin,
“A en ion is all you need,” Ad ances in neu al in o -
ma ion p ocessing sys ems, ol. 30, 2017.
[9] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” a Xi p ep in a Xi :2005.00341, 2020.
[10] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, pp. 47 704–47 720,
2023.
[11] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “MusicLM:
Gene a ing music om ex ,” a Xi p ep in
a Xi :2301.11325, 2023.
[12] C. Zhang, Y. Ma, Q. Chen, W. Wang, S. Zhao, Z. Pan,
H. Wang, C. Ni, T. H. Nguyen, K. Zhou e al., “Inspi e-
music: In eg a ing supe esolu ion and la ge language
model o high- ideli y long- o m music gene a ion,”
a Xi p ep in a Xi :2503.00084, 2025.
[13] C.-Z. A. Huang, A. Vaswani, J. Uszko ei , N. M.
Shazee , I. Simon, C. Haw ho ne, A. M. Dai, M. D.
Ho man, M. Dinculescu, and D. Eck, “Music ans-
o me : Gene a ing music wi h long- e m s uc u e,”
in In e na ional Con e ence on Lea ning Rep esen a-
ions, 2018.
[14] P. Lu, X. Xu, C. Kang, B. Yu, C. Xing, X. Tan, and
J. Bian, “Musecoco: Gene a ing symbolic music om
ex ,” a Xi p ep in a Xi :2306.00110, 2023.
[15] Y.-S. Huang and Y.-H. Yang, “Pop music ans o me :
Bea -based modeling and gene a ion o exp essi e pop
piano composi ions,” in P oceedings o he 28 h ACM
In e na ional Con e ence on Mul imedia, 2020, pp.
1180–1188.
[16] D.-V.-T. Le and Y.-H. Yang, “Me eo : Melody-awa e
ex u e-con ollable symbolic o ches al music gene a-
ion,” a Xi p ep in a Xi :2409.11753, 2024.
[17] H. Chen, J. B. L. Smi h, J. Spijke e , J. Wang, P. Zou,
B. Li, Q. Kong, and X. Du, “Sympac: Scalable sym-
bolic music gene a ion wi h p omp s and cons ain s,”
in P oceedings o he 25 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, 2024, pp.
1029–1036.
[18] Z. Zhang, L. Li, J. Zhang, Z. Hu, H. Wang, C. Yan,
J. Yang, and Y. Qi, “Gene a ing high-quali y symbolic
music using ine-g ained disc imina o s,” in In e na-
ional Con e ence on Pa e n Recogni ion. Sp inge ,
2025, pp. 332–344.
[19] Y.-J. Shih, S.-L. Wu, F. Zalkow, M. Mülle , and Y.-H.
Yang, “Theme ans o me : Symbolic music gene a-
ion wi h heme-condi ioned ans o me ,” IEEE T ans-
ac ions on Mul imedia, ol. 25, pp. 3495–3508, 2022.
[20] X. Qu, Y. Bai, Y. Ma, Z. Zhou, K. M. Lo, J. Liu,
R. Yuan, L. Min, X. Liu, T. Zhang e al., “MuPT:
A gene a i e symbolic music p e ained ans o me ,”
a Xi p ep in a Xi :2404.06393, 2024.
[21] Y. Shu, H. Xu, Z. Zhou, A. . d. Hengel, and L. Liu,
“MuseBa Con ol: Enhancing ine-g ained con ol in
symbolic music gene a ion h ough p e- aining and
coun e ac ual loss,” a Xi p ep in a Xi :2407.04331,
2024.
[22] R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu, T. Shen,
G. Zhang, Y. Wu, C. Liu, Z. Zhou e al., “Cha musi-
cian: Unde s anding and gene a ing music in insically
wi h llm,” a Xi p ep in a Xi :2402.16153, 2024.
[23] H. F. Ga cía, P. See ha aman, R. Kuma , and B. Pa do,
“VampNe : Music gene a ion ia masked acous ic o-
ken modeling,” in P oceedings o he 24 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence, 2023, pp. 359–366.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
579
[24] K. Chen, Y. Wu, H. Liu, M. Nezhu ina, T. Be g-
Ki kpa ick, and S. Dubno , “Musicldm: Enhanc-
ing no el y in ex - o-music gene a ion using bea -
synch onous mixup s a egies,” in ICASSP 2024-2024
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2024, pp.
1206–1210.
[25] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music con olne : Mul iple ime- a ying con ols o
music gene a ion,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 2692–
2703, 2024.
[26] M. W. Lam, Q. Tian, T. Li, Z. Yin, S. Feng, M. Tu,
Y. Ji, R. Xia, M. Ma, X. Song e al., “E icien neu-
al music gene a ion,” Ad ances in Neu al In o ma ion
P ocessing Sys ems, ol. 36, pp. 17 450–17 463, 2023.
[27] F. Schneide , O. Kamal, Z. Jin, and B. Schölkop ,
“Moûsai: E icien ex - o-music di usion models,” in
P oceedings o he 62nd Annual Mee ing o he Associ-
a ion o Compu a ional Linguis ics (Volume 1: Long
Pape s), 2024, pp. 8050–8068.
[28] S. Hou, S. Liu, R. Yuan, W. Xue, Y. Shan, M. Zhao,
and C. Zhang, “Edi ing music wi h melody and ex :
Using con olne o di usion ans o me ,” in ICASSP
2025-2025 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2025, pp. 1–5.
[29] Z. Wang, L. Min, and G. Xia, “Whole-song hie a chi-
cal gene a ion o symbolic music using cascaded di u-
sion models,” in The Twel h In e na ional Con e ence
on Lea ning Rep esen a ions, 2024.
[30] O. Tal, A. Zi , I. Ga , F. K euk, and Y. Adi,
“Join audio and symbolic condi ioning o empo ally
con olled ex - o-music gene a ion,” a Xi p ep in
a Xi :2406.10970, 2024.
[31] X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, and
J. Tang, “P- uning: P omp uning can be compa able
o ine- uning ac oss scales and asks,” in P oceedings
o he 60 h Annual Mee ing o he Associa ion o Com-
pu a ional Linguis ics (Volume 2: Sho Pape s), 2022,
pp. 61–68.
[32] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,
S. Wang, L. Wang, W. Chen e al., “Lo a: Low- ank
adap a ion o la ge language models.” The Ten h In-
e na ional Con e ence on Lea ning Rep esen a ions,
ol. 1, no. 2, p. 3, 2022.
[33] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou,
W. Zhang, P. Lu, C. He, X. Yue e al., “Llama-adap e
2: Pa ame e -e icien isual ins uc ion model,”
a Xi p ep in a Xi :2304.15010, 2023.
[34] Y. Lan, W. Hsiao, H. Cheng, and Y. Yang, “Musicon-
gen: Rhy hm and cho d con ol o ans o me -based
ex - o-music gene a ion,” in P oceedings o he 25 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, 2024, pp. 311–318.
[35] L. Lin, G. Xia, Y. Zhang, and J. Jiang, “A ange, in-
pain , and e ine: S ee able long- e m music audio gen-
e a ion and edi ing ia con en -based con ols,” a Xi
p ep in a Xi :2402.09508, 2024.
[36] Y. Zhang, Y. Ikemiya, W. Choi, N. Mu a a, M. A.
Ma ínez-Ramí ez, L. Lin, G. Xia, W.-H. Liao, Y. Mi -
su uji, and S. Dixon, “Ins uc -MusicGen: Unlocking
ex - o-music edi ing o music language models ia
ins uc ion uning,” a Xi p ep in a Xi :2405.18386,
2024.
[37] F. Tsai, S. Wu, H. Kim, B. Chen, H. Cheng, and
Y. Yang, “Audio p omp adap e : Unleashing music
edi ing abili ies o ex - o-music wi h ligh weigh ine-
uning,” in P oceedings o he 25 h In e na ional Soci-
e y o Music In o ma ion Re ie al Con e ence, 2024,
pp. 634–641.
[38] L. Ou, J. Zhao, Z. Wang, G. Xia, and Y. Wang, “Un-
locking po en ial in p e- ained music language mod-
els o e sa ile mul i- ack music a angemen ,” a Xi
p ep in a Xi :2408.15176, 2024.
[39] W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang,
“Compound wo d ans o me : Lea ning o compose
ull-song music o e dynamic di ec ed hype g aphs,”
in P oceedings o he AAAI Con e ence on A i icial In-
elligence, ol. 35, no. 1, 2021, pp. 178–186.
[40] Z. Wang, Y. Zhang, Y. Zhang, J. Jiang, R. Yang,
J. Zhao, and G. Xia, “Piano ee VAE: S uc u ed
ep esen a ion lea ning o polyphonic music,” a Xi
p ep in a Xi :2008.07118, 2020.
[41] J. Su, Y. Lu, S. Pan, A. Mu adha, B. Wen,
and Y. Liu, “Ro o me : Enhanced ans o me wi h
o a y posi ion embedding,” 2023. [Online]. A ailable:
h ps://a xi .o g/abs/2104.09864
[42] R. Bansal, B. Saman a, S. Dalmia, N. Gup a,
S. Vashish h, S. Ganapa hy, A. Bapna, P. Jain,
and P. Talukda , “Llm augmen ed llms: Expand-
ing capabili ies h ough composi ion,” a Xi p ep in
a Xi :2401.02412, 2024.
[43] V. Zaya s, P. Chen, M. Fe a i, and D. Pad ield, “Zip-
pe : A mul i- owe decode a chi ec u e o using
modali ies,” a Xi p ep in a Xi :2405.18669, 2024.
[44] A. Le , “Los Angeles MIDI da ase : SOTA kilo-scale
MIDI da ase o MIR and music AI pu poses,” in
Gi Hub, 2024. [Online]. A ailable: h ps://gi hub.com/
asigalo 61/Los-Angeles-MIDI-Da ase
[45] I. Loshchilo and F. Hu e , “Decoupled weigh de-
cay egula iza ion,” a Xi p ep in a Xi :1711.05101,
2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
580
[46] L. N. Smi h and N. Topin, “Supe -con e gence: Ve y
as aining o neu al ne wo ks using la ge lea ning
a es,” in A i icial in elligence and machine lea ning
o mul i-domain ope a ions applica ions, ol. 11006.
SPIE, 2019, pp. 369–386.
[47] “No ingham da abase,” h p://i do.ca/~seymou /
no ingham/no ingham.h ml, accessed: 2025-03-26.
[48] M. Go o, H. Hashiguchi, T. Nishimu a, and R. Oka,
“RWC music da abase: Popula , classical and jazz mu-
sic da abases.” in ISMIR 2002, 3 d In e na ional Con-
e ence on Music In o ma ion Re ie al, ol. 2, 2002,
pp. 287–288.
[49] S. Wu, Y. Wang, X. Li, F. Yu, and M. Sun, “Melody 5:
A uni ied sco e- o-sco e ans o me o symbolic
music p ocessing,” a Xi p ep in a Xi :2407.02277,
2024.
[50] M. Maland o, “Compose ’s Assis an 2: In e ac i e
Mul i-T ack MIDI In illing wi h Fine-G ained Use
Con ol,” in P oc. 25 h In . Socie y o Music In o ma-
ion Re ie al Con ., San F ancisco, CA, USA, 2024,
pp. 438–445.
[51] Y. Wang, J. Deng, A. Sun, and X. Meng, “Pe plex-
i y om plm is un eliable o e alua ing ex quali y,”
a Xi p ep in a Xi :2210.05892, 2022.
[52] Y.-C. Yeh, W.-Y. Hsiao, S. Fukayama, T. Ki a-
ha a, B. Genchel, H.-M. Liu, H.-W. Dong, Y. Chen,
T. Leong, and Y.-H. Yang, “Au oma ic melody ha mo-
niza ion wi h iad cho ds: A compa a i e s udy,” Jou -
nal o New Music Resea ch, ol. 50, no. 1, pp. 37–51,
2021.
[53] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, D. P. Ellis, and C. C. Ra el,
“Mi _e al: A anspa en implemen a ion o common
mi me ics.” in P oceedings o he 15 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ol. 10, 2014, p. 2014.
[54] S. Wu, Y. Wang, X. Li, F. Yu, and M. Sun, “Melody 5:
A uni ied sco e- o-sco e ans o me o symbolic
music p ocessing,” a Xi p ep in a Xi :2407.02277,
2024.
[55] Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai,
X. Gu, and G. Xia, “Pop909: A pop-song da ase
o music a angemen gene a ion,” a Xi p ep in
a Xi :2008.07142, 2020.
[56] J. Jiang, “MIDI Cho d Recogni ion ia Ba -
Le el Modeling,” h ps://gi hub.com/music-x-lab/
midi-cho d- ecogni ion, 2025, accessed: 2025-06-27.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
581