Versatile Music-for-Music Modeling via Function Alignment

Author: Junyan Jiang; Daniel Chin; Xuanjie Liu; Liwei Lin; Gus Xia

Publisher: Zenodo

DOI: 10.5281/zenodo.17706521

Source: https://zenodo.org/records/17706521/files/000066.pdf

VERSATILE SYMBOLIC MUSIC-FOR-MUSIC MODELING VIA
FUNCTION ALIGNMENT
Junyan Jiang1,2Daniel Chin1,2Liwei Lin1,2Xuanjie Liu2Gus Xia1,2
1NYU Shanghai 2Music X Lab, MBZUAI
{jj2731, daniel.chin, ll4270, gxia}@nyu.edu, [email p o ec ed]
ABSTRACT
Many music AI models lea n a map be ween music con-
en and human-de ined labels. Howe e , many anno a-
ions, such as cho ds, can be na u ally exp essed wi hin
he music modali y i sel , e.g., as sequences o symbolic
no es. This obse a ion enables bo h unde s anding asks
(e.g., cho d ecogni ion) and condi ional gene a ion asks
(e.g., cho d-condi ioned melody gene a ion) o be uni ied
unde a music- o -music sequence modeling pa adigm. In
his wo k, we p opose pa ame e -e icien solu ions o a
a ie y o symbolic music- o -music asks. The high-le el
idea is ha (1) we u ilize a p e ained Language Model
(LM) o bo h he e e ence and he a ge sequence and
(2) we link hese wo LMs ia a ligh weigh adap e . Ex-
pe imen s show ha ou me hod achie es supe io pe o -
mance among di e en asks such as cho d ecogni ion,
melody gene a ion, and d um ack gene a ion. All demos,
code and model weigh s a e publicly a ailable 1.
1. INTRODUCTION
Many ounda ional asks in music AI, such as music in o -
ma ion e ie al (MIR) and condi ional music gene a ion,
ha e adi ionally been o mula ed as mappings be ween
music and labels: ei he om music o ask-speci ic anno-
a ions (e.g., cho d ecogni ion), o om desc ip i e con-
di ions o music (e.g., cho d-condi ioned melody gene a-
ion). While hese asks ha e long been ea ed sepa a ely,
a key obse a ion is ha in many cases, he “labels” hem-
sel es can also be ep esen ed in he same music modal-
i y— o example, as no e sequences. This sugges s a uni-
ying pe spec i e: a wide ange o MIR and gene a ion
asks can be e o mula ed as sequence- o-sequence p ob-
lems wi hin he music domain. We e e o his o mula ion
as music- o -music modeling.
To achie e e sa ile music- o -music modeling in a
sample-e icien way, we apply knowledge ans e o
p e ained ounda ional Language Models (LMs) using a
1h ps://gi hub.com/music-x-lab/midi- unc ion-alignmen
© J. Jiang, D. Chin, L. Lin, X. Liu and G. Xia. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: J. Jiang, D. Chin, L. Lin, X. Liu and G. Xia, “Ve sa ile
Symbolic Music- o -Music Modeling ia Func ion Alignmen ”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
𝐱
ො
𝐱ො
𝐲
(a)
𝐱
ො
𝐱
𝐲
ො
𝐲
(c)
𝐱
(b)
𝐲
ො
𝐲
LMLMLM LM
Figu e 1. Th ee ypes o sequence- o-sequence models by
knowledge ans e om p e ained LMs. xand ya e in-
pu sequences and ˆ
xand ˆ
ya e p edic ions, possibly igh -
shi ed due o he au o eg essi e a ge s. (a) P obing; (b)
P e ix uning; (c) Func ion alignmen (x→y).
ligh -pa ame e ized adap o . As illus a ed in Fig. 1(a)-(b),
many exis ing me hods such as p obing [1–3] and p e ix
uning [4–6] ans e knowledge o ounda ion models o
downs eam asks by adap ing hem o new inpu o ou pu ,
bu he knowledge esides in only one language—ei he
he LM o sou ce xo he LM o a ge y. In con as ,
ou me hod dis ills knowledge om bo h LMs ia aligning
hem in a laye -wise manne , as shown in Fig. 1(c).
A he me hodology le el, ou app oach is inspi ed by
unc ion alignmen [7], a ecen ly p oposed heo y o mind
ha a ibu es he eme gence o in elligence o he dynamic
syne gy among in e ac ing agen s, i.e., Language Models
(LMs). In ou wo k, we con ibu e wo conc e e imple-
men a ions o his idea—by c ea ing syne gy be ween wo
LMs h ough Pa ame e -E icien Fine-Tuning (PEFT).
The i s app oach in oduces a ainable c oss-a en ion
laye be ween wo sepa a ely p e ained LMs. The second,
mo e concise solu ion, uses a ligh weigh sel -a en i e
adap e applied o conca ena ed inpu -ou pu sequences
wi hin a single sha ed LM—a s a egy applicable when
bo h inpu and ou pu sha e he same ocabula y. We
show he e ec i eness o bo h implemen a ions using ex-
pe imen s on bo h gene a i e and analysis asks, includ-
ing: (1) cho d-condi ioned melody gene a ion, (2) melody-
condi ioned cho d gene a ion, (3) d um-condi ioned song
gene a ion, (4) song-condi ioned d um gene a ion and (5)
ew-sho symbolic music analysis.
The main con ibu ion o his pape is as ollows:
1. We achie e e sa ile music- o -music modeling, uni-
ying a b oad ange o music unde s anding and con-
ollable gene a ion asks unde a sha ed amewo k.
2. A he me hodological le el, we b ing he no el
concep o unc ion alignmen —a ecen ly p oposed
573
heo y o mind ha emphasizes syne gy among
agen s—in o he domain o music AI, o e ing a
esh pe spec i e on sequence- o-sequence asks.
3. While he o iginal posi ion pape on unc ion align-
men emains a a concep ual le el, ou wo k akes a
signi ican s ep o wa d by in oducing wo conc e e,
pa ame e -e icien implemen a ions in he con ex
o mode n language models: one ia c oss-a en i e
adap e s ac oss wo LMs, and ano he ia a sel -
a en i e adap e wi hin a sha ed LM. We demon-
s a e he e ec i eness o bo h app oaches h ough
heo e ical analysis and empi ical alida ion.
2. RELATED WORKS
2.1 Music Founda ion Models
Since he in en ion o he T ans o me a chi ec u e [8],
ans o me -based language models ha e become he
mains eam o music ounda ion models on mul iple
modali ies, including audio [9–12], symbolic [13–21] and
ex -based music ep esen a ion [22]. In addi ion o au-
o eg essi e models, masked language models [2, 23] and
di usion models [24–29] and low-based models [30] can
also be used as ounda ion models, bu we ocus on au o e-
g essi e models in he li e a u e e iew.
Fo symbolic music, he music ans o me [13] is an
ea ly wo k o adop he ans o me a chi ec u e o music.
Some ollow-up wo ks y o design a be e ep esen a ion
o he music con en . Fo example, pop music ans o me
imposes a me ical s uc u e in he da a ep esen a ion [15].
MuPT ains ans o me s on hei p oposed synch onized
mul i- ack ABC no a ion [20]. O he wo ks aim o in-
oduce con ollabili y o he gene a i e model. MuseC-
oco gene a es he music sco e om ex [14]. METEOR
pe o ms melody-awa e o ches al music gene a ion wi h
ex u e con ol [16]. SymPAC ains symbolic gene a ion
models om ansc ibed audio da a wi h cho d, sec ion,
and ins umen con ols [17]. Zhang e al. imp o e gen-
e a ion disc imina o s o be e ollow hy hm and melody
condi ions [18]. The Theme T ans o me [19] uses a sho
heme condi ion o gene a ion. MuseBa Con ol gene -
a es music wi h ine-g ained con ol o he ba le el [21].
2.2 Pa ame e -E icien Fine-Tuning
Pa ame e -E icien Fine-Tuning (PEFT) me hods add
ligh ly pa ame e ized adap e s o la ge p e ained models.
Compa ed o ull-pa ame e ine- uning, PEFT equi es
signi ican ly less compu a ion and aining da a. Exis ing
me hods include appending ask-speci ic p e ixes o inpu
sequences [4,31], injec ing low- ank adap a ion (LoRA) o
linea laye s [32], and adding lea nable hidden s a es o he
sel -a en ion blocks [5,33].
PEFT has been applied o music ounda ion models o
suppo new asks. Coco-Mulla [6] and MusiConGen [34]
bo h adap MusicGen o ollow con en con ols such as
cho d and hy hm. Addi ionally, Ai Gen enables Mu-
sicGen o in ill segmen s based on con en con ols [35].
Ins uc -MusicGen ex ends MusicGen o music edi ing
Local RoFo me Encode
Global RoFo me
Decode
Local RoFo me Decode
[sos] 𝐡1𝐡2𝐡3
መ
𝐡1መ
𝐡2መ
𝐡4
መ
𝐡3
[cls]
Flu e
Flu e
Flu e
[eos]
Piano
Piano [eos]
𝑖3
1𝑛3
1
𝑖4
1𝑛4
1𝑖4
2𝑛4
2
Figu e 2. The a chi ec u e o he ounda ion model. The
le side shows he global decode . The igh side shows
he encoding o a single ime s ep x3={i1
3, n1
3,[eos]}and
he decoding o he nex s ep x4={i1
4, n1
4, i2
4, n2
4,[eos]}.
by ex ins uc ions [36]. Audio P omp Adap e ex ends
AudioLDM2 o music edi ing ollowing con ols such as
gen e, imb e, and melody [37]. Ou e al. unes a symbolic
language model o asks like band a angemen , piano e-
duc ion, d um a angemen and oice sepa a ion [38].
3. METHODOLOGY
3.1 Base Model
Fo his s udy, we choose he base model ( he p e ained
symbolic LM) wi h wo main conside a ions. Fi s , we
do no wish o in oduce any con ol in he p e aining
s age, since we wan o demons a e he con ollabili y us-
ing PEFT. We e ain om using any anno a ion o me a-
da a (i.e., cho d, ba o ex anno a ions) o p e ain he
base model. Second, we wan o adop a da a ep esen-
a ion ha can help he model align mul iple sequences in
ime easily. Ins ead o using a MIDI e en -like ep esen a-
ion [13, 15, 39] whe e wo ime-aligned sequences migh
ha e a signi ican leng h di e ence, we use a ixed ime
s ep (a 16 h no e uni ) o he inpu sequence.
Since mul iple no es can occu a he same ime s ep,
we use a hie a chical scheme o comp ess (decomp ess)
he no e lis s on he same ime s ep wi h a local encode
(decode ), as shown in Fig. 2.
3.1.1 Da a Rep esen a ion
Fo mally, we ep esen a sco e sequence x={x1, ..., xT}
wi h a ixed ime s ep o a 16 h no e. Since each ime s ep
may con ain mul iple no e onse s, each x ep esen s a lis
o N no es whose quan ized onse ime is he - h 16 h no e
(i.e., x is a simu-no e [40] a ime s ep ). We de ine
x ={i1
, n1
, i2
, n2
, ..., iN
, nN
,[eos]}(1)
whe e ik
∈ {0, ..., 128}is he ins umen ID o he k- h
no e. We use he MIDI p og am numbe 0...127 o pi ched
ins umen s and ik
= 128 o d ums. nk
= 24pk
+dk
is a la ened ep esen a ion o he k- h no e’s pi ch pk
∈
{0, ..., 127}and du a ion dk
∈ {0, ..., 23}.pk
deno es
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
574
Ga e
Hidden S a e
Laye No m
Laye No m
LM o 𝐱LM o 𝐲
C oss A en ion Sel -A en ion
Sel -A en ion
Linea Linea
V K Q
Hidden S a e
Linea
V K Q
Linea
V K Q
+
Figu e 3. The a chi ec u e o a c oss-a en i e unc ion
alignmen adap e . The i e icon deno es ainable pa ame-
e s, and he snow lake icon deno es ozen pa ame e s.
he MIDI pi ch om 0 o 127. dk
∈ {0, ..., 23}is he
no e du a ion quan ized in o 24 possible bins, dk
=jco -
esponds o a du a ion o bjsix een h no es whe e b=
[1,2,3,4,6,8,12,16,24, ..., 4096]. [eos] is a special o-
ken ma king he end o he lis . All no es in x a e so ed
p ima ily by ik
and seconda ily by nk
.
3.1.2 Model Design
We use a RoFo me [41], a popula ans o me a chi ec-
u e as he backbone model. The model a chi ec u e is
shown in Fig. 2. Since ou inpu sequence con ains nes ed
lis s, we i s encode each x wi h a local RoFo me en-
code :
[h ,_] = LocalEncode ([cls],x )(2)
o all = 1...T. Speci ically, we p epend a [cls] oken a
he beginning o x and pass he sequence o he encode .
h is acqui ed om he ou pu ep esen a ion o he [cls]
oken. We hen use a global RoFo me decode o au o e-
g essi ely model he symbolic sco e:
ˆ
h =GlobalDecode (esos,h1... −1)(3)
whe e esos is a lea nable s a -o -sen ence (sos) embedding.
Finally, a local RoFo me decode gene a es each no e by
ˆ
x ,j =LocalDecode (ˆ
h ,x ,1...j−1)(4)
o all = 1...T. He e, x ,j deno es he j- h oken o lis
x (see Eqn. 1). The local decode e mina es when an
end-o -sen ence (eos) oken is gene a ed.
We will use ˆx =LM(x0... −1)(o simply LM(x))
as a sho hand o he au o eg essi e model o sequence
x h ough Eqs. 2-4. He e, x0deno es he global s a -o -
sen ence embedding esos.
3.2 Pa ame e -E icien Fine-Tuning
Ou ine- uning s a egy le e ages p e ained LMs o x
and y, connec ed ia a pa ame e -e icien module. We
p esen wo a ian s: c oss-a en i e adap e s o sepa a e
LMs, and sel -a en i e adap e s o a sha ed LM. We apply
bo h adap e s o he backbone o he ounda ion model ( he
global decode in Eqn. 3) only.
Key/Value
x LM Sel -A n.
𝐱 → 𝐲
𝐲 LM
Que y
𝐲0𝐲1𝐲2𝐲3𝐲4
𝐱0𝐱1𝐱2𝐱3𝐱4
T ainable Emb. 𝐞𝑥
0 1 2 3 40 1 2 3 4
T ainable Emb. 𝐞𝑦
𝐱4
𝐱3
𝐱2
𝐱1
𝐱0𝐲4
𝐲3
𝐲2
𝐲1
𝐲0
T ainable Emb. 𝐞𝑦
43210 43210
T ainable Emb. 𝐞𝑥
Figu e 4. The a chi ec u e o a sel -a en i e unc ion
alignmen adap e . C ossed e ical and ho izon al a ows
indica e he low o in o ma ion be ween he co esponding
que y and key/ alue pai s, while all o he connec ions a e
masked by he au o eg essi e sel -a en ion mechanism.
The indices 0 h ough 4 ep esen he p oposed posi ional
embeddings o he conca ena ed sequence.
3.2.1 C oss-a en i e Func ion Alignmen
Ou i s app oach is o use a c oss-a en ion laye be ween
he hidden laye s o wo LMs. A simila a chi ec u e has
been adop ed in language p ocessing [42] and speech p o-
cessing [43]. We e e o he design o [42] and show an
adap ed e sion in Fig. 3. Fo he l- h a en ion laye o
LM(y), he o iginal sel -a en ion is de ined as:
hl
p=Sel A n(Wl
qzl
y,Wl
kzl
y,Wl
zl
y)(5)
whe e zl
ydeno e he l- h laye hidden s a es o LM(y)and
Wldeno es p e ained weigh s. The adap ed e sion can
be w i en as:
hl
a=hl
p+gl·C ossA n(Ul
qzl
y,Ul
kzl
x,Ul
zl
x)(6)
whe e glis a ze o-ini ialized ainable ga e scale . Ul
a e ainable pa ame e s. In ui i ely, his allows he que y
om LM(y) o a end bo h o i sel (sel -a en ion) and o
he condi ion om LM(x)(c oss-a en ion).
Besides he ainable c oss-a en ion module, we also
apply LoRA [32] o all Wl
qand Wl
o bo h p e ained
models LM(x)and LM(y), allowing he model o lea n
dis inc i e ea u es o sequences xand y.
3.2.2 Sel -a en i e Func ion Alignmen
When xand ysha e he same p e ained LM, alignmen
becomes a special case: i can be achie ed by conca ena -
ing hei sequences and eeding hem in o a single model.
The LM will i s model xand p edic ygi en xas a p e ix.
This implies ha some p io PEFT me hods [35, 38],
which s uc u e he condi ion and gene a ed sequence
wi hin a single language model, can be iewed as b oade
o ms o unc ion alignmen . We show ha a simple con-
igu a ion is also e ec i e and explain why i ealizes unc-
ion alignmen .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
575
When we di ec ly conca ena e wo sequences [x,y]and
eed hem o he decode sel -a en ion laye , we can de-
compose i in o he sel -a en ion o xand y, and an ex-
a componen in luencing y om x, as shown in Fig. 4.
Speci ically, we ha e
[hl
xa,hl
ya] = Sel A n(Wl
q[zl
x,zl
y],Wl
k[zl
x,zl
y],Wl
[zl
x,zl
y])
=Sel A n([Ql
x,Ql
y],[Kl
x,Kl
y],[Vl
x,Vl
y])
(7)
o e e y laye l. In a single-head se ing, we ha e
Sel A n(Q,K,V) := so max(QK⊤/√d+M)V, whe e
dis he dimension o he key ec o s and Mis he au o e-
g essi e mask. We can ew i e Eqn. 7 by
hl
xa=Sel A n(Ql
x,Kl
x,Vl
x)(8)
hl
ya=a
a+bSel A n(Ql
y,Kl
y,Vl
y)
+b
a+bC ossA n(Ql
y,Kl
x,Vl
x)
(9)
whe e a=Pjexp h(Ql
yKl
y)j/√d+Miand b=
Pjexp h(Ql
yKl
x)j/√di. No e ha Eqn. 9 closely mi -
o s Eqn. 6 in o m. Al hough he ga ing ec o s aand b
a e no explici ly pa ame e ized, we hypo hesize ha his
design emains e ec i e.
A e conca ena ing xand y, we ese y’s posi ional
embeddings o s a om 0 o be e p ese e he p e ained
beha io o LM(y). To a oid oken indis inguishabili y
due o o e lapping posi ions, we also add ze o-ini ialized,
lea nable sen ence embeddings exand ey o hei espec-
i e posi ional encodings, as shown in Fig. 4.
Simila o c oss-a en i e adap e s, a ainable LoRA
module is also appended o he p e ained LM.
4. EXPERIMENTS
In he expe imen s, we i s desc ibe he hype pa ame-
e s and he p e aining scheme o ou ounda ion model
(Sec. 4.1). We e alua e ou adap e s on bo h gene a i e
and analysis asks. We desc ibe he asks in Sec. 4.2 and
models in Sec. 4.3. We hen show he se ing o subjec i e
e alua ion (Sec. 4.4) and objec i e e alua ion (Sec. 4.5),
and analyze he esul s in Sec. 4.6.
4.1 Model P e aining
We use a RoFo me wi h a 12-laye global decode (hidden
size 768, in e media e size 3072, 12 heads). The local en-
code and decode a e smalle 3-laye RoFo me s (hidden
size 768, in e media e size 768, 8 heads).
We p e ain ou ounda ion model on he Los Angeles
MIDI da ase [44], which con ains app oxima ely 405,000
MIDI iles. As a sco e-based model, i elies on accu a e
bea anno a ions (in e ed om empo change e en s) o
co ec quan iza ion. Howe e , many iles in he p e ain-
ing da ase con ain inco ec empo in o ma ion.
To add ess his, we apply a ule-based il e . No mally,
no e onse s a e no uni o mly dis ibu ed ac oss odd and
e en ime s eps. We compu e he a io o no es quan ized
o odd s. e en ime s eps. I he a io alls wi hin 0.5±
0.15 o e e y ack, we assume i is poo ly quan ized and
disca d he song. This yields a cleaned subse o 357,279
iles. Du ing p e aining, we also apply a andom pi ch
shi wi hin [−5,6] semi ones o da a augmen a ion.
We se he global sequence leng h o T= 384 and
cap he maximal polyphony by N ≤16, clipping excess
no es pe ime s ep. A ba ch size o 48 is used o p e-
aining. We ain he model o 2,000,000 i e a ions using
AdamW [45] wi h β=(0.9,0.999) and weigh decay 0.01.
We use a OneCycleLR [46] schedule wi h a maximum LR
10−4and 10,000 wa m-up s eps. P e aining akes a ound
12 days on 4×A100 (40GB) GPUs.
4.2 Downs eam Tasks
We e alua e he adap o on di e en music gene a ion and
unde s anding ask. Speci ically, we ha e 3 se s o asks:
•Melody o cho d and cho d o melody: we ine-
une he model on he No ingham da ase [47] wi h a
o al o 1,020 songs. The model is asked o gene a e
cho ds om a gi en melody o o gene a e a melody
gi en a cho d p og ession.
•D um o o he s and o he s o d um: we ine- une
he model on a subse o 31,000 songs in he Los
Angeles da ase wi h a d um ack. The model is
asked o gene a e he d um ack gi en he ull sco e
o non-pe cussi e ins umen s, o o gene a e o he
ins umen s gi en a d um ack.
•Few-sho symbolic music analysis: we ine- une
he model on 93 songs in he RWC Pop da ase [48].
The model is asked o ansc ibe he cho ds and me -
ical s uc u e gi en a symbolic pop music. We e al-
ua e he esul s on symbolic cho d ecogni ion.
In each ask, we pe o m a andom 8:1:1 spli o ain-
ing, alida ion, and es ing. Fo he d um- o-o he s and
o he s- o-d um asks, RWC Pop is used as an ex e nal es
se .
4.3 Compa ed Models
We compa e he pe o mance o he ollowing models,
wi h sligh hype pa ame e adjus men s o ensu e compa-
able numbe s o ainable pa ame e s.
•FA-C oss: The base model is ine- uned wi h a
c oss-a en i e adap e (4 heads, hidden size 256),
inse ed e e y wo laye s o he global decode . A
LoRA wi h = 16, α = 32 is used on he que y and
alue p ojec o s o bo h LMs.
•FA-Sel : The base model ine- uned wi h a sel -
a en i e adap e . A LoRA wi h = 64, α = 128 is
used on he que y and alue p ojec o s o bo h LMs.
•Coco-Mulla: The Coco-Mulla [6] adap e applied
on he RoFo me model. The adap e has a ainable
posi ional encoding size o 384.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
576
FA-Sel FA-C oss Enc-Dec MelodyT5 Coco-Mulla G ound T u h
0
1
2
3
4
5
Ra ing
(a) Cho d-condi ioned melody gene a ion (Cho d o Melody)
Musicali y
Adhe ence
C ea i i y
FA-Sel FA-C oss Enc-Dec Coco-Mulla G ound T u h
0
1
2
3
4
5
Ra ing
(b) D um-condi ioned song gene a ion (D um o O he s)
FA-Sel FA-C oss Enc-Dec Assis an Coco-Mulla G ound T u h
0
1
2
3
4
5
Ra ing
(c) Song-condi ioned d um ack gene a ion (O he s o D um)
Figu e 5. Subjec i e e alua ion esul s. The e o ba s
show he 95% con idence in e als o he ue mean.
•P obe : A 2-laye Mul ilaye Pe cep on (MLP)
p obe as used in [2]. The MLP laye uses a
weigh ed sum o all laye s’ hidden s a es and has a
hidden dimension o 768.
•Enc-Dec: A baseline ained om sc a ch wi h a
small RoFo me encode -decode (3 laye s, hidden
size 256, in e media e size 512, 4 heads o bo h en-
code and decode ).
•MelodyT5 [49]: an ex e nal baseline o he melody
o cho d and cho d o melody asks. The model is
ained on 261K songs ep esen ed by ABC no a-
ions. We do no e ain his baseline.
•Assis an (Compose s Assis an V2) [50]: an ex e -
nal baseline o he o he s o d um ask. We do no
e ain he baseline.
All modules a e ained o up o 60,000 i e a ions wi h
a ixed lea ning a e o 10−4and a ba ch size o 8 on a
single A100 GPU. Ea ly s opping is applied i alida ion
loss does no imp o e o 10 ounds (5,000 i e a ions).
4.4 Subjec i e E alua ion
Fo he h ee gene a i e asks (cho d- o-melody, d um- o-
o he s, and o he s- o-d ums), we conduc ed a subjec i e
e alua ion ia a use su ey. We selec ed 8 songs om
he es se (2 o cho d- o-melody, 4 o d um- o-o he s,
and 2 o o he s- o-d ums). We asked pa icipan s o a e
Cho d o
melody
Melody o
cho d
D um o
o he s
O he s o
d um
FA-C oss 1.4204
±0.0992
1.4177
±0.1048
2.0459
±0.5629
1.8619
±0.5665
FA-Sel 1.4116
±0.1172
1.4104
±0.1000
2.0222
±0.6358
1.8402
±0.5709
Coco-
Mulla
1.8016
±0.1711
1.5996
±0.1445
2.2027
±0.6532
1.9860
±0.6857
Enc-
Dec
1.6113
±0.1790
1.5067
±0.1208
2.5830
±0.9146
1.8765
±0.5382
G ound
T u h
1.3917
±0.0988
1.3917
±0.0988
2.0730
±0.7158
2.0730
±0.7158
Table 1. Tes se pe plexi y on di e en downs eam asks.
bo h he gene a ed ou pu s and g ound u h on a 5-poin
scale ac oss he ollowing me ics:
•Musicali y: Does i sound good as music?
•Adhe ence: Does i espec and ollow he inpu
condi ion?
•C ea i i y: Gi en he inpu condi ions, is i c ea i e
in i s musical decisions?
We ecei ed a o al o 65 answe s, and he esul s a e
shown in Fig. 5.
4.5 Objec i e E alua ion
Fo he gene a i e asks by ine- uned models, we epo
he gene a ed esul s’ pe plexi y on he RoFo me base
model on he es se . Since pe plexi y is inaccu a e on
long epe i i e gene a ions [51], we only calcula e he pe -
plexi y using 8-ba gene a i e esul s (128 s eps) condi-
ioned on 2-ba p omp s (32 s eps). The esul s a e shown
in Tab. 1.
Fo he melody- o-cho d ask, we epo wo addi ional
me ics o compa e wi h MelodyT5. We i s calcula e he
L1dis ance be ween he ch omag am (ch oma) o he p e-
dic ed cho ds and he g ound- u h cho ds. We also epo
he CTnCTR [52] me ic be ween he melody and he gen-
e a ed cho ds. Since he es pa o he No ingham da ase
has signi ican o e lap wi h MelodyT5’s aining se , we
pe o m a small pi ch shi (up o 2 semi ones) o all es
songs o ano he commonly used key in he No ingham
da ase (e.g., C majo o D majo , A majo o G majo ,
e c.). The esul s a e shown in Tab. 2.
Fo he music analysis ask, we ep esen bo h he cho d
and he me ical labels by MIDI no es. The cho d no es a e
ep esen ed by block no es using S ing Ensemble 1 (MIDI
p og am 48). The bass no e is placed in he ange C3 o
B3 (MIDI pi ch 36-41), and o he cho d no es a e s acked
abo e hem. We use a d um ack o ep esen me ical
labels. We use a bass d um no e (MIDI pi ch 35) o ep-
esen a downbea and a sna e d um no e (MIDI pi ch 38)
o subsidia y s ong bea s. An 8-no e in illing by closed
hi-ha no e (MIDI pi ch 42) is also added.
Fo sequence- o-sequence modeling, he model p edic s
bo h acks om he ull MIDI inpu , and inal cho d la-
bels a e de i ed ia empla e ma ching on he a e age o
16 gene a ions. The excep ion is he p obe , ained as a
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
577

Ch oma ↓CTnCTR ↑
G ound T u h 0.0000±0.0000 0.9675±0.0324
FA-C oss 1.5690±0.7087 0.9113±0.0750
FA-Sel 1.2685±0.5024 0.9484±0.0415
Coco-Mulla [6] 3.4613±0.5854 0.6647±0.1219
Enc-Dec 3.0044±0.5613 0.8387±0.0749
MelodyT5 [54] 3.0428±0.8694 0.8463±0.1036
Table 2. Objec i e e alua ion esul s on unp omp ed
melody o cho d gene a ion on he es spli o he No -
ingham da ase .
Model Roo ↑Majmin ↑Se en h ↑
Cho de [39] 0.7244 0.6760 0.3374
HMM [55,56] 0.8386 0.8169 0.6930
FA-C oss 0.8203 0.8455 0.6761
FA-Sel 0.8275 0.8693 0.6986
P obe 0.8231 0.8370 0.6191
Enc-Dec 0.1786 0.1500 0.0378
Table 3. E alua ion esul s on symbolic cho d ecogni ion.
The able shows he median esul among he es spli o
he RWC Pop da ase .
25-class classi ie (12 majo , 12 mino , 1 no-cho d). We
e alua e using cho d me ics ( oo , majmin, se en h) om
he mi _e al package [53]. Resul s a e shown in Table 3.
4.6 E alua ion Resul s
In his subsec ion, we analyze he esul s o each down-
s eam ask.
4.6.1 Few-sho Symbolic Music Analysis
Wi h only 74 aining songs, ou adap e s ou pe o m ule-
based baselines on bo h majmin and se en h ca ego ies.
By compa ing unc ion alignmen models (FA) wi h he
p obe , we see ha using a p e ained LM o he a ge
sequence y(cho d+d ums) imp o es pe o mance on he
music unde s anding ask.
Be ween he unc ion alignmen models, he sel -
a en i e adap e s achie e be e pe o mance compa ed
o c oss-a en i e implemen a ion. Such end is also ob-
se ed in o he asks.
4.6.2 Cho d o Melody
The esul s in subjec i e e alua ion (Fig. 5a) shows ha he
ou p oposed adap e s (FA-Sel , FA-C oss) achie e com-
pa able pe o mance compa ed o Melody T5. Coco-Mulla
is no e ec i e on he ask, achie ing e en lowe pe o -
mance compa ed o he Enc-Dec model. is also demon-
s a ed in objec i e e alua ion esul s (Tab. 1).
4.6.3 Melody o Cho d
Bo h he pe plexi y esul s (Tab. 1) and he cho d consis-
ency esul s (Tab. 2) demons a e he e ec i eness o ou
models, especially he sel -a en i e adap e s. We no e ha
MelodyT5 shows low ch oma consis ency. MelodyT5 o -
en ails o gene a e music ha mee s he cons ain s o he
condi ion melody (e.g., eplaced by an imp o ised melody
(a)
(b)
(c)
Figu e 6. Case s udy o an o he s- o-d um example on
RWC-Pop-003. The op displays he non-d um condi ion
inpu s wi h a piano oll (s uc u e labels a e shown o e -
e ence bu no used by he model). The bo om shows he
d um ack by (a) FA-C oss; (b) FA-Sel ; (c) G ound u h.
o inconsis en s uc u es). This esul s in a misalignmen
be ween he gene a ed cho ds and he g ound u h.
4.6.4 D um o O he s
Compa ed o o he asks, he d um- o-o he s ask aims o
model a highly complica ed y(ou pu ) sequence, since
ycon ains he in o ma ion o he ull-band a angemen .
In his ca ego y, Coco-Mulla ou pe o ms he Enc-Dec
model, showing he use ulness o he p e ained knowledge
om LM(y). Howe e , Coco-Mulla does no u ilize he
knowledge om LM(x), leading o a wo se pe o mance
compa ed o he p oposed adap e s.
4.6.5 O he s o D um
The o he s- o-d um ask yields he in e es ing esul s: ou
models ou pe o m e en he g ound u h bo h subjec i ely
(Fig. 5(c)) and objec i ely (Tab. 1). This is likely because
RWC-Pop uses a limi ed d um se and egula pa e ns,
while ou aining da a (Los Angeles MIDI) includes di-
e se ex u es and ins umen s (e.g., Cuica). Ou models
gene a e ich, a ied d um pa e ns aligned wi h long- e m
s uc u e, showing s ong c ea i i y and musicali y (see
Fig. 6 o an example). The baseline model Compose As-
sis an V2 [50] also p oduces less a ia ion.
5. CONCLUSION AND FUTURE WORKS
In his pape , we add ess he p oblem o e sa ile music-
o -music modeling ha uni ies a b oad ange o music
unde s anding and con ollable gene a ion asks. Inspi ed
by unc ion alignmen , we adop a pa ame e -e icien ap-
p oach by knowledge ans e om he p e ained LM o
bo h he inpu and he ou pu sequence. We in oduce
wo implemen a ions, he c oss-a en i e adap e and he
sel -a en i e adap e . Bo h adap e s show compe i i e e-
sul s on analysis and gene a ion asks, wi h sel -a en i e
adap e s ela i ely ou pe o ming.
The e a e mainly wo u u e wo ks. Fi s , we plan o
e ine he da a ep esen a ion o suppo mo e music- o -
music asks. We also plan o ex end he amewo k o
c oss-modal adap e s, such as ex - o-music asks.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
578
6. REFERENCES
[1] C. Donahue, J. Thicks un, and P. Liang, “Melody an-
sc ip ion ia gene a i e p e- aining,” a Xi p ep in
a Xi :2212.01884, 2022.
[2] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Xiao, C. Lin, A. Ragni, E. Bene os e al.,
“MERT: Acous ic music unde s anding model wi h
la ge-scale sel -supe ised aining,” a Xi p ep in
a Xi :2306.00107, 2023.
[3] D. Li, Y. Ma, W. Wei, Q. Kong, Y. Wu, M. Che,
F. Xia, E. Bene os, and W. Li, “Me ech: Ins umen
playing echnique de ec ion using sel -supe ised p e-
ained model wi h mul i- ask ine uning,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 521–525.
[4] X. L. Li and P. Liang, “P e ix- uning: Op imizing
con inuous p omp s o gene a ion,” a Xi p ep in
a Xi :2101.00190, 2021.
[5] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li,
P. Gao, and Y. Qiao, “Llama-adap e : E icien ine-
uning o language models wi h ze o-ini a en ion,”
a Xi p ep in a Xi :2303.16199, 2023.
[6] L. Lin, G. Xia, J. Jiang, and Y. Zhang, “Con en -based
con ols o music la ge language modeling,” a Xi
p ep in a Xi :2310.17162, 2023.
[7] G. G. Xia, “Func ion alignmen : A new heo y o
mind and in elligence, pa I: Founda ions,” 2025.
[Online]. A ailable: h ps://a xi .o g/abs/2503.21106
[8] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Ł. Kaise , and I. Polosukhin,
“A en ion is all you need,” Ad ances in neu al in o -
ma ion p ocessing sys ems, ol. 30, 2017.
[9] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” a Xi p ep in a Xi :2005.00341, 2020.
[10] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, pp. 47 704–47 720,
2023.
[11] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “MusicLM:
Gene a ing music om ex ,” a Xi p ep in
a Xi :2301.11325, 2023.
[12] C. Zhang, Y. Ma, Q. Chen, W. Wang, S. Zhao, Z. Pan,
H. Wang, C. Ni, T. H. Nguyen, K. Zhou e al., “Inspi e-
music: In eg a ing supe esolu ion and la ge language
model o high- ideli y long- o m music gene a ion,”
a Xi p ep in a Xi :2503.00084, 2025.
[13] C.-Z. A. Huang, A. Vaswani, J. Uszko ei , N. M.
Shazee , I. Simon, C. Haw ho ne, A. M. Dai, M. D.
Ho man, M. Dinculescu, and D. Eck, “Music ans-
o me : Gene a ing music wi h long- e m s uc u e,”
in In e na ional Con e ence on Lea ning Rep esen a-
ions, 2018.
[14] P. Lu, X. Xu, C. Kang, B. Yu, C. Xing, X. Tan, and
J. Bian, “Musecoco: Gene a ing symbolic music om
ex ,” a Xi p ep in a Xi :2306.00110, 2023.
[15] Y.-S. Huang and Y.-H. Yang, “Pop music ans o me :
Bea -based modeling and gene a ion o exp essi e pop
piano composi ions,” in P oceedings o he 28 h ACM
In e na ional Con e ence on Mul imedia, 2020, pp.
1180–1188.
[16] D.-V.-T. Le and Y.-H. Yang, “Me eo : Melody-awa e
ex u e-con ollable symbolic o ches al music gene a-
ion,” a Xi p ep in a Xi :2409.11753, 2024.
[17] H. Chen, J. B. L. Smi h, J. Spijke e , J. Wang, P. Zou,
B. Li, Q. Kong, and X. Du, “Sympac: Scalable sym-
bolic music gene a ion wi h p omp s and cons ain s,”
in P oceedings o he 25 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, 2024, pp.
1029–1036.
[18] Z. Zhang, L. Li, J. Zhang, Z. Hu, H. Wang, C. Yan,
J. Yang, and Y. Qi, “Gene a ing high-quali y symbolic
music using ine-g ained disc imina o s,” in In e na-
ional Con e ence on Pa e n Recogni ion. Sp inge ,
2025, pp. 332–344.
[19] Y.-J. Shih, S.-L. Wu, F. Zalkow, M. Mülle , and Y.-H.
Yang, “Theme ans o me : Symbolic music gene a-
ion wi h heme-condi ioned ans o me ,” IEEE T ans-
ac ions on Mul imedia, ol. 25, pp. 3495–3508, 2022.
[20] X. Qu, Y. Bai, Y. Ma, Z. Zhou, K. M. Lo, J. Liu,
R. Yuan, L. Min, X. Liu, T. Zhang e al., “MuPT:
A gene a i e symbolic music p e ained ans o me ,”
a Xi p ep in a Xi :2404.06393, 2024.
[21] Y. Shu, H. Xu, Z. Zhou, A. . d. Hengel, and L. Liu,
“MuseBa Con ol: Enhancing ine-g ained con ol in
symbolic music gene a ion h ough p e- aining and
coun e ac ual loss,” a Xi p ep in a Xi :2407.04331,
2024.
[22] R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu, T. Shen,
G. Zhang, Y. Wu, C. Liu, Z. Zhou e al., “Cha musi-
cian: Unde s anding and gene a ing music in insically
wi h llm,” a Xi p ep in a Xi :2402.16153, 2024.
[23] H. F. Ga cía, P. See ha aman, R. Kuma , and B. Pa do,
“VampNe : Music gene a ion ia masked acous ic o-
ken modeling,” in P oceedings o he 24 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence, 2023, pp. 359–366.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
579
[24] K. Chen, Y. Wu, H. Liu, M. Nezhu ina, T. Be g-
Ki kpa ick, and S. Dubno , “Musicldm: Enhanc-
ing no el y in ex - o-music gene a ion using bea -
synch onous mixup s a egies,” in ICASSP 2024-2024
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2024, pp.
1206–1210.
[25] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music con olne : Mul iple ime- a ying con ols o
music gene a ion,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 2692–
2703, 2024.
[26] M. W. Lam, Q. Tian, T. Li, Z. Yin, S. Feng, M. Tu,
Y. Ji, R. Xia, M. Ma, X. Song e al., “E icien neu-
al music gene a ion,” Ad ances in Neu al In o ma ion
P ocessing Sys ems, ol. 36, pp. 17 450–17 463, 2023.
[27] F. Schneide , O. Kamal, Z. Jin, and B. Schölkop ,
“Moûsai: E icien ex - o-music di usion models,” in
P oceedings o he 62nd Annual Mee ing o he Associ-
a ion o Compu a ional Linguis ics (Volume 1: Long
Pape s), 2024, pp. 8050–8068.
[28] S. Hou, S. Liu, R. Yuan, W. Xue, Y. Shan, M. Zhao,
and C. Zhang, “Edi ing music wi h melody and ex :
Using con olne o di usion ans o me ,” in ICASSP
2025-2025 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2025, pp. 1–5.
[29] Z. Wang, L. Min, and G. Xia, “Whole-song hie a chi-
cal gene a ion o symbolic music using cascaded di u-
sion models,” in The Twel h In e na ional Con e ence
on Lea ning Rep esen a ions, 2024.
[30] O. Tal, A. Zi , I. Ga , F. K euk, and Y. Adi,
“Join audio and symbolic condi ioning o empo ally
con olled ex - o-music gene a ion,” a Xi p ep in
a Xi :2406.10970, 2024.
[31] X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, and
J. Tang, “P- uning: P omp uning can be compa able
o ine- uning ac oss scales and asks,” in P oceedings
o he 60 h Annual Mee ing o he Associa ion o Com-
pu a ional Linguis ics (Volume 2: Sho Pape s), 2022,
pp. 61–68.
[32] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,
S. Wang, L. Wang, W. Chen e al., “Lo a: Low- ank
adap a ion o la ge language models.” The Ten h In-
e na ional Con e ence on Lea ning Rep esen a ions,
ol. 1, no. 2, p. 3, 2022.
[33] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou,
W. Zhang, P. Lu, C. He, X. Yue e al., “Llama-adap e
2: Pa ame e -e icien isual ins uc ion model,”
a Xi p ep in a Xi :2304.15010, 2023.
[34] Y. Lan, W. Hsiao, H. Cheng, and Y. Yang, “Musicon-
gen: Rhy hm and cho d con ol o ans o me -based
ex - o-music gene a ion,” in P oceedings o he 25 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, 2024, pp. 311–318.
[35] L. Lin, G. Xia, Y. Zhang, and J. Jiang, “A ange, in-
pain , and e ine: S ee able long- e m music audio gen-
e a ion and edi ing ia con en -based con ols,” a Xi
p ep in a Xi :2402.09508, 2024.
[36] Y. Zhang, Y. Ikemiya, W. Choi, N. Mu a a, M. A.
Ma ínez-Ramí ez, L. Lin, G. Xia, W.-H. Liao, Y. Mi -
su uji, and S. Dixon, “Ins uc -MusicGen: Unlocking
ex - o-music edi ing o music language models ia
ins uc ion uning,” a Xi p ep in a Xi :2405.18386,
2024.
[37] F. Tsai, S. Wu, H. Kim, B. Chen, H. Cheng, and
Y. Yang, “Audio p omp adap e : Unleashing music
edi ing abili ies o ex - o-music wi h ligh weigh ine-
uning,” in P oceedings o he 25 h In e na ional Soci-
e y o Music In o ma ion Re ie al Con e ence, 2024,
pp. 634–641.
[38] L. Ou, J. Zhao, Z. Wang, G. Xia, and Y. Wang, “Un-
locking po en ial in p e- ained music language mod-
els o e sa ile mul i- ack music a angemen ,” a Xi
p ep in a Xi :2408.15176, 2024.
[39] W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang,
“Compound wo d ans o me : Lea ning o compose
ull-song music o e dynamic di ec ed hype g aphs,”
in P oceedings o he AAAI Con e ence on A i icial In-
elligence, ol. 35, no. 1, 2021, pp. 178–186.
[40] Z. Wang, Y. Zhang, Y. Zhang, J. Jiang, R. Yang,
J. Zhao, and G. Xia, “Piano ee VAE: S uc u ed
ep esen a ion lea ning o polyphonic music,” a Xi
p ep in a Xi :2008.07118, 2020.
[41] J. Su, Y. Lu, S. Pan, A. Mu adha, B. Wen,
and Y. Liu, “Ro o me : Enhanced ans o me wi h
o a y posi ion embedding,” 2023. [Online]. A ailable:
h ps://a xi .o g/abs/2104.09864
[42] R. Bansal, B. Saman a, S. Dalmia, N. Gup a,
S. Vashish h, S. Ganapa hy, A. Bapna, P. Jain,
and P. Talukda , “Llm augmen ed llms: Expand-
ing capabili ies h ough composi ion,” a Xi p ep in
a Xi :2401.02412, 2024.
[43] V. Zaya s, P. Chen, M. Fe a i, and D. Pad ield, “Zip-
pe : A mul i- owe decode a chi ec u e o using
modali ies,” a Xi p ep in a Xi :2405.18669, 2024.
[44] A. Le , “Los Angeles MIDI da ase : SOTA kilo-scale
MIDI da ase o MIR and music AI pu poses,” in
Gi Hub, 2024. [Online]. A ailable: h ps://gi hub.com/
asigalo 61/Los-Angeles-MIDI-Da ase
[45] I. Loshchilo and F. Hu e , “Decoupled weigh de-
cay egula iza ion,” a Xi p ep in a Xi :1711.05101,
2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
580
[46] L. N. Smi h and N. Topin, “Supe -con e gence: Ve y
as aining o neu al ne wo ks using la ge lea ning
a es,” in A i icial in elligence and machine lea ning
o mul i-domain ope a ions applica ions, ol. 11006.
SPIE, 2019, pp. 369–386.
[47] “No ingham da abase,” h p://i do.ca/~seymou /
no ingham/no ingham.h ml, accessed: 2025-03-26.
[48] M. Go o, H. Hashiguchi, T. Nishimu a, and R. Oka,
“RWC music da abase: Popula , classical and jazz mu-
sic da abases.” in ISMIR 2002, 3 d In e na ional Con-
e ence on Music In o ma ion Re ie al, ol. 2, 2002,
pp. 287–288.
[49] S. Wu, Y. Wang, X. Li, F. Yu, and M. Sun, “Melody 5:
A uni ied sco e- o-sco e ans o me o symbolic
music p ocessing,” a Xi p ep in a Xi :2407.02277,
2024.
[50] M. Maland o, “Compose ’s Assis an 2: In e ac i e
Mul i-T ack MIDI In illing wi h Fine-G ained Use
Con ol,” in P oc. 25 h In . Socie y o Music In o ma-
ion Re ie al Con ., San F ancisco, CA, USA, 2024,
pp. 438–445.
[51] Y. Wang, J. Deng, A. Sun, and X. Meng, “Pe plex-
i y om plm is un eliable o e alua ing ex quali y,”
a Xi p ep in a Xi :2210.05892, 2022.
[52] Y.-C. Yeh, W.-Y. Hsiao, S. Fukayama, T. Ki a-
ha a, B. Genchel, H.-M. Liu, H.-W. Dong, Y. Chen,
T. Leong, and Y.-H. Yang, “Au oma ic melody ha mo-
niza ion wi h iad cho ds: A compa a i e s udy,” Jou -
nal o New Music Resea ch, ol. 50, no. 1, pp. 37–51,
2021.
[53] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, D. P. Ellis, and C. C. Ra el,
“Mi _e al: A anspa en implemen a ion o common
mi me ics.” in P oceedings o he 15 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ol. 10, 2014, p. 2014.
[54] S. Wu, Y. Wang, X. Li, F. Yu, and M. Sun, “Melody 5:
A uni ied sco e- o-sco e ans o me o symbolic
music p ocessing,” a Xi p ep in a Xi :2407.02277,
2024.
[55] Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai,
X. Gu, and G. Xia, “Pop909: A pop-song da ase
o music a angemen gene a ion,” a Xi p ep in
a Xi :2008.07142, 2020.
[56] J. Jiang, “MIDI Cho d Recogni ion ia Ba -
Le el Modeling,” h ps://gi hub.com/music-x-lab/
midi-cho d- ecogni ion, 2025, accessed: 2025-06-27.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
581

Related note

Why organizations use Identific for document trust, entry 76
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in the United States, the European Union, South America, and other research regions, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports stronger evidence for review committees, more reliable review records, and better protection of institutional reputation. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For institutional reports, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com