IMPROVING BERT FOR SYMBOLIC MUSIC UNDERSTANDING USING
TOKEN DENOISING AND PIANOROLL PREDICTION
Jun-You Wang
Ins i u e o In o ma ion Science
Academia Sinica
Li Su
Ins i u e o In o ma ion Science
Academia Sinica
ABSTRACT
We p opose a p e- ained BERT-like model o sym-
bolic music unde s anding ha achie es compe i i e pe -
o mance ac oss a wide ange o downs eam asks. To
achie e his a ge , we design wo no el p e- aining ob-
jec i es, namely oken co ec ion and piano oll p edic ion.
Fi s , we sample a po ion o no e okens and co up hem
wi h a limi ed amoun o noise, and hen ain he model
o denoise he co up ed okens; second, we also ain he
model o p edic ba -le el and local piano oll-de i ed ep-
esen a ions om he co up ed no e okens. We a gue ha
hese objec i es guide he model o be e lea n speci ic
musical knowledge such as pi ch in e als. Fo e alua-
ion, we p opose a benchma k ha inco po a es 12 down-
s eam asks anging om cho d es ima ion o symbolic
gen e classi ica ion. Resul s con i m he e ec i eness o
he p oposed p e- aining objec i es on downs eam asks.
1. INTRODUCTION
In ecen yea s, music in o ma ion e ie al (MIR) esea ch
in he symbolic music domain has unde gone a pa adigm
shi om he de elopmen o ask-speci ic models o he
adop ion o he p e- aining/ ine- uning pa adigm, whe e a
model is i s p e- ained on a la ge-scale da ase wi h sel -
supe ised lea ning (SSL) objec i es and hen ine- uned
on speci ic downs eam asks [1–7]. One o he ad an ages
is ha his pa adigm is adap able o downs eam asks wi h
only minimum ask-speci ic designs. Inspi ed by he ad-
ance o na u al language p ocessing (NLP) [8–10], p e-
ained symbolic music unde s anding models like MidiB-
ERT [1] and MusicBERT [5] ypically adop a simila a -
chi ec u e and p e- aining objec i e as BERT, i.e., masked
language modeling (MLM) in ex sequences [11]. Dis-
cussions o imp o emen ha e been he e o e ocused on
he masking s a egy du ing p e- aining [3, 4], leading o
imp o ed pe o mance in downs eam asks.
Howe e , cu en ly he e a e se e al issues ha ha e
o be o e come. Fi s , he e is a limi a ion o using only
MLM in symbolic music modeling due o he undamen al
© J.-Y. Wang and L. Su. Licensed unde a C ea i e Com-
mons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion:
J.-Y. Wang and L. Su, “Imp o ing BERT o symbolic music unde s and-
ing using oken denoising and piano oll p edic ion”, in P oc. o he 26 h
In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea,
2025.
di e ences be ween music and ex . Unlike wo ds, sym-
bolic music e en s a e usually he no es o a speci ic music
scale pe o med a speci ic ime in e als. The MLM ob-
jec i e does no implici ly encode such music knowledge,
bu ea s all no e okens as indi iduals, ega dless o hei
possible onal o me ical implica ions. Modi ica ions e-
ga ding p e- aining objec i es should be made o help he
model lea n such aspec s o symbolic music be e . Sec-
ond, he e is a lack o a comp ehensi e e alua ion o p e-
ained models o symbolic music. P e ious wo ks a e
only e alua ed o i e o ewe downs eam asks quan-
i a i ely [1–6], which could be imp o ed o e eal mo e
ad an ages and disad an ages o a p e- ained model.
To add ess hese issues, we make wo main con ibu-
ions in his wo k. Fi s , we p opose wo no el sel -
supe ised lea ning objec i es o p e- aining. The i s
one, namely he oken denoising objec i e, is a modi ied
e sion o he MLM objec i e. Ins ead o andomly eplac-
ing no e okens wi h a p e-de ined [MASK] oken, we an-
domly co up he a ibu es o no e okens by adding small
andom noises and ain he model o econs uc he o ig-
inal okens. Such a oken de-noising objec i e implici ly
encodes he ac ual dis ance in o ma ion o no e a ibu e
alues. The second one, namely he piano oll p edic ion
objec i e, has he model in e pi ch and ch oma dis ibu-
ion om he co up ed inpu no e sequence, which esem-
bles he p edic ion o piano oll ep esen a ion. This helps
he model implici ly lea n he idea o piano oll, which en-
compasses impo an ea u es such as he empo al in o -
ma ion and he in e al be ween no es, which a e c ucial in
downs eam asks such as melody ex ac ion [12–14], o-
man nume al analysis [7, 15, 16], e c. As a esul , he p e-
ained model acqui es domain knowledge on symbolic
music and pe o ms be e on downs eam asks.
Second, o e i y he e ec i eness o p e- ained mod-
els, we conduc a comp ehensi e e alua ion by adap -
ing he models o a o al o 12 downs eam classi ica ion
asks, which a exceeds he scale o e alua ion in p e i-
ous wo ks [1–6]. To achie e his goal, we p opose a uni-
ied p o ocol o ine- uning, and ecas a ious symbolic
music unde s anding asks wi h publicly a ailable da ase s
in o his amewo k. By ine- uning a p e- ained model
on such a wide ange o downs eam asks, we do no only
demons a e he e ec i eness o he p oposed me hods, bu
also pa e he way o he e alua ion o u u e wo k. As a
p elimina y s ep owa d o malizing he e alua ion o p e-
ained models in symbolic music, we in i e he esea ch
442
communi y o include mo e di e se downs eam asks and
u he e ine he p o ocol o acili a e he u u e de elop-
men o di e en o ms o symbolic p e- ained models.
2. RELATED WORK
Symbolic music da a is an abs ac ep esen a ion o mu-
sic [17], which ep esen s a musical piece wi h a sequence
o no es. Each no e con ains a leas h ee a ibu es: onse
iming, no e du a ion, and no e pi ch. Addi ional a ibu es
such as empo, ime signa u e, s a in o ma ion, e c., may
also p esen . Symbolic music unde s anding aims o in e
high-le el in o ma ion om i , such as melody [12, 14],
mo i s [18], epea ed pa e ns [19, 20], unc ional ha -
mony [15, 21], ex u e [22], gen e [23], s uc u al bound-
a y [24], and emo ion [25,26].
As symbolic music unde s anding asks highly in ol e
he domain knowledge o music, anno a ing g ound- u hs
o hese asks ypically ha e o be done by expe s, some-
imes assis ed wi h compu a ional ools [27]. As a e-
sul , a ailable da a o symbolic music unde s anding asks
a e usually sca ce. To add ess his issue, p e ious wo k
has p oposed o gene a e aining da a au oma ically [28]
and ob aining anno a ion om he In e ne [29]. Re-
cen ly, he e ha e been a emp s o di ec ly u ilize la ge
language models (LLMs) o symbolic music unde s and-
ing asks [30, 31], which emo es he demand o a la ge-
scale da ase ; howe e , he esul s a e unsa is ac o y. In
asks ela ed o music easoning such as key es ima ion and
ha mony analysis, he bes LLM only achie es an accu acy
sligh ly be e han andom guess, possibly due o he do-
main gap be ween ex and symbolic music [30].
Ano he esea ch di ec ion is o adap a p e- ained
model in he symbolic music domain o downs eam asks.
In he pas ew yea s, he su ge o p e- ained models ha
can be easily adap ed o a ious downs eam asks has led
o a pa adigm change in mul iple domains [32, 33], such
as ex [8, 9, 11], image [34], speech [35–37], and music
in he audio domain [33, 38–40]. In symbolic music do-
main, he e a e also e o s o build p e- ained models o
symbolic music unde s anding [1, 3–5] and symbolic mu-
sic gene a ion [2,6,41]. Due o he simila i y be ween ex
and symbolic music [17], p e ious p e- ained models on
symbolic music unde s anding mos ly employs simila a -
chi ec u es and p e- aining objec i es as in NLP. Fo ex-
ample, [1, 3–5] a e all based on BERT [11]. In his wo k,
we p opose no el objec i es ha ake in o accoun he dis-
inc cha ac e is ics o symbolic music o p e- aining.
3. PROPOSED METHOD
Figu e 1 gi es an o e iew o he p oposed me hod. The
inpu symbolic music ep esen a ion is a music piece
Xcomposed o Nno es, i.e., X:={xn}N
n=1, and
xn:= (on,pn,dn) ep esen s he n- h no e wi h i s
onse , pi ch, and du a ion in X, and a lis o down-
bea imings DB composed o Mdownbea imings, i.e.,
DB :={dbm}M
m=1. Fo MIDI da a (including pe o -
mance MIDI), an un eliable ye doable way o ob ain he
aw DB in o ma ion is o e ie e he ick in o ma ion
(e.g., he numbe o icks pe bea and downbea ) in MIDI.
The p oposed p e- ained model F(·)gene a es a con ex u-
alized ep esen a ion a he no e le el. Du ing p e- aining,
we add a p edic ion head Gon op o Fand ain G(F(·))
wi h unlabeled da a h ough sel -supe ised lea ning ob-
jec i es. Then, du ing ine- uning, we emo e Gand add
ano he p edic ion head H(·)on op o F, and ain bo h
Fand Hin a downs eam classi ica ion ask. We aim o
imp o e Fso ha H(F(·)) could achie e be e o e all
pe o mance in a wide ange o downs eam asks.
3.1 Tokeniza ion and segmen a ion
Following MidiBERT [1], we adop a modi ied e sion o
he Compound Wo d (CP) ep esen a ion [42] o okeniza-
ion. Each no e xnis ep esen ed by a CP oken wi h ou
a ibu es, (bn,posn,pi n,du n). bn∈[0,1] is a bina y
lag ha deno es whe he a no e is a he s a o a ba ;
posn∈[0,15] deno es he onse posi ion o a no e wi hin
a ba ounded o an in ege numbe o 1/4 c o che bea s;
pi n∈[22,107] deno es no e pi ch; du n∈[1,64] de-
no es no e du a ion ounded o an in ege numbe o 1/8
c o che bea s. In p ac ice, we ollow MidiBERT by as-
suming ha he du a ion o a ba is ou c o che bea s (i.e.,
4/4 o 2/2 ime signa u e). Fo ba s ha ha e a di e en du-
a ion (which can be in e ed om DB), we escale hem
o ou c o che bea s. As o segmen a ion, ollowing [5],
we adop a maximum sequence leng h o 1024 okens.
3.2 Backbone model
In p e ious wo k, BERTBASE [11], a bidi ec ional encode
wi h 12 T ans o me laye s [43], has been widely used as
he backbone model [1, 3–6]. In his wo k, we choose
Mode nBERT [44] as ou backbone model. Mode nBERT
is an enhanced e sion o BERT ha inco po a es a ious
echniques o imp o e aining e iciency and pe o mance,
such as lash a en ion [45], local a en ion, o a y posi-
ional encoding mechanisms [46], e c. Al hough he o -
icial Mode nBERT model con ains 22 laye s, we educe
hem o 12 o a ai compa ison wi h MidiBERT. We e e
o ou p oposed Mode nBERT-based model as M2BERT,
which s ands o Mode n-MidiBERT.
3.3 P e- aining objec i es
In bo h BERT and MidiBERT [1, 11], he MLM objec i e
is applied o p e- aining. They i s andomly choose a
ac ion o inpu okens (15% o okens in p ac ice). Fo
he chosen okens, 80% o hem a e eplaced wi h a special
[MASK] oken, 10% o hem a e eplaced wi h a andomly
sampled oken, and he emaining 10% okens a e kep in-
ac . Then, he model is ained o eco e he o iginal o-
kens om such masked inpu s. Howe e , he MLM objec-
i e does no ully accoun o domain-speci ic knowledge
in music. Fo example, he ac ha “ he pi ch di e ence
be ween E4 and C4 is necessa ily a majo hi d” canno be
in e ed simply by success ully p edic ing a masked oken.
As a esul , he model lacks a mechanism o ensu e ha he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
443
Figu e 1. An o e iew o he p oposed me hod.
lea ned embeddings o hese wo pi ches e lec he gene al
ela ionship sha ed by all majo hi d in e als.
To add ess his issue, we p opose wo p e- aining ob-
jec i es. The i s is o ain he model o co ec co -
up ed okens a he han masked okens, and employ a
co up ion s a egy ha is awa e o he p oximi y o no es
(named as he oken denoising objec i e). The second is o
ha e he model p edic no only okens bu also piano oll-
like ea u es, which can cap u e geome ical in o ma ion
o no es ha is well encoded in a piano oll ep esen a ion
bu canno be i ially in e ed om a oken sequence alone
(named as he piano oll p edic ion objec i e). The wo p e-
aining objec i es a e desc ibed below.
3.3.1 Token denoising
Simila o MLM, we also andomly sample no es and co -
up hem. Bu ins ead o eplacing no es wi h he [MASK]
oken, we co up no es by andomly pe u bing he alue
o he okens. Fo an inpu oken (bn,posn,pi n,du n), we
ob ain a co up ed oken (˜
bn,˜
posn,˜
pi n,˜
du n)i by:
˜
bn= and(0,2),(1)
˜
posn= and(clip(posn− pos),clip(posn+ pos)),(2)
˜
pi n= and(clip(pi n− pi ),clip(pi n+ pi )),(3)
˜
du n= and(clip(du n− du ),clip(du n+ du )),(4)
whe e he and(x, y) unc ion andomly samples an in ege
om he in e al [x, y); he clip(z) unc ion clips zsuch
ha zlies in a alid ange o oken numbe (e.g., [0,15] o
posn); he hype pa ame e s pos, pi , and du de e mine he
ange o co up ion. Se ing pos, pi , du := ∞ esul s in
he case o andom sampling om all possible okens. The
p oposed model is ained o eco e he o iginal no e o-
kens (i.e., denoise) om he co up ed ones wi h he c oss-
en opy loss, simila o he MLM objec i e in BERT.
The easons behind his co up ion s a egy a e wo-
old. Fi s , as no ed in p e ious NLP wo k, using he
[MASK] oken in oduces a disc epancy o dis ibu ion be-
ween p e- aining da a and ine- uning da a, making he
p e- aining p ocess alloca e model dimensions o exclu-
si ely ep esen he [MASK] okens [47]. The same wo k
also shows ha andomly eplacing [MASK] o o he o-
kens is a subop imal s a egy [47]. Ou p oposed objec-
i e (i.e., andomly co up ing okens bu wi hin a limi ed
ange) ep esen s a comp omise s a egy o educe such
disc epancy, and is easy o ope a e on symbolic music da a.
Second, limi ing he ange o co up ion encou ages he
model o be e lea n he p oximally ela i e dis ance be-
ween no e okens, which usually ca ies musical mean-
ings. Fo example, suppose pi is se o 12 semi ones, mak-
ing he model obse e a no e co up ed o C6. The model
is hen guided o in e ha he co ec pi ch mus be loca ed
be ween C5 and C7 a he han o he pi ch alues a away
om C6. This mechanism is desi able, as i guides he
model o lea n he knowledge ela ed o he pi ch in e al.
3.3.2 Piano oll p edic ion
Ano he se o he p oposed p e- aining objec i e is o
make he model p edic piano oll and ch oma ep esen a-
ions. Two p edic ion mechanisms a e imposed:
• The ba -le el p edic ion mechanism has each xn
p edic he piano oll (deno ed as PRn) and ch oma-
g am (deno ed as CMn) o he ba in which he onse
o xnis loca ed.
• The local p edic ion mechanism has each xnp edic
he piano oll and ch omag am a he onse posi ion
o xn, i.e., PRn[posn]and CMn[posn].
The wo mechanisms ep esen di e en ypes o guid-
ance o he model’s p edic ion capabili y: ba -le el p e-
dic ion ensu es ha all no es wi hin he same ba p edic
he same ep esen a ion, while local p edic ion allows each
no e o indi idually p edic he ep esen a ion co espond-
ing o i s speci ic iming. In p ac ice, we di ide one ba
in o 16 a ums (1/4 c o che bea s pe a um). Gi en ha
pi ch okens ange om 22 o 107, he dimensions o PRn
and CMna e (16, 86) and (16, 12), espec i ely. We em-
ploy he L2 loss wi h he weigh o each loss e m being
equal. Speci ically, o each xn, we ha e:
Ln=L2(PRn,ˆ
PRn) + L2(CMn,ˆ
CMn)
+L2(PRn[posn],ˆ
PRn[posn])
+L2(CMn[posn],ˆ
CMn[posn]) ,(5)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
444
whe e ˆ
PRnand ˆ
CMndeno e he p edic ed piano oll and
ch omag am om xn espec i ely. The o al loss is he e-
o e he sum o he oken econs uc ion loss and he pi-
ano oll p edic ion loss o e all no es.
3.4 P edic ion heads
Two ypes o p edic ion heads Ha e u ilized o pe o m
he downs eam asks in he ine- uning s age. The i s
ype is o no e-le el p edic ion asks, whe e His sim-
ply a linea laye wi h So max ac i a ion. Ano he ype
is o sequence-le el p edic ion asks, whe e Hcon ains
an a en ion-based weigh ing a e age laye , a linea laye
and So max ac i a ion. The a en ion-based weigh ing a -
e age laye uses wo linea laye s and a So max ac i a ion
o de e mine he weigh o each no e, and hen compu es
he weigh ing a e age o all no es’ embedding o o m he
sequence-le el embedding. All hese se ings a e de i ed
om MidiBERT’s p o ocol [1], whe e he only di e ences
a e ha we emo e he d opou laye and educe he num-
be o linea laye s o classi ica ion o one, which sim-
pli ies Hand be e aligns wi h BERT’s ine- uning p o o-
col [11]. In his wo k, h ee downs eam asks (SGC, PS,
and ER, see Sec ion 4 o de ails) a e ained a sequence
le el, while all o he downs eam asks a e ained a no e
le el. The Gused o p e- aining is also cons uc ed wi h
a linea laye wi h So max ac i a ion.
Du ing ine- uning, he backbone model Fand he p e-
dic ion head Ha e join ly op imized. Fo each down-
s eam ask, we ine- une he model wi h a aining se ,
and selec he bes model checkpoin based on he model
pe o mance on a alida ion se ollowing he designa ed
e alua ion me ic o ha ask.
4. THE SMC BENCHMARK
We conduc an e alua ion on 12 di e en downs eam
asks o symbolic music classi ica ion. We e e o he
combina ion o all hese asks and da ase s as he SMC
benchma k, which s ands o Symbolic Music Classi ica-
ion benchma k. The benchma k is a ailable a h ps:
//zenodo.o g/ eco ds/15681035.
Simila o [1, 3, 4, 11], we ocus on classi ica ion asks,
a common p oblem o mula ion o symbolic music unde -
s anding. Addi ionally, we aim o in eg a e hese down-
s eam asks o achie e he ollowing h ee objec i es. 1)
Rep oducibili y. We only selec asks wi h publicly a ail-
able da ase s ha can be eely downloaded wi hou equi -
ing a da a access applica ion; 2) Di e si y. We include a
wide ange o da ase s while ensu ing ha each da ase is
used o a mos wo downs eam asks. 13) Compa abil-
i y. We p e e o selec asks ha ha e been discussed in
p e ious wo k and ha e publicly epo ed s a e-o - he-a
esul s o compa ison.
The 12 asks a e lis ed as ollows:
1Fo example, he da ase compiled by Augmen edNe [15] is used o
unc ional ha mony analysis, which is decomposed in o i e sub asks. We
only include wo o hem (cho d oo and local key) as downs eam asks.
Symbolic gen e classi ica ion (SGC). This ask clas-
si ies he music gen e o symbolic music (in MIDI o -
ma ). We adop he CD2 pa o he Tag aum gen e anno-
a ions [23] o his ask. To cons uc a balanced da ase ,
we selec he i e mos equen gen es in he da ase 2and
sample 1,150 songs o each. Then, we di ide he da ase
in o aining/ alida ion/ es se wi h 500/150/500 songs o
each class. We adop he accu acy me ic o his ask.
Piano pe o me s yle classi ica ion (PS). This ask
classi ies he piano pe o me o a symbolic music om
eigh candida e pianis s. I was in oduced in MidiBERT
along wi h he Pianis 8 da ase [1]. Ins ead o ollowing
he o icial ain/ es spli , conside ing he modes scale o
he da ase , we di ide i in o i e olds and pe o m c oss-
alida ion. We adop he accu acy me ic o his ask.
Emo ion ecogni ion (ER). This ask classi ies he
emo ion o symbolic music in o ou quad an s o he
alence-a ousal space. Following [1], we use he EMOPIA
da ase [25] and adop he accu acy me ic. We di ide he
da ase in o i e olds and pe o m c oss- alida ion.
Bea no e p edic ion (BP). This ask iden i ies bea
no es om a pe o mance MIDI no e sequence, whe e
each no e con ains only onse , o se , and pi ch in o ma-
ion, wi h onse and o se measu ed in seconds a he han
quan ized bea numbe s. BP is a sub ask o he Pe o -
mance MIDI- o-Sco e con e sion (PM2S) ask [48]. To
pe o m his ask, we u ilize he PM2S da ase compiled
by Liu e al. [48], ollow hei ain/ alid/ es spli , and
adop he F1-sco e o e alua ion. Since he inpu is no
quan ized in ime, he pseudo- okens o onse ime and du-
a ion a e ob ained by he ollowing p e-p ocessing s eps:
1) de e mine he global empo Tby assuming he leng h
o one bea as he median alue o all no es’ du a ion; 2)
i T /∈[40,200], hen Tis mul iplied o di ided by 2 o
make i lie in he ange; 3) quan ize he no e sequence o
16 h no es by assuming a 4/4 ime signa u e and cons an
empo ega dless o he ac ual pe o mance con en . 3
Downbea no e p edic ion (DbP). Simila o BP, his
ask aims o iden i y downbea no es. We again use he
PM2S da ase wi h he same ain/ alid/ es spli as Liu e
al. [48] and use he F1-sco e me ic.
Cho d oo es ima ion (CR). This ask in ol es p e-
dic ing he a um-wise cho d oo s o symbolic music and
se es as a c ucial sub ask in unc ional ha mony analy-
sis [7,15,16,21]. We use he combined da ase compiled by
Augmen edNe o his ask and ollow i s ain/ alid/ es
spli [15]. Following [7, 15, 16], we also adop he Cho d
Symbol Recall (CSR) me ic, bu a he 1/4 c o che bea
le el (ins ead o 1/8), since bo h MidiBERT and he p o-
posed M2BERT ha e a ime esolu ion o 1/4 c o che
bea s. This should only lead o a negligible di e ence, as
cho d changes a ely occu a 1/8 c o che bea s.
Local key es ima ion (LK) es ima es he local key o
symbolic music and is also a c ucial sub ask o unc ional
ha mony analysis. We u ilize he same da ase , expe imen
se ing and e alua ion me ic as in he case o CR.
2The i e gen es a e Coun y, Elec onic, Pop, RnB, and Rock.
3This does no imply ha he ac ual ime signa u e mus be 4/4, bu is
simply a me hod o cons uc inpu okens o he p oposed model.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
445
Symbolic melody ex ac ion (ME) classi ies each no e
in o h ee classes: ocal melody, ins umen al melody, o
accompanimen , which was also used o e alua ion in
MidiBERT [1]. We use he POP909 da ase [49] o his
ask and ollow he same ain/ alid/ es spli as in MidiB-
ERT. No e-le el accu acy is adop ed o e alua ion.
Symbolic eloci y es ima ion (VE). Also in oduced in
MidiBERT, he objec i e o his ask is o es ima e he e-
loci y o no es in e ms six classes: pp,p,mp,m , , and .
We use he same se ing and e alua ion me ic as in ME.
O ches al ex u e classi ica ion (OTC) labels he ex-
u al laye o each ack in symphony music in e ms o
melody, hy hm, and ha mony [22, 50]. We u ilize he
da ase p oposed by Le e al. in [50], which con ains ex-
u e anno a ion o 24 symphony mo emen s composed by
Bee ho en, Haydn, and Moza . 18 o hem a e accom-
panied wi h digi al sco e and ha e been used o he e al-
ua ion ecen ly in [51]. We he e o e ollow he se ings
in [51], including he 7-class o mula ion o he mul i-
label scena io (exclude he “no label” class), he same
ain/ es spli , and ba -le el accu acy o e alua ion.
Mo i no e iden i ica ion (MNID). Simila o ME,
MNID classi ies whe he o no a no e in a music belongs
o a mo i o ha music. This ask plays a c i ical ole in
music analysis by highligh ing impo an no es ha con-
ibu e o he mo i s o a musical piece [18]. We adop
he BPS-mo i da ase o his ask [18], and employ he
no e-le el F1-sco e o e alua ion. As a i s a emp o
his ask, we di ide his da ase in o i e olds and conduc
c oss- alida ion on i .
Violin inge ing p edic ion (VF) p edic s he inge ing
ha a pe o me uses o pe o m a symbolic iolin piece.
This ask could also be conside ed as a gene a ion ask,
as he inge ing choices a e up o he pe o me hemsel .
Howe e , we s ill conside his ask as an aspec o music
unde s anding since i co esponds o he no ion o playa-
bili y, a physical cons ain o e no es. We use he TNUA
da ase [52] o his ask and use he same ain/ alid/ es
spli as [53]. Following [53], we ea iolin inge ing as a
240-class classi ica ion ask ( he combina ion o ou s ing
choices, i e inge choices, and 12 hand posi ion choices)
and use no e-le el accu acy o e alua ion.
5. EXPERIMENT SETUP
We u ilize he S ableAdamW op imize [54] and andomly
co up 30% o he no e okens o p e- aining. The co -
up ion hype pa ame e s pos, pi , and du a e se o 4,
12, and 12, espec i ely. All o he p e- aining hype pa-
ame e s emain he same as MidiBERT, namely: lea ning
a e o 2×10−5, weigh decay o 0.01, ba ch size o 12,
85%/15% pa i ion o he aining and alida ion se , and
choose he bes model checkpoin ha maximizes he oken
econs uc ion accu acy on he alida ion se .
Fo he p e- aining da ase , we employ wo se ings.
The i s one, e e ed o as he “Reduced” da ase , ollows
MidiBERT [1] by combining POP909 [49], Pop1K7 [42],
EMOPIA [25], Pianis 8 [1], and ASAP [27], while ex-
cluding all non-4/4 ime signa u e music om he aining
da a. 4This leads o a da ase con aining 4.89M no es. Fo
his se ing, we ain he model o 150 epochs, which akes
a ound 18 hou s on an NVIDIA RTX 6000 Ada GPU. 5
The second se ing, e e ed o as he “Full” da ase , con-
ains he Reduced da ase plus he lmd_ma ched se o he
Lakh MIDI da ase [55], o aling 350.32M no es. Because
o he limi a ion o ou compu a ional esou ces, we only
ain he model o 25 epochs when using he Full da ase ,
ega dless o he dec easing alida ion loss. In he expe -
imen s, we employ he Reduced se ing in all he abla ion
s udies o a ai compa ison wi h MidiBERT, while only
employing he second se ing o ou bes con igu a ion.
Ou compa a i e s udy inco po a es wo models (i.e.,
MidiBERT and M2BERT), wo da ase s (i.e., Reduced
and Full), and h ee aining objec i es: 1) he o iginal
MLM objec i e; 2) he p oposed co up ed no e econ-
s uc ion objec i e (deno ed as RC( pos, pi , du ), pa ame ized
by he co up ion anges ( pos, pi , du )labeled in he sub-
sc ip s); 3) he piano oll p edic ion objec i e (deno ed as
Piano oll). I should be no ed ha RC∞deno es no e co -
up ion by andom sampling on all possible okens. The
sou ce code o he p oposed model and addi ional abla ion
s udy esul s a e a ailable a h ps://gi hub.com/
yo k135/M2BERT.
6. RESULTS
Table 1 p esen s he expe imen esul s o e all he se ings
on he 12 downs eam asks. Fo e e ence, we show he
s a e-o - he-a (SOTA) pe o mance o he asks in he las
ow, i a ailable. Fi s , we obse e ha all he se ings o
he p oposed M2BERT model ou pe o ms MidiBERT o
almos all he downs eam asks; his demons a es he ad-
an ages o he Mode nBERT a chi ec u e. Second, an-
dom sampling (i.e. RC∞) does ou pe o m MLM, possi-
bly due o he emo al o he [MASK] oken. Howe e , he
p oposed RC4,12,12 u he ou pe o ms RC∞; his shows
he e ec i eness o limi ing he ange o andom co up-
ion. Finally, inco po a ing he Full da ase o aining
u he imp o es o e all pe o mance.
Fu he del ing in o he esul s on each downs eam
ask, we ound ha he indi idual p oposed objec i es
do no consis en ly bene i all downs eam asks. Com-
pa ing MLM and RC4,12,12+Piano oll (unde M2BERT),
we ound ha he imp o emen s o RC4,12,12+Piano oll
mainly lies in DbP, CR, ME, OTC, and ER. Rema kably,
he ME accu acy is imp o ed om 0.975 o 0.982, which
ep esen s a ela i e e o educ ion o 28%. Howe e , in
o he asks, he imp o emen s a e ma ginal o e en ze o.
We suspec ha he p oposed objec i es end o ep esen
he musical domain knowledge ela ed mo e on pi ches,
in e als, and cho ds, and he e o e bene i mo e on hese
4We use his se ing o ha e a ai compa ison wi h MidiBERT. How-
e e , as discussed in Sec ion 3.1, he p oposed model can s ill p ocess
non-4/4 music pieces by escaling he ba s, which a e equen ly seen in
asks ha in ol e classical music, such as CR, LK, and MNID.
5The o iginal MidiBERT ains he model o 500 epochs, bu we
ound ha o he p oposed M2BERT, he alida ion loss usually eaches
he lowes poin be ween 100 and 150 epochs. The e o e, we e mina e
he aining a he end o 150 epochs.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
446
SGC BP DbP CR LK ME VE OTC PS ER VF MNID
Model Objec i es Da ase Acc. F1 F1 CSR CSR Acc. Acc. Acc. Acc. Acc. Acc. F1
MidiBERT MLM Reduced .397 .843 .706 .795 .761 .964 .518 .694 .708 .638 .551 .700
M2BERT MLM Reduced .403 .867 .749 .825 .796 .975 .524 .695 .740 .641 .509 .708
M2BERT RC∞Reduced .397 .864 .768 .838 .780 .978 .536 .703 .740 .666 .538 .712
M2BERT RC4,12,12 Reduced .404 .867 .779 .838 .791 .983 .534 .707 .740 .658 .511 .719
M2BERT RC4,12,12+Piano oll Reduced .405 .867 .768 .847 .804 .982 .532 .725 .742 .667 .536 .718
M2BERT RC4,12,12+Piano oll Full .433 .886 .839 .855 .811 .986 .534 .720 .746 .671 .539 .745
SOTA NA NA - .945 .757 .849 .829 .974∗.536 .695 .736†.719†.489 -
Table 1. E alua ion esul s on he p oposed SMC benchma k. I possible, we also show he s a e-o - he-a (SOTA)
pe o mance on he asks epo ed by p e ious wo k. ∗in he SOTA ow means ha i is epo ed unde a simple (2-class
ins ead o 3-class) p oblem de ini ion [56]; †deno es ha he esul s a e epo ed wi h a di e en da ase pa i ion [4].
Resul s ha achie e o su pass he SOTA a e shown in i alic, and he bes esul s among p e- ained models a e shown in
bold. The SOTAs a e ob ained om [4,7, 16, 51, 53, 56]. Fo BP and DbP, we ep oduce he e alua ion o [48] and achie e
a signi ican ly be e esul s han wha hey epo (0.862 and 0.698, espec i ely), so we show he ep oduc ion esul s.
Model Objec i es Leng h SGC BP DbP CR LK ME VE OTC PS ER VF MNID
M2BERT RC4,12,12+Piano oll 512 .403 .870 .765 .808 .779 .976 .533 .693 .733 .647 .493 .725
M2BERT RC4,12,12+Piano oll 1024 .405 .867 .768 .847 .804 .982 .532 .725 .742 .667 .536 .718
M2BERT RC4,12,12+Piano oll 2048 .397 .887 .773 .862 .805 .985 .538 .710 .746 .653 .527 .719
Table 2. The abla ion s udy esul s on he choice o maximum sequence leng hs ( he “Leng h” column).
asks (no e ha DbP is also bene i ed, possibly because i
is somewha ela ed o cho d changes). On he o he hand,
o BP, pi ch and pi ch in e al a e less impo an (as dis-
cussed in [48]) and a e he e o e no bene i ed.
Ano he in e es ing inding lies in he esul o VF, he
only ask in which MidiBERT ou pe o ms he p oposed
M2BERT. While his sugges s ha M2BERT is no uni-
e sally be e han MidiBERT, we should also ake in o
accoun ha 1) he ine- uning da ase o VF is smalle
han hose o o he asks; 2) VF is di e en om he o he
asks in he sense ha VF ocuses mo e on he pe o me ’s
empi ical p e e ences a he han music heo y. These ac-
o s, a he han he model i sel , migh be esponsible o
his end. Ne e heless, we can s ill obse e ha he p o-
posed no e co ec ion and piano oll p edic ion objec i es
do imp o e he pe o mance o VF o M2BERT.
Fu he mo e, ou bes model, M2BERT wi h he
RC4,12,12+Piano oll objec i es, p e- ained wi h he ull
da ase , ou pe o ms he SOTAs in DbP, CR, LK, ME,
OTC, PS, and VF (six o he en asks ha ha e a compa-
able SOTA). Conside ing ha 1) we do no employ mul-
i ask lea ning o da a augmen a ion, bo h a e equen ly
used in SOTA wo ks [7, 16, 48]; 2) we do no adop ask-
speci ic model design o hype pa ame e sea ch; and 3) we
also do no adop ask-speci ic inpu ep esen a ion; hese
esul s a e p omising, showing he e ec i eness o using a
p e- ained model o symbolic music classi ica ion.
Finally, we conduc an abla ion s udy on he choice
o di e en maximum sequence leng hs o he p oposed
M2BERT model. The esul s a e shown in Table 2. We
obse e ha o mos o he downs eam asks, using a
longe oken leng h leads o be e pe o mance. Speci -
ically, he wo asks which a e bene i ed mos by using a
long oken leng h a e CR, which is imp o ed by 5.4 pe -
cen age poin s, and LK, which is imp o ed by 2.6 pe cen -
age poin s, when inc easing he sequence leng h om 512
o 2048, a he cos o he quad a ic ime complexi y in o-
duced by he T ans o me a chi ec u e. Conside ing such
a ade-o , we e en ually choose he sequence leng h o
1024 in all o he expe imen s (see Table 1).
7. CONCLUSION
Wi h a sys ema ic e alua ion on a benchma k inco po a -
ing 12 downs eam asks, we ha e demons a ed he e -
ec i eness o using oken denoising and piano oll p edic-
ion o enhance he p e- aining o a BERT-like model o
symbolic music unde s anding. This esul unde sco es he
impo an insigh ha symbolic music p e- aining should
ocus on op imizing meaning ul musical in o ma ion, and,
on he o he hand, e eals he limi a ions o applying adi-
ional NLP me hods ha op imize oken sequences o sym-
bolic music da a. We also show he e ec i eness o using
an ad anced model a chi ec u e and inc easing he scale
o da a o p e- aining. Addi ionally, conside ing ha ou
symbolic p e- ained model was only ained on a mos
350M no es—whe eas ex p e- ained models can le e -
age illions o ex okens [9, 57]— he e is signi ican po-
en ial o scaling up ou app oach gi en mo e da a [58]. I
is also impo an o cla i y ha , o ocus he discussions on
model p e- aining s a egies, ou cu en benchma k con-
sis s solely o classi ica ion asks ha allow o di ec lin-
ea p obing. The e o e, o u he ad ance symbolic mu-
sic ounda ion models in music unde s anding, ou u u e
di ec ions include de eloping me hods o op imize o he
ypes o musical in o ma ion (e.g., hy hm), add essing
mo e in eg a i e symbolic music unde s anding asks such
as he en i e unc ional ha mony ecogni ion ask and he
pe o mance MIDI- o-sco e con e sion ask, and iden i y-
ing pa hways o u he scaling up ou model.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
447
8. ETHICS STATEMENT
In his wo k, all he da ase s used in model aining
and e alua ion a e publicly a ailable and can be down-
loaded wi hou submi ing any da a access applica ion
o m. While his s ic policy imp o es he ep oducibili y,
i also makes he e alua ion biased owa d high- esou ce
asks and music gen es ha ha e a la ge amoun o pub-
licly a ailable da a. Fo example, o classical music, we
mainly ocus on wes e n classical music ins ead o classi-
cal music om o he egions. Fu u e e o s could be made
o mi iga e his issue by adding mo e di e se da ase s and
downs eam asks o he e alua ion benchma k.
9. ACKNOWLEDGMENTS
This wo k is suppo ed in pa by Na ional Science and
Technology Council unde G an NSTC 113-2221-E-001-
013, he Academia Sinica G and Challenge (GCS) P o-
g am unde G an AS–GCS–112–M07, and he Pos doc-
o al Schola P og am o Academia Sinica unde G an
AS-PD-1141-M15-2.
10. REFERENCES
[1] Y. Chou, I. Chen, C. Chang, J. Ching, and
Y. Yang, “MidiBERT-Piano: La ge-scale p e- aining
o symbolic music unde s anding,” CoRR, ol.
abs/2107.05223, 2021.
[2] X. Liang e al., “PianoBART: Symbolic piano mu-
sic gene a ion and unde s anding wi h la ge-scale p e-
aining,” in IEEE In e na ional Con e ence on Mul i-
media and Expo, ICME, 2024, pp. 1–6.
[3] Z. Zhao, “Ad e sa ial-midibe : Symbolic music un-
de s anding model based on unbias p e- aining and
mask ine- uning,” CoRR, ol. abs/2407.08306, 2024.
[4] Z. Shen, L. Yang, Z. Yang, and H. Lin, “Mo e han
simply masking: Explo ing p e- aining s a egies o
symbolic music unde s anding,” in P oceedings o he
2023 ACM In e na ional Con e ence on Mul imedia
Re ie al, 2023, p. 540–544.
[5] M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and
T. Liu, “MusicBERT: Symbolic music unde s anding
wi h la ge-scale p e- aining,” in Findings o he Asso-
cia ion o Compu a ional Linguis ics: ACL/IJCNLP,
2021, pp. 791–800.
[6] Z. Wang and G. Xia, “MuseBERT: P e- aining mu-
sic ep esen a ion o music unde s anding and con ol-
lable gene a ion,” in P oceedings o he 22nd In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence, ISMIR 2021, 2021, pp. 722–729.
[7] M. Sailo , “RNBe : Fine- uning a masked language
model o oman nume al analysis,” in P oceedings o
he 25 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, ISMIR, 2024, pp. 814–821.
[8] OpenAI, “GPT-4 echnical epo ,” CoRR, ol.
abs/2303.08774, 2023.
[9] H. Tou on e al., “Llama: Open and e icien oun-
da ion language models,” CoRR, ol. abs/2302.13971,
2023.
[10] E. Beeching e al., “Open LLM leade boa d,”
h ps://hugging ace.co/spaces/HuggingFaceH4/open_
llm_leade boa d, 2023.
[11] J. De lin, M. Chang, K. Lee, and K. Tou ano a,
“BERT: p e- aining o deep bidi ec ional ans o m-
e s o language unde s anding,” in P oceedings o he
2019 Con e ence o he No h Ame ican Chap e o
he Associa ion o Compu a ional Linguis ics: Hu-
man Language Technologies, NAACL-HLT, 2019, pp.
4171–4186.
[12] A. L. Ui denboge d and J. Zobel, “Melodic ma ching
echniques o la ge music da abases,” in P oceedings
o he 7 h ACM In e na ional Con e ence on Mul ime-
dia, 1999, pp. 57–66.
[13] F. Simone a, C. E. C. Chacón, S. N alampi as, and
G. Widme , “A con olu ional app oach o melody line
iden i ica ion in symbolic sco es,” in P oceedings o
he 20 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2019, pp. 924–931.
[14] K. Kos a, W. T. Lu, G. Medeo , and P. Chanquion,
“A deep lea ning me hod o melody ex ac ion om
a polyphonic symbolic music ep esen a ion,” in P o-
ceedings o he 23 d In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), 2022, pp.
757–763.
[15] N. N. López, M. Go ham, and I. Fujinaga, “Augmen -
edNe : A oman nume al analysis ne wo k wi h syn-
he ic aining examples and addi ional onal asks,”
in P oceedings o he 22nd In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, ISMIR, 2021,
pp. 404–411.
[16] E. Ka ys inaios and G. Widme , “Roman nume al anal-
ysis wi h g aph neu al ne wo ks: Onse -wise p edic-
ions om no e-wise ea u es,” in P oceedings o he
24 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR, 2023, pp. 597–604.
[17] D. Le, L. Bigo, M. Kelle , and D. He emans, “Na u al
language p ocessing me hods o symbolic music gen-
e a ion and in o ma ion e ie al: a su ey,” CoRR, ol.
abs/2402.17467, 2024.
[18] Y. Hsiao, T. Hung, T. Chen, and L. Su, “Bps-mo i :
A da ase o epea ed pa e n disco e y o polyphonic
symbolic music,” in P oceedings o he 24 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2023, pp. 281–288.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
448
[19] T. Collins, “2013:Disco e y o Repea ed Themes &
Sec ions,” h ps://www.music-i .o g/mi ex/wiki/2013:
Disco e y_o _Repea ed_Themes_%26_Sec ions,
2013.
[20] D. Me edi h, K. Lems öm, and G. A. Wiggins, “Algo-
i hms o disco e ing epea ed pa e ns in mul idimen-
sional ep esen a ions o polyphonic music,” Jou nal
o New Music Resea ch, ol. 31, no. 4, pp. 321–345,
2002.
[21] T.-P. Chen and L. Su, “Func ional ha mony ecogni-
ion o symbolic music da a wi h mul i- ask ecu en
neu al ne wo ks,” in P oceedings o he 19 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2018, pp. 90–97.
[22] Z.-S. Lin e al., “S3: A symbolic music da ase o
compu a ional music analysis o symphonies,” in Ex-
ended Abs ac s o he La e-B eaking Demo Session
o he 25 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2024.
[23] H. Sch eibe , “Imp o ing gen e anno a ions o he
million song da ase ,” in P oceedings o he 16 h In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence, ISMIR, 2015, pp. 241–247.
[24] C. He nandez-Oli an, S. R. Llamas, and J. R. Bel án,
“Symbolic music s uc u e analysis wi h g aph ep e-
sen a ions and changepoin de ec ion me hods,” CoRR,
ol. abs/2303.13881, 2023.
[25] H. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and
Y. Yang, “EMOPIA: A mul i-modal pop piano da ase
o emo ion ecogni ion and emo ion-based music gen-
e a ion,” in P oceedings o he 22nd In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence, IS-
MIR, 2021, pp. 318–325.
[26] J. Zhao and K. Yoshii, “Mul imodal mul i ace ed music
emo ion ecogni ion based on sel -a en i e usion o
psychology-inspi ed symbolic and acous ic ea u es,”
in Asia Paci ic Signal and In o ma ion P ocessing As-
socia ion Annual Summi and Con e ence, APSIPA
ASC, 2023, pp. 1641–1645.
[27] F. Fosca in, A. McLeod, P. Rigaux, F. Jacquema d, and
M. Sakai, “ASAP: a da ase o aligned sco es and pe -
o mances o piano ansc ip ion,” in P oceedings o
he 21 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, ISMIR, 2020, pp. 534–541.
[28] H. Zhang, J. Tang, S. R. Ra ee, S. Dixon, G. Fazekas,
and G. A. Wiggins, “ATEPP: A da ase o au oma -
ically ansc ibed exp essi e piano pe o mance,” in
P oceedings o he 23 d In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, ISMIR, 2022,
pp. 446–453.
[29] A. Yca and E. Bene os, “A-MAPS: Augmen ed
MAPS da ase wi h hy hm and key anno a ions,” in
19 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR La e B eaking and Demo
Pape s, 2018.
[30] Z. Zhou e al., “Can LLMs " eason" in music? an
e alua ion o llms’ capabili y o music unde s anding
and gene a ion,” in P oceedings o he 25 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2024.
[31] R. Yuan e al., “Cha Musician: Unde s anding and
gene a ing music in insically wi h LLM,” in Findings
o he Associa ion o Compu a ional Linguis ics, ACL,
2024, pp. 6252–6271.
[32] R. Bommasani e al., “On he oppo uni ies and isks o
ounda ion models,” CoRR, ol. abs/2108.07258, 2021.
[33] M. Won, Y. Hung, and D. Le, “A ounda ion model o
music in o ma ics,” in IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing, ICASSP,
2024, pp. 1226–1230.
[34] A. Rad o d e al., “Lea ning ans e able isual models
om na u al language supe ision,” in P oceedings o
he 38 h In e na ional Con e ence on Machine Lea n-
ing, ICML, ol. 139, 2021, pp. 8748–8763.
[35] S. Yang e al., “SUPERB: speech p ocessing uni e sal
pe o mance benchma k,” in In e speech 2021, 22nd
Annual Con e ence o he In e na ional Speech Com-
munica ion Associa ion, 2021, pp. 1194–1198.
[36] S. Chen e al., “Wa LM: La ge-scale sel -supe ised
p e- aining o ull s ack speech p ocessing,” IEEE
Jou nal o Selec ed Topics in Signal P ocessing,
ol. 16, no. 6, pp. 1505–1518, 2022.
[37] C. Huang e al., “Dynamic-SUPERB Phase-2: A
collabo a i ely expanding benchma k o measu ing
he capabili ies o spoken language models wi h 180
asks,” in The Thi een h In e na ional Con e ence on
Lea ning Rep esen a ions, ICLR 2025, 2025.
[38] Y. Li e al., “MERT: acous ic music unde s anding
model wi h la ge-scale sel -supe ised aining,” in The
Twel h In e na ional Con e ence on Lea ning Rep e-
sen a ions, ICLR 2024, 2024.
[39] W. Liao e al., “Music ounda ion model as gene ic
boos e o music downs eam asks,” CoRR, ol.
abs/2411.01135, 2024.
[40] R. Cas ellon, C. Donahue, and P. Liang, “Codi ied au-
dio language modeling lea ns use ul ep esen a ions
o music in o ma ion e ie al,” in P oceedings o he
22nd In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR, 2021, pp. 88–96.
[41] S. Wu, Y. Wang, X. Li, F. Yu, and M. Sun, “MelodyT5:
A uni ied sco e- o-sco e ans o me o symbolic mu-
sic p ocessing,” in P oceedings o he 25 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence, ISMIR, 2024, pp. 642–650.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
449
[42] W. Hsiao, J. Liu, Y. Yeh, and Y. Yang, “Compound
wo d ans o me : Lea ning o compose ull-song mu-
sic o e dynamic di ec ed hype g aphs,” in Thi y-
Fi h AAAI Con e ence on A i icial In elligence, AAAI,
2021, pp. 178–186.
[43] A. Vaswani e al., “A en ion is all you need,” in Ad-
ances in Neu al In o ma ion P ocessing Sys ems 30:
Annual Con e ence on Neu al In o ma ion P ocessing
Sys ems, 2017, pp. 5998–6008.
[44] B. Wa ne e al., “Sma e , be e , as e , longe : A
mode n bidi ec ional encode o as , memo y e -
icien , and long con ex ine uning and in e ence,”
CoRR, ol. abs/2412.13663, 2024.
[45] T. Dao, D. Y. Fu, S. E mon, A. Rud a, and C. Ré,
“FlashA en ion: as and memo y-e icien exac a -
en ion wi h IO-awa eness,” in Ad ances in Neu al
In o ma ion P ocessing Sys ems 35: Annual Con e -
ence on Neu al In o ma ion P ocessing Sys ems 2022,
Neu IPS, 2022.
[46] J. Su, M. H. M. Ahmed, Y. Lu, S. Pan, W. Bo, and
Y. Liu, “Ro o me : Enhanced ans o me wi h o-
a y posi ion embedding,” Neu ocompu ing, ol. 568,
p. 127063, 2024.
[47] Y. Meng e al., “Rep esen a ion de iciency in masked
language modeling,” in The Twel h In e na ional Con-
e ence on Lea ning Rep esen a ions, ICLR, 2024.
[48] L. Liu, Q. Kong, V. Mo i, and E. Bene os, “Pe o -
mance midi- o-sco e con e sion by neu al bea ack-
ing,” in P oceedings o he 23 d In e na ional Socie y
o Music In o ma ion Re ie al Con e ence, ISMIR,
2022, pp. 395–402.
[49] Z. Wang e al., “POP909: A pop-song da ase o mu-
sic a angemen gene a ion,” in P oceedings o he 21 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR 2020, 2020, pp. 38–45.
[50] D. Le, M. Gi aud, F. Le é, and F. Macca ini, “A co -
pus desc ibing o ches al ex u e in i s mo emen s
o classical and ea ly- oman ic symphonies,” in DL M
’22: 9 h In e na ional Con e ence on Digi al Lib a ies
o Musicology, 2022, pp. 27–35.
[51] Y.-H. Chu and L. Su, “O ches al ex u e classi ica ion
wi h con olu ion,” in Ex ended Abs ac s o he La e-
B eaking Demo Session o he 24 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence (IS-
MIR), 2023.
[52] Y.-H. Jen, T.-P. Chen, S.-W. Sun, and L. Su, “Posi ion-
ing le -hand mo emen in iolin pe o mance: A sys-
em and use s udy o inge ing pa e n gene a ion,” in
P oceedings o he 26 h In e na ional Con e ence on
In elligen Use In e aces, 2021, p. 208–212.
[53] W. Lin, Y. F. Wang, and L. Su, “Enhancing iolin in-
ge ing gene a ion h ough audio-symbolic usion,” in
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing, ICASSP, 2024, pp. 811–815.
[54] M. Wo sman, T. De me s, L. Ze lemoye , A. Mo -
cos, A. Fa hadi, and L. Schmid , “S able and low-
p ecision aining o la ge-scale ision-language mod-
els,” in Ad ances in Neu al In o ma ion P ocessing
Sys ems 36: Annual Con e ence on Neu al In o ma ion
P ocessing Sys ems 2023, Neu IPS, 2023.
[55] C. Ra el, “Lea ning-based me hods o compa ing se-
quences, wi h applica ions o audio- o-midi alignmen
and ma ching,” Ph.D. disse a ion, Columbia Uni e -
si y, USA, 2016.
[56] J. Zhao, D. Tania , K. Adhinug aha, V. M. Baska an,
and K. Wong, “Mul i-mmlg: a no el amewo k o ex-
ac ing mul iple main melodies om MIDI iles,” Neu-
al Compu . Appl., ol. 35, no. 30, pp. 22 687–22 704,
2023.
[57] A. Dubey e al., “The Llama 3 he d o models,” CoRR,
ol. abs/2407.21783, 2024.
[58] J. Hes ness e al., “Deep lea ning scaling is p edic able,
empi ically,” CoRR, ol. abs/1712.00409, 2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
450