SCALING SELF-SUPERVISED REPRESENTATION LEARNING
FOR SYMBOLIC PIANO PERFORMANCE
Louis B adshaw1,4Honglu Fan3,4Alexande Spanghe 2,4S ella Bide man4Simon Col on1
1Queen Ma y Uni e si y o London 2Uni e si y o Sou he n Cali o nia
3Uni e si y o Gene a 4Eleu he AI
[email p o ec ed], [email p o ec ed], [email p o ec ed]
ABSTRACT
We s udy he capabili ies o gene a i e au o eg essi e ans-
o me models ained on la ge amoun s o symbolic solo-
piano ansc ip ions. A e i s p e aining on app oxi-
ma ely 60,000 hou s o music, we use a compa a i ely
smalle , high-quali y subse , o ine une models o p o-
duce musical con inua ions, pe o m symbolic classi ica ion
asks, and p oduce gene al-pu pose con as i e MIDI em-
beddings by adap ing he SimCLR amewo k o symbolic
music. When e alua ing piano con inua ion cohe ence, ou
gene a i e model ou pe o ms leading symbolic gene a ion
echniques and emains compe i i e wi h p op ie a y au-
dio gene a ion models. On MIR classi ica ion benchma ks,
ozen ep esen a ions om ou con as i e model achie e
s a e-o - he-a esul s in linea p obe expe imen s, while
di ec ine uning demons a es he gene alizabili y o p e-
ained ep esen a ions, o en equi ing only a ew hund ed
labeled examples o specialize o downs eam asks.
1. INTRODUCTION
Mode n machine lea ning sys ems inc easingly u ilize sel -
supe ised lea ning (SSL) as a co e componen o hei
aining pipeline. In his pa adigm, gene al-pu pose ep e-
sen a ions a e lea ned du ing an ini ial phase o sel -guided
lea ning, which can hen be adap ed o specialized asks,
o en ou pe o ming pu ely supe ised app oaches, pa icu-
la ly when access o supe ised da a is limi ed [1].
As in o he ields, ecen wo k using neu al ne wo ks
o model symbolic music has s a ed o adop SSL [2
–
5].
Howe e , he symbolic music da a ha hese models a e
ained on is ypically c ea ed manually, in a labo -in ensi e
p ocess. Acqui ing i a he scale common o o he modali-
ies (e.g., ex , images, audio) is challenging. Consequen ly,
success ul esea ch o en in ol es aining om sc a ch
on da ase s such as Lakh and IMSLP [6,7], wi h esea ch
p oblems o mula ed a ound asks ha di ec ly align wi h
hese da ase s (e.g. mul i- ack symbolic music gene a ion).
This con as s wi h o he domains whe e subs an ial e o s
© L. B adshaw, H. Fan, A. Spanghe , S. Bide man and S.
Col on. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: L. B adshaw, H. Fan, A. Spanghe , S.
Bide man and S. Col on, “Scaling Sel -Supe ised Rep esen a ion Lea n-
ing o Symbolic Piano Pe o mance”, in P oc. o he 26 h In . Socie y o
Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
Compose
Sc iabin
Chopin
Schube
Bach
Schumann
Haydn
Bee ho en
Moza
Sa ie
B ahms
Tchaiko sky
Lisz
Debussy
Rachmanino
Ra el
Handel
Figu e 1. -SNE isualisa ion o con as i e embeddings
o classical composi ions, ained on MIDI da a wi hou
ex e nal me ada a. The c oss (×) highligh s Chopin’s Wal z
in A mino , which was disco e ed
1
a e he aining da a
was compiled, ensu ing ha i was no included.
ha e p oduced gene alis models ained a an ex eme scale,
such as LLaMA and CLIP [8,9], which p o ide s ong oun-
da ions o esea ch in da a-limi ed se ings [10,11]. These
cons ain s on symbolic music esea ch become pa icula ly
clea when conside ing ad ancemen s in he neighbo ing
a ea o audio modeling, whe e la ge-scale models including
AudioGen and AudioLM [12, 13], alongside hei unde -
lying neu al audio codecs [14, 15], ha e d i en a b oad
ange o ad ancemen s in music gene a ion [16
–
18], and
whe e SSL has been applied a scale o de elop e ec i e,
gene al-pu pose embedding models [19, 20].
Fo una ely, s ong p og ess has been made owa ds al-
le ia ing da a bo lenecks o symbolic music esea ch by
le e aging neu al ne wo ks ained o au oma ic music
ansc ip ion (AMT) [21]. In he es ic ed domain o
solo-piano audio eco dings, mode n AMT models achie e
highly eliable no e-iden i ica ion accu acy [22
–
24], en-
abling au oma ed da ase cu a ion pipelines ha c awl aw
1
See Ja ie C. He nández, “Hea a Chopin Wal z Unea hed A e
Nea ly 200 Yea s,” The New Yo k Times, Oc . 27, 2024.
451
audio and ansc ibe i in o MIDI using a combina ion o
web sc aping, audio-based p ocessing, and AMT me h-
ods [25
–
27]. Mo eo e , as his symbolic da a is ansc ibed
om eal eco dings, i cap u es he sub le ies and dynamics
o human pe o mance. Recen ly, his combined p og ess
has esul ed in a new da ase o symbolic music, A ia-
MIDI [28], comp ising ansc ip ions o solo-piano eco d-
ings ga he ed a scale om YouTube, which has been made
a ailable o public use. A ~100k hou s, A ia-MIDI is
o de s o magni ude la ge han simila da ase s [25], p e-
sen ing a unique oppo uni y o in es iga e he applica ion
o scaling SSL me hods o symbolic music modeling.
Building on his, in his wo k we le e age A ia-MIDI
o p e ain a gene a i e ans o me model ia nex - oken
p edic ion, using i as a ounda ion o explo e he e ec i e-
ness o SSL echniques applied o symbolic music a a scale
close o ecen applica ions in he ex , image, and audio
domains. We e alua e ou model ac oss wo dimensions:
gene a i e modeling and ep esen a ion lea ning. Fo gene -
a i e capabili ies, we conduc human lis ening es s compa -
ing piano con inua ions gene a ed by ou model, while o
ep esen a ion lea ning we measu e he abili y o he p e-
ained model o adap o MIR classi ica ion asks ia ine-
uning. To explo e applica ions o simila i y and e ie al
asks, we p opose and analyze a no el sel -supe ised adap-
a ion o he con as i e lea ning amewo k o symbolic
music, which ine unes ou model o p oduce embeddings
ha cap u e pe o mance and composi ion-le el ea u es,
as demons a ed by he na u al compose clus e ing isual-
ized in Figu e 1. In bo h e alua ion se ings, we compa e
agains symbolic and audio-based baselines. O e all, ou
expe imen s p o ide s ong e idence ha scaling SSL is a
p omising app oach o ackling di icul asks ac oss sym-
bolic MIR. Ou key con ibu ions a e he ollowing:
1.
We in oduce and open-sou ce A ia
2
, a p e ained au-
o eg essi e ans o me model ained on ansc ip-
ions o piano eco dings. Th ough human lis ening
es s, we show i gene a es cohe en con inua ions
om sho musical p omp s, ou pe o ming An icipa-
o y Music T ans o me [29] and i aling p op ie a y
audio models like Suno 3.5 [30].
2.
We u he demons a e he e ec i eness o la ge-
scale p e ained ep esen a ions o symbolic MIR
h ough wo app oaches: (1) di ec ly ine uning ou
model o classi ica ion asks, achie ing s ong pe -
o mance when labeled examples a e ex emely lim-
i ed, and (2) p oposing a no el adap a ion o con-
as i e lea ning ha p oduces an embedding model
achie ing s a e-o - he-a accu acy in linea p obe
expe imen s including compose , gen e, and s yle
de ec ion. C i ically, we show ha his con as i e ap-
p oach is e ec i e only when applied as a seconda y
ine uning phase.
In addi ion o ou models, we elease a MIDI p ep o-
cessing and okeniza ion lib a y designed o scale o la ge
2A ailable a : h ps://gi hub.com/eleu he ai/a ia
da ase s and, al hough his wo k ocuses on solo piano, o
na i ely suppo mul i- ack MIDI iles. Toge he , hese
con ibu ions may se e as a ounda ion o u u e esea ch
in symbolic music modeling.
2. RELATED WORK
Ou wo k ela es o many sub-a eas o compu a ional music,
gene a i e modeling, and ep esen a ion lea ning. In his
sec ion, we ocus on ela ed wo k speci ic o he sub ield
o symbolic music modeling.
The ield o symbolic music gene a ion using neu al
ne wo ks has ad anced apidly. P io o he in oduc ion o
ans o me s, models such as DeepBach [31] and Cocone
[32] demons a ed ha neu al ne wo ks a e e ec i e ools
o modeling musical ha monies in Ba oque music. The
au o eg essi e pa adigm o symbolic music gene a ion,
which models music as a s eam o okens, gained ac ion
by adap ing a chi ec u es om na u al language p ocessing
[33]. This app oach was ex ended by [34] o inco po a e
exp essi e onse and du a ion imings, enabling gene a ed
music o mo e closely emula e human pe o mance.
Music T ans o me [35] was a seminal wo k demons a -
ing he powe and scalabili y o he au o eg essi e app oach.
The au ho s ained a ans o me decode on he MAE-
STRO da ase [36], a collec ion o exp essi e MIDI piano
eco dings, and showed ha au o eg essi e models could
e ec i ely lea n long- e m musical dependencies. Subse-
quen wo k om he same au ho s p o ided s ong e idence
ha he musical and c ea i e capabili ies o hei model
scale well wi h da ase size [37], ein o cing he alue o
cu a ing la ge-scale piano ansc ip ion da ase s as a u u e
di ec ion, a cen al p emise we explo e in ou wo k.
Building on his ounda ion, MuseNe [38] expanded
his app oach by adding mul i- ack suppo o i s MIDI
okenize and aining a la ge model on a di e se co pus o
mul i-ins umen da a, including MAESTRO. Al e na i e
okeniza ion schemes, such as REMI [39], ha e also been
in luen ial. Va ia ions o REMI ha e been adop ed by mod-
els including Muse o me [40], Figa o [41], and MuseC-
oco [42], which all in oduced me hods o condi ioning
gene a ion on a ious musical ea u es. O he esea ch has
explo ed ep esen a ions beyond MIDI, such as he ABC
no a ion [43] used by MuPT [3]. Mo e ecen ly, An icipa-
o y Music T ans o me [29] was in oduced as a e sa ile,
s a e-o - he-a model o p omp con inua ion and in illing
asks wi h exp essi e millisecond-le el p ecision.
Fo ep esen a ion lea ning, se e al me hods ha e been
de eloped o p oduce symbolic music embeddings, use ul
as ea u e ex ac o s o downs eam classi ica ion asks.
These include MusicVAE [44], a a ia ional au oencode
o cap u ing long- e m s uc u e; MusicBERT [4], which
lea ns sel -supe ised ep esen a ions ia a ba -masking
objec i e; and he CLaMP se ies o models [5,45,46], which
employ con as i e lea ning echniques o build c oss-modal
ep esen a ions wi h na u al language desc ip ions.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
452
PIANO-ROLL
012345678
Time (seconds)
C
E
G
MUSIC
TRANSFORMER
SHIFT
1000MS
SET_VEL
60
NOTE_ON
60
SHIFT
1000MS
SHIFT
1000MS
NOTE_ON
64
SHIFT
1000MS
NOTE_OFF
60
SHIFT
1000MS ...
MUSENET
WAIT
1000MS
PIANO
P: C4
V: 60
WAIT
2000MS
PIANO
P: E4
V: 60
WAIT
1000MS
PIANO
P: C4
V: 0
WAIT
1000MS
PIANO
P: G4
V: 60
WAIT
1000MS ...
REMI
BAR POSITION
2/4
PITCH
C4
VELOCITY
60
DURATION
3/4
POSITION
4/4
PITCH
E4
VELOCITY
60
DURATION
3/4 ...
ARIA
PIANO
P: 60
V: 60
ONSET
1000MS
DURATION
3000MS
PIANO
P: 64
V: 60
ONSET
3000MS
DURATION
3000MS <T>
PIANO
P: 67
V: 60
ONSET
0MS ...
Figu e 2. Compa ison o di e en okeniza ions o a piano- oll, using a ious app oaches. Music T ans o me [35] and
MuseNe [38] ack he passage o ime using ime-shi okens, whe eas A ia uses absolu e onse s ela i e o he cu en
segmen . The REMI okenize [39] uses a neu al bea - acking model o es ima e posi ions o no es and ba delimi e s [47].
3. METHOD
To explo e he capabili ies o la ge-scale sel -supe ised
models o piano pe o mance, we i s p e ained an au-
o eg essi e ans o me model using nex - oken p edic ion
on a e ined subse o he A ia-MIDI da ase . We adop
his se up due o i s e sa ili y: nex - oken p edic ion has a
p o en ack eco d in gene a i e modeling o bo h sym-
bolic and audio-based music [13,35], as well as adap abili y
o downs eam asks ia ine uning [48]. Apa om he
okeniza ion scheme, which we hand-designed, we used a
con en ional mode n ans o me a chi ec u e wi h mini-
mal modi ica ions, p o iding a s anda dized ounda ion o
e alua ing ou hypo hesis and suppo ing u he esea ch.
3.1 MIDI Tokeniza ion
To au o eg essi ely model MIDI iles as s eams o disc e e
okens, we chose o use a empo al esolu ion o 10 mil-
liseconds o no e onse s and du a ions, and disc e ize no e
eloci y alues in o 12 bins. Ou okenize is designed o
na i ely handle mul i- ack (mul i-ins umen ) MIDI iles
by condensing he 128 MIDI ins umen s, co esponding
o
p og am_change
MIDI messages, in o 13 ins umen
classes, including one o pe cussion.
Gi en a MIDI ile, we esol e i s cons i uen
no e_on
and
no e_o
e en s in o a lis o no es. Fo non-pe cussion
ins umen s, we okenize a no e wi h pi ch
p
, eloci y
,
and absolu e onse /o se in milliseconds
( on, o )
as a
iple o okens:
[ins umen , p, ],[onse : on],[du a ion: o − on]
Fo pe cussion, we okenize a no e wi h no e numbe
n
and onse
on
as:
[d um, n],[onse : on]
The okeniza ion o an en i e MIDI ile is cons uc ed by
conca ena ing he okeniza ions o he cons i uen no es in
o de o onse . MIDI me ada a, such as key, empo, and
ime signa u e, is disca ded, and o he ele an musical
in o ma ion, such as he sus ain pedal, is inco po a ed di-
ec ly in o he du a ion okens. This schema is se apa
om some popula okeniza ion echniques used o sym-
bolic music, such as REMI [39] and ex -based sco e ep-
esen a ions ABC [43] and MusicXML [49], as i does no
include bea o ba in o ma ion, ins ead ep esen ing onse s
and du a ions in milliseconds.
In he MIDI s anda d [50],
no e_on
and
no e_o
e en s a e spaced empo ally by speci ying a numbe o
icks o wai be o e p ocessing he nex e en . Fo Music
T ans o me and MuseNe , he au ho s inco po a e his in o
hei chosen MIDI okeniza ion schemes [35, 38], using
ime-shi okens o sepa a e no es a he han speci ying
hei absolu e onse imes. Howe e , eme ging wo k has
p o ided e idence ha using ime-shi okens in his way
may be subop imal in ans o me -based models, esul ing
in educed accu acy in sequence- o-sequence piano an-
sc ip ion [51], and uns able hy hm o d i ing ba lines
in musical gene a ions [39]. One possible explana ion is
ha when using ela i e- iming okeniza ion, au o eg essi e
models s uggle o main ain an exac empo al ep esen a-
ion o he p io con ex , as hey mus sum up many sequen-
ial ime-shi alues o calcula e empo al ela ionships
be ween no es wi h medium o long- e m dependencies.
P e ious s udies on la ge language models ha e demon-
s a ed ha ans o me s can s uggle wi h exac ly his so
o a i hme ic [52, 53].
In p elimina y in es iga ions, we also obse ed nega i e
e ec s when using ela i e- iming okeniza ions, pa icu-
la ly on empo al ins abili y in passages wi h apid no e
sequences. To add ess hese issues, we chose o adop ab-
solu e onse imes in ou okenize . We implemen ed his
by di iding he music in o 5000-millisecond segmen s and
eco ding no e onse s ela i e o he s a o each segmen –
his helped us a oid expanding he okenize ’s ocabula y
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
453
o include all possible absolu e onse imes. To emo e
ambigui y, we ma ked he s a o each new segmen us-
ing a special oken:
<T>
. We designed his o esemble
no e iming using bea -posi ion wi hin a ba , howe e , un-
like okeniza ion schemes ha do his di ec ly [39, 54], ou
app oach is applicable o MIDI iles ha lack bea and
ba in o ma ion, such as hose ansc ibed om solo pi-
ano eco dings. Figu e 2 illus a es how ou app oach
di e s om o he app oaches.
Ti−Tj=
Pi
k=j+1 wkRela i e
C(<T>, i, j) + ˜oi−˜ojHyb id (Ou s)
oi−ojAbsolu e
(1)
Equa ion 1 demons a es he a i hme ic equi ed o calcula e
he ime sepa a ing wo no es
ni
and
nj
ac oss he di e -
en okeniza ion app oaches, whe e
wk
deno es he leng h
o he ime-shi message p eceding no e
k
,
C(<T>, i, j)
ep esen s he o al ime spanned by comple e 5000ms seg-
men s be ween no es
ni
and
nj
, calcula ed by coun ing he
numbe o segmen okens and mul iplying by he segmen
du a ion,
ok
ep esen s he absolu e onse ime o no e
k
,
and
˜ok
ep esen s he adjus ed absolu e onse ime o no e
k
ela i e o i s 5000ms segmen .
3.2 Model
Ou model a chi ec u e builds upon he LLaMa 3.2 model
amily, chosen due o i s e ec i eness in au o eg essi e
asks ac oss modali ies [55]. Using he 1B pa ame e con-
igu a ion as a s a ing poin , we made se e al a chi ec-
u al modi ica ions. Fi s ly, guided by es ablished p inci-
ples on model-da a a ios o language models [56], we
educed he hidden s a e dimension (
dmodel
) om 2048 o
1536. This dec eased he pa ame e coun by oughly hal ,
balancing model capaci y wi h compu a ional e iciency
o ou da ase scale. Secondly, we simpli ied he a chi ec-
u e by op ing o s anda d mul i-head a en ion (wi h 24
heads) and laye no maliza ion [57,58], ins ead o g ouped-
que y a en ion and RMS no maliza ion as used in s anda d
LLaMa 3 a ian s [59, 60].
P e aining da ase . As ou aining co pus consis s
o au oma ically ansc ibed in e ne -sou ced piano eco d-
ings, signi ican a iabili y exis s in ansc ip ion quali y
and con en sui abili y, po en ially in oducing ha m ul bi-
ases o noisy da a in o downs eam models. To mi iga e his,
we implemen ed igo ous p ep ocessing s eps. To educe
memo iza ion, we add essed ex eme cases o composi ion
duplica ion, such as epea ed pe o mances o o e ep e-
sen ed wo ks, by applying il e ing based on composi ional
me ada a. Speci ically, o compose s wi h mo e han 250
ins ances o iles con aining opus and/o piece numbe ags,
we e ained a mos 10 ins ances pe opus/piece-numbe
pai . Fo hese same compose s, we also disca ded all o he
iles ha lack composi ional iden i ie s. Addi ionally, we
employed heu is ic-based il e ing, conside ing no e den-
si y, pi ch and du a ion en opy, silence, and indica o s
o epe i i e con en , o exclude p oblema ic en ies (e.g.,
Black MIDI
3
). Following hese s eps, ou e ined p e ain-
ing co pus comp ises 820,944 MIDI iles, amoun ing o
60,473 hou s o solo piano music.
P e aining ecipe. We p e ained ou model using s an-
da d nex - oken p edic ion on conca ena ed sequences o
okenized MIDI iles, as de ailed in Sec ion 3.1. A sequence
leng h o 8192 okens was chosen o balance compu a ional
cons ain s wi h he need o lea n meaning ul sho - and
long- e m dependencies wi hin piano music. To enhance
gene aliza ion and p e en o e i ing, we u ilized online
da a augmen a ion, andomly ansposing (
±
5 semi ones),
a ying empo (
±
20%), and adjus ing MIDI eloci y (
±
10).
Gene a i e ine uning. We p oduced a model a ian
ailo ed o gene a i e piano-con inua ion asks by applying
a single-epoch ine uning phase a e p e aining, annealing
he lea ning a e o ze o while aining on highe -quali y
da a. To enhance da a quali y, we emo ed all iden i ied
composi ional duplica es, igh ened exis ing quali y il e s,
and in oduced an addi ional il e aimed a excluding an-
sc ip ions o syn hesized MIDI iles
4
. Addi ionally, du ing
his phase, each aining sequence begins a he s a o
a new ile (i.e., non-conca ena ed), and we inse a spe-
cial oken (
<D>
) app oxima ely 100 okens be o e he end
o each aining example o enable explici in e ence- ime
con ol o e gene a ion endings.
3.3 Con as i e Rep esen a ion Lea ning
To in es iga e he s eng h o he p e ained ep esen a-
ions, we p opose a seconda y ine uning s age, adap ing
he p e ained model o gene a e embeddings o okenized
sequences. Ou app oach le e ages he SimCLR amewo k
o con as i e ep esen a ion lea ning [61]. In SimCLR,
an encode is ained o p oduce simila embeddings o
di e en iews o he same aining example while simul a-
neously pushing embeddings om un ela ed examples apa
h ough minimiza ion o a con as i e loss. This app oach
has demons a ed s ong esul s in music, cap u ing seman-
ic ela ionships wi hin embeddings e ec i ely [62,63], and
has ecen ly been combined wi h la ge p e ained language
models o p oduce ich ex ual embeddings [64, 65].
To gene a e wo dis inc iews o a MIDI ile, we an-
domly ex ac wo di e en con iguous slices, each comp is-
ing be ween 100 and 650 no es (app oxima ely 300–2000
okens). Each slice unde goes independen da a augmen a-
ion using ou s anda d p ocedu es be o e okeniza ion. To
p oduce sequence embeddings, we eplace he o iginal lan-
guage modeling head wi h an embedding head, p ojec ing
he inal hidden s a e in o a 512-dimensional embedding
space. We de i e a slice’s embedding om he hidden
s a e associa ed wi h an end-o -sequence oken appended
a e he inal no e oken- iple.
To calcula e he con as i e loss, we use he no mal-
ized empe a u e-scaled c oss-en opy loss, NT-Xen , o e
miniba ches o ela ed embedding pai s:
ℓi,j =−log exp (sim (zi, zj)/τ)
P2N
k=1
1
[k=i]exp (sim (zi, zk)/τ)(2)
3h ps://en.wikipedia.o g/wiki/Black_MIDI
4P ep ocessing de ails: h ps://gi hub.com/loubb ad/a ia-midi
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
454
He e,
sim(zk, zl)
deno es he cosine simila i y be ween
no malized embeddings
zk
and
zl
,
1
[k=i]∈ {0,1}
is an in-
dica o unc ion, and
τ
is he empe a u e pa ame e . Each
miniba ch consis s o
N
MIDI iles, om which we con-
s uc
N
pai s o ela ed embeddings (i.e.,
2N
o al em-
beddings),
{zi, zi+N}i=1,...,N
, whe e bo h
zi
and
zi+N
a e de i ed om wo augmen ed iews o he same ile.
We ain he model by minimizing he symme ic loss:
L:= 1
2PN
k=1(ℓk,k+N+ℓk+N,k)
.
This se up has wo key ad an ages. Fi s , by ex ac ing
non-o e lapping slices om he same ile, he model lea ns
embeddings e lec ing highe -le el musical seman ics such
as gen e, compose , s yle, and pe o mance nuances, a he
han local de ails. This is impo an o musical pe o -
mances, whe e s anda d supe ised ep esen a ion lea ning
app oaches, e.g., MuLan [66], a e limi ed due o he desc ip-
i e sub le y and complexi y o musical a ibu es. Second,
ou app oach acili a es s udying how e ec i ely nex - oken
p edic ion ep esen a ions ans e o con as i e embed-
ding amewo ks. When ained om sc a ch, SimCLR-
inspi ed aining me hods ypically equi e la ge amoun s
o in-ba ch nega i es, which pose signi ican VRAM con-
s ain s [61]. Howe e , ecen wo k on ex embeddings
shows ha ini ializing con as i e aining om p e ained
models can alle ia e his [64]. Thus, ou me hod in oduces
a gene al-pu pose semi-supe ised amewo k o ep esen-
a ion lea ning o symbolic music, which allows us o e alu-
a e he ans e abili y o nex - oken musical ep esen a ions.
4. EXPERIMENTS
Ha ing ou lined ou me hodology, we e alua e he gene -
a i e capabili ies o ou model, as well as he con as i e
ep esen a ion lea ning amewo k, in he con ex o piano
pe o mance. To unde s and i s capabili ies in he wide
a ea o models o gene a i e music and MIR, we compa e
ou app oach o bo h symbolic and audio-based baselines,
u ilizing Piano eq [67] o syn hesize MIDI iles in o audio.
4.1 Se up
We p e ained ou model using he AdamW op imize o
75 epochs o e he aining co pus. We used a lea ning
a e o
3e-4
wi h 1000 wa mup s eps, ollowed by a linea
decay o 10% o he ini ial a e o e he cou se o ain-
ing. The model has app oxima ely 650 million pa ame e s
and was p e ained o 9 days on 8 H100 GPUs wi h a
ba ch size o 16 pe GPU.
In he con as i e ine uning s age, we used a lea ning
a e o
1e-5
wi h he same linea decay schedule. We se
he NT-Xen empe a u e pa ame e o
τ= 0.1
. This phase
las ed 25 epochs, du ing which each MIDI ile con ibu es
exac ly one pai o augmen ed iews pe epoch. We ained
on he educed ine uning da ase desc ibed in Sec ion 3.2;
howe e , we elaxed he p ep ocessing cons ain s on com-
posi ional duplica es o encou age he model o dis inguish
be ween di e en pe o mances o popula composi ions.
Gene a i e modeling. Following he gene a i e ine-
uning p ocedu e desc ibed in Sec ion 3.2, we explo e he
Compa ed Model Wins Ties Losses p- alue
AM T ans o me 38 0 6 9.43e-7
Suno 3.5 18 9 21 7.49e-1
MusicGen 49 1 0 3.55e-15
G ound T u h 15 9 17 8.60e-1
Table 1. Pai wise human p e e ence esul s compa ing
musical cohe ence o 45-second con inua ions o 15-second
p omp s. We epo he numbe o imes ou model won,
ied, o los agains he lis ed model. P- alues a e compu ed
using a wo-sided binomial es on non- ied compa isons.
gene a i e capabili ies o he esul ing model by analyzing
he musical cohe ence o con inua ions o sho solo piano
p omp s. This me hodology aligns wi h e alua ions in p e i-
ous wo k [13,29], and mi iga es as e bias by ha ing pa ic-
ipan s e alua e con inua ions wi hin he same musical s yle.
In ou lis ening es , we asked 46 pa icipan s wi h a
leas one yea o musical aining o compa e 45-second con-
inua ions gene a ed om 15-second solo piano p omp s,
e alua ing hei musical cohe ence. Pa icipan s we e p e-
sen ed wi h a se ies o andom pai wise A/B compa isons,
whe e hey we e asked o indica e hei p e e ed con in-
ua ion, guided by c i e ia such as melodic de elopmen ,
hy hmic s uc u e, ha monic p og ession, and s ylis ic
cohe ence. To gene a e es samples, we selec ed i e
p omp s ep esen ing di e en subgen es o solo piano mu-
sic, and gene a ed eigh con inua ions pe p omp ( o aling
40 con inua ions pe model). We compa ed ou model’s
ou pu s agains se e al baselines, including An icipa o y
Music T ans o me (
music-la ge-800k
) [29], he audio-
based gene a i e models MusicGen (
la ge
) [16] and Suno
3.5 [30], and he human-composed g ound- u h.
Con as i e embeddings. We e alua e ou app oach
o lea ning con as i e embeddings by aining linea clas-
si ie s on he ozen embeddings p oduced by di e en
models and compa ing hei pe o mance on held-ou es
se s. We assess pe o mance using es ablished benchma ks,
Pianis 8 [68] and VG-MIDI [69], as well as new bench-
ma ks we de i e om A ia-MIDI me ada a. Speci ically,
we ex ac ed label-balanced ain- es spli s comp ising
10,000 and 1,000 iles, espec i ely, o ou classi ica ion
asks: Gen e (2 classes), Musical Pe iod (4 classes), Fo m
(6 classes), and Compose (10 classes). Fo compa ison,
we include esul s om CLaMP 3 (
saas
) [46], M3 [45],
and he audio-based model MERT [70]. Linea classi ie s
we e ained on global ile embeddings ob ained by a e -
aging slice embeddings wi hin each ile. We ained wi h
a lea ning a e o
3e-4
and a linea decay schedule o 0,
unning sepa a e expe imen s wi h 10, 20, and 50 epochs,
and epo ing he bes esul .
Supe ised ine uning. To complemen ou linea p obe
expe imen s, we e alua e how well ou p e ained model
adap s o supe ised musical classi ica ion asks, employing
ine uning echniques inspi ed by NLP li e a u e [48, 71].
Fo classi ie ine uning, we eplaced he language model-
ing head wi h a classi ica ion head, p edic ing labels di ec ly
om he hidden s a e o he end-o -sequence oken. Du ing
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
455
Model Gen e Fo m Musical Pe iod Compose Pianis 8 VG-MIDI
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1
Main Resul s
MERT 83.00 83.00 63.89 63.90 69.50 68.94 69.60 69.30 65.06 65.18 45.45 40.37
M3 85.10 85.10 69.88 70.12 71.20 70.81 71.90 71.72 81.93 81.48 54.55 46.13
CLaMP 3 89.10 89.10 77.79 77.97 80.60 80.20 84.50 84.46 80.72 79.76 45.45 36.53
A iaEmb 92.40 92.40 82.45 82.57 84.70 84.69 90.50 90.49 91.57 92.38 63.64 63.96
A iaF 93.20 93.20 87.53 87.59 86.50 86.53 96.30 96.32 91.56 92.03 68.18 69.55
Embeddings
A ia†
e=25 82.30 82.30 66.94 66.96 69.00 68.50 65.50 65.41 84.34 84.56 59.09 54.29
A iae=1 92.90 92.90 80.53 80.69 83.80 83.71 87.60 87.62 92.77 93.71 59.09 57.80
A iaτ=0.05 92.40 92.40 81.34 81.48 84.00 83.85 89.90 89.90 95.18 95.71 59.09 54.32
A iaτ=0.592.30 92.30 73.43 73.63 80.70 80.56 70.20 70.05 91.57 92.70 54.55 45.00
Fine uning
A ian=100 89.50 89.50 68.26 68.20 70.20 70.64 65.30 64.10 - - - -
A ian=200 91.10 91.10 75.25 75.54 75.10 75.68 78.10 78.08 - - - -
A ian=500 90.80 90.80 79.31 79.49 80.90 80.91 85.20 85.18 - - - -
A ian=1000 91.40 91.40 80.63 80.68 82.90 83.01 90.10 90.12 - - - -
Table 2. Classi ica ion pe o mance ac oss symbolic music asks. We epo maximum accu acy (Acc) and mac o-F1 sco es
(F1) o each ask. Main Resul s compa e ou embedding model (A ia
Emb
) and supe ised ine uned model (A ia
F
) o o he
models (MERT, M3, CLaMP 3). Embedding abla ions a y key componen s o he con as i e lea ning se up: aining
epochs (
e
), empe a u e pa ame e (
τ
), and wi hou p e aining (†), while keeping all o he se ings he same as A ia
Emb
.
Fine uning abla ions show es -se pe o mance as a unc ion o he numbe o labeled aining iles (n).
his phase, we ine uned all model weigh s end- o-end using
a lea ning a e o
1e-5
(wi hou wa mup) wi h linea decay
schedule, and applied d opou o esidual connec ions, in-
c easing he d opou a e linea ly om
pd= 0.0
( i s laye )
o
pd= 0.2
( inal laye ). By sys ema ically a ying he
numbe o labeled aining examples, using class-balanced
subse s, we analyze ou p e ained model’s abili y o adap
o supe ised symbolic MIR asks in scena ios wi h limi ed
labeled da a. In each case, we ained o 10 epochs and
epo he esul s om he bes -pe o ming epoch.
4.2 Resul s
Table 1 epo s he esul s o ou lis ening es . Pa icipan s
consis en ly p e e ed he musical cohe ence o con inua-
ions p oduced by ou model o e hose om An icipa o y
Music T ans o me and MusicGen. This signals a no able
imp o emen in symbolic models o piano pe o mance
gene a ion, which we p ima ily a ibu e o he scale o ou
aining da ase , gi en ou s anda dized se up. I also high-
ligh s limi a ions in audio models like MusicGen, whose
es ic ed con ex window necessi a es sliding-window in-
e ence, diminishing cohe ence in longe gene a ions. Con-
e sely, we ound no s a is ically signi ican p e e ence di -
e ence be ween ou model’s ou pu s and ei he Suno 3.5 o
human-composed g ound- u h con inua ions. We acknowl-
edge wo key limi a ions: Fi s ly, we could no include
closed-access models like AudioLM [13], despi e hei
p omising epo ed esul s on simila piano-con inua ion
benchma ks. Secondly, ou e alua ion excludes popula
symbolic models such as MuPT [3], as hei ba -le el im-
ing ep esen a ion (e.g., ABC no a ion) is incompa ible wi h
exp essi e millisecond-le el MIDI pe o mances.
Table 2 summa izes he esul s o ou linea p obe and
supe ised ine uning classi ica ion expe imen s, alongside
an abla ion s udy o aining con igu a ions o con as i e
lea ning. Ou p oposed me hod o semi-supe ised ep-
esen a ion lea ning subs an ially imp o es esul s on all
benchma ks, p oducing embeddings ha cap u e di e se
ile-le el musical a ibu es wi hou inco po a ing me a-
da a du ing aining. The abla ion s udy u he highligh s
he impo ance o ini ializing con as i e aining om p e-
ained nex - oken ep esen a ions, demons a ing ha ou
con as i e me hod is compe i i e only when applied as a
ine uning s age. No ably, ine uning on one embedding
pai pe ile o a single epoch (A ia
e=1
) su passes aining
om sc a ch on 25 pai s pe ile (A ia
†
e=25
). While his ep-
esen s an ad ancemen , we no e ha ou benchma ks ocus
exclusi ely on piano pe o mances, whe eas he compa i-
son models suppo mul i-ins umen MIDI o audio iles.
Finally, ou supe ised ine uning expe imen s demons a e
he s ong adap abili y o nex - oken p edic ion SSL ame-
wo ks o supe ised symbolic MIR asks. Ou ine uned
models achie e s a e-o - he-a classi ica ion pe o mance
on la ge da ase s and pe o m su p isingly well on complex
asks, e en when ained on limi ed labeled da a.
5. CONCLUSION
We in oduce A ia, an au o eg essi e gene a i e ans-
o me model designed o in es iga e he scalabili y o sel -
supe ised lea ning o symbolic music modeling. Ou ex-
pe imen s show ha his p e aining amewo k e ec i ely
adap s o gene a i e modeling, MIDI-embedding gene a-
ion, and supe ised MIR asks. Mo eo e , ou indings
sugges ha ca e ul da a cu a ion and la ge-scale aining
can unlock new oppo uni ies o downs eam symbolic mu-
sic applica ions, pa icula ly in se ings whe e da a is sca ce.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
456
6. ACKNOWLEDGMENTS
This wo k was suppo ed by UKRI and EPSRC unde
g an EP/S022694/1. Addi ional suppo was p o ided by
Eleu he AI and S abili yAI, as well as a compu e g an om
he Minis y o Science and ICT o Ko ea and Gwangju
Me opoli an Ci y.
7. REFERENCES
[1]
J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and
D. Tao, “A su ey on sel -supe ised lea ning: Algo-
i hms, applica ions, and u u e ends,” IEEE T ans-
ac ions on Pa e n Analysis and Machine In elligence,
2024.
[2]
Y. Wang, S. Wu, J. Hu, X. Du, Y. Peng, Y. Huang,
S. Fan, X. Li, F. Yu, and M. Sun, “No agen: Ad anc-
ing musicali y in symbolic music gene a ion wi h la ge
language model aining pa adigms,” a Xi p ep in
a Xi :2502.18008, 2025.
[3]
X. Qu, Y. Bai, Y. Ma, Z. Zhou, K. M. Lo, J. Liu, R. Yuan,
L. Min, X. Liu, T. Zhang e al., “Mup : A gene a i e
symbolic music p e ained ans o me ,” a Xi p ep in
a Xi :2404.06393, 2024.
[4]
M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y. Liu,
“Musicbe : Symbolic music unde s anding wi h la ge-
scale p e- aining,” a Xi p ep in a Xi :2106.05630,
2021.
[5]
S. Wu, D. Yu, X. Tan, and M. Sun, “Clamp: Con-
as i e language-music p e- aining o c oss-modal
symbolic music in o ma ion e ie al,” a Xi p ep in
a Xi :2304.11029, 2023.
[6]
C. Ra el, “Lea ning-based me hods o compa ing se-
quences, wi h applica ions o audio- o-midi alignmen
and ma ching,” Ph.D. disse a ion, Columbia Uni e si y,
2016.
[7]
IMSLP. (2006) IMSLP/Pe ucci music lib a y. IMSLP.
[Online]. A ailable: h ps://imslp.o g
[8]
A. Rad o d, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh,
S. Aga wal, G. Sas y, A. Askell, P. Mishkin, J. Cla k
e al., “Lea ning ans e able isual models om na u al
language supe ision,” in In e na ional con e ence on
machine lea ning. PmLR, 2021, pp. 8748–8763.
[9]
H. Tou on, T. La il, G. Izaca d, X. Ma ine , M.-A.
Lachaux, T. Lac oix, B. Roziè e, N. Goyal, E. Hamb o,
F. Azha e al., “Llama: Open and e icien ounda ion
language models,” a Xi p ep in a Xi :2302.13971,
2023.
[10]
C. Zhou, P. Liu, P. Xu, S. Iye , J. Sun, Y. Mao, X. Ma,
A. E a , P. Yu, L. Yu e al., “Lima: Less is mo e o
alignmen ,” Ad ances in Neu al In o ma ion P ocessing
Sys ems, ol. 36, pp. 55 006–55 021, 2023.
[11]
A. Kolesniko , L. Beye , X. Zhai, J. Puigce e , J. Yung,
S. Gelly, and N. Houlsby, “Big ans e (bi ): Gene al
isual ep esen a ion lea ning,” in Compu e Vision–
ECCV 2020: 16 h Eu opean Con e ence, Glasgow, UK,
Augus 23–28, 2020, P oceedings, Pa V 16. Sp inge ,
2020, pp. 491–507.
[12]
F. K euk, G. Synnae e, A. Polyak, U. Singe , A. Dé-
ossez, J. Cope , D. Pa ikh, Y. Taigman, and Y. Adi,
“Audiogen: Tex ually guided audio gene a ion,” a Xi
p ep in a Xi :2209.15352, 2022.
[13]
Z. Bo sos, R. Ma inie , D. Vincen , E. Kha i ono ,
O. Pie quin, M. Sha i i, D. Roblek, O. Teboul, D. G ang-
ie , M. Tagliasacchi e al., “Audiolm: A language mod-
eling app oach o audio gene a ion,” IEEE/ACM ans-
ac ions on audio, speech, and language p ocessing,
ol. 31, pp. 2523–2533, 2023.
[14]
A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi,
“High ideli y neu al audio comp ession,” a Xi p ep in
a Xi :2210.13438, 2022.
[15]
N. Zeghidou , A. Luebs, A. Om an, J. Skoglund, and
M. Tagliasacchi, “Sounds eam: An end- o-end neu-
al audio codec,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 30, pp. 495–
507, 2021.
[16]
A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “Musiclm: Gene a -
ing music om ex ,” a Xi p ep in a Xi :2301.11325,
2023.
[17]
J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, pp. 47 704–47 720,
2023.
[18]
Z. Bo sos, M. Sha i i, D. Vincen , E. Kha i ono ,
N. Zeghidou , and M. Tagliasacchi, “Sounds o m:
E icien pa allel audio gene a ion,” a Xi p ep in
a Xi :2305.09636, 2023.
[19]
W.-N. Hsu, B. Bol e, Y.-H. H. Tsai, K. Lakho ia,
R. Salakhu dino , and A. Mohamed, “Hube : Sel -
supe ised speech ep esen a ion lea ning by masked
p edic ion o hidden uni s,” 2021. [Online]. A ailable:
h ps://a xi .o g/abs/2106.07447
[20]
A. Bae ski, Y. Zhou, A. Mohamed, and M. Auli,
“wa 2 ec 2.0: A amewo k o sel -supe ised lea ning
o speech ep esen a ions,” Ad ances in neu al in o -
ma ion p ocessing sys ems, ol. 33, pp. 12 449–12 460,
2020.
[21]
E. Bene os, S. Dixon, Z. Duan, and S. Ewe , “Au o-
ma ic music ansc ip ion: An o e iew,” IEEE Signal
P ocessing Magazine, ol. 36, no. 1, pp. 20–30, 2018.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
457
[22]
Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-
esolu ion piano ansc ip ion wi h pedals by eg essing
onse and o se imes,” IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing, ol. 29, pp.
3707–3717, 2021.
[23]
K. Toyama, T. Akama, Y. Ikemiya, Y. Takida, W.-H.
Liao, and Y. Mi su uji, “Au oma ic piano ansc ip ion
wi h hie a chical equency- ime ans o me ,” a Xi
p ep in a Xi :2307.04305, 2023.
[24]
Y. Yan and Z. Duan, “Sco ing ime in e als using non-
hie a chical ans o me o au oma ic piano ansc ip-
ion,” a Xi p ep in a Xi :2404.09466, 2024.
[25]
Q. Kong, B. Li, J. Chen, and Y. Wang, “Gian midi-
piano: A la ge-scale midi da ase o classical piano
music,” a Xi p ep in a Xi :2010.07061, 2020.
[26] H. Zhang, J. Tang, S. Ra ee, S. Dixon, and G. Fazekas,
“A epp: A da ase o au oma ically ansc ibed exp es-
si e piano pe o mance,” in P oceedings o he 23 d
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2022.
[27]
D. Edwa ds, S. Dixon, and E. Bene os, “Pijama: Piano
jazz wi h au oma ic midi anno a ions,” T ansac ions
o he In e na ional Socie y o Music In o ma ion Re-
ie al, 2023.
[28]
L. B adshaw and S. Col on, “A ia-midi: A da ase
o piano midi iles o symbolic music modeling,” in
In e na ional Con e ence on Lea ning Rep esen a ions,
2025. [Online]. A ailable: h ps://open e iew.ne /
o um?id=X5h hgndxW
[29]
J. Thicks un, D. Hall, C. Donahue, and P. Liang,
“An icipa o y music ans o me ,” a Xi p ep in
a Xi :2306.08620, 2023.
[30]
I. Suno, “Suno AI 3.5,” 2024, compu e so wa e.
[Online]. A ailable: h ps://sunnoai.com/ 3-5/
[31]
G. Hadje es, F. Pache , and F. Nielsen, “Deepbach: A
s ee able model o bach cho ales gene a ion,” in In-
e na ional con e ence on machine lea ning. PMLR,
2017, pp. 1362–1371.
[32]
C.-Z. A. Huang, T. Cooijmans, A. Robe s, A. Cou ille,
and D. Eck, “Coun e poin by con olu ion,” a Xi
p ep in a Xi :1903.07227, 2019.
[33]
F. T. Liang, M. Go ham, M. Johnson, and J. Sho on,
“Au oma ic s ylis ic composi ion o bach cho ales wi h
deep ls m,” in ISMIR, 2017, pp. 449–456.
[34]
S. Oo e, I. Simon, S. Dieleman, D. Eck, and K. Si-
monyan, “This ime wi h eeling: Lea ning exp essi e
musical pe o mance,” Neu al Compu ing and Applica-
ions, ol. 32, pp. 955–967, 2020.
[35]
C.-Z. A. Huang, A. Vaswani, J. Uszko ei , N. Shazee ,
I. Simon, C. Haw ho ne, A. M. Dai, M. D. Ho man,
M. Dinculescu, and D. Eck, “Music ans o me ,” a Xi
p ep in a Xi :1809.04281, 2018.
[36]
C. Haw ho ne, A. S asyuk, A. Robe s, I. Simon, C.-
Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and
D. Eck, “Enabling ac o ized piano music modeling
and gene a ion wi h he maes o da ase ,” a Xi p ep in
a Xi :1810.12247, 2018.
[37]
I. Simon, C.-Z. A. Huang, J. Engel, C. Haw ho ne,
and M. Dinculescu, “Gene a ing piano music
wi h ans o me ,” h ps://magen a. enso low.o g/
piano- ans o me , Sep embe 2019, blog pos .
[Online]. A ailable: h ps://magen a. enso low.o g/
piano- ans o me
[38]
C. Payne, “Musene ,” 2019, openAI, 25 Ap . 2019.
[Online]. A ailable: h ps://openai.com/blog/musene
[39]
Y.-S. Huang and Y.-H. Yang, “Pop music ans o me :
Bea -based modeling and gene a ion o exp essi e pop
piano composi ions,” in P oceedings o he 28 h ACM
in e na ional con e ence on mul imedia, 2020, pp. 1180–
1188.
[40]
B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang,
T. Qin, and T.-Y. Liu, “Muse o me : T ans o me wi h
ine-and coa se-g ained a en ion o music gene a ion,”
Ad ances in neu al in o ma ion p ocessing sys ems,
ol. 35, pp. 1376–1388, 2022.
[41]
D. on Rü e, L. Biggio, Y. Kilche , and T. Ho mann,
“Figa o: Gene a ing symbolic music wi h ine-g ained
a is ic con ol,” a Xi p ep in a Xi :2201.10936,
2022.
[42]
P. Lu, X. Xu, C. Kang, B. Yu, C. Xing, X. Tan, and
J. Bian, “Musecoco: Gene a ing symbolic music om
ex ,” a Xi p ep in a Xi :2306.00110, 2023.
[43]
C. Walshaw, “Abc no a ion,” abcno a ion.com, 2008,
e ie ed 1 Ma ch 2008.
[44]
A. Robe s, J. Engel, C. Ra el, C. Haw ho ne, and
D. Eck, “A hie a chical la en ec o model o lea ning
long- e m s uc u e in music,” in In e na ional con e -
ence on machine lea ning. PMLR, 2018, pp. 4364–
4373.
[45]
S. Wu, Y. Wang, R. Yuan, Z. Guo, X. Tan, G. Zhang,
M. Zhou, J. Chen, X. Mu, Y. Gao e al., “Clamp 2:
Mul imodal music in o ma ion e ie al ac oss 101 lan-
guages using la ge language models,” a Xi p ep in
a Xi :2410.13267, 2024.
[46]
S. Wu, Z. Guo, R. Yuan, J. Jiang, S. Doh, G. Xia, J. Nam,
X. Li, F. Yu, and M. Sun, “Clamp 3: Uni e sal music
in o ma ion e ie al ac oss unaligned modali ies and
unseen languages,” a Xi p ep in a Xi :2502.10362,
2025.
[47]
S. Böck, F. K ebs, and G. Widme , “Join bea and
downbea acking wi h ecu en neu al ne wo ks.” in
ISMIR. New Yo k Ci y, 2016, pp. 255–261.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
458
[48]
C. Ra el, N. Shazee , A. Robe s, K. Lee, S. Na ang,
M. Ma ena, Y. Zhou, W. Li, and P. J. Liu, “Explo ing
he limi s o ans e lea ning wi h a uni ied ex - o-
ex ans o me ,” Jou nal o machine lea ning esea ch,
ol. 21, no. 140, pp. 1–67, 2020.
[49]
M. Good, “MusicXML: An in e ne - iendly
o ma o shee music,” in P oceedings o
XML 2001 Con e ence, 2001. [Online]. A ail-
able: h ps://michaelgood.in o/publica ions/music/
musicxml-an-in e ne - iendly- o ma - o -shee -music/
[50]
“MIDI speci ica ion,” 1996. [Online]. A ailable:
h ps://midi.o g/midi-1-0-de ailed-speci ica ion
[51]
C. Haw ho ne, I. Simon, R. Swa ely, E. Manilow, and
J. Engel, “Sequence- o-sequence piano ansc ip ion
wi h ans o me s,” a Xi p ep in a Xi :2107.09142,
2021.
[52]
N. Lee, K. S eeni asan, J. D. Lee, K. Lee, and
D. Papailiopoulos, “Teaching a i hme ic o small
ans o me s,” 2023. [Online]. A ailable: h ps:
//a xi .o g/abs/2307.03381
[53]
S. McLeish, A. Bansal, A. S ein, N. Jain,
J. Ki chenbaue , B. R. Ba oldson, B. Kailkhu a,
A. Bha ele, J. Geiping, A. Schwa zschild, and
T. Golds ein, “T ans o me s can do a i hme ic wi h
he igh embeddings,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2405.17399
[54]
W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang,
“Compound wo d ans o me : Lea ning o compose
ull-song music o e dynamic di ec ed hype g aphs,”
in P oceedings o he AAAI Con e ence on A i icial
In elligence, ol. 35, no. 1, 2021, pp. 178–186.
[55]
A. G a a io i, A. Dubey, A. Jauh i, A. Pandey, A. Ka-
dian, A. Al-Dahle, A. Le man, A. Ma hu , A. Schel en,
A. Vaughan e al., “The llama 3 he d o models,” a Xi
p ep in a Xi :2407.21783, 2024.
[56]
J. Ho mann, S. Bo geaud, A. Mensch, E. Bucha skaya,
T. Cai, E. Ru he o d, D. d. L. Casas, L. A. Hen-
d icks, J. Welbl, A. Cla k e al., “T aining compu e-
op imal la ge language models,” a Xi p ep in
a Xi :2203.15556, 2022.
[57]
A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Ł. Kaise , and I. Polosukhin,
“A en ion is all you need,” Ad ances in neu al in o ma-
ion p ocessing sys ems, ol. 30, 2017.
[58]
J. L. Ba, J. R. Ki os, and G. E. Hin on, “Laye no mal-
iza ion,” a Xi p ep in a Xi :1607.06450, 2016.
[59]
J. Ainslie, J. Lee-Tho p, M. De Jong, Y. Zemlyanskiy,
F. Leb ón, and S. Sanghai, “Gqa: T aining gene alized
mul i-que y ans o me models om mul i-head check-
poin s,” a Xi p ep in a Xi :2305.13245, 2023.
[60]
B. Zhang and R. Senn ich, “Roo mean squa e laye
no maliza ion,” Ad ances in Neu al In o ma ion P o-
cessing Sys ems, ol. 32, 2019.
[61]
T. Chen, S. Ko nbli h, M. No ouzi, and G. Hin on, “A
simple amewo k o con as i e lea ning o isual ep-
esen a ions,” in In e na ional con e ence on machine
lea ning, 2020, pp. 1597–1607.
[62]
J. Spijke e and J. A. Bu goyne, “Con as i e
lea ning o musical ep esen a ions,” a Xi p ep in
a Xi :2103.09410, 2021.
[63]
J. Choi, S. Jang, H. Cho, and S. Chung, “Towa ds p ope
con as i e sel -supe ised lea ning s a egies o mu-
sic audio ep esen a ion,” in 2022 IEEE In e na ional
Con e ence on Mul imedia and Expo (ICME). IEEE,
2022, pp. 1–6.
[64]
T. Gao, X. Yao, and D. Chen, “Simcse: Simple
con as i e lea ning o sen ence embeddings,” a Xi
p ep in a Xi :2104.08821, 2021.
[65]
L. Wang, N. Yang, X. Huang, L. Yang, R. Majumde ,
and F. Wei, “Imp o ing ex embeddings wi h la ge
language models,” a Xi p ep in a Xi :2401.00368,
2023.
[66]
Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and D. P.
Ellis, “Mulan: A join embedding o music audio and
na u al language,” a Xi p ep in a Xi :2208.12415,
2022.
[67]
Moda , “Piano eq,” h ps://www.moda .com/
piano eq, accessed: 2025-03-28.
[68]
Y.-H. Chou, I. Chen, C.-J. Chang, J. Ching, Y.-H.
Yang e al., “Midibe -piano: La ge-scale p e- aining
o symbolic music unde s anding,” a Xi p ep in
a Xi :2107.05223, ol. 2, 2021.
[69]
L. N. Fe ei a and J. Whi ehead, “Lea ning o
gene a e music wi h sen imen ,” a Xi p ep in
a Xi :2103.06125, 2021.
[70]
Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Xiao, C. Lin, A. Ragni, E. Bene os e al.,
“Me : Acous ic music unde s anding model wi h
la ge-scale sel -supe ised aining,” a Xi p ep in
a Xi :2306.00107, 2023.
[71]
J. De lin, M.-W. Chang, K. Lee, and K. Tou ano a,
“Be : P e- aining o deep bidi ec ional ans o me s
o language unde s anding,” in P oceedings o he 2019
con e ence o he No h Ame ican chap e o he asso-
cia ion o compu a ional linguis ics: human language
echnologies, olume 1 (long and sho pape s), 2019,
pp. 4171–4186.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
459