Scaling Self-Supervised Representation Learning for Symbolic Piano Performance

Author: Louis Bradshaw; Alexander Spangher; Honglu Fan; Stella Biderman; Simon Colton

Publisher: Zenodo

DOI: 10.5281/zenodo.17706484

Source: https://zenodo.org/records/17706484/files/000052.pdf

SCALING SELF-SUPERVISED REPRESENTATION LEARNING
FOR SYMBOLIC PIANO PERFORMANCE
Louis B adshaw1,4Honglu Fan3,4Alexande Spanghe 2,4S ella Bide man4Simon Col on1
1Queen Ma y Uni e si y o London 2Uni e si y o Sou he n Cali o nia
3Uni e si y o Gene a 4Eleu he AI
[email p o ec ed], [email p o ec ed], [email p o ec ed]
ABSTRACT
We s udy he capabili ies o gene a i e au o eg essi e ans-
o me models ained on la ge amoun s o symbolic solo-
piano ansc ip ions. A e i s p e aining on app oxi-
ma ely 60,000 hou s o music, we use a compa a i ely
smalle , high-quali y subse , o ine une models o p o-
duce musical con inua ions, pe o m symbolic classi ica ion
asks, and p oduce gene al-pu pose con as i e MIDI em-
beddings by adap ing he SimCLR amewo k o symbolic
music. When e alua ing piano con inua ion cohe ence, ou
gene a i e model ou pe o ms leading symbolic gene a ion
echniques and emains compe i i e wi h p op ie a y au-
dio gene a ion models. On MIR classi ica ion benchma ks,
ozen ep esen a ions om ou con as i e model achie e
s a e-o - he-a esul s in linea p obe expe imen s, while
di ec ine uning demons a es he gene alizabili y o p e-
ained ep esen a ions, o en equi ing only a ew hund ed
labeled examples o specialize o downs eam asks.
1. INTRODUCTION
Mode n machine lea ning sys ems inc easingly u ilize sel -
supe ised lea ning (SSL) as a co e componen o hei
aining pipeline. In his pa adigm, gene al-pu pose ep e-
sen a ions a e lea ned du ing an ini ial phase o sel -guided
lea ning, which can hen be adap ed o specialized asks,
o en ou pe o ming pu ely supe ised app oaches, pa icu-
la ly when access o supe ised da a is limi ed [1].
As in o he ields, ecen wo k using neu al ne wo ks
o model symbolic music has s a ed o adop SSL [2
–
5].
Howe e , he symbolic music da a ha hese models a e
ained on is ypically c ea ed manually, in a labo -in ensi e
p ocess. Acqui ing i a he scale common o o he modali-
ies (e.g., ex , images, audio) is challenging. Consequen ly,
success ul esea ch o en in ol es aining om sc a ch
on da ase s such as Lakh and IMSLP [6,7], wi h esea ch
p oblems o mula ed a ound asks ha di ec ly align wi h
hese da ase s (e.g. mul i- ack symbolic music gene a ion).
This con as s wi h o he domains whe e subs an ial e o s
© L. B adshaw, H. Fan, A. Spanghe , S. Bide man and S.
Col on. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: L. B adshaw, H. Fan, A. Spanghe , S.
Bide man and S. Col on, “Scaling Sel -Supe ised Rep esen a ion Lea n-
ing o Symbolic Piano Pe o mance”, in P oc. o he 26 h In . Socie y o
Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
Compose
Sc iabin
Chopin
Schube
Bach
Schumann
Haydn
Bee ho en
Moza
Sa ie
B ahms
Tchaiko sky
Lisz
Debussy
Rachmanino
Ra el
Handel
Figu e 1. -SNE isualisa ion o con as i e embeddings
o classical composi ions, ained on MIDI da a wi hou
ex e nal me ada a. The c oss (×) highligh s Chopin’s Wal z
in A mino , which was disco e ed
1
a e he aining da a
was compiled, ensu ing ha i was no included.
ha e p oduced gene alis models ained a an ex eme scale,
such as LLaMA and CLIP [8,9], which p o ide s ong oun-
da ions o esea ch in da a-limi ed se ings [10,11]. These
cons ain s on symbolic music esea ch become pa icula ly
clea when conside ing ad ancemen s in he neighbo ing
a ea o audio modeling, whe e la ge-scale models including
AudioGen and AudioLM [12, 13], alongside hei unde -
lying neu al audio codecs [14, 15], ha e d i en a b oad
ange o ad ancemen s in music gene a ion [16
–
18], and
whe e SSL has been applied a scale o de elop e ec i e,
gene al-pu pose embedding models [19, 20].
Fo una ely, s ong p og ess has been made owa ds al-
le ia ing da a bo lenecks o symbolic music esea ch by
le e aging neu al ne wo ks ained o au oma ic music
ansc ip ion (AMT) [21]. In he es ic ed domain o
solo-piano audio eco dings, mode n AMT models achie e
highly eliable no e-iden i ica ion accu acy [22
–
24], en-
abling au oma ed da ase cu a ion pipelines ha c awl aw
1
See Ja ie C. He nández, “Hea a Chopin Wal z Unea hed A e
Nea ly 200 Yea s,” The New Yo k Times, Oc . 27, 2024.
451
audio and ansc ibe i in o MIDI using a combina ion o
web sc aping, audio-based p ocessing, and AMT me h-
ods [25
–
27]. Mo eo e , as his symbolic da a is ansc ibed
om eal eco dings, i cap u es he sub le ies and dynamics
o human pe o mance. Recen ly, his combined p og ess
has esul ed in a new da ase o symbolic music, A ia-
MIDI [28], comp ising ansc ip ions o solo-piano eco d-
ings ga he ed a scale om YouTube, which has been made
a ailable o public use. A ~100k hou s, A ia-MIDI is
o de s o magni ude la ge han simila da ase s [25], p e-
sen ing a unique oppo uni y o in es iga e he applica ion
o scaling SSL me hods o symbolic music modeling.
Building on his, in his wo k we le e age A ia-MIDI
o p e ain a gene a i e ans o me model ia nex - oken
p edic ion, using i as a ounda ion o explo e he e ec i e-
ness o SSL echniques applied o symbolic music a a scale
close o ecen applica ions in he ex , image, and audio
domains. We e alua e ou model ac oss wo dimensions:
gene a i e modeling and ep esen a ion lea ning. Fo gene -
a i e capabili ies, we conduc human lis ening es s compa -
ing piano con inua ions gene a ed by ou model, while o
ep esen a ion lea ning we measu e he abili y o he p e-
ained model o adap o MIR classi ica ion asks ia ine-
uning. To explo e applica ions o simila i y and e ie al
asks, we p opose and analyze a no el sel -supe ised adap-
a ion o he con as i e lea ning amewo k o symbolic
music, which ine unes ou model o p oduce embeddings
ha cap u e pe o mance and composi ion-le el ea u es,
as demons a ed by he na u al compose clus e ing isual-
ized in Figu e 1. In bo h e alua ion se ings, we compa e
agains symbolic and audio-based baselines. O e all, ou
expe imen s p o ide s ong e idence ha scaling SSL is a
p omising app oach o ackling di icul asks ac oss sym-
bolic MIR. Ou key con ibu ions a e he ollowing:
1.
We in oduce and open-sou ce A ia
2
, a p e ained au-
o eg essi e ans o me model ained on ansc ip-
ions o piano eco dings. Th ough human lis ening
es s, we show i gene a es cohe en con inua ions
om sho musical p omp s, ou pe o ming An icipa-
o y Music T ans o me [29] and i aling p op ie a y
audio models like Suno 3.5 [30].
2.
We u he demons a e he e ec i eness o la ge-
scale p e ained ep esen a ions o symbolic MIR
h ough wo app oaches: (1) di ec ly ine uning ou
model o classi ica ion asks, achie ing s ong pe -
o mance when labeled examples a e ex emely lim-
i ed, and (2) p oposing a no el adap a ion o con-
as i e lea ning ha p oduces an embedding model
achie ing s a e-o - he-a accu acy in linea p obe
expe imen s including compose , gen e, and s yle
de ec ion. C i ically, we show ha his con as i e ap-
p oach is e ec i e only when applied as a seconda y
ine uning phase.
In addi ion o ou models, we elease a MIDI p ep o-
cessing and okeniza ion lib a y designed o scale o la ge
2A ailable a : h ps://gi hub.com/eleu he ai/a ia
da ase s and, al hough his wo k ocuses on solo piano, o
na i ely suppo mul i- ack MIDI iles. Toge he , hese
con ibu ions may se e as a ounda ion o u u e esea ch
in symbolic music modeling.
2. RELATED WORK
Ou wo k ela es o many sub-a eas o compu a ional music,
gene a i e modeling, and ep esen a ion lea ning. In his
sec ion, we ocus on ela ed wo k speci ic o he sub ield
o symbolic music modeling.
The ield o symbolic music gene a ion using neu al
ne wo ks has ad anced apidly. P io o he in oduc ion o
ans o me s, models such as DeepBach [31] and Cocone
[32] demons a ed ha neu al ne wo ks a e e ec i e ools
o modeling musical ha monies in Ba oque music. The
au o eg essi e pa adigm o symbolic music gene a ion,
which models music as a s eam o okens, gained ac ion
by adap ing a chi ec u es om na u al language p ocessing
[33]. This app oach was ex ended by [34] o inco po a e
exp essi e onse and du a ion imings, enabling gene a ed
music o mo e closely emula e human pe o mance.
Music T ans o me [35] was a seminal wo k demons a -
ing he powe and scalabili y o he au o eg essi e app oach.
The au ho s ained a ans o me decode on he MAE-
STRO da ase [36], a collec ion o exp essi e MIDI piano
eco dings, and showed ha au o eg essi e models could
e ec i ely lea n long- e m musical dependencies. Subse-
quen wo k om he same au ho s p o ided s ong e idence
ha he musical and c ea i e capabili ies o hei model
scale well wi h da ase size [37], ein o cing he alue o
cu a ing la ge-scale piano ansc ip ion da ase s as a u u e
di ec ion, a cen al p emise we explo e in ou wo k.
Building on his ounda ion, MuseNe [38] expanded
his app oach by adding mul i- ack suppo o i s MIDI
okenize and aining a la ge model on a di e se co pus o
mul i-ins umen da a, including MAESTRO. Al e na i e
okeniza ion schemes, such as REMI [39], ha e also been
in luen ial. Va ia ions o REMI ha e been adop ed by mod-
els including Muse o me [40], Figa o [41], and MuseC-
oco [42], which all in oduced me hods o condi ioning
gene a ion on a ious musical ea u es. O he esea ch has
explo ed ep esen a ions beyond MIDI, such as he ABC
no a ion [43] used by MuPT [3]. Mo e ecen ly, An icipa-
o y Music T ans o me [29] was in oduced as a e sa ile,
s a e-o - he-a model o p omp con inua ion and in illing
asks wi h exp essi e millisecond-le el p ecision.
Fo ep esen a ion lea ning, se e al me hods ha e been
de eloped o p oduce symbolic music embeddings, use ul
as ea u e ex ac o s o downs eam classi ica ion asks.
These include MusicVAE [44], a a ia ional au oencode
o cap u ing long- e m s uc u e; MusicBERT [4], which
lea ns sel -supe ised ep esen a ions ia a ba -masking
objec i e; and he CLaMP se ies o models [5,45,46], which
employ con as i e lea ning echniques o build c oss-modal
ep esen a ions wi h na u al language desc ip ions.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
452
PIANO-ROLL
012345678
Time (seconds)
C
E
G
MUSIC
TRANSFORMER
SHIFT
1000MS
SET_VEL
60
NOTE_ON
60
SHIFT
1000MS
SHIFT
1000MS
NOTE_ON
64
SHIFT
1000MS
NOTE_OFF
60
SHIFT
1000MS ...
MUSENET
WAIT
1000MS
PIANO
P: C4
V: 60
WAIT
2000MS
PIANO
P: E4
V: 60
WAIT
1000MS
PIANO
P: C4
V: 0
WAIT
1000MS
PIANO
P: G4
V: 60
WAIT
1000MS ...
REMI
BAR POSITION
2/4
PITCH
C4
VELOCITY
60
DURATION
3/4
POSITION
4/4
PITCH
E4
VELOCITY
60
DURATION
3/4 ...
ARIA
PIANO
P: 60
V: 60
ONSET
1000MS
DURATION
3000MS
PIANO
P: 64
V: 60
ONSET
3000MS
DURATION
3000MS <T>
PIANO
P: 67
V: 60
ONSET
0MS ...
Figu e 2. Compa ison o di e en okeniza ions o a piano- oll, using a ious app oaches. Music T ans o me [35] and
MuseNe [38] ack he passage o ime using ime-shi okens, whe eas A ia uses absolu e onse s ela i e o he cu en
segmen . The REMI okenize [39] uses a neu al bea - acking model o es ima e posi ions o no es and ba delimi e s [47].
3. METHOD
To explo e he capabili ies o la ge-scale sel -supe ised
models o piano pe o mance, we i s p e ained an au-
o eg essi e ans o me model using nex - oken p edic ion
on a e ined subse o he A ia-MIDI da ase . We adop
his se up due o i s e sa ili y: nex - oken p edic ion has a
p o en ack eco d in gene a i e modeling o bo h sym-
bolic and audio-based music [13,35], as well as adap abili y
o downs eam asks ia ine uning [48]. Apa om he
okeniza ion scheme, which we hand-designed, we used a
con en ional mode n ans o me a chi ec u e wi h mini-
mal modi ica ions, p o iding a s anda dized ounda ion o
e alua ing ou hypo hesis and suppo ing u he esea ch.
3.1 MIDI Tokeniza ion
To au o eg essi ely model MIDI iles as s eams o disc e e
okens, we chose o use a empo al esolu ion o 10 mil-
liseconds o no e onse s and du a ions, and disc e ize no e
eloci y alues in o 12 bins. Ou okenize is designed o
na i ely handle mul i- ack (mul i-ins umen ) MIDI iles
by condensing he 128 MIDI ins umen s, co esponding
o
p og am_change
MIDI messages, in o 13 ins umen
classes, including one o pe cussion.
Gi en a MIDI ile, we esol e i s cons i uen
no e_on
and
no e_o
e en s in o a lis o no es. Fo non-pe cussion
ins umen s, we okenize a no e wi h pi ch
p
, eloci y
,
and absolu e onse /o se in milliseconds
( on, o )
as a
iple o okens:
[ins umen , p, ],[onse : on],[du a ion: o − on]
Fo pe cussion, we okenize a no e wi h no e numbe
n
and onse
on
as:
[d um, n],[onse : on]
The okeniza ion o an en i e MIDI ile is cons uc ed by
conca ena ing he okeniza ions o he cons i uen no es in
o de o onse . MIDI me ada a, such as key, empo, and
ime signa u e, is disca ded, and o he ele an musical
in o ma ion, such as he sus ain pedal, is inco po a ed di-
ec ly in o he du a ion okens. This schema is se apa
om some popula okeniza ion echniques used o sym-
bolic music, such as REMI [39] and ex -based sco e ep-
esen a ions ABC [43] and MusicXML [49], as i does no
include bea o ba in o ma ion, ins ead ep esen ing onse s
and du a ions in milliseconds.
In he MIDI s anda d [50],
no e_on
and
no e_o
e en s a e spaced empo ally by speci ying a numbe o
icks o wai be o e p ocessing he nex e en . Fo Music
T ans o me and MuseNe , he au ho s inco po a e his in o
hei chosen MIDI okeniza ion schemes [35, 38], using
ime-shi okens o sepa a e no es a he han speci ying
hei absolu e onse imes. Howe e , eme ging wo k has
p o ided e idence ha using ime-shi okens in his way
may be subop imal in ans o me -based models, esul ing
in educed accu acy in sequence- o-sequence piano an-
sc ip ion [51], and uns able hy hm o d i ing ba lines
in musical gene a ions [39]. One possible explana ion is
ha when using ela i e- iming okeniza ion, au o eg essi e
models s uggle o main ain an exac empo al ep esen a-
ion o he p io con ex , as hey mus sum up many sequen-
ial ime-shi alues o calcula e empo al ela ionships
be ween no es wi h medium o long- e m dependencies.
P e ious s udies on la ge language models ha e demon-
s a ed ha ans o me s can s uggle wi h exac ly his so
o a i hme ic [52, 53].
In p elimina y in es iga ions, we also obse ed nega i e
e ec s when using ela i e- iming okeniza ions, pa icu-
la ly on empo al ins abili y in passages wi h apid no e
sequences. To add ess hese issues, we chose o adop ab-
solu e onse imes in ou okenize . We implemen ed his
by di iding he music in o 5000-millisecond segmen s and
eco ding no e onse s ela i e o he s a o each segmen –
his helped us a oid expanding he okenize ’s ocabula y
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
453
o include all possible absolu e onse imes. To emo e
ambigui y, we ma ked he s a o each new segmen us-
ing a special oken:
<T>
. We designed his o esemble
no e iming using bea -posi ion wi hin a ba , howe e , un-
like okeniza ion schemes ha do his di ec ly [39, 54], ou
app oach is applicable o MIDI iles ha lack bea and
ba in o ma ion, such as hose ansc ibed om solo pi-
ano eco dings. Figu e 2 illus a es how ou app oach
di e s om o he app oaches.
Ti−Tj=







Pi
k=j+1 wkRela i e
C(<T>, i, j) + ˜oi−˜ojHyb id (Ou s)
oi−ojAbsolu e
(1)
Equa ion 1 demons a es he a i hme ic equi ed o calcula e
he ime sepa a ing wo no es
ni
and
nj
ac oss he di e -
en okeniza ion app oaches, whe e
wk
deno es he leng h
o he ime-shi message p eceding no e
k
,
C(<T>, i, j)
ep esen s he o al ime spanned by comple e 5000ms seg-
men s be ween no es
ni
and
nj
, calcula ed by coun ing he
numbe o segmen okens and mul iplying by he segmen
du a ion,
ok
ep esen s he absolu e onse ime o no e
k
,
and
˜ok
ep esen s he adjus ed absolu e onse ime o no e
k
ela i e o i s 5000ms segmen .
3.2 Model
Ou model a chi ec u e builds upon he LLaMa 3.2 model
amily, chosen due o i s e ec i eness in au o eg essi e
asks ac oss modali ies [55]. Using he 1B pa ame e con-
igu a ion as a s a ing poin , we made se e al a chi ec-
u al modi ica ions. Fi s ly, guided by es ablished p inci-
ples on model-da a a ios o language models [56], we
educed he hidden s a e dimension (
dmodel
) om 2048 o
1536. This dec eased he pa ame e coun by oughly hal ,
balancing model capaci y wi h compu a ional e iciency
o ou da ase scale. Secondly, we simpli ied he a chi ec-
u e by op ing o s anda d mul i-head a en ion (wi h 24
heads) and laye no maliza ion [57,58], ins ead o g ouped-
que y a en ion and RMS no maliza ion as used in s anda d
LLaMa 3 a ian s [59, 60].
P e aining da ase . As ou aining co pus consis s
o au oma ically ansc ibed in e ne -sou ced piano eco d-
ings, signi ican a iabili y exis s in ansc ip ion quali y
and con en sui abili y, po en ially in oducing ha m ul bi-
ases o noisy da a in o downs eam models. To mi iga e his,
we implemen ed igo ous p ep ocessing s eps. To educe
memo iza ion, we add essed ex eme cases o composi ion
duplica ion, such as epea ed pe o mances o o e ep e-
sen ed wo ks, by applying il e ing based on composi ional
me ada a. Speci ically, o compose s wi h mo e han 250
ins ances o iles con aining opus and/o piece numbe ags,
we e ained a mos 10 ins ances pe opus/piece-numbe
pai . Fo hese same compose s, we also disca ded all o he
iles ha lack composi ional iden i ie s. Addi ionally, we
employed heu is ic-based il e ing, conside ing no e den-
si y, pi ch and du a ion en opy, silence, and indica o s
o epe i i e con en , o exclude p oblema ic en ies (e.g.,
Black MIDI
3
). Following hese s eps, ou e ined p e ain-
ing co pus comp ises 820,944 MIDI iles, amoun ing o
60,473 hou s o solo piano music.
P e aining ecipe. We p e ained ou model using s an-
da d nex - oken p edic ion on conca ena ed sequences o
okenized MIDI iles, as de ailed in Sec ion 3.1. A sequence
leng h o 8192 okens was chosen o balance compu a ional
cons ain s wi h he need o lea n meaning ul sho - and
long- e m dependencies wi hin piano music. To enhance
gene aliza ion and p e en o e i ing, we u ilized online
da a augmen a ion, andomly ansposing (
±
5 semi ones),
a ying empo (
±
20%), and adjus ing MIDI eloci y (
±
10).
Gene a i e ine uning. We p oduced a model a ian
ailo ed o gene a i e piano-con inua ion asks by applying
a single-epoch ine uning phase a e p e aining, annealing
he lea ning a e o ze o while aining on highe -quali y
da a. To enhance da a quali y, we emo ed all iden i ied
composi ional duplica es, igh ened exis ing quali y il e s,
and in oduced an addi ional il e aimed a excluding an-
sc ip ions o syn hesized MIDI iles
4
. Addi ionally, du ing
his phase, each aining sequence begins a he s a o
a new ile (i.e., non-conca ena ed), and we inse a spe-
cial oken (
<D>
) app oxima ely 100 okens be o e he end
o each aining example o enable explici in e ence- ime
con ol o e gene a ion endings.
3.3 Con as i e Rep esen a ion Lea ning
To in es iga e he s eng h o he p e ained ep esen a-
ions, we p opose a seconda y ine uning s age, adap ing
he p e ained model o gene a e embeddings o okenized
sequences. Ou app oach le e ages he SimCLR amewo k
o con as i e ep esen a ion lea ning [61]. In SimCLR,
an encode is ained o p oduce simila embeddings o
di e en iews o he same aining example while simul a-
neously pushing embeddings om un ela ed examples apa
h ough minimiza ion o a con as i e loss. This app oach
has demons a ed s ong esul s in music, cap u ing seman-
ic ela ionships wi hin embeddings e ec i ely [62,63], and
has ecen ly been combined wi h la ge p e ained language
models o p oduce ich ex ual embeddings [64, 65].
To gene a e wo dis inc iews o a MIDI ile, we an-
domly ex ac wo di e en con iguous slices, each comp is-
ing be ween 100 and 650 no es (app oxima ely 300–2000
okens). Each slice unde goes independen da a augmen a-
ion using ou s anda d p ocedu es be o e okeniza ion. To
p oduce sequence embeddings, we eplace he o iginal lan-
guage modeling head wi h an embedding head, p ojec ing
he inal hidden s a e in o a 512-dimensional embedding
space. We de i e a slice’s embedding om he hidden
s a e associa ed wi h an end-o -sequence oken appended
a e he inal no e oken- iple.
To calcula e he con as i e loss, we use he no mal-
ized empe a u e-scaled c oss-en opy loss, NT-Xen , o e
miniba ches o ela ed embedding pai s:
ℓi,j =−log exp (sim (zi, zj)/τ)
P2N
k=1
1
[k=i]exp (sim (zi, zk)/τ)(2)
3h ps://en.wikipedia.o g/wiki/Black_MIDI
4P ep ocessing de ails: h ps://gi hub.com/loubb ad/a ia-midi
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
454
He e,
sim(zk, zl)
deno es he cosine simila i y be ween
no malized embeddings
zk
and
zl
,
1
[k=i]∈ {0,1}
is an in-
dica o unc ion, and
τ
is he empe a u e pa ame e . Each
miniba ch consis s o
N
MIDI iles, om which we con-
s uc
N
pai s o ela ed embeddings (i.e.,
2N
o al em-
beddings),
{zi, zi+N}i=1,...,N
, whe e bo h
zi
and
zi+N
a e de i ed om wo augmen ed iews o he same ile.
We ain he model by minimizing he symme ic loss:
L:= 1
2PN
k=1(ℓk,k+N+ℓk+N,k)
.
This se up has wo key ad an ages. Fi s , by ex ac ing
non-o e lapping slices om he same ile, he model lea ns
embeddings e lec ing highe -le el musical seman ics such
as gen e, compose , s yle, and pe o mance nuances, a he
han local de ails. This is impo an o musical pe o -
mances, whe e s anda d supe ised ep esen a ion lea ning
app oaches, e.g., MuLan [66], a e limi ed due o he desc ip-
i e sub le y and complexi y o musical a ibu es. Second,
ou app oach acili a es s udying how e ec i ely nex - oken
p edic ion ep esen a ions ans e o con as i e embed-
ding amewo ks. When ained om sc a ch, SimCLR-
inspi ed aining me hods ypically equi e la ge amoun s
o in-ba ch nega i es, which pose signi ican VRAM con-
s ain s [61]. Howe e , ecen wo k on ex embeddings
shows ha ini ializing con as i e aining om p e ained
models can alle ia e his [64]. Thus, ou me hod in oduces
a gene al-pu pose semi-supe ised amewo k o ep esen-
a ion lea ning o symbolic music, which allows us o e alu-
a e he ans e abili y o nex - oken musical ep esen a ions.
4. EXPERIMENTS
Ha ing ou lined ou me hodology, we e alua e he gene -
a i e capabili ies o ou model, as well as he con as i e
ep esen a ion lea ning amewo k, in he con ex o piano
pe o mance. To unde s and i s capabili ies in he wide
a ea o models o gene a i e music and MIR, we compa e
ou app oach o bo h symbolic and audio-based baselines,
u ilizing Piano eq [67] o syn hesize MIDI iles in o audio.
4.1 Se up
We p e ained ou model using he AdamW op imize o
75 epochs o e he aining co pus. We used a lea ning
a e o
3e-4
wi h 1000 wa mup s eps, ollowed by a linea
decay o 10% o he ini ial a e o e he cou se o ain-
ing. The model has app oxima ely 650 million pa ame e s
and was p e ained o 9 days on 8 H100 GPUs wi h a
ba ch size o 16 pe GPU.
In he con as i e ine uning s age, we used a lea ning
a e o
1e-5
wi h he same linea decay schedule. We se
he NT-Xen empe a u e pa ame e o
τ= 0.1
. This phase
las ed 25 epochs, du ing which each MIDI ile con ibu es
exac ly one pai o augmen ed iews pe epoch. We ained
on he educed ine uning da ase desc ibed in Sec ion 3.2;
howe e , we elaxed he p ep ocessing cons ain s on com-
posi ional duplica es o encou age he model o dis inguish
be ween di e en pe o mances o popula composi ions.
Gene a i e modeling. Following he gene a i e ine-
uning p ocedu e desc ibed in Sec ion 3.2, we explo e he
Compa ed Model Wins Ties Losses p- alue
AM T ans o me 38 0 6 9.43e-7
Suno 3.5 18 9 21 7.49e-1
MusicGen 49 1 0 3.55e-15
G ound T u h 15 9 17 8.60e-1
Table 1. Pai wise human p e e ence esul s compa ing
musical cohe ence o 45-second con inua ions o 15-second
p omp s. We epo he numbe o imes ou model won,
ied, o los agains he lis ed model. P- alues a e compu ed
using a wo-sided binomial es on non- ied compa isons.
gene a i e capabili ies o he esul ing model by analyzing
he musical cohe ence o con inua ions o sho solo piano
p omp s. This me hodology aligns wi h e alua ions in p e i-
ous wo k [13,29], and mi iga es as e bias by ha ing pa ic-
ipan s e alua e con inua ions wi hin he same musical s yle.
In ou lis ening es , we asked 46 pa icipan s wi h a
leas one yea o musical aining o compa e 45-second con-
inua ions gene a ed om 15-second solo piano p omp s,
e alua ing hei musical cohe ence. Pa icipan s we e p e-
sen ed wi h a se ies o andom pai wise A/B compa isons,
whe e hey we e asked o indica e hei p e e ed con in-
ua ion, guided by c i e ia such as melodic de elopmen ,
hy hmic s uc u e, ha monic p og ession, and s ylis ic
cohe ence. To gene a e es samples, we selec ed i e
p omp s ep esen ing di e en subgen es o solo piano mu-
sic, and gene a ed eigh con inua ions pe p omp ( o aling
40 con inua ions pe model). We compa ed ou model’s
ou pu s agains se e al baselines, including An icipa o y
Music T ans o me (
music-la ge-800k
) [29], he audio-
based gene a i e models MusicGen (
la ge
) [16] and Suno
3.5 [30], and he human-composed g ound- u h.
Con as i e embeddings. We e alua e ou app oach
o lea ning con as i e embeddings by aining linea clas-
si ie s on he ozen embeddings p oduced by di e en
models and compa ing hei pe o mance on held-ou es
se s. We assess pe o mance using es ablished benchma ks,
Pianis 8 [68] and VG-MIDI [69], as well as new bench-
ma ks we de i e om A ia-MIDI me ada a. Speci ically,
we ex ac ed label-balanced ain- es spli s comp ising
10,000 and 1,000 iles, espec i ely, o ou classi ica ion
asks: Gen e (2 classes), Musical Pe iod (4 classes), Fo m
(6 classes), and Compose (10 classes). Fo compa ison,
we include esul s om CLaMP 3 (
saas
) [46], M3 [45],
and he audio-based model MERT [70]. Linea classi ie s
we e ained on global ile embeddings ob ained by a e -
aging slice embeddings wi hin each ile. We ained wi h
a lea ning a e o
3e-4
and a linea decay schedule o 0,
unning sepa a e expe imen s wi h 10, 20, and 50 epochs,
and epo ing he bes esul .
Supe ised ine uning. To complemen ou linea p obe
expe imen s, we e alua e how well ou p e ained model
adap s o supe ised musical classi ica ion asks, employing
ine uning echniques inspi ed by NLP li e a u e [48, 71].
Fo classi ie ine uning, we eplaced he language model-
ing head wi h a classi ica ion head, p edic ing labels di ec ly
om he hidden s a e o he end-o -sequence oken. Du ing
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
455

Model Gen e Fo m Musical Pe iod Compose Pianis 8 VG-MIDI
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1
Main Resul s
MERT 83.00 83.00 63.89 63.90 69.50 68.94 69.60 69.30 65.06 65.18 45.45 40.37
M3 85.10 85.10 69.88 70.12 71.20 70.81 71.90 71.72 81.93 81.48 54.55 46.13
CLaMP 3 89.10 89.10 77.79 77.97 80.60 80.20 84.50 84.46 80.72 79.76 45.45 36.53
A iaEmb 92.40 92.40 82.45 82.57 84.70 84.69 90.50 90.49 91.57 92.38 63.64 63.96
A iaF 93.20 93.20 87.53 87.59 86.50 86.53 96.30 96.32 91.56 92.03 68.18 69.55
Embeddings
A ia†
e=25 82.30 82.30 66.94 66.96 69.00 68.50 65.50 65.41 84.34 84.56 59.09 54.29
A iae=1 92.90 92.90 80.53 80.69 83.80 83.71 87.60 87.62 92.77 93.71 59.09 57.80
A iaτ=0.05 92.40 92.40 81.34 81.48 84.00 83.85 89.90 89.90 95.18 95.71 59.09 54.32
A iaτ=0.592.30 92.30 73.43 73.63 80.70 80.56 70.20 70.05 91.57 92.70 54.55 45.00
Fine uning
A ian=100 89.50 89.50 68.26 68.20 70.20 70.64 65.30 64.10 - - - -
A ian=200 91.10 91.10 75.25 75.54 75.10 75.68 78.10 78.08 - - - -
A ian=500 90.80 90.80 79.31 79.49 80.90 80.91 85.20 85.18 - - - -
A ian=1000 91.40 91.40 80.63 80.68 82.90 83.01 90.10 90.12 - - - -
Table 2. Classi ica ion pe o mance ac oss symbolic music asks. We epo maximum accu acy (Acc) and mac o-F1 sco es
(F1) o each ask. Main Resul s compa e ou embedding model (A ia
Emb
) and supe ised ine uned model (A ia
F
) o o he
models (MERT, M3, CLaMP 3). Embedding abla ions a y key componen s o he con as i e lea ning se up: aining
epochs (
e
), empe a u e pa ame e (
τ
), and wi hou p e aining (†), while keeping all o he se ings he same as A ia
Emb
.
Fine uning abla ions show es -se pe o mance as a unc ion o he numbe o labeled aining iles (n).
his phase, we ine uned all model weigh s end- o-end using
a lea ning a e o
1e-5
(wi hou wa mup) wi h linea decay
schedule, and applied d opou o esidual connec ions, in-
c easing he d opou a e linea ly om
pd= 0.0
( i s laye )
o
pd= 0.2
( inal laye ). By sys ema ically a ying he
numbe o labeled aining examples, using class-balanced
subse s, we analyze ou p e ained model’s abili y o adap
o supe ised symbolic MIR asks in scena ios wi h limi ed
labeled da a. In each case, we ained o 10 epochs and
epo he esul s om he bes -pe o ming epoch.
4.2 Resul s
Table 1 epo s he esul s o ou lis ening es . Pa icipan s
consis en ly p e e ed he musical cohe ence o con inua-
ions p oduced by ou model o e hose om An icipa o y
Music T ans o me and MusicGen. This signals a no able
imp o emen in symbolic models o piano pe o mance
gene a ion, which we p ima ily a ibu e o he scale o ou
aining da ase , gi en ou s anda dized se up. I also high-
ligh s limi a ions in audio models like MusicGen, whose
es ic ed con ex window necessi a es sliding-window in-
e ence, diminishing cohe ence in longe gene a ions. Con-
e sely, we ound no s a is ically signi ican p e e ence di -
e ence be ween ou model’s ou pu s and ei he Suno 3.5 o
human-composed g ound- u h con inua ions. We acknowl-
edge wo key limi a ions: Fi s ly, we could no include
closed-access models like AudioLM [13], despi e hei
p omising epo ed esul s on simila piano-con inua ion
benchma ks. Secondly, ou e alua ion excludes popula
symbolic models such as MuPT [3], as hei ba -le el im-
ing ep esen a ion (e.g., ABC no a ion) is incompa ible wi h
exp essi e millisecond-le el MIDI pe o mances.
Table 2 summa izes he esul s o ou linea p obe and
supe ised ine uning classi ica ion expe imen s, alongside
an abla ion s udy o aining con igu a ions o con as i e
lea ning. Ou p oposed me hod o semi-supe ised ep-
esen a ion lea ning subs an ially imp o es esul s on all
benchma ks, p oducing embeddings ha cap u e di e se
ile-le el musical a ibu es wi hou inco po a ing me a-
da a du ing aining. The abla ion s udy u he highligh s
he impo ance o ini ializing con as i e aining om p e-
ained nex - oken ep esen a ions, demons a ing ha ou
con as i e me hod is compe i i e only when applied as a
ine uning s age. No ably, ine uning on one embedding
pai pe ile o a single epoch (A ia
e=1
) su passes aining
om sc a ch on 25 pai s pe ile (A ia
†
e=25
). While his ep-
esen s an ad ancemen , we no e ha ou benchma ks ocus
exclusi ely on piano pe o mances, whe eas he compa i-
son models suppo mul i-ins umen MIDI o audio iles.
Finally, ou supe ised ine uning expe imen s demons a e
he s ong adap abili y o nex - oken p edic ion SSL ame-
wo ks o supe ised symbolic MIR asks. Ou ine uned
models achie e s a e-o - he-a classi ica ion pe o mance
on la ge da ase s and pe o m su p isingly well on complex
asks, e en when ained on limi ed labeled da a.
5. CONCLUSION
We in oduce A ia, an au o eg essi e gene a i e ans-
o me model designed o in es iga e he scalabili y o sel -
supe ised lea ning o symbolic music modeling. Ou ex-
pe imen s show ha his p e aining amewo k e ec i ely
adap s o gene a i e modeling, MIDI-embedding gene a-
ion, and supe ised MIR asks. Mo eo e , ou indings
sugges ha ca e ul da a cu a ion and la ge-scale aining
can unlock new oppo uni ies o downs eam symbolic mu-
sic applica ions, pa icula ly in se ings whe e da a is sca ce.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
456
6. ACKNOWLEDGMENTS
This wo k was suppo ed by UKRI and EPSRC unde
g an EP/S022694/1. Addi ional suppo was p o ided by
Eleu he AI and S abili yAI, as well as a compu e g an om
he Minis y o Science and ICT o Ko ea and Gwangju
Me opoli an Ci y.
7. REFERENCES
[1]
J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and
D. Tao, “A su ey on sel -supe ised lea ning: Algo-
i hms, applica ions, and u u e ends,” IEEE T ans-
ac ions on Pa e n Analysis and Machine In elligence,
2024.
[2]
Y. Wang, S. Wu, J. Hu, X. Du, Y. Peng, Y. Huang,
S. Fan, X. Li, F. Yu, and M. Sun, “No agen: Ad anc-
ing musicali y in symbolic music gene a ion wi h la ge
language model aining pa adigms,” a Xi p ep in
a Xi :2502.18008, 2025.
[3]
X. Qu, Y. Bai, Y. Ma, Z. Zhou, K. M. Lo, J. Liu, R. Yuan,
L. Min, X. Liu, T. Zhang e al., “Mup : A gene a i e
symbolic music p e ained ans o me ,” a Xi p ep in
a Xi :2404.06393, 2024.
[4]
M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y. Liu,
“Musicbe : Symbolic music unde s anding wi h la ge-
scale p e- aining,” a Xi p ep in a Xi :2106.05630,
2021.
[5]
S. Wu, D. Yu, X. Tan, and M. Sun, “Clamp: Con-
as i e language-music p e- aining o c oss-modal
symbolic music in o ma ion e ie al,” a Xi p ep in
a Xi :2304.11029, 2023.
[6]
C. Ra el, “Lea ning-based me hods o compa ing se-
quences, wi h applica ions o audio- o-midi alignmen
and ma ching,” Ph.D. disse a ion, Columbia Uni e si y,
2016.
[7]
IMSLP. (2006) IMSLP/Pe ucci music lib a y. IMSLP.
[Online]. A ailable: h ps://imslp.o g
[8]
A. Rad o d, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh,
S. Aga wal, G. Sas y, A. Askell, P. Mishkin, J. Cla k
e al., “Lea ning ans e able isual models om na u al
language supe ision,” in In e na ional con e ence on
machine lea ning. PmLR, 2021, pp. 8748–8763.
[9]
H. Tou on, T. La il, G. Izaca d, X. Ma ine , M.-A.
Lachaux, T. Lac oix, B. Roziè e, N. Goyal, E. Hamb o,
F. Azha e al., “Llama: Open and e icien ounda ion
language models,” a Xi p ep in a Xi :2302.13971,
2023.
[10]
C. Zhou, P. Liu, P. Xu, S. Iye , J. Sun, Y. Mao, X. Ma,
A. E a , P. Yu, L. Yu e al., “Lima: Less is mo e o
alignmen ,” Ad ances in Neu al In o ma ion P ocessing
Sys ems, ol. 36, pp. 55 006–55 021, 2023.
[11]
A. Kolesniko , L. Beye , X. Zhai, J. Puigce e , J. Yung,
S. Gelly, and N. Houlsby, “Big ans e (bi ): Gene al
isual ep esen a ion lea ning,” in Compu e Vision–
ECCV 2020: 16 h Eu opean Con e ence, Glasgow, UK,
Augus 23–28, 2020, P oceedings, Pa V 16. Sp inge ,
2020, pp. 491–507.
[12]
F. K euk, G. Synnae e, A. Polyak, U. Singe , A. Dé-
ossez, J. Cope , D. Pa ikh, Y. Taigman, and Y. Adi,
“Audiogen: Tex ually guided audio gene a ion,” a Xi
p ep in a Xi :2209.15352, 2022.
[13]
Z. Bo sos, R. Ma inie , D. Vincen , E. Kha i ono ,
O. Pie quin, M. Sha i i, D. Roblek, O. Teboul, D. G ang-
ie , M. Tagliasacchi e al., “Audiolm: A language mod-
eling app oach o audio gene a ion,” IEEE/ACM ans-
ac ions on audio, speech, and language p ocessing,
ol. 31, pp. 2523–2533, 2023.
[14]
A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi,
“High ideli y neu al audio comp ession,” a Xi p ep in
a Xi :2210.13438, 2022.
[15]
N. Zeghidou , A. Luebs, A. Om an, J. Skoglund, and
M. Tagliasacchi, “Sounds eam: An end- o-end neu-
al audio codec,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 30, pp. 495–
507, 2021.
[16]
A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “Musiclm: Gene a -
ing music om ex ,” a Xi p ep in a Xi :2301.11325,
2023.
[17]
J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, pp. 47 704–47 720,
2023.
[18]
Z. Bo sos, M. Sha i i, D. Vincen , E. Kha i ono ,
N. Zeghidou , and M. Tagliasacchi, “Sounds o m:
E icien pa allel audio gene a ion,” a Xi p ep in
a Xi :2305.09636, 2023.
[19]
W.-N. Hsu, B. Bol e, Y.-H. H. Tsai, K. Lakho ia,
R. Salakhu dino , and A. Mohamed, “Hube : Sel -
supe ised speech ep esen a ion lea ning by masked
p edic ion o hidden uni s,” 2021. [Online]. A ailable:
h ps://a xi .o g/abs/2106.07447
[20]
A. Bae ski, Y. Zhou, A. Mohamed, and M. Auli,
“wa 2 ec 2.0: A amewo k o sel -supe ised lea ning
o speech ep esen a ions,” Ad ances in neu al in o -
ma ion p ocessing sys ems, ol. 33, pp. 12 449–12 460,
2020.
[21]
E. Bene os, S. Dixon, Z. Duan, and S. Ewe , “Au o-
ma ic music ansc ip ion: An o e iew,” IEEE Signal
P ocessing Magazine, ol. 36, no. 1, pp. 20–30, 2018.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
457
[22]
Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-
esolu ion piano ansc ip ion wi h pedals by eg essing
onse and o se imes,” IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing, ol. 29, pp.
3707–3717, 2021.
[23]
K. Toyama, T. Akama, Y. Ikemiya, Y. Takida, W.-H.
Liao, and Y. Mi su uji, “Au oma ic piano ansc ip ion
wi h hie a chical equency- ime ans o me ,” a Xi
p ep in a Xi :2307.04305, 2023.
[24]
Y. Yan and Z. Duan, “Sco ing ime in e als using non-
hie a chical ans o me o au oma ic piano ansc ip-
ion,” a Xi p ep in a Xi :2404.09466, 2024.
[25]
Q. Kong, B. Li, J. Chen, and Y. Wang, “Gian midi-
piano: A la ge-scale midi da ase o classical piano
music,” a Xi p ep in a Xi :2010.07061, 2020.
[26] H. Zhang, J. Tang, S. Ra ee, S. Dixon, and G. Fazekas,
“A epp: A da ase o au oma ically ansc ibed exp es-
si e piano pe o mance,” in P oceedings o he 23 d
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2022.
[27]
D. Edwa ds, S. Dixon, and E. Bene os, “Pijama: Piano
jazz wi h au oma ic midi anno a ions,” T ansac ions
o he In e na ional Socie y o Music In o ma ion Re-
ie al, 2023.
[28]
L. B adshaw and S. Col on, “A ia-midi: A da ase
o piano midi iles o symbolic music modeling,” in
In e na ional Con e ence on Lea ning Rep esen a ions,
2025. [Online]. A ailable: h ps://open e iew.ne /
o um?id=X5h hgndxW
[29]
J. Thicks un, D. Hall, C. Donahue, and P. Liang,
“An icipa o y music ans o me ,” a Xi p ep in
a Xi :2306.08620, 2023.
[30]
I. Suno, “Suno AI 3.5,” 2024, compu e so wa e.
[Online]. A ailable: h ps://sunnoai.com/ 3-5/
[31]
G. Hadje es, F. Pache , and F. Nielsen, “Deepbach: A
s ee able model o bach cho ales gene a ion,” in In-
e na ional con e ence on machine lea ning. PMLR,
2017, pp. 1362–1371.
[32]
C.-Z. A. Huang, T. Cooijmans, A. Robe s, A. Cou ille,
and D. Eck, “Coun e poin by con olu ion,” a Xi
p ep in a Xi :1903.07227, 2019.
[33]
F. T. Liang, M. Go ham, M. Johnson, and J. Sho on,
“Au oma ic s ylis ic composi ion o bach cho ales wi h
deep ls m,” in ISMIR, 2017, pp. 449–456.
[34]
S. Oo e, I. Simon, S. Dieleman, D. Eck, and K. Si-
monyan, “This ime wi h eeling: Lea ning exp essi e
musical pe o mance,” Neu al Compu ing and Applica-
ions, ol. 32, pp. 955–967, 2020.
[35]
C.-Z. A. Huang, A. Vaswani, J. Uszko ei , N. Shazee ,
I. Simon, C. Haw ho ne, A. M. Dai, M. D. Ho man,
M. Dinculescu, and D. Eck, “Music ans o me ,” a Xi
p ep in a Xi :1809.04281, 2018.
[36]
C. Haw ho ne, A. S asyuk, A. Robe s, I. Simon, C.-
Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and
D. Eck, “Enabling ac o ized piano music modeling
and gene a ion wi h he maes o da ase ,” a Xi p ep in
a Xi :1810.12247, 2018.
[37]
I. Simon, C.-Z. A. Huang, J. Engel, C. Haw ho ne,
and M. Dinculescu, “Gene a ing piano music
wi h ans o me ,” h ps://magen a. enso low.o g/
piano- ans o me , Sep embe 2019, blog pos .
[Online]. A ailable: h ps://magen a. enso low.o g/
piano- ans o me
[38]
C. Payne, “Musene ,” 2019, openAI, 25 Ap . 2019.
[Online]. A ailable: h ps://openai.com/blog/musene
[39]
Y.-S. Huang and Y.-H. Yang, “Pop music ans o me :
Bea -based modeling and gene a ion o exp essi e pop
piano composi ions,” in P oceedings o he 28 h ACM
in e na ional con e ence on mul imedia, 2020, pp. 1180–
1188.
[40]
B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang,
T. Qin, and T.-Y. Liu, “Muse o me : T ans o me wi h
ine-and coa se-g ained a en ion o music gene a ion,”
Ad ances in neu al in o ma ion p ocessing sys ems,
ol. 35, pp. 1376–1388, 2022.
[41]
D. on Rü e, L. Biggio, Y. Kilche , and T. Ho mann,
“Figa o: Gene a ing symbolic music wi h ine-g ained
a is ic con ol,” a Xi p ep in a Xi :2201.10936,
2022.
[42]
P. Lu, X. Xu, C. Kang, B. Yu, C. Xing, X. Tan, and
J. Bian, “Musecoco: Gene a ing symbolic music om
ex ,” a Xi p ep in a Xi :2306.00110, 2023.
[43]
C. Walshaw, “Abc no a ion,” abcno a ion.com, 2008,
e ie ed 1 Ma ch 2008.
[44]
A. Robe s, J. Engel, C. Ra el, C. Haw ho ne, and
D. Eck, “A hie a chical la en ec o model o lea ning
long- e m s uc u e in music,” in In e na ional con e -
ence on machine lea ning. PMLR, 2018, pp. 4364–
4373.
[45]
S. Wu, Y. Wang, R. Yuan, Z. Guo, X. Tan, G. Zhang,
M. Zhou, J. Chen, X. Mu, Y. Gao e al., “Clamp 2:
Mul imodal music in o ma ion e ie al ac oss 101 lan-
guages using la ge language models,” a Xi p ep in
a Xi :2410.13267, 2024.
[46]
S. Wu, Z. Guo, R. Yuan, J. Jiang, S. Doh, G. Xia, J. Nam,
X. Li, F. Yu, and M. Sun, “Clamp 3: Uni e sal music
in o ma ion e ie al ac oss unaligned modali ies and
unseen languages,” a Xi p ep in a Xi :2502.10362,
2025.
[47]
S. Böck, F. K ebs, and G. Widme , “Join bea and
downbea acking wi h ecu en neu al ne wo ks.” in
ISMIR. New Yo k Ci y, 2016, pp. 255–261.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
458
[48]
C. Ra el, N. Shazee , A. Robe s, K. Lee, S. Na ang,
M. Ma ena, Y. Zhou, W. Li, and P. J. Liu, “Explo ing
he limi s o ans e lea ning wi h a uni ied ex - o-
ex ans o me ,” Jou nal o machine lea ning esea ch,
ol. 21, no. 140, pp. 1–67, 2020.
[49]
M. Good, “MusicXML: An in e ne - iendly
o ma o shee music,” in P oceedings o
XML 2001 Con e ence, 2001. [Online]. A ail-
able: h ps://michaelgood.in o/publica ions/music/
musicxml-an-in e ne - iendly- o ma - o -shee -music/
[50]
“MIDI speci ica ion,” 1996. [Online]. A ailable:
h ps://midi.o g/midi-1-0-de ailed-speci ica ion
[51]
C. Haw ho ne, I. Simon, R. Swa ely, E. Manilow, and
J. Engel, “Sequence- o-sequence piano ansc ip ion
wi h ans o me s,” a Xi p ep in a Xi :2107.09142,
2021.
[52]
N. Lee, K. S eeni asan, J. D. Lee, K. Lee, and
D. Papailiopoulos, “Teaching a i hme ic o small
ans o me s,” 2023. [Online]. A ailable: h ps:
//a xi .o g/abs/2307.03381
[53]
S. McLeish, A. Bansal, A. S ein, N. Jain,
J. Ki chenbaue , B. R. Ba oldson, B. Kailkhu a,
A. Bha ele, J. Geiping, A. Schwa zschild, and
T. Golds ein, “T ans o me s can do a i hme ic wi h
he igh embeddings,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2405.17399
[54]
W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang,
“Compound wo d ans o me : Lea ning o compose
ull-song music o e dynamic di ec ed hype g aphs,”
in P oceedings o he AAAI Con e ence on A i icial
In elligence, ol. 35, no. 1, 2021, pp. 178–186.
[55]
A. G a a io i, A. Dubey, A. Jauh i, A. Pandey, A. Ka-
dian, A. Al-Dahle, A. Le man, A. Ma hu , A. Schel en,
A. Vaughan e al., “The llama 3 he d o models,” a Xi
p ep in a Xi :2407.21783, 2024.
[56]
J. Ho mann, S. Bo geaud, A. Mensch, E. Bucha skaya,
T. Cai, E. Ru he o d, D. d. L. Casas, L. A. Hen-
d icks, J. Welbl, A. Cla k e al., “T aining compu e-
op imal la ge language models,” a Xi p ep in
a Xi :2203.15556, 2022.
[57]
A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Ł. Kaise , and I. Polosukhin,
“A en ion is all you need,” Ad ances in neu al in o ma-
ion p ocessing sys ems, ol. 30, 2017.
[58]
J. L. Ba, J. R. Ki os, and G. E. Hin on, “Laye no mal-
iza ion,” a Xi p ep in a Xi :1607.06450, 2016.
[59]
J. Ainslie, J. Lee-Tho p, M. De Jong, Y. Zemlyanskiy,
F. Leb ón, and S. Sanghai, “Gqa: T aining gene alized
mul i-que y ans o me models om mul i-head check-
poin s,” a Xi p ep in a Xi :2305.13245, 2023.
[60]
B. Zhang and R. Senn ich, “Roo mean squa e laye
no maliza ion,” Ad ances in Neu al In o ma ion P o-
cessing Sys ems, ol. 32, 2019.
[61]
T. Chen, S. Ko nbli h, M. No ouzi, and G. Hin on, “A
simple amewo k o con as i e lea ning o isual ep-
esen a ions,” in In e na ional con e ence on machine
lea ning, 2020, pp. 1597–1607.
[62]
J. Spijke e and J. A. Bu goyne, “Con as i e
lea ning o musical ep esen a ions,” a Xi p ep in
a Xi :2103.09410, 2021.
[63]
J. Choi, S. Jang, H. Cho, and S. Chung, “Towa ds p ope
con as i e sel -supe ised lea ning s a egies o mu-
sic audio ep esen a ion,” in 2022 IEEE In e na ional
Con e ence on Mul imedia and Expo (ICME). IEEE,
2022, pp. 1–6.
[64]
T. Gao, X. Yao, and D. Chen, “Simcse: Simple
con as i e lea ning o sen ence embeddings,” a Xi
p ep in a Xi :2104.08821, 2021.
[65]
L. Wang, N. Yang, X. Huang, L. Yang, R. Majumde ,
and F. Wei, “Imp o ing ex embeddings wi h la ge
language models,” a Xi p ep in a Xi :2401.00368,
2023.
[66]
Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and D. P.
Ellis, “Mulan: A join embedding o music audio and
na u al language,” a Xi p ep in a Xi :2208.12415,
2022.
[67]
Moda , “Piano eq,” h ps://www.moda .com/
piano eq, accessed: 2025-03-28.
[68]
Y.-H. Chou, I. Chen, C.-J. Chang, J. Ching, Y.-H.
Yang e al., “Midibe -piano: La ge-scale p e- aining
o symbolic music unde s anding,” a Xi p ep in
a Xi :2107.05223, ol. 2, 2021.
[69]
L. N. Fe ei a and J. Whi ehead, “Lea ning o
gene a e music wi h sen imen ,” a Xi p ep in
a Xi :2103.06125, 2021.
[70]
Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Xiao, C. Lin, A. Ragni, E. Bene os e al.,
“Me : Acous ic music unde s anding model wi h
la ge-scale sel -supe ised aining,” a Xi p ep in
a Xi :2306.00107, 2023.
[71]
J. De lin, M.-W. Chang, K. Lee, and K. Tou ano a,
“Be : P e- aining o deep bidi ec ional ans o me s
o language unde s anding,” in P oceedings o he 2019
con e ence o he No h Ame ican chap e o he asso-
cia ion o compu a ional linguis ics: human language
echnologies, olume 1 (long and sho pape s), 2019,
pp. 4171–4186.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
459

Related note

Why organizations use Identific for document trust, entry 38
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com