PianoBind: A Multi-Modal Joint Embedding Model for Pop-Piano Music

Author: Hayeon Bang; Eunjin Choi; Seungheon Doh; Juhan Nam

Publisher: Zenodo

DOI: 10.5281/zenodo.17706422

Source: https://zenodo.org/records/17706422/files/000045.pdf

PIANOBIND: A MULTIMODAL JOINT EMBEDDING MODEL FOR
POP-PIANO MUSIC
Hayeon Bang Eunjin Choi Seungheon Doh Juhan Nam
G adua e School o Cul u e Technology, KAIST, Sou h Ko ea
{hayeonbang,jech,seungheondoh,juhan.nam}@kais .ac.k
ABSTRACT
Solo piano music, despi e being a single-ins umen
medium, possesses signi ican exp essi e capabili ies, con-
eying ich seman ic in o ma ion ac oss gen es, moods,
and s yles. Howe e , cu en gene al-pu pose music ep-
esen a ion models, p edominan ly ained on la ge-scale
da ase s, o en s uggle o cap u e sub le seman ic dis inc-
ions wi hin homogeneous solo piano music. Fu he mo e,
exis ing piano-speci ic ep esen a ion models a e ypically
unimodal, ailing o cap u e he inhe en ly mul imodal na-
u e o piano music, exp essed h ough audio, symbolic,
and ex ual modali ies. To add ess hese limi a ions, we
p opose PianoBind, a piano-speci ic mul imodal join em-
bedding model. We sys ema ically in es iga e s a egies
o mul i-sou ce aining and modali y u iliza ion wi hin a
join embedding amewo k op imized o cap u ing ine-
g ained seman ic dis inc ions in (1) small-scale and (2)
homogeneous piano da ase s. Ou expe imen al esul s
demons a e ha PianoBind lea ns mul imodal ep esen a-
ions ha e ec i ely cap u e sub le nuances o piano mu-
sic, achie ing supe io ex - o-music e ie al pe o mance
on in-domain and ou -o -domain piano da ase s compa ed
o gene al-pu pose music join embedding models. Mo e-
o e , ou design choices o e eusable insigh s o mul i-
modal ep esen a ion lea ning wi h homogeneous da ase s
beyond piano music.
1. INTRODUCTION
The piano s ands as a uniquely e sa ile solo ins umen
capable o con eying complex polyphonic musical ex-
p ession h ough a single ins umen . Wi h i s expansi e
onal ange, ha monic possibili ies, and exp essi e capabil-
i ies—e en allowing o ches al wo ks o be e ec i ely pe -
o med on a single keyboa d—piano music encompasses
di e se gen es, s yles, and exp essi e con en . Nume -
ous s udies in he Music In o ma ion Re ie al (MIR) ield
ha e a ge ed asks ocused on piano music, such as pi-
ano music gene a ion [1–4] and au oma ic music ansc ip-
ion [5–8]. Howe e , esea ch on piano-speci ic ep esen-
© Hayeon Bang, Eunjin Choi, Seungheon Doh, and Juhan
Nam. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: Hayeon Bang, Eunjin Choi, Se-
ungheon Doh, and Juhan Nam, “PianoBind: A Mul imodal Join Em-
bedding Model o Pop-piano Music”, in P oc. o he 26 h In . Socie y
o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
Figu e 1. Illus a ion o PianoBind: A mul imodal piano
music ep esen a ion model in eg a ing audio, MIDI, and
ex .
a ion models emains limi ed. Exis ing app oaches a e
ypically cons ained o a single modali y [9, 10], ailing
o e lec he inhe en ly mul imodal na u e o piano music
ha encompasses audio eco dings, symbolic MIDI, and
seman ic desc ip ions.
Recen ad ances in mul imodal join embedding models
ha e shown p omise in b idging he gap be ween audio and
ex domains [11–15] and symbolic domains [16–18], ye
hey o en all sho in specialized domains—pa icula ly
in solo piano music. Despi e hei e sa ili y ac oss di-
e se music ca ego ies, gene al-pu pose models ypically
lack he sensi i i y needed o cap u e sub le seman ic nu-
ances wi hin homogeneous solo piano music. This lim-
i a ion a ises p ima ily om he sca ci y o high-quali y
piano-speci ic da a in gene al-pu pose music- ex da ase s,
which a e ypically domina ed by mul i-ins umen al o
ocal-cen ic music.
In his wo k, we p esen PianoBind, a mul imodal join
embedding model ha in eg a es mul iple modali ies o
solo piano music—audio, symbolic (MIDI), and ex ual
desc ip ions—wi hin a uni ied embedding space (illus-
a ed in Figu e 1), enabling a mo e comp ehensi e ep-
esen a ion. Ou goal is o cap u e he ine-g ained se-
man ic cha ac e is ics o piano music—spanning gen e,
mood, and s yle— ha a e o en o e looked by la ge-scale,
gene al-pu pose models. To achie e his, we use he PI-
AST da ase [19] o aining, a pop-piano music (e.g. new-
age, piano co e , jazz and i s sub-gen es) da ase ha in-
cludes audio, MIDI, and ex ual desc ip ions. We sys em-
a ically explo e mul i-sou ce aining s a egies ailo ed o
391
he cha ac e is ics o domain-speci ic da ase s. We u i-
lize a compa a i ely la ge amoun o au oma ically col-
lec ed ex ual da a wi h weake empo al alignmen o au-
dio. This app oach compensa es o he limi ed amoun
o human-anno a ed da a. Fu he mo e, we p opose e ec-
i e me hods o combining mul imodal in o ma ion bo h
du ing aining and a e ie al ime, whe e join audio-
symbolic embeddings can signi ican ly enhance ex - o-
music e ie al, pa icula ly in dis inguishing highly sim-
ila piano pieces.
Ou expe imen s on bo h in-domain and ou -o -domain
piano da ase s show ha PianoBind ou pe o ms gene al-
pu pose music join embedding models in cap u ing he
nuances o piano solo music. By ocusing on small-scale,
homogeneous da ase s, ou indings also o e aluable
guidelines o de eloping specialized mul imodal ep e-
sen a ion lea ning app oaches in o he domains wi h lim-
i ed da a. As such, his s udy con ibu es o he g owing
body o piano-cen ic MIR esea ch. I also con ibu es
o b oade discussions on e icien and ine-g ained mul i-
modal modeling. These con ibu ions a e especially ele-
an when da a a ailabili y is inhe en ly cons ained. We
ha e publicly eleased code and p e ained weigh s o Pi-
anoBind, wi h he demo online 1.
2. RELATED WORKS
2.1 Piano Music Rep esen a ion Lea ning
Piano music has long se ed as a cen al subjec in MIR,
owing o i s s uc u al ichness and exp essi e dep h. How-
e e , despi e his sus ained a en ion, exis ing ep esen-
a ion lea ning app oaches o piano music a e p edom-
inan ly unimodal— elying solely on symbolic o audio
da a— hus ailing o e lec he inhe en ly mul imodal na-
u e o piano music.
Ea ly piano music ep esen a ion in he symbolic music
domain p ima ily ocused on disen angling low-le el sym-
bolic ea u es, mos ly wi h he goal o enhancing con olla-
bili y in music gene a ion asks. Models such as PianoT ee
VAE [20], Wang e al. [21], and CollageNe [22] exempli y
his app oach, decomposing music in o a ibu es such as
hy hm, ha mony, ex u e, and s uc u e o acili a e use -
con olled music gene a ion.
Mo e ecen ly, ans o me -based models like
MidiBERT-Piano [9] and PianoBART [10] ha e ex-
panded piano music unde s anding h ough la ge-scale
p e aining, cap u ing bo h low-le el ea u es and highe -
le el musical a ibu es. MidiBERT-Piano in oduced
masked modeling objec i es o solo piano MIDI da a,
demons a ing s ong ans e abili y o downs eam asks
such as compose classi ica ion and exp essi e a ibu e
p edic ion. Building upon his ounda ion, PianoBART
ex ended hese capabili ies om unde s anding owa d
gene a ion asks, acili a ing mo e sophis ica ed music
c ea ion and symbolic inpain ing asks. None heless, hese
models emain pu ely symbolic, lacking in eg a ion wi h
acous ic in o ma ion o na u al language seman ics.
1h ps://hayeonbang.gi hub.io/PianoBind/
Despi e he p og ess in piano music ep esen a ion
lea ning, mul imodal unde s anding o piano music has
been limi ed, p ima ily due o he sca ci y o da ase s ha
suppo such app oaches. Un il ecen ly, ew esou ces
exis ed ha combined mul iple modali ies o piano pe -
o mance. EMOPIA [23] made an ini ial s ep by p o id-
ing pai ed audio and MIDI eco dings wi h emo ion la-
bels, while PIAST [19] has ecen ly ex ended his u he
by adding comp ehensi e ex ual anno a ions desc ibing
gen e and mood. Howe e , e en wi h hese mul imodal
da ase s becoming a ailable, no p io esea ch has p o-
posed an in eg a ed app oach ha join ly le e ages audio,
symbolic, and ex ual modali ies o piano music unde -
s anding.
2.2 Mul imodal Join Embedding Models
Mul imodal join embedding models aim o align da a
om di e en modali ies in a sha ed embedding space.
This alignmen c ea es seman ically meaning ul ep e-
sen a ions ha cap u e ela ionships be ween modali ies,
enabling c oss-modal e ie al, unde s anding, and con-
di ioned gene a ion. MuLan [11] pionee ed la ge-scale
audio– ex aining wi h o e 44 million pai s, showing
s ong e ie al capabili ies. MusCALL [12] p oposed
an audio- ex dual encode a chi ec u e, demons a ing
imp o ed e ie al pe o mance and downs eam pe o -
mance. CLAP [13] expanded on his by le e aging au-
dio cap ioning co po a and keywo d- o-cap ion augmen-
a ion s a egies, and in oduced a join audio- ex em-
bedding model. Fu he e inemen s include models like
TTMR [14] and TTMR++ [15], which add ess a ying
que y g anula i ies— anging om single ags o ull sen-
ences—and inco po a e ich me ada a o p oduce mo e de-
sc ip i e and con ex ual ex embeddings. These me hods
imp o ed e ie al accu acy by modeling bo h linguis ic
and musical sub le ies.
In pa allel, symbolic- ex join embedding models ha e
been explo ed h ough he CLaMP se ies [16–18], which
in oduced con as i e lea ning be ween symbolic music
(e.g., ABC o MIDI) and ex ual desc ip ions. While
CLaMP pionee ed symbolic music e ie al h ough ex -
ABC join aining, CLaMP2 expanded his o mul ilin-
gual ex and MIDI da a, and CLaMP3 u he inco po a ed
audio and pe o mance signals, ma king he i s imodal
amewo k in music ep esen a ion lea ning o join ly align
audio, symbolic, and ex ual modali ies.
Despi e hese ad ances, mos cu en models a e ained
on gene al-pu pose, la ge-scale da ase s co e ing a b oad
ange o musical s yles and ins umen a ion. As a esul ,
hey o en all sho o cap u ing he ine-g ained seman ic
di e ences equi ed o mo e homogeneous domains like
solo piano music. These models end o unde pe o m in
se ings whe e sub le a ia ions in gen e, mood, and s yle
mus be accu a ely dis inguished.
To add ess his gap, ou p oposed model PianoBind in-
eg a es piano-speci ic audio, symbolic, and ex ual modal-
i ies in a uni ied embedding space, enabling mo e p ecise
e ie al wi hin his specialized musical domain.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
392
Figu e 2. T aining s a egies o PianoBind: (1) Mul i-sou ce aining combines small s ongly aligned (human-anno a ed)
and la ge weakly aligned (au oma ically collec ed) da ase s; (2) T imodal lea ning simul aneously aligns audio, symbolic,
and ex ual embeddings; (3) Mul imodal i em embedding me ges mul iple modali ies in o a uni ied ep esen a ion.
3. PIANOBIND
Conside ing he speci ic cha ac e is ics o piano solo
da ase s—such as homogeneous da a dis ibu ion, limi ed
da ase size, and mul imodali y—we p opose a mul imodal
join embedding model specialized o solo piano music.
In his sec ion, we desc ibe he o e all a chi ec u e o Pi-
anoBind (sec ion 3.1), and he aining s a egies i ex-
plo es (sec ion 3.2), comp ising: (1) mul i-sou ce lea ning
wi h s ongly and weakly aligned pai s, and (2) modali y
in eg a ion ac oss audio, symbolic, and ex ual ea u es.
3.1 A chi ec u e O e iew
3.1.1 Audio Encode
Following p e ious wo k [12], we adop a modi ied
ResNe -50 [24] a chi ec u e o p ocess mel-spec og am
ep esen a ions o piano eco dings. We ex ac 128-band
mel-spec og ams wi h a 1024-poin FFT, 512-poin hop
leng h, and apply log-scaling. As in MusCALL, we ap-
ply h ee s em con olu ional laye s ollowed by a e age
pooling, and implemen an i-aliased blu pooling. We also
ollow he downsizing s a egy employed in he p e ious
wo k.
3.1.2 Symbolic Encode
We u ilize MidiBERT [9] as a symbolic encode . The
model handles ba , posi ion, pi ch, and du a ion in o ma-
ion o MIDI by adap ing Compound Wo d (CP) ep esen-
a ion [2] in o i s model. Since MidiBERT does no con ain
he [CLS] oken, we ob ained he inal MIDI embedding by
mean-pooling o e he sequence o hidden s a es om he
las T ans o me laye , esul ing in a dense ep esen a ion
ha is hen p ojec ed o he sha ed embedding space.
3.1.3 Tex Encode
Ou ex p ocessing pipeline employs RoBERTa [25], us-
ing i s by e-pai encoding (BPE) okenize . The ok-
enized inpu passes h ough 12 T ans o me laye s wi h
768-dimensional hidden s a es. Since RoBERTa lacks he
s anda d poole ou pu ound in BERT, we c ea e sen ence-
le el ep esen a ions by mean-pooling ac oss he inal
laye ’s hidden s a es. These ex ual embeddings a e also
mapped o ou sha ed space h ough a linea p ojec ion
laye .
3.1.4 Join Embedding wi h Con as i e Loss
The ep esen a ions om ou h ee encode s (audio, MIDI,
and ex ) a e aligned h ough modali y-speci ic linea p o-
jec ions in o a 512-dimensional sha ed embedding space.
All embeddings unde go ℓ2-no maliza ion o ensu e con-
sis en scaling ac oss modali ies. C oss-modal simila i-
ies a e compu ed ia do p oduc s be ween hese no mal-
ized embeddings, p o iding he ounda ion o ou a ious
aining objec i es. We use he N-pai Con as i e loss,
known as he In oNCE [26] loss, which maximizes he
cosine simila i y be ween posi i e music- ex embedding
pai s while minimizing he simila i y o nega i e pai s.
Fo audio- ex alignmen , he In oNCE loss is de ined as
ollows:
La→ =−1
N
N
X
i=1
log exp(za,i ·z+
,i/τ)
Pz∈{z+
,i,z−
,i}exp(za,i ·z/τ)(1)
whe e za,i and z ,i e e o audio and ex embeddings e-
spec i ely, in he audio- ex aining. An analogous loss is
compu ed o MIDI– ex pai s by subs i u ing audio em-
beddings wi h symbolic ones. τ ep esen s he empe a-
u e pa ame e , and z−
,i deno es a se o nega i e ex em-
beddings. This aims o align embeddings be ween ele-
an music- ex pai s while sepa a ing i ele an pai s in
he embedding space. The o al loss is compu ed by sym-
me ically combining losses in bo h di ec ions (music- o-
ex and ex - o-music). The inal symme ic loss is de ined
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
393
as:
La↔ =La→ +L →a
2(2)
3.2 T aining F amewo k
3.2.1 Mul i-sou ce T aining
Conside ing he small size and specialized anno a ion e-
qui ed o piano da ase s, we explo e wo s a egies o
e ec i ely le e age bo h la ge-scale, weakly aligned au-
dio– ex da a and small-scale, expe -anno a ed da a.
Combined T aining We join ly ain he model on bo h
sou ces by mixing hem wi hin aining ba ches ollow-
ing p io wo k [13, 15, 27, 28]. To mi iga e da a imbalance
and noise om weak supe ision, we ca e ully con ol he
sampling a io be ween he wo sou ces. This enables he
model o lea n gene alizable language–music alignmen s
while g adually inco po a ing domain-speci ic sub le ies
o solo piano music.
P e- aining and Fine- uning Al e na i ely, we also
adop he wo-s age aining s a egy, p e- aining and ine-
uning, ollowing [29]. The model is i s p e- ained on he
la ge-scale, weakly labeled da ase o acqui e gene alizable
music ep esen a ions. I is hen ine- uned on he smalle ,
expe -anno a ed da a, wi h encode pa ame e s upda ed.
This sequen ial s a egy enables he model o bene i om
b oad seman ic co e age du ing p e- aining, while la e
adap ing o he exp essi e and s ylis ic nuances o solo pi-
ano music h ough ine- uning.
3.2.2 T imodal Rep esen a ion Lea ning
To ully exploi he mul imodal cha ac e is ics o piano
music, we ex end beyond adi ional bimodal se ups by
in eg a ing audio, symbolic (MIDI), and ex modali ies
in o a uni ied e ie al amewo k. Inspi ed by he aining
s a egy om AudioCLIP [30], we compu e he con as i e
loss ac oss modali y pai s (audio- ex , MIDI- ex ) and a -
e age hem o o m he inal objec i e. Howe e , consid-
e ing he na u e o ou MIDI da a as ansc ibed ep esen-
a ions de i ed di ec ly om co esponding audio eco d-
ings, including an audio-MIDI loss would no signi ican ly
con ibu e addi ional seman ic dis inc ion. Consequen ly,
we u ilize only he audio- ex and MIDI- ex con as i e
losses, calcula ing hei a e age as ou inal aining objec-
i e:
L o al =La↔ +Lm↔
2(3)
Fu he mo e, we le e age mul imodal in o ma ion no
only du ing he aining phase bu also a e alua ion
ime. To le e age he complemen a y s eng hs o di e -
en modali ies, we p opose mul imodal i em embeddings
ha use audio and MIDI in o ma ion. Speci ically, audio
and MIDI embeddings a e in eg a ed h ough a e age u-
sion du ing he e alua ion p ocess.
4. EXPERIMENT
4.1 Da ase
This s udy u ilizes he PIAST da ase [19], he i s music-
ex da ase explici ly designed o pop-piano music. The
da ase consis s o audio, MIDI, and ex ual desc ip ions,
based on a comp ehensi e piano-speci ic axonomy o 31
seman ic ags ac oss gen e, emo ion/mood, and s yle. The
da ase comp ises wo subse s: PIAST-YT, a la ge-scale
collec ion o app oxima ely 7,367 acks (abou 900 hou s)
au oma ically collec ed om YouTube, wi h accompany-
ing ex ual me ada a ( i les, desc ip ions, and ags) e ined
using a la ge language model; and PIAST-AT, a smalle ,
expe -anno a ed se o 1,986 acks (abou 17 hou s in o-
al, 30 seconds pe ack). Fo bo h subse s, MIDI da a
is gene a ed ia au oma ic piano ansc ip ion. The an-
sc ibed MIDI iles we e synch onized o downbea es i-
ma es, and melody and cho d in o ma ion was ex ac ed.
Since he ex ual da a o PIAST-YT is au oma ically
collec ed, i exhibi s weak alignmen wi h he audio con-
en , in oducing conside able noise in he ex -audio e-
la ionships. To add ess his challenge, we apply he wo
mul i-sou ce lea ning s a egies desc ibed in Sec ion 3,
le e aging bo h he la ge-scale bu weakly aligned PIAST-
YT da a and he smalle , high-quali y PIAST-AT anno a-
ions. This combined app oach enables mo e obus ep e-
sen a ion lea ning despi e he inhe en da a limi a ions. Fo
expe imen s, we use a 9:1 ain– alida ion spli o PIAST-
YT, and an 8:1:1 ain– alida ion– es spli o PIAST-AT.
4.2 E alua ion
We e alua e ou model using bo h in-domain and ou -o -
domain ex - o-music e ie al asks. Fo in-domain e al-
ua ion, we use he 10% held-ou es spli om PIAST-
AT, comp ising 199 acks. Fo ou -o -domain e alua ion,
we in oduce EMOPIA-Caps by manually anno a ing de-
sc ip i e ag labels o he EMOPIA es spli [23], and
ans o ming hem in o na u al language cap ions using
a la ge language model. The ini ial ags we e na u ally
o e lapping wi h he ocabula y used in he PIAST-AT
piano-music axonomy. To add ess his and be e app ox-
ima e eal-wo ld na u al use que ies, we pa aph ased he
ags in o ee- o m na u al language cap ions using GPT-
4o [31]. The gene a ed cap ions we e hen e iewed and
e ined by a human music expe wi h a majo in com-
posi ion, o ensu e hei seman ic accu acy and musical
ele ance. This ans o ma ion no only be e app oxi-
ma es use -s yle que ies, bu also enables e alua ion o he
model’s abili y o gene alize o di e se ex ual exp essions.
Bo h e alua ion da ase s use sen ence- o m ex ual inpu s;
howe e , he in-domain se consis s o conca ena ed ags,
whe eas he ou -o -domain se con ains ee- o m, na u al
language cap ions.
Fo bo h e alua ion se ings, we pe o m ex - o-music
e ie al using Recall@K (R@1, R@5, R@10) and Me-
dian Rank (MedR), as hese me ics e lec he model’s
abili y o gene alize o di e se and uncons ained ex ual
que ies.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
394
T aining S a egy In-domain (199 acks) Ou -o -domain (88 acks)
R@1 R@5 R@10 MedR↓R@1 R@5 R@10 MedR↓
Combined T aining
Audio 8.04 25.62 37.18 17 2.56 10.26 35.90 17
Symbolic 4.02 17.09 27.13 27 2.56 28.21 43.59 13
T imodal 6.53 24.12 37.68 17 5.13 25.64 46.15 14
P e- aining & Fine- uning
Audio 6.53 28.14 42.71 15 7.69 20.51 46.15 12
Symbolic 8.04 26.63 45.23 12 5.13 20.51 35.90 12
T imodal 10.55 35.67 52.76 10 15.38 41.03 51.28 10
Table 1. Pe o mance compa ison on ex -based music e ie al asks, on he in-domain (PIAST-AT) and ou -o -domain
(EMOPIA-Caps) da ase s.
4.3 Implemen a ion De ails
Fo audio p ocessing, we use 20-second signals wi h a
sampling a e o 16 kHz, consis en wi h me hods es ab-
lished by p e ious wo k [12]. To ensu e empo al align-
men be ween audio and MIDI, we ma ch each audio seg-
men ’s s a ime (in seconds) wi h he nea es MIDI ba
onse , ex ac ing MIDI okens om ba s ha co espond
o each audio segmen . These sequences a e subsequen ly
s anda dized o exac ly 512 okens h ough padding o
unca ion. Fo ex inpu s, we employ he RoBERTa o-
kenize [25] wi h a 77- oken leng h limi . To comba o e -
i ing and enhance he di e si y o ou ex ual da a, we
implemen a dynamic ex d opou s a egy building on ap-
p oaches om se e al ecen wo ks [14–16]. This ech-
nique andomly selec s and combines a ailable ex ual el-
emen s (such as ags and cap ions) in a ying o de s o
each aining ins ance. Fo combined aining, we use a
7:3 sampling a io be ween PIAST-YT and PIAST-AT.
All models a e ained using he AdamW op imize wi h
a 5e-5 ini ial lea ning a e, 0.2 weigh decay, and a consis-
en ba ch size o 64 ac oss all expe imen s. Fo he con-
as i e loss, we join ly op imize he empe a u e pa am-
e e τalongside encode and p ojec ion pa ame e s, ol-
lowing success ul app oaches demons a ed in ecen mul-
imodal wo ks [12, 14]. We selec op imal model check-
poin s based on median ank pe o mance on ou alida-
ion da ase . Ou implemen a ion is based on PyTo ch, us-
ing au oma ic mixed p ecision and ained on an NVIDIA
A6000 GPU.
5. RESULTS
5.1 Compa ison o T aining S a egies
5.1.1 Mul i-sou ce aining
Table 1 shows he pe o mance o di e en aining s a e-
gies ac oss bo h in-domain (PIAST-AT) and ou -o -domain
(EMOPIA-Caps) es se s. We compa e wo mul i-sou ce
lea ning app oaches: combined aining and p e- aining
ollowed by ine- uning, each e alua ed using wo bimodal
con igu a ions (audio– ex and symbolic– ex ) and one i-
modal (audio–symbolic– ex ) con igu a ion.
The esul s demons a e ha he p e- aining and ine-
uning app oach gene ally ou pe o ms he combined ain-
ing ac oss modali y con igu a ions and me ics. In he in-
domain se ing, he p e- aining and ine- uning app oach
wi h imodal in eg a ion achie es he bes pe o mance,
eaching a Median Rank o 10, signi ican ly su passing he
co esponding me ics o combined aining. Simila ly,
in he ou -o -domain con ex , i yields supe io esul s
wi h he same Median Rank. While a ew isola ed me -
ics—such as R@1 o audio in-domain and R@5 o sym-
bolic ou -o -domain—a e ma ginally highe in he com-
bined aining se up, hese excep ions do no con adic he
o e all end. These indings unde sco e he challenges o
small-scale anno a ed da ase s. When dealing wi h limi ed
high-quali y anno a ions, he combined aining app oach
exposes he model o da a imbalance issues, whe e he
la ge bu noisie da ase can po en ially o e powe he sig-
nal om he smalle expe -anno a ed da a. In con as , he
sequen ial knowledge ans e app oach—ini ially lea n-
ing gene alizable ep esen a ions om b oade da a be o e
adap ing o specialized piano-speci ic con ex s—enables
he model o be e le e age bo h da a sou ces.
5.1.2 T imodal In eg a ion
Ou esul s demons a e he subs an ial pe o mance gains
achie ed h ough imodal in eg a ion compa ed o bi-
modal app oaches. Ac oss bo h mul i-sou ce aining
s a egies, he imodal model consis en ly ou pe o ms
bo h audio- ex and symbolic- ex con igu a ions. The
imp o emen s a e pa icula ly p onounced in he p e-
aining and ine- uning app oach, whe e imodal in e-
g a ion achie es a Median Rank o 10 compa ed o 12
o symbolic-only and 15 o audio-only. This pe o -
mance ad an age ex ends o highe R@10, wi h he i-
modal app oach achie ing 52.76%, signi ican ly ou pe -
o ming bo h symbolic and audio.
These indings s ongly suppo ou hypo hesis ha e -
ec i e ep esen a ion o piano music equi es in eg a -
ing mul iple modali ies. While bo h audio and symbolic
ep esen a ions cap u e aluable in o ma ion—wi h sym-
bolic ep esen a ions sligh ly ou pe o ming audio on in-
domain e ie al— hei combina ion in a imodal ame-
wo k yields ep esen a ions ha mo e comp ehensi ely
cap u e he seman ic nuances o piano music. This sug-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
395

Model Piano
Speci ic I em Modali y In-domain (199 acks) Ou -o -domain (88 acks)
R@1 R@5 R@10 MedR↓R@1 R@5 R@10 MedR↓
CLAP-Music ✗Audio 0.00 7.38 9.85 54 5.13 20.51 41.03 15
TTMR++ ✗Audio 1.47 6.40 12.31 45 5.13 15.38 28.21 16
CLaMP3saas ✗Audio 1.50 7.53 13.56 49 2.56 20.51 41.03 12
CLaMP2 ✗Symbolic 3.02 8.54 14.57 43 5.13 30.77 43.59 14
CLaMP3c2
sa ✗Symbolic 4.02 12.06 22.11 39 12.82 30.77 46.15 12
CLaMP3saas ✗Audio + Symbolic 1.50 7.53 12.06 68 2.56 33.33 46.15 13
CLaMP3c2
sa ✗Audio + Symbolic 2.51 10.55 17.58 47 7.69 28.20 43.58 13
PianoBind (Ou s) ✓Audio + Symbolic 10.55 35.67 52.76 10 15.38 41.03 51.28 10
Table 2. Pe o mance compa isons be ween PianoBind and p e ious ex - o-music e ie al models, conduc ed on he in-
domain (PIAST-AT) and ou -o -domain (EMOPIA-Caps) da ase s.
ges s ha audio and symbolic modali ies p o ide comple-
men a y pe spec i es.
5.2 Compa ison wi h Exis ing Models
Table 2 p esen s a compa a i e analysis be ween Pi-
anoBind and exis ing ex - o-music e ie al models. We
compa e agains leading audio-based models—CLAP-
Music, TTMR++, and CLaMP3saas (op imized o
audio)—as well as symbolic-based models, including
CLaMP2 and CLaMP3c2
sa (op imized o symbolic). All
models we e ained on a la ge-scale o gene al-pu pose
da ase s. The esul s demons a e PianoBind’s subs an-
ial pe o mance ad an age o e gene al-pu pose models.
Fo in-domain e ie al, PianoBind achie es he lowes
Median Rank o 10, signi ican ly ou pe o ming he bes -
pe o ming gene al-pu pose model, CLaMP3c2
sa, which
achie ed a Median Rank o 39. This pe o mance ad an-
age ex ends o ou -o -domain e ie al as well, whe e Pi-
anoBind main ains i s lead wi h an R@10 o 51.28% and
Median Rank o 10, compa ed o he nex bes model,
CLaMP3c2
sa, wi h 46.15% and 12, espec i ely.
Addi ionally, we also ex ended he models om
CLaMP3, by implemen ing mul imodal i em embeddings
h ough ea u e usion—which is no p esen in he o ig-
inal wo k. Howe e , e en wi h his app oach, CLaMP3
models wi h ea u e usion unde pe o m compa ed o hei
bi-modali y esul s. This unde pe o mance likely s ems
om he specialized na u e o CLaMP3 a ian s, whe e
each model was op imized o a speci ic modali y.
5.3 Compa a i e Analysis o Model Design Choices
We conduc ed addi ional expe imen s o alida e key
model design choices. Speci ically, we compa ed ou
a e aged-loss aining s a egy in imodal lea ning wi h
he saas (symbolic →audio →audio →symbolic) align-
men s a egy adop ed in CLaMP3. As shown in Table 3,
ou a e aged-loss aining clea ly ou pe o ms bo h he
o iginal CLaMP3saas model and ou own eimplemen a-
ion (Ou s_saas) in bo h in-domain and ou -o -domain e-
ie al. These esul s demons a e he ad an age o join ly
lea ning audio– ex and MIDI– ex embeddings, a he
han aligning independen ly ained modali ies h ough a
Model ID OOD
R@10 MedR↓R@10 MedR↓
CLaMP3_saas 13.56 49 41.03 12
Ou s_saas 33.66 18 30.77 19
Ou s_LossA g 52.76 10 52.76 10
Table 3. Compa ison be ween Saas and A e aged Loss
in imodal lea ning in bo h in-domain (ID) and ou -o -
domain (OOD) e alua ions.
s aged alignmen p ocess. The consis en pe o mance
gains in Median Rank and R@10 u he highligh he e -
ec i eness o ou uni ied aining objec i e in cap u ing
ine-g ained seman ic ela ionships in piano music.
6. CONCLUSION
In his pape , we in oduced PianoBind, a mul imodal join
embedding model designed o pop-piano music, in eg a -
ing audio, symbolic, and ex ual modali ies. Despi e us-
ing subs an ially less aining da a han gene al-pu pose
models, PianoBind achie ed s ong e ie al pe o mance.
Ou indings sugges ha a sequen ial mul i-sou ce aining
s a egy—p e- aining on la ge-scale noisy da a ollowed
by ine- uning on human-anno a ed examples—is mo e e -
ec i e han aining on he wo sou ces simul aneously,
pa icula ly in low- esou ce se ings. We also obse ed
ha in eg a ing audio and symbolic modali ies cap u es
complemen a y seman ic cues, and hei join use leads o
mo e obus embeddings. Mo eo e , combining audio and
symbolic embeddings a in e ence ime imp o es e ie al
pe o mance, p o ided ha he modali ies a e well-aligned
h ough join aining.
A limi a ion o ou s udy is ha ou e alua ion da ase s,
bo h in-domain and ou -o -domain, a e ela i ely small-
scale, po en ially es ic ing he gene alizabili y o ou
indings. Mo eo e , ou da ase s p ima ily ocused on pop-
piano gen es, lacking su icien ep esen a ion o classical
and o he di e se piano gen es. Add essing hese limi a-
ions by cons uc ing la ge -scale and mo e gen e-di e se
piano- ex da ase s emains an impo an di ec ion o u-
u e wo k. This u he highligh s he need o comp ehen-
si e benchma ks o igo ously e alua e mul imodal embed-
ding models ac oss a ied piano music con ex s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
396
7. ACKNOWLEDGMENTS
This wo k has been suppo ed by he collabo a ion wi h
NCSOFT, Ko ea.
8. REFERENCES
[1] Y.-S. Huang and Y.-H. Yang, “Pop music ans o me :
Bea -based modeling and gene a ion o exp essi e pop
piano composi ions,” in P oceedings o he 28 h ACM
in e na ional con e ence on mul imedia, 2020, pp.
1180–1188.
[2] W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang,
“Compound wo d ans o me : Lea ning o compose
ull-song music o e dynamic di ec ed hype g aphs,”
in P oceedings o he AAAI Con e ence on A i icial In-
elligence, ol. 35, no. 1, 2021, pp. 178–186.
[3] S.-L. Wu and Y.-H. Yang, “Compose & embellish:
Well-s uc u ed piano pe o mance gene a ion ia a
wo-s age app oach,” in ICASSP 2023 - 2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), 2023, pp. 1–5.
[4] C.-P. Tan, H. Ai, Y.-H. Chang, S.-H. Guan, and Y.-H.
Yang, “Picogen2: Piano co e gene a ion wi h ans e
lea ning app oach and weakly aligned da a,” in P o-
ceedings o he 25 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), San F an-
cisco, CA, Uni ed S a es, No . 2024.
[5] S. Sig ia, E. Bene os, and S. Dixon, “An end- o-end
neu al ne wo k o polyphonic piano music ansc ip-
ion,” IEEE/ACM T ansac ions on Audio, Speech, and
Language P ocessing, ol. 24, no. 5, pp. 927–939,
2016.
[6] C. Haw ho ne, E. Elsen, J. Song, A. Robe s, I. Si-
mon, C. Ra el, J. Engel, S. Oo e, and D. Eck, “Onse s
and ames: Dual-objec i e piano ansc ip ion,” a Xi
p ep in a Xi :1710.11153, 2017.
[7] Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-
esolu ion piano ansc ip ion wi h pedals by eg ess-
ing onse and o se imes,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing, ol. 29,
pp. 3707–3717, 2021.
[8] T. Kwon, D. Jeong, and J. Nam, “Polyphonic piano
ansc ip ion using au o eg essi e mul i-s a e no e
model,” in In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, 2020. [Online]. A ailable:
h ps://api.seman icschola .o g/Co pusID:222125050
[9] Y.-H. Chou, I. Chen, C.-J. Chang, J. Ching, Y.-H. Yang
e al., “MidiBERT-Piano: La ge-scale p e- aining
o symbolic music unde s anding,” a Xi p ep in
a Xi :2107.05223, 2021.
[10] X. Liang, Z. Zhao, W. Zeng, Y. He, F. He, Y. Wang,
and C. Gao, “Pianoba : Symbolic piano music gene -
a ion and unde s anding wi h la ge-scale p e- aining,”
in 2024 IEEE In e na ional Con e ence on Mul imedia
and Expo (ICME), 2024, pp. 1–6.
[11] Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and D. P.
Ellis, “Mulan: A join embedding o music audio and
na u al language,” a Xi p ep in a Xi :2208.12415,
2022.
[12] I. Manco, E. Bene os, E. Quin on, and G. Fazekas,
“Con as i e audio-language lea ning o music,” in
P oceedings o he 23 d In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR), 2022.
[13] Y. Wu*, K. Chen*, T. Zhang*, Y. Hui*, T. Be g-
Ki kpa ick, and S. Dubno , “La ge-scale con as i e
language-audio p e aining wi h ea u e usion and
keywo d- o-cap ion augmen a ion,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing, ICASSP, 2023.
[14] S. Doh, M. Won, K. Choi, and J. Nam, “Towa d uni-
e sal ex - o-music e ie al,” in ICASSP 2023-2023
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2023, pp.
1–5.
[15] S. Doh, M. Lee, D. Jeong, and J. Nam, “En iching
music desc ip ions wi h a ine uned-llm and me ada a
o ex - o-music e ie al,” in ICASSP 2024 - 2024
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2024, pp. 826–830.
[16] S. Wu, D. Yu, X. Tan, and M. Sun, “Clamp: Con-
as i e language-music p e- aining o c oss-modal
symbolic music in o ma ion e ie al,” a Xi p ep in
a Xi :2304.11029, 2023.
[17] S. Wu, Y. Wang, R. Yuan, Z. Guo, X. Tan, G. Zhang,
M. Zhou, J. Chen, X. Mu, Y. Gao e al., “Clamp 2:
Mul imodal music in o ma ion e ie al ac oss 101 lan-
guages using la ge language models,” a Xi p ep in
a Xi :2410.13267, 2024.
[18] S. Wu, Z. Guo, R. Yuan, J. Jiang, S. Doh, G. Xia,
J. Nam, X. Li, F. Yu, and M. Sun, “Clamp 3:
Uni e sal music in o ma ion e ie al ac oss unaligned
modali ies and unseen languages,” 2025. [Online].
A ailable: h ps://a xi .o g/abs/2502.10362
[19] H. Bang, E. Choi, M. Finch, S. Doh, S. Lee, G.-H. Lee,
and J. Nam, “PIAST: A mul imodal piano da ase wi h
audio, symbolic and ex ,” in P oceedings o he 3 d
Wo kshop on NLP o Music and Audio (NLP4MusA),
No . 2024, pp. 5–10.
[20] Z. Wang, Y. Zhang, Y. Zhang, J. Jiang, R. Yang,
J. Zhao, and G. Xia, “Piano ee ae: S uc u ed
ep esen a ion lea ning o polyphonic music,” a Xi
p ep in a Xi :2008.07118, 2020.
[21] Z. Wang, D. Wang, Y. Zhang, and G. Xia, “Lea ning in-
e p e able ep esen a ion o con ollable polyphonic
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
397
music gene a ion,” P oceedings o he 23 d In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2020.
[22] A. Wue kaixi, C. Bene a os, Z. Duan, and C. Zhang,
“Collagene : Fusing a bi a y melody and accompani-
men in o a cohe en song,” In e na ional Socie y o
Music In o ma ion Re ie al, 2022.
[23] H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and Y.-
H. Yang, “EMOPIA: A mul i-modal pop piano da ase
o emo ion ecogni ion and emo ion-based music gen-
e a ion,” in P oceedings o 22 h In e na ional Con e -
ence on Music In o ma ion Re ie al (ISMIR), 2021.
[24] K. He, Y. Wang, and J. Hopc o , “A powe ul gene -
a i e model using andom weigh s o he deep image
ep esen a ion,” Ad ances in Neu al In o ma ion P o-
cessing Sys ems, ol. 29, 2016.
[25] Y. Liu, M. O , N. Goyal, J. Du, M. Joshi,
D. Chen, O. Le y, M. Lewis, L. Ze lemoye , and
V. S oyano , “Robe a: A obus ly op imized BERT
p e aining app oach,” CoRR, ol. abs/1907.11692,
2019. [Online]. A ailable: h p://a xi .o g/abs/1907.
11692
[26] A. . d. Oo d, Y. Li, and O. Vinyals, “Rep esen a-
ion lea ning wi h con as i e p edic i e coding,” a Xi
p ep in a Xi :1807.03748, 2018.
[27] J. Ga dne , I. Simon, E. Manilow, C. Haw ho ne, and
J. Engel, “M 3: Mul i- ask mul i ack music ansc ip-
ion,” a Xi p ep in a Xi :2111.03017, 2021.
[28] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-
audio p e aining wi h ea u e usion and keywo d- o-
cap ion augmen a ion,” in ICASSP 2023-2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[29] S. Doh, K. Choi, J. Lee, and J. Nam, “Lp-musiccaps:
Llm-based pseudo music cap ioning,” in ISMIR, 2023.
[30] A. Guzho , F. Raue, J. Hees, and A. Dengel, “Au-
dioclip: Ex ending clip o image, ex and audio,” in
ICASSP 2022-2022 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2022, pp. 976–980.
[31] OpenAI, “Cha gp -4o (gp -4 omni),” h ps://openai.
com/index/gp -4o, 2024, accessed: 2025-03-26.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
398

Related note

Why organizations use Identific for document trust, entry 24
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in North America, Europe, Latin America, and international online education, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports more transparent source review, better handling of multilingual submissions, and more consistent review procedures. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For doctoral theses, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com