scieee Science in your language
[en] (orig)

PianoBind: A Multi-Modal Joint Embedding Model for Pop-Piano Music

Author: Hayeon Bang; Eunjin Choi; Seungheon Doh; Juhan Nam
Publisher: Zenodo
DOI: 10.5281/zenodo.17706422
Source: https://zenodo.org/records/17706422/files/000045.pdf
PIANOBIND: A MULTIMODAL JOINT EMBEDDING MODEL FOR
POP-PIANO MUSIC
Hayeon Bang Eunjin Choi Seungheon Doh Juhan Nam
G adua e School o Cul u e Technology, KAIST, Sou h Ko ea
{hayeonbang,jech,seungheondoh,juhan.nam}@kais .ac.k
ABSTRACT
Solo piano music, despi e being a single-ins umen
medium, possesses signi ican exp essi e capabili ies, con-
eying ich seman ic in o ma ion ac oss gen es, moods,
and s yles. Howe e , cu en gene al-pu pose music ep-
esen a ion models, p edominan ly ained on la ge-scale
da ase s, o en s uggle o cap u e sub le seman ic dis inc-
ions wi hin homogeneous solo piano music. Fu he mo e,
exis ing piano-speci ic ep esen a ion models a e ypically
unimodal, ailing o cap u e he inhe en ly mul imodal na-
u e o piano music, exp essed h ough audio, symbolic,
and ex ual modali ies. To add ess hese limi a ions, we
p opose PianoBind, a piano-speci ic mul imodal join em-
bedding model. We sys ema ically in es iga e s a egies
o mul i-sou ce aining and modali y u iliza ion wi hin a
join embedding amewo k op imized o cap u ing ine-
g ained seman ic dis inc ions in (1) small-scale and (2)
homogeneous piano da ase s. Ou expe imen al esul s
demons a e ha PianoBind lea ns mul imodal ep esen a-
ions ha e ec i ely cap u e sub le nuances o piano mu-
sic, achie ing supe io ex - o-music e ie al pe o mance
on in-domain and ou -o -domain piano da ase s compa ed
o gene al-pu pose music join embedding models. Mo e-
o e , ou design choices o e eusable insigh s o mul i-
modal ep esen a ion lea ning wi h homogeneous da ase s
beyond piano music.
1. INTRODUCTION
The piano s ands as a uniquely e sa ile solo ins umen
capable o con eying complex polyphonic musical ex-
p ession h ough a single ins umen . Wi h i s expansi e
onal ange, ha monic possibili ies, and exp essi e capabil-
i ies—e en allowing o ches al wo ks o be e ec i ely pe -
o med on a single keyboa d—piano music encompasses
di e se gen es, s yles, and exp essi e con en . Nume -
ous s udies in he Music In o ma ion Re ie al (MIR) ield
ha e a ge ed asks ocused on piano music, such as pi-
ano music gene a ion [1–4] and au oma ic music ansc ip-
ion [5–8]. Howe e , esea ch on piano-speci ic ep esen-
© Hayeon Bang, Eunjin Choi, Seungheon Doh, and Juhan
Nam. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: Hayeon Bang, Eunjin Choi, Se-
ungheon Doh, and Juhan Nam, “PianoBind: A Mul imodal Join Em-
bedding Model o Pop-piano Music”, in P oc. o he 26 h In . Socie y
o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
Figu e 1. Illus a ion o PianoBind: A mul imodal piano
music ep esen a ion model in eg a ing audio, MIDI, and
ex .
a ion models emains limi ed. Exis ing app oaches a e
ypically cons ained o a single modali y [9, 10], ailing
o e lec he inhe en ly mul imodal na u e o piano music
ha encompasses audio eco dings, symbolic MIDI, and
seman ic desc ip ions.
Recen ad ances in mul imodal join embedding models
ha e shown p omise in b idging he gap be ween audio and
ex domains [11–15] and symbolic domains [16–18], ye
hey o en all sho in specialized domains—pa icula ly
in solo piano music. Despi e hei e sa ili y ac oss di-
e se music ca ego ies, gene al-pu pose models ypically
lack he sensi i i y needed o cap u e sub le seman ic nu-
ances wi hin homogeneous solo piano music. This lim-
i a ion a ises p ima ily om he sca ci y o high-quali y
piano-speci ic da a in gene al-pu pose music- ex da ase s,
which a e ypically domina ed by mul i-ins umen al o
ocal-cen ic music.
In his wo k, we p esen PianoBind, a mul imodal join
embedding model ha in eg a es mul iple modali ies o
solo piano music—audio, symbolic (MIDI), and ex ual
desc ip ions—wi hin a uni ied embedding space (illus-
a ed in Figu e 1), enabling a mo e comp ehensi e ep-
esen a ion. Ou goal is o cap u e he ine-g ained se-
man ic cha ac e is ics o piano music—spanning gen e,
mood, and s yle— ha a e o en o e looked by la ge-scale,
gene al-pu pose models. To achie e his, we use he PI-
AST da ase [19] o aining, a pop-piano music (e.g. new-
age, piano co e , jazz and i s sub-gen es) da ase ha in-
cludes audio, MIDI, and ex ual desc ip ions. We sys em-
a ically explo e mul i-sou ce aining s a egies ailo ed o
391
he cha ac e is ics o domain-speci ic da ase s. We u i-
lize a compa a i ely la ge amoun o au oma ically col-
lec ed ex ual da a wi h weake empo al alignmen o au-
dio. This app oach compensa es o he limi ed amoun
o human-anno a ed da a. Fu he mo e, we p opose e ec-
i e me hods o combining mul imodal in o ma ion bo h
du ing aining and a e ie al ime, whe e join audio-
symbolic embeddings can signi ican ly enhance ex - o-
music e ie al, pa icula ly in dis inguishing highly sim-
ila piano pieces.
Ou expe imen s on bo h in-domain and ou -o -domain
piano da ase s show ha PianoBind ou pe o ms gene al-
pu pose music join embedding models in cap u ing he
nuances o piano solo music. By ocusing on small-scale,
homogeneous da ase s, ou indings also o e aluable
guidelines o de eloping specialized mul imodal ep e-
sen a ion lea ning app oaches in o he domains wi h lim-
i ed da a. As such, his s udy con ibu es o he g owing
body o piano-cen ic MIR esea ch. I also con ibu es
o b oade discussions on e icien and ine-g ained mul i-
modal modeling. These con ibu ions a e especially ele-
an when da a a ailabili y is inhe en ly cons ained. We
ha e publicly eleased code and p e ained weigh s o Pi-
anoBind, wi h he demo online 1.
2. RELATED WORKS
2.1 Piano Music Rep esen a ion Lea ning
Piano music has long se ed as a cen al subjec in MIR,
owing o i s s uc u al ichness and exp essi e dep h. How-
e e , despi e his sus ained a en ion, exis ing ep esen-
a ion lea ning app oaches o piano music a e p edom-
inan ly unimodal— elying solely on symbolic o audio
da a— hus ailing o e lec he inhe en ly mul imodal na-
u e o piano music.
Ea ly piano music ep esen a ion in he symbolic music
domain p ima ily ocused on disen angling low-le el sym-
bolic ea u es, mos ly wi h he goal o enhancing con olla-
bili y in music gene a ion asks. Models such as PianoT ee
VAE [20], Wang e al. [21], and CollageNe [22] exempli y
his app oach, decomposing music in o a ibu es such as
hy hm, ha mony, ex u e, and s uc u e o acili a e use -
con olled music gene a ion.
Mo e ecen ly, ans o me -based models like
MidiBERT-Piano [9] and PianoBART [10] ha e ex-
panded piano music unde s anding h ough la ge-scale
p e aining, cap u ing bo h low-le el ea u es and highe -
le el musical a ibu es. MidiBERT-Piano in oduced
masked modeling objec i es o solo piano MIDI da a,
demons a ing s ong ans e abili y o downs eam asks
such as compose classi ica ion and exp essi e a ibu e
p edic ion. Building upon his ounda ion, PianoBART
ex ended hese capabili ies om unde s anding owa d
gene a ion asks, acili a ing mo e sophis ica ed music
c ea ion and symbolic inpain ing asks. None heless, hese
models emain pu ely symbolic, lacking in eg a ion wi h
acous ic in o ma ion o na u al language seman ics.
1h ps://hayeonbang.gi hub.io/PianoBind/
Despi e he p og ess in piano music ep esen a ion
lea ning, mul imodal unde s anding o piano music has
been limi ed, p ima ily due o he sca ci y o da ase s ha
suppo such app oaches. Un il ecen ly, ew esou ces
exis ed ha combined mul iple modali ies o piano pe -
o mance. EMOPIA [23] made an ini ial s ep by p o id-
ing pai ed audio and MIDI eco dings wi h emo ion la-
bels, while PIAST [19] has ecen ly ex ended his u he
by adding comp ehensi e ex ual anno a ions desc ibing
gen e and mood. Howe e , e en wi h hese mul imodal
da ase s becoming a ailable, no p io esea ch has p o-
posed an in eg a ed app oach ha join ly le e ages audio,
symbolic, and ex ual modali ies o piano music unde -
s anding.
2.2 Mul imodal Join Embedding Models
Mul imodal join embedding models aim o align da a
om di e en modali ies in a sha ed embedding space.
This alignmen c ea es seman ically meaning ul ep e-
sen a ions ha cap u e ela ionships be ween modali ies,
enabling c oss-modal e ie al, unde s anding, and con-
di ioned gene a ion. MuLan [11] pionee ed la ge-scale
audio– ex aining wi h o e 44 million pai s, showing
s ong e ie al capabili ies. MusCALL [12] p oposed
an audio- ex dual encode a chi ec u e, demons a ing
imp o ed e ie al pe o mance and downs eam pe o -
mance. CLAP [13] expanded on his by le e aging au-
dio cap ioning co po a and keywo d- o-cap ion augmen-
a ion s a egies, and in oduced a join audio- ex em-
bedding model. Fu he e inemen s include models like
TTMR [14] and TTMR++ [15], which add ess a ying
que y g anula i ies— anging om single ags o ull sen-
ences—and inco po a e ich me ada a o p oduce mo e de-
sc ip i e and con ex ual ex embeddings. These me hods
imp o ed e ie al accu acy by modeling bo h linguis ic
and musical sub le ies.
In pa allel, symbolic- ex join embedding models ha e
been explo ed h ough he CLaMP se ies [16–18], which
in oduced con as i e lea ning be ween symbolic music
(e.g., ABC o MIDI) and ex ual desc ip ions. While
CLaMP pionee ed symbolic music e ie al h ough ex -
ABC join aining, CLaMP2 expanded his o mul ilin-
gual ex and MIDI da a, and CLaMP3 u he inco po a ed
audio and pe o mance signals, ma king he i s imodal
amewo k in music ep esen a ion lea ning o join ly align
audio, symbolic, and ex ual modali ies.
Despi e hese ad ances, mos cu en models a e ained
on gene al-pu pose, la ge-scale da ase s co e ing a b oad
ange o musical s yles and ins umen a ion. As a esul ,
hey o en all sho o cap u ing he ine-g ained seman ic
di e ences equi ed o mo e homogeneous domains like
solo piano music. These models end o unde pe o m in
se ings whe e sub le a ia ions in gen e, mood, and s yle
mus be accu a ely dis inguished.
To add ess his gap, ou p oposed model PianoBind in-
eg a es piano-speci ic audio, symbolic, and ex ual modal-
i ies in a uni ied embedding space, enabling mo e p ecise
e ie al wi hin his specialized musical domain.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
392
Figu e 2. T aining s a egies o PianoBind: (1) Mul i-sou ce aining combines small s ongly aligned (human-anno a ed)
and la ge weakly aligned (au oma ically collec ed) da ase s; (2) T imodal lea ning simul aneously aligns audio, symbolic,
and ex ual embeddings; (3) Mul imodal i em embedding me ges mul iple modali ies in o a uni ied ep esen a ion.
3. PIANOBIND
Conside ing he speci ic cha ac e is ics o piano solo
da ase s—such as homogeneous da a dis ibu ion, limi ed
da ase size, and mul imodali y—we p opose a mul imodal
join embedding model specialized o solo piano music.
In his sec ion, we desc ibe he o e all a chi ec u e o Pi-
anoBind (sec ion 3.1), and he aining s a egies i ex-
plo es (sec ion 3.2), comp ising: (1) mul i-sou ce lea ning
wi h s ongly and weakly aligned pai s, and (2) modali y
in eg a ion ac oss audio, symbolic, and ex ual ea u es.
3.1 A chi ec u e O e iew
3.1.1 Audio Encode
Following p e ious wo k [12], we adop a modi ied
ResNe -50 [24] a chi ec u e o p ocess mel-spec og am
ep esen a ions o piano eco dings. We ex ac 128-band
mel-spec og ams wi h a 1024-poin FFT, 512-poin hop
leng h, and apply log-scaling. As in MusCALL, we ap-
ply h ee s em con olu ional laye s ollowed by a e age
pooling, and implemen an i-aliased blu pooling. We also
ollow he downsizing s a egy employed in he p e ious
wo k.
3.1.2 Symbolic Encode
We u ilize MidiBERT [9] as a symbolic encode . The
model handles ba , posi ion, pi ch, and du a ion in o ma-
ion o MIDI by adap ing Compound Wo d (CP) ep esen-
a ion [2] in o i s model. Since MidiBERT does no con ain
he [CLS] oken, we ob ained he inal MIDI embedding by
mean-pooling o e he sequence o hidden s a es om he
las T ans o me laye , esul ing in a dense ep esen a ion
ha is hen p ojec ed o he sha ed embedding space.
3.1.3 Tex Encode
Ou ex p ocessing pipeline employs RoBERTa [25], us-
ing i s by e-pai encoding (BPE) okenize . The ok-
enized inpu passes h ough 12 T ans o me laye s wi h
768-dimensional hidden s a es. Since RoBERTa lacks he
s anda d poole ou pu ound in BERT, we c ea e sen ence-
le el ep esen a ions by mean-pooling ac oss he inal
laye ’s hidden s a es. These ex ual embeddings a e also
mapped o ou sha ed space h ough a linea p ojec ion
laye .
3.1.4 Join Embedding wi h Con as i e Loss
The ep esen a ions om ou h ee encode s (audio, MIDI,
and ex ) a e aligned h ough modali y-speci ic linea p o-
jec ions in o a 512-dimensional sha ed embedding space.
All embeddings unde go ℓ2-no maliza ion o ensu e con-
sis en scaling ac oss modali ies. C oss-modal simila i-
ies a e compu ed ia do p oduc s be ween hese no mal-
ized embeddings, p o iding he ounda ion o ou a ious
aining objec i es. We use he N-pai Con as i e loss,
known as he In oNCE [26] loss, which maximizes he
cosine simila i y be ween posi i e music- ex embedding
pai s while minimizing he simila i y o nega i e pai s.
Fo audio- ex alignmen , he In oNCE loss is de ined as
ollows:
La→ =−1
N
N
X
i=1
log exp(za,i ·z+
,i/τ)
Pz∈{z+
,i,z−
,i}exp(za,i ·z/τ)(1)
whe e za,i and z ,i e e o audio and ex embeddings e-
spec i ely, in he audio- ex aining. An analogous loss is
compu ed o MIDI– ex pai s by subs i u ing audio em-
beddings wi h symbolic ones. τ ep esen s he empe a-
u e pa ame e , and z−
,i deno es a se o nega i e ex em-
beddings. This aims o align embeddings be ween ele-
an music- ex pai s while sepa a ing i ele an pai s in
he embedding space. The o al loss is compu ed by sym-
me ically combining losses in bo h di ec ions (music- o-
ex and ex - o-music). The inal symme ic loss is de ined
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
393
as:
La↔ =La→ +L →a
2(2)
3.2 T aining F amewo k
3.2.1 Mul i-sou ce T aining
Conside ing he small size and specialized anno a ion e-
qui ed o piano da ase s, we explo e wo s a egies o
e ec i ely le e age bo h la ge-scale, weakly aligned au-
dio– ex da a and small-scale, expe -anno a ed da a.
Combined T aining We join ly ain he model on bo h
sou ces by mixing hem wi hin aining ba ches ollow-
ing p io wo k [13, 15, 27, 28]. To mi iga e da a imbalance
and noise om weak supe ision, we ca e ully con ol he
sampling a io be ween he wo sou ces. This enables he
model o lea n gene alizable language–music alignmen s
while g adually inco po a ing domain-speci ic sub le ies
o solo piano music.
P e- aining and Fine- uning Al e na i ely, we also
adop he wo-s age aining s a egy, p e- aining and ine-
uning, ollowing [29]. The model is i s p e- ained on he
la ge-scale, weakly labeled da ase o acqui e gene alizable
music ep esen a ions. I is hen ine- uned on he smalle ,
expe -anno a ed da a, wi h encode pa ame e s upda ed.
This sequen ial s a egy enables he model o bene i om
b oad seman ic co e age du ing p e- aining, while la e
adap ing o he exp essi e and s ylis ic nuances o solo pi-
ano music h ough ine- uning.
3.2.2 T imodal Rep esen a ion Lea ning
To ully exploi he mul imodal cha ac e is ics o piano
music, we ex end beyond adi ional bimodal se ups by
in eg a ing audio, symbolic (MIDI), and ex modali ies
in o a uni ied e ie al amewo k. Inspi ed by he aining
s a egy om AudioCLIP [30], we compu e he con as i e
loss ac oss modali y pai s (audio- ex , MIDI- ex ) and a -
e age hem o o m he inal objec i e. Howe e , consid-
e ing he na u e o ou MIDI da a as ansc ibed ep esen-
a ions de i ed di ec ly om co esponding audio eco d-
ings, including an audio-MIDI loss would no signi ican ly
con ibu e addi ional seman ic dis inc ion. Consequen ly,
we u ilize only he audio- ex and MIDI- ex con as i e
losses, calcula ing hei a e age as ou inal aining objec-
i e:
L o al =La↔ +Lm↔
2(3)
Fu he mo e, we le e age mul imodal in o ma ion no
only du ing he aining phase bu also a e alua ion
ime. To le e age he complemen a y s eng hs o di e -
en modali ies, we p opose mul imodal i em embeddings
ha use audio and MIDI in o ma ion. Speci ically, audio
and MIDI embeddings a e in eg a ed h ough a e age u-
sion du ing he e alua ion p ocess.
4. EXPERIMENT
4.1 Da ase
This s udy u ilizes he PIAST da ase [19], he i s music-
ex da ase explici ly designed o pop-piano music. The
da ase consis s o audio, MIDI, and ex ual desc ip ions,
based on a comp ehensi e piano-speci ic axonomy o 31
seman ic ags ac oss gen e, emo ion/mood, and s yle. The
da ase comp ises wo subse s: PIAST-YT, a la ge-scale
collec ion o app oxima ely 7,367 acks (abou 900 hou s)
au oma ically collec ed om YouTube, wi h accompany-
ing ex ual me ada a ( i les, desc ip ions, and ags) e ined
using a la ge language model; and PIAST-AT, a smalle ,
expe -anno a ed se o 1,986 acks (abou 17 hou s in o-
al, 30 seconds pe ack). Fo bo h subse s, MIDI da a
is gene a ed ia au oma ic piano ansc ip ion. The an-
sc ibed MIDI iles we e synch onized o downbea es i-
ma es, and melody and cho d in o ma ion was ex ac ed.
Since he ex ual da a o PIAST-YT is au oma ically
collec ed, i exhibi s weak alignmen wi h he audio con-
en , in oducing conside able noise in he ex -audio e-
la ionships. To add ess his challenge, we apply he wo
mul i-sou ce lea ning s a egies desc ibed in Sec ion 3,
le e aging bo h he la ge-scale bu weakly aligned PIAST-
YT da a and he smalle , high-quali y PIAST-AT anno a-
ions. This combined app oach enables mo e obus ep e-
sen a ion lea ning despi e he inhe en da a limi a ions. Fo
expe imen s, we use a 9:1 ain– alida ion spli o PIAST-
YT, and an 8:1:1 ain– alida ion– es spli o PIAST-AT.
4.2 E alua ion
We e alua e ou model using bo h in-domain and ou -o -
domain ex - o-music e ie al asks. Fo in-domain e al-
ua ion, we use he 10% held-ou es spli om PIAST-
AT, comp ising 199 acks. Fo ou -o -domain e alua ion,
we in oduce EMOPIA-Caps by manually anno a ing de-
sc ip i e ag labels o he EMOPIA es spli [23], and
ans o ming hem in o na u al language cap ions using
a la ge language model. The ini ial ags we e na u ally
o e lapping wi h he ocabula y used in he PIAST-AT
piano-music axonomy. To add ess his and be e app ox-
ima e eal-wo ld na u al use que ies, we pa aph ased he
ags in o ee- o m na u al language cap ions using GPT-
4o [31]. The gene a ed cap ions we e hen e iewed and
e ined by a human music expe wi h a majo in com-
posi ion, o ensu e hei seman ic accu acy and musical
ele ance. This ans o ma ion no only be e app oxi-
ma es use -s yle que ies, bu also enables e alua ion o he
model’s abili y o gene alize o di e se ex ual exp essions.
Bo h e alua ion da ase s use sen ence- o m ex ual inpu s;
howe e , he in-domain se consis s o conca ena ed ags,
whe eas he ou -o -domain se con ains ee- o m, na u al
language cap ions.
Fo bo h e alua ion se ings, we pe o m ex - o-music
e ie al using Recall@K (R@1, R@5, R@10) and Me-
dian Rank (MedR), as hese me ics e lec he model’s
abili y o gene alize o di e se and uncons ained ex ual
que ies.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
394
T aining S a egy In-domain (199 acks) Ou -o -domain (88 acks)
R@1 R@5 R@10 MedR↓R@1 R@5 R@10 MedR↓
Combined T aining
Audio 8.04 25.62 37.18 17 2.56 10.26 35.90 17
Symbolic 4.02 17.09 27.13 27 2.56 28.21 43.59 13
T imodal 6.53 24.12 37.68 17 5.13 25.64 46.15 14
P e- aining & Fine- uning
Audio 6.53 28.14 42.71 15 7.69 20.51 46.15 12
Symbolic 8.04 26.63 45.23 12 5.13 20.51 35.90 12
T imodal 10.55 35.67 52.76 10 15.38 41.03 51.28 10
Table 1. Pe o mance compa ison on ex -based music e ie al asks, on he in-domain (PIAST-AT) and ou -o -domain
(EMOPIA-Caps) da ase s.
4.3 Implemen a ion De ails
Fo audio p ocessing, we use 20-second signals wi h a
sampling a e o 16 kHz, consis en wi h me hods es ab-
lished by p e ious wo k [12]. To ensu e empo al align-
men be ween audio and MIDI, we ma ch each audio seg-
men ’s s a ime (in seconds) wi h he nea es MIDI ba
onse , ex ac ing MIDI okens om ba s ha co espond
o each audio segmen . These sequences a e subsequen ly
s anda dized o exac ly 512 okens h ough padding o
unca ion. Fo ex inpu s, we employ he RoBERTa o-
kenize [25] wi h a 77- oken leng h limi . To comba o e -
i ing and enhance he di e si y o ou ex ual da a, we
implemen a dynamic ex d opou s a egy building on ap-
p oaches om se e al ecen wo ks [14–16]. This ech-
nique andomly selec s and combines a ailable ex ual el-
emen s (such as ags and cap ions) in a ying o de s o
each aining ins ance. Fo combined aining, we use a
7:3 sampling a io be ween PIAST-YT and PIAST-AT.
All models a e ained using he AdamW op imize wi h
a 5e-5 ini ial lea ning a e, 0.2 weigh decay, and a consis-
en ba ch size o 64 ac oss all expe imen s. Fo he con-
as i e loss, we join ly op imize he empe a u e pa am-
e e τalongside encode and p ojec ion pa ame e s, ol-
lowing success ul app oaches demons a ed in ecen mul-
imodal wo ks [12, 14]. We selec op imal model check-
poin s based on median ank pe o mance on ou alida-
ion da ase . Ou implemen a ion is based on PyTo ch, us-
ing au oma ic mixed p ecision and ained on an NVIDIA
A6000 GPU.
5. RESULTS
5.1 Compa ison o T aining S a egies
5.1.1 Mul i-sou ce aining
Table 1 shows he pe o mance o di e en aining s a e-
gies ac oss bo h in-domain (PIAST-AT) and ou -o -domain
(EMOPIA-Caps) es se s. We compa e wo mul i-sou ce
lea ning app oaches: combined aining and p e- aining
ollowed by ine- uning, each e alua ed using wo bimodal
con igu a ions (audio– ex and symbolic– ex ) and one i-
modal (audio–symbolic– ex ) con igu a ion.
The esul s demons a e ha he p e- aining and ine-
uning app oach gene ally ou pe o ms he combined ain-
ing ac oss modali y con igu a ions and me ics. In he in-
domain se ing, he p e- aining and ine- uning app oach
wi h imodal in eg a ion achie es he bes pe o mance,
eaching a Median Rank o 10, signi ican ly su passing he
co esponding me ics o combined aining. Simila ly,
in he ou -o -domain con ex , i yields supe io esul s
wi h he same Median Rank. While a ew isola ed me -
ics—such as R@1 o audio in-domain and R@5 o sym-
bolic ou -o -domain—a e ma ginally highe in he com-
bined aining se up, hese excep ions do no con adic he
o e all end. These indings unde sco e he challenges o
small-scale anno a ed da ase s. When dealing wi h limi ed
high-quali y anno a ions, he combined aining app oach
exposes he model o da a imbalance issues, whe e he
la ge bu noisie da ase can po en ially o e powe he sig-
nal om he smalle expe -anno a ed da a. In con as , he
sequen ial knowledge ans e app oach—ini ially lea n-
ing gene alizable ep esen a ions om b oade da a be o e
adap ing o specialized piano-speci ic con ex s—enables
he model o be e le e age bo h da a sou ces.
5.1.2 T imodal In eg a ion
Ou esul s demons a e he subs an ial pe o mance gains
achie ed h ough imodal in eg a ion compa ed o bi-
modal app oaches. Ac oss bo h mul i-sou ce aining
s a egies, he imodal model consis en ly ou pe o ms
bo h audio- ex and symbolic- ex con igu a ions. The
imp o emen s a e pa icula ly p onounced in he p e-
aining and ine- uning app oach, whe e imodal in e-
g a ion achie es a Median Rank o 10 compa ed o 12
o symbolic-only and 15 o audio-only. This pe o -
mance ad an age ex ends o highe R@10, wi h he i-
modal app oach achie ing 52.76%, signi ican ly ou pe -
o ming bo h symbolic and audio.
These indings s ongly suppo ou hypo hesis ha e -
ec i e ep esen a ion o piano music equi es in eg a -
ing mul iple modali ies. While bo h audio and symbolic
ep esen a ions cap u e aluable in o ma ion—wi h sym-
bolic ep esen a ions sligh ly ou pe o ming audio on in-
domain e ie al— hei combina ion in a imodal ame-
wo k yields ep esen a ions ha mo e comp ehensi ely
cap u e he seman ic nuances o piano music. This sug-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
395

Model Piano
Speci ic I em Modali y In-domain (199 acks) Ou -o -domain (88 acks)
R@1 R@5 R@10 MedR↓R@1 R@5 R@10 MedR↓
CLAP-Music ✗Audio 0.00 7.38 9.85 54 5.13 20.51 41.03 15
TTMR++ ✗Audio 1.47 6.40 12.31 45 5.13 15.38 28.21 16
CLaMP3saas ✗Audio 1.50 7.53 13.56 49 2.56 20.51 41.03 12
CLaMP2 ✗Symbolic 3.02 8.54 14.57 43 5.13 30.77 43.59 14
CLaMP3c2
sa ✗Symbolic 4.02 12.06 22.11 39 12.82 30.77 46.15 12
CLaMP3saas ✗Audio + Symbolic 1.50 7.53 12.06 68 2.56 33.33 46.15 13
CLaMP3c2
sa ✗Audio + Symbolic 2.51 10.55 17.58 47 7.69 28.20 43.58 13
PianoBind (Ou s) ✓Audio + Symbolic 10.55 35.67 52.76 10 15.38 41.03 51.28 10
Table 2. Pe o mance compa isons be ween PianoBind and p e ious ex - o-music e ie al models, conduc ed on he in-
domain (PIAST-AT) and ou -o -domain (EMOPIA-Caps) da ase s.
ges s ha audio and symbolic modali ies p o ide comple-
men a y pe spec i es.
5.2 Compa ison wi h Exis ing Models
Table 2 p esen s a compa a i e analysis be ween Pi-
anoBind and exis ing ex - o-music e ie al models. We
compa e agains leading audio-based models—CLAP-
Music, TTMR++, and CLaMP3saas (op imized o
audio)—as well as symbolic-based models, including
CLaMP2 and CLaMP3c2
sa (op imized o symbolic). All
models we e ained on a la ge-scale o gene al-pu pose
da ase s. The esul s demons a e PianoBind’s subs an-
ial pe o mance ad an age o e gene al-pu pose models.
Fo in-domain e ie al, PianoBind achie es he lowes
Median Rank o 10, signi ican ly ou pe o ming he bes -
pe o ming gene al-pu pose model, CLaMP3c2
sa, which
achie ed a Median Rank o 39. This pe o mance ad an-
age ex ends o ou -o -domain e ie al as well, whe e Pi-
anoBind main ains i s lead wi h an R@10 o 51.28% and
Median Rank o 10, compa ed o he nex bes model,
CLaMP3c2
sa, wi h 46.15% and 12, espec i ely.
Addi ionally, we also ex ended he models om
CLaMP3, by implemen ing mul imodal i em embeddings
h ough ea u e usion—which is no p esen in he o ig-
inal wo k. Howe e , e en wi h his app oach, CLaMP3
models wi h ea u e usion unde pe o m compa ed o hei
bi-modali y esul s. This unde pe o mance likely s ems
om he specialized na u e o CLaMP3 a ian s, whe e
each model was op imized o a speci ic modali y.
5.3 Compa a i e Analysis o Model Design Choices
We conduc ed addi ional expe imen s o alida e key
model design choices. Speci ically, we compa ed ou
a e aged-loss aining s a egy in imodal lea ning wi h
he saas (symbolic →audio →audio →symbolic) align-
men s a egy adop ed in CLaMP3. As shown in Table 3,
ou a e aged-loss aining clea ly ou pe o ms bo h he
o iginal CLaMP3saas model and ou own eimplemen a-
ion (Ou s_saas) in bo h in-domain and ou -o -domain e-
ie al. These esul s demons a e he ad an age o join ly
lea ning audio– ex and MIDI– ex embeddings, a he
han aligning independen ly ained modali ies h ough a
Model ID OOD
R@10 MedR↓R@10 MedR↓
CLaMP3_saas 13.56 49 41.03 12
Ou s_saas 33.66 18 30.77 19
Ou s_LossA g 52.76 10 52.76 10
Table 3. Compa ison be ween Saas and A e aged Loss
in imodal lea ning in bo h in-domain (ID) and ou -o -
domain (OOD) e alua ions.
s aged alignmen p ocess. The consis en pe o mance
gains in Median Rank and R@10 u he highligh he e -
ec i eness o ou uni ied aining objec i e in cap u ing
ine-g ained seman ic ela ionships in piano music.
6. CONCLUSION
In his pape , we in oduced PianoBind, a mul imodal join
embedding model designed o pop-piano music, in eg a -
ing audio, symbolic, and ex ual modali ies. Despi e us-
ing subs an ially less aining da a han gene al-pu pose
models, PianoBind achie ed s ong e ie al pe o mance.
Ou indings sugges ha a sequen ial mul i-sou ce aining
s a egy—p e- aining on la ge-scale noisy da a ollowed
by ine- uning on human-anno a ed examples—is mo e e -
ec i e han aining on he wo sou ces simul aneously,
pa icula ly in low- esou ce se ings. We also obse ed
ha in eg a ing audio and symbolic modali ies cap u es
complemen a y seman ic cues, and hei join use leads o
mo e obus embeddings. Mo eo e , combining audio and
symbolic embeddings a in e ence ime imp o es e ie al
pe o mance, p o ided ha he modali ies a e well-aligned
h ough join aining.
A limi a ion o ou s udy is ha ou e alua ion da ase s,
bo h in-domain and ou -o -domain, a e ela i ely small-
scale, po en ially es ic ing he gene alizabili y o ou
indings. Mo eo e , ou da ase s p ima ily ocused on pop-
piano gen es, lacking su icien ep esen a ion o classical
and o he di e se piano gen es. Add essing hese limi a-
ions by cons uc ing la ge -scale and mo e gen e-di e se
piano- ex da ase s emains an impo an di ec ion o u-
u e wo k. This u he highligh s he need o comp ehen-
si e benchma ks o igo ously e alua e mul imodal embed-
ding models ac oss a ied piano music con ex s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
396
7. ACKNOWLEDGMENTS
This wo k has been suppo ed by he collabo a ion wi h
NCSOFT, Ko ea.
8. REFERENCES
[1] Y.-S. Huang and Y.-H. Yang, “Pop music ans o me :
Bea -based modeling and gene a ion o exp essi e pop
piano composi ions,” in P oceedings o he 28 h ACM
in e na ional con e ence on mul imedia, 2020, pp.
1180–1188.
[2] W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang,
“Compound wo d ans o me : Lea ning o compose
ull-song music o e dynamic di ec ed hype g aphs,”
in P oceedings o he AAAI Con e ence on A i icial In-
elligence, ol. 35, no. 1, 2021, pp. 178–186.
[3] S.-L. Wu and Y.-H. Yang, “Compose & embellish:
Well-s uc u ed piano pe o mance gene a ion ia a
wo-s age app oach,” in ICASSP 2023 - 2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), 2023, pp. 1–5.
[4] C.-P. Tan, H. Ai, Y.-H. Chang, S.-H. Guan, and Y.-H.
Yang, “Picogen2: Piano co e gene a ion wi h ans e
lea ning app oach and weakly aligned da a,” in P o-
ceedings o he 25 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), San F an-
cisco, CA, Uni ed S a es, No . 2024.
[5] S. Sig ia, E. Bene os, and S. Dixon, “An end- o-end
neu al ne wo k o polyphonic piano music ansc ip-
ion,” IEEE/ACM T ansac ions on Audio, Speech, and
Language P ocessing, ol. 24, no. 5, pp. 927–939,
2016.
[6] C. Haw ho ne, E. Elsen, J. Song, A. Robe s, I. Si-
mon, C. Ra el, J. Engel, S. Oo e, and D. Eck, “Onse s
and ames: Dual-objec i e piano ansc ip ion,” a Xi
p ep in a Xi :1710.11153, 2017.
[7] Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-
esolu ion piano ansc ip ion wi h pedals by eg ess-
ing onse and o se imes,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing, ol. 29,
pp. 3707–3717, 2021.
[8] T. Kwon, D. Jeong, and J. Nam, “Polyphonic piano
ansc ip ion using au o eg essi e mul i-s a e no e
model,” in In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, 2020. [Online]. A ailable:
h ps://api.seman icschola .o g/Co pusID:222125050
[9] Y.-H. Chou, I. Chen, C.-J. Chang, J. Ching, Y.-H. Yang
e al., “MidiBERT-Piano: La ge-scale p e- aining
o symbolic music unde s anding,” a Xi p ep in
a Xi :2107.05223, 2021.
[10] X. Liang, Z. Zhao, W. Zeng, Y. He, F. He, Y. Wang,
and C. Gao, “Pianoba : Symbolic piano music gene -
a ion and unde s anding wi h la ge-scale p e- aining,”
in 2024 IEEE In e na ional Con e ence on Mul imedia
and Expo (ICME), 2024, pp. 1–6.
[11] Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and D. P.
Ellis, “Mulan: A join embedding o music audio and
na u al language,” a Xi p ep in a Xi :2208.12415,
2022.
[12] I. Manco, E. Bene os, E. Quin on, and G. Fazekas,
“Con as i e audio-language lea ning o music,” in
P oceedings o he 23 d In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR), 2022.
[13] Y. Wu*, K. Chen*, T. Zhang*, Y. Hui*, T. Be g-
Ki kpa ick, and S. Dubno , “La ge-scale con as i e
language-audio p e aining wi h ea u e usion and
keywo d- o-cap ion augmen a ion,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing, ICASSP, 2023.
[14] S. Doh, M. Won, K. Choi, and J. Nam, “Towa d uni-
e sal ex - o-music e ie al,” in ICASSP 2023-2023
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2023, pp.
1–5.
[15] S. Doh, M. Lee, D. Jeong, and J. Nam, “En iching
music desc ip ions wi h a ine uned-llm and me ada a
o ex - o-music e ie al,” in ICASSP 2024 - 2024
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2024, pp. 826–830.
[16] S. Wu, D. Yu, X. Tan, and M. Sun, “Clamp: Con-
as i e language-music p e- aining o c oss-modal
symbolic music in o ma ion e ie al,” a Xi p ep in
a Xi :2304.11029, 2023.
[17] S. Wu, Y. Wang, R. Yuan, Z. Guo, X. Tan, G. Zhang,
M. Zhou, J. Chen, X. Mu, Y. Gao e al., “Clamp 2:
Mul imodal music in o ma ion e ie al ac oss 101 lan-
guages using la ge language models,” a Xi p ep in
a Xi :2410.13267, 2024.
[18] S. Wu, Z. Guo, R. Yuan, J. Jiang, S. Doh, G. Xia,
J. Nam, X. Li, F. Yu, and M. Sun, “Clamp 3:
Uni e sal music in o ma ion e ie al ac oss unaligned
modali ies and unseen languages,” 2025. [Online].
A ailable: h ps://a xi .o g/abs/2502.10362
[19] H. Bang, E. Choi, M. Finch, S. Doh, S. Lee, G.-H. Lee,
and J. Nam, “PIAST: A mul imodal piano da ase wi h
audio, symbolic and ex ,” in P oceedings o he 3 d
Wo kshop on NLP o Music and Audio (NLP4MusA),
No . 2024, pp. 5–10.
[20] Z. Wang, Y. Zhang, Y. Zhang, J. Jiang, R. Yang,
J. Zhao, and G. Xia, “Piano ee ae: S uc u ed
ep esen a ion lea ning o polyphonic music,” a Xi
p ep in a Xi :2008.07118, 2020.
[21] Z. Wang, D. Wang, Y. Zhang, and G. Xia, “Lea ning in-
e p e able ep esen a ion o con ollable polyphonic
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
397
music gene a ion,” P oceedings o he 23 d In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2020.
[22] A. Wue kaixi, C. Bene a os, Z. Duan, and C. Zhang,
“Collagene : Fusing a bi a y melody and accompani-
men in o a cohe en song,” In e na ional Socie y o
Music In o ma ion Re ie al, 2022.
[23] H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and Y.-
H. Yang, “EMOPIA: A mul i-modal pop piano da ase
o emo ion ecogni ion and emo ion-based music gen-
e a ion,” in P oceedings o 22 h In e na ional Con e -
ence on Music In o ma ion Re ie al (ISMIR), 2021.
[24] K. He, Y. Wang, and J. Hopc o , “A powe ul gene -
a i e model using andom weigh s o he deep image
ep esen a ion,” Ad ances in Neu al In o ma ion P o-
cessing Sys ems, ol. 29, 2016.
[25] Y. Liu, M. O , N. Goyal, J. Du, M. Joshi,
D. Chen, O. Le y, M. Lewis, L. Ze lemoye , and
V. S oyano , “Robe a: A obus ly op imized BERT
p e aining app oach,” CoRR, ol. abs/1907.11692,
2019. [Online]. A ailable: h p://a xi .o g/abs/1907.
11692
[26] A. . d. Oo d, Y. Li, and O. Vinyals, “Rep esen a-
ion lea ning wi h con as i e p edic i e coding,” a Xi
p ep in a Xi :1807.03748, 2018.
[27] J. Ga dne , I. Simon, E. Manilow, C. Haw ho ne, and
J. Engel, “M 3: Mul i- ask mul i ack music ansc ip-
ion,” a Xi p ep in a Xi :2111.03017, 2021.
[28] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-
audio p e aining wi h ea u e usion and keywo d- o-
cap ion augmen a ion,” in ICASSP 2023-2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[29] S. Doh, K. Choi, J. Lee, and J. Nam, “Lp-musiccaps:
Llm-based pseudo music cap ioning,” in ISMIR, 2023.
[30] A. Guzho , F. Raue, J. Hees, and A. Dengel, “Au-
dioclip: Ex ending clip o image, ex and audio,” in
ICASSP 2022-2022 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2022, pp. 976–980.
[31] OpenAI, “Cha gp -4o (gp -4 omni),” h ps://openai.
com/index/gp -4o, 2024, accessed: 2025-03-26.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
398