scieee Science in your language
[en] (orig)

PianoVAM: A Multimodal Piano Performance Dataset

Author: Yonghyun Kim; Junhyung Park; Joonhyung Bae; Kirak Kim; Taegyun Kwon; Alexander Lerch; Juhan Nam
Publisher: Zenodo
DOI: 10.5281/zenodo.17706504
Source: https://zenodo.org/records/17706504/files/000061.pdf
PIANOVAM: A MULTIMODAL PIANO PERFORMANCE DATASET
Yonghyun Kim♭Junhyung Pa k♮Joonhyung Bae♯Ki ak Kim♯
Taegyun Kwon♯Alexande Le ch♭Juhan Nam♯
♭Music In o ma ics G oup, Geo gia Ins i u e o Technology, USA
♮Depa men o Ma hema ical Sciences, KAIST, Sou h Ko ea
♯G adua e School o Cul u e Technology, KAIST, Sou h Ko ea
{yonghyun.kim, alexande .le ch}@ga ech.edu, { onyishappy, jh.bae, ki ak, ilcobo2, juhan.nam}@kais .ac.k
ABSTRACT
The mul imodal na u e o music pe o mance has d i en
inc easing in e es in da a beyond he audio domain wi hin
he music in o ma ion e ie al (MIR) communi y. This
pape in oduces PianoVAM, a comp ehensi e piano pe -
o mance da ase ha includes ideos, audio, MIDI, hand
landma ks, inge ing labels, and ich me ada a. The da ase
was eco ded using a Diskla ie piano, cap u ing audio and
MIDI om ama eu pianis s du ing hei daily p ac ice ses-
sions, alongside synch onized op- iew ideos in ealis ic
and a ied pe o mance condi ions. Hand landma ks and
inge ing labels we e ex ac ed using a p e ained hand pose
es ima ion model and a semi-au oma ed inge ing anno-
a ion algo i hm. We discuss he challenges encoun e ed
du ing da a collec ion and he alignmen p ocess ac oss
di e en modali ies. Addi ionally, we desc ibe ou inge -
ing anno a ion me hod based on hand landma ks ex ac ed
om ideos. Finally, we p esen benchma king esul s o
bo h audio-only and audio- isual piano ansc ip ion us-
ing he PianoVAM da ase and discuss addi ional po en ial
applica ions.
1. INTRODUCTION
Music pe o mance is inhe en ly mul imodal, in ol ing no
only audio bu also mo ion, pos u e, and o he isual ele-
men s as pa o he exp essi e sound c ea ion p ocess [1,2].
In he ield o Music In o ma ion Re ie al (MIR), he e
has been g owing in e es in collec ing mul imodal pe o -
mance da a o enhance he ex ac ion o musical in o ma ion
by le e aging mul iple modali ies [3,4]. Such mul imodal
da a —pa icula ly audio- isual da a— ha e been u ilized
in a ious MIR asks ac oss di e se musical gen es. Exam-
ples include au oma ic MIDI ansc ip ion o solo piano
pe o mances [5,6], ib a o analysis o polyphonic s ing
music [7], singing oice sepa a ion [8], and melodic mo i
iden i ica ion in Indian ocal pe o mances [9]. In hese
asks, isual in o ma ion om pe o me ideos has been
© Y. Kim, J. Pa k, J. Bae, T. Kwon, K. Kim, A. Le ch and
J. Nam. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: Y. Kim, J. Pa k, J. Bae, T. Kwon,
K. Kim, A. Le ch and J. Nam, “PianoVAM: A Mul imodal Piano Pe o -
mance Da ase ”, in P oc. o he 26 h In . Socie y o Music In o ma ion
Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
Figu e 1. The PianoVAM Da ase : Synch onized ideo,
audio, MIDI wi h inge ing, hand landma ks, and me ada a.
shown o imp o e model obus ness by p o iding addi ional
musical cues. This pape ocuses on he mul imodal da a
collec ion o solo piano pe o mances o imp o e ansc ip-
ion and explo e o he po en ial applica ions.
Piano ansc ip ion, which con e s audio eco dings
in o symbolic ep esen a ions like MIDI o shee music,
is a well-es ablished MIR ask ha has made signi ican
p og ess h ough la ge-scale, clean audio-MIDI da ase s
such as MAESTRO [14] and ca e ully designed deep lea n-
ing models [15, 16]. As benchma king pe o mance on
he MAESTRO da ase app oaches i s ceiling [17], new
challenges ha e eme ged in piano ansc ip ion. One ma-
jo challenge is achie ing acous ic obus ness o ensu e
eliable ansc ip ion om eal-wo ld piano pe o mance
eco dings, which o en ea u e di e se piano imb es, e-
e be a ion, o o he in e e ing noise sou ces. While da a
augmen a ion has been a common echnique o add ess
his issue [10,18], le e aging isual da a om pe o mance
ideos has ecen ly eme ged as an al e na i e esea ch di-
ec ion [6,19
–
21]. O he key challenges include cap u ing
iche pe o mance in o ma ion beyond a single MIDI ack,
such as le - igh hand sepa a ion, piano inge ing, and o he
playing de ails [12, 13]. Add essing hese challenges e-
qui es cap u ing isual-domain da a, such as pe o me mo-
ion o keyboa d- iew ideos, and synch onizing hem wi h
audio and MIDI da a. Howe e , collec ing such mul imodal
da a is cos ly, equi ing a dedica ed da a acquisi ion sys em,
and ime-consuming, as i depends on he a ailabili y o
p epa ed piano playe s.
This pape in oduces PianoVAM, a comp ehensi e pi-
ano pe o mance da ase ha includes ideos, audio, MIDI,
hand landma ks, inge ing labels, and ich me ada a. An
528
Da ase Size (h s) Video Audio Audio Type MIDI Finge ing
MAESTRO 3 [10] 198.7 ✗44.1–48kHz, S e eo Real ✓ ✗
MAPS (MUS subse ) [11] 18.6 ✗44.1kHz, S e eo Syn h. & Real ✓ ✗
OMAPS2 [6] 6.7 1080p/30 ps 44.1kHz, Mono Real △(.TXT) ✗
PianoYT [5] ∼20 Va ies Va ies (YouTube) Va ies (YouTube) △(Pseudo) ✗
PianoVAM (Ou s) 21.0 1080p/60 ps 44.1kHz, Mono Real ✓△(Pseudo)
Table 1. Compa ison o piano ansc ip ion da ase s.
Da ase To al no es # o pieces Labeled a io (%) Da a ype Reliabili y Anno a ion
PIG [12] 100,044 309 100 MIDI & Sco e (.PDF) By pianis s Manual
ThumbSe [13] – 2,523 52 MusicXML By miscellaneous Manual
PianoVAM (Ou s) 1,050,966 106 100 Mul imodal ∼0.95 Semi-Au o
Table 2. Compa ison o piano inge ing da ase s.
o e iew o PianoVAM is p esen ed in Figu e 1. The da ase
was collec ed om ama eu pianis s du ing hei daily p ac-
ice sessions on a Diskla ie piano, wi h synch onized op-
iew ideos cap u ed in ealis ic and a ied pe o mance
condi ions. We ex ac ed hand landma ks and gene a ed
inge ing pseudo-labels using a p e ained hand pose es i-
ma ion model combined wi h a semi-au oma ed inge ing
de ec ion algo i hm. We desc ibe he challenges aced du -
ing da a collec ion and he alignmen o mul iple modali ies.
Fu he mo e, we de ail ou inge ing anno a ion me hod,
which u ilizes hand landma ks ex ac ed om he ideos.
Las ly, we p esen expe imen al esul s on bo h audio-only
and audio- isual piano ansc ip ion using he PianoVAM
da ase o benchma king, along wi h a discussion o i s
po en ial applica ions. PianoVAM is a ailable o download
om he Gi Hub page 1unde he CC BY-NC 4.0 license.
2. RELATED WORK
2.1 Audio-Visual Da ase s
The eme gence o audio- isual da ase s ep esen s a p omis-
ing on ie in MIR, unlocking new esea ch possibili ies
by p o iding isual in o ma ion ha complemen s audio
signals. The URMP da ase [3] o e s synch onized audio,
ideo, and MIDI eco dings o mul i-ins umen classical
pe o mances, suppo ing mul imodal analysis o ensemble
music such as audio- isual sou ce associa ion ia ib a o
modeling [7]. The Acappella da ase [8] con ains solo a cap-
pella ideos, enabling audio- isual singing oice sepa a ion
wi h ine-g ained con ol o e isual and acous ic condi-
ions. Nadka ni e al. p esen an audio- isual da ase o
Hindus ani ocal pe o mances anno a ed wi h melodic mo-
i s and s able no es, enabling ges u e-based music analysis
and aga classi ica ion h ough mo emen -melody co e-
spondence [9].
2.2 Piano T ansc ip ion Da ase s
Table 1 compa es exis ing piano pe o mance da ase s
wi h PianoVAM. MAESTRO [10] includes high-quali y
audio and MIDI da a eco ded om p o icien pianis s
on Diskla ie pianos bu lacks op- iew ideos. While
1h ps://yonghyunk1m.gi hub.io/PianoVAM
MAPS [11] p o ides audio and MIDI om ac ual pe o -
mances (MUS subse ), a signi ican po ion (210 ou o 270
eco dings) is syn hesized. OMAPS2 [6] and PianoYT [5]
inco po a e ideo da a bu o e only limi ed MIDI anno a-
ions: OMAPS2 p o ides MIDI-like labels, while PianoYT
uses pseudo-MIDI anno a ions ansc ibed wi h he Onse s
and F ames model [14]. In compa ison, PianoVAM o e s
he mos comp ehensi e mul imodal da ase , including eal
pe o mance audio, synch onized MIDI, op- iew ideos,
and inge ing pseudo-labels, al hough i s o al du a ion is
limi ed compa ed o he la ge da ase s lis ed.
2.3 Finge ing Da ase s
Table 2 compa es exis ing piano inge ing da ase s wi h
PianoVAM. PIG [12] inco po a es inge ing and MIDI in-
o ma ion o sec ions o se e al pieces played by p o es-
sional pianis s, which also p o ides di e en inge ings o
he same piece by a ious pianis s. Ramoneda e al. [13]
a emp ed o anno a e inge ing o he comple e piece om
pa ially anno a ed MusicXML iles wi h he suppo o
ThumbSe da ase , which in u n c owd-sou ced inge ing
in o ma ion o nume ous pieces om MuseSco e
2
, an on-
line piano sco e websi e, bu he sou ce o inge ing an-
no a ion is no clea . The p esen ed PianoVAM da ase , in
con as , u ilizes a inge ing de ec ion algo i hm applied
o op- iew ideo da a synced wi h MIDI and is imp o ed
by manual anno a ion o incomple e inge ing labels o im-
p o e eliabili y.
3. DATASET ACQUISITION & PRE-PROCESSING
3.1 Acquisi ion
We de eloped a da a acquisi ion sys em o s eamline he un-
supe ised eco ding o ideo, audio, MIDI, and associa ed
me ada a, such as pe o me and piece de ails.
3.1.1 Acquisi ion Wo k low
The acquisi ion wo k low comp ises he ollowing s eps.
Fi s , new use s egis e by p o iding basic pe sonal de-
ails and ecei e unique QR codes o iden i ica ion and
eco ding con ol. A single in e ace launch ini ializes he
equi ed so wa e: OBS S udio o ideo and audio, and
2h ps://musesco e.com (Las accessed: June 28, 2025)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
529
20
40
60
80
100
MIDI No e
Pi ch Dis ibu ion
MAESTRO 3: Min=21, Median=64, Max=108
PianoVAM: Min=21, Median=65, Max=108
0
20
40
60
80
100
120
MIDI Veloci y
Veloci y Dis ibu ion
MAESTRO 3: Min=1, Median=66, Max=126
PianoVAM: Min=1, Median=65, Max=126
0.0
0.2
0.4
0.6
0.8
1.0
Sus ain Pedal Usage Ra io
Sus ain Pedal Usage Dis ibu ion
MAESTRO 3: Min=0.00, Median=0.66, Max=0.99
PianoVAM: Min=0.00, Median=0.91, Max=1.00
Figu e 2. Dis ibu ional compa ison o pi ch, eloci y, and pedal usage be ween MAESTRO 3 and PianoVAM.
Logic P o o audio and MIDI. Be o e eco ding, use s inpu
pe o mance de ails and display hei p o ile QR code o a
op- iew came a o begin cap u ing. When he pe o mance
concludes, p esen ing a s op QR code ends he eco ding.
3.1.2 Ha dwa e Se up
The da a acquisi ion p ocess was managed by a con ol
sc ip on a PC. An o e head-moun ed webcam cap u ed
he ideo, while a dedica ed mic ophone and a Diskla ie
piano p o ided he audio and MIDI signals, espec i ely.
OBS S udio eco ded he audio- ideo s eam, and Logic P o
concu en ly eco ded he audio-MIDI s eam. A common
audio signal cap u ed by bo h sys ems p o ided a e e -
ence o he global ime alignmen o co ec o eco ding
la ency.
3.2 P e-p ocessing
3.2.1 Alignmen
The ime alignmen o audio and MIDI da a was u he
e ined using he ine alignmen echnique used o he
MAESTRO da ase [10]. Speci ically, he eco ded audio
was down-mixed o mono and esampled o 22.05
kHz
. The
MIDI da a was hen syn hesized in o an audio signal a
he same sampling a e using FluidSyn h wi h a SoundFon
sampled om Diskla ie P o eco dings,
3
since we could
no access he o iginal Diskla ie 7 SoundFon . A Cons an -
Q T ans o m was applied o bo h audio signals using a
hop leng h o 64 samples (
∼
3
ms
esolu ion). Finally, we
applied Dynamic Time Wa ping wi hin a Sakoe-Chiba band
o
±2.5s
o co ec any emaining empo al disc epancies,
such as small cons an o se s o ji e .
3.2.2 Audio Loudness No maliza ion
Reco dings we e collec ed o e a six-mon h pe iod in a
sha ed s udio, wi h a no able gap be ween May and Augus .
This in e up ion may ha e in oduced inconsis encies in
loudness due o a ia ions in eco ding condi ions, such
as gain se ings o mic ophone placemen . To mi iga e po-
en ial misma ches be ween loudness and MIDI eloci y
ac oss he da ase , we applied a loudness no maliza ion
p ocedu e. Fi s , he collec ed MIDI da a we e syn hesized
using FluidSyn h wi h a Diskla ie -sampled SoundFon
3
.
The in eg a ed loudness o each syn hesized audio ile was
3
h ps:// eepa s.zen oid.o g/Piano/YDP-G andPiano/YDP-
G andPiano-SF2-20160804. a .bz2 (Las accessed: June 28, 2025)
measu ed using he pyloudno m package [22]. We hen
compu ed he a e age loudness ac oss all ende ed iles
and de ined
−
23
LUFS
as he desi ed global a e age. A
uni o m gain o se was calcula ed based on he di e ence
be ween he measu ed a e age and his global e e ence.
This o se was hen applied o each ende ed ile’s loudness
o yield a unique a ge loudness pe endi ion. Each eal
PianoVAM audio eco ding was hen scaled o ma ch he
a ge loudness o he co esponding endi ion. This p ocess
is designed o enhance loudness- eloci y consis ency while
p ese ing he na u al dynamic ange o he pe o mances.
4. DATASET STATISTICS
The da ase con ains 106 solo piano eco dings om 10
ama eu pe o me s, o aling app oxima ely 21 hou s. The
epe oi e is s ylis ically di e se, spanning wo ks om 38
compose s om he Ba oque o mode n e as (e.g., Bach,
Chopin, Kapus in, Joe Hisaishi) and includes se e al im-
p o isa ions. Pe o me s’ sel - epo ed skill le els a e ad-
anced (70 eco dings), in e media e (26), and beginne
(10). Al hough he eco ding sys em allowed pe o me s
o choose be ween wo pe o mance ypes —Pe o mance
and DailyP ac ice— all eco dings we e sel -labeled as Dai-
lyP ac ice. This indica es ha he da ase p ima ily cap u es
in o mal p ac ice sessions whe e s ic sco e adhe ence can-
no always be expec ed.
To in es iga e di e ences in exp essi e cha ac e is ics
ac oss da ase s, we conduc ed a b ie compa a i e analysis
o MIDI- ela ed dis ibu ions be ween MAESTRO 3 and
PianoVAM. Speci ically, we examined h ee aspec s dis-
played in
Figu e 2
: (i) he dis ibu ion o pi ch (MIDI no e
numbe s), (ii) he dis ibu ion o eloci y (MIDI eloci y
alues), and (iii) he dis ibu ion o sus ain pedal usage on
a pe - ile basis. We compu ed Cohen’s d o each musical
ea u e o assess he p ac ical signi icance o in e -da ase
di e ences. Pi ch (d= 0.0446) and eloci y (d= -0.0379)
exhibi ed negligible di e ences be ween da ase s. How-
e e , pedal usage e ealed a la ge e ec size (d= 0.870),
indica ing a subs an ially highe use o he sus ain pedal
in PianoVAM compa ed o MAESTRO 3. We specula e
his di e ence s ems om an in e play be ween a pedal-
demanding Roman ic/Imp essionis epe oi e, he gene -
ous pedaling endencies o ama eu pe o me s in p ac ice,
and a less e e be an s udio en i onmen ha encou ages
compensa ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
530
Video
Hand landma k de ec ion
(MediaPipe)
F amewise hand a ea
de ec ion
I dep h < h eshold
hen de ec loa ing hand(s)
Calcula e inge sco e
o each no e
F amewise inge posi ion
e alua ion
else human selec ion om
mul iple candida es
I a single candida e
hen de e mine inge ing
Inpu P ep ocess
Labeling
MIDI
Keyboa d
a ea in o
MIDI w/ hand &
inge ing in o
Ou pu
Figu e 3. Flowcha o he inge ing de ec ion algo i hm.
5. ANNOTATION OF FINGERING LABELS
To gene a e inge ing anno a ions, we de eloped he hyb id
algo i hm shown in
Figu e 3
. The algo i hm i s p ocesses
pe o mance ideos o map hand landma ks o po en ial
inge candida es o each MIDI no e. Fo no es wi h a
single, unambiguous candida e, he inge ing is de e mined
au oma ically, achie ing a p ecision o
∼
95% (c . Table
3). In ambiguous cases wi h mul iple candida es (a ec ing
∼
20% o no es), a cus om GUI p omp s a human anno a o
o make he inal selec ion. This app oach ensu es comple e
and accu a e inge ing anno a ions o he en i e da ase .
5.1 Inpu s & Ou pu s
The inpu s a e ideo, MIDI, keyboa d co ne loca ions and
lens dis o ion coe icien s in he i s ame o each ideo,
which can also be se manually by he GUI-based anno a ion
ool. The ou pu s a e inge ing in o ma ion, and a sepa a e
MIDI ile o each hand.
5.2 Me hod
The algo i hm sugges s candida es inge s ha a e likely
o play each no e in he MIDI ile. Fi s , hand landma k
in o ma ion is ex ac ed om he inpu ideo ame wi h
MediaPipe Hands [23]. A loa ing hand ob usca ing he
o he hand bu no playing any no es should be de ec ed
and excluded om he pool o candida es. Thus, we de ine
a me ic o measu e he
z
-dep h o a hand o de ec such
loa ing hands. Assuming ha we know he hand skele on
o he playe explici ly, we can calcula e he
z
-dep h om
he
xy
coo dina e in o ma ion. Fo each ideo, we ind he
model skele on o each hand, which is a s anda d o all
skele ons ha should be unben , no il ed, and on he key-
boa d. To measu e il , he plane de ined by h ee hand key
poin s, namely W is (
W
), Index Finge Me aca pals (
I
),
and Ring Finge Me aca pals (
R
), is u ilized. We assume
ha he angle
∠IWR
o he un il ed skele on (pa allel o
he plane o he keyboa d) should be close o
28◦
, which
is ou heu is ic es ima e o he a e age human hand in i s
neu al posi ion. To measu e how much he hand is ben ,
we calcula e he a io
=|△IW R|
|W F1F2F3F4F5|
o
|△IWR|
and
a ea o he hexagon
|WF1F2F3F4F5|
whe e
Fi
is he in-
ge ip o he
i
h inge . Finally, assuming ha he hand is
playing in he majo i y o ames, he median
△IWR
is
selec ed as he de aul posi ion:
I0W0R0=median
|∠IW R−28◦|low 10%, high 50%|△IW R|.(1)
Le he plane o he p ojec ed 2D image be
z=z0
, and he
a ea o 2D image be
A:= {(x, y)|x∈(−1,1), y ∈(−AR, AR)} ⊂ R2
∼
={(x, y, z0)|x∈(−1,1), y ∈(−AR, AR)} ⊂ PR3,
(2)
whe e he o igin o eal p ojec i e space
PR3
is he cen e
o he came a lens and
AR
is he aspec a io o he ideo.
Since he goal is o calcula e he ela i e dep h a he han
he eal dep h, we se
z0= 1
o con enience o calcula ion.
Knowing ha he o iginal poin is con ained in he line
{(x , y , )| ∈R+}
, we ha e h ee equa ions likely in
gene al posi ion and h ee a iables o sol e
, u, > 0
om
||(xI , yI , ),(xWu, yWu, u)||R3=
I0W0
(3)
||(xWu, yWu, u),(xR , yR , )||R3=
W0R0
(4)
||(xR , yR , ),(xI , yI , )||R3=
R0I0
.(5)
He e, ou desi ed solu ion can be app oxima ely calcu-
la ed using Powell’s dog leg algo i hm wi h a close ini ial
guess
( i, ui, i) = (1,1,1)
. By subs i u ing he solu ion
( , u, )
, we ge he 3D coo dina es o
I, W, R
. Finally,
d= ( +u+ )/3
becomes he ela i e
z
-dep h o he
mass o he cen e o he hand skele on o each ame. I he
z
-dep h is less han he h eshold
0.9
, o equi alen ly, i he
hand is loa ing mo e han 10% o he dis ance be ween he
came a and he keyboa d, we decide ha he a ge hand o
he a ge ame is loa ing.
A e disca ding all loa ing hands om he de ec ion,
possible inge candida es o each no e can be chosen. Fo
each no e, a inge ing sco e is calcula ed, indica ing he
likelihood o each inge p essing he no e. The sco e is
based on he numbe o ames in which he inge ip is
in he selec ed no e a ea. I he inge ip is comple ely in
he a ea, a alue o 1 is assigned; i i is sligh ly o , a
co ec ion weigh is applied, educing he sco e o his
ame. Thus, he inal inge ing sco e canno exceed he
o al numbe o ames o each no e. F om he inge ing
sco e o each inge o he
n
h no e, we add he inge as a
no mal candida e o a s ong candida e i he inge ing sco e
is g ea e han 50% o 80% o he heo e ical maximum
sco e (no e leng h in ames), espec i ely. I he e exis s
only one s ong candida e, we pick he s ong candida e
as he only candida e. No e ha he e migh be no es wi h
ei he no candida e o mul iple candida es, in which case
he algo i hm will lea e hese no es unlabeled.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
531
Piece (Index) P ec. To al To al
(150) No C. Mul i C.
Chopin, Op.18 (8) 91.7 12.7 8.0
Debussy, L.75 M .3 (17)∗97.1 7.9 2.8
G ieg, Op.16 M .2 (18) 99.3 10.6 4.4
Yi uma, Kiss he Rain (29) 95.2 5.9 8.7
Imp o isa ion (31) 98.6 16.7 3.0
Schumann, Op.17 M .1 (34) 82.9 8.0 5.4
Kapus in, Op.40 No.6 (42) 98.4 8.4 5.5
Sca la i, K.380 (60) 100.0 3.7 3.5
Ra el, M.30 (81) 92.4 35.1 4.4
Sa ie, Gymnopedie No.1 (93) 100.0 4.1 2.4
A e age o 10 pieces 95.6 13.0†5.1†
Table 3. P ecision (P ec.) o e i s 150 no es, pe cen age
o no es wi h none o mul iple candida es (C.) o e all no es
o selec ed 10 pieces (
∗
Fo inge subs i u ions, we admi
bo h inge s as co ec inge ing;
†
Weigh ed a e age wi h
he numbe o o al no es o each piece as he weigh ).
5.2.1 Reliabili y
Wi h he suppo o he dedica ed GUI-based inge ing anno-
a ion ool, we manually labeled he g ound u h inge ing
o he i s 150 no es o 10 pieces in PianoVAM o assess
he esul s. As de ailed in Table 3, he algo i hm achie ed
an a e age p ecision o o e 95%, wi h mos p edic ion
e o s in ol ing adjacen inge s. The able also epo s he
pe cen age o no es wi h no candida es and mul iple candi-
da es. No ably, Ra el’s Jeux d’eau s ands ou as an ou lie
wi h a high a io o no es wi hou candida es. Excluding his
piece, he weigh ed a e ages o no es wi h no and mul iple
candida es a e 9.4% and 5.2%, espec i ely.
6. BENCHMARK RESULTS
To demons a e i s u ili y, we benchma k he PianoVAM
da ase on he ask o piano ansc ip ion unde wo se ings:
audio-only and audio- isual. The audio- isual expe imen s
a e speci ically designed o assess he isual modali y’s
con ibu ion o enhancing pe o mance unde challenging
acous ic condi ions.
6.1 Da a Spli
To acili a e ep oducibili y o esul s, we p o ide in o -
ma ion on da a spli s designed o mee he ollowing c i-
e ia: (i) no composi ion appea s in mo e han one spli ,
and (ii) he da ase is di ided app oxima ely in o 80/10/10
pe cen o he aining, alida ion, and es se , espec i ely,
based on o al du a ion. The esul ing ain/ alida ion/ es
spli s con ain 73, 19, and 14 iles, espec i ely. While hese
spli s a e in ended o suppo ep oducibili y and compa a-
bili y, we acknowledge ha di e en expe imen al objec-
i es migh equi e di e en spli s.
6.2 Audio-Only Piano T ansc ip ion
As MAESTRO is widely ega ded as a s anda d da ase in
piano ansc ip ion esea ch, we deemed i a sui able e e -
ence poin o e alua ing ou da ase . Acco dingly, we pe -
o med a compa a i e analysis using he Onse s and F ames
T ain Da ase No e w/ O se w/ Vel. F ame
MAESTRO 3 93.4 62.3 90.3 78.2
PianoVAM 95.8 60.4 93.9 80.0
Combined 95.2 73.5 93.0 86.9
Table 4. T ansc ip ion F1 sco es on he PianoVAM es
spli . Bold: highes ; Unde line: signi ican ly highe han he
lowes ; Double-line: signi ican ly highe o e bo h o he s
(p < .0167).
model [14], ollowing i s o iginal speci ica ions. The model
was ained on each da ase as well as on a combined e -
sion. We u ilized he model weigh s co esponding o he
checkpoin wi h he lowes alida ion loss o in e ence.
The esul s in Table 4 a e epo ed as F1 Sco es (%)
and calcula ed o e he en i e du a ion o he espec i e
es spli s. The e ms No e,w/ O se , and w/ Vel. e e o
no e e alua ion wi h onse , wi h onse & o se , and wi h
onse & eloci y, espec i ely (c . [15]). All ou e alua-
ion me ics, including F ame, we e compu ed using he
mi _e al package [24]. Following es ablished ansc ip ion
esea ch con en ions, o se imings we e adjus ed o he
pedal- elease ime i he sus ain pedal emained engaged.
S a is ical es s con i m ha signi ican di e ences ac oss
aining se s o all me ics (F iedman es ,
p < .001
). Pos -
hoc Wilcoxon es s wi h Bon e oni co ec ion (
α=.0167
)
showed ha bo h PianoVAM and Combined models signi i-
can ly ou pe o med MAESTRO 3 in No e and w/ Veloci y.
Fo w/ O se and F ame, only he Combined model yielded
signi ican ly highe han bo h o he s. While PianoVAM
sligh ly ou pe o med Combined in No e and w/ Veloci y,
only he la e di e ence was s a is ically signi ican .
6.3 Audio-Visual Piano T ansc ip ion
Va ious app oaches ha e been explo ed o piano ansc ip-
ion when bo h audio and ideo a e a ailable. Fo ins ance,
Wan e al. and Wang e al. p oposed me hods enhancing
he ou pu o an audio-only AMT sys em by inco po a ing
isual in o ma ion [19, 20], while Li e al. u ilized bo h
modali ies join ly o imp o e onse p edic ion [6,21].
Fo his benchma k expe imen , we ocus on examining
how isual in o ma ion can be used o imp o e ansc ip ion
pe o mance unde subop imal eco ding condi ions. We im-
plemen a simple pos -p ocessing pipeline ha e ines MIDI
ou pu s om an audio-only AMT model by using op- iew
ideo, es ima ed piano keyboa d co ne coo dina es, and
hand skele ons de ec ed wi h MediaPipe Hands [23]. This
p ocess enables he elimina ion o physically implausible
no es by e e encing isual e idence, he eby imp o ing on-
se p ecision. The ull implemen a ion and addi ional de ails
a e a ailable on Gi Hub 1, and a b ie o e iew ollows.
Fi s , onse e en s a e ex ac ed om he p edic ed MIDI
ile. Fo each onse , he nea es ideo ame is e ie ed,
and hand landma ks a e p edic ed [23]. Each ideo ame’s
imes amp is de ined as he midpoin o he ime in e al
i co e s. I no hand is de ec ed, he co esponding on-
se is unchanged. When bo h hands a e de ec ed, a pe -
spec i e ans o ma ion is applied using he keyboa d co -
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
532

Inpu Me hod P ecision Recall F1
Noisy
Vanilla 96.0 43.7 57.2
+ NoiseAug 96.1 82.8 88.7
+ Video 97.2 82.7 89.2
Re e be an Vanilla 66.8 68.2 64.4
+ Video 68.1 67.8 64.8
Table 5. Onse p edic ion pe o mance unde di e en
acous ic condi ions. Bold: highes in each column; Unde -
line: signi ican ly highe o e he p eceding me hod (pai ed
- es , p < 0.05).
ne me ada a o p oduce a no malized ec angula image
(
H:W= 125 : 1024
), which main ains he s anda d
heigh - o-wid h a io (
1 : 8.147
) o an 88-key piano. The
same ans o ma ion is applied o he p edic ed hand land-
ma ks. Assuming ha he 52 whi e keys a e e enly spaced,
he algo i hm es ima es which whi e key egion each inge -
ip co esponds o, based on i s ans o med x-coo dina e.
To accoun o possible e o s in hand landma k de ec ion,
mul iple candida e keys a e conside ed o each inge ip,
wi h a unable h eshold de e mining he candida e ange
(
±2
whi e keys in ou expe imen ). The inal se o alid
pi ch candida es is ob ained by in e sec ing all inge ip
candida e se s. Fo each onse , i he pi ch p edic ed by he
audio-only AMT model alls wi hin his candida e se , he
no e is e ained; o he wise, i is disca ded. This p ocess is
epea ed o all onse s in he ansc ip ion.
Table 5 summa izes onse p edic ion pe o mance unde
wo challenging acous ic condi ions: SNR=0
dB
Gaussian
noise, and added e e be a ion. To e alua e he model’s o-
bus ness unde e e be an acous ic condi ions, we applied
con olu ional e e b using a eal-wo ld impulse esponse
(IR) eco ded in S . Geo ge’s Chu ch
4
. The IR was o ig-
inally sampled a 96
kHz
and downsampled o 16
kHz
o
ma ch he audio inpu . All audio samples we e con ol ed
wi h he mono e sion o his IR using FFT-based con-
olu ion. To compensa e o he inhe en delay in he IR
(wi h i s peak loca ed a sample index 653), we emo ed
he i s 653 samples om each con ol ed ou pu o en-
su e p ope empo al alignmen . The esul ing signals we e
hen peak-no malized o main ain consis en ampli ude and
a oid dis o ion.
Unde noisy condi ions, he baseline model (Vanilla),
ained on clean audio only, exhibi ed subs an ial deg ada-
ion. In oducing noise du ing aining (+ NoiseAug) signi -
ican ly imp o es ecall and F1 (
p < .0001
), while p ecision
emains unchanged. Fo he + NoiseAug condi ion, he
model was ained on a 50/50 mix u e o o iginal clean au-
dio and augmen ed noisy samples. The noisy samples we e
gene a ed by adding Gaussian noise wi h signal- o-noise
a ios (SNR) andomly sampled om 0 o 24
dB
(c . [25]).
Adding isual il e ing (+ Video) u he imp o es p ecision
(
p=.0052
) and F1 (
p=.0101
), howe e , he gain in ecall
is no s a is ically signi ican .
4
h ps://web iles.yo k.ac.uk/OPENAIR/IRs/s -geo ges-episcopal-
chu ch/s -geo ges-episcopal-chu ch.zip; s _geo ges_ a .wa (Las
accessed: June 28, 2025)
In e e be an condi ions, isual pos -p ocessing signi -
ican ly imp o es p ecision (
p=.0005
) and ma ginally
imp o es F1 (
p=.0508
), wi h no signi ican change in
ecall. Quali a i e inspec ion e ealed ha e e be an ails
we e some imes misclassi ied as new onse s and he isual
modali y helped educe such e o s.
7. DISCUSSION
The da ase was collec ed using a sys em designed o a-
cili a e unsupe ised eco ding, allowing pe o me s o
play eely wi hou on-si e assis ance. While his app oach
s eamlines da a acquisi ion, he da ase exhibi s biases in
pe o me iden i y, pedal usage, and compose ep esen a-
ion. In addi ion, since all eco dings o igina e om p ac ice
sessions, he da ase is unsui able o compa a i e s udies
wi h co esponding musical sco es. Ou inge ing de ec-
ion app oach, while p omising, aces challenges om i-
sual ambigui ies. These a ise om bo h complex pianis ic
echniques, such as he mul i- inge p epa a ions o apid
epe i ions in Chopin’s G ande alse b illian e, and isual
a i ac s like pe o mance-induced mo ion blu in Ra el’s
Jeux d’eau o shadows in he Schumann’s Fan asie in C
eco ding. Fu he mo e, he algo i hm is in en ionally o-
cused on con en ional playing, hus excluding ex ended
echniques like glissandi o playing wi h he is . Ou u-
u e ex ensions may include expe pe o mances, expanded
modali ies (e.g., mul i-angle ideo), and con ex ually ich
da a o suppo mo e obus and musically meaning ul anal-
ysis. Mo eo e , we aim o imp o e inge ing de ec ion p e-
cision by le e aging s a e-o - he-a models o hand pose
es ima ion [26] and 3D econs uc ion [27].
8. CONCLUSION
We p esen ed PianoVAM, a comp ehensi e mul imodal
da ase o ama eu piano p ac ice sessions ha cap u es
synch onized op- iew ideo, audio, MIDI, hand land-
ma ks, inge ing labels, and ich me ada a. Reco ded using
a Yamaha Diskla ie in na u al, a ied p ac ice condi ions,
PianoVAM add esses key limi a ions o exis ing da ase s
ha o en lack speci ic modali ies o ely on syn he ic o
incomple e anno a ions. To gene a e inge ing labels, we
p opose a semi-au oma ed me hod ha combines hand land-
ma k de ec ion om ideo wi h manual e inemen . We also
discuss he challenges o mul imodal alignmen and da a
collec ion. To demons a e he u ili y o PianoVAM, we
epo baseline esul s o bo h audio-only and audio- isual
piano ansc ip ion asks and showcase i s po en ial o ad-
ancing a ange o MIR applica ions. Fu u e ex ensions
o he da ase may add ess cu en imbalances in musical
con en and me ada a di e si y.
9. ETHICS STATEMENT
This s udy in ol ed human pa icipan s o da a collec ion,
which was app o ed by he Ins i u ional Re iew Boa d
(IRB) a KAIST (App o al No. KH2023-235). All p oce-
du es s ic ly adhe ed o es ablished e hical guidelines.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
533
10. ACKNOWLEDGMENTS
We since ely app ecia e he KAIST music and audio com-
pu ing lab and PIAST (Piano club) membe s who pa ici-
pa ed in he da ase acquisi ion as pe o me s. This esea ch
was suppo ed by he Na ional Resea ch Founda ion o Ko-
ea (NRF) unded by he Ko ea Go e nmen (MSIT) unde
G an RS-2023-NR077289 and G an RS-2024-00358448.
11. REFERENCES
[1]
V. Be ge on and D. M. Lopes, “Hea ing and seeing
musical exp ession,” Philosophy and Phenomenological
Resea ch, ol. 78, no. 1, pp. 1–16, 2009.
[2]
F. Pla z and R. Kopiez, “When he eye lis ens: A me a-
analysis o how audio- isual p esen a ion enhances he
app ecia ion o music pe o mance,” Music Pe cep ion:
An In e disciplina y Jou nal, ol. 30, no. 1, pp. 71–83,
2012.
[3]
B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma, “C e-
a ing a mul i ack classical music pe o mance da ase
o mul imodal music analysis: Challenges, insigh s,
and applica ions,” IEEE T ansac ions on Mul imedia,
ol. 21, no. 2, pp. 522–535, 2018.
[4]
Z. Duan, S. Essid, C. C. S. Liem, G. Richa d, and
G. Sha ma, “Audio isual analysis o music pe o -
mances: O e iew o an eme ging ield,” IEEE Signal
P ocessing Magazine, ol. 36, no. 1, pp. 63–73, 2019.
[5]
A. S. Koepke, O. Wiles, Y. Moses, and A. Zisse man,
“Sigh o sound: An end- o-end app oach o isual piano
ansc ip ion,” in P oceedings o he IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocessing
(ICASSP), 2020, pp. 1838–1842.
[6]
Y. Li, X. Wang, R. Wu, W. Xu, and W. Chen, “A c nn-
gcn piano ansc ip ion model based on audio and skele-
on ea u es,” in IEEE In e na ional Con e ence on
Acous ics, Speech, and Signal P ocessing Wo kshops
(ICASSPW), 2023, pp. 1–5.
[7]
B. Li, C. Xu, and Z. Duan, “Audio isual sou ce associa-
ion o s ing ensembles h ough mul i-modal ib a o
analysis,” in P oceedings o he Sound and Music Com-
pu ing Con e ence (SMC), 2017.
[8]
J. F. Mon esinos, V. S. Kadandale, and G. Ha o, “A cap-
pella: Audio- isual singing oice sepa a ion,” in P o-
ceedings o he 32nd B i ish Machine Vision Con e ence
(BMVC), 2021.
[9]
S. Nadka ni, P. Rao, and M. Clay on, “Iden i ying
melodic mo i s and s able no es om ges u al in o ma-
ion in indian ocal pe o mances,” T ansac ions o he
In e na ional Socie y o Music In o ma ion Re ie al,
ol. 7, no. 1, 2024.
[10]
C. Haw ho ne, A. S asyuk, A. Robe s, I. Simon, C.-
Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and
D. Eck, “Enabling ac o ized piano music modeling
and gene a ion wi h he MAESTRO da ase ,” in P o-
ceedings o he In e na ional Con e ence on Lea ning
Rep esen a ions (ICLR), 2019.
[11]
V. Emiya, N. Be in, B. Da id, and R. Badeau,
“Maps - a piano da abase o mul ipi ch es ima ion
and au oma ic ansc ip ion o music,” Resea ch
Repo , Tech. Rep., 2010. [Online]. A ailable: h ps:
//hal.in ia. /in ia-00544155
[12]
E. Nakamu a, Y. Sai o, and K. Yoshii, “S a is ical lea n-
ing and es ima ion o piano inge ing,” In o ma ion Sci-
ences, ol. 517, pp. 68–85, 2020.
[13]
P. Ramoneda, D. Jeong, E. Nakamu a, X. Se a, and
M. Mi on, “Au oma ic piano inge ing om pa ially an-
no a ed sco es using au o eg essi e neu al ne wo ks,” in
P oceedings o he 30 h ACM In e na ional Con e ence
on Mul imedia (ACM MM), 2022, pp. 6502–6510.
[14]
C. Haw ho ne, E. Elsen, J. Song, A. Robe s, I. Simon,
C. Ra el, J. Engel, S. Oo e, and E. D, “Onse s and
ames: Dual-objec i e piano ansc ip ion,” in P oceed-
ings o he 19 h In e na ional Socie y o Music In o -
ma ion Re ie al Con e ence (ISMIR), 2018, pp. 50–57.
[15]
Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-
esolu ion piano ansc ip ion wi h pedals by eg essing
onse and o se imes,” IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing, ol. 29, pp.
3707–3717, 2021.
[16]
W. Wei, P. Li, Y. Yu, and W. Li, “Hppne : Modeling
he ha monic s uc u e and pi ch in a iance in piano
ansc ip ion,” in P oceedings o he 23 d In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), 2022, pp. 709–716.
[17]
Y. Yan and Z. Duan, “Sco ing ime in e als using non-
hie a chical ans o me o au oma ic piano ansc ip-
ion,” in P oceedings o he 25 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2024, pp. 973–980.
[18]
D. Edwa ds, S. Dixon, E. Bene os, A. Maezawa, and
Y. Kusaka, “A da a-d i en analysis o obus au oma ic
piano ansc ip ion,” IEEE Signal P ocessing Le e s,
ol. 31, pp. 681–685, 2024.
[19]
Y. Wan, X. Wang, R. Zhou, and Y. Yan, “Au oma ic
piano music ansc ip ion using audio- isual ea u es,”
Chinese Jou nal o Elec onics, ol. 24, no. 3, pp. 597–
603, 2015.
[20]
X. Wang, W. Xu, J. Liu, W. Yang, and W. Cheng, “An
audio- isual usion piano ansc ip ion app oach based
on s a egy,” in P oceedings o he 24 h In e na ional
Con e ence on Digi al Audio E ec s (DAFx), 2021, pp.
308–315.
[21]
Y. Li, X. Wang, R. Wu, W. Xu, and W. Cheng, “A
wo-s age audio- isual usion piano ansc ip ion model
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
534
based on he a en ion mechanism,” IEEE/ACM T ans-
ac ions on Audio, Speech, and Language P ocessing,
ol. 32, pp. 3618–3630, 2024.
[22]
C. J. S einme z and J. D. Reiss, “pyloudno m: A simple
ye lexible loudness me e in py hon,” in 150 h AES
Con en ion, 2021.
[23]
F. Zhang, V. Baza e sky, A. Vakuno , A. Tkachenka,
G. Sung, C.-L. Chang, and M. G undmann, “Mediapipe
hands: On-de ice eal- ime hand acking,” 2020.
[Online]. A ailable: h ps://a xi .o g/abs/2006.10214
[24]
C. Ra el, B. McFee, E. J. Humph ey, J. Salamon, O. Ni-
e o, D. Liang, D. P. Ellis, and C. C. Ra el, “Mi _e al: A
anspa en implemen a ion o common mi me ics.” in
P oceedings o he 15 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), 2014, pp.
367–372.
[25]
Y. Kim and A. Le ch, “Towa ds obus ansc ip ion:
Explo ing noise injec ion s a egies o aining da a
augmen a ion,” in La e B eaking Demo o he 25 h In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence (ISMIR), 2024.
[26]
Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose++:
Vision T ans o me o Gene ic Body Pose Es ima ion,”
IEEE T ansac ions on Pa e n Analysis & Machine In-
elligence, ol. 46, no. 02, pp. 1212–1230, 2024.
[27]
H. Dong, A. Chha ia, W. Gou, F. Vicen e Ca asco,
and F. D. De la To e, “Hamba: Single- iew 3d hand
econs uc ion wi h g aph-guided bi-scanning mamba,”
Ad ances in Neu al In o ma ion P ocessing Sys ems,
ol. 37, pp. 2127–2160, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
535