scieee Science in your language
[en] (orig)

Predicting Flutist Onset Timing in Duet Performance: A Multimodal Analysis of Gesture and Breath Cues

Author: Jaeran Choi; Taegyun Kwon; Juhan Nam
Publisher: Zenodo
DOI: 10.5281/zenodo.17706343
Source: https://zenodo.org/records/17706343/files/000012.pdf
PREDICTING FLUTIST ONSET TIMING IN DUET PERFORMANCE:
A MULTIMODAL ANALYSIS OF GESTURE AND BREATH CUES
Jae an Choi Taegyun Kwon Juhan Nam
G adua e School o Cul u e Technology, KAIST, Sou h Ko ea
{jae an.choi,ilcobo2,juhan.nam}@kais .ac.k
ABSTRACT1
In ensemble pe o mances, musicians use ges u e and2
b ea h cues o synch onize hei ini ial no es a he begin-3
ning o a piece, bu he p ecise ela ionship be ween hese4
cues and onse iming emains unde -explo ed. This s udy5
in es iga es how lu is s’ ges u e and b ea h cues encode6
he iming in o ma ion o he ini ial no e onse . This e-7
sea ch consis s o ou componen s: (1) Collec ion o a cue8
da ase con aining synch onized ideo and audio eco d-9
ings o lu e-piano due s, (2) Iden i ica ion o cue candida e10
poin s h ough acial mo emen cu es and b ea h onse -11
o se analysis, (3) Ve i ica ion o p edic ed onse accu acy12
using linea eg ession on hese cues compa ed o human13
onse asynch onies and (4) In oduc ion and explo a ion o 14
a ‘ igge ’ concep , de ined as immedia e, clea ly pe cei -15
able ges u es (such as s opping o aising he head) indi-16
ca ing he p ecise momen o onse . Ou indings sugges 17
a dual-cue sys em: p epa a o y cues b oadly p edic onse 18
iming, while p ecise igge s e ine he exac onse . We19
compa ed he ime di e ence be ween he p edic ed and20
piano onse s wi h he lu e–piano asynch onies and e i-21
ied he concep s o cue and igge h ough expe in e -22
iews. This esea ch con ibu es o a deepe unde s anding23
o he complex phenomena o musical cues du ing pe o -24
mance h ough mul imodal analysis. This pape p o ides25
an open-access cue da ase , which can be ound on he ac-26
companying websi e. 1
27
1. INTRODUCTION28
Music pe o mance is inhe en ly mul imodal, combining29
sound and mo ion. Al hough hese elemen s p ima ily con-30
ey musical exp ession o audiences [1], hey also play a31
c i ical ole in ensemble synch oniza ion among pe o m-32
e s [2]. Pe o me s o en employ speci ic ges u es and33
b ea hing sounds as musical cues o synch onize hei no e34
onse s, pa icula ly a he beginning o du ing c i ical mo-35
men s in a pe o mance. The con en ion o cueing app ox-36
1h ps://gi hub.com/jae anchoi/ lu is _cue_da ase
© J. Choi, T. Kwon and J. Nam.. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: J. Choi, T. Kwon and J. Nam., “P edic ing Flu is Onse Timing in
Due Pe o mance: A Mul imodal Analysis o Ges u e and B ea h Cues”,
in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
ima ely one bea be o e he musical onse has been ex en-37
si ely documen ed in p e ious s udies [3–5]. Head nod-38
ding ges u es, in pa icula , a e commonly used as in ui i e39
musical cues, and p e ious s udies ha e u ilized such ges-40
u es o de ine cue imings, e en ex ending hei applica-41
ion o in e ac ions wi h obo ic musicians [5–7]. The e-42
o e, unde s anding musical cues no only deepens syn-43
ch oniza ion knowledge bu is also c ucial o designing44
in e ac i e pe o mance sys ems.45
P e ious s udies ha e examined he ole o ges u es46
and b ea hing in pe o me synch oniza ion. Bishop e al.47
showed ha pe o me s use isual and audi o y cues o48
synch onize a e silence o es s [8]. Addi ionally, ges-49
u es a he beginning o a piece we e ound o co e-50
la e wi h empo, pa icula ly h ough alling accele a ion51
cu es ha ypically each hei midpoin app oxima ely52
one bea be o e he onse [9]. Howe e , his midpoin im-53
ing was o en no p ecisely one bea ahead, and exac onse 54
p edic ion based on hese cu es was no explo ed. Ve a55
e al. demons a ed inc eased onse asynch ony when pe -56
o me s es a ed oge he a e es s wi hou isual con-57
ac , and iden i ied ela ionships be ween b ea h onse -58
o se imings and es du a ions, ye did no cla i y how59
ges u es o b ea hing speci ically encode onse iming [10].60
Al hough hese s udies highligh he signi icance o ges-61
u es and b ea h cues in synch oniza ion, hey ha e no 62
ho oughly in es iga ed how combined isual and audi o y63
cues p ecisely p edic in ended onse iming, especially a 64
he beginning o a piece.65
One eason o he limi ed quan i a i e analyses in p e-66
ious s udies is he di icul y o accu a ely acking ges-67
u es. Bishop e al. u ilized Kinec senso s and accele om-68
e e s o measu e mo ion cu es [9], while Timme s e al.69
employed in a ed ma ke s o measu e bow eloci y in a70
s ing qua e se ing [11]. In con as , ou app oach uses71
senso less image p ocessing echniques, applying op ical72
low me hods [12] o quan i a i ely ack acial ges u e73
mo emen s.74
This pape in es iga es how a lu is ’s in ended onse 75
iming is encoded h ough ges u e and b ea h cues, com-76
p ising ou main componen s: (1) Collec ion o synch o-77
nized ideo and audio da a om lu e-piano due pe o -78
mances in ol ing 20 lu is s (To al 1,320 ials), (2) Ex-79
ac ion and anno a ion o ges u e cue ea u es using ace80
mo emen cu es analyzed by op ical low, alongside man-81
ual anno a ion o b ea h onse and o se imings, (3) Ve -82
i ica ion o p edic i e accu acy o cue-based onse p e-83
100
Figu e 1: The o e all amewo k o he musical cue de ec ion and onse iming p edic ion
dic ions by compa ing linea eg ession–de i ed imings o84
obse ed human onse asynch onies and (4) Explo a ion o 85
a igge concep , which in ol es h ee ypes o immedia e,86
clea ly pe cei able mo emen s: quickly aising he head,87
s opping head mo emen a e he cue o slowly aising he88
head. The i s ype jus be o e ully aising he head, he89
second igge s onse immedia ely a e s opping, and he90
hi d p o ides a less clea signal. These igge mo emen s91
allow pe o me s o p ecisely de ec he onse momen o 92
accu a e synch oniza ion.93
Ou analysis demons a es clea ela ionships be ween94
he leng hs o ges u e and b ea h cues and onse imings.95
Ges u e cues showed a linea ela ionship be ween po-96
si ion, eloci y and accele a ion peaks in e ical acial97
mo emen s and subsequen onse iming. Simila ly, b ea h98
cues exhibi ed a co ela ion be ween b ea h du a ion and99
he iming in e al o he no e onse , wi h a ia ions ob-100
se ed ac oss di e en empos. We u he e i ied ou hy-101
po heses h ough expe in e iews. Addi ionally, we con-102
duc ed a case s udy o add ess ela ed phenomena, includ-103
ing adap a ion e ec s— educed disc epancies h ough e-104
pea ed ehea sal—and ins ances o cue execu ion ailu es,105
whe e signi ican di e ences be ween cue-based p edic ed106
onse and ac ual onse we e obse ed. Based on hese107
obse a ions, we conclude ha musicians u ilize wo p i-108
ma y synch oniza ion s a egies: ough iming indica ion109
h ough ges u e and b ea h cues, and p ecise, immedia e110
signals h ough igge s. Fu he mo e, exac synch oniza-111
ion is e ined h ough epe i i e ehea sals.112
2. RELATED WORKS113
2.1 Musical Cue o Synch oniza ion114
Musical cues, essen ial o pe o me communica ion,115
include isual ges u es and non-musical elemen s like116
b ea hing. They a e pa icula ly aluable o p ecise coo -117
dina ion, such as in pieces wi h ab up empo changes [13],118
o synch oniza ion a e es s and empo a ia ions [8].119
The beginning o a musical piece is challenging o coo di-120
na ion due o he absence o p eceding audio cues. Bishop121
e al. examined isual cues a piece ini ia ion, inding ha 122
he peak o he accele a ion cu e in nodding ges u es indi-123
ca ed bea posi ions, while ges u e du a ion and pe iodici y124
con eyed empo in o ma ion [9]. Howe e , hey did no in-125
es iga e he p edic i e capabili y o ges u e cues o onse 126
iming p edic ion.127
Mos s udies on synch oniza ion be ween ges u es and128
musical hy hm ha e emphasized eloci y peaks a he 129
han spa ial posi ions as p ima y ea u es. Su [14] demon-130
s a ed his using minimal labo a o y se ups wi h bouncing131
poin -ligh and audi o y s imuli. Simila indings eme ged132
om s ing qua e s udies linking bow speed o empo133
cues [11], and conduc o s udies emphasizing ba on eloc-134
i y and accele a ion [15]. Ve a e al. u he highligh ed135
b ea h cues’ signi icance in synch oniza ion, pa icula ly136
when isual con ac is limi ed [10]. These s udies unde -137
line he ole o ges u e and b ea h cues in synch oniza ion,138
sugges ing eloci y, accele a ion, and cue leng h in luence139
iming. Howe e , ew s udies ha e simul aneously exam-140
ined hose cues in wind ins umen s o assess hei impac 141
on iming. Ou s udy ex ends his by quan i a i ely exam-142
ining co ela ions be ween ges u es, b ea h cues, and onse 143
imings h ough mul imodal analysis o lu e-piano due s.144
2.2 Ges u e Analysis o Pe o mance145
Mo ion acking me hods u ilizing senso s o op ical low-146
based ideo acking [16] a e common o ges u e analy-147
sis. P e ious s udies by Bishop e al. and Timme s e al.148
used a achable senso s o ma ke s o measu e mo emen 149
and accele a ion [9, 11], whe eas Bochen e al. applied150
audio- isual analysis wi h op ical low o examine ib a o151
pa e ns om s ing playe s’ hand mo emen s [17, 18].152
Maezawa e al. p oposed MuEns, a mul imodal sco e-153
ollowing sys em employing op ical low o ack ges u es154
o au oma ed piano accompanimen . Howe e , in ha pa -155
icula wo k, he au ho s u ilized a bi a ily de ined ges u e156
cues ins ead o sys ema ically analyzing how pe o me s157
na u ally encode onse imings [5].158
3. DATASET159
3.1 Musical Cue Da ase 160
We c ea ed a mul imodal cue da ase con aining ideo and161
audio eco dings o lu e-piano due pe o mances o ana-162
lyze musical cues om ges u es and b ea hing sounds. Fol-163
lowing p e ious s udies [9,14], we iden i ied peaks in posi-164
ion, eloci y, and accele a ion cu es as po en ial ges u e165
cue poin s. B ea h cue poin s we e de ined by he b ea h166
onse and o se imings. We e med he in e al be ween167
pai ed cue poin s as ‘cue leng h’ and he du a ion om cue168
ini ia ion o lu e onse as ‘cue-onse leng h’.169
3.1.1 Pa icipan s170
A o al o 20 p o essional lu is s and 3 pianis s pa ici-171
pa ed in he expe imen , all holding bachelo ’s o mas e ’s172
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
101
Figu e 2:B ea h Cue and Onse Anno a ion B ea h cue
is de ined by he b ea h onse and o se anno a ed on he
mel spec og am. Six ma ke s a e anno a ed pe ial. De-
ailed in o ma ion can be ound in Sec ion 3.2
deg ees in pe o mance. Each lu is -pianis pai had no 173
p e iously pe o med oge he .174
3.1.2 Placemen and Equipmen 175
The se up mimicked a conce s age, wi h lu is s posi-176
ioned acing away om he pianis s. The pianis s could177
obse e he lu is s, while he lu is s we e ins uc ed o178
gi e cues wi hou u ning o looking a he pianis s. A179
came a eco ded lu is s a 60 ps 2, and mic ophones sep-180
a a ely cap u ed audio om each ins umen a 44.1 kHz.181
A neu al-colo ed sc een behind lu is s minimized back-182
g ound in e e ence o accu a e acial mo emen acking.183
3.1.3 Musical Piece184
This da ase comp ises simul aneously s a ing lu e–piano185
due s. Pa 1 included a C majo scale and Pachelbel’s186
Canon pe o med a slow (50 BPM), medium (100 BPM)187
and as (150 BPM) empos, each epea ed wice as wa m-188
up exe cises (12 ials). Pa 2 consis ed o 18 classical189
pieces simpli ied o piano and a anged o simul aneous190
s a s. Each piece was assigned a speci ic empo (50, 100,191
o 150 BPM), and he en i e se o 18 pieces was epea ed192
h ee imes, esul ing in 54 ials. Each due pe o med a193
o al o 66 ials, esul ing in 1,320 ials o e all. Shee 194
music and sample audio we e p o ided in ad ance.195
3.1.4 P ocedu e196
The eco ding p ocedu e o each piece included: (1) An197
expe imen e ’s clap signaling s a , ollowed by a measu e198
o clicks ma ching he gi en empo; (2) The lu is gi ing a199
cue a e clicks ended; (3) The due beginning in esponse200
o he cue.201
3.1.5 Pos -session In e iew202
The in e iews collec ed he insigh s o he pa icipan s203
on cue s a egies. Pa icipan s epo ed p o iding cues ap-204
p oxima ely one bea (o hal o wo bea s, depending on205
he piece) ahead, using body mo emen s o b ea h. Some206
pa icipan s men ioned ha in ypical pe o mance si ua-207
ions, hey adjus hei cue iming based on he accompa-208
nying ins umen and ensemble con ex .209
2Some ideos we e eco ded a 30 ps due o came a o e hea ing
Figu e 3:Ges u e Cue Example Ges u e cue is de ined
wi h he maximum( ed) and minimum(black) peaks. The
in e al be ween hese peaks, called ‘cue leng h’ (x, ed
line), and he du a ion be ween he maximum peak o he
lu e onse , called ‘cue-onse leng h’ (y, blue line).
3.2 Anno a ion and P ep ocessing210
Video and audio da a synch oniza ion was achie ed211
h ough an expe imen e ’s clap a he s a . Using he spec-212
og am iewe in Adobe Audi ion, we manually anno a ed213
six ma ke s pe ial on he mel spec og am (Figu e 2): he214
expe imen e ’s clap (S a ), b ea h sound onse and o se 215
(B ea h Onse ,B ea h O se ), ini ial no e onse s o lu e216
and piano (Flu e Onse ,Piano Onse ), and he lu e’s sec-217
ond measu e onse (2nd Measu e).218
4. METHODS219
4.1 Ges u e Cue De ec ion220
4.1.1 Mo ion De ec ion221
To de ec ges u e cues om lu is s, we used MediaPipe’s222
ace landma k de ec ion 3[19] o eliably iden i y ace e-223
gions, e en when pa ially obscu ed by he lu e. Subse-224
quen ly, op ical low me hods [17,18] we e applied o ack225
acial mo ion. A pilo s udy indica ed op imal ace land-226
ma k de ec ion accu acy when he ace occupied a leas 227
50% o he ideo ame heigh ; ideos we e acco dingly228
esized.229
4.1.2 Mo ion Fea u e Ex ac ion230
We analyzed acial ges u es by ex ac ing posi ion, eloc-231
i y, and accele a ion magni ude cu es om a e aged y-232
axis op ical low alues. Due o quan ized pixel posi ions233
causing disc e e eloci y cu es, we applied ze o-phase il-234
e ing 4 o smoo h he cu e while p ese ing peak posi-235
ions.236
4.1.3 Mo ion Peak Picking237
Figu e 3 illus a es a ypical ges u e cue pa e n. Wi hin a238
one-measu e window p eceding he lu e onse (‘cue win-239
dow’), we iden i ied maximum and minimum peaks on240
posi ion, eloci y, and accele a ion cu es using he ind-241
peaks 5algo i hm. This app oach au oma ically de ec ed242
3a ailable a : h ps://de elope s.google.com/mediapipe
4scipy.signal. il il
5scipy.signal. ind_peaks
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
102
Figu e 4:Rela ionship be ween Cue Leng h and Cue-Onse Leng h o Ges u e Posi ion, Veloci y, Accele a ion,
and B ea h Cue Black line: o e all eg ession; blue, g een, pink: 50, 100, 150 BPM, espec i ely. Slopes (y=ax) and
co ela ion ( ) a e shown o each cu e.
848 peaks, wi h 394 manually anno a ed. T ials wi hou 243
clea peak pa e ns o ailed acking we e excluded, e-244
sul ing in 1,242 usable ials ou o 1,320. Ges u e cu es245
om cases whe e peak acking ailed a e also a ailable on246
he accompanying webpage.247
4.2 B ea h Cue De ec ion248
To de ec b ea h cues, we anno a ed b ea h onse and o se 249
wi hin he same one-measu e ‘cue window’ p eceding lu e250
onse , based on mel spec og ams (Figu e 2). Analyses251
we e conduc ed exclusi ely on he 1,242 ials ha we e252
e i ied o con ain alid ges u e cues.253
4.3 Onse Timing P edic ion254
We examined ou cues: ges u e posi ion, eloci y, accel-255
e a ion, and b ea h. To explo e he ela ionship be ween256
hese cues and onse iming, we applied a simple linea 257
eg ession model. Each cue leng h was se as he inde-258
penden a iable x, while he cue-onse leng h ( he in e -259
al om cue s a posi ions—maximum peak o b ea h on-260
se — o he ac ual onse ) was he dependen a iable y. The261
eg ession model, wi hou bias, is de ined by y=ax, as il-262
lus a ed in Figu e 3. The slope ade i ed om he e-263
g ession indica es he a io be ween cue leng h and onse 264
iming, enabling onse p edic ion. Addi ionally, we ana-265
lyzed he ials sepa a ely by empo o assess di e ences266
in cue-onse ela ionships, examining bo h he eg ession267
slope and Pea son co ela ion. A high co ela ion would268
con i m he cue’s alidi y o p edic ing onse iming.269
5. RESULTS AND DISCUSSION270
5.1 Pa e ns in Ges u e Cu es271
The posi ion cu e did no consis en ly show he same272
shape in all ials, bu in mos cases i emained s a ic ini-273
ially and hen displayed a clea downwa d-up-down mo-274
ion pa e n ha se ed as a signal (Figu e 3). E en when275
he posi ion cu e de ia ed om he ypical pa e n, an276
up-down mo ion immedia ely p eceding onse was consis-277
en ly p esen (1242 ou o 1320 ials, see Sec ion 4.1.3).278
These mo emen s we e o en pe iodic and sinusoidal, e-279
sul ing in eloci y and accele a ion cu es ha mi o ed280
he posi ion cu e, bu phase-shi ed by app oxima ely a281
qua e cycle. Fu he in es iga ion could examine how282
a ia ions in cu e cha ac e is ics, such as he deg ee o si-283
nusoidal shape and pe iodici y consis ency, in luence pe -284
o me s’ in e p e a ion o cues and hei subsequen syn-285
ch oniza ion accu acy. Mo eo e , analyzing de ia ions286
om ypical sinusoidal pa e ns migh unco e addi ional287
insigh s in o pe o me -speci ic ges u e s a egies. Due o288
challenges in quan i ying hese cu e cha ac e is ics p e-289
cisely, his opic emains open o u u e esea ch.290
5.2 Resul s o he Linea Reg ession291
Figu e 4 illus a es he ela ionships be ween cue leng h292
and cue-onse leng h o ges u e and b ea h cues ac oss all293
pa icipan s. Bo h ypes o cues showed linea ela ion-294
ships wi h cue-onse du a ions. Fo b ea h cues, eigh ou -295
lie s we e iden i ied due o ambiguous anno a ions; hese296
we e excluded om subsequen eg ession analyses. Lin-297
ea eg ession indica ed slopes o 1.53 (ges u e posi ion),298
2.28 (ges u e eloci y), 3.05 (ges u e accele a ion), and299
1.78 (b ea h cue). In e es ingly, none o he cues yielded300
a slope close o 2, sugges ing ha coun ing he cue leng h301
and subsequen du a ion as equal uni s (simila o coun -302
ing wo bea s) is no consis en ly applicable. The eloc-303
i y and b ea h cues had slopes closes o 2, bu he in-304
e al om he eloci y cue o onse was sligh ly longe 305
(1.28 imes cue leng h), while o he b ea h cue i was306
sligh ly sho e (0.78 imes cue leng h). Ges u e accele a-307
ion showed a s ong co ela ion (0.72) wi h cue-onse du-308
a ions, bu b ea h cues exhibi ed an e en s onge co e-309
la ion (0.78), highligh ing hei supe io eliabili y o p e-310
dic ing onse s. The highe co ela ion wi h eloci y and ac-311
cele a ion compa ed o posi ion aligns wi h p e ious s ud-312
ies, sugges ing pe o me s p ima ily pe cei e eloci y o 313
accele a ion peaks as cue indica o s a he han posi ional314
poin s. Howe e , accele a ion peaks closely align wi h p e-315
ious posi ion peaks, indica ing a po en ial wo-peak en-316
coding pa e n in posi ion cu es. P ecisely quan i ying317
his ela ionship is challenging and hus emains a opic318
o u u e esea ch.319
Addi ionally, slopes gene ally dec eased sligh ly wi h320
inc eased empo, indica ing lu e onse s occu ed soone 321
han expec ed based on p opo ionally sho ened cue322
leng hs. Fu he mo e, co ela ion coe icien s o ges u e323
cues dec eased a highe empos, whe eas b ea h cues ex-324
hibi ed highe co ela ions a as e empos, ein o cing he325
e ec i eness o b ea h cues o synch oniza ion.326
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
103
Figu e 5:His og am o No e Onse Asynch ony Onse ime di e ences be ween he pianis ’s onse and 1. Flu is onse
(G ound u h) 2-4. P edic ed onse om ges u e posi ion, eloci y, accele a ion cue 5. P edic ed onse om b ea h cue.
ID G oup Ges u e
Cla i y
Onse
Timing
B ea h
Cla i y
Onse
Timing Async Abs
Async STD
la e 4.5 3.3 3.9 3.4 71 71 39
A good 4.0 3.6 4.8 3.0 54 54 31
as 3.8 2.4 4.4 2.6 -54 125 66
la e 4.4 3.1 3.9 3.1 71 71 39
B good 3.1 3.4 3.3 3.4 54 54 31
as 3.9 3.9 4.5 3.9 -54 125 66
Table 1:Resul o he Expe E alua ion Expe a ings
o cue cla i y and execu ion iming (1 = la e, 3 = on- ime,
5 = ea ly) on a 5-poin scale, wi h asynch ony me ics (ms)
indica ing a e age g oup asynch ony.
5.3 No e Onse Asynch ony327
5.3.1 Asynch ony Compa ision328
Figu e 5 shows his og ams o ime di e ences be ween329
he pianis ’s onse and i e condi ions: ac ual lu is onse 330
(g ay) and ou p edic ed onse s om he linea eg ession331
model (Sec ion 5.2). This e o ep esen s he disc epancy332
expec ed i he lu is pe ec ly ollowed ou linea model333
and execu ed he cues p ecisely. A no able ea u e in he334
pianis - lu is asynch ony his og am is he minimal occu -335
ence o ials jus be o e ze o milliseconds. This sug-336
ges s a endency o he leade ( lu is ) o s a sligh ly ea -337
lie han he ollowe (pianis ), aligning wi h obse a ions338
con i med by expe in e iews (Sec ion 5.5.2). Addi ion-339
ally, we conside audi o y eac ion (pe o me s s a ing in340
esponse o hea ing he pa ne ’s onse ) unlikely in mos 341
cases, as he obse ed asynch onies a e ypically smalle 342
han he a e age audi o y eac ion ime (150 ms) [20].343
The ‘Pianis -Flu is asynch ony’ had he na owes 344
sp ead (Absolu e(Abs) mean = 79ms, S anda d De ia ion345
(STD) = 168ms), consis en wi h p e ious esea ch e-346
po ing he i s onse asynch onies sligh ly abo e 80ms347
[9]. Ges u e-based p edic ions showed absolu e mean e -348
o s be ween 121–168ms, wi h accele a ion p edic ions ex-349
hibi ing highe a iabili y (STD = 247ms). B ea h p e-350
dic ions we e mo e consis en (Abs mean = 109 ms, STD351
= 209 ms). Al hough he accele a ion cue demons a ed352
he highes Pea son co ela ion, i s g ea e a iabili y and353
longe cue leng h con ibu ed o la ge e o s. These ind-354
ings sugges ha while cue-based linea p edic ions a e355
sligh ly less p ecise han human synch oniza ion, b ea h356
cues p o ide mo e eliable p edic ions han ges u e-based357
cues. The p ecision o p edic ions compa ed wi h ac ual358
lu is onse s exhibi ed simila pa e ns.359
5.3.2 Reduc ion o Asynch ony Th ough Repe i ion360
We also in es iga ed whe he synch oniza ion accu acy361
imp o es as pianis s and lu is s adap o each o he ’s cues.362
Figu e 6: Human onse asynch ony ac oss session epe i-
ion.
As desc ibed in Sec ion 3.1.3, we obse ed asynch ony363
changes h ough an ini ial wa m-up session o 12 ials364
(S1) and h ee epea ed se s o 18 pieces (T1–T3), illus-365
a ed in Figu e 6. Asynch ony no ably dec eased om366
he ini ial wa m-up session (S1) h ough he second ep-367
e i ion session (T2) bu s abilized he ea e . We in e -368
p e his as indica ing adap a ion e ec s, whe e pe o me s369
quickly imp o ed synch oniza ion by amilia izing hem-370
sel es wi h each o he ’s cues. Howe e , ongoing piece371
a ia ion and un esol ed consensus abou igge s (Sec ion372
5.4) likely p e en ed u he educ ion in asynch ony be-373
yond a ce ain h eshold.374
5.4 T igge s375
Despi e he conside able accu acy o linea p edic ions376
(Sec ion 5.2), ques ions emained ega ding p ecise cue377
ecogni ion, pa icula ly o eloci y and accele a ion378
peaks. Gi en po en ial pe cep ion e o s in iden i ying379
cue poin s, we in es iga ed addi ional ac o s musicians380
migh use o ensu e p ecise synch oniza ion. We obse ed381
cha ac e is ic ges u e pa e ns immedia ely ollowing cue382
mo emen s nea onse imings. A e he downwa d-383
upwa d-downwa d cue mo ion, lu is s ypically execu ed384
one o h ee dis inc igge pa e ns (Figu e 7): (A) ini-385
ia ing onse jus be o e aising he head again, ypically386
aligned wi h he p e ious upwa d cue mo emen , (B)387
b ie ly pausing wi h he head lowe ed and s a ing immedi-388
a ely a e wa d, o (C) ambiguously ini ia ing onse while389
slowly aising he head. Pa e ns (A) and (B) p o ided390
clea , immedia e signals sui able as p ecise igge s. Pa -391
e n (C), howe e , ep esen ed ambiguous o absen ig-392
ge s. Wi hou p io ag eemen , hese ambiguous ges u es393
could cause con usion— o example, a lu is in ending394
pa e n (A) migh be misin e p e ed by he pianis as pa -395
e n (B), esul ing in signi ican iming disc epancies. Such396
cases we e indeed obse ed, and he alidi y o hese ig-397
ge pa e ns was u he con i med h ough expe in e -398
iews (Sec ion 5.5). Thus, we p opose ha musicians ini-399
ially encode app oxima e iming h ough cues and hen400
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
104

Figu e 7:Examples o h ee igge ypes (A, B, C)
G een dashed lines indica e b ea h onse and o se , he
blue line ma ks lu e onse , and he ed line piano onse .
The ed shaded a ea ep esen s he cue in e al.
achie e p ecise synch oniza ion h ough clea ly de ined401
igge ges u es.402
5.5 Expe E alua ion and In e iew403
A e he main analysis, we conduc ed a wo-hou in e -404
iew wi h wo expe lu is s, each wi h o e 20 yea s o 405
ensemble and eaching expe ience. The session in ol ed406
wo main asks: e alua ing ials ca ego ized as as , good,407
o la e based on onse p edic ions om eloci y cues, and408
assessing ou igge hypo hesis. Due o ime cons ain s,409
only eloci y cues we e used, as hey o e ed he mos in-410
e p e able and pe cep ually eliable basis o cue classi i-411
ca ion.412
5.5.1 E alua ion o Linea Model P edic ions413
Expe s e iewed 8 ials om each g oup ( as , good, la e)414
o e alua e he cla i y o he cues (accu acy) and whe he 415
he ac ual onse iming ma ched he iming implied by he416
cue (execu ion iming). The expe s e alua ed 24 andomly417
o de ed ials wi hou being in o med o he g oup labels.418
A e ag eeing wi h he assump ion ha head-nodding ges-419
u es and b ea hing se e as cues, expe s a ed cue accu-420
acy on a scale om 1 (no ecognizable cue) o 5 (clea ly421
ecognizable cue) and execu ion iming om 1 ( e y la e)422
o 5 ( e y ea ly), wi h 3 ep esen ing p ecise iming. Re-423
sul s a e summa ized in Table 1.424
Su p isingly, he expe s di e ed no ably in hei e al-425
ua ions o b ea h cue execu ion iming, which also did no 426
align clea ly wi h ac ual pianis - lu is asynch ony. Expe 427
A a ed he ‘good’ g oup’s ges u e cues as sligh ly la e bu 428
b ea h cues as mos accu a e. Expe B a ed he g oups in429
descending o de (la e-good- as ) o pe cei ed la eness bu 430
s ill conside ed he ‘la e’ g oup ela i ely ea ly (a e age431
3.1). Ac ual asynch ony pa ly aligned wi h ou p edic-432
ions; he ‘good’ g oup exhibi ed he smalles asynch ony,433
consis en wi h he cue hypo hesis. Howe e , unexpec ed434
disc epancies eme ged, such as he ‘ as ’ g oup showed435
la e lu e onse s han he piano. These inconsis encies436
likely e lec indi idual a ia ions in in e p e ing and ex-437
ecu ing cues.438
5.5.2 In-dep h In e iew439
A e he ideo e alua ion, we conduc ed de ailed discus-440
sions abou he linea p edic ion model and he igge 441
concep . Expe s acknowledged he gene al p ac ice o 442
cueing app oxima ely one bea ahead, no ing ha speci ic443
mo emen s we e in ui i ely execu ed a he han explic-444
i ly planned. Bo h emphasized empo- ela ed in luences on445
ges u e and b ea h cues, highligh ing ges u e pe iodici y as446
c ucial o clea cue deli e y.447
Rega ding igge s, expe s ini ially did no consciously448
ecognize di e en igge ypes bu ag eed wi h he p o-449
posed classi ica ions a e e iewing examples. Bo h450
ag eed ype (A) was op imal, while ype (B) was con-451
side ed challenging due o he lu e’s physical cons ain s.452
Opinions on ype (C) di e ged: Expe A conside ed i in-453
he en ly p one o highe e o s, whe eas Expe B belie ed454
i could be iable when accompanied by p ecise b ea h455
cues and ehea sal. Expe s no ed ha igge s may a y456
depending on musical con ex (ph asing, emphasis). They457
also obse ed po en ial igge s, including he lu e’s end-458
poin posi ion, lip shape, and inge mo emen s. Finally,459
bo h p oposed ha in lu e-piano due s, sligh delays in pi-460
ano onse migh cogni i ely bene i synch oniza ion, po-461
en ially explaining he sca ci y o ials whe e piano onse 462
p eceded lu e onse , as obse ed in Figu e 5.463
5.6 Limi a ion464
Al hough his s udy con ibu es o unde s anding cue-465
based synch oniza ion, se e al limi a ions emain. Fi s ,466
ou igge analysis was no ully quan i a i e; u u e e-467
sea ch should de elop p ecise me hods o de ining ig-468
ge s om ges u e cu es and conside addi ional signals469
such as lip shape, inge mo emen s, and ho izon al lu e470
ac ions. Second, The decision-making p ocess o selec -471
ing igge s was no in es iga ed. Explo ing how musi-472
cians choose and ag ee on igge s could u he cla i y473
synch oniza ion s a egies. Addi ionally, indi idual a i-474
abili y in cue and igge p e e ences was no quan i a i ely475
examined; u u e esea ch could explo e pe sonalized syn-476
ch oniza ion s a egies. Las ly, ou indings apply speci i-477
cally o lu e-piano due s and piece ini ia ion. Gene alizing478
hese me hods o o he ins umen combina ions and ongo-479
ing musical con ex s emains necessa y.480
6. CONCLUSION481
This s udy in es iga ed how ges u e and b ea h cues used482
by lu is s in lu e-piano due s encode no e onse iming a 483
he ini ia ion o musical pieces. To quan i a i ely analyze484
cues, we collec ed a cue da ase , iden i ying cue poin s ia485
mo ion acking and b ea h sounds. Ou mul imodal anal-486
ysis e ealed linea ela ionships be ween cue leng hs and487
onse imings, con i ming ha ges u e and b ea h cues e-488
liably p edic onse iming. In addi ion, we in oduced he489
concep o igge s, de ined as immedia e ges u es ha in-490
dica e p ecise onse momen s, which we alida ed h ough491
expe in e iews. Fu u e esea ch could u he de elop492
he analysis o igge s o achie e a mo e sys ema ic un-493
de s anding. Addi ionally, de eloping eg ession models494
ha join ly use ges u e and b ea h cues, o explo ing so-495
phis ica ed p edic i e models, would be aluable. I could496
u he ex end ou cue analysis me hods o o he ins u-497
men s, ensemble con igu a ions, and wi hin-pe o mance498
synch oniza ion, p o iding deepe insigh s in o he com-499
plexi y o ensemble coo dina ion.500
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
105
7. ETHICS STATEMENT501
This s udy was app o ed by he Ins i u ional Re iew Boa d502
(IRB), and all pa icipan s consen ed o ideo eco ding503
and da a sha ing. Each lu is ’s session las ed 1.5 hou s,504
pianis s had 30-minu e b eaks be ween sessions, wi h up505
o ou sessions pe day. Pa icipan s ecei ed app op i-506
a e compensa ion. To p o ec p i acy, he ideo da a will507
emain con iden ial.508
8. ACKNOWLEDGMENTS509
This wo k has been suppo ed by he Na ional Resea ch510
Founda ion o Ko ea (NRF) g an unded by he Ko ea511
go e nmen (MSIT) unde G an RS-2023-NR077289 and512
G an RS-2024-00358448.513
9. REFERENCES514
[1] C.-J. Tsay, “Sigh o e sound in he judgmen o music515
pe o mance,” P oceedings o he Na ional Academy o 516
Sciences, ol. 110, no. 36, pp. 14 580–14 585, 2013.517
[2] L. Bishop, C. Cancino-Chacón, and W. Goebl, “Mo -518
ing o communica e, mo ing o in e ac : Pa e ns o 519
body mo ion in musical duo pe o mance,” Music Pe -520
cep ion: An In e disciplina y Jou nal, ol. 37, no. 1,521
pp. 1–25, 2019.522
[3] F. K. Hukpo i, You Guide o Basic Conduc ing. Ac-523
c a: Noyam Publishe s, 2023.524
[4] R. Page-Shipp, D. Joseph, and C. an Nieke k, “Con-525
duc o less singing g oup: a pa icula kind o sel -526
managed eam?” Team Pe o mance Managemen : An527
In e na ional Jou nal, ol. 24, no. 5/6, pp. 331–346,528
2018.529
[5] A. Maezawa and K. Yamamo o, “MuEns: A mul i-530
modal human-machine music ensemble o li e conce 531
pe o mance,” in P oceedings o he 2017 CHI Con e -532
ence on Human Fac o s in Compu ing Sys ems, 2017,533
pp. 4290–4301.534
[6] X. Gao, A. Rogel, R. Sanka ana ayanan, B. Dowling,535
and G. Weinbe g, “Music, body, and machine: ges u e-536
based synch oniza ion in human- obo musical in e ac-537
ion,” F on ie s in Robo ics and AI, ol. 11, 2024.538
[7] A. Lim, T. Mizumo o, L.-K. Cahie , T. O suka, T. Taka-539
hashi, K. Koma ani, T. Oga a, and H. G. Okuno,540
“Robo musical accompanimen : in eg a ing audio and541
isual cues o eal- ime synch oniza ion wi h a human542
lu is ,” in 2010 IEEE/RSJ In e na ional Con e ence on543
In elligen Robo s and Sys ems, 2010, pp. 1964–1969.544
[8] L. Bishop and W. Goebl, “When hey lis en and when545
hey wa ch: Pianis s’ use o non e bal audio and i-546
sual cues du ing due pe o mance,” Musicae Scien-547
iae, ol. 19, no. 1, pp. 84–110, 2015.548
[9] ——, “Bea ing ime: How ensemble musicians’ cueing549
ges u es communica e bea posi ion and empo,” Psy-550
chology o Music, ol. 46, no. 1, pp. 84–106, 2018.551
[10] B. Ve a, E. Chew, and P. G. Healey, “A s udy o ensem-552
ble synch onisa ion unde es ic ed line o sigh .” in553
In e na ional Socie y o Music In o ma ion Re ie al554
Con e ence (ISMIR), 2013, pp. 293–298.555
[11] R. Timme s, S. Endo, A. B adbu y, and A. M. Wing,556
“Synch oniza ion and leade ship in s ing qua e pe -557
o mance: a case s udy o audi o y and isual cues,”558
F on ie s in Psychology, ol. 5, p. 645, 2014.559
[12] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma,560
“C ea ing a mul i ack classical music pe o mance561
da ase o mul imodal music analysis: Challenges, in-562
sigh s, and applica ions,” IEEE T ansac ions on Mul i-563
media, ol. 21, no. 2, pp. 522–535, 2018.564
[13] S. Kawase, “Gazing beha io and coo dina ion du ing565
piano duo pe o mance,” A en ion, Pe cep ion, & Psy-566
chophysics, ol. 76, pp. 527–540, 2014.567
[14] Y.-H. Su, “Audio isual bea induc ion in complex au-568
di o y hy hms: Poin -ligh igu e mo emen as an e -569
ec i e isual bea ,” Ac a Psychologica, ol. 151, pp.570
40–50, 2014.571
[15] G. Luck and J. A. Sloboda, “Spa io- empo al cues o 572
isually media ed synch oniza ion,” Music Pe cep ion,573
ol. 26, no. 5, pp. 465–473, 2009.574
[16] B. K. Ho n and B. G. Schunck, “De e mining op ical575
low,” A i icial in elligence, ol. 17, no. 1-3, pp. 185–576
203, 1981.577
[17] B. Li, K. Dinesh, Z. Duan, and G. Sha ma, “See578
and lis en: Sco e-in o med associa ion o sound acks579
o playe s in chambe music pe o mance ideos,”580
in 2017 IEEE In e na ional Con e ence on Acous ics,581
Speech and Signal P ocessing (ICASSP), 2017, pp.582
2906–2910.583
[18] B. Li, C. Xu, and Z. Duan, “Audio isual sou ce associ-584
a ion o s ing ensembles h ough mul i-modal ib a o585
analysis,” P oc. Sound and Music Compu ing (SMC),586
pp. 159–166, 2017.587
[19] V. Baza e sky, I. G ishchenko, K. Ra eend an, T. Zhu,588
F. Zhang, and M. G undmann, “Blazepose: On-589
de ice eal- ime body pose acking,” a Xi p ep in 590
a Xi :2006.10204, 2020.591
[20] R. J. Kosinski, “A li e a u e e iew on eac ion ime,”592
Clemson Uni e si y, ol. 10, no. 1, pp. 337–344, 2008.593
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
106