scieee Science in your language
[en] (orig)

Exploring Network Adaptations for Minimum Latency Real-Time Piano Transcription

Author: Patricia Hu; Silvan Peter; Jan Schlüter; Gerhard Widmer
Publisher: Zenodo
DOI: 10.5281/zenodo.17706339
Source: https://zenodo.org/records/17706339/files/000010.pdf
EXPLORING SYSTEM ADAPTATIONS FOR
MINIMUM LATENCY REAL-TIME PIANO TRANSCRIPTION
Pa icia Hu1Sil an Da id Pe e 1Jan Schlü e 1Ge ha d Widme 1,2
1Ins i u e o Compu a ional Pe cep ion, Johannes Keple Uni e si y Linz, Aus ia
2LIT AI Lab, Linz Ins i u e o Technology, Aus ia
[email p o ec ed]
ABSTRACT
Ad ances in neu al ne wo k design and he a ailabili y o
la ge-scale labeled da ase s ha e d i en majo imp o e-
men s in piano ansc ip ion. Exis ing app oaches a ge
ei he o line applica ions, wi h no es ic ions on compu-
a ional demands, o online ansc ip ion, wi h delays o
128–320 ms. Howe e , mos eal- ime musical applica-
ions equi e la encies below 30 ms. In his wo k, we in-
es iga e whe he and how he cu en s a e-o - he-a on-
line ansc ip ion model can be adap ed o eal- ime pi-
ano ansc ip ion. Speci ically, we elimina e all non-causal
p ocessing, and educe compu a ional load h ough sha ed
compu a ions ac oss co e model componen s and a ia-
ions in model size. Addi ionally, we explo e di e en
p e- and pos p ocessing s a egies, and ela ed label en-
coding schemes, and discuss hei sui abili y o eal- ime
ansc ip ion. E alua ing he adap ions on he MAESTRO
da ase , we ind a d op in ansc ip ion accu acy due o
s ic ly causal p ocessing as well as a adeo be ween he
p ep ocessing la ency and p edic ion accu acy. We elease
ou sys em as a baseline o suppo esea che s in designing
models owa ds minimum la ency eal- ime ansc ip ion.
1. INTRODUCTION
Au oma ic music ansc ip ion (AMT) is he ask o ans-
o ming audio signals in o hei symbolic music ep esen-
a ion, and is commonly e e ed o as one o he holy g ails
in Music In o ma ion Re ie al (MIR), gi en i s ole in
linking he audio and symbolic domain, as well as i s el-
e ance o a ious downs eam asks and musical applica-
ions [1,2]. The ansc ip ion o piano solo music is among
he mos ex ensi ely s udied asks, d i en by he ins u-
men ’s well-de ined onse cha ac e is ics and he a ailabil-
i y o la ge-scale, s ongly labeled aining da a [1]. Conse-
quen ly, esea ch on au oma ic piano ansc ip ion has seen
subs an ial p og ess. No able con ibu ions include [1–4],
wi h he o me wo se ing new benchma ks by le e aging
la ge a chi ec u es, inc eased model complexi y, ex ended
© P. Hu, S. Pe e , J. Schlü e and G. Widme . Licensed un-
de a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: P. Hu, S. Pe e , J. Schlü e and G. Widme , “Explo ing
Sys em Adap a ions o Minimum La ency Real-Time Piano T ansc ip-
ion”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al
Con ., Daejeon, Sou h Ko ea, 2025.
aining, and a no el eg ession-based a ge encoding ap-
p oach, leading o highe - esolu ion piano ansc ip ion.
P og ess in au oma ic piano ansc ip ion has ocused
almos exclusi ely on o line me hods, wi h only a ew
in es iga ions a emp ing o sol e his ask online [5–7].
These online sys ems ypically econ igu e o line ap-
p oaches o block-wise upda es while keeping audio ep-
esen a ions and ne wo k s uc u e, and achie e la encies
be ween 128 and 320 ms. La ency alues in he hun-
d eds o milliseconds a e sui able o musical applica ions
such as sub i ling, page u ning, o isualiza ions. How-
e e , mos in e ac i e musical applica ions equi e lowe
la encies, e.g., o digi al ins umen s, a commonly named
bound is 10 ms [8–10], and o ne wo ked ensemble play-
ing la encies o up o 30 ms a e accep able [11–13]. A
eal- ime ansc ip ion model should enable luen musi-
cal in e ac ion, o which we ake 30 ms as a minimal e-
qui emen , and 10 ms as a goal o impe cep ible la ency.
La ency s ems om a ious sou ces: audio bu e ing, p e-
p ocessing, model in e ence and pos p ocessing. Fo an
au oma ic piano ansc ip ion sys em o pe o m in e ence
in eal- ime, i mus minimize la ency in all sou ces. I
should adhe e o s ic causali y equi emen s in bo h he
model a chi ec u e and he pos p ocessing, ensu ing ha
bo h ely exclusi ely on pas in o ma ion.
In his wo k, we adap a s a e-o - he-a model o eal-
ime ansc ip ion owa ds minimal la ency. We do so by
allowing only causal p ocessing wi hin he model and e-
ducing compu a ional load by sha ing compu a ions ac oss
co e componen s o he model. We u he in es iga e he
la encies incu ed by widely used p e- and pos p ocessing
s a egies, and explo e op ions o mi iga e hese using a
combina ion o adap ed STFT p ocessing, label encoding
schemes, loss unc ions, and causal pos -p ocessing. Ou
con ibu ion is wo- old: Fi s , we examine he sou ces o
la ency and causali y iola ions in he cu en s a e-o - he-
a sys em in online piano ansc ip ion, and p opose and
e alua e changes in he modeling and p e- and pos p ocess-
ing s ages o e icien eal- ime ansc ip ion. Second, we
p o ide an open-sou ce basis o a low-la ency, eal- ime
au oma ic piano ansc ip ion, in i ing u he esea ch and
de elopmen in his a ea.
The emainde o his a icle is s uc u ed as ollows: In
Sec ion 2 we poin o ela ed wo k, one o which will o m
he ou s a ing poin o ou adap a ions owa ds mini-
mum la ency, which we will he e o e ocus on in g ea e
83
de ail in Sec ion 3. In Sec ion 4 we p esen ou la ency-
minimizing adap a ions and epo he expe imen s con-
duc ed o assess hei e ec ansc ip ion accu acy. We dis-
cuss he challenges and lessons lea ned, and an ou look o
u u e wo k in Sec ion 5.
2. RELATED WORK
As ou lined in he p e ious sec ion, bo h online and eal-
ime ansc ip ion emain la gely unexplo ed. We will dis-
cuss h ee no ewo hy con ibu ions [5–7].
Fe nandez [6] p oposes a pu ely con olu ional model-
ing app oach ha ocuses solely on onse and eloci y p e-
dic ion, achie ing a la ency be ween 4 and 9 seconds.
Kwon e al. [7, 14, 15] p opose an au o eg essi e neu-
al ne wo k o e icien online piano ansc ip ion. The
a chi ec u e comp ises wo main componen s: an acous ic
module consis ing o a s ack o con olu ional laye s wi h
equency-condi ioned FiLM laye s, and a no e sequence
module consis ing o pi chwise LSTMs, a mul i-s a e so -
max ou pu ( o di e en no e s a es: onse , sus ain, e-
onse , o se , and o ), and an au o eg essi e connec ion
om he no e s a e ou pu o he p e ious ime s ep o he
cu en sequence module inpu . The au ho s p opose a i-
ous a chi ec u es ha balance accu acy and la ency. O e -
all, hei models achie e la encies om 128 o 320 ms.
Kusaka and Maezawa [5] in oduce Mobile-AMT, a
amewo k designed o ackle bo h eal- ime p ocessing
and gene aliza ion o unseen eco ding en i onmen s in
au oma ic piano ansc ip ion. They op imize a s a e-o -
he-a o line ansc ip ion model [2] by eplacing i s con-
en ional con olu ional componen s wi h ligh weigh , e -
icien al e na i es [16] and ain i using a da a augmen-
a ion scheme ha simula es ou dis inc acous ic dis i-
bu ion shi s. The esul ing model se s he cu en s a e
o he a in online au oma ic piano ansc ip ion, achie -
ing F1 sco es compa able o o line s a e-o - he-a models
while being obus o in- he-wild eco dings. I s la ency is
epo ed as 174 ms – bu an appa en o e sigh in he a -
chi ec u e inc eases la ency o 10 s. We use his me hod as
ou s a ing poin and de ail i in he nex sec ion.
3. STARTING POINT
As Mobile-AMT [5] ep esen s he cu en s a e o he a
in eal- ime piano ansc ip ion, we use his me hod as he
ounda ion and e e ence me hod o ou adap a ions o-
wa ds minimum-la ency ansc ip ion. To p o ide con ex
o hese modi ica ions, we i s ou line he s uc u e o he
model, and pa icula ly ocus on he modeling aspec s ha
iola e causali y equi emen s o eal- ime sys ems and
necessi a e adap a ion.
3.1 Model A chi ec u e
Mobile-AMT [5] is a ligh weigh adap a ion o he s a e-
o - he-a o line piano ansc ip ion model by Kong e al.
[2]. I eplaces all con en ional con olu ional blocks wi h
MobileNe [16] equi alen s, which consis o dep hwise
sepa able con olu ions ha educe compu a ional com-
plexi y while main aining ep esen a ional powe . Addi-
ionally, Mobile-AMT emo es all bi-di ec ional lows in
ecu en model laye s, and d ops one o o iginally ou
acous ic model co e s acks ( he one o no e o se p edic-
ion, which is ins ead condi ioned on he ame and onse
ou pu 1). All op imiza ion and ac i a ion laye s a e e-
ained om he o iginal o line model. Wi h hese modi-
ica ions, he au ho s epo a esul ing la ency o 174 ms,
and a gue hei model o be capable o eal- ime in e ence.
Apa om dep hwise sepa able con olu ions, Mobile-
AMT also adop s MobileNe ’s Squeeze-and-Exci a ion
(SE) laye s o dynamically ecalib a e channel-wise ea-
u es. The squeeze ope a ion in ol es global pooling o e
all spa ial dimensions, and he e o e elies on in o ma ion
om he en i e ea u e map, making he ope a ion non-
causal. As Mobile-AMT p ocesses i s inpu in 10-second
blocks, he squeeze ope a ion adds 10 seconds o la ency,
which is no accoun ed o in he au ho s’ calcula ions.
3.2 Pos p ocessing
Mobile-AMT uses he same eg ession a ge encoding o -
ma as i s unde lying o line ansc ip ion model [2], esul -
ing in inc emen ally inc easing and dec easing onse and
o se a ge s o e a sequence o ames ins ead o bina y,
poin wise classi ica ion a ge s.
While his a ge encoding o ma allows o high, sub-
ame esolu ion o onse and o se de ec ion, i neces-
si a es non-causal pos p ocessing, as he de ec ion o an
onse /o se elies on bo h pas and u u e ames. The au-
ho s o Mobile-AMT adop he same pos p ocessing s a -
egy as [2], and accoun o i in hei o e all la ency calcu-
la ion.
3.3 T aining and E alua ion Se up
Mobile-AMT is ained on 10-second segmen s o 16 kHz
audio, which a e ans o med in o 229-binned mel spec-
og ams a e an STFT wi h a 2048-sample Hann window
and a hop size o 320 samples. The loss unc ion o he on-
se , o se and ame a ge s is bina y c oss-en opy (BCE),
and eloci ies a e ained using mean squa ed e o (MSE).
Mobile-AMT p oposes a da a augmen a ion scheme o
aining o enhance obus ness o eal-wo ld, in- he-wild
eco dings, he e o e aining du a ion depends on he da a
augmen a ions applied. Fo he non-augmen ed baseline,
Mobile-AMT is ained o 3000 epochs on he MAESTRO
da ase [1] wi h a ba ch size o 16 and uses he Adam op i-
mize wi h a lea ning a e o 0.001 annealed o 0 ollowing
a cosine schedule [12].
4. REAL-TIME ADAPTATIONS
As ou main con ibu ion, we implemen and e alua e
adap a ions o he s a ing poin ha aim o educe he
sys em’s la ency. We o m h ee g oups o expe imen s:
1The o se p edic ion is omi ed du ing in e ence, bu no addi ional
de ails a e p o ided on whe he o how he pos p ocessing is adjus ed o
accoun o his missing in o ma ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
84
adap ing he aining and pos p ocessing, adap ing he au-
dio p ep ocessing, and adap ing he ne wo k a chi ec u e.
We conduc each expe imen wi h a educed aining ime
o 500 epochs (30k upda es, app oxima ely one-six h o
he o al aining ime epo ed in ou e e ence me hod [5])
o sa e compu a ional esou ces, as mos e ec s become
e iden al eady du ing he ea ly aining s ages. We pe -
o m one inal compa ison o a model in which we com-
bine selec ed modi ica ions, and compa e his o ou e e -
ence me hod, wi h bo h models ained o 2000 epochs.
Fo all ou expe imen s, unless o he wise no ed, we
ain on 3-second segmen s o audio a 16 kHz, ans-
o med in o 229-bin mel spec og ams a e an STFT wi h
a Hann window o 2048 and a hop size o 160 samples. We
double he ame a e compa ed o Mobile-AMT o allow
o mo e equen upda es, and he e o e educed la ency.
Apa om he aining du a ion and da a encoding,
we ollow he same aining scheme as used in he non-
augmen ed se up in ou e e ence me hod, i.e., all adap ed
models a e ained and e alua ed on he MAESTRO 3.0
da ase using mi _e al [17]. As ou wo k ocuses on min-
imizing delay in eal- ime ansc ip ion, we mos ly ocus
on he no e onse me ics. Excep o he inal compa ison,
we only e alua e on he alida ion se . Since ou goal is o
achie e minimal la ency o eal- ime musical in e ac ions,
we assess in e ence pe o mance on s ic e iming ole -
ances (10–30 ms) o educe he misma ch be ween sys em
e alua ion and a ge pe o mance.
4.1 T aining and Pos p ocessing
In Mobile-AMT, he aining a ge o an onse o o -
se ex ends o e mul iple ime s eps, o ming a iangle
cen e ed on he a ge anno a ion. O iginally p oposed by
Kong e al. [2], such iangles allow exp essing posi ions a
a highe esolu ion han he ame a e. Howe e , p edic -
ing such iangles equi es lookahead in he model, as he
ou pu mus inc ease be o e he ac ual e en . Fu he mo e,
in e p e ing he p edic ions equi es lookahead in pos p o-
cessing in o de o ind each iangle’s maximum. As a
p epa a ion o swi ching o a causal e sion o Mobile-
AMT wi h causal pos p ocessing, we hus eplace he i-
angula a ge s wi h bina y a ge s ha a e ac i e only a
he ame ha is closes o an anno a ion.
As he changed label encoding esul s in a hea ily im-
balanced bina y classi ica ion ask, pa icula ly o onse
and o se a ge s, we y weigh ing hei posi i e occu -
ences wi h a ac o o 10 in he bina y c oss-en opy loss
[18]. Addi ionally, poin wise classi ica ion a ge s penal-
ize small empo al de ia ions mo e s ongly han iangu-
la a ge s. To accoun o his, we es applying a shi -
ole an loss ecen ly p oposed o bea de ec ion [19], wi h
a ole ance o ±1 ame (±10 ms a ou ame a e).
We also modi y he pos p ocessing o ensu e ha he
p edic ion o he cu en ame elies only on pas in o -
ma ion. Fi s , we bina ize he ou pu p obabili ies: o on-
se and o se a ge s, an ac i a ion is de ec ed when he
cu en ame exceeds a gi en h eshold while he p e i-
ous ame does no . A ame ac i a ion is eco ded i he
Tole ance 10 ms 20 ms 30 ms
Exp. Pos p ocessing onse h eshold: 0.45
TP1 9.58 ±2.29 25.80 ±4.83 47.44 ±7.30
TP2 29.84 ±8.37 47.64 ±11.76 50.28 ±11.83
TP3 13.95 ±4.91 34.34 ±8.05 45.10 ±8.33
TP4 27.24 ±7.47 55.49 ±9.73 64.95 ±10.06
TP5 16.00 ±5.32 41.02 ±8.50 55.40 ±8.53
Exp. Pos p ocessing onse h eshold: 0.55
TP1 13.42 ±2.60 35.07 ±5.31 56.02 ±8.14
TP2 23.56 ±8.41 35.50 ±11.44 36.87 ±11.49
TP3 17.04 ±5.52 40.54 ±8.26 51.46 ±8.29
TP4 27.42 ±7.52 54.95 ±10.18 63.67 ±10.72
TP5 17.46 ±5.59 44.19 ±8.53 58.68 ±8.61
Exp. Pos p ocessing onse h eshold: 0.65
TP1 19.10 ±3.12 43.88 ±7.02 59.93 ±9.77
TP2 14.32 ±6.70 20.52 ±8.68 21.07 ±8.70
TP3 20.70 ±5.90 46.48 ±8.28 56.72 ±8.28
TP4 27.18 ±7.65 53.65 ±10.71 61.55 ±11.36
TP5 18.74 ±5.80 46.80 ±8.60 61.05 ±8.89
Table 1. Compa ison o no e onse F1 sco es (mean ±
s d each o e h ee expe imen al uns) on he MAESTRO
.3 alida ion se ac oss h ee onse ole ances o di e -
en a ge encodings, (weigh ed) loss unc ions and onse
h esholds in pos p ocessing.
cu en ame su passes he h eshold. Nex , we elimina e
e-onse s o he same pi ch ha occu wi hin a p ede ined
minimum e-onse dis ance. Finally, we de e mine he o -
se o an ac i e no e based on he ea lie occu ing one o
ei he ame inac i i y o o se ac i i y.
We es he change in label encoding in combina ion
wi h di e en (weigh ed) loss unc ions in i e di e en
expe imen al se ups: In TP1 we ain ou e e ence model
wi h he o iginal loss unc ions and eg ession a ge en-
coding scheme as p oposed by he au ho s [5]. In TP2 we
use classi ica ion a ge s, and in TP3 we addi ionally ap-
ply a weigh ac o 10 on bo h he onse and o se a ge s.
Finally, we apply he shi - ole an BCE loss (wi h a ol-
e ance o ±10 ms), ei he unweigh ed (TP4) o weigh ed
again by a ac o o 10 (TP5). All i e se ups use ou
s ic ly causal pos p ocessing desc ibed abo e. The model
a chi ec u e emains as p oposed in Mobile-AMT.
Table 1 lis s he no e onse F1 sco es on he MAE-
STRO 3.0 alida ion se o he di e en aining a ge s
and loss unc ions, combined wi h di e en onse h esh-
old alues applied du ing pos p ocessing, as e alua ed on
di e en onse ole ance h esholds. We choose o e al-
ua e a lowe ole ance h esholds as hey e lec a eal-
ime sys em’s p ac ical esponsi eness be e han he com-
monly used ±50ms do, which migh mask la ency issues
by c edi ing he model o de ec ions ha would eel de-
layed o a use in an in e ac i e scena io. Lea ning poin -
wise bina y a ge s wi hou any u he modi ica ion (TP2)
p o es supe io o sho sequen ial eg ession (TP1), p o-
ided a low enough onse h eshold du ing pos p ocessing.
The e ec o lowe onse de ec ion h esholds is somewha
mi iga ed by weigh ing posi i e labeled examples highe
(TP3), and i ually ully elimina ed by using a shi -
ole an loss (TP4). The e is no u he imp o emen when
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
85
1
0
1
0
1
1024
0
1
160
2000 1500 1000 500 0 500 1000
0
1
Figu e 1. Mobile-AMT cen e s STFT windows on he
ime poin o p edic o , incu ing a delay o 1024 samples.
We shi he window o educe he delay o 160 samples,
and change he window unc ion o be e use ha limi ed
amoun o u u e in o ma ion.
combining weigh s and shi - ole ance (T5).
Fo he nex expe imen s, we use bina y a ge s wi h
weigh s. While he shi - ole an loss pe o med a o ably
he e, ou goal is o use a causal model, o which shi ol-
e ance could esul in sys ema ically delayed p edic ions.
4.2 Audio P ep ocessing
Mobile-AMT p ocesses audio in spec og am ames o
2048-sample windows cen e ed on he ime poin s o p e-
dic e en s o . Thus, e en wi h a causal model ha does
no p ocess in o ma ion om u u e ames, in eal- ime
in e ence, e e y ime a comple e audio bu e o 2048 sam-
ples is illed, in e ence will be igge ed o p edic e en s
ha a e al eady 1024 samples (64 ms) in he pas . Figu e 1
illus a es his: he op ow shows an audio wa e o m, he
second ow a ypical STFT window cen e ed a he onse .
Sho e il e s educe his delay, as he delay is ixed o hal
he il e leng h, bu his comes a he cos o lowe e-
quency esolu ion, which we should a oid: Wi h a 2048-
sample STFT a 16 kHz, he bin wid h is 7.8 Hz, which is
al eady oo coa se o achie e semi one p ecision o he
lowes piano no es (A0 a 27.5 Hz, BZ0 a 29.14 Hz).
We can howe e educe his delay o a lowe numbe
nso samples (e.g., 160 samples o 10 ms) while keeping
he window leng h and equency esolu ion unchanged,
by shi ing windows so hey end nssamples a e hei
e e ence poin ins ead o being cen e ed. The hi d ow
in Figu e 1 shows he Hann window shi ed o ns= 160
samples. The Hann window s ongly a enua es he bound-
a ies, he igh one o which now con ains highly ele-
an in o ma ion o he p edic ion. To mi iga e his un-
wan ed a enua ion, we can eplace he Hann window wi h
an asymme ic window ha ape s (2048 −ns)samples
be o e and nssamples a e he e e ence poin . The las
ow in Figu e 1 illus a es his windowing unc ion o a
1888/160 sample asymme y. No e how we keep mo e in-
o ma ion om he incoming samples in he g ay shaded
a ea unde he window unc ion, albei a he cos o in-
c easing spec al leakage (by abou 20 dB o ns= 160).
Tole ance 10 ms 20 ms 30 ms
H1 Hann 64 ms 17.43 ±5.10 34.65 ±7.40 39.86 ±7.45
H2 Hann 10 ms 0.00 ±0.01 0.00 ±0.02 0.04 ±0.10
T1 asym. 10 ms 22.25 ±5.17 25.21 ±5.20 25.84 ±5.18
T2 asym. 20 ms 28.61 ±7.09 33.76 ±7.01 34.65 ±6.89
T3 asym. 30 ms 27.61 ±6.79 37.91 ±7.54 39.43 ±7.33
T4 asym. 40 ms 24.39 ±6.50 37.51 ±7.81 39.87 ±7.60
T5 asym. 50 ms 20.99 ±6.02 36.88 ±7.75 40.47 ±7.59
ST asym. 10 ms 0.41 ±0.40 1.57 ±0.82 11.86 ±4.25
Table 2. No e onse F1 sco es on he MAESTRO .3 ali-
da ion se o di e en windowing unc ions. The las ow
addi ionally uses a shi - ole an aining loss.
To expe imen wi h di e en windowing con igu a ions
o educing he delay in audio p ep ocessing, we mod-
i y ou e e ence me hod o apply only causal p ocessing,
as allowing he model access o u u e ames would en-
de ou in e en ions meaningless. Speci ically, we make
each con olu ion causal, so he model’s ecep i e ield o
9 ames ex ends 8 ames in o he pas , a he han spli -
ing 4 ames in o he pas and 4 ames in o he u u e.
Addi ionally, we emo e he Squeeze-Exci a ion laye s o
he MobileNe V3 blocks, which pe o m global a e age
pooling o e bo h pas and u u e ames in an exce p .
Table 2 shows ou esul s. The o iginal cen e ed Hann
window wi h ou causal model (H1) pe o ms wo se han
ou non-causal s a ing poin (TP3 in Table 1). Shi ing he
Hann window om a delay o 64 ms o a delay o 10 ms
(H2) seems o comple ely a enua e usable in o ma ion in
he ames. Using an asymme ic window (T1) imp o es
pe o mance, bu s ill alls behind he cen e ed Hann win-
dow. Successi ely inc easing he delay up o 50 ms, we
see a s ong imp o emen (T2–T5 and Figu e 2). Fo
a delay o 30 ms o mo e, we ma ch pe o mance o he
cen e ed Hann window a an e alua ion onse ole ance o
30 ms. Fo s ic e ole ances, shi ed asymme ic windows
o 20 ms delay o mo e su pass he cen e ed Hann window.
We also ake he chance o in es iga e how a shi -
ole an loss o ±1 ame a ec s esul s o he causal
model. The loss could allow he model o sys ema ically
p edic e en s one ame (10 ms) la e han anno a ed. Su -
p isingly, using an asymme ic window wi h 10 ms o de-
lay, we ind ha he shi - ole an loss (ST) pe o ms on
pa wi h 30 ms delay (T3) when admi ing an e alua ion
ole ance o 50 ms (no shown in able), bu b eaks down
wi h any s ic e ole ance (as seen in he able).
Fo he hi d g oup o expe imen s, we keep he s ic es
se ing wi h asymme ic windows a a delay o 10 ms.
4.3 Model A chi ec u e
In ou inal g oup o expe imen s, we in es iga e a chi ec-
u al modi ica ions o ou e e ence model. The a chi ec-
u e o Mobile-AMT consis s o h ee acous ic s acks, each
consis ing o ecu en con olu ional blocks. Each s ack
lea ns a (onse , ame o eloci y) a ge . Fo some a ge s,
he ou pu s o mul iple s acks a e conca ena ed o condi-
ion he inal p edic ions. Compa ed o hei e e ence o -
line model [2], Mobile-AMT omi s he acous ic s ack o
he o se a ge , eusing he s ack o he ame a ge .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
86
10 20 30 50
Onse ole ance (ms)
0.200
0.225
0.250
0.275
0.300
0.325
0.350
0.375
0.400
Mean no e onse F1 sco e
Asymme ic window wi h di e en amoun s o shi
10ms
20ms
30ms
40ms
50ms
Figu e 2. No e onse F1 sco es (means only) o di e en
window delays and onse ole ance h esholds.
Tole ance 10 ms 20 ms 30 ms
No e onse F1 mean ±s d
A1 20.23 ±4.29 23.03 ±4.25 23.75 ±4.24
A2 25.77 ±4.65 30.66 ±4.98 31.79 ±4.93
A3 21.59 ±4.79 24.48 ±4.83 25.14 ±4.80
A4 21.53 ±4.31 24.51 ±4.24 25.22 ±4.19
A5 19.44 ±4.20 22.39 ±4.34 23.12 ±4.30
A6 25.82 ±4.48 30.52 ±4.51 31.56 ±4.44
No e onse and o se F1 mean ±s d
A1 3.25 ±1.31 5.13 ±2.54 7.20 ±3.51
A2 3.56 ±1.29 6.17 ±2.80 8.39 ±4.01
A3 5.84 ±1.41 5.84 ±2.48 7.71 ±3.43
A4 5.94 ±1.31 5.94 ±2.48 7.89 ±3.49
A5 4.84 ±1.11 4.84 ±2.36 6.61 ±3.32
A6 6.79 ±1.15 6.79 ±2.53 8.89 ±3.59
Table 3. No e onse and onse -and-o se F1 sco es on he
MAESTRO .3 alida ion se ac oss h ee onse ole ances
o a chi ec u al modi ica ions and inpu ep esen a ions.
Ou expe imen s in ol e he ollowing adap a ions,
each es ed independen ly: Fi s , in A1 we ( e)in oduce a
sepa a e o se acous ic s ack o explo e whe he and how
i imp o es o se label p edic ion. In A2 we emo e he
eloci y condi ioning on he onse s. Nex , we examine
whe he u he s eamlining he a chi ec u e by sha ing
a ou h (A3), hal (A4) o all (A5) o he con olu ional
blocks in he model’s acous ic s acks a ec s pe o mance.
Las ly, in A6 we examine he e ec o aining on he o ig-
inal 10 seconds sequence leng h.
Table 3 p esen s he no e onse and no e onse -and-
o se F1 sco es o he model and da a adap a ions on he
MAESTRO alida ion se . O e all, i is e iden ha he
combined impac o bina y, hea ily imbalanced poin wise
a ge s, causal modeling, and shi ed asymme ic window
esul s in a signi ican ly ha de lea ning p oblem, wi h
he same aining du a ion (500 epochs) leading o signi -
ican ly poo e sco es han he base case (TP1 in Table 1).
Howe e , ac oss all expe imen al se ups in Sec ion 4.1
compa ed o he cu en one, all ou causal modi ica ions
demons a e signi ican ly s onge obus ness o dec easing
ole ance h esholds, which is impo an o gua an ee low
la ency in p edic ions, and he e o e appea p omising o
u he aining.
Fu he mo e, when compa ing all model a chi ec u e
modi ica ions (A1-5) on no e onse and no e-onse -and-
o se F1 sco e, we obse e wo unexpec ed model be-
ha iou s: Fi s , adding a sepa a e o se s ack (A1) does
no imp o e o se p edic ion. As ou pos p ocessing de-
ec s a no e o se as he ea lie o ei he o se ac i a ion
o ame inac i a ion, we hypo hesize ha ame ac i i y
is su icien ly lea ned o compensa e o he absence o an
o se acous ic s ack. Second, emo ing he eloci y con-
di ioning on onse p edic ion (A2) esul s in a s ong im-
p o emen in onse p edic ion. Fu he mo e, sha ing he
acous ic s ack ac oss inc easing p opo ions (A3-5) does
no appea o hinde he model’s abili y o lea n meaning-
ul ep esen a ions. Finally, expe imen A6 sugges s ha
he model bene i s om he la ge con ex ual window.
4.4 Final compa ison
Fo ou inal compa ison, we p oceed wi h he ollowing
da a and model con igu a ions: we con inue wi h he (160
samples) shi ed asymme ic window o he STFT (T1 in
Sec. 4.2), emo e he eloci y condi ioning (A2) and sha e
all con olu ional laye s in he acous ic s ack ac oss all a -
ge s (A5). Mobile-AMT uses he o iginal non-causal pos -
p ocessing desc ibed in Sec ion 3.2, while ou model use
he causal pos p ocessing in oduced in Sec ion 4.1.
Table 4 summa izes he esul s o e di e en onse (and
o se ) h esholds o no e onse and onse -and-o se me -
ics. As expec ed, Mobile-AMT ou pe o ms ou modi ied
causal model ac oss all me ics, wi h a signi ican ma gin.
Upon e iewing all expe imen s conduc ed, we conclude
ha he la ges pe o mance d ops a e a ibu ed o he
shi ed window unc ion and he causal con olu ions in ou
model. When compa ing Mobile-AMT and ou adap ed
model ac oss a ious onse ole ance h esholds, we ob-
se e, simila o he p e ious expe imen , ha while ou
modi ied causal model p edic s ewe a ge s wi h lowe
accu acy o e all, i demons a es highe p ecision and o-
bus ness when e alua ed a s ic e iming ole ances.
5. DISCUSSION AND OUTLOOK
In his wo k, we in es iga e whe he and how he cu -
en s a e o he a in eal- ime piano ansc ip ion can be
adap ed o achie e minimum-la ency au oma ic piano an-
sc ip ion sui able o eal- ime musical in e ac ion.
Wha la ency is sui able canno be answe ed uni e -
sally, so ou choice o 10–30 ms is wo hy o discus-
sion. While 10 ms is sugges ed in digi al ins umen de-
sign [8–10], h esholds o la ency pe cep ion a y depend-
ing on he musical si ua ion, ask, and ins umen : o pe -
cussi e digi al ins umen s, dec eased a ings o alues
o 20 ms and abo e we e ound [20], ins umen -speci ic
h esholds be ween below 10 ms and 40 ms a e epo ed
in a li e moni o ing se ing [21], and abou 30 ms we e
ound o ges u al con ol [22]. O se s as low as 6 ms may
be pe cei ed in simple isoch onously spaced s imuli [23],
while o he esea che s ound jus no iceable la ency di -
e ences a 27 ms and highe [24]. T ansc ip ion-enabled
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
87

No e Onse No e Onse wi h O se
Model Tol. (ms) P ecision Recall F1 P ecision Recall F1
Causal-AMT 10 43.51 ±6.87 25.60 ±8.64 31.55 ±7.76 6.86 ±2.15 4.11 ±2.03 5.03 ±2.08
Mobile-AMT 10 22.70 ±5.82 15.57 ±2.96 18.26 ±3.51 3.27 ±1.19 2.32 ±0.94 2.69 ±1.00
Causal-AMT 20 50.86 ±7.52 29.71 ±9.11 36.70 ±7.96 10.89 ±3.91 6.34 ±2.90 7.86 ±3.16
Mobile-AMT 20 59.28 ±7.41 41.87 ±8.44 48.52 ±6.96 11.03 ±4.15 7.98 ±3.73 9.16 ±3.84
Causal-AMT 30 51.85 ±7.49 30.24 ±9.07 37.38 ±7.85 14.49 ±5.72 8.33 ±3.93 10.37 ±4.42
Mobile-AMT 30 81.17 ±6.29 57.84 ±12.79 66.80 ±9.78 18.42 ±6.89 13.27 ±6.09 15.26 ±6.30
Table 4. Final compa ison be ween ou implemen a ion o Mobile-AMT and ou modi ied minimal-la ency, s ic ly causally
adapa ed model.
eal- ime applica ions like in e ac i e accompanimen o
gene a i e imp o isa ion a e mo e akin o ensemble play-
ing han di ec ins umen con ol. In ne wo ked musical
con ex s, esea che s ypically aim o 20–30 ms o la ency
o mee pe o mance condi ions ha mi o adi ional in-
pe son ensembles [11, 12]. Howe e , s udies also ound
ha musicians may be able o compensa e o la encies up
o 50 ms [13,25] o e en 100 ms [26] o one piano piece, a
alue ha was deemed “nei he musical no in e ac i e” in
ano he s udy [27]. A eal- ime ansc ip ion model should
no only be ole able bu enable luen musical in e ac ion,
so we ook 30 ms as a minimal equi emen , and 10 ms as
a goal o impe cep ible la ency.
We in es iga e mul iple adap a ions o educe la-
ency, including label encoding wi h causal pos p ocess-
ing, shi ed asymme ic window unc ions du ing p ep o-
cessing, and a chi ec u al modi ica ions ha en o ce causal
p ocessing wi hin he model. Addi ionally, we educe he
model size ( om 320 o 160 GFLOPs o 3 seconds o in-
pu ) by sha ing compu a ions ac oss co e model compo-
nen s o all a ge s.
In a i s se o expe imen s, we assess he impac o
eg ession e sus classi ica ion loss encodings o non-
causal models. The o iginal eg ession a ge s only make
sense in conjunc ion wi h a lookahead as he a ge s begin
o inc ease se e al ames be o e he ac ual onse which
is impossible o a causal model o p edic . To mi i-
ga e he cos in aining s abili y and accu acy incu ed by
localized, causal- eady a ge s, we expe imen wi h loss
unc ions ha weigh he ac i e ames o e he inac i e
ame o comba label imbalance, and loss unc ions ha
a e ole an o small empo al shi s. We ind ha he
weigh ed classi ica ion losses app oxima e he baseline,
and he shi - ole an losses each he same le el in he ab-
sence o a ge s equi ing lookahead.
In a second expe imen , we in es iga e he delay in-
cu ed by he compu a ion o audio ea u e ep esen a ions.
Speci ically, we look a STFT windows and hei co -
esponding cen e ed a ge s. T ansc ip ion equi es high
equency esolu ion o pi ch es ima ion which equi es
la ge windows. Cen e ing he a ge s esul s in an o en
o e looked delay o hal he window leng h, 64 ms in ou
case. We es con igu a ions o shi ed windows along wi h
asymme ic windowing unc ions ha do no a enua e he
mos ecen samples. We ind ha agg essi ely shi ed
windows a 10 ms do de e io a e he ansc ip ion accu acy
by a lo , ye a 30 ms, we each compa able pe o mance o
an unshi ed causal model. He e, a shi - ole an loss does
no imp o e pe o mance. A he same ime, con igu ing
he model a chi ec u e o s ic ly causal p ocessing also
de e io a es pe o mance wi h espec o he baseline wi h
mo e han 100 ms o lookahead.
In a hi d expe imen , we assess di e en model a chi-
ec u es and hei impac . We obse e ha sha ing he con-
olu ional componen s o he acous ic s ack ac oss di e -
en a ge ypes p o es bene icial. We hypo hesize ha he
local acous ic ea u es cap u ed in he con olu ional laye s
o he acous ic model can be e ec i ely lea ned indepen-
den ly o sequen ial in o ma ion, making hem in a ian o
he a ge ype. Fu he mo e, emo ing he eloci y con-
di ioning on he onse s s ongly imp o es he accu acy o
onse p edic ions.
O e all, we ind ha we can compensa e well o algo-
i hmic issues: we can scale he model and use lookahead-
ee a ge s wi hou a majo d op in pe o mance. Wha
p o es di icul , howe e , is o ende he model s ic ly
causal and o e ec i ely p ocess he incoming audio wi h-
ou loss o ele an in o ma ion. Fo a la ency o 10 ms, i
would be equi ed ha he model p edic s pi ches wi h a
mos 10 ms o incoming audio samples. Fo onse s o he
lowes wo oc a es on he piano, his means ha he e is no
e en a ull pe iod o he undamen al equency p esen in
he samples, and p edic ions may need o ely on ha monic
pa ials. Along wi h he ansien phase and he conse-
quen ly blu y STFT ame, his leads o an inc easingly
ha d ansc ip ion ask. We hope ha hese indings and
pinpoin ed challenges will con ibu e o u u e esea ch on
eal- ime, minimum la ency au oma ic piano ansc ip ion.
While his s udy p ima ily ocuses on he algo i hmic
pe o mance and obus ness o a eal- ime ansc ip ion
model, we acknowledge ha a de ailed analysis o p o-
cessing ime—including bo h ne wo k in e ence and p e-
p ocessing—ac oss di e en ha dwa e pla o ms emains
an impo an a ea o u u e wo k o allow o he p ac i-
cal deploymen o a eal- ime ansc ip ion sys em in eal-
wo ld scena ios. Likewise, we wan o ake a close in-
spec ion in o he design o he unde lying window unc-
ion and il e bank, in o de o ind an app op ia e balance
be ween educing p edic ion delay and inc easing u u e
con ex , all while main aining he desi ed STFT p ope ies.
Las ly, ac oss all expe imen al g oups, ou sys em adap a-
ions consis en ly ou pe o med he baseline a lowe im-
ing ole ance, which we conside a desi able p ope y wo -
hy o u he in es iga ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
88
6. ACKNOWLEDGEMENTS
This esea ch acknowledges suppo by he Eu opean Re-
sea ch Council (ERC), unde he Eu opean Union’s Ho i-
zon 2020 esea ch and inno a ion p og amme, g an ag ee-
men No. 101019375 Whi he Music?. The LIT AI Lab is
suppo ed by he Fede al S a e o Uppe Aus ia.
7. REFERENCES
[1] C. Haw ho ne, A. S asyuk, A. Robe s, I. Simon, C.-
Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and
D. Eck, “Enabling ac o ized piano music modeling
and gene a ion wi h he MAESTRO da ase ,” in In-
e na ional Con e ence on Lea ning Rep esen a ions,
2019.
[2] Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-
Resolu ion Piano T ansc ip ion wi h Pedals by Re-
g essing Onse and O se Times,” IEEE/ACM T ans-
ac ions on Audio Speech and Language P ocessing,
ol. 29, pp. 3707–3717, 2021.
[3] S. Sig ia, E. Bene os, and S. Dixon, “An end- o-end
neu al ne wo k o polyphonic piano music ansc ip-
ion,” IEEE/ACM T ansac ions on Audio, Speech, and
Language P ocessing, ol. 24, no. 5, pp. 927–939,
2016.
[4] R. Kelz, S. Böck, and G. Widme , “Deep polyphonic
ads piano no e ansc ip ion,” in ICASSP 2019-2019
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2019, pp.
246–250.
[5] Y. Kusaka and A. Maezawa, “Mobile-AMT: Real-
Time Polyphonic Piano T ansc ip ion o In- he-Wild
Reco dings,” in 2024 32nd Eu opean Signal P ocess-
ing Con e ence (EUSIPCO). IEEE, 2024, pp. 36–40.
[6] A. Fe nandez, “Onse s and Veloci ies: A o dable
Real-Time Piano T ansc ip ion Using Con olu ional
Neu al Ne wo ks,” in 2023 31s Eu opean Signal P o-
cessing Con e ence (EUSIPCO). IEEE, 2023, pp.
151–155.
[7] T. Kwon, D. Jeong, and J. Nam, “Towa ds E icien and
Real-Time Piano T ansc ip ion Using Neu al Au o e-
g essi e Models,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, 2024.
[8] D. Wessel and M. W igh , “P oblems and p ospec s o
in ima e musical con ol o compu e s,” Compu e mu-
sic jou nal, ol. 26, no. 3, pp. 11–22, 2002.
[9] A. P. McPhe son, R. H. Jack, and G. Mo o, “Ac ion-
sound la ency: A e ou ools as enough?” in
16 h In e na ional Con e ence on New In e aces o
Musical Exp ession, NIME 2016, G i i h Uni e si y,
B isbane, Aus alia, July 11-15, 2016. nime.o g,
2016, pp. 20–25. [Online]. A ailable: h ps://doi.o g/
10.5281/zenodo.3964611
[10] F. Caspe, J. Shie , M. Sandle , C. Sai is, and
A. McPhe son, “Designing neu al syn hesize s o low
la ency in e ac ion,” a Xi p ep in a Xi :2503.11562,
2025.
[11] L. Tu che and C. Ro ondi, “On he ela ion be ween
he ields o ne wo ked music pe o mances, ubiqui-
ous music, and in e ne o musical hings,” Pe sonal
and Ubiqui ous Compu ing, ol. 27, no. 5, pp. 1783–
1792, 2023.
[12] E. Lakio akis, C. Liaskos, and X. Dimi opoulos, “Im-
p o ing ne wo ked music pe o mance sys ems us-
ing applica ion-ne wo k collabo a ion,” Concu ency
and Compu a ion: P ac ice and Expe ience, ol. 31,
no. 24, p. e4730, 2019.
[13] E. Chew, R. Zimme mann, A. A. Sawchuk, C. Ky -
iakakis, C. Papadopoulos, A. F ançois, G. Kim,
A. Rizzo, and A. Volk, “Musical in e ac ion a a dis-
ance: Dis ibu ed imme si e pe o mance,” in P o-
ceedings o he MusicNe wo k Fou h Open Wo kshop
on In eg a ion o Music in Mul imedia Applica ions.
MusicNe wo k Ba celona, 2004, pp. 15–16.
[14] T. Kwon, D. Jeong, and J. Nam, “Polyphonic Piano
T ansc ip ion Using Au o eg essi e Mul i-S a e No e
Model,” in The 21 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR). In e na-
ional Socie y o Music In o ma ion Re ie al, 2020.
[15] D. Jeong and S. Telecom, “Real- ime au oma ic piano
music ansc ip ion sys em,” in La e B eaking Demo.
In e na ional Socie y o Music In o ma ion Re ie al,
2020, pp. 4–6.
[16] A. Howa d, M. Sandle , G. Chu, L.-C. Chen, B. Chen,
M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasude an e al.,
“Sea ching o MobileNe V3,” in P oceedings o he
IEEE/CVF in e na ional con e ence on compu e i-
sion, 2019, pp. 1314–1324.
[17] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, D. P. Ellis, and C. C. Ra el,
“MIR_EVAL: A T anspa en Implemen a ion o Com-
mon MIR Me ics.” in ISMIR, ol. 10, 2014, p. 2014.
[18] R. M. Bi ne , J. J. Bosch, D. Rubins ein, G. Mesegue -
B ocal, and S. Ewe , “A ligh weigh ins umen -
agnos ic model o polyphonic no e ansc ip ion and
mul ipi ch es ima ion,” in P oceedings o he IEEE In-
e na ional Con e ence on Acous ics, Speech, and Sig-
nal P ocessing (ICASSP), Singapo e, 2022.
[19] F. Fosca in, J. Schlü e , and G. Widme , “Bea his!
Accu a e bea acking wi hou DBN pos p ocessing,”
a Xi p ep in a Xi :2407.21658, 2024.
[20] R. H. Jack, A. Meh abi, T. S ockman, and A. McPhe -
son, “Ac ion-sound la ency and he pe cei ed quali y o
digi al musical ins umen s: Compa ing p o essional
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
89
pe cussionis s and ama eu musicians,” Music Pe cep-
ion, ol. 36, no. 1, pp. 109–128, 09 2018. [Online].
A ailable: h ps://doi.o g/10.1525/mp.2018.36.1.109
[21] M. Les e and J. Boley, “The e ec s o la ency on li e
sound moni o ing,” Jou nal o he Audio Enginee ing
Socie y, no. 7198, oc obe 2007.
[22] T. Mäki-Pa ola and P. Hämäläinen, “La ency ole ance
o ges u e con olled con inuous sound ins umen
wi hou ac ile eedback,” in P oceedings o he 2004
In e na ional Compu e Music Con e ence, ICMC
2004, Miami, Flo ida, USA, No embe 1-6, 2004.
Michigan Publishing, 2004. [Online]. A ailable:
h ps://hdl.handle.ne /2027/spo.bbp2372.2004.032
[23] A. F ibe g and J. Sundbe g, “Time disc imina ion in
a mono onic, isoch onous sequence,” The Jou nal o
he Acous ical Socie y o Ame ica, ol. 98, no. 5, pp.
2524–2531, 1995.
[24] A. Schmid, M. Amb os, J. Bogon, and R. Wimme ,
“Measu ing he jus no iceable di e ence o audio
la ency,” in P oceedings o he 19 h In e na ional
Audio Mos ly Con e ence: Explo a ions in Sonic
Cul u es, AM 2024, Milan, I aly, Sep embe 18-
20, 2024, L. A. Ludo ico and D. A. Mau o,
Eds. ACM, 2024, pp. 325–331. [Online]. A ailable:
h ps://doi.o g/10.1145/3678299.3678331
[25] S. Dahl and R. B esin, “Is he playe mo e in luenced
by he audi o y han he ac ile eedback om he in-
s umen ,” in P oceedings o he Digi al Audio E ec s
Con e ence (DAFx), 2001, pp. 6–9.
[26] A. A. Sawchuk, E. Chew, R. Zimme mann, C. Pa-
padopoulos, and C. Ky iakakis, “F om emo e media
imme sion o dis ibu ed imme si e pe o mance,”
in P oceedings o he 2003 ACM SIGMM Wo k-
shop on Expe ien ial Telep esence, se . ETP ’03.
New Yo k, NY, USA: Associa ion o Compu ing
Machine y, 2003, p. 110–120. [Online]. A ailable:
h ps://doi.o g/10.1145/982484.982506
[27] C. Ba le e, D. Headlam, M. Bocko, and G. Velikic,
“E ec o ne wo k la ency on in e ac i e musical
pe o mance,” Music Pe cep ion, ol. 24, no. 1,
pp. 49–62, 09 2006. [Online]. A ailable: h ps:
//doi.o g/10.1525/mp.2006.24.1.49
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
90