scieee Science in your language
[en] (orig)

Joint Transcription of Acoustic Guitar Strumming Directions and Chords

Author: Sebastian Murgul; Johannes Schimper; Michael Heizmann
Publisher: Zenodo
DOI: 10.5281/zenodo.17706490
Source: https://zenodo.org/records/17706490/files/000055.pdf
JOINT TRANSCRIPTION OF ACOUSTIC GUITAR STRUMMING
DIRECTIONS AND CHORDS
Sebas ian Mu gul1,2Johannes Schimpe 2Michael Heizmann2
1Klangio GmbH, Ka ls uhe, Ge many
2Ka ls uhe Ins i u e o Technology, Ka ls uhe, Ge many
[email p o ec ed]
ABSTRACT
Au oma ic ansc ip ion o gui a s umming is an unde -
ep esen ed and challenging ask in Music In o ma ion Re-
ie al (MIR), pa icula ly o ex ac ing bo h s umming di-
ec ions and cho d p og essions om audio signals. While
exis ing me hods show p omise, hei e ec i eness is o -
en hinde ed by limi ed da ase s. In his wo k, we ex end
a mul imodal app oach o gui a s umming ansc ip ion
by in oducing a no el da ase and a deep lea ning-based
ansc ip ion model. We collec
90 min
o eal-wo ld gui a
eco dings using an ESP32 sma wa ch mo ion senso and a
s uc u ed eco ding p o ocol, complemen ed by a syn he ic
da ase o
4 h
o labeled s umming audio. A Con olu ional
Recu en Neu al Ne wo k (CRNN) model is ained o
de ec s umming e en s, classi y hei di ec ion, and iden-
i y he co esponding cho ds using only mic ophone audio.
Ou e alua ion demons a es signi ican imp o emen s o e
baseline onse de ec ion algo i hms, wi h a hyb id me hod
combining syn he ic and eal-wo ld da a achie ing he high-
es accu acy o bo h s umming ac ion de ec ion and cho d
classi ica ion. These esul s highligh he po en ial o deep
lea ning o obus gui a s umming ansc ip ion and open
new a enues o au oma ic hy hm gui a analysis.
1. INTRODUCTION
Au oma ic music ansc ip ion is a key ask in Music In o -
ma ion Re ie al (MIR), aiming o con e audio signals
in o symbolic ep esen a ions. Fo he ansc ip ion o solo
ins umen music, nume ous new app oaches and ools ha e
been p oposed o e he las yea s [1]. While classical no e-
acking models such as [2], [3], and [4] pe o m well o
inge picking, hey a e no designed o p edic s umming
di ec ions. These models ocus on indi idual no e onse s
and o en s uggle wi h he dense polyphony and hy hmic
s uc u e o s umming, whe e he emphasis lies on cho d-
le el a icula ion. This limi a ion highligh s he need o a
dedica ed s umming ansc ip ion sys em wi h applica ions
in music educa ion, DAW plugins, and no a ion so wa e.
© S. Mu gul, J. Schimpe , and M. Heizmann. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: S. Mu gul, J. Schimpe , and M. Heizmann, “Join
T ansc ip ion o Acous ic Gui a S umming Di ec ions and Cho ds”, in
P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
Resea ch on gui a s umming ansc ip ion has p ima -
ily ollowed wo main app oaches: audio-based classi i-
ca ion and senso -based mo ion analysis. In 2019, Bello
e al. p oposed a neu al ne wo k-based classi ica ion sys-
em o dis inguish be ween up and down s okes using
Mel-F equency Ceps al Coe icien s (MFCCs) segmen s
as inpu ea u es [5]. Thei app oach achie ed a classi i-
ca ion accu acy o
72.5 %
o a Con olu ional Neu al Ne -
wo k (CNN) and
70 %
o a Long Sho -Te m Memo y
(LSTM) model. Ea lie , in 2013, Ma sushi a e al. de el-
oped a w is wa ch-like de ice designed o analyze down-
s umming ac ions in e ms o no e iming and in ensi y [6].
Mo e ecen ly, F ei e e al. (2020) explo ed s umming
ges u es in g ea e de ail using ine ial measu emen uni s
(IMUs) and mo ion cap u e echnology, u he ad ancing
senso -based analysis o gui a pe o mance [7]. A mul i-
modal app oach was in oduced in 2022 by Mu gul e al.,
who combined a back-o -hand-moun ed mo ion senso wi h
gui a pickup audio o s umming ac ion ansc ip ion [8].
Thei me hod in ol ed eco ding a small manually labeled
da ase , which was used o e alua e algo i hmic anno a ion
echniques based on onse de ec ion in he pickup signal and
h esholding he i s -o de de i a i e o he mo ion da a.
Building on he app oach in Mu gul e al., we ex end
he mul imodal app oach o c ea e a bigge and mo e di-
e se da ase in o de o ain a neu al ne wo k. We in-
c ease he da ase size om
5 min
o
90 min
and om
4
cho ds o
24
cho ds (majo / mino ) while also adding mo e
complex s umming hy hms and pe o mance pa ame e
a ia ions. The e o e, an imp o ed hand mo ion senso
based on an o - he-shel ESP32 sma wa ch module is
de eloped, and a sophis ica ed eco ding plan wi h spe-
ci ic ins uc ions o he playe s is c ea ed. A new gui a
s umming da ase is eco ded by h ee gui a playe s using
his app oach and semi-au oma ically anno a ed using he
mul imodal in o ma ion. While he semi-au oma ic anno-
a ion p ocess is scalable, he eco ding p ocess s ill does
ake some ime. The e o e, o complemen he eal-wo ld
da ase , we p esen a gui a s umming da a syn hesis ap-
p oach ha is used o gene a e an addi ional
4 h
o labeled
s umming audio. These da ase s a e hen used o ain a
CRNN model o au oma ically de ec s umming e en s and
classi y he s umming di ec ion as well as he played cho d
om solely mic ophone audio. Finally, he ansc ip ion e-
sul s a e e alua ed using he es spli o he eal s umming
eco dings and compa ed wi h baseline algo i hms.
477
2. MULTIMODAL STRUMMING RECORDING
2.1 Mo ion Reco ding Ha dwa e
To cap u e hand mo emen and, consequen ly, he s um-
ming di ec ion, a compac and ligh weigh sys em is e-
qui ed ha can be a ached o he playing hand. I mus
enable wi eless communica ion o ansmi ing mo ion
da a and be capable o s a ing and s opping audio eco d-
ings on a compu e ia wi eless commands. Addi ionally,
he sys em should be in ui i e o gui a is s o use. Fo
scalable applica ions, he solu ion should be cos -e icien .
The ESP32-S3-Touch-LCD-1.28 module om Wa esha e
mee s hese equi emen s and se es as he cen al mic o-
con olle [9]. I ea u es a 3-axis accele ome e (QMI8658),
a LiPo ba e y connec o wi h a ba e y managemen , and
suppo s he wi eless s anda ds Wi-Fi and Blue oo h Low
Ene gy (BLE). Fu he mo e, he module includes an LCD
sc een wi h ouch unc ionali y and a compac o m ac o .
A cus om 3D-p in ed enclosu e enables a wa ch-like
a achmen on he back o he hand. The enclosu e also
houses a
350 mAh
LiPo ba e y, as shown in Figu e 1a.
Figu e 1b illus a es he senso sys em a ached o he back
o he hand.
(a) Backside o he hand senso
wi hou co e .
(b) Senso a ached o he back
o he hand.
Figu e 1. Hand senso in i s enclosu e
Hand mo emen is desc ibed using a simpli ied model
like in [8], in which he hand pe o ms a semici cula mo-
ion a ound he elbow. The
x
-axis uns along he back
o he hand, o hogonal o he inge s, while he
y
-axis is
o hogonal o he
x
-di ec ion, poin ing owa ds he inge -
ips. The ele an accele a ion componen s a e g a i a ional
accele a ion
Ag
, cen ipe al accele a ion
Acen ipe al
, and an-
gen ial accele a ion
A angen ial
[10]. The spa ial o ien a ions
eco ded by he senso , along wi h he measu ed accele -
a ions o di e en hand posi ions, a e shown in Figu e
2. The cen ipe al accele a ion ac s exclusi ely in he
y
-
di ec ion, while he angen ial accele a ion occu s along he
x
-axis. Consequen ly, he accele a ion
Ax
in he
x
-di ec ion
and Ayin he y-di ec ion a e gi en by
Ax=A angen ial +Ag·cos(φ)(1)
Ay=Acen ipe al +Ag·sin(φ)(2)
whe e
φ
is he angle ela i e o he ho izon al axis, anging
om
−90°
o
90°
. Fo slow, quasi-s a iona y hand mo e-
men s,
Ax
anges om
−1g
o
0g
, while
Ay
akes alues
be ween
−1g
and
1g
. Due o he symme y p ope ies o
he sine unc ion,
Ax
alone canno de e mine he mo emen
Figu e 2. Mo ion model o he senso
di ec ion. Howe e , by di e en ia ing he accele a ion in
he y-di ec ion, he mo emen di ec ion can be in e ed. A
nega i e g adien co esponds o an upwa d mo ion, while
a posi i e g adien co esponds o a downwa d mo ion.
In non-s a iona y cases, such as du ing s umming, bo h
angen ial and cen ipe al accele a ion con ibu e o
Ax
and
Ay
espec i ely alongside he g a i a ional accele a ion.
The y-di ec ion expe iences an addi ional, cons an cen-
ipe al accele a ion. Because ou me hod elies on accele -
a ion de i a i es, he cons an cen ipe al accele a ion can
be igno ed.
2.2 Reco ding P ocess
Table 1 gi es an o e iew o he playing ins uc ions gi en
o he gui a is s. To compile he da ase s,
28
di e en s um-
ming pa e ns in 4/4 ime signa u e based on [11, 12] a e
used, anging om hy hmically simple o complex syn-
copa ed pa e ns. The pa e ns a y in pa ame e s like
empo (60, 80, 100 BPM), cho d p og essions, playing s yle
(plec um, inge ), and olume (so , medium, loud). The
a ia ions we e de e mined andomly based on a uni o m
dis ibu ion.
Pa ame e Values
Pa e n 28 pa e ns
Tempo 60 BPM, 80 BPM, 100 BPM
Mo emen li le, no mal, la ge
Volume quie , medium, loud
Technique inge , pick
Cho ds majo and mino cho d p og essions
Table 1. Resul s on mic ophone audio.
The da a collec ion was conduc ed wi h h ee gui a is s,
including a p o essional gui a eache and wo expe ienced
ama eu gui a is s. The s umming pa e ns we e played
o
60 s
each o a me onome, ollowing he p ede ined
pa ame e s. Simul aneously, audio eco dings om he
gui a pickup and accele a ion da a we e cap u ed. Syn-
ch oniza ion o bo h audio signals was pe o med using
c oss-co ela ion. Addi ionally, he gui a is s’ playing was
eco ded using he mic ophone on an iPhone 15 P o. The
o al eco ding du a ion amoun s o
90 min
. Due o he
ligh weigh design and he moun ing posi ion on he back
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
478
o he hand, he gui a playe s’ inge s we e no cons ained
by he senso du ing pe o mance.
2.3 Semi-Au oma ic Anno a ion
The anno a ion p ocess in ol es iden i ying he onse imes
and s umming di ec ions wi hin he pickup eco dings as
well as he synch oniza ion wi h he mo ion senso signal.
Ins ead o elying solely on au oma ed onse de ec ion, he
p ocess is op imized by inco po a ing p io knowledge om
he eco ding plan, which includes empo, hy hm pa e ns,
cho ds, and s umming sequences. This s uc u ed in o -
ma ion allows o a mo e obus p edic ion o expec ed
onse imes, educing eliance on pu ely signal-based onse
de ec ion. To de e mine ac ual onse imes, spec al lux
analysis [13] is used o de ec signi ican changes in he
audio signal. Howe e , since he gui a is does no neces-
sa ily s a a he exac ze o-second ma k, a use -assis ed
g aphical in e ace is employed o align he es ima ed onse s
wi h he heo e ical pa e n. The p ocess in ol es selec ing
he ac ual s a ime and i e a i ely adjus ing un il he de-
ec ed onse s align wi h he expec ed iming based on he
me onome. S umming di ec ion is de e mined using ac-
cele a ion da a, which is synch onized wi h he audio signal.
Since ansmission la ency and sys em delays in oduce a
ime o se be ween he audio and accele a ion da a, man-
ual adjus men s a e equi ed. An in e ac i e isualiza ion
displays bo h spec al lux and di e en ia ed accele a ion,
allowing use s o shi he accele a ion da a un il he peaks
o accele a ion de i a i es align wi h he de ec ed onse s.
To assign s umming di ec ion, peaks in he accele a ion
de i a i e co esponding o upwa d and downwa d hand
mo emen s a e ma ched wi h de ec ed onse peaks in spec-
al lux. I he accele a ion de i a i e is posi i e a an onse
ime, i is labeled as an up s um; i nega i e, i is labeled as
a down s um. Nex , we use he a p io i in o ma ion om
he eco ding plan o au oma ically co ec he anno a ions
and add cho d labels. Since we use a me onome, i can be
assumed ha he hy hmical pa e n is played consis en ly
enough o in e pola e missed s umming e en s. Finally, he
anno a ed da a unde goes manual alida ion and co ec ion
by a human anno a o . The anno a o isually inspec s and
adjus s he de ec ed onse s and s umming di ec ions using
an in e ac i e g aphical in e ace.
3. GUITAR STRUMMING SYNTHESIS
To c ea e a di e se and scalable da ase o aining s um-
ming ansc ip ion models, we in oduce a no el s umming
syn hesis app oach consis ing o h ee s ages: s umming
abla u e sampling, audio ende ing, and audio augmen a-
ion. This me hod gene a es app oxima ely
1000
examples
o aling
4 h
o audio, which a e andomly spli in o
90 %
aining, 5 % alida ion, and 5 % es ing se s.
3.1 S umming Tabla u e Sampling
The i s s ep in ol es gene a ing syn he ic s umming ab-
la u es, as illus a ed in Figu e 3. A da abase o
51
cho d
p og essions in unc ional no a ion and
36
s umming pa -
e ns de ined on a 16 h-no e g id se e as he ounda ion
o gene a ing a ia ions. Each example is c ea ed by an-
domly selec ing a cho d p og ession, ansposing i o a
andom key, and mapping each cho d o a inge ing om a
lookup able. A andom s umming pa e n and empo a e
hen applied o c ea e a comple e abla u e. To in oduce
na u al impe ec ions, he las no e o a s umming cho d
is andomly d opped in
50 %
o cases, simula ing playing
inconsis encies ypical o ama eu gui a is s. The gene a ed
abla u es a e s o ed in he Gui a P o
1
o ma , alongside a
CSV anno a ion ile con aining iming, s umming ac ion,
and cho d labels.
3.2 Audio Rende ing
The syn hesized abla u es a e ende ed in o audio using
DAWD eame [14] and Ample Sound’s i ual gui a ins u-
men s
2
, ollowing a me hodology simila o Syn hTab [15].
Ins ead o con e ing abla u es o MIDI, we use . xp p ese
iles o load he Gui a P o no a ion di ec ly in o he i ual
ins umen engine. This way, up and down s oke in o ma-
ion can be inpu om he abla u e. To enhance ealism,
ende ing pa ame e s a e andomized, including he blend
be ween i ual mic ophones and he amoun o e noise
in oduced. The inal ou pu is sa ed as a
44.1 kHz
WAV
ile. Since he ende ing p ocess in oduces an a e age
40 ms
la ency, his delay is accoun ed o in he da ase
anno a ions o main ain synch oniza ion accu acy.
3.3 Audio Augmen a ion
To u he imp o e ealism and a iabili y, a pos -p ocessing
s ep applies a chain o e ec s using he Pedalboa d li-
b a y [16]. The augmen a ion pipeline in oduces con olled
dis o ions and en i onmen al ac o s o be e simula e eal-
wo ld eco dings. The p ocessing chain includes dis o ion,
high- and low-pass il e ing, and comp ession o mimic
onal a ia ions ac oss di e en eco ding condi ions. To
simula e oom acous ics, a con olu ional e e b e ec is
applied. Addi ional backg ound noise laye s, including am-
bien eco dings ( a ic, wea he , and li ing oom sounds)
and whi e noise, a e inco po a ed o model mic ophone im-
pe ec ions and noisy en i onmen s. Finally, sho bu s s o
e ing sounds and pe cussi e noises, such as ligh apping
o clapping, a e injec ed a andom in e als o emula e
na u al gui a handling. The e ec pa ame e s, such as
signal- o-noise a io (SNR), il e cu -o equencies, and
d y/we mix a ios, a e andomized o ensu e b oad gene al-
iza ion.
4. MODEL
Ou model builds upon he Con olu ional Recu en Neu al
Ne wo k (CRNN) a chi ec u e p oposed by Kong e al. [17]
o piano ansc ip ion. Unlike adi ional classi ica ion-
based app oaches ha es ima e a disc e e piano oll ep e-
sen a ion, his me hod employs a eg ession-based s a egy
1See h ps://www.gui a -p o.com o mo e in o ma ion.
2A ailable a h ps://amplesound.ne /en/index.asp.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
479
51 Cho d
P og essions
T anspose
o Random Key
C ea e Cho d
Tabla u e
Cho d
Finge ings
Apply S umming
Pa e n w/ Tempo
36 S umming
Pa e ns
D op
Las No e
GP5 &
Anno a ions
Figu e 3. Flow cha o he s umming abla u e sampling p ocess.
o p edic he ime o he nex onse o o se e en . This
design allows o mo e p ecise onse es ima ions beyond
he limi a ions o ixed ame s ep sizes, while also inc eas-
ing obus ness agains mino misalignmen s in onse label
anno a ions du ing aining.
4.1 P e-P ocessing
The inpu audio is esampled o
16 kHz
and segmen ed in o
o e lapping
10 s
clips wi h a hop size o
1 s
o enhance da a
di e si y. Each segmen is con e ed in o a loga i hmic Mel
spec og am, which se es as he inpu ep esen a ion o
he neu al ne wo k. The spec og am is compu ed using a
window size o
2048
samples and a hop size o
160
sam-
ples, esul ing in a ime- equency ep esen a ion wi h
229
equency bins, s a ing a a minimum equency o
30 Hz
.
To imp o e gene aliza ion, andom pi ch shi s in he ange
[−6,6]
semi ones a e applied du ing aining, wi h cho d
labels ansposed acco dingly. The o e lapping segmen a-
ion and augmen a ion ensu e obus ea u e lea ning ac oss
di e se s umming pa e ns.
4.2 A chi ec u e
The model consis s o wo main componen s: a s umming
onse eg ession ne wo k and a cho d classi ica ion ne -
wo k. The inpu Mel spec og am is i s p ocessed by a
con olu ional laye s ack (Con S ack) designed o cap u e
ime- equency ea u es. The s uc u e o he Con S ack
ollows he design in [17] and consis s o ou con olu-
ional blocks. Each block con ains wo con olu ional laye s
wi h iden ical ke nel sizes, ollowed by a pooling ope a-
ion ha educes he spec al dimension while p ese ing
empo al in o ma ion. A e he inal con olu ional block,
he ex ac ed ea u es a e la ened o subsequen p ocess-
ing. The la ened ea u e ep esen a ion is passed h ough
a ully connec ed (FC) laye be o e being ed in o a bidi ec-
ional GRU (biGRU) laye wi h 256 uni s. The ou pu o
he biGRU is hen passed h ough ano he ully connec ed
laye , which gene a es eg ession alues o up s ums and
down s ums.
In pa allel o he onse eg ession, a sepa a e cho d ea-
u e ex ac ion s ack p ocesses he inpu spec og am in a
simila manne . Since cho d labels a e only a ailable a
s umming e en imes, he ou pu s o bo h ne wo ks a e
me ged be o e passing h ough an addi ional biGRU and
ully connec ed laye o p oduce inal classi ica ion logi s
Log Mel-Spec og am (
T×512
)
Con S ack
FC, c=768
biGRU, c=256
FC, c=2
Con S ack
FC, c=768
biGRU, c=256
FC, c=24
biGRU, c=256
FC, c=24
Cho d Classi ica ion
(
T×24
)
Ac ion Reg ession
(
T×2
)
Figu e 4. Join s umming ac ion de ec ion and cho d ecog-
ni ion ne wo k using loga i hmic Mel spec og am as inpu
ea u e.
g(∆−2)g(∆−1)g(∆0)g(∆1)g(∆2)00
S umming Ac ion
∆−2
∆−1
∆0
∆1
∆2
Figu e 5. S uc u e o S umming Ac ion Onse Reg ession
Labels.
o 24 majo and mino cho d classes. Figu e 4 p o ides an
o e iew o he ull model a chi ec u e.
4.3 Reg ession Ta ge s
Ins ead o elying on bina y ame-based labels, a
eg ession-based app oach is used o de e mine s umming
ac ions, as illus a ed in Figu e 5. The eg ession a ge
unc ion
g(∆i)∈[0,1]
encodes he ime di e ence o he
nex s umming ac ion onse
∆i
, whe e
i
is he index o a
ame, using a iangula dis ibu ion. The a ge is de ined
as
g(∆i) = (1−|∆i|
J∆,|i| ≤ J
0,|i|> J , (3)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
480
whe e
∆
deno es he ame hop size and
J
is a hype pa ame-
e ha con ols he sha pness o he eg ession labels which
is se o
5
in ou expe imen s. The loss unc ion consis s o
wo componen s: one o s umming onse eg ession and
ano he o cho d classi ica ion. The s umming ac ion e-
g ession loss
lac ion
is calcula ed om he eg ession ou pu
Rac ion and he a ge Gac ion by
lac ion =
T
X
=1
K
X
k=1
lbce (Gac ion( , k), Rac ion( , k)) ,(4)
whe e
lbce
ep esen s he bina y c oss-en opy loss,
T
is
he numbe o ime s eps, and
K
deno es he numbe o
s umming ac ion ca ego ies. Fo cho d classi ica ion, a
simila loss unc ion is used on he p edic ion ou pu s
Pcho d
and he a ge s Gcho d:
lcho d =
T
X
=1
C
X
c=1
lbce (Gcho d( , c), Pcho d( , c)) .(5)
whe e
C
ep esen s he numbe o possible cho d labels.
The o al loss unc ion used du ing aining is simply he
sum o bo h componen s:
l=lac ion +lcho d .(6)
The model is ained using he AdamW op imize [18]
wi h an ini ial lea ning a e o
10−4
. The aining p ocess is
un o
20,000
s eps wi h a ba ch size o
6
. On an NVIDIA
Tesla V100 GPU, aining akes app oxima ely 2 h.
5. EXPERIMENTS AND RESULTS
This sec ion e alua es he pe o mance o ou p oposed
me hod o s umming onse de ec ion, di ec ion classi i-
ca ion, and cho d ecogni ion. We begin by assessing he
de ec ion accu acy using gui a pickup signals, ollowed by
an e alua ion o eal-wo ld mic ophone eco dings. Finally,
we analyze he e ec i eness o pi ch shi augmen a ion and
compa e ou cho d ecogni ion wi h exis ing app oaches.
Model pe o mance is measu ed using p ecision, ecall,
and F1-sco e o s umming de ec ion. Speci ically, we
epo hese me ics o down s ums (
F1down
), up s ums
(
F1up
), and s umming class agnos ic (
F1any
). A
50 ms
ol-
e ance window is used, ollowing he mi _e al lib a y [19].
5.1 Resul s on Gui a Pickup Signals
In ou i s expe imen , we explo e he pe o mance o ou
model di ec ly on he gui a pickup signals. We use wo o
he gui a is s we eco ded o ain ou model and e alua e on
he hi d gui a is . We compa e he de ec ion quali y o ou
ained model wi h common onse de ec ion unc ions spec-
al lux [13], supe lux [13] and Complex Domain Onse
De ec ion Func ion (CD-ODF) [20] . Fo spec al lux and
supe lux, we use he implemen a ion gi en in he lib osa
lib a y [21]. The esul ing p ecision, ecall, and F1-sco e
o any s umming di ec ion a e highligh ed in Table 2 o
compa ison. O he onse de ec ion unc ions, he spec al
lux o e s he bes de ec ion esul s, di ec ly ollowed by
he CD-ODF. Compa ed wi h spec al lux and supe lux,
he CD-ODF o e s a no iceably high ecall. The e o e, i
migh be sui able o an ac i e lea ning labeling scena io.
Ou model ou pe o ms he onse de ec ion unc ions in all
h ee p ecision, ecall and F1-sco e. By achie ing an F1-
sco e o abou
98 %
, he model is qui e capable o eliably
de ec ing he s umming ac ions in he pickup signal.
Me hod F1any Pany Rany
Spec al Flux [13] 79.49 % 78.53 % 81.86 %
Supe Flux [13] 74.36 % 77.04 % 73.36 %
CD-ODF [20] 79.32 % 68.50 % 98.15 %
Ou s 97.60 % 96.54 % 98.73 %
Table 2. S umming de ec ion esul s on pickup audio.
By ma ching he de ec ed s umming onse s wi h he
mo emen da a om he hand senso , he s umming di-
ec ion can also be de e mined. In Table 3, we compa e
he esul s o he mul imodal algo i hmic app oach wi h
ou CRNN model. Fo all ou app oaches, he F1-sco e
o down s ums is highe han o up s ums. Ou CRNN
model ou pe o ms he algo i hmic app oaches o he down
s um as well as he up s um class, whe eby he inc ease is
speci ically no icable o up s um e en s. Combining he
CRNN de ec ion wi h he accele a ion-based classi ica ion
leads o he o e all bes esul s. The e o e, he labeling
could be au oma ed qui e e icien ly by using a hyb id ap-
p oach wi h he pickup audio signal o de ec he e en s in
he audio and he mo ion senso da a o ge he s umming
e en class algo i hmically.
Me hods F1any F1down F1up
Spec al Flux [13] 79.49 % 85.40 % 68.60 %
Supe Flux [13] 74.36 % 84.40 % 67.80 %
CD-ODF [20] 79.32 % 82.20 % 78.40 %
Ou s 97.60 % 87.87 % 84.90 %
Ou s + Senso 97.60 % 90.02 % 88.66 %
Table 3. S umming e en de ec ion esul s by class. The
onse de ec ion unc ion esul s a e pai ed wi h he hand
mo emen signal in o de o classi y he e en s.
5.2 Resul s on Mic ophone Reco dings
Nex , we examine he ac ion de ec ion pe o mance on he
eal-wo ld mic ophone da a. The eal-wo ld audio con ains
o e all mo e noise, e e b and ambien sounds. The de ec-
ion pe o mance o di e en aining da ase cons ella ions
(Syn he ic (Sy), mic ophone exclusi ely (Ph), mic ophone
and pickup (Ph + Pi), and all h ee da ase s (Sy + Ph + Pi))
is compa ed in Table 4. The
F1any
esul s o all da ase s lie
in a simila ange. The syn he ic da ase achie es abou
5 %
be e esul s han when only using he compa ably small
aining da ase o eal-wo ld phone eco dings. When he
pickup audio da ase is used in addi ion o he mic ophone
eco dings, we see a clea inc ease ac oss all models. The
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
481

T aining Da a F1any Rany Pany F1down Rdown Pdown F1up Rup Pup
Sy 89.77 % 89.47 % 90.56 % 73.92 % 75.00 % 74.04 % 52.64 % 56.99 % 51.04 %
Ph 85.06 % 84.11 % 86.12 % 79.90 % 78.70 % 81.42 % 66.81 % 67.52 % 67.88 %
Ph + Pi 89.45 % 88.37 % 90.64 % 82.94 % 83.72 % 82.40 % 75.10 % 73.17 % 78.24 %
Sy + Ph + Pi 92.75 % 92.50 % 93.25 % 85.51 % 85.87 % 85.43 % 79.02 % 81.15 % 77.80 %
Table 4. Resul s on mic ophone audio ained on a ious combina ions o he syn he ic da ase (Sy), eal-wo ld pickup audio
(Pi), and eal-wo ld mic ophone eco dings (Ph).
inc ease is especially signi ican o up s ums. In gene al,
he eal-wo ld da a pe o ms signi ican ly be e han he
syn he ic da ase exclusi ely. He e, we see an inc ease o
o e
40 %
compa ed o he syn he ic da ase exclusi ely.
The e o e, eliable onse de ec ion i sel can be ained om
syn he ic examples alone, bu he classi ica ion o he s um-
ming ac ion p o i s om eal-wo ld audio. The bes o e all
esul s a e ob ained by combining he syn he ic da ase wi h
he mic ophone and pickup da ase . This indica es ha
inc easing he eal-wo ld da ase in addi ional eco ding
sessions migh yield u he imp o emen s. In e es ingly,
ine- uning a checkpoin p e ained on syn he ic da a on he
phone and pickup da a leads o wo se esul s han joining
all h ee aining da ase s.
5.3 E ec o Pi ch Shi Augmen a ion
Max Pi ch Shi F1any F1down F1up
None 81.15 % 71.04 % 55.80 %
±3semi ones 85.06 % 79.10 % 71.99 %
±6semi ones 89.45 % 82.94 % 75.10 %
±12 semi ones 85.90 % 80.89 % 72.25 %
Table 5. E ec o he max pi ch shi pa ame e in he p e-
p ocessing s ep on he s umming de ec ion pe o mance.
In he model p e-p ocessing we pe o m da a augmen a-
ion in he o m o a andom pi ch shi be o e calcula ing
he inpu spec og am. The e ec o he pi ch shi aug-
men a ion is s udied using he aining on he combined
phone and pickup da ase . The esul s o his expe imen
a e shown in Table 5. Applying a max pi ch shi o 6
semi ones leads o he bes esul s. The F1-sco e o down
s ums inc eases by
10 %
and up s ums F1-sco e by
14 %
.
While he pi ch shi in oduces mo e a i ac s as he no e
shi inc eases, i also inc eases he di e si y o cho ds used
and he e o e helps he model gene alize.
5.4 Cho d Recogni ion
While he p e ious expe imen s only ocused on he s um-
ming ac ion de ec ion and classi ica ion, he cho d ecogni-
ion pe o mance is quan i ied in his expe imen and com-
pa ed wi h a popula CNN-based [22] and a s a e-o - he a
ans o me model [23]. We use he checkpoin s p o ided
by he au ho s. In con as o he cho d ecogni ion ask,
whe e ypically a musical piece is segmen ed in o sec ions
o a speci ic cho d, we a e in e es ed in assigning a cho d o
Me hod (Da ase ) Accu acy
Deep Ch oma Cho d Recogni ion [22] 80.37 %
Cho d Recogni ion BTC [23] 89.21 %
Ou s (Sy) 87.84 %
Ou s (Ph + Pi) 81.52 %
Ou s (Sy + Ph + Pi) 90.06 %
Table 6. Resul s o cho d ecogni ion on he mic ophone
audio o he eal-wo ld eco dings.
a de ec ed s umming e en . The e o e, we use he g ound
u h s umming ac ion imes o de e mine a cho d label.
Fo he aining o ou own model, we use a maximum
pi ch shi o 6 semi ones. The esul ing accu acy sco es
o he majo -mino ocabula y a e shown in Table 6. The
cho d ecogni ion ans o me model and ou model ained
on he combined da ase achie e he bes esul s o abou
90 %
. The CNN-based cho d acking shows he weakes
pe o mance. In con as o he s umming ac ion de ec ion,
ou model ained on he syn he ic da ase alone pe o ms
signi ican ly be e han wi h only he smalle eal-wo ld
da ase . T aining on all h ee da ase s u he inc eases he
pe o mance o ou app oach.
6. CONCLUSION
This s udy demons a es he e ec i eness o a CRNN-based
model o he join ansc ip ion o gui a s umming ac ions
and cho ds. We in oduced a no el app oach o s umming
syn hesis, gene a ing a la ge da ase o syn he ic s umming
examples. By ex ending an exis ing mul imodal s umming
ansc ip ion amewo k, we also collec ed 90 minu es o
eal-wo ld gui a eco dings, enhanced wi h semi-au oma ic
anno a ions. The combina ion o syn he ic and eal-wo ld
da ase s allowed us o ain a obus ansc ip ion model
capable o accu a ely de ec ing s umming onse s, classi-
ying s umming di ec ion, and iden i ying cho ds om
mic ophone audio.
Fu u e wo k could ex end his app oach o co e a
b oade ange o hy hmic pa e ns, including mu ed s um-
ming e en s, which pose a challenge o mo ion-based
anno a ion me hods. Addi ionally, he cho d ocabula y,
cu en ly limi ed o majo and mino cho ds, could be ex-
panded o include se en h cho ds, suspended cho ds, and
o he common cho d oicings. These imp o emen s would
u he enhance he e sa ili y and eal-wo ld applicabili y
o au oma ic s umming ansc ip ion models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
482
7. REFERENCES
[1]
E. Bene os, S. Dixon, Z. Duan, and S. Ewe , “Au o-
ma ic Music T ansc ip ion: An O e iew,” IEEE Signal
P ocessing Magazine, ol. 36, no. 1, pp. 20–30, 2018.
[2]
X. Riley, D. Edwa ds, and S. Dixon, “High Resolu-
ion Gui a T ansc ip ion ia Domain Adap a ion,” in
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2024, pp. 1051–1055.
[3]
S. Chang, E. Bene os, H. Ki chho , and S. Dixon,
“You MT3+: Mul i-Ins umen Music T ansc ip ion
wi h Enhanced T ans o me A chi ec u es and C oss-
Da ase STEM Augmen a ion,” in 2024 IEEE 34 h In-
e na ional Wo kshop on Machine Lea ning o Signal
P ocessing (MLSP), 2024, pp. 1–6.
[4]
A. Wiggins and Y. Kim, “Gui a Tabla u e Es ima ion
Wi h a Con olu ional Neu al Ne wo k,” in P oceedings
o he 20 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2019, pp. 284–291.
[5]
K. Bello and P. Mayol, “Classi ica ion o Acous ic Gui-
a S um using Con olu ional Neu al Ne wo ks and
Long-Sho -Te m-Memo y,” Philippine e-Jou nal o
Applied Resea ch and De elopmen , ol. 9, pp. 49–57,
2019.
[6]
S. Ma sushi a and D. Iwase, “De ec ing S umming
Ac ion While Playing Gui a ,” in P oceedings o he
2013 In e na ional Symposium on Wea able Compu e s,
2013, pp. 145–146.
[7]
S. F ei e, G. San os, A. A mondes, E. Meneses, and
M. Wande ley, “E alua ion o Ine ial Senso Da a by
a Compa ison Wi h Op ical Mo ion Cap u e Da a o
Gui a S umming Ges u es,” Senso s, ol. 20, no. 19,
p. 5722, 2020.
[8]
S. Mu gul and M. Heizmann, “A Mul imodal App oach
o Acous ic Gui a S umming Ac ion T ansc ip ion,” in
Ex ended Abs ac s o he La e-B eaking Demo Session
o he 23 d In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2022.
[9]
Wa esha e. (2025) Esp32-s3 ouch lcd 1.28”. h ps:
//www.wa esha e.com/esp32-s3- ouch-lcd-1.28.h m.
(accessed Feb. 28, 2025).
[10]
D. Kleppne and R. J. Kolenkow, An In oduc ion To
Mechanics, 2nd ed. Camb idge, UK: Camb idge Uni-
e si y P ess, 2014.
[11]
D. Sam a. (2025) Schlagmus e ü Gi-
a e. h ps://www.gi a enpa k.de/blog/
schlagmus e -gi a e-s umming-pa e ns/. (accessed
Feb. 28, 2025).
[12]
E. Swanson. (2025) S umming Pa e ns.
h ps://www.e iksgui a lessons.com/wp-con en /
uploads/2015/02/S umming-Pa e ns- o -Gui a 1.pd .
(accessed Feb. 28, 2025).
[13]
S. Böck and G. Widme , “Maximum Fil e Vib a o Sup-
p ession o Onse De ec ion,” in P oceedings o he
16 h In e na ional Con e ence on Digi al Audio E ec s
(DAFx), 2013, p. 4.
[14]
D. B aun, “DawD eame : B idging he Gap Be ween
Digi al Audio Wo ks a ions and Py hon In e aces,” in
Ex ended Abs ac s o he La e-B eaking Demo Session
o he 22nd In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2021.
[15]
Y. Zang, Y. Zhong, F. Cwi kowi z, and Z. Duan, “Syn-
h ab: Le e aging Syn hesized Da a o Gui a Tabla-
u e T ansc ip ion,” in IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP),
2024, pp. 1286–1290.
[16]
P. Sobo , “Pedalboa d,” Jul. 2021. [Online]. A ailable:
h ps://doi.o g/10.5281/zenodo.7817838
[17]
Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-
Resolu ion Piano T ansc ip ion Wi h Pedals by Reg ess-
ing Onse and O se Times,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing, ol. 29,
pp. 3707–3717, 2021.
[18]
I. Loshchilo and F. Hu e , “Decoupled Weigh Decay
Regula iza ion,” in In e na ional Con e ence on Lea n-
ing Rep esen a ions (ICLR), 2017.
[19]
C. Ra el, B. McFee, E. J. Humph ey, J. Salamon, O. Ni-
e o, D. Liang, D. P. Ellis, and C. C. Ra el, “MIR_EVAL:
A T anspa en Implemen a ion o Common MIR Me -
ics,” in P oceedings o he 15 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2014, p. 2014.
[20]
J. P. Bello, C. Duxbu y, M. Da ies, and M. Sandle , “On
he Use o Phase and Ene gy o Musical Onse De ec-
ion in he Complex Domain,” IEEE Signal P ocessing
Le e s, ol. 11, no. 6, pp. 553–556, May 2004.
[21]
B. McFee, C. Ra el, D. Liang, D. P. Ellis, M. McVica ,
E. Ba enbe g, and O. Nie o, “Lib osa: Audio and Music
Signal Analysis in Py hon,” SciPy, ol. 2015, pp. 18–24,
2015.
[22]
F. Ko zeniowski and G. Widme , “Fea u e Lea ning o
Cho d Recogni ion: The Deep Ch oma Ex ac o ,” in
P oceedings o he 17 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), 2016.
[23]
J. Pa k, K. Choi, S. Jeon, D. Kim, and J. Pa k, “A
Bi-Di ec ional T ans o me o Musical Cho d Recog-
ni ion,” in P oceedings o he 20 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
483