Joint Transcription of Acoustic Guitar Strumming Directions and Chords

Author: Sebastian Murgul; Johannes Schimper; Michael Heizmann

Publisher: Zenodo

DOI: 10.5281/zenodo.17706490

Source: https://zenodo.org/records/17706490/files/000055.pdf

JOINT TRANSCRIPTION OF ACOUSTIC GUITAR STRUMMING
DIRECTIONS AND CHORDS
Sebas ian Mu gul1,2Johannes Schimpe 2Michael Heizmann2
1Klangio GmbH, Ka ls uhe, Ge many
2Ka ls uhe Ins i u e o Technology, Ka ls uhe, Ge many
[email p o ec ed]
ABSTRACT
Au oma ic ansc ip ion o gui a s umming is an unde -
ep esen ed and challenging ask in Music In o ma ion Re-
ie al (MIR), pa icula ly o ex ac ing bo h s umming di-
ec ions and cho d p og essions om audio signals. While
exis ing me hods show p omise, hei e ec i eness is o -
en hinde ed by limi ed da ase s. In his wo k, we ex end
a mul imodal app oach o gui a s umming ansc ip ion
by in oducing a no el da ase and a deep lea ning-based
ansc ip ion model. We collec
90 min
o eal-wo ld gui a
eco dings using an ESP32 sma wa ch mo ion senso and a
s uc u ed eco ding p o ocol, complemen ed by a syn he ic
da ase o
4 h
o labeled s umming audio. A Con olu ional
Recu en Neu al Ne wo k (CRNN) model is ained o
de ec s umming e en s, classi y hei di ec ion, and iden-
i y he co esponding cho ds using only mic ophone audio.
Ou e alua ion demons a es signi ican imp o emen s o e
baseline onse de ec ion algo i hms, wi h a hyb id me hod
combining syn he ic and eal-wo ld da a achie ing he high-
es accu acy o bo h s umming ac ion de ec ion and cho d
classi ica ion. These esul s highligh he po en ial o deep
lea ning o obus gui a s umming ansc ip ion and open
new a enues o au oma ic hy hm gui a analysis.
1. INTRODUCTION
Au oma ic music ansc ip ion is a key ask in Music In o -
ma ion Re ie al (MIR), aiming o con e audio signals
in o symbolic ep esen a ions. Fo he ansc ip ion o solo
ins umen music, nume ous new app oaches and ools ha e
been p oposed o e he las yea s [1]. While classical no e-
acking models such as [2], [3], and [4] pe o m well o
inge picking, hey a e no designed o p edic s umming
di ec ions. These models ocus on indi idual no e onse s
and o en s uggle wi h he dense polyphony and hy hmic
s uc u e o s umming, whe e he emphasis lies on cho d-
le el a icula ion. This limi a ion highligh s he need o a
dedica ed s umming ansc ip ion sys em wi h applica ions
in music educa ion, DAW plugins, and no a ion so wa e.
© S. Mu gul, J. Schimpe , and M. Heizmann. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: S. Mu gul, J. Schimpe , and M. Heizmann, “Join
T ansc ip ion o Acous ic Gui a S umming Di ec ions and Cho ds”, in
P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
Resea ch on gui a s umming ansc ip ion has p ima -
ily ollowed wo main app oaches: audio-based classi i-
ca ion and senso -based mo ion analysis. In 2019, Bello
e al. p oposed a neu al ne wo k-based classi ica ion sys-
em o dis inguish be ween up and down s okes using
Mel-F equency Ceps al Coe icien s (MFCCs) segmen s
as inpu ea u es [5]. Thei app oach achie ed a classi i-
ca ion accu acy o
72.5 %
o a Con olu ional Neu al Ne -
wo k (CNN) and
70 %
o a Long Sho -Te m Memo y
(LSTM) model. Ea lie , in 2013, Ma sushi a e al. de el-
oped a w is wa ch-like de ice designed o analyze down-
s umming ac ions in e ms o no e iming and in ensi y [6].
Mo e ecen ly, F ei e e al. (2020) explo ed s umming
ges u es in g ea e de ail using ine ial measu emen uni s
(IMUs) and mo ion cap u e echnology, u he ad ancing
senso -based analysis o gui a pe o mance [7]. A mul i-
modal app oach was in oduced in 2022 by Mu gul e al.,
who combined a back-o -hand-moun ed mo ion senso wi h
gui a pickup audio o s umming ac ion ansc ip ion [8].
Thei me hod in ol ed eco ding a small manually labeled
da ase , which was used o e alua e algo i hmic anno a ion
echniques based on onse de ec ion in he pickup signal and
h esholding he i s -o de de i a i e o he mo ion da a.
Building on he app oach in Mu gul e al., we ex end
he mul imodal app oach o c ea e a bigge and mo e di-
e se da ase in o de o ain a neu al ne wo k. We in-
c ease he da ase size om
5 min
o
90 min
and om
4
cho ds o
24
cho ds (majo / mino ) while also adding mo e
complex s umming hy hms and pe o mance pa ame e
a ia ions. The e o e, an imp o ed hand mo ion senso
based on an o - he-shel ESP32 sma wa ch module is
de eloped, and a sophis ica ed eco ding plan wi h spe-
ci ic ins uc ions o he playe s is c ea ed. A new gui a
s umming da ase is eco ded by h ee gui a playe s using
his app oach and semi-au oma ically anno a ed using he
mul imodal in o ma ion. While he semi-au oma ic anno-
a ion p ocess is scalable, he eco ding p ocess s ill does
ake some ime. The e o e, o complemen he eal-wo ld
da ase , we p esen a gui a s umming da a syn hesis ap-
p oach ha is used o gene a e an addi ional
4 h
o labeled
s umming audio. These da ase s a e hen used o ain a
CRNN model o au oma ically de ec s umming e en s and
classi y he s umming di ec ion as well as he played cho d
om solely mic ophone audio. Finally, he ansc ip ion e-
sul s a e e alua ed using he es spli o he eal s umming
eco dings and compa ed wi h baseline algo i hms.
477
2. MULTIMODAL STRUMMING RECORDING
2.1 Mo ion Reco ding Ha dwa e
To cap u e hand mo emen and, consequen ly, he s um-
ming di ec ion, a compac and ligh weigh sys em is e-
qui ed ha can be a ached o he playing hand. I mus
enable wi eless communica ion o ansmi ing mo ion
da a and be capable o s a ing and s opping audio eco d-
ings on a compu e ia wi eless commands. Addi ionally,
he sys em should be in ui i e o gui a is s o use. Fo
scalable applica ions, he solu ion should be cos -e icien .
The ESP32-S3-Touch-LCD-1.28 module om Wa esha e
mee s hese equi emen s and se es as he cen al mic o-
con olle [9]. I ea u es a 3-axis accele ome e (QMI8658),
a LiPo ba e y connec o wi h a ba e y managemen , and
suppo s he wi eless s anda ds Wi-Fi and Blue oo h Low
Ene gy (BLE). Fu he mo e, he module includes an LCD
sc een wi h ouch unc ionali y and a compac o m ac o .
A cus om 3D-p in ed enclosu e enables a wa ch-like
a achmen on he back o he hand. The enclosu e also
houses a
350 mAh
LiPo ba e y, as shown in Figu e 1a.
Figu e 1b illus a es he senso sys em a ached o he back
o he hand.
(a) Backside o he hand senso
wi hou co e .
(b) Senso a ached o he back
o he hand.
Figu e 1. Hand senso in i s enclosu e
Hand mo emen is desc ibed using a simpli ied model
like in [8], in which he hand pe o ms a semici cula mo-
ion a ound he elbow. The
x
-axis uns along he back
o he hand, o hogonal o he inge s, while he
y
-axis is
o hogonal o he
x
-di ec ion, poin ing owa ds he inge -
ips. The ele an accele a ion componen s a e g a i a ional
accele a ion
Ag
, cen ipe al accele a ion
Acen ipe al
, and an-
gen ial accele a ion
A angen ial
[10]. The spa ial o ien a ions
eco ded by he senso , along wi h he measu ed accele -
a ions o di e en hand posi ions, a e shown in Figu e
2. The cen ipe al accele a ion ac s exclusi ely in he
y
-
di ec ion, while he angen ial accele a ion occu s along he
x
-axis. Consequen ly, he accele a ion
Ax
in he
x
-di ec ion
and Ayin he y-di ec ion a e gi en by
Ax=A angen ial +Ag·cos(φ)(1)
Ay=Acen ipe al +Ag·sin(φ)(2)
whe e
φ
is he angle ela i e o he ho izon al axis, anging
om
−90°
o
90°
. Fo slow, quasi-s a iona y hand mo e-
men s,
Ax
anges om
−1g
o
0g
, while
Ay
akes alues
be ween
−1g
and
1g
. Due o he symme y p ope ies o
he sine unc ion,
Ax
alone canno de e mine he mo emen
Figu e 2. Mo ion model o he senso
di ec ion. Howe e , by di e en ia ing he accele a ion in
he y-di ec ion, he mo emen di ec ion can be in e ed. A
nega i e g adien co esponds o an upwa d mo ion, while
a posi i e g adien co esponds o a downwa d mo ion.
In non-s a iona y cases, such as du ing s umming, bo h
angen ial and cen ipe al accele a ion con ibu e o
Ax
and
Ay
espec i ely alongside he g a i a ional accele a ion.
The y-di ec ion expe iences an addi ional, cons an cen-
ipe al accele a ion. Because ou me hod elies on accele -
a ion de i a i es, he cons an cen ipe al accele a ion can
be igno ed.
2.2 Reco ding P ocess
Table 1 gi es an o e iew o he playing ins uc ions gi en
o he gui a is s. To compile he da ase s,
28
di e en s um-
ming pa e ns in 4/4 ime signa u e based on [11, 12] a e
used, anging om hy hmically simple o complex syn-
copa ed pa e ns. The pa e ns a y in pa ame e s like
empo (60, 80, 100 BPM), cho d p og essions, playing s yle
(plec um, inge ), and olume (so , medium, loud). The
a ia ions we e de e mined andomly based on a uni o m
dis ibu ion.
Pa ame e Values
Pa e n 28 pa e ns
Tempo 60 BPM, 80 BPM, 100 BPM
Mo emen li le, no mal, la ge
Volume quie , medium, loud
Technique inge , pick
Cho ds majo and mino cho d p og essions
Table 1. Resul s on mic ophone audio.
The da a collec ion was conduc ed wi h h ee gui a is s,
including a p o essional gui a eache and wo expe ienced
ama eu gui a is s. The s umming pa e ns we e played
o
60 s
each o a me onome, ollowing he p ede ined
pa ame e s. Simul aneously, audio eco dings om he
gui a pickup and accele a ion da a we e cap u ed. Syn-
ch oniza ion o bo h audio signals was pe o med using
c oss-co ela ion. Addi ionally, he gui a is s’ playing was
eco ded using he mic ophone on an iPhone 15 P o. The
o al eco ding du a ion amoun s o
90 min
. Due o he
ligh weigh design and he moun ing posi ion on he back
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
478
o he hand, he gui a playe s’ inge s we e no cons ained
by he senso du ing pe o mance.
2.3 Semi-Au oma ic Anno a ion
The anno a ion p ocess in ol es iden i ying he onse imes
and s umming di ec ions wi hin he pickup eco dings as
well as he synch oniza ion wi h he mo ion senso signal.
Ins ead o elying solely on au oma ed onse de ec ion, he
p ocess is op imized by inco po a ing p io knowledge om
he eco ding plan, which includes empo, hy hm pa e ns,
cho ds, and s umming sequences. This s uc u ed in o -
ma ion allows o a mo e obus p edic ion o expec ed
onse imes, educing eliance on pu ely signal-based onse
de ec ion. To de e mine ac ual onse imes, spec al lux
analysis [13] is used o de ec signi ican changes in he
audio signal. Howe e , since he gui a is does no neces-
sa ily s a a he exac ze o-second ma k, a use -assis ed
g aphical in e ace is employed o align he es ima ed onse s
wi h he heo e ical pa e n. The p ocess in ol es selec ing
he ac ual s a ime and i e a i ely adjus ing un il he de-
ec ed onse s align wi h he expec ed iming based on he
me onome. S umming di ec ion is de e mined using ac-
cele a ion da a, which is synch onized wi h he audio signal.
Since ansmission la ency and sys em delays in oduce a
ime o se be ween he audio and accele a ion da a, man-
ual adjus men s a e equi ed. An in e ac i e isualiza ion
displays bo h spec al lux and di e en ia ed accele a ion,
allowing use s o shi he accele a ion da a un il he peaks
o accele a ion de i a i es align wi h he de ec ed onse s.
To assign s umming di ec ion, peaks in he accele a ion
de i a i e co esponding o upwa d and downwa d hand
mo emen s a e ma ched wi h de ec ed onse peaks in spec-
al lux. I he accele a ion de i a i e is posi i e a an onse
ime, i is labeled as an up s um; i nega i e, i is labeled as
a down s um. Nex , we use he a p io i in o ma ion om
he eco ding plan o au oma ically co ec he anno a ions
and add cho d labels. Since we use a me onome, i can be
assumed ha he hy hmical pa e n is played consis en ly
enough o in e pola e missed s umming e en s. Finally, he
anno a ed da a unde goes manual alida ion and co ec ion
by a human anno a o . The anno a o isually inspec s and
adjus s he de ec ed onse s and s umming di ec ions using
an in e ac i e g aphical in e ace.
3. GUITAR STRUMMING SYNTHESIS
To c ea e a di e se and scalable da ase o aining s um-
ming ansc ip ion models, we in oduce a no el s umming
syn hesis app oach consis ing o h ee s ages: s umming
abla u e sampling, audio ende ing, and audio augmen a-
ion. This me hod gene a es app oxima ely
1000
examples
o aling
4 h
o audio, which a e andomly spli in o
90 %
aining, 5 % alida ion, and 5 % es ing se s.
3.1 S umming Tabla u e Sampling
The i s s ep in ol es gene a ing syn he ic s umming ab-
la u es, as illus a ed in Figu e 3. A da abase o
51
cho d
p og essions in unc ional no a ion and
36
s umming pa -
e ns de ined on a 16 h-no e g id se e as he ounda ion
o gene a ing a ia ions. Each example is c ea ed by an-
domly selec ing a cho d p og ession, ansposing i o a
andom key, and mapping each cho d o a inge ing om a
lookup able. A andom s umming pa e n and empo a e
hen applied o c ea e a comple e abla u e. To in oduce
na u al impe ec ions, he las no e o a s umming cho d
is andomly d opped in
50 %
o cases, simula ing playing
inconsis encies ypical o ama eu gui a is s. The gene a ed
abla u es a e s o ed in he Gui a P o
1
o ma , alongside a
CSV anno a ion ile con aining iming, s umming ac ion,
and cho d labels.
3.2 Audio Rende ing
The syn hesized abla u es a e ende ed in o audio using
DAWD eame [14] and Ample Sound’s i ual gui a ins u-
men s
2
, ollowing a me hodology simila o Syn hTab [15].
Ins ead o con e ing abla u es o MIDI, we use . xp p ese
iles o load he Gui a P o no a ion di ec ly in o he i ual
ins umen engine. This way, up and down s oke in o ma-
ion can be inpu om he abla u e. To enhance ealism,
ende ing pa ame e s a e andomized, including he blend
be ween i ual mic ophones and he amoun o e noise
in oduced. The inal ou pu is sa ed as a
44.1 kHz
WAV
ile. Since he ende ing p ocess in oduces an a e age
40 ms
la ency, his delay is accoun ed o in he da ase
anno a ions o main ain synch oniza ion accu acy.
3.3 Audio Augmen a ion
To u he imp o e ealism and a iabili y, a pos -p ocessing
s ep applies a chain o e ec s using he Pedalboa d li-
b a y [16]. The augmen a ion pipeline in oduces con olled
dis o ions and en i onmen al ac o s o be e simula e eal-
wo ld eco dings. The p ocessing chain includes dis o ion,
high- and low-pass il e ing, and comp ession o mimic
onal a ia ions ac oss di e en eco ding condi ions. To
simula e oom acous ics, a con olu ional e e b e ec is
applied. Addi ional backg ound noise laye s, including am-
bien eco dings ( a ic, wea he , and li ing oom sounds)
and whi e noise, a e inco po a ed o model mic ophone im-
pe ec ions and noisy en i onmen s. Finally, sho bu s s o
e ing sounds and pe cussi e noises, such as ligh apping
o clapping, a e injec ed a andom in e als o emula e
na u al gui a handling. The e ec pa ame e s, such as
signal- o-noise a io (SNR), il e cu -o equencies, and
d y/we mix a ios, a e andomized o ensu e b oad gene al-
iza ion.
4. MODEL
Ou model builds upon he Con olu ional Recu en Neu al
Ne wo k (CRNN) a chi ec u e p oposed by Kong e al. [17]
o piano ansc ip ion. Unlike adi ional classi ica ion-
based app oaches ha es ima e a disc e e piano oll ep e-
sen a ion, his me hod employs a eg ession-based s a egy
1See h ps://www.gui a -p o.com o mo e in o ma ion.
2A ailable a h ps://amplesound.ne /en/index.asp.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
479
51 Cho d
P og essions
T anspose
o Random Key
C ea e Cho d
Tabla u e
Cho d
Finge ings
Apply S umming
Pa e n w/ Tempo
36 S umming
Pa e ns
D op
Las No e
GP5 &
Anno a ions
Figu e 3. Flow cha o he s umming abla u e sampling p ocess.
o p edic he ime o he nex onse o o se e en . This
design allows o mo e p ecise onse es ima ions beyond
he limi a ions o ixed ame s ep sizes, while also inc eas-
ing obus ness agains mino misalignmen s in onse label
anno a ions du ing aining.
4.1 P e-P ocessing
The inpu audio is esampled o
16 kHz
and segmen ed in o
o e lapping
10 s
clips wi h a hop size o
1 s
o enhance da a
di e si y. Each segmen is con e ed in o a loga i hmic Mel
spec og am, which se es as he inpu ep esen a ion o
he neu al ne wo k. The spec og am is compu ed using a
window size o
2048
samples and a hop size o
160
sam-
ples, esul ing in a ime- equency ep esen a ion wi h
229
equency bins, s a ing a a minimum equency o
30 Hz
.
To imp o e gene aliza ion, andom pi ch shi s in he ange
[−6,6]
semi ones a e applied du ing aining, wi h cho d
labels ansposed acco dingly. The o e lapping segmen a-
ion and augmen a ion ensu e obus ea u e lea ning ac oss
di e se s umming pa e ns.
4.2 A chi ec u e
The model consis s o wo main componen s: a s umming
onse eg ession ne wo k and a cho d classi ica ion ne -
wo k. The inpu Mel spec og am is i s p ocessed by a
con olu ional laye s ack (Con S ack) designed o cap u e
ime- equency ea u es. The s uc u e o he Con S ack
ollows he design in [17] and consis s o ou con olu-
ional blocks. Each block con ains wo con olu ional laye s
wi h iden ical ke nel sizes, ollowed by a pooling ope a-
ion ha educes he spec al dimension while p ese ing
empo al in o ma ion. A e he inal con olu ional block,
he ex ac ed ea u es a e la ened o subsequen p ocess-
ing. The la ened ea u e ep esen a ion is passed h ough
a ully connec ed (FC) laye be o e being ed in o a bidi ec-
ional GRU (biGRU) laye wi h 256 uni s. The ou pu o
he biGRU is hen passed h ough ano he ully connec ed
laye , which gene a es eg ession alues o up s ums and
down s ums.
In pa allel o he onse eg ession, a sepa a e cho d ea-
u e ex ac ion s ack p ocesses he inpu spec og am in a
simila manne . Since cho d labels a e only a ailable a
s umming e en imes, he ou pu s o bo h ne wo ks a e
me ged be o e passing h ough an addi ional biGRU and
ully connec ed laye o p oduce inal classi ica ion logi s
Log Mel-Spec og am (
T×512
)
Con S ack
FC, c=768
biGRU, c=256
FC, c=2
Con S ack
FC, c=768
biGRU, c=256
FC, c=24
biGRU, c=256
FC, c=24
Cho d Classi ica ion
(
T×24
)
Ac ion Reg ession
(
T×2
)
Figu e 4. Join s umming ac ion de ec ion and cho d ecog-
ni ion ne wo k using loga i hmic Mel spec og am as inpu
ea u e.
g(∆−2)g(∆−1)g(∆0)g(∆1)g(∆2)00
S umming Ac ion
∆−2
∆−1
∆0
∆1
∆2
Figu e 5. S uc u e o S umming Ac ion Onse Reg ession
Labels.
o 24 majo and mino cho d classes. Figu e 4 p o ides an
o e iew o he ull model a chi ec u e.
4.3 Reg ession Ta ge s
Ins ead o elying on bina y ame-based labels, a
eg ession-based app oach is used o de e mine s umming
ac ions, as illus a ed in Figu e 5. The eg ession a ge
unc ion
g(∆i)∈[0,1]
encodes he ime di e ence o he
nex s umming ac ion onse
∆i
, whe e
i
is he index o a
ame, using a iangula dis ibu ion. The a ge is de ined
as
g(∆i) = (1−|∆i|
J∆,|i| ≤ J
0,|i|> J , (3)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
480
whe e
∆
deno es he ame hop size and
J
is a hype pa ame-
e ha con ols he sha pness o he eg ession labels which
is se o
5
in ou expe imen s. The loss unc ion consis s o
wo componen s: one o s umming onse eg ession and
ano he o cho d classi ica ion. The s umming ac ion e-
g ession loss
lac ion
is calcula ed om he eg ession ou pu
Rac ion and he a ge Gac ion by
lac ion =
T
X
=1
K
X
k=1
lbce (Gac ion( , k), Rac ion( , k)) ,(4)
whe e
lbce
ep esen s he bina y c oss-en opy loss,
T
is
he numbe o ime s eps, and
K
deno es he numbe o
s umming ac ion ca ego ies. Fo cho d classi ica ion, a
simila loss unc ion is used on he p edic ion ou pu s
Pcho d
and he a ge s Gcho d:
lcho d =
T
X
=1
C
X
c=1
lbce (Gcho d( , c), Pcho d( , c)) .(5)
whe e
C
ep esen s he numbe o possible cho d labels.
The o al loss unc ion used du ing aining is simply he
sum o bo h componen s:
l=lac ion +lcho d .(6)
The model is ained using he AdamW op imize [18]
wi h an ini ial lea ning a e o
10−4
. The aining p ocess is
un o
20,000
s eps wi h a ba ch size o
6
. On an NVIDIA
Tesla V100 GPU, aining akes app oxima ely 2 h.
5. EXPERIMENTS AND RESULTS
This sec ion e alua es he pe o mance o ou p oposed
me hod o s umming onse de ec ion, di ec ion classi i-
ca ion, and cho d ecogni ion. We begin by assessing he
de ec ion accu acy using gui a pickup signals, ollowed by
an e alua ion o eal-wo ld mic ophone eco dings. Finally,
we analyze he e ec i eness o pi ch shi augmen a ion and
compa e ou cho d ecogni ion wi h exis ing app oaches.
Model pe o mance is measu ed using p ecision, ecall,
and F1-sco e o s umming de ec ion. Speci ically, we
epo hese me ics o down s ums (
F1down
), up s ums
(
F1up
), and s umming class agnos ic (
F1any
). A
50 ms
ol-
e ance window is used, ollowing he mi _e al lib a y [19].
5.1 Resul s on Gui a Pickup Signals
In ou i s expe imen , we explo e he pe o mance o ou
model di ec ly on he gui a pickup signals. We use wo o
he gui a is s we eco ded o ain ou model and e alua e on
he hi d gui a is . We compa e he de ec ion quali y o ou
ained model wi h common onse de ec ion unc ions spec-
al lux [13], supe lux [13] and Complex Domain Onse
De ec ion Func ion (CD-ODF) [20] . Fo spec al lux and
supe lux, we use he implemen a ion gi en in he lib osa
lib a y [21]. The esul ing p ecision, ecall, and F1-sco e
o any s umming di ec ion a e highligh ed in Table 2 o
compa ison. O he onse de ec ion unc ions, he spec al
lux o e s he bes de ec ion esul s, di ec ly ollowed by
he CD-ODF. Compa ed wi h spec al lux and supe lux,
he CD-ODF o e s a no iceably high ecall. The e o e, i
migh be sui able o an ac i e lea ning labeling scena io.
Ou model ou pe o ms he onse de ec ion unc ions in all
h ee p ecision, ecall and F1-sco e. By achie ing an F1-
sco e o abou
98 %
, he model is qui e capable o eliably
de ec ing he s umming ac ions in he pickup signal.
Me hod F1any Pany Rany
Spec al Flux [13] 79.49 % 78.53 % 81.86 %
Supe Flux [13] 74.36 % 77.04 % 73.36 %
CD-ODF [20] 79.32 % 68.50 % 98.15 %
Ou s 97.60 % 96.54 % 98.73 %
Table 2. S umming de ec ion esul s on pickup audio.
By ma ching he de ec ed s umming onse s wi h he
mo emen da a om he hand senso , he s umming di-
ec ion can also be de e mined. In Table 3, we compa e
he esul s o he mul imodal algo i hmic app oach wi h
ou CRNN model. Fo all ou app oaches, he F1-sco e
o down s ums is highe han o up s ums. Ou CRNN
model ou pe o ms he algo i hmic app oaches o he down
s um as well as he up s um class, whe eby he inc ease is
speci ically no icable o up s um e en s. Combining he
CRNN de ec ion wi h he accele a ion-based classi ica ion
leads o he o e all bes esul s. The e o e, he labeling
could be au oma ed qui e e icien ly by using a hyb id ap-
p oach wi h he pickup audio signal o de ec he e en s in
he audio and he mo ion senso da a o ge he s umming
e en class algo i hmically.
Me hods F1any F1down F1up
Spec al Flux [13] 79.49 % 85.40 % 68.60 %
Supe Flux [13] 74.36 % 84.40 % 67.80 %
CD-ODF [20] 79.32 % 82.20 % 78.40 %
Ou s 97.60 % 87.87 % 84.90 %
Ou s + Senso 97.60 % 90.02 % 88.66 %
Table 3. S umming e en de ec ion esul s by class. The
onse de ec ion unc ion esul s a e pai ed wi h he hand
mo emen signal in o de o classi y he e en s.
5.2 Resul s on Mic ophone Reco dings
Nex , we examine he ac ion de ec ion pe o mance on he
eal-wo ld mic ophone da a. The eal-wo ld audio con ains
o e all mo e noise, e e b and ambien sounds. The de ec-
ion pe o mance o di e en aining da ase cons ella ions
(Syn he ic (Sy), mic ophone exclusi ely (Ph), mic ophone
and pickup (Ph + Pi), and all h ee da ase s (Sy + Ph + Pi))
is compa ed in Table 4. The
F1any
esul s o all da ase s lie
in a simila ange. The syn he ic da ase achie es abou
5 %
be e esul s han when only using he compa ably small
aining da ase o eal-wo ld phone eco dings. When he
pickup audio da ase is used in addi ion o he mic ophone
eco dings, we see a clea inc ease ac oss all models. The
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
481

T aining Da a F1any Rany Pany F1down Rdown Pdown F1up Rup Pup
Sy 89.77 % 89.47 % 90.56 % 73.92 % 75.00 % 74.04 % 52.64 % 56.99 % 51.04 %
Ph 85.06 % 84.11 % 86.12 % 79.90 % 78.70 % 81.42 % 66.81 % 67.52 % 67.88 %
Ph + Pi 89.45 % 88.37 % 90.64 % 82.94 % 83.72 % 82.40 % 75.10 % 73.17 % 78.24 %
Sy + Ph + Pi 92.75 % 92.50 % 93.25 % 85.51 % 85.87 % 85.43 % 79.02 % 81.15 % 77.80 %
Table 4. Resul s on mic ophone audio ained on a ious combina ions o he syn he ic da ase (Sy), eal-wo ld pickup audio
(Pi), and eal-wo ld mic ophone eco dings (Ph).
inc ease is especially signi ican o up s ums. In gene al,
he eal-wo ld da a pe o ms signi ican ly be e han he
syn he ic da ase exclusi ely. He e, we see an inc ease o
o e
40 %
compa ed o he syn he ic da ase exclusi ely.
The e o e, eliable onse de ec ion i sel can be ained om
syn he ic examples alone, bu he classi ica ion o he s um-
ming ac ion p o i s om eal-wo ld audio. The bes o e all
esul s a e ob ained by combining he syn he ic da ase wi h
he mic ophone and pickup da ase . This indica es ha
inc easing he eal-wo ld da ase in addi ional eco ding
sessions migh yield u he imp o emen s. In e es ingly,
ine- uning a checkpoin p e ained on syn he ic da a on he
phone and pickup da a leads o wo se esul s han joining
all h ee aining da ase s.
5.3 E ec o Pi ch Shi Augmen a ion
Max Pi ch Shi F1any F1down F1up
None 81.15 % 71.04 % 55.80 %
±3semi ones 85.06 % 79.10 % 71.99 %
±6semi ones 89.45 % 82.94 % 75.10 %
±12 semi ones 85.90 % 80.89 % 72.25 %
Table 5. E ec o he max pi ch shi pa ame e in he p e-
p ocessing s ep on he s umming de ec ion pe o mance.
In he model p e-p ocessing we pe o m da a augmen a-
ion in he o m o a andom pi ch shi be o e calcula ing
he inpu spec og am. The e ec o he pi ch shi aug-
men a ion is s udied using he aining on he combined
phone and pickup da ase . The esul s o his expe imen
a e shown in Table 5. Applying a max pi ch shi o 6
semi ones leads o he bes esul s. The F1-sco e o down
s ums inc eases by
10 %
and up s ums F1-sco e by
14 %
.
While he pi ch shi in oduces mo e a i ac s as he no e
shi inc eases, i also inc eases he di e si y o cho ds used
and he e o e helps he model gene alize.
5.4 Cho d Recogni ion
While he p e ious expe imen s only ocused on he s um-
ming ac ion de ec ion and classi ica ion, he cho d ecogni-
ion pe o mance is quan i ied in his expe imen and com-
pa ed wi h a popula CNN-based [22] and a s a e-o - he a
ans o me model [23]. We use he checkpoin s p o ided
by he au ho s. In con as o he cho d ecogni ion ask,
whe e ypically a musical piece is segmen ed in o sec ions
o a speci ic cho d, we a e in e es ed in assigning a cho d o
Me hod (Da ase ) Accu acy
Deep Ch oma Cho d Recogni ion [22] 80.37 %
Cho d Recogni ion BTC [23] 89.21 %
Ou s (Sy) 87.84 %
Ou s (Ph + Pi) 81.52 %
Ou s (Sy + Ph + Pi) 90.06 %
Table 6. Resul s o cho d ecogni ion on he mic ophone
audio o he eal-wo ld eco dings.
a de ec ed s umming e en . The e o e, we use he g ound
u h s umming ac ion imes o de e mine a cho d label.
Fo he aining o ou own model, we use a maximum
pi ch shi o 6 semi ones. The esul ing accu acy sco es
o he majo -mino ocabula y a e shown in Table 6. The
cho d ecogni ion ans o me model and ou model ained
on he combined da ase achie e he bes esul s o abou
90 %
. The CNN-based cho d acking shows he weakes
pe o mance. In con as o he s umming ac ion de ec ion,
ou model ained on he syn he ic da ase alone pe o ms
signi ican ly be e han wi h only he smalle eal-wo ld
da ase . T aining on all h ee da ase s u he inc eases he
pe o mance o ou app oach.
6. CONCLUSION
This s udy demons a es he e ec i eness o a CRNN-based
model o he join ansc ip ion o gui a s umming ac ions
and cho ds. We in oduced a no el app oach o s umming
syn hesis, gene a ing a la ge da ase o syn he ic s umming
examples. By ex ending an exis ing mul imodal s umming
ansc ip ion amewo k, we also collec ed 90 minu es o
eal-wo ld gui a eco dings, enhanced wi h semi-au oma ic
anno a ions. The combina ion o syn he ic and eal-wo ld
da ase s allowed us o ain a obus ansc ip ion model
capable o accu a ely de ec ing s umming onse s, classi-
ying s umming di ec ion, and iden i ying cho ds om
mic ophone audio.
Fu u e wo k could ex end his app oach o co e a
b oade ange o hy hmic pa e ns, including mu ed s um-
ming e en s, which pose a challenge o mo ion-based
anno a ion me hods. Addi ionally, he cho d ocabula y,
cu en ly limi ed o majo and mino cho ds, could be ex-
panded o include se en h cho ds, suspended cho ds, and
o he common cho d oicings. These imp o emen s would
u he enhance he e sa ili y and eal-wo ld applicabili y
o au oma ic s umming ansc ip ion models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
482
7. REFERENCES
[1]
E. Bene os, S. Dixon, Z. Duan, and S. Ewe , “Au o-
ma ic Music T ansc ip ion: An O e iew,” IEEE Signal
P ocessing Magazine, ol. 36, no. 1, pp. 20–30, 2018.
[2]
X. Riley, D. Edwa ds, and S. Dixon, “High Resolu-
ion Gui a T ansc ip ion ia Domain Adap a ion,” in
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2024, pp. 1051–1055.
[3]
S. Chang, E. Bene os, H. Ki chho , and S. Dixon,
“You MT3+: Mul i-Ins umen Music T ansc ip ion
wi h Enhanced T ans o me A chi ec u es and C oss-
Da ase STEM Augmen a ion,” in 2024 IEEE 34 h In-
e na ional Wo kshop on Machine Lea ning o Signal
P ocessing (MLSP), 2024, pp. 1–6.
[4]
A. Wiggins and Y. Kim, “Gui a Tabla u e Es ima ion
Wi h a Con olu ional Neu al Ne wo k,” in P oceedings
o he 20 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2019, pp. 284–291.
[5]
K. Bello and P. Mayol, “Classi ica ion o Acous ic Gui-
a S um using Con olu ional Neu al Ne wo ks and
Long-Sho -Te m-Memo y,” Philippine e-Jou nal o
Applied Resea ch and De elopmen , ol. 9, pp. 49–57,
2019.
[6]
S. Ma sushi a and D. Iwase, “De ec ing S umming
Ac ion While Playing Gui a ,” in P oceedings o he
2013 In e na ional Symposium on Wea able Compu e s,
2013, pp. 145–146.
[7]
S. F ei e, G. San os, A. A mondes, E. Meneses, and
M. Wande ley, “E alua ion o Ine ial Senso Da a by
a Compa ison Wi h Op ical Mo ion Cap u e Da a o
Gui a S umming Ges u es,” Senso s, ol. 20, no. 19,
p. 5722, 2020.
[8]
S. Mu gul and M. Heizmann, “A Mul imodal App oach
o Acous ic Gui a S umming Ac ion T ansc ip ion,” in
Ex ended Abs ac s o he La e-B eaking Demo Session
o he 23 d In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2022.
[9]
Wa esha e. (2025) Esp32-s3 ouch lcd 1.28”. h ps:
//www.wa esha e.com/esp32-s3- ouch-lcd-1.28.h m.
(accessed Feb. 28, 2025).
[10]
D. Kleppne and R. J. Kolenkow, An In oduc ion To
Mechanics, 2nd ed. Camb idge, UK: Camb idge Uni-
e si y P ess, 2014.
[11]
D. Sam a. (2025) Schlagmus e ü Gi-
a e. h ps://www.gi a enpa k.de/blog/
schlagmus e -gi a e-s umming-pa e ns/. (accessed
Feb. 28, 2025).
[12]
E. Swanson. (2025) S umming Pa e ns.
h ps://www.e iksgui a lessons.com/wp-con en /
uploads/2015/02/S umming-Pa e ns- o -Gui a 1.pd .
(accessed Feb. 28, 2025).
[13]
S. Böck and G. Widme , “Maximum Fil e Vib a o Sup-
p ession o Onse De ec ion,” in P oceedings o he
16 h In e na ional Con e ence on Digi al Audio E ec s
(DAFx), 2013, p. 4.
[14]
D. B aun, “DawD eame : B idging he Gap Be ween
Digi al Audio Wo ks a ions and Py hon In e aces,” in
Ex ended Abs ac s o he La e-B eaking Demo Session
o he 22nd In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2021.
[15]
Y. Zang, Y. Zhong, F. Cwi kowi z, and Z. Duan, “Syn-
h ab: Le e aging Syn hesized Da a o Gui a Tabla-
u e T ansc ip ion,” in IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP),
2024, pp. 1286–1290.
[16]
P. Sobo , “Pedalboa d,” Jul. 2021. [Online]. A ailable:
h ps://doi.o g/10.5281/zenodo.7817838
[17]
Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-
Resolu ion Piano T ansc ip ion Wi h Pedals by Reg ess-
ing Onse and O se Times,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing, ol. 29,
pp. 3707–3717, 2021.
[18]
I. Loshchilo and F. Hu e , “Decoupled Weigh Decay
Regula iza ion,” in In e na ional Con e ence on Lea n-
ing Rep esen a ions (ICLR), 2017.
[19]
C. Ra el, B. McFee, E. J. Humph ey, J. Salamon, O. Ni-
e o, D. Liang, D. P. Ellis, and C. C. Ra el, “MIR_EVAL:
A T anspa en Implemen a ion o Common MIR Me -
ics,” in P oceedings o he 15 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2014, p. 2014.
[20]
J. P. Bello, C. Duxbu y, M. Da ies, and M. Sandle , “On
he Use o Phase and Ene gy o Musical Onse De ec-
ion in he Complex Domain,” IEEE Signal P ocessing
Le e s, ol. 11, no. 6, pp. 553–556, May 2004.
[21]
B. McFee, C. Ra el, D. Liang, D. P. Ellis, M. McVica ,
E. Ba enbe g, and O. Nie o, “Lib osa: Audio and Music
Signal Analysis in Py hon,” SciPy, ol. 2015, pp. 18–24,
2015.
[22]
F. Ko zeniowski and G. Widme , “Fea u e Lea ning o
Cho d Recogni ion: The Deep Ch oma Ex ac o ,” in
P oceedings o he 17 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), 2016.
[23]
J. Pa k, K. Choi, S. Jeon, D. Kim, and J. Pa k, “A
Bi-Di ec ional T ans o me o Musical Cho d Recog-
ni ion,” in P oceedings o he 20 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
483

Related note

Why organizations use Identific for document trust, entry 58
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com