UNDERSTANDING PERFORMANCE LIMITATIONS IN AUTOMATIC
DRUM TRANSCRIPTION
Philipp Weye s1Ch is ian Uhle1,2Meina d Mülle 1,2
Ma hias Lang1
1F aunho e Ins i u e o In eg a ed Ci cui s (IIS), E langen, Ge many
2In e na ional Audio Labo a o ies E langen, Ge many
[email p o ec ed]
ABSTRACT
Recen ad ancemen s in Au oma ic D um T ansc ip ion
(ADT) ha e imp o ed o e all ansc ip ion pe o mance.
Howe e , s a e-o - he-a (SOTA) models s ill s uggle
wi h ce ain d um classes, pa icula ly oms and cym-
bals, and he speci ic ac o s limi ing hei pe o mance
emain unclea . This pape add esses his gap by le e -
aging he Sepa a e-T acks-Anno a e-Resyn hesize D ums
(STAR D ums) da ase o c ea e mul iple da ase e sions
ha sys ema ically elimina e po en ial pe o mance con-
s ain s. We conduc expe imen s using h ee common
ADT deep neu al ne wo k (DNN) a chi ec u es o iden i y
and quan i y hese limi a ions. Fo d um ansc ip ion in
he p esence o melodic ins umen s (DTM), he p ima y
limi ing ac o is in e e ence om melodic ins umen s
and singing. Aside om his, pe o mance imp o es by
app oxima ely i e pe cen when aining and es ing use
he same single d um ki , only s ong onse s a e p esen ,
o no es a e no played simul aneously. Fo d um an-
sc ip ion o d um-only eco dings (DTD), nea ly e o - ee
ansc ip ion is achie ed when simul aneous onse s a e e-
mo ed. This con i ms ha o e lapping d um hi s a e he
main pe o mance cons ain . By iden i ying key ADT
challenges, we p o ide insigh s o enhance SOTA models
and imp o e o e all ansc ip ion accu acy.
1. INTRODUCTION
As a sub- ield o Au oma ic Music T ansc ip ion (AMT)
wi hin he b oade ield o Music In o ma ion Re ie al
(MIR), Au oma ic D um T ansc ip ion (ADT) ocuses on
iden i ying and classi ying d um sounds in audio signals.
D um ansc ip ion o d um-only eco dings (DTD) is con-
side ed less challenging due o he absence o sounds o igi-
na ing om o he ins umen s, whe eas d um ansc ip ion
in he p esence o melodic ins umen s (DTM) p esen s
he challenge o d um sounds po en ially being masked by
© P. Weye s, C. Uhle, M. Mülle , and M. Lang. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: P. Weye s, C. Uhle, M. Mülle , and M. Lang, “Un-
de s anding Pe o mance Limi a ions in Au oma ic D um T ansc ip ion”,
in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
melodic ins umen s and singing, o non-d um sounds be-
ing misclassi ied as d ums [1].
Applica ions o ADT include music educa ion so wa e
ha p o ides eal- ime eedback o s uden s p ac icing on
acous ic d um ki s, and music p oduc ion ools ha use
ansc ip ions o add o eplace d um samples [1].
Fo hese applica ions, high-quali y d um ansc ip ion
is essen ial. Howe e , achie ing his equi es o e coming
se e al key challenges:
• In e e ence om melodic ins umen s and ocals,
which can mask d um sounds [1].
• O e lapping d um sounds om di e en classes,
leading o mu ual masking [2].
• Weak onse s, which a e di icul o de ec due o low
loudness and ene gy [3].
• Limi ed gene aliza ion, which a ec s ansc ip ion
pe o mance ac oss di e se da ase s.
We sys ema ically in es iga e and quan i y he im-
pac o hese challenges by c ea ing mul iple e sions
o he Sepa a e-T acks-Anno a e-Resyn hesize D ums
(STAR D ums) da ase [4] ha simpli y he ADT p oblem.
Ou main con ibu ion is o p o ide insigh s in o pe o -
mance imp o emen s achie able by sys ema ically elimi-
na ing limi ing ac o s in aining and es da a. Expe i-
men s a e conduc ed o DTM and DTD sepa a ely, sup-
po ing he de elopmen o mo e obus ADT algo i hms
and a deepe unde s anding o he p oblem.
The pape is s uc u ed as ollows: Sec ion 2 e iews
ela ed wo k, Sec ion 3 desc ibes he me hodology, Sec ion
4 de ails he expe imen s, Sec ion 5 p esen s he esul s,
and Sec ion 6 concludes.
2. RELATED WORK
The eme gence o deep neu al ne wo ks (DNNs) in ADT
imp o ed ansc ip ion pe o mance. In [5], a ious Con-
olu ional Neu al Ne wo k (CNN) and Con olu ional Re-
cu en Neu al Ne wo k (CRNN) a chi ec u es we e com-
pa ed, and la e , hese we e ained on la ge amoun s o
syn he ic da a gene a ed om MIDI iles, combined wi h
smalle manually labeled da ase s [2].
Subsequen wo ks [4,6–9] ha e ci ed [2] o [5] as s a e
o he a (SOTA), and explo ed imp o emen s using di -
e en da ase s o al e na i e DNN a chi ec u es.
582
Few sho lea ning (FSL) has been applied o ADT wi h
p omising esul s, hough i equi ed examples o each
class a in e ence ime [6]. Dynamic FSL add essed his by
elimina ing he need o p o ide examples o d um classes
p esen in ini ial aining, while s ill allowing adap a ion o
d um sounds a in e ence ime [4].
The au ho s o [7] and [10] c ea ed he Au oma ic
D ums T ansc ip ion On Fi e (ADTOF) da ase us-
ing c owd-sou ced anno a ions and ained models wi h
CRNN and CNN a chi ec u es inco po a ing sel -a en ion.
Bo h ame-wise and a um g id synch onized models
achie ed simila pe o mance. Ta um-le el a en ion-
based ne wo ks we e also e ec i e in [11].
In [8], he A2MD da ase was c ea ed using semi-
au oma ic labeling. The p oposed models also e alua ed
bea in o ma ion, esul ing in a modes pe o mance in-
c ease. The au ho s o [9] employed a language model o
egula ize aining o supp essing musically unna u al on-
se s. While his app oach imp o ed pe o mance, he an-
sc ip ion was limi ed o he h ee main d um ins umen s,
bass d um, sna e d um, and hi-ha .
While s a e-o - he-a (SOTA) algo i hms o DTM
achie e good o e all pe o mance (global F-measu es
abo e 0.8 on hand-anno a ed da ase s [3,10]), classes such
as oms and cymbals s ill show medioc e esul s.
In [3], he pe o mance o models ained on he
ADTOF da ase is analyzed in de ail. Se e al hypo heses
a e p oposed o explain ansc ip ion e o s. The issue o
so onse s being masked is pa ly in es iga ed by in o-
ducing a empo oc a e F-measu e, which dis ega ds e o s
when ansc ip ions occu a hal o double he empo. The
unde lying assump ion is ha especially cymbals a e o en
played wi h al e na ing weak and s ong onse s, esul ing
in e e y second weak onse being missed. Addi ionally,
con usion ma ices a e used o analyze class con usions,
e ealing simila p oblems iden i ied in [2]: Misclassi i-
ca ion equen ly occu s among simila sounding ins u-
men s, such as hi-ha and cymbals. O he common e o s
a e linked o weak onse s, cha ac e ized by low loudness,
o masking e ec s.
In his pape , we go beyond desc ibing cu en SOTA
pe o mance by a ibu ing he emaining pe o mance
gap o o a pe ec ansc ip ion o speci ic pe o mance-
limi ing ac o s. This allows us o a ge hese ac o s in
u u e wo k and es ima e he po en ial maximum pe o -
mance gains.
3. METHODOLOGY
We ake ad an age o he STAR D ums da ase , i s u i-
lized in [4], o c ea e aining and es da a whe e d um
s ems a e modi ied o elimina e po en ial e o sou ces,
he eby p og essi ely educing he complexi y o he ADT
ask. STAR D ums is c ea ed om audio eco dings in-
cluding melodic ins umen s, singing, and d ums. We ei-
he u ilize audio da a p o ided as sepa a e d um and non-
d um s ems o apply a Music Sou ce Sepa a ion (MSS) al-
go i hm o sepa a e mix u e eco dings in o d um and non-
d um s ems.
Subsequen ly, we anno a e he d um s em using an ADT
algo i hm published alongside [2] and ega d his in o -
ma ion as es ima ed e e ence anno a ion. We hen e-
syn hesize he d ums by ende ing he es ima ed e e ence
anno a ions using se e al i ual d um ki s and no mal-
ize he loudness o he e-syn hesized d um s em acco d-
ing o Recommenda ion ITU-R BS1770-4 (2015) o ma ch
he loudness o he o iginal d um s em. Finally, he e-
syn hesized d um s em is mixed wi h he o iginal non-
d um s em o c ea e he audio signal o aining an es ing
ADT algo i hms.
The inpu da a o STAR D ums o igina es om
MUSDB18 [12], ISMIR04 [13], and MTG-Jamendo [14].
Wi h da a om ISMIR04 (o iginally o gen e classi ica-
ion), STAR D ums co e s a wide ange o gen es, while
Rock and Pop a e emphasized due o MUSDB18 and
MTG-Jamendo. The da a om MUSDB18 is used o al-
ida ion and es ing because i is al eady a ailable as s ems,
hus a oiding biases in he esul s caused by a i ac s in o-
duced by MSS. By using 60 s exce p s om ISMIR04 and
MTG-Jamendo da a and ull i ems om MUSDB18, we
achie e easonable a ios be ween he leng hs o aining,
alida ion, and es spli s, co esponding o 114.7 h,8.3 h,
and 1.6 h, espec i ely.
STAR D ums con ains eco dings o ins umen s played
by musicians and ocals, unlike ully syn he ic da ase s. In
[4], STAR D ums ou pe o med aining wi h Slakh [15],
which uses only syn he ic da a. STAR D ums also allows
ull con ol o e he e-syn hesized d um s em, unlike o he
da ase s [7, 8, 10] whe e sepa a e d um s ems a e una ail-
able and d um sounds canno be modi ied. Addi ionally,
he esul s a e no a ec ed by labeling e o s, as only he
e-syn hesized d um s em, which ma ches he es ima ed
anno a ion exac ly, is included. In con as , he ex en o
labeling e o s in ADTOF is unknown, and e en human
anno a o s o en no ully ag ee as demons a ed in [3].
Le e aging STAR D ums’s lexibili y, we c ea e simpli-
ied e sions o he e-syn hesized d um s em o sys ema i-
cally educe ADT ask complexi y, allowing p ecise quan-
i ica ion o key limi ing ac o s.
Table 1 p esen s he i e a ian s o STAR D ums and
he speci ic esea ch ques ions hey a e designed o ad-
d ess. The 20Ki s e sion, se ing as he baseline, u i-
lizes 20 i ual d um ki s o gene a e he e-syn hesized
d um s em. I includes simul aneous onse s and cap u es
a ull dynamic ange, wi h MIDI eloci y alues om 40
o 127. The pe o mance o models ained on di e -
en STAR D ums a ian s will be e alua ed ela i e o his
baseline.
Fo 10Ki s, we di ide he 20 i ual d um ki s in o wo
dis inc spli s and ain a model on each. To e alua e he
impac o iden ical d um sounds in aining and es ing, we
i s es each model on he spli i was ained on. Addi-
ionally, we pe o m 2-way c oss- alida ion by es ing each
model on he spli i was no ained on, which allows us o
assess pe o mance when he d um sounds in aining and
es ing di e .
The 1Ki e sion is c ea ed using a single d um ki . We
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
583
STAR D ums Va ian Iden i ie Resea ch Ques ion
20 d um ki s 20Ki s Baseline
10 d um ki s ( wo spli s) 10Ki s
How does aining and es ing wi h a educed numbe o di e en
d um sounds impac pe o mance?
How does es ing wi h d um sounds no included in aining
impac pe o mance?
1 d um ki ( ou spli s) 1Ki
How does aining and es ing wi h a educed numbe o di e en
d um sounds impac pe o mance?
How does es ing wi h d um sounds no included in aining
impac pe o mance?
20 d um ki s -
No weak onse s 20Ki sNoWeak How does he absence o weak onse s in aining and es ing
impac pe o mance?
20 d um ki s -
No simul aneous onse s 20Ki sNoSim How does he absence o simul aneous onse s in aining and es ing
impac pe o mance?
Table 1. Va ian s o STAR D ums wi h iden i ie and co esponding esea ch ques ions.
epea he da ase c ea ion, aining, and es ing wi h ou
di e en d um ki s o enhance gene alizabili y and pe o m
he same e alua ion as o 10Ki s, assessing pe o mance
wi h bo h iden ical and di e en d um sounds in aining
and es ing. Fo cla i y, we epo only he a e aged pe -
o mance ac oss all spli s o 10Ki s and 1Ki .
The goal o 10Ki s and 1Ki is o examine how he
p esence o iden ical e sus di e en d um sounds in ain-
ing and es ing impac s pe o mance. Addi ionally, we as-
sess how educing he numbe o d um ki s a ec s an-
sc ip ion accu acy in bo h scena ios.
20Ki sNoWeak uses all 20 i ual d um ki s and con-
ains only s ong onse s wi h high loudness by limi ing he
MIDI eloci y ange o 100 o 127 du ing da ase c ea ion,
p o iding insigh in o he in luence o weak onse s, cha -
ac e ized by low loudness, on ansc ip ion pe o mance.
Las ly, we c ea e he 20Ki sNoSim e sion by using all
20 i ual d um ki s and ensu ing a minimum in e -onse
in e al o 50 ms du ing he e-syn hesis p ocess o in es i-
ga e he e ec o simul aneous onse s. In cases whe e mul-
iple onse s occu wi hin a 50 ms window, we andomly
choose one onse and disca d he o he s.
Fo each o he i e a ian s, we gene a e bo h a d um-
only e sion and a ull-mix e sion o compa e ansc ip-
ion pe o mance in he mo e challenging DTM scena io
agains he less demanding DTD scena io.
4. EXPERIMENTS
We ain models using each o he i e a ian s o
STAR D ums lis ed in Table 1 ac oss h ee di e en a -
chi ec u es. An o e iew o he DNN a chi ec u es used
is p o ided in Table 2. The models p ocess monau al mel
spec a wi h 96 bands and an uppe cu -o equency o
16 kHz. The spec a a e compu ed using a 1024-poin
sho - ime Fou ie ans o m (STFT) wi h a hop leng h
o 512 samples, de i ed om audio signals sampled a
48 kHz, esul ing in a ame leng h o 10.7 ms.
While all models u ilize CNN laye s, he CRNN
model addi ionally inco po a es Recu en Neu al Ne wo k
(RNN) laye s, and he CNNSA model employs sel -a en ion
Model # F ames # Pa ms. A chi ec u e
CNN 25 2.4M 5 CNN laye s
3 Dense laye s
CRNN 400 2.9M
4 CNN laye s
3 RNN laye s
3 Dense laye s
CNNSA 400 6.9M
4 CNN laye s
2 Sel -a . laye s
3 Dense laye s
Table 2. O e iew o models used, including inpu ame
leng h, pa ame e coun , and a chi ec u al de ails.
laye s. Each model inco po a es h ee dense laye s o map
he ou pu o he numbe o classes, ollowed by a sigmoid
ac i a ion unc ion o gene a e p obabili y es ima es.
The CNN model ope a es on blocks o 25 STFT ames.
Ne wo ks u ilizing RNN o sel -a en ion laye s can cap-
u e empo al dependencies in he inpu da a, allowing
hem o p ocess la ge blocks o 400 STFT ames o im-
p o ed con ex modeling. Simila block leng hs we e p o-
posed in [5,10].
We ca ego ize onse s in o eigh classes, ollowing he
mapping p oposed in [2]: bass d um, sna e d um, hi-ha ,
oms, bell, cymbals, ide cymbals, and cla e. To conside a
wide ange o d um sounds, we a oid using he h ee-class
mapping, only including bass d um, sna e d um, and hi-
ha , and he i e-class mapping u ilized in [3,7,10]. A he
same ime, we e ain om using he 18-class mapping, as
used in [2], since i emains unclea o which ex en classi-
ica ion ambigui ies a ising om he ine-g ained mapping
impac pe o mance. As no ed in [2], no clea equency
anges exis o low, mid, and high oms. Addi ionally,
dis inguishing be ween sounds like closed hi-ha and pedal
hi-ha can be challenging, e en o humans.
Onse imes a e ex ac ed om he de ec ion p obabil-
i ies using a peak picking algo i hm wi h a ixed h esh-
old o 0.55 ac oss all classes. We iden i y ue posi i es,
alse posi i es, and alse nega i es using a ole ance win-
dow o 50 ms, ollowing [3, 10], wi h he Py hon package
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
584
Tes Da ase Same d ums
ain + es ?
Model T ain Da ase
20Ki s 10Ki s 1Ki 20Ki sNoWeak 20Ki sNoSim
DTM DTD DTM DTD DTM DTD DTM DTD DTM DTD
20Ki s ✓
CNN 0.73 0.89 0.71 0.85 0.56 0.64
CRNN 0.78 0.91 0.76 0.87 0.58 0.65
CNNSA 0.78 0.91 0.75 0.86 0.59 0.64
10Ki s ✓
CNN 0.73 0.89
CRNN 0.79 0.91
CNNSA 0.78 0.91
10Ki s ✗
CNN 0.67 0.78
CRNN 0.73 0.81
CNNSA 0.71 0.81
1Ki ✓
CNN 0.78 0.91
CRNN 0.82 0.92
CNNSA 0.80 0.92
1Ki ✗
CNN 0.61 0.70
CRNN 0.64 0.71
CNNSA 0.64 0.71
20Ki sNoWeak ✓
CNN 0.74 0.89 0.77 0.90
CRNN 0.80 0.91 0.82 0.92
CNNSA 0.79 0.91 0.83 0.92
20Ki sNoSim ✓
CNN 0.71 0.92 0.77 0.97
CRNN 0.78 0.94 0.83 0.98
CNNSA 0.75 0.94 0.81 0.98
ENST D ums ✗
CNN 0.70 0.74 0.69 0.72 0.62 0.63 0.69 0.73 0.62 0.65
CRNN 0.73 0.76 0.73 0.73 0.65 0.64 0.72 0.75 0.62 0.65
CNNSA 0.73 0.76 0.71 0.73 0.64 0.66 0.71 0.73 0.62 0.63
MDB D ums ✗
CNN 0.71 0.79 0.71 0.76 0.64 0.65 0.71 0.79 0.66 0.74
CRNN 0.75 0.74 0.73 0.74 0.63 0.64 0.72 0.75 0.66 0.73
CNNSA 0.71 0.76 0.69 0.73 0.61 0.63 0.71 0.78 0.67 0.71
Table 3. Resul s in e ms o global F-measu e when aining and es ing on di e en STAR D ums e sions using he CNN,
he CRNN, and he CNNSA model o DTM and DTD. The las wo ows show esul s o MDB D ums and ENST D ums.
mi _e al [16]. The global F-measu e is used o pe o -
mance compa ison and is compu ed using mic o a e ag-
ing [17], which in ol es summing all ue posi i es, alse
posi i es, and alse nega i es ac oss all classes and acks
be o e calcula ing he F-measu e. This app oach assigns
equal weigh o e e y onse .
In addi ion o es ing on he STAR D ums es spli ,
we use MDB D ums [18] and ENST D ums [19], pub-
licly a ailable ADT da ase s commonly used o es ing.
MDB D ums and ENST D ums include 0.4 and 1.0 h o
hand-anno a ed audio, espec i ely, p o ided as comple e
mix u es con aining eco dings o d um sounds alongside
melodic ins umen s. Addi ionally, d um-only s ems a e
p o ided by he au ho s.
5. RESULTS
Table 3 p esen s he esul s o all expe imen s. Fo cla -
i y, we e alua e only he combina ions o aining and es
da ase s ha add ess he esea ch ques ions in Table 1.
Each cell in Table 3 shows wo global F-measu es o he
h ee model a chi ec u es om Table 2, whe e he i s ow
co esponds o he CNN model, he second ow o he CRNN,
and he hi d ow o he CNNSA model. The i s alue is
he global F-measu e o DTM, and he second alue is o
DTD. The esul s o he single spli s o 10Ki s and 1Ki
a e a e aged.
O e all, he CRNN and CNNSA models ou pe o m he
CNN model, aligning wi h he indings o [2]. The CNNSA
model achie es simila pe o mance o he CRNN model, as
obse ed in [10]. The bes pe o mance on MDB D ums
and ENST D ums wi h an F-measu e o 0.75 and 0.73, e-
spec i ely, is sligh ly lowe han epo ed in [10], likely
due o hei use o a less challenging i e-class mapping.
In all expe imen s wi h STAR D ums, DTD pe o mance
su passes DTM pe o mance.
In he ollowing subsec ions, we p o ide a de ailed anal-
ysis o he esul s in ela ion o he esea ch ques ions ou -
lined in Table 1.
5.1 Reducing he Numbe o D um Ki s
Pe o mance emains simila o DTM when he numbe
o d um ki s in aining and es ing is educed om 20 o
10 (10Ki s), p o ided he d um sounds a e iden ical. In
con as , pe o mance dec eases when d um ki s in aining
and es ing di e . Fo ins ance, compa ing 20Ki s and
10Ki s, he esul s o he CNN model dec ease om 0.73
o 0.67 and o he CRNN model om 0.78 o 0.73.
Reducing om 20 (20Ki s) o 1 d um ki (1Ki ) ha
is iden ical in aining and es ing, inc eases DTM pe o -
mance no ably o he CNN and CRNN models om 0.73 o
0.78 and 0.78 o 0.82, espec i ely. This pe o mance im-
p o emen sugges s ha , in DTM, i can be bene icial o
a model no o ha e o gene alize ac oss many di e en
d um ki s. Con e sely, using 1Ki wi h di e en d um ki s
in aining and es ing leads o a signi ican pe o mance
d op.
The i s and las wo ows o Table 3 show sligh DTM
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
585
pe o mance dec eases o models ained on 10Ki s com-
pa ed o 20Ki s when es ing on 20Ki s and MDB D ums
(CRNN: 0.78 o 0.76 and 0.75 o 0.73), wi h pe o mance
emaining cons an on ENST D ums. In con as , aining
wi h 1Ki leads o lowe pe o mance o all models, wi h
he CRNN model d opping o 0.63 on MDB D ums.
MDB D ums and ENST D ums include di e en d um
sounds han STAR D ums, ea u ing eco ded d ums a he
han syn hesized audio. The small pe o mance dec ease
obse ed o aining wi h 10Ki s sugges s ha e en a
ela i ely low numbe o i ual d um ki s allows o he
models o pe o m well on unseen eal-wo ld da a.
Fo DTD, a g ea e pe o mance dec ease compa ed o
DTM is obse ed wi h 10Ki s when he d um ki s di -
e be ween aining and es ing (CRNN: 0.91 o 0.81). Fo
1Ki , we see a pe o mance dec ease o up o 0.2 in F-
measu e (CRNN: 0.91 o 0.71). This decline may be a -
ibu ed o DTM bene i ing om he p esence o melodic
ins umen s and singing, which ac as implici da a aug-
men a ion by in oducing backg ound noise o he d um
sounds [20]. Consequen ly, in DTM, he model does no
ely as hea ily on aining wi h a di e se ange o d um
sounds as i does o DTD. In con as , DTD is mo e
p one o o e i ing due o he lack o his addi ional a i-
abili y. When d um ki s a e iden ical in bo h aining and
es ing, pe o mance emains consis en ac oss 20 and 10
d um ki s. Fo one d um ki , he pe o mance inc ease is
less p onounced han o DTM (CNNSA: 0.91 o 0.92).
In summa y, DTM gene ally pe o ms wo se han DTD
bu bene i s mo e when d um sounds in aining and es
a e consis en and o igin om a single d um ki . In con-
as , DTD elies mo e on di e se d um sounds o obus
gene aliza ion.
5.2 No Weak Onse s
Fo DTM, using 20Ki sNoWeak in aining and es ing,
whe e he eloci y o weak no es is inc eased o include
only s ong onse s, esul s in a pe o mance imp o emen
ac oss all models, wi h an inc ease o 0.04 o 0.05 in F-
measu e. Fo example, he CNN model’s global F-measu e
imp o es om 0.73 o 0.77.
Fo DTD, he e is a consis en bu small pe o mance
inc ease ac oss all models, indica ing he ansc ip ion e -
o s ela ed o weak onse s a e no a s ong pe o mance-
limi ing ac o o DTD.
When only es ing on 20Ki sNoWeak and aining on
20Ki s, he pe o mance o DTM inc eases sligh ly and
emains cons an o DTD.
5.3 No Simul aneous Onse s
In DTM, using a da ase which does no include simul ane-
ous onse s (20Ki sNoSim) leads o a pe o mance inc ease
simila o ha achie ed by a oiding weak onse s, wi h im-
p o emen s anging om 0.03 o 0.05 in F-measu e (CRNN:
0.78 o 0.83).
Fo DTD, simul aneous onse s a e he main
pe o mance-limi ing ac o when aining and es -
ing wi h iden ical d um sounds, esul ing in an F-measu e
inc ease o 0.07 o 0.08 when elimina ed. This esul s
in a nea ly pe ec ansc ip ion pe o mance wi h an F-
measu e 0.98 o he CRNN and CNNSA models, compa ed
o 0.91 on 20Ki s.
When es ing on 20Ki sNoSim a e aining on
20Ki s, pe o mance emains cons an o he CRNN model
and dec eases o he o he wo a chi ec u es o DTM.
This dec ease is mainly due o a educ ion in p ecision,
especially o bass d um and sna e d um. A possible ex-
plana ion is ha he models, ha ing lea ned con en ions
abou classes equen ly occu ing simul aneously, gene -
a e mo e alse posi i es when such simul anei y is absen
in he es da a. Fo DTD, we obse e a smalle pe o -
mance gain compa ed o when bo h aining and es ing
a e conduc ed on 20Ki sNoSim.
5.4 Compa ison o DTM and DTD Pe o mance
The p esence o melodic ins umen s and singing signi i-
can ly limi s DTM pe o mance, leading o wo se esul s
in all e alua ions ca ied ou wi h STAR D ums. Fo DTD,
simul aneous onse s a e he only signi ican pe o mance-
limi ing ac o when d um sounds in aining and es ing
a e iden ical, while DTM pe o mance is also a ec ed by
weak onse s and can inc ease when d um sounds o igina e
om one single and iden ical d um ki in aining and es -
ing.
In DTM, models ained wi h he 20Ki s da ase gene -
alized well o he d um sounds o MDB D ums, esul ing
in a small pe o mance gap o 0.02 o 0.07 in F-measu e
be ween he esul s o he 20Ki s es spli and MDB
D ums (CNN: 0.73 and 0.71). Con e sely, models ained
o DTD on 20Ki s exhibi ed a la ge pe o mance gap,
anging om 0.1 o 0.17 in F-measu e (CNNSA: 0.91 and
0.76). The easons may be again a ibu ed o melodic in-
s umen s and singing se ing as da a augmen a ion as ou -
lined in Sec ion 5.1.
5.5 Rela i e Pe o mance Changes
Table 4 summa izes indings ela i e o he esea ch ques-
ions ou lined in Table 1. To p o ide a clea e unde s and-
ing o he pe o mance changes de ailed in Table 3, we i s
calcula ed he a e age global F-measu e ac oss all h ee
model a chi ec u es o each STAR D ums a ian . Sub-
sequen ly, we compu ed he ela i e pe o mance changes
by di iding hese a e ages om each a ian by he a e -
age global F-measu e o he 20Ki s a ian . Values a e
p o ided o bo h DTM and DTD.
Fo DTM, ansc ip ion pe o mance imp o es by 5.0 %
when all d um sounds du ing aining and es ing o igin
om he same d um ki . Fo DTD, he same expe imen
esul s in an 1.6 % imp o emen .
When d um ki s in aining and es ing di e , educing
he numbe o d um ki s om 20 o 10 leads o an 8.1 %
pe o mance d op o DTM and 11.3 % dec ease o DTD.
Fu he educing o a single d um ki esul s in a 17.4 %
dec ease o DTM and a 21.5 % dec ease o DTD. These
indings sugges ha di e se d um sounds in aining a e
mo e c i ical o DTD han o DTM.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
586
How does ansc ip ion pe o mance change when ... Change in global F-measu e
DTM DTD
... educing he numbe o ki s in aining om 20 o 1,
wi h same d um sounds in aining and es ing. +5.0 % +1.6 %
... educing he numbe o ki s in aining om 20 o 10,
wi h di e en d um sounds in aining and es ing.
−8.1 % −11.3 %
... educing he numbe o ki s in aining om 20 o 1,
wi h di e en d um sounds in aining and es ing.
−17.4 % −21.5 %
... no weak onse s a e p esen . +5.5 % +1.1 %
... no simul aneous onse s a e p esen . +5.1 % +8.5 %
... no sounds o melodic ins umen s and singing a e p esen . +14.5 %
Table 4. Rela i e changes in global F-measu e a e aged ac oss used DNN a chi ec u es when compa ing he esul s o
di e en e sions o STAR D ums and when compa ing all DTM esul s o all DTD esul s ac oss all STAR D ums e sions.
The absence o weak and simul aneous onse s in DTM
leads o simila pe o mance inc eases o 5.5 % and 5.1 %,
espec i ely. In con as , weak onse s ha e a minimal im-
pac on DTD, wi h only an 1.1 % inc ease. Howe e , p e-
en ing simul aneous onse s in DTD yields a mo e sub-
s an ial pe o mance inc ease o 8.5 %.
Finally, pe o ming DTD compa ed o DTM esul s in
an a e age pe o mance inc ease o 14.5 % ac oss all ex-
pe imen s conduc ed using STAR D ums a ian s.
5.6 Quali a i e Analysis o T ansc ip ion E o s
A e p esen ing he quan i a i e esul s, we manually
inspec ed ansc ip ions om he bes -pe o ming CRNN
model ( ained on 20Ki s) o MDB D ums o iden i y
sys ema ic e o s ela ed o p e ious indings and he chal-
lenges ou lined in Sec ion 1.
Jazz exce p s wi h many so onse s a e challenging:
weak sna e, bass d um, and cymbal onse s a e o en missed
in DTM. Addi ionally, pe cussi e e en s om melodic in-
s umen s can cause alse posi i es. Fo example, accen u-
a ed bass gui a no es a e some imes labeled as bass d um
hi s.
When a cymbal’s decay o e laps wi h he a ack o an-
o he d um sound, he sys em may alsely de ec a cymbal
in DTM, while he ansc ip ion is co ec o DTD. Si-
mul aneous hi-ha and sna e sounds a e some imes missed,
p esumably due o masking and simila spec al ea u es.
In con as , alse posi i e hi-ha de ec ions can occu when
only sna e d um is ac i e. Simila con usions a ise be-
ween bass d um and low oms, and be ween low-pi ched
sna es and oms. Hea ily dis o ed d um sounds in some
i ems a e no classi ied eliably, p esumably because hey
di e oo much om he sounds included in aining.
These obse a ions suppo ou quan i a i e indings:
simul anei y o d um sounds, weak onse s, limi ed gene al-
iza ion, and in e e ence om melodic ins umen s emain
key challenges o mode n ADT sys ems.
6. CONCLUSION
In his s udy, we u ilized he STAR D ums da ase o quan-
i y se e al pe o mance-limi ing ac o s in ADT. We c e-
a ed i e inc easingly simpli ied e sions o STAR D ums
and conduc ed expe imen s using h ee di e en DNN a -
chi ec u es.
Ou indings highligh h ee ac o s in DTM ha in-
c ease pe o mance by app oxima ely 5 % each:
• T aining and es ing use iden ical d um sounds o ig-
ina ing om a single ki .
• No weak onse s a e p esen .
• No simul aneous onse s a e p esen .
Fo DTD, simul aneous onse s a e he cen al
pe o mance-limi ing ac o , wi h nea ly e o - ee an-
sc ip ions achie ed in hei absence. Mo eo e , DTD ben-
e i s mo e om a di e se se o aining d um ki s han
DTM.
Inc easing he di e si y o aining da a o STAR D ums
can be achie ed by employing a ious da a augmen a ion
echniques, such as pi ch shi ing, dynamic ange comp es-
sion, and e e be a ion.
Ou indings sugges ha he ansc ip ion quali y o
music educa ion apps, which analyze s uden eco dings
o p o ide eedback, will dec ease i a s uden plays ad-
anced d um pa e ns wi h mo e simul aneous onse s, o
uses a ki whose imb e di e s s ongly om he aining
da a. Employing FSL could help mimic he e ec s o ha -
ing iden ical d um sounds o aining and es ing. Addi-
ionally, c ea ing e sions o STAR D ums wi h a highe
numbe o simul aneous onse s wi h a ying class combi-
na ions could acili a e e icien lea ning o hese complex
scena ios.
In music p oduc ion ools ha use ansc ip ions o
add o eplace d um samples, gen e-speci ic playing s yles
ma e : o example, a jazz ack wi h equen so ide and
ghos no es will be ansc ibed less accu a ely han a pop
ack wi h mo e uni o m dynamics, especially o DTM.
Explo ing he op imal a io o weak o s ong onse s in
STAR D ums could u he con ibu e o pe o mance im-
p o emen s.
By sha ing hese insigh s, we aim o suppo u he ad-
ancemen s in ADT by p o iding clea indica ions o how
add essing key challenges can enhance ansc ip ion pe -
o mance.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
587
7. ACKNOWLEDGMENTS
The In e na ional Audio Labo a o ies E langen a e a
join ins i u ion o he F ied ich-Alexande -Uni e si ä
E langen-Nü nbe g (FAU) and F aunho e Ins i u e o In-
eg a ed Ci cui s IIS.
8. REFERENCES
[1] C. Wu, C. Di ma , C. Sou hall, R. Vogl, G. Widme ,
J. Hockman, M. Mülle , and A. Le ch, “A e iew o au-
oma ic d um ansc ip ion,” IEEE ACM T ans. Audio
Speech Lang. P ocess., ol. 26, no. 9, pp. 1457–1483,
2018.
[2] R. Vogl, G. Widme , and P. Knees, “Towa ds mul i-
ins umen d um ansc ip ion,” in P oc. o 21s
DAFx’18, 2018.
[3] M. Zeh en, M. Alunno, and P. Bien inesi, “In-dep h
pe o mance analysis o he ad o -based algo i hm o
au oma ic d um ansc ip ion,” in P oc. o 25 h ISMIR,
2024, pp. 1060–1067.
[4] P. Webe , C. Uhle, M. Mülle , and M. Lang, “Real- ime
au oma ic d um ansc ip ion using dynamic ew-sho
lea ning,” in P oc. o 5 h In e na ional Symposium on
he In e ne o Sounds (IS2). IEEE, 2024.
[5] R. Vogl, M. Do e , G. Widme , and P. Knees, “D um
ansc ip ion ia join bea and d um modeling using
con olu ional ecu en neu al ne wo ks,” in P oc. o
18 h ISMIR, 2017, pp. 150–157.
[6] Y. Wang, J. Salamon, M. Ca w igh , N. B yan, and
J. Bello, “Few-sho d um ansc ip ion in polyphonic
music,” in P oc. o 21s ISMIR, 2020.
[7] M. Zeh en, M. Alunno, and P. Bien inesi, “ADTOF:
A la ge da ase o non-syn he ic music o au oma ic
d um ansc ip ion,” in P oc. o 22nd ISMIR, 2021, pp.
818–824.
[8] I. Wei, C. Wu, and L. Su, “Imp o ing au oma ic d um
ansc ip ion using la ge-scale audio- o-midi aligned
da a,” in P oc. o ICASSP. IEEE, 2021, pp. 246–250.
[9] R. Ishizuka, R. Nishikimi, E. Nakamu a, and K. Yoshii,
“Ta um-le el d um ansc ip ion based on a con olu-
ional ecu en neu al ne wo k wi h language model-
based egula ized aining,” in P oc. o Asia-Paci ic
Signal and In o ma ion P ocessing Associa ion Annual
Summi and Con e ence. IEEE, 2020, pp. 359–364.
[10] M. Zeh en, M. Alunno, and P. Bien inesi, “High-
quali y and ep oducible au oma ic d um ansc ip ion
om c owdsou ced da a,” Signals, ol. 4, no. 4, pp.
768–787, 2023.
[11] R. Ishizuka, R. Nishikimi, and K. Yoshii, “Global
s uc u e-awa e d um ansc ip ion based on sel -
a en ion mechanisms,” Signals, ol. 2, no. 3, pp. 508–
526, 2021.
[12] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “The MUSDB18 co pus o music
sepa a ion,” Dec. 2017. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.1117372
[13] P. Cano, E. Gómez, F. Gouyon, P. He e a, M. Kop-
penbe ge , B. Ong, X. Se a, S. S eich, and N. Wack,
“ISMIR 2004 audio desc ip ion con es ,” Music Tech-
nology G oup o he Uni e si a Pompeu Fab a, Tech.
Rep, 2006.
[14] D. Bogdano , M. Won, P. To s ogan, A. Po e , and
X. Se a, “The MTG-Jamendo da ase o au oma ic
music agging,” in Machine Lea ning o Music Dis-
co e y Wo kshop, In e na ional Con e ence on Ma-
chine Lea ning, Long Beach, CA, Uni ed S a es, 2019.
[15] E. Manilow, G. Wiche n, P. See ha aman, and
J. Le Roux, “Cu ing music sou ce sepa a ion some
Slakh: A da ase o s udy he impac o aining da a
quali y and quan i y,” in P oc. o WASPAA. IEEE,
2019, pp. 45–49.
[16] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, and D. P. W. Ellis, “MIR_EVAL: A
anspa en implemen a ion o common MIR me ics.”
in P oc. o 15 h ISMIR, 2014, pp. 367–372.
[17] K. Takahashi, K. Yamamo o, A. Kuchiba, and
T. Koyama, “Con idence in e al o mic o-a e aged
1and mac o-a e aged 1sco es,” Appl. In ell., ol. 52,
no. 5, pp. 4961–4972, 2022.
[18] C. Sou hall, C.-W. Wu, A. Le ch, and J. Hockman,
“MDB d ums: An anno a ed subse o MedleyDB
o au oma ic d um ansc ip ion,” in P oc. o 18 h
ISMIR, 2017. [Online]. A ailable: h ps://gi hub.com/
Ca lSou hall/MDBD ums
[19] O. Gille and G. Richa d, “ENST-d ums: an ex ensi e
audio- isual da abase o d um signals p ocessing,” in
P oc. o 7 h ISMIR, 2006, pp. 156–159.
[20] B. McFee, E. J. Humph ey, and J. P. Bello, “A so wa e
amewo k o musical da a augmen a ion,” in P oc. o
16 h ISMIR, 2015, pp. 248–254.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
588