scieee Science in your language
[en] (orig)

Towards Human-in-the-Loop Onset Detection: A Transfer Learning Approach for Maracatu

Author: António Pinto (INESC TEC; University of Porto -. Faculty of Engineering)
Publisher: Zenodo
DOI: 10.5281/zenodo.17706403
Source: https://zenodo.org/records/17706403/files/000037.pdf
TOWARDS HUMAN-IN-THE-LOOP ONSET DETECTION: A TRANSFER
LEARNING APPROACH FOR MARACATU
An ónio Sá Pin o
Faculdade de Engenha ia da Uni e sidade do Po o, Po o, Po ugal
INESC TEC, Po o, Po ugal
[email p o ec ed]
ABSTRACT
We explo e ans e lea ning s a egies o musical onse
de ec ion in he A o-B azilian Ma aca u adi ion, which
ea u es complex hy hmic pa e ns ha challenge con-
en ional models. We adap wo Tempo al Con olu ional
Ne wo k a chi ec u es: one p e- ained o onse de ec ion
(in a- ask) and ano he o bea acking (in e - ask). Us-
ing only 5-second anno a ed snippe s pe ins umen , we
ine- une hese models h ough laye -wise e aining s a e-
gies o i e adi ional pe cussion ins umen s. Ou esul s
demons a e signi ican imp o emen s o e baseline pe -
o mance, wi h F1 sco es eaching up o 0.998 in he in a-
ask se ing and imp o emen s o o e 50 pe cen age poin s
in bes -case scena ios. The c oss- ask adap a ion p o es
pa icula ly e ec i e o ime-keeping ins umen s, whe e
onse s na u ally align wi h bea posi ions. The op imal
ine- uning con igu a ion a ies by ins umen , highligh -
ing he impo ance o ins umen -speci ic adap a ion s a e-
gies. This app oach add esses he challenges o unde ep-
esen ed musical adi ions, o e ing an e icien human-
in- he-loop me hodology ha minimizes anno a ion e o
while maximizing pe o mance. Ou indings con ibu e o
mo e inclusi e music in o ma ion e ie al ools applicable
beyond Wes e n musical con ex s.
1. INTRODUCTION
Accu a ely iden i ying he p ecise momen when a musi-
cal no e begins emains one o he undamen al challenges
in audio signal p ocessing. This ask, known as musical
onse de ec ion, se es as a co ne s one o nume ous Mu-
sic In o ma ion Re ie al (MIR) applica ions. Onse de ec-
ion has his o ically been essen ial o hy hmic analysis,
no ably in bea acking sys ems [1–3]. While end- o-end
lea ning models ha e ecen ly bypassed his explici s ep
in some con ex s, onse de ec ion con inues o be c i ical
o di e se applica ions such as sco e ollowing [4], music
segmen a ion [5], and polyphonic music ansc ip ion [6].
The me hodological e olu ion o onse de ec ion mi -
o s b oade ends in MIR esea ch. Ea ly app oaches e-
© A. S. Pin o. Licensed unde a C ea i e Commons A i-
bu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: A. S. Pin o,
“Towa ds Human-in- he-loop Onse De ec ion: A T ans e Lea ning Ap-
p oach o Ma aca u”, in P oc. o he 26 h In . Socie y o Music In o -
ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
lied on signal p ocessing echniques o iden i y signi ican
changes in audio p ope ies [7, 8], ollowed by he in o-
duc ion o ea u e-based machine lea ning me hods [9,10].
The ield hen shi ed owa d neu al ne wo k a chi ec u es,
beginning wi h Recu en Neu al Ne wo ks (RNNs) [11]
and ad ancing o Con olu ional Neu al Ne wo ks (CNNs)
[12], which ex ac ele an ea u es di ec ly om aw au-
dio o spec al ep esen a ions. Despi e imp essi e ad-
ances in pe o mance me ics (wi h op models achie -
ing F1 sco es app oaching 90% in ecen e alua ions 1),
signi ican challenges pe sis in onse de ec ion. In pa ic-
ula , accu a ely de ec ing so onse s emains di icul e en
o ad anced models [13]. Mo eo e , hese da a-d i en ap-
p oaches in oduce addi ional challenges ela ed o aining
da a equi emen s and gene alizabili y.
The e ec i eness o supe ised lea ning models hinges
on he quali y and di e si y o aining da a [14]. Cu en
sys ems expe ience pe o mance d ops when analysing
non-Wes e n musical adi ions o a e ins umen s, p i-
ma ily due o insu icien ep esen a ion in exis ing
da ase s. Add essing hese gaps equi es cos ly anno a ion
e o s ha demand bo h domain-speci ic and cul u ally-
in o med expe ise [15], u he complica ing da ase cu-
a ion. Fu he mo e, he anno a ion p ocess i sel e eals
limi a ions: manual labelling o onse s is p one o human
e o and inconsis encies [16], wi h e en isola ed pe cus-
si e signals p o ing di icul o label p ecisely [17]. These
cons ain s es ic he p ac ical deploymen o s a e-o -
he-a sys ems in di e se musical con ex s, poin ing o he
need o mo e adap able s a egies.
Mo ing beyond he speci ic challenges o onse de-
ec ion, MIR esea ch has employed se e al adap i e
s a egies wi hin hy hm analysis asks. In o med me h-
ods le e age a p io i knowledge abou hy hmic con en
o asks such as bea acking [18] and me e de e mi-
na ion [19], which, while e ec i e in speci ic gen es,
lack gene alizabili y. T ans e lea ning le e ages knowl-
edge ac oss domains, wi h examples including adap a ions
o mains eam bea - acking models o G eek olk mu-
sic [20] and acili a ing adap i e hy hm mic o iming gen-
e a ion [21]. Addi ionally, use -cen ic app oaches like
Ac i e Lea ning and Few-Sho Lea ning op imize lea ning
h ough s a egic sample selec ion, enhancing adap abili y
in polyphonic d um ansc ip ion [22, 23] and enabling in-
1MIREX 2018, a h ps://nema.lis.illinois.edu/nema_ou /
mi ex2018/ esul s/aod/
320
e ac i e e inemen o onse de ec ion [24] and bea ack-
ing [25]. This shi owa d use in ol emen exempli ies
he cu en human-cen ed landscape o MIR, ecognizing
use s’ essen ial ole in da a-d i en sys ems [26]. The in e-
g a ion o human expe ise in o compu a ional amewo ks
p o ides a p omising a enue when exis ing solu ions p o e
insu icien .
Recen esea ch has explo ed inco po a ing use -
p o ided in o ma ion o enhance bea acking pe o -
mance. Techniques such as high-le el model pa ame e -
iza ion [27] and in eg a ing use -anno a ed da a snippe s in
a ine- uning cycle [28] ha e shown p omise o imp o ing
s a e-o - he-a accu acy. These me hods a e pa icula ly
e ec i e in add essing challenges in unde ep esen ed mu-
sical con ex s, whe e con en ional MIR echniques unde -
pe o m. Such app oaches ha e p o en ins umen al in he
c ea ion o he Ma aca u onse da ase [17], me e de e -
mina ion in La in-Ame ican music [29], and bea acking
in highly challenging music signals [30]. The implemen a-
ion o ans e lea ning o hese asks a ies conside ably:
while some app oaches e ain only inal laye s o le e age
basic hy hmic ep esen a ions [20, 31], o he s a ge inpu
and ou pu laye s o ins umen -speci ic adap a ion [17],
and some e ain en i e ne wo ks [28]. Despi e hese a ied
s a egies, no s udies ha e empi ically e alua ed he im-
pac o laye -wise e aining on model pe o mance, lea -
ing his c i ical ques ion unexplo ed.
Building on his ounda ion, his pape explo es a use -
d i en ans e lea ning app oach o onse de ec ion, o-
cusing on he A o-B azilian adi ion o Ma aca u. We
use he eponymous da ase [17], which ea u es complex
hy hms and unique ins umen al acous ic cha ac e is ics
ha cause leading models o s uggle wi h achie ing sa is-
ac o y pe o mance.
Ou me hodology in ol es adap ing a deep neu al ne -
wo k o each ins umen in he “ e no”, he pe cus-
sion ensemble cen al o Ma aca u’s hy hm, based on a
sho anno a ed snippe pe ins umen . Th ough hese
ins umen -speci ic adap a ions, we demons a e an e ec-
i e and s aigh o wa d me hod o enhance s a e-o - he-a
pe o mance. We in es iga e wo dis inc ans e lea n-
ing scena ios: one wi h a model ini ially ained o on-
se de ec ion, and ano he no el app oach adap ing a bea -
acking model o onse de ec ion. This ex ends p e ious
esea ch [17, 32] by explo ing c oss- ask ea u e ans e -
abili y and le e aging mo e complex models ained on
la ge da ase s. Fu he mo e, we sys ema ically e alua e
laye -wise e aining s a egies, examining he e ec i e-
ness o eezing di e en laye g oups o iden i y op imal
con igu a ions o Ma aca u onse de ec ion.
2. METHODOLOGY
Ou app oach add esses he limi a ions o exis ing mod-
els in non-mains eam signals by in eg a ing use -p o ided
sho anno a ed snippe s. We adap he human-in- he-
loop me hod p oposed by Pin o e al. o bea ack-
ing [27, 28, 30] o he ask o onse de ec ion, le e aging
s a e-o - he-a models h ough in-si u ine- uning. This
use -cen ed me hodology elimina es he need o ex en-
si e aining om sc a ch, enabling end-use s o swi ly
ob ain high-quali y onse es ima es ha align wi h hei
judgmen s.
Fo onse de ec ion in mono imb al signals, we adap
neu al ne wo ks o each ins umen ’s unique acous ic cha -
ac e is ics using jus a single 5-second anno a ed snip-
pe pe ins umen as he ine- uning a ge . This ap-
p oach demons a es bo h minimal anno a ion e o and
apid adap a ion cycles, yielding ins umen -speci ic ne -
wo ks op imized o hei co esponding acous ic p ope -
ies while emaining compu a ionally easible o s anda d
esou ces. While ou me hod is applicable o a ious DNN
a chi ec u es, his s udy employs Tempo al Con olu ional
Ne wo k (TCN)-based models o hei e icien e aining
capabili ies. The TCN’s pe o mance in onse de ec ion
asks is compa able o s a e-o - he-a models, as demon-
s a ed in Sec ion 3.1, making i sui able o ou in es iga-
ion.
We explo e wo ans e lea ning scena ios: an in a-
ask se ing using a TCN onse de ec ion model [32]
and an in e - ask se ing ha adap s a TCN bea acking
model [33] o onse de ec ion. This in e - ask app oach
can be amed as a domain adap a ion p oblem, whe e a
model ained o bea acking is epu posed o onse
de ec ion. Gi en he inhe en ela ionship be ween bea s
and onse s, his adap a ion may bene i om he ypically
b oade aining da a a ailable o bea acking models. To
he bes o ou knowledge, his is he i s s udy o explo e
domain adap a ion om bea acking o onse de ec ion.
Fu he mo e, onse de ec ion’s unambiguous objec i e,
when con as ed wi h he mul i ace ed na u e o bea ack-
ing, allows o clea e adap a ion a ge s and, consequen ly,
mo e s aigh o wa d in e p e a ion o esul s. This mo i-
a ed us o ex end p e ious esea ch by examining laye -
wise e aining s a egies. We sys ema ically eeze di e -
en segmen s o he 15-laye TCN a chi ec u es, om he
ini ial con olu ional laye s wi h small ecep i e ields o
he deepe laye s wi h la ge dila ion a es and wide e-
cep i e ields. In o al, ou expe imen al cycle comp ises
150 ine- uning cycles (15 laye con igu a ions ×5 ins u-
men s ×2 models). Th ough his comp ehensi e e alua-
ion, we aim o in es iga e ea u e ans e abili y be ween
ela ed hy hm analysis asks and sys ema ically assess he
impac o di e en laye eezing con igu a ions.
In line wi h open science p inciples [34], we p o ide a
Gi Hub eposi o y wi h ou code and de ailed esul s, in-
cluding pe - ile e alua ion me ics o all con igu a ions
and highe - esolu ion igu es o de ailed analysis 2. The
emainde o his sec ion ou lines he Ma aca u da ase
composi ion, expe imen al se ings, base models’ desc ip-
ion, and ine- uning and e alua ion de ails.
2h ps://gi hub.com/asapsmc/HIILOnse De ec ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
321
2.1 Da ase
Ma aca u de baque sol o 3, also known as Ma aca u
“ u al”, is a ib an ca ni al pe o mance om Pe nam-
buco, No heas B azil, combining music, poe y, and
dance [36]. The hy hmic nucleus o Ma aca u, known as
he “ e no” ensemble, consis s o i e pe cussionis s play-
ing adi ional handmade ins umen s: cuica,gonge-lo,
a ol,minei o, and ambo -hi. The Ma aca u da ase [17]
cap u es hese ins umen s using con ac mic ophones o
la gely isola ed pe -ins umen acks, eco ded du ing a
ixed loca ion pe o mance and comp ising 34 indi idual
pieces o alling app oxima ely 33 minu es 4.
Ma aca u ea u es wo main hy hmic pa e ns: “ma -
cha” and “samba”, cha ac e ized by as empi o app oxi-
ma ely 165 and 180 bea s-pe -minu e (bpm), espec i ely.
This apid pace c ea es a complex iming p o ile ac oss he
ensemble. Time-keeping ins umen s (cuica and gonge-
lo) main ain hy hmic s abili y despi e hei spo adic use,
wi h a mean onse coun o a ound 4,700 (2.5 anno a ions
pe second). In con as , he “ oicing” ins umen s ( a ol,
minei o, and ambo -hi) play mo e exp essi e oles, esul -
ing in a highe mean onse coun o app oxima ely 16,600
(8.9 anno a ions pe second).
−0.1
0.0
0.1
Cuica
0.68 0.74 0.79 0.85 0.91 0.96 1.02
−0.2
0.0
0.2
Gonge-Lo
0.74 0.79 0.85 0.91 0.96 1.02 1.08
−0.1
0.0
0.1
Minei o
0.11 0.14 0.16 0.18 0.20
−0.1
0.0
0.1
Tambo -Hi
0.20 0.23 0.25 0.27 0.29
012345
−0.2
0.0
0.2
Ta ol
0.20 0.23 0.25 0.27 0.29
ime (seconds)
Ampli ude
Figu e 1. Onse -anno a ed wa e o ms o he Ma aca u
ins umen s. Le : 5-seconds ine- uning snippe ; Righ :
Zoomed in wa e o m, om he second onse o he sample
be o e hi d onse (in blue).
The in icacy o hese hy hms and dis inc wa e o m
shapes, as illus a ed in Figu e 1, complica es onse de-
ec ion and anno a ion. The minei o exempli ies his chal-
lenge wi h i s unusual wa e o m cha ac e is ics, which led
3He ea e e e ed o as Ma aca u, his gen e should be dis inguished
om Ma aca u de baque i ado (o “Nação”). Bo h sha e A ican o i-
gins and ce ain musical simila i ies, bu di e signi ican ly in ins umen-
a ion, p ac ice, and na a i e [35].
4While he o iginal da ase con ains 34 iles pe ins umen , we ex-
cluded Ins umen _34 iles ac oss all sub-da ase s due o a co up ed
Minei o_34 ile.
o i s exclusion om mic o iming analysis in he o iginal
da ase c ea ion s udy due o anno a ion di icul ies [17].
Combined wi h he unde - ep esen a ion o hese ins u-
men s in a ailable model aining da a, hese ac o s c e-
a e subs an ial obs acles o bo h human anno a o s and
au oma ed sys ems. The Ma aca u da ase hus p o ides
an ideal es bed o ou human-in- he-loop s a egy, ex-
ending he app oach p e iously employed in he da ase ’s
c ea ion.
2.2 Base Models
This s udy employs wo p e- ained models, bo h de-
i ed om he TCN a chi ec u e p oposed by Da ies and
Böck [37]. Fo he in a- ask se ing, we use a modi ied
e sion o he o iginal TCN model wi h an addi ional 11 h
dila ion a e le el [32], ained om sc a ch on he On-
se DB da ase [4] o onse de ec ion. In he in e - ask sce-
na io, we u ilize an adap a ion o he [33] mul i ask ne -
wo k, modi ied by masking i s empo and downbea loss o
unc ion as a single- ask (bea ) ne wo k, ained on a ious
bea - acking da ase s. He ea e , we e e o hese models
as TCN 1 and TCN 2, espec i ely.
Con 1
Con 2
Con 3
Tcn1
Tcn2
Tcn4
Tcn8
Tcn16
Tcn32
Tcn64
Tcn128
Tcn256
Tcn512
Tcn1024
Ou pu
Inpu
Con 1
Figu e 2. High-le el a chi ec u e sha ed by he TCN 1 and
TCN 2 models. Bo h ollow he same laye sequence and
dep h, bu di e in con olu ional il e con igu a ion, e-
sul ing in dis inc ecep i e ields and o e all model sizes.
As illus a ed in Figu e 2, bo h models sha e he same
high-le el a chi ec u e and signal condi ioning s ages, bu
hei implemen a ions di e signi ican ly. TCN 1 con-
sis s o h ee con olu ional laye s wi h 16 il e s and il e
shapes o 3×3, 3×3, and 1×8, wi h max pooling o e h ee
equency bins a e he i s wo laye s. In con as , TCN 2
employs h ee con olu ional laye s wi h 20 il e s and il-
e shapes o 3×3, 1×10, and 3×3, each ollowed by max
pooling o e h ee equency bins. Bo h a chi ec u es use
d opou a e each con olu ional s age. The ensuing TCN
block ope a es non-casually and consis s o 11 dila ion le -
els, 16 il e s, and a ke nel size o 5. The TCN 1 model
comp ises 21,890 pa ame e s, while he TCN 2 model has
116,302 pa ame e s. The o iginal aining p ocedu es also
di e ed sligh ly in op imiza ion echniques: TCN 1 em-
ployed a s anda d Adam op imize , whe eas TCN 2 used a
Rec i ied Adam plus Lookahead app oach.
2.3 Fine- uning
Fo bo h in a- ask and in e - ask ans e lea ning se ings,
we adop he ine- uning s a egy desc ibed in [28], us-
ing a 5-second anno a ed sample pe ins umen o demon-
s a e minimal anno a ion e o . Each base model is ine-
uned o 50 epochs wi h he lea ning a e educed o one-
qua e o he o iginal alue, main aining he o iginal op i-
mize s o seamless aining con inua ion. Ea ly s opping
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
322
and lea ning a e educ ion mechanisms we e no imple-
men ed as hese pa ame e s p o ed su icien o con e -
gence gi en he sho aining du a ion and small da ase
size. Gi en ou sys ema ic laye -wise analysis comp ising
150 ine- uning cycles, we omi ed da a augmen a ion and
addi ional hype pa ame e op imiza ion o main ain expe -
imen al ac abili y and suppo isola ed analysis o how
laye -wise eezing s a egies ela e o each ins umen ’s
acous ic cha ac e is ics.
We e alua e all possible ine- uning con igu a ions, de-
no ed as A-B, whe e A and B indica e he s a ing and
ending laye s o he ozen sec ion, espec i ely. The ou -
pu laye is always upda ed and hus excluded om his
no a ion. We explo e con igu a ions om Con 1...3 o
Tcn1...1024, including he ully ainable con igu a ion
. These a e compa ed wi h he in a- ask baseline bsl
and he in e - ask baseline bsl*.
2.4 E alua ion
The ne wo k ou pu is an onse ac i a ion unc ion wi h
a 10-millisecond (ms) empo al esolu ion. We apply he
s anda d madmom peak-picking algo i hm o ob ain onse
es ima es. Pe o mance is e alua ed using he F1 me ic
wi h he de aul 25 ms ole ance window [4]. We imple-
men a holdou alida ion app oach whe e, o each in-
s umen , we ex ac a 5-second segmen om he i s
ile (Ins umen _01) o ine- uning and hen exclude his
en i e ile om he e alua ion se o p e en da a leak-
age. This ensu es unbiased assessmen o he ins umen -
adap ed models by e alua ing pe o mance on he emain-
ing 32 iles pe ins umen .
3. EXPERIMENTS AND RESULTS
3.1 P elimina y Model Analysis
To con ex ualize ou app oach, we i s compa e he pe o -
mance o ou base TCN models wi h p e ious s a e-o - he-
a me hods on he Onse DB da ase [4]. Ou base models,
TCN 1 and TCN 2, achie e F1 sco es o 0.907 and 0.340,
espec i ely. The lowe pe o mance o TCN 2 is expec ed,
as i was o iginally ained o bea acking a he han
onse de ec ion. The madmomRNN and madmomCNN mod-
els, p e- ained and p o ided as in e ence- eady models in
he madmom package [38], achie e F1 sco es o 0.849 and
0.913, espec i ely. Howe e , i is impo an o no e ha
hese e alua ions we e conduc ed wi hou knowledge o
he o iginal aining/ es spli s used o hese p e- ained
models, c ea ing po en ial da a leakage ha may lead o an
o e es ima ion o hei pe o mance. The 2nd gene a ion
onse CNN [39] emains he es ablished benchma k, wi h
a epo ed F1 sco e o 0.903, e i ied h ough k- old c oss-
alida ion. Unlike he madmom models, ou TCN models
we e e alua ed unde he same alida ion condi ions as he
2nd gen CNN, ensu ing compa abili y. These esul s indi-
ca e ha TCN 1 is compe i i e wi h he cu en s a e o he
a in onse de ec ion.
Table 1. Rep esen a i e con igu a ions demons a ing im-
p o emen s ac oss ans e lea ning se ings.
Onse - o-Onse Bea - o-Onse
Adap ed (bes ) bsl Adap ed (bes ) bsl*
Cuica Tcn16 0.985 0.477 0.955 0.429
Gonge-Lo Tcn2/4/16 0.998 0.508 0.956 0.892
Minei o Tcn16 0.972 0.946 Tcn8 0.790 0.193
Tambo -Hi /Tcn1024 0.978 0.965 Tcn1 0.723 0.443
Ta ol Con 3 0.997 0.993 0.884 0.139
3.2 Onse - o-Onse T ans e Lea ning Resul s
Figu e 3 ( op) p esen s he F1 sco es ob ained o each ine-
uning con igu a ion in compa ison o he baseline. The
esul s can be g ouped based on he hy hmic ole o he
ins umen s: ime-keeping (cuica and gonge-lo) s. oic-
ing ( a ol,minei o, and ambo -hi).
Fo ime-keeping ins umen s, he baseline pe o mance
is mode a e (F1 ≈0.5), bu ine- uning yields signi ican
imp o emen s, wi h sco es eaching he 0.8–1.0 ange.
In con as , exp essi e ins umen s exhibi highe ini ial
F1 sco es (≈0.9–1.0), which limi s he ela i e imp o e-
men . This dispa i y can be a ibu ed o he con en ional
na u e o a ol and ambo -hi, which a e mo e aligned
wi h he aining da a, whe eas cuica and gonge-lo di e ge
mo e in e ms o acous ic cha ac e is ics. An excep ion is
minei o, which achie es a ela i ely high baseline sco e
despi e i s dis inc wa e o m cha ac e is ics. Howe e ,
he epo ed lowe p ecision o hese g ound- u h anno-
a ions [17] complica es di ec pe o mance compa isons.
Table 1 p esen s high-pe o ming con igu a ions o
demons a e he achie able imp o emen s ac oss ins u-
men s. The Tcn16 model achie es he highes accu-
acy o cuica and minei o (0.985 and 0.972, espec i ely),
while Tcn2, Tcn4, and Tcn16 all achie e he highes F1
sco e o gonge-lo (F1 = 0.998). Fo ambo -hi, he bes
pe o mance is ob ained wi h bo h Tcn1024 and (F1 =
0.978). Fo a ol, he highes F1 sco e (0.997) is achie ed
wi h Con 3, hough many con igu a ions show compa a-
ble pe o mance wi h ma ginal di e ences. These con igu-
a ions consis en ly ou pe o m he baseline, wi h he mos
no able gains obse ed in cuica and gonge-lo, whe e F1
imp o emen s exceed 50 pe cen age poin s (p.p.).
In summa y, all ins umen s bene i om adap a ion, as
mos ine- uned con igu a ions—and in pa icula , he bes
o each ins umen —consis en ly ou pe o m he base-
line. The imp o emen is especially p onounced o ime-
keeping ins umen s (cuica and gonge-lo), likely due o
hei lowe baseline accu acy, which allows mo e oom o
imp o emen , and he ela i e ease o de ec ing spa se on-
se s compa ed o hose ha a e closely clus e ed in ime,
e en hough onse densi y emains well abo e he ne -
wo k’s empo al esolu ion o 10 ms. The op imal eeze
con igu a ion a ies by ins umen , wi h no clea global
end. Howe e , some pa e ns eme ge: o oicing in-
s umen s, ull-ne wo k ine- uning ( ) anks among he
op-pe o ming con igu a ions, whe eas i deg ades pe o -
mance o ime-keeping ins umen s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
323
0.0
0.2
0.4
0.6
0.8
1.0
F1
Cuica Gonge-Lo Minei o Tambo -Hi Ta ol
bsl*/bsl
Con 1
Con 2
Con 3
Tcn1
Tcn2
Tcn4
Tcn8
Tcn16
Tcn32
Tcn64
Tcn128
Tcn256
Tcn512
Tcn1024
0.0
0.2
0.4
0.6
0.8
1.0
F1
bsl*/bsl
Con 1
Con 2
Con 3
Tcn1
Tcn2
Tcn4
Tcn8
Tcn16
Tcn32
Tcn64
Tcn128
Tcn256
Tcn512
Tcn1024
bsl*/bsl
Con 1
Con 2
Con 3
Tcn1
Tcn2
Tcn4
Tcn8
Tcn16
Tcn32
Tcn64
Tcn128
Tcn256
Tcn512
Tcn1024
bsl*/bsl
Con 1
Con 2
Con 3
Tcn1
Tcn2
Tcn4
Tcn8
Tcn16
Tcn32
Tcn64
Tcn128
Tcn256
Tcn512
Tcn1024
bsl*/bsl
Con 1
Con 2
Con 3
Tcn1
Tcn2
Tcn4
Tcn8
Tcn16
Tcn32
Tcn64
Tcn128
Tcn256
Tcn512
Tcn1024
Figu e 3. Dis ibu ion o F1 sco es pe laye -wise con igu a ion unde wo ans e lea ning se ings: Onse - o-Onse ( op),
whe e ine- uned models a e compa ed agains hei baseline, and Bea - o-Onse (bo om), whe e we assess c oss- ask
e sus wi hin- ask ans e lea ning, wi h compa able pe o mance obse ed o ime-keeping ins umen s.
3.3 Bea - o-Onse T ans e Lea ning Resul s
In his sec ion, we ocus on a domain adap a ion, whe e
a model p e- ained o bea acking is adap ed o on-
se de ec ion. Unlike he p e ious se ing, he goal he e
is no o compa e ine- uned models o hei baseline, as
his o igina es om a di e en ask. We also e ain om
an in-dep h analysis o mean F1 sco es ac oss da ase s,
gi en hei limi ed in e p e a i e alue. Ins ead, we assess
whe he models ine- uned in his se ing achie e esul s
compa able o hose in he onse - o-onse ans e lea ning
scena io. Figu e 3 (bo om) p o ides an o e iew o he
esul s.
Time-keeping ins umen s, such as cuica and gonge-lo,
achie e ela i ely high baseline (bsl*) accu acies, likely
due o he alignmen be ween hei onse s and bea lo-
ca ions. Adap a ion imp o es accu acy ac oss all ins u-
men s, con i ming he easibili y o bea - o-onse ans e
lea ning. Howe e , while he ine- uned models consis-
en ly ou pe o m he bea - acking baseline, di ec com-
pa isons o he onse - o-onse se ing e eal pe o mance
dispa i ies ha a y by ins umen . Speci ically, o ime-
keeping ins umen s, pe o mance emains nea ly iden ical
ac oss bo h ans e lea ning scena ios, wi h di e ences o
only 1.6 p.p. o cuica and 3.7 p.p. o gonge-lo. In con-
as , oicing ins umen s exhibi p og essi ely la ge dis-
c epancies, wi h F1-sco e di e ences o 11.3p.p. o a ol,
27.5p.p. o minei o, and he la ges gap o 32.2p.p. o
ambo -hi.
Close inspec ion o he laye -wise esul s e eals addi-
ional pa e ns. The accu acy gene ally inc eases as mo e
laye s a e ine- uned up o he 3 d o 4 h dila ion le el, be-
yond which no u he gains a e obse ed. Howe e , his
end does no hold o a ol, whe e deepe ine- uning
leads o addi ional pe o mance imp o emen s. These
obse a ions highligh ha , while ine- uning is bene i-
cial ac oss all cases, he op imal e aining dep h emains
ins umen -dependen .
Al oge he , he esul s indica e ha ea u e ans e abil-
i y om bea acking o onse de ec ion is mo e e ec i e
o ime-keeping ins umen s han o oicing ins umen s.
Speci ically, gonge-lo exhibi s a clea ly highe baseline
F1 accu acy in he bea - o-onse se ing compa ed o i s
onse - o-onse coun e pa (0.892 s. 0.508), while cuica
achie es a compa able pe o mance (0.429 s. 0.477), as
epo ed in Table 1. This enhanced c oss- ask adap abili y
a ises om he me ical unc ion o ime-keeping ins u-
men s: hei onse s inhe en ly coincide wi h bea posi ions,
making hem na u al a ge s o he p e- ained model’s
hy hmic ep esen a ions. Examining hese esul s mo e
closely, we e i y ha Ma aca u’s empo ange o 165–180
BPM co esponds o in e -bea in e als o 333–363 ms.
These du a ions app oxima ely ma ch he wa e o m spans
o cuica and gonge-lo, bu no hose o he o he ins u-
men s 5.
This empo al alignmen —whe e he ins umen s’
acous ic p o ile align wi h he gen e’s in e -bea in e -
als—explains he high baseline accu acies. Addi ion-
ally, he la ge capaci y o TCN 2 (116,302 pa ame e s s.
21,890 in TCN 1) and i s exposu e o a b oade aining se
5Acco ding o an in o mal inspec ion o wa e o m spans—cuica:
384-428 ms, gonge-lo: 376-400 ms, a ol: 77-107 ms, minei o: 90-180
ms, and ambo -hi: 120-230 ms.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
324

may u he con ibu e o his ad an age. This sugges s ha
model exp essi i y and p e- aining di e si y can compen-
sa e o ask di e ences in ce ain ans e lea ning scena -
ios.
3.4 Discussion
Ou in es iga ion o wo con as ing ans e lea ning sce-
na ios e eals ha adap a ion ou pe o ms baseline ap-
p oaches ac oss all ins umen s, wi h a ying deg ees o
imp o emen .
In he wi hin-domain se ing, adap a ion yielded high
accu acies wi h F1 sco es om 0.972 (minei o) o 0.998
(gonge-lo) and 0.997 ( a ol). Imp o emen was mos p o-
nounced o ime-keeping ins umen s wi h lowe baseline
accu acies (≈0.500), wi h cuica showing a 52 p.p. gain.
Fo he c oss-domain adap a ion, while imp o emen s o e
he bea - acking baseline (bsl*) we e e iden , compa -
ison agains he onse - acking baseline (bsl) e ealed
ins umen -dependen pa e ns. Voicing ins umen s’ bes
F1 sco es emained below he onse - acking baseline by
11-24 p.p. , indica ing limi ed bene i s om domain adap-
a ion. Howe e , o ime-keeping ins umen s whose on-
se s align wi h he p e- ained model’s hy hmic p io s,
c oss- ask adap a ion yielded imp o emen s o 45-48 pe -
cen age poin s.
These indings p o ide key insigh s: i) Fine- uning
consis en ly enhances pe o mance in bo h se ings, mak-
ing i aluable o achie ing high accu acy in unde ep e-
sen ed music gen es; ii) Models ained on bea - acking
can be e ec i ely adap ed o onse de ec ion, le e aging
model scale o compensa e o ask di e gence and ad-
d essing limi ed da a a ailabili y o non-mains eam in-
s umen s. Howe e , e ec i eness a ies by ins umen
ype: bea - o-onse adap a ion bene i s ime-keeping in-
s umen s, while onse - o-onse adap a ion consis en ly im-
p o es pe o mance ac oss all ins umen s. These im-
p o emen s a e na u ally mo e subs an ial when baseline
accu acy is lowe , as obse ed in oicing ins umen s.
Ou esul s also demons a e ha op imal ine- uning
con igu a ions a y by ins umen , necessi a ing ailo ed
s a egies o selec ing which laye weigh s o upda e du -
ing ine- uning. This challenges he assump ion ha only
laye s closes o he musical su ace and he ou pu laye
would equi e ecalib a ion o op imize a ne wo k o a
speci ic ins umen [17].
Finally, se e al limi a ions wa an conside a ion. Ou
esul s ep esen a single expe imen al cycle, and despi e
p io esea ch sugges ing ela i e s abili y ac oss uns [28,
30], he s ochas ic na u e o he ( e) aining p ocess–due
o con olu ional d opou –implies ha esul s may a y.
While unlikely o a ec gene al ends, mul iple cycles
would be needed o in es iga e speci ic aspec s such as e-
cep i e ield size impac and i s ela ion o op imal laye
eeze selec ion o ins umen wa e o m p o iles. No e
ha , as p e iously discussed, co esponding laye s ac oss
he wo models di e in hei empo al ecep i e ields de-
spi e ha ing he same labels. Fo ins ance, while Con 3
co esponds o app oxima ely 50 ms in bo h models, he
laye Tcn2 spans 170 ms in TCN 1 s. 410 ms in TCN 2.
This disc epancy mus be conside ed when in e p e ing e-
sul s, limi ing di ec compa ison be ween speci ic eeze
con igu a ions ac oss scena ios. The lowe anno a ion
p ecision o minei o u he limi s some esul in e p e-
a ion, po en ially explaining i s anomalous pe o mance
(e.g. lowes ine- uned and baseline accu acy on each se -
ing).
No ably, ou cu en esul s we e achie ed wi h min-
imal adjus men o he expe imen al pipeline o main-
ain ai compa ison wi h baselines. This conse a i e
app oach sugges s g ea e imp o emen s migh be pos-
sible h ough hype pa ame e op imiza ion— o example,
c oss- ask adap a ion may equi e mo e epochs o con e ge
han wi hin- ask adap a ion. While such op imiza ion ex-
ceeded his s udy’s scope, i ep esen s a p omising di-
ec ion o ex ending he clea pe o mance gains demon-
s a ed he e.
4. CONCLUSION
This s udy in es iga ed onse de ec ion in Ma aca u de
baque sol o h ough wo ans e lea ning s a egies: onse -
o-onse adap a ion and bea - o-onse adap a ion. Bo h
app oaches yielded no able imp o emen s o e baseline
models, unde lining he ad an ages o ine- uning o en-
hancing accu acy.
We demons a ed ha c oss- ask adap a ion o models
is iable o less- ep esen ed asks such as onse de ec ion
when s uc u al alignmen exis s be ween sou ce and a ge
domains. T ans e lea ning e ec i ely add esses limi ed
da a a ailabili y and ci cum en s ex ensi e manual anno-
a ion o cos ly aining om sc a ch—a inding wi h im-
po an implica ions o music in o ma ion e ie al, pa -
icula ly when acing da a sca ci y challenges.
Fu u e wo k should add ess his s udy’s limi a ions
while explo ing in g ea e de ail he ac o s in luencing
ans e lea ning e ec i eness. Mul iple- un expe imen s
would con i m obse ed ends and in es iga e speci ic as-
pec s, such as op imal eeze segmen selec ion and i s e-
la ion wi h ne wo k ecep i e ield and ins umen wa e-
o m p o iles, alongside po en ial imp o emen s h ough
hype pa ame e op imiza ion. Addi ional esea ch di ec-
ions include ex ending he analysis o o he da ase s and
unde ep esen ed ins umen s, and e ining aining p o o-
cols. E alua ing ou adap i e app oach using s ic e ole -
ance windows would p o ide deepe insigh s in o empo al
p ecision, pa icula ly o exp essi e ins umen s and mi-
c o iming analysis applica ions whe e ine-scale empo al
a ia ions a e signi ican .
In summa y, his s udy demons a es he e ec i eness
o ans e lea ning in imp o ing musical onse de ec ion
o di e se adi ions beyond he Wes e n canon. By adap -
ing exis ing models, we can imp o e accu acy and obus -
ness o unde ep esen ed sounds. These me hods and in-
sigh s con ibu e o de eloping mo e inclusi e ools o
music analysis, wi h applica ions ex ending beyond he
speci ic gen es and asks s udied he e o bene i he b oade
ield o Music In o ma ion Re ie al.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
325
5. REFERENCES
[1] M. Go o and Y. Mu aoka, “A bea acking sys em o
acous ic signals o music,” in P oceedings o he 2nd
ACM In e na ional Con e ence on Mul imedia (MUL-
TIMEDIA ’94). ACM P ess, 1994, pp. 365–372.
[2] S. Dixon, “Au oma ic Ex ac ion o Tempo and Bea
F om Exp essi e Pe o mances,” Jou nal o New Mu-
sic Resea ch, ol. 30, no. 1, pp. 39–58, 2001.
[3] R. B. Dannenbe g, “Towa d au oma ed holis ic bea
acking, music analysis, and unde s anding,” in P o-
ceedings o he 6 h In e na ional Con e ence on Music
In o ma ion Re ie al (ISMIR), 2005, pp. 366–373.
[4] S. Böck, F. K ebs, and M. Schedl, “E alua ing he on-
line capabili ies o onse de ec ion me hods,” in P o-
ceedings o he 13 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), 2012, pp.
49–54.
[5] J. Pons, R. Gong, and X. Se a, “Sco e-in o med sylla-
ble segmen a ion o a cappella singing oice wi h con-
olu ional neu al ne wo ks,” in P oceedings o he 18 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2017, pp. 383–389.
[6] R. Vogl, M. Do e , G. Widme , and P. Knees, “D um
T ansc ip ion ia Join Bea and D um Modeling us-
ing Con olu ional Recu en Neu al Ne wo ks,” in P o-
ceedings o he In e na ional Socie y o Music In o -
ma ion Re ie al (ISMIR) Con e ence, 2017, pp. 150–
157.
[7] J. Bello, L. Daude , S. Abdallah, C. Duxbu y,
M. Da ies, and M. Sandle , “A u o ial on onse de ec-
ion in music signals,” IEEE T ansac ions on Speech
and Audio P ocessing, ol. 13, no. 5, pp. 1035–1047,
2005.
[8] S. Dixon, “Onse de ec ion e isi ed,” in P oceedings
o he 9 h In e na ional Con e ence on Digi al Audio
E ec s (DAFx), 2006, pp. 133–137.
[9] M. Ma ol , A. Ka cic, and M. P i osnik, “Neu al ne -
wo ks o no e onse de ec ion in piano music,” in P o-
ceedings o he In e na ional Compu e Music Con e -
ence (ICMC), 2002.
[10] A. Lacos e and D. Eck, “A Supe ised Classi ica ion
Algo i hm o No e Onse De ec ion,” EURASIP Jou -
nal on Ad ances in Signal P ocessing, ol. 2007, no. 1,
p. 043745, 2006.
[11] F. Eyben, S. Böck, B. Schulle , and A. G a es, “Uni-
e sal onse de ec ion wi h bidi ec ional long sho -
e m memo y neu al ne wo ks,” P oceedings o he
11 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR 2010, no. Janua y, pp. 589–
594, 2010.
[12] J. Schlü e and S. Böck, “Musical onse de ec ion wi h
Con olu ional Neu al Ne wo ks,” in 6 h in e na ional
wo kshop on machine lea ning and music (MML),
2013.
[13] M. Tomczak and J. Hockman, “Onse De ec ion o
S ing Ins umen s Using Bidi ec ional Tempo al and
Con olu ional Recu en Ne wo ks,” in P oceedings
o he 18 h In e na ional Audio Mos ly Con e ence.
ACM, 2023, pp. 136–142.
[14] G. Pee e s, “The Deep Lea ning Re olu ion in MIR:
The P os and Cons, he Needs and he Challenges,”
in Pe cep ion, Rep esen a ions, Image, Sound, Mu-
sic - 14 h In e na ional Symposium, CMMR 2019,
Ma seille, F ance, Oc obe 14-18, 2019, Re ised Se-
lec ed Pape s, se . Lec u e No es in Compu e Science,
R. K onland-Ma ine , S. Ys ad, and M. A amaki, Eds.,
ol. 12631. Sp inge , 2021, pp. 3–30.
[15] A. S ini asamu hy, A. Holzap el, and X. Se a, “In
Sea ch o Au oma ic Rhy hm Analysis Me hods o
Tu kish and Indian A Music,” Jou nal o New Music
Resea ch, ol. 43, no. 1, pp. 94–114, 2014.
[16] J. Bol and G. Fazekas, “Supe ised Con as i e Lea n-
ing Fo Musical Onse De ec ion,” in P oceedings
o he 18 h In e na ional Audio Mos ly Con e ence.
ACM, 2023, pp. 130–135.
[17] M. E. P. Da ies, M. Fuen es, J. Fonseca, L. Aly,
M. Je ónimo, and F. B. Ba aldi, “Mo ing in Time:
Compu a ional Analysis o Mic o iming in Ma aca u
de Baque Sol o,” in P oceedings o he 21 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2020, pp. 795–802.
[18] M. Fuen es, B. McFee, H. C. C ayencou , S. Essid, and
J. P. Bello, “A Music S uc u e In o med Downbea
T acking Sys em Using Skip-chain Condi ional Ran-
dom Fields and Deep Lea ning,” in IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP), ol. 2019-May. IEEE, 2019, pp. 481–
485.
[19] A. S ini asamu hy, A. Holzap el, and X. Se a, “In-
o med au oma ic me e analysis o music eco d-
ings,” in P oceedings o he 18 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2017, pp. 679–685.
[20] D. Fiocchi, M. Buccoli, M. Zanoni, F. An onacci,
and A. Sa i, “Bea T acking using Recu en Neu-
al Ne wo k: A T ans e Lea ning App oach,” in 26 h
Eu opean Signal P ocessing Con e ence (EUSIPCO).
IEEE, 2018, pp. 1915–1919.
[21] G. Bu loiu, “In e ac i e Lea ning o Mic o iming in an
Exp essi e D um Machine,” in The Join Con e ence
on AI Music C ea i i y, 2020.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
326
[22] Y. Wang, J. Salamon, M. Ca w igh , N. J. B yan, and
J. P. Bello, “Few-Sho D um T ansc ip ion in Poly-
phonic Music,” in P oceedings o he 21s In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2020, pp. 117–124.
[23] Y. Wang, N. J. B yan, M. Ca w igh , J. Pablo Bello,
and J. Salamon, “Few-Sho Con inual Lea ning o Au-
dio Classi ica ion,” in IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2021, pp. 321–325.
[24] J. J. Vale o-Mas and J. M. Iñes a, “In e ac i e use co -
ec ion o au oma ically de ec ed onse s: app oach and
e alua ion,” EURASIP Jou nal on Audio, Speech, and
Music P ocessing, ol. 2017, no. 1, p. 15, 2017.
[25] K. Yamamo o, “Human-in- he-Loop Adap a ion o In-
e ac i e Musical Bea T acking,” in P oceedings o he
22nd In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), 2021, pp. 794–801.
[26] M. Schedl, E. Gómez, and J. U bano, “Music In o -
ma ion Re ie al: Recen De elopmen s and Applica-
ions,” Founda ions and T ends® in In o ma ion Re-
ie al, ol. 8, no. 2-3, pp. 127–261, 2014.
[27] A. S. Pin o and M. E. P. Da ies, “Towa ds use -
in o med bea acking o musical audio,” in 14 h In-
e na ional Symposium on Compu e Music Mul idis-
ciplina y Resea ch (CMMR), 2019, pp. 577–588.
[28] A. Pin o, S. Böck, J. Ca doso, and M. Da ies, “Use -
D i en Fine-Tuning o Bea T acking,” Elec onics,
ol. 10, no. 13, p. 1518, 2021.
[29] L. S. Maia, M. Rocamo a, and M. Fuen es, “Adap -
ing me e acking models o La in ame ican music,” in
P oceedings o he 23 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR), 2022,
pp. 361–368.
[30] A. S. Pin o and G. Be na des, “B idging he Rhy h-
mic Gap : A Use -Cen ic App oach o Bea T acking
in Challenging Music Signals,” in 16 h In e na ional
Symposium on Compu e Music Mul idisciplina y Re-
sea ch (CMMR), 2023, pp. 1–12.
[31] K. Choi, G. Fazekas, M. Sandle , and K. Cho, “T ans-
e lea ning o music classi ica ion and eg ession
asks,” in P oceedings o he 18 h In e na ional Con-
e ence on Music In o ma ion Re ie al (ISMIR), 2017,
pp. 141–149.
[32] J. Fonseca, M. Fuen es, F. Bonini Ba aldi, and M. E.
Da ies, “On he Use o Au oma ic Onse De ec ion
o he Analysis o Ma aca u de Baque Sol o,” in Pe -
spec i es on Music, Sound and Musicology.Cu en Re-
sea ch in Sys ema ic Musicology, ol. 10., L. Co eia
Cas ilho, R. Dias, and J. Pinho, Eds. Sp inge Cham,
2021, pp. 209–225.
[33] S. Böck and M. E. P. Da ies, “Decons uc , Anal-
yse, Recons uc : How To Imp o e Tempo, Bea , and
Downbea Es ima ion,” in P oceedings o he 21s In-
e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2020, pp. 574–582.
[34] B. McFee, J. W. Kim, M. Ca w igh , J. Salamon,
R. Bi ne , J. P. Bello, and O.-s. P ac ices, “Open-
Sou ce P ac ices o Music Signal P ocessing Re-
sea ch: Recommenda ions o T anspa en , Sus ain-
able, and Rep oducible Audio Resea ch,” IEEE Signal
P ocessing Magazine, ol. 36, no. Janua y, pp. 128–
137, 2019.
[35] C. d. O. San os, T. S. Resende, and P. M. Keays,
Ba uque Book: Ma aca u Baque Vi ado e Baque Sol o.
Au ho ’s edi ion., 2009.
[36] G. P. Bessoni e Sil a, “Ma aca u de Baque Sol o: de
b incadei a a pa imônio cul u al,” Cade no Vi ual de
Tu ismo, ol. 21, no. 2, p. 113, 2021.
[37] M. E. P. Da ies and S. Böck, “Tempo al con olu ional
ne wo ks o musical audio bea acking,” in P oceed-
ings o he 27 h Eu opean Signal P ocessing Con e -
ence (EUSIPCO), 2019.
[38] S. Böck, F. Ko zeniowski, J. Schlü e , F. K ebs, and
G. Widme , “madmom: A New Py hon Audio and
Music Signal P ocessing Lib a y,” in P oceedings o
he 24 h ACM In e na ional Con e ence on Mul imedia
(MM ’16). ACM, 2016, pp. 1174–1178.
[39] J. Schlü e and S. Böck, “Imp o ed musical onse de-
ec ion wi h Con olu ional Neu al Ne wo ks,” in IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP). IEEE, 2014, pp. 6979–
6983.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
327