THE RHYTHM IN ANYTHING: AUDIO-PROMPTED DRUMS
GENERATION WITH MASKED LANGUAGE MODELING
Pa ick O’Reilly1Julia Ba ne 1Hugo Flo es Ga cia1
Annie Chu1Na han P uyne1P em See ha aman2B yan Pa do1
1No hwes e n Uni e si y, E ans on, USA 2Adobe Resea ch, San F ancisco, USA
[email p o ec ed]
ABSTRACT
Musicians and nonmusicians alike use hy hmic sound ges-
u es, such as apping and bea boxing, o exp ess d um pa -
e ns. While hese ges u es e ec i ely communica e mu-
sical ideas, ealizing hese ideas as ully-p oduced d um
eco dings can be ime-consuming, po en ially dis up -
ing many c ea i e wo k lows. To b idge his gap, we
p esen TRIA (The Rhy hm InAny hing), a masked ans-
o me model o mapping hy hmic sound ges u es o
high- ideli y d um eco dings. Gi en an audio p omp o
he desi ed hy hmic pa e n and a second p omp o ep-
esen d umki imb e, TRIA p oduces audio o a d umki
playing he desi ed hy hm (wi h app op ia e elabo a ions)
in he desi ed imb e. Subjec i e and objec i e e alua ions
show ha a TRIA model ained on less han 10 hou s
o publicly-a ailable d um da a can gene a e high-quali y,
ai h ul ealiza ions o sound ges u es ac oss a wide ange
o imb es in a ze o-sho manne .
1. INTRODUCTION
Sound ges u es such as apping and bea boxing p o ide a
con enien and idioma ic means o exp essing hy hmic
ideas. Ra he han “li e ally” speci ying a hy hmic idea
h ough one- o-one imi a ion, sound ges u es o en cap-
u e a educed, high-le el ep esen a ion o he desi ed
hy hm— o ins ance, a bea boxe may only oice one el-
emen whe e many ha e simul aneous onse s, o lea e ce -
ain elemen s un oiced and implied. Realizing hese ges-
u es as ully-p oduced d um a angemen s o en equi es
many s eps: he oiced sound elemen s in a ges u e mus
be mapped o app op ia e d um pa s, un oiced o implied
elemen s mus be plausibly econs uc ed, he esul ing a -
angemen mus be pe o med and eco ded o sequenced
and syn hesized digi ally in audio edi ing so wa e, and he
inal eco ding may equi e u he p ocessing o shape he
imb e sa is ac o ily. By con as , many c ea i e wo k lows
© P. O’Reilly, J. Ba ne , H. F. Ga cia, A. Chu, N. P uyne,
P. See ha aman, and B. Pa do. Licensed unde a C ea i e Commons A i-
bu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: P. O’Reilly,
J. Ba ne , H. F. Ga cia, A. Chu, N. P uyne, P. See ha aman, and B.
Pa do, “The Rhy hm In Any hing: Audio-P omp ed D ums Gene a ion
wi h Masked Language Modeling”, in P oc. o he 26 h In . Socie y o
Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
Figu e 1. TRIA condi ions gene a ion o a new d um
eco ding on wo p omp s: he imb e o an example d um
eco ding (illus a ed by a spec og am), and he hy hm o
a sound ges u e ( he dualized ea u es in Sec ion 3.1).
may bene i om he abili y o apidly gene a e di e se
ull-d umki ealiza ions o hy hmic sound ges u es.
To b idge his gap, we p opose TRIA (The Rhy hm In
Any hing), a masked ans o me model o mapping a bi-
a y hy hmic sound ges u es o high- ideli y d um eco d-
ings. Gi en wo audio p omp s—one speci ying he ba-
sic desi ed hy hm ia a sound ges u e, and one speci y-
ing he desi ed d um imb e ia an example eco ding—
TRIA syn hesizes ull-d umki audio playing a leshed-
ou a angemen o he desi ed hy hm in he desi ed im-
b e. TRIA can ai h ully ealize sound ges u es in unseen
imb es in a ze o-sho manne despi e i s ela i ely small
model size (43M ainable pa ame e s) and aining da ase
(less han 10 hou s o publicly-a ailable d um eco dings
om MusDB18-HQ [1]). Th ough bo h quan i a i e com-
pa isons and quali a i e human lis ening e alua ions, we
demons a e ha TRIA ma ches o exceeds he pe o -
mance o a 1-billion pa ame e s a e-o - he-a model [2]
ained on 20,000 hou s o public and p i a e da a in con-
e ing sound ges u es o d um eco dings.
Ou con ibu ions a e as ollows:
1. A model capable o mapping a bi a y hy hmic
sound ges u es o high- ideli y d um eco dings us-
ing d um imb es speci ied a in e ence ime
2. A dualized ep esen a ion ha le s he model cap-
u e salien hy hmic s uc u e ac oss d um and non-
d um sound classes
3. Subjec i e and objec i e e alua ions showing he
impo ance o he dualized ep esen a ion and he
460
Figu e 2. The p oposed TRIA sys em. Du ing aining (le ), acous ic okens o a okenized d um eco ding a e p edic ed,
condi ioned on su ounding unmasked okens and hy hm ea u es ex ac ed om an augmen ed e sion o he eco ding;
we illus a e h ee aining examples. Du ing in e ence ( igh ), we ix he imb e p omp as a p e ix and p edic a masked
su ix condi ioned on aligned ea u es ex ac ed om he hy hm p omp . In e ence p edic s okens in coa se- o- ine o de .
model’s abili y o gene a e musically-pleasing ans-
la ions ha adhe e o hy hm and imb e p omp s
We p o ide audio examples and code on ou webpage. 1
2. RELATED WORK
The ansla ion o simple hy hmic ges u es in o ull d um
bea s has been explo ed in he symbolic domain, no ably in
he G ooVAE models p oposed by Gillick e al. [3]. While
hese allow o mapping single- oice MIDI d um pa e ns
o ull-d umki exp essi e MIDI pe o mances, hey do no
allow o audio-p omp ed hy hm o imb e speci ica ion.
In he audio domain, San os & Ca doso applied RAVE
[4] models o a ap- o-d ums ansla ion ask [5]. Howe e ,
RAVE does no suppo ze o-sho audio-p omp ed imb e
speci ica ion, bu ins ead equi es e- aining o each new
speci ied imb e. In gene al, ecen neu al ne wo k-based
imb e ans e sys ems a e simila ly cons ained o else
suppo only pi ched ins umen s [6, 7]. One excep ion is
MelodyFlow [2], which pe o ms ex -guided audio edi -
ing ia la en di usion in e sion, hypo he ically allowing
o he speci ica ion o a bi a y imb es ia ex p omp s.
We pe o m ex ensi e compa isons be ween ou p oposed
sys em and MelodyFlow in Sec ion 4.
A numbe o sys ems use ansc ip ion o ansla e bea -
box audio in o d um eco dings ia syn hesis om a p e-
dic ed MIDI ep esen a ion, bu gene ally equi e use -
speci ic calib a ion o accu a e ansc ip ion and do no
suppo audio-p omp ed imb e speci ica ion [8,9]. In gen-
e al, ansc ip ion-based sys ems a e cons ained o na -
ow sound ges u e ypes wi h well-de ined audio-symbol
mappings o a ailable anno a ed da a o supe ised ain-
ing (e.g., bea boxing), and limi ed o “li e al” mappings
o imb es on o a omic sound e en s. By con as , we
p opose an audio-p omp ed, sel -supe ised app oach o
1h ps:// he hy hminany hing.gi hub.io/
mapping simple hy hmic ges u es o po en ially complex
ull-d umki eco dings, allowing o he gene a ion o a -
angemen de ails no explici in he hy hm ges u e.
P e ious wo ks ha e hypo hesized ha musicians o -
en pe cei e and a ange d um pa e ns using implici wo-
oice “dualized” ep esen a ions ha oscilla e be ween low
and high s a es [10,11]. Howe e , he use o dualized ep-
esen a ions o music gene a ion has been limi ed o he
symbolic domain [12]. Ou p oposed sys em ob ains du-
alized ep esen a ions om audio (Sec ion 3.1) o guide
he gene a ion o d um audio, le ing us speci y hy hmic
s uc u e wi h non-d um sounds (e.g. inge apping).
Finally, ou wo k di e s om p io wo k on gene a -
ing symbolic hy hm pa e ns [13–15], d um loops [16,17],
and d um samples [18–20] in ha we seek o con e sound
ges u es in o audio-domain ull d umki pe o mances.
3. METHOD
We nex desc ibe he design o he p oposed TRIA sys em.
A chi ec u e: Simila o VampNe [21], TRIA is a
ans o me -based masked language model. TRIA consis s
o 12 s anda d ans o me blocks, each wi h hidden size
h= 512,8a en ion heads, and o a y posi ional encoding
[22], esul ing in 43 million ainable pa ame e s.
Audio Tokeniza ion: TRIA p edic s acous ic okens
p oduced by Desc ip Audio Codec (DAC) [23]. Wi hin
DAC, audio is segmen ed in o a se ies o T ames, each o
which is mapped o a ec o ep esen a ion ia a ully con-
olu ional encode . Encoded ec o s a e quan ized wi h a
hie a chical sequence o C ec o -quan ize s, each wi h i s
own codebook. Each quan ize encodes he esidual be-
ween he o iginal and he quan ized ep esen a ion p o-
duced by he p e ious quan ize s. Quan ized ec o s a e
ep esen ed by hei codebook indices, esul ing in a oken
ep esen a ion o Ccodebooks by T ames. A ma ched
decode con e s C×T oken ep esen a ions in o audio.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
461
Masked Language Modeling: TRIA gene a es d um
audio by p edic ing missing o “masked” DAC okens
wi hin a pa ially-masked “bu e ” o size C×T, condi-
ioned on unmasked okens ( ep esen ing he a ge im-
b e and gene a ed con en ), as is ypical o masked oken
modeling. TRIA, howe e , also condi ions gene a ion on
aligned hy hm ea u es ep esen ing he a ge hy hm (see
Sec ion 3.1). Once all masked okens a e p edic ed, hey
a e mapped o 44.1kHz mono audio ia he DAC decode .
To p oduce p edic ions o masked okens in he bu e
wi hin a speci ic codebook c∈[0, C −1], all okens in
he bu e a e i s mapped o con inuous ec o s o size h
ia sepa a e lea ned embedding ables pe codebook, wi h
masked okens mapped o a single lea ned “mask” embed-
ding sha ed ac oss all codebooks. Recall ha he okens
in e e y codebook a le el c′> c co ec he esidual e -
o o he oken a le el c. The e o e, i a oken a le el c
is masked, all co esponding embedding ec o s in code-
books c′> c a e ze oed. Embedding ec o s a e hen
summed ac oss codebooks o ob ain a sequence o shape
h×T. Rhy hm ea u es (Sec ion 3.1) a e p ojec ed o he
hidden dimension and ze oed o ames in which he e
a e no masked okens, esul ing in a co esponding con-
di ioning sequence o shape h×T. The wo sequences a e
summed and passed o he ans o me , which p edic s a
p obabili y dis ibu ion o e okens in codebook ca each
ame ia one o Ccodebook-speci ic p ojec ion laye s.
In e ence: A in e ence, we ake as inpu s a imb e
p omp (d um) eco ding and a hy hm p omp (sound ges-
u e) eco ding. We cons uc a bu e in which he ok-
enized imb e p omp se es as an unmasked p e ix, wi h
all subsequen ames (co esponding o he leng h o he
hy hm p omp ) ully masked. We compu e hy hm ea-
u es aligned o his masked su ix om he hy hm p omp .
We hen pe o m SoundS o m-s yle in e ence [24] o
i e a i ely p edic masked okens in each codebook in
coa se- o- ine o de , using he schedule o Chang e al. [25]
o g adually unmask o “con i m” okens in he su ix.
We adop empe a u e-based nonde e minis ic unmasking
om VampNe and causal bias om S emGen [26] o a-
o unmasking ea lie okens in he bu e i s .
Thus, we ill in he masked su ix using imb al in o -
ma ion om he imb e p omp and hy hmic in o ma ion
om he hy hm p omp , esul ing in a gene a ion ha ad-
he es o bo h p omp s. By speci ying he numbe o in-
e ence i e a ions o e which each codebook is unmasked,
we can expend mo e compu e on challenging high-en opy
ea ly gene a ion s eps and less on highly-de e mined la e
s eps. Fo all expe imen s epo ed in his pape , we use
an in e ence schedule o {8,8,8,8,8,4,4,4,4}i e a ions
o DAC’s 9 espec i e codebooks in coa se- o- ine o de ,
classi ie - ee guidance [27] weigh 2.0, unmasking em-
pe a u e 10.0, and causal bias 1.0.
T aining: A each aining i e a ion, we sample a d um
eco ding ha se es as bo h imb e and hy hm p omp ,
okenizing wi h DAC o ob ain a bu e and compu ing
hy hm ea u es a a ma ching empo al esolu ion. We
selec a andom codebook and a andom span o consec-
Figu e 3. Resul s o he lis ene p e e ence e alua ion de-
ailed in Sec ion 4.3. We plo win a es o TRIA and
MelodyFlow gene a ions om hy hm p omp s sampled
om he AVP and TapTamD um (TTD) da ase s, as well
as andom ancho s om MoisesDB d ums.
u i e ames co e ing 50% o 75% o he bu e leng h,
and mask a subse o okens wi hin his codebook and
span acco ding o he cosine schedule p oposed by Chang
e al. [25]; we hen compu e c oss-en opy loss be ween
TRIA’s p edic ed dis ibu ions a masked oken posi ions
and he co esponding g ound- u h okens. To allow TRIA
o p ocess hy hm p omp s om a a ie y o sound sou ces
and eco ding condi ions, we apply noise, band-pass il e -
ing, pi ch shi , phase shi , and equaliza ion o he hy hm
p omp audio wi h independen 25% p obabili ies a each
i e a ion. To p o ide con ol o e he deg ee o adhe ence
o he hy hm p omp , we implemen classi ie - ee guid-
ance [27] by ze oing hy hm ea u es in 20% o aining
i e a ions o lea n uncondi ional mappings, and hen pe -
o ming weigh ed in e pola ion be ween uncondi ional and
condi ional p edic ions a in e ence ime.
We ain all TRIA models on d ums om a 90% spli
o he MusDBHQ-18 da ase [1], o aling 8 hou s o audio.
We ain on 6-second andom exce p s o 100,000 i e a-
ions a a ba ch size o 48 on 4×NVIDIA A10G GPUs,
equi ing ∼27 hou s pe model. T aining and in e ence
a e illus a ed in Figu e 2.
3.1 Dualized Rhy hm Rep esen a ion
To allow in e ence on a bi a y sound ges u es while ain-
ing only on d um audio, we equi e (1) imb e- hy hm dis-
en anglemen , wi h imb e in o ma ion o he p edic ion
o masked oken spans p o ided by unmasked okens ou -
side he span and hy hm in o ma ion p o ided by aligned
hy hm ea u es wi hin he span; and (2) a hy hm ea u e
ep esen a ion ha cap u es he s uc u e o bo h d ums
and sound ges u es wi h as ly di e en equency ene gy
dis ibu ions. I imb e- hy hm disen anglemen is no en-
o ced, e.g. i imb e in o ma ion leaks om he hy hm
ea u es, TRIA will no apply he speci ied imb e. I he e
exis s a modali y gap be ween d ums and sound ges u es
wi hin he hy hm ea u e ep esen a ion, TRIA will s ug-
gle o map sound ges u es o plausible d um gene a ions.
The simples hy hm ep esen a ion sa is ying hese c i-
e ia is a one-dimensional sequence o loudness es ima es,
which cap u es onse in o ma ion simila o G ooVAE [3].
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
462
F1 Sna e ↑F1 Kick ↑
Model 30ms 100ms 30ms 100ms
Random ancho 0.04 0.15 0.09 0.29
MelodyFlow0.0 0.08 0.16 0.11 0.19
MelodyFlow0.1 0.11 0.13 0.13 0.18
MelodyFlow0.2 0.19 0.23 0.21 0.23
TRIA1Band 0.23 0.35 0.38 0.50
TRIA2Band*0.32 0.47 0.52 0.66
TRIA2Band-NA 0.10 0.17 0.47 0.62
TRIA3Band 0.33 0.50 0.61 0.71
TRIA4Band 0.30 0.47 0.59 0.72
Table 1. F1 sco es o au oma ic sna e and kick ansc ip-
ions o MelodyFlow and TRIA gene a ions om anno-
a ed AVP bea box eco dings a 30ms and 100ms onse
ole ances. Highe sco es indica e gene a ions p ese e he
placemen o kicks and sna es om bea box eco dings.
Howe e , esea che s ha e ound ha onse ep esen a ions
ail o adequa ely cap u e ela ionships be ween mul iple
elemen s wi hin pe cussion pa e ns and human sound ges-
u es [10, 11] – o ins ance, dis inc kick and sna e o-
caliza ions wi hin a bea box eco ding may be “ la ened”
in o indis inguishable loudness spikes, making i di icul
o TRIA o ai h ully map he bea box o d ums. On he
o he hand, i he hy hm ea u e ep esen a ion is oo ine-
g ained, e.g. a ull spec og am, i will leak imb e in o -
ma ion om he hy hm p omp and cause TRIA o igno e
he imb e p omp . Addi ionally, d ums and sound ges-
u es will likely mani es dis inc ly in ine-g ained ea u e
ep esen a ions, causing a ain-in e ence misma ch.
To add ess hese po en ial pi alls, we p opose a hy hm
ea u e ep esen a ion based on a wo-band spec og am
wi h an adap i e spli ing equency. We s a wi h an 80-
bin mel-spec og am o he hy hm p omp audio and com-
pu e a spli ing equency ha equally di ides ene gy in o
low and high bands, summing all bins wi hin each band.
We hen s anda dize each band independen ly, apply a sig-
moid nonlinea i y o bound all alues o [0,1], and quan ize
all alues o 33 s eps (0,1
32 ,2
32 , ..., 1) wi hin his ange.
Ou mo i a ion o his ep esen a ion is wo old. Fi s ,
a wo- oice ep esen a ion allows co e elemen s o d um
eco dings and sound ges u es o mani es dis inc ly, bu
lacks su icien de ail o leak imb e in o ma ion o dis in-
guish be ween d um eco dings and sound ges u es. Sec-
ond, wo- oice o “dualized” hy hm ep esen a ions ha e
been explo ed p e iously o he analysis and gene a ion
o d um pa e ns in he symbolic domain [10–12]. We ex-
end his line o inqui y by e alua ing he e icacy o audio-
de i ed dualiza ions o audio gene a ion.
4. EXPERIMENTS
We empi ically alida e TRIA’s abili y o map sound ges-
u es o ull-d umki eco dings in use -speci ied imb es
ac oss wo speci ic sound ges u e ypes (bea boxing and
MFCC-Sim
Model Rhy hm ↓Timb e ↑Random
MelodyFlow0.0 0.88 - - 0.81
MelodyFlow0.1 0.92 - - 0.86
MelodyFlow0.2 0.96 - - 0.85
TRIA1Band 0.85 0.95 0.87
TRIA2Band*0.85 0.96 0.87
TRIA2Band-NA 0.83 0.93 0.85
TRIA3Band 0.86 0.95 0.87
TRIA4Band 0.84 0.96 0.86
Table 2. Timb al simila i y be ween model ou pu s, in-
pu hy hm/ imb e p omp s, and andom d um eco dings
as measu ed by ime-a e aged MFCC cosine simila i y.
Highe - han- andom simila i y wi h he hy hm p omp
implies imb e leakage, while highe - han- andom simila -
i y wi h he imb e p omp implies p omp adhe ence.
apping). We conduc bo h subjec i e human e alua ions
and objec i e e alua ions o gene a ion quali y and adhe -
ence o hy hm and imb e p omp s.
4.1 Models
TRIA: In addi ion o he TRIA sys em desc ibed in Sec-
ion 3 (TRIA2Band*), we alida e ou choice o hy hm
ea u e ep esen a ion by compa ing a ian s o TRIA
ained on 1-band (TRIA1Band), 2-band wi h no adap i e
equency spli (TRIA2Band-NA), 3-band (TRIA3Band), and
4-band hy hm ea u es (TRIA4Band).
MelodyFlow: we compa e TRIA o MelodyFlow [2],
a s a e-o - he-a ex -p omp ed music edi ing sys em.
MelodyFlow can apply ex -speci ied imb es o sound ges-
u es using egula ized la en in e sion, which maps an
encoded sound ges u e o an ini ial noise es ima e and
hen esyn hesizes i condi ioned on he ex p omp ia
low-ma ching. This is done by a 1-billion pa ame e
ans o me model ained on a mix o p i a e and li-
censed music o alling 20,000 hou s. The deg ee o which
MelodyFlow p ese es he s uc u e o he hy hm p omp
can be coa sely con olled by speci ying he “ a ge low
s ep” o in e sion, wi h 0.0co esponding o ull noising
and 1.0co esponding o no noising (whe e he audio is
le unal e ed). In ou expe imen s we compa e a ge low
s eps o 0.0,0.1, and 0.2(MelodyFlow0.0, MelodyFlow0.1,
and MelodyFlow0.2, espec i ely); we ind ha highe al-
ues esul in negligible adhe ence o he speci ied im-
b e. We use he de aul se ings o 128 in e ence s eps,
“Eule ” sol e , and ReNoise [28] egula iza ion s eng h
0.2. To allow ai compa isons wi h TRIA, we downmix
MelodyFlow gene a ions om s e eo o mono and down-
sample om 48kHz o 44.1kHz.
4.2 Da ase s
We e alua e bo h TRIA and MelodyFlow on hy hm
p omp s d awn om wo da ase s o sound ges u es: AVP
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
463
Model KADPANN ↓KADCLAP ↓
TRIA1Band 6.95 6.81
TRIA2Band*4.56 5.05
TRIA2Band-NA 6.61 10.63
TRIA3Band 4.53 5.46
TRIA4Band 4.14 4.61
Table 3. Ke nel Audio Dis ance (KAD) be ween a se o
500 gene a ions om each model and a e e ence dis ibu-
ion o 500 d um exce p s om MoisesDB; lowe sco es
indica e be e audio quali y.
[29], con aining 56 ama eu bea box imp o isa ions ac oss
28 pa icipan s and 2 condi ions wi h human-anno a ed
ansc ip ions; and TapTamD um [11], con aining 1116
wo- one apping imi a ions o d um bea s ac oss 4 pa -
icipan s. To a oid o e lap wi h TRIA’s aining da a, we
sample audio imb e p omp s om he MoisesDB da ase
[30], which con ains d um s ems om 240 comme cial-
quali y music acks. Because MelodyFlow equi es im-
b e speci ica ion ia ex a he han audio p omp s, we
gene a e 50 desc ip ions o acous ic and elec onic d um
ki imb es using GPT-4.5 [31] which we manually in-
spec o ensu e quali y and di e si y. Due o he lack o
a ailable d umki imb e desc ip ion da ase s and ou di i-
cul y in ob aining di e se cap ions om d um audio using
exis ing mul imodal models [32], we se le on hese syn-
he ic desc ip ions as a easonable app oxima ion o “plau-
sible" ex p omp s, and consul wi h he MelodyFlow au-
ho s o ensu e desc ip ions a e o ma ed app op ia ely o
he model. In all expe imen s, we sample 2-second im-
b e p omp s o TRIA and gene a e om hy hm p omp s
immed o a maximum du a ion o 4 seconds.
4.3 Subjec i e E alua ion
We i s aim o unde s and how human lis ene s a e
TRIA’s ansla ions o sound ges u es o d ums when
compa ed o he s a e-o - he-a model MelodyFlow. To
his end, we conduc a lis ening e alua ion u ilizing Re-
SE al [33], a amewo k o subjec i e e alua ion asks
on c owdwo ke pla o ms; we ec ui e alua o s h ough
he online esea ch pla o m P oli ic 2. We e alua e he
TRIA2Band* and MelodyFlow0.2 a ian s, as we ind ha
hese models p o ide a good balance o adhe ence o bo h
hy hm and imb e p omp s.
Da a P epa a ion: We p epa ed 80 se s o sho (3–
4second) audio clips. Each se con ained (1) a e e -
ence sound ges u e se ing as a hy hm p omp , d awn ei-
he om he AVP “pe sonal” condi ion (bea boxing) o
TapTamD um ( apping); (2) a TRIA gene a ion om he
hy hm p omp ; (3) a MelodyFlow gene a ion om he
hy hm p omp ; and (4) a andom MoisesDB d um exce p ,
un ela ed o he hy hm p omp , as a low ancho . We gen-
e a ed hese 80 se s using 10 hy hm p omp s (5 bea box-
ing, 5 apping) and 8 imb e p omp s pe hy hm p omp .
2h ps://www.p oli ic.com/
TRIA’s audio imb e p omp s we e d awn andomly om
MoisesDB d um exce p s, while MelodyFlow’s ex im-
b e p omp s we e d awn om he a o emen ioned se o
50 gene a ed imb e desc ip ions. To ensu e b oadly com-
pa able imb es ac oss gene a ions, we es ic ed ou audio
imb e p omp s o acous ic d um ki eco dings and ou ex
p omp s o desc ip ions o acous ic d um ki imb es.
ABX T ials: We le e aged he indings o Ca w igh e
al. [34, 35] and deployed pai wise compa ison e alua ions
using emo e c owdwo ke s. In ou s udy, c owdwo k-
e s pe o med ABX ials: hey hea d a e e ence hy hm
p omp (“X”) and we e andomly p esen ed wi h wo clips
(“A” and “B”) om he co esponding (1) TRIA gene a-
ion, (2) MelodyFlow gene a ion, o (3) a andom d um
exce p o ac as a low ancho . They we e hen asked o
selec “A” o “B” gi en he c i e ia:
Selec which o he wo choices is a mo e mu-
sically pleasing ansla ion om he e e ence
clip o d ums ha cap u es he o iginal hy hm
and g oo e o he e e ence clip.
Full co e age o ou 80 se s equi ed 3 pai wise com-
pa isons pe se : TRIA s. MelodyFlow, TRIA s. Ran-
dom Exce p , and MelodyFlow s. Random Exce p . We
equi ed 5 lis ene s e alua e each compa ison, esul ing in
80 ×3×5 = 1200 o al ials. F om ou ABX esul s, we
compu ed he win a e o TRIA and MelodyFlow on each
da ase and e alua ed he s a is ical signi icance o he in-
dica ed lis ene p e e ence. We p esen he esul s o ou
subjec i e e alua ion in Figu e 3.
Pa icipan Rec ui men : We ec ui ed 120 US
English-speaking human lis ene s wi h an app o al a ing
o ≥95% and a eco d o comple ing 100+ p io asks
on P oli ic. Each lis ene e alua ed 10 andomly assigned
ABX pai wise compa isons. To ensu e da a quali y, pa ic-
ipan s had o pass a lis ening es assessing one sensi i i y
om 55 Hz - 10 kHz [36], along wi h a en ion checks.
They we e paid $2.50 pe se o 10 compa isons, es ima ed
o be equi alen o $18.75/hou . We excluded pa icipan s
who ailed he lis ening es , as well as hose who p e e ed
he Random Exce p ≥80% o he ime, as his sugges s
hey dis ega ded he gi en e alua ion c i e ion o hy hm
adhe ence. Following da a cleaning, we had 116 pa ici-
pan s wi h a o al o 1160 e alua ion pai s.
4.4 Rhy hm P omp Adhe ence
To e alua e TRIA’s p ese a ion o he hy hmic s uc u e
o sound ges u es when ansla ing o d ums, we conduc
an au oma ed ansc ip ion e alua ion. We sample 250
gene a ions om each MelodyFlow and TRIA a ian con-
di ioned on hy hm p omp s d awn om he AVP bea box
da ase , all o which ha e g ound- u h human anno a ions
o kick d um, sna e, and hi-ha ocaliza ions. We hen
ansc ibe hese gene a ions using he p e ained “F ame-
RNN” d um ansc ip ion model o Zeh en e al. [37]. Fi-
nally, we measu e he co espondence be ween ansc ibed
and g ound- u h kick and sna e d um pa s using he onse
F1 sco e wi h 30ms and 100ms ole ances, as is common in
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
464
he d um ansc ip ion li e a u e [38,39]. Highe F1 sco es
indica e igh e co espondence be ween he kick and sna e
ocaliza ions in he hy hm p omp and he kick and sna e
d ums in he gene a ed audio. We epo esul s in Table 1.
4.5 Timb e P omp Adhe ence
To e alua e TRIA’s ea men o imb e in o ma ion, we
compu e he cosine simila i y be ween ime-a e aged 80-
dimensional MFCC ep esen a ions o he gene a ed au-
dio and imb e p omp (indica ing adhe ence o he imb e
p omp ), he gene a ed audio and hy hm p omp (indica -
ing he deg ee o imb e leakage om he hy hm p omp ),
and he gene a ed audio and a andom exce p om Moi-
sesDB d ums (as an ancho ). While his measu emen o
spec al co espondence p o ides a coa se app oxima ion
o imb al simila i y, we ind ha i cap u es s ong ends
in each model’s ea men o imb e. Because MelodyFlow
allows imb e speci ica ion h ough a ex p omp , no an
audio p omp , we can only compu e i s ou pu simila i y
o he hy hm p omp and andom exce p . Ou esul s a e
epo ed in Table 2. We u he illus a e he p ocessing o
hy hm and imb e p omp s by bo h sys ems in Figu e 4.
4.6 Audio Quali y
To e alua e he ealism o gene a ed audio, we compu e he
Ke nel Audio Dis ance (KAD) [40, 41] be ween 500 ou -
pu s om each me hod and a e e ence dis ibu ion o 500
andom exce p s o MoisesDB d ums. Simila o F éche
Audio Dis ance (FAD) [42], KAD measu es he simila -
i y o he gene a ed dis ibu ion o a e e ence dis ibu ion,
while showing s onge alignmen wi h human quali y a -
ings less bias a small sample sizes. Fo KAD we con-
side he “PANN” embedding a ian [43], which he au-
ho s show is mos co ela ed wi h human pe cep ion, and
he “CLAP-Laion-Music” embedding a ian [44], which
le e ages a model ained speci ically on music. Be-
cause TRIA ecei es audio imb e p omp s om he e -
e ence dis ibu ion while MelodyFlow ecei es ex imb e
p omp s, we compa e only a ian s o TRIA o ai ness.
We epo esul s in Table 3.
5. DISCUSSION
Ou expe imen al esul s demons a e TRIA’s e icacy
in ansla ing hy hm ges u es o ull-d umki eco dings
ai h ul o he hy hm and imb e p omp s. As illus a ed
in Figu e 3, ou subjec i e e alua ion shows no s a is ically
signi ican p e e ence be ween TRIA and MelodyFlow
gene a ions. This is p omising gi en ha MelodyFlow is
oughly 25× he size, and ained on 2,000× he da a.
Addi ionally, bo h models a e s ongly p e e ed o an-
dom d um exce p s a signi icance p≤0.001 acco ding o
a wo-sided binomial es , indica ing ha bo h succeed in
cap u ing he co e g oo e and s uc u e o hy hm p omp s.
The esul s o ou ansc ip ion e alua ion, p esen ed
in Table 1, show ha TRIA s ongly ou pe o ms
MelodyFlow in p ese ing he hy hmic s uc u e o bea -
box sound ges u es as indica ed by co espondence o kick
Figu e 4. Gi en a hy hm p omp ( op) wi h ocal kick and
sna e d um imi a ions, he “sna e” sound can be eplaced
by use -p o ided samples ia TRIA’s imb e p omp ing
abili y: (a) a bongo d um, (b) wood c acks, and (c) a noise
bu s . Gi en co esponding imb e p omp s in ex o m,
MelodyFlow adhe es mo e closely o he spec al con en
o he hy hm p omp .
and sna e placemen in he hy hm p omp and gene a ed
audio. While inc easing he a ge low s ep imp o es
MelodyFlow’s hy hm adhe ence sligh ly, i s ill signi i-
can ly unde pe o ms all e alua ed TRIA a ian s. These
esul s demons a e he s eng h o TRIA’s dualized hy hm
ea u e ep esen a ion, which ou pe o ms bo h a 1-band
ep esen a ion and a non-adap i e 2-band ep esen a ion
ha nai ely spli s he mel spec og am along i s cen e e-
quency. Adap i e 3- and 4-band hy hm ea u e ep esen a-
ions yield diminishing e u ns as hey sligh ly inc ease he
accu acy o kick placemen , bu do no ha e a meaning ul
e ec on sna e placemen . This indica es ha a dualized
ep esen a ion may be su icien o cap u e he co e hy h-
mic s uc u e o many sound ges u es, while single- oice
ep esen a ions a e likely insu icien .
The esul s o ou imb e e alua ion, p esen ed in Table
2, show ha TRIA gene a ions exhibi lowe spec al co -
ela ion wi h he hy hm p omp han andom ancho s, and
highe co ela ion wi h he imb e p omp han andom an-
cho s – indica ing bo h a lack o imb e leakage om he
hy hm p omp and s ong adhe ence o he imb e p omp .
In con as , MelodyFlow gene a ions exhibi highe - han-
andom spec al co ela ion wi h he hy hm p omp , indi-
ca ing imb e leakage. We p o ide examples illus a ing
hese beha io s in Figu e 4: MelodyFlow o en mimics he
spec al s uc u e o hy hm p omp s, while TRIA e ec-
i ely u ilizes a di e se a ay o imb e p omp s o de e -
mine spec al s uc u e. This audio-p omp ed imb e map-
ping is a key ad an age o TRIA o e ex -p omp ed sys-
ems, allowing o mo e speci ic exampla -based s ee ing
o gene a ions. Finally, as shown in Table 3, ou dualized
hy hm ea u es ou pe o m bo h 1-band and non-adap i e
2-band ea u es in p oducing ealis ic d um audio.
O e all, hese esul s show he p omise o ou p oposed
app oach e en in small model and da a egimes. Di ec-
ions o u u e wo k include scaling he model and da ase ;
le e aging TRIA’s exis ing capabili ies o o he in e ence
pa adigms such as inpain ing and d ums- o-d ums con e -
sion; and explo ing lea nable dualized hy hm ea u es.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
465
6. ACKNOWLEDGEMENTS
This wo k was suppo ed by NSF Awa d Numbe 2222369.
We would also like o hank And é Ca alho dos San os
and Gaël Le Lan o p oduc i e discussions.
7. ETHICS STATEMENT
In his sec ion we acknowledge (1) he b oade e hical im-
plica ions o gene a i e music models in he con ex o ou
wo k, (2) he e hical implica ions o using c owdwo ke s
o pe o m ou subjec i e e alua ion, and (3) ou posi ion-
ali y as au ho s o his wo k.
7.1 B oade E hical Implica ions
A ecen wo k on he e hical implica ions o gene a i e au-
dio models [45] iden i ied a se o po en ial ha ms speci ic
o gene a i e music models: (1) loss o agency and au ho -
ship, (2) s i ling o c ea i i y, (3) p edominance o wes e n
bias, (4) cul u al app op ia ion, (5) copy igh in ingemen ,
and (6) clima e impac o hese models; we add ess his
wo k wi h ega d o each o hese six ha ms.
This wo k is in ended o p o ide c ea o s wi h he abil-
i y o u n any sound ges u e in o a d um bea wi h hei
desi ed imb e. We see his as a means o p o ide music
c ea o s wi h addi ional agency; howe e , we do acknowl-
edge ha he e is he (1) po en ial o emo ing agency
o (2) s i ling he c ea i i y o pe cussion composi ion and
p oduc ion.
We ecognize ha ou wo k is ained on a small da ase
o d ums and hus pe o ms bes wi h imb es p esen in
ha da ase , and so (3) may pe o m poo ly wi h ou -o -
domain imb es such as adi ional eas e n music pe cus-
sion ins umen s. This is a limi a ion o he da ase and
cu en i e a ion o TRIA bu no he p oposed me hod i -
sel , as u u e wo k could ain TRIA on non-wes e n d um
bea s o o e come his limi a ion.
In i s cu en i e a ion, we do no belie e he e is a
s ong po en ial o (4) cul u al app op ia ion wi h TRIA;
howe e , i someone we e o e- ain TRIA on a da ase
o pe cussion om a cul u e o which hey do no belong,
i would enable ha ac . In ega d o (5) po en ial o
copy igh in ingemen , TRIA was ained on MusDBHQ-
18 [1], which is licensed o any educa ional pu poses. I
TRIA we e o be used o comme cial pu poses, i would
equi e e- aining on p op ie a y da ase s o o he wise
non-copy igh ed wo k in o de o p o ec he copy igh
holde s o hese acks, hough we a e no p oposing his
wo k be used o comme cial pu poses.
Finally, we acknowledge (6) all gene a i e models ha e
an en i onmen al impac — o anspa ency as encou aged
by [46], we documen ed ou compu a ional esou ces used
o aining, aining ime, and numbe o pa ame e s,
which in all cases a e a less han needed o compe ing
models such as MelodyFlow. Based on ou 4×NVIDIA
A10G GPUs (150W) and 27 hou s o aining ime, we es-
ima e each aining un has an ene gy cos o 16.2 kWh.
Fo compa ison, MelodyFlow was ained on 8×H100
96GB GPUs (350W), wi h no epo ed aining ime. I we
assume conse a i ely an equal aining ime o 27 hou s,
hen one MelodyFlow aining un would cos a leas 280
kWh, o a a minimum 17× he ene gy consump ion o
TRIA.
7.2 C owdwo ke s
Ou subjec i e e alua ion u ilizing human lis ene s was ap-
p o ed (and de e mined o be exemp ) unde Ins i u ional
Re iew Boa d a he hos uni e si y o he i s au ho . We
also ensu ed ha each e alua o was paid a ai wage wi h
an es ima ed hou ly pay o $18.75, which is abo e he min-
imum wage o e e y ci y in he Uni ed S a es. We also
paid hose who ailed he lis ening es and hus could no
pa ake in ou s udy $0.50 o hei ime. We used c owd-
wo ke s o his e alua ion, and acknowledge ha e hical
use o c owdwo ke s goes beyond ai pay [47]; we es ed
he s udy among he au ho eam p io o launch o en-
su e he e would be no bu den o wo ke s beyond po en ial
bo edom and made su e he e alua o s knew hey could
s op he s udy a any ime.
7.3 Posi ionali y
Finally, we would like o add ess he posi ionali y o he
au ho s. This is a di e se eam o esea che s, hough we
a e p edominan ly om wes e n de eloped coun ies (wi h
one au ho being om he Global Sou h). We a e all bo h
musicians and AI esea che s, and hus sha e a men ali y
ha AI echnologies used o gene a i e music can ha e
a ne posi i e impac as long as hey a e ools used o em-
powe and assis musicians and c ea o s a he han eplace
hem. We acknowledge a bias in he conduc o his wo k
e lec ing an o e all posi i e a i ude owa ds AI echnolo-
gies in his ega d, and ecognize ha his is no a uni e sal
belie .
Ul ima ely, we belie e ha he bene i s o his wo k a
ou weigh hese po en ial isks, and we ook ca e o keep
hem in mind as we conduc ed his esea ch.
8. REFERENCES
[1] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “The MUSDB18 co pus o music
sepa a ion,” Dec. 2017. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.1117372
[2] G. L. Lan, B. Shi, Z. Ni, S. S ini asan, A. Kuma ,
B. Ellis, D. Kan , V. Naga aja, E. Chang, W.-N. Hsu,
Y. Shi, and V. Chand a, “High ideli y ex -guided
music edi ing ia single-s age low ma ching,” a Xi
p ep in a Xi :2407.03648, 2024.
[3] J. Gillick, A. Robe s, J. Engel, D. Eck, and D. Bam-
man, “Lea ning o g oo e wi h in e se sequence ans-
o ma ions,” in ICML, 2019.
[4] A. Caillon and P. Esling, “Ra e: A a ia ional au oen-
code o as and high-quali y neu al audio syn hesis,”
a Xi p ep in a Xi :2111.05011, 2021.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
466
[5] A. C. San os and A. Ca doso, “F om aps o d ums:
Audio- o-audio pe cussion s yle ans e ,” in Ex ended
Abs ac s o he La e-B eaking Demo Session o he
25 h In . Socie y o Music In o ma ion Re ie al Con .,
2023.
[6] J. Engel, L. Han akul, C. Gu, and A. Robe s,
“Ddsp: Di e en iable digi al signal p ocessing,” a Xi
p ep in a Xi :2001.04643, 2020.
[7] N. Deme lé, P. Esling, G. Do as, and D. Geno a,
“Combining audio con ol and s yle ans e using la-
en di usion,” in P oceedings o he o he 25 h In .
Socie y o Music In o ma ion Re ie al Con ., 2024.
[8] A. Rami es, R. Penha, and M. E. P. Da ies, “Use spe-
ci ic adap a ion in au oma ic ansc ip ion o ocalised
pe cussion,” a Xi p ep in a Xi :1811.02406, 2018.
[9] Vochlea, “Duble 2.” [Online]. A ailable: h ps:
// ochlea.com/p oduc s/duble 2
[10] O. La illo and F. B u o d, “Bis a e educ ion and
compa ison o d um pa e ns,” in P oceedings o he
21s In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence. ISMIR, 2020.
[11] B. Haki, B. Ko owski, C. Lee, and S. Jo da, “Tap am-
d um: A da ase o dualized d um pa e ns,” in P o-
ceedings o he 24 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence. ISMIR, 2023.
[12] B. Ko owski, “Dualiza ion o hy hm pa e ns,” Mas-
e ’s hesis, Uni e si a Pompeu Fab a, 2020.
[13] I.-C. Wei, C.-W. Wu, and L. Su, “Gene a ing s uc u ed
d um pa e ns using a ia ional au oencode and sel -
simila i y ma ix,” in In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), 2019.
[14] D. P. W. Ellis and J. A oyo, “Eigen hy hms: D um
pa e n basis se s o classi ica ion and gene a ion,” in
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2004.
[15] D. Gómez-Ma ín, S. Jo dà, and P. He e a, “Ne wo k
ep esen a ions o d um sequences o classi ica ion
and gene a ion,” F on ie s in Compu e Science, ol. 6,
2024.
[16] G. Alain, M. Che alie -Bois e , F. Os e a h, and
R. Piché-Taille e , “Deepd umme : Gene a ing d um
loops using deep lea ning and a human in he loop,”
a Xib p ep in a Xi :2008.04391, 2020.
[17] S. La ne and M. G ach en, “D umne : High-le el
con ol o d um ack gene a ion using lea ned pa e ns
o hy hmic in e ac ion,” in P oc. IEEE Wo kshop on
Applica ions o Signal P ocessing o Audio and Acous-
ics (WASPAA), 2019.
[18] J. Nis al, S. La ne , and G. Richa d, “D umgan: Syn-
hesis o d um sounds wi h imb al ea u e condi ion-
ing using gene a i e ad e sa ial ne wo ks,” in In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2020.
[19] A. La aul , A. Roebel, and M. Voi y, “S ylewa e-
gan: S yle-based syn hesis o d um sounds wi h ex en-
si e con ols using gene a i e ad e sa ial ne wo ks,” in
P oceedings o he Sound and Music Compu ing Con-
e ence (SMC), 2022.
[20] J. D ysdale, M. Tomczak, and J. Hockman, “S yle-
based d um syn hesis wi h gan in e sion,” in Ex ended
Abs ac s o he La e-B eaking Demo Session o he
22nd In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), 2021.
[21] H. Flo es Ga cia, P. See ha aman, R. Kuma , and
B. Pa do, “VampNe : Music gene a ion ia masked
acous ic oken modeling,” in Con e ence o he In e -
na ional Socie y o Music In o ma ion Re ie al (IS-
MIR), 2023.
[22] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and
Y. Liu, “Ro o me : Enhanced ans o me wi h
o a y posi ion embedding,” Neu ocompu ing, ol.
568, no. C, Ma . 2024. [Online]. A ailable: h ps:
//doi.o g/10.1016/j.neucom.2023.127063
[23] R. Kuma , P. See ha aman, I. K. Alejand o Luebs, and
K. Kuma , “High- ideli y audio comp ession wi h im-
p o ed qgan,” in Neu IPS, 2023.
[24] Z. Bo sos, M. Sha i i, D. Vincen , E. Kha i ono ,
N. Zeghidou , and M. Tagliasacchi, “Sounds o m:
E icien pa allel audio gene a ion,” a Xi p ep in
a Xi :2305.09636, 2023.
[25] H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T.
F eeman, “Maskgi : Masked gene a i e image ans-
o me ,” in CVPR, 2022.
[26] J. D. Pa ke , J. Spijke e , K. Kos a, F. Yesile ,
B. Kuzne so , J.-C. Wang, M. A en , J. Chen, and
D. Le, “S emgen: A music gene a ion model ha lis-
ens,” in ICASSP, 2024.
[27] J. Ho and T. Salimans, “Classi ie - ee di usion guid-
ance,” in Neu IPS 2021 Wo kshop on Deep Gene a i e
Models and Downs eam Applica ions, 2021.
[28] A. V. H. A.-E. Daniel Ga ibi, O Pa ashnik and
D. Cohen-O , “Real image in e sion h ough i e a i e
noising,” a Xi p ep in a Xi :2403.14602, 2024.
[29] A. Delgado, S. McDonald, N. Xu, and M. Sandle ,
“A new da ase o ama eu ocal pe cussion analysis,”
in P oceedings o he 14 h In e na ional Audio Mos ly
Con e ence: A Jou ney in Sound, se . AM ’19.
New Yo k, NY, USA: Associa ion o Compu ing
Machine y, 2019, p. 17–23. [Online]. A ailable:
h ps://doi.o g/10.1145/3356590.3356844
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
467
[30] I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“Moisesdb: A da ase o sou ce sepa a ion beyond 4-
s ems,” in Con e ence o he In e na ional Socie y o
Music In o ma ion Re ie al (ISMIR), 2023.
[31] OpenAI, “OpenAI GPT-4.5 Sys em Ca d,”
h ps://openai.com/index/gp -4-5-sys em-ca d/,
Feb ua y 2025.
[32] S. Ghosh, Z. Kong, S. Kuma , S. Sakshi, J. Kim,
W. Ping, R. Valle, D. Manocha, and B. Ca anza o, “Au-
dio lamingo 2: An audio-language model wi h long-
audio unde s anding and expe easoning abili ies,”
a Xi p ep in a Xi :2503.03983, 2025.
[33] M. Mo ison, B. Tang, G. Tan, and B. Pa do, “Rep o-
ducible subjec i e e alua ion,” in ICLR Wo kshop on
ML E alua ion S anda ds, Ap il 2022.
[34] M. Ca w igh , B. Pa do, G. J. Myso e, and M. Ho -
man, “Fas and easy c owdsou ced pe cep ual audio
e alua ion,” in 2016 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2016, pp. 619–623.
[35] M. Ca w igh , B. Pa do, and G. J. Myso e, “C owd-
sou ced pai wise-compa ison o sou ce sepa a ion
e alua ion,” in 2018 ieee in e na ional con e ence
on acous ics, speech and signal p ocessing (icassp).
IEEE, 2018, pp. 606–610.
[36] E. Rumbold, G. Tzane akis, and B. Pa do, “Co ela-
ions be ween objec i e and subjec i e e alua ions o
music sou ce sepa a ion,” 2024.
[37] M. Zeh en, M. Alunno, and P. Bien inesi, “High-
quali y and ep oducible au oma ic d um ansc ip ion
om c owdsou ced da a,” Signals, ol. 4, no. 4,
pp. 768–787, 2023. [Online]. A ailable: h ps:
//www.mdpi.com/2624-6120/4/4/42
[38] R. Vogl, M. Do e , and P. Knees, “Recu en neu al
ne wo ks o d um ansc ip ion,” in P oceedings o he
17 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, 2016.
[39] M. Heyda i, F. Cwi kowi z, and Z. Duan, “Bea ne :
C nn and pa icle il e ing o online join bea down-
bea and me e acking,” in P oceedings o he 22nd
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, 2021.
[40] Y. Chung, P. Eu, J. Lee, K. Choi, J. Nam, and
B. S. Chon, “Kad: No mo e ad! an e ec i e
and e icien e alua ion me ic o audio gene a ion,”
a Xi :2502.15602, 2025. [Online]. A ailable: h ps:
//a xi .o g/abs/2502.15602
[41] J. Nis al, S. La ne , and G. Richa d, “Compa ing ep-
esen a ions o audio syn hesis using gene a i e ad e -
sa ial ne wo ks,” in 27 h Eu opean Signal P ocessing
Con e ence (EUSIPCO), 2019.
[42] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i ,
“F éche audio dis ance: A e e ence- ee me ic o
e alua ing music ´enhancemen algo i hms,” 2019.
[43] T. I. Y. W.-Z. Y. W. W. Qiuqiang Kong, Yin Cao
and M. D. Plumbley, “Panns: La ge-scale p e ained
audio neu al ne wo ks o audio pa e n ecogni ion,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, 2020.
[44] T. Z. Y. H.-T. B.-K. Yusong Wu, Ke Chen and
S. Dubno , “La gescale con as i e language-audio
p e aining wi h ea u e usion and keywo d- o-cap ion
augmen a ion,” in IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP),
2023.
[45] J. Ba ne , “The e hical implica ions o gene a i e au-
dio models: A sys ema ic li e a u e e iew,” in P o-
ceedings o he 2023 AAAI/ACM Con e ence on AI,
E hics, and Socie y, 2023, pp. 146–161.
[46] A. Holzap el, A.-K. Kaila, and P. Jääskeläinen, “G een
mi ?: In es iga ing compu a ional cos o ecen
music-ai esea ch in ismi ,” in In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2024.
[47] B. Shmueli, J. Fell, S. Ray, and L.-W. Ku, “Beyond ai
pay: E hical implica ions o nlp c owdsou cing,” a Xi
p ep in a Xi :2104.10097, 2021.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
468