scieee Science in your language
[en] (orig)

Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification

Author: Recep Oguz Araz; Guillem Cortès-Sebastià; Emilio Molina; Joan Serra; Xavier Serra; Yuhki Mitsufuji; Dmitry Bogdanov
Publisher: Zenodo
DOI: 10.5281/zenodo.17706424
Source: https://zenodo.org/records/17706424/files/000046.pdf
ENHANCING NEURAL AUDIO FINGERPRINT ROBUSTNESS TO AUDIO
DEGRADATION FOR MUSIC IDENTIFICATION
R. Oguz A az1Guillem Co ès-Sebas ià2Emilio Molina2Joan Se à3
Xa ie Se a1Yuki Mi su uji3,4Dmi y Bogdano 1
1Music Technology G oup, Uni e si a Pompeu Fab a, Ba celona, Spain
2BMAT Licensing S.L., Ba celona, Spain
3Sony AI 4Sony G oup Co po a ion
[email p o ec ed]
ABSTRACT
Audio inge p in ing (AFP) allows he iden i ica ion o un-
known audio con en by ex ac ing compac ep esen a-
ions, e med audio inge p in s, ha a e designed o e-
main obus agains common audio deg ada ions. Neu al
AFP me hods o en employ me ic lea ning, whe e ep e-
sen a ion quali y is in luenced by he na u e o he supe i-
sion and he u ilized loss unc ion. Howe e , ecen wo k
un ealis ically simula es eal-li e audio deg ada ion du ing
aining, esul ing in sub-op imal supe ision. Addi ion-
ally, al hough se e al mode n me ic lea ning app oaches
ha e been p oposed, cu en neu al AFP me hods con inue
o ely on he NT-Xen loss wi hou explo ing he ecen
ad ances o classical al e na i es. In his wo k, we p opose
a se ies o bes p ac ices o enhance he sel -supe ision
by le e aging musical signal p ope ies and ealis ic oom
acous ics. We hen p esen he i s sys ema ic e alua ion
o a ious me ic lea ning app oaches in he con ex o
AFP, demons a ing ha a sel -supe ised adap a ion o he
iple loss yields supe io pe o mance. Ou esul s also
e eal ha aining wi h mul iple posi i e samples pe an-
cho has c i ically di e en e ec s ac oss loss unc ions.
Ou app oach is buil upon hese insigh s and achie es
s a e-o - he-a pe o mance on bo h a la ge, syn he ically
deg aded da ase and a eal-wo ld da ase eco ded using
mic ophones in di e se music enues.
1. INTRODUCTION
Audio inge p in ing (AFP) me hods a e used o iden-
i ying unknown audio con en [1–3]. Iden i ica ion is
achie ed by ex ac ing dis inc i e ep esen a ions, known
as inge p in s, om audio segmen s, which a e ma ched
agains a da abase o e e ence audio. Possible AFP ap-
plica ions include music iden i ica ion (MI) [4], duplica e
de ec ion [5], and b oadcas moni o ing [6]. This pape
© R. O. A az, G. Co ès-Sebas ià, E. Molina, J. Se à, X.
Se a, Y. Mi su uji, and D. Bogdano . Licensed unde a C ea i e Com-
mons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: R.
O. A az, G. Co ès-Sebas ià, E. Molina, J. Se à, X. Se a, Y. Mi su uji,
and D. Bogdano , “ Enhancing Neu al Audio Finge p in Robus ness o
Audio Deg ada ion o Music Iden i ica ion ”, in P oc. o he 26 h In . So-
cie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
ocuses on MI, whe e scalabili y is c ucial due o la ge
da abases, equi ing compac inge p in s o e icien e-
ie al. MI can be pe o med a bo h he ack-le el and
segmen -le el, he la e in ol ing ime-aligned ma ching
be ween que y audio and e e ence acks.
In MI, que y audio can signi ican ly di e om i s e -
e ence ack due o a ious ac o s. In en ional ans o ma-
ions, such as pi ch shi ing and ime s e ching, can al e a
ack, and since MI o en elies on mic ophone eco dings
o playback sounds, he que y audio may be deg aded due
o signal ansduc ion and en i onmen al ac o s du ing
sound p opaga ion. Fo accu a e iden i ica ion, inge p in s
should be obus enough o be ecognized despi e such al-
e a ions. One exis ing app oach, NowPlaying [7], lea ns
o c ea e obus inge p in s om eal eco dings o aligned
clean-deg aded music pai s. Howe e , such da ase s a e
cos ly o ga he and, hence, mos me hods ely on sel -
supe ision om syn he ic audio deg ada ions. [8–11].
The quali y o he ep esen a ions ob ained by sel -
supe ised lea ning (SSL) is in luenced by a ious ac o s,
including he na u e o he sel -supe ision signal and he
calcula ion o he loss unc ion. Ye , cu en neu al AFP
app oaches un ealis ically simula e eal-li e audio deg a-
da ion du ing aining, p o iding sub-op imal supe ision.
Mo eo e , al hough ecen wo k in neu al AFP gene ally
assumes ha he NT-Xen [12] loss ou pe o ms o he clas-
sical losses on MI, a sys ema ic compa ison be ween hese
objec i es is missing. Besides, se e al mode n con as i e
sel -supe ised losses ha build on he NT-Xen loss ha e
shown p omise in ela ed audio asks [13,14], bu ha e no
been e alua ed in he con ex o AFP. In addi ion, some
o hese loss unc ions allow he cons uc ion o aining
ba ches wi h mul iple di e en ly deg aded e sions o he
same clean audio, a s a egy ha could imp o e iden i ica-
ion bu emains unexplo ed wi hin he scope o AFP.
Ou wo k builds upon NAFP [9], a neu al AFP app oach
ha gene a es eal- alued inge p in s, wi h open-sou ce
code and p e- ained weigh s. We unco e se e al sub-
op imal design choices o NAFP ega ding i s SSL s a -
egy and ea men o oom acous ics. To add ess hese
issues, we p opose a se ies o bes p ac ices ha signi i-
can ly enhance iden i ica ion pe o mance. Mo eo e , we
e eal c i ical implemen a ion p oblems in NAFP’s e alu-
a ion me hod ha skew pe o mance me ics. We e ise
399
hese p oblems and conduc e alua ions on bo h syn he -
ically deg aded que ies and a eal-wo ld da ase eco ded
wi h mic ophones ac oss di e se music enues. Nex , we
assess he e ec i eness o di e en loss unc ions o AFP,
demons a ing ha he iple loss, con a y o common be-
lie , deli e s he bes pe o mance. We hen explo e he
bene i s and d awbacks o inc easing he numbe o pos-
i i e samples pe ancho in a ba ch, p o iding p ac ical
aining guidelines. Building on hese imp o emen s, we
ob ain a s a e-o - he-a app oach, ou pe o ming he pub-
licly a ailable baselines. We sha e ou p e- ained model
weigh s, open-sou ce code, and da a. 1
2. RELATED WORK
2.1 Me ic lea ning
Me ic lea ning aims o c ea e ep esen a ions in which se-
man ically simila elemen s a e close o each o he han
seman ically dissimila ones. The iple loss [15] is a su-
pe ised me ic lea ning unc ion ha is p o en e ec i e in
a ious e ie al asks. I uses class labels o o m iple s
o ancho , posi i e, and nega i e samples.
Recen wo k le e ages sel -supe ised me ic lea ning,
wi h NT-Xen being a pa icula ly success ul loss unc-
ion [12]. Mo e ecen wo k hen builds upon NT-Xen :
DCL [16] decouples posi i e pai ’s simila i ies om he
nega i e pai s’; A&U [17] p oposes wo quali ies ha good
ep esen a ions should ob ain, and heo e ically p o es ha
using NT-Xen wi h an in ini e-size ba ch op imizes hem;
KCL [18] e- o mula es A&U o equi e a smalle ba ch.
NT-Xen and i s ex ensions a e only de ined o a sin-
gle posi i e sample pe ancho . SupCon [19] (supe ised)
and Mul iPosCon [20] (sel -supe ised) bo h ex end NT-
Xen o mul iple posi i es pe ancho , epo ing bene i s in
image- ela ed asks. No ably, Mul iPosCon and NT-Xen
a e equi alen when each ancho has only one posi i e.
Hence, o simplici y, we ea hem synonymously.
The quali y o lea ned ep esen a ions is in luenced by
how he ba ches a e cons uc ed, pa icula ly by he num-
be o ancho samples and he numbe o posi i es pe an-
cho [15]. Howe e , hese aspec s emain unexplo ed in
AFP. To add ess his gap, we pe o m con olled expe i-
men s. Th oughou his pape , le NAdeno e he numbe
o ancho s in a ba ch and NPPA he numbe o posi i es
pe ancho . The numbe o o al samples in a ba ch is hen
NB=NA(1 + NPPA).
2.2 Neu al AFP
Neu al AFP app oaches can be b oadly classi ied in o wo
ca ego ies based on he gene a ed inge p in s: bina y and
eal- alued. In his wo k, we ocus on he la e . Rele an
app oaches include NowPlaying [7], which employs he
iple loss o ain a con olu ional neu al ne wo k (CNN),
and also CULAF [8], NAFP [9], and ABFP [10], all o
which use he NT-Xen loss o ain CNNs. Addi ionally,
G aFP [11] ains a g aph CNN wi h he NT-Xen loss.
1h ps://gi hub.com/ a az15/neu al-music- p
Model FS[Hz] TL[s] TH[s] A ailable
NowPlaying [7] n/a n/a 1.000 ✗
CULAF [8] 16k 2.50 2.125 ✗
NAFP [9] 8k 1.00 0.500 ✓
ABFP [10] 16k 0.96 0.096 ✗
G aFP [11] 16k 1.00 0.100 ✓
NMFP (p oposed) 8k 1.00 0.500 ✓
Table 1. Real- alued neu al AFP models. Columns in-
dica e he sampling a e (FS), segmen du a ion (TL), hop
du a ion (TH), and publicly a ailable code and weigh s.
Table 1 compa es hese app oaches in e ms o wo as-
pec s ha de e mine an AFP sys em’s use case: audio
sampling a e and inge p in gene a ion a e (measu ed
by he hop du a ion). We do no conside high sampling
a es necessa y o MI, as que y audio is o en ansmi -
ed o e bandwid h-limi ed channels. Mo eo e , since hu-
mans can ypically iden i y music e en unde low-pass il-
e ing, a obus sys em should be capable o doing he
same. Likewise, he hop sizes a ound 100 ms used by
G aFP and ABFP subs an ially inc ease he ime, s o age,
and compu a ional cos s o inge p in ex ac ion and e-
ie al, educing o e all scalabili y. This becomes e i-
den by looking a hei epo ed es da abases con ain-
ing 30-second audio chunks ins ead o ull acks, selec ed
om he F ee Music A chi e (FMA) [21]. They con-
ain abou 97 k ( ma_la ge) and 25k ( ma_medium)
chunks, co esponding o 28.3 M and 7.3 M inge p in s,
espec i ely. By con as , NAFP’s es da abase (and ou s,
as discussed in Sec ion 3.4) con ains o e 93 k ull-leng h
acks ( ma_ ull), o aling 53.6 M inge p in s. Using a
100 ms hop du a ion would esul in 268 M inge p in s, e-
qui ing much longe ime o bo h ex ac ion and e ie al.
3. METHODOLOGY
In his sec ion, we ou line NAFP’s me hodology and,
whe e applicable, desc ibe ou modi ica ions.
3.1 Audio deg ada ions
NAFP au ho s ocus on achie ing obus ness agains h ee
ypes o audio deg ada ion: addi i e backg ound noise,
oom e e be a ion, and mic ophone esponse. We ol-
low hei deg ada ion chain, bu cu a e mo e ex ensi e se s
o each ype. NAFP uses backg ound noise eco dings
ea u ing a mix o andom noises and wo speci ic acous-
ic scenes: subway and pub en i onmen s. Howe e , we
ound his selec ion o lack su icien di e si y, and in-
s ead adop ed he TUT Acous ic Scenes 2016 da ase [22],
which includes 15 dis inc acous ic scenes ep esen ing
a ious po en ial MI use cases. The aining and es se s
consis o 585 and 195 minu es o eco dings, espec i ely.
Fo oom impulse esponses (IRs), we use Ope-
nAIR [23] and AIR [24] da ase s, simila o NAFP, bu wi h
adjus men s conside ing oom acous ics. F om OpenAIR,
we use all 143 mono and s e eo eco dings om 28 di e se
en i onmen s, such as halls and chu ches, chosen o hei
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
400
long-du a ion IRs. F om AIR, we use he IRs eco ded
wi hou a dummy head. Fo he binau al eco dings, we
include bo h channels sepa a ely, as hey cap u e di e en
e lec ion pa e ns, esul ing in 60 IR measu emen s om 6
ooms. Unlike NAFP, we also u ilize he MIT-Su ey [25]
da ase , con ibu ing 270 IRs om a ious public spaces.
Fo mic ophone IRs, we exclusi ely use he da ase p o-
ided by F anco e al. [26]. I con ains measu emen s o 25
mic ophones encompassing 38 unique mic ophone-pola
pa e n combina ions. The IRs we e measu ed a dis ances
o 0.5 m, 1.25m, and 5m and a mul iple inciden angles,
om which we chose in ege mul iples o 60◦.
We pa i ion he eco dings o each deg ada ion in o
ain and es se s. Fo backg ound noise, we use he pa i-
ions p oposed in Mesa os e al. [22]. Fo IRs, we ensu e
ha he measu emen s o he same oom o mic ophone a e
con ained in ei he he aining o es se . This conside a-
ion does no hold in he publicly a ailable IR pa i ions o
NAFP, which can be e i ied by compa ing he ilenames.
3.2 T aining
NAFP au ho s expe imen wi h wo a ia ions based on he
aining op imize : Adam [27] and LAMB [28]. The au-
ho s epo ed ha LAMB equi es a much la ge ba ch
size o achie e high pe o mance, making aining on
consume -g ade GPUs imp ac ical. As a esul , we adop
he Adam a ia ion as ou baseline, whose p e- ained
weigh s a e una ailable. The e o e, we ain NAFP-Adam
ou sel es using he o icial implemen a ion and he p o-
ided audio deg ada ion da ase s.2In all ou expe i-
men s, we use he same aining music as NAFP: 10k au-
dio chunks o 30 s du a ion, sou ced om ma_medium.
NAFP has an audio con ex o one second, and inge -
p in s a e ex ac ed wi h a 500 ms hop du a ion a in e -
ence. A aining, a ±200 ms andom o se is applied be-
ween posi i e and ancho samples o enhance obus ness
o po en ial misalignmen be ween que y and e e ence in-
ge p in s. We inc ease his o se o ±250 ms, co espond-
ing o 50% o he hop du a ion, which we ind mo e in u-
i i e. As o inpu ea u es, we ollow he same magni ude
mel-spec og am ex ac ion pa ame e s as NAFP bu apply
a sligh ly di e en o mula ion du ing he powe con e -
sions s ep. Finally, we scale he ea u es o [−1,1], using
he global dynamic ange o 80 dB.
We ain all models o 100 epochs wi h he same a -
chi ec u e, op imize , lea ning a e schedule , SpecAug-
men [29] implemen a ion, and NT-Xen pa ame e τas
in NAFP-Adam. Howe e , we ain ou models using au-
oma ic mixed p ecision. The con igu a ion NA= 768,
NPPA = 1 equi es app oxima ely 15 hou s o comple e on
a single NVIDIA RTX 4090 GPU and 20 CPU co es.
3.3 Re ie al
NAFP pe o ms a wo-s age e ie al using Faiss [30], a li-
b a y o e icien la ge-scale simila i y sea ch. Fi s , o
2h ps://gi hub.com/mimb es/neu al-audio- p/
each inge p in in he que y sequence, he op 20 app ox-
ima ely mos simila segmen s a e e ie ed. Then, a can-
dida e sequence is cons uc ed o each unique segmen ,
conside ing i s posi ion in he que y sequence, and he a -
e age simila i y sco e is calcula ed be ween he que y and
candida e sequences. The o iginal Faiss index ype and
hype -pa ame e s a e kep unchanged in ou wo k.
3.4 E alua ion
NAFP measu es iden i ica ion pe o mance using he Top-
1 hi a e me ic, which we also employ. Howe e , whe eas
NAFP e alua es only segmen -le el iden i ica ion ( ocus-
ing on exac and nea ma ches wi hin one hop du a ion),
we e alua e bo h ack- and segmen -le el iden i ica ion.
We disco e ed ha , in NAFP’s e alua ion, some que y
acks we e ep esen ed up o 11 imes, while o he s we e
no ep esen ed a all, skewing he esul s. To add ess his,
we use 30-second audio chunks as que ies and selec 6
equally spaced indices wi hin he chunk. S a ing om
each index, we que y inge p in sequences o leng hs 1,
3, 9, and 19, co esponding o 1, 2, 5, and 10 seconds
o audio, espec i ely. This me hod e icien ly u ilizes he
chunks and ensu es ha each ack con ibu es uni o mly
o he me ic. Addi ionally, NAFP’s inge p in s o age
implemen a ion simply conca ena es inge p in s om all
que y chunks, dis ega ding ack bounda ies. As a esul ,
12.7% o he que y sequences con ain inge p in s om
wo acks. In ou implemen a ion, we ensu e ha each
que y sequence is con ained wi hin a single ack.
Va ious app oaches, including NAFP, pe o m e alua-
ions on syn he ically deg aded que ies [1,2,31–33], while
only a ew conside eal mic ophone eco dings [7,34]. We
e alua e ou models on bo h ypes o da a.
Syn he icly deg aded que ies — We build ou syn he ic
e alua ion on NAFP’s publicly a ailable es se , which
is a subse o ma_ ull. Howe e , i is known ha
FMA con ains duplica es, so we pe o m duplica e e-
mo al. Thei es da abase con ains 93,358 ull-leng h
acks. We no ed ha 1,205 unique acks we e no in-
cluded, which we included in ou es da abase, esul ing
in 95,163 o al acks, dis inc om he aining acks.
Fo he que y se , NAFP uses audio chunks o 30-
second du a ion om 500 acks, which is insu icien o a
comp ehensi e e alua ion. A la ge se inc eases s a is ical
powe , o e s be e insigh in o obus ness ac oss a ious
deg ada ions, and educes po en ial biases. The e o e, we
addi ionally inco po a e 9,500 andom acks om ou es
da abase, b inging he o al o 10,000 que y acks. We
hen deg ade each ack om s a o end by sequen ially
applying addi i e backg ound noise (using a andom SNR
sampled uni o mly om [0,10] dB), ollowed by con olu-
ion wi h oom and mic ophone IRs (in e lea ed wi h an-
dom gain, using ull IR du a ions). F om he esul ing au-
dio, we andomly sample 30-second chunks o be used as
que ies. We make his da ase publicly a ailable.
I is wo h emphasizing ha ou e alua ion se up uses
signi ican ly mo e que ies han hose used in NAFP, G aFP,
and ABFP, and ou da abase is subs an ially la ge han
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
401
hose used in G aFP and ABFP. The e o e, ou e alua ion
p o ides a mo e comp ehensi e and ealis ic assessmen
o scalabili y han p io wo k. We a gue ha scalabili y
claims should be e alua ed unde simila condi ions.
Real-wo ld da a — We conduc a eal-wo ld e alua ion
in collabo a ion wi h BMAT Licensing S.L. using a da ase
o mic ophone eco dings cap u ed using sma phones and
digi al eco de s in a ious se ings, including ba s, nigh -
clubs, and conce halls. The g ound u h was es ablished
by a inge p in ing sys em agains a la ge da abase, wi h
esul s e i ied by human anno a o s. All e e ence acks
ha e a leas one ma ch, o aling 3,692 que y- e e ence
pai s. Al hough he da ase is no la ge enough o es scal-
abili y, i se es well o es ing obus ness in eal en i on-
men s.
4. NEURAL MUSIC FINGERPRINTING
We p opose a new app oach o aining neu al AFP mod-
els ha ocuses on musical signal p ope ies and eal-li e
audio deg ada ion. Fi s , o es ablish a baseline, we bench-
ma k he NAFP-Adam model on ack-le el iden i ica ion.
Nex , we inc emen ally apply a se ies o bes p ac ices
o his baseline o acili a e he lea ning p ocess, alida -
ing he con ibu ion o each. Finally, we explo e a ange
o me ic lea ning app oaches o u he enhance pe o -
mance. We e e o ou me hod as NMFP (Neu al Music
Finge p in ing) o highligh i s ocus on MI.
4.1 Bes p ac ices o MI
In his sec ion, we imp o e he sel -supe ision du ing
aining by mo e closely simula ing eal-li e audio deg a-
da ion, elimina ing aul y lea ning signals, and p o iding
he model wi h addi ional cues. Al oge he , ou p ac ices
imp o e iden i ica ion pe o mance by 8.3% and 20.9%
on he syn he ic and eal-wo ld da ase s, espec i ely (Ta-
ble 2). No ably, ou p ac ices add negligible aining
o e head while subs an ially aising pe o mance, making
hem ou ecommended bes p ac ices o aining.
App op ia e audio deg ada ion da ase s — Sel -
supe ision comes om he objec i e o embedding he
clean audio and i s syn he ically deg aded e sion close o
each o he . By exposing he model o deg ada ions ha
closely esemble hose encoun e ed in eal-wo ld scena -
ios, we encou age s uc u ing he lea ned ep esen a ions
acco dingly. The e o e, he choice o deg ada ion da a is
c i ical; i should e lec ealis ic use cases. T aining wi h
ou deg ada ion da ase s as opposed o hose o NAFP in-
c eases pe o mance on bo h da ase s (Table 2, ows 1–2).
Elimina ing alse nega i es — The objec i e o he NT-
Xen loss is o inc ease he simila i y be ween an ancho
and i s posi i e, while dec easing he simila i y wi h all i s
nega i es in he ba ch. Howe e , in NAFP, a ba ch can con-
ain mul iple ancho samples om he same ack wi h an
18% p obabili y (64 segmen s chosen om 590,000 seg-
men s ac oss 10,000 acks wi h 59 segmen s each). Be-
longing o he same ack, hese samples sha e a ious mu-
sical p ope ies. Ye , he loss unc ion ea s hem as neg-
a i e pai s, esul ing in a aul y lea ning signal. In ou im-
plemen a ion, we elimina e alse nega i es by ensu ing ha
each ba ch con ains one ancho sample pe ack. This in-
c eases inge p in dis inc i eness (Table 2, ows 2–3).
Full IRs — Du ing aining, NAFP unca es all IRs o
75 ms. Fo a oom IR, his du a ion only includes he ea ly
e lec ions [35]. We aim o mo e ealis ic deg ada ion so
ha ou models can lea n o gene a e obus inge p in s
agains eal-li e e e be a ion. The e o e, we use he ull
du a ion o IRs, which can go up o se e al seconds o
la ge ooms [25]. Howe e , in p ac ice, we unca e IRs
o he model’s con ex leng h, as he longe pa will no
con ibu e o he cu en segmen . The esul ing model ex-
hibi s imp o ed obus ness (Table 2, ows 3–4).
Pas e e be a ion deg ada ion — Que y inge p in s
ex ac ed om mic ophone eco dings con ain e e be a-
ion, including he ails o pas sound e en s. P e ious au-
dio deg ada ion me hods o en o e look his aspec , con-
ol ing an audio segmen wi h a oom IR as i he sound
s a s ab up ly, wi hou acous ic his o y. Mis ep esen ing
e e be a ion can esul in lea ning un ealis ic ep esen a-
ions, especially in such ine-g ained applica ions. Hence,
we con ol e audio segmen s s a ing om hei pas con-
ex and disca d he pas a e con olu ion, yielding a seg-
men ha con ains he e e be a ion o cu en and pas
e en s. Ma ching he model’s con ex , we use a one-second
pas con ex du a ion. This inc eases obus ness conside -
ably (Table 2, ows 4–5).
Lowe equencies — The 300 Hz lowe equency cu -
o applied in NAFP’s ea u e ex ac ion disca ds aluable
in o ma ion ha can p o ide addi ional musical cues. In
music enues such as conce halls and es i al a eas, a
long dis ances om he speake s, he bass equencies will
o m he majo i y o he sounds su i ing he backg ound
noise om he c owd. Since mos mic ophones, including
sma phone mic ophones, can cap u e lowe equencies,
ex ending he equency ange can p o ide bene i s ac oss
di e en eco ding de ices. We es ed mul iple alues and
selec ed a 160 Hz bound. The esul ing model achie es
u he imp o ed pe o mance (Table 2, ows 5–6).
4.2 Explo ing me ic lea ning
Ha ing imp o ed he sel -supe ision, we now explo e di -
e en me ic lea ning me hods o u he enhance he iden-
i ica ion pe o mance. This explo a ion includes compa -
ing se e al loss unc ions, in es iga ing he e ec o ain-
ing wi h di e en numbe s o ancho s and posi i es pe an-
cho , and uning loss unc ion hype -pa ame e s.
Loss unc ion compa ison — Mos neu al AFP models
ely on NT-Xen wi hou compa ison wi h o he losses un-
de consis en se ings. He e, we sys ema ically compa e
mul iple losses, sea ching o addi ional bene i s. Speci -
ically, we conside he iple , NT-Xen , DCL, KCL, and
A&U losses. Fo DCL, we use he same τpa ame e as
NT-Xen . Fo A&U and KCL, we ake he de aul pa am-
e e s in he espec i e publica ions. Fo he iple loss,
we employ ha d posi i e and semi-ha d nega i e mining,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
402
Row App oach FN TIR Pas FLC
Top-1 hi a e (%)
Syn he ic da a Real-wo ld da a
1 s 2 s 5 s 10 s 1 s 2 s 5 s 10 s
1 Baseline ✗75 ms ✗300 Hz 72.2 82.8 89.7 92.3 33.6 49.9 65.4 71.7
2
Ou s
✗75 ms ✗300 Hz 74.4 83.8 89.9 92.1 39.4 56.2 70.5 76.1
3✓75 ms ✗300 Hz 76.7 85.8 91.6 93.6 45.5 64.8 77.3 81.9
4✓1 s ✗300Hz 77.9 86.7 92.0 93.8 49.3 66.7 78.5 82.6
5✓1 s ✓300Hz 80.1 87.9 92.5 94.2 50.6 68.1 79.2 83.0
6✓1 s ✓160 Hz 80.5 88.0 92.6 94.2 54.5 70.7 81.0 84.3
Table 2. Imp o emen s o e NAFP in ack-le el iden i ica ion using ou es se . ‘FN’ indica es i he alse nega i es
issue in he ba ches is co ec ed, TIR deno es he impulse esponse unca ion du a ion, ‘Pas ’ indica es whe he he acous ic
his o y is applied du ing e e be a ion, and FLC speci ies he lowe cu -o equency du ing ea u e ex ac ion.
Loss
Top-1 hi a e (%)
Syn he ic da a Real-wo ld da a
1 s 2 s 1 s 2 s
NT-Xen 84.1 90.1 58.7 72.3
DCL 83.0 89.6 54.5 68.7
A&U 76.4 86.2 48.0 66.3
KCL 38.8 60.6 18.2 41.1
T iple 86.4 91.6 63.4 75.1
Table 3. Loss unc ion compa ison on ack-le el iden i-
ica ion using NA= 768 and NPPA = 1.
Loss NANPPA
Top-1 hi a e (%)
Syn he ic da a Real-wo ld da a
1 s 2 s 1 s 2 s
NT-Xen
64 1 80.5 88.0 54.5 70.7
512 1 84.1 90.2 58.2 71.8
768 1 84.1 90.1 58.7 72.3
512 2 76.9 85.5 46.0 63.4
384 3 74.6 83.7 44.2 62.0
T iple
64 1 82.8 89.4 57.4 72.1
512 1 86.1 91.4 62.5 74.7
768 1 86.4 91.6 63.4 75.1
512 2 86.6 91.7 63.9 75.0
384 3 86.1 91.2 63.0 74.6
Table 4. E ec o NAand NPPA o NT-Xen and iple
losses.
compu ing he loss unc ion using he squa ed Euclidean
dis ance wi h a ma gin o α= 0.5. In his expe imen ,
o a ai compa ison, we use NA= 768 o all losses and
use one posi i e pe ancho o he iple loss (NPPA = 1),
since he emaining losses a e only de ined o his se ing.
Table 3 epo s he esul s, whe e he iple loss ou -
pe o ms all o he loss unc ions. Compa ed o i s closes
compe i o , NT-Xen , i sco es 2.3% and 4.7% highe on
he syn he ic and eal-wo ld da a, espec i ely. No ably,
he NT-Xen loss ou pe o ms i s ex ensions: DCL, A&U,
and KCL. While we ind he decoupling idea o DCL in u-
i i e, ou esul s do no show an imp o emen . Based on
hese esul s, we e ain only he iple and NT-Xen losses
o he emainde o ou expe imen s.
Loss τ, α
Top-1 hi a e (%)
Syn he ic da a Real-wo ld da a
1 s 2 s 1 s 2 s
NT-Xen
(τ)
0.01 83.2 89.8 61.2 74.3
0.02 83.8 90.0 61.8 74.6
0.05 84.1 90.1 58.7 72.3
0.07 81.8 88.8 53.8 69.2
T iple
(α)
0.3 85.9 91.2 63.8 74.9
0.5 86.6 91.7 63.9 75.0
0.7 86.7 91.6 63.3 74.7
Table 5. Hype -pa ame e uning esul s on ack-le el
iden i ica ion o NT-Xen (NA= 768, NPPA = 1) and
iple (NA= 512, NPPA = 2) losses.
Inc easing he numbe o ancho s — Fo he NT-Xen
loss in Table 4, inc easing NA om 64 o 512 yields a 3.6%
imp o emen on syn he ic da a, whe eas u he inc easing
o 768 causes a sa u a ion. On eal-wo ld da a, howe e ,
inc easing NA om 64 o 768 consis en ly imp o es pe -
o mance. Fo he iple loss, aining wi h la ge NAp o-
g essi ely inc eases pe o mance on bo h da ase s.
Numbe o ancho s s posi i es pe ancho — On he
one hand, exposing he model o a di e se se o acks
wi hin a ba ch is bene icial o lea ning disc imina i e ep-
esen a ions. On he o he hand, p esen ing mul iple de-
g aded e sions o he same audio segmen can help he
model lea n in a iance o eal-wo ld dis o ions. Howe e ,
due o he GPU memo y cons ain , inc easing he numbe
o posi i es pe ancho educes he numbe o ancho s ha
can i in a ba ch, c ea ing a ade-o be ween di e si y and
in a iance. To c ea e mul iple posi i es o an ancho , we
andomly shi he ancho independen ly and use a di e -
en combina ion o deg ada ions.
In Table 4, when he numbe o o al samples in a ba ch
is se o 1536, swi ching om NPPA = 1 o NPPA = 2 o
NPPA = 3 signi ican ly deg ades NT-Xen ’s pe o mance
on bo h da ase s. Mo eo e , when NA= 512, using
NPPA = 2 pe o ms signi ican ly wo se han NPPA = 1 on
bo h da ase s. These wo obse a ions show ha , o NT-
Xen , using mo e han one posi i e pe ancho is de imen-
al, which is no caused by he educed numbe o ancho s.
Indeed, aining wi h NA= 64 and NPPA = 1 ( oughly en
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
403

Model
Top-1 hi a e (%)
Syn he ic da a Real-wo ld da a
1 s 2 s 5 s 10 s 1 s 2 s 5 s 10 s
NAFP-Adam 72.2 82.8 89.7 92.3 33.6 49.9 65.4 71.7
NAFP-LAMB [9] 73.8 83.9 90.4 92.7 42.0 58.6 71.4 76.3
G aFP-500 ms 17.0 34.8 52.1 63.6 20.0 42.5 62.0 69.4
G aFP-100 ms [11] 19.7 53.6 72.3 80.0 46.6 60.8 74.8 82.5
NMFP (p oposed) 86.6 91.7 94.5 95.6 63.9 75.0 82.0 84.6
Table 6. Final compa ison on ack-le el iden i ica ion.
imes smalle ba ch size) pe o ms signi ican ly be e on
bo h da ase s. The e o e, we conclude ha NPPA = 1 is
he bes se ing o NT-Xen . Fo he iple loss, Table 4
shows ha aining wi h NPPA = 2 yields he bes pe -
o mance and, unlike NT-Xen , inc easing NPPA does no
de e io a e he pe o mance signi ican ly. No ably, aining
wi h NPPA = 3 and NPPA = 1 pe o ms compa ably, e en
hough he musical a ie y in a ba ch is hal he amoun .
Hype -pa ame e uning — In Table 4, we ha e seen
ha he iple loss ou pe o ms NT-Xen in all NPPA-NA
combina ions. To gain insigh in o he e ec o hype -
pa ame e s, we expe imen wi h di e en τ(NT-Xen ) and
α( iple ) alues. The esul s a e gi en in Table 5, whe e
he iple loss again ou pe o ms NT-Xen by a signi ican
amoun on bo h da ase s. No ably, NT-Xen ’s pe o mance
is imp o ed on eal-wo ld da a by using a smalle τpa am-
e e . Based on hese esul s, we choose he iple loss wi h
NA= 512,NPPA = 2, and α= 0.5 o ou NMFP model.
5. RESULTS
He e, we compa e NMFP wi h NAFP-Adam (baseline
me hod) and he only s a e-o - he-a models wi h pub-
licly a ailable weigh s: NAFP-LAMB [9] and G aFP [11].
NMFP and NAFP models a e ained a 8 kHz sampling
a e, while G aFP was ained a 16 kHz, hence equi -
ing upsampling. Addi ionally, G aFP uses a 100 ms hop,
whe eas ou models ope a e wi h 500 ms. We conside
G aFP wi h each hop du a ion.
Table 6 shows ha NMFP subs an ially ou pe o ms
bo h NAFP and G aFP on ack-le el iden i ica ion. In
pa icula , NMFP imp o es i s baseline (NAFP-Adam) by
14.4% on syn he ic da a and 30.3% on eal-wo ld da a,
and i su passes he o icial NAFP-LAMB by 12.8% and
21.9%, espec i ely. Agains G aFP, NMFP sco es 69.6%
highe on syn he ic da a and 43.9% on eal-wo ld da a
when using a 500 ms hop. Wi h a 100 ms hop, i s ill ou -
pe o ms G aFP by 66.9% on syn he ic da a and 17.3% on
eal-wo ld da a. This d as ic di e ence could be due o he
applied upsampling, which can no c ea e he highe e-
quencies ha G aFP likely depends on.
In Table 7, we compa e NMFP agains NAFP on
segmen -le el iden i ica ion on he syn he ic da a. We ex-
clude G aFP om his compa ison, as i s au ho s do no
conside segmen -le el iden i ica ion, and due o hei peak
picking me hods’ in e ac ion wi h silen segmen s. NMFP
Model Ma ch Top-1 hi a e (%)
1 s 2 s 5 s 10 s
NAFP-Adam exac 50.5 64.8 74.1 78.2
nea 61.2 74.2 83.4 87.7
NAFP-LAMB [9] exac 55.1 66.4 74.4 77.9
nea 62.7 74.6 83.3 87.4
NMFP (p oposed) exac 63.8 74.8 82.0 85.0
nea 75.6 83.5 89.2 92.0
Table 7. Segmen -le el iden i ica ion esul s on he syn-
he ic da ase .
ou pe o ms i s baseline, NAFP-Adam, by 13.3% in exac
ma ches and 14.4% in nea ma ches. NMFP also ou pe -
o ms NAFP-LAMB by 8.7% in exac ma ches and 12.9%
in nea ma ches. Toge he , he esul s in Table 6 and Ta-
ble 7 demons a e ha ou model, NMFP, se s he s a e-o -
he-a on bo h ack- and segmen -le el MI.
6. CONCLUSION
We p esen a comp ehensi e amewo k o enhancing he
obus ness o neu al AFP models agains eal-wo ld audio
deg ada ion. By co ec ing e alua ion laws in p io wo k,
we es ablish a mo e eliable benchma k o u u e AFP e-
sea ch. Ou e alua ions, conduc ed on bo h a syn he ic
da ase and a eal-wo ld da ase eco ded in di e se music
enues, show ha NMFP signi ican ly ou pe o ms exis -
ing neu al AFP models wi h publicly a ailable weigh s.
Speci ically, on ack-le el iden i ica ion, i ou pe o ms
he o icial NAFP model by 12.9% on syn he ic da a and
by 21.9% on eal-wo ld da a.
Ou success s ems om wo key a eas. Fi s , we show
ha paying ca e ul a en ion o musical signal p ope ies
and oom acous ics enhances pe o mance conside ably.
Second, by e isi ing me ic lea ning, we unco e ed se -
e al key indings ha u he imp o e pe o mance. We
disco e ed ha he iple loss, despi e common assump-
ions, ou pe o ms mode n al e na i es such as NT-Xen .
We also ound ha iple loss does no su e om he
pe o mance sa u a ion seen wi h NT-Xen a la ge ba ch
sizes. Finally, we cha ac e ized a c i ical ade-o be ween
he numbe o ancho s and posi i es pe ancho in aining
ba ches. Toge he , hese insigh s o m a se o alida ed,
high-impac p inciples o neu al AFP de elopmen .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
404
7. ACKNOWLEDGMENTS
This wo k was suppo ed by he p e-doc o al p og am
AGAUR-FI aju s (2024 FI-3 00065) Joan O ó, unded by
he Sec e a ia d’Uni e si a s i Rece ca o he Depa amen
de Rece ca i Uni e si a s o he Gene ali a de Ca alunya;
and by he Cá ed as ENIA p og am “IA y Música: Cá ed a
en In eligencia A i icial y Música” (TSI-100929-2023-1),
unded by he Sec e a ía de Es ado de Digi alización e In-
eligencia A i icial and he Eu opean Union – Nex Gen-
e a ion EU.
This wo k was also pa o he p ojec TROBA
– Technologies o he ecogni ion o musical wo ks
in he e a o dynamic gene a ion o audio con en
(ACE014/20/000051), wi hin he call Nuclis d’R+D 2024,
wi h he suppo o ACCIÓ (Agency o Business Compe -
i i eness, Go e nmen o Ca alonia).
8. REFERENCES
[1] J. Hai sma, T. Kalke , and J. Oos een, “Robus audio
hashing o con en iden i ica ion,” in In . Wo kshop on
Con en -Based Mul imedia Indexing (CBMI), 2001.
[2] J. Hai sma and T. Kalke , “A highly obus audio in-
ge p in ing sys em,” in P oc. o he 3 d In . Con . on
Music In o ma ion Re ie al (ISMIR), 2002.
[3] P. Cano, E. Ba le, T. Kalke , and J. Hai sma, “A e iew
o algo i hms o audio inge p in ing,” in IEEE Wo k-
shop on Mul imedia Signal P ocessing (MMSP), 2002.
[4] A. Wang, “The Shazam music ecogni ion se ice,”
Communica ions o he ACM, ol. 49, no. 8, pp. 44–
48, 2006.
[5] C. Bu ges, J. Pla , and S. Jana, “Dis o ion disc imi-
nan analysis o audio inge p in ing,” IEEE T ansac-
ions on Speech and Audio P ocessing, ol. 11, no. 3,
pp. 165–174, 2003.
[6] G. Co ès, A. Ciu ana, E. Molina, M. Mi on, O. Mey-
e s, J. Six, and X. Se a, “BAF: An audio inge p in -
ing da ase o b oadcas moni o ing,” in P oc. o he
23 d In . Soc. o Music In o ma ion Re ie al Con .
(ISMIR), 2022.
[7] B. A. y. A cas, B. G elle , R. Guo, K. Kilgou ,
S. Kuma , J. Lyon, J. Odell, M. Ri e , D. Roblek,
M. Sha i i, and M. Velimi o i´
c, “Now Playing:
Con inuous low-powe music ecogni ion,” 2017,
a Xi :1711.10958 [cs, eess]. [Online]. A ailable:
h p://a xi .o g/abs/1711.10958
[8] Z. Yu, X. Du, B. Zhu, and Z. Ma, “Con as i e
unsupe ised lea ning o audio inge p in ing,” 2020,
a Xi :2010.13540 [cs, eess]. [Online]. A ailable:
h p://a xi .o g/abs/2010.13540
[9] S. Chang, D. Lee, J. Pa k, H. Lim, K. Lee, K. Ko, and
Y. Han, “Neu al audio inge p in o high-speci ic au-
dio e ie al based on con as i e lea ning,” in IEEE
In . Con . on Acous ics, Speech and Signal P ocessing
(ICASSP), 2021.
[10] A. Singh, K. Demuynck, and V. A o a, “A en ion-
based audio embeddings o que y-by-example,” in
P oc. o he 23 d In . Socie y o Music In o ma ion
Re ie al Con . (ISMIR), 2022.
[11] A. Bha acha jee, S. Singh, and E. Bene os, “G aF-
P in : A GNN-based app oach o audio iden i ica-
ion,” in IEEE In . Con . on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), 2025.
[12] T. Chen, S. Ko nbli h, M. No ouzi, and G. Hin on,
“A simple amewo k o con as i e lea ning o isual
ep esen a ions,” in P oc. o he 37 h In . Con . on Ma-
chine Lea ning (ICML), 2020.
[13] J. Se à, R. O. A az, D. Bogdano , and Y. Mi su uji,
“Supe ised con as i e lea ning om weakly-labeled
audio segmen s o musical e sion ma ching,” in P oc.
o he 42nd In . Con . on Machine Lea ning (ICML),
2025.
[14] J. Guino , E. Quin on, and G. Fazekas, “Semi-
supe ised con as i e lea ning o musical ep esen a-
ions,” in P oc. o he 25 h In . Soc. o Music In o ma-
ion Re ie al Con . (ISMIR), 2024.
[15] F. Sch o , D. Kalenichenko, and J. Philbin, “Facene :
A uni ied embedding o ace ecogni ion and clus e -
ing,” in IEEE Con . on Compu e Vision and Pa e n
Recogni ion (CVPR), 2015, pp. 815–823.
[16] C.-H. Yeh, C.-Y. Hong, Y.-C. Hsu, T.-L. Liu, Y. Chen,
and Y. LeCun, “Decoupled con as i e lea ning,” in
Compu e Vision – ECCV, 2022.
[17] T. Wang and P. Isola, “Unde s anding con as i e ep-
esen a ion lea ning h ough alignmen and uni o mi y
on he hype sphe e,” in P oc. o he 37 h In . Con . on
Machine Lea ning (ICML), 2020.
[18] P. Ko omilas, G. Bou i sas, T. Giannakopoulos, M. A.
Nicolaou, and Y. Panagakis, “B idging mini-ba ch and
asymp o ic analysis in con as i e lea ning: F om In-
oNCE o ke nel-based losses,” in P oc. o he 41s In .
Con . on Machine Lea ning (ICML), 2024.
[19] P. Khosla, P. Te e wak, C. Wang, A. Sa na, Y. Tian,
P. Isola, A. Maschino , C. Liu, and D. K ishnan, “Su-
pe ised con as i e lea ning,” in P oc. o he 34 h
In . Con . on Neu al In o ma ion P ocessing Sys ems
(Neu IPS), 2020.
[20] Y. Tian, L. Fan, P. Isola, H. Chang, and D. K ish-
nan, “S ableRep: Syn he ic images om ex - o-image
models make s ong isual ep esen a ion lea ne s,” in
P oc. o he 37 h In . Con . on Neu al In o ma ion P o-
cessing Sys ems (Neu IPS), 2023.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
405
[21] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,” in
P oc. o he 18 h In . Soc. o Music In o ma ion Re-
ie al Con . (ISMIR), 2017.
[22] A. Mesa os, T. Hei ola, and T. Vi anen, “TUT
da abase o acous ic scene classi ica ion and sound
e en de ec ion,” in 24 h Eu opean Signal P ocessing
Con . (EUSIPCO), 2016.
[23] D. T. Mu phy and S. Shelley, “OpenAIR: An In e ac-
i e Au aliza ion Web Resou ce and Da abase,” in Au-
dio Enginee ing Socie y Con en ion 129, 2010.
[24] M. Jeub, M. Schä e , and P. Va y, “A binau al oom im-
pulse esponse da abase o he e alua ion o de e e -
be a ion algo i hms,” in P oc. o In . Con . on Digi al
Signal P ocessing (DSP), 2009.
[25] J. T ae and J. H. McDe mo , “S a is ics o na u al e-
e be a ion enable pe cep ual sepa a ion o sound and
space,” P oc. o he Na ional Academy o Sciences, ol.
113, no. 48, pp. E7856–E7865, 2016.
[26] J. F anco, B. Bˇ
acilˇ
a, T. B ookes, and E. De Sena, “A
mul i-angle, mul i-dis ance da ase o mic ophone im-
pulse esponses,” Jou nal o he Audio Enginee ing So-
cie y, 2022.
[27] D. Kingma and J. Ba, “Adam: A me hod o s ochas ic
op imiza ion,” in P oc. o he 3 d In . Con . o Lea n-
ing Rep esen a ions (ICLR), 2015.
[28] Y. You, J. Li, S. Reddi, J. Hseu, S. Kuma , S. Bhojana-
palli, X. Song, J. Demmel, K. Keu ze , and C.-J. Hsieh,
“La ge ba ch op imiza ion o deep lea ning: T aining
BERT in 76 minu es,” in P oc. o he 8 h In . Con . on
Lea ning Rep esen a ions (ICLR), 2020.
[29] D. S. Pa k, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph,
E. D. Cubuk, and Q. V. Le, “SpecAugmen : A simple
da a augmen a ion me hod o au oma ic speech ecog-
ni ion,” in 20 h Annual Con . o he In . Speech Com-
munica ion Associa ion (INTERSPEECH), 2019.
[30] J. Johnson, M. Douze, and H. Jégou, “Billion-scale
simila i y sea ch wi h GPUs,” IEEE T ansac ions on
Big Da a, ol. 7, no. 3, pp. 535–547, 2019.
[31] R. Sonnlei ne and G. Widme , “Robus quad-based
audio inge p in ing,” IEEE/ACM T ans. on Audio,
Speech, and Language P ocessing, ol. 24, no. 3, pp.
409–421, 2016.
[32] A. Báez-Suá ez, N. Shah, J. A. Nolazco-Flo es, S.-
H. S. Huang, O. Gnawali, and W. Shi, “SAMAF:
Sequence- o-sequence au oencode model o audio
inge p in ing,” ACM T ans. Mul imedia Compu .
Commun. Appl., ol. 16, no. 2, pp. 1–23, 2020.
[33] A. Aga waal, P. Kanaujia, S. S. Roy, and
S. Ghose, “Robus and ligh weigh audio in-
ge p in o au oma ic con en ecogni ion,” 2023,
a Xi :2305.09559 [cs, eess]. [Online]. A ailable:
h p://a xi .o g/abs/2305.09559
[34] M. Ramona and G. Pee e s, “AudioP in : An e icien
audio inge p in sys em based on a no el cos -less syn-
ch oniza ion scheme,” in IEEE In . Con . on Acous ics,
Speech and Signal P ocessing (ICASSP), 2013.
[35] L. L. Be anek, “Conce hall acous ics—1992,” The
Jou nal o he Acous ical Socie y o Ame ica, ol. 92,
no. 1, pp. 1–39, 1992.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
406