scieee Science in your language
[en] (orig)

PeakNetFP: Peak-Based Neural Audio Fingerprinting Robust to Extreme Time Stretching

Author: Guillem Cortès-Sebastià; Benjamin Martin; Emilio Molina; Xavier Serra; Romain Hennequin
Publisher: Zenodo
DOI: 10.5281/zenodo.17706375
Source: https://zenodo.org/records/17706375/files/000025.pdf
PEAKNETFP: PEAK-BASED NEURAL AUDIO FINGERPRINTING
ROBUST TO EXTREME TIME STRETCHING
Guillem Co ès-Sebas ià13 Benjamin Ma in2Emilio Molina1
Xa ie Se a3Romain Hennequin2
1BMAT Licensing S.L., Ba celona, Spain
2Deeze Resea ch, Pa is, F ance
3Music Technology G oup, Uni e si a Pompeu Fab a, Ba celona, Spain
[email p o ec ed], [email p o ec ed]
ABSTRACT
This wo k in oduces PeakNe FP, he i s neu al audio
inge p in ing (AFP) sys em designed speci ically a ound
spec al peaks. This no el sys em is designed o le e -
age he spa se spec al coo dina es ypically compu ed
by adi ional peak-based AFP me hods. PeakNe FP pe -
o ms hie a chical poin ea u e ex ac ion echniques sim-
ila o he compu e ision model Poin Ne ++, and is
ained using con as i e lea ning like in he s a e-o - he-
a deep lea ning AFP, Neu alFP. This combina ion allows
PeakNe FP o ou pe o m con en ional AFP sys ems and
achie es compa able pe o mance o Neu alFP when han-
dling challenging ime-s e ched audio da a. In ex ensi e
e alua ion, PeakNe FP main ains a Top-1 hi a e o o e
90% o s e ching ac o s anging om 50% o 200%.
Mo eo e , PeakNe FP o e s signi ican e iciency ad an-
ages: compa ed o Neu alFP, i has 100 imes ewe pa-
ame e s and uses 11 imes smalle inpu da a. These ea-
u es make PeakNe FP a ligh weigh and e icien solu ion
o AFP asks whe e ime s e ching is in ol ed. O e all,
his sys em ep esen s a p omising di ec ion o u u e AFP
echnologies, as i success ully me ges he ligh weigh na-
u e o peak-based AFP wi h he adap abili y and pa -
e n ecogni ion capabili ies o neu al ne wo k-based ap-
p oaches, pa ing he way o mo e scalable and e icien
solu ions in he ield.
1. INTRODUCTION
Audio Finge p in ing (AFP) is he MIR ask o iden i y-
ing audio eco dings wi hin a da abase o e e ence acks.
Ea ly AFP sys ems da e back wen y yea s, wi h Shazam
[1] and Philips [2] sys ems. Since hen AFP has been ex-
ensi ely s udied o a ious use cases, such as que y-by-
example [1], in eg i y e i ica ion [3], con en -based copy
de ec ion [4], DJ-se moni o ing [5], o high speci ic audio
© G. Co ès-Sebas ià, B. Ma in, E. Molina, X. Se a, R.
Hennequin. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na-
ional License (CC BY 4.0). A ibu ion: G. Co ès-Sebas ià, B. Ma in,
E. Molina, X. Se a, R. Hennequin, “PeakNe FP: Peak-based Neu al Au-
dio Finge p in ing Robus o Ex eme Time S e ching”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
e ie al [6]. Peak-based AFP sys ems ha e a long ajec-
o y in he ield and mul iple wo ks use his app oach o
enhance hei obus ness o pi ch shi ing and ime s e ch-
ing [7], backg ound music iden i ica ion [8], o o c e-
a e a ligh weigh AFP ha can un in embedded sys ems
[9]. These algo i hms a e based on ex ac ing and link-
ing salien spec al peaks compu ed om ime- equency
ep esen a ions. These a e ma u e, p oduc ion- eady sys-
ems ha do no equi e aining, and can scale o indus-
ial le els in which da abases consis o millions o e e -
ences [10,11]. Thus, companies wi h massi e da a ca alogs
ely on hem o con en iden i ica ion [12].
Rep esen a ion lea ning sys ems, such as Now Playing
[13] o Neu alFP [6], ecen ly eme ge as no el app oaches
ha le e age Con as i e Lea ning (CL) and Con olu ional
Neu al Ne wo ks (CNNs) o lea n he simila i ies be ween
a dis o ed audio clip and i s co esponding e e ence ack.
They a e designed o pe o m highly sensi i e audio e-
ie al, capable o ma ching sho segmen s, and signi -
ican ly ou pe o m adi ional peak-based me hods unde
challenging condi ions. This is due o hei abili y o cap-
u e mo e complex and nuanced ea u es om he da a,
making hem mo e obus o a ious ypes o dis o ions
and noise ha adi ional me hods s uggle wi h [6,14].
This comes a he expense o equi ing la ge compu-
a ional esou ces, dense inpu da a, model aining, and
GPU compu ing, which migh no be sui able o some ap-
plica ions. In indus ial solu ions, hese equi emen s may
be ha d o o e come, and peak-based ea u es a e s ill con-
side ed as a iable al e na i e [11, 12, 15, 16]. P ac ically,
i is common ha audio ea u es ha e o be compu ed on a
clien de ice, hen uploaded o a se e o pe o m iden i i-
ca ion agains a e e ence da abase. In such condi ions, e-
qui ing dense spec og ams as audio ea u es signi ican ly
inc eases he amoun o da a o be uploaded compa ed o
sending only spa se spec al peaks. When inge p in gen-
e a ion is conside ed o un clien -side, i may be mo e
complex and ba e y in ensi e o un in e ence on ained
models as compa ed o simple ule-based peak ex ac ion
algo i hms, especially o que y-by-example applica ions
[15] whe e clien de ices a e gene ally a iable in speci-
ica ions (e.g. sma phones). An al e na i e se up is ully
in-de ice audio iden i ica ion [13], al hough his gene ally
implies e en mo e es ic i e compu a ional equi emen s
206
on he de ice in e ms o memo y oo p in and limi s he
da abase size. Addi ionally, music copy igh s owne s a e
o en eluc an o sha e any dense ep esen a ion ha could
be ei he in e ed o used o o he asks han inge p in -
ing, and a e mo e inclined o compu e spa se a ge ed ea-
u es ha ca y less in o ma ion and can ha dly be used
o any hing else han wha hey we e designed o (wi h
a e excep ions [17]). Finally, as peak-based AFP has been
used ex ensi ely by indus ial sys ems, i is ele an o
p i a e companies o le e age such la ge da ase s o p e-
compu ed spec al peaks o neu al audio inge p in ing ap-
p oaches [1, 12]. Fo hese easons, in his wo k we p o-
pose o keep he adi ional peak-based ea u es as inpu ,
and use hem in a mode n neu al app oach.
In his i s publica ion on a neu al spa se peak-based
model, we choose o ocus ou s udy on ime s e ching in
ex eme condi ions, which has been unde explo ed in he
li e a u e. Time s e ching is an audio p ocessing echnique
ha al e s he empo o a ack wi hou changing i s pi ch.
This me hod is commonly used by DJs o synch onize he
empo o di e en songs wi hin a mix o o c ea e emixes
ha a e ei he slowed down o sped up [18]. In challeng-
ing si ua ions such as mash-ups, blends, o licensing ci -
cum en ion a emp s, ime s e ching happens in complex
iden i ica ion si ua ions whe e se e e empo modi ica ions
a e used on sho exce p s, making hem e y ha d o be
au oma ically iden i ied [19].
The main con ibu ion o his wo k is o in oduce
a no el AFP sys em ope a ing wi h ligh weigh spec al
peaks as inpu , bu g ounded in a ep esen a ion lea n-
ing app oach and e alua ed in he con ex o ime s e ch-
ing. Speci ically, ou model PeakNe FP applies con as i e
lea ning o lea n inge p in s om spa se spec al peaks in-
pu , le e aging he hie a chical poin se lea ning algo i hm
Poin ne ++ [20]. I is designed o exhibi he good pe -
o mance o neu al s a e-o - he-a app oaches while keep-
ing memo y oo p in low hanks o spa se inpu . To ou
knowledge, his is he i s a emp a combining adi ional
peaks and ep esen a ion lea ning o audio inge p in ing,
and he i s ime a poin -cloud ne wo k is used o AFP. As
a subsequen con ibu ion, we e alua e PeakNe FP along-
side he SOTA algo i hm on ime s e ching, QuadFP [21],
which is a peak-based app oach, and Neu alFP [6], he
SOTA neu al audio inge p in ing, in a new scena io o
i . We inally show ha PeakNe FP achie es pe o mance
close o he SOTA me hod Neu alFP, despi e using 100
imes ewe pa ame e s and 11 imes smalle inpu da a
han he la e .
In sec ion 2 we summa ize he wo ks ele an o his
publica ion. In sec ion 3, we desc ibe he hie a chical peak
se ea u e ex ac ion as well as he con as i e ep esen a-
ion lea ning amewo k a he co e o PeakNe FP. Finally,
in sec ion 4 we p esen i s e alua ion in he con ex o ex-
eme ime s e ching and show how i compa es o he
peak-based ime s e ching baseline QuadFP, and o he
spec og am-based SOTA model Neu alFP.PeakNe FP
code, da ase , and model a e open and a ailable 1.
1h ps://gi hub.com/guillemco es/peakne p
2. RELATED WORK
O e he pas wo decades, he esea ch communi y has
wo ked o ad ance audio inge p in ing sys ems o mul i-
ple use cases. Some o hese inno a ions include wa ele s
[22] o noise esilience, cons an Q- ans o m [23] o Fun-
damen al F equency Map [24] o pi ch-shi ing obus -
ness, and cosine il e s [25] o b oadcas moni o ing, o
name a ew.
We can classi y AFP me hods in o h ee b oad ca e-
go ies: local desc ip o s-based [4, 22, 24–29], peak-based
[1,7,19,21,23,28,30–32], and neu al audio inge p in s [6,
13,33–37]. Peak-based inge p in s s a ed wi h Shazam’s
algo i hm [1], which se he basis o spec al peak pai s
linking o o m hashes ha a e obus o noise. Then,
Six & Leman p oposed linking iple s o ob ain obus -
ness o ime and equency modi ica ions in Panako [7],
al hough i is no sui able o sho que ies o ex eme ime
s e ching since i was designed o con en deduplica ion
o audio collec ions o old eco dings ha we e digi alized
by eplaying. Simila ly, Sonnlei ne & Wilde p oposed
QuadFP [21], which adap s blind as ome y esea ch [38]
o build quad uple s o peaks and gene a e hashes obus o
signi ican ime and equency modi ica ions [19]. O he
peak-based AFP wo ks [23, 28, 30–32] ha e also s udied
how o imp o e he obus ness o ime and equency mod-
i ica ions. Peak-based adi ional me hods pe o m well
e en in he p esence o al e a ions such as noise, comp es-
sion, o e e be a ion, o ins ance. They gene ally p o-
duce ligh weigh hashes ha can be e icien ly indexed in o
lookup ables, which makes hem scalable o hund eds o
housands o e en ens o millions o music pieces. Addi-
ionally, such me hods do no equi e aining o accele -
a ed compu ing ha dwa e.
2.1 QuadFP
Howe e , such adi ional me hods signi ican ly unde pe -
o m in he p esence o ex emely challenging scena ios,
like in he case o s ong ime s e ching [21]. QuadFP
s ands ou as one o he mos ad anced peak-based AFP in
ha ega d. Designed o be obus o ime and equency
modi ica ions, i s co e inno a ion is he use o quad u-
ple peak desc ip o s, which cap u e no only he posi ion
o each peak bu also i s ela ionship wi h neighbo ing
peaks. Each quad uple desc ibes a cons ella ion o ou
peaks (local maxima) in he ime- equency domain, e ec-
i ely encoding local pa e ns and ela ionships be ween
peaks. This app oach is mo e obus o noise and a i-
a ions in audio con en compa ed o o he inge p in ing
me hods [21] such as Panako [7]. Once he quad uple ea-
u es a e ex ac ed, QuadFP uses a hashing mechanism o
map hese desc ip o s o a da abase. In his publica ion, we
use QuadFP as he mos ad anced peak-based AFP base-
line o obus ness agains ime s e ching. I also aligns
wi h he use case o his s udy, which es ic s he inpu
da a o spec al peaks. Ou objec i e is o show how much
a neu al ne wo k can imp o e he bes adi ional sys em
o ime s e ching.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
207
2.2 Neu alFP
In he las decade, neu al ne wo ks ha e been success ully
used o AFP. In 2017, Google p esen ed he i s neu al
AFP sys em Now Playing [13]. I was designed o un on
mobile de ices wi h da ase s o limi ed size while ea u ing
high obus ness o noise. A e y ecen app oach, G aF-
P in [39], le e ages he s uc u al lea ning capabili ies o
G aph Neu al Ne wo ks (GNNs) o gene a e obus inge -
p in s om ime- equency ep esen a ions. As opposed o
ou me hod, a he han using spa se spec al peaks, G aF-
P in ex ac s om spec og ams localiza ion-awa e, low-
dimensional ea u es using a con olu ional encode . O he
neu al AFP sys ems [6,33–37] use a ge ed augmen a ions
in a con as i e lea ning amewo k o achie e obus ness
o noise, e e be a ion, echo, and o he dis o ions. Among
hem, Neu alFP [6] s ands ou as being he only ully
eplicable neu al AFP. The implemen a ion is open-sou ce,
wi h a public da ase and model weigh s. Neu alFP is also
ligh e han o he p oposed models such as ans o me -
based AFPs [34,36].
Fo hese easons, we use Neu alFP as a ounda ion o
he de elopmen o PeakNe FP.Neu alFP le e ages con-
as i e lea ning o achie e high-sensi i e audio e ie al,
employing a con olu ional encode o ex ac meaning ul
ea u es om mel-spec og ams. In his pape , we p opose
o euse Neu alFP’s con as i e lea ning amewo k while
changing inpu da a om dense spec og ams o spa se
peak-based ea u es. The o iginal Neu alFP is hen used as
a e e ence model, showing wha could be achie ed when
conside ing ull spec og ams as inpu .
2.3 AFP o ime s e ching
Some inge p in ing me hods ha e been designed o e ec-
i ely handle ime s e ching, he ocus o his wo k.
QuadFP [21] appea s as a miles one on his opic. Us-
ing a quad uple-based spec al peak g ouping (see sec-
ion 2.1) combined wi h an asymme ic que y- e e ence
inge p in s con igu a ion ha maximizes he numbe o
quads gene a ed in que ies, hey epo high p ecision and
accu acy measu es o mul iple empo modi ica ions o
20 seconds que ies. Thei e e ence da abase consis s o
100,000 acks om Jamendo and hey es 300 que y
acks ime-s e ched wi h 13 s e ching ac o s be ween
70% and 130%. In his expe imen , hey epo an a e -
age accu acy o 92.9% o 10-second que ies, bu 28.1%
a e age accu acy o 2.5-second que ies. This shows how
he pe o mance collapses as he que y leng h sh inks. We
can expec his pe o mance o be lowe i ewe quads pe
second a e used, a mo e likely scena io in an indus ial en-
i onmen .
Yao e al. [40] use he same da ase as [21]. Expe i-
men s we e done o 13 s e ching ac o s be ween 70%
and 130% and a que y leng h o 20 seconds. They epo
simila pe o mance o QuadFP bu wi h a 20% d op in e-
call. SAMAF [41] epo s ha o di e en que y leng hs
anging om 1 o 6 seconds, hey achie e o e 80% ac-
cu acy o mild s e ching (0.9 and 1.1) bu his collapses
wi h se e e s e ching (0.5, 1.5), wi h less han 13% o ac-
Figu e 1. Conside ed AFP sys ems o e iew. F om
op o bo om: Neu alFP,PeakNe FP (ou s), QuadFP.
Dashed lines ep esen ex a da a used o aining. Ou
model PeakNe FP lea ns ea u es om he same inpu as
QuadFP, in he same con as i e lea ning amewo k as
Neu alFP.
cu acy. Panako [7] epo s esul s o que ies o 20, 40, and
60 seconds on a da abase o 30,000 songs. Less han a hi d
o he que ies a e esol ed co ec ly a e a ime s e ching
modi ica ion o 8%, hough. Son e al. [24] achie e pe -
ec p ecision o empo modi ica ion in he ange o 70%
o 130%. Howe e , hei da ase only comp ises 100 audio
iles and hey que y using he ull audio leng h. Geo ge
& Jhunjhunwala [42] p opose o encode he ea u es us-
ing only equency in o ma ion and hus making i inde-
penden o ime, as opposed o [1], which encodes wi h
espec o ime. They es wi h empo modi ica ions in he
±50% ange. They achie e o e 97% o accu acy bu on
a small da ase o 300 samples o 20 seconds each. Thei
algo i hm is also no sui able o sho que ies.
3. PEAKNETFP
In his sec ion, we desc ibe ou p oposed model,
PeakNe FP, s a ing om he spa se inpu da a h ough o
he con as i e lea ning amewo k, highligh ing he hie -
a chical peak se ea u e ex ac ion p ocess. Addi ionally,
we in oduce he da ase used o e alua ion. Figu e 1 p o-
ides an o e iew o all he AFP sys ems conside ed, in-
cluding ou PeakNe FP, he baseline QuadFP [21], and he
SOTA AFP model Neu alFP [6].
3.1 Spa se inpu da a
As we desc ibed in he in oduc ion, ou model is designed
o handle spa se da a in he o m o 3-dimensional spec al
peaks, as ea u es ex ac ed om a hi d-pa y adi ional
AFP sys em. Typically, such peaks ep esen a subse o
he local maxima om he spec og am, chosen based on
c i e ia ha iden i y he mos salien ones [1,7,30]. In ou
sys em, we ex ac local maxima in he melspec og am
using 3x3 ke nels and s ide 1 as a p oxy o adi ional
peak-based inge p in s o simplici y and o a oid biasing
he esul s on o he sys em c i e ia. This allows he neu al
ne wo k o lea n which peaks a e mos ele an o ma ch-
ing o classi ica ion e en hough a mo e e ined peak se-
lec ion could help educe he dimensionali y o he inpu ,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
208
Figu e 2.PeakNe FP o e iew consis ing o Sampling,
G ouping + Fea u e Ex ac ion (G+FE) and Mul i-Scale
ea u e G ouping laye s (MSG).
imp o ing compu a ional e iciency. In p ac ice, we selec
he 256 highes ampli ude local maxima pe each 1-second
segmen , ensu ing ha we cap u e he mos p ominen ea-
u es wi hin each window.
When wo king wi h peaks a he han con inuous da a
poin s, i becomes challenging o injec locali y in o con-
olu ional ke nels, which ypically ely on dense, g id-like
inpu s uc u es. Peaks c ea e a spa se ep esen a ion o
da a, much like how poin clouds a e used in compu e i-
sion asks. This spa si y complica es he di ec applica-
ion o adi ional con olu ional me hods, as he e is no in-
he en neighbo hood s uc u e. Howe e , app oaches like
Poin Ne [43] and de i ed me hods om he poin cloud
li e a u e o e a po en ial solu ion by g ouping local peaks
and p ocessing hem simila ly o how con olu ions ope a e
on spec og ams. By le e aging local ela ionships among
peaks, we can cap u e meaning ul pa e ns wi hou equi -
ing dense, con inuous inpu .
3.2 Hie a chical peak se ea u e ex ac ion
Hie a chical Poin Ne , o Poin Ne ++ [20], in oduces a
mul i-le el app oach o cap u e bo h local and global ea-
u es in spa se da a, akin o he nes ed con olu ions ound
in adi ional CNNs. Poin Ne ++ o ganizes poin s (peaks
in ou con ex ) in o hie a chical g oupings, whe e local
neighbo hoods a e p og essi ely sampled and p ocessed,
simila o he way con olu ions scan ac oss dense da a.
This hie a chical s uc u e enables Poin Ne ++ o e ec-
i ely lea n bo h ine-g ained and high-le el ea u es om
spa se da a. In PeakNe FP, we inco po a e he hie a chi-
cal peak se ea u e ex ac ion om Poin Ne ++ in o he
con as i e lea ning amewo k om Neu alFP. Figu e 2
illus a es he a chi ec u e o PeakNe FP and Figu e 3 he
G ouping and Fea u e Ex ac ion (G+FE) block o he sec-
ond laye , ep esen ed in blue in Figu e 2. The spa se peaks
encoding s a s wi h wo Se Abs ac ion (SA) laye s, ep-
esen ed in g een and blue on Figu e 2, which a e esponsi-
ble o g ouping neighbou ing peaks in a hie a chical way
h ough Mul i-Scale ea u e G ouping (MSG). Each laye
iope a es in h ee key s eps:
(I) Sampling: The N(i)peaks wi h g ea es ampli udes a e
selec ed as ancho peaks ha will be he cen e o he peak
g oups. This s ep helps con ol compu a ional complexi y
as he ne wo k goes deepe .
Figu e 3. The G ouping + Fea u e Ex ac ion block o he
second laye (i= 2). Fo each speci ic sublaye j,Rj
and Gja e he adius and g oup size o he que yball, and
(Aj, Bj, Cj)a e he dimensions o he MLP laye s.
(II) G ouping + Fea u e Ex ac ion block (G+FE): each
block is made o 3 pa allel laye s, each laye jcomp ising:
(i) G ouping: o each ancho peak, we selec he
G(i)
jcloses peaks wi hing adius R(i)
jusing que y balls
o o m local neighbo hoods. These neighbo hoods ac as
local ecep i e ields, simila o con olu ional pa ches in
CNNs. The que y ball is a c ucial elemen because i al-
lows p ecise con ol o e he dis ance and adius o hie -
a chical sea ch in he poin cloud. Unlike adi ional con-
olu ions ha ely on ixed g id s uc u es, he que y ball
adap s o he i egula dis ibu ion o peaks by g ouping
hem based on ac ual spa ial p oximi y.
(ii) Fea u e Ex ac ion: Wi hin each neighbo -
hood, an MLP wi h 3 laye s o espec i e dimensions
A(i)
j, B(i)
j, C(i)
jis applied o lea n local ea u es. The
MLP agg ega es ea u es o each poin and uses max-
pooling o summa ize hem in o a single ec o o dimen-
sion N(i)×C(i)
j ep esen ing he local egion.
(III) Mul i-scale ea u e g ouping: Fea u es o all pa al-
lel ex ac ion laye s a e conca ena ed o a single embed-
ding o dimension N(i)×(C(i)
1+C(i)
2+C(i)
3). This s ep
allows he model o cap u e ea u es a mul iple scales si-
mul aneously by using di e en neighbo hood adii du -
ing he g ouping s age, which helps conside bo h ine and
coa se ea u es.
A e each SA laye , he ou pu is a smalle se o peaks
wi h highe -dimensional ea u e ec o s. These ec o s a e
passed o he nex SA laye , whe e he p ocess epea s wi h
a new sampling (Peaks(2) in Figu e 2), u he abs ac ing
he da a. As we mo e deepe in o he ne wo k, he ecep-
i e ields become la ge , allowing he ne wo k o cap u e
b oade con ex ual in o ma ion while main aining local de-
ails. The las SA laye , ep esen ed in pu ple in Figu e 2,
is simila o a G ouping + Fea u e Ex ac ion block, bu
whe e all poin s a e g ouped oge he , o ming a single
128-dimensional ea u e ec o . This inal ec o hus en-
codes bo h local and global in o ma ion abou he peaks,
and can be used as a inge p in .
3.3 Con as i e lea ning amewo k
PeakNe FP elies on he Neu alFP con as i e lea ning
amewo k [6] o lea n inge p in s, which we desc ibe
in he ollowing. I ope a es on 1-second windows wi h
a 50% o e lap. I c ea es da a pai s by applying ime
s e ching o sho audio snippe s. Each mini-ba ch MB
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
209
is o med by Nsamples and Naugmen ed eplicas o he
same samples o gene a e posi i e pai s xiand xjso ha
MB ={xi, xj, ..., xNi, xNj }and |MB|= 2N. NT-Xen
loss [44] is chosen o maximize an ag eemen be ween pos-
i i e pai s in a mini-ba ch MB. No explici nega i e sam-
pling is pe o med, hus, gi en a posi i e pai , he o he
2(N−1) da a poin s a e o be ea ed as nega i e samples.
The NT-Xen loss o a gi en pai o embeddings ziand zj
is de ined as:
l(i, j) = −log exp(ai,j /τ)
P2N
k=1
1
[k=i]exp(ai,k/τ)(1)
Whe e ai,j =zT
izj o i, j ∈ {1,...,2N}.τis he em-
pe a u e scaling ac o o he so max. Compu ing he
Top-1 in he so max unc ion is equi alen o Maximum
Inne P oduc Sea ch (MIPS).
1
[k=i]ensu es ha he sum-
ma ion excludes he ancho -posi i e pai . Then, he loss L
a e ages lac oss all posi i e pai s, bo h (i, j) and (j, i):
L=1
2N
N
X
k=1
l(2k−1,2k) + l(2k, 2k−1) (2)
Du ing e ie al, as in [6], we e ie e 20 candida e seg-
men s om an In e ed File P oduc Quan iza ion (IVFPQ)
index buil wi h Faiss [45]. Then, we pe o m a sequence
ma ching in which each segmen ’s embedding is compa ed
o candida e embeddings ia inne p oduc , and esul s a e
o de ed based on his sco e.
3.4 Da ase
To de elop PeakNe FP, we use he same da ase as in Neu-
alFP [6] bu change he augmen a ions o ime s e ching.
The da ase consis s in mul iple audio iles ex ac ed om
ma_medium da ase [46] ha comes wi h de ined sub-
se s, which we also use o ain and es ou models. The
T ain subse con ains 10,000 30-second audio clips while
Tes -Que y/DB con ains 500 30-second audio clips. To in-
c ease he e e ence se , we use Tes -Dummy-DB, which
comp ises 100,000 ull acks wi h an a e age leng h o
278 seconds each. This is use ul o es ing he scalabil-
i y o he sys em. In he e alua ion s ep, we use he same
2,000 segmen s selec ed andomly om he 500 clips as in
he Neu alFP e alua ion.
Du ing aining, he s e ching augmen a ions a e pe -
o med a he spec og am le el, which we esize only
on he ime axis using bilinea in e pola ion. Unlike
wa e o m-based s e ching me hods om SOX 2, his
nai e me hod in eg a es easily wi h he aining pipeline
and ensu es ha ou model is no o e i ing o any pa ic-
ula i ies o a speci ic model bu lea ns o handle s e ching.
In es ing, we use SOX o gene a e ealis ic que ies om
he DB acks o Tes -Que y/DB se . We gene a e que ies
o s e ching ac o s 1.05, 1.1, 1.2, 1.4, 1.6, 1.8, and 2 ha
inc ease he empo o he song, and hei coun e pa s e-
ducing he empo 0.975, 0.95, 0.9, 0.8, 0.7, 0.6, and 0.5.
No e ha a s e ching ac o o 2 doubles he empo while
0.5 hal es i . This es se is publicly a ailable in Zenodo 3.
2h ps://sou ce o ge.ne /p ojec s/sox/
3h ps://zenodo.o g/ eco ds/15646861
4. EVALUATION
In his sec ion, we p esen he pa icula i ies o he e alua-
ion amewo k as well as he me ic used and he esul s.
Finally, we also dissemina e he compu a ional cos o he
benchma ked sys ems.
4.1 E alua ion amewo k
PeakNe FP e alua ion is s ic ly based on Neu alFP o
allow a ai compa ison and can be examined in he
eposi o y accompanying his publica ion. We ain bo h
PeakNe FP and Neu alFP o 100 epochs using a ba ch
size o 240 wi h Adam op imize [47] ollowing he au-
ho s’ ecommenda ion [6]. Table 1 summa izes he pa-
ame e s o PeakNe FP laye s, including he numbe o
ancho peaks N, he que yballs adii R, and he numbe
o peaks pe que yball g ouping G.
Since he e is no public QuadFP implemen a ion, we
c ea e ou own e sion based on he o iginal pape [21].
Al hough no all implemen a ion de ails a e p o ided o
exac eplica ion, we closely ollowed he model’s key as-
pec s, such as compu ing mo e quads o que ies han e -
e ences and applying cascading heu is ics o e icien ly il-
e ou i ele an quads du ing compa ison. We alida e
ou implemen a ion in sec ion 4.2 by compa ing ou esul s
wi h he ones om he o iginal publica ion [21].
We e alua e PeakNe FP as well as QuadFP and Neu-
alFP in he p esence o ime s e ching anging om he
ex eme alues 0.5x o 2x he o iginal speed. Addi ionally,
we es each model wi h que y leng hs o 2, 3, 5, 6, and 10
seconds o ensu e ele ance o que y-by-example appli-
ca ions. 1s que ies a e no conside ed, since hei esul ing
size wi h ime ac o s o e 1 would make hem smalle han
Neu alFP window size.
To compa e he AFP sys ems ai ly, we align wi h he
li e a u e [6] by using Top-1 hi a e HR@1 de ined as he
numbe o hi s a Top-1 di ided by he numbe o que ies.
No e ha bo h Neu alFP and PeakNe FP always e u ns a
ma ch, so in his case, Top-1 hi a e is equi alen o bo h
p ecision and ecall. Fu u e wo k could include aining
a classi ie on he ma ching sco es o adap he sys em o
ou -o - ocabula y que ies.
MLP
Laye N j G R A B C
SA + MSG 1 200
1 4 0.1 16 16 32
2 8 0.2 32 32 64
3 16 0.3 32 48 64
SA + MSG 2 100
1 4 0.2 32 32 64
2 8 0.3 64 64 128
3 16 0.4 64 64 128
SA 128 256 128
Table 1.PeakNe FP laye s and hei pa ame e s: numbe
o ancho s N, laye index j, numbe o peaks o g oup G,
g ouping adius R, and dimensions o he 3 MLP laye s A,
Band C.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
210

0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.4 1.6 1.8 2.0
S e ching Fac o
0
10
20
30
40
50
60
70
80
90
100
HR@1 (%)
10s
2s
3s
10s
10s
6s
5s
3s
2s
Top-1 hi a e
Neu alFP
PeakNe FP
QuadFP
Figu e 4. Top-1 hi a e (HR@1) as a unc ion o s e ch-
ing ac o s o Neu alFP,PeakNe FP (ou s), and QuadFP.
Each cu e ep esen s one (model, que y leng h) pai .
4.2 Resul s
To check he alidi y o ou cus om implemen a ion, we
compa e ou QuadFP esul s wi h he ones epo ed on
Table I in [21]. Ou implemen a ion o QuadFP shows
be e esul s han he o iginal o sho que ies (≤5 sec-
onds), wi h an a e age 48% HR@1 o 2s que ies in ou
case agains a epo ed 28% o 2.5s que ies in [21], and
sligh ly wo se esul s o la ge ones, wi h 94% o ou im-
plemen a ion s 98% in [21] o 20-second que ies. We
acknowledge ha he da ase s di e be ween bo h s udies
(FMA e sus Jamendo), bu hypo hesize ha since hei
size and na u e is simila , he compa ison in he con ex o
AFP sys ems s ill emains alid. In summa y, we conclude
ha ou implemen a ion yields esul s compa able o [21],
wi h emaining di e ences coming om ei he e alua ion
da ase s o implemen a ion di e ences.
Figu e 4 illus a es Top-1 hi a es HR@1 o all 3 mod-
els as a unc ion o s e ching ac o . Fo each model, we
epo 5 di e en cu es ha co espond o he 5 que y
leng hs es ed, d awn in blue plain lines o Neu alFP, o -
ange dashed-do lines o QuadFP, and magen a dashed
lines o ou model PeakNe FP. We highligh he cu es
co esponding o 10 seconds que ies o compa e wi h he
bes se up o ou QuadFP baseline.
PeakNe FP ou pe o ms QuadFP globally, e ec i ely
handling ime s e ching. I also exhibi s excellen pe o -
mance, achie ing o e 98% HR@1 o 10-second que ies
wi hin he commonly epo ed 0.7 o 1.4 s e ching ac-
o ange [21,24,40]. Fo ex eme s e ching ac o s (<0.7
and >1.4) PeakNe FP pe o mance dec eases a bi bu s ill
main ains o e 90% HR@1. As a e e ence, QuadFP only
achie es 3.6% HR@1 o a 0.5 ac o . In ac , we obse e
ha QuadFP pe o ms well o mino s e ching (0.9 o
1.1) as epo ed be o e [21], bu i s pe o mance apidly de-
e io a es as he s e ching de ia es u he om 1, each-
ing nea ly ze o HR@1 a ex eme s e ching ac o s o 0.5
and 2. This e ec is p obably due o he lack o enough
p ese ed quads a such s ong ime s e ching ac o s. In
e ms o que y leng h, QuadFP’s pe o mance deg ades
quickly as que ies ge smalle . I should be eminded he e
ha QuadFP is a ule based algo i hm ha does no equi e
aining as opposed o PeakNe FP o Neu alFP.
Resul s on he SOTA model Neu alFP exhibi a s ong
obus ness o ime s e ching, e en in ex eme cases. As
a eminde , Neu alFP p ocesses en i e spec og ams while
PeakNe FP p ocesses spa se peaks only. None heless, o
he commonly epo ed ime s e ching ac o s (0.7 o 1.4),
PeakNe FP and Neu alFP ob ain almos iden ical pe o -
mance, wi h a maximum di e ence o ±0.7% HR@1. Fo
he mo e ex eme ac o s, bo h sys ems pe o mance is di-
minished, wi h PeakNe FP being mo e a ec ed han Neu-
alFP, wi h he maximum di e ence be ween sys ems be-
ing 1.85% HR@1 a ac o 0.5. Rega ding he que y leng h,
PeakNe FP equi es 5s que ies a leas o keep pe o mance
sys ema ically abo e 0.9 a ex eme ime ac o s.
We conclude ha PeakNe FP has a pe o mance compa-
able o Neu alFP wi h a sligh dec ease a ex eme s e ch-
ing ac o s. Howe e , PeakNe FP is signi ican ly ligh e
han Neu alFP. I uses an inpu o 256 3D peaks, app ox-
ima ely 11 imes smalle han Neu alFP’s 256 ×32 spec-
og ams. Wi h 169k ainable pa ame e s, PeakNe FP’s
model size is 100 imes smalle han Neu alFP’s 16.9M.
This also equi es subs an ially less in e ence memo y:
800MiB o PeakNe FP e sus 2338 MiB o Neu alFP
(ba ch size 125 on a single RTX 3090). This imp o es ou
model’s scalabili y and educes memo y usage o ca alog
embedding gene a ion.
5. CONCLUSION
In his wo k, we in oduce a no el audio inge p in ing
sys em, PeakNe FP, ha is designed as a hyb id app oach
combining he s eng hs o adi ional peak-based inge -
p in sys ems, hea ily used in indus ial con ex s, wi h
mode n neu al ne wo k-based ep esen a ion lea ning ap-
p oaches. We use a compu e ision-inspi ed poin cloud
ne wo k o handle spa se peaks, which we use in a con-
as i e lea ning app oach simila o mode n AFP me h-
ods. Ou e alua ion in he con ex o ex eme ime s e ch-
ing demons a es ha PeakNe FP consis en ly ou pe o ms
he SOTA on ime-s e ched da a, QuadFP. Mo eo e , we
show ha , while he spec og am-based SOTA in AFP
Neu alFP pe o ms e y well in such a ask, ou model
PeakNe FP can achie e compa able pe o mance while
wo king on peaks and hus equi ing 11 imes smalle inpu
da a, and using 100 imes less pa ame e s han he o me .
In conclusion, PeakNe FP p o ides a scalable and e -
icien solu ion o audio iden i ica ion asks ha in ol e
signi ican empo al e a ions, combining he compac ness
o peak-based me hods wi h he obus ness and lexibili y
o neu al ne wo ks. I imp o es wi h espec o adi ional
me hods o se e e o ex eme s e ching ac o s, and ap-
pea s as an al e na i e o ully neu al app oaches, espe-
cially o con ex s whe e memo y and compu a ional e -
iciency a e c i ical. Fu u e wo ks will ocus on imp o ing
he model o applica ions beyond ime s e ching, such as
pi ch shi ing.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
211
6. ACKNOWLEDGEMENTS
This esea ch is pa o esCUE – Sma sys em o au-
oma ic usage epo ing o musical wo ks in audio isual
p oduc ions (SAV-20221147) unded by CDTI and he Eu-
opean Union - Nex Gene a ion EU, and suppo ed by
he Spanish Minis e io de Ciencia, Inno ación y Uni e -
sidades and he Minis e io pa a la T ans o mación Digi-
al y de la Función Pública. Fu he mo e, i has ecei ed
suppo om he Indus ial Doc o a es plan o he Sec e-
a ia d’Uni e si a s i Rece ca, Depa amen d’Emp esa i
Coneixemen de la Gene ali a de Ca alunya, g an ag ee-
men No. DI46-2020.
7. REFERENCES
[1] A. Wang, “An indus ial s eng h audio sea ch algo-
i hm,” in P oceedings o he 4 h In e na ional Soci-
e y o Music In o ma ion Re ie al Con e ence (IS-
MIR 2003), Bal imo e, Ma yland, USA, 2003, pp. 7–
13.
[2] J. Hai sma and T. Kalke , “A highly obus audio in-
ge p in ing sys em,” in P oceedings o he 3 d In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR 2002), Pa is, F ance, 2002, pp. 107–115.
[3] E. Gomez, P. Cano, L. Gomes, E. Ba lle, and M. Bon-
ne , “Mixed wa e ma king- inge p in ing app oach o
in eg i y e i ica ion o audio eco dings,” in P oceed-
ings o he In e na ional Telecommunica ions Sympo-
sium, Na al, B azil, 2002.
[4] C. Ouali, P. Dumouchel, and V. Gup a, “A obus au-
dio inge p in ing me hod o con en -based copy de-
ec ion,” in 12 h In e na ional Wo kshop on Con en -
Based Mul imedia Indexing (CBMI 2014), Klagen u ,
Aus ia, 2014, pp. 1–6.
[5] R. Sonnlei ne , A. A z , and G. Widme , “Landma k-
based audio inge p in ing o DJ mix moni o ing,” in
P oceedings o he 17 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR 2016),
New Yo k Ci y, New Yo k, USA, 2016, pp. 185–191.
[6] S. Chang, D. Lee, J. Pa k, H. Lim, K. Lee, K. Ko, and
Y. Han, “Neu al audio inge p in o high-speci ic au-
dio e ie al based on con as i e lea ning,” in IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP 2021), To on o, On a io,
Canada, 2021, pp. 3025–3029.
[7] J. Six and M. Leman, “Panako: A scalable acous ic in-
ge p in ing sys em handling ime-scale and pi ch mod-
i ica ion,” in P oceedings o he 15 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence (IS-
MIR 2014), Taipei, Taiwan, 2014, pp. 259–264.
[8] H. Kim, J. Kim, J. Pa k, S. Kim, C. Pa k, and W. Yoo,
“Backg ound music moni o ing amewo k and da ase
o b oadcas audio,” ETRI Jou nal, 2024.
[9] J. Six, “Ola : O e ly ligh weigh acous ic inge p in -
ing,” in P oceedings o he 21s In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR
2020), Mon éal, Canada, 2020.
[10] A. L.-C. Wang and D. Culbe , “Robus and in a i-
an audio pa e n ma ching,” Uni ed S a es Pa en
US007 627 477B2, 2009, Shazam In es men s L d. and
Apple Inc.
[11] A. Mas e , B. Mon -Reynaud, K. Mohaje , and
T. S onehocke , “Sys ems and me hods o p o iding
iden i ica ion in o ma ion in esponse o an audio seg-
men ,” Uni ed S a es Pa en US10 657 174B2, 2020,
SoundHound, Inc.
[12] K. Akesbi, “Audio denoising o obus audio inge -
p in ing,” Mas e ’s hesis, Ecole no male supé ieu e
Pa is-Saclay, Pa is, F ance, 2022.
[13] B. G elle , B. Ague a-A cas, D. Roblek, J. D. Lyon,
J. J. Odell, K. Kilgou , M. Ri e , M. Sha i i, M. Ve-
limi o i´
c, R. Guo, and S. Kuma , “Now playing: Con-
inuous low-powe music ecogni ion,” in P oceedings
o he 31s Annual Con e ence on Neu al In o ma ion
P ocessing Sys ems (NIPS 2017) Wo kshop: Machine
Lea ning on he Phone, Long Beach, CA, USA, 2017.
[14] G. Co ès, A. Ciu ana, E. Molina, M. Mi on, O. Mey-
e s, J. Six, and X. Se a, “BAF: An audio inge p in -
ing da ase o b oadcas moni o ing,” in P oceedings
o he 23 d In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR 2022), Bengalu u, India,
2022, pp. 908–916.
[15] A. Wang, “The shazam music ecogni ion se ice,”
Communica ions o he ACM, ol. 49, no. 8, pp. 44–
48, 2006.
[16] S. Bilob o , “Indexing based on ime- a ian ans-
o ms o an audio signal’s spec og am,” Uni ed S a es
Pa en US10418 051B2, 2019, Facebook, Inc.
[17] M. P is e , R. Michael, M. Boll, C. Kö e , K. Rieck,
and D. A p, “Lis ening be ween he bi s: P i acy leaks
in audio inge p in s,” in P oceedings o he In e na-
ional Con e ence on De ec ion o In usions and Mal-
wa e, and Vulne abili y Assessmen , Cham, 2024, pp.
184–204.
[18] D. Schwa z and D. Fou e , “Unmixdb: A da ase o dj-
mix in o ma ion e ie al,” in P oceedings o he 19 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR 2018), Pa is, F ance, 2018.
[19] R. Sonnlei ne , “Audio iden i ica ion ia inge p in ing.
achie ing obus ness o se e e signal modi ica ions,”
PhD hesis, Johannes Keple Uni e si y Linz, Linz,
Ös e eich, 2017.
[20] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Poin -
ne ++: deep hie a chical ea u e lea ning on poin se s
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
212
in a me ic space,” in P oceedings o he 31s In e na-
ional Con e ence on Neu al In o ma ion P ocessing
Sys ems (NIPS 2017), Long Beach, CA, USA, 2017, p.
5105–5114.
[21] R. Sonnlei ne and G. Widme , “Robus quad-based au-
dio inge p in ing,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 24, no. 3, pp.
409–421, 2016.
[22] S. Baluja and M. Co ell, “Wa ep in : E icien
wa ele -based audio inge p in ing,” Pa e n Recogni-
ion, ol. 41, no. 11, pp. 3467–3480, 2008.
[23] S. Fene , G. Richa d, and Y. G enie , “A scalable
audio inge p in me hod wi h obus ness o pi ch-
shi ing,” in P oceedings o he 12 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence (IS-
MIR 2011), Miami, Flo ida, USA, 2011, pp. 121–126.
[24] H. Son, S. Byun, and S. Lee, “A obus audio inge -
p in ing using a new hashing me hod,” IEEE Access,
ol. 8, pp. 172 343–172 351, 2020.
[25] M. Ramona and G. Pee e s, “Audiop in : An e icien
audio inge p in sys em based on a no el cos -less
synch oniza ion scheme,” in IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP 2013). Vancou e , BC, Canada: IEEE, 2013,
pp. 818–822.
[26] J. Hai sma, T. Kalke , and J. Oos een, “Robus au-
dio hashing o con en iden i ica ion,” Con en Based
Mul imedia Indexing, B escia, I aly, 2001.
[27] X. Angue a, A. Ga zon, and T. Adamek, “MASK:
obus local ea u es o audio inge p in ing,”
in P oceedings o he 2012 IEEE In e na ional
Con e ence on Mul imedia and Expo, ICME.
Melbou ne, Aus alia: IEEE Compu e Soci-
e y, 7 2012, pp. 455–460. [Online]. A ailable:
h ps://doi.o g/10.1109/ICME.2012.137
[28] E. Dup az and G. Richa d, “Robus equency-based
audio inge p in ing,” in P oceedings o he IEEE In-
e na ional Con e ence on Acous ics, Speech, and Sig-
nal P ocessing (ICASSP 2010), Dallas, Texas, USA,
2010, pp. 281–284.
[29] A. Aga waal, P. Kanaujia, S. S. Roy, and S. Ghose,
“Robus and ligh weigh audio inge p in o au o-
ma ic con en ecogni ion,” 2023. [Online]. A ailable:
h ps://a xi .o g/abs/2305.09559
[30] R. Sonnlei ne and G. Widme , “Quad-based audio in-
ge p in ing obus o ime and equency scaling,” in
P oceedings o he 17 h In e na ional Con e ence on
Digi al Audio E ec s (DAFx-14), E langen, Ge many,
9 2014, pp. 173–180.
[31] M. Malekesmaeili and R. K. Wa d, “A local inge -
p in ing app oach o audio copy de ec ion,” Signal
P ocess., ol. 98, pp. 308–321, 2014.
[32] J.-Y. Lee and H.-G. Kim, “Audio inge p in ing using
a obus hash unc ion based on he MCLT peak-pai ,”
The Jou nal o he Acous ical Socie y o Ko ea, ol. 34,
no. 2, pp. 157–162, 2015.
[33] Z. Yu, X. Du, B. Zhu, and Z. Ma, “Con as i e unsu-
pe ised lea ning o audio inge p in ing,” Compu ing
Resea ch Reposi o y (CoRR), 2020.
[34] A. Singh, K. Demuynck, and V. A o a, “A en ion-
based audio embeddings o que y-by-example,” in
P oceedings o he 23 d In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR 2022),
Bengalu u, India, 2022, pp. 52–58.
[35] X. Wu and H. Wang, “Asymme ic con as i e lea ning
o audio inge p in ing,” IEEE Signal P ocess. Le .,
ol. 29, pp. 1873–1877, 2022.
[36] A. Singh, K. Demuynck, and V. A o a, “Simul ane-
ously lea ning obus audio embeddings and balanced
hash codes o que y-by-example,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP 2023), Rhodes Island, G eece,
2023, pp. 1–5.
[37] Y. Fuji a and T. Koma su, “Audio inge p in ing wi h
holog aphic educed ep esen a ions,” in 25 h Annual
Con e ence o he In e na ional Speech Communi-
ca ion Associa ion (In e speech 2024), Kos, G eece,
2024.
[38] D. Lang, D. W. Hogg, K. Mie le, M. Blan on, and
S. Roweis, “As ome y. ne : Blind as ome ic calib a-
ion o a bi a y as onomical images,” The as onomi-
cal jou nal, ol. 139, no. 5, p. 1782, 2010.
[39] A. Bha acha jee, S. Singh, and E. Bene os, “G a p in :
A gnn-based app oach o audio iden i ica ion,” in
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP 2025), Hyde abad, In-
dia, 2025.
[40] S. Yao, B. Niu, and J. Liu, “Enhancing sampling and
coun ing me hod o audio e ie al wi h ime-s e ch
esis ance,” in 2018 IEEE Fou h In e na ional Con-
e ence on Mul imedia Big Da a (BigMM), 2018.
[41] A. Báez-Suá ez, N. Shah, J. A. Nolazco-Flo es, S.-
H. S. Huang, O. Gnawali, and W. Shi, “SAMAF:
Sequence- o-sequence au oencode model o au-
dio inge p in ing,” ACM T ansac ions on Mul ime-
dia Compu ing, Communica ions, and Applica ions
(TOMM), ol. 16, no. 2, 2020.
[42] J. Geo ge and A. Jhunjhunwala, “Scalable and o-
bus audio inge p in ing me hod ole able o ime-
s e ching,” in 2015 IEEE In e na ional con e ence on
digi al signal p ocessing (DSP), 2015, pp. 436–440.
[43] C. R. Qi, H. Su, M. Kaichun, and L. J. Guibas, “Poin -
ne : Deep lea ning on poin se s o 3d classi ica ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
213
and segmen a ion,” in 2017 IEEE Con e ence on Com-
pu e Vision and Pa e n Recogni ion (CVPR), 2017,
pp. 77–85.
[44] T. Chen, S. Ko nbli h, M. No ouzi, and G. Hin on,
“A simple amewo k o con as i e lea ning o isual
ep esen a ions,” in In e na ional Con e ence on Ma-
chine Lea ning (ICML), PmLR, 2020, pp. 1597–1607.
[45] J. Johnson, M. Douze, and H. Jégou, “Billion-scale
simila i y sea ch wi h GPUs,” IEEE T ansac ions on
Big Da a, ol. 7, no. 3, pp. 535–547, 2019.
[46] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,” in
P oceedings o he 18 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR 2017),
Suzhou, China, 2017, pp. 316–323.
[47] D. P. Kingma and J. Ba, “Adam: A me hod o s ochas-
ic op imiza ion,” in 3 d In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR 2015), San Diego,
CA, USA, 2015.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
214