PeakNetFP: Peak-Based Neural Audio Fingerprinting Robust to Extreme Time Stretching

Author: Guillem Cortès-Sebastià; Benjamin Martin; Emilio Molina; Xavier Serra; Romain Hennequin

Publisher: Zenodo

DOI: 10.5281/zenodo.17706375

Source: https://zenodo.org/records/17706375/files/000025.pdf

PEAKNETFP: PEAK-BASED NEURAL AUDIO FINGERPRINTING
ROBUST TO EXTREME TIME STRETCHING
Guillem Co ès-Sebas ià13 Benjamin Ma in2Emilio Molina1
Xa ie Se a3Romain Hennequin2
1BMAT Licensing S.L., Ba celona, Spain
2Deeze Resea ch, Pa is, F ance
3Music Technology G oup, Uni e si a Pompeu Fab a, Ba celona, Spain
[email p o ec ed], [email p o ec ed]
ABSTRACT
This wo k in oduces PeakNe FP, he i s neu al audio
inge p in ing (AFP) sys em designed speci ically a ound
spec al peaks. This no el sys em is designed o le e -
age he spa se spec al coo dina es ypically compu ed
by adi ional peak-based AFP me hods. PeakNe FP pe -
o ms hie a chical poin ea u e ex ac ion echniques sim-
ila o he compu e ision model Poin Ne ++, and is
ained using con as i e lea ning like in he s a e-o - he-
a deep lea ning AFP, Neu alFP. This combina ion allows
PeakNe FP o ou pe o m con en ional AFP sys ems and
achie es compa able pe o mance o Neu alFP when han-
dling challenging ime-s e ched audio da a. In ex ensi e
e alua ion, PeakNe FP main ains a Top-1 hi a e o o e
90% o s e ching ac o s anging om 50% o 200%.
Mo eo e , PeakNe FP o e s signi ican e iciency ad an-
ages: compa ed o Neu alFP, i has 100 imes ewe pa-
ame e s and uses 11 imes smalle inpu da a. These ea-
u es make PeakNe FP a ligh weigh and e icien solu ion
o AFP asks whe e ime s e ching is in ol ed. O e all,
his sys em ep esen s a p omising di ec ion o u u e AFP
echnologies, as i success ully me ges he ligh weigh na-
u e o peak-based AFP wi h he adap abili y and pa -
e n ecogni ion capabili ies o neu al ne wo k-based ap-
p oaches, pa ing he way o mo e scalable and e icien
solu ions in he ield.
1. INTRODUCTION
Audio Finge p in ing (AFP) is he MIR ask o iden i y-
ing audio eco dings wi hin a da abase o e e ence acks.
Ea ly AFP sys ems da e back wen y yea s, wi h Shazam
[1] and Philips [2] sys ems. Since hen AFP has been ex-
ensi ely s udied o a ious use cases, such as que y-by-
example [1], in eg i y e i ica ion [3], con en -based copy
de ec ion [4], DJ-se moni o ing [5], o high speci ic audio
© G. Co ès-Sebas ià, B. Ma in, E. Molina, X. Se a, R.
Hennequin. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na-
ional License (CC BY 4.0). A ibu ion: G. Co ès-Sebas ià, B. Ma in,
E. Molina, X. Se a, R. Hennequin, “PeakNe FP: Peak-based Neu al Au-
dio Finge p in ing Robus o Ex eme Time S e ching”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
e ie al [6]. Peak-based AFP sys ems ha e a long ajec-
o y in he ield and mul iple wo ks use his app oach o
enhance hei obus ness o pi ch shi ing and ime s e ch-
ing [7], backg ound music iden i ica ion [8], o o c e-
a e a ligh weigh AFP ha can un in embedded sys ems
[9]. These algo i hms a e based on ex ac ing and link-
ing salien spec al peaks compu ed om ime- equency
ep esen a ions. These a e ma u e, p oduc ion- eady sys-
ems ha do no equi e aining, and can scale o indus-
ial le els in which da abases consis o millions o e e -
ences [10,11]. Thus, companies wi h massi e da a ca alogs
ely on hem o con en iden i ica ion [12].
Rep esen a ion lea ning sys ems, such as Now Playing
[13] o Neu alFP [6], ecen ly eme ge as no el app oaches
ha le e age Con as i e Lea ning (CL) and Con olu ional
Neu al Ne wo ks (CNNs) o lea n he simila i ies be ween
a dis o ed audio clip and i s co esponding e e ence ack.
They a e designed o pe o m highly sensi i e audio e-
ie al, capable o ma ching sho segmen s, and signi -
ican ly ou pe o m adi ional peak-based me hods unde
challenging condi ions. This is due o hei abili y o cap-
u e mo e complex and nuanced ea u es om he da a,
making hem mo e obus o a ious ypes o dis o ions
and noise ha adi ional me hods s uggle wi h [6,14].
This comes a he expense o equi ing la ge compu-
a ional esou ces, dense inpu da a, model aining, and
GPU compu ing, which migh no be sui able o some ap-
plica ions. In indus ial solu ions, hese equi emen s may
be ha d o o e come, and peak-based ea u es a e s ill con-
side ed as a iable al e na i e [11, 12, 15, 16]. P ac ically,
i is common ha audio ea u es ha e o be compu ed on a
clien de ice, hen uploaded o a se e o pe o m iden i i-
ca ion agains a e e ence da abase. In such condi ions, e-
qui ing dense spec og ams as audio ea u es signi ican ly
inc eases he amoun o da a o be uploaded compa ed o
sending only spa se spec al peaks. When inge p in gen-
e a ion is conside ed o un clien -side, i may be mo e
complex and ba e y in ensi e o un in e ence on ained
models as compa ed o simple ule-based peak ex ac ion
algo i hms, especially o que y-by-example applica ions
[15] whe e clien de ices a e gene ally a iable in speci-
ica ions (e.g. sma phones). An al e na i e se up is ully
in-de ice audio iden i ica ion [13], al hough his gene ally
implies e en mo e es ic i e compu a ional equi emen s
206
on he de ice in e ms o memo y oo p in and limi s he
da abase size. Addi ionally, music copy igh s owne s a e
o en eluc an o sha e any dense ep esen a ion ha could
be ei he in e ed o used o o he asks han inge p in -
ing, and a e mo e inclined o compu e spa se a ge ed ea-
u es ha ca y less in o ma ion and can ha dly be used
o any hing else han wha hey we e designed o (wi h
a e excep ions [17]). Finally, as peak-based AFP has been
used ex ensi ely by indus ial sys ems, i is ele an o
p i a e companies o le e age such la ge da ase s o p e-
compu ed spec al peaks o neu al audio inge p in ing ap-
p oaches [1, 12]. Fo hese easons, in his wo k we p o-
pose o keep he adi ional peak-based ea u es as inpu ,
and use hem in a mode n neu al app oach.
In his i s publica ion on a neu al spa se peak-based
model, we choose o ocus ou s udy on ime s e ching in
ex eme condi ions, which has been unde explo ed in he
li e a u e. Time s e ching is an audio p ocessing echnique
ha al e s he empo o a ack wi hou changing i s pi ch.
This me hod is commonly used by DJs o synch onize he
empo o di e en songs wi hin a mix o o c ea e emixes
ha a e ei he slowed down o sped up [18]. In challeng-
ing si ua ions such as mash-ups, blends, o licensing ci -
cum en ion a emp s, ime s e ching happens in complex
iden i ica ion si ua ions whe e se e e empo modi ica ions
a e used on sho exce p s, making hem e y ha d o be
au oma ically iden i ied [19].
The main con ibu ion o his wo k is o in oduce
a no el AFP sys em ope a ing wi h ligh weigh spec al
peaks as inpu , bu g ounded in a ep esen a ion lea n-
ing app oach and e alua ed in he con ex o ime s e ch-
ing. Speci ically, ou model PeakNe FP applies con as i e
lea ning o lea n inge p in s om spa se spec al peaks in-
pu , le e aging he hie a chical poin se lea ning algo i hm
Poin ne ++ [20]. I is designed o exhibi he good pe -
o mance o neu al s a e-o - he-a app oaches while keep-
ing memo y oo p in low hanks o spa se inpu . To ou
knowledge, his is he i s a emp a combining adi ional
peaks and ep esen a ion lea ning o audio inge p in ing,
and he i s ime a poin -cloud ne wo k is used o AFP. As
a subsequen con ibu ion, we e alua e PeakNe FP along-
side he SOTA algo i hm on ime s e ching, QuadFP [21],
which is a peak-based app oach, and Neu alFP [6], he
SOTA neu al audio inge p in ing, in a new scena io o
i . We inally show ha PeakNe FP achie es pe o mance
close o he SOTA me hod Neu alFP, despi e using 100
imes ewe pa ame e s and 11 imes smalle inpu da a
han he la e .
In sec ion 2 we summa ize he wo ks ele an o his
publica ion. In sec ion 3, we desc ibe he hie a chical peak
se ea u e ex ac ion as well as he con as i e ep esen a-
ion lea ning amewo k a he co e o PeakNe FP. Finally,
in sec ion 4 we p esen i s e alua ion in he con ex o ex-
eme ime s e ching and show how i compa es o he
peak-based ime s e ching baseline QuadFP, and o he
spec og am-based SOTA model Neu alFP.PeakNe FP
code, da ase , and model a e open and a ailable 1.
1h ps://gi hub.com/guillemco es/peakne p
2. RELATED WORK
O e he pas wo decades, he esea ch communi y has
wo ked o ad ance audio inge p in ing sys ems o mul i-
ple use cases. Some o hese inno a ions include wa ele s
[22] o noise esilience, cons an Q- ans o m [23] o Fun-
damen al F equency Map [24] o pi ch-shi ing obus -
ness, and cosine il e s [25] o b oadcas moni o ing, o
name a ew.
We can classi y AFP me hods in o h ee b oad ca e-
go ies: local desc ip o s-based [4, 22, 24–29], peak-based
[1,7,19,21,23,28,30–32], and neu al audio inge p in s [6,
13,33–37]. Peak-based inge p in s s a ed wi h Shazam’s
algo i hm [1], which se he basis o spec al peak pai s
linking o o m hashes ha a e obus o noise. Then,
Six & Leman p oposed linking iple s o ob ain obus -
ness o ime and equency modi ica ions in Panako [7],
al hough i is no sui able o sho que ies o ex eme ime
s e ching since i was designed o con en deduplica ion
o audio collec ions o old eco dings ha we e digi alized
by eplaying. Simila ly, Sonnlei ne & Wilde p oposed
QuadFP [21], which adap s blind as ome y esea ch [38]
o build quad uple s o peaks and gene a e hashes obus o
signi ican ime and equency modi ica ions [19]. O he
peak-based AFP wo ks [23, 28, 30–32] ha e also s udied
how o imp o e he obus ness o ime and equency mod-
i ica ions. Peak-based adi ional me hods pe o m well
e en in he p esence o al e a ions such as noise, comp es-
sion, o e e be a ion, o ins ance. They gene ally p o-
duce ligh weigh hashes ha can be e icien ly indexed in o
lookup ables, which makes hem scalable o hund eds o
housands o e en ens o millions o music pieces. Addi-
ionally, such me hods do no equi e aining o accele -
a ed compu ing ha dwa e.
2.1 QuadFP
Howe e , such adi ional me hods signi ican ly unde pe -
o m in he p esence o ex emely challenging scena ios,
like in he case o s ong ime s e ching [21]. QuadFP
s ands ou as one o he mos ad anced peak-based AFP in
ha ega d. Designed o be obus o ime and equency
modi ica ions, i s co e inno a ion is he use o quad u-
ple peak desc ip o s, which cap u e no only he posi ion
o each peak bu also i s ela ionship wi h neighbo ing
peaks. Each quad uple desc ibes a cons ella ion o ou
peaks (local maxima) in he ime- equency domain, e ec-
i ely encoding local pa e ns and ela ionships be ween
peaks. This app oach is mo e obus o noise and a i-
a ions in audio con en compa ed o o he inge p in ing
me hods [21] such as Panako [7]. Once he quad uple ea-
u es a e ex ac ed, QuadFP uses a hashing mechanism o
map hese desc ip o s o a da abase. In his publica ion, we
use QuadFP as he mos ad anced peak-based AFP base-
line o obus ness agains ime s e ching. I also aligns
wi h he use case o his s udy, which es ic s he inpu
da a o spec al peaks. Ou objec i e is o show how much
a neu al ne wo k can imp o e he bes adi ional sys em
o ime s e ching.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
207
2.2 Neu alFP
In he las decade, neu al ne wo ks ha e been success ully
used o AFP. In 2017, Google p esen ed he i s neu al
AFP sys em Now Playing [13]. I was designed o un on
mobile de ices wi h da ase s o limi ed size while ea u ing
high obus ness o noise. A e y ecen app oach, G aF-
P in [39], le e ages he s uc u al lea ning capabili ies o
G aph Neu al Ne wo ks (GNNs) o gene a e obus inge -
p in s om ime- equency ep esen a ions. As opposed o
ou me hod, a he han using spa se spec al peaks, G aF-
P in ex ac s om spec og ams localiza ion-awa e, low-
dimensional ea u es using a con olu ional encode . O he
neu al AFP sys ems [6,33–37] use a ge ed augmen a ions
in a con as i e lea ning amewo k o achie e obus ness
o noise, e e be a ion, echo, and o he dis o ions. Among
hem, Neu alFP [6] s ands ou as being he only ully
eplicable neu al AFP. The implemen a ion is open-sou ce,
wi h a public da ase and model weigh s. Neu alFP is also
ligh e han o he p oposed models such as ans o me -
based AFPs [34,36].
Fo hese easons, we use Neu alFP as a ounda ion o
he de elopmen o PeakNe FP.Neu alFP le e ages con-
as i e lea ning o achie e high-sensi i e audio e ie al,
employing a con olu ional encode o ex ac meaning ul
ea u es om mel-spec og ams. In his pape , we p opose
o euse Neu alFP’s con as i e lea ning amewo k while
changing inpu da a om dense spec og ams o spa se
peak-based ea u es. The o iginal Neu alFP is hen used as
a e e ence model, showing wha could be achie ed when
conside ing ull spec og ams as inpu .
2.3 AFP o ime s e ching
Some inge p in ing me hods ha e been designed o e ec-
i ely handle ime s e ching, he ocus o his wo k.
QuadFP [21] appea s as a miles one on his opic. Us-
ing a quad uple-based spec al peak g ouping (see sec-
ion 2.1) combined wi h an asymme ic que y- e e ence
inge p in s con igu a ion ha maximizes he numbe o
quads gene a ed in que ies, hey epo high p ecision and
accu acy measu es o mul iple empo modi ica ions o
20 seconds que ies. Thei e e ence da abase consis s o
100,000 acks om Jamendo and hey es 300 que y
acks ime-s e ched wi h 13 s e ching ac o s be ween
70% and 130%. In his expe imen , hey epo an a e -
age accu acy o 92.9% o 10-second que ies, bu 28.1%
a e age accu acy o 2.5-second que ies. This shows how
he pe o mance collapses as he que y leng h sh inks. We
can expec his pe o mance o be lowe i ewe quads pe
second a e used, a mo e likely scena io in an indus ial en-
i onmen .
Yao e al. [40] use he same da ase as [21]. Expe i-
men s we e done o 13 s e ching ac o s be ween 70%
and 130% and a que y leng h o 20 seconds. They epo
simila pe o mance o QuadFP bu wi h a 20% d op in e-
call. SAMAF [41] epo s ha o di e en que y leng hs
anging om 1 o 6 seconds, hey achie e o e 80% ac-
cu acy o mild s e ching (0.9 and 1.1) bu his collapses
wi h se e e s e ching (0.5, 1.5), wi h less han 13% o ac-
Figu e 1. Conside ed AFP sys ems o e iew. F om
op o bo om: Neu alFP,PeakNe FP (ou s), QuadFP.
Dashed lines ep esen ex a da a used o aining. Ou
model PeakNe FP lea ns ea u es om he same inpu as
QuadFP, in he same con as i e lea ning amewo k as
Neu alFP.
cu acy. Panako [7] epo s esul s o que ies o 20, 40, and
60 seconds on a da abase o 30,000 songs. Less han a hi d
o he que ies a e esol ed co ec ly a e a ime s e ching
modi ica ion o 8%, hough. Son e al. [24] achie e pe -
ec p ecision o empo modi ica ion in he ange o 70%
o 130%. Howe e , hei da ase only comp ises 100 audio
iles and hey que y using he ull audio leng h. Geo ge
& Jhunjhunwala [42] p opose o encode he ea u es us-
ing only equency in o ma ion and hus making i inde-
penden o ime, as opposed o [1], which encodes wi h
espec o ime. They es wi h empo modi ica ions in he
±50% ange. They achie e o e 97% o accu acy bu on
a small da ase o 300 samples o 20 seconds each. Thei
algo i hm is also no sui able o sho que ies.
3. PEAKNETFP
In his sec ion, we desc ibe ou p oposed model,
PeakNe FP, s a ing om he spa se inpu da a h ough o
he con as i e lea ning amewo k, highligh ing he hie -
a chical peak se ea u e ex ac ion p ocess. Addi ionally,
we in oduce he da ase used o e alua ion. Figu e 1 p o-
ides an o e iew o all he AFP sys ems conside ed, in-
cluding ou PeakNe FP, he baseline QuadFP [21], and he
SOTA AFP model Neu alFP [6].
3.1 Spa se inpu da a
As we desc ibed in he in oduc ion, ou model is designed
o handle spa se da a in he o m o 3-dimensional spec al
peaks, as ea u es ex ac ed om a hi d-pa y adi ional
AFP sys em. Typically, such peaks ep esen a subse o
he local maxima om he spec og am, chosen based on
c i e ia ha iden i y he mos salien ones [1,7,30]. In ou
sys em, we ex ac local maxima in he melspec og am
using 3x3 ke nels and s ide 1 as a p oxy o adi ional
peak-based inge p in s o simplici y and o a oid biasing
he esul s on o he sys em c i e ia. This allows he neu al
ne wo k o lea n which peaks a e mos ele an o ma ch-
ing o classi ica ion e en hough a mo e e ined peak se-
lec ion could help educe he dimensionali y o he inpu ,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
208
Figu e 2.PeakNe FP o e iew consis ing o Sampling,
G ouping + Fea u e Ex ac ion (G+FE) and Mul i-Scale
ea u e G ouping laye s (MSG).
imp o ing compu a ional e iciency. In p ac ice, we selec
he 256 highes ampli ude local maxima pe each 1-second
segmen , ensu ing ha we cap u e he mos p ominen ea-
u es wi hin each window.
When wo king wi h peaks a he han con inuous da a
poin s, i becomes challenging o injec locali y in o con-
olu ional ke nels, which ypically ely on dense, g id-like
inpu s uc u es. Peaks c ea e a spa se ep esen a ion o
da a, much like how poin clouds a e used in compu e i-
sion asks. This spa si y complica es he di ec applica-
ion o adi ional con olu ional me hods, as he e is no in-
he en neighbo hood s uc u e. Howe e , app oaches like
Poin Ne [43] and de i ed me hods om he poin cloud
li e a u e o e a po en ial solu ion by g ouping local peaks
and p ocessing hem simila ly o how con olu ions ope a e
on spec og ams. By le e aging local ela ionships among
peaks, we can cap u e meaning ul pa e ns wi hou equi -
ing dense, con inuous inpu .
3.2 Hie a chical peak se ea u e ex ac ion
Hie a chical Poin Ne , o Poin Ne ++ [20], in oduces a
mul i-le el app oach o cap u e bo h local and global ea-
u es in spa se da a, akin o he nes ed con olu ions ound
in adi ional CNNs. Poin Ne ++ o ganizes poin s (peaks
in ou con ex ) in o hie a chical g oupings, whe e local
neighbo hoods a e p og essi ely sampled and p ocessed,
simila o he way con olu ions scan ac oss dense da a.
This hie a chical s uc u e enables Poin Ne ++ o e ec-
i ely lea n bo h ine-g ained and high-le el ea u es om
spa se da a. In PeakNe FP, we inco po a e he hie a chi-
cal peak se ea u e ex ac ion om Poin Ne ++ in o he
con as i e lea ning amewo k om Neu alFP. Figu e 2
illus a es he a chi ec u e o PeakNe FP and Figu e 3 he
G ouping and Fea u e Ex ac ion (G+FE) block o he sec-
ond laye , ep esen ed in blue in Figu e 2. The spa se peaks
encoding s a s wi h wo Se Abs ac ion (SA) laye s, ep-
esen ed in g een and blue on Figu e 2, which a e esponsi-
ble o g ouping neighbou ing peaks in a hie a chical way
h ough Mul i-Scale ea u e G ouping (MSG). Each laye
iope a es in h ee key s eps:
(I) Sampling: The N(i)peaks wi h g ea es ampli udes a e
selec ed as ancho peaks ha will be he cen e o he peak
g oups. This s ep helps con ol compu a ional complexi y
as he ne wo k goes deepe .
Figu e 3. The G ouping + Fea u e Ex ac ion block o he
second laye (i= 2). Fo each speci ic sublaye j,Rj
and Gja e he adius and g oup size o he que yball, and
(Aj, Bj, Cj)a e he dimensions o he MLP laye s.
(II) G ouping + Fea u e Ex ac ion block (G+FE): each
block is made o 3 pa allel laye s, each laye jcomp ising:
(i) G ouping: o each ancho peak, we selec he
G(i)
jcloses peaks wi hing adius R(i)
jusing que y balls
o o m local neighbo hoods. These neighbo hoods ac as
local ecep i e ields, simila o con olu ional pa ches in
CNNs. The que y ball is a c ucial elemen because i al-
lows p ecise con ol o e he dis ance and adius o hie -
a chical sea ch in he poin cloud. Unlike adi ional con-
olu ions ha ely on ixed g id s uc u es, he que y ball
adap s o he i egula dis ibu ion o peaks by g ouping
hem based on ac ual spa ial p oximi y.
(ii) Fea u e Ex ac ion: Wi hin each neighbo -
hood, an MLP wi h 3 laye s o espec i e dimensions
A(i)
j, B(i)
j, C(i)
jis applied o lea n local ea u es. The
MLP agg ega es ea u es o each poin and uses max-
pooling o summa ize hem in o a single ec o o dimen-
sion N(i)×C(i)
j ep esen ing he local egion.
(III) Mul i-scale ea u e g ouping: Fea u es o all pa al-
lel ex ac ion laye s a e conca ena ed o a single embed-
ding o dimension N(i)×(C(i)
1+C(i)
2+C(i)
3). This s ep
allows he model o cap u e ea u es a mul iple scales si-
mul aneously by using di e en neighbo hood adii du -
ing he g ouping s age, which helps conside bo h ine and
coa se ea u es.
A e each SA laye , he ou pu is a smalle se o peaks
wi h highe -dimensional ea u e ec o s. These ec o s a e
passed o he nex SA laye , whe e he p ocess epea s wi h
a new sampling (Peaks(2) in Figu e 2), u he abs ac ing
he da a. As we mo e deepe in o he ne wo k, he ecep-
i e ields become la ge , allowing he ne wo k o cap u e
b oade con ex ual in o ma ion while main aining local de-
ails. The las SA laye , ep esen ed in pu ple in Figu e 2,
is simila o a G ouping + Fea u e Ex ac ion block, bu
whe e all poin s a e g ouped oge he , o ming a single
128-dimensional ea u e ec o . This inal ec o hus en-
codes bo h local and global in o ma ion abou he peaks,
and can be used as a inge p in .
3.3 Con as i e lea ning amewo k
PeakNe FP elies on he Neu alFP con as i e lea ning
amewo k [6] o lea n inge p in s, which we desc ibe
in he ollowing. I ope a es on 1-second windows wi h
a 50% o e lap. I c ea es da a pai s by applying ime
s e ching o sho audio snippe s. Each mini-ba ch MB
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
209
is o med by Nsamples and Naugmen ed eplicas o he
same samples o gene a e posi i e pai s xiand xjso ha
MB ={xi, xj, ..., xNi, xNj }and |MB|= 2N. NT-Xen
loss [44] is chosen o maximize an ag eemen be ween pos-
i i e pai s in a mini-ba ch MB. No explici nega i e sam-
pling is pe o med, hus, gi en a posi i e pai , he o he
2(N−1) da a poin s a e o be ea ed as nega i e samples.
The NT-Xen loss o a gi en pai o embeddings ziand zj
is de ined as:
l(i, j) = −log exp(ai,j /τ)
P2N
k=1
1
[k=i]exp(ai,k/τ)(1)
Whe e ai,j =zT
izj o i, j ∈ {1,...,2N}.τis he em-
pe a u e scaling ac o o he so max. Compu ing he
Top-1 in he so max unc ion is equi alen o Maximum
Inne P oduc Sea ch (MIPS).
1
[k=i]ensu es ha he sum-
ma ion excludes he ancho -posi i e pai . Then, he loss L
a e ages lac oss all posi i e pai s, bo h (i, j) and (j, i):
L=1
2N
N
X
k=1
l(2k−1,2k) + l(2k, 2k−1) (2)
Du ing e ie al, as in [6], we e ie e 20 candida e seg-
men s om an In e ed File P oduc Quan iza ion (IVFPQ)
index buil wi h Faiss [45]. Then, we pe o m a sequence
ma ching in which each segmen ’s embedding is compa ed
o candida e embeddings ia inne p oduc , and esul s a e
o de ed based on his sco e.
3.4 Da ase
To de elop PeakNe FP, we use he same da ase as in Neu-
alFP [6] bu change he augmen a ions o ime s e ching.
The da ase consis s in mul iple audio iles ex ac ed om
ma_medium da ase [46] ha comes wi h de ined sub-
se s, which we also use o ain and es ou models. The
T ain subse con ains 10,000 30-second audio clips while
Tes -Que y/DB con ains 500 30-second audio clips. To in-
c ease he e e ence se , we use Tes -Dummy-DB, which
comp ises 100,000 ull acks wi h an a e age leng h o
278 seconds each. This is use ul o es ing he scalabil-
i y o he sys em. In he e alua ion s ep, we use he same
2,000 segmen s selec ed andomly om he 500 clips as in
he Neu alFP e alua ion.
Du ing aining, he s e ching augmen a ions a e pe -
o med a he spec og am le el, which we esize only
on he ime axis using bilinea in e pola ion. Unlike
wa e o m-based s e ching me hods om SOX 2, his
nai e me hod in eg a es easily wi h he aining pipeline
and ensu es ha ou model is no o e i ing o any pa ic-
ula i ies o a speci ic model bu lea ns o handle s e ching.
In es ing, we use SOX o gene a e ealis ic que ies om
he DB acks o Tes -Que y/DB se . We gene a e que ies
o s e ching ac o s 1.05, 1.1, 1.2, 1.4, 1.6, 1.8, and 2 ha
inc ease he empo o he song, and hei coun e pa s e-
ducing he empo 0.975, 0.95, 0.9, 0.8, 0.7, 0.6, and 0.5.
No e ha a s e ching ac o o 2 doubles he empo while
0.5 hal es i . This es se is publicly a ailable in Zenodo 3.
2h ps://sou ce o ge.ne /p ojec s/sox/
3h ps://zenodo.o g/ eco ds/15646861
4. EVALUATION
In his sec ion, we p esen he pa icula i ies o he e alua-
ion amewo k as well as he me ic used and he esul s.
Finally, we also dissemina e he compu a ional cos o he
benchma ked sys ems.
4.1 E alua ion amewo k
PeakNe FP e alua ion is s ic ly based on Neu alFP o
allow a ai compa ison and can be examined in he
eposi o y accompanying his publica ion. We ain bo h
PeakNe FP and Neu alFP o 100 epochs using a ba ch
size o 240 wi h Adam op imize [47] ollowing he au-
ho s’ ecommenda ion [6]. Table 1 summa izes he pa-
ame e s o PeakNe FP laye s, including he numbe o
ancho peaks N, he que yballs adii R, and he numbe
o peaks pe que yball g ouping G.
Since he e is no public QuadFP implemen a ion, we
c ea e ou own e sion based on he o iginal pape [21].
Al hough no all implemen a ion de ails a e p o ided o
exac eplica ion, we closely ollowed he model’s key as-
pec s, such as compu ing mo e quads o que ies han e -
e ences and applying cascading heu is ics o e icien ly il-
e ou i ele an quads du ing compa ison. We alida e
ou implemen a ion in sec ion 4.2 by compa ing ou esul s
wi h he ones om he o iginal publica ion [21].
We e alua e PeakNe FP as well as QuadFP and Neu-
alFP in he p esence o ime s e ching anging om he
ex eme alues 0.5x o 2x he o iginal speed. Addi ionally,
we es each model wi h que y leng hs o 2, 3, 5, 6, and 10
seconds o ensu e ele ance o que y-by-example appli-
ca ions. 1s que ies a e no conside ed, since hei esul ing
size wi h ime ac o s o e 1 would make hem smalle han
Neu alFP window size.
To compa e he AFP sys ems ai ly, we align wi h he
li e a u e [6] by using Top-1 hi a e HR@1 de ined as he
numbe o hi s a Top-1 di ided by he numbe o que ies.
No e ha bo h Neu alFP and PeakNe FP always e u ns a
ma ch, so in his case, Top-1 hi a e is equi alen o bo h
p ecision and ecall. Fu u e wo k could include aining
a classi ie on he ma ching sco es o adap he sys em o
ou -o - ocabula y que ies.
MLP
Laye N j G R A B C
SA + MSG 1 200
1 4 0.1 16 16 32
2 8 0.2 32 32 64
3 16 0.3 32 48 64
SA + MSG 2 100
1 4 0.2 32 32 64
2 8 0.3 64 64 128
3 16 0.4 64 64 128
SA 128 256 128
Table 1.PeakNe FP laye s and hei pa ame e s: numbe
o ancho s N, laye index j, numbe o peaks o g oup G,
g ouping adius R, and dimensions o he 3 MLP laye s A,
Band C.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
210

0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.4 1.6 1.8 2.0
S e ching Fac o
0
10
20
30
40
50
60
70
80
90
100
HR@1 (%)
10s
2s
3s
10s
10s
6s
5s
3s
2s
Top-1 hi a e
Neu alFP
PeakNe FP
QuadFP
Figu e 4. Top-1 hi a e (HR@1) as a unc ion o s e ch-
ing ac o s o Neu alFP,PeakNe FP (ou s), and QuadFP.
Each cu e ep esen s one (model, que y leng h) pai .
4.2 Resul s
To check he alidi y o ou cus om implemen a ion, we
compa e ou QuadFP esul s wi h he ones epo ed on
Table I in [21]. Ou implemen a ion o QuadFP shows
be e esul s han he o iginal o sho que ies (≤5 sec-
onds), wi h an a e age 48% HR@1 o 2s que ies in ou
case agains a epo ed 28% o 2.5s que ies in [21], and
sligh ly wo se esul s o la ge ones, wi h 94% o ou im-
plemen a ion s 98% in [21] o 20-second que ies. We
acknowledge ha he da ase s di e be ween bo h s udies
(FMA e sus Jamendo), bu hypo hesize ha since hei
size and na u e is simila , he compa ison in he con ex o
AFP sys ems s ill emains alid. In summa y, we conclude
ha ou implemen a ion yields esul s compa able o [21],
wi h emaining di e ences coming om ei he e alua ion
da ase s o implemen a ion di e ences.
Figu e 4 illus a es Top-1 hi a es HR@1 o all 3 mod-
els as a unc ion o s e ching ac o . Fo each model, we
epo 5 di e en cu es ha co espond o he 5 que y
leng hs es ed, d awn in blue plain lines o Neu alFP, o -
ange dashed-do lines o QuadFP, and magen a dashed
lines o ou model PeakNe FP. We highligh he cu es
co esponding o 10 seconds que ies o compa e wi h he
bes se up o ou QuadFP baseline.
PeakNe FP ou pe o ms QuadFP globally, e ec i ely
handling ime s e ching. I also exhibi s excellen pe o -
mance, achie ing o e 98% HR@1 o 10-second que ies
wi hin he commonly epo ed 0.7 o 1.4 s e ching ac-
o ange [21,24,40]. Fo ex eme s e ching ac o s (<0.7
and >1.4) PeakNe FP pe o mance dec eases a bi bu s ill
main ains o e 90% HR@1. As a e e ence, QuadFP only
achie es 3.6% HR@1 o a 0.5 ac o . In ac , we obse e
ha QuadFP pe o ms well o mino s e ching (0.9 o
1.1) as epo ed be o e [21], bu i s pe o mance apidly de-
e io a es as he s e ching de ia es u he om 1, each-
ing nea ly ze o HR@1 a ex eme s e ching ac o s o 0.5
and 2. This e ec is p obably due o he lack o enough
p ese ed quads a such s ong ime s e ching ac o s. In
e ms o que y leng h, QuadFP’s pe o mance deg ades
quickly as que ies ge smalle . I should be eminded he e
ha QuadFP is a ule based algo i hm ha does no equi e
aining as opposed o PeakNe FP o Neu alFP.
Resul s on he SOTA model Neu alFP exhibi a s ong
obus ness o ime s e ching, e en in ex eme cases. As
a eminde , Neu alFP p ocesses en i e spec og ams while
PeakNe FP p ocesses spa se peaks only. None heless, o
he commonly epo ed ime s e ching ac o s (0.7 o 1.4),
PeakNe FP and Neu alFP ob ain almos iden ical pe o -
mance, wi h a maximum di e ence o ±0.7% HR@1. Fo
he mo e ex eme ac o s, bo h sys ems pe o mance is di-
minished, wi h PeakNe FP being mo e a ec ed han Neu-
alFP, wi h he maximum di e ence be ween sys ems be-
ing 1.85% HR@1 a ac o 0.5. Rega ding he que y leng h,
PeakNe FP equi es 5s que ies a leas o keep pe o mance
sys ema ically abo e 0.9 a ex eme ime ac o s.
We conclude ha PeakNe FP has a pe o mance compa-
able o Neu alFP wi h a sligh dec ease a ex eme s e ch-
ing ac o s. Howe e , PeakNe FP is signi ican ly ligh e
han Neu alFP. I uses an inpu o 256 3D peaks, app ox-
ima ely 11 imes smalle han Neu alFP’s 256 ×32 spec-
og ams. Wi h 169k ainable pa ame e s, PeakNe FP’s
model size is 100 imes smalle han Neu alFP’s 16.9M.
This also equi es subs an ially less in e ence memo y:
800MiB o PeakNe FP e sus 2338 MiB o Neu alFP
(ba ch size 125 on a single RTX 3090). This imp o es ou
model’s scalabili y and educes memo y usage o ca alog
embedding gene a ion.
5. CONCLUSION
In his wo k, we in oduce a no el audio inge p in ing
sys em, PeakNe FP, ha is designed as a hyb id app oach
combining he s eng hs o adi ional peak-based inge -
p in sys ems, hea ily used in indus ial con ex s, wi h
mode n neu al ne wo k-based ep esen a ion lea ning ap-
p oaches. We use a compu e ision-inspi ed poin cloud
ne wo k o handle spa se peaks, which we use in a con-
as i e lea ning app oach simila o mode n AFP me h-
ods. Ou e alua ion in he con ex o ex eme ime s e ch-
ing demons a es ha PeakNe FP consis en ly ou pe o ms
he SOTA on ime-s e ched da a, QuadFP. Mo eo e , we
show ha , while he spec og am-based SOTA in AFP
Neu alFP pe o ms e y well in such a ask, ou model
PeakNe FP can achie e compa able pe o mance while
wo king on peaks and hus equi ing 11 imes smalle inpu
da a, and using 100 imes less pa ame e s han he o me .
In conclusion, PeakNe FP p o ides a scalable and e -
icien solu ion o audio iden i ica ion asks ha in ol e
signi ican empo al e a ions, combining he compac ness
o peak-based me hods wi h he obus ness and lexibili y
o neu al ne wo ks. I imp o es wi h espec o adi ional
me hods o se e e o ex eme s e ching ac o s, and ap-
pea s as an al e na i e o ully neu al app oaches, espe-
cially o con ex s whe e memo y and compu a ional e -
iciency a e c i ical. Fu u e wo ks will ocus on imp o ing
he model o applica ions beyond ime s e ching, such as
pi ch shi ing.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
211
6. ACKNOWLEDGEMENTS
This esea ch is pa o esCUE – Sma sys em o au-
oma ic usage epo ing o musical wo ks in audio isual
p oduc ions (SAV-20221147) unded by CDTI and he Eu-
opean Union - Nex Gene a ion EU, and suppo ed by
he Spanish Minis e io de Ciencia, Inno ación y Uni e -
sidades and he Minis e io pa a la T ans o mación Digi-
al y de la Función Pública. Fu he mo e, i has ecei ed
suppo om he Indus ial Doc o a es plan o he Sec e-
a ia d’Uni e si a s i Rece ca, Depa amen d’Emp esa i
Coneixemen de la Gene ali a de Ca alunya, g an ag ee-
men No. DI46-2020.
7. REFERENCES
[1] A. Wang, “An indus ial s eng h audio sea ch algo-
i hm,” in P oceedings o he 4 h In e na ional Soci-
e y o Music In o ma ion Re ie al Con e ence (IS-
MIR 2003), Bal imo e, Ma yland, USA, 2003, pp. 7–
13.
[2] J. Hai sma and T. Kalke , “A highly obus audio in-
ge p in ing sys em,” in P oceedings o he 3 d In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR 2002), Pa is, F ance, 2002, pp. 107–115.
[3] E. Gomez, P. Cano, L. Gomes, E. Ba lle, and M. Bon-
ne , “Mixed wa e ma king- inge p in ing app oach o
in eg i y e i ica ion o audio eco dings,” in P oceed-
ings o he In e na ional Telecommunica ions Sympo-
sium, Na al, B azil, 2002.
[4] C. Ouali, P. Dumouchel, and V. Gup a, “A obus au-
dio inge p in ing me hod o con en -based copy de-
ec ion,” in 12 h In e na ional Wo kshop on Con en -
Based Mul imedia Indexing (CBMI 2014), Klagen u ,
Aus ia, 2014, pp. 1–6.
[5] R. Sonnlei ne , A. A z , and G. Widme , “Landma k-
based audio inge p in ing o DJ mix moni o ing,” in
P oceedings o he 17 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR 2016),
New Yo k Ci y, New Yo k, USA, 2016, pp. 185–191.
[6] S. Chang, D. Lee, J. Pa k, H. Lim, K. Lee, K. Ko, and
Y. Han, “Neu al audio inge p in o high-speci ic au-
dio e ie al based on con as i e lea ning,” in IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP 2021), To on o, On a io,
Canada, 2021, pp. 3025–3029.
[7] J. Six and M. Leman, “Panako: A scalable acous ic in-
ge p in ing sys em handling ime-scale and pi ch mod-
i ica ion,” in P oceedings o he 15 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence (IS-
MIR 2014), Taipei, Taiwan, 2014, pp. 259–264.
[8] H. Kim, J. Kim, J. Pa k, S. Kim, C. Pa k, and W. Yoo,
“Backg ound music moni o ing amewo k and da ase
o b oadcas audio,” ETRI Jou nal, 2024.
[9] J. Six, “Ola : O e ly ligh weigh acous ic inge p in -
ing,” in P oceedings o he 21s In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR
2020), Mon éal, Canada, 2020.
[10] A. L.-C. Wang and D. Culbe , “Robus and in a i-
an audio pa e n ma ching,” Uni ed S a es Pa en
US007 627 477B2, 2009, Shazam In es men s L d. and
Apple Inc.
[11] A. Mas e , B. Mon -Reynaud, K. Mohaje , and
T. S onehocke , “Sys ems and me hods o p o iding
iden i ica ion in o ma ion in esponse o an audio seg-
men ,” Uni ed S a es Pa en US10 657 174B2, 2020,
SoundHound, Inc.
[12] K. Akesbi, “Audio denoising o obus audio inge -
p in ing,” Mas e ’s hesis, Ecole no male supé ieu e
Pa is-Saclay, Pa is, F ance, 2022.
[13] B. G elle , B. Ague a-A cas, D. Roblek, J. D. Lyon,
J. J. Odell, K. Kilgou , M. Ri e , M. Sha i i, M. Ve-
limi o i´
c, R. Guo, and S. Kuma , “Now playing: Con-
inuous low-powe music ecogni ion,” in P oceedings
o he 31s Annual Con e ence on Neu al In o ma ion
P ocessing Sys ems (NIPS 2017) Wo kshop: Machine
Lea ning on he Phone, Long Beach, CA, USA, 2017.
[14] G. Co ès, A. Ciu ana, E. Molina, M. Mi on, O. Mey-
e s, J. Six, and X. Se a, “BAF: An audio inge p in -
ing da ase o b oadcas moni o ing,” in P oceedings
o he 23 d In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR 2022), Bengalu u, India,
2022, pp. 908–916.
[15] A. Wang, “The shazam music ecogni ion se ice,”
Communica ions o he ACM, ol. 49, no. 8, pp. 44–
48, 2006.
[16] S. Bilob o , “Indexing based on ime- a ian ans-
o ms o an audio signal’s spec og am,” Uni ed S a es
Pa en US10418 051B2, 2019, Facebook, Inc.
[17] M. P is e , R. Michael, M. Boll, C. Kö e , K. Rieck,
and D. A p, “Lis ening be ween he bi s: P i acy leaks
in audio inge p in s,” in P oceedings o he In e na-
ional Con e ence on De ec ion o In usions and Mal-
wa e, and Vulne abili y Assessmen , Cham, 2024, pp.
184–204.
[18] D. Schwa z and D. Fou e , “Unmixdb: A da ase o dj-
mix in o ma ion e ie al,” in P oceedings o he 19 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR 2018), Pa is, F ance, 2018.
[19] R. Sonnlei ne , “Audio iden i ica ion ia inge p in ing.
achie ing obus ness o se e e signal modi ica ions,”
PhD hesis, Johannes Keple Uni e si y Linz, Linz,
Ös e eich, 2017.
[20] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Poin -
ne ++: deep hie a chical ea u e lea ning on poin se s
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
212
in a me ic space,” in P oceedings o he 31s In e na-
ional Con e ence on Neu al In o ma ion P ocessing
Sys ems (NIPS 2017), Long Beach, CA, USA, 2017, p.
5105–5114.
[21] R. Sonnlei ne and G. Widme , “Robus quad-based au-
dio inge p in ing,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 24, no. 3, pp.
409–421, 2016.
[22] S. Baluja and M. Co ell, “Wa ep in : E icien
wa ele -based audio inge p in ing,” Pa e n Recogni-
ion, ol. 41, no. 11, pp. 3467–3480, 2008.
[23] S. Fene , G. Richa d, and Y. G enie , “A scalable
audio inge p in me hod wi h obus ness o pi ch-
shi ing,” in P oceedings o he 12 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence (IS-
MIR 2011), Miami, Flo ida, USA, 2011, pp. 121–126.
[24] H. Son, S. Byun, and S. Lee, “A obus audio inge -
p in ing using a new hashing me hod,” IEEE Access,
ol. 8, pp. 172 343–172 351, 2020.
[25] M. Ramona and G. Pee e s, “Audiop in : An e icien
audio inge p in sys em based on a no el cos -less
synch oniza ion scheme,” in IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP 2013). Vancou e , BC, Canada: IEEE, 2013,
pp. 818–822.
[26] J. Hai sma, T. Kalke , and J. Oos een, “Robus au-
dio hashing o con en iden i ica ion,” Con en Based
Mul imedia Indexing, B escia, I aly, 2001.
[27] X. Angue a, A. Ga zon, and T. Adamek, “MASK:
obus local ea u es o audio inge p in ing,”
in P oceedings o he 2012 IEEE In e na ional
Con e ence on Mul imedia and Expo, ICME.
Melbou ne, Aus alia: IEEE Compu e Soci-
e y, 7 2012, pp. 455–460. [Online]. A ailable:
h ps://doi.o g/10.1109/ICME.2012.137
[28] E. Dup az and G. Richa d, “Robus equency-based
audio inge p in ing,” in P oceedings o he IEEE In-
e na ional Con e ence on Acous ics, Speech, and Sig-
nal P ocessing (ICASSP 2010), Dallas, Texas, USA,
2010, pp. 281–284.
[29] A. Aga waal, P. Kanaujia, S. S. Roy, and S. Ghose,
“Robus and ligh weigh audio inge p in o au o-
ma ic con en ecogni ion,” 2023. [Online]. A ailable:
h ps://a xi .o g/abs/2305.09559
[30] R. Sonnlei ne and G. Widme , “Quad-based audio in-
ge p in ing obus o ime and equency scaling,” in
P oceedings o he 17 h In e na ional Con e ence on
Digi al Audio E ec s (DAFx-14), E langen, Ge many,
9 2014, pp. 173–180.
[31] M. Malekesmaeili and R. K. Wa d, “A local inge -
p in ing app oach o audio copy de ec ion,” Signal
P ocess., ol. 98, pp. 308–321, 2014.
[32] J.-Y. Lee and H.-G. Kim, “Audio inge p in ing using
a obus hash unc ion based on he MCLT peak-pai ,”
The Jou nal o he Acous ical Socie y o Ko ea, ol. 34,
no. 2, pp. 157–162, 2015.
[33] Z. Yu, X. Du, B. Zhu, and Z. Ma, “Con as i e unsu-
pe ised lea ning o audio inge p in ing,” Compu ing
Resea ch Reposi o y (CoRR), 2020.
[34] A. Singh, K. Demuynck, and V. A o a, “A en ion-
based audio embeddings o que y-by-example,” in
P oceedings o he 23 d In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR 2022),
Bengalu u, India, 2022, pp. 52–58.
[35] X. Wu and H. Wang, “Asymme ic con as i e lea ning
o audio inge p in ing,” IEEE Signal P ocess. Le .,
ol. 29, pp. 1873–1877, 2022.
[36] A. Singh, K. Demuynck, and V. A o a, “Simul ane-
ously lea ning obus audio embeddings and balanced
hash codes o que y-by-example,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP 2023), Rhodes Island, G eece,
2023, pp. 1–5.
[37] Y. Fuji a and T. Koma su, “Audio inge p in ing wi h
holog aphic educed ep esen a ions,” in 25 h Annual
Con e ence o he In e na ional Speech Communi-
ca ion Associa ion (In e speech 2024), Kos, G eece,
2024.
[38] D. Lang, D. W. Hogg, K. Mie le, M. Blan on, and
S. Roweis, “As ome y. ne : Blind as ome ic calib a-
ion o a bi a y as onomical images,” The as onomi-
cal jou nal, ol. 139, no. 5, p. 1782, 2010.
[39] A. Bha acha jee, S. Singh, and E. Bene os, “G a p in :
A gnn-based app oach o audio iden i ica ion,” in
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP 2025), Hyde abad, In-
dia, 2025.
[40] S. Yao, B. Niu, and J. Liu, “Enhancing sampling and
coun ing me hod o audio e ie al wi h ime-s e ch
esis ance,” in 2018 IEEE Fou h In e na ional Con-
e ence on Mul imedia Big Da a (BigMM), 2018.
[41] A. Báez-Suá ez, N. Shah, J. A. Nolazco-Flo es, S.-
H. S. Huang, O. Gnawali, and W. Shi, “SAMAF:
Sequence- o-sequence au oencode model o au-
dio inge p in ing,” ACM T ansac ions on Mul ime-
dia Compu ing, Communica ions, and Applica ions
(TOMM), ol. 16, no. 2, 2020.
[42] J. Geo ge and A. Jhunjhunwala, “Scalable and o-
bus audio inge p in ing me hod ole able o ime-
s e ching,” in 2015 IEEE In e na ional con e ence on
digi al signal p ocessing (DSP), 2015, pp. 436–440.
[43] C. R. Qi, H. Su, M. Kaichun, and L. J. Guibas, “Poin -
ne : Deep lea ning on poin se s o 3d classi ica ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
213
and segmen a ion,” in 2017 IEEE Con e ence on Com-
pu e Vision and Pa e n Recogni ion (CVPR), 2017,
pp. 77–85.
[44] T. Chen, S. Ko nbli h, M. No ouzi, and G. Hin on,
“A simple amewo k o con as i e lea ning o isual
ep esen a ions,” in In e na ional Con e ence on Ma-
chine Lea ning (ICML), PmLR, 2020, pp. 1597–1607.
[45] J. Johnson, M. Douze, and H. Jégou, “Billion-scale
simila i y sea ch wi h GPUs,” IEEE T ansac ions on
Big Da a, ol. 7, no. 3, pp. 535–547, 2019.
[46] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,” in
P oceedings o he 18 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR 2017),
Suzhou, China, 2017, pp. 316–323.
[47] D. P. Kingma and J. Ba, “Adam: A me hod o s ochas-
ic op imiza ion,” in 3 d In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR 2015), San Diego,
CA, USA, 2015.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
214

Related note

Why institutions use Plag.ai for originality review, entry 41
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by teachers in the United States, the European Union, South America, and other research regions, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also faster first-level screening, better protection of institutional reputation, and stronger evidence for review committees. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For student essays, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai