Refining Music Sample Identification With a Self-Supervised Graph Neural Network

Author: Aditya Bhattacharjee; Ivan Meresman Higgs; Mark Sandler; Emmanouil Benetos

Publisher: Zenodo

DOI: 10.5281/zenodo.17706498

Source: https://zenodo.org/records/17706498/files/000059.pdf

REFINING MUSIC SAMPLE IDENTIFICATION WITH A
SELF-SUPERVISED GRAPH NEURAL NETWORK
Adi ya Bha acha jee1I an Me esman Higgs1Ma k Sandle 1Emmanouil Bene os1
1Queen Ma y Uni e si y o London, UK
{a.bha acha jee, i.me esman-higgs, ma k.sandle , emmanouil.bene os}@qmul.ac.uk
ABSTRACT
Au oma ic sample iden i ica ion (ASID) - he de ec ion and
iden i ica ion o po ions o audio eco dings ha ha e been
eused in new musical wo ks - is an essen ial bu chal-
lenging ask in he ield o audio que y-based e ie al.
While a ela ed ask, audio inge p in ing, has made sig-
ni ican p og ess in accu a ely e ie ing musical con en
unde “ eal wo ld” (noisy, e e be an ) condi ions, ASID
sys ems s uggle o iden i y samples ha ha e unde gone
musical modi ica ions. Thus, a sys em obus o common
music p oduc ion ans o ma ions such as ime-s e ching,
pi ch-shi ing, e ec s p ocessing, and unde lying o o e -
laying music is an impo an open challenge. In his wo k,
we p opose a ligh weigh and scalable encoding a chi ec u e
employing a G aph Neu al Ne wo k wi hin a con as i e
lea ning amewo k. Ou model uses only 9% o he ain-
able pa ame e s compa ed o he cu en s a e-o - he-a
sys em while achie ing compa able pe o mance, eaching
a mean a e age p ecision (mAP) o 44.2%.
To enhance e ie al quali y, we in oduce a wo-s age
app oach consis ing o an ini ial coa se simila i y sea ch o
candida e selec ion, ollowed by a c oss-a en ion classi ie
ha ejec s i ele an ma ches and e ines he anking o
e ie ed candida es - an essen ial capabili y absen in p io
models. In addi ion, as que ies in eal-wo ld applica ions
a e o en sho in du a ion, we benchma k ou sys em o
sho que ies using new ine-g ained anno a ions o he
Sample100 da ase , which we publish as pa o his wo k.
1. INTRODUCTION
Sampling is a musical echnique ha “inco po a es po ions
o exis ing sound eco dings in o a newly collaged compo-
si ion” [1]. The samples o en unde go signi ican modi ica-
ion du ing his c ea i e p ocess: hey may be pi ch-shi ed,
ime-s e ched and hea ily p ocessed wi h audio e ec s
(hence o h sampling ans o ma ions), and a e ypically
combined wi h o he musical elemen s, c ea ing “musical
in e e ence” which makes iden i ica ion di icul e en o
human expe s. The ele ance o his p ac ice is highligh ed
© A. Bha acha jee, I. Me esman Higgs, M. Sandle , and E.
Bene os. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: A. Bha acha jee, I. Me esman Higgs,
M. Sandle , and E. Bene os, “Re ining music sample iden i ica ion wi h a
sel -supe ised g aph neu al ne wo k”, in P oc. o he 26 h In . Socie y o
Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
as, since he mass popula isa ion o hip hop, disco and
elec onic dance music, his kind o “ ans o ma i e app o-
p ia ion” has become one o he mos impo an echniques
o compose s and songw i e s [2].
Au oma ic sample iden i ica ion (ASID) is a c ucial ask
in music e ie al: gi en an audio que y - ei he a small
segmen o an en i e music ack - he goal is o e ie e
he sample sou ce om a da abase o music eco dings,
e en i sampling ans o ma ions ha e been applied. The
po en ial o subs an ially impac domains such as a ibu-
ion and copy igh highligh s he ele ance o his ask o
music c ea o s and igh s holde s, as well as music in o -
ma ion e ie al (MIR) esea che s.
This ask is pa icula ly challenging as sampling ans-
o ma ions can d as ically al e he audio ea u es while
main aining pe cep ual simila i y. A easonable app oach is
o ake cues om deep lea ning-based audio inge p in ing
esea ch, lea ning me ics ha allow o a simila i y-based
sea ch and e ie al sys em. Addi ionally, augmen a ions
in he aining pipeline allow models o lea n in a iance
o sampling ans o ma ions employed in music p oduc-
ion. Recen audio inge p in ing esea ch has success ully
employed G aph Neu al Ne wo ks (GNNs), achie ing s a e-
o - he-a esul s while using compac a chi ec u es ha
acili a e e icien aining, which in o ms his wo k.
P og ess in ASID has been hinde ed by he limi ed a ail-
abili y o well-anno a ed da ase s ha e lec eal-wo ld
sampling p ac ices. The Sample100 da ase [3] is he only
publicly a ailable da ase o anno a ions speci ically ad-
d essing he p esence o samples in comme cially p oduced
songs. In his pape we p esen a e ised e sion o his
da ase , anno a ed by expe s o include mo e ine-g ained
empo al anno a ions o he samples, as well as addi ional
commen s, ime-s e ching es ima es and ins umen a ion in-
o ma ion. We use hese new anno a ions o epo segmen -
wise hi - a es and o analyse he pe o mance o ou sys em
in ela ion o he ype o sample and augmen a ions pe -
o med du ing he a is ic p ocess.
Ou key con ibu ions a e as ollows:
•
We p opose he adap a ion o a ligh weigh G aph
Neu al Ne wo k as he neu al encode o ASID.
•
We in oduce a bina y c oss-a en ion classi ie o
acili a e an accu a e anking and e ining o e ie ed
audio inge p in s.
•
We con ibu e new ine-g ained empo al anno a ions
o he Sample100 da ase , and e alua e ou model’s
511
pe o mance on sho -que y e ie al, demons a ing
supe io op-N hi - a es compa ed o he baseline.
•
We p esen a de ailed analysis o e ie al pe o -
mances on di e en ypes o samples and discuss he
iabili y o he p oposed amewo k.
Ou code as well as he newly ex ended Sample100 da ase
ha e been made a ailable o ep oducibili y
1
.
2. RELATED WORKS
Despi e he ASID ask being a ele an and challenging one
o he MIR communi y, he e ha e been ew a emp s o
ackle i . Founda ional wo k by Van Balen e al. [3], in o-
duced he Sample100 da ase and p oposed he adap a ion
o a spec al peak-based audio inge p in ing amewo k
o make i obus o pi ch-shi ing. Gu u ani e al. [4] p o-
posed a sys em inspi ed by music co e iden i ica ion, using
Non-nega i e Ma ix Fac o iza ion o c ea e empla es o
he samples and Dynamic Time Wa ping o achie e a de ec-
ion algo i hm ha could be obus o ime-shi ing. Bo h o
hese wo ks ocus p ima ily on obus ness agains indi idual
sampling ans o ma ions bu nei he add ess he b oade
ange ypically encoun e ed in eal-wo ld scena ios. O he
adi ional inge p in ing me hods ha we e e ec i e o
audio e ie al asks, such as aud p in [5] and Panako [6]
ha e also been es ed on his ask [7] and p o ed insu i-
cien o ASID, s uggling wi h he challenges o combined
sampling ans o ma ions and in e e ing “musical noise”
( he o e lying musical composi ion).
Mo e ecen ly, he i s deep lea ning-based app oach by
Ches on e al. [7] achie ed s a e-o - he-a pe o mance on
he Sample100 da ase using a CNN a chi ec u e (ResNe 50-
IBN) p e iously used o co e song iden i ica ion [8] and
exploi ing music sou ce sepa a ion o c ea e syn he ic ain-
ing da a. This app oach se es as ou baseline and demon-
s a es bo h he easibili y and emaining challenges o ap-
plying deep lea ning o ASID.
Cu en s a e-o - he-a audio e ie al sys ems p edomi-
nan ly use CNNs [8
–
10] o ans o me s [11] ained wi h
con as i e lea ning objec i es. While e ec i e, hese a -
chi ec u es ypically equi e signi ican compu a ional e-
sou ces and la ge aining ba ches, limi ing hei p ac i-
cal iabili y. These limi a ions can be add essed by mo e
pa ame e -e icien app oaches based on G aph Neu al Ne -
wo ks (GNNs), which excel a cap u ing complex s uc u al
pa e ns in non-Euclidean spaces [12]. GNNs ha e p o en
e ec i e o audio asks whe e empo al and spec al ela-
ionships a e impo an , including audio inge p in ing [13]
and audio agging [14], by e ec i ely modelling local and
global in e ac ions be ween ime- equency egions.
3. METHODOLOGY
ASID in ol es wo ca ego ies o audio eco dings: a e -
e ence, an o iginal music eco ding, and a que y, a new
eco ding ha inco po a es (i.e., samples) pa s o he e -
e ence. Fo aining, we gene a e que y- e e ence pai s by
1h ps://gi hub.com/chymae a96/Neu alSampleID
e-mixing sou ce sepa a ed s ems as p oposed in [7]. Fo
e alua ion, ou e ie al me hodology employs a wo-s age
p ocess: ini ial candida e selec ion ia app oxima e nea es -
neighbou sea ch, ollowed by ine-g ained anking wi h
he c oss-a en ion classi ie . Figu e 1 illus a es he com-
ple e e ie al pipeline, de ailing how e e ence ma ches a e
e ie ed and anked o a gi en que y.
3.1 Inpu Fea u es
Ou sys em employs log-scaled Mel-spec og ams as inpu
ea u es. Gi en an audio wa e o m
y∈R
, sampled a 16
kHz, we i s compu e i s Mel-spec og am ep esen a ion
X ∈ RF×T
. He e,
F
deno es he numbe o Mel- equency
bins, and
T
is he numbe o empo al ames.
Du ing aining, we andomly sample sho audio seg-
men s o ixed du a ion
seg
om each eco ding in he ain-
ing da ase and use i o gene a e p oxy que y- e e ence
pai s (see Sec ion 3.4.1). Fo e ie al, we use eal que y
and e e ence audio eco dings which a e segmen ed in o
o e lapping segmen s o leng h
seg
. Sec ion 5 de ails he
con igu a ion o he inpu ea u es and hype pa ame e s.
3.2 Encode A chi ec u e
Ou GNN encode builds upon he a chi ec u e in oduced
in [13]. Gi en an inpu Mel-spec og am
X
, we i s ep e-
sen i as a se o h ee-dimensional ime- equency poin s,
each desc ibed by i s ime index, equency bin index, and
ampli ude alue. F om his ini ial ep esen a ion, we p o-
duce o e lapping pa ch embeddings by agg ega ing local
neighbou hoods o ime- equency poin s in o la en ec o s.
Fo mally, each esul ing pa ch embedding is ep esen ed by:
:R3×p→Rd,(1)
whe e
p
deno es he numbe o neighbou ing poin s agg e-
ga ed pe pa ch, and
d
is he dimensionali y o he la en em-
bedding. These pa ch embeddings se e di ec ly as nodes
in he subsequen g aph s uc u e.
Nex , we cons uc a k-nea es neighbou s (kNN) g aph
om hese node embeddings. Speci ically, o each node
embedding
xi
, we iden i y i s
k
nea es neighbou s based
on cosine simila i y in he la en embedding space. The e-
sul ing edges ep esen la en s uc u al ela ionships among
Mel-spec og am pa ches.
Node embeddings a e hen i e a i ely e ined ia g aph
con olu ion (G aphCon ) laye s. Fo each node embedding
xi
, we agg ega e in o ma ion om i s neighbou s
xj
, whe e
j∈ N(xi)
. Fo mally, he upda e ule is gi en by:
yi=xi+σAGG({xj:j∈ N (xi)}),(2)
whe e
yi
is he upda ed embedding,
σ
deno es a nonlin-
ea ac i a ion unc ion,
N(xi)
is he se o neighbou s o
node
xi
, and AGG ep esen s an agg ega ion ope a ion sum-
ma izing ele an in o ma ion om neighbou ing nodes.
Th ough i e a i e agg ega ion, each node embedding p o-
g essi ely encodes inc easingly ich con ex ual and s uc-
u al in o ma ion. The GNN encode comp ises mul iple
blocks o G aphCon laye s, each ollowed by eed o wa d
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
512
GNN
encode
L2 p ojec ion laye
Segmen ed e s
Da abase
o e e ence
songs
Co ec ma ch
Lis o candida es
A
GNN
encode
MHCA Classi ie
P1
P2
P5
P3
P4
B
MHCA + mean
Con ex ec o
dn
c p
dn
NMq
N
NM
dn
N
Final
anked
ou pu
Re e ence embeddings
Que y embeddings
Que y
Mel
Spec og am
Segmen s
Co ec ma ch
Re ine
+
ank
Figu e 1. Illus a ed ASID me hodology: (A) Gi en a que y, we compu e segmen -le el embeddings ( inge p in s), ma ched
o e e ence embeddings ia app oxima e nea es -neighbou (ANN) sea ch; based on which, candida e songs a e e ie ed
om he e e ence da abase h ough a lookup p ocess (do ed a ows). (B) A mul i-head c oss-a en ion (MHCA) classi ie
e ines and anks candida es using node embedding ma ices NMq(que y) and NM ( e e ences).
ne wo k (FFN) laye s. A he beginning o each block,
he kNN g aph is dynamically econs uc ed o e lec he
upda ed node embeddings.
The ou pu o he GNN encode is a se o e ined node
embeddings, collec i ely e e ed o as he node embed-
ding ma ix, which se e as inpu ea u es o he c oss-
a en ion classi ie . Finally, hese node embeddings a e
a e age-pooled and p ojec ed in o audio inge p in s. Bo h
la en embeddings a e used in he subsequen e ie al e-
inemen s age.
Fo a comp ehensi e discussion o a chi ec u al de ails
and design conside a ions, we e e eade s o [13].
3.3 C oss-A en ion Classi ie
To cap u e he la en ela ionships be ween hese wo se s
o node embeddings, we in oduce a mul i-head c oss-
a en ion classi ie . Gi en a que y and a e e ence audio
segmen , we i s compu e he node embedding ma ices
q∈RN×dn
and
∈RN×dn
, espec i ely. He e,
N
is he numbe o nodes, and
dn
is he dimensionali y o
each node embedding. We compu e a en ion-weigh ed
embeddings as ollows:
C=MHA(q, , )(3)
whe e
MHA(.)
deno es s anda d mul i-head a en ion [15].
The esul ing embedding ma ix
C∈RN×dn
is an
a en ion-weigh ed ans o ma ion o
, whe e a en ion is
compu ed be ween co esponding node embeddings in
q
and
.
C
is hen agg ega ed by mean pooling, p oducing
a single con ex ec o
c∈Rdn
:
c=1
N
N
X
j=1
Cj(4)
whe e
Cj
is con ex ec o o he
j
- h node embedding.
Finally, he con ex ec o
c
is ans o med by a shallow
nonlinea classi ie in o a scala con idence sco e
s
:
s=σ(wTc+b)(5)
whe e
w∈Rdn
,
b∈R
a e lea nable pa ame e s, and
σ
deno es he sigmoid ac i a ion unc ion. The scala
s
indica es he con idence ha he que y and e e ence seg-
men s ma ch. As shown in Figu e 1, a e ie al ime, his
is used as a anking mechanism as well as a measu e o
ejec ing low-con idence candida es.
3.4 T aining Pipeline
Ou p oposed app oach in ol es wo dis inc aining s ages:
a sel -supe ised con as i e lea ning s age o embedding
aining and a subsequen bina y classi ica ion s age o
he downs eam c oss-a en ion classi ie . Bo h s ages use
iden ical p ocedu es o p oduce p oxy que y- e e ence pai s
om he sou ce-sepa a ed aining da a, closely ollowing
he me hodology es ablished in p io wo k [7].
3.4.1 Que y-Re e ence Pai Gene a ion
Le us deno e he s ems ex ac ed om he aining audio
sou ce
x
as a se
S={s1, s2, ..., sK}
, whe e each s em
sk
co esponds o a sou ce-sepa a ed audio componen (e.g.,
ocals, d ums, bass). Gi en a andom imes amp segmen
s
s a ing a
and o leng h
∆
, we i s ex ac co esponding
audio segmen s om each s em as
sk( s) = sk[ , + ∆ ](6)
esul ing in he se
{s1( s), s2( s), ..., sK( s)}
. These s em
segmen s a e pa i ioned andomly in o wo subse s,
Sq
and
S
, wi h
Sq∪ S =S
and
Sq∩ S =∅
.
A que y segmen
xq
is o med as he sum o s ems in
Sq
:
xq=X
s∈Sq
s( s).(7)
The e e ence segmen
x
is gene a ed by mixing an
augmen ed e sion o he que y segmen wi h he emain-
ing s ems:
x =aug2 aug1(xq) + X
s∈S
s( s)!.(8)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
513
He e,
aug1
and
aug2
ep esen audio e ec s unc ions
applied sequen ially o simula e ealis ic music p oduc ion
ans o ma ions. The e ec pa ame e s a e sampled om
a uni o m dis ibu ion. Speci ically,
•aug1
: ime-o se (
±
250ms) and gain a ia ion
(±10dB).
•aug2
: pi ch-shi ing (
±
3 semi ones) and ime-
s e ching (70 - 150%).
The sou ce-sepa a ion sys em (see Sec ion 4.1) allows
he ex ac ion o musically salien sou ces ha can cons i-
u e a sample. The pai
(xq, x )
cons i u es a posi i e que y-
e e ence example;
xq
is a p oxy o a que y con aining an
ins ance o a sample, and
x
ep esen s a e e ence example
which con ains he sample ha is c ea i ely dis o ed and
is p esen in a mix along wi h o he musical elemen s.
3.4.2 Con as i e Lea ning
We ain he encode using a sel -supe ised con-
as i e lea ning amewo k. Gi en a ba ch o
B
pai s
{(xi
q, xi
)}B
i=1
, we ob ain hei co esponding audio inge -
p in s
{zi
q, zi
}N
i=1
om he encode . We hen employ he
No malized Tempe a u e-scaled C oss En opy (NT-Xen )
loss [16] o maximize simila i y be ween embeddings om
posi i e pai s, while minimizing cosine simila i y o em-
beddings om all o he pai s in he ba ch.
3.4.3 Downs eam Classi ie T aining
The c oss-a en ion classi ie is ained as a downs eam
ask, wi h he encode pa ame e s ozen a e he con-
as i e lea ning s age. Fo his s age, we disca d he p e-
iously used p ojec ion ne wo k and di ec ly use he node
embedding ma ix ob ained om he encode .
T aining ba ches consis o que y- e e ence pai s gen-
e a ed iden ically o he con as i e lea ning s age. Le
Q={qi}Bc
i=1
and
R={ j}Bc
j=1
ep esen que y and e e -
ence embedding se s in a ba ch, espec i ely, whe e each
embedding
qi, j∈RN×dn
, and
Bc
is he ba ch size.
Posi i e examples co espond o pai s o iden ical in-
dices:
P={(qi, j)|i=j},(9)
while nega i e examples a e selec ed om pai s wi h non-
iden ical indices ia ha d-nega i e mining. Speci ically,
we selec nega i e pai s as he subse o non-posi i e pai s
ha maximize audio inge p in simila i y, hus being he
mos con ounding:
N={(qi, −
j)|i=j, −
j= a g max
j,j=isim(zi, zj)}.
(10)
We main ain a ixed a io o 1:3 o posi i e o nega i e
pai s wi hin each aining ba ch. The classi ie ou pu s a
scala p edic ion
p∈[0,1]
, ained wi h he bina y c oss-
en opy (BCE) loss, whe e he label o pai s
(qi, j)∈ P
is 1 and o o pai s
(qi, −
j)∈ N
is 0.
3.5 Re ie al and E alua ion
Ou e ie al sys em, illus a ed in Figu e 1, ope a es in
wo sequen ial s ages:
•
App oxima e nea es -neighbou (ANN) sea ch o do
a as and coa se sea ch o candida e e e ence audio
inge p in s om he da abase.
•
C oss-a en ion classi ie sco ing o e ine he candi-
da e se and ank hem based on ele ance.
Fo e e y o e lapping segmen (compu ed as desc ibed
in Sec ion 3.1) in he que y, we p obe he e e ence da abase
o ma ches based on he simila i y o he audio inge p in s;
hus yielding a se o candida e ma ches.
In he second s age, we u ilize he c oss-a en ion classi-
ie o e ine hese candida e ma ches. Fo each candida e
segmen e ie ed, we ex ac i s co esponding node em-
bedding ma ix. Gi en a que y eco ding, ep esen ed as a
sequence o node embedding ma ices, we compu e clas-
si ie sco es
p(q, )
o each pai o que y
q
and e ie ed
candida e
. The inal candida e segmen -le el con idence
sco e is de e mined by selec ing he maximum classi ie
sco e o e all segmen s o he que y:
pcl (q, ) = max
qi∈Qp(qi, ).(11)
We ejec candida e segmen s wi h con idence sco es
pcl (q, )<0.5
. Subsequen ly, we agg ega e hese accep ed
segmen -le el sco es o ob ain a song-le el e ie al sco e.
Speci ically, o each unique e e ence eco ding, we sum
he segmen -le el con idence sco es:
Psong(q, R) = X
∈R
Pcl (q, ),(12)
whe e
R
deno es he se o e ie ed segmen s belonging o
he same e e ence song. The esul ing agg ega ed sco es
Psong(q, R)
p o ide a obus anking o candida e songs
o each que y eco ding.
4. DATASET
4.1 T aining Da ase
Fo aining, we use he F ee Music A chi e (FMA) medium
da ase [17], which con ains 25,000 30-second acks ac oss
16 gen es. We p e-p ocessed his da ase o make i sui able
o ou s em-mixing con as i e lea ning app oach. We used
he cu en SOTA algo i hm “Bea This” [18] o pe o m
bea acking and use his as a p oxy o musical hy hmic
egula i y in he FMA acks, excluding 2,533 acks wi h
ewe han 32 bea s a e he i s downbea . This il e ing
ensu ed ha ou aining da a consis ed only o musical
con en wi h some le el o hy hmic s uc u e.
To gene a e he s ems ha will be used o he syn-
he ic aining pai s, we applied sou ce sepa a ion using
he Hyb id T ans o me Demucs model (
h demucs
) [19]
o each usable ack, sepa a ing hem in o d ums, bass,
ocals, and “o he ” s ems.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
514
4.2 E alua ion Da ase
Fo he e alua ion o ou sys em, we use he Sample100
da ase [3] . The da ase consis s o 75 ull-leng h hip-
hop eco dings (que ies) con aining samples om 68 ull-
leng h songs ( e e ences) ac oss a a ie y o gen es, wi h
R&B/Soul ep esen ing he majo i y [7]. I con ains 106
sample ela ionships and a o al o 137 sample occu ences,
as some que ies use mul iple samples and some e e ences
appea in mul iple que ies. To challenge e ie al sys ems,
he da ase includes 320 addi ional “noise” acks wi h a sim-
ila gen e dis ibu ion, which a e no sampled in any que y.
Because samples a e ypically c ea ed om a sho seg-
men o a song, only a small po ion o each candida e ack
is sampled and p esen in que ies - sample leng hs ange
om jus one second o 26 seconds. The samples ep esen
eal-wo ld musical “ ans o ma i e app op ia ion” [2], in-
cluding bo h onal ( i s), pe cussi e d um b eak (bea s),
and 1-no e mic o-samples. Non-musical samples (e.g. ilm
dialogue) a e no included.
To enable mo e de ailed e alua ion, we p esen an ex-
ended e sion o he Sample100 da ase wi h ine-g ained
empo al anno a ions pe o med by expe musicians using
Sonic Visualise [20]. Unlike he o iginal da ase , which
only p o ided i s occu ence imes amps a 1-second p e-
cision, ou anno a ions include p ecise s a and end imes
o all sample occu ences wi h
±250
ms esolu ion, ans-
o ming he da ase in o a segmen -wise e alua ion esou ce.
This imp o ed empo al g anula i y allows o mo e accu-
a e e alua ion o ASID sys ems by es ing wi h sho que y
snippe s om anywhe e wi hin he sampled ma e ial.
We u he en ich he da ase by adding es ima es o
he ime-s e ching a io be ween he e e ence and que y
acks, as well as ins umen a ion (s em) anno a ions o
bo h he o iginal ma e ial and he in e e ing ins umen s in
he que y, and expanding he commen s abou he samples.
The ime-s e ching a io was calcula ed om he empo o
bo h que y and e e ence segmen s, de e mined h ough a
combina ion o au oma ic bea acking [18] wi h manual
e i ica ion. S em anno a ions we e pe o med by lis ening
o he acks and hei sou ce-sepa a ed s ems o ensu e
accu acy. Rele an sample class coun s a e shown in Table
4, including a ca ego isa ion in o subs an ial o minimal
ime-s e ching. This new in o ma ion will enable mo e nu-
anced analysis o ou model’s pe o mance ac oss di e en
ypes o sampling p ac ices in sec ion 6.3.
5. EXPERIMENTAL SETUP
5.1 Hype pa ame e s and Con igu a ion
Ou expe imen al se up and hype pa ame e choices a e
summa ized in Table 1, wi h ce ain pa ame e s de ailed
in he p eceding sec ions. The con as i e lea ning s age
was pe o med using an NVIDIA A100 GPU, wi h models
ained o a maximum o 180 epochs; we employed ea ly
s opping based on alida ion pe o mance. T aining u i-
lized he Adam op imize coupled wi h a cosine annealing
lea ning- a e schedule . Fo he downs eam c oss-a en ion
classi ie , we ained o a maximum o 5 epochs using
he Adam op imize wi h a ixed lea ning a e, keeping he
encode pa ame e s ozen o p ese e he lea ned ep esen-
a ions om he con as i e lea ning s age. Fo he ANN
sea ch algo i hm, we use IVF-PQ [21], an e icien choice
o e ie al asks in la ge ec o da abases.
Hype pa ame e Value
Sampling a e 16,000 Hz
log-powe Mel-spec og am size F×T64 ×32
Finge p in {window leng h, hop} {4s, 0.5s}
Finge p in dimension 128
Node ma ix dimension {N,dn} {32, 512}
Tempe a u e τ0.05
Con as i e ba ch size B1024
Downs eam ba ch size Bc32
Table 1. Expe imen al Con igu a ion
5.2 E alua ion Me ics
The ASID ask is undamen ally a e ie al p oblem, whe e
he goal is o ank candida e audio segmen s based on hei
ele ance o a que y. Hence, we adop mean a e age p eci-
sion (mAP) [22] as ou p ima y me ic, whe e he que y is
compu ed om a ull song con aining a sample. The mean
a e age p ecision (mAP) summa izes e ie al quali y by
agg ega ing he p ecision alues a he anks whe e ele an
i ems a e e ie ed, a e aged ac oss all que ies.
Addi ionally, inspi ed by an es ablished p ac ice in au-
dio inge p in ing li e a u e [10], we epo op-
N
hi a es.
Speci ically, we measu e he p opo ion o que ies o which
a leas one co ec sample is e ie ed wi hin he op
N
anked esul s. We do so o di e en que y sizes (5s o 20s).
This me ic p o ides an in ui i e indica ion o p ac ical e-
ie al accu acy and he sys em’s e icacy o sho que ies.
5.3 Baseline F amewo k
We compa e ou p oposed sys em agains he ecen s a e-
o - he-a baseline in oduced by Ches on e al. [7]. Thei
amewo k employs a ResNe 50-IBN a chi ec u e and u i-
lizes a mul i- ask lea ning app oach ha join ly op imises
a me ic lea ning objec i e h ough iple loss and an aux-
ilia y classi ica ion ask. This a chi ec u e has achie ed
s a e-o - he-a e ie al pe o mance in e ms o mean a -
e age p ecision (mAP). Due o p ac ical compu a ional con-
s ain s, we ins ead adop and epo esul s on a ResNe 18-
IBN model, which has a compa able numbe o pa ame e s
o ou p oposed GNN-based encode . Apa om he model
size, we closely adhe e o he aining p ocedu es and e al-
ua ion me hodology ou lined in [7]. We also include hei
epo ed bes pe o mance o e e ence.
6. RESULTS AND DISCUSSION
6.1 Benchma king
We p esen he pe o mance compa ison be ween ou p o-
posed GNN+MHCA a chi ec u e and he baseline in Ta-
ble 2. Ou model ma ches he epo ed pe o mance o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
515

he much la ge ResNe 50-IBN model, and signi ican ly
ou pe o ms he eimplemen ed baseline.
Model # pa ams mAP
ResNe 50-IBN (Ches on e al.) 222M 0.441
ResNe 18-IBN (Baseline) 34M 0.330
GNN (Ou s) ba ch size = 1024 20M 0.416
GNN + ba ch size = 256 0.373
MHCA ba ch size = 512 20M 0.411
(Ou s) ba ch size = 1024 0.442
Table 2. Pe o mance o models on Sample100 da ase .
A key ac o in luencing ou model’s pe o mance is he
ba ch size used du ing con as i e lea ning. Inc easing he
ba ch size om 256 o 1024 leads o an imp o emen o
6.9pp in mean a e age p ecision (mAP). This imp o emen
occu s because la ge ba ch sizes p o ide mo e nega i e
samples pe posi i e pai , en iching he di e si y o he con-
as i e space. Consequen ly, he model lea ns embeddings
ha be e disc imina e be ween ele an and i ele an ex-
amples. While la ge ba ch sizes (e.g., 2048) we e explo ed,
he associa ed aining ime ende ed hype pa ame e uning
imp ac ical, and hus hese esul s a e no epo ed.
Models Que y Leng h
5 sec 7 sec 10 s 15 sec 20 sec
Top-1 Hi Ra e (%)
ResNe 18-IBN (Baseline) 15.8 15.8 16.0 16.5 21.7
GNN+MHCA (Ou s) 15.5 24.6 27.5 26.1 30.7
Top-3 Hi Ra e (%)
ResNe 18-IBN (Baseline) 24.3 24.3 29.2 26.1 32.3
GNN+MHCA (Ou s) 19.1 32.1 38.3 40.9 50.3
Top-10 Hi Ra e (%)
ResNe 18-IBN (Baseline) 27.4 27.4 40.4 45.2 49.1
GNN+MHCA (Ou s) 19.1 33.8 44.7 51.3 63.2
Table 3. Hi a es o ou amewo k and baseline.
Table 3 shows ou model’s pe o mance on sho que ies,
a common use case in eal-wo ld sample iden i ica ion sce-
na ios. While he hi a es o sho e que ies a e compa-
able o he baseline, ou amewo k exhibi s signi ican ly
supe io pe o mance o longe que ies (14.1pp inc ease in
op-1 hi a e o 20-second-long que ies). The p og essi e
imp o emen in hi a es wi h inc easing que y leng h show
ha ou app oach e ec i ely agg ega es segmen -le el con-
idence sco es o e ie e he co ec e e ence song.
6.2 Re ie al Re inemen ia C oss-A en ion Classi ie
To examine he impac o he c oss-a en ion classi ie as
a e ie al e inemen s ep, we conduc an abla ion s udy.
Table 2 shows ha inco po a ing he classi ie (MHCA) o
ank e ie ed esul s imp o es mAP by 2.6pp, con i ming
he u ili y o his e inemen s age. Addi ionally, o e alua e
he classi ie ’s capabili y o ejec i ele an ma ches, we
cons uc a balanced alida ion se comp ising 300 posi i e
que y- e e ence pai s and 300 nega i e pai s d awn om
he “noise” da a desc ibed in Sec ion 4.2. The classi ie
achie es an AUROC sco e o 0.776, indica ing ha i does
no pe ec ly disc imina e be ween genuine and con ound-
ing examples. Thus, he obse ed imp o emen in e ie al
pe o mance can be a ibu ed o he combined e ec o
he wo-s age e ie al p ocess, a he han solely o he
ejec ion capabili y o he classi ie .
6.3 Pe o mance by Sample Cha ac e is ics
To unde s and he pe o mance o he model ac oss sample
cha ac e is ics, we compu ed he mAP o di e en ca e-
go ies o Sample100. As shown in Table 4, he e is a modes
pe o mance gap be ween melodic/ha monic i samples
and pe cussi e bea samples ( he wo 1-no e samples we e
no aken in o accoun ). This may be a ibu ed o he na u e
o bea samples, which consis p ima ily o d ums ha a e
o en subjec o o e dubbing echniques whe e p oduce s
laye addi ional pe cussion elemen s, and ha migh also
be bu ied deepe in he mix benea h o he ins umen a ion,
po en ially making hem less salien o he GNN o cap u e.
Fu he analysis by looking a he speci ic ins umen a ion
in he e e ence and que y, o by applying sou ce sepa a ion
a de ec ion ime, is le o u u e wo k.
Type Time s e ching
Ri Bea 1-no e >5% <5%
# samples 71 33 2 44 62
mAP 0.471 0.391 - 0.340 0.503
Table 4. Pe o mance acco ding o sample class.
A signi ican pe o mance gap was obse ed in ela ion
o ime s e ching, whe e we classi ied samples subjec ed o
minimal ime s e ching (<5%) and hose wi h signi ican
ime s e ching (>5%). This 16.3pp di e ence in pe o -
mance shows ha al hough ou model is obus o some
deg ee o ime-s e ching, la ge changes in empo unda-
men ally al e he empo al ela ionships be ween audio
ea u es ha ou model elies on o iden i ica ion.
7. CONCLUSION AND FUTURE WORK
This pape p esen s a ligh weigh GNN-based app oach o
au oma ic sample iden i ica ion ha achie es s a e-o - he-
a pe o mance while using only 9% o he pa ame e s com-
pa ed o p e ious me hods. Ou key con ibu ions include
adap ing a GNN encode o sample iden i ica ion, in oduc-
ing a c oss-a en ion classi ie o e ining e ie al esul s,
and ex ending he Sample100 da ase wi h ine-g ained em-
po al anno a ions ha enable g anula e alua ion.
Ou esul s show ha he p oposed amewo k achie es
a mAP o 44.2%, wi h s ong pe o mance on melodic-
ha monic samples and samples wi h low ime-s e ching.
Ou amewo k’s c oss-a en ion s age is use ul in he e in-
ing o he anking, and in oduces ejec ion capabili ies.
Fu u e wo k should explo e end- o-end aining me h-
ods obus o sampling ans o ma ions, and explo e in-
eg a ing sou ce sepa a ion du ing in e ence o imp o e
pe o mance on hea ily masked samples. Newly a ailable
anno a ions can be le e aged o analysis o how speci ic
a ibu es o samples (ins umen a ion ype, in e pola ion,
gen e) a ec iden i ica ion accu acy, and we hope he e-
lease o ou ex ended Sample100 da ase will aid he de el-
opmen o specialized echniques ha add ess he mos
challenging cases in ASID.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
516
8. ACKNOWLEDGMENTS
We hank Ma hew Alan Wal o d o his con ibu-
ions o he new anno a ions o he Sample100
da ase . This esea ch u ilised Queen Ma y’s Apoc-
i a HPC acili y, suppo ed by QMUL Resea ch-IT
(h p://doi.o g/10.5281/zenodo.438045). This wo k was
suppo ed by UKRI - Inno a e UK (P ojec no. 10102241).
A. Bha acha jee is a esea ch s uden a he UKRI Cen e
o Doc o al T aining in A i icial In elligence and Music,
suppo ed join ly by UK Resea ch and Inno a ion [g an
numbe EP/S022694/1] and Queen Ma y Uni e si y o
London.
9. REFERENCES
[1]
K. McLeod, P. DiCola, J. Toomey, and K. Thomson,
C ea i e License: The Law and Cul u e o Digi al Sam-
pling. Du ham, NC, USA: Duke Uni e si y P ess,
2011.
[2]
J. Deme s, S eal This Music: How In ellec ual P ope y
Law A ec s Musical C ea i i y. A hens, GA, USA:
Uni e si y o Geo gia P ess, 2006.
[3]
J. V. Balen, “Au oma ic ecogni ion o samples in musi-
cal audio,” Mas e ’s hesis, Uni e si a Pompeu Fab a,
Ba celona, Spain, 2011.
[4]
S. Gu u ani and A. Le ch, “Au oma ic sample de ec ion
in polyphonic music,” in P oc. o he 18 h In . Socie y
o Music In o ma ion Re ie al Con ., Suzhou, China,
2017.
[5]
D. Ellis, “aud p in : Landma k-based audio in-
ge p in ing,” 2014, so wa e. [Online]. A ailable:
h ps://gi hub.com/dpwe/aud p in
[6]
J. Six and M. Leman, “Panako - a scalable acous ic in-
ge p in ing sys em handling ime-scale and pi ch mod-
i ica ion,” in P oc. o he 15 h In . Socie y o Music
In o ma ion Re ie al Con ., Taipei, Taiwan, 2014.
[7]
H. Ches on, J. V. Balen, and S. Du and, “Au oma ic
iden i ica ion o samples in hip-hop music ia
mul i-loss aining and an a i icial da ase ,” 2025,
a Xi p ep in a Xi :2502.06364. [Online]. A ailable:
h ps://a xi .o g/abs/2502.06364
[8]
X. Du, Z. Yu, B. Zhu, X. Chen, and Z. Ma, “By eco e :
Co e song iden i ica ion ia mul i-loss aining,” in
P oc. IEEE In . Con . Acous ics, Speech and Signal
P ocessing, To on o, ON, Canada, 2021, pp. 551–555.
[9]
X. Xu, X. Chen, and D. Yang, “Key-in a ian con olu-
ional neu al ne wo k owa d e icien co e song iden i-
ica ion,” in P oc. IEEE In . Con . Mul imedia and Expo,
San Diego, CA, USA, 2018.
[10]
S. Chang, D. Lee, J. Pa k, H. Lim, K. Lee, K. Ko, and
Y. Han, “Neu al audio inge p in o high-speci ic au-
dio e ie al based on con as i e lea ning,” in P oc.
IEEE In . Con . Acous ics, Speech and Signal P ocess-
ing, To on o, ON, Canada, 2021, pp. 3025–3029.
[11]
A. Singh, K. Demuynck, and V. A o a, “A en ion-based
audio embeddings o que y-by-example,” in P oc. o
he 23 d In . Socie y o Music In o ma ion Re ie al
Con ., Bengalu u, India, 2022, pp. 52–58.
[12]
G. Li, M. Mulle , A. Thabe , and B. Ghanem, “Deep-
gcns: Can gcns go as deep as cnns?” in P oc. IEEE/CVF
In . Con . Compu e Vision, Seoul, Ko ea, 2019, pp.
9267–9276.
[13]
A. Bha acha jee, S. Singh, and E. Bene os, “G a p in :
A gnn-based app oach o audio iden i ica ion,” in P oc.
IEEE In . Con . Acous ics, Speech and Signal P ocess-
ing, Hyde abad, India, 2025.
[14]
S. Singh, C. J. S einme z, E. Bene os, H. Phan, and
D. S owell, “A gnn: Audio agging g aph neu al ne -
wo k,” IEEE Signal P ocessing Le e s, 2024.
[15]
A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Łukasz Kaise , and I. Polo-
sukhin, “A en ion is all you need,” Ad ances in Neu al
In o ma ion P ocessing Sys ems, ol. 30, 2017.
[16]
T. Chen, S. Ko nbli h, M. No ouzi, and G. Hin on, “A
simple amewo k o con as i e lea ning o isual ep-
esen a ions,” in P oc. In . Con . Machine Lea ning,
2020, pp. 1597–1607.
[17]
M. De e a d, K. Benzi, P. Vande gheyns , and X. B es-
son, “Fma: A da ase o music analysis,” in P oc. o
he 18 h In . Socie y o Music In o ma ion Re ie al
Con ., Suzhou, China, 2017.
[18]
F. Fosca in, J. Schlü e , and G. Widme , “Bea his!
accu a e bea acking wi hou dbn pos p ocessing,” in
P oc. o he 25 h In . Socie y o Music In o ma ion
Re ie al Con ., San F ancisco, CA, USA, 2024.
[19]
S. Roua d, F. Massa, and A. Dé ossez, “Hyb id ans-
o me s o music sou ce sepa a ion,” in P oc. IEEE In .
Con . Acous ics, Speech and Signal P ocessing, Rhodes
Island, G eece, 2023.
[20]
C. Cannam, C. Landone, and M. Sandle , “Sonic
isualise : An open sou ce applica ion o iewing,
analysing, and anno a ing music audio iles,” in P oc.
ACM Mul imedia In . Con ., Fi enze, I aly, 2010, pp.
1467–1468.
[21]
J. Johnson, M. Douze, and H. Jégou, “Billion-scale
simila i y sea ch wi h gpus,” IEEE T ansac ions on Big
Da a, ol. 7, no. 3, pp. 535–547, 2019.
[22]
J. S. Downie, “The music in o ma ion e ie al e al-
ua ion exchange (2005–2007): A window in o music
in o ma ion e ie al esea ch,” Acous ical Science and
Technology, ol. 29, no. 4, pp. 247–255, 2008.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
517

Related note

Why institutions use Plag.ai for originality review, entry 93
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by doctoral supervisors in universities, research institutes, colleges, schools, and publishing workflows, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer documentation of academic decisions, reduced manual checking effort, and clearer separation between similarity and misconduct. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For course assignments, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai