REFINING MUSIC SAMPLE IDENTIFICATION WITH A
SELF-SUPERVISED GRAPH NEURAL NETWORK
Adi ya Bha acha jee1I an Me esman Higgs1Ma k Sandle 1Emmanouil Bene os1
1Queen Ma y Uni e si y o London, UK
{a.bha acha jee, i.me esman-higgs, ma k.sandle , emmanouil.bene os}@qmul.ac.uk
ABSTRACT
Au oma ic sample iden i ica ion (ASID) - he de ec ion and
iden i ica ion o po ions o audio eco dings ha ha e been
eused in new musical wo ks - is an essen ial bu chal-
lenging ask in he ield o audio que y-based e ie al.
While a ela ed ask, audio inge p in ing, has made sig-
ni ican p og ess in accu a ely e ie ing musical con en
unde “ eal wo ld” (noisy, e e be an ) condi ions, ASID
sys ems s uggle o iden i y samples ha ha e unde gone
musical modi ica ions. Thus, a sys em obus o common
music p oduc ion ans o ma ions such as ime-s e ching,
pi ch-shi ing, e ec s p ocessing, and unde lying o o e -
laying music is an impo an open challenge. In his wo k,
we p opose a ligh weigh and scalable encoding a chi ec u e
employing a G aph Neu al Ne wo k wi hin a con as i e
lea ning amewo k. Ou model uses only 9% o he ain-
able pa ame e s compa ed o he cu en s a e-o - he-a
sys em while achie ing compa able pe o mance, eaching
a mean a e age p ecision (mAP) o 44.2%.
To enhance e ie al quali y, we in oduce a wo-s age
app oach consis ing o an ini ial coa se simila i y sea ch o
candida e selec ion, ollowed by a c oss-a en ion classi ie
ha ejec s i ele an ma ches and e ines he anking o
e ie ed candida es - an essen ial capabili y absen in p io
models. In addi ion, as que ies in eal-wo ld applica ions
a e o en sho in du a ion, we benchma k ou sys em o
sho que ies using new ine-g ained anno a ions o he
Sample100 da ase , which we publish as pa o his wo k.
1. INTRODUCTION
Sampling is a musical echnique ha “inco po a es po ions
o exis ing sound eco dings in o a newly collaged compo-
si ion” [1]. The samples o en unde go signi ican modi ica-
ion du ing his c ea i e p ocess: hey may be pi ch-shi ed,
ime-s e ched and hea ily p ocessed wi h audio e ec s
(hence o h sampling ans o ma ions), and a e ypically
combined wi h o he musical elemen s, c ea ing “musical
in e e ence” which makes iden i ica ion di icul e en o
human expe s. The ele ance o his p ac ice is highligh ed
© A. Bha acha jee, I. Me esman Higgs, M. Sandle , and E.
Bene os. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: A. Bha acha jee, I. Me esman Higgs,
M. Sandle , and E. Bene os, “Re ining music sample iden i ica ion wi h a
sel -supe ised g aph neu al ne wo k”, in P oc. o he 26 h In . Socie y o
Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
as, since he mass popula isa ion o hip hop, disco and
elec onic dance music, his kind o “ ans o ma i e app o-
p ia ion” has become one o he mos impo an echniques
o compose s and songw i e s [2].
Au oma ic sample iden i ica ion (ASID) is a c ucial ask
in music e ie al: gi en an audio que y - ei he a small
segmen o an en i e music ack - he goal is o e ie e
he sample sou ce om a da abase o music eco dings,
e en i sampling ans o ma ions ha e been applied. The
po en ial o subs an ially impac domains such as a ibu-
ion and copy igh highligh s he ele ance o his ask o
music c ea o s and igh s holde s, as well as music in o -
ma ion e ie al (MIR) esea che s.
This ask is pa icula ly challenging as sampling ans-
o ma ions can d as ically al e he audio ea u es while
main aining pe cep ual simila i y. A easonable app oach is
o ake cues om deep lea ning-based audio inge p in ing
esea ch, lea ning me ics ha allow o a simila i y-based
sea ch and e ie al sys em. Addi ionally, augmen a ions
in he aining pipeline allow models o lea n in a iance
o sampling ans o ma ions employed in music p oduc-
ion. Recen audio inge p in ing esea ch has success ully
employed G aph Neu al Ne wo ks (GNNs), achie ing s a e-
o - he-a esul s while using compac a chi ec u es ha
acili a e e icien aining, which in o ms his wo k.
P og ess in ASID has been hinde ed by he limi ed a ail-
abili y o well-anno a ed da ase s ha e lec eal-wo ld
sampling p ac ices. The Sample100 da ase [3] is he only
publicly a ailable da ase o anno a ions speci ically ad-
d essing he p esence o samples in comme cially p oduced
songs. In his pape we p esen a e ised e sion o his
da ase , anno a ed by expe s o include mo e ine-g ained
empo al anno a ions o he samples, as well as addi ional
commen s, ime-s e ching es ima es and ins umen a ion in-
o ma ion. We use hese new anno a ions o epo segmen -
wise hi - a es and o analyse he pe o mance o ou sys em
in ela ion o he ype o sample and augmen a ions pe -
o med du ing he a is ic p ocess.
Ou key con ibu ions a e as ollows:
•
We p opose he adap a ion o a ligh weigh G aph
Neu al Ne wo k as he neu al encode o ASID.
•
We in oduce a bina y c oss-a en ion classi ie o
acili a e an accu a e anking and e ining o e ie ed
audio inge p in s.
•
We con ibu e new ine-g ained empo al anno a ions
o he Sample100 da ase , and e alua e ou model’s
511
pe o mance on sho -que y e ie al, demons a ing
supe io op-N hi - a es compa ed o he baseline.
•
We p esen a de ailed analysis o e ie al pe o -
mances on di e en ypes o samples and discuss he
iabili y o he p oposed amewo k.
Ou code as well as he newly ex ended Sample100 da ase
ha e been made a ailable o ep oducibili y
1
.
2. RELATED WORKS
Despi e he ASID ask being a ele an and challenging one
o he MIR communi y, he e ha e been ew a emp s o
ackle i . Founda ional wo k by Van Balen e al. [3], in o-
duced he Sample100 da ase and p oposed he adap a ion
o a spec al peak-based audio inge p in ing amewo k
o make i obus o pi ch-shi ing. Gu u ani e al. [4] p o-
posed a sys em inspi ed by music co e iden i ica ion, using
Non-nega i e Ma ix Fac o iza ion o c ea e empla es o
he samples and Dynamic Time Wa ping o achie e a de ec-
ion algo i hm ha could be obus o ime-shi ing. Bo h o
hese wo ks ocus p ima ily on obus ness agains indi idual
sampling ans o ma ions bu nei he add ess he b oade
ange ypically encoun e ed in eal-wo ld scena ios. O he
adi ional inge p in ing me hods ha we e e ec i e o
audio e ie al asks, such as aud p in [5] and Panako [6]
ha e also been es ed on his ask [7] and p o ed insu i-
cien o ASID, s uggling wi h he challenges o combined
sampling ans o ma ions and in e e ing “musical noise”
( he o e lying musical composi ion).
Mo e ecen ly, he i s deep lea ning-based app oach by
Ches on e al. [7] achie ed s a e-o - he-a pe o mance on
he Sample100 da ase using a CNN a chi ec u e (ResNe 50-
IBN) p e iously used o co e song iden i ica ion [8] and
exploi ing music sou ce sepa a ion o c ea e syn he ic ain-
ing da a. This app oach se es as ou baseline and demon-
s a es bo h he easibili y and emaining challenges o ap-
plying deep lea ning o ASID.
Cu en s a e-o - he-a audio e ie al sys ems p edomi-
nan ly use CNNs [8
–
10] o ans o me s [11] ained wi h
con as i e lea ning objec i es. While e ec i e, hese a -
chi ec u es ypically equi e signi ican compu a ional e-
sou ces and la ge aining ba ches, limi ing hei p ac i-
cal iabili y. These limi a ions can be add essed by mo e
pa ame e -e icien app oaches based on G aph Neu al Ne -
wo ks (GNNs), which excel a cap u ing complex s uc u al
pa e ns in non-Euclidean spaces [12]. GNNs ha e p o en
e ec i e o audio asks whe e empo al and spec al ela-
ionships a e impo an , including audio inge p in ing [13]
and audio agging [14], by e ec i ely modelling local and
global in e ac ions be ween ime- equency egions.
3. METHODOLOGY
ASID in ol es wo ca ego ies o audio eco dings: a e -
e ence, an o iginal music eco ding, and a que y, a new
eco ding ha inco po a es (i.e., samples) pa s o he e -
e ence. Fo aining, we gene a e que y- e e ence pai s by
1h ps://gi hub.com/chymae a96/Neu alSampleID
e-mixing sou ce sepa a ed s ems as p oposed in [7]. Fo
e alua ion, ou e ie al me hodology employs a wo-s age
p ocess: ini ial candida e selec ion ia app oxima e nea es -
neighbou sea ch, ollowed by ine-g ained anking wi h
he c oss-a en ion classi ie . Figu e 1 illus a es he com-
ple e e ie al pipeline, de ailing how e e ence ma ches a e
e ie ed and anked o a gi en que y.
3.1 Inpu Fea u es
Ou sys em employs log-scaled Mel-spec og ams as inpu
ea u es. Gi en an audio wa e o m
y∈R
, sampled a 16
kHz, we i s compu e i s Mel-spec og am ep esen a ion
X ∈ RF×T
. He e,
F
deno es he numbe o Mel- equency
bins, and
T
is he numbe o empo al ames.
Du ing aining, we andomly sample sho audio seg-
men s o ixed du a ion
seg
om each eco ding in he ain-
ing da ase and use i o gene a e p oxy que y- e e ence
pai s (see Sec ion 3.4.1). Fo e ie al, we use eal que y
and e e ence audio eco dings which a e segmen ed in o
o e lapping segmen s o leng h
seg
. Sec ion 5 de ails he
con igu a ion o he inpu ea u es and hype pa ame e s.
3.2 Encode A chi ec u e
Ou GNN encode builds upon he a chi ec u e in oduced
in [13]. Gi en an inpu Mel-spec og am
X
, we i s ep e-
sen i as a se o h ee-dimensional ime- equency poin s,
each desc ibed by i s ime index, equency bin index, and
ampli ude alue. F om his ini ial ep esen a ion, we p o-
duce o e lapping pa ch embeddings by agg ega ing local
neighbou hoods o ime- equency poin s in o la en ec o s.
Fo mally, each esul ing pa ch embedding is ep esen ed by:
:R3×p→Rd,(1)
whe e
p
deno es he numbe o neighbou ing poin s agg e-
ga ed pe pa ch, and
d
is he dimensionali y o he la en em-
bedding. These pa ch embeddings se e di ec ly as nodes
in he subsequen g aph s uc u e.
Nex , we cons uc a k-nea es neighbou s (kNN) g aph
om hese node embeddings. Speci ically, o each node
embedding
xi
, we iden i y i s
k
nea es neighbou s based
on cosine simila i y in he la en embedding space. The e-
sul ing edges ep esen la en s uc u al ela ionships among
Mel-spec og am pa ches.
Node embeddings a e hen i e a i ely e ined ia g aph
con olu ion (G aphCon ) laye s. Fo each node embedding
xi
, we agg ega e in o ma ion om i s neighbou s
xj
, whe e
j∈ N(xi)
. Fo mally, he upda e ule is gi en by:
yi=xi+σAGG({xj:j∈ N (xi)}),(2)
whe e
yi
is he upda ed embedding,
σ
deno es a nonlin-
ea ac i a ion unc ion,
N(xi)
is he se o neighbou s o
node
xi
, and AGG ep esen s an agg ega ion ope a ion sum-
ma izing ele an in o ma ion om neighbou ing nodes.
Th ough i e a i e agg ega ion, each node embedding p o-
g essi ely encodes inc easingly ich con ex ual and s uc-
u al in o ma ion. The GNN encode comp ises mul iple
blocks o G aphCon laye s, each ollowed by eed o wa d
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
512
GNN
encode
L2 p ojec ion laye
Segmen ed e s
Da abase
o e e ence
songs
Co ec ma ch
Lis o candida es
A
GNN
encode
MHCA Classi ie
P1
P2
P5
P3
P4
B
MHCA + mean
Con ex ec o
dn
c p
dn
NMq
N
NM
dn
N
Final
anked
ou pu
Re e ence embeddings
Que y embeddings
Que y
Mel
Spec og am
Segmen s
Co ec ma ch
Re ine
+
ank
Figu e 1. Illus a ed ASID me hodology: (A) Gi en a que y, we compu e segmen -le el embeddings ( inge p in s), ma ched
o e e ence embeddings ia app oxima e nea es -neighbou (ANN) sea ch; based on which, candida e songs a e e ie ed
om he e e ence da abase h ough a lookup p ocess (do ed a ows). (B) A mul i-head c oss-a en ion (MHCA) classi ie
e ines and anks candida es using node embedding ma ices NMq(que y) and NM ( e e ences).
ne wo k (FFN) laye s. A he beginning o each block,
he kNN g aph is dynamically econs uc ed o e lec he
upda ed node embeddings.
The ou pu o he GNN encode is a se o e ined node
embeddings, collec i ely e e ed o as he node embed-
ding ma ix, which se e as inpu ea u es o he c oss-
a en ion classi ie . Finally, hese node embeddings a e
a e age-pooled and p ojec ed in o audio inge p in s. Bo h
la en embeddings a e used in he subsequen e ie al e-
inemen s age.
Fo a comp ehensi e discussion o a chi ec u al de ails
and design conside a ions, we e e eade s o [13].
3.3 C oss-A en ion Classi ie
To cap u e he la en ela ionships be ween hese wo se s
o node embeddings, we in oduce a mul i-head c oss-
a en ion classi ie . Gi en a que y and a e e ence audio
segmen , we i s compu e he node embedding ma ices
q∈RN×dn
and
∈RN×dn
, espec i ely. He e,
N
is he numbe o nodes, and
dn
is he dimensionali y o
each node embedding. We compu e a en ion-weigh ed
embeddings as ollows:
C=MHA(q, , )(3)
whe e
MHA(.)
deno es s anda d mul i-head a en ion [15].
The esul ing embedding ma ix
C∈RN×dn
is an
a en ion-weigh ed ans o ma ion o
, whe e a en ion is
compu ed be ween co esponding node embeddings in
q
and
.
C
is hen agg ega ed by mean pooling, p oducing
a single con ex ec o
c∈Rdn
:
c=1
N
N
X
j=1
Cj(4)
whe e
Cj
is con ex ec o o he
j
- h node embedding.
Finally, he con ex ec o
c
is ans o med by a shallow
nonlinea classi ie in o a scala con idence sco e
s
:
s=σ(wTc+b)(5)
whe e
w∈Rdn
,
b∈R
a e lea nable pa ame e s, and
σ
deno es he sigmoid ac i a ion unc ion. The scala
s
indica es he con idence ha he que y and e e ence seg-
men s ma ch. As shown in Figu e 1, a e ie al ime, his
is used as a anking mechanism as well as a measu e o
ejec ing low-con idence candida es.
3.4 T aining Pipeline
Ou p oposed app oach in ol es wo dis inc aining s ages:
a sel -supe ised con as i e lea ning s age o embedding
aining and a subsequen bina y classi ica ion s age o
he downs eam c oss-a en ion classi ie . Bo h s ages use
iden ical p ocedu es o p oduce p oxy que y- e e ence pai s
om he sou ce-sepa a ed aining da a, closely ollowing
he me hodology es ablished in p io wo k [7].
3.4.1 Que y-Re e ence Pai Gene a ion
Le us deno e he s ems ex ac ed om he aining audio
sou ce
x
as a se
S={s1, s2, ..., sK}
, whe e each s em
sk
co esponds o a sou ce-sepa a ed audio componen (e.g.,
ocals, d ums, bass). Gi en a andom imes amp segmen
s
s a ing a
and o leng h
∆
, we i s ex ac co esponding
audio segmen s om each s em as
sk( s) = sk[ , + ∆ ](6)
esul ing in he se
{s1( s), s2( s), ..., sK( s)}
. These s em
segmen s a e pa i ioned andomly in o wo subse s,
Sq
and
S
, wi h
Sq∪ S =S
and
Sq∩ S =∅
.
A que y segmen
xq
is o med as he sum o s ems in
Sq
:
xq=X
s∈Sq
s( s).(7)
The e e ence segmen
x
is gene a ed by mixing an
augmen ed e sion o he que y segmen wi h he emain-
ing s ems:
x =aug2 aug1(xq) + X
s∈S
s( s)!.(8)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
513
He e,
aug1
and
aug2
ep esen audio e ec s unc ions
applied sequen ially o simula e ealis ic music p oduc ion
ans o ma ions. The e ec pa ame e s a e sampled om
a uni o m dis ibu ion. Speci ically,
•aug1
: ime-o se (
±
250ms) and gain a ia ion
(±10dB).
•aug2
: pi ch-shi ing (
±
3 semi ones) and ime-
s e ching (70 - 150%).
The sou ce-sepa a ion sys em (see Sec ion 4.1) allows
he ex ac ion o musically salien sou ces ha can cons i-
u e a sample. The pai
(xq, x )
cons i u es a posi i e que y-
e e ence example;
xq
is a p oxy o a que y con aining an
ins ance o a sample, and
x
ep esen s a e e ence example
which con ains he sample ha is c ea i ely dis o ed and
is p esen in a mix along wi h o he musical elemen s.
3.4.2 Con as i e Lea ning
We ain he encode using a sel -supe ised con-
as i e lea ning amewo k. Gi en a ba ch o
B
pai s
{(xi
q, xi
)}B
i=1
, we ob ain hei co esponding audio inge -
p in s
{zi
q, zi
}N
i=1
om he encode . We hen employ he
No malized Tempe a u e-scaled C oss En opy (NT-Xen )
loss [16] o maximize simila i y be ween embeddings om
posi i e pai s, while minimizing cosine simila i y o em-
beddings om all o he pai s in he ba ch.
3.4.3 Downs eam Classi ie T aining
The c oss-a en ion classi ie is ained as a downs eam
ask, wi h he encode pa ame e s ozen a e he con-
as i e lea ning s age. Fo his s age, we disca d he p e-
iously used p ojec ion ne wo k and di ec ly use he node
embedding ma ix ob ained om he encode .
T aining ba ches consis o que y- e e ence pai s gen-
e a ed iden ically o he con as i e lea ning s age. Le
Q={qi}Bc
i=1
and
R={ j}Bc
j=1
ep esen que y and e e -
ence embedding se s in a ba ch, espec i ely, whe e each
embedding
qi, j∈RN×dn
, and
Bc
is he ba ch size.
Posi i e examples co espond o pai s o iden ical in-
dices:
P={(qi, j)|i=j},(9)
while nega i e examples a e selec ed om pai s wi h non-
iden ical indices ia ha d-nega i e mining. Speci ically,
we selec nega i e pai s as he subse o non-posi i e pai s
ha maximize audio inge p in simila i y, hus being he
mos con ounding:
N={(qi, −
j)|i=j, −
j= a g max
j,j=isim(zi, zj)}.
(10)
We main ain a ixed a io o 1:3 o posi i e o nega i e
pai s wi hin each aining ba ch. The classi ie ou pu s a
scala p edic ion
p∈[0,1]
, ained wi h he bina y c oss-
en opy (BCE) loss, whe e he label o pai s
(qi, j)∈ P
is 1 and o o pai s
(qi, −
j)∈ N
is 0.
3.5 Re ie al and E alua ion
Ou e ie al sys em, illus a ed in Figu e 1, ope a es in
wo sequen ial s ages:
•
App oxima e nea es -neighbou (ANN) sea ch o do
a as and coa se sea ch o candida e e e ence audio
inge p in s om he da abase.
•
C oss-a en ion classi ie sco ing o e ine he candi-
da e se and ank hem based on ele ance.
Fo e e y o e lapping segmen (compu ed as desc ibed
in Sec ion 3.1) in he que y, we p obe he e e ence da abase
o ma ches based on he simila i y o he audio inge p in s;
hus yielding a se o candida e ma ches.
In he second s age, we u ilize he c oss-a en ion classi-
ie o e ine hese candida e ma ches. Fo each candida e
segmen e ie ed, we ex ac i s co esponding node em-
bedding ma ix. Gi en a que y eco ding, ep esen ed as a
sequence o node embedding ma ices, we compu e clas-
si ie sco es
p(q, )
o each pai o que y
q
and e ie ed
candida e
. The inal candida e segmen -le el con idence
sco e is de e mined by selec ing he maximum classi ie
sco e o e all segmen s o he que y:
pcl (q, ) = max
qi∈Qp(qi, ).(11)
We ejec candida e segmen s wi h con idence sco es
pcl (q, )<0.5
. Subsequen ly, we agg ega e hese accep ed
segmen -le el sco es o ob ain a song-le el e ie al sco e.
Speci ically, o each unique e e ence eco ding, we sum
he segmen -le el con idence sco es:
Psong(q, R) = X
∈R
Pcl (q, ),(12)
whe e
R
deno es he se o e ie ed segmen s belonging o
he same e e ence song. The esul ing agg ega ed sco es
Psong(q, R)
p o ide a obus anking o candida e songs
o each que y eco ding.
4. DATASET
4.1 T aining Da ase
Fo aining, we use he F ee Music A chi e (FMA) medium
da ase [17], which con ains 25,000 30-second acks ac oss
16 gen es. We p e-p ocessed his da ase o make i sui able
o ou s em-mixing con as i e lea ning app oach. We used
he cu en SOTA algo i hm “Bea This” [18] o pe o m
bea acking and use his as a p oxy o musical hy hmic
egula i y in he FMA acks, excluding 2,533 acks wi h
ewe han 32 bea s a e he i s downbea . This il e ing
ensu ed ha ou aining da a consis ed only o musical
con en wi h some le el o hy hmic s uc u e.
To gene a e he s ems ha will be used o he syn-
he ic aining pai s, we applied sou ce sepa a ion using
he Hyb id T ans o me Demucs model (
h demucs
) [19]
o each usable ack, sepa a ing hem in o d ums, bass,
ocals, and “o he ” s ems.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
514
4.2 E alua ion Da ase
Fo he e alua ion o ou sys em, we use he Sample100
da ase [3] . The da ase consis s o 75 ull-leng h hip-
hop eco dings (que ies) con aining samples om 68 ull-
leng h songs ( e e ences) ac oss a a ie y o gen es, wi h
R&B/Soul ep esen ing he majo i y [7]. I con ains 106
sample ela ionships and a o al o 137 sample occu ences,
as some que ies use mul iple samples and some e e ences
appea in mul iple que ies. To challenge e ie al sys ems,
he da ase includes 320 addi ional “noise” acks wi h a sim-
ila gen e dis ibu ion, which a e no sampled in any que y.
Because samples a e ypically c ea ed om a sho seg-
men o a song, only a small po ion o each candida e ack
is sampled and p esen in que ies - sample leng hs ange
om jus one second o 26 seconds. The samples ep esen
eal-wo ld musical “ ans o ma i e app op ia ion” [2], in-
cluding bo h onal ( i s), pe cussi e d um b eak (bea s),
and 1-no e mic o-samples. Non-musical samples (e.g. ilm
dialogue) a e no included.
To enable mo e de ailed e alua ion, we p esen an ex-
ended e sion o he Sample100 da ase wi h ine-g ained
empo al anno a ions pe o med by expe musicians using
Sonic Visualise [20]. Unlike he o iginal da ase , which
only p o ided i s occu ence imes amps a 1-second p e-
cision, ou anno a ions include p ecise s a and end imes
o all sample occu ences wi h
±250
ms esolu ion, ans-
o ming he da ase in o a segmen -wise e alua ion esou ce.
This imp o ed empo al g anula i y allows o mo e accu-
a e e alua ion o ASID sys ems by es ing wi h sho que y
snippe s om anywhe e wi hin he sampled ma e ial.
We u he en ich he da ase by adding es ima es o
he ime-s e ching a io be ween he e e ence and que y
acks, as well as ins umen a ion (s em) anno a ions o
bo h he o iginal ma e ial and he in e e ing ins umen s in
he que y, and expanding he commen s abou he samples.
The ime-s e ching a io was calcula ed om he empo o
bo h que y and e e ence segmen s, de e mined h ough a
combina ion o au oma ic bea acking [18] wi h manual
e i ica ion. S em anno a ions we e pe o med by lis ening
o he acks and hei sou ce-sepa a ed s ems o ensu e
accu acy. Rele an sample class coun s a e shown in Table
4, including a ca ego isa ion in o subs an ial o minimal
ime-s e ching. This new in o ma ion will enable mo e nu-
anced analysis o ou model’s pe o mance ac oss di e en
ypes o sampling p ac ices in sec ion 6.3.
5. EXPERIMENTAL SETUP
5.1 Hype pa ame e s and Con igu a ion
Ou expe imen al se up and hype pa ame e choices a e
summa ized in Table 1, wi h ce ain pa ame e s de ailed
in he p eceding sec ions. The con as i e lea ning s age
was pe o med using an NVIDIA A100 GPU, wi h models
ained o a maximum o 180 epochs; we employed ea ly
s opping based on alida ion pe o mance. T aining u i-
lized he Adam op imize coupled wi h a cosine annealing
lea ning- a e schedule . Fo he downs eam c oss-a en ion
classi ie , we ained o a maximum o 5 epochs using
he Adam op imize wi h a ixed lea ning a e, keeping he
encode pa ame e s ozen o p ese e he lea ned ep esen-
a ions om he con as i e lea ning s age. Fo he ANN
sea ch algo i hm, we use IVF-PQ [21], an e icien choice
o e ie al asks in la ge ec o da abases.
Hype pa ame e Value
Sampling a e 16,000 Hz
log-powe Mel-spec og am size F×T64 ×32
Finge p in {window leng h, hop} {4s, 0.5s}
Finge p in dimension 128
Node ma ix dimension {N,dn} {32, 512}
Tempe a u e τ0.05
Con as i e ba ch size B1024
Downs eam ba ch size Bc32
Table 1. Expe imen al Con igu a ion
5.2 E alua ion Me ics
The ASID ask is undamen ally a e ie al p oblem, whe e
he goal is o ank candida e audio segmen s based on hei
ele ance o a que y. Hence, we adop mean a e age p eci-
sion (mAP) [22] as ou p ima y me ic, whe e he que y is
compu ed om a ull song con aining a sample. The mean
a e age p ecision (mAP) summa izes e ie al quali y by
agg ega ing he p ecision alues a he anks whe e ele an
i ems a e e ie ed, a e aged ac oss all que ies.
Addi ionally, inspi ed by an es ablished p ac ice in au-
dio inge p in ing li e a u e [10], we epo op-
N
hi a es.
Speci ically, we measu e he p opo ion o que ies o which
a leas one co ec sample is e ie ed wi hin he op
N
anked esul s. We do so o di e en que y sizes (5s o 20s).
This me ic p o ides an in ui i e indica ion o p ac ical e-
ie al accu acy and he sys em’s e icacy o sho que ies.
5.3 Baseline F amewo k
We compa e ou p oposed sys em agains he ecen s a e-
o - he-a baseline in oduced by Ches on e al. [7]. Thei
amewo k employs a ResNe 50-IBN a chi ec u e and u i-
lizes a mul i- ask lea ning app oach ha join ly op imises
a me ic lea ning objec i e h ough iple loss and an aux-
ilia y classi ica ion ask. This a chi ec u e has achie ed
s a e-o - he-a e ie al pe o mance in e ms o mean a -
e age p ecision (mAP). Due o p ac ical compu a ional con-
s ain s, we ins ead adop and epo esul s on a ResNe 18-
IBN model, which has a compa able numbe o pa ame e s
o ou p oposed GNN-based encode . Apa om he model
size, we closely adhe e o he aining p ocedu es and e al-
ua ion me hodology ou lined in [7]. We also include hei
epo ed bes pe o mance o e e ence.
6. RESULTS AND DISCUSSION
6.1 Benchma king
We p esen he pe o mance compa ison be ween ou p o-
posed GNN+MHCA a chi ec u e and he baseline in Ta-
ble 2. Ou model ma ches he epo ed pe o mance o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
515
he much la ge ResNe 50-IBN model, and signi ican ly
ou pe o ms he eimplemen ed baseline.
Model # pa ams mAP
ResNe 50-IBN (Ches on e al.) 222M 0.441
ResNe 18-IBN (Baseline) 34M 0.330
GNN (Ou s) ba ch size = 1024 20M 0.416
GNN + ba ch size = 256 0.373
MHCA ba ch size = 512 20M 0.411
(Ou s) ba ch size = 1024 0.442
Table 2. Pe o mance o models on Sample100 da ase .
A key ac o in luencing ou model’s pe o mance is he
ba ch size used du ing con as i e lea ning. Inc easing he
ba ch size om 256 o 1024 leads o an imp o emen o
6.9pp in mean a e age p ecision (mAP). This imp o emen
occu s because la ge ba ch sizes p o ide mo e nega i e
samples pe posi i e pai , en iching he di e si y o he con-
as i e space. Consequen ly, he model lea ns embeddings
ha be e disc imina e be ween ele an and i ele an ex-
amples. While la ge ba ch sizes (e.g., 2048) we e explo ed,
he associa ed aining ime ende ed hype pa ame e uning
imp ac ical, and hus hese esul s a e no epo ed.
Models Que y Leng h
5 sec 7 sec 10 s 15 sec 20 sec
Top-1 Hi Ra e (%)
ResNe 18-IBN (Baseline) 15.8 15.8 16.0 16.5 21.7
GNN+MHCA (Ou s) 15.5 24.6 27.5 26.1 30.7
Top-3 Hi Ra e (%)
ResNe 18-IBN (Baseline) 24.3 24.3 29.2 26.1 32.3
GNN+MHCA (Ou s) 19.1 32.1 38.3 40.9 50.3
Top-10 Hi Ra e (%)
ResNe 18-IBN (Baseline) 27.4 27.4 40.4 45.2 49.1
GNN+MHCA (Ou s) 19.1 33.8 44.7 51.3 63.2
Table 3. Hi a es o ou amewo k and baseline.
Table 3 shows ou model’s pe o mance on sho que ies,
a common use case in eal-wo ld sample iden i ica ion sce-
na ios. While he hi a es o sho e que ies a e compa-
able o he baseline, ou amewo k exhibi s signi ican ly
supe io pe o mance o longe que ies (14.1pp inc ease in
op-1 hi a e o 20-second-long que ies). The p og essi e
imp o emen in hi a es wi h inc easing que y leng h show
ha ou app oach e ec i ely agg ega es segmen -le el con-
idence sco es o e ie e he co ec e e ence song.
6.2 Re ie al Re inemen ia C oss-A en ion Classi ie
To examine he impac o he c oss-a en ion classi ie as
a e ie al e inemen s ep, we conduc an abla ion s udy.
Table 2 shows ha inco po a ing he classi ie (MHCA) o
ank e ie ed esul s imp o es mAP by 2.6pp, con i ming
he u ili y o his e inemen s age. Addi ionally, o e alua e
he classi ie ’s capabili y o ejec i ele an ma ches, we
cons uc a balanced alida ion se comp ising 300 posi i e
que y- e e ence pai s and 300 nega i e pai s d awn om
he “noise” da a desc ibed in Sec ion 4.2. The classi ie
achie es an AUROC sco e o 0.776, indica ing ha i does
no pe ec ly disc imina e be ween genuine and con ound-
ing examples. Thus, he obse ed imp o emen in e ie al
pe o mance can be a ibu ed o he combined e ec o
he wo-s age e ie al p ocess, a he han solely o he
ejec ion capabili y o he classi ie .
6.3 Pe o mance by Sample Cha ac e is ics
To unde s and he pe o mance o he model ac oss sample
cha ac e is ics, we compu ed he mAP o di e en ca e-
go ies o Sample100. As shown in Table 4, he e is a modes
pe o mance gap be ween melodic/ha monic i samples
and pe cussi e bea samples ( he wo 1-no e samples we e
no aken in o accoun ). This may be a ibu ed o he na u e
o bea samples, which consis p ima ily o d ums ha a e
o en subjec o o e dubbing echniques whe e p oduce s
laye addi ional pe cussion elemen s, and ha migh also
be bu ied deepe in he mix benea h o he ins umen a ion,
po en ially making hem less salien o he GNN o cap u e.
Fu he analysis by looking a he speci ic ins umen a ion
in he e e ence and que y, o by applying sou ce sepa a ion
a de ec ion ime, is le o u u e wo k.
Type Time s e ching
Ri Bea 1-no e >5% <5%
# samples 71 33 2 44 62
mAP 0.471 0.391 - 0.340 0.503
Table 4. Pe o mance acco ding o sample class.
A signi ican pe o mance gap was obse ed in ela ion
o ime s e ching, whe e we classi ied samples subjec ed o
minimal ime s e ching (<5%) and hose wi h signi ican
ime s e ching (>5%). This 16.3pp di e ence in pe o -
mance shows ha al hough ou model is obus o some
deg ee o ime-s e ching, la ge changes in empo unda-
men ally al e he empo al ela ionships be ween audio
ea u es ha ou model elies on o iden i ica ion.
7. CONCLUSION AND FUTURE WORK
This pape p esen s a ligh weigh GNN-based app oach o
au oma ic sample iden i ica ion ha achie es s a e-o - he-
a pe o mance while using only 9% o he pa ame e s com-
pa ed o p e ious me hods. Ou key con ibu ions include
adap ing a GNN encode o sample iden i ica ion, in oduc-
ing a c oss-a en ion classi ie o e ining e ie al esul s,
and ex ending he Sample100 da ase wi h ine-g ained em-
po al anno a ions ha enable g anula e alua ion.
Ou esul s show ha he p oposed amewo k achie es
a mAP o 44.2%, wi h s ong pe o mance on melodic-
ha monic samples and samples wi h low ime-s e ching.
Ou amewo k’s c oss-a en ion s age is use ul in he e in-
ing o he anking, and in oduces ejec ion capabili ies.
Fu u e wo k should explo e end- o-end aining me h-
ods obus o sampling ans o ma ions, and explo e in-
eg a ing sou ce sepa a ion du ing in e ence o imp o e
pe o mance on hea ily masked samples. Newly a ailable
anno a ions can be le e aged o analysis o how speci ic
a ibu es o samples (ins umen a ion ype, in e pola ion,
gen e) a ec iden i ica ion accu acy, and we hope he e-
lease o ou ex ended Sample100 da ase will aid he de el-
opmen o specialized echniques ha add ess he mos
challenging cases in ASID.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
516
8. ACKNOWLEDGMENTS
We hank Ma hew Alan Wal o d o his con ibu-
ions o he new anno a ions o he Sample100
da ase . This esea ch u ilised Queen Ma y’s Apoc-
i a HPC acili y, suppo ed by QMUL Resea ch-IT
(h p://doi.o g/10.5281/zenodo.438045). This wo k was
suppo ed by UKRI - Inno a e UK (P ojec no. 10102241).
A. Bha acha jee is a esea ch s uden a he UKRI Cen e
o Doc o al T aining in A i icial In elligence and Music,
suppo ed join ly by UK Resea ch and Inno a ion [g an
numbe EP/S022694/1] and Queen Ma y Uni e si y o
London.
9. REFERENCES
[1]
K. McLeod, P. DiCola, J. Toomey, and K. Thomson,
C ea i e License: The Law and Cul u e o Digi al Sam-
pling. Du ham, NC, USA: Duke Uni e si y P ess,
2011.
[2]
J. Deme s, S eal This Music: How In ellec ual P ope y
Law A ec s Musical C ea i i y. A hens, GA, USA:
Uni e si y o Geo gia P ess, 2006.
[3]
J. V. Balen, “Au oma ic ecogni ion o samples in musi-
cal audio,” Mas e ’s hesis, Uni e si a Pompeu Fab a,
Ba celona, Spain, 2011.
[4]
S. Gu u ani and A. Le ch, “Au oma ic sample de ec ion
in polyphonic music,” in P oc. o he 18 h In . Socie y
o Music In o ma ion Re ie al Con ., Suzhou, China,
2017.
[5]
D. Ellis, “aud p in : Landma k-based audio in-
ge p in ing,” 2014, so wa e. [Online]. A ailable:
h ps://gi hub.com/dpwe/aud p in
[6]
J. Six and M. Leman, “Panako - a scalable acous ic in-
ge p in ing sys em handling ime-scale and pi ch mod-
i ica ion,” in P oc. o he 15 h In . Socie y o Music
In o ma ion Re ie al Con ., Taipei, Taiwan, 2014.
[7]
H. Ches on, J. V. Balen, and S. Du and, “Au oma ic
iden i ica ion o samples in hip-hop music ia
mul i-loss aining and an a i icial da ase ,” 2025,
a Xi p ep in a Xi :2502.06364. [Online]. A ailable:
h ps://a xi .o g/abs/2502.06364
[8]
X. Du, Z. Yu, B. Zhu, X. Chen, and Z. Ma, “By eco e :
Co e song iden i ica ion ia mul i-loss aining,” in
P oc. IEEE In . Con . Acous ics, Speech and Signal
P ocessing, To on o, ON, Canada, 2021, pp. 551–555.
[9]
X. Xu, X. Chen, and D. Yang, “Key-in a ian con olu-
ional neu al ne wo k owa d e icien co e song iden i-
ica ion,” in P oc. IEEE In . Con . Mul imedia and Expo,
San Diego, CA, USA, 2018.
[10]
S. Chang, D. Lee, J. Pa k, H. Lim, K. Lee, K. Ko, and
Y. Han, “Neu al audio inge p in o high-speci ic au-
dio e ie al based on con as i e lea ning,” in P oc.
IEEE In . Con . Acous ics, Speech and Signal P ocess-
ing, To on o, ON, Canada, 2021, pp. 3025–3029.
[11]
A. Singh, K. Demuynck, and V. A o a, “A en ion-based
audio embeddings o que y-by-example,” in P oc. o
he 23 d In . Socie y o Music In o ma ion Re ie al
Con ., Bengalu u, India, 2022, pp. 52–58.
[12]
G. Li, M. Mulle , A. Thabe , and B. Ghanem, “Deep-
gcns: Can gcns go as deep as cnns?” in P oc. IEEE/CVF
In . Con . Compu e Vision, Seoul, Ko ea, 2019, pp.
9267–9276.
[13]
A. Bha acha jee, S. Singh, and E. Bene os, “G a p in :
A gnn-based app oach o audio iden i ica ion,” in P oc.
IEEE In . Con . Acous ics, Speech and Signal P ocess-
ing, Hyde abad, India, 2025.
[14]
S. Singh, C. J. S einme z, E. Bene os, H. Phan, and
D. S owell, “A gnn: Audio agging g aph neu al ne -
wo k,” IEEE Signal P ocessing Le e s, 2024.
[15]
A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Łukasz Kaise , and I. Polo-
sukhin, “A en ion is all you need,” Ad ances in Neu al
In o ma ion P ocessing Sys ems, ol. 30, 2017.
[16]
T. Chen, S. Ko nbli h, M. No ouzi, and G. Hin on, “A
simple amewo k o con as i e lea ning o isual ep-
esen a ions,” in P oc. In . Con . Machine Lea ning,
2020, pp. 1597–1607.
[17]
M. De e a d, K. Benzi, P. Vande gheyns , and X. B es-
son, “Fma: A da ase o music analysis,” in P oc. o
he 18 h In . Socie y o Music In o ma ion Re ie al
Con ., Suzhou, China, 2017.
[18]
F. Fosca in, J. Schlü e , and G. Widme , “Bea his!
accu a e bea acking wi hou dbn pos p ocessing,” in
P oc. o he 25 h In . Socie y o Music In o ma ion
Re ie al Con ., San F ancisco, CA, USA, 2024.
[19]
S. Roua d, F. Massa, and A. Dé ossez, “Hyb id ans-
o me s o music sou ce sepa a ion,” in P oc. IEEE In .
Con . Acous ics, Speech and Signal P ocessing, Rhodes
Island, G eece, 2023.
[20]
C. Cannam, C. Landone, and M. Sandle , “Sonic
isualise : An open sou ce applica ion o iewing,
analysing, and anno a ing music audio iles,” in P oc.
ACM Mul imedia In . Con ., Fi enze, I aly, 2010, pp.
1467–1468.
[21]
J. Johnson, M. Douze, and H. Jégou, “Billion-scale
simila i y sea ch wi h gpus,” IEEE T ansac ions on Big
Da a, ol. 7, no. 3, pp. 535–547, 2019.
[22]
J. S. Downie, “The music in o ma ion e ie al e al-
ua ion exchange (2005–2007): A window in o music
in o ma ion e ie al esea ch,” Acous ical Science and
Technology, ol. 29, no. 4, pp. 247–255, 2008.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
517