DISCOGS-VINET-MIREX
R. O˘
guz A az1Joan Se à2Xa ie Se a1
Yuki Mi su uji2Dmi y Bogdano 1
1Uni e si a Pompeu Fab a, Music Technology G oup, Ba celona
2Sony AI
[email p o ec ed]
ABSTRACT
This echnical epo p esen s ou submission o he co e
song iden i ica ion ask o he 2024 edi ion o he Music
In o ma ion Re ie al E alua ion eXchange (MIREX). Fo
his submission, we enhanced ou Discogs-VINe model
by changing he de ini ion o an epoch, inco po a ing au-
oma ic mixed p ecision (AMP) du ing bo h aining and
in e ence, and sampling ou e sions pe clique du ing
iple mining (which became possible wi h AMP). Due o
his enhanced model’s pe o mance on he Discogs-VI es
se , we ained a new model om sc a ch using he en i e
Discogs-VI da ase , a he han jus he aining pa i ion
used in Discogs-VINe (a 45% inc ease in he numbe o
e sions). This enhanced and e ained model is named
Discogs-VINe -MIREX.
1. INTRODUCTION
Ve sion iden i ica ion (VI), also known as co e song iden-
i ica ion (CSI), aims o iden i y he di e en e sions o a
musical wo k om a collec ion o acks [1]. The iden i i-
ca ion p ocess elies on gene a ing digi al audio ep esen-
a ions o acks, whe e he ep esen a ions o e sions o
he same musical wo k a e designed o be close o each
o he compa ed o non- e sion acks. In con empo a y VI
app oaches, closeness is ypically measu ed using a ec-
o space ope a ion such as he cosine simila i y o Eu-
clidean dis ance. Audio ep esen a ions a e ob ained by
aining neu al ne wo ks on da ase s composed o mul i-
ple se s o e sions, known as cliques. Du ing he e ie al
phase, he p e- ained neu al ne wo k c ea es such ep e-
sen a ions, which a e used o iden i ying e sions.
Da ase s such as Da-TACOS [2] o SHS100K [3] we e
commonly used o ain neu al ne wo ks o VI. Howe e ,
he ela i ely small size o hese da ase s became a limi ing
ac o in ad ancing VI models. To add ess his challenge,
he Discogs-VI-YT da ase was ecen ly in oduced [4].
This new da ase o e s a signi ican imp o emen , con-
aining mo e han ou imes he numbe o e sions ound
© R. O. A az, J. Se à, X. Se a, Y. Mi su uji, and D. Bog-
dano . Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: R. O. A az, J. Se à, X. Se a, Y.
Mi su uji, and D. Bogdano , “Discogs-VINe -MIREX”, in P oc. o he
25 h In . Socie y o Music In o ma ion Re ie al Con ., San F ancisco,
Uni ed S a es, 2024.
in he SHS100K da ase and app oxima ely en imes he
amoun o cliques.
Wi h he elease o he Discogs-VI-YT da ase , we also
in oduced he Discogs-VINe model [4], buil upon he
CQT-Ne [5] a chi ec u e. Discogs-VINe was designed o
be ainable on a single comme cial-g ade GPU such as
he NVIDIA RTX2080, and o exclude any da a augmen-
a ions du ing aining, highligh ing he da ase ’s po en ial
wi hou addi ional p e-p ocessing echniques. Gi en he
p esence o abou 80,000 aining se cliques, aining he
model wi h a classi ica ion objec i e was imp ac ical on a
comme cial GPU. As a esul , he model was ained exclu-
si ely using he iple loss. In his submission, we imp o e
he model in se e al aspec s.
2. PROPOSED SYSTEM
2.1 Model
Fo his submission, we implemen ed se e al key modi i-
ca ions o he Discogs-VINe model. Fi s , we used non-
de e minis ic CUDA ope a ions o cu back on he consid-
e ably long aining ime o handle he la ge amoun o da a
in he Discogs-VI-YT da ase . Then, we in oduced au o-
ma ic mixed p ecision (AMP) aining wi h he same goal.
Impo an ly, besides inc easing compu a ion speed, AMP
enables la ge ba ch sizes, which a e impo an o e ec-
i e iple mining s a egies. Du ing each aining i e a-
ion, we andomly sample 54 cliques, wi h ou e sions
selec ed om each clique, esul ing in a o al ba ch size o
216. I a clique con ains ewe han ou e sions, we du-
plica e e sions as needed. Each e sion in he ba ch se es
as an ancho sample du ing iple mining.
We u ilize online iple mining wi h andom posi i es
and ha d nega i es. An epoch is de ined as one pass o e
he en i e aining se whe e each e sion has been used
as he ancho sample once. We ain he model o 40
epochs wi h his new and imp o ed de ini ion. Fo each
e sion, we ex ac 7,600 cons an -Q ans o m ames
(abou 176.4 s wi h 22,050 Hz sampling a e and a hop
size o 512 samples). The lea ning a e was scheduled
using cosine annealing wi h i e wa m-up s eps, s a ing
a 0.0001, eaching 0.01 a he end o he wa m-up, and
annealing down o 0.00001 by he end o aining. The
model gene a es 512-dimensional embeddings, and we ap-
ply L2-no maliza ion a e he las laye o ensu e he em-
beddings lie on he uni hype sphe e. This e sion o he
model was ained on he en i e Discogs-VI-YT da ase ,
which con ains app oxima ely 493,000 e sions o abou
98,000 composi ions. We also inc eased he numbe o pa-
ame e s o 8 million om 5 million by using mul iples o
40 channels in he con olu ional laye s ins ead o 32. The
iple loss ma gin was se o 0.3, wi h all o he pa ame e s
ollowing he Discogs-VINe con igu a ion.
2.2 Re ie al
We use maximum inne p oduc sea ch o e ie al. Since
ou model c ea es L2-no malized ep esen a ions, his is
equi alen o maximum cosine simila i y sea ch.
2.3 E alua ion
We ained Discogs-VINe -MIREX on he en i e Discogs-
VI-YT da ase , and since Discogs-VI-YT con ains mos o
he cliques o bo h SHS100K and Da-TACOS da ase s, we
did no e alua e he model on any ex e nal da ase s. How-
e e , du ing aining we moni o ed he alida ion se pe -
o mance o gua an ee o e - i ing, and he model achie ed
abou 0.91 mAP and 4 MR1 pe o mance by he end o
aining. This inal model was benchma ked in he compe-
i ion.
3. ACKNOWLEDGMENTS
This wo k is suppo ed by he p e-doc o al p og am
AGAUR-FI aju s (2024 FI-3 00065) Joan O ó, unded by
he Sec e a ia d’Uni e si a s i Rece ca del Depa amen
de Rece ca i Uni e si a s de la Gene ali a de Ca alunya;
and he Cá ed as ENIA p og am “IA y Música: Cá ed a
en In eligencia A i icial y Música” (TSI-100929-2023-1),
unded by he Sec e a ía de Es ado de Digi alización e In-
eligencia A i icial and he Eu opean Union-Nex Gene -
a ion EU.
4. REFERENCES
[1] F. Yesile , G. Do as, R. M. Bi ne , C. J. T alie, and
J. Se à, “Audio-Based Musical Ve sion Iden i ica ion:
Elemen s and challenges,” IEEE Signal P ocessing
Magazine, ol. 38, no. 6, pp. 115–136, 2021.
[2] F. Yesile , C. T alie, A. Co eya, D. F. Sil a, P. To s o-
gan, E. Gómez, and X. Se a, “Da-TACOS: A Da ase
o Co e Song Iden i ica ion and Unde s anding,” in
P oc. o he 20 h In . Soc. o Music In o ma ion Re-
ie al Con . (ISMIR), 2019.
[3] X. Xu, X. Chen, and D. Yang, “Key-In a ian Con o-
lu ional Neu al Ne wo k Towa d E icien Co e Song
Iden i ica ion,” in IEEE In . Con . on Mul imedia and
Expo (ICME), 2018.
[4] R. O. A az, X. Se a, and D. Bogdano , “Discogs-VI:
A musical e sion iden i ica ion da ase based on pub-
lic edi o ial me ada a,” in P oc. o he 25 h In . Soc. o
Music In o ma ion Re ie al Con . (ISMIR), 2024.
[5] Z. Yu, X. Xu, X. Chen, and D. Yang, “Lea ning a Rep-
esen a ion o Co e Song Iden i ica ion Using Con o-
lu ional Neu al Ne wo k,” in IEEE In . Con . on Acous-
ics, Speech and Signal P ocessing (ICASSP), 2020.