Discogs-VINet-MIREX

Author: Araz, Recep Oğuz; Serrà, Joan; Serra, Xavier; MITSUFUJI, YUKI; Bogdanov, Dmitry

Publisher: Zenodo

DOI: 10.5281/zenodo.17335627

Source: https://zenodo.org/records/17335627/files/R_Oguz_Araz-MIREX2024.pdf

DISCOGS-VINET-MIREX
R. O˘
guz A az1Joan Se à2Xa ie Se a1
Yuki Mi su uji2Dmi y Bogdano 1
1Uni e si a Pompeu Fab a, Music Technology G oup, Ba celona
2Sony AI
[email p o ec ed]
ABSTRACT
This echnical epo p esen s ou submission o he co e
song iden i ica ion ask o he 2024 edi ion o he Music
In o ma ion Re ie al E alua ion eXchange (MIREX). Fo
his submission, we enhanced ou Discogs-VINe model
by changing he de ini ion o an epoch, inco po a ing au-
oma ic mixed p ecision (AMP) du ing bo h aining and
in e ence, and sampling ou e sions pe clique du ing
iple mining (which became possible wi h AMP). Due o
his enhanced model’s pe o mance on he Discogs-VI es
se , we ained a new model om sc a ch using he en i e
Discogs-VI da ase , a he han jus he aining pa i ion
used in Discogs-VINe (a 45% inc ease in he numbe o
e sions). This enhanced and e ained model is named
Discogs-VINe -MIREX.
1. INTRODUCTION
Ve sion iden i ica ion (VI), also known as co e song iden-
i ica ion (CSI), aims o iden i y he di e en e sions o a
musical wo k om a collec ion o acks [1]. The iden i i-
ca ion p ocess elies on gene a ing digi al audio ep esen-
a ions o acks, whe e he ep esen a ions o e sions o
he same musical wo k a e designed o be close o each
o he compa ed o non- e sion acks. In con empo a y VI
app oaches, closeness is ypically measu ed using a ec-
o space ope a ion such as he cosine simila i y o Eu-
clidean dis ance. Audio ep esen a ions a e ob ained by
aining neu al ne wo ks on da ase s composed o mul i-
ple se s o e sions, known as cliques. Du ing he e ie al
phase, he p e- ained neu al ne wo k c ea es such ep e-
sen a ions, which a e used o iden i ying e sions.
Da ase s such as Da-TACOS [2] o SHS100K [3] we e
commonly used o ain neu al ne wo ks o VI. Howe e ,
he ela i ely small size o hese da ase s became a limi ing
ac o in ad ancing VI models. To add ess his challenge,
he Discogs-VI-YT da ase was ecen ly in oduced [4].
This new da ase o e s a signi ican imp o emen , con-
aining mo e han ou imes he numbe o e sions ound
© R. O. A az, J. Se à, X. Se a, Y. Mi su uji, and D. Bog-
dano . Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: R. O. A az, J. Se à, X. Se a, Y.
Mi su uji, and D. Bogdano , “Discogs-VINe -MIREX”, in P oc. o he
25 h In . Socie y o Music In o ma ion Re ie al Con ., San F ancisco,
Uni ed S a es, 2024.
in he SHS100K da ase and app oxima ely en imes he
amoun o cliques.
Wi h he elease o he Discogs-VI-YT da ase , we also
in oduced he Discogs-VINe model [4], buil upon he
CQT-Ne [5] a chi ec u e. Discogs-VINe was designed o
be ainable on a single comme cial-g ade GPU such as
he NVIDIA RTX2080, and o exclude any da a augmen-
a ions du ing aining, highligh ing he da ase ’s po en ial
wi hou addi ional p e-p ocessing echniques. Gi en he
p esence o abou 80,000 aining se cliques, aining he
model wi h a classi ica ion objec i e was imp ac ical on a
comme cial GPU. As a esul , he model was ained exclu-
si ely using he iple loss. In his submission, we imp o e
he model in se e al aspec s.
2. PROPOSED SYSTEM
2.1 Model
Fo his submission, we implemen ed se e al key modi i-
ca ions o he Discogs-VINe model. Fi s , we used non-
de e minis ic CUDA ope a ions o cu back on he consid-
e ably long aining ime o handle he la ge amoun o da a
in he Discogs-VI-YT da ase . Then, we in oduced au o-
ma ic mixed p ecision (AMP) aining wi h he same goal.
Impo an ly, besides inc easing compu a ion speed, AMP
enables la ge ba ch sizes, which a e impo an o e ec-
i e iple mining s a egies. Du ing each aining i e a-
ion, we andomly sample 54 cliques, wi h ou e sions
selec ed om each clique, esul ing in a o al ba ch size o
216. I a clique con ains ewe han ou e sions, we du-
plica e e sions as needed. Each e sion in he ba ch se es
as an ancho sample du ing iple mining.
We u ilize online iple mining wi h andom posi i es
and ha d nega i es. An epoch is de ined as one pass o e
he en i e aining se whe e each e sion has been used
as he ancho sample once. We ain he model o 40
epochs wi h his new and imp o ed de ini ion. Fo each
e sion, we ex ac 7,600 cons an -Q ans o m ames
(abou 176.4 s wi h 22,050 Hz sampling a e and a hop
size o 512 samples). The lea ning a e was scheduled
using cosine annealing wi h i e wa m-up s eps, s a ing
a 0.0001, eaching 0.01 a he end o he wa m-up, and
annealing down o 0.00001 by he end o aining. The
model gene a es 512-dimensional embeddings, and we ap-
ply L2-no maliza ion a e he las laye o ensu e he em-
beddings lie on he uni hype sphe e. This e sion o he
model was ained on he en i e Discogs-VI-YT da ase ,
which con ains app oxima ely 493,000 e sions o abou
98,000 composi ions. We also inc eased he numbe o pa-
ame e s o 8 million om 5 million by using mul iples o
40 channels in he con olu ional laye s ins ead o 32. The
iple loss ma gin was se o 0.3, wi h all o he pa ame e s
ollowing he Discogs-VINe con igu a ion.
2.2 Re ie al
We use maximum inne p oduc sea ch o e ie al. Since
ou model c ea es L2-no malized ep esen a ions, his is
equi alen o maximum cosine simila i y sea ch.
2.3 E alua ion
We ained Discogs-VINe -MIREX on he en i e Discogs-
VI-YT da ase , and since Discogs-VI-YT con ains mos o
he cliques o bo h SHS100K and Da-TACOS da ase s, we
did no e alua e he model on any ex e nal da ase s. How-
e e , du ing aining we moni o ed he alida ion se pe -
o mance o gua an ee o e - i ing, and he model achie ed
abou 0.91 mAP and 4 MR1 pe o mance by he end o
aining. This inal model was benchma ked in he compe-
i ion.
3. ACKNOWLEDGMENTS
This wo k is suppo ed by he p e-doc o al p og am
AGAUR-FI aju s (2024 FI-3 00065) Joan O ó, unded by
he Sec e a ia d’Uni e si a s i Rece ca del Depa amen
de Rece ca i Uni e si a s de la Gene ali a de Ca alunya;
and he Cá ed as ENIA p og am “IA y Música: Cá ed a
en In eligencia A i icial y Música” (TSI-100929-2023-1),
unded by he Sec e a ía de Es ado de Digi alización e In-
eligencia A i icial and he Eu opean Union-Nex Gene -
a ion EU.
4. REFERENCES
[1] F. Yesile , G. Do as, R. M. Bi ne , C. J. T alie, and
J. Se à, “Audio-Based Musical Ve sion Iden i ica ion:
Elemen s and challenges,” IEEE Signal P ocessing
Magazine, ol. 38, no. 6, pp. 115–136, 2021.
[2] F. Yesile , C. T alie, A. Co eya, D. F. Sil a, P. To s o-
gan, E. Gómez, and X. Se a, “Da-TACOS: A Da ase
o Co e Song Iden i ica ion and Unde s anding,” in
P oc. o he 20 h In . Soc. o Music In o ma ion Re-
ie al Con . (ISMIR), 2019.
[3] X. Xu, X. Chen, and D. Yang, “Key-In a ian Con o-
lu ional Neu al Ne wo k Towa d E icien Co e Song
Iden i ica ion,” in IEEE In . Con . on Mul imedia and
Expo (ICME), 2018.
[4] R. O. A az, X. Se a, and D. Bogdano , “Discogs-VI:
A musical e sion iden i ica ion da ase based on pub-
lic edi o ial me ada a,” in P oc. o he 25 h In . Soc. o
Music In o ma ion Re ie al Con . (ISMIR), 2024.
[5] Z. Yu, X. Xu, X. Chen, and D. Yang, “Lea ning a Rep-
esen a ion o Co e Song Iden i ica ion Using Con o-
lu ional Neu al Ne wo k,” in IEEE In . Con . on Acous-
ics, Speech and Signal P ocessing (ICASSP), 2020.

Related note

Why organizations use Identific for document trust, entry 50
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in large academic systems, distance-learning programs, and cross-border universities, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports faster first-level screening, better protection of institutional reputation, and better handling of multilingual submissions. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For conference papers, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com