ON THE DE-DUPLICATION OF THE LAKH MIDI DATASET
Eunjin Choi1Hye in Kim2Jiwoo Ryu2
Juhan Nam1Dasaem Jeong2
1G adua e School o Cul u e Technology, KAIST, Sou h Ko ea
2Depa men o A & Technology, Sogang Uni e si y, Sou h Ko ea
{jech,juhan.nam}@kais .ac.k , {kime0225, clay yu338}@gmail.com, {dasaemj}@sogang.ac.k
ABSTRACT
A la ge-scale da ase is essen ial o aining a well-
gene alized deep-lea ning model. Mos such da ase s a e
collec ed ia sc aping om a ious in e ne sou ces, in-
e i ably in oducing duplica ed da a. In he symbolic mu-
sic domain, hese duplica es o en come om mul iple use
a angemen s and me ada a changes a e simple edi ing.
Howe e , despi e c i ical issues such as un eliable ain-
ing e alua ion om da a leakage du ing andom spli ing,
da ase duplica ion has no been ex ensi ely add essed in
he MIR communi y. This s udy in es iga es he da ase
duplica ion issues ega ding Lakh MIDI Da ase (LMD),
one o he la ges publicly a ailable sou ces in he sym-
bolic music domain. To ind and e alua e he bes e ie al
me hod o duplica ed da a, we employed he Clean MIDI
subse o he LMD as a benchma k es se , in which di -
e en e sions o he same songs a e g ouped oge he . We
i s e alua ed ule-based app oaches and p e ious sym-
bolic music e ie al models o de-duplica ion and also in-
es iga ed wi h a con as i e lea ning-based BERT model
wi h a ious augmen a ions o ind duplica e iles. As a e-
sul , we p opose h ee di e en e sions o he il e ed lis
o LMD, which il e s ou a leas 38,134 samples in he
mos conse a i e se ings among 178,561 iles.
1. INTRODUCTION
As da a-d i en app oaches become mains eam and huge
gene a i e neu al ne wo ks become popula , he signi i-
cance o la ge-scale da ase s is inc easing. The emen-
dous pe o mance o gene a i e models has been a ibu ed
o massi e da ase a ailabili y. Fo example, he a ailabil-
i y o billions o ex -image pai da a boos ed he high-
quali y image syn hesis om ex in he compu e ision
domain [1]. Among se e al da ase collec ion s a egies,
web sc aping [2], o syn hesis [3] has eme ged as he mos
p ac ical me hod o assembling la ge-scale da ase s, gi en
he high cos s associa ed wi h manual collec ion.
Howe e , collec ing la ge-scale da ase s ia web c awl-
ing can ine i ably cause p oblems, including p i acy, copy-
© F. Au ho , S. Au ho , and T. Au ho . Licensed unde a
C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0).
A ibu ion: F. Au ho , S. Au ho , and T. Au ho , “On he de-duplica ion
o he Lakh MIDI da ase ”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
igh , and da a duplica ion issues. In his pape , we ocus
on da a duplica ion, pa icula ly highligh ing issues in he
use o he Lakh MIDI Da ase (LMD) [2], which is widely
ecognized as one o he la ge-scale da ase s in MIR. While
he e a e s udies in o he ields ha discuss he downsides
o da ase duplica ion [4] and he bene i s o de-duplica ion
[5], we ound ha such discussions a e no ably lacking in
he music in o ma ion e ie al (MIR) communi y. One e-
la ed wo k is he examina ion o da ase alidi y issues in
GTZAN [6,7], and ou wo k sha es his in en by add ess-
ing he co ec usage o LMD h ough de-duplica ion.
In pa icula , we a gue ha he cu en use o he LMD
in symbolic music gene a ion impai s he alidi y o exis -
ing expe imen s. The issue a ises om un eliable aining
e alua ions caused by da a leakage du ing andom spli s.
Duplica es ac oss aining, alida ion, and es spli s can
skew e alua ion me ics such as c oss-en opy loss, which
many p e ious s udies used o assess hei model pe o -
mance [8–13]. In he music gene a ion domain, subjec-
i e e alua ion is cos ly, and objec i e e alua ion me ics
a e o en insu icien o ully cap u e he quali y o gene -
a ed music. Consequen ly, s udies ely on alida ion loss
as a me ic o claim non-o e i ing and model e ec i e-
ness [13]. Duplica es can also bias lis ening-based e alua-
ions, pa icula ly when he es spli is used o condi ion-
ing. This is common p ac ice in condi ional music gene -
a ion [10, 12–14]. I he duplica ion emains unadd essed,
andom spli s ha lead o da a leakage will con inue o un-
de mine he eliabili y o e alua ions.
Howe e , manually inding duplica es in a la ge-scale
da ase such as LMD is i ually in easible. The e o e, we
explo ed cleaning LMD using ule-based and neu al ap-
p oaches. Ou con ibu ions a e as ollows:
•How o clean he LMD?
We e alua ed ule-based me hods and exis ing symbolic
music e ie al models o duplica e de ec ion. We also
explo ed aining an unsupe ised con as i e BERT
model wi h augmen a ions o de ec duplica es.
•How can we e alua e he de-duplica ion?
We p opose an e alua ion me hod o e alua e duplica e
de ec ion by u ilizing he me ada a o he Clean MIDI
subse o LMD, e e ed o as LMD-clean in his pape .
•How many duplica es exis in he da ase ?
Using he duplica e de ec ion me hod wi h he bes pe -
o mance, we classi y he duplica ed iles in he LMD.
44
Pape Da ase Ve sion Spli E alua ion
MuseGAN [15] LPD-5-ma ched N.A. .
MIDI-Sandwich2 [8] LPD- ull N.A. NLL
LakhNES [9] LMD- ull ain, alid PPL
PopMAG [10] LMD-ma ched andom PPL, ˇ“(
MMM [16] LMD- ull N.A. .
PiRhDy [17] LMD- ull N.A. .
MMD [18] LMD-ma ched N.A. .
Han e al. [19] LMD- ull N.A. .
Muse o me [11] LMD- ull 8:1:1 PPL
MusicBERT [20] LMD- ull N.A. .
FIGARO [12] LMD- ull 8:1:1 PPL, ˇ“(
MIDI2Vec [21] LMD-ma ched 9:1 wi h CV .
YM2413-MDB [22] LMD- ull 9:1:1 .
Sulun e al. [23] LM(P)D- ull, ma ched N.A. .
Han e al. [24] LMD- ull N.A. .
An icipa o y [13] LMD- ull 87:6:6 PPL, ˇ“(
ex 2midi [14] LMD- ull (MidiCaps) N.A. ˇ“(
Table 1. S udies ha used LMD o aining. S udies men-
ioned he da ase duplica ion issue a e bolded. LPD is he
piano oll e sion o LMD sugges ed by [15]. Spli s a e-
gies a e N.A. when no desc ibed in he pape . CV means
c oss- alida ion. The las column explains whe he he au-
ho s u ilized he da ase du ing he e alua ion. ˇ“(means
ha he da ase is used in he lis ening es .
E en wi h he mos conse a i e h eshold o ejec ion,
we ind 38,134 duplica ed iles o be il e ed. Finally, we
p esen a il e ing lis o LMD om bo h ou p oposed
con igu a ion and he mos conse a i e h eshold. 1
2. RELATED WORKS
2.1 LMD and Rela ed S udies
LMD [2] is a da ase eleased in 2016 wi h a me hod o e -
icien ly ma ching la ge-scale MIDI co pus collec ed om
he In e ne o he Million Song Da ase (MSD) [25]. A he
ime, he da ase was in ended o be used o applica ions
such as con en -based e ie al, co pus s udies o music
s uc u e and pa e ns, and ansc ip ion using pai ed au-
dio and MIDI. Di e ging om i s ini ially sugges ed appli-
ca ions, his da ase has become equen ly used o ain-
ing symbolic music gene a ion models, p ima ily because
i is one o he la ges a ailable symbolic music da ase s.
Among he a ailable LMD e sions, LMD- ull, which con-
ains 178,561 iles wi h unique MD5 hashes, is u ilized as
he mos popula o mul i-ins umen al pop music gene -
a ion, and LMD-ma ched, which consis s o 45,129 songs
ma ched wi h MSD is also used in se e al s udies.
Since he elease o LMD, la ge -scale da ase s based
on web sc aping—such as Me aMIDI (MMD) [18] and
GigaMIDI [26]—ha e eme ged. No ably, he ecen ly e-
leased GigaMIDI da ase is a supe se o LMD. Addi ion-
ally, he in oduc ion o he MidiCaps da ase [27], which
inco po a es LLM-gene a ed ex cap ions aligned wi h
LMD, has u he expanded applica ions o LMD in ex -
o-music gene a ion [14] and music e ie al asks [28,29].
As shown in Table 1, mos s udies employed he LMD-
ull and spli i o aining wi hou men ioning he spli
1All aining and e alua ion code is publicly a ailable: h ps://
gi hub.com/jech2/LMD_Deduplica ion
s a egies. Also, se e al pape s employed he NLL loss o
PPL alues o hei e alua ion. We no e ha a ew pa-
pe s in Table 1 poin ed ou he duplica ion issues wi hin
he da ase and emo ed he duplica ed iles using a ule-
based app oach, such as MIDI encoding hash ma ching.
Howe e , we ound ha he ule-based app oach is no su -
icien o emo e all o he duplica ed iles in he da ase ,
which we will show in he e alua ion sec ion. In his s udy,
we used LMD-clean as ou de-duplica ion s udy and es
da ase . This da ase con ains 17,184 iles o ganized by he
a is and song name in he di ec o y and ilename, which
we ound o ha e mul iple MIDI iles o he same song.
2.2 Da ase De-duplica ion and Rela ed Issues
Recen ly, he need o da ase de-duplica ion has gained
a en ion ac oss a ious domains. In compu e ision, [4]
p oposed a me hod o elimina e duplica ion by comp ess-
ing CLIP ea u es using a con as i e ea u e comp ession
echnique. In music in o ma ion e ie al (MIR), [30] e-
cen ly in oduced a me hod o de ec ing exac duplica es
in aining da a using audio-based music simila i y me -
ics. Fu he mo e, in na u al language p ocessing, [5] in-
es iga ed he e ec s o da ase de-duplica ion by applying
exac subs ing ma ching and hash-based echniques. Thei
indings showed ha emo ing duplica es imp o es lan-
guage model pe o mance, educes aining ime, and low-
e s he a e o aining da a memo iza ion wi hou ha m-
ing pe plexi y. In his wo k, we ocus on he de-duplica ion
me hod o la ge-scale symbolic music da ase s.
2.3 Symbolic Music Unde s anding and Re ie al
Wi h he ad en o la ge-scale language unde s anding
models such as BERT [31] and BART [32] in he na u-
al language domain, se e al coun e pa s ha e been in-
oduced o symbolic music, including MidiBERT-Piano
[33], MusicBERT [20], and PianoBART [34]. Among
hese, MusicBERT is a la ge-scale p e- ained model
ained on LMD and a p i a e da ase . While ea lie me h-
ods ocused solely on he MIDI modali y, ecen ap-
p oaches ha e explo ed symbolic music e ie al using
ex , led by he in oduc ion o CLaMP [35], which lea ns
join embeddings o symbolic music and ex h ough con-
as i e lea ning, using ABC no a ion as i s inpu o ma .
CLaMP2 [28] ex ends his app oach by enhancing he
symbolic encode o suppo bo h ABC and MIDI o ma s
wi h mul ilingual suppo . CLaMP3 [29] u he gene al-
izes he model o handle addi ional modali ies, including
audio and images. We used hese models o ou da ase
de-duplica ion ask by le e aging hei embeddings.
3. DUPLICATION TYPES IN THE LMD
We desc ibe he ypes o duplica ed MIDI iles obse ed in
LMD-clean. In a b oad sense, i we ocus on music gen-
e a ion, all a angemen s o he same song should be de-
ined as duplica ion (i.e., he same song canno be in di e -
en spli s). Howe e , in asks such as music a angemen ,
di e en a angemen s can be conside ed as di e en da a
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
45
samples. Conside ing his aspec , we sepa a ed he dupli-
ca ion ype in o wo cases: ha d duplica ion and so dupli-
ca ion 2. This wo k ocuses on de ec ing ha d duplica ion.
3.1 Ha d Duplica ion om Simila A angemen
We de ine ha d duplica ion as iles ha sha e iden ical sec-
ions o a angemen s wi h mino di e ences. These di -
e ences include ins umen mapping o o de , empo, s a
o se , ile leng h, missing acks, o no e-le el al e a ions
such as pi ch, du a ion, o eloci y. Melodic a ia ions may
also appea , such as added o namen a ion o changes in he
numbe o cho d ones played by a speci ic ins umen . Key
ansposi ions a e conside ed ha d duplica ion when o he
musical elemen s emain nea ly iden ical bu a e ea ed as
so duplica ion i accompanied by signi ican s ylis ic o
s uc u al changes. We assume ha some ha d duplica es
we e likely collec ed as use s modi ied and e-uploaded ex-
is ing iles o iginally c ea ed by o he a ange s.
3.2 So Duplica ion om Di e en A angemen
In so duplica ion, MIDI iles p ese e essen ial musical
elemen s such as melody, ha mony, o namen s, and ins u-
men a ion bu di e in a angemen s yle, e lec ing he di-
e se s yles o indi idual a ange s. Such duplica ion in-
cludes cases whe e he co e melody emains unchanged o
simila , bu accompanimen s yles (e.g., a peggio, wal z),
pi ch anges, and o e all s uc u e a y conside ably. In
cases o ex emely di e en a angemen s, hese a ia ions
migh lead a lis ene o pe cei e hem as dis inc songs.
4. DUPLICATE DETECTION: DATASET AND
CONVENTIONAL APPROACHES
Along wi h he da ase o e alua ion, we i s explo ed
se e al app oaches o iden i ying duplica es, including
simple ule-based me hods and p e- ained symbolic mu-
sic e ie al models.
4.1 Da ase
We used LMD-clean as ou e alua ion benchma k o as-
sess how well each me hod de ec s duplica es wi hin he
da ase . LMD-clean is o ganized by a is olde s and
song ilenames, whe e duplica e ins ances o he same
song by he same a is a e labeled wi h a ia ions in
he ilenames (e.g., Dancing Queen.mid,Dancing
queen.2.mid). Acco ding o his me ada a, 10,355 ou
o 17,184 iles in LMD-clean a e conside ed duplica es.
4.2 Rule-based App oach
The ollowing ule-based me hods se e as baselines o
iden i ying duplica es in he da ase . We assumed ha ha d
duplica ed samples sha e highly simila MIDI-le el ea-
u es. Based on his assump ion, we explo ed se e al me h-
ods aimed a de ec ing and il e ing ou iles wi h iden ical
o nea ly iden ical ea u es a he bea o pi ch le el.
2We show examples o duplica ion ypes in he companion websi e.
4.2.1 MIDI Encoding Hash
As discussed in Sec ion 2.1, some s udies [11, 20, 23] em-
ployed a hash-based app oach o de ec duplica ed MIDI
iles wi h di e en me ada a. He e, we used he ile de-
duplica ion code o MusicBERT [20]. The s ing e sions
o Oc uple ep esen a ions a e encoded acco ding o he
MD5 hash alue, and he hash alues o all MIDI iles om
LMD-clean a e compa ed.
4.2.2 Bea Posi ion En opy
In ou p elimina y s udy, we ound he e a e many dupli-
ca es ha ha e exac ly he same music bu wi h di e en
ins umen mapping o ack o de . To de ec hese dupli-
ca es, we applied a simple me hod ha checks he dis ibu-
ion o no e posi ion wi hin a ba using he MIDI encoding
scheme o [36]. We compu ed en opy alues om no e
posi ion dis ibu ions a a 16 h-no e esolu ion. Files wi h
iden ical en opy alues we e iden i ied as ha d duplica es.
4.2.3 Ch oma-DTW
To de ec duplica es wi h simila pi ch con en , we mea-
su ed he ch oma-le el dis ance be ween MIDI iles. Piano
oll-based ch omag ams we e i s gene a ed and aligned
by ansposing hem wi h he highes pi ch occu ence
ac oss iles. Dynamic Time Wa ping (DTW) was hen ap-
plied o measu e he simila i y be ween aligned ch oma-
g ams. To educe he compu a ional cos o applying DTW
o he en i e da ase , we i s compu ed pi ch his og ams
o all iles and measu ed he pai wise Kullback-Leible
(KL) di e gence. Fo each ile, we selec ed he op 250
candida es wi h he lowes KL di e gence and hen applied
ch omag am-based DTW o hese candida es. Al hough
his app oach disca ds empo al in o ma ion, i se es as
a ough p e il e ing s ep.
4.3 P e ious Symbolic Music Embedding Models
We u ilized p e- ained MusicBERT [20] and CLaMP
model se ies [28, 29, 35] since hey suppo mul i-
ins umen al MIDI. Fo MusicBERT, we used he p e-
ained MusicBERT-small and MusicBERT-base mod-
els o in e ence. Fo CLaMP, we used he p e- ained
CLaMP-512, CLaMP-1024, CLaMP2 and CLaMP3 mod-
els. Since he CLaMP-512 and CLaMP-1024 models use
XML o inpu iles and ABC no a ion o hei in e nal
da a ep esen a ion, he MIDI iles a e i s con e ed using
Musesco e ba ch p ocessing and hen con e ed wi h he
XML o ABC con e sion algo i hm. Fo CLaMP 2 and 3,
we con e ed MIDI o MTF, hei MIDI encoding scheme.
5. DUPLICATE DETECTION: A CONTRASTIVE
LEARNING-BASED APPROACH
P e ious p e- ained symbolic embedding models we e
no o iginally ained o duplica e de ec ion. Inspi ed by
[37,38] ha e alua ed he obus ness o audio o music em-
bedding models agains pe u ba ions such as pi ch shi ,
we explo e whe he aining wi h such pe u ba ions would
imp o e song iden i ica ion despi e a ia ions. To his end,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
46
Le el Augmen a ion Value
No e Onse Shi (-2, 2)
No e Du a ion Shi (-4, 4)
No e Veloci y Shi (-3, 3)
T ack Pi ch Oc a e Shi (-24, 24)
T ack Ins O de Shu le
T ack Ins Mapping Excep D um
T ack Ins D op Less han 50%
T ack Ba D op 15%
Segmen Ba Shi (1, 4)
Segmen No e D op 15%
Segmen Pi ch T anspose (-6, 6)
Table 2. Lis o augmen a ions o gene a ing MIDI a ia-
ions. Values in b aces a e in he oken le el (inclusi e).
we de eloped a BERT-based model and applied a ious
augmen a ions o he posi i e samples in con as i e lea n-
ing, which we e e o as CAugBERT.
5.1 Da a Rep esen a ion
We used he LMD- ull da ase , excluding all iles p esen
in LMD-clean by ma ching MD5 hash alues. The esul -
ing da ase is e e ed o as LMD- il e ed. We andomly
spli he LMD- il e ed in o 98:1:1 a io and p e-p ocessed
he MIDI iles wi h Oc uple encoding using MidiTok [39].
Al hough we ini ially held ou he emaining 1% o es -
ing, we did no use i in ou inal expe imen s, as we chose
o e alua e ou model on LMD-clean ins ead.
5.2 Da a Augmen a ion
To e lec he ypes o a ia ions desc ibed in Sec ion 3.1,
we cons uc ed posi i e pai s o con as i e lea ning us-
ing di e en augmen a ions, e e ed o as MIDI a ia ion
augmen a ion. Each augmen a ion o MIDI a ia ion is in-
dependen ly and andomly applied du ing aining. The de-
ails o hese augmen a ions a e p o ided in Table 2. In ad-
di ion, ollowing [24], we used neighbo segmen s, which
a e di e en pa s om he same piece, as posi i e pai s
o aining. Each piece was i s segmen ed in o chunks
o 1024 okens, and hen a andom segmen was selec ed.
MIDI a ia ion augmen a ion was also applied o neighbo
segmen s o enhance obus ness.
5.3 Model Desc ip ion
The implemen a ion o CAugBERT is based on he code
om [24], which applies con as i e lea ning o a BERT
a chi ec u e. To align wi h he pa ame e se ings o
MusicBERT-small, we used a 4-laye ans o me wi h a
sequence leng h o 1024, hidden size and ocabula y em-
bedding size o 512, and a eed o wa d dimension o 2048.
We use a o al ba ch size o 64 ac oss wo A6000 GPUs.
Fo masked language modeling (MLM), we adop ed he
same elemen , compound, and ba -le el masking s a egies
used in MusicBERT. Con as i e lea ning was guided by
he NT-Xen loss [40]. The inal loss was compu ed as
a weigh ed sum o he MLM and con as i e (NT-Xen )
losses, wi h weigh s o 0.3 and 1.0, espec i ely.
Du ing aining, each encoded MIDI segmen was aug-
men ed using ei he MIDI a ia ion o neighbo augmen-
a ion desc ibed in Sec ion 5.2, o maximize he di e si y
o augmen a ions wi hin each ba ch. Fo alida ion, ixed
manual seed alues we e used o main ain consis en aug-
men a ions ac oss alida ion ba ches. To e alua e he e -
ec i eness o he con as i e lea ning app oach o dupli-
ca ion de ec ion, we conduc ed an abla ion s udy on he
con as i e loss, as p esen ed in Table 3.
6. EVALUATION
We e alua ed all app oaches om wo pe spec i es: (1)
Does he sys em ank duplica es as mo e simila han o h-
e s? (2) How accu a ely does i iden i y ue duplica es?
To answe hese ques ions, we u ilized he me ics ha a e
commonly used o ecommenda ion and e ie al sys ems.
6.1 Measu ing Simila i ies
Fo MIDI Encoding Hash, he simila i y be ween samples
was se as 1 when encoding hashes ma ched. Simila i y o
bea posi ion en opy was compu ed by sub ac ing he ab-
solu e di e ence in en opy om he maximum alue o 1.
Fo Ch oma-DTW, he simila i y was calcula ed as 1 minus
he DTW dis ance. In he MusicBERT se ies, simila i y
was measu ed using he cosine simila i y o he a e age o-
ken embeddings om he T ans o me ’s inal hidden laye .
Fo CAugBERT, we used he [CLS] oken embedding om
he inal hidden laye . All BERT-based models u ilized
512-dimensional embedding. Fo CLaMP se ies, we used
a p e- ained 768-dimensional embedding whe e he las
hidden s a e was a e age pooled and passed h ough a p o-
jec ion laye , ollowing he code p o ided in [28,29,35].
6.2 E alua ion wi h Re ie al Me ics
To e alua e how well each me hod assigns highe simi-
la i y sco es o he duplica es, we adop no malized Dis-
coun ed Cumula i e Gain (nDCG) and Mean Recip ocal
Rank (MRR) as e alua ion me ics. nDCG measu es how
highly ele an i ems a e anked, assigning highe sco es
when duplica es appea close o he op o he e ie al
lis . I is compu ed by no malizing Discoun ed Cumula-
i e Gain wi h he op imal anking whe e all duplica es a e
e ie ed a he highes possible anks. Fo each que y in
LMD-clean, we assign ele ance 1 o duplica es and 0 o
o he s when compu ing nDCG. MRR is de ined as he a -
e age o he in e se anks o he highes - anked ele an
i em o he que y. This co esponds o he a e age ank o
he highes simila i y samples among he duplica es.
Fo he nDCG and MRR me ics, neu al ne wo k-based
app oaches ou pe o med he ule-based me hods. Among
hem, he CLaMP model se ies consis en ly achie ed
highe sco es han BERT-based models, wi h CLAMP3
showing he bes o e all pe o mance. Since CLaMP mod-
els we e speci ically ained o e ie al asks, he esul is
consis en wi h i s in ended design.
Howe e , we obse ed ha all app oaches pe o med
below a ce ain uppe bound. In pa icula , while neu-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
47
Me hod nDCG@all MRR P ecision Recall F1 FN
MIDI Encoding Hash 0.283 0.280 0.922 0.172 0.291 6,759
Bea Posi ion En opy 0.344 0.329 0.919 0.192 0.318 6,435
Ch oma-DTW 0.157 0.081 0.826 0.050 0.094 8,730
MusicBERT_small [20] 0.563 0.590 0.904 0.336 0.490 4,685
MusicBERT_base [20] 0.556 0.580 0.904 0.324 0.477 4,799
CLaMP-512 [35] 0.580 0.610 0.906 0.273 0.419 5,289
CLaMP-1024 [35] 0.646 0.673 0.903 0.317 0.469 4,762
CLaMP2 [28] 0.627 0.647 0.903 0.117 0.208 7,384
CLaMP3 [29] 0.697 0.709 0.902 0.210 0.341 6,017
BERT wi hou Con as i e 0.549 0.580 0.901 0.253 0.395 5,630
CAugBERT 0.558 0.586 0.903 0.339 0.493 4,623
Table 3. O e all pe o mances o e ie al and classi ica ion me ics on LMD-clean. Fo nDCG and MRR, we epo mean
alue. Fo p ecision, ecall, and F1 o neu al ne wo k-based app oaches, he pe o mance wi h he bes h eshold (p ecision
> 0.9) is epo ed. FN ep esen s he alse nega i e, he numbe o duplica es ha he model ailed o de ec . Bold ex
indica es he bes pe o mance, while unde line ep esen s he second bes .
P edic ion S a egies P ecision Recall F1 FN
Mall 0.881 0.411 0.561 3,734
M ule 0.897 0.233 0.370 5,858
MusicBERT ∪CAugBERT 0.901 0.369 0.523 4,300
CLaMP ∪CAugBERT∗0.899 0.395 0.548 3,954
CLaMP ∪CAugBERT ∪ M ule 0.888 0.395 0.547 3,917
Table 4. Classi ica ion pe o mance using combina ions o
a ailable me hods on LMD-clean. CLaMP: CLaMP-1024
/ MusicBERT: MusicBERT_small / Mall: he union o all
me hods / M ule: he subse consis ing only o ule-based
me hods. *: he p oposed con igu a ion ha we used in he
inal LMD de-duplica ion p ocess / FN: he alse nega i e,
whe e duplica es exis bu a e no p edic ed / Bold: he bes
pe o mance / unde line: he second bes .
al ne wo k-based models pe o med well on ha d dupli-
ca es, hey s uggled wi h de ec ing so duplica es. Man-
ual inspec ion o e ie ed samples e ealed ha simila i y
sco es o en ailed o e lec pe cep ual simila i y in hese
so duplica e cases. Fo p e- ained symbolic music unde -
s anding and e ie al models, we can in e p e his as a e-
sul o he p e- ained model no being op imized o dupli-
ca ion de ec ion asks. Fo he con as i e lea ning-based
app oach, de ec ing so duplica ions in ol ing di e en
a angemen s appea s o be challenging unde he cu en
MIDI a ia ion augmen a ion s a egies, as hey ely on el-
a i ely simple, ule-based augmen a ions.
6.3 E alua ion wi h Classi ica ion Me ics
We epo he p ecision and ecall o each model. While
F1-sco e is commonly used o deciding he bes h eshold
o balance p ecision and ecall, such h esholds may esul
in an unaccep ably high a e o alse posi i es in p ac i-
cal de-duplica ion. To ensu e ha ou model compa ison
emains meaning ul unde ealis ic scena ios, we selec ed
he lowes h eshold ha sa is ies a p ecision o 0.9.
Acco ding o Table 3, CAugBERT achie ed he
bes pe o mance, which is ma ginally highe han Mu-
sicBERT_small model. In addi ion, CAugBERT ou pe -
o med he BERT wi hou con as i e loss, which im-
plies ha con as i e lea ning wi h augmen a ions is an
e ec i e scheme o duplica e de ec ion. In con as o
he e ie al asks, ecen CLaMP models (CLaMP2 and
CLaMP3) pe o med wo se han he o iginal CLaMP. We
no e ha BERT-based and CLaMP-based models exhibi
di e en pe o mance ends in bo h e ie al and classi i-
ca ion asks, implying ha he wo app oaches de ec du-
plica ion om sligh ly di e en pe spec i es.
Simila o e ie al me ics, ule-based app oaches con-
sis en ly unde pe o m neu al ne wo k-based models. In
pa icula , Ch oma-DTW ailed o achie e high p ecision
ac oss all h esholds. We also no e ha he p ecision o
MIDI Encoding Hash is no 1.0, which implies ha se -
e al iles in LMD-clean inco ec ly ma ched he song name
label; we ound a ew songs wi h he exac same MIDI con-
en we e ma ched wi h di e en song names. Upon manu-
ally e iewing he co esponding eco dings, we iden i ied
wo dis inc issues: 1) he me ada a o some MIDI iles was
inco ec ly linked o en i ely di e en pieces, and 2) a song
appea ed mul iple imes wi h di e en me ada a, o en due
o a ia ions in song i les ac oss in e na ional eleases.
6.4 P oposed Me hod o De-duplica ion
P io o pe o ming he de-duplica ion o LMD, we in-
es iga ed whe he combining mul iple me hods could u -
he enhance pe o mance. The esul s o hese expe i-
men s a e p esen ed in Table 4. Gi en he di e ing pe -
spec i es o each me hod, we hypo hesized ha hey may
de ec duplica ion based on complemen a y c i e ia. To
es his, we e alua ed all possible me hod combina ions
wi h CAugBERT. Among all wo-model combina ions, he
union o CAugBERT and CLaMP-1024 yielded he highes
F1 sco e. While MusicBERT_small achie ed he second-
highes F1 sco e in Table 3, i s union wi h CAugBERT e-
sul ed in a lowe F1 sco e han he union wi h CLaMP.
No ably, his combina ion achie ed pe o mance compa-
able o ha o combining all a ailable me hods, sugges -
ing ha hese wo models co e he majo i y o duplica e
p edic ions. While we also explo ed inco po a ing ule-
based me hods in o he inal ensemble, he pe o mance
gain was negligible. Consequen ly, we selec ed he union
o he CLaMP-1024 and CAugBERT as ou p oposed con-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
48
Me hods # Clus e s # Duplica es
MIDI Encoding Hash [20] 16 633 26 167
P oposed Con igu a ion 23 566 68 075
Conse a i e Con igu a ion 20 797 38 134
Table 5. Duplica e coun in LMD- ull when que ied wi h
LMD- ull, using wo con igu a ions. P oposed e e s o he
con igu a ion desc ibed in Sec ion 6.4. Conse a i e e e s
o he con igu a ion using embedding simila i y ≥0.99.
Figu e 1. Lis ening es on 100 andom LMD- ull sam-
ples wi h de ec ed duplica es. X-axis shows classi ied ca -
ego ies om Sec ion 7. Unha ched ba s use he p oposed
h eshold; ha ched ba s use he s ic e ≥0.99 h eshold.
igu a ion o LMD de-duplica ion.
To assess how well hese h esholds de e mined on
LMD-clean gene alize o LMD- ull, we conduc ed a man-
ual alida ion o he e ie al esul s. We i s ex ac ed all
embeddings o LMD- ull using he p oposed con igu a ion
and compu ed he simila i y be ween embeddings. Dupli-
ca e pai s we e iden i ied using he h esholds applied in
he p e ious expe imen . We hen andomly sampled 100
MIDI iles, each con aining a leas one de ec ed duplica e,
esul ing in a o al o 506 MIDI samples including he 100
que ies. A e , one o he au ho s wi h a bachelo ’s deg ee
in music composi ion manually lis ened o he e ie ed
i ems and classi ied he e ie ed samples in o ou ca e-
go ies. As desc ibed in Figu e 1, we ound ha 72.9% o
i ems a e duplica es (so and ha d), whe eas almos simi-
la in ins umen combina ions and cho d p og essions bu
no duplica e i ems we e 6.9%, and he pe cen age o i el-
e an i ems is 20.2%. Upon close inspec ion, 79.3% o he
i ele an i ems u ned ou o be ei he ugue-like pieces o
sho single-ins umen segmen s—bo h ypes o da a ha
we e unde ep esen ed du ing e alua ing on LMD-clean.
When inc easing he h eshold up o 0.99, he p opo ion o
simila and i ele an i ems d opped o 2.22% and 2.22%.
7. DE-DUPLICATION OF LMD
A e we h esholded he simila i y, we cons uc ed an ad-
jacency lis , whe eas de ec ed duplica ion is an edge, and
each ile can be a node in a g aph. A e wa d, we an a
dep h- i s sea ch o ind he clus e s o duplica ion (i.e.,
connec ed componen s o he g aph) and gene a ed a dupli-
ca ed ile lis excep o one sample wi h he highes o al
no e coun s in each clus e . A e he clus e ing p ocess, we
ound ha he numbe o clus e s wi h duplica es is 23,566,
Figu e 2. Duplica e coun in LMD- ull when que ied wi h
LMD-clean, using h ee di e en me hods. LMD-clean du-
plica es based on he a is olde and ilename.
and he numbe o duplica e iles is 68,075, which is he
numbe o iles o be il e ed, as shown in Table 5.
Gi en he pe o mance limi a ion o he cu en con ig-
u a ion, we o e h ee de-duplica ion op ions o LMD as
ou bes a ailable solu ion.
The i s op ion is il e ing pieces based only on he
LMD-clean, as he p ima y use case o LMD is gene a ing
mul i-ins umen al pop music. LMD-clean co e s a ious
ypes o amous songs, which akes a la ge po ion o du-
plica es in LMD. We elease a duplica e lis gene a ed by
que ying each piece o LMD-clean agains LMD- ull using
ou bes con igu a ion. The numbe o de ec ed duplica es
is shown in Figu e 2. This will help esea che s e ec i ely
emo e highly duplica ed popula acks.
Also, o add ess di e se use cases, we p o ide dupli-
ca e lis s by que ying om he en i e LMD- ull, using bo h
he p oposed con igu a ion and a mo e conse a i e h esh-
old (0.99 o bo h models). These op ions a e designed o
suppo scena ios ha ei he p io i ize agg essi e duplica e
emo al o aim o minimize he elimina ion o i ele an
i ems. We no e ha e en wi h a conse a i e h eshold o
ejec ion o 0.99, i yielded 20,797 duplica e clus e s wi h
38 134 duplica ed iles o be il e ed, which is la ge han
MusicBERT’s ile encoding hash de ec s: 26,167.
8. CONCLUSION
In his s udy, we highligh ed da ase duplica ion issues in
MIR and examined he capaci ies o a ious app oaches
o de-duplica ing he LMD. As a esul , we ound ha
he 38 134 iles in he LMD- ull (21.4%) a e conside ed as
duplica ions wi h high con idence wi h he p oposed con-
igu a ion. The esul ing de-duplica ion lis s and ou ap-
p oaches can enhance he o e all alidi y o symbolic mu-
sic esea ch. We expec he me hods we explo ed can be
applied o o he la ge-scale symbolic music da ase s.
Fu u e wo k includes in es iga ing he impac o de-
duplica ion on he aining and e alua ion o MIR mod-
els, simila o he analysis o aining da a a ibu ion in
audio-based music gene a ion by [37]. Imp o ing so du-
plica e de ec ion while minimizing i ele an ma ches also
emains an open challenge. One p omising di ec ion is
o le e age neu al me hods o MIDI a ia ion gene a ion
[19,41] o cons uc posi i e pai s o con as i e lea ning.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
49
9. ACKNOWLEDGEMENTS
This wo k was suppo ed by he Na ional Resea ch Foun-
da ion o Ko ea (NRF) g an unded by he Ko ea go e n-
men (MSIT) (RS-2025-00560548).
10. REFERENCES
[1] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Pa ekh, H. Pham,
Q. Le, Y.-H. Sung, Z. Li, and T. Due ig, “Scaling up i-
sual and ision-language ep esen a ion lea ning wi h
noisy ex supe ision,” in P oceedings o he 38 h In-
e na ional Con e ence on Machine Lea ning (ICML),
2021.
[2] C. Ra el, “Lea ning-based me hods o compa ing se-
quences, wi h applica ions o audio- o-midi alignmen
and ma ching,” Ph.D. disse a ion, Columbia Uni e -
si y, 2016.
[3] E. Manilow, G. Wiche n, P. See ha aman, and
J. Le Roux, “Cu ing music sou ce sepa a ion some
Slakh: A da ase o s udy he impac o aining da a
quali y and quan i y,” in P oceedings o IEEE Wo k-
shop on Applica ions o Signal P ocessing o Audio
and Acous ics (WASPAA), 2019.
[4] R. Webs e , J. Rabin, L. Simon, and F. Ju ie, “On
he de-duplica ion o LAION-2b,” a Xi p ep in :
2303.12733, 2023.
[5] K. Lee, D. Ippoli o, A. Nys om, C. Zhang, D. Eck,
C. Callison-Bu ch, and N. Ca lini, “Deduplica ing
aining da a makes language models be e ,” in P o-
ceedings o he 60 h Annual Mee ing o he Associ-
a ion o Compu a ional Linguis ics (Volume 1: Long
Pape s), 2022.
[6] B. L. S u m, “The g zan da ase : I s con en s, i s aul s,
hei e ec s on e alua ion, and i s u u e use,” a Xi
p ep in :1306.1461, 2013.
[7] ——, “The s a e o he a en yea s a e a s a e o
he a : Fu u e esea ch in music in o ma ion e ie al,”
Jou nal o new music esea ch, ol. 43, no. 2, pp. 147–
172, 2014.
[8] X. Liang, J. Wu, and J. Cao, “MIDI-Sandwich2: Rnn-
based hie a chical mul i-modal usion gene a ion ae
ne wo ks o mul i- ack symbolic music gene a ion,”
a Xi p ep in : 1909.03522, 2019.
[9] C. Donahue, H. H. Mao, Y. E. Li, G. W. Co ell, and
J. McAuley, “LakhNES: Imp o ing mul i-ins umen al
music gene a ion wi h c oss-domain p e- aining,” in
P oceedings o he 20 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, 2019.
[10] Y. Ren, J. He, X. Tan, T. Qin, Z. Zhao, and T.-Y. Liu,
“PopMAG: Pop music accompanimen gene a ion,” in
P oceedings o he 28 h ACM In e na ional Con e ence
on Mul imedia, 2020.
[11] B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang,
T. Qin, and T.-Y. Liu, “Muse o me : T ans o me wi h
ine- and coa se-g ained a en ion o music gene a-
ion,” in Ad ances in Neu al In o ma ion P ocessing
Sys ems 35 (Neu IPS 2022), 2022.
[12] D. on Rü e, L. Biggio, Y. Kilche , and T. Ho mann,
“Figa o: Gene a ing symbolic music wi h ine-g ained
a is ic con ol,” P oceedings o he In e na ional Con-
e ence on Lea ning Rep esen a ions (ICLR), 2023.
[13] J. Thicks un, D. L. W. Hall, C. Donahue, and P. Liang,
“An icipa o y Music T ans o me ,” T ansac ions on
Machine Lea ning Resea ch, 2024.
[14] K. Bhanda i, A. Roy, K. Wang, G. Pu i, S. Col on, and
D. He emans, “Tex 2midi: Gene a ing symbolic mu-
sic om cap ions,” in P oceedings o he 39 h AAAI
Con e ence on A i icial In elligence, 2025.
[15] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang,
“Musegan: Mul i- ack sequen ial gene a i e ad e sa -
ial ne wo ks o symbolic music gene a ion and accom-
panimen ,” in P oceedings o he 32 h AAAI Con e -
ence on A i icial In elligence, 2018.
[16] J. Ens and P. Pasquie , “MMM: Explo ing condi ional
mul i- ack music gene a ion wi h he ans o me ,”
a Xi p ep in : 2008.06048, 2020.
[17] H. Liang, W. Lei, P. Y. Chan, Z. Yang, M. Sun, and
T.-S. Chua, “Pi hdy: Lea ning pi ch-, hy hm-, and
dynamics-awa e embeddings o symbolic music,” in
P oceedings o he 28 h ACM In e na ional Con e ence
on Mul imedia, 2020.
[18] J. Ens and P. Pasquie , “Building he Me aMIDI
da ase : Linking symbolic and audio musical da a.” in
P oceedings o he 22nd In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, 2021.
[19] S. Han, H. Ihm, D. Ahn, and W. Lim, “Ins umen sepa-
a ion o symbolic music by explici ly guided di usion
model,” P oceedings o he Neu IPS Wo kshop on Ma-
chine Lea ning o C ea i i y and Design, 2022.
[20] M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-
Y. Liu, “MusicBERT: Symbolic music unde s anding
wi h la ge-scale p e- aining,” in Findings o he As-
socia ion o Compu a ional Linguis ics: ACL-IJCNLP
2021, 2021.
[21] P. Lisena, A. Me oño-Peñuela, and R. T oncy,
“Midi2 ec: Lea ning midi embeddings o eliable p e-
dic ion o symbolic music me ada a,” Seman ic Web,
ol. 13, no. 3, pp. 357–377, 2022.
[22] E. Choi, Y. Chung, S. Lee, J. Jeon, T. Kwon, and
J. Nam, “YM2413-MDB: A mul i-ins umen al FM
ideo game music da ase wi h emo ion anno a ions,”
in P oceedings o he 23 d In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, 2022.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
50
[23] S. Sulun, M. E. Da ies, and P. Viana, “Symbolic mu-
sic gene a ion condi ioned on con inuous- alued emo-
ions,” IEEE Access, ol. 10, pp. 44 617–44 626, 2022.
[24] S. Han, H. Ihm, and W. Lim, “Sys ema ic analysis
o music ep esen a ions om be ,” a Xi p ep in
a Xi :2306.04628, 2023.
[25] B. W. Thie y Be in-Mahieux, Daniel P. W. Ellis and
P. Lame e, “The Million Song Da ase ,” in P oceedings
o he 12 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, 2011.
[26] K. J. M. Lee, J. Ens, S. Adkins, P. Sa men o, M. Ba -
he , and P. Pasquie , “The GigaMIDI da ase wi h
ea u es o exp essi e music pe o mance de ec ion,”
T ansac ions o he In e na ional Socie y o Music In-
o ma ion Re ie al, ol. 8, no. 1, pp. 1–19, 2025.
[27] J. Melecho sky, A. Roy, and D. He emans, “Midi-
Caps: A la ge-scale midi da ase wi h ex cap ions,” in
P oceedings o he 25 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, 2024.
[28] S. Wu, Y. Wang, R. Yuan, Z. Guo, X. Tan, G. Zhang,
M. Zhou, J. Chen, X. Mu, Y. Gao, Y. Dong, J. Liu,
X. Li, F. Yu, and M. Sun, “CLaMP 2: Mul imodal mu-
sic in o ma ion e ie al ac oss 101 languages using
la ge language models,” a Xi p ep in : 2410.13267,
2024.
[29] S. Wu, Z. Guo, R. Yuan, J. Jiang, S. Doh, G. Xia,
J. Nam, X. Li, F. Yu, and M. Sun, “CLaMP 3: Uni-
e sal music in o ma ion e ie al ac oss unaligned
modali ies and unseen languages,” a Xi p ep in :
2502.10362, 2025.
[30] R. Ba lle-Roca, W.-H. Liao, X. Se a, Y. Mi su uji, and
E. Gómez, “Towa ds assessing da a eplica ion in mu-
sic gene a ion wi h music simila i y me ics on aw au-
dio,” in P oceedings o he 25 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence, 2024.
[31] J. De lin, M.-W. Chang, K. Lee, and K. Tou ano a,
“BERT: P e- aining o Deep Bidi ec ional T ans o m-
e s o Language Unde s anding,” in P oceedings o
he 2019 Con e ence o he No h Ame ican Chap e o
he Associa ion o Compu a ional Linguis ics: Human
Language Technologies, Volume 1 (Long and Sho Pa-
pe s), 2019.
[32] M. Lewis, Y. Liu, N. Goyal, M. Ghaz ininejad, A. Mo-
hamed, O. Le y, V. S oyano , and L. Ze lemoye ,
“BART: Denoising sequence- o-sequence p e- aining
o na u al language gene a ion, ansla ion, and com-
p ehension,” in P oceedings o he 58 h Annual Mee -
ing o he Associa ion o Compu a ional Linguis ics,
2020.
[33] Y.-H. Chou, I.-C. Chen, J. Ching, C.-J. Chang, and Y.-
H. Yang, “Midibe -piano: La ge-scale p e- aining o
symbolic music classi ica ion asks,” Jou nal o C e-
a i e Music Sys ems, ol. 8, no. 1, 2024.
[34] X. Liang, Z. Zhao, W. Zeng, Y. He, F. He, Y. Wang,
and C. Gao, “Pianoba : Symbolic piano music gene -
a ion and unde s anding wi h la ge-scale p e- aining,”
in P oceedings o he 2024 IEEE In e na ional Con e -
ence on Mul imedia and Expo (ICME), 2024.
[35] S. Wu, D. Yu, X. Tan, and M. Sun, “CLaMP: Con-
as i e Language-Music P e- aining o C oss-Modal
Symbolic Music In o ma ion Re ie al,” in P oceed-
ings o he 24 h In e na ional Socie y o Music In o -
ma ion Re ie al Con e ence, 2023.
[36] J. Ryu, H.-W. Dong, J. Jung, and D. Jeong, “Nes ed
music ans o me : Sequen ially decoding compound
okens in symbolic music and audio gene a ion,” in
P oceedings o he 25 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, 2024.
[37] J. Ba ne , H. F. Ga cia, and B. Pa do, “Explo ing musi-
cal oo s: Applying audio embeddings o empowe in-
luence a ibu ion o a gene a i e music model,” in
P oceedings o he 25 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, 2024.
[38] N. F ade , N. Gu owski, F. Chhel, and J.-P. B io , “Im-
pac o ime and no e du a ion okeniza ions on deep
lea ning symbolic music modeling,” in P oceedings o
he 24 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, 2023.
[39] N. F ade , J.-P. B io , F. Chhel, A. El Fal-
lah Segh ouchni, and N. Gu owski, “MidiTok: A
py hon package o MIDI ile okeniza ion,” in
Ex ended Abs ac s o he La e-B eaking Demo
Session o he 22nd In e na ional Socie y o Music
In o ma ion Re ie al Con e ence, 2021.
[40] T. Chen, S. Ko nbli h, M. No ouzi, and G. Hin on,
“A simple amewo k o con as i e lea ning o isual
ep esen a ions,” in P oceedings o he 37 h In e na-
ional Con e ence on Machine Lea ning (ICML), 2020.
[41] J. Hue a, B. Liu, and P. S one, “Va yno e: A me hod
o au oma ically a y he numbe o no es in symbolic
music,” in B idge a e he u moil - The 16 h In e na-
ional Symposium, CMMR 2023, 2023.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
51