On the De-Duplication of the Lakh MIDI Dataset

Author: Eunjin Choi; Hyerin Kim; Jiwoo Ryu; Juhan Nam; Dasaem Jeong

Publisher: Zenodo

DOI: 10.5281/zenodo.17706329

Source: https://zenodo.org/records/17706329/files/000005.pdf

ON THE DE-DUPLICATION OF THE LAKH MIDI DATASET
Eunjin Choi1Hye in Kim2Jiwoo Ryu2
Juhan Nam1Dasaem Jeong2
1G adua e School o Cul u e Technology, KAIST, Sou h Ko ea
2Depa men o A & Technology, Sogang Uni e si y, Sou h Ko ea
{jech,juhan.nam}@kais .ac.k , {kime0225, clay yu338}@gmail.com, {dasaemj}@sogang.ac.k
ABSTRACT
A la ge-scale da ase is essen ial o aining a well-
gene alized deep-lea ning model. Mos such da ase s a e
collec ed ia sc aping om a ious in e ne sou ces, in-
e i ably in oducing duplica ed da a. In he symbolic mu-
sic domain, hese duplica es o en come om mul iple use
a angemen s and me ada a changes a e simple edi ing.
Howe e , despi e c i ical issues such as un eliable ain-
ing e alua ion om da a leakage du ing andom spli ing,
da ase duplica ion has no been ex ensi ely add essed in
he MIR communi y. This s udy in es iga es he da ase
duplica ion issues ega ding Lakh MIDI Da ase (LMD),
one o he la ges publicly a ailable sou ces in he sym-
bolic music domain. To ind and e alua e he bes e ie al
me hod o duplica ed da a, we employed he Clean MIDI
subse o he LMD as a benchma k es se , in which di -
e en e sions o he same songs a e g ouped oge he . We
i s e alua ed ule-based app oaches and p e ious sym-
bolic music e ie al models o de-duplica ion and also in-
es iga ed wi h a con as i e lea ning-based BERT model
wi h a ious augmen a ions o ind duplica e iles. As a e-
sul , we p opose h ee di e en e sions o he il e ed lis
o LMD, which il e s ou a leas 38,134 samples in he
mos conse a i e se ings among 178,561 iles.
1. INTRODUCTION
As da a-d i en app oaches become mains eam and huge
gene a i e neu al ne wo ks become popula , he signi i-
cance o la ge-scale da ase s is inc easing. The emen-
dous pe o mance o gene a i e models has been a ibu ed
o massi e da ase a ailabili y. Fo example, he a ailabil-
i y o billions o ex -image pai da a boos ed he high-
quali y image syn hesis om ex in he compu e ision
domain [1]. Among se e al da ase collec ion s a egies,
web sc aping [2], o syn hesis [3] has eme ged as he mos
p ac ical me hod o assembling la ge-scale da ase s, gi en
he high cos s associa ed wi h manual collec ion.
Howe e , collec ing la ge-scale da ase s ia web c awl-
ing can ine i ably cause p oblems, including p i acy, copy-
© F. Au ho , S. Au ho , and T. Au ho . Licensed unde a
C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0).
A ibu ion: F. Au ho , S. Au ho , and T. Au ho , “On he de-duplica ion
o he Lakh MIDI da ase ”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
igh , and da a duplica ion issues. In his pape , we ocus
on da a duplica ion, pa icula ly highligh ing issues in he
use o he Lakh MIDI Da ase (LMD) [2], which is widely
ecognized as one o he la ge-scale da ase s in MIR. While
he e a e s udies in o he ields ha discuss he downsides
o da ase duplica ion [4] and he bene i s o de-duplica ion
[5], we ound ha such discussions a e no ably lacking in
he music in o ma ion e ie al (MIR) communi y. One e-
la ed wo k is he examina ion o da ase alidi y issues in
GTZAN [6,7], and ou wo k sha es his in en by add ess-
ing he co ec usage o LMD h ough de-duplica ion.
In pa icula , we a gue ha he cu en use o he LMD
in symbolic music gene a ion impai s he alidi y o exis -
ing expe imen s. The issue a ises om un eliable aining
e alua ions caused by da a leakage du ing andom spli s.
Duplica es ac oss aining, alida ion, and es spli s can
skew e alua ion me ics such as c oss-en opy loss, which
many p e ious s udies used o assess hei model pe o -
mance [8–13]. In he music gene a ion domain, subjec-
i e e alua ion is cos ly, and objec i e e alua ion me ics
a e o en insu icien o ully cap u e he quali y o gene -
a ed music. Consequen ly, s udies ely on alida ion loss
as a me ic o claim non-o e i ing and model e ec i e-
ness [13]. Duplica es can also bias lis ening-based e alua-
ions, pa icula ly when he es spli is used o condi ion-
ing. This is common p ac ice in condi ional music gene -
a ion [10, 12–14]. I he duplica ion emains unadd essed,
andom spli s ha lead o da a leakage will con inue o un-
de mine he eliabili y o e alua ions.
Howe e , manually inding duplica es in a la ge-scale
da ase such as LMD is i ually in easible. The e o e, we
explo ed cleaning LMD using ule-based and neu al ap-
p oaches. Ou con ibu ions a e as ollows:
•How o clean he LMD?
We e alua ed ule-based me hods and exis ing symbolic
music e ie al models o duplica e de ec ion. We also
explo ed aining an unsupe ised con as i e BERT
model wi h augmen a ions o de ec duplica es.
•How can we e alua e he de-duplica ion?
We p opose an e alua ion me hod o e alua e duplica e
de ec ion by u ilizing he me ada a o he Clean MIDI
subse o LMD, e e ed o as LMD-clean in his pape .
•How many duplica es exis in he da ase ?
Using he duplica e de ec ion me hod wi h he bes pe -
o mance, we classi y he duplica ed iles in he LMD.
44
Pape Da ase Ve sion Spli E alua ion
MuseGAN [15] LPD-5-ma ched N.A. .
MIDI-Sandwich2 [8] LPD- ull N.A. NLL
LakhNES [9] LMD- ull ain, alid PPL
PopMAG [10] LMD-ma ched andom PPL, ˇ“(
MMM [16] LMD- ull N.A. .
PiRhDy [17] LMD- ull N.A. .
MMD [18] LMD-ma ched N.A. .
Han e al. [19] LMD- ull N.A. .
Muse o me [11] LMD- ull 8:1:1 PPL
MusicBERT [20] LMD- ull N.A. .
FIGARO [12] LMD- ull 8:1:1 PPL, ˇ“(
MIDI2Vec [21] LMD-ma ched 9:1 wi h CV .
YM2413-MDB [22] LMD- ull 9:1:1 .
Sulun e al. [23] LM(P)D- ull, ma ched N.A. .
Han e al. [24] LMD- ull N.A. .
An icipa o y [13] LMD- ull 87:6:6 PPL, ˇ“(
ex 2midi [14] LMD- ull (MidiCaps) N.A. ˇ“(
Table 1. S udies ha used LMD o aining. S udies men-
ioned he da ase duplica ion issue a e bolded. LPD is he
piano oll e sion o LMD sugges ed by [15]. Spli s a e-
gies a e N.A. when no desc ibed in he pape . CV means
c oss- alida ion. The las column explains whe he he au-
ho s u ilized he da ase du ing he e alua ion. ˇ“(means
ha he da ase is used in he lis ening es .
E en wi h he mos conse a i e h eshold o ejec ion,
we ind 38,134 duplica ed iles o be il e ed. Finally, we
p esen a il e ing lis o LMD om bo h ou p oposed
con igu a ion and he mos conse a i e h eshold. 1
2. RELATED WORKS
2.1 LMD and Rela ed S udies
LMD [2] is a da ase eleased in 2016 wi h a me hod o e -
icien ly ma ching la ge-scale MIDI co pus collec ed om
he In e ne o he Million Song Da ase (MSD) [25]. A he
ime, he da ase was in ended o be used o applica ions
such as con en -based e ie al, co pus s udies o music
s uc u e and pa e ns, and ansc ip ion using pai ed au-
dio and MIDI. Di e ging om i s ini ially sugges ed appli-
ca ions, his da ase has become equen ly used o ain-
ing symbolic music gene a ion models, p ima ily because
i is one o he la ges a ailable symbolic music da ase s.
Among he a ailable LMD e sions, LMD- ull, which con-
ains 178,561 iles wi h unique MD5 hashes, is u ilized as
he mos popula o mul i-ins umen al pop music gene -
a ion, and LMD-ma ched, which consis s o 45,129 songs
ma ched wi h MSD is also used in se e al s udies.
Since he elease o LMD, la ge -scale da ase s based
on web sc aping—such as Me aMIDI (MMD) [18] and
GigaMIDI [26]—ha e eme ged. No ably, he ecen ly e-
leased GigaMIDI da ase is a supe se o LMD. Addi ion-
ally, he in oduc ion o he MidiCaps da ase [27], which
inco po a es LLM-gene a ed ex cap ions aligned wi h
LMD, has u he expanded applica ions o LMD in ex -
o-music gene a ion [14] and music e ie al asks [28,29].
As shown in Table 1, mos s udies employed he LMD-
ull and spli i o aining wi hou men ioning he spli
1All aining and e alua ion code is publicly a ailable: h ps://
gi hub.com/jech2/LMD_Deduplica ion
s a egies. Also, se e al pape s employed he NLL loss o
PPL alues o hei e alua ion. We no e ha a ew pa-
pe s in Table 1 poin ed ou he duplica ion issues wi hin
he da ase and emo ed he duplica ed iles using a ule-
based app oach, such as MIDI encoding hash ma ching.
Howe e , we ound ha he ule-based app oach is no su -
icien o emo e all o he duplica ed iles in he da ase ,
which we will show in he e alua ion sec ion. In his s udy,
we used LMD-clean as ou de-duplica ion s udy and es
da ase . This da ase con ains 17,184 iles o ganized by he
a is and song name in he di ec o y and ilename, which
we ound o ha e mul iple MIDI iles o he same song.
2.2 Da ase De-duplica ion and Rela ed Issues
Recen ly, he need o da ase de-duplica ion has gained
a en ion ac oss a ious domains. In compu e ision, [4]
p oposed a me hod o elimina e duplica ion by comp ess-
ing CLIP ea u es using a con as i e ea u e comp ession
echnique. In music in o ma ion e ie al (MIR), [30] e-
cen ly in oduced a me hod o de ec ing exac duplica es
in aining da a using audio-based music simila i y me -
ics. Fu he mo e, in na u al language p ocessing, [5] in-
es iga ed he e ec s o da ase de-duplica ion by applying
exac subs ing ma ching and hash-based echniques. Thei
indings showed ha emo ing duplica es imp o es lan-
guage model pe o mance, educes aining ime, and low-
e s he a e o aining da a memo iza ion wi hou ha m-
ing pe plexi y. In his wo k, we ocus on he de-duplica ion
me hod o la ge-scale symbolic music da ase s.
2.3 Symbolic Music Unde s anding and Re ie al
Wi h he ad en o la ge-scale language unde s anding
models such as BERT [31] and BART [32] in he na u-
al language domain, se e al coun e pa s ha e been in-
oduced o symbolic music, including MidiBERT-Piano
[33], MusicBERT [20], and PianoBART [34]. Among
hese, MusicBERT is a la ge-scale p e- ained model
ained on LMD and a p i a e da ase . While ea lie me h-
ods ocused solely on he MIDI modali y, ecen ap-
p oaches ha e explo ed symbolic music e ie al using
ex , led by he in oduc ion o CLaMP [35], which lea ns
join embeddings o symbolic music and ex h ough con-
as i e lea ning, using ABC no a ion as i s inpu o ma .
CLaMP2 [28] ex ends his app oach by enhancing he
symbolic encode o suppo bo h ABC and MIDI o ma s
wi h mul ilingual suppo . CLaMP3 [29] u he gene al-
izes he model o handle addi ional modali ies, including
audio and images. We used hese models o ou da ase
de-duplica ion ask by le e aging hei embeddings.
3. DUPLICATION TYPES IN THE LMD
We desc ibe he ypes o duplica ed MIDI iles obse ed in
LMD-clean. In a b oad sense, i we ocus on music gen-
e a ion, all a angemen s o he same song should be de-
ined as duplica ion (i.e., he same song canno be in di e -
en spli s). Howe e , in asks such as music a angemen ,
di e en a angemen s can be conside ed as di e en da a
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
45
samples. Conside ing his aspec , we sepa a ed he dupli-
ca ion ype in o wo cases: ha d duplica ion and so dupli-
ca ion 2. This wo k ocuses on de ec ing ha d duplica ion.
3.1 Ha d Duplica ion om Simila A angemen
We de ine ha d duplica ion as iles ha sha e iden ical sec-
ions o a angemen s wi h mino di e ences. These di -
e ences include ins umen mapping o o de , empo, s a
o se , ile leng h, missing acks, o no e-le el al e a ions
such as pi ch, du a ion, o eloci y. Melodic a ia ions may
also appea , such as added o namen a ion o changes in he
numbe o cho d ones played by a speci ic ins umen . Key
ansposi ions a e conside ed ha d duplica ion when o he
musical elemen s emain nea ly iden ical bu a e ea ed as
so duplica ion i accompanied by signi ican s ylis ic o
s uc u al changes. We assume ha some ha d duplica es
we e likely collec ed as use s modi ied and e-uploaded ex-
is ing iles o iginally c ea ed by o he a ange s.
3.2 So Duplica ion om Di e en A angemen
In so duplica ion, MIDI iles p ese e essen ial musical
elemen s such as melody, ha mony, o namen s, and ins u-
men a ion bu di e in a angemen s yle, e lec ing he di-
e se s yles o indi idual a ange s. Such duplica ion in-
cludes cases whe e he co e melody emains unchanged o
simila , bu accompanimen s yles (e.g., a peggio, wal z),
pi ch anges, and o e all s uc u e a y conside ably. In
cases o ex emely di e en a angemen s, hese a ia ions
migh lead a lis ene o pe cei e hem as dis inc songs.
4. DUPLICATE DETECTION: DATASET AND
CONVENTIONAL APPROACHES
Along wi h he da ase o e alua ion, we i s explo ed
se e al app oaches o iden i ying duplica es, including
simple ule-based me hods and p e- ained symbolic mu-
sic e ie al models.
4.1 Da ase
We used LMD-clean as ou e alua ion benchma k o as-
sess how well each me hod de ec s duplica es wi hin he
da ase . LMD-clean is o ganized by a is olde s and
song ilenames, whe e duplica e ins ances o he same
song by he same a is a e labeled wi h a ia ions in
he ilenames (e.g., Dancing Queen.mid,Dancing
queen.2.mid). Acco ding o his me ada a, 10,355 ou
o 17,184 iles in LMD-clean a e conside ed duplica es.
4.2 Rule-based App oach
The ollowing ule-based me hods se e as baselines o
iden i ying duplica es in he da ase . We assumed ha ha d
duplica ed samples sha e highly simila MIDI-le el ea-
u es. Based on his assump ion, we explo ed se e al me h-
ods aimed a de ec ing and il e ing ou iles wi h iden ical
o nea ly iden ical ea u es a he bea o pi ch le el.
2We show examples o duplica ion ypes in he companion websi e.
4.2.1 MIDI Encoding Hash
As discussed in Sec ion 2.1, some s udies [11, 20, 23] em-
ployed a hash-based app oach o de ec duplica ed MIDI
iles wi h di e en me ada a. He e, we used he ile de-
duplica ion code o MusicBERT [20]. The s ing e sions
o Oc uple ep esen a ions a e encoded acco ding o he
MD5 hash alue, and he hash alues o all MIDI iles om
LMD-clean a e compa ed.
4.2.2 Bea Posi ion En opy
In ou p elimina y s udy, we ound he e a e many dupli-
ca es ha ha e exac ly he same music bu wi h di e en
ins umen mapping o ack o de . To de ec hese dupli-
ca es, we applied a simple me hod ha checks he dis ibu-
ion o no e posi ion wi hin a ba using he MIDI encoding
scheme o [36]. We compu ed en opy alues om no e
posi ion dis ibu ions a a 16 h-no e esolu ion. Files wi h
iden ical en opy alues we e iden i ied as ha d duplica es.
4.2.3 Ch oma-DTW
To de ec duplica es wi h simila pi ch con en , we mea-
su ed he ch oma-le el dis ance be ween MIDI iles. Piano
oll-based ch omag ams we e i s gene a ed and aligned
by ansposing hem wi h he highes pi ch occu ence
ac oss iles. Dynamic Time Wa ping (DTW) was hen ap-
plied o measu e he simila i y be ween aligned ch oma-
g ams. To educe he compu a ional cos o applying DTW
o he en i e da ase , we i s compu ed pi ch his og ams
o all iles and measu ed he pai wise Kullback-Leible
(KL) di e gence. Fo each ile, we selec ed he op 250
candida es wi h he lowes KL di e gence and hen applied
ch omag am-based DTW o hese candida es. Al hough
his app oach disca ds empo al in o ma ion, i se es as
a ough p e il e ing s ep.
4.3 P e ious Symbolic Music Embedding Models
We u ilized p e- ained MusicBERT [20] and CLaMP
model se ies [28, 29, 35] since hey suppo mul i-
ins umen al MIDI. Fo MusicBERT, we used he p e-
ained MusicBERT-small and MusicBERT-base mod-
els o in e ence. Fo CLaMP, we used he p e- ained
CLaMP-512, CLaMP-1024, CLaMP2 and CLaMP3 mod-
els. Since he CLaMP-512 and CLaMP-1024 models use
XML o inpu iles and ABC no a ion o hei in e nal
da a ep esen a ion, he MIDI iles a e i s con e ed using
Musesco e ba ch p ocessing and hen con e ed wi h he
XML o ABC con e sion algo i hm. Fo CLaMP 2 and 3,
we con e ed MIDI o MTF, hei MIDI encoding scheme.
5. DUPLICATE DETECTION: A CONTRASTIVE
LEARNING-BASED APPROACH
P e ious p e- ained symbolic embedding models we e
no o iginally ained o duplica e de ec ion. Inspi ed by
[37,38] ha e alua ed he obus ness o audio o music em-
bedding models agains pe u ba ions such as pi ch shi ,
we explo e whe he aining wi h such pe u ba ions would
imp o e song iden i ica ion despi e a ia ions. To his end,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
46
Le el Augmen a ion Value
No e Onse Shi (-2, 2)
No e Du a ion Shi (-4, 4)
No e Veloci y Shi (-3, 3)
T ack Pi ch Oc a e Shi (-24, 24)
T ack Ins O de Shu le
T ack Ins Mapping Excep D um
T ack Ins D op Less han 50%
T ack Ba D op 15%
Segmen Ba Shi (1, 4)
Segmen No e D op 15%
Segmen Pi ch T anspose (-6, 6)
Table 2. Lis o augmen a ions o gene a ing MIDI a ia-
ions. Values in b aces a e in he oken le el (inclusi e).
we de eloped a BERT-based model and applied a ious
augmen a ions o he posi i e samples in con as i e lea n-
ing, which we e e o as CAugBERT.
5.1 Da a Rep esen a ion
We used he LMD- ull da ase , excluding all iles p esen
in LMD-clean by ma ching MD5 hash alues. The esul -
ing da ase is e e ed o as LMD- il e ed. We andomly
spli he LMD- il e ed in o 98:1:1 a io and p e-p ocessed
he MIDI iles wi h Oc uple encoding using MidiTok [39].
Al hough we ini ially held ou he emaining 1% o es -
ing, we did no use i in ou inal expe imen s, as we chose
o e alua e ou model on LMD-clean ins ead.
5.2 Da a Augmen a ion
To e lec he ypes o a ia ions desc ibed in Sec ion 3.1,
we cons uc ed posi i e pai s o con as i e lea ning us-
ing di e en augmen a ions, e e ed o as MIDI a ia ion
augmen a ion. Each augmen a ion o MIDI a ia ion is in-
dependen ly and andomly applied du ing aining. The de-
ails o hese augmen a ions a e p o ided in Table 2. In ad-
di ion, ollowing [24], we used neighbo segmen s, which
a e di e en pa s om he same piece, as posi i e pai s
o aining. Each piece was i s segmen ed in o chunks
o 1024 okens, and hen a andom segmen was selec ed.
MIDI a ia ion augmen a ion was also applied o neighbo
segmen s o enhance obus ness.
5.3 Model Desc ip ion
The implemen a ion o CAugBERT is based on he code
om [24], which applies con as i e lea ning o a BERT
a chi ec u e. To align wi h he pa ame e se ings o
MusicBERT-small, we used a 4-laye ans o me wi h a
sequence leng h o 1024, hidden size and ocabula y em-
bedding size o 512, and a eed o wa d dimension o 2048.
We use a o al ba ch size o 64 ac oss wo A6000 GPUs.
Fo masked language modeling (MLM), we adop ed he
same elemen , compound, and ba -le el masking s a egies
used in MusicBERT. Con as i e lea ning was guided by
he NT-Xen loss [40]. The inal loss was compu ed as
a weigh ed sum o he MLM and con as i e (NT-Xen )
losses, wi h weigh s o 0.3 and 1.0, espec i ely.
Du ing aining, each encoded MIDI segmen was aug-
men ed using ei he MIDI a ia ion o neighbo augmen-
a ion desc ibed in Sec ion 5.2, o maximize he di e si y
o augmen a ions wi hin each ba ch. Fo alida ion, ixed
manual seed alues we e used o main ain consis en aug-
men a ions ac oss alida ion ba ches. To e alua e he e -
ec i eness o he con as i e lea ning app oach o dupli-
ca ion de ec ion, we conduc ed an abla ion s udy on he
con as i e loss, as p esen ed in Table 3.
6. EVALUATION
We e alua ed all app oaches om wo pe spec i es: (1)
Does he sys em ank duplica es as mo e simila han o h-
e s? (2) How accu a ely does i iden i y ue duplica es?
To answe hese ques ions, we u ilized he me ics ha a e
commonly used o ecommenda ion and e ie al sys ems.
6.1 Measu ing Simila i ies
Fo MIDI Encoding Hash, he simila i y be ween samples
was se as 1 when encoding hashes ma ched. Simila i y o
bea posi ion en opy was compu ed by sub ac ing he ab-
solu e di e ence in en opy om he maximum alue o 1.
Fo Ch oma-DTW, he simila i y was calcula ed as 1 minus
he DTW dis ance. In he MusicBERT se ies, simila i y
was measu ed using he cosine simila i y o he a e age o-
ken embeddings om he T ans o me ’s inal hidden laye .
Fo CAugBERT, we used he [CLS] oken embedding om
he inal hidden laye . All BERT-based models u ilized
512-dimensional embedding. Fo CLaMP se ies, we used
a p e- ained 768-dimensional embedding whe e he las
hidden s a e was a e age pooled and passed h ough a p o-
jec ion laye , ollowing he code p o ided in [28,29,35].
6.2 E alua ion wi h Re ie al Me ics
To e alua e how well each me hod assigns highe simi-
la i y sco es o he duplica es, we adop no malized Dis-
coun ed Cumula i e Gain (nDCG) and Mean Recip ocal
Rank (MRR) as e alua ion me ics. nDCG measu es how
highly ele an i ems a e anked, assigning highe sco es
when duplica es appea close o he op o he e ie al
lis . I is compu ed by no malizing Discoun ed Cumula-
i e Gain wi h he op imal anking whe e all duplica es a e
e ie ed a he highes possible anks. Fo each que y in
LMD-clean, we assign ele ance 1 o duplica es and 0 o
o he s when compu ing nDCG. MRR is de ined as he a -
e age o he in e se anks o he highes - anked ele an
i em o he que y. This co esponds o he a e age ank o
he highes simila i y samples among he duplica es.
Fo he nDCG and MRR me ics, neu al ne wo k-based
app oaches ou pe o med he ule-based me hods. Among
hem, he CLaMP model se ies consis en ly achie ed
highe sco es han BERT-based models, wi h CLAMP3
showing he bes o e all pe o mance. Since CLaMP mod-
els we e speci ically ained o e ie al asks, he esul is
consis en wi h i s in ended design.
Howe e , we obse ed ha all app oaches pe o med
below a ce ain uppe bound. In pa icula , while neu-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
47
Me hod nDCG@all MRR P ecision Recall F1 FN
MIDI Encoding Hash 0.283 0.280 0.922 0.172 0.291 6,759
Bea Posi ion En opy 0.344 0.329 0.919 0.192 0.318 6,435
Ch oma-DTW 0.157 0.081 0.826 0.050 0.094 8,730
MusicBERT_small [20] 0.563 0.590 0.904 0.336 0.490 4,685
MusicBERT_base [20] 0.556 0.580 0.904 0.324 0.477 4,799
CLaMP-512 [35] 0.580 0.610 0.906 0.273 0.419 5,289
CLaMP-1024 [35] 0.646 0.673 0.903 0.317 0.469 4,762
CLaMP2 [28] 0.627 0.647 0.903 0.117 0.208 7,384
CLaMP3 [29] 0.697 0.709 0.902 0.210 0.341 6,017
BERT wi hou Con as i e 0.549 0.580 0.901 0.253 0.395 5,630
CAugBERT 0.558 0.586 0.903 0.339 0.493 4,623
Table 3. O e all pe o mances o e ie al and classi ica ion me ics on LMD-clean. Fo nDCG and MRR, we epo mean
alue. Fo p ecision, ecall, and F1 o neu al ne wo k-based app oaches, he pe o mance wi h he bes h eshold (p ecision
> 0.9) is epo ed. FN ep esen s he alse nega i e, he numbe o duplica es ha he model ailed o de ec . Bold ex
indica es he bes pe o mance, while unde line ep esen s he second bes .
P edic ion S a egies P ecision Recall F1 FN
Mall 0.881 0.411 0.561 3,734
M ule 0.897 0.233 0.370 5,858
MusicBERT ∪CAugBERT 0.901 0.369 0.523 4,300
CLaMP ∪CAugBERT∗0.899 0.395 0.548 3,954
CLaMP ∪CAugBERT ∪ M ule 0.888 0.395 0.547 3,917
Table 4. Classi ica ion pe o mance using combina ions o
a ailable me hods on LMD-clean. CLaMP: CLaMP-1024
/ MusicBERT: MusicBERT_small / Mall: he union o all
me hods / M ule: he subse consis ing only o ule-based
me hods. *: he p oposed con igu a ion ha we used in he
inal LMD de-duplica ion p ocess / FN: he alse nega i e,
whe e duplica es exis bu a e no p edic ed / Bold: he bes
pe o mance / unde line: he second bes .
al ne wo k-based models pe o med well on ha d dupli-
ca es, hey s uggled wi h de ec ing so duplica es. Man-
ual inspec ion o e ie ed samples e ealed ha simila i y
sco es o en ailed o e lec pe cep ual simila i y in hese
so duplica e cases. Fo p e- ained symbolic music unde -
s anding and e ie al models, we can in e p e his as a e-
sul o he p e- ained model no being op imized o dupli-
ca ion de ec ion asks. Fo he con as i e lea ning-based
app oach, de ec ing so duplica ions in ol ing di e en
a angemen s appea s o be challenging unde he cu en
MIDI a ia ion augmen a ion s a egies, as hey ely on el-
a i ely simple, ule-based augmen a ions.
6.3 E alua ion wi h Classi ica ion Me ics
We epo he p ecision and ecall o each model. While
F1-sco e is commonly used o deciding he bes h eshold
o balance p ecision and ecall, such h esholds may esul
in an unaccep ably high a e o alse posi i es in p ac i-
cal de-duplica ion. To ensu e ha ou model compa ison
emains meaning ul unde ealis ic scena ios, we selec ed
he lowes h eshold ha sa is ies a p ecision o 0.9.
Acco ding o Table 3, CAugBERT achie ed he
bes pe o mance, which is ma ginally highe han Mu-
sicBERT_small model. In addi ion, CAugBERT ou pe -
o med he BERT wi hou con as i e loss, which im-
plies ha con as i e lea ning wi h augmen a ions is an
e ec i e scheme o duplica e de ec ion. In con as o
he e ie al asks, ecen CLaMP models (CLaMP2 and
CLaMP3) pe o med wo se han he o iginal CLaMP. We
no e ha BERT-based and CLaMP-based models exhibi
di e en pe o mance ends in bo h e ie al and classi i-
ca ion asks, implying ha he wo app oaches de ec du-
plica ion om sligh ly di e en pe spec i es.
Simila o e ie al me ics, ule-based app oaches con-
sis en ly unde pe o m neu al ne wo k-based models. In
pa icula , Ch oma-DTW ailed o achie e high p ecision
ac oss all h esholds. We also no e ha he p ecision o
MIDI Encoding Hash is no 1.0, which implies ha se -
e al iles in LMD-clean inco ec ly ma ched he song name
label; we ound a ew songs wi h he exac same MIDI con-
en we e ma ched wi h di e en song names. Upon manu-
ally e iewing he co esponding eco dings, we iden i ied
wo dis inc issues: 1) he me ada a o some MIDI iles was
inco ec ly linked o en i ely di e en pieces, and 2) a song
appea ed mul iple imes wi h di e en me ada a, o en due
o a ia ions in song i les ac oss in e na ional eleases.
6.4 P oposed Me hod o De-duplica ion
P io o pe o ming he de-duplica ion o LMD, we in-
es iga ed whe he combining mul iple me hods could u -
he enhance pe o mance. The esul s o hese expe i-
men s a e p esen ed in Table 4. Gi en he di e ing pe -
spec i es o each me hod, we hypo hesized ha hey may
de ec duplica ion based on complemen a y c i e ia. To
es his, we e alua ed all possible me hod combina ions
wi h CAugBERT. Among all wo-model combina ions, he
union o CAugBERT and CLaMP-1024 yielded he highes
F1 sco e. While MusicBERT_small achie ed he second-
highes F1 sco e in Table 3, i s union wi h CAugBERT e-
sul ed in a lowe F1 sco e han he union wi h CLaMP.
No ably, his combina ion achie ed pe o mance compa-
able o ha o combining all a ailable me hods, sugges -
ing ha hese wo models co e he majo i y o duplica e
p edic ions. While we also explo ed inco po a ing ule-
based me hods in o he inal ensemble, he pe o mance
gain was negligible. Consequen ly, we selec ed he union
o he CLaMP-1024 and CAugBERT as ou p oposed con-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
48

Me hods # Clus e s # Duplica es
MIDI Encoding Hash [20] 16 633 26 167
P oposed Con igu a ion 23 566 68 075
Conse a i e Con igu a ion 20 797 38 134
Table 5. Duplica e coun in LMD- ull when que ied wi h
LMD- ull, using wo con igu a ions. P oposed e e s o he
con igu a ion desc ibed in Sec ion 6.4. Conse a i e e e s
o he con igu a ion using embedding simila i y ≥0.99.
Figu e 1. Lis ening es on 100 andom LMD- ull sam-
ples wi h de ec ed duplica es. X-axis shows classi ied ca -
ego ies om Sec ion 7. Unha ched ba s use he p oposed
h eshold; ha ched ba s use he s ic e ≥0.99 h eshold.
igu a ion o LMD de-duplica ion.
To assess how well hese h esholds de e mined on
LMD-clean gene alize o LMD- ull, we conduc ed a man-
ual alida ion o he e ie al esul s. We i s ex ac ed all
embeddings o LMD- ull using he p oposed con igu a ion
and compu ed he simila i y be ween embeddings. Dupli-
ca e pai s we e iden i ied using he h esholds applied in
he p e ious expe imen . We hen andomly sampled 100
MIDI iles, each con aining a leas one de ec ed duplica e,
esul ing in a o al o 506 MIDI samples including he 100
que ies. A e , one o he au ho s wi h a bachelo ’s deg ee
in music composi ion manually lis ened o he e ie ed
i ems and classi ied he e ie ed samples in o ou ca e-
go ies. As desc ibed in Figu e 1, we ound ha 72.9% o
i ems a e duplica es (so and ha d), whe eas almos simi-
la in ins umen combina ions and cho d p og essions bu
no duplica e i ems we e 6.9%, and he pe cen age o i el-
e an i ems is 20.2%. Upon close inspec ion, 79.3% o he
i ele an i ems u ned ou o be ei he ugue-like pieces o
sho single-ins umen segmen s—bo h ypes o da a ha
we e unde ep esen ed du ing e alua ing on LMD-clean.
When inc easing he h eshold up o 0.99, he p opo ion o
simila and i ele an i ems d opped o 2.22% and 2.22%.
7. DE-DUPLICATION OF LMD
A e we h esholded he simila i y, we cons uc ed an ad-
jacency lis , whe eas de ec ed duplica ion is an edge, and
each ile can be a node in a g aph. A e wa d, we an a
dep h- i s sea ch o ind he clus e s o duplica ion (i.e.,
connec ed componen s o he g aph) and gene a ed a dupli-
ca ed ile lis excep o one sample wi h he highes o al
no e coun s in each clus e . A e he clus e ing p ocess, we
ound ha he numbe o clus e s wi h duplica es is 23,566,
Figu e 2. Duplica e coun in LMD- ull when que ied wi h
LMD-clean, using h ee di e en me hods. LMD-clean du-
plica es based on he a is olde and ilename.
and he numbe o duplica e iles is 68,075, which is he
numbe o iles o be il e ed, as shown in Table 5.
Gi en he pe o mance limi a ion o he cu en con ig-
u a ion, we o e h ee de-duplica ion op ions o LMD as
ou bes a ailable solu ion.
The i s op ion is il e ing pieces based only on he
LMD-clean, as he p ima y use case o LMD is gene a ing
mul i-ins umen al pop music. LMD-clean co e s a ious
ypes o amous songs, which akes a la ge po ion o du-
plica es in LMD. We elease a duplica e lis gene a ed by
que ying each piece o LMD-clean agains LMD- ull using
ou bes con igu a ion. The numbe o de ec ed duplica es
is shown in Figu e 2. This will help esea che s e ec i ely
emo e highly duplica ed popula acks.
Also, o add ess di e se use cases, we p o ide dupli-
ca e lis s by que ying om he en i e LMD- ull, using bo h
he p oposed con igu a ion and a mo e conse a i e h esh-
old (0.99 o bo h models). These op ions a e designed o
suppo scena ios ha ei he p io i ize agg essi e duplica e
emo al o aim o minimize he elimina ion o i ele an
i ems. We no e ha e en wi h a conse a i e h eshold o
ejec ion o 0.99, i yielded 20,797 duplica e clus e s wi h
38 134 duplica ed iles o be il e ed, which is la ge han
MusicBERT’s ile encoding hash de ec s: 26,167.
8. CONCLUSION
In his s udy, we highligh ed da ase duplica ion issues in
MIR and examined he capaci ies o a ious app oaches
o de-duplica ing he LMD. As a esul , we ound ha
he 38 134 iles in he LMD- ull (21.4%) a e conside ed as
duplica ions wi h high con idence wi h he p oposed con-
igu a ion. The esul ing de-duplica ion lis s and ou ap-
p oaches can enhance he o e all alidi y o symbolic mu-
sic esea ch. We expec he me hods we explo ed can be
applied o o he la ge-scale symbolic music da ase s.
Fu u e wo k includes in es iga ing he impac o de-
duplica ion on he aining and e alua ion o MIR mod-
els, simila o he analysis o aining da a a ibu ion in
audio-based music gene a ion by [37]. Imp o ing so du-
plica e de ec ion while minimizing i ele an ma ches also
emains an open challenge. One p omising di ec ion is
o le e age neu al me hods o MIDI a ia ion gene a ion
[19,41] o cons uc posi i e pai s o con as i e lea ning.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
49
9. ACKNOWLEDGEMENTS
This wo k was suppo ed by he Na ional Resea ch Foun-
da ion o Ko ea (NRF) g an unded by he Ko ea go e n-
men (MSIT) (RS-2025-00560548).
10. REFERENCES
[1] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Pa ekh, H. Pham,
Q. Le, Y.-H. Sung, Z. Li, and T. Due ig, “Scaling up i-
sual and ision-language ep esen a ion lea ning wi h
noisy ex supe ision,” in P oceedings o he 38 h In-
e na ional Con e ence on Machine Lea ning (ICML),
2021.
[2] C. Ra el, “Lea ning-based me hods o compa ing se-
quences, wi h applica ions o audio- o-midi alignmen
and ma ching,” Ph.D. disse a ion, Columbia Uni e -
si y, 2016.
[3] E. Manilow, G. Wiche n, P. See ha aman, and
J. Le Roux, “Cu ing music sou ce sepa a ion some
Slakh: A da ase o s udy he impac o aining da a
quali y and quan i y,” in P oceedings o IEEE Wo k-
shop on Applica ions o Signal P ocessing o Audio
and Acous ics (WASPAA), 2019.
[4] R. Webs e , J. Rabin, L. Simon, and F. Ju ie, “On
he de-duplica ion o LAION-2b,” a Xi p ep in :
2303.12733, 2023.
[5] K. Lee, D. Ippoli o, A. Nys om, C. Zhang, D. Eck,
C. Callison-Bu ch, and N. Ca lini, “Deduplica ing
aining da a makes language models be e ,” in P o-
ceedings o he 60 h Annual Mee ing o he Associ-
a ion o Compu a ional Linguis ics (Volume 1: Long
Pape s), 2022.
[6] B. L. S u m, “The g zan da ase : I s con en s, i s aul s,
hei e ec s on e alua ion, and i s u u e use,” a Xi
p ep in :1306.1461, 2013.
[7] ——, “The s a e o he a en yea s a e a s a e o
he a : Fu u e esea ch in music in o ma ion e ie al,”
Jou nal o new music esea ch, ol. 43, no. 2, pp. 147–
172, 2014.
[8] X. Liang, J. Wu, and J. Cao, “MIDI-Sandwich2: Rnn-
based hie a chical mul i-modal usion gene a ion ae
ne wo ks o mul i- ack symbolic music gene a ion,”
a Xi p ep in : 1909.03522, 2019.
[9] C. Donahue, H. H. Mao, Y. E. Li, G. W. Co ell, and
J. McAuley, “LakhNES: Imp o ing mul i-ins umen al
music gene a ion wi h c oss-domain p e- aining,” in
P oceedings o he 20 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, 2019.
[10] Y. Ren, J. He, X. Tan, T. Qin, Z. Zhao, and T.-Y. Liu,
“PopMAG: Pop music accompanimen gene a ion,” in
P oceedings o he 28 h ACM In e na ional Con e ence
on Mul imedia, 2020.
[11] B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang,
T. Qin, and T.-Y. Liu, “Muse o me : T ans o me wi h
ine- and coa se-g ained a en ion o music gene a-
ion,” in Ad ances in Neu al In o ma ion P ocessing
Sys ems 35 (Neu IPS 2022), 2022.
[12] D. on Rü e, L. Biggio, Y. Kilche , and T. Ho mann,
“Figa o: Gene a ing symbolic music wi h ine-g ained
a is ic con ol,” P oceedings o he In e na ional Con-
e ence on Lea ning Rep esen a ions (ICLR), 2023.
[13] J. Thicks un, D. L. W. Hall, C. Donahue, and P. Liang,
“An icipa o y Music T ans o me ,” T ansac ions on
Machine Lea ning Resea ch, 2024.
[14] K. Bhanda i, A. Roy, K. Wang, G. Pu i, S. Col on, and
D. He emans, “Tex 2midi: Gene a ing symbolic mu-
sic om cap ions,” in P oceedings o he 39 h AAAI
Con e ence on A i icial In elligence, 2025.
[15] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang,
“Musegan: Mul i- ack sequen ial gene a i e ad e sa -
ial ne wo ks o symbolic music gene a ion and accom-
panimen ,” in P oceedings o he 32 h AAAI Con e -
ence on A i icial In elligence, 2018.
[16] J. Ens and P. Pasquie , “MMM: Explo ing condi ional
mul i- ack music gene a ion wi h he ans o me ,”
a Xi p ep in : 2008.06048, 2020.
[17] H. Liang, W. Lei, P. Y. Chan, Z. Yang, M. Sun, and
T.-S. Chua, “Pi hdy: Lea ning pi ch-, hy hm-, and
dynamics-awa e embeddings o symbolic music,” in
P oceedings o he 28 h ACM In e na ional Con e ence
on Mul imedia, 2020.
[18] J. Ens and P. Pasquie , “Building he Me aMIDI
da ase : Linking symbolic and audio musical da a.” in
P oceedings o he 22nd In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, 2021.
[19] S. Han, H. Ihm, D. Ahn, and W. Lim, “Ins umen sepa-
a ion o symbolic music by explici ly guided di usion
model,” P oceedings o he Neu IPS Wo kshop on Ma-
chine Lea ning o C ea i i y and Design, 2022.
[20] M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-
Y. Liu, “MusicBERT: Symbolic music unde s anding
wi h la ge-scale p e- aining,” in Findings o he As-
socia ion o Compu a ional Linguis ics: ACL-IJCNLP
2021, 2021.
[21] P. Lisena, A. Me oño-Peñuela, and R. T oncy,
“Midi2 ec: Lea ning midi embeddings o eliable p e-
dic ion o symbolic music me ada a,” Seman ic Web,
ol. 13, no. 3, pp. 357–377, 2022.
[22] E. Choi, Y. Chung, S. Lee, J. Jeon, T. Kwon, and
J. Nam, “YM2413-MDB: A mul i-ins umen al FM
ideo game music da ase wi h emo ion anno a ions,”
in P oceedings o he 23 d In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, 2022.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
50
[23] S. Sulun, M. E. Da ies, and P. Viana, “Symbolic mu-
sic gene a ion condi ioned on con inuous- alued emo-
ions,” IEEE Access, ol. 10, pp. 44 617–44 626, 2022.
[24] S. Han, H. Ihm, and W. Lim, “Sys ema ic analysis
o music ep esen a ions om be ,” a Xi p ep in
a Xi :2306.04628, 2023.
[25] B. W. Thie y Be in-Mahieux, Daniel P. W. Ellis and
P. Lame e, “The Million Song Da ase ,” in P oceedings
o he 12 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, 2011.
[26] K. J. M. Lee, J. Ens, S. Adkins, P. Sa men o, M. Ba -
he , and P. Pasquie , “The GigaMIDI da ase wi h
ea u es o exp essi e music pe o mance de ec ion,”
T ansac ions o he In e na ional Socie y o Music In-
o ma ion Re ie al, ol. 8, no. 1, pp. 1–19, 2025.
[27] J. Melecho sky, A. Roy, and D. He emans, “Midi-
Caps: A la ge-scale midi da ase wi h ex cap ions,” in
P oceedings o he 25 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, 2024.
[28] S. Wu, Y. Wang, R. Yuan, Z. Guo, X. Tan, G. Zhang,
M. Zhou, J. Chen, X. Mu, Y. Gao, Y. Dong, J. Liu,
X. Li, F. Yu, and M. Sun, “CLaMP 2: Mul imodal mu-
sic in o ma ion e ie al ac oss 101 languages using
la ge language models,” a Xi p ep in : 2410.13267,
2024.
[29] S. Wu, Z. Guo, R. Yuan, J. Jiang, S. Doh, G. Xia,
J. Nam, X. Li, F. Yu, and M. Sun, “CLaMP 3: Uni-
e sal music in o ma ion e ie al ac oss unaligned
modali ies and unseen languages,” a Xi p ep in :
2502.10362, 2025.
[30] R. Ba lle-Roca, W.-H. Liao, X. Se a, Y. Mi su uji, and
E. Gómez, “Towa ds assessing da a eplica ion in mu-
sic gene a ion wi h music simila i y me ics on aw au-
dio,” in P oceedings o he 25 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence, 2024.
[31] J. De lin, M.-W. Chang, K. Lee, and K. Tou ano a,
“BERT: P e- aining o Deep Bidi ec ional T ans o m-
e s o Language Unde s anding,” in P oceedings o
he 2019 Con e ence o he No h Ame ican Chap e o
he Associa ion o Compu a ional Linguis ics: Human
Language Technologies, Volume 1 (Long and Sho Pa-
pe s), 2019.
[32] M. Lewis, Y. Liu, N. Goyal, M. Ghaz ininejad, A. Mo-
hamed, O. Le y, V. S oyano , and L. Ze lemoye ,
“BART: Denoising sequence- o-sequence p e- aining
o na u al language gene a ion, ansla ion, and com-
p ehension,” in P oceedings o he 58 h Annual Mee -
ing o he Associa ion o Compu a ional Linguis ics,
2020.
[33] Y.-H. Chou, I.-C. Chen, J. Ching, C.-J. Chang, and Y.-
H. Yang, “Midibe -piano: La ge-scale p e- aining o
symbolic music classi ica ion asks,” Jou nal o C e-
a i e Music Sys ems, ol. 8, no. 1, 2024.
[34] X. Liang, Z. Zhao, W. Zeng, Y. He, F. He, Y. Wang,
and C. Gao, “Pianoba : Symbolic piano music gene -
a ion and unde s anding wi h la ge-scale p e- aining,”
in P oceedings o he 2024 IEEE In e na ional Con e -
ence on Mul imedia and Expo (ICME), 2024.
[35] S. Wu, D. Yu, X. Tan, and M. Sun, “CLaMP: Con-
as i e Language-Music P e- aining o C oss-Modal
Symbolic Music In o ma ion Re ie al,” in P oceed-
ings o he 24 h In e na ional Socie y o Music In o -
ma ion Re ie al Con e ence, 2023.
[36] J. Ryu, H.-W. Dong, J. Jung, and D. Jeong, “Nes ed
music ans o me : Sequen ially decoding compound
okens in symbolic music and audio gene a ion,” in
P oceedings o he 25 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, 2024.
[37] J. Ba ne , H. F. Ga cia, and B. Pa do, “Explo ing musi-
cal oo s: Applying audio embeddings o empowe in-
luence a ibu ion o a gene a i e music model,” in
P oceedings o he 25 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, 2024.
[38] N. F ade , N. Gu owski, F. Chhel, and J.-P. B io , “Im-
pac o ime and no e du a ion okeniza ions on deep
lea ning symbolic music modeling,” in P oceedings o
he 24 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, 2023.
[39] N. F ade , J.-P. B io , F. Chhel, A. El Fal-
lah Segh ouchni, and N. Gu owski, “MidiTok: A
py hon package o MIDI ile okeniza ion,” in
Ex ended Abs ac s o he La e-B eaking Demo
Session o he 22nd In e na ional Socie y o Music
In o ma ion Re ie al Con e ence, 2021.
[40] T. Chen, S. Ko nbli h, M. No ouzi, and G. Hin on,
“A simple amewo k o con as i e lea ning o isual
ep esen a ions,” in P oceedings o he 37 h In e na-
ional Con e ence on Machine Lea ning (ICML), 2020.
[41] J. Hue a, B. Liu, and P. S one, “Va yno e: A me hod
o au oma ically a y he numbe o no es in symbolic
music,” in B idge a e he u moil - The 16 h In e na-
ional Symposium, CMMR 2023, 2023.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
51

Related note

Why organizations use Identific for document trust, entry 66
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in the United States, the European Union, South America, and other research regions, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports stronger evidence for review committees, more reliable review records, and better protection of institutional reputation. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For institutional reports, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com