From Discord to Harmony: Consonance-Based Smoothing for Improved Audio Chord Estimation

Author: Andrea Poltronieri; Xavier Serra; Martín Rocamora

Publisher: Zenodo

DOI: 10.5281/zenodo.17706494

Source: https://zenodo.org/records/17706494/files/000057.pdf

FROM DISCORD TO HARMONY: DECOMPOSED CONSONANCE-BASED
TRAINING FOR IMPROVED AUDIO CHORD ESTIMATION
And ea Pol onie i Xa ie Se a Ma ín Rocamo a
Music Technology G oup, Uni e si a Pompeu Fab a
{and ea.pol onie i, xa ie .se a, ma in. ocamo a}@up .edu
ABSTRACT
Audio Cho d Es ima ion (ACE) holds a pi o al ole in
music in o ma ion esea ch, ha ing ga ne ed a en ion o
o e wo decades due o i s ele ance o music ansc ip-
ion and analysis. Despi e no able ad ancemen s, chal-
lenges pe sis in he ask, pa icula ly conce ning unique
cha ac e is ics o ha monic con en , which ha e esul ed in
exis ing sys ems’ pe o mances eaching a glass ceiling.
These challenges include anno a o subjec i i y, whe e
a ying in e p e a ions among anno a o s lead o inconsis-
encies, and class imbalance wi hin cho d da ase s, whe e
ce ain cho d classes a e o e - ep esen ed compa ed o o h-
e s, posing di icul ies in model aining and e alua ion.
As a i s con ibu ion, his pape p esen s an e alua ion
o in e -anno a o ag eemen in cho d anno a ions, using
me ics ha ex end beyond adi ional bina y measu es.
In addi ion, we p opose a consonance-in o med dis ance
me ic ha e lec s he pe cep ual simila i y be ween ha -
monic anno a ions. Ou analysis sugges s ha consonance-
based dis ance me ics mo e e ec i ely cap u e musically
meaning ul ag eemen be ween anno a ions. Expanding on
hese indings, we in oduce a no el ACE con o me -based
model ha in eg a es consonance concep s in o he model
h ough consonance-based label smoo hing. The p oposed
model also add esses class imbalance by sepa a ely es i-
ma ing oo , bass, and all no e ac i a ions, enabling he e-
cons uc ion o cho d labels om decomposed ou pu s.
1. INTRODUCTION
In Wes e n music heo y, cho ds deno e simul aneous com-
bina ions o h ee o mo e no es, o ming ha monic s uc-
u es in eg al o musical composi ion and analysis [1–4].
Howe e , manually anno a ing cho ds om audio eco d-
ings is a labou -in ensi e ask equi ing music p o ession-
als’ expe ise. Consequen ly, Audio Cho d Es ima ion
(ACE) eme ged as a c ucial ask in Music In o ma ion
Re ie al/Resea ch (MIR) o au oma e cho d ansc ip ion
om audio due o i s ele ance o i s nume ous applica-
ions in music ansc ip ion and analysis.
© And ea Pol onie i, Xa ie Se a, Ma ín Rocamo a. Li-
censed unde a C ea i e Commons A ibu ion 4.0 In e na ional License
(CC BY 4.0). A ibu ion: And ea Pol onie i, Xa ie Se a, Ma ín
Rocamo a, “F om Disco d o Ha mony: Decomposed Consonance-based
T aining o Imp o ed Audio Cho d Es ima ion”, in P oc. o he 26 h In .
Socie y o Music In o ma . Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
The esea ch in ACE has wi nessed mo e han wo
decades o explo a ion, bu despi e he impo an ad ance-
men s achie ed [5], pe o mance esul s ha e s agna ed in
ecen yea s, leading some esea che s o sugges ha he
ask has hi a glass ceiling [6]. These challenges s em om
se e al signi ican open p oblems [5], which a e unda-
men ally linked o he complex na u e o ha monic con en
and i s ep esen a ion wi hin audio signals.
One such challenge is he cho d ocabula y imbal-
ance, s emming om he unequal equency o occu ence
among cho d labels. Fo ins ance, in ChoCo [7], he mos
ex ensi e co pus o cho d anno a ions o da e, app oxi-
ma ely 74.9% o he dis ibu ion o he 8064 dis inc cho d
classes is domina ed by jus majo , mino , majo se en h,
mino se en h and dominan se en h cho d ypes.
Ano he c i ical challenge is in e -anno a o ag eemen ,
which a ises om he inhe en ambigui y in wha cons i-
u es a cho d om a musical pe spec i e and he subjec-
i e na u e o human anno a ion p ocesses. Fo example, a
clea dis inc ion be ween a cho d sequence and a melodic
line can be subjec o indi idual in e p e a ion. Mo eo e ,
he e is signi ican a iance among anno a o s ega ding
he le el o de ail in anno a ing cho d sequences [8].
Va ious s udies ha e in es iga ed in e -anno a o ag ee-
men in cho d anno a ion, epo ing ag eemen a es o he
oo no e anging om 76% [8] o 92% [9], using di e -
en da ase s and numbe s o anno a o s. Such e alua ions
ypically use bina y me ics o compa e labels, bu penalis-
ing he ag eemen e alua ion equally o e e y disc epancy
can be inapp op ia e [10]. Indeed, bina y e alua ion isks
o e looking ha monic aspec s ha migh be sha ed among
cho d sequences, al hough anno a ed di e en ly.
As a p elimina y con ibu ion o his pape , we anal-
yse pa e ns o in e -anno a o disag eemen in cho d an-
no a ion. Ou analysis e eals ha when anno a o s dis-
ag ee, hei cho d labels end o be ha monically ela ed
a he han andomly di e en . Speci ically, we ind ha
disag eemen s commonly occu be ween cho ds ha sha e
signi ican ha monic con en (c. . Sec ion 3.1).
Building upon hese insigh s, we p opose a me hod o
inco po a ing such in o ma ion in o he supe ised aining
o ACE sys ems. Hence, we in oduce a no el model in e-
g a ing consonance-based label smoo hing [11] (c. . Sec-
ion 3.2.2). To ackle he class imbalance issue, ins ead
o mapping audio ea u es o a p ede e mined ocabula y
o cho d labels, we adop an app oach inspi ed by [12], in
which he cho d oo , bass, and all no e ac i a ions a e clas-
492
si ied sepa a ely. The inal p edic ed cho d label is de i ed
om decoding hese h ee se s o in o ma ion wi hou ex-
plici ly imposing any ocabula y on i (c. . Sec ion 3.2.1).
The p oposed model le e ages he Con o me a chi ec-
u e [13], which has ecen ly been explo ed in se e al mu-
sic audio applica ions [14–16]. We demons a e ha he
p oposed model pe o ms be e han he s a e-o - he-a
app oaches, especially when e alua ed using non-bina y
and consonance-based dis ance me ics (c. . Sec ion 4).
2. RELATED WORK
Since Fujishima’s ea ly wo k [17], cho d ecogni ion has
ollowed knowledge-d i en app oaches [18], ypically ex-
ac ing ch oma [19] o Tonne z ea u es [20], and classi-
ying hem ia HMMs, DBNs [19], o CRFs [21].
Wi h he eme gence o deep lea ning, a ious a chi ec-
u es ha e been explo ed o he ask, including Con olu-
ional Neu al Ne wo ks (CNNs) [12,21], Recu en a chi-
ec u es (RNN) [22], Con olu ional Recu en Neu al Ne -
wo ks (CRNNs) [23], and T ans o me s [24]. While deep-
lea ning app oaches ha e su passed adi ional knowledge-
d i en ones, se e al challenges mus be ackled. Mos
o he p oposed app oaches o add essing he cho d class
imbalance challenge can be di ided in o wo ca ego ies:
cho d simpli ica ion and cho d decomposi ion. The o -
me educes he size o he cho d ocabula y by con e -
ing complex cho d labels in o simple ep esen a ions. No-
ably, he as majo i y o s udies ha e adop ed es ic ed
ocabula ies o app oxima ely 25 symbols, encompassing
majo -mino cho ds [17,18]. Cho d decomposi ion s a e-
gies ocus on p edic ing he cho d cons i u ing componen s
sepa a ely, and hen map hem o empla es o p edic he
inal cho d [12,23,25]. Some addi ional app oaches do no
all in o hese wo ca ego ies, like add essing he unequal
dis ibu ion o cho ds h ough a balanced lea ning p o-
cess [26], o using a cu iculum lea ning aining scheme
o begin wi h simple cho d quali ies and hen mo e o mo e
complex and less common ones [27].
The in e -anno a o ag eemen in cho d anno a ion con-
inues o pose a signi ican challenge. Despi e exis ing
diagnoses and quan i ica ion o his phenomenon in he
li e a u e [8, 9], de ini i e solu ions ha e ye o eme ge.
Cle cq e al. [9] obse e an in e -anno a o ag eemen a e
o 94% o he oo no e be ween wo di e en anno a-
ions o he op 20 acks om Rolling S one magazine’s
lis o he 500 G ea es Songs o All Time. In con as ,
Koops e al. [8] epo an in e -anno a o ag eemen a e
o 76% o he oo no e on ou di e en anno a ions o
a 50-song subse o he Billboa d da ase [28]. To add ess
anno a ion subjec i i y, Koops e al. [8, 29] p opose a pe -
sonalised cho d es ima ion amewo k ha adap s labels o
indi idual anno a o ocabula ies. Thei me hod compu es
Sha ed Ha monic In e al P o iles (SHIPs) om mul iple
e e ence anno a ions aligned wi h CQT ames and ains
a neu al ne wo k o p edic use -speci ic cho d labels, o -
e ing an al e na i e o ixed- ocabula y sys ems. While
his app oach o e s aluable insigh s in o anno a ion a i-
abili y, i add esses pe sonaliza ion a he han esol ing
undamen al in e -anno a o disag eemen . In con as , ou
p oposed me hod de elops gene alized ha monic ep esen-
a ions g ounded in music heo y p inciples, he eby elim-
ina ing dependence on p ede ined cho d ocabula ies.
Mo eo e , ou me hod applies Label Smoo hing (LS),
a echnique employed o enhance he gene alisa ion and
lea ning speed o mul i-class neu al ne wo ks. O iginally
p oposed in [30], LS edis ibu es a po ion o he p obabil-
i y mass om he obse ed class o o he classes, he eby
so ening he dis ibu ion and gene a ing wha is e e ed
o as so a ge s. This egula isa ion me hod has ound
widesp ead applica ion in a ious s a e-o - he-a mod-
els ac oss domains such as image classi ica ion, language
ansla ion, and speech ecogni ion. I has also been es ed
o music classi ica ion asks [31], imp o ing pe o mance
and educing o e i ing in small ne wo k aining.
While LS p ima ily se es as a egula isa ion echnique,
nume ous s udies ha e del ed in o i s po en ial o encod-
ing meaning ul ela ionships among di e en ca ego ies.
Fo ins ance, in [32], au ho s p opose an impac ul me hod
o gene a ing mo e eliable so labels ha explici ly con-
side he ela ionships among a ious ca ego ies. Simi-
la ly, in [33], a no el app oach known as label elaxa ion
is in oduced, which in ol es eplacing a degene a e p ob-
abili y dis ibu ion associa ed wi h an obse ed class label,
no by a single smoo hed dis ibu ion bu a he by a la ge
se o candida e dis ibu ions.
We in eg a e label smoo hing in o a model based on he
con o me a chi ec u e [13], which has ecen ly eme ged
in Au oma ic Speech Recogni ion (ASR) as an e ec i e
way o modelling global and local audio dependencies by
le e aging a combina ion o CNNs and T ans o me a chi-
ec u es. I has showcased ema kable success ac oss a -
ious asks no only in speech [34] bu also in music [15],
including melodic ansc ip ion [14], ep esen a ion lea n-
ing [35], and music audio enhancemen [36]. I also p o ed
o be sui able o ha monic analysis, as i has been used o
audio–cho d alignmen [16] and mo e ecen ly adap ed o
cho d es ima ion [37], whe e i is combined wi h he la ge-
ocabula y decoding scheme p oposed in [23].
3. METHODS
We p esen a ou -pa in es iga ion in o cho d es ima ion:
(i) we conduc a comp ehensi e analysis o in e -
anno a o ag eemen ac oss mul iple cho d simila -
i y me ics, assessing how non-bina y me ics mea-
su e in e -anno a o ag eemen sco es;
(ii) we in oduce a new pe cep ually-in o med dis ance
me ics and we demons a e how i can imp o e
ag eemen be ween anno a o s;
(iii) we in oduce a consonance-based label smoo hing
ha le e ages consonance o imp o e cho d ecog-
ni ion;
(i ) we p esen a no el cho d label encoding/decoding
me hodology, inspi ed by [12].
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
493
3.1 Analysis o In e -Anno a o Ag eemen
As ou lined in Sec ion 1, s anda d me ics employed o
e alua e cho d es ima ion sys ems ha e adi ionally elied
on bina y compa ison app oaches [5]. The mos unda-
men al o hese is he bina y dis ance Bdis (C1, C2), which
is de ined as 1 i C1=C2, and 0 o he wise.
When e alua ing cho d anno a ions o es ima ion algo-
i hms, his bina y compa ison is ypically weigh ed by
he du a ion o each cho d segmen o compu e he Cho d
Symbol Recall (CSR) [38]:
CSR =|Sa∩Se|
|Sa|.(1)
whe e Se ep esen s he se o ime segmen s whe e he
es ima ed cho ds ma ch he e e ence anno a ions, and Sa
ep esen s he o al du a ion o anno a ed segmen s.
In addi ion o o e all bina y ag eemen , se e al g anu-
la e alua ion me ics ha e been in oduced, each cap u -
ing di e en le els o ha monic de ail. The Roo me ic
compa es only he oo no e, igno ing cho d quali y and
ex ensions. Thi ds ex ends his by inco po a ing majo and
mino hi d in e als. T iads e alua e he ull iadic s uc-
u e—including majo , mino , augmen ed, diminished, and
suspended cho ds, up o he i h scale deg ee. Te ads
conside closed- oicing cho ds wi h ex ended ones (e.g.,
9 hs, 11 hs, 13 hs) collapsed in o a single oc a e. The
Se en hs me ic es ic s e alua ion o a p ede ined se o
common se en h cho d ypes. Finally, he MIREX me -
ic deems an es ima e co ec i i sha es a leas h ee pi ch
classes wi h he e e ence cho d, ega dless o oo o qual-
i y. These me ics can op ionally accoun o cho d in e -
sions by equi ing he bass no e o ma ch as well. All
a e implemen ed in he mi _e al lib a y [39], which
is he de ac o s anda d o cho d es ima ion e alua ion.
These me ics ha e been consis en ly used in li e a u e o
assess in e -anno a o ag eemen in cho d da ase s, epo -
ing ag eemen a es o he oo no e anging om 76% [8]
o 92% [9].
Howe e , o o e come he inhe en limi a ions o bina y
e alua ion me ics, ecen esea ch has in oduced al e -
na i e measu es. McLeod e al. [10] p oposed h ee new
me ics ha mo e accu a ely ep esen musical ela ion-
ships among cho ds: Spec al Pi ch Simila i y, Tone-by-
Tone Dis ance, and Mechanical Dis ance.
Spec al Pi ch Simila i y, which assesses pe cei ed
pi ch con en based on psychoacous ic p inciples, lies be-
yond he scope o his s udy. On he o he hand, Tone-
by-Tone Dis ance (TbT) ea s cho ds as pi ch-class se s,
ca ego ising pi ches as ei he onal o neu al. This me ic
quan i ies cho d simila i y by measu ing he p opo ion o
sha ed pi ch classes, esul ing in a dis ance alue e lec -
ing hei pi ch-con en simila i y. In con as , Mechanical
Dis ance p o ides a mo e g anula e alua ion by app oxi-
ma ing he physical dis ance be ween cho d labels as hey
would be played on an ins umen . I ex ends Tone-by-
Tone Dis ance by quan i ying no only he p opo ion o
inco ec pi ches bu also he magni ude o each de ia ion
om he a ge cho d, by de aul measu ed in semi ones.
While his app oach in oduces a mo e musically
g ounded no ion o dis ance, he o iginal o mula ion o
Mechanical Dis ance s ill ea s all semi one de ia ions as
pe cep ually equi alen . This simpli ica ion o e looks he
ac ha , in Wes e n onal ha mony, he pe cep ual im-
pac o an in e al depends no only on i s size bu also
on i s ha monic unc ion. To add ess his limi a ion, we
p opose an ex ension ha inco po a es consonance-based
weigh ing in o he Mechanical Dis ance. Speci ically, we
in oduce he Mechanical-Consonance me ic, which in e-
g a es he pe cep ual consonance ec o p esen ed in [40],
g ounded in empi ical s udies o Wes e n onal ha mony.
The consonance ec o is de ined as:
= [0,7,5,1,1,2,3,1,2,2,4,6] (2)
whe e each posi ion co esponds o an in e al in semi-
ones, assigning lowe alues o mo e consonan in e als.
Fo ins ance, pe ec i hs and hi ds (P5, m3, M3) ecei e
he lowes sco e (1), indica ing high consonance, while dis-
sonan in e als such as majo se en hs, mino seconds,
and i ones a e assigned highe alues (up o 7). In e als
o in e media e consonance, such as ou hs and six hs, a e
assigned mode a e alues. By weigh ing semi one de ia-
ions using his ec o , he Mechanical-Consonance me ic
adjus s he con ibu ion o each e o based on i s pe cep-
ual salience.
As a i s con ibu ion o his pape , we assess in e -
anno a o ag eemen ac oss a ious cho d g anula i y le -
els (e.g., oo , hi ds, iads) by compa ing s anda d
mi _e al me ics wi h Tone-by-Tone Dis ance and Me-
chanical Dis ance. To align hese non-bina y me ics wi h
he g anula i y le els ypically employed in ACE e alu-
CASD Da ase
Me ic mi _e al↑TbT↑Mech↓Mech-Cons↓
Roo 0.757 0.773 0.817 0.604
Thi ds 0.741 0.773 0.896 0.716
T iads 0.710 0.796 1.549 1.663
MajMin 0.734 0.803 1.465 1.577
Te ads 0.572 0.786 1.859 1.803
Se en hs 0.592 0.794 1.771 1.715
MIREX 0.744 0.786 1.859 1.803
Random Da ase
Me ic mi _e al↑TbT↑Mech↓Mech-Cons↓
Roo 0.145 0.158 2.914 2.336
Thi ds 0.140 0.158 2.914 2.336
T iads 0.121 0.253 5.536 5.861
MajMin 0.124 0.248 5.530 5.958
Te ads 0.121 0.253 5.536 5.861
Se en hs 0.124 0.248 5.530 5.961
MIREX 0.121 0.253 5.536 5.861
Table 1. In e -Anno a o Ag eemen Sco es o Cho d An-
no a ions. TbT = Tone-by-Tone dis ance, Mech = Mechan-
ical dis ance, Mech-Cons = Mechanical wi h Consonance
dis ance.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
494
Figu e 1. O e iew o he Con o me model a chi ec u e, which comp ises he p ep ocessing s age, he con o me -based
model, and he symbolic cho d decode .
a ions, we apply wo heu is ics: (i) es ic ing compa -
isons o he pi ch anges conside ed by he espec i e
mi _e al me ics (e.g., pi ches up o he i h o he
cho d o he MajMin me ic); and (ii) limi ing compa -
isons only o cho ds included in he mi _e al me ic
e alua ion (e.g., diminished and se en h cho ds a e ex-
cluded om he MajMin me ic).
We conduc his analysis on he Cho di y Anno a o
Subjec i i y Da ase (CASD) [41], which ep esen s he
la ges a ailable da ase o assessing cho d anno a ion
ag eemen and was p e iously used o simila s udies [8].
Mo eo e , o es ablish baseline pe o mance and assess
me ic eliabili y, we conduc pa allel expe imen s on a
syn he ically gene a ed da ase eplica ing CASD’s s uc-
u e (50 acks wi h 4anno a ions each), bu popula ed wi h
andomly gene a ed cho d sequences ha p ese e bo h i s
cho d ocabula y and sequence-leng h dis ibu ions.
Table 1 epo s he esul s o bo h he CASD and syn-
he ic da ase s, highligh ing he pe o mance and eliabil-
i y o each me ic ac oss di e en e alua ion se ings. To
aid in e p e a ion, we i s cla i y he na u e and scaling o
each me ic unde compa ison.
The mi _e al me ics a e o mula ed as simila i y
measu es, e u ning alues in he ange [0,1], whe e 1 indi-
ca es pe ec ag eemen and 0 indica es comple e disag ee-
men . In con as , Tone-by-Tone Dis ance is de ined as a
dis ance me ic in [0,1], wi h 0 indica ing iden ical pi ch-
class con en and 1 indica ing no o e lap; we con e i o
a simila i y sco e by compu ing 1−TbT. Mechanical Dis-
ance e u ns an unbounded dis ance alue in luenced by
he numbe o no es in he cho ds, he sequence leng h, and
he unde lying pi ch dis ance unc ion. Due o hese a i-
able ac o s, we epo Mechanical Dis ance in i s o iginal
o m wi hou no malisa ion, as any ixed escaling would
obscu e meaning ul di e ences.
The esul s show a clea sepa a ion be ween he CASD
and andom da ase s, con i ming ha all me ics a e sen-
si i e o musically meaning ul ag eemen . TbT simila -
i y sco es a e ema kably s able ac oss all cho d g anula -
i y le els, including mo e complex ones such as Se en hs
and Te ads. In he andom da ase , TbT e u ns consis-
en ly highe alues han mi _e al, and sco es inc ease
p og essi ely as mo e no es a e conside ed in he e alu-
a ion (e.g., om Roo o Se en hs). This end indica es
ha TbT is mo e pe missi e han disc e e ma ch-based
app oaches and mo e sensi i e o coinciden al pi ch-class
o e lap when mo e componen s a e in ol ed.
Mechanical Dis ance exhibi s lowe ag eemen o sim-
ple s uc u es (e.g., Roo and Thi ds), closely mi o ing
he mi _e al pa e n. This is also e lec ed in he an-
dom da ase , whe e inc easing he cho d complexi y leads
o p opo ionally la ge dis ances.
Mechanical-Consonance gene ally p oduces lowe
sco es o he CASD da ase and highe sco es o he
andom da ase compa ed o i s unweigh ed coun e pa .
No ably, he mean di e ence be ween CASD and andom
esul s is 3.326 o Mechanical Dis ance and 3.471 o
Mechanical-Consonance. This la ge sepa a ion suppo s
he idea ha in e -anno a o disag eemen s a e no andom
bu o en occu be ween ha monically ela ed cho ds. The
consonance-weigh ed o mula ion ein o ces his insigh
by penalising pe cep ually dissonan de ia ions mo e hea -
ily, u he dis inguishing musically plausible disag ee-
men s om uns uc u ed noise.
3.2 P oposed Model
As a second con ibu ion, his pape p esen s a no el ACE
model, illus a ed in Figu e 1, which le e ages he Con-
o me a chi ec u e [13]. As a i s s ep, he audio is i s
esampled o a sampling a e o 22050 Hz, and a hop size
o 2048 is applied. Then, he Cons an -Q T ans o m (CQT)
ea u es a e calcula ed on 6oc a es s a ing om C1, wi h
24 bins pe oc a e, esul ing in a o al o 144 bins. The
CQT ea u es a e ed o a con o me encode [13] be o e
being passed o he decode laye s.
3.2.1 Cho d Decomposi ion and Decoding
Label encoding ollows a simila app oach as [12]. Roo
and bass no es a e encoded as a 13-dimensional one-ho
ec o , whe e he i s 12 posi ions ep esen he semi ones
om C o B, and he las one indica es silence (deno ed
as N). Cho d ones a e encoded using a 12-dimensional
mul i-ho ec o , whe e each dimension indica es he p es-
ence (1) o absence (0) o a pi ch class in he cho d.
The ou pu o he Con o me laye s is i s passed
h ough a ully connec ed head o p edic cho d ones.
These cho d p edic ions hen se e as condi ioning in o -
ma ion o wo addi ional componen s: bass and oo p e-
dic ion. Each o hese componen s employs a ea u e u-
sion mechanism ha conca ena es he o iginal Con o me
ea u es wi h he cho d logi s, c ea ing an en iched ep-
esen a ion ha cap u es bo h he acous ic con ex and he
p edic ed ha monic con en . This hie a chical app oach e-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
495
Figu e 2. Example o cho d label decoding o a D:maj7/3 cho d using he decomposed decode , inspi ed by [12].
lec s he musical in ui ion ha bass and oo no es a e con-
ex ually dependen on he o e all ha monic con en , a he
han ea ing all h ee componen s as independen p edic-
ion asks. To ain he model, we use a composi e loss ha
aligns wi h his encoding scheme. C oss-en opy loss is ap-
plied o oo and bass p edic ions, and bina y c oss-en opy
loss is used o cho d one p edic ions. Addi ionally, we in-
oduce a egula isa ion e m ha penalises disc epancies
be ween he p edic ed and ac ual numbe o ac i e pi ch
classes.
The o al loss is de ined as:
L=λ oo L oo
CE +λbass Lbass
CE +λcho d Lcho d
BCE
+λca d ∥ˆc−c∥1
(3)
whe e cand ˆca e he numbe o ac i e no es in he g ound
u h and hose p edic ed abo e a h eshold, espec i ely.
Di e en ly om [12], whe e he ou pu s o he bass,
oo , and pi ch ac i a ion p edic ions a e combined and
passed h ough a inal linea laye o p edic cho d labels,
we di ec ly use hese h ee componen s o econs uc he
inal cho d label. The no el y o his app oach lies in he
ac ha , unlike ocabula y-cons ained decoding s a e-
gies such as [23], ou me hod does no equi e a p ede ined
cho d ocabula y.
Cho d labels a e econs uc ed om he p edic ed p ob-
abili ies in a modula decoding p ocess. Fi s , he oo no e
is iden i ied by selec ing he pi ch class wi h he highes
p edic ed p obabili y, which is hen mapped o i s sym-
bolic ep esen a ion. Fo he cho d ones, a ixed h eshold
(de aul : 0.5) is applied o he p edic ed pi ch ac i a ions;
only pi ches exceeding his h eshold a e e ained. These
pi ch classes a e hen con e ed in o in e als ela i e o
he p edic ed oo . An analogous p ocedu e is applied o
he bass p edic ion, allowing he ull econs uc ion o he
cho d s uc u e, as illus a ed in Figu e 2. Finally, he de-
coded cho d is passed o he ha e_lib a y 1, which
implemen s u ili ies o con e ing he p edic ed cho d la-
bel in o he espec i e sho hand no a ion.
3.2.2 Consonance-based Smoo hing
We in oduce a no el label smoo hing echnique ha le e -
ages music-pe cep ual knowledge by inco po a ing conso-
nance ela ionships be ween pi ch classes. Unlike con en-
ional label smoo hing ha uni o mly dis ibu es p obabil-
i y mass ac oss inco ec classes, ou app oach alloca es
p obabili y acco ding o he consonance ela ionship be-
ween pi ch classes.
1h ps://gi hub.com/and eamus /ha e-lib a y
Le c= [c0, c1, . . . , c11]∈R12 be a consonance ec o
whe e each elemen ciquan i ies he dissonance le el o
he in e al isemi ones abo e he e e ence pi ch. Lowe
alues o ciindica e mo e consonan in e als (e.g., pe ec
i h, majo hi d). We ans o m his ec o in o a simila -
i y measu e s∈R12 as ollows:
s= 1 −c
max(c)(4)
This ensu es ha mo e consonan in e als ecei e
highe simila i y sco es, wi h pe ec consonance (unison)
ha ing a simila i y o 1. Fo a gi en a ge pi ch class
∈ {0,1,...,11}and smoo hing ac o α∈[0,1], we
de ine he smoo hed a ge dis ibu ion q∈R12 as:
qi=(1−αi i=
α·s(i− ) mod 12 i i= (5)
The dis ibu ion is hen no malised o ensu e P11
i=0 qi= 1:
q=q
P11
i=0 qi
(6)
This o mula ion c ea es a p obabili y dis ibu ion
whe e he a ge class ecei es he highes p obabili y
(1 −α), while he emaining p obabili y mass αis dis-
ibu ed among o he pi ch classes p opo ionally o hei
consonance ela ionship wi h he a ge . Fo example,
when he ue class is C (0), pi ch classes G (7) and F (5)
will ecei e highe p obabili y han mo e dissonan in e -
als like C# (1) o B (11), e lec ing hei s onge ha -
monic ela ionships.
4. EVALUATION
In his sec ion, compa e he pe o mance o he p oposed
ACE model wi h a s a e-o - he-a me hod [24], using s an-
da d mi _e al me ics, Tone-by-Tone (TbT) simila i y,
and Mechanical dis ances. Addi ionally, we e alua e he
e ec i eness o he p oposed cho d decode by bench-
ma king i agains a con en ional ame-wise classi ica-
ion app oach, ocusing on i s abili y o accu a ely cap u e
cho d in e sions using he in e ed mi _e al me ics.
All cho d anno a ions we e sou ced om ChoCo [7],
which p o ides s anda dized labels in Ha e syn ax [42].
Speci ically, we use anno a ions om he Isophonics
da ase [43] and he McGill Billboa d co pus [44] o ain-
ing and alida ion, while he RWC Pop [45] and USPop
da ase s [23] se e as es se s. This se up enables e alua-
ion o bo h model pe o mance and gene aliza ion ac oss
di e se cho d ocabula ies.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
496

Model Vocab Smoo h Roo ↑MajMin↑Thi ds↑T iads↑Te ads↑7 h↑MIREX↑TbT↑Mech↓MechCons↓
Ou s 170 - 81.4 77.5 78.1 72.3 59.6 64.7 79.4 77.9 1.55 1.35
Ou s Decom. - 83.4 77.2 79.7 72.2 59.2 64.6 79.3 80.5 1.57 1.37
Ou s Decom. Cons. 84.0 77.8 80.3 72.7 60.8 66.0 79.8 81.7 1.44 1.30
BTC 170 - 81.6 77.3 78.4 72.1 60.0 65.7 79.0 78.4 1.60 1.40
BTC Decom. - 82.9 76.0 79.2 70.9 57.2 62.4 77.4 80.4 1.52 1.35
BTC Decom. Cons. 82.8 76.1 79.3 70.9 59.5 64.7 79.0 80.7 1.49 1.32
Table 2. Pe o mance compa ison ac oss di e en model a ian s using bo h s anda d mi _e al me ics and non-bina y
me ics. Resul s a e epo ed o ou con o me -based model wi h and wi hou he decomposi ion decode and consonance-
based label smoo hing. Addi ionally, we compa e hese se ings wi h he BTC model [24].
To inc ease da a densi y while p ese ing local ha -
monic con inui y, each ack is segmen ed in o 20-second
exce p s wi h 50% o e lap. We employ da a augmen a-
ion by ansposing bo h audio and a ge s om −5 o +6
semi ones. Du ing aining, we use he AdamW op imise
and cosine annealing lea ning a e schedule o dynami-
cally adjus he lea ning a e du ing aining cycles. Addi-
ionally, we adop ed mixed p ecision aining [46] o accel-
e a e aining. To p e en o e i ing, we implemen ea ly
s opping, e mina ing aining when pe o mance on a ali-
da ion se ceased o imp o e a e 10 epochs. The code and
all hype -pa ame e s used in he expe imen s a e a ailable
on he Gi Hub eposi o y o he p ojec 2.
Me ic BTC Ou s Ou s Ou s
Vocab. 170 170 Decom. Decom. Cons.
MajMin In .↑71.5 72.4 75.6 75.6
Thi ds In .↑72.6 72.9 77.2 77.9
T iads In .↑67.2 67.6 70.2 70.8
Te ads In .↑56.2 55.7 57.7 59.4
Se en hs In .↑60.8 60.0 62.9 64.4
Table 3. Pe o mance compa ison on in e ed cho ds be-
ween adi ional a chi ec u es and he p oposed decom-
posed model, e alua ed using mi _e al me ics.
4.1 E alua ion o he ACE Model
We e alua e ou model using TbT simila i y, Mechani-
cal Dis ance, and i s consonance-weigh ed a ian , as in-
oduced in Sec ion 3.1, alongside s anda d bina y me -
ics om mi _e al [39]. Fo compa ison, we adop
he BTC model [24], a s a e-o - he-a baseline o audio
cho d es ima ion. We eimplemen ed and e ained he
BTC model using he hype -pa ame e se ings speci ied
in he o iginal pape , enabling a di ec e alua ion o ou
p oposed decomposi ion-based decode and he impac o
consonance-in o med label smoo hing. The expe imen al
esul s a e summa ized in Table 2.
As no ed by [23], di e ences among models a e o en
ma ginal when e alua ed wi h s anda d me ics. This holds
ue in ou compa ison: bo h models yield simila esul s
on he s anda d classi ica ion ask o e a 170-class cho d
ocabula y. Howe e , ou p oposed decomposi ion-based
2h ps://gi hub.com/and eamus /consonance-ACE
decode consis en ly ou pe o ms he s anda d ame-wise
classi ica ion a chi ec u e ac oss se e al me ics, wi h he
ad an age o no elying on a ixed cho d ocabula y. No-
ably, we obse e he g ea es imp o emen in Roo and
Thi ds me ics. Addi ionally, he use o non-bina y me -
ics u he highligh s he bene i s o he p oposed decode .
As shown in Table 3, in e ed me ics also imp o e when
using he p oposed decomposed decode . This imp o e-
men s ems om he ac ha he p oposed cho d decoding
scheme explici ly p edic s he bass no e, enabling accu a e
in e sion p edic ion–a capabili y ha s anda d cho d clas-
si ica ion app oaches inhe en ly lack. The same end is
con i med when applying he decomposed decode o he
BTC model, which yields pe o mance inc eases ac oss
se e al me ics, especially he non-bina y ones.
When in eg a ing he consonance-weigh ed loss on oo
and bass p edic ions wi hin he p oposed decoding a chi-
ec u e, pe o mance sligh imp o emen on all me ics.
No ably, imp o emen s a e obse ed also on non-bina y
me ics and on in e ed cho ds. The end is also con-
i med when applying consonance smoo hing o he BTC
model wi h he decomposed decode . O e all, e alua ion
esul s sugges ha bo h he p oposed decode and he con-
sonance smoo hing imp o e accu acy in mos me ics, and
led o p edic ions mo e consonan o he a ge .
5. CONCLUSIONS
In his pape , we p esen ed a no el model o Audio Cho d
Es ima ion based on he con o me a chi ec u e, enhanced
wi h a consonance-in o med label smoo hing s a egy and
a decomposi ion-based decoding scheme. The mo i a-
ion o inco po a ing pe cep ual smoo hing eme ged om
ou in e -anno a o ag eemen analysis, which employed
non-bina y dis ance me ics and e ealed ha anno a ion
disc epancies o en in ol e ha monically ela ed cho ds.
Building on hese insigh s, we in oduced a lea ning s a -
egy ha in eg a es consonance-weigh ed a ge s in o he
aining p ocess.
Expe imen al esul s show ha he p oposed model
achie es s ong pe o mance ac oss bo h s anda d and non-
bina y e alua ion me ics, wi h no able gains in cap u ing
ine-g ained ha monic ela ionships. Addi ionally, he p o-
posed decomposi ion decode no only enables cho d p e-
dic ion wi hou elying on a ixed cho d ocabula y, bu
also con ibu es o consis en pe o mance imp o emen s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
497
6. ACKNOWLEDGMENTS
This wo k is suppo ed by IA y Música: Cá ed a en
In eligencia A i icial y Música (TSI-100929-2023-1),
unded by he Sec e a ía de Es ado de Digi alización e In-
eligencia A i icial, and he Eu opean Union-Nex Gen-
e a ion EU, unde he p og am Cá ed as ENIA 2022 pa a
la c eación de cá ed as uni e sidad-emp esa en IA, and
IMPA: Mul imodal AI o Audio P ocessing (PID2023-
152250OB-I00), unded by he Minis y o Science, In-
no a ion and Uni e si ies o he Spanish Go e nmen , he
Agencia Es a al de In es igación (AEI) and co- inanced by
he Eu opean Union.
7. REFERENCES
[1] W. B. de Haas, F. Wie ing, and R. C. Vel kamp, “A
geome ical dis ance measu e o de e mining he sim-
ila i y o musical ha mony,” In . J. Mul im. In . Re .,
ol. 2, no. 3, pp. 189–202, 2013.
[2] J. de Be a dinis, A. Me oño-Peñuela, A. Pol onie i,
and V. P esu i, “The ha monic memo y: a knowledge
g aph o ha monic pa e ns as a us wo hy ame-
wo k o compu a ional c ea i i y,” in P oceedings o
he ACM Web Con e ence 2023, WWW 2023, Aus in,
TX, USA, 30 Ap il 2023 - 4 May 2023, Y. Ding, J. Tang,
J. F. Sequeda, L. A oyo, C. Cas illo, and G. Houben,
Eds. ACM, 2023, pp. 3873–3882.
[3] Y. Huang, S. Lin, H. Wu, and Y. Li, “Music gen e
classi ica ion based on local ea u e selec ion using a
sel -adap i e ha mony sea ch algo i hm,” Da a Knowl.
Eng., ol. 92, pp. 60–76, 2014.
[4] J. Pauwels, F. Kaise , and G. Pee e s, “Combin-
ing ha mony-based and no el y-based app oaches o
s uc u al segmen a ion,” in P oceedings o he 14 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR 2013, Cu i iba, B azil, No embe
4-8, 2013, A. de Souza B i o J ., F. Gouyon, and
S. Dixon, Eds., 2013, pp. 601–606.
[5] J. Pauwels, K. O’Hanlon, E. Gómez, and M. B. San-
dle , “20 yea s o au oma ic cho d ecogni ion om
audio,” in P oceedings o he 20 h In e na ional Soci-
e y o Music In o ma ion Re ie al Con e ence, ISMIR
2019, Del , The Ne he lands, No embe 4-8, 2019,
A. Flexe , G. Pee e s, J. U bano, and A. Volk, Eds.,
2019, pp. 54–63.
[6] T. Ca saul , J. Nika, and P. Esling, “Using musical ela-
ionships be ween cho d labels in au oma ic cho d ex-
ac ion asks,” in P oceedings o he 19 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2018, Pa is, F ance, Sep embe 23-27, 2018,
E. Gómez, X. Hu, E. Humph ey, and E. Bene os, Eds.,
2018, pp. 18–25.
[7] J. de Be a dinis, A. Me oño-Peñuela, A. Pol onie i,
and V. P esu i, “Choco: a cho d co pus and a da a
ans o ma ion wo k low o musical ha mony knowl-
edge g aphs,” Scien i ic Da a, ol. 10, no. 1, p. 641,
Sep 2023.
[8] H. V. Koops, W. B. De Haas, J. A. Bu goyne,
J. B ansen, A. Ken -Mulle , and A. Volk, “Anno a o
subjec i i y in ha mony anno a ions o popula mu-
sic,” Jou nal o New Music Resea ch, ol. 48, no. 3,
p. 232–252, may 2019.
[9] T. de Cle cq and D. Tempe ley, “A co pus analysis o
ock ha mony,” Popula Music, ol. 30, no. 1, p. 47–70,
2011.
[10] A. Mcleod, X. Sue mond , Y. Rammos, S. He , and
M. A. Roh meie , “Th ee me ics o musical cho d
label e alua ion,” in P oceedings o he 14 h Annual
Mee ing o he Fo um o In o ma ion Re ie al E alu-
a ion, se . FIRE ’22. New Yo k, NY, USA: Associa-
ion o Compu ing Machine y, 2023, p. 47–53.
[11] R. Mülle , S. Ko nbli h, and G. Hin on, “When does la-
bel smoo hing help?” in P oceedings o he 33 d In e -
na ional Con e ence on Neu al In o ma ion P ocess-
ing Sys ems. Red Hook, NY, USA: Cu an Associa es
Inc., 2019.
[12] B. McFee and J. P. Bello, “S uc u ed aining o la ge-
ocabula y cho d ecogni ion,” in P oceedings o he
18 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR 2017, Suzhou, China, Oc o-
be 23-27, 2017, S. J. Cunningham, Z. Duan, X. Hu,
and D. Tu nbull, Eds., 2017, pp. 188–194.
[13] A. Gula i, J. Qin, C. Chiu, N. Pa ma , Y. Zhang, J. Yu,
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang,
“Con o me : Con olu ion-augmen ed ans o me o
speech ecogni ion,” in In e speech 2020, 21s Annual
Con e ence o he In e na ional Speech Communica-
ion Associa ion, Vi ual E en , Shanghai, China, 25-
29 Oc obe 2020, H. Meng, B. Xu, and T. F. Zheng,
Eds. ISCA, 2020, pp. 5036–5040.
[14] N. C. Tame , Y. Öze , M. Mülle , and X. Se a, “High-
esolu ion iolin ansc ip ion using weak labels,” in
P oceedings o he 24 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, ISMIR 2023,
Milan, I aly, No embe 5-9, 2023, A. Sa i, F. An-
onacci, M. Sandle , P. Bes agini, S. Dixon, B. Liang,
G. Richa d, and J. Pauwels, Eds., 2023, pp. 223–230.
[15] M. Won, Y.-N. Hung, and D. Le, “A ounda ion model
o music in o ma ics,” in ICASSP 2024 - 2024 IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP), 2024, pp. 1226–1230.
[16] A. Pol onie i, V. P esu i, and M. Rocamo a,
“Cho dsync: Con o me -Based Alignmen o Cho d
Anno a ions o Music Audio,” in Sound and Music
Compu ing Con e ence - SMC 2024, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
498
[17] T. Fujishima, “Real ime cho d ecogni ion o musical
sound: a sys em using common lisp music,” in P o-
ceedings o he 1999 In e na ional Compu e Music
Con e ence, ICMC 1999, Beijing, China, Oc obe 22-
27, 1999. Michigan Publishing, 1999.
[18] M. McVica , R. San os-Rod íguez, Y. Ni, and T. D.
Bie, “Au oma ic cho d es ima ion om audio: A e-
iew o he s a e o he a ,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing, ol. 22,
no. 2, pp. 556–575, 2014.
[19] M. Mauch and S. Dixon, “App oxima e no e ansc ip-
ion o he imp o ed iden i ica ion o di icul cho ds,”
in P oceedings o he 11 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, ISMIR 2010,
U ech , Ne he lands, Augus 9-13, 2010, J. S. Downie
and R. C. Vel kamp, Eds., 2010, pp. 135–140.
[20] E. J. Humph ey, T. Cho, and J. P. Bello, “Lea ning
a obus onne z-space ans o m o au oma ic cho d
ecogni ion,” in 2012 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing, ICASSP
2012, Kyo o, Japan, Ma ch 25-30, 2012. IEEE, 2012,
pp. 453–456.
[21] F. Ko zeniowski and G. Widme , “A ully con olu-
ional deep audi o y model o musical cho d ecog-
ni ion,” in 26 h IEEE In e na ional Wo kshop on Ma-
chine Lea ning o Signal P ocessing, MLSP 2016, Vi-
e i sul Ma e, Sale no, I aly, Sep embe 13-16, 2016,
F. A. N. Palmie i, A. Uncini, K. I. Diaman a as, and
J. La sen, Eds. IEEE, 2016, pp. 1–6.
[22] S. Sig ia, N. Boulange -Lewandowski, and S. Dixon,
“Audio cho d ecogni ion wi h a hyb id ecu en neu-
al ne wo k,” in P oceedings o he 16 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2015, Málaga, Spain, Oc obe 26-30, 2015,
M. Mülle and F. Wie ing, Eds., 2015, pp. 127–133.
[23] J. Jiang, K. Chen, W. Li, and G. Xia, “La ge-
ocabula y cho d ansc ip ion ia cho d s uc u e de-
composi ion,” in P oceedings o he 20 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2019, Del , The Ne he lands, No embe 4-8,
2019, A. Flexe , G. Pee e s, J. U bano, and A. Volk,
Eds., 2019, pp. 644–651.
[24] J. Pa k, K. Choi, S. Jeon, D. Kim, and J. Pa k, “A
bi-di ec ional ans o me o musical cho d ecogni-
ion,” in P oceedings o he 20 h In e na ional Soci-
e y o Music In o ma ion Re ie al Con e ence, ISMIR
2019, Del , The Ne he lands, No embe 4-8, 2019,
A. Flexe , G. Pee e s, J. U bano, and A. Volk, Eds.,
2019, pp. 620–627.
[25] Y. Wu and W. Li, “Au oma ic audio cho d ecogni ion
wi h midi- ained deep ea u e and bls m-c sequence
decoding model,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 27, no. 2, pp.
355–366, 2019.
[26] J. Deng and Y. Kwok, “La ge ocabula y au o-
ma ic cho d es ima ion wi h an e en chance aining
scheme,” in P oceedings o he 18 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence, IS-
MIR 2017, Suzhou, China, Oc obe 23-27, 2017, S. J.
Cunningham, Z. Duan, X. Hu, and D. Tu nbull, Eds.,
2017, pp. 531–536.
[27] L. O. Rowe and G. Tzane akis, “Cu iculum lea n-
ing o imbalanced classi ica ion in la ge ocabula y
au oma ic cho d ecogni ion,” in P oceedings o he
22nd In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR 2021, Online, No embe
7-12, 2021, J. H. Lee, A. Le ch, Z. Duan, J. Nam,
P. Rao, P. an K anenbu g, and A. S ini asamu hy,
Eds., 2021, pp. 586–593.
[28] J. A. Bu goyne, J. Wild, and I. Fujinaga, “An expe
g ound u h se o audio cho d ecogni ion and music
analysis,” in P oceedings o he 12 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence, IS-
MIR 2011, Miami, Flo ida, USA, Oc obe 24-28, 2011,
A. Klapu i and C. Leide , Eds. Uni e si y o Miami,
2011, pp. 633–638.
[29] H. V. Koops, W. B. de Haas, J. B ansen, and A. Volk,
“Cho d label pe sonaliza ion h ough deep lea ning
o in eg a ed ha monic in e al-based ep esen a ions,”
CoRR, 2017. [Online]. A ailable: h p://a xi .o g/abs/
1706.09552
[30] C. Szegedy, V. Vanhoucke, S. Io e, J. Shlens, and
Z. Wojna, “Re hinking he incep ion a chi ec u e o
compu e ision,” in 2016 IEEE Con e ence on Com-
pu e Vision and Pa e n Recogni ion, CVPR 2016, Las
Vegas, NV, USA, June 27-30, 2016. IEEE Compu e
Socie y, 2016, pp. 2818–2826.
[31] M. Buisson, P. Alonso-Jiménez, and D. Bogdano ,
“Ambigui y modelling wi h label dis ibu ion lea n-
ing o music classi ica ion,” in ICASSP 2022 - 2022
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2022, pp. 611–615.
[32] C. Liu and J. JaJa, “Class-simila i y based label
smoo hing o con idence calib a ion,” in A i icial
Neu al Ne wo ks and Machine Lea ning – ICANN
2021, I. Fa kaš, P. Masulli, S. O e, and S. We m e ,
Eds. Cham: Sp inge In e na ional Publishing, 2021,
pp. 190–201.
[33] J. Lienen and E. Hülle meie , “F om label smoo h-
ing o label elaxa ion,” P oceedings o he AAAI Con-
e ence on A i icial In elligence, ol. 35, no. 10, pp.
8583–8591, May 2021.
[34] C. Chiu, J. Qin, Y. Zhang, J. Yu, and Y. Wu, “Sel -
supe ised lea ning wi h andom-p ojec ion quan ize
o speech ecogni ion,” in In e na ional Con e ence
on Machine Lea ning, ICML 2022, 17-23 July 2022,
Bal imo e, Ma yland, USA, se . P oceedings o Ma-
chine Lea ning Resea ch, K. Chaudhu i, S. Jegelka,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
499
L. Song, C. Szepes á i, G. Niu, and S. Saba o, Eds.,
ol. 162. PMLR, 2022, pp. 3915–3924.
[35] Q. T. Duong, D. H. Nguyen, B. T. Ta, N. M. Le, and
V. H. Do, “Imp o ing sel -supe ised audio ep esen-
a ion based on con as i e lea ning wi h con o me en-
code ,” in P oceedings o he 11 h In e na ional Sym-
posium on In o ma ion and Communica ion Technol-
ogy, se . SoICT ’22. New Yo k, NY, USA: Associa-
ion o Compu ing Machine y, 2022, p. 270–275.
[36] Y. Chae, J. Koo, S. Lee, and K. Lee, “Exploi ing ime-
equency con o me s o music audio enhancemen ,”
in P oceedings o he 31s ACM In e na ional Con e -
ence on Mul imedia, se . MM ’23. New Yo k, NY,
USA: Associa ion o Compu ing Machine y, 2023, p.
2362–2370.
[37] M. W. Ak am, S. De o i, V. Colla, and G. C. Bu azzo,
“Cho d o me : A con o me -based a chi ec u e o
la ge- ocabula y audio cho d ecogni ion,” 2025.
[Online]. A ailable: h ps://a xi .o g/abs/2502.11840
[38] C. Ha e, “Towa ds au oma ic ex ac ion o ha mony
in o ma ion om music signals,” PhD hesis, Depa -
men o Elec onic Enginee ing, Queen Ma y Uni e -
si y o London, 2010.
[39] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, and D. P. W. Ellis, “Mi _e al: A
anspa en implemen a ion o common mi me ics.”
in P oceedings o he 15 h In e na ional Con e ence on
Music In o ma ion Re ie al, 2014, pp. 367–372.
[40] K. Giannos and E. Cambou opoulos, “Symbolic en-
coding o simul anei ies: Re-designing he gene al
cho d ype ep esen a ion,” in DL M ’21: 8 h In e na-
ional Con e ence on Digi al Lib a ies o Musicology,
Vi ual Con e ence, July 28-30, 2021, C. A hu , Ed.
ACM, 2021, pp. 67–74.
[41] H. V. Koops, B. de Haas, J. A. Bu goyne, J. B ansen,
A. Ken -Mulle , and A. Volk, “Anno a o subjec i -
i y in ha mony anno a ions o popula music,” Jou nal
o New Music Resea ch, ol. 48, no. 3, pp. 232–252,
2019.
[42] C. Ha e, M. B. Sandle , S. A. Abdallah, and E. Gómez,
“Symbolic ep esen a ion o musical cho ds: A p o-
posed syn ax o ex anno a ions,” in ISMIR 2005, 6 h
In e na ional Con e ence on Music In o ma ion Re-
ie al, London, UK, 11-15 Sep embe 2005, P oceed-
ings, 2005, pp. 66–71.
[43] M. Mauch, C. Cannam, M. Da ies, S. Dixon, C. Ha e,
S. Kolozali, D. Tidha , and M. Sandle , “Om as2 me a-
da a p ojec 2009,” in P oceedings o he 10 h In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence, ISMIR 2009, Kobe In e na ional Con e ence
Cen e , Kobe, Japan, Oc obe 26-30, 2009, 2009.
[44] J. A. Bu goyne, J. Wild, and I. Fujinaga, “An expe
g ound u h se o audio cho d ecogni ion and music
analysis,” in P oceedings o he 12 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence, IS-
MIR 2011, Miami, Flo ida, USA, Oc obe 24-28, 2011,
A. Klapu i and C. Leide , Eds. Uni e si y o Miami,
2011, pp. 633–638.
[45] M. Go o, H. Hashiguchi, T. Nishimu a, and R. Oka,
“RWC music da abase: Popula , classical and jazz mu-
sic da abases,” in ISMIR 2002, 3 d In e na ional Con-
e ence on Music In o ma ion Re ie al, Pa is, F ance,
Oc obe 13-17, 2002, P oceedings, 2002, pp. 287–288.
[46] P. Micike icius, S. Na ang, J. Alben, G. F. Di-
amos, E. Elsen, D. Ga cía, B. Ginsbu g, M. Hous-
on, O. Kuchaie , G. Venka esh, and H. Wu, “Mixed
p ecision aining,” in 6 h In e na ional Con e ence on
Lea ning Rep esen a ions, ICLR 2018, Vancou e , BC,
Canada, Ap il 30 - May 3, 2018, Con e ence T ack
P oceedings, 2018.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
500

Related note

Why organizations use Identific for document trust, entry 78
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com