Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation

Author: Frank Cwitkowitz; Zhiyao Duan

Publisher: Zenodo

DOI: 10.5281/zenodo.17706527

Source: https://zenodo.org/records/17706527/files/000069.pdf

INVESTIGATING AN OVERFITTING AND DEGENERATION
PHENOMENON IN SELF-SUPERVISED MULTI-PITCH ESTIMATION
F ank Cwi kowi z Zhiyao Duan
Audio In o ma ion Resea ch Lab, Uni e si y o Roches e
[email p o ec ed], [email p o ec ed]
ABSTRACT
Mul i-Pi ch Es ima ion (MPE) con inues o be a sough a -
e capabili y o Music In o ma ion Re ie al (MIR) sys-
ems, and is c i ical o many applica ions and downs eam
asks in ol ing pi ch, including music ansc ip ion. How-
e e , exis ing me hods a e la gely based on supe ised
lea ning, and he e a e signi ican challenges in collec -
ing anno a ed da a o he ask. Recen ly, sel -supe ised
echniques exploi ing in insic p ope ies o pi ch and ha -
monic signals ha e shown p omise o bo h monophonic
and polyphonic pi ch es ima ion, bu hese s ill emain in-
e io o supe ised me hods. In his wo k, we ex end he
classic supe ised MPE pa adigm by inco po a ing se -
e al sel -supe ised objec i es based on pi ch-in a ian and
pi ch-equi a ian p ope ies. This join aining esul s in
a subs an ial imp o emen unde closed aining condi-
ions, which na u ally sugges s ha applying he same ob-
jec i es o a b oade collec ion o da a will yield u he
imp o emen s. Howe e , in doing so we unco e a phe-
nomenon whe eby ou model simul aneously o e i s o
he supe ised da a while degene a ing on da a used o
sel -supe ision only. We demons a e and in es iga e his
and o e ou insigh s on he unde lying p oblem.
1. INTRODUCTION
Pi ch is a pe cep ual a ibu e o sound e en s ha p oduce
wa es o ha monics ha oscilla e a in ege mul iples o
a undamen al equency (F0) [1]. Pi ch is a ounda ional
aspec o music, and i is o en use ul o ep esen musi-
cal con en in e ms o ela ionships be ween pi ch (i.e.,
melody and ha mony). In Music In o ma ion Re ie al
(MIR) esea ch, he ask o de ec ing pi ch ac i i y and es-
ima ing he co esponding F0s wi hin a polyphonic sig-
nal is known as Mul i-Pi ch Es ima ion (MPE) [2]. This
is an impo an ask wi h exci ing applica ions in machine
lis ening, human-compu e in e ac ion, and music da abas-
ing. Pi ch es ima ion is also necessa y o mo e high-le el
MIR asks such as Au oma ic Music T ansc ip ion (AMT),
whe e MPE is o en pe o med in conjunc ion wi h he es-
© F. Cwi kowi z and Z. Duan. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: F. Cwi kowi z and Z. Duan, “In es iga ing an O e i ing and De-
gene a ion Phenomenon in Sel -Supe ised Mul i-Pi ch Es ima ion”, in
P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
ima ion o no e e en s. Cu en ly, s a e-o - he-a MPE
me hods a e hea ily based on supe ised machine lea ning
echniques and equi e la ge amoun s o ich and di e se
aining da a wi h pi ch anno a ions [3]. Howe e , he e
a e signi ican challenges wi h p ocu ing mul i-pi ch anno-
a ions, especially o audio eco dings comp ising mul i-
ins umen mix u es, di icul - o-anno a e polyphonic in-
s umen s (e.g., gui a ), o less common ins umen s. Fo
his eason, such me hods a e unable o scale beyond he
a ailable da ase s, which a e gene ally homogeneous (e.g.,
solo piano) o limi ed in size (i.e., less han 10 hou s).
Se e al s a egies ha e been p oposed o mi iga e such
issues, including semi-supe ision on da a wi h weakly
aligned anno a ions [4] o p e- aining on mix u es o
la ge-scale monophonic da a [5]. Howe e , hese me h-
ods s ill all wi hin he supe ised lea ning pa adigm and
as such a e subjec o he size and quali y o he da a and
anno a ions. In pa allel, he e has also been wo k o build
music ounda ion models ha lea n mo e gene al ep esen-
a ions o music which can be ans e ed o downs eam
asks [6, 7]. Howe e , hese ep esen a ions s ill s uggle
o cap u e he le el o g anula i y needed o low-le el
asks like MPE. An al e na i e app oach is o de ine ask-
speci ic sel -supe ised objec i es ha encou age a model
o espec p ope ies o pi ch, such as equi a iance o pi ch
shi ing and in a iance o imb al ans o ma ions [8, 9].
These echniques ha e demons a ed ema kable success
in lea ning o es ima e pi ch om unlabeled da a and ha e
also been gene alized o polyphonic da a [10].
In his wo k, we expand upon, e ine, and in eg a e he
echniques p oposed in [10] in o a supe ised MPE ame-
wo k esembling ha o ecen me hods employing a con-
olu ional neu al ne wo k (CNN) [11–14] ained o es i-
ma e a mul i-pi ch salience-g am o an inpu spec og am.
We show hese sel -supe ised objec i es can signi ican ly
imp o e he pe o mance o he supe ised amewo k un-
de a join aining pa adigm. Howe e , when a emp ing
o apply he same objec i es o addi ional da a wi h no
co esponding supe ision, we obse e a su p ising phe-
nomenon: sel -supe ision on he addi ional da a does no
imp o e pe o mance bu ac ually s ee s ou model owa d
degene a ion, i.e. blank pi ch salience es ima es, on such
da a. The model s ill exhibi s he co ec beha io on he
alida ion se o he anno a ed da ase as well as e alua-
ion da a ollowing a simila dis ibu ion. We demons a e
his issue and conduc se e al ollow-up expe imen s in an
a emp o iden i y and explain he unde lying p oblem.
596
2. FRAMEWORK
In his sec ion, we desc ibe ou ea u e ex ac ion module,
model a chi ec u e, and aining objec i es. Ou me hod-
ology can be iewed as he in eg a ion o sel -supe ised
echniques o MPE [10] in o a supe ised amewo k.
2.1 Model & Fea u es
We adop a modi ied e sion o he ully con olu ional 2D
au oencode used in he Timb e-T ap amewo k [14]. This
model comp ises ou encode and decode blocks wi h
dila ed con olu ions, esidual connec ions, and s ided o
ansposed con olu ions o esampling ea u es ac oss e-
quency. Al hough Timb e-T ap was p oposed as a uni ied
amewo k o pe o m ansc ip ion and econs uc ion, we
adop only he base model and disca d he la en ea u e
used o swi ch be ween modes. We also inse laye no -
maliza ion a e he ini ial con olu ional laye o bo h he
encode and decode , a e he s ided and ansposed con-
olu ion in each encode and decode block, and a e he
la en space con olu ion. Finally, we double he numbe o
il e s in each con olu ional laye .
We also eplace he complex Cons an -Q T ans o m
(CQT) module used in he Timb e-T ap amewo k wi h
calcula ion o Ha monic CQT 1(HCQT) spec og ams
[11] XH∈[0,1]6×K×Nwi h K= 440 equency bins
s a ing om min = 27.5Hz and 5bin pe semi one eso-
lu ion. Inpu audio is esampled o 22,050 Hz, and Nis he
numbe ames using a hop size o 256 samples. We main-
ain he o iginal se o ha monics H={0.5,1,2,3,4,5}.
The main ad an age o he HCQT is i s capaci y o in-
dex ha monic ene gy ac oss he channel dimension. This
s uc u e is pe ec ly sui ed o con olu ional laye s and
es ablishes a s ong induc i e bias o pi ch es ima ion.
One consequence o ou model, deno ed by F(·), is he
esul ing sha ed dimensionali y be ween XHand p edic-
ions ˆ
Y=F(XH), which ideally ep esen mul i-pi ch
salience-g ams. This makes i con enien o o mula e he
sel -supe ised echniques p oposed in Sec. 2.3. No e ha
his con igu a ion o model and ea u es is e y simila o
wha was used in he SS-MPE amewo k [10].
2.2 Supe ised T aining
Gi en he g ound- u h pi ch ac i a ions Y∈[0,1]K×N
co esponding o XH, a supe ised loss can be de ined as
Lsp =1
N
N−1
X
n=0
K−1
X
k=0
Bˆ
Y[k, n],e
Y[k, n],(1)
whe e B(·,·) ep esen s bina y c oss-en opy (BCE) loss
and e
Y ep esen s he a ge mul i-pi ch salience-g am o
XH. Following [11], we blu each ame o Yusing a
Gaussian ke nel wi h σ=1
5semi ones (1 bin) o ob ain
e
Y. Minimiza ion o Lsp ep esen s he classic aining
objec i e o supe ised MPE and is used as he p ima y
aining signal wi hin ou amewo k.
1As in [10], a a iable Q- ac o [15] is employed o imp o ed com-
pu a ional e iciency and inc eased ime esolu ion a lowe equencies.
2.3 Sel -Supe ised Techniques
2.3.1 In a iance & Equi a iance Objec i es
We u he de ine wo classes o sel -supe ised objec i es,
adap ed om [10], based on pi ch-in a ian and pi ch-
equi a ian p ope ies. Unde ou amewo k, hese objec-
i es a e mean o encou age he model o implici ly en-
code a ious p ope ies o pi ch. A pi ch-in a ian ans-
o ma ion i (·)pe o ms some manipula ion o XH ha
ideally should no a ec he p edic ed mul i-pi ch salience-
g am ˆ
Y. These ans o ma ions can be used o o mula e
in a iance-based losses o he o m
Li =1
N
N−1
X
n=0
K−1
X
k=0
BF( i (XH))[k, n],ˆ
Y[k, n].(2)
While he ela i e s eng h o ene gy a ha monic equen-
cies is p ima ily wha in luences imb e, pi ch is de e -
mined by an ac ual o implied F0. As such, we simula e
pi ch-in a ian imb al ans o ma ions i − by applying
andom pa abolic equaliza ion cu es u[k] = 1 −2α(k−
β)2[16], whe e β∈[0, K −1] and α∈[0,1
(K−1)2]
a e sampled uni o mly, o each ame and channel o
XH, o de ine a imb e-in a iance loss Li − . Simila ly,
he e is no disce nible pi ch associa ed wi h non-ha monic
sounds, which none heless make up an impo an aspec
o music (i.e., pe cussion). As such, we c ea e musically-
ele an pi ch-in a ian ans o ma ions i −pby andomly
sampling and supe imposing pe cussi e audio om he
Expanded G oo e MIDI Da ase (E-GMD) [17] (a he
wa e o m-le el) wi h olume ∈[0,1], sampled uni-
o mly, o de ine a pe cussion-in a iance loss Li −p.
Con e sely, a pi ch-equi a ian ans o ma ion e (·)
pe o ms some manipula ion o XH ha ideally should
co espond o a pa allel manipula ion o he p edic ed
mul i-pi ch salience-g am ˆ
Y. These ans o ma ions can
be used o o mula e equi a iance-based losses o he o m
Le =1
N
N−1
X
n=0
K−1
X
k=0
BF( e (XH))[k, n], e ˆ
Y[k, n].
(3)
The HCQT spec og ams ed in o ou model and he co -
esponding expec ed mul i-pi ch salience-g ams ha e an
equi a ian ela ionship o a ious geome ic ans o ma-
ions. These include e ical ansla ions, which co e-
spond o a pi ch shi o ∆k
5semi ones, ho izon al ans-
la ions, which co espond o a ime delay o 4∆n
Nsec-
onds, and ho izon al s e ching, which co esponds o a
speed-up by a ac o o γ. We pe o m andom pi ch-
equi a ian ans o ma ions e −gwi h uni o mly sampled
∆k∈[−boc , boc ]bins, ∆n∈[−N
4,N
4] ames, and γ∈
[0.5,2] 2 o de ine a geome ic-equi a iance loss Le −g.
While all o hese in a iance and equi a iance p ope -
ies can be lea ned implici ly o some deg ee h ough a su-
pe ised objec i e o ully con olu ional induc i e bias, an
explici aining signal can lead o less o e i ing. Mo e-
o e , hese echniques can b oaden he aining da a and
in oduce p e iously unseen elemen s such as pe cussion.
2Sampled uni o mly om [0.5,1] and [1,2] in equal p opo ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
597
2.3.2 Ene gy-Based S imulus
Objec i es based on he losses om Sec. 2.3.1 can be ul-
ne able o i ial solu ions, e.g. uni o mly inac i e o ac-
i e mul i-pi ch salience-g ams. One way o p o ec agains
such degene a ion is o injec some so o ene gy-based
s imulus. The g ound- u h and i s de i a i e e
Ya e one
o m o s imulus ha can p e en collapse, bu o cou se
hey a e no always a ailable. In lieu o g ound- u h, a
loss le e aging ene gy-based a ge s can be o mula ed as
Leg =1
N
N−1
X
n=0
K−1
X
k=0
Bˆ
Y[k, n],e
X[k, n],(4)
whe e e
X(lin)=P5
h=1 1
h4X(lin)
h ep esen s a weigh ed
a e age o XHac oss ha monic channels, compu ed in
linea -scale and con e ed o decibel-scale. No e ha (4)
is a simpli ica ion o he ha monic and suppo loss used
o iginally in [14]. While his loss will p o ec agains i -
ial solu ions, he a ge e
X[k, n]by na u e is qui e coa se
and con ains many alse ala ms (see [14]). A simple im-
p o emen is o induce spa si y h ough ano he loss:
Lsp =1
N
N−1
X
n=0
K−1
X
k=0
ˆ
Y[k, n].(5)
In p ac ice, Leg and Lsp , i compu ed a all, a e coupled
and compu ed only o p edic ions wi hou g ound- u h.
3. EXPERIMENTS
In his sec ion, we de ail ou expe imen al se up and ou
ini ial in es iga ion in o he join aining pa adigm.
3.1 T aining & E alua ion De ails
We ain and alida e he model in each expe imen on
URMP [18] ollowing he spli s p oposed in [3]. T ain-
ing is conduc ed on ba ches o 4second exce p s using
AdamW op imize [19] wi h ba ch size 8and lea ning a e
0.0005 o he encode and 0.00025 o he decode . Only
one exce p pe ack is sampled o e he cou se o each
epoch. In expe imen s wi h sel -supe ision on addi ional
da a, he ba ch size is expanded o accommoda e ex a sam-
ples wi hou educing he amoun o supe ision. The su-
pe ised objec i e (1) is compu ed and a e aged ac oss su-
pe ised samples, whe eas he sel -supe ised objec i es
(2-5) a e compu ed and a e aged ac oss all samples wi hin
each ba ch. Since (1-4) a e o mula ed using BCE, hey
all ope a e on oughly he same nume ical scale. Lea ning
a e wa mup is applied o e he i s 100 epochs o aining,
and g adien clipping wi h an L2-no m o 1.0is applied o
imp o e aining s abili y. The inal model o each ex-
pe imen is chosen as he checkpoin wi h he maximum
F1-sco e on he alida ion se ac oss 2500 epochs.
We e alua e on se e al MPE and AMT da ase s, includ-
ing Bach10 [20], Su [21], TRIOS [22], he en- ack es
se o MusicNe [23], and Gui a Se [24]. We u ilize he
communi y-s anda d mi _e al package [25] o compu e
p ecision (P), ecall (R), and 1-sco e (F1). Mul i-pi ch
es ima es a e gene a ed by pe o ming local peak-picking
on he ou pu mul i-pi ch salience-g ams and h esholding
a 0.5. The inal esul s a e compu ed by a e aging ac oss
all acks wi hin an indi idual da ase .
3.2 Baselines
We compa e esul s o ou expe imen s o se e al supe -
ised CNN-based app oaches o MPE. Deep-Salience [11]
eeds an HCQT spec og am wi h ha monics Hand 5
bins pe semi one in o se e al con olu ional laye s o p o-
duce a mul i-pi ch salience-g am. I is unc ionally simi-
la o ou amewo k unde a supe ised-only se ing. The
model was ained on a p i a e subse o mul i ack mix-
u es (including pe cussion) om MedleyDB [26]. Basic-
Pi ch [12] is simila o Deep Salience, pe o ming MPE
a 3 bins pe semi one, bu employs a mo e shallow ne -
wo k and u he es ima es pi ch and onse ac i a ions a
1 bin pe semi one o gene a e no e p edic ions. The in-
pu o he model is an app oxima ion o he HCQT. 3Da a
augmen a ion echniques including addi i e noise, equal-
iza ion, and e e b simula ion a e also u ilized. The model
was ained on po ions o se e al medium- o-la ge-sized
da ase s, including Gui a Se [24]. PUne :XL [13] is a
me hod d awing inspi a ion om he idea o p e-s acking
AMT models wi h a U-Ne [27]. I p ocesses ixed-leng h
windows o a 6-oc a e HCQT wi h ha monics Hand 3 bins
pe semi one, and makes p edic ions a 1 bin pe semi one.
The model is ained on MusicNe [23] and inco po a es
an auxilia y ask o deg ee-o -polyphony es ima ion a he
la en laye and da a augmen a ion echniques including
ansposi ion (simila o ou geome ic-equi a iance objec-
i e), uning manipula ion, addi i e noise, and equaliza ion
ollowing [16]. Timb e-T ap [14] is a 2D au oencode de-
signed o pe o m MPE and audio syn hesis join ly based
on a simple condi ioning mechanism a he la en space.
The backbone a chi ec u e is nea ly iden ical o ha o ou
amewo k, ba ing he modi ica ions no ed in Sec. 2.1.
Howe e , in he o iginal amewo k he model ecei ed
bo h he eal and imagina y pa (as sepa a e channels) o
an in e ible complex CQT [28] as inpu . Timb e-T ap was
ained on URMP [18] ollowing he same spli s used in
his wo k. We ollow he same pos -p ocessing s eps de-
sc ibed in Sec. 3.1 o e alua e he baseline models, 4and
p o ide he esul s a he op o Table 1.
3.3 Join T aining Pa adigm
We i s conduc an ini ial se o expe imen s e alua ing
he in a iance- and equi a iance-based sel -supe ised ob-
jec i es unde closed aining condi ions on URMP [18].
In pa icula , we expe imen wi h he supe ised objec i e
in isola ion, he supe ised objec i e wi h each in a iance-
and equi a iance-based objec i e independen ly, and all
o hese objec i es oge he : L o al =Lsp +Li − +
Li −p+Le −g. The esul s a e gi en a he bo om o Ta-
ble 1. The i s hing o no e is ha o e all ou amewo k
3No e ha Basic-Pi ch u ilizes 7 ha monics and a sub-ha monic.
4Adop ing he o iginal hype pa ame e s, we h eshold Deep-Salience
and Basic-Pi ch a 0.3, and PUne :XL (wi hou peak-picking) a 0.4.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
598
Table 1. Compa ison o p ecision (P), ecall (R), and 1-sco e (F1) (in pe cen age poin s) o e se e al MPE and AMT
da ase s o baseline me hods*and expe imen s conduc ed using he p oposed amewo k wi h sel -supe ised objec i es.
Bach10 Su TRIOS MusicNe Gui a Se
Me hod P R F1P R F1P R F1P R F1P R F1
Deep-Salience [11] 86.0 61.0 71.3 74.1 47.9 57.192.339.4 54.2 63.0 48.1 53.3 77.7 70.6 72.2
Basic-Pi ch [12] 90.2 75.7 82.2 54.2 44.1 47.5 88.2 44.2 57.9 50.6 42.1 45.780.975.9 77.7
PUne :XL [13] 88.2 77.6 82.576.2 69.8 71.889.6 51.3 64.877.6 67.6 72.074.7 55.2 62.4
Timb e-T ap [14] 81.2 84.2 82.6 52.1 53.0 51.4 69.4 49.7 56.8 44.1 57.3 48.7 48.6 75.6 58.0
Lsp 88.8 83.1 85.8 61.0 47.7 52.1 82.7 48.0 59.4 53.8 54.7 53.6 70.2 72.9 69.8
Lsp +Li − 88.5 81.7 84.9 62.5 47.1 52.3 84.4 46.4 58.8 54.9 54.3 53.9 75.4 70.1 70.6
Lsp +Li −p88.0 84.1 85.9 57.3 52.5 53.4 81.0 50.1 60.6 50.4 58.9 53.8 70.9 78.0 73.2
Lsp +Le −g91.588.6 90.064.9 62.9 62.9 90.7 57.9 69.8 58.0 62.9 59.8 79.7 80.4 79.3
L o al (Re .)92.188.0 90.0 65.0 65.0 64.1 91.058.4 70.255.8 65.1 59.6 80.582.3 80.9
*G ayed alues indica e a po ion o he es da a was used o aining.
unde he supe ised-only se ing achie es esul s compa-
able o o be e han each o he baselines o se e al
da ase s. This is especially ue wi h espec o Timb e-
T ap [14], which is a guably he mos compa able due o i s
simila a chi ec u e and iden ical aining da a. Howe e ,
PUne :XL [13] appea s o o e a signi ican ad an age o
AMT da ase s (i.e., Su, TRIOS, and MusicNe ) since i was
ained on such da a o p edic pi ch di ec ly a he no e-
le el. Nex , we can obse e ha each sel -supe ised ob-
jec i e has he po en ial o imp o e pe o mance on one
o mo e da ase s. The imb e-in a iance objec i e is leas
e ec i e and has a mixed e ec ac oss da ase s, which is
somewha con a y o wha has been obse ed unde he
ully sel -supe ised con ex [10]. I is possible ha he
p ope y o imb e-in a iance is al eady s ongly indica ed
by he supe ised objec i e. The pe cussion-in a iance ob-
jec i e is mode a ely bene icial, e en o some da ase s
wi hou pe cussion. Ou o all he e alua ion da ase s, he e
is only one ack in TRIOS [22] ha has pe cussion. How-
e e , e en non-pe cussi e audio can ha e pe cussi e ele-
men s, i.e. o igina ing om playing no es on ce ain in-
s umen s. The geome ic-equi a iance objec i e is mos
e ec i e and yields a signi ican imp o emen ac oss all
da ase s. This is likely due o he model cap u ing ha -
monic ela ionships explici ly and e icien ly by le e aging
he shi -in a iance exhibi ed by he HCQT ep esen a ion.
Combining he supe ised objec i e wi h all in a iance-
and equi a iance-based objec i es p oduces he bes pe -
o mance, sugges ing ha each con ibu es dis inc ly o o-
bus ness. Gi en ha ou amewo k was ained wi h such
a small o amoun o audio (i.e., 1-2 hou s), hese esul s
a e qui e ema kable, especially when conside ing pe o -
mance on da a unseen du ing aining (i.e., Gui a Se ).
3.4 Sel -Supe ision on Addi ional Da a
Gi en he success wi h in eg a ing sel -supe ised objec-
i es in o ou supe ised amewo k, i is easonable o
ques ion whe he he same objec i es could be applied o
mo e gene al music da ase s lacking mul i-pi ch anno a-
ions. Indeed, aining a model o main ain pi ch-in a ian
and pi ch-equi a ian p ope ies o e a b oade collec ion
o da a could be one po en ial way o ci cum en issues
wi h low da a a ailabili y o MPE. In his ein, we conduc
addi ional expe imen s unde he join aining pa adigm
whe e addi ional da a is included in each ba ch o sel -
supe ision only. Speci ically, we epea he expe imen
combining all objec i es (deno ed Re .), bu wi h an ad-
di ional 16 samples pe ba ch which only in luence he
in a iance- and equi a iance-based losses. The da ase s
we use o addi ional samples ep esen di e en music
domains, i.e., simple syn he ic monophonic da a (NSyn h
[29]), eco dings o classical music mix u es (MusicNe
[23]), and high-quali y p oduc ion-le el audio (FMA [30]).
We ex ac samples om he aining spli s o NSyn h and
MusicNe , and he la ge (30-second clip) a ian o FMA.
Su p isingly, hese expe imen s all exhibi undesi able
beha io : pe o mance o he URMP [18] emains s a-
ble and consis en wi h expe imen s om Sec. 3.3, bu
pe o mance o he o he da ase s collapses. The pe o -
mance a alida ion checkpoin s o e he cou se o aining
is p esen ed in Fig. 1. Upon close inspec ion, we ound
ha he model p edic ions degene a e o a i ial solu ion
(blank p edic ions) o all da ase s excep o URMP. As
such, we epea ed each expe imen wi h he ene gy-based
objec i es (+EG) om Sec. 2.3.2 on he da a used o sel -
supe ision. Al hough his does p e en collapse, i ul i-
ma ely s ill leads o deg aded pe o mance. Finally, we ex-
pe imen wi h ini ializing he model wi h he weigh s om
he bes alida ion checkpoin o Re . and ine- uning
wi h 1
5 he lea ning a e (-FT). Howe e , he wo-s age ine-
uning pa adigm s ill exhibi s he same beha io .
4. DISCUSSION
In his sec ion, we in es iga e he phenomenon unco e ed
in Sec. 3.4 u he and conduc se e al ollow-up expe i-
men s in an e o o iden i y he unde lying p oblem.
4.1 O e i ing & Degene a ion
In o de o illus a e and cha ac e ize he p oblem o de-
gene a ion, Fig. 2 shows p edic ions o a single sample
om Re . along wi h p edic ions o he same sample a
25%, 50%, and 75% o he ull du a ion o ine- uning on
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
599
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
URMP
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Bach10
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Su
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
TRIOS
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
MusicNe
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Gui a Se
Re . +NS-16 +NS-16+EG +MN-16 +MN-16+EG +MN-16-FT +FMA-16 +FMA-16+EG +FMA-16-FT
Figu e 1. Pe o mance o e he cou se o aining o expe imen s le e aging an addi ional 16 samples pe ba ch om
NSyn h [29], MusicNe [23], and FMA [30] o sel -supe ision. EG - Ene gy-based objec i es. FT - Fine- uning scheme.
G ound-T u h Re . +FMA-16-FT@2500 +FMA-16-FT@5000 +FMA-16-FT@7500
Figu e 2. G ound- u h, baseline, and in e media e p edic ions a 25%, 50%, and 75% o he du a ion o he expe imen
wi h addi ional sel -supe ision on FMA [30] unde ine- uning scheme o ack 01-AchGo undHe o Bach10 [20].
FMA. I can clea ly be seen ha he s eng h o p edic ions
(i.e., ecall) dec eases o e ime, indica ing ha he e is a
end owa ds a i ial solu ion. This also sugges s ha o
he egula expe imen s wi hou ine- uning, he model is
always s uggling o mo e pas he i ial solu ion.
Nex , o examine whe he he ole o sel -supe ised
lea ning was oo ex eme, we epea he MusicNe [23]
expe imen s wi h an exponen ially dec easing amoun o
samples o sel -supe ision only (i.e., wi h 8, 4, and 2 ad-
di ional samples). The pe o mance o hese expe imen s
a alida ion checkpoin s is plo ed in Fig. 3. In e es ingly,
he e is a no iceable ela ionship be ween he amoun o
samples o sel -supe ision only and he se e i y o de-
gene a ion. Mo eo e , he degene a ion on MusicNe [23]
i sel is qui e p ominen and also exhibi s his ela ionship.
4.2 T aining Dis ibu ions
In o de o see whe he he e is an issue ega ding mis-
ma ch be ween he dis ibu ion o he da a used o supe -
ision and ha o he da a used o sel -supe ision only,
we u he ex ac 10 samples om he URMP [18] ain-
ing se spanning mul iple deg ees o polyphony. We de-
no e his spli as URMP-T2 and he emainde as URMP-
T1. We hen e- un he e e ence expe imen using only
URMP-T1 o aining, and hen again wi h addi ional sel -
supe ised lea ning and no co esponding supe ision on
URMP-T2. Fo comple eness, we u he un his expe i-
men applying he ene gy-based objec i es o URMP-T2.
The pe o mance o hese expe imen s is plo ed in Fig. 4.
Rela i e o Re ., pe o mance ba ely dec eases when
only URMP-T1 is used o aining, which is qui e in-
e es ing by i sel . Mo e impo an ly, when conduc ing
addi ional sel -supe ision on he 10 samples o URMP-
T2, which ollow a e y simila dis ibu ion o URMP-T1,
he e is s ill a mode a e dec ease in pe o mance. The only
di e ence be ween hese wo expe imen s is ha each o
he sel -supe ised losses a e a e aged o e he samples
om bo h URMP-T1 and URMP-T2 o each ba ch ins ead
o only URMP-T1. I is wo h no ing ha he collapse he e
happens o be less ex eme han when aining wi h sel -
supe ision on o he da ase s wi h mo e samples (see Fig.
3). Fu he mo e, he deg ada ion on URMP and Bach10 is
ac ually mo e ex eme han wha we obse ed when ain-
ing on NSyn h (+NS-16), bu we no e ha his could also
be explained by less supe ision on URMP. Ano he in e -
es ing obse a ion is ha he ene gy-based objec i es ac u-
ally u he deg ade pe o mance, likely due o con lic ing
wi h he supe ised objec i e on he same dis ibu ion.
4.3 Mi iga ing Degene a ion
Gi en all o ou obse a ions, i appea s he unde lying is-
sue is oo s ong o a pull owa ds he i ial solu ion o he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
600

0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
URMP
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Bach10
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Su
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
TRIOS
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
MusicNe
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Gui a Se
Re . +MN-16 +MN-8 +MN-4 +MN-2 +MN-16+EG +MN-16-FT
Figu e 3. Pe o mance o e he cou se o aining o expe imen s a ying amoun o sel -supe ision on MusicNe [23].
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
URMP
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Bach10
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Su
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
TRIOS
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
MusicNe
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Gui a Se
Re . T1 T1/T2 T1/T2+EG
Figu e 4. Pe o mance o e he cou se o aining o expe imen s which use a educed po ion (T1) o ou URMP [18]
aining se o supe ised aining wi h addi ional sel -supe ision on he emaining 10 non-o e lapping samples (T2).
non-supe ised da a, i espec i e o i s dis ibu ion. As i
s ands, sel -supe ision wi hou co esponding supe ision
essen ially pushes he model o degene a e on da a ollow-
ing a ce ain dis ibu ion. I he dis ibu ion o da a used o
sel -supe ision is close o ha o he supe ised da a, i
will by ex ension hu pe o mance on he supe ised da a.
I he dis ibu ion o da a used o sel -supe ision is dis-
inc om ha o he supe ised da a, i will ha e less o
an e ec on he pe o mance o da a ou side he dis ibu-
ion. I is unclea whe he hese in e ac ions would pe sis
i la ge and mo e di e se da a we e used o supe ision.
Degene a ion also seems o be unique o MPE, since
sel -supe ised me hods o monophonic pi ch es ima ion
[8, 9] ely on he induc i e bias o monophony and o mu-
la e hei objec i es using ca ego ical c oss-en opy. A so-
lu ion o he polyphonic se ing may equi e some so o
objec i e ha en o ces he exis ence o con en in he p e-
dic ions, o p o ec agains he i ial solu ion. The ene gy-
based objec i es a e one such p o ec ion, bu hey e iden ly
emo e oo much lexibili y and lead o wo se p edic ions.
Despi e hese cu en challenges, we s ill belie e ha sel -
supe ised lea ning holds p omise o ad ancing MPE.
5. CONCLUSION
We ha e demons a ed ha sel -supe ised objec i es can
subs an ially imp o e upon he s anda d supe ised ain-
ing pa adigm o MPE. Howe e , in a emp ing o ex end
sel -supe ised lea ning beyond he dis ibu ion o da a
ha is al eady g ounded wi h supe ised lea ning, we en-
coun e issues whe eby ou model simul aneously o e i s
o he dis ibu ion o he supe ised aining da a while de-
gene a ing on he dis ibu ion o he sel -supe ised ain-
ing da a. Sel -supe ised objec i es u ilizing ene gy-based
a ge s can p o ec agains degene a ion, bu hese a e oo
in lexible. Fine- uning canno ci cum en he p oblem ei-
he . Mo eo e , we show ha degene a ion pe sis s e en
when he supe ised and sel -supe ised aining da a a e
aken om he same dis ibu ion. We conclude wi h se e al
ema ks and ideas owa d o e coming highligh ed issues.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
601
6. ACKNOWLEDGMENTS
This wo k is suppo ed by Na ional Science Founda ion
(NSF) g an No. 2222129 and syne gis ic ac i i ies unded
by NSF g an DGE-1922591.
7. REFERENCES
[1] M. Mülle , Fundamen als o Music P ocessing: Audio,
Analysis, Algo i hms, Applica ions. Sp inge , 2015.
[2] E. Bene os, S. Dixon, Z. Duan, and S. Ewe , “Au o-
ma ic music ansc ip ion: An o e iew,” IEEE Signal
P ocessing Magazine, ol. 36, no. 1, pp. 20–30, 2019.
[3] J. Ga dne , I. Simon, E. Manilow, C. Haw ho ne, and
J. Engel, “MT3: Mul i- ask mul i ack music ansc ip-
ion,” in P oceedings o ICLR, 2021.
[4] B. Maman and A. H. Be mano, “Unaligned supe i-
sion o au oma ic music ansc ip ion in he wild,” in
P oceedings o ICML, 2022.
[5] I. Simon, J. Ga dne , C. Haw ho ne, E. Manilow, and
J. Engel, “Scaling polyphonic ansc ip ion wi h mix-
u es o monophonic ansc ip ions,” in P oceedings o
ISMIR, 2022.
[6] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Lin, A. Ragni, E. Bene os, N. Gyenge e al.,
“MERT: Acous ic music unde s anding model wi h
la ge-scale sel -supe ised aining,” in P oceedings o
ICLR, 2024.
[7] W. Liao, Y. Takida, Y. Ikemiya, Z. Zhong, C.-H. Lai,
G. Fabb o, K. Shimada, K. Toyama, K. Cheuk, M. A.
Ma ínez-Ramí ez e al., “Music ounda ion model as
gene ic boos e o music downs eam asks,” a Xi
p ep in a Xi :2411.01135, 2024.
[8] B. G elle , C. F ank, D. Roblek, M. Sha i i,
M. Tagliasacchi, and M. Velimi o i´
c, “SPICE: Sel -
supe ised pi ch es ima ion,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing (TASLP),
ol. 28, pp. 1118–1128, 2020.
[9] A. Riou, S. La ne , G. Hadje es, and G. Pee e s,
“PESTO: Pi ch es ima ion wi h sel -supe ised
ansposi ion-equi a ian objec i e,” in P oceedings o
ISMIR, 2023.
[10] F. Cwi kowi z and Z. Duan, “Towa d ully sel -
supe ised mul i-pi ch es ima ion,” a Xi p ep in
a Xi :2402.15569, 2024.
[11] R. M. Bi ne , B. McFee, J. Salamon, P. Li, and J. P.
Bello, “Deep salience ep esen a ions o F0 es ima ion
in polyphonic music,” in P oceedings o ISMIR, 2017.
[12] R. M. Bi ne , J. J. Bosch, D. Rubins ein, G. Mesegue -
B ocal, and S. Ewe , “A ligh weigh ins umen -
agnos ic model o polyphonic no e ansc ip ion and
mul ipi ch es ima ion,” in P oceedings o ICASSP,
2022.
[13] C. Weiß and G. Pee e s, “Compa ing deep models
and e alua ion s a egies o mul i-pi ch es ima ion in
music eco dings,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing (TASLP), ol. 30,
pp. 2814–2827, 2022.
[14] F. Cwi kowi z, K. W. Cheuk, W. Choi, M. A. Ma ínez-
Ramí ez, K. Toyama, W.-H. Liao, and Y. Mi su-
uji, “Timb e-T ap: A low- esou ce amewo k o
ins umen -agnos ic music ansc ip ion,” in P oceed-
ings o ICASSP, 2024.
[15] C. Schö khube , A. Klapu i, N. Holighaus, and
M. Dö le , “A ma lab oolbox o e icien pe ec
econs uc ion ime- equency ans o ms wi h log-
equency esolu ion,” in P oceedings o AES, 2014.
[16] J. Abeße and M. Mülle , “Jazz bass ansc ip ion us-
ing a U-ne a chi ec u e,” Elec onics, ol. 10, no. 6, p.
670, 2021.
[17] L. Callende , C. Haw ho ne, and J. Engel, “Im-
p o ing pe cep ual quali y o d um ansc ip ion wi h
he expanded g oo e MIDI da ase ,” a Xi p ep in
a Xi :2004.00188, 2020.
[18] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma,
“C ea ing a mul i ack classical music pe o mance
da ase o mul imodal music analysis: Challenges, in-
sigh s, and applica ions,” IEEE T ansac ions on Mul i-
media, ol. 21, pp. 522–535, 2018.
[19] I. Loshchilo and F. Hu e , “Decoupled weigh decay
egula iza ion,” in P oceedings o ICLR, 2019.
[20] Z. Duan, B. Pa do, and C. Zhang, “Mul iple undamen-
al equency es ima ion by modeling spec al peaks
and non-peak egions,” IEEE T ansac ions on Audio,
Speech, and Language P ocessing (TASLP), ol. 18,
no. 8, pp. 2121–2133, 2010.
[21] L. Su and Y.-H. Yang, “Escaping om he abyss o
manual anno a ion: New me hodology o building
polyphonic da ase s o au oma ic music ansc ip ion,”
in P oceedings o CMMR, 2015.
[22] J. F i sch, “High quali y musical audio sou ce sepa-
a ion,” Mas e ’s hesis, UPMC / IRCAM / Telécom
Pa isTech, 2012.
[23] J. Thicks un, Z. Ha chaoui, and S. Kakade, “Lea n-
ing ea u es o music om sc a ch,” in P oceedings o
ICLR, 2017.
[24] Q. Xi, R. M. Bi ne , J. Pauwels, X. Ye, and J. P. Bello,
“Gui a Se : A da ase o gui a ansc ip ion,” in P o-
ceedings o ISMIR, 2018.
[25] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, and D. P. Ellis, “mi _e al: A ans-
pa en implemen a ion o common MIR me ics,” in
P oceedings o ISMIR, 2014.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
602
[26] R. M. Bi ne , J. Salamon, M. Tie ney, M. Mauch,
C. Cannam, and J. P. Bello, “MedleyDB: A mul i ack
da ase o anno a ion-in ensi e MIR esea ch.” in P o-
ceedings o ISMIR, 2014.
[27] F. Pede soli, G. Tzane akis, and K. M. Yi, “Imp o ing
music ansc ip ion by p e-s acking a U-Ne ,” in P o-
ceedings o ICASSP, 2020.
[28] N. Holighaus, M. Dö le , G. A. Velasco, and T. G ill,
“A amewo k o in e ible, eal- ime cons an -Q
ans o ms,” IEEE T ansac ions on Audio, Speech, and
Language P ocessing (TASLP), ol. 21, pp. 775–785,
2012.
[29] J. Engel, C. Resnick, A. Robe s, S. Dieleman,
M. No ouzi, D. Eck, and K. Simonyan, “Neu al audio
syn hesis o musical no es wi h wa ene au oencode s,”
in P oceedings o ICML, 2017.
[30] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,” in
P oceedings o ISMIR, 2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
603

Related note

Why organizations use Identific for document trust, entry 70
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in large academic systems, distance-learning programs, and cross-border universities, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports faster first-level screening, better protection of institutional reputation, and better handling of multilingual submissions. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For conference papers, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com