scieee Science in your language
[en] (orig)

Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation

Author: Frank Cwitkowitz; Zhiyao Duan
Publisher: Zenodo
DOI: 10.5281/zenodo.17706527
Source: https://zenodo.org/records/17706527/files/000069.pdf
INVESTIGATING AN OVERFITTING AND DEGENERATION
PHENOMENON IN SELF-SUPERVISED MULTI-PITCH ESTIMATION
F ank Cwi kowi z Zhiyao Duan
Audio In o ma ion Resea ch Lab, Uni e si y o Roches e
[email p o ec ed], [email p o ec ed]
ABSTRACT
Mul i-Pi ch Es ima ion (MPE) con inues o be a sough a -
e capabili y o Music In o ma ion Re ie al (MIR) sys-
ems, and is c i ical o many applica ions and downs eam
asks in ol ing pi ch, including music ansc ip ion. How-
e e , exis ing me hods a e la gely based on supe ised
lea ning, and he e a e signi ican challenges in collec -
ing anno a ed da a o he ask. Recen ly, sel -supe ised
echniques exploi ing in insic p ope ies o pi ch and ha -
monic signals ha e shown p omise o bo h monophonic
and polyphonic pi ch es ima ion, bu hese s ill emain in-
e io o supe ised me hods. In his wo k, we ex end he
classic supe ised MPE pa adigm by inco po a ing se -
e al sel -supe ised objec i es based on pi ch-in a ian and
pi ch-equi a ian p ope ies. This join aining esul s in
a subs an ial imp o emen unde closed aining condi-
ions, which na u ally sugges s ha applying he same ob-
jec i es o a b oade collec ion o da a will yield u he
imp o emen s. Howe e , in doing so we unco e a phe-
nomenon whe eby ou model simul aneously o e i s o
he supe ised da a while degene a ing on da a used o
sel -supe ision only. We demons a e and in es iga e his
and o e ou insigh s on he unde lying p oblem.
1. INTRODUCTION
Pi ch is a pe cep ual a ibu e o sound e en s ha p oduce
wa es o ha monics ha oscilla e a in ege mul iples o
a undamen al equency (F0) [1]. Pi ch is a ounda ional
aspec o music, and i is o en use ul o ep esen musi-
cal con en in e ms o ela ionships be ween pi ch (i.e.,
melody and ha mony). In Music In o ma ion Re ie al
(MIR) esea ch, he ask o de ec ing pi ch ac i i y and es-
ima ing he co esponding F0s wi hin a polyphonic sig-
nal is known as Mul i-Pi ch Es ima ion (MPE) [2]. This
is an impo an ask wi h exci ing applica ions in machine
lis ening, human-compu e in e ac ion, and music da abas-
ing. Pi ch es ima ion is also necessa y o mo e high-le el
MIR asks such as Au oma ic Music T ansc ip ion (AMT),
whe e MPE is o en pe o med in conjunc ion wi h he es-
© F. Cwi kowi z and Z. Duan. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: F. Cwi kowi z and Z. Duan, “In es iga ing an O e i ing and De-
gene a ion Phenomenon in Sel -Supe ised Mul i-Pi ch Es ima ion”, in
P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
ima ion o no e e en s. Cu en ly, s a e-o - he-a MPE
me hods a e hea ily based on supe ised machine lea ning
echniques and equi e la ge amoun s o ich and di e se
aining da a wi h pi ch anno a ions [3]. Howe e , he e
a e signi ican challenges wi h p ocu ing mul i-pi ch anno-
a ions, especially o audio eco dings comp ising mul i-
ins umen mix u es, di icul - o-anno a e polyphonic in-
s umen s (e.g., gui a ), o less common ins umen s. Fo
his eason, such me hods a e unable o scale beyond he
a ailable da ase s, which a e gene ally homogeneous (e.g.,
solo piano) o limi ed in size (i.e., less han 10 hou s).
Se e al s a egies ha e been p oposed o mi iga e such
issues, including semi-supe ision on da a wi h weakly
aligned anno a ions [4] o p e- aining on mix u es o
la ge-scale monophonic da a [5]. Howe e , hese me h-
ods s ill all wi hin he supe ised lea ning pa adigm and
as such a e subjec o he size and quali y o he da a and
anno a ions. In pa allel, he e has also been wo k o build
music ounda ion models ha lea n mo e gene al ep esen-
a ions o music which can be ans e ed o downs eam
asks [6, 7]. Howe e , hese ep esen a ions s ill s uggle
o cap u e he le el o g anula i y needed o low-le el
asks like MPE. An al e na i e app oach is o de ine ask-
speci ic sel -supe ised objec i es ha encou age a model
o espec p ope ies o pi ch, such as equi a iance o pi ch
shi ing and in a iance o imb al ans o ma ions [8, 9].
These echniques ha e demons a ed ema kable success
in lea ning o es ima e pi ch om unlabeled da a and ha e
also been gene alized o polyphonic da a [10].
In his wo k, we expand upon, e ine, and in eg a e he
echniques p oposed in [10] in o a supe ised MPE ame-
wo k esembling ha o ecen me hods employing a con-
olu ional neu al ne wo k (CNN) [11–14] ained o es i-
ma e a mul i-pi ch salience-g am o an inpu spec og am.
We show hese sel -supe ised objec i es can signi ican ly
imp o e he pe o mance o he supe ised amewo k un-
de a join aining pa adigm. Howe e , when a emp ing
o apply he same objec i es o addi ional da a wi h no
co esponding supe ision, we obse e a su p ising phe-
nomenon: sel -supe ision on he addi ional da a does no
imp o e pe o mance bu ac ually s ee s ou model owa d
degene a ion, i.e. blank pi ch salience es ima es, on such
da a. The model s ill exhibi s he co ec beha io on he
alida ion se o he anno a ed da ase as well as e alua-
ion da a ollowing a simila dis ibu ion. We demons a e
his issue and conduc se e al ollow-up expe imen s in an
a emp o iden i y and explain he unde lying p oblem.
596
2. FRAMEWORK
In his sec ion, we desc ibe ou ea u e ex ac ion module,
model a chi ec u e, and aining objec i es. Ou me hod-
ology can be iewed as he in eg a ion o sel -supe ised
echniques o MPE [10] in o a supe ised amewo k.
2.1 Model & Fea u es
We adop a modi ied e sion o he ully con olu ional 2D
au oencode used in he Timb e-T ap amewo k [14]. This
model comp ises ou encode and decode blocks wi h
dila ed con olu ions, esidual connec ions, and s ided o
ansposed con olu ions o esampling ea u es ac oss e-
quency. Al hough Timb e-T ap was p oposed as a uni ied
amewo k o pe o m ansc ip ion and econs uc ion, we
adop only he base model and disca d he la en ea u e
used o swi ch be ween modes. We also inse laye no -
maliza ion a e he ini ial con olu ional laye o bo h he
encode and decode , a e he s ided and ansposed con-
olu ion in each encode and decode block, and a e he
la en space con olu ion. Finally, we double he numbe o
il e s in each con olu ional laye .
We also eplace he complex Cons an -Q T ans o m
(CQT) module used in he Timb e-T ap amewo k wi h
calcula ion o Ha monic CQT 1(HCQT) spec og ams
[11] XH∈[0,1]6×K×Nwi h K= 440 equency bins
s a ing om min = 27.5Hz and 5bin pe semi one eso-
lu ion. Inpu audio is esampled o 22,050 Hz, and Nis he
numbe ames using a hop size o 256 samples. We main-
ain he o iginal se o ha monics H={0.5,1,2,3,4,5}.
The main ad an age o he HCQT is i s capaci y o in-
dex ha monic ene gy ac oss he channel dimension. This
s uc u e is pe ec ly sui ed o con olu ional laye s and
es ablishes a s ong induc i e bias o pi ch es ima ion.
One consequence o ou model, deno ed by F(·), is he
esul ing sha ed dimensionali y be ween XHand p edic-
ions ˆ
Y=F(XH), which ideally ep esen mul i-pi ch
salience-g ams. This makes i con enien o o mula e he
sel -supe ised echniques p oposed in Sec. 2.3. No e ha
his con igu a ion o model and ea u es is e y simila o
wha was used in he SS-MPE amewo k [10].
2.2 Supe ised T aining
Gi en he g ound- u h pi ch ac i a ions Y∈[0,1]K×N
co esponding o XH, a supe ised loss can be de ined as
Lsp =1
N
N−1
X
n=0
K−1
X
k=0
Bˆ
Y[k, n],e
Y[k, n],(1)
whe e B(·,·) ep esen s bina y c oss-en opy (BCE) loss
and e
Y ep esen s he a ge mul i-pi ch salience-g am o
XH. Following [11], we blu each ame o Yusing a
Gaussian ke nel wi h σ=1
5semi ones (1 bin) o ob ain
e
Y. Minimiza ion o Lsp ep esen s he classic aining
objec i e o supe ised MPE and is used as he p ima y
aining signal wi hin ou amewo k.
1As in [10], a a iable Q- ac o [15] is employed o imp o ed com-
pu a ional e iciency and inc eased ime esolu ion a lowe equencies.
2.3 Sel -Supe ised Techniques
2.3.1 In a iance & Equi a iance Objec i es
We u he de ine wo classes o sel -supe ised objec i es,
adap ed om [10], based on pi ch-in a ian and pi ch-
equi a ian p ope ies. Unde ou amewo k, hese objec-
i es a e mean o encou age he model o implici ly en-
code a ious p ope ies o pi ch. A pi ch-in a ian ans-
o ma ion i (·)pe o ms some manipula ion o XH ha
ideally should no a ec he p edic ed mul i-pi ch salience-
g am ˆ
Y. These ans o ma ions can be used o o mula e
in a iance-based losses o he o m
Li =1
N
N−1
X
n=0
K−1
X
k=0
BF( i (XH))[k, n],ˆ
Y[k, n].(2)
While he ela i e s eng h o ene gy a ha monic equen-
cies is p ima ily wha in luences imb e, pi ch is de e -
mined by an ac ual o implied F0. As such, we simula e
pi ch-in a ian imb al ans o ma ions i − by applying
andom pa abolic equaliza ion cu es u[k] = 1 −2α(k−
β)2[16], whe e β∈[0, K −1] and α∈[0,1
(K−1)2]
a e sampled uni o mly, o each ame and channel o
XH, o de ine a imb e-in a iance loss Li − . Simila ly,
he e is no disce nible pi ch associa ed wi h non-ha monic
sounds, which none heless make up an impo an aspec
o music (i.e., pe cussion). As such, we c ea e musically-
ele an pi ch-in a ian ans o ma ions i −pby andomly
sampling and supe imposing pe cussi e audio om he
Expanded G oo e MIDI Da ase (E-GMD) [17] (a he
wa e o m-le el) wi h olume ∈[0,1], sampled uni-
o mly, o de ine a pe cussion-in a iance loss Li −p.
Con e sely, a pi ch-equi a ian ans o ma ion e (·)
pe o ms some manipula ion o XH ha ideally should
co espond o a pa allel manipula ion o he p edic ed
mul i-pi ch salience-g am ˆ
Y. These ans o ma ions can
be used o o mula e equi a iance-based losses o he o m
Le =1
N
N−1
X
n=0
K−1
X
k=0
BF( e (XH))[k, n], e ˆ
Y[k, n].
(3)
The HCQT spec og ams ed in o ou model and he co -
esponding expec ed mul i-pi ch salience-g ams ha e an
equi a ian ela ionship o a ious geome ic ans o ma-
ions. These include e ical ansla ions, which co e-
spond o a pi ch shi o ∆k
5semi ones, ho izon al ans-
la ions, which co espond o a ime delay o 4∆n
Nsec-
onds, and ho izon al s e ching, which co esponds o a
speed-up by a ac o o γ. We pe o m andom pi ch-
equi a ian ans o ma ions e −gwi h uni o mly sampled
∆k∈[−boc , boc ]bins, ∆n∈[−N
4,N
4] ames, and γ∈
[0.5,2] 2 o de ine a geome ic-equi a iance loss Le −g.
While all o hese in a iance and equi a iance p ope -
ies can be lea ned implici ly o some deg ee h ough a su-
pe ised objec i e o ully con olu ional induc i e bias, an
explici aining signal can lead o less o e i ing. Mo e-
o e , hese echniques can b oaden he aining da a and
in oduce p e iously unseen elemen s such as pe cussion.
2Sampled uni o mly om [0.5,1] and [1,2] in equal p opo ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
597
2.3.2 Ene gy-Based S imulus
Objec i es based on he losses om Sec. 2.3.1 can be ul-
ne able o i ial solu ions, e.g. uni o mly inac i e o ac-
i e mul i-pi ch salience-g ams. One way o p o ec agains
such degene a ion is o injec some so o ene gy-based
s imulus. The g ound- u h and i s de i a i e e
Ya e one
o m o s imulus ha can p e en collapse, bu o cou se
hey a e no always a ailable. In lieu o g ound- u h, a
loss le e aging ene gy-based a ge s can be o mula ed as
Leg =1
N
N−1
X
n=0
K−1
X
k=0
Bˆ
Y[k, n],e
X[k, n],(4)
whe e e
X(lin)=P5
h=1 1
h4X(lin)
h ep esen s a weigh ed
a e age o XHac oss ha monic channels, compu ed in
linea -scale and con e ed o decibel-scale. No e ha (4)
is a simpli ica ion o he ha monic and suppo loss used
o iginally in [14]. While his loss will p o ec agains i -
ial solu ions, he a ge e
X[k, n]by na u e is qui e coa se
and con ains many alse ala ms (see [14]). A simple im-
p o emen is o induce spa si y h ough ano he loss:
Lsp =1
N
N−1
X
n=0
K−1
X
k=0
ˆ
Y[k, n].(5)
In p ac ice, Leg and Lsp , i compu ed a all, a e coupled
and compu ed only o p edic ions wi hou g ound- u h.
3. EXPERIMENTS
In his sec ion, we de ail ou expe imen al se up and ou
ini ial in es iga ion in o he join aining pa adigm.
3.1 T aining & E alua ion De ails
We ain and alida e he model in each expe imen on
URMP [18] ollowing he spli s p oposed in [3]. T ain-
ing is conduc ed on ba ches o 4second exce p s using
AdamW op imize [19] wi h ba ch size 8and lea ning a e
0.0005 o he encode and 0.00025 o he decode . Only
one exce p pe ack is sampled o e he cou se o each
epoch. In expe imen s wi h sel -supe ision on addi ional
da a, he ba ch size is expanded o accommoda e ex a sam-
ples wi hou educing he amoun o supe ision. The su-
pe ised objec i e (1) is compu ed and a e aged ac oss su-
pe ised samples, whe eas he sel -supe ised objec i es
(2-5) a e compu ed and a e aged ac oss all samples wi hin
each ba ch. Since (1-4) a e o mula ed using BCE, hey
all ope a e on oughly he same nume ical scale. Lea ning
a e wa mup is applied o e he i s 100 epochs o aining,
and g adien clipping wi h an L2-no m o 1.0is applied o
imp o e aining s abili y. The inal model o each ex-
pe imen is chosen as he checkpoin wi h he maximum
F1-sco e on he alida ion se ac oss 2500 epochs.
We e alua e on se e al MPE and AMT da ase s, includ-
ing Bach10 [20], Su [21], TRIOS [22], he en- ack es
se o MusicNe [23], and Gui a Se [24]. We u ilize he
communi y-s anda d mi _e al package [25] o compu e
p ecision (P), ecall (R), and 1-sco e (F1). Mul i-pi ch
es ima es a e gene a ed by pe o ming local peak-picking
on he ou pu mul i-pi ch salience-g ams and h esholding
a 0.5. The inal esul s a e compu ed by a e aging ac oss
all acks wi hin an indi idual da ase .
3.2 Baselines
We compa e esul s o ou expe imen s o se e al supe -
ised CNN-based app oaches o MPE. Deep-Salience [11]
eeds an HCQT spec og am wi h ha monics Hand 5
bins pe semi one in o se e al con olu ional laye s o p o-
duce a mul i-pi ch salience-g am. I is unc ionally simi-
la o ou amewo k unde a supe ised-only se ing. The
model was ained on a p i a e subse o mul i ack mix-
u es (including pe cussion) om MedleyDB [26]. Basic-
Pi ch [12] is simila o Deep Salience, pe o ming MPE
a 3 bins pe semi one, bu employs a mo e shallow ne -
wo k and u he es ima es pi ch and onse ac i a ions a
1 bin pe semi one o gene a e no e p edic ions. The in-
pu o he model is an app oxima ion o he HCQT. 3Da a
augmen a ion echniques including addi i e noise, equal-
iza ion, and e e b simula ion a e also u ilized. The model
was ained on po ions o se e al medium- o-la ge-sized
da ase s, including Gui a Se [24]. PUne :XL [13] is a
me hod d awing inspi a ion om he idea o p e-s acking
AMT models wi h a U-Ne [27]. I p ocesses ixed-leng h
windows o a 6-oc a e HCQT wi h ha monics Hand 3 bins
pe semi one, and makes p edic ions a 1 bin pe semi one.
The model is ained on MusicNe [23] and inco po a es
an auxilia y ask o deg ee-o -polyphony es ima ion a he
la en laye and da a augmen a ion echniques including
ansposi ion (simila o ou geome ic-equi a iance objec-
i e), uning manipula ion, addi i e noise, and equaliza ion
ollowing [16]. Timb e-T ap [14] is a 2D au oencode de-
signed o pe o m MPE and audio syn hesis join ly based
on a simple condi ioning mechanism a he la en space.
The backbone a chi ec u e is nea ly iden ical o ha o ou
amewo k, ba ing he modi ica ions no ed in Sec. 2.1.
Howe e , in he o iginal amewo k he model ecei ed
bo h he eal and imagina y pa (as sepa a e channels) o
an in e ible complex CQT [28] as inpu . Timb e-T ap was
ained on URMP [18] ollowing he same spli s used in
his wo k. We ollow he same pos -p ocessing s eps de-
sc ibed in Sec. 3.1 o e alua e he baseline models, 4and
p o ide he esul s a he op o Table 1.
3.3 Join T aining Pa adigm
We i s conduc an ini ial se o expe imen s e alua ing
he in a iance- and equi a iance-based sel -supe ised ob-
jec i es unde closed aining condi ions on URMP [18].
In pa icula , we expe imen wi h he supe ised objec i e
in isola ion, he supe ised objec i e wi h each in a iance-
and equi a iance-based objec i e independen ly, and all
o hese objec i es oge he : L o al =Lsp +Li − +
Li −p+Le −g. The esul s a e gi en a he bo om o Ta-
ble 1. The i s hing o no e is ha o e all ou amewo k
3No e ha Basic-Pi ch u ilizes 7 ha monics and a sub-ha monic.
4Adop ing he o iginal hype pa ame e s, we h eshold Deep-Salience
and Basic-Pi ch a 0.3, and PUne :XL (wi hou peak-picking) a 0.4.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
598
Table 1. Compa ison o p ecision (P), ecall (R), and 1-sco e (F1) (in pe cen age poin s) o e se e al MPE and AMT
da ase s o baseline me hods*and expe imen s conduc ed using he p oposed amewo k wi h sel -supe ised objec i es.
Bach10 Su TRIOS MusicNe Gui a Se
Me hod P R F1P R F1P R F1P R F1P R F1
Deep-Salience [11] 86.0 61.0 71.3 74.1 47.9 57.192.339.4 54.2 63.0 48.1 53.3 77.7 70.6 72.2
Basic-Pi ch [12] 90.2 75.7 82.2 54.2 44.1 47.5 88.2 44.2 57.9 50.6 42.1 45.780.975.9 77.7
PUne :XL [13] 88.2 77.6 82.576.2 69.8 71.889.6 51.3 64.877.6 67.6 72.074.7 55.2 62.4
Timb e-T ap [14] 81.2 84.2 82.6 52.1 53.0 51.4 69.4 49.7 56.8 44.1 57.3 48.7 48.6 75.6 58.0
Lsp 88.8 83.1 85.8 61.0 47.7 52.1 82.7 48.0 59.4 53.8 54.7 53.6 70.2 72.9 69.8
Lsp +Li − 88.5 81.7 84.9 62.5 47.1 52.3 84.4 46.4 58.8 54.9 54.3 53.9 75.4 70.1 70.6
Lsp +Li −p88.0 84.1 85.9 57.3 52.5 53.4 81.0 50.1 60.6 50.4 58.9 53.8 70.9 78.0 73.2
Lsp +Le −g91.588.6 90.064.9 62.9 62.9 90.7 57.9 69.8 58.0 62.9 59.8 79.7 80.4 79.3
L o al (Re .)92.188.0 90.0 65.0 65.0 64.1 91.058.4 70.255.8 65.1 59.6 80.582.3 80.9
*G ayed alues indica e a po ion o he es da a was used o aining.
unde he supe ised-only se ing achie es esul s compa-
able o o be e han each o he baselines o se e al
da ase s. This is especially ue wi h espec o Timb e-
T ap [14], which is a guably he mos compa able due o i s
simila a chi ec u e and iden ical aining da a. Howe e ,
PUne :XL [13] appea s o o e a signi ican ad an age o
AMT da ase s (i.e., Su, TRIOS, and MusicNe ) since i was
ained on such da a o p edic pi ch di ec ly a he no e-
le el. Nex , we can obse e ha each sel -supe ised ob-
jec i e has he po en ial o imp o e pe o mance on one
o mo e da ase s. The imb e-in a iance objec i e is leas
e ec i e and has a mixed e ec ac oss da ase s, which is
somewha con a y o wha has been obse ed unde he
ully sel -supe ised con ex [10]. I is possible ha he
p ope y o imb e-in a iance is al eady s ongly indica ed
by he supe ised objec i e. The pe cussion-in a iance ob-
jec i e is mode a ely bene icial, e en o some da ase s
wi hou pe cussion. Ou o all he e alua ion da ase s, he e
is only one ack in TRIOS [22] ha has pe cussion. How-
e e , e en non-pe cussi e audio can ha e pe cussi e ele-
men s, i.e. o igina ing om playing no es on ce ain in-
s umen s. The geome ic-equi a iance objec i e is mos
e ec i e and yields a signi ican imp o emen ac oss all
da ase s. This is likely due o he model cap u ing ha -
monic ela ionships explici ly and e icien ly by le e aging
he shi -in a iance exhibi ed by he HCQT ep esen a ion.
Combining he supe ised objec i e wi h all in a iance-
and equi a iance-based objec i es p oduces he bes pe -
o mance, sugges ing ha each con ibu es dis inc ly o o-
bus ness. Gi en ha ou amewo k was ained wi h such
a small o amoun o audio (i.e., 1-2 hou s), hese esul s
a e qui e ema kable, especially when conside ing pe o -
mance on da a unseen du ing aining (i.e., Gui a Se ).
3.4 Sel -Supe ision on Addi ional Da a
Gi en he success wi h in eg a ing sel -supe ised objec-
i es in o ou supe ised amewo k, i is easonable o
ques ion whe he he same objec i es could be applied o
mo e gene al music da ase s lacking mul i-pi ch anno a-
ions. Indeed, aining a model o main ain pi ch-in a ian
and pi ch-equi a ian p ope ies o e a b oade collec ion
o da a could be one po en ial way o ci cum en issues
wi h low da a a ailabili y o MPE. In his ein, we conduc
addi ional expe imen s unde he join aining pa adigm
whe e addi ional da a is included in each ba ch o sel -
supe ision only. Speci ically, we epea he expe imen
combining all objec i es (deno ed Re .), bu wi h an ad-
di ional 16 samples pe ba ch which only in luence he
in a iance- and equi a iance-based losses. The da ase s
we use o addi ional samples ep esen di e en music
domains, i.e., simple syn he ic monophonic da a (NSyn h
[29]), eco dings o classical music mix u es (MusicNe
[23]), and high-quali y p oduc ion-le el audio (FMA [30]).
We ex ac samples om he aining spli s o NSyn h and
MusicNe , and he la ge (30-second clip) a ian o FMA.
Su p isingly, hese expe imen s all exhibi undesi able
beha io : pe o mance o he URMP [18] emains s a-
ble and consis en wi h expe imen s om Sec. 3.3, bu
pe o mance o he o he da ase s collapses. The pe o -
mance a alida ion checkpoin s o e he cou se o aining
is p esen ed in Fig. 1. Upon close inspec ion, we ound
ha he model p edic ions degene a e o a i ial solu ion
(blank p edic ions) o all da ase s excep o URMP. As
such, we epea ed each expe imen wi h he ene gy-based
objec i es (+EG) om Sec. 2.3.2 on he da a used o sel -
supe ision. Al hough his does p e en collapse, i ul i-
ma ely s ill leads o deg aded pe o mance. Finally, we ex-
pe imen wi h ini ializing he model wi h he weigh s om
he bes alida ion checkpoin o Re . and ine- uning
wi h 1
5 he lea ning a e (-FT). Howe e , he wo-s age ine-
uning pa adigm s ill exhibi s he same beha io .
4. DISCUSSION
In his sec ion, we in es iga e he phenomenon unco e ed
in Sec. 3.4 u he and conduc se e al ollow-up expe i-
men s in an e o o iden i y he unde lying p oblem.
4.1 O e i ing & Degene a ion
In o de o illus a e and cha ac e ize he p oblem o de-
gene a ion, Fig. 2 shows p edic ions o a single sample
om Re . along wi h p edic ions o he same sample a
25%, 50%, and 75% o he ull du a ion o ine- uning on
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
599
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
URMP
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Bach10
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Su
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
TRIOS
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
MusicNe
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Gui a Se
Re . +NS-16 +NS-16+EG +MN-16 +MN-16+EG +MN-16-FT +FMA-16 +FMA-16+EG +FMA-16-FT
Figu e 1. Pe o mance o e he cou se o aining o expe imen s le e aging an addi ional 16 samples pe ba ch om
NSyn h [29], MusicNe [23], and FMA [30] o sel -supe ision. EG - Ene gy-based objec i es. FT - Fine- uning scheme.
G ound-T u h Re . +FMA-16-FT@2500 +FMA-16-FT@5000 +FMA-16-FT@7500
Figu e 2. G ound- u h, baseline, and in e media e p edic ions a 25%, 50%, and 75% o he du a ion o he expe imen
wi h addi ional sel -supe ision on FMA [30] unde ine- uning scheme o ack 01-AchGo undHe o Bach10 [20].
FMA. I can clea ly be seen ha he s eng h o p edic ions
(i.e., ecall) dec eases o e ime, indica ing ha he e is a
end owa ds a i ial solu ion. This also sugges s ha o
he egula expe imen s wi hou ine- uning, he model is
always s uggling o mo e pas he i ial solu ion.
Nex , o examine whe he he ole o sel -supe ised
lea ning was oo ex eme, we epea he MusicNe [23]
expe imen s wi h an exponen ially dec easing amoun o
samples o sel -supe ision only (i.e., wi h 8, 4, and 2 ad-
di ional samples). The pe o mance o hese expe imen s
a alida ion checkpoin s is plo ed in Fig. 3. In e es ingly,
he e is a no iceable ela ionship be ween he amoun o
samples o sel -supe ision only and he se e i y o de-
gene a ion. Mo eo e , he degene a ion on MusicNe [23]
i sel is qui e p ominen and also exhibi s his ela ionship.
4.2 T aining Dis ibu ions
In o de o see whe he he e is an issue ega ding mis-
ma ch be ween he dis ibu ion o he da a used o supe -
ision and ha o he da a used o sel -supe ision only,
we u he ex ac 10 samples om he URMP [18] ain-
ing se spanning mul iple deg ees o polyphony. We de-
no e his spli as URMP-T2 and he emainde as URMP-
T1. We hen e- un he e e ence expe imen using only
URMP-T1 o aining, and hen again wi h addi ional sel -
supe ised lea ning and no co esponding supe ision on
URMP-T2. Fo comple eness, we u he un his expe i-
men applying he ene gy-based objec i es o URMP-T2.
The pe o mance o hese expe imen s is plo ed in Fig. 4.
Rela i e o Re ., pe o mance ba ely dec eases when
only URMP-T1 is used o aining, which is qui e in-
e es ing by i sel . Mo e impo an ly, when conduc ing
addi ional sel -supe ision on he 10 samples o URMP-
T2, which ollow a e y simila dis ibu ion o URMP-T1,
he e is s ill a mode a e dec ease in pe o mance. The only
di e ence be ween hese wo expe imen s is ha each o
he sel -supe ised losses a e a e aged o e he samples
om bo h URMP-T1 and URMP-T2 o each ba ch ins ead
o only URMP-T1. I is wo h no ing ha he collapse he e
happens o be less ex eme han when aining wi h sel -
supe ision on o he da ase s wi h mo e samples (see Fig.
3). Fu he mo e, he deg ada ion on URMP and Bach10 is
ac ually mo e ex eme han wha we obse ed when ain-
ing on NSyn h (+NS-16), bu we no e ha his could also
be explained by less supe ision on URMP. Ano he in e -
es ing obse a ion is ha he ene gy-based objec i es ac u-
ally u he deg ade pe o mance, likely due o con lic ing
wi h he supe ised objec i e on he same dis ibu ion.
4.3 Mi iga ing Degene a ion
Gi en all o ou obse a ions, i appea s he unde lying is-
sue is oo s ong o a pull owa ds he i ial solu ion o he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
600

0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
URMP
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Bach10
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Su
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
TRIOS
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
MusicNe
0 2000 4000 6000 8000 10000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Gui a Se
Re . +MN-16 +MN-8 +MN-4 +MN-2 +MN-16+EG +MN-16-FT
Figu e 3. Pe o mance o e he cou se o aining o expe imen s a ying amoun o sel -supe ision on MusicNe [23].
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
URMP
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Bach10
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Su
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
TRIOS
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
MusicNe
0 1000 2000 3000 4000 5000 6000 7000
# Ba ches
0.0
0.2
0.4
0.6
0.8
1.0
F
1-Sco e
Gui a Se
Re . T1 T1/T2 T1/T2+EG
Figu e 4. Pe o mance o e he cou se o aining o expe imen s which use a educed po ion (T1) o ou URMP [18]
aining se o supe ised aining wi h addi ional sel -supe ision on he emaining 10 non-o e lapping samples (T2).
non-supe ised da a, i espec i e o i s dis ibu ion. As i
s ands, sel -supe ision wi hou co esponding supe ision
essen ially pushes he model o degene a e on da a ollow-
ing a ce ain dis ibu ion. I he dis ibu ion o da a used o
sel -supe ision is close o ha o he supe ised da a, i
will by ex ension hu pe o mance on he supe ised da a.
I he dis ibu ion o da a used o sel -supe ision is dis-
inc om ha o he supe ised da a, i will ha e less o
an e ec on he pe o mance o da a ou side he dis ibu-
ion. I is unclea whe he hese in e ac ions would pe sis
i la ge and mo e di e se da a we e used o supe ision.
Degene a ion also seems o be unique o MPE, since
sel -supe ised me hods o monophonic pi ch es ima ion
[8, 9] ely on he induc i e bias o monophony and o mu-
la e hei objec i es using ca ego ical c oss-en opy. A so-
lu ion o he polyphonic se ing may equi e some so o
objec i e ha en o ces he exis ence o con en in he p e-
dic ions, o p o ec agains he i ial solu ion. The ene gy-
based objec i es a e one such p o ec ion, bu hey e iden ly
emo e oo much lexibili y and lead o wo se p edic ions.
Despi e hese cu en challenges, we s ill belie e ha sel -
supe ised lea ning holds p omise o ad ancing MPE.
5. CONCLUSION
We ha e demons a ed ha sel -supe ised objec i es can
subs an ially imp o e upon he s anda d supe ised ain-
ing pa adigm o MPE. Howe e , in a emp ing o ex end
sel -supe ised lea ning beyond he dis ibu ion o da a
ha is al eady g ounded wi h supe ised lea ning, we en-
coun e issues whe eby ou model simul aneously o e i s
o he dis ibu ion o he supe ised aining da a while de-
gene a ing on he dis ibu ion o he sel -supe ised ain-
ing da a. Sel -supe ised objec i es u ilizing ene gy-based
a ge s can p o ec agains degene a ion, bu hese a e oo
in lexible. Fine- uning canno ci cum en he p oblem ei-
he . Mo eo e , we show ha degene a ion pe sis s e en
when he supe ised and sel -supe ised aining da a a e
aken om he same dis ibu ion. We conclude wi h se e al
ema ks and ideas owa d o e coming highligh ed issues.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
601
6. ACKNOWLEDGMENTS
This wo k is suppo ed by Na ional Science Founda ion
(NSF) g an No. 2222129 and syne gis ic ac i i ies unded
by NSF g an DGE-1922591.
7. REFERENCES
[1] M. Mülle , Fundamen als o Music P ocessing: Audio,
Analysis, Algo i hms, Applica ions. Sp inge , 2015.
[2] E. Bene os, S. Dixon, Z. Duan, and S. Ewe , “Au o-
ma ic music ansc ip ion: An o e iew,” IEEE Signal
P ocessing Magazine, ol. 36, no. 1, pp. 20–30, 2019.
[3] J. Ga dne , I. Simon, E. Manilow, C. Haw ho ne, and
J. Engel, “MT3: Mul i- ask mul i ack music ansc ip-
ion,” in P oceedings o ICLR, 2021.
[4] B. Maman and A. H. Be mano, “Unaligned supe i-
sion o au oma ic music ansc ip ion in he wild,” in
P oceedings o ICML, 2022.
[5] I. Simon, J. Ga dne , C. Haw ho ne, E. Manilow, and
J. Engel, “Scaling polyphonic ansc ip ion wi h mix-
u es o monophonic ansc ip ions,” in P oceedings o
ISMIR, 2022.
[6] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Lin, A. Ragni, E. Bene os, N. Gyenge e al.,
“MERT: Acous ic music unde s anding model wi h
la ge-scale sel -supe ised aining,” in P oceedings o
ICLR, 2024.
[7] W. Liao, Y. Takida, Y. Ikemiya, Z. Zhong, C.-H. Lai,
G. Fabb o, K. Shimada, K. Toyama, K. Cheuk, M. A.
Ma ínez-Ramí ez e al., “Music ounda ion model as
gene ic boos e o music downs eam asks,” a Xi
p ep in a Xi :2411.01135, 2024.
[8] B. G elle , C. F ank, D. Roblek, M. Sha i i,
M. Tagliasacchi, and M. Velimi o i´
c, “SPICE: Sel -
supe ised pi ch es ima ion,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing (TASLP),
ol. 28, pp. 1118–1128, 2020.
[9] A. Riou, S. La ne , G. Hadje es, and G. Pee e s,
“PESTO: Pi ch es ima ion wi h sel -supe ised
ansposi ion-equi a ian objec i e,” in P oceedings o
ISMIR, 2023.
[10] F. Cwi kowi z and Z. Duan, “Towa d ully sel -
supe ised mul i-pi ch es ima ion,” a Xi p ep in
a Xi :2402.15569, 2024.
[11] R. M. Bi ne , B. McFee, J. Salamon, P. Li, and J. P.
Bello, “Deep salience ep esen a ions o F0 es ima ion
in polyphonic music,” in P oceedings o ISMIR, 2017.
[12] R. M. Bi ne , J. J. Bosch, D. Rubins ein, G. Mesegue -
B ocal, and S. Ewe , “A ligh weigh ins umen -
agnos ic model o polyphonic no e ansc ip ion and
mul ipi ch es ima ion,” in P oceedings o ICASSP,
2022.
[13] C. Weiß and G. Pee e s, “Compa ing deep models
and e alua ion s a egies o mul i-pi ch es ima ion in
music eco dings,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing (TASLP), ol. 30,
pp. 2814–2827, 2022.
[14] F. Cwi kowi z, K. W. Cheuk, W. Choi, M. A. Ma ínez-
Ramí ez, K. Toyama, W.-H. Liao, and Y. Mi su-
uji, “Timb e-T ap: A low- esou ce amewo k o
ins umen -agnos ic music ansc ip ion,” in P oceed-
ings o ICASSP, 2024.
[15] C. Schö khube , A. Klapu i, N. Holighaus, and
M. Dö le , “A ma lab oolbox o e icien pe ec
econs uc ion ime- equency ans o ms wi h log-
equency esolu ion,” in P oceedings o AES, 2014.
[16] J. Abeße and M. Mülle , “Jazz bass ansc ip ion us-
ing a U-ne a chi ec u e,” Elec onics, ol. 10, no. 6, p.
670, 2021.
[17] L. Callende , C. Haw ho ne, and J. Engel, “Im-
p o ing pe cep ual quali y o d um ansc ip ion wi h
he expanded g oo e MIDI da ase ,” a Xi p ep in
a Xi :2004.00188, 2020.
[18] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma,
“C ea ing a mul i ack classical music pe o mance
da ase o mul imodal music analysis: Challenges, in-
sigh s, and applica ions,” IEEE T ansac ions on Mul i-
media, ol. 21, pp. 522–535, 2018.
[19] I. Loshchilo and F. Hu e , “Decoupled weigh decay
egula iza ion,” in P oceedings o ICLR, 2019.
[20] Z. Duan, B. Pa do, and C. Zhang, “Mul iple undamen-
al equency es ima ion by modeling spec al peaks
and non-peak egions,” IEEE T ansac ions on Audio,
Speech, and Language P ocessing (TASLP), ol. 18,
no. 8, pp. 2121–2133, 2010.
[21] L. Su and Y.-H. Yang, “Escaping om he abyss o
manual anno a ion: New me hodology o building
polyphonic da ase s o au oma ic music ansc ip ion,”
in P oceedings o CMMR, 2015.
[22] J. F i sch, “High quali y musical audio sou ce sepa-
a ion,” Mas e ’s hesis, UPMC / IRCAM / Telécom
Pa isTech, 2012.
[23] J. Thicks un, Z. Ha chaoui, and S. Kakade, “Lea n-
ing ea u es o music om sc a ch,” in P oceedings o
ICLR, 2017.
[24] Q. Xi, R. M. Bi ne , J. Pauwels, X. Ye, and J. P. Bello,
“Gui a Se : A da ase o gui a ansc ip ion,” in P o-
ceedings o ISMIR, 2018.
[25] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, and D. P. Ellis, “mi _e al: A ans-
pa en implemen a ion o common MIR me ics,” in
P oceedings o ISMIR, 2014.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
602
[26] R. M. Bi ne , J. Salamon, M. Tie ney, M. Mauch,
C. Cannam, and J. P. Bello, “MedleyDB: A mul i ack
da ase o anno a ion-in ensi e MIR esea ch.” in P o-
ceedings o ISMIR, 2014.
[27] F. Pede soli, G. Tzane akis, and K. M. Yi, “Imp o ing
music ansc ip ion by p e-s acking a U-Ne ,” in P o-
ceedings o ICASSP, 2020.
[28] N. Holighaus, M. Dö le , G. A. Velasco, and T. G ill,
“A amewo k o in e ible, eal- ime cons an -Q
ans o ms,” IEEE T ansac ions on Audio, Speech, and
Language P ocessing (TASLP), ol. 21, pp. 775–785,
2012.
[29] J. Engel, C. Resnick, A. Robe s, S. Dieleman,
M. No ouzi, D. Eck, and K. Simonyan, “Neu al audio
syn hesis o musical no es wi h wa ene au oencode s,”
in P oceedings o ICML, 2017.
[30] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,” in
P oceedings o ISMIR, 2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
603