scieee Science in your language
[en] (orig)

Improving Neural Pitch Estimation With SWIPE Kernels

Author: David Marttila; Joshua D. Reiss
Publisher: Zenodo
DOI: 10.5281/zenodo.17706561
Source: https://zenodo.org/records/17706561/files/000080.pdf
IMPROVING NEURAL PITCH ESTIMATION WITH SWIPE KERNELS
Da id Ma ila, Joshua D. Reiss
Cen e o Digi al Music, Queen Ma y Uni e si y o London
[email p o ec ed], [email p o ec ed]
ABSTRACT
Neu al ne wo ks ha e become he dominan echnique o
accu a e pi ch and pe iodici y es ima ion. Al hough a lo
o esea ch has gone in o imp o ing ne wo k a chi ec-
u es and aining pa adigms, mos app oaches ope a e di-
ec ly on he aw audio wa e o m o on gene al-pu pose
ime- equency ep esen a ions. We in es iga e he use o
Saw oo h-Inspi ed Pi ch Es ima ion (SWIPE) ke nels as
an audio on end and ind ha hese hand-c a ed, ask-
speci ic ea u es can make neu al pi ch es ima o s mo e ac-
cu a e, obus o noise, and mo e pa ame e -e icien . We
e alua e supe ised and sel -supe ised s a e-o - he-a a -
chi ec u es on common da ase s and show ha he SWIPE
audio on end allows o educing he ne wo k size by an
o de o magni ude wi hou pe o mance deg ada ion. Ad-
di ionally, we show ha he SWIPE algo i hm on i s own is
much mo e accu a e han commonly epo ed, ou pe o m-
ing s a e-o - he-a sel -supe ised neu al pi ch es ima o s.
1. INTRODUCTION
Pi ch plays a cen al ole in how humans pe cei e sound.
Consequen ly, pi ch es ima ion is a undamen al ask in
many music, speech and audio p ocessing pipelines. While
pi ch is a psychoacous ic phenomenon, i closely co e-
la es o he signal p ocessing concep o he undamen al
equency 0. Recen li e a u e commonly uses he e m
“pi ch es ima ion” o e e o he ask o es ima ing an au-
dio signal’s 0.
Gi en he impo ance o accu a e pi ch es ima ion, he
opic has ecei ed a conside able amoun o esea ch a en-
ion o e he pas decades. Nume ous digi al signal p o-
cessing (DSP) echniques es ima e pi ch based on he cep-
s um [1], he powe spec um [2–4], o he au oco ela ion
unc ion [5,6].
Mo e ecen ly, deep neu al ne wo ks ha e been applied
o he ask o pi ch es ima ion [7–10]. In a ypical a chi-
ec u e, a con olu ional neu al ne wo k (CNN) is gi en
o e lapping ames o aw audio as inpu and ained o
p edic a p obabili y dis ibu ion o e a disc e e se o 0
candida es in a supe ised ashion. While hese models
can each e y high accu acy, hey equi e a la ge amoun
© D. Ma ila and J. D. Reiss. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: D. Ma ila and J. D. Reiss, “Imp o ing Neu al Pi ch Es ima ion
wi h SWIPE Ke nels”, in P oc. o he 26 h In . Socie y o Music In o -
ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
o aining da a anno a ed wi h eliable g ound u h pi ch
alues and can s uggle wi h ou -o -domain gene aliza ion
and obus ness o noise and e e be a ion. Addi ionally,
he CNNs usually consis o millions o pa ame e s, mak-
ing hem less sui able o use in low- esou ce en i on-
men s.
These d awbacks ha e been add essed in wo di e en
ways. Sel -supe ised aining pa adigms [11–14] do no
equi e labeled aining da a, and inco po a ing adi ional
DSP app oaches in o neu al ne wo ks has been shown o
inc ease e iciency and obus ness [15, 16].
In his pape , we combine hese wo app oaches and
me ge ask-speci ic DSP-based ea u es wi h bo h supe -
ised and sel -supe ised aining pa adigms. Speci i-
cally, we subs i u e he audio on end in he Pi ch Es ima-
ion wi h Sel -Supe ised T ansposi ion-Equi a ian Ob-
jec i e (PESTO) a chi ec u e [14] o a ep esen a ion ob-
ained om he Saw oo h Wa e o m Inspi ed Pi ch Es ima-
o (SWIPE) [3], which es ima es pi ch by measu ing he
simila i y o he inpu spec um o ha o saw oo h wa es
a a ious pi ch candida es. We also in es iga e he use o
SWIPE as a on end o supe ised neu al pi ch es ima-
o s, which usually ope a e di ec ly on he audio wa e o m.
The co e insigh s o ou wo k a e hese:
• Al hough SWIPE is commonly used as a baseline o
neu al pi ch es ima ion, we ind ha i s pe o mance
has been signi ican ly unde epo ed. We show ha
SWIPE in i s o iginal o m su passes he accu acy
o he s a e o he a in sel -supe ised neu al pi ch
de ec ion (PESTO).
• SWIPE is a well-sui ed audio on end o neu-
al pi ch es ima o s in bo h supe ised and sel -
supe ised se ings, and can imp o e he s a e-o -
he-a in e ms o accu acy, obus ness, e iciency,
and la ency.
T ained models alongside a SWIPE implemen a ion in
PyTo ch [17] a e a ailable online. 1The emainde o his
pape is s uc u ed as ollows. In Sec ion 2, we gi e an
o e iew o e SWIPE and neu al pi ch es ima ion me h-
ods. Sec ion 3 co e s some aspec s o ou SWIPE imple-
men a ion choices and de ails how we embed SWIPE in o
neu al pi ch es ima ion a chi ec u es. Sec ion 4 desc ibes
how we e alua e ou app oach, and Sec ion 5 p esen s he
esul s o ou e alua ion.
1h ps://gi hub.com/dsuedhol /
neu al-pi ch-swipe
688
Figu e 1. SWIPE and SWIPE’ ke nels co esponding o
a pi ch candida e a 330 Hz. The SWIPE ke nel con ains
peaks a all in ege ha monics o he candida e equency.
The SWIPE’ ke nel is ob ained by emo ing he peaks a
non-p ime ha monics.
2. BACKGROUND
2.1 SWIPE
The Saw oo h Wa e o m Inspi ed Pi ch Es ima o
(SWIPE) [3] es ima es pi ch by iden i ying he undamen-
al equency o a saw oo h wa e o m whose spec um
bes ma ches ha o he inpu signal. To achie e his, i
cons uc s spec al ke nels o a numbe o disc e e pi ch
candida es, and assigns a sco e o each pi ch candida e by
measu ing he simila i y be ween i s associa ed ke nel and
he spec um o he inpu signal.
2.1.1 Sco e Calcula ion
Mo e o mally, conside a (windowed) audio signal x[n]o
leng h Nand i s disc e e Fou ie T ans o m (DFT) X[k],
which may be unca ed o he K=⌊N/2⌋+ 1 bins co -
esponding o non-nega i e equencies i xis eal- alued.
Le now C={ 1, 2,... |C|}be a se o |C|pi ch candi-
da es. Then Sc[k]is he spec al ke nel associa ed wi h he
pi ch candida e c, and we can compu e i s SWIPE sco e
Z( c) : C→[−1,1] as he no malized inne p oduc be-
ween Scand X:
Z( c) = PK−1
k=0 Sc[k]· |X[k]|1/2
PK−1
k=0 |X[k]|1/2(1)
The pi ch es ima e is hen gi en by he c ha maxi-
mizes Z( c)and may op ionally be u he e ined by e.g.
pa abolic in e pola ion o he local maximum.
While Eqn (1) compu es he inne p oduc o e all bins
o he DFT o no a ional simplici y, he o iginal SWIPE
pape sugges s esampling he spec um o he Equi alen
Rec angula Bandwid h (ERB) [18] scale o speech da a,
o o he mel scale o musical ins umen s.
2.1.2 Ke nel Design
The ke nel Scis designed o maximize he inne p od-
uc wi h Xi xis a signal wi h undamen al equency
0= c. To achie e ha , i con ains cosine lobes o wid h
c/2a all in ege ha monics o c, decaying in magni ude
o mimic he spec um o a saw oo h wa e. As he au-
ho s o SWIPE lay ou , his co esponds p ecisely o he
squa e oo o he main lobes o a Hann-windowed saw-
oo h wa e i he size o he analysis window is exac ly
T= 8/ c. The ke nel u he con ains nega i e- alued
alleys a 1
2 c,3
2 c, . . ., i.e. a he midpoin be ween each
ha monic peak.
Since all ha monics o a signal wi h a ue undamen-
al equency o also con ibu e o he sco es o he
pi ch candida es a /2, /3, . . ., a common a ian o he
SWIPE algo i hm emo es he non-p ime ha monics (ex-
cep o he i s one) o all ke nels o educe he p oblem
o oc a e e o s. This is known as SWIPE’, bu e ec i e
and widesp ead enough ha i is o en simply e e ed o
as SWIPE, o example in he Speech P ocessing Toolki
(SPTK) [19] implemen a ion, which is based di ec ly on
he MATLAB code published along wi h SWIPE. We ake
he same app oach in his pape and will gene ally assume
ha he sco es Za e calcula ed using SWIPE’ ke nels. An
example o such a ke nel is illus a ed in Figu e 1.
2.2 Supe ised Neu al Pi ch Es ima ion
The es ablished way o using neu al ne wo ks o pi ch es-
ima ion is o in e p e an audio signal x[n]as a ec o x.
A ne wo k θ hen maps x o a ec o y∈[0,1]|C|, whe e
each en y yc ep esen s he p obabili y ha a co espond-
ing pi ch candida e cis he pi ch o x. In supe ised ain-
ing, his is ea ed as a mul i-class classi ica ion p oblem,
calcula ing he loss using he c oss-en opy o he g ound
u h pi ch. The g ound u h dis ibu ion may be smoo hed
using Gaussian blu ing o aid aining [7]. Voicing con i-
dence can be deduced om he en opy o he p edic ed
p obabili y dis ibu ion [9].
2.3 Sel -Supe ised Neu al Pi ch Es ima ion
Pi ch Es ima ion wi h Sel -Supe ised T ansposi ion-
Equi a ian Objec i e (PESTO) [14] is a s a e-o - he-a
a chi ec u e o sel -supe ised aining o neu al ne wo k
pi ch es ima o s, whe e na u al symme ies o he inpu a e
exploi ed o lea n a ansla ion-equi a ian ep esen a ion,
ins ead o p o iding g ound u h pi ches o he model.
2.3.1 T aining Se up
Du ing aining, he model lea ns o op imize a combina-
ion o h ee losses:
An equi a iance loss en o ces ha pi ch-shi ed e -
sions o an inpu should esul in ou pu dis ibu ions ha
a e ansposi ions o he o iginal inpu . Gi en an inpu
xand i s pi ch-shi ed e sion x(k)(shi ed by ksemi-
ones), hei espec i e ou pu s yand y(k)should sa is y
φ(y(k)) = αkφ(y), whe e φis a de e minis ic linea map-
ping:
φ:R|C|→R
y7→ (α, α2, . . . , α|C|)y(2)
and αis a hype pa ame e .
A egula iza ion loss u he ensu es ha he ne -
wo k’s ou pu s o pi ch-shi ed inpu s main ain he ex-
pec ed ansposi ion ela ionship. Fo a pai o ou pu s y
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
689
and y(k), he shi ed c oss-en opy loss
LSCE(y,y(k), k) =
|C|−1
X
i=0
yilog y(k)
i+k(3)
measu es how well y(k)ma ches he k-semi one shi o y.
Finally, an in a iance loss encou ages he mapping θ
o be in a ian o he imb e o he signal. Du ing aining,
PESTO d aws andom ans o ms om a se o pi ch-
p ese ing da a augmen a ions T. Gi en ˜
x= (x), he
in a iance loss is hen exp essed as he c oss-en opy be-
ween y= θ(x)and ˜
y= θ(˜x).
2.3.2 Model A chi ec u e
The PESTO a chi ec u e uses he cons an -Q ans o m
(CQT) o an audio ame as i s inpu , whe e he bins o
he CQT exac ly co espond o he pi ch candida es. The
ans o ms T ake he o m o adding andom noise and
gain o he CQT ames. A CNN p ocesses he ame, and
i s la ened ou pu is ed o a inal linea laye ollowed by
a so max laye which p oduces a p obabili y dis ibu ion.
Impo an ly, he inal linea laye uses a Toepli z ma ix as
i s weigh ma ix o p ese e he ansposi ion equi a iance
o he CNN.
3. METHODS
The co e insigh o his wo k is ha SWIPE sco es encode
ich pi ch in o ma ion and a e hus well sui ed as an audio
on end o neu al pi ch es ima ion in bo h supe ised and
sel -supe ised se ings. This sec ion i s co e s ou im-
plemen a ion o SWIPE in de ail, and hen desc ibes how
we adap supe ised and sel -supe ised neu al pi ch es i-
ma o s o wo k wi h SWIPE sco es.
3.1 SWIPE Implemen a ion
We calcula e he SWIPE sco es by sampling he spec um
a 1024 equencies, which a e linea ly spaced on he mel
scale o e a ange om 0.25 · min o 1.25 · max. We used
he Slaney-s yle mel scale, which is linea up o 1 kHz and
loga i hmic abo e, as implemen ed in he lib osa oolki
[20]. Fo each o he a ious window sizes, he spec um
is calcula ed wi h he same FFT esolu ion (ze o-padding
he inpu as needed) and e alua ed a he sampling e-
quencies using linea in e pola ion. We a ange he pi ch
candida es o ma ch he CQT esolu ion used in PESTO:
loga i hmically spaced o e a ange o min = 27.5Hz o
max = 8055 Hz, using a esolu ion o 3 bins pe semi one,
o a o al o 295 bins.
Al hough many pape s on neu al pi ch es ima ion com-
pa e hei wo k o SWIPE as a baseline, hey gene ally do
no ci e he implemen a ion hey used o epo he pa ame-
e s hey chose. To make su e ha ou implemen a ion does
no signi ican ly unde pe o m, we compa e i o he mos
popula open-sou ce implemen a ion o SWIPE, which is
con ained in he Speech P ocessing Toolki (SPTK) [19].
I uses 8 pi ch bins pe semi one by de aul , samples he
inpu equency spec um acco ding o he ERB [18] scale,
and e ines he es ima e using pa abolic in e pola ion.
Table 1 con ains he Raw Pi ch Accu acy (RPA)
achie ed by he SPTK and ou implemen a ion on he
MDB-s em-syn h and MIR-1K da ase s and compa es i
o p e iously epo ed baseline alues. The me ics and
da ase s a e desc ibed in mo e de ail in Sec ion 4. The ac-
cu acy o he SPTK implemen a ion seems o signi ican ly
de e io a e o la ge sea ch anges. We epo he alues
o uppe limi s o 2kHz and 8kHz, whe e he lowe limi
is 30 Hz o bo h. We se he sco e h eshold which pi ch
candida es need o exceed o be conside ed o 0.
Ou implemen a ion appea s o be a lo mo e obus o
i s la ge sea ch ange (27.5–8055 Hz). Swi ching om
ERB o mel sampling esul s in a no able accu acy gain
on MDB-s em-syn h, which con ains mo e a ied imb es.
This ma ches he esul s o he o iginal SWIPE pape .
Bo h he SPTK and ou own implemen a ion can pe -
o m much mo e accu a ely han he alues ha we e p e-
iously epo ed as baselines in he neu al pi ch es ima ion
li e a u e sugges , o he ex en ha SWIPE ou pe o ms
e en s a e-o - he-a sel -supe ised pi ch de ec ion mod-
els (see Sec ion 5.2).
3.2 Supe ised Neu al Pi ch Es ima ion
We expe imen wi h using bo h SWIPE sco es and he
CQT as an audio on end in a supe ised aining con ex .
We eed he inpu in o a CNN wi h 6 1D-con olu ional
laye s, applying laye no maliza ion and a leaky ReLU
non-linea i y wi h slope 0.3 be ween each laye . Ze o-
padding is applied o he inpu in each laye o p ese e
he inpu dimension. A e la ening, he ou pu o he i-
nal laye is educed o he dimensionali y o he pi ch bins
and ed in o a So max laye o ob ain a p obabili y dis i-
bu ion. We ind ha using a dense linea laye o pe o m
he dimensionali y educ ion s ongly deg aded gene aliza-
ion in his se up, and ins ead also employ a Toepli z laye
in he supe ised model.
3.3 Sel -Supe ised Neu al Pi ch Es ima ion
The o iginal PESTO a chi ec u e is al eady well sui ed o
wo k wi h SWIPE sco es, which can be di ec ly subs i-
u ed o he CQT bins wi hou iola ing he assump ions
on ansla ion equi a iance. Since SWIPE sco es encode
pe iodici y in o ma ion much mo e explici ly han CQT
ames, we expec he encode ne wo k o achie e simi-
la pe o mance wi h ewe pa ame e s. We es his hy-
po hesis by aining a PESTO-s yle encode wi h a d as i-
cally educed pa ame e coun , consis ing only o he inal
Toepli z ully-connec ed laye – a con olu ional laye wi h
a single il e o size 647 – and so max no maliza ion. In
his e y simple a chi ec u e, he Toepli z laye can be seen
as essen ially lea ning a eweigh ing o he SWIPE sco es,
and o e ine he loca ion o he peak sco e i he ou pu
esolu ion is la ge han he inpu esolu ion.
Ini ial expe imen s indica ed ha applying andom da a
augmen a ion o he SWIPE sco es only esul ed in de-
g aded pe o mance compa ed o he baseline DSP algo-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
690
Repo ed in [12,13] SPTK (2kHz) SPTK (8kHz) Ou s (ERB) Ou s (mel)
MIR-1K 86.6% 96.5% 68.2% 95.7% 96.2%
MDB 90.7% 94.1% 61.4% 94.0% 96.1%
Table 1. Raw Pi ch Accu acy ob ained by SWIPE implemen a ions on he MIR-1K and MDB-s em-syn h da ase s, com-
pa ed o baseline alues p e iously epo ed in sel -supe ised pi ch es ima ion pape s. Fo SPTK, we epo di e en uppe
limi s o he sea ch ange. Fo ou implemen a ion, he sea ch ange is cons an , bu he equency sampling scale changes.
i hm. We addi ionally augmen he audio ame in he ime
domain by adding andom noise and applying a ini e im-
pulse esponse (FIR) il e wi h a andomized ampli ude e-
sponse. This esul s in an inc eased compu a ional cos o
aining, since he SWIPE sco es need o be ecalcula ed a
e e y aining s ep, bu does no a ec he compu a ional
cos o in e ence once aining has inished.
4. EXPERIMENTAL SETUP
4.1 Da ase s
Ou expe imen s use h ee 0-anno a ed da ase s ha a e
commonly used o aining and benchma king pi ch de-
ec o s:
MDB-s em-syn h [21] con ains 230 solo acks (418
minu es o al) o ins umen sounds and ocals. The au-
dio is e-syn hesized om i s 0anno a ions, which means
ha he 0anno a ions a e pe ec . I is anno a ed wi h a
hop size o 2.9 ms.
PTDB-TUG [22] con ains 4720 audio and la yngo-
g aph eco dings (576 minu es o al) o 20 English speak-
e s eading sen ences. I is anno a ed wi h a hop size o 10
ms.
MIR-1K [23] con ains 1000 sho eco dings (133 min-
u es o al) o Chinese ka aoke pe o mances. I is anno-
a ed wi h a hop size o 20 ms.
4.2 Baselines
We compa e all esul s o wo DSP-based pi ch de ec ion
baselines: PYIN [24] and SWIPE. We do no pe o m
Vi e bi decoding o any so o peak e inemen , simply
selec ing he pi ch candida e wi h he highes sco e. PYIN
sco es a e based on au oco ela ion and so hei na u al es-
olu ion is exp essed in in ege samples. We esample he
sco es o he same pi ch candida e esolu ion as used o
SWIPE using linea in e pola ion.
As a baseline o supe ised aining, we choose
FCNF0++ [9], which o he bes o ou knowledge is he
cu en ly bes -pe o ming supe ised monophonic neu al
pi ch de ec o ha ope a es on a ame-by- ame basis,
a he han p ocessing he en i e audio signal a once.
In he sel -supe ised se ing, we compa e ou esul s
agains he o iginal PESTO a chi ec u e, which is he cu -
en s a e o he a in sel -supe ised monophonic neu al
pi ch es ima ion.
4.3 E alua ion Me ics
We use he mi _e al package [25] o epo he ollowing
me ics:
Raw Pi ch Accu acy (RPA), he pe cen age o oiced
ames o which he model p edic ed a pi ch wi hin 50
cen s o he g ound u h.
F-Sco e, measu ing he accu acy o he bina y
oiced/un oiced decision.
O e all Accu acy (OA), he pe cen age o all ames
( oiced and un oiced) o which a co ec oicing deci-
sion was made, and o which he model p edic ed a pi ch
wi hin 50 cen s o he g ound u h i he ame is oiced.
5. RESULTS AND DISCUSSION
We epo sepa a e expe imen al esul s o he supe ised
and sel -supe ised app oaches, in each case closely epli-
ca ing he aining se up o he baselines (FCNF0++ and
PESTO, espec i ely) o assess he impac o using SWIPE
sco es as an audio on end.
5.1 Supe ised Models
We e e o he wo p oposed supe ised models (see Sec-
ion 3.2) as CQT-sup and SWIPE-sup. We ain ou mod-
els as well as he FCNF0++ baseline on MDB-s em-syn h
and PTDB-TUG a he same ime. While ne wo ks ha
ake CQT o SWIPE sco es as inpu a e sample- a e ag-
nos ic, FCFN0++ ope a es on he aw audio wa e o m and
was designed o wo k wi h a sampling a e o 8kHz, so we
esample i s inpu acco dingly.
In he in e es o a di ec compa ison, we use he 70-
15-15 spli in o aining, alida ion and es ing pa i ions
ha was published in [9]. The pe o mance o he ained
models is measu ed by calcula ing RPA, F-Sco e and OA
on he es ing se . To be e measu e gene aliza ion pe o -
mance on unseen da a, we addi ionally e alua e he ained
models on he ull MIR-1K da ase , which is no used in
aining.
We ain he models o 500,000 s eps, using a ba ch
size o 256 and he Adam op imize [26] wi h an ini ial
lea ning a e o 0.0002. Table 2 shows he esul s o he
e alua ion. O e all, bo h CQT-sup and SWIPE-sup seem
compe i i e wi h FCNF0++, bu no clea ly supe io . They
a e able o almos ma ch he in-domain RPA o FCNF0++,
and ou pe o m i in e ms o oiced/un oiced accu acy
and gene aliza ion. The wo p oposed models use ewe
ainable pa ame e s han FCNF0++, bu equi e a la ge
con ex window.
The CQT inpu ea u es seem o be pa icula ly well
sui ed o making oiced/un oiced decisions, wi h he
CQT-sup model a aining he highes F-Sco e on bo h he
es se and on MIR-1K.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
691
Me hod # Pa ams Window Size [ms] Tes Pa i ion MIR-1K
RPA F-Sco e OA RPA F-Sco e OA
PYIN - 145 89.4% - - 95.4% - -
SWIPE - 327 93.2% - - 96.2% - -
CQT-sup (ou s) 0.9M 1871 98.1% 98.7% 99.2% 92.2% 91.2% 87.1%
SWIPE-sup (ou s) 0.9M 327 97.9% 98.1% 98.6% 93.5% 90.0% 87.8%
FCNF0++ [9] 6.6M 128 98.3% 98.2% 98.7% 91.0% 87.8% 86.0%
Table 2. E alua ion esul s o he supe ised models as measu ed by Raw Pi ch Accu acy (RPA), he F-Sco e o he
oiced/un oiced decision, and O e all Accu acy (OA). We also epo he sizes o he neu al ne wo ks and he maximum
window size equi ed by he es ima o . Models a e e alua ed using he combined es pa i ion o MDB-s em-syn h and
PTDB-TUG published in [9], as well as on he en i e MIR-1K da ase , which was no used in aining.
All h ee models s uggle wi h gene aliza ion, s aying
well behind he DSP baselines on MIR-1K. The bes gen-
e aliza ion beha io is shown by SWIPE-sup, e en hough
i was he leas accu a e model on he es se .
5.2 Sel -Supe ised Models
We e e o he h ee modi ied PESTO models (see Sec-
ion 3.3) as CQT- iny,SWIPE- ull, and SWIPE- iny,
whe e “ iny” e e s o he Toepli z-only encode and “ ull”
o he o iginal PESTO encode a chi ec u e wi h a mul i-
laye CNN. We ain he models on he whole o MIR-1K
and measu e hei pe o mance on MDB-s em-syn h, and
ice e sa. The models a e ained o 50 epochs using a
ba ch size o 256 and he Adam op imize wi h an ini ial
lea ning a e o 0.0001.
The esul s o he e alua ion a e gi en in Table 3. The
baseline SWIPE implemen a ion ou pe o ms PESTO on
bo h da ase s, ega dless o which da ase he model was
ained on. This means ha he o iginal SWIPE algo-
i hm ou pe o ms all wo k on sel -supe ised mono-
phonic pi ch de ec ion published o da e.
When using CQT ames as inpu , educing he encode
ne wo k o jus he inal Toepli z laye no iceably deg ades
pe o mance, especially in he ac oss-da ase e alua ion.
Howe e , i is wo h no ing ha CQT- iny s ill achie es
ela i ely good same-da ase accu acy. Since no explici
pi ch in o ma ion is gi en o he model du ing aining, his
is a s ong indica o o he use ulness o he ansposi ion-
equi a ian aining s uc u e ha PESTO in oduced.
The highes accu acy on he same-da ase e alua ion is
achie ed by he wo models ha use SWIPE sco es as in-
pu . Like CQT- iny howe e , hei pe o mance plumme s
when ained on MIR-1K and e alua ed on MDB-s em-
syn h. MDB-s em-syn h co e s a la ge pi ch ange han
MIR-1K and con ains mo e a ied imb es, making gen-
e aliza ion challenging. The o iginal PESTO is he only
model ha is able o make his jump easonably well.
In he e e se di ec ion howe e , SWIPE- iny achie es
he highes RPA ou o he ou models when ained on
MDB-s em-syn h and e alua ed on MIR-1K, as well as he
bes same-da ase RPA o MDB-s em-syn h. Adding he
Toepli z laye on op o he SWIPE sco es imp o es hei
pe o mance, bu he addi ional ne wo k laye s in SWIPE-
ull do no b ing u he accu acy gains, seemingly hinde -
ing pe o mance ins ead.
Figu e 2.Top: The spec um o a ame o audio om
MIR-1K. The solid e ical line ma ks he g ound u h
pi ch. Bo om: The SWIPE sco es o he ame, be-
o e (do ed) and a e (dashed) hey we e ans o med by
he SWIPE- iny encode ained on MDB-s em-syn h. The
e ical lines indica e he pi ch es ima e ob ained om he
basic SWIPE algo i hm (do ed), he es ima e gi en by
SWIPE- iny (dashed), and he g ound u h (solid).
The Toepli z-only encode in SWIPE- iny seems o
lea n o e ine he peaks o he SWIPE sco es, mi iga ing
e o s caused by he quan iza ion o he sea ch space o by
inpu spec a ha de ia e oo a om he ha monic ideal.
Figu e 2 illus a es a ame whe e an es ima ion e o o 90
cen s is educed o 30 cen s a e eeding he SWIPE sco es
h ough he SWIPE- iny encode .
5.3 La ency-Accu acy T adeo o SWIPE
The ame-based s uc u e o he e alua ed models, and es-
pecially he ligh weigh a chi ec u e o he sel -supe ised
es ima o s, lend hemsel es na u ally o use in eal- ime,
s eaming applica ions. In his con ex , small window sizes
a e desi able o educe la ency.
As desc ibed in Sec ion 2.1, he heo e ical ideal win-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
692

Raw Pi ch Accu acy
Me hod # pa ams T ained on MIR-1K MDB-s em-syn h
PYIN - - 95.4% 91.6%
SWIPE - - 96.2% 96.1%
PESTO
(baseline om [14]) 28.9k MIR-1K 96.1% 94.6%
MDB-s em-syn h 93.5% 95.5%
CQT- iny 647 MIR-1K 95.6% 78.8%
MDB-s em-syn h 91.7% 95.5%
SWIPE- ull 28.2k MIR-1K 97.0% 89.7%
MDB-s em-syn h 96.1% 96.4%
SWIPE- iny 647 MIR-1K 96.6% 90.1%
MDB-s em-syn h 96.4% 96.5%
Table 3. E alua ion esul s o he sel -supe ised models. Fo bo h da ase s, we highligh he bes esul achie ed when
aining and e alua ing on he same da ase (no explici pi ch in o ma ion is p o ided o he model du ing aining), and
when aining on one da ase and e alua ing on he o he . The pe o mance o PYIN and SWIPE is gi en o compa ison.
Window Size Raw Pi ch Accu acy
[Samples] [ms] SWIPE- iny SWIPE-sup
16384 372 96.4% 97.2%
8192 186 96.4% 97.1%
4096 93 96.2% 96.7%
2048 46 85.0% 86.9%
Table 4. The e ec o educing he maximum window size
( o a sampling a e o 44.1kHz) a which SWIPE sco es
a e calcula ed. RPA on MIR-1K is epo ed o SWIPE-
iny ( ained on MDB-s em-syn h) and SWIPE-sup.
dow size o each pi ch candida e cis exac ly 8/ c. In
p ac ice howe e , i is su icien o only conside window
sizes whose leng h in samples is a powe o wo. Fo a
gi en pi ch candida e wi h an ideal window leng h W, he
sco e is hen calcula ed wice a window leng hs 2⌊log2(W)⌋
and 2⌈log2(W)⌉, and linea ly in e pola ed o ob ain an ap-
p oxima ion o he sco e a he ideal size. Gi en a sampling
a e o s= 44.1kHz and a minimum pi ch candida e o
min = 27.5Hz, he nex -longes window wi h a powe -
o - wo leng h in samples co esponds o 327 ms, which is
al eady a signi ican imp o emen compa ed o he 1871
ms equi ed by he CQT.
Howe e , his can be educed u he . The SWIPE-
based models o e a s aigh o wa d way o educe bo h
la ency and compu a ional cos a he expense o accu-
acy by simply calcula ing he sco es o lowe pi ch candi-
da es a sho e window sizes (wi hou in e pola ion). C u-
cially, his adjus men can be made lexibly a in e ence
ime wi hou e aining he model. Table 4 shows he e -
ec ha educing he window size has on he RPA o wo
selec ed models. No e ha educing he window size o a
leng h ha is no a powe o wo is also possible i ine
con ol o e he adeo is desi ed.
5.4 Robus ness o Noise
In he inal expe imen , we in es iga e how obus a i-
ous ained models a e o noisy condi ions by adding whi e
noise o he inpu audio a dec easing signal- o-noise a ios.
Raw Pi ch Accu acy (MIR-1K)
Model clean 5 dB 0 dB -5 dB -10 dB
PYIN 95.4% 95.3% 95.1% 93.7% 85.8%
SWIPE 96.2% 93.9% 91.2% 85.6% 75.2%
CQT-sup 92.2% 91.5% 89.3% 87.3% 82.3%
SWIPE-sup 93.5% 91.6% 90.0% 87.1% 72.2%
SWIPE- iny 96.6%96.0%95.3% 93.4% 88.5%
PESTO 94.6% 93.3% 92.9% 90.1% 81.7%
FCNF0++ 91.0% 90.3% 89.0% 83.5% 81.0%
Table 5. The e ec ha adding whi e noise a a ious
signal- o-noise a ios o he inpu audio has on he aw
pi ch accu acy on he MIR-1K da ase o a ious models.
The esul s a e shown in Table 5. SWIPE- iny appea s o
be ai ly obus o backg ound noise, especially compa ed
o he base SWIPE algo i hm. The pe o mance o he su-
pe ised models deg ades somewha quicke han ha o
he sel -supe ised ones. This is no oo su p ising, since
in a iance o added noise is an explici aining objec i e
o he sel -supe ised models (see Sec ion 3.3).
6. CONCLUSION
We in es iga ed he po en ial o combining he SWIPE al-
go i hm wi h neu al pi ch es ima ion. We adap ed es ab-
lished supe ised and sel -supe ised aining echniques
o use SWIPE sco es as an audio on end and ob ained
accu a e, e icien , obus and lexible pi ch es ima o s.
We demons a ed ha he po en ial o SWIPE has been
signi ican ly unde es ima ed in he li e a u e despi e being
commonly used as a baseline. The algo i hm in i s o igi-
nal o m ou pe o ms s a e-o - he-a sel -supe ised neu-
al pi ch es ima o s.
In he u u e, we plan o explo e whe he he pe o -
mance o he pi ch es ima o s can be u he imp o ed by
hyb id aining schemes ha make simul aneous use o
labeled da a and sel -supe ised aining objec i es. We
would also like o in es iga e whe he ce ain pa ame e s
o SWIPE, such as he equency en elope o he weigh s
o indi idual ha monics, can be di ec ly lea ned om da a.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
693
7. ACKNOWLEDGMENTS
This wo k was suppo ed by UK Resea ch and Inno a ion
[g an numbe EP/S022694/1]. The au ho s would like o
hank he anonymous e iewe s o hei aluable eedback
which signi ican ly imp o ed his pape .
8. REFERENCES
[1] A. M. Noll, “Ceps um Pi ch De e mina ion,” The
Jou nal o he Acous ical Socie y o Ame ica, ol. 41,
no. 2, pp. 293–309, Feb. 1967.
[2] R. C. Mahe and J. W. Beauchamp, “Fundamen al e-
quency es ima ion o musical signals using a wo-way
misma ch p ocedu e,” The Jou nal o he Acous ical
Socie y o Ame ica, ol. 95, no. 4, pp. 2254–2263,
1994.
[3] A. Camacho and J. G. Ha is, “A saw oo h wa e o m
inspi ed pi ch es ima o o speech and music,” The
Jou nal o he Acous ical Socie y o Ame ica, ol. 124,
no. 3, pp. 1638–1652, Sep. 2008.
[4] S. Gonzalez and M. B ookes, “PEFAC - A Pi ch Es-
ima ion Algo i hm Robus o High Le els o Noise,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 22, no. 2, pp. 518–530, Feb.
2014.
[5] A. de Che eigné and H. Kawaha a, “YIN, a Funda-
men al F equency Es ima o o Speech and Music,”
The Jou nal o he Acous ical Socie y o Ame ica, ol.
111, no. 4, pp. 1917–1930, 2002.
[6] P. McLeod and G. Wy ill, “A sma e way o ind
pi ch,” in In e na ional Compu e Music Con e ence
(ICMC), 2005.
[7] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “CREPE:
A Con olu ional Rep esen a ion o Pi ch Es ima ion,”
in P oceedings o he IEEE In e na ional Con e -
ence on Acous ics, Speech, and Signal P ocessing
(ICASSP), 2018.
[8] L. A daillon and A. Roebel, “Fully-Con olu ional Ne -
wo k o Pi ch Es ima ion o Speech Signals,” in In e -
speech 2019, 2019.
[9] M. Mo ison, C. Hsieh, N. P uyne, and B. Pa do,
“C oss-domain Neu al Pi ch and Pe iodici y Es ima-
ion,” h p://a xi .o g/abs/2301.12258, Jun. 2023.
[10] X. Li, H. Huang, Y. Hu, L. He, J. Zhang, and Y. Wang,
“YOLOPi ch: A Time-F equency Dual-B anch YOLO
Model o Pi ch Es ima ion,” in In e speech 2024,
2024.
[11] B. G elle , C. F ank, D. Roblek, M. Sha i i,
M. Tagliasacchi, and M. Velimi o i´
c, “Pi ch es ima-
ion ia sel -supe ision,” in 2020 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP), 2020.
[12] B. G elle , C. F ank, D. Roblek, M. Sha i i,
M. Tagliasacchi, and M. Velimi o ic, “SPICE: Sel -
Supe ised Pi ch Es ima ion,” IEEE/ACM T ansac-
ions on Audio Speech and Language P ocessing,
ol. 28, pp. 1118–1128, 2020.
[13] J. Engel, R. Swa ely, A. Robe s, L. H. Han akul, and
C. Haw ho ne, “Sel -Supe ised Pi ch De ec ion by In-
e se Audio Syn hesis,” Wo kshop on Sel -Supe ision
in Audio and Speech a he 37 h In e na ional Con e -
ence on Machine Lea ning (ICML 2020), 2020.
[14] A. Riou, S. La ne , G. Hadje es, and G. Pee e s,
“PESTO: Pi ch es ima ion wi h sel -supe ised
ansposi ion-equi a ian objec i e,” in P oceedings o
he 24 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2023.
[15] P. Rengaswamy, M. G. Reddy, K. S. Rao, and P. Das-
gup a, “h 0: A Hyb id Pi ch Ex ac ion Me hod o
Mul imodal Voice,” Ci cui s, Sys ems, and Signal P o-
cessing, ol. 40, no. 1, pp. 262–275, Jan. 2021.
[16] E. S. Hassan, B. Neyazi, H. S. Seddeq, A. Z. Mah-
moud, A. S. Oshaba, A. El-Ema y, and F. E. Abd El-
Samie, “HAEPF: Hyb id app oach o es ima ing pi ch
equency in he p esence o e e be a ion,” Mul ime-
dia Tools and Applica ions, ol. 83, no. 32, pp. 77 489–
77 508, Feb. 2024.
[17] A. Paszke, S. G oss, F. Massa, A. Le e , J. B ad-
bu y, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. An iga, A. Desmaison, A. Kop , E. Yang, Z. DeVi o,
M. Raison, A. Tejani, S. Chilamku hy, B. S eine ,
L. Fang, J. Bai, and S. Chin ala, “Py o ch: An impe -
a i e s yle, high-pe o mance deep lea ning lib a y,” in
Ad ances in Neu al In o ma ion P ocessing Sys ems
32, 2019, pp. 8024–8035.
[18] B. R. Glasbe g and B. C. Moo e, “De i a ion o au-
di o y il e shapes om no ched-noise da a,” Hea ing
Resea ch, ol. 47, no. 1, pp. 103–138, 1990.
[19] T. Yoshimu a, T. Fujimo o, K. Ou a, and K. Tokuda,
“SPTK4: An open-sou ce so wa e oolki o speech
signal p ocessing,” in 12 h ISCA Speech Syn hesis
Wo kshop (SSW 2023), 2023, pp. 211–217.
[20] B. McFee, C. Ra el, D. Liang, D. P. Ellis, M. McVica ,
E. Ba enbe g, and O. Nie o, “lib osa: Audio and music
signal analysis in py hon.” in P oceedings o he 14 h
Py hon in Science Con e ence, 2015.
[21] J. Salamon, R. M. Bi ne , J. Bonada, J. J. Bosch,
E. Gómez, and J. P. Bello, “An analysis/syn hesis
amewo k o au oma ic 0 anno a ion o mul i ack
da ase s.” in P oceedings o he 18 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence (IS-
MIR), 2017.
[22] G. Pi ke , M. Wohlmay , S. Pe ik, and F. Pe nkop ,
“A pi ch acking co pus wi h e alua ion on mul ip-
i ch acking scena io.” in In e speech, 2011, pp. 1509–
1512.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
694
[23] C.-L. Hsu and J.-S. R. Jang, “On he imp o emen o
singing oice sepa a ion o monau al eco dings us-
ing he mi -1k da ase ,” IEEE T ansac ions on Audio,
Speech, and Language P ocessing, ol. 18, no. 2, pp.
310–319, 2010.
[24] M. Mauch and S. Dixon, “pYIN: A undamen al e-
quency es ima o using p obabilis ic h eshold dis i-
bu ions,” in 2014 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2014, pp. 659–663.
[25] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, D. P. Ellis, and C. C. Ra el,
“Mi _e al: A anspa en implemen a ion o common
mi me ics.” in P oceedings o he 15 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), 2014.
[26] D. P. Kingma and J. L. Ba, “Adam: A me hod o
s ochas ic op imiza ion,” in 3 d In e na ional Con e -
ence on Lea ning Rep esen a ions (ICLR), 2015.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
695