Improving Neural Pitch Estimation With SWIPE Kernels

Author: David Marttila; Joshua D. Reiss

Publisher: Zenodo

DOI: 10.5281/zenodo.17706561

Source: https://zenodo.org/records/17706561/files/000080.pdf

IMPROVING NEURAL PITCH ESTIMATION WITH SWIPE KERNELS
Da id Ma ila, Joshua D. Reiss
Cen e o Digi al Music, Queen Ma y Uni e si y o London
[email p o ec ed], [email p o ec ed]
ABSTRACT
Neu al ne wo ks ha e become he dominan echnique o
accu a e pi ch and pe iodici y es ima ion. Al hough a lo
o esea ch has gone in o imp o ing ne wo k a chi ec-
u es and aining pa adigms, mos app oaches ope a e di-
ec ly on he aw audio wa e o m o on gene al-pu pose
ime- equency ep esen a ions. We in es iga e he use o
Saw oo h-Inspi ed Pi ch Es ima ion (SWIPE) ke nels as
an audio on end and ind ha hese hand-c a ed, ask-
speci ic ea u es can make neu al pi ch es ima o s mo e ac-
cu a e, obus o noise, and mo e pa ame e -e icien . We
e alua e supe ised and sel -supe ised s a e-o - he-a a -
chi ec u es on common da ase s and show ha he SWIPE
audio on end allows o educing he ne wo k size by an
o de o magni ude wi hou pe o mance deg ada ion. Ad-
di ionally, we show ha he SWIPE algo i hm on i s own is
much mo e accu a e han commonly epo ed, ou pe o m-
ing s a e-o - he-a sel -supe ised neu al pi ch es ima o s.
1. INTRODUCTION
Pi ch plays a cen al ole in how humans pe cei e sound.
Consequen ly, pi ch es ima ion is a undamen al ask in
many music, speech and audio p ocessing pipelines. While
pi ch is a psychoacous ic phenomenon, i closely co e-
la es o he signal p ocessing concep o he undamen al
equency 0. Recen li e a u e commonly uses he e m
“pi ch es ima ion” o e e o he ask o es ima ing an au-
dio signal’s 0.
Gi en he impo ance o accu a e pi ch es ima ion, he
opic has ecei ed a conside able amoun o esea ch a en-
ion o e he pas decades. Nume ous digi al signal p o-
cessing (DSP) echniques es ima e pi ch based on he cep-
s um [1], he powe spec um [2–4], o he au oco ela ion
unc ion [5,6].
Mo e ecen ly, deep neu al ne wo ks ha e been applied
o he ask o pi ch es ima ion [7–10]. In a ypical a chi-
ec u e, a con olu ional neu al ne wo k (CNN) is gi en
o e lapping ames o aw audio as inpu and ained o
p edic a p obabili y dis ibu ion o e a disc e e se o 0
candida es in a supe ised ashion. While hese models
can each e y high accu acy, hey equi e a la ge amoun
© D. Ma ila and J. D. Reiss. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: D. Ma ila and J. D. Reiss, “Imp o ing Neu al Pi ch Es ima ion
wi h SWIPE Ke nels”, in P oc. o he 26 h In . Socie y o Music In o -
ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
o aining da a anno a ed wi h eliable g ound u h pi ch
alues and can s uggle wi h ou -o -domain gene aliza ion
and obus ness o noise and e e be a ion. Addi ionally,
he CNNs usually consis o millions o pa ame e s, mak-
ing hem less sui able o use in low- esou ce en i on-
men s.
These d awbacks ha e been add essed in wo di e en
ways. Sel -supe ised aining pa adigms [11–14] do no
equi e labeled aining da a, and inco po a ing adi ional
DSP app oaches in o neu al ne wo ks has been shown o
inc ease e iciency and obus ness [15, 16].
In his pape , we combine hese wo app oaches and
me ge ask-speci ic DSP-based ea u es wi h bo h supe -
ised and sel -supe ised aining pa adigms. Speci i-
cally, we subs i u e he audio on end in he Pi ch Es ima-
ion wi h Sel -Supe ised T ansposi ion-Equi a ian Ob-
jec i e (PESTO) a chi ec u e [14] o a ep esen a ion ob-
ained om he Saw oo h Wa e o m Inspi ed Pi ch Es ima-
o (SWIPE) [3], which es ima es pi ch by measu ing he
simila i y o he inpu spec um o ha o saw oo h wa es
a a ious pi ch candida es. We also in es iga e he use o
SWIPE as a on end o supe ised neu al pi ch es ima-
o s, which usually ope a e di ec ly on he audio wa e o m.
The co e insigh s o ou wo k a e hese:
• Al hough SWIPE is commonly used as a baseline o
neu al pi ch es ima ion, we ind ha i s pe o mance
has been signi ican ly unde epo ed. We show ha
SWIPE in i s o iginal o m su passes he accu acy
o he s a e o he a in sel -supe ised neu al pi ch
de ec ion (PESTO).
• SWIPE is a well-sui ed audio on end o neu-
al pi ch es ima o s in bo h supe ised and sel -
supe ised se ings, and can imp o e he s a e-o -
he-a in e ms o accu acy, obus ness, e iciency,
and la ency.
T ained models alongside a SWIPE implemen a ion in
PyTo ch [17] a e a ailable online. 1The emainde o his
pape is s uc u ed as ollows. In Sec ion 2, we gi e an
o e iew o e SWIPE and neu al pi ch es ima ion me h-
ods. Sec ion 3 co e s some aspec s o ou SWIPE imple-
men a ion choices and de ails how we embed SWIPE in o
neu al pi ch es ima ion a chi ec u es. Sec ion 4 desc ibes
how we e alua e ou app oach, and Sec ion 5 p esen s he
esul s o ou e alua ion.
1h ps://gi hub.com/dsuedhol /
neu al-pi ch-swipe
688
Figu e 1. SWIPE and SWIPE’ ke nels co esponding o
a pi ch candida e a 330 Hz. The SWIPE ke nel con ains
peaks a all in ege ha monics o he candida e equency.
The SWIPE’ ke nel is ob ained by emo ing he peaks a
non-p ime ha monics.
2. BACKGROUND
2.1 SWIPE
The Saw oo h Wa e o m Inspi ed Pi ch Es ima o
(SWIPE) [3] es ima es pi ch by iden i ying he undamen-
al equency o a saw oo h wa e o m whose spec um
bes ma ches ha o he inpu signal. To achie e his, i
cons uc s spec al ke nels o a numbe o disc e e pi ch
candida es, and assigns a sco e o each pi ch candida e by
measu ing he simila i y be ween i s associa ed ke nel and
he spec um o he inpu signal.
2.1.1 Sco e Calcula ion
Mo e o mally, conside a (windowed) audio signal x[n]o
leng h Nand i s disc e e Fou ie T ans o m (DFT) X[k],
which may be unca ed o he K=⌊N/2⌋+ 1 bins co -
esponding o non-nega i e equencies i xis eal- alued.
Le now C={ 1, 2,... |C|}be a se o |C|pi ch candi-
da es. Then Sc[k]is he spec al ke nel associa ed wi h he
pi ch candida e c, and we can compu e i s SWIPE sco e
Z( c) : C→[−1,1] as he no malized inne p oduc be-
ween Scand X:
Z( c) = PK−1
k=0 Sc[k]· |X[k]|1/2
PK−1
k=0 |X[k]|1/2(1)
The pi ch es ima e is hen gi en by he c ha maxi-
mizes Z( c)and may op ionally be u he e ined by e.g.
pa abolic in e pola ion o he local maximum.
While Eqn (1) compu es he inne p oduc o e all bins
o he DFT o no a ional simplici y, he o iginal SWIPE
pape sugges s esampling he spec um o he Equi alen
Rec angula Bandwid h (ERB) [18] scale o speech da a,
o o he mel scale o musical ins umen s.
2.1.2 Ke nel Design
The ke nel Scis designed o maximize he inne p od-
uc wi h Xi xis a signal wi h undamen al equency
0= c. To achie e ha , i con ains cosine lobes o wid h
c/2a all in ege ha monics o c, decaying in magni ude
o mimic he spec um o a saw oo h wa e. As he au-
ho s o SWIPE lay ou , his co esponds p ecisely o he
squa e oo o he main lobes o a Hann-windowed saw-
oo h wa e i he size o he analysis window is exac ly
T= 8/ c. The ke nel u he con ains nega i e- alued
alleys a 1
2 c,3
2 c, . . ., i.e. a he midpoin be ween each
ha monic peak.
Since all ha monics o a signal wi h a ue undamen-
al equency o also con ibu e o he sco es o he
pi ch candida es a /2, /3, . . ., a common a ian o he
SWIPE algo i hm emo es he non-p ime ha monics (ex-
cep o he i s one) o all ke nels o educe he p oblem
o oc a e e o s. This is known as SWIPE’, bu e ec i e
and widesp ead enough ha i is o en simply e e ed o
as SWIPE, o example in he Speech P ocessing Toolki
(SPTK) [19] implemen a ion, which is based di ec ly on
he MATLAB code published along wi h SWIPE. We ake
he same app oach in his pape and will gene ally assume
ha he sco es Za e calcula ed using SWIPE’ ke nels. An
example o such a ke nel is illus a ed in Figu e 1.
2.2 Supe ised Neu al Pi ch Es ima ion
The es ablished way o using neu al ne wo ks o pi ch es-
ima ion is o in e p e an audio signal x[n]as a ec o x.
A ne wo k θ hen maps x o a ec o y∈[0,1]|C|, whe e
each en y yc ep esen s he p obabili y ha a co espond-
ing pi ch candida e cis he pi ch o x. In supe ised ain-
ing, his is ea ed as a mul i-class classi ica ion p oblem,
calcula ing he loss using he c oss-en opy o he g ound
u h pi ch. The g ound u h dis ibu ion may be smoo hed
using Gaussian blu ing o aid aining [7]. Voicing con i-
dence can be deduced om he en opy o he p edic ed
p obabili y dis ibu ion [9].
2.3 Sel -Supe ised Neu al Pi ch Es ima ion
Pi ch Es ima ion wi h Sel -Supe ised T ansposi ion-
Equi a ian Objec i e (PESTO) [14] is a s a e-o - he-a
a chi ec u e o sel -supe ised aining o neu al ne wo k
pi ch es ima o s, whe e na u al symme ies o he inpu a e
exploi ed o lea n a ansla ion-equi a ian ep esen a ion,
ins ead o p o iding g ound u h pi ches o he model.
2.3.1 T aining Se up
Du ing aining, he model lea ns o op imize a combina-
ion o h ee losses:
An equi a iance loss en o ces ha pi ch-shi ed e -
sions o an inpu should esul in ou pu dis ibu ions ha
a e ansposi ions o he o iginal inpu . Gi en an inpu
xand i s pi ch-shi ed e sion x(k)(shi ed by ksemi-
ones), hei espec i e ou pu s yand y(k)should sa is y
φ(y(k)) = αkφ(y), whe e φis a de e minis ic linea map-
ping:
φ:R|C|→R
y7→ (α, α2, . . . , α|C|)y(2)
and αis a hype pa ame e .
A egula iza ion loss u he ensu es ha he ne -
wo k’s ou pu s o pi ch-shi ed inpu s main ain he ex-
pec ed ansposi ion ela ionship. Fo a pai o ou pu s y
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
689
and y(k), he shi ed c oss-en opy loss
LSCE(y,y(k), k) =
|C|−1
X
i=0
yilog y(k)
i+k(3)
measu es how well y(k)ma ches he k-semi one shi o y.
Finally, an in a iance loss encou ages he mapping θ
o be in a ian o he imb e o he signal. Du ing aining,
PESTO d aws andom ans o ms om a se o pi ch-
p ese ing da a augmen a ions T. Gi en ˜
x= (x), he
in a iance loss is hen exp essed as he c oss-en opy be-
ween y= θ(x)and ˜
y= θ(˜x).
2.3.2 Model A chi ec u e
The PESTO a chi ec u e uses he cons an -Q ans o m
(CQT) o an audio ame as i s inpu , whe e he bins o
he CQT exac ly co espond o he pi ch candida es. The
ans o ms T ake he o m o adding andom noise and
gain o he CQT ames. A CNN p ocesses he ame, and
i s la ened ou pu is ed o a inal linea laye ollowed by
a so max laye which p oduces a p obabili y dis ibu ion.
Impo an ly, he inal linea laye uses a Toepli z ma ix as
i s weigh ma ix o p ese e he ansposi ion equi a iance
o he CNN.
3. METHODS
The co e insigh o his wo k is ha SWIPE sco es encode
ich pi ch in o ma ion and a e hus well sui ed as an audio
on end o neu al pi ch es ima ion in bo h supe ised and
sel -supe ised se ings. This sec ion i s co e s ou im-
plemen a ion o SWIPE in de ail, and hen desc ibes how
we adap supe ised and sel -supe ised neu al pi ch es i-
ma o s o wo k wi h SWIPE sco es.
3.1 SWIPE Implemen a ion
We calcula e he SWIPE sco es by sampling he spec um
a 1024 equencies, which a e linea ly spaced on he mel
scale o e a ange om 0.25 · min o 1.25 · max. We used
he Slaney-s yle mel scale, which is linea up o 1 kHz and
loga i hmic abo e, as implemen ed in he lib osa oolki
[20]. Fo each o he a ious window sizes, he spec um
is calcula ed wi h he same FFT esolu ion (ze o-padding
he inpu as needed) and e alua ed a he sampling e-
quencies using linea in e pola ion. We a ange he pi ch
candida es o ma ch he CQT esolu ion used in PESTO:
loga i hmically spaced o e a ange o min = 27.5Hz o
max = 8055 Hz, using a esolu ion o 3 bins pe semi one,
o a o al o 295 bins.
Al hough many pape s on neu al pi ch es ima ion com-
pa e hei wo k o SWIPE as a baseline, hey gene ally do
no ci e he implemen a ion hey used o epo he pa ame-
e s hey chose. To make su e ha ou implemen a ion does
no signi ican ly unde pe o m, we compa e i o he mos
popula open-sou ce implemen a ion o SWIPE, which is
con ained in he Speech P ocessing Toolki (SPTK) [19].
I uses 8 pi ch bins pe semi one by de aul , samples he
inpu equency spec um acco ding o he ERB [18] scale,
and e ines he es ima e using pa abolic in e pola ion.
Table 1 con ains he Raw Pi ch Accu acy (RPA)
achie ed by he SPTK and ou implemen a ion on he
MDB-s em-syn h and MIR-1K da ase s and compa es i
o p e iously epo ed baseline alues. The me ics and
da ase s a e desc ibed in mo e de ail in Sec ion 4. The ac-
cu acy o he SPTK implemen a ion seems o signi ican ly
de e io a e o la ge sea ch anges. We epo he alues
o uppe limi s o 2kHz and 8kHz, whe e he lowe limi
is 30 Hz o bo h. We se he sco e h eshold which pi ch
candida es need o exceed o be conside ed o 0.
Ou implemen a ion appea s o be a lo mo e obus o
i s la ge sea ch ange (27.5–8055 Hz). Swi ching om
ERB o mel sampling esul s in a no able accu acy gain
on MDB-s em-syn h, which con ains mo e a ied imb es.
This ma ches he esul s o he o iginal SWIPE pape .
Bo h he SPTK and ou own implemen a ion can pe -
o m much mo e accu a ely han he alues ha we e p e-
iously epo ed as baselines in he neu al pi ch es ima ion
li e a u e sugges , o he ex en ha SWIPE ou pe o ms
e en s a e-o - he-a sel -supe ised pi ch de ec ion mod-
els (see Sec ion 5.2).
3.2 Supe ised Neu al Pi ch Es ima ion
We expe imen wi h using bo h SWIPE sco es and he
CQT as an audio on end in a supe ised aining con ex .
We eed he inpu in o a CNN wi h 6 1D-con olu ional
laye s, applying laye no maliza ion and a leaky ReLU
non-linea i y wi h slope 0.3 be ween each laye . Ze o-
padding is applied o he inpu in each laye o p ese e
he inpu dimension. A e la ening, he ou pu o he i-
nal laye is educed o he dimensionali y o he pi ch bins
and ed in o a So max laye o ob ain a p obabili y dis i-
bu ion. We ind ha using a dense linea laye o pe o m
he dimensionali y educ ion s ongly deg aded gene aliza-
ion in his se up, and ins ead also employ a Toepli z laye
in he supe ised model.
3.3 Sel -Supe ised Neu al Pi ch Es ima ion
The o iginal PESTO a chi ec u e is al eady well sui ed o
wo k wi h SWIPE sco es, which can be di ec ly subs i-
u ed o he CQT bins wi hou iola ing he assump ions
on ansla ion equi a iance. Since SWIPE sco es encode
pe iodici y in o ma ion much mo e explici ly han CQT
ames, we expec he encode ne wo k o achie e simi-
la pe o mance wi h ewe pa ame e s. We es his hy-
po hesis by aining a PESTO-s yle encode wi h a d as i-
cally educed pa ame e coun , consis ing only o he inal
Toepli z ully-connec ed laye – a con olu ional laye wi h
a single il e o size 647 – and so max no maliza ion. In
his e y simple a chi ec u e, he Toepli z laye can be seen
as essen ially lea ning a eweigh ing o he SWIPE sco es,
and o e ine he loca ion o he peak sco e i he ou pu
esolu ion is la ge han he inpu esolu ion.
Ini ial expe imen s indica ed ha applying andom da a
augmen a ion o he SWIPE sco es only esul ed in de-
g aded pe o mance compa ed o he baseline DSP algo-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
690
Repo ed in [12,13] SPTK (2kHz) SPTK (8kHz) Ou s (ERB) Ou s (mel)
MIR-1K 86.6% 96.5% 68.2% 95.7% 96.2%
MDB 90.7% 94.1% 61.4% 94.0% 96.1%
Table 1. Raw Pi ch Accu acy ob ained by SWIPE implemen a ions on he MIR-1K and MDB-s em-syn h da ase s, com-
pa ed o baseline alues p e iously epo ed in sel -supe ised pi ch es ima ion pape s. Fo SPTK, we epo di e en uppe
limi s o he sea ch ange. Fo ou implemen a ion, he sea ch ange is cons an , bu he equency sampling scale changes.
i hm. We addi ionally augmen he audio ame in he ime
domain by adding andom noise and applying a ini e im-
pulse esponse (FIR) il e wi h a andomized ampli ude e-
sponse. This esul s in an inc eased compu a ional cos o
aining, since he SWIPE sco es need o be ecalcula ed a
e e y aining s ep, bu does no a ec he compu a ional
cos o in e ence once aining has inished.
4. EXPERIMENTAL SETUP
4.1 Da ase s
Ou expe imen s use h ee 0-anno a ed da ase s ha a e
commonly used o aining and benchma king pi ch de-
ec o s:
MDB-s em-syn h [21] con ains 230 solo acks (418
minu es o al) o ins umen sounds and ocals. The au-
dio is e-syn hesized om i s 0anno a ions, which means
ha he 0anno a ions a e pe ec . I is anno a ed wi h a
hop size o 2.9 ms.
PTDB-TUG [22] con ains 4720 audio and la yngo-
g aph eco dings (576 minu es o al) o 20 English speak-
e s eading sen ences. I is anno a ed wi h a hop size o 10
ms.
MIR-1K [23] con ains 1000 sho eco dings (133 min-
u es o al) o Chinese ka aoke pe o mances. I is anno-
a ed wi h a hop size o 20 ms.
4.2 Baselines
We compa e all esul s o wo DSP-based pi ch de ec ion
baselines: PYIN [24] and SWIPE. We do no pe o m
Vi e bi decoding o any so o peak e inemen , simply
selec ing he pi ch candida e wi h he highes sco e. PYIN
sco es a e based on au oco ela ion and so hei na u al es-
olu ion is exp essed in in ege samples. We esample he
sco es o he same pi ch candida e esolu ion as used o
SWIPE using linea in e pola ion.
As a baseline o supe ised aining, we choose
FCNF0++ [9], which o he bes o ou knowledge is he
cu en ly bes -pe o ming supe ised monophonic neu al
pi ch de ec o ha ope a es on a ame-by- ame basis,
a he han p ocessing he en i e audio signal a once.
In he sel -supe ised se ing, we compa e ou esul s
agains he o iginal PESTO a chi ec u e, which is he cu -
en s a e o he a in sel -supe ised monophonic neu al
pi ch es ima ion.
4.3 E alua ion Me ics
We use he mi _e al package [25] o epo he ollowing
me ics:
Raw Pi ch Accu acy (RPA), he pe cen age o oiced
ames o which he model p edic ed a pi ch wi hin 50
cen s o he g ound u h.
F-Sco e, measu ing he accu acy o he bina y
oiced/un oiced decision.
O e all Accu acy (OA), he pe cen age o all ames
( oiced and un oiced) o which a co ec oicing deci-
sion was made, and o which he model p edic ed a pi ch
wi hin 50 cen s o he g ound u h i he ame is oiced.
5. RESULTS AND DISCUSSION
We epo sepa a e expe imen al esul s o he supe ised
and sel -supe ised app oaches, in each case closely epli-
ca ing he aining se up o he baselines (FCNF0++ and
PESTO, espec i ely) o assess he impac o using SWIPE
sco es as an audio on end.
5.1 Supe ised Models
We e e o he wo p oposed supe ised models (see Sec-
ion 3.2) as CQT-sup and SWIPE-sup. We ain ou mod-
els as well as he FCNF0++ baseline on MDB-s em-syn h
and PTDB-TUG a he same ime. While ne wo ks ha
ake CQT o SWIPE sco es as inpu a e sample- a e ag-
nos ic, FCFN0++ ope a es on he aw audio wa e o m and
was designed o wo k wi h a sampling a e o 8kHz, so we
esample i s inpu acco dingly.
In he in e es o a di ec compa ison, we use he 70-
15-15 spli in o aining, alida ion and es ing pa i ions
ha was published in [9]. The pe o mance o he ained
models is measu ed by calcula ing RPA, F-Sco e and OA
on he es ing se . To be e measu e gene aliza ion pe o -
mance on unseen da a, we addi ionally e alua e he ained
models on he ull MIR-1K da ase , which is no used in
aining.
We ain he models o 500,000 s eps, using a ba ch
size o 256 and he Adam op imize [26] wi h an ini ial
lea ning a e o 0.0002. Table 2 shows he esul s o he
e alua ion. O e all, bo h CQT-sup and SWIPE-sup seem
compe i i e wi h FCNF0++, bu no clea ly supe io . They
a e able o almos ma ch he in-domain RPA o FCNF0++,
and ou pe o m i in e ms o oiced/un oiced accu acy
and gene aliza ion. The wo p oposed models use ewe
ainable pa ame e s han FCNF0++, bu equi e a la ge
con ex window.
The CQT inpu ea u es seem o be pa icula ly well
sui ed o making oiced/un oiced decisions, wi h he
CQT-sup model a aining he highes F-Sco e on bo h he
es se and on MIR-1K.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
691
Me hod # Pa ams Window Size [ms] Tes Pa i ion MIR-1K
RPA F-Sco e OA RPA F-Sco e OA
PYIN - 145 89.4% - - 95.4% - -
SWIPE - 327 93.2% - - 96.2% - -
CQT-sup (ou s) 0.9M 1871 98.1% 98.7% 99.2% 92.2% 91.2% 87.1%
SWIPE-sup (ou s) 0.9M 327 97.9% 98.1% 98.6% 93.5% 90.0% 87.8%
FCNF0++ [9] 6.6M 128 98.3% 98.2% 98.7% 91.0% 87.8% 86.0%
Table 2. E alua ion esul s o he supe ised models as measu ed by Raw Pi ch Accu acy (RPA), he F-Sco e o he
oiced/un oiced decision, and O e all Accu acy (OA). We also epo he sizes o he neu al ne wo ks and he maximum
window size equi ed by he es ima o . Models a e e alua ed using he combined es pa i ion o MDB-s em-syn h and
PTDB-TUG published in [9], as well as on he en i e MIR-1K da ase , which was no used in aining.
All h ee models s uggle wi h gene aliza ion, s aying
well behind he DSP baselines on MIR-1K. The bes gen-
e aliza ion beha io is shown by SWIPE-sup, e en hough
i was he leas accu a e model on he es se .
5.2 Sel -Supe ised Models
We e e o he h ee modi ied PESTO models (see Sec-
ion 3.3) as CQT- iny,SWIPE- ull, and SWIPE- iny,
whe e “ iny” e e s o he Toepli z-only encode and “ ull”
o he o iginal PESTO encode a chi ec u e wi h a mul i-
laye CNN. We ain he models on he whole o MIR-1K
and measu e hei pe o mance on MDB-s em-syn h, and
ice e sa. The models a e ained o 50 epochs using a
ba ch size o 256 and he Adam op imize wi h an ini ial
lea ning a e o 0.0001.
The esul s o he e alua ion a e gi en in Table 3. The
baseline SWIPE implemen a ion ou pe o ms PESTO on
bo h da ase s, ega dless o which da ase he model was
ained on. This means ha he o iginal SWIPE algo-
i hm ou pe o ms all wo k on sel -supe ised mono-
phonic pi ch de ec ion published o da e.
When using CQT ames as inpu , educing he encode
ne wo k o jus he inal Toepli z laye no iceably deg ades
pe o mance, especially in he ac oss-da ase e alua ion.
Howe e , i is wo h no ing ha CQT- iny s ill achie es
ela i ely good same-da ase accu acy. Since no explici
pi ch in o ma ion is gi en o he model du ing aining, his
is a s ong indica o o he use ulness o he ansposi ion-
equi a ian aining s uc u e ha PESTO in oduced.
The highes accu acy on he same-da ase e alua ion is
achie ed by he wo models ha use SWIPE sco es as in-
pu . Like CQT- iny howe e , hei pe o mance plumme s
when ained on MIR-1K and e alua ed on MDB-s em-
syn h. MDB-s em-syn h co e s a la ge pi ch ange han
MIR-1K and con ains mo e a ied imb es, making gen-
e aliza ion challenging. The o iginal PESTO is he only
model ha is able o make his jump easonably well.
In he e e se di ec ion howe e , SWIPE- iny achie es
he highes RPA ou o he ou models when ained on
MDB-s em-syn h and e alua ed on MIR-1K, as well as he
bes same-da ase RPA o MDB-s em-syn h. Adding he
Toepli z laye on op o he SWIPE sco es imp o es hei
pe o mance, bu he addi ional ne wo k laye s in SWIPE-
ull do no b ing u he accu acy gains, seemingly hinde -
ing pe o mance ins ead.
Figu e 2.Top: The spec um o a ame o audio om
MIR-1K. The solid e ical line ma ks he g ound u h
pi ch. Bo om: The SWIPE sco es o he ame, be-
o e (do ed) and a e (dashed) hey we e ans o med by
he SWIPE- iny encode ained on MDB-s em-syn h. The
e ical lines indica e he pi ch es ima e ob ained om he
basic SWIPE algo i hm (do ed), he es ima e gi en by
SWIPE- iny (dashed), and he g ound u h (solid).
The Toepli z-only encode in SWIPE- iny seems o
lea n o e ine he peaks o he SWIPE sco es, mi iga ing
e o s caused by he quan iza ion o he sea ch space o by
inpu spec a ha de ia e oo a om he ha monic ideal.
Figu e 2 illus a es a ame whe e an es ima ion e o o 90
cen s is educed o 30 cen s a e eeding he SWIPE sco es
h ough he SWIPE- iny encode .
5.3 La ency-Accu acy T adeo o SWIPE
The ame-based s uc u e o he e alua ed models, and es-
pecially he ligh weigh a chi ec u e o he sel -supe ised
es ima o s, lend hemsel es na u ally o use in eal- ime,
s eaming applica ions. In his con ex , small window sizes
a e desi able o educe la ency.
As desc ibed in Sec ion 2.1, he heo e ical ideal win-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
692

Raw Pi ch Accu acy
Me hod # pa ams T ained on MIR-1K MDB-s em-syn h
PYIN - - 95.4% 91.6%
SWIPE - - 96.2% 96.1%
PESTO
(baseline om [14]) 28.9k MIR-1K 96.1% 94.6%
MDB-s em-syn h 93.5% 95.5%
CQT- iny 647 MIR-1K 95.6% 78.8%
MDB-s em-syn h 91.7% 95.5%
SWIPE- ull 28.2k MIR-1K 97.0% 89.7%
MDB-s em-syn h 96.1% 96.4%
SWIPE- iny 647 MIR-1K 96.6% 90.1%
MDB-s em-syn h 96.4% 96.5%
Table 3. E alua ion esul s o he sel -supe ised models. Fo bo h da ase s, we highligh he bes esul achie ed when
aining and e alua ing on he same da ase (no explici pi ch in o ma ion is p o ided o he model du ing aining), and
when aining on one da ase and e alua ing on he o he . The pe o mance o PYIN and SWIPE is gi en o compa ison.
Window Size Raw Pi ch Accu acy
[Samples] [ms] SWIPE- iny SWIPE-sup
16384 372 96.4% 97.2%
8192 186 96.4% 97.1%
4096 93 96.2% 96.7%
2048 46 85.0% 86.9%
Table 4. The e ec o educing he maximum window size
( o a sampling a e o 44.1kHz) a which SWIPE sco es
a e calcula ed. RPA on MIR-1K is epo ed o SWIPE-
iny ( ained on MDB-s em-syn h) and SWIPE-sup.
dow size o each pi ch candida e cis exac ly 8/ c. In
p ac ice howe e , i is su icien o only conside window
sizes whose leng h in samples is a powe o wo. Fo a
gi en pi ch candida e wi h an ideal window leng h W, he
sco e is hen calcula ed wice a window leng hs 2⌊log2(W)⌋
and 2⌈log2(W)⌉, and linea ly in e pola ed o ob ain an ap-
p oxima ion o he sco e a he ideal size. Gi en a sampling
a e o s= 44.1kHz and a minimum pi ch candida e o
min = 27.5Hz, he nex -longes window wi h a powe -
o - wo leng h in samples co esponds o 327 ms, which is
al eady a signi ican imp o emen compa ed o he 1871
ms equi ed by he CQT.
Howe e , his can be educed u he . The SWIPE-
based models o e a s aigh o wa d way o educe bo h
la ency and compu a ional cos a he expense o accu-
acy by simply calcula ing he sco es o lowe pi ch candi-
da es a sho e window sizes (wi hou in e pola ion). C u-
cially, his adjus men can be made lexibly a in e ence
ime wi hou e aining he model. Table 4 shows he e -
ec ha educing he window size has on he RPA o wo
selec ed models. No e ha educing he window size o a
leng h ha is no a powe o wo is also possible i ine
con ol o e he adeo is desi ed.
5.4 Robus ness o Noise
In he inal expe imen , we in es iga e how obus a i-
ous ained models a e o noisy condi ions by adding whi e
noise o he inpu audio a dec easing signal- o-noise a ios.
Raw Pi ch Accu acy (MIR-1K)
Model clean 5 dB 0 dB -5 dB -10 dB
PYIN 95.4% 95.3% 95.1% 93.7% 85.8%
SWIPE 96.2% 93.9% 91.2% 85.6% 75.2%
CQT-sup 92.2% 91.5% 89.3% 87.3% 82.3%
SWIPE-sup 93.5% 91.6% 90.0% 87.1% 72.2%
SWIPE- iny 96.6%96.0%95.3% 93.4% 88.5%
PESTO 94.6% 93.3% 92.9% 90.1% 81.7%
FCNF0++ 91.0% 90.3% 89.0% 83.5% 81.0%
Table 5. The e ec ha adding whi e noise a a ious
signal- o-noise a ios o he inpu audio has on he aw
pi ch accu acy on he MIR-1K da ase o a ious models.
The esul s a e shown in Table 5. SWIPE- iny appea s o
be ai ly obus o backg ound noise, especially compa ed
o he base SWIPE algo i hm. The pe o mance o he su-
pe ised models deg ades somewha quicke han ha o
he sel -supe ised ones. This is no oo su p ising, since
in a iance o added noise is an explici aining objec i e
o he sel -supe ised models (see Sec ion 3.3).
6. CONCLUSION
We in es iga ed he po en ial o combining he SWIPE al-
go i hm wi h neu al pi ch es ima ion. We adap ed es ab-
lished supe ised and sel -supe ised aining echniques
o use SWIPE sco es as an audio on end and ob ained
accu a e, e icien , obus and lexible pi ch es ima o s.
We demons a ed ha he po en ial o SWIPE has been
signi ican ly unde es ima ed in he li e a u e despi e being
commonly used as a baseline. The algo i hm in i s o igi-
nal o m ou pe o ms s a e-o - he-a sel -supe ised neu-
al pi ch es ima o s.
In he u u e, we plan o explo e whe he he pe o -
mance o he pi ch es ima o s can be u he imp o ed by
hyb id aining schemes ha make simul aneous use o
labeled da a and sel -supe ised aining objec i es. We
would also like o in es iga e whe he ce ain pa ame e s
o SWIPE, such as he equency en elope o he weigh s
o indi idual ha monics, can be di ec ly lea ned om da a.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
693
7. ACKNOWLEDGMENTS
This wo k was suppo ed by UK Resea ch and Inno a ion
[g an numbe EP/S022694/1]. The au ho s would like o
hank he anonymous e iewe s o hei aluable eedback
which signi ican ly imp o ed his pape .
8. REFERENCES
[1] A. M. Noll, “Ceps um Pi ch De e mina ion,” The
Jou nal o he Acous ical Socie y o Ame ica, ol. 41,
no. 2, pp. 293–309, Feb. 1967.
[2] R. C. Mahe and J. W. Beauchamp, “Fundamen al e-
quency es ima ion o musical signals using a wo-way
misma ch p ocedu e,” The Jou nal o he Acous ical
Socie y o Ame ica, ol. 95, no. 4, pp. 2254–2263,
1994.
[3] A. Camacho and J. G. Ha is, “A saw oo h wa e o m
inspi ed pi ch es ima o o speech and music,” The
Jou nal o he Acous ical Socie y o Ame ica, ol. 124,
no. 3, pp. 1638–1652, Sep. 2008.
[4] S. Gonzalez and M. B ookes, “PEFAC - A Pi ch Es-
ima ion Algo i hm Robus o High Le els o Noise,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 22, no. 2, pp. 518–530, Feb.
2014.
[5] A. de Che eigné and H. Kawaha a, “YIN, a Funda-
men al F equency Es ima o o Speech and Music,”
The Jou nal o he Acous ical Socie y o Ame ica, ol.
111, no. 4, pp. 1917–1930, 2002.
[6] P. McLeod and G. Wy ill, “A sma e way o ind
pi ch,” in In e na ional Compu e Music Con e ence
(ICMC), 2005.
[7] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “CREPE:
A Con olu ional Rep esen a ion o Pi ch Es ima ion,”
in P oceedings o he IEEE In e na ional Con e -
ence on Acous ics, Speech, and Signal P ocessing
(ICASSP), 2018.
[8] L. A daillon and A. Roebel, “Fully-Con olu ional Ne -
wo k o Pi ch Es ima ion o Speech Signals,” in In e -
speech 2019, 2019.
[9] M. Mo ison, C. Hsieh, N. P uyne, and B. Pa do,
“C oss-domain Neu al Pi ch and Pe iodici y Es ima-
ion,” h p://a xi .o g/abs/2301.12258, Jun. 2023.
[10] X. Li, H. Huang, Y. Hu, L. He, J. Zhang, and Y. Wang,
“YOLOPi ch: A Time-F equency Dual-B anch YOLO
Model o Pi ch Es ima ion,” in In e speech 2024,
2024.
[11] B. G elle , C. F ank, D. Roblek, M. Sha i i,
M. Tagliasacchi, and M. Velimi o i´
c, “Pi ch es ima-
ion ia sel -supe ision,” in 2020 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP), 2020.
[12] B. G elle , C. F ank, D. Roblek, M. Sha i i,
M. Tagliasacchi, and M. Velimi o ic, “SPICE: Sel -
Supe ised Pi ch Es ima ion,” IEEE/ACM T ansac-
ions on Audio Speech and Language P ocessing,
ol. 28, pp. 1118–1128, 2020.
[13] J. Engel, R. Swa ely, A. Robe s, L. H. Han akul, and
C. Haw ho ne, “Sel -Supe ised Pi ch De ec ion by In-
e se Audio Syn hesis,” Wo kshop on Sel -Supe ision
in Audio and Speech a he 37 h In e na ional Con e -
ence on Machine Lea ning (ICML 2020), 2020.
[14] A. Riou, S. La ne , G. Hadje es, and G. Pee e s,
“PESTO: Pi ch es ima ion wi h sel -supe ised
ansposi ion-equi a ian objec i e,” in P oceedings o
he 24 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2023.
[15] P. Rengaswamy, M. G. Reddy, K. S. Rao, and P. Das-
gup a, “h 0: A Hyb id Pi ch Ex ac ion Me hod o
Mul imodal Voice,” Ci cui s, Sys ems, and Signal P o-
cessing, ol. 40, no. 1, pp. 262–275, Jan. 2021.
[16] E. S. Hassan, B. Neyazi, H. S. Seddeq, A. Z. Mah-
moud, A. S. Oshaba, A. El-Ema y, and F. E. Abd El-
Samie, “HAEPF: Hyb id app oach o es ima ing pi ch
equency in he p esence o e e be a ion,” Mul ime-
dia Tools and Applica ions, ol. 83, no. 32, pp. 77 489–
77 508, Feb. 2024.
[17] A. Paszke, S. G oss, F. Massa, A. Le e , J. B ad-
bu y, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. An iga, A. Desmaison, A. Kop , E. Yang, Z. DeVi o,
M. Raison, A. Tejani, S. Chilamku hy, B. S eine ,
L. Fang, J. Bai, and S. Chin ala, “Py o ch: An impe -
a i e s yle, high-pe o mance deep lea ning lib a y,” in
Ad ances in Neu al In o ma ion P ocessing Sys ems
32, 2019, pp. 8024–8035.
[18] B. R. Glasbe g and B. C. Moo e, “De i a ion o au-
di o y il e shapes om no ched-noise da a,” Hea ing
Resea ch, ol. 47, no. 1, pp. 103–138, 1990.
[19] T. Yoshimu a, T. Fujimo o, K. Ou a, and K. Tokuda,
“SPTK4: An open-sou ce so wa e oolki o speech
signal p ocessing,” in 12 h ISCA Speech Syn hesis
Wo kshop (SSW 2023), 2023, pp. 211–217.
[20] B. McFee, C. Ra el, D. Liang, D. P. Ellis, M. McVica ,
E. Ba enbe g, and O. Nie o, “lib osa: Audio and music
signal analysis in py hon.” in P oceedings o he 14 h
Py hon in Science Con e ence, 2015.
[21] J. Salamon, R. M. Bi ne , J. Bonada, J. J. Bosch,
E. Gómez, and J. P. Bello, “An analysis/syn hesis
amewo k o au oma ic 0 anno a ion o mul i ack
da ase s.” in P oceedings o he 18 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence (IS-
MIR), 2017.
[22] G. Pi ke , M. Wohlmay , S. Pe ik, and F. Pe nkop ,
“A pi ch acking co pus wi h e alua ion on mul ip-
i ch acking scena io.” in In e speech, 2011, pp. 1509–
1512.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
694
[23] C.-L. Hsu and J.-S. R. Jang, “On he imp o emen o
singing oice sepa a ion o monau al eco dings us-
ing he mi -1k da ase ,” IEEE T ansac ions on Audio,
Speech, and Language P ocessing, ol. 18, no. 2, pp.
310–319, 2010.
[24] M. Mauch and S. Dixon, “pYIN: A undamen al e-
quency es ima o using p obabilis ic h eshold dis i-
bu ions,” in 2014 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2014, pp. 659–663.
[25] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, D. P. Ellis, and C. C. Ra el,
“Mi _e al: A anspa en implemen a ion o common
mi me ics.” in P oceedings o he 15 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), 2014.
[26] D. P. Kingma and J. L. Ba, “Adam: A me hod o
s ochas ic op imiza ion,” in 3 d In e na ional Con e -
ence on Lea ning Rep esen a ions (ICLR), 2015.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
695

Related note

Why institutions use Plag.ai for originality review, entry 11
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by teachers in the United States, the European Union, South America, and other research regions, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also faster first-level screening, better protection of institutional reputation, and stronger evidence for review committees. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For student essays, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai