TUNING MATTERS: ANALYZING MUSICAL TUNING BIAS IN NEURAL
VOCODERS
Hans-Ul ich Be endes, Ben Maman, Meina d Mülle
In e na ional Audio Labo a o ies E langen
{hans-ul ich.be endes, ben.maman, meina d.muelle }@audiolabs-e langen.de
ABSTRACT
Vocode s, which econs uc ime-domain wa e o ms om
spec al ep esen a ions such as mel-spec og ams, a e es-
sen ial in mode n music and speech syn hesis. T adi-
ional signal-p ocessing echniques like he G i in-Lim al-
go i hm ha e la gely been eplaced by neu al ocode s,
which le e age gene a i e models o achie e supe io au-
dio quali y. Howe e , hese models can in oduce a i-
ac s and biases, po en ially a ec ing hei ou pu in un-
o eseen ways. In his s udy, we examine how di e en
musical unings a ec neu al mel- o-audio ocode s wi hin
he con ex o Wes e n music, whe e pe o mances do no
necessa ily adhe e o he mode n 440 Hz s anda d uning.
As a key con ibu ion, we e alua e se e al ecen neu al
ocode s on da ase s con aining piano, iolin, and singing
oice eco dings. Ou esul s e eal ha di e en ocode s
exhibi dis inc biases, causing de ia ion in uning, and a -
ec ing wa e o m econs uc ion quali y in case o non-
s anda d uning. Ou wo k unde sco es he need o im-
p o ed ocode obus ness in music syn hesis and p o ides
insigh s o e ining u u e models.
1. INTRODUCTION
Recen ad ances in speech and music syn hesis o en ol-
low a wo-s age app oach: An ini ial acous ic model gen-
e a es an in e media e spec al ep esen a ion, om which
a second model, equen ly e e ed o as a ocode ,
econs uc s a ime-domain wa e o m [1–4]. A com-
mon choice o his in e media e ep esen a ion is a mel-
spec og am. While adi ional signal-p ocessing me h-
ods can econs uc wa e o ms om mel-spec og ams,
hei quali y depends on he spec al dimensionali y. Re-
cen deep lea ning-based gene a i e models, such as Gen-
e a i e Ad e sa ial Ne wo ks (GANs) [5] o Di usion
models [6], achie e high- ideli y econs uc ion e en om
low-dimensional ep esen a ions. Al hough his o ically
used only in speech ansmission, he e m “Vocode ” has
ecen ly been adop ed o gene al spec og am- o-audio
models [7,8].
© H.-U. Be endes, B. Maman, and M. Mülle . Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: H.-U. Be endes, B. Maman, and M. Mülle , “Tuning
Ma e s: Analyzing Musical Tuning Bias in Neu al Vocode s”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
430 435 440 445 450
Re e ence Tuning [Hz]
430
435
440
445
450
Vocoded Tuning [Hz]
Reco dings
Ideal Vocode
Figu e 1: Sca e plo o es ima ed uning om ocoded
eco dings ( e ical axis, ocode om [1]) compa ed o
o iginal audio uning (ho izon al axis). Ma ginal dis ibu-
ions a e shown as his og am plo s wi h a Gaussian ke -
nel densi y es ima ion ( ed and blue line). Fo an ideal
ocode , he blue and ed dis ibu ion would align.
This wo-s age app oach ac o izes he p oblem o audio
syn hesis, enabling di e en modeling echniques o each
s age. Vocode s can be ained in a sel -supe ised manne
on la ge amoun s o da a, enabling esea che s in music
o speech syn hesis o ely on p e- ained ocode s. How-
e e , ocode s may in oduce a i ac s o biases, pa icu-
la ly in case o domain-shi , such as unseen musical in-
s umen s. While ecen wo ks in music syn hesis ha e
mo ed owa ds mo e musically in o med spec og am gen-
e a ion models—such as enabling ine-g ained con ol o e
imb e and pi ch [2,3]— he impac o ocode s on he mu-
sical cha ac e is ics o he ou pu emains unde explo ed.
In pa icula , one musically impo an bu o en o e looked
aspec is he in luence ha di e en musical unings ha e
on signal econs uc ion.
Tuning plays a undamen al ole in music and a ies ac oss
s yles and adi ions. In Wes e n music, uning ypically
e e s o he equency o a e e ence pi ch, om which
he equencies o all o he pi ches can be de i ed. While
A4 = 440 Hz is he mode n s anda d uning [9], eal-wo ld
eco dings o en exhibi de ia ions due o his o ical ea-
sons, o a is ic choices [10].
166
This pape aims o explo e how musical uning a ec s
ocode pe o mance. A key con ibu ion is ou analysis o
uning p ese a ion du ing wa e o m econs uc ion om
mel-spec og ams o eal eco dings, e ealing sys ema ic
biases in ce ain ocode s. Ou e alua ion compa es mul i-
ple neu al ocode s and a signal-p ocessing baseline ac oss
h ee di e se da ase s: piano, iolin, and singing wi h pi-
ano accompanimen . Figu e 1 highligh s one o ou ind-
ings, showing how a speci ic ocode in oduces uning
bias, leading o a misma ch be ween he uning dis ibu-
ions o he o iginal and econs uc ed eco dings. As a
u he con ibu ion, we conduc a lis ening es o assess
how non-s anda d uning a ec s he pe cei ed quali y o
ocoded audio.
2. BACKGROUND
2.1 Mel-Spec og am In e sion
Mel-spec og am compu a ion in ol es wo main s ages,
bo h o which can lead o in o ma ion loss. Fi s , he mag-
ni ude sho - ime Fou ie ans o m (STFT) is compu ed,
disca ding phase in o ma ion. Second, a mel il e bank
is applied o he magni ude-STFT, ypically educing e-
quency esolu ion.
A signal p ocessing-based app oach o mel-spec og am
in e sion is o i s es ima e he magni ude-STFT, o en
ia a pseudo-in e se, using Non-Nega i e Leas Squa es
(NNLS) [11], and hen econs uc he wa e o m by es-
ima ing he phase, ypically using he G i in-Lim algo-
i hm [12]. The quali y o he econs uc ed wa e o m de-
pends hea ily on he spec al esolu ion o bo h he STFT
and mel-spec og am.
In con as , neu al ocode s a e able o syn hesize high-
quali y audio om mel-spec og ams wi h a lowe spec-
al dimensionali y, making hem mo e p ac ical o audio
syn hesis. Ea ly neu al ocode s ocused on speech and
o en ailed o gene alize o unseen domains such as new
speake s o musical ins umen s. Mo e ecen ly, “uni e -
sal” neu al ocode s ha e eme ged ha can obus ly han-
dle di e se audio sou ces, including complex musical sig-
nals [5,13].
Fo example, Haw ho ne e al. [1] ain a GAN-based
ocode on 16,000 hou s o music da a, building upon
SoundS eam [14] and SEANe [15]. This ocode is
widely used in music syn hesis [1–3, 16, 17]. Simila ly,
BigVGAN [5], o iginally ained on speech da a, has been
ex ended by BigVGAN-V2 1wi h a b oade aining se
including music and en i onmen al sounds, enabling mo e
obus pe o mance ac oss domains. Despi e hei imp es-
si e audio quali y, neu al ocode s a e sensi i e o hei
aining da a. As a esul , hey may s uggle wi h ou -o -
dis ibu ion inpu s, such as un amilia ins umen s o non-
s anda d musical unings.
P e ious s udies ha e ound ha many mel spec og am
in e sion models p oduce wa e o ms wi h locally uns able
pi ch when applied o music [18, 19]. Howe e , ou s udy
akes a b oade pe spec i e by examining uning as a global
1h ps://gi hub.com/NVIDIA/BigVGAN
430 435 440 445 450
Re e ence A4 [Hz]
50 40 30 20 10 0 10 20 30 40 50
Tuning [Cen s]
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
No m. Num. Occu ences
BPSD
SWD
VE
Figu e 2: Dis ibu ion o uning (5-Cen esolu ion) pe
eco ding o he h ee in es iga ed da ase s.
s a is ic o e en i e eco dings, dis inguishing i om local
pi ch luc ua ions, as discussed in he nex sec ion.
2.2 Tuning and Tuning Es ima ion
In he con ex o he 12- one equal empe amen sys em
in Wes e n music, uning can be cha ac e ized by he e-
quency o a gi en e e ence pi ch, o en called conce
pi ch. The mode n uning s anda d was es ablished in 1975
in ISO16 de ining he e e ence pi ch o be 440 Hz o he
no e A4 [9]. Howe e , his has been subjec o change
o e ime, and e en oday, i is by no means a uni e -
sally applied s anda d. To illus a e his, Figu e 2 shows
he dis ibu ion o uning alues pe eco ding o he h ee
da ase s used in his pape , which we in oduce in Sec-
ion 3. All h ee uning dis ibu ions peak a ound 440 Hz,
wi h a sligh endency owa d highe alues. Howe e , we
can also see conside able a ia ions o some eco dings,
going as low as 430 Hz. S udies ha e ound ha only 50%
o Wes e n classical music eco dings all in o he uning
ange 440–443 Hz [20], unde sco ing he na u al di e si y
in musical uning. Qin and Le ch [21] ound ha uning
can be a con ounding a iable o music classi ica ion al-
go i hms, highligh ing he po en ial impac o uning on
deep-lea ning-based models.
Tuning can also be exp essed as de ia ion in Cen s om
A4 = 440 Hz, whe e one semi one equals 100 Cen s, as
shown by he wo x-axes in Figu e 2. In his con ex ,
uning es ima ion is he ask o inding he conce pi ch
equency, o equi alen ly, he de ia ion o he s anda d
440 Hz pi ch. Due o he impo ance o uning in music,
many di e en app oaches ha e been de eloped o un-
ing es ima ion o ull pe o mances [20, 22–24]. Mos ap-
p oaches de ine uning as a ci cula o se om a e e ence
pi ch ( ypically A4 = 440 Hz) wi hin ±50 Cen s since a de-
ia ion o mo e han ±50 Cen s is indis inguishable om
a ansposi ion o he nex semi one. Fo example, an A
ha is 60 Cen s la (lowe ) is indis inguishable om a G#
ha is 40 Cen s sha p (highe ). Fo eco dings de ia ing
by mo e han ±50 Cen s, he es ima ed uning he e o e
“w aps a ound” o he opposi e side. To ensu e a obus and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
167
unbiased e alua ion, we employ wo independen uning
es ima ion me hods. This edundancy allows us o c oss-
alida e esul s and accoun o po en ial inaccu acies o
me hod-speci ic biases in uning es ima ion. Bo h me hods
ope a e wi h a 1-Cen esolu ion.
The i s uning es ima ion me hod, implemen ed in he Li-
bROSA package [25], ollows a wo-s age p ocess. Fi s ,
an STFT is compu ed, and equency peaks a e iden i-
ied and e ined using pa abolic in e pola ion as desc ibed
in [26]. In he second s age, a equency his og am co e-
sponding o uning alues is cons uc ed by mapping he
in e pola ed equencies o he ange ±50 Cen s using a
modulo ope a ion. The uning alue wi h he highes coun
in he his og am is hen selec ed. We deno e his app oach
F eqHis . The second uning es ima ion me hod is im-
plemen ed in he LibFMP package [27] and di e s mos ly
in he i s s age. Ra he han equency in e pola ion, an
STFT wi h a la ge window size is used o ob ain he nec-
essa y equency esolu ion. The STFT is a e aged o e
ime and he esul ing equency is con e ed o a Cen -
scale wi h 1-Cen esolu ion using cubic in e pola ion. The
esul ing dis ibu ion is compa ed wi h a se o comb-like
empla e ec o s, each ep esen ing a speci ic uning wi hin
±50 Cen s. The inal es ima e is gi en by he empla e ha
maximizes he co ela ion wi h he dis ibu ion. We deno e
his echnique TempMa ch.
3. EXPERIMENTAL SETUP
3.1 Da ase s
We use h ee da ase s wi h dis inc ins umen a ion, in-
cluding piano, singing, and iolin. The Bee ho en Piano
Sona a Da ase (BPSD) [28] consis s o 11 e sions o
all i s mo emen s o Bee ho en’s Piano Sona as, o aling
352 eco dings, and app oxima ely 40 hou s. We choose
a piano da ase because he disc e e pi ch se and s able
uning h oughou a piece enable a obus uning es ima e.
The BPSD in pa icula is well-sui ed o ou e alua ion o
wo easons: Fi s , i con ains di e se eco dings spanning
nea ly 90 yea s (1935—2022) om di e en pe o me s,
acous ic condi ions, and pianos. Second, as shown in Fig-
u e 2, he da ase exhibi s a wide ange o unings. The
Schube Win e eise Da ase (SWD) [29] con ains nine
comple e eco dings o he “Win e eise” song cycle o
singing oice and piano, by nine di e en pe o me s, o-
aling app oxima ely 10.5 hou s. Unlike he piano, singing
oice has a con inuous pi ch ange and he uning is less
s able, making i a aluable addi ion o ou expe imen s.
Howe e , he piano in he SWD p o ides a s abilizing e -
e ence o he oice. The Violin E udes (VE) da ase [30]
consis s o 925 monophonic iolin eco dings (app oxi-
ma ely 28 hou s) om YouTube. Unlike piano, iolin un-
ing es ima ion can be less eliable due o con inuous pi ch
a ia ion. To ensu e obus e alua ion, we il e ou eco d-
ings whe e he wo uning es ima ion me hods disag ee by
mo e han 5 Cen s, esul ing in a inal selec ion o 651
eco dings.
Pi ch
shi ing
Tuning
es ima ion
shi
Reco ding
Pi ch-shi ed eco ding
wi h uning
Mel +
Vocode
Tuning
es ima ion
Figu e 3: Expe imen al se up o a single eco ding x.
3.2 Pi ch-shi Augmen a ion and Vocoding
Since ou da ase s do no co e he ull ange o uning al-
ues wi h a su icien numbe o eco dings—shown in Fig-
u e 2—we use pi ch shi augmen a ion o c ea e a new e -
sion o ou da ase s wi h a uni o m dis ibu ion in he un-
ing space simila o [21], using he Rubbe Band Lib a y. 2
Figu e 3 shows he expe imen al se up o he applied pi ch
shi augmen a ion. Fo a gi en eco ding x, we es ima e
he o iginal uning τxusing he F eqHis es ima o . We
sample a a ge uning τ∼ U (−50,50) and pi ch shi xby
he di e ence δ=τ−τx. This yields a modi ied eco ding
ywi h a uning o τy=τ(equali y holds up o he un-
ing es ima ion e o ). Fo each eco ding in ou da ase s,
we gene a e ou pi ch-shi ed e sions, which a e subse-
quen ly downsampled o 16 kHz. While pi ch shi ing may
in oduce mino a i ac s, we a gue hese a ec pe cei ed
quali y bu no uning es ima ion accu acy.
Nex , we calcula e a mel-spec og am om yand sub-
sequen ly econs uc he ime-domain signal using a
ocode , p oducing he ou pu ˆy. We e e o his p ocess
as ocoding y. The pa ame e s o he mel-spec og am a e
always chosen o i he gi en ocode , and a e ocoding,
each signal is downsampled back o 16 kHz. The uning o
ˆyis hen es ima ed, yielding ˆτ. By compa ing ˆτwi h τ, we
assess he ocode ’s abili y o p ese e uning.
3.3 Quan i a i e Me ics Tuning P ese a ion
We in oduce wo me ics o e alua e uning p ese a ion.
A s aigh o wa d app oach would be o compu e he di -
e ence ˆτ−τ. Howe e , since ou uning es ima ion al-
go i hms a e ci cula , la ge e o s can a ise om semi one
con usion. Fo example, i a ocode is applied o a sig-
nal wi h a uning o τ= 45 Cen s and i aises he uning
by 10 Cen s, he es ima ion would e u n ˆτ=−45 Cen s
(equi alen o +55 Cen s). A simple di e ence would hen
yield ˆτ−τ=−90 Cen s, e en hough he ocode only
changed he uning by 10 Cen s in his case.
To add ess his, we in oduce a ci cula di e ence, con-
side ing uning es ima es on a ci cle whe e τ= 50 and
τ=−50 a e equi alen . Fo mally, we de ine he ci cula
di e ence be ween wo es ima es τ1and τ2as:
δci c =
δ+ 100,i δ < −50
δ−100,i δ > 50
δo he wise
(1)
2h ps://gi hub.com/b eak as quay/ ubbe band
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
168
Vocode Sho Name T aining Da a # Pa am. Fs# Mel Bands STFT Win. Len. Hop Leng h
Haw ho ne e al. [1] HAWT Music 15M 16 kHz 128 640 320
BigVGAN [5] BV Speech 112M [22, 24] kHz [80, 100] 1024 256
BigVGAN-V2 [31] BV2 Music, Speech, ES 112M [22, 24, 44] kHz [80, 100, 128] 1024 256
NNLS & GL [12,25] LSGL —— 16 kHz [100, 128, 150] 640 320
Table 1: O e iew o e in es iga ed ocode s. Fo ocode s wi h mul iple e sions, we show lis s wi h pa ame e s o each
e sion (in o de ). Fo example, BigVGAN-V2 wi h 128 mel bands has a sampling equency o 44 kHz. “ES” s ands o
en i onmen al sounds, “NNLS&GL” o Non-Nega i e Leas Squa es & G i in-Lim.
whe e δ=τ2−τ1. This gua an ees δci c ∈[−50,50].
While he ci cula di e ence cap u es he de ia ion be-
ween τand ˆτ, i does no p o ide insigh in o he s a is-
ical dis ibu ion o ˆτ. By compa ing he dis ibu ions o
τand ˆτ(shown in blue and ed o he example in Fig-
u e 1, espec i ely), we quan i y how s ongly he uning
dis ibu ion o he ocoded audio de ia es om he inpu
dis ibu ion. To his end, we use he Wasse s ein Dis ance
(also e e ed o as he ea h mo e ’s dis ance, o EMD),
which is he op imal anspo cos be ween wo p obabili y
dis ibu ions [32]. A lowe Wasse s ein Dis ance indica es
g ea e simila i y. In pa icula , gi en he ci cula na u e
o uning es ima ion, we compu e he Ci cula Wasse s ein
Dis ance (CWD), as desc ibed in [33]. Thus, in his op-
imal anspo p oblem, p obabili y mass can low ac oss
he bounda ies o he es ima ion ange, as bo h ends a e
connec ed on he conside ed ci cle.
3.4 Vocode s
In Sec ion 2.1, we b ie ly in oduced he main ocode a -
chi ec u es in es iga ed in his wo k, o which we will
use he ollowing sho hand no a ions h oughou he e-
mainde o he pape : Haw ho ne e al. [1] (HAWT),
BigVGAN [5] (BV), BigVGAN-V2 [31] (BV2), and he
signal-p ocessing-based app oach o NNLS [11] ollowed
by G i in-Lim [12] (LSGL). No e ha he ocode by
Haw ho ne e al. is ambiguously also e e ed o as Sound-
S eam in he li e a u e [2, 3].
Table 1 gi es an o e iew o he in es iga ed ocode s.
HAWT has only a single e sion, wi h 128 mel bands. Mul-
iple e sions exis o BV and BV2, which di e in he
numbe o mel bands and sampling equency Fs. Fo
LSGL we use h ee di e en numbe s o mel bands, wi h
he same unde lying STFT p ope ies. In ou esul s, we
iden i y each ocode by i s sho name and he numbe
o mel bands. Fo ins ance, BV-80 e e s o he BigV-
GAN model wi h 80 mel bands and a 22kHz sampling a e.
This naming con en ion is unambiguous, as a ian s o he
same ocode wi h di e en sampling a es also ha e dis-
inc numbe s o mel bands.
4. RESULTS TUNING PRESERVATION
4.1 Quan i a i e Resul s
Figu e 4a p esen s he mean absolu e δci c o all es ed
ocode s, whe e dis inc colo s ep esen he da ase s, and
he colo shade indica es he uning es ima o . In addi ion
o he es ed ocode s, we include me ics o g ound u h
HAWT
BV 80
BV 100
BV2 80
BV2 100
BV2 128
LSGL 100
LSGL 128
LSGL 150
GT
0
5
10
15
20
Mean Abs. ci c
Da ase & Tuning Es . Me hod
BPSD-TempMa ch
SWD-TempMa ch
VE-TempMa ch
BPSD-F eqHis
SWD-F eqHis
VE-F eqHis
(a) Mean absolu e δci c [Cen s]
HAWT
BV 80
BV 100
BV2 80
BV2 100
BV2 128
LSGL 100
LSGL 128
LSGL 150
GT
0
5
10
15
20
CWD
(b) Ci cula Wasse s ein Dis ance (CWD)
Figu e 4: E alua ion me ics o each ocode , da ase ,
and uning es ima ion me hod.
audio GT in he igu e, whe e we compa e he uning es i-
ma e τyo he pi ch-shi ed audio wi h he a ge uning τ.
Fo GT, we can see ha uning es ima ion aligns wi h pi ch-
shi ing, meaning he es ima ed uning o a pi ch-shi ed
e sion di e s on a e age no mo e han 2 Cen s om he
a ge uning.
As a i s and cen al obse a ion, we see ha mos neu al
ocode s in oduce uning de ia ion, whe eas he signal-
p ocessing-based LSGL gene ally shows a lowe ci cula
di e ence. Howe e , BV2-100 and BV2-128 a e an ex-
cep ion o his. We also obse e ha a highe numbe
o mel bands o LSGL and BV2 leads o less uning de-
ia ion, eaching alues only ma ginally abo e he un-
ing es ima ion inconsis encies o a high numbe o mel
bands. When compa ing neu al ocode s, HAWT exhibi s
he o e all highes ci cula di e ence, eaching alues up
o δci c = 18.7Cen s. BV2 shows lowe alues on a e -
age compa ed o he o iginal BV. F om u he obse ing
Figu e 4a, he di e en da ase s seem o ha e an impac on
he uning p ese a ion o some ocode s. Fo ins ance,
BV and HAWT show no ably highe uning de ia ions o
SWD compa ed o o he da ase s. The VE da ase is leas
a ec ed by uning de ia ions o all neu al ocode s, pos-
sibly due o i s con inuous pi ch na u e, since i includes
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
169
BPSD SWD VE
HAWT
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
BV2-80
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
Figu e 5: Tuning es ima es o ocoded audio (ˆτ) o e inpu audio (τ) on all h ee da ase s o ocode s HAWT and BV2-80.
Ma ginal dis ibu ions a e shown as his og ams wi h a mo ing a e age smoo hing o wid h i e applied and a Gaussian
ke nel densi y es ima e (line). Tunings we e es ima ed wi h TempMa ch es ima o .
iolin only. Howe e , his da ase seems o be di icul o
LSGL. Compa ing he esul s o he wo uning es ima ion
me hods, we obse e simila ends despi e some a ia ions
o speci ic ocode -da ase combina ions.
Figu e 4b shows he CWD o all ocode s, da ase s, and
uning es ima ion me hods. We obse e a s ong co e-
la ion wi h δci c om Figu e 4a. A high CWD indica es
ha he uning o he ocode ou pu ollows a dis ibu ion
di e en om ha o he uni o mly dis ibu ed pi ch-shi
augmen ed da ase s, sugges ing ha , in gene al, when a
ocode in oduces uning de ia ions, hese de ia ions ol-
low a non-uni o m dis ibu ion.
4.2 Quali a i e Resul s
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
(a) BV2-128
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
(b) LSGL-150
Figu e 6: Tuning es ima es o ocoded audio (ˆτ) o e in-
pu audio (τ) o BV2-128 and LSGL-150 on he SWD.
To be e unde s and he uning changes in oduced
by he ocode s, we analyze wo example ocode s in
mo e de ail: HAWT and BV2-80, as hese a e among
he mos commonly used o music da a in he li e a-
u e [1–3, 17, 34, 35]. Figu e 5 shows a sca e plo o
all ocoded unings ˆτo e inpu unings τ o all h ee
da ase s, alongside ma ginal dis ibu ions wi h a Gaussian
ke nel densi y es ima ion (KDE). No e ha we accoun o
he ci cula con inui y o he uning es ima es o calcula e
he KDE.
As indica ed by ou quan i a i e analysis, he ou pu dis i-
bu ion o ˆτis hea ily al e ed om he uni o m inpu dis-
ibu ion in almos all cases. Fo bo h ocode s, a clus e
a ound ˆτ= 0 is e iden , co esponding o A4 = 440 Hz,
hough he ac ual peak is sligh ly abo e ˆτ= 0. This could
e lec a bias in he aining da a, as many eco dings a e
uned be ween 440 and 443 Hz, as discussed in Sec ion 2.2.
This e ec is simila o he BPSD and he SWD, bu less
p onounced o he VE. Figu e 6 shows he same sca e
plo o BV2-128 and LSGL-150 on SWD, whe e ou -
pu uning closely ollows inpu uning, consis en wi h he
low uning de ia ion me ics discussed ea lie . The bias
obse ed in Figu e 5 appea s o anish o he highe es-
olu ion BV2 model (which has mo e mel bands and highe
sampling a e). This is po en ially due o he educed in-
o ma ion loss in he mel spec og ams, making accu a e
econs uc ion an easie ask. Addi ional igu es o all
da ase s, ocode s, and uning es ima o combina ions a e
a ailable on ou websi e. 3
3h ps://www.audiolabs-e langen.de/ esou ces/2025-ISMIR-
Vocode TuningEs ima ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
170
5. LISTENING TEST
While we showed ha ocode s may no p ese e uning,
we did no examine whe he his a ec s pe cei ed quali y
o non-s anda d uning. In p inciple, uning de ia ion and
ou pu quali y could be o hogonal; a ocode migh al e
uning ye s ill p oduce high-quali y audio.
To in es iga e his ques ion, we conduc ed a lis ening es ,
ocusing on he BPSD due o i s di e se o iginal uning.
Ins ead o compa ing samples om di e en ocode s, we
compa e only ocoded samples o he same exce p unde
di e en unings om a single ocode . This goal in o-
duces wo challenges o he lis ening es design. Fi s , we
equi e es i ems wi h iden ical o iginal quali y bu di e -
en unings. A po en ial solu ion is pi ch shi ing, which
can obus ly eplica e a speci ic uning (Sec ion 4.1, Fig-
u e 4a), bu he e, he in oduc ion o small a i ac s migh
impac he pe cei ed quali y. Second, he commonly used
MUSHRA es [36] is unsui able o compa ing i ems ha
di e in uning om he e e ence, as pi ch di e ences
would di ec ly in luence he lis ene s’ judgmen . We ad-
d ess bo h issues in ou es design.
The es ollows an AB o ma wi hou a e e ence: Pa ic-
ipan s compa e wo exce p s and choose he one wi h he
be e -pe cei ed quali y, ha ing a “no p e e ence” op ion
as well. As a i s s ep, we selec ou BPSD eco dings
wi h di e se o iginal unings o −42,−11, 0, and 34 Cen s
and pi ch-shi each one h ee imes o eplica e he o he
h ee unings, yielding 16 es i ems ( ou pe uning).
In o de o analyze uning bias wi hin each ocode , we
do no compa e ac oss ocode s bu a he selec one se
o i em pai s ha is hen ocoded and p esen ed indepen-
den ly o each ocode . Each o iginal (non-pi ch-shi ed)
eco ding is pai ed wi h i s h ee pi ch-shi ed e sions, en-
su ing each uning is es ed agains he o iginal. Fo exam-
ple, he i em wi h an o iginal uning o −42 Cen s is pai ed
wi h i s pi ch-shi ed e sions wi h unings -11, 0, and 34
Cen s. In his example, i a lis ene p e e s he 0-Cen un-
ing o e he o iginal −42-Cen uning, e en hough he 0-
Cen was ob ained h ough pi ch-shi ing, his can indica e
ha uning a ec s quali y s onge han pi ch shi ing. In
o al, his yields 12 i em pai s pe ocode , and we es ou
ocode s: HAWT,BV2-80,BV2-128, and LSGL-150.
We spli he i ems in o wo sepa a e lis ening es s wi h 24
pai s each, o limi he es du a ion pe lis ene . Addi-
ionally, we include ou con ol pai s o bo h subg oups
wi h iden ical i ems, whe e a en i e lis ene s should indi-
ca e “no p e e ence”. We exclude lis ene s who indica e
mo e han once a p e e ence o con ol i em pai s.
5.1 Resul s
In o al, 25 pa icipan s ook pa in ou lis ening es , 19
male and 6 emale, wi h a median age o 27, anging om
21 o 58. Among hem, 20 pa icipan s had some p io
expe ience wi h lis ening es s. A o al o 5 lis ene s did
no mee he pos -sc eening c i e ion, lea ing 20 lis ene s
dis ibu ed e enly among he wo i em subse s. Fo each
uning alue, we agg ega e he numbe o imes i was p e-
e ed. I uning had no impac on quali y, we would expec
-42 -11 0 34
0
10
20
30
40
50
P e e ence [%]
HAWT
-42 -11 0 34
BV2 80
-42 -11 0 34
BV2 128
-42 -11 0 34
LSGL 150
Tuning [Cen s]
P e e ence No P e e ence
Figu e 7: Lis ening es esul s: lis ene s’ p e e ence o-
wa ds uning alues as a pe cen age o o al o es o
each ocode indi idually. “No p e e ence” o es a e spli
e enly be ween bo h exce p s in a pai , e.g., a o e be ween
0 and 11 coun s as hal a o e in he ed ba o each, mean-
ing ha ba s sum up o 100% o each ocode .
ei he a endency owa ds “no p e e ence” o es, o an ap-
p oxima ely equal dis ibu ion o p e e ences ac oss uning
alues.
Figu e 7 illus a es his pe cen age o p e e ence o es o
each uning alue and es ed ocode . Fo HAWT and
BV2-80 we can see a end: The unings −42 and +34
Cen s ecei e ewe p e e ence o es compa ed o he mid-
dle alues. In con as , BV2-128 does no exhibi a s ong
end, while LSGL-150 shows a gene ally lowe num-
be o p e e ence o es, sugges ing ha lis ene s pe cei ed
ewe quali y di e ences compa ed o he neu al ocode s.
When agg ega ing p e e ences in o g oups o “o iginal”
and “pi ch-shi ed”, lis ene s show a sligh p e e ence o-
wa ds he o iginal i ems o he neu al ocode s, indica ing
ha pi ch-shi ing also has a nega i e in luence on qual-
i y (see supplemen a y websi e). The e o e, ully disen-
angling he e ec s o pi ch shi ing and uning in he lis-
ening es emains challenging. Howe e , due o ou es
design, always compa ing he ocoded o iginal eco dings
wi h hei ocoded pi ch-shi ed coun e pa , Figu e 7 s ill
shows a meaning ul end.
O e all, he esul s indica e ha ocode s which show a
bias in uning p ese a ion (HAWT and BV2-80) also show
a dec ease in quali y when econs uc ing signals wi h ou -
o -dis ibu ion uning.
6. CONCLUSIONS
In his s udy, we in es iga ed how musical uning a ec s
neu al ocode s, ocusing mainly on uning p ese a ion.
Ou indings e eal ha ocode s can signi ican ly al e
bo h indi idual unings and o e all uning dis ibu ions,
wi h some exhibi ing a bias owa ds mode n s anda d un-
ing. Addi ionally, ou lis ening es sugges s a decline in
econs uc ion quali y o signals wi h non-s anda d un-
ings when p ocessed by a ocode wi h uning bias.
Ou wo k unde sco es he impo ance o uning in music
gene a ion and ocode design. Fu u e wo k should ocus
on mi iga ing uning biases du ing ocode aining. Ou
e alua ion app oach, based on pi ch shi ing and quan i a-
i e e alua ion me ics, gi es esea che s a s aigh o wa d
ye e ec i e me hod o assessing uning obus ness in mu-
sic ocode s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
171
7. ACKNOWLEDGEMENTS
This wo k was unded by he Deu sche Fo schungs-
gemeinscha (DFG, Ge man Resea ch Founda ion) un-
de G an No. 350953655 (MU 2686/11-2) and G an
No. 500643750 (MU 2686/15-1). The In e na ional Au-
dio Labo a o ies E langen a e a join ins i u ion o
he F ied ich-Alexande -Uni e si ä E langen-Nü nbe g
(FAU) and F aunho e Ins i u e o In eg a ed Ci cui s IIS.
8. REFERENCES
[1] C. Haw ho ne, I. Simon, A. Robe s, N. Zeghidou ,
J. Ga dne , E. Manilow, and J. H. Engel, “Mul i-
ins umen music syn hesis wi h spec og am di u-
sion,” in P oceedings o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2022, pp. 598–607.
[2] B. Maman, J. Zei le , M. Mülle , and A. H. Be mano,
“Pe o mance condi ioning o di usion-based mul i-
ins umen music syn hesis,” in P oceedings o he
IEEE In e na ional Con e ence on Acous ics, Speech,
and Signal P ocessing (ICASSP), Seoul, Sou h Ko ea,
2024, pp. 5045–5049.
[3] D. Kim, H.-W. Dong, and D. Jeong, “Violindi : En-
hancing exp essi e iolin syn hesis wi h pi ch bend
condi ioning,” in P oceedings o he IEEE In e na-
ional Con e ence on Acous ics, Speech, and Signal
P ocessing (ICASSP), Hyde abad, India, 2025, pp. 1–
5.
[4] J. Shen, R. Pang, R. J. Weiss, M. Schus e , N. Jai ly,
Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R.-S. Ryan,
R. A. Sau ous, Y. Agiomy giannakis, and Y. Wu, “Na -
u al TTS syn hesis by condi ioning Wa eNe on MEL
spec og am p edic ions,” in P oceedings o he IEEE
In e na ional Con e ence on Acous ics, Speech, and
Signal P ocessing (ICASSP), Calga y, Canada, 2018,
pp. 4779–4783.
[5] S. gil Lee, W. Ping, B. Ginsbu g, B. Ca anza o, and
S. Yoon, “BigVGAN: A uni e sal neu al ocode
wi h la ge-scale aining,” in P oceedings o he In-
e na ional Con e ence on Lea ning Rep esen a ions
(ICLR), Kigali, Rwanda, 2023.
[6] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Ca an-
za o, “Di wa e: A e sa ile di usion model o audio
syn hesis,” in P oceedings o he In e na ional Con-
e ence on Lea ning Rep esen a ions, ICLR, Vi ual,
2021.
[7] H. Dudley, “Remaking Speech,” The Jou nal o he
Acous ical Socie y o Ame ica, ol. 11, no. 2, pp. 169–
177, 1939.
[8] A. Mus a a, N. Pia, and G. Fuchs, “S yleMelGAN: An
e icien high- ideli y ad e sa ial ocode wi h empo-
al adap i e no maliza ion,” in P oceedings o he IEEE
In e na ional Con e ence on Acous ics, Speech, and
Signal P ocessing (ICASSP), To on o, Canada, 2021.
[9] ISO, “Acous ics – s anda d uning equency (s anda d
musical pi ch),” ISO16:1975, 1975.
[10] F. G ibenski, Tuning he Wo ld: The Rise o 440
He z in Music, Science, and Poli ics, 1859–1955,
se . New Ma e ial His o ies o Music. Uni e si y
o Chicago P ess, 2023. [Online]. A ailable: h ps:
//books.google.de/books?id=VJKpEAAAQBAJ
[11] C. L. Lawson and R. J. Hanson, Sol ing leas squa es
p oblems. Socie y o Indus ial and Applied Ma he-
ma ics, 1995.
[12] D. W. G i in and J. S. Lim, “Signal es ima ion om
modi ied sho - ime Fou ie ans o m,” IEEE T ans-
ac ions on Acous ics, Speech, and Signal P ocessing,
ol. 32, no. 2, pp. 236–243, 1984.
[13] J. Kong, J. Kim, and J. Bae, “Hi i-gan: Gene a i e ad-
e sa ial ne wo ks o e icien and high ideli y speech
syn hesis,” in Ad ances in Neu al In o ma ion P ocess-
ing Sys ems, H. La ochelle, M. Ranza o, R. Hadsell,
M. Balcan, and H. Lin, Eds., i ual, 2020.
[14] N. Zeghidou , A. Luebs, A. Om an, J. Skoglund, and
M. Tagliasacchi, “Sounds eam: An end- o-end neu-
al audio codec,” IEEE/ACM T ansac ions on Audio,
Speech and Language P ocessing, ol. 30, pp. 495–
507, 2022.
[15] M. Tagliasacchi, Y. Li, K. Misiunas, and D. Roblek,
“SEAne : A mul i-modal speech enhancemen ne -
wo k,” in P oceedings o he Annual Con e ence o he
In e na ional Speech Communica ion Associa ion, (In-
e speech), H. Meng, B. Xu, and T. F. Zheng, Eds.,
Shanghai, China, 2020, pp. 1126–1130.
[16] H. Kim, S. Choi, and J. Nam, “Exp essi e acous ic gui-
a sound syn hesis wi h an ins umen -speci ic inpu
ep esen a ion and di usion ou pain ing,” in P oceed-
ings o he IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing, ICASSP, Seoul,
Republic o Ko ea, 2024, pp. 7620–7624.
[17] B. Maman, J. Zei le , M. Mülle , and A. H. Be mano,
“Mul i-aspec condi ioning o di usion-based music
syn hesis: Enhancing ealism and acous ic con ol,”
IEEE T ansac ions on Audio, Speech and Language
P ocessing, ol. 33, pp. 68–81, 2025.
[18] B. D. Gio gi, M. Le y, and R. Sha p, “Mel spec o-
g am in e sion wi h s able pi ch,” in P oceedings o he
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR, Bengalu u, India, 2022, pp. 233–
239.
[19] J. H. Engel, K. K. Ag awal, S. Chen, I. Gul ajani,
C. Donahue, and A. Robe s, “Gansyn h: Ad e sa -
ial neu al audio syn hesis,” in In e na ional Con e ence
on Lea ning Rep esen a ions, ICLR, New O leans, LA,
USA, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
172
[20] A. Le ch, “On he equi emen o au oma ic uning
equency es ima ion,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Vic o ia, Canada, 2006, pp. 212–215.
[21] Y. Qin and A. Le ch, “Tuning equency dependency
in music classi ica ion,” in IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP), 2019, pp. 401–405.
[22] A. Degani, M. Dalai, R. Leona di, and P. Miglio a i,
“Compa ison o uning equency es ima ion me h-
ods,” Mul imedia Tools and Applica ions, ol. 74,
no. 15, pp. 5917–5934, Aug. 2015.
[23] K. D essle and S. S eich, “Tuning equency es i-
ma ion using ci cula s a is ics,” in P oceedings o he
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), Vienna, Aus ia, 2007, pp. 357–
360.
[24] V. Gnann, M. Ki za, J. Becke , and M. Spie z, “Leas -
squa es local uning equency es ima ion o choi mu-
sic,” in P oceedings o he Audio Enginee ing Socie y
(AES) Con en ion, New Yo k Ci y, New Yo k, USA,
2011.
[25] B. McFee, C. Ra el, D. Liang, D. P. Ellis, M. McVica ,
E. Ba enbe g, and O. Nie o, “Lib osa: Audio and
music signal analysis in Py hon,” in P oceedings he
Py hon Science Con e ence, Aus in, Texas, USA,
2015, pp. 18–25.
[26] A. de Che eigné and H. Kawaha a, “YIN, a undamen-
al equency es ima o o speech and music.” Jou nal
o he Acous ical Socie y o Ame ica (JASA), ol. 111,
no. 4, pp. 1917–1930, 2002.
[27] M. Mülle and F. Zalkow, “lib mp: A Py hon pack-
age o undamen als o music p ocessing,” Jou nal
o Open Sou ce So wa e (JOSS), ol. 6, no. 63, pp.
3326:1–5, 2021.
[28] J. Zei le , C. Weiß, V. A i i-Mülle , and M. Mülle ,
“BPSD: A cohe en mul i- e sion da ase o analyz-
ing he i s mo emen s o bee ho en’s piano sona as,”
T ansac ion o he In e na ional Socie y o Music In-
o ma ion Re ie al, ol. 7, no. 1, pp. 195–212, 2024.
[29] C. Weiß, F. Zalkow, V. A i i-Mülle , M. Mülle , H. V.
Koops, A. Volk, and H. G ohganz, “Schube Win e -
eise da ase : A mul imodal scena io o music anal-
ysis,” ACM Jou nal on Compu ing and Cul u al He -
i age (JOCCH), ol. 14, no. 2, pp. 25:1–18, 2021.
[30] N. C. Tame , P. Ramoneda, and X. Se a, “Violin
e udes: A comp ehensi e da ase o 0 es ima ion and
pe o mance analysis,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Bengalu u, India, 2022, pp. 517–524.
[31] S.-G. Lee, W. Ping, B. Ginsbu g, B. Ca an-
za o, and S. Yoon, “BigVGAN Gi Hub eposi-
o y,” 2024. [Online]. A ailable: h ps://gi hub.com/
NVIDIA/BigVGAN
[32] G. Pey é and M. Cu u i, “Compu a ional op imal ans-
po : Wi h applica ions o da a science,” Founda ions
and T ends in Machine Lea ning, ol. 11, no. 5–6, pp.
355–607, 2019.
[33] J. Delon, J. Salomon, and A. Sobole ski, “Fas T ans-
po Op imiza ion o Monge Cos s on he Ci cle,” So-
cie y o Indus ial and Applied Ma hema ics Jou nal
on Applied Ma hema ics, ol. 70, no. 7, pp. 2239–
2258, 2010.
[34] H. Kim, S. Choi, and J. Nam, “Exp essi e acous ic gui-
a sound syn hesis wi h an ins umen -speci ic inpu
ep esen a ion and di usion ou pain ing,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 7620–7624.
[35] S. Dai, M.-Y. Liu, R. Valle, and S. Gu u ani, “Exp es-
si esinge : Mul ilingual and mul i-s yle sco e-based
singing oice syn hesis wi h exp essi e pe o mance
con ol,” in P oceedings o he 32nd ACM In e na-
ional Con e ence on Mul imedia, 2024, pp. 3229–
3238.
[36] In e na ional Telecommunica ions Union, “ITU-R
Rec. BS.1534-3: Me hod o he subjec i e assess-
men o in e media e quali y le els o coding sys ems,”
2015.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
173