Tuning Matters: Analyzing Musical Tuning Bias in Neural Vocoders

Author: Hans-Ulrich Berendes; Ben Maman; Meinard Müller

Publisher: Zenodo

DOI: 10.5281/zenodo.17706359

Source: https://zenodo.org/records/17706359/files/000020.pdf

TUNING MATTERS: ANALYZING MUSICAL TUNING BIAS IN NEURAL
VOCODERS
Hans-Ul ich Be endes, Ben Maman, Meina d Mülle
In e na ional Audio Labo a o ies E langen
{hans-ul ich.be endes, ben.maman, meina d.muelle }@audiolabs-e langen.de
ABSTRACT
Vocode s, which econs uc ime-domain wa e o ms om
spec al ep esen a ions such as mel-spec og ams, a e es-
sen ial in mode n music and speech syn hesis. T adi-
ional signal-p ocessing echniques like he G i in-Lim al-
go i hm ha e la gely been eplaced by neu al ocode s,
which le e age gene a i e models o achie e supe io au-
dio quali y. Howe e , hese models can in oduce a i-
ac s and biases, po en ially a ec ing hei ou pu in un-
o eseen ways. In his s udy, we examine how di e en
musical unings a ec neu al mel- o-audio ocode s wi hin
he con ex o Wes e n music, whe e pe o mances do no
necessa ily adhe e o he mode n 440 Hz s anda d uning.
As a key con ibu ion, we e alua e se e al ecen neu al
ocode s on da ase s con aining piano, iolin, and singing
oice eco dings. Ou esul s e eal ha di e en ocode s
exhibi dis inc biases, causing de ia ion in uning, and a -
ec ing wa e o m econs uc ion quali y in case o non-
s anda d uning. Ou wo k unde sco es he need o im-
p o ed ocode obus ness in music syn hesis and p o ides
insigh s o e ining u u e models.
1. INTRODUCTION
Recen ad ances in speech and music syn hesis o en ol-
low a wo-s age app oach: An ini ial acous ic model gen-
e a es an in e media e spec al ep esen a ion, om which
a second model, equen ly e e ed o as a ocode ,
econs uc s a ime-domain wa e o m [1–4]. A com-
mon choice o his in e media e ep esen a ion is a mel-
spec og am. While adi ional signal-p ocessing me h-
ods can econs uc wa e o ms om mel-spec og ams,
hei quali y depends on he spec al dimensionali y. Re-
cen deep lea ning-based gene a i e models, such as Gen-
e a i e Ad e sa ial Ne wo ks (GANs) [5] o Di usion
models [6], achie e high- ideli y econs uc ion e en om
low-dimensional ep esen a ions. Al hough his o ically
used only in speech ansmission, he e m “Vocode ” has
ecen ly been adop ed o gene al spec og am- o-audio
models [7,8].
© H.-U. Be endes, B. Maman, and M. Mülle . Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: H.-U. Be endes, B. Maman, and M. Mülle , “Tuning
Ma e s: Analyzing Musical Tuning Bias in Neu al Vocode s”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
430 435 440 445 450
Re e ence Tuning [Hz]
430
435
440
445
450
Vocoded Tuning [Hz]
Reco dings
Ideal Vocode
Figu e 1: Sca e plo o es ima ed uning om ocoded
eco dings ( e ical axis, ocode om [1]) compa ed o
o iginal audio uning (ho izon al axis). Ma ginal dis ibu-
ions a e shown as his og am plo s wi h a Gaussian ke -
nel densi y es ima ion ( ed and blue line). Fo an ideal
ocode , he blue and ed dis ibu ion would align.
This wo-s age app oach ac o izes he p oblem o audio
syn hesis, enabling di e en modeling echniques o each
s age. Vocode s can be ained in a sel -supe ised manne
on la ge amoun s o da a, enabling esea che s in music
o speech syn hesis o ely on p e- ained ocode s. How-
e e , ocode s may in oduce a i ac s o biases, pa icu-
la ly in case o domain-shi , such as unseen musical in-
s umen s. While ecen wo ks in music syn hesis ha e
mo ed owa ds mo e musically in o med spec og am gen-
e a ion models—such as enabling ine-g ained con ol o e
imb e and pi ch [2,3]— he impac o ocode s on he mu-
sical cha ac e is ics o he ou pu emains unde explo ed.
In pa icula , one musically impo an bu o en o e looked
aspec is he in luence ha di e en musical unings ha e
on signal econs uc ion.
Tuning plays a undamen al ole in music and a ies ac oss
s yles and adi ions. In Wes e n music, uning ypically
e e s o he equency o a e e ence pi ch, om which
he equencies o all o he pi ches can be de i ed. While
A4 = 440 Hz is he mode n s anda d uning [9], eal-wo ld
eco dings o en exhibi de ia ions due o his o ical ea-
sons, o a is ic choices [10].
166
This pape aims o explo e how musical uning a ec s
ocode pe o mance. A key con ibu ion is ou analysis o
uning p ese a ion du ing wa e o m econs uc ion om
mel-spec og ams o eal eco dings, e ealing sys ema ic
biases in ce ain ocode s. Ou e alua ion compa es mul i-
ple neu al ocode s and a signal-p ocessing baseline ac oss
h ee di e se da ase s: piano, iolin, and singing wi h pi-
ano accompanimen . Figu e 1 highligh s one o ou ind-
ings, showing how a speci ic ocode in oduces uning
bias, leading o a misma ch be ween he uning dis ibu-
ions o he o iginal and econs uc ed eco dings. As a
u he con ibu ion, we conduc a lis ening es o assess
how non-s anda d uning a ec s he pe cei ed quali y o
ocoded audio.
2. BACKGROUND
2.1 Mel-Spec og am In e sion
Mel-spec og am compu a ion in ol es wo main s ages,
bo h o which can lead o in o ma ion loss. Fi s , he mag-
ni ude sho - ime Fou ie ans o m (STFT) is compu ed,
disca ding phase in o ma ion. Second, a mel il e bank
is applied o he magni ude-STFT, ypically educing e-
quency esolu ion.
A signal p ocessing-based app oach o mel-spec og am
in e sion is o i s es ima e he magni ude-STFT, o en
ia a pseudo-in e se, using Non-Nega i e Leas Squa es
(NNLS) [11], and hen econs uc he wa e o m by es-
ima ing he phase, ypically using he G i in-Lim algo-
i hm [12]. The quali y o he econs uc ed wa e o m de-
pends hea ily on he spec al esolu ion o bo h he STFT
and mel-spec og am.
In con as , neu al ocode s a e able o syn hesize high-
quali y audio om mel-spec og ams wi h a lowe spec-
al dimensionali y, making hem mo e p ac ical o audio
syn hesis. Ea ly neu al ocode s ocused on speech and
o en ailed o gene alize o unseen domains such as new
speake s o musical ins umen s. Mo e ecen ly, “uni e -
sal” neu al ocode s ha e eme ged ha can obus ly han-
dle di e se audio sou ces, including complex musical sig-
nals [5,13].
Fo example, Haw ho ne e al. [1] ain a GAN-based
ocode on 16,000 hou s o music da a, building upon
SoundS eam [14] and SEANe [15]. This ocode is
widely used in music syn hesis [1–3, 16, 17]. Simila ly,
BigVGAN [5], o iginally ained on speech da a, has been
ex ended by BigVGAN-V2 1wi h a b oade aining se
including music and en i onmen al sounds, enabling mo e
obus pe o mance ac oss domains. Despi e hei imp es-
si e audio quali y, neu al ocode s a e sensi i e o hei
aining da a. As a esul , hey may s uggle wi h ou -o -
dis ibu ion inpu s, such as un amilia ins umen s o non-
s anda d musical unings.
P e ious s udies ha e ound ha many mel spec og am
in e sion models p oduce wa e o ms wi h locally uns able
pi ch when applied o music [18, 19]. Howe e , ou s udy
akes a b oade pe spec i e by examining uning as a global
1h ps://gi hub.com/NVIDIA/BigVGAN
430 435 440 445 450
Re e ence A4 [Hz]
50 40 30 20 10 0 10 20 30 40 50
Tuning [Cen s]
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
No m. Num. Occu ences
BPSD
SWD
VE
Figu e 2: Dis ibu ion o uning (5-Cen esolu ion) pe
eco ding o he h ee in es iga ed da ase s.
s a is ic o e en i e eco dings, dis inguishing i om local
pi ch luc ua ions, as discussed in he nex sec ion.
2.2 Tuning and Tuning Es ima ion
In he con ex o he 12- one equal empe amen sys em
in Wes e n music, uning can be cha ac e ized by he e-
quency o a gi en e e ence pi ch, o en called conce
pi ch. The mode n uning s anda d was es ablished in 1975
in ISO16 de ining he e e ence pi ch o be 440 Hz o he
no e A4 [9]. Howe e , his has been subjec o change
o e ime, and e en oday, i is by no means a uni e -
sally applied s anda d. To illus a e his, Figu e 2 shows
he dis ibu ion o uning alues pe eco ding o he h ee
da ase s used in his pape , which we in oduce in Sec-
ion 3. All h ee uning dis ibu ions peak a ound 440 Hz,
wi h a sligh endency owa d highe alues. Howe e , we
can also see conside able a ia ions o some eco dings,
going as low as 430 Hz. S udies ha e ound ha only 50%
o Wes e n classical music eco dings all in o he uning
ange 440–443 Hz [20], unde sco ing he na u al di e si y
in musical uning. Qin and Le ch [21] ound ha uning
can be a con ounding a iable o music classi ica ion al-
go i hms, highligh ing he po en ial impac o uning on
deep-lea ning-based models.
Tuning can also be exp essed as de ia ion in Cen s om
A4 = 440 Hz, whe e one semi one equals 100 Cen s, as
shown by he wo x-axes in Figu e 2. In his con ex ,
uning es ima ion is he ask o inding he conce pi ch
equency, o equi alen ly, he de ia ion o he s anda d
440 Hz pi ch. Due o he impo ance o uning in music,
many di e en app oaches ha e been de eloped o un-
ing es ima ion o ull pe o mances [20, 22–24]. Mos ap-
p oaches de ine uning as a ci cula o se om a e e ence
pi ch ( ypically A4 = 440 Hz) wi hin ±50 Cen s since a de-
ia ion o mo e han ±50 Cen s is indis inguishable om
a ansposi ion o he nex semi one. Fo example, an A
ha is 60 Cen s la (lowe ) is indis inguishable om a G#
ha is 40 Cen s sha p (highe ). Fo eco dings de ia ing
by mo e han ±50 Cen s, he es ima ed uning he e o e
“w aps a ound” o he opposi e side. To ensu e a obus and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
167
unbiased e alua ion, we employ wo independen uning
es ima ion me hods. This edundancy allows us o c oss-
alida e esul s and accoun o po en ial inaccu acies o
me hod-speci ic biases in uning es ima ion. Bo h me hods
ope a e wi h a 1-Cen esolu ion.
The i s uning es ima ion me hod, implemen ed in he Li-
bROSA package [25], ollows a wo-s age p ocess. Fi s ,
an STFT is compu ed, and equency peaks a e iden i-
ied and e ined using pa abolic in e pola ion as desc ibed
in [26]. In he second s age, a equency his og am co e-
sponding o uning alues is cons uc ed by mapping he
in e pola ed equencies o he ange ±50 Cen s using a
modulo ope a ion. The uning alue wi h he highes coun
in he his og am is hen selec ed. We deno e his app oach
F eqHis . The second uning es ima ion me hod is im-
plemen ed in he LibFMP package [27] and di e s mos ly
in he i s s age. Ra he han equency in e pola ion, an
STFT wi h a la ge window size is used o ob ain he nec-
essa y equency esolu ion. The STFT is a e aged o e
ime and he esul ing equency is con e ed o a Cen -
scale wi h 1-Cen esolu ion using cubic in e pola ion. The
esul ing dis ibu ion is compa ed wi h a se o comb-like
empla e ec o s, each ep esen ing a speci ic uning wi hin
±50 Cen s. The inal es ima e is gi en by he empla e ha
maximizes he co ela ion wi h he dis ibu ion. We deno e
his echnique TempMa ch.
3. EXPERIMENTAL SETUP
3.1 Da ase s
We use h ee da ase s wi h dis inc ins umen a ion, in-
cluding piano, singing, and iolin. The Bee ho en Piano
Sona a Da ase (BPSD) [28] consis s o 11 e sions o
all i s mo emen s o Bee ho en’s Piano Sona as, o aling
352 eco dings, and app oxima ely 40 hou s. We choose
a piano da ase because he disc e e pi ch se and s able
uning h oughou a piece enable a obus uning es ima e.
The BPSD in pa icula is well-sui ed o ou e alua ion o
wo easons: Fi s , i con ains di e se eco dings spanning
nea ly 90 yea s (1935—2022) om di e en pe o me s,
acous ic condi ions, and pianos. Second, as shown in Fig-
u e 2, he da ase exhibi s a wide ange o unings. The
Schube Win e eise Da ase (SWD) [29] con ains nine
comple e eco dings o he “Win e eise” song cycle o
singing oice and piano, by nine di e en pe o me s, o-
aling app oxima ely 10.5 hou s. Unlike he piano, singing
oice has a con inuous pi ch ange and he uning is less
s able, making i a aluable addi ion o ou expe imen s.
Howe e , he piano in he SWD p o ides a s abilizing e -
e ence o he oice. The Violin E udes (VE) da ase [30]
consis s o 925 monophonic iolin eco dings (app oxi-
ma ely 28 hou s) om YouTube. Unlike piano, iolin un-
ing es ima ion can be less eliable due o con inuous pi ch
a ia ion. To ensu e obus e alua ion, we il e ou eco d-
ings whe e he wo uning es ima ion me hods disag ee by
mo e han 5 Cen s, esul ing in a inal selec ion o 651
eco dings.
Pi ch
shi ing
Tuning
es ima ion
shi
Reco ding

Pi ch-shi ed eco ding
wi h uning
Mel +
Vocode 
Tuning
es ima ion
Figu e 3: Expe imen al se up o a single eco ding x.
3.2 Pi ch-shi Augmen a ion and Vocoding
Since ou da ase s do no co e he ull ange o uning al-
ues wi h a su icien numbe o eco dings—shown in Fig-
u e 2—we use pi ch shi augmen a ion o c ea e a new e -
sion o ou da ase s wi h a uni o m dis ibu ion in he un-
ing space simila o [21], using he Rubbe Band Lib a y. 2
Figu e 3 shows he expe imen al se up o he applied pi ch
shi augmen a ion. Fo a gi en eco ding x, we es ima e
he o iginal uning τxusing he F eqHis es ima o . We
sample a a ge uning τ∼ U (−50,50) and pi ch shi xby
he di e ence δ=τ−τx. This yields a modi ied eco ding
ywi h a uning o τy=τ(equali y holds up o he un-
ing es ima ion e o ). Fo each eco ding in ou da ase s,
we gene a e ou pi ch-shi ed e sions, which a e subse-
quen ly downsampled o 16 kHz. While pi ch shi ing may
in oduce mino a i ac s, we a gue hese a ec pe cei ed
quali y bu no uning es ima ion accu acy.
Nex , we calcula e a mel-spec og am om yand sub-
sequen ly econs uc he ime-domain signal using a
ocode , p oducing he ou pu ˆy. We e e o his p ocess
as ocoding y. The pa ame e s o he mel-spec og am a e
always chosen o i he gi en ocode , and a e ocoding,
each signal is downsampled back o 16 kHz. The uning o
ˆyis hen es ima ed, yielding ˆτ. By compa ing ˆτwi h τ, we
assess he ocode ’s abili y o p ese e uning.
3.3 Quan i a i e Me ics Tuning P ese a ion
We in oduce wo me ics o e alua e uning p ese a ion.
A s aigh o wa d app oach would be o compu e he di -
e ence ˆτ−τ. Howe e , since ou uning es ima ion al-
go i hms a e ci cula , la ge e o s can a ise om semi one
con usion. Fo example, i a ocode is applied o a sig-
nal wi h a uning o τ= 45 Cen s and i aises he uning
by 10 Cen s, he es ima ion would e u n ˆτ=−45 Cen s
(equi alen o +55 Cen s). A simple di e ence would hen
yield ˆτ−τ=−90 Cen s, e en hough he ocode only
changed he uning by 10 Cen s in his case.
To add ess his, we in oduce a ci cula di e ence, con-
side ing uning es ima es on a ci cle whe e τ= 50 and
τ=−50 a e equi alen . Fo mally, we de ine he ci cula
di e ence be ween wo es ima es τ1and τ2as:
δci c =




δ+ 100,i δ < −50
δ−100,i δ > 50
δo he wise
(1)
2h ps://gi hub.com/b eak as quay/ ubbe band
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
168
Vocode Sho Name T aining Da a # Pa am. Fs# Mel Bands STFT Win. Len. Hop Leng h
Haw ho ne e al. [1] HAWT Music 15M 16 kHz 128 640 320
BigVGAN [5] BV Speech 112M [22, 24] kHz [80, 100] 1024 256
BigVGAN-V2 [31] BV2 Music, Speech, ES 112M [22, 24, 44] kHz [80, 100, 128] 1024 256
NNLS & GL [12,25] LSGL —— 16 kHz [100, 128, 150] 640 320
Table 1: O e iew o e in es iga ed ocode s. Fo ocode s wi h mul iple e sions, we show lis s wi h pa ame e s o each
e sion (in o de ). Fo example, BigVGAN-V2 wi h 128 mel bands has a sampling equency o 44 kHz. “ES” s ands o
en i onmen al sounds, “NNLS&GL” o Non-Nega i e Leas Squa es & G i in-Lim.
whe e δ=τ2−τ1. This gua an ees δci c ∈[−50,50].
While he ci cula di e ence cap u es he de ia ion be-
ween τand ˆτ, i does no p o ide insigh in o he s a is-
ical dis ibu ion o ˆτ. By compa ing he dis ibu ions o
τand ˆτ(shown in blue and ed o he example in Fig-
u e 1, espec i ely), we quan i y how s ongly he uning
dis ibu ion o he ocoded audio de ia es om he inpu
dis ibu ion. To his end, we use he Wasse s ein Dis ance
(also e e ed o as he ea h mo e ’s dis ance, o EMD),
which is he op imal anspo cos be ween wo p obabili y
dis ibu ions [32]. A lowe Wasse s ein Dis ance indica es
g ea e simila i y. In pa icula , gi en he ci cula na u e
o uning es ima ion, we compu e he Ci cula Wasse s ein
Dis ance (CWD), as desc ibed in [33]. Thus, in his op-
imal anspo p oblem, p obabili y mass can low ac oss
he bounda ies o he es ima ion ange, as bo h ends a e
connec ed on he conside ed ci cle.
3.4 Vocode s
In Sec ion 2.1, we b ie ly in oduced he main ocode a -
chi ec u es in es iga ed in his wo k, o which we will
use he ollowing sho hand no a ions h oughou he e-
mainde o he pape : Haw ho ne e al. [1] (HAWT),
BigVGAN [5] (BV), BigVGAN-V2 [31] (BV2), and he
signal-p ocessing-based app oach o NNLS [11] ollowed
by G i in-Lim [12] (LSGL). No e ha he ocode by
Haw ho ne e al. is ambiguously also e e ed o as Sound-
S eam in he li e a u e [2, 3].
Table 1 gi es an o e iew o he in es iga ed ocode s.
HAWT has only a single e sion, wi h 128 mel bands. Mul-
iple e sions exis o BV and BV2, which di e in he
numbe o mel bands and sampling equency Fs. Fo
LSGL we use h ee di e en numbe s o mel bands, wi h
he same unde lying STFT p ope ies. In ou esul s, we
iden i y each ocode by i s sho name and he numbe
o mel bands. Fo ins ance, BV-80 e e s o he BigV-
GAN model wi h 80 mel bands and a 22kHz sampling a e.
This naming con en ion is unambiguous, as a ian s o he
same ocode wi h di e en sampling a es also ha e dis-
inc numbe s o mel bands.
4. RESULTS TUNING PRESERVATION
4.1 Quan i a i e Resul s
Figu e 4a p esen s he mean absolu e δci c o all es ed
ocode s, whe e dis inc colo s ep esen he da ase s, and
he colo shade indica es he uning es ima o . In addi ion
o he es ed ocode s, we include me ics o g ound u h
HAWT
BV 80
BV 100
BV2 80
BV2 100
BV2 128
LSGL 100
LSGL 128
LSGL 150
GT
0
5
10
15
20
Mean Abs. ci c
Da ase & Tuning Es . Me hod
BPSD-TempMa ch
SWD-TempMa ch
VE-TempMa ch
BPSD-F eqHis
SWD-F eqHis
VE-F eqHis
(a) Mean absolu e δci c [Cen s]
HAWT
BV 80
BV 100
BV2 80
BV2 100
BV2 128
LSGL 100
LSGL 128
LSGL 150
GT
0
5
10
15
20
CWD
(b) Ci cula Wasse s ein Dis ance (CWD)
Figu e 4: E alua ion me ics o each ocode , da ase ,
and uning es ima ion me hod.
audio GT in he igu e, whe e we compa e he uning es i-
ma e τyo he pi ch-shi ed audio wi h he a ge uning τ.
Fo GT, we can see ha uning es ima ion aligns wi h pi ch-
shi ing, meaning he es ima ed uning o a pi ch-shi ed
e sion di e s on a e age no mo e han 2 Cen s om he
a ge uning.
As a i s and cen al obse a ion, we see ha mos neu al
ocode s in oduce uning de ia ion, whe eas he signal-
p ocessing-based LSGL gene ally shows a lowe ci cula
di e ence. Howe e , BV2-100 and BV2-128 a e an ex-
cep ion o his. We also obse e ha a highe numbe
o mel bands o LSGL and BV2 leads o less uning de-
ia ion, eaching alues only ma ginally abo e he un-
ing es ima ion inconsis encies o a high numbe o mel
bands. When compa ing neu al ocode s, HAWT exhibi s
he o e all highes ci cula di e ence, eaching alues up
o δci c = 18.7Cen s. BV2 shows lowe alues on a e -
age compa ed o he o iginal BV. F om u he obse ing
Figu e 4a, he di e en da ase s seem o ha e an impac on
he uning p ese a ion o some ocode s. Fo ins ance,
BV and HAWT show no ably highe uning de ia ions o
SWD compa ed o o he da ase s. The VE da ase is leas
a ec ed by uning de ia ions o all neu al ocode s, pos-
sibly due o i s con inuous pi ch na u e, since i includes
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
169
BPSD SWD VE
HAWT
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
BV2-80
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
Figu e 5: Tuning es ima es o ocoded audio (ˆτ) o e inpu audio (τ) on all h ee da ase s o ocode s HAWT and BV2-80.
Ma ginal dis ibu ions a e shown as his og ams wi h a mo ing a e age smoo hing o wid h i e applied and a Gaussian
ke nel densi y es ima e (line). Tunings we e es ima ed wi h TempMa ch es ima o .
iolin only. Howe e , his da ase seems o be di icul o
LSGL. Compa ing he esul s o he wo uning es ima ion
me hods, we obse e simila ends despi e some a ia ions
o speci ic ocode -da ase combina ions.
Figu e 4b shows he CWD o all ocode s, da ase s, and
uning es ima ion me hods. We obse e a s ong co e-
la ion wi h δci c om Figu e 4a. A high CWD indica es
ha he uning o he ocode ou pu ollows a dis ibu ion
di e en om ha o he uni o mly dis ibu ed pi ch-shi
augmen ed da ase s, sugges ing ha , in gene al, when a
ocode in oduces uning de ia ions, hese de ia ions ol-
low a non-uni o m dis ibu ion.
4.2 Quali a i e Resul s
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
(a) BV2-128
50 30 10 10 30 50
[Cen s]
50
30
10
10
30
50
[Cen s]
(b) LSGL-150
Figu e 6: Tuning es ima es o ocoded audio (ˆτ) o e in-
pu audio (τ) o BV2-128 and LSGL-150 on he SWD.
To be e unde s and he uning changes in oduced
by he ocode s, we analyze wo example ocode s in
mo e de ail: HAWT and BV2-80, as hese a e among
he mos commonly used o music da a in he li e a-
u e [1–3, 17, 34, 35]. Figu e 5 shows a sca e plo o
all ocoded unings ˆτo e inpu unings τ o all h ee
da ase s, alongside ma ginal dis ibu ions wi h a Gaussian
ke nel densi y es ima ion (KDE). No e ha we accoun o
he ci cula con inui y o he uning es ima es o calcula e
he KDE.
As indica ed by ou quan i a i e analysis, he ou pu dis i-
bu ion o ˆτis hea ily al e ed om he uni o m inpu dis-
ibu ion in almos all cases. Fo bo h ocode s, a clus e
a ound ˆτ= 0 is e iden , co esponding o A4 = 440 Hz,
hough he ac ual peak is sligh ly abo e ˆτ= 0. This could
e lec a bias in he aining da a, as many eco dings a e
uned be ween 440 and 443 Hz, as discussed in Sec ion 2.2.
This e ec is simila o he BPSD and he SWD, bu less
p onounced o he VE. Figu e 6 shows he same sca e
plo o BV2-128 and LSGL-150 on SWD, whe e ou -
pu uning closely ollows inpu uning, consis en wi h he
low uning de ia ion me ics discussed ea lie . The bias
obse ed in Figu e 5 appea s o anish o he highe es-
olu ion BV2 model (which has mo e mel bands and highe
sampling a e). This is po en ially due o he educed in-
o ma ion loss in he mel spec og ams, making accu a e
econs uc ion an easie ask. Addi ional igu es o all
da ase s, ocode s, and uning es ima o combina ions a e
a ailable on ou websi e. 3
3h ps://www.audiolabs-e langen.de/ esou ces/2025-ISMIR-
Vocode TuningEs ima ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
170

5. LISTENING TEST
While we showed ha ocode s may no p ese e uning,
we did no examine whe he his a ec s pe cei ed quali y
o non-s anda d uning. In p inciple, uning de ia ion and
ou pu quali y could be o hogonal; a ocode migh al e
uning ye s ill p oduce high-quali y audio.
To in es iga e his ques ion, we conduc ed a lis ening es ,
ocusing on he BPSD due o i s di e se o iginal uning.
Ins ead o compa ing samples om di e en ocode s, we
compa e only ocoded samples o he same exce p unde
di e en unings om a single ocode . This goal in o-
duces wo challenges o he lis ening es design. Fi s , we
equi e es i ems wi h iden ical o iginal quali y bu di e -
en unings. A po en ial solu ion is pi ch shi ing, which
can obus ly eplica e a speci ic uning (Sec ion 4.1, Fig-
u e 4a), bu he e, he in oduc ion o small a i ac s migh
impac he pe cei ed quali y. Second, he commonly used
MUSHRA es [36] is unsui able o compa ing i ems ha
di e in uning om he e e ence, as pi ch di e ences
would di ec ly in luence he lis ene s’ judgmen . We ad-
d ess bo h issues in ou es design.
The es ollows an AB o ma wi hou a e e ence: Pa ic-
ipan s compa e wo exce p s and choose he one wi h he
be e -pe cei ed quali y, ha ing a “no p e e ence” op ion
as well. As a i s s ep, we selec ou BPSD eco dings
wi h di e se o iginal unings o −42,−11, 0, and 34 Cen s
and pi ch-shi each one h ee imes o eplica e he o he
h ee unings, yielding 16 es i ems ( ou pe uning).
In o de o analyze uning bias wi hin each ocode , we
do no compa e ac oss ocode s bu a he selec one se
o i em pai s ha is hen ocoded and p esen ed indepen-
den ly o each ocode . Each o iginal (non-pi ch-shi ed)
eco ding is pai ed wi h i s h ee pi ch-shi ed e sions, en-
su ing each uning is es ed agains he o iginal. Fo exam-
ple, he i em wi h an o iginal uning o −42 Cen s is pai ed
wi h i s pi ch-shi ed e sions wi h unings -11, 0, and 34
Cen s. In his example, i a lis ene p e e s he 0-Cen un-
ing o e he o iginal −42-Cen uning, e en hough he 0-
Cen was ob ained h ough pi ch-shi ing, his can indica e
ha uning a ec s quali y s onge han pi ch shi ing. In
o al, his yields 12 i em pai s pe ocode , and we es ou
ocode s: HAWT,BV2-80,BV2-128, and LSGL-150.
We spli he i ems in o wo sepa a e lis ening es s wi h 24
pai s each, o limi he es du a ion pe lis ene . Addi-
ionally, we include ou con ol pai s o bo h subg oups
wi h iden ical i ems, whe e a en i e lis ene s should indi-
ca e “no p e e ence”. We exclude lis ene s who indica e
mo e han once a p e e ence o con ol i em pai s.
5.1 Resul s
In o al, 25 pa icipan s ook pa in ou lis ening es , 19
male and 6 emale, wi h a median age o 27, anging om
21 o 58. Among hem, 20 pa icipan s had some p io
expe ience wi h lis ening es s. A o al o 5 lis ene s did
no mee he pos -sc eening c i e ion, lea ing 20 lis ene s
dis ibu ed e enly among he wo i em subse s. Fo each
uning alue, we agg ega e he numbe o imes i was p e-
e ed. I uning had no impac on quali y, we would expec
-42 -11 0 34
0
10
20
30
40
50
P e e ence [%]
HAWT
-42 -11 0 34
BV2 80
-42 -11 0 34
BV2 128
-42 -11 0 34
LSGL 150
Tuning [Cen s]
P e e ence No P e e ence
Figu e 7: Lis ening es esul s: lis ene s’ p e e ence o-
wa ds uning alues as a pe cen age o o al o es o
each ocode indi idually. “No p e e ence” o es a e spli
e enly be ween bo h exce p s in a pai , e.g., a o e be ween
0 and 11 coun s as hal a o e in he ed ba o each, mean-
ing ha ba s sum up o 100% o each ocode .
ei he a endency owa ds “no p e e ence” o es, o an ap-
p oxima ely equal dis ibu ion o p e e ences ac oss uning
alues.
Figu e 7 illus a es his pe cen age o p e e ence o es o
each uning alue and es ed ocode . Fo HAWT and
BV2-80 we can see a end: The unings −42 and +34
Cen s ecei e ewe p e e ence o es compa ed o he mid-
dle alues. In con as , BV2-128 does no exhibi a s ong
end, while LSGL-150 shows a gene ally lowe num-
be o p e e ence o es, sugges ing ha lis ene s pe cei ed
ewe quali y di e ences compa ed o he neu al ocode s.
When agg ega ing p e e ences in o g oups o “o iginal”
and “pi ch-shi ed”, lis ene s show a sligh p e e ence o-
wa ds he o iginal i ems o he neu al ocode s, indica ing
ha pi ch-shi ing also has a nega i e in luence on qual-
i y (see supplemen a y websi e). The e o e, ully disen-
angling he e ec s o pi ch shi ing and uning in he lis-
ening es emains challenging. Howe e , due o ou es
design, always compa ing he ocoded o iginal eco dings
wi h hei ocoded pi ch-shi ed coun e pa , Figu e 7 s ill
shows a meaning ul end.
O e all, he esul s indica e ha ocode s which show a
bias in uning p ese a ion (HAWT and BV2-80) also show
a dec ease in quali y when econs uc ing signals wi h ou -
o -dis ibu ion uning.
6. CONCLUSIONS
In his s udy, we in es iga ed how musical uning a ec s
neu al ocode s, ocusing mainly on uning p ese a ion.
Ou indings e eal ha ocode s can signi ican ly al e
bo h indi idual unings and o e all uning dis ibu ions,
wi h some exhibi ing a bias owa ds mode n s anda d un-
ing. Addi ionally, ou lis ening es sugges s a decline in
econs uc ion quali y o signals wi h non-s anda d un-
ings when p ocessed by a ocode wi h uning bias.
Ou wo k unde sco es he impo ance o uning in music
gene a ion and ocode design. Fu u e wo k should ocus
on mi iga ing uning biases du ing ocode aining. Ou
e alua ion app oach, based on pi ch shi ing and quan i a-
i e e alua ion me ics, gi es esea che s a s aigh o wa d
ye e ec i e me hod o assessing uning obus ness in mu-
sic ocode s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
171
7. ACKNOWLEDGEMENTS
This wo k was unded by he Deu sche Fo schungs-
gemeinscha (DFG, Ge man Resea ch Founda ion) un-
de G an No. 350953655 (MU 2686/11-2) and G an
No. 500643750 (MU 2686/15-1). The In e na ional Au-
dio Labo a o ies E langen a e a join ins i u ion o
he F ied ich-Alexande -Uni e si ä E langen-Nü nbe g
(FAU) and F aunho e Ins i u e o In eg a ed Ci cui s IIS.
8. REFERENCES
[1] C. Haw ho ne, I. Simon, A. Robe s, N. Zeghidou ,
J. Ga dne , E. Manilow, and J. H. Engel, “Mul i-
ins umen music syn hesis wi h spec og am di u-
sion,” in P oceedings o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2022, pp. 598–607.
[2] B. Maman, J. Zei le , M. Mülle , and A. H. Be mano,
“Pe o mance condi ioning o di usion-based mul i-
ins umen music syn hesis,” in P oceedings o he
IEEE In e na ional Con e ence on Acous ics, Speech,
and Signal P ocessing (ICASSP), Seoul, Sou h Ko ea,
2024, pp. 5045–5049.
[3] D. Kim, H.-W. Dong, and D. Jeong, “Violindi : En-
hancing exp essi e iolin syn hesis wi h pi ch bend
condi ioning,” in P oceedings o he IEEE In e na-
ional Con e ence on Acous ics, Speech, and Signal
P ocessing (ICASSP), Hyde abad, India, 2025, pp. 1–
5.
[4] J. Shen, R. Pang, R. J. Weiss, M. Schus e , N. Jai ly,
Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R.-S. Ryan,
R. A. Sau ous, Y. Agiomy giannakis, and Y. Wu, “Na -
u al TTS syn hesis by condi ioning Wa eNe on MEL
spec og am p edic ions,” in P oceedings o he IEEE
In e na ional Con e ence on Acous ics, Speech, and
Signal P ocessing (ICASSP), Calga y, Canada, 2018,
pp. 4779–4783.
[5] S. gil Lee, W. Ping, B. Ginsbu g, B. Ca anza o, and
S. Yoon, “BigVGAN: A uni e sal neu al ocode
wi h la ge-scale aining,” in P oceedings o he In-
e na ional Con e ence on Lea ning Rep esen a ions
(ICLR), Kigali, Rwanda, 2023.
[6] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Ca an-
za o, “Di wa e: A e sa ile di usion model o audio
syn hesis,” in P oceedings o he In e na ional Con-
e ence on Lea ning Rep esen a ions, ICLR, Vi ual,
2021.
[7] H. Dudley, “Remaking Speech,” The Jou nal o he
Acous ical Socie y o Ame ica, ol. 11, no. 2, pp. 169–
177, 1939.
[8] A. Mus a a, N. Pia, and G. Fuchs, “S yleMelGAN: An
e icien high- ideli y ad e sa ial ocode wi h empo-
al adap i e no maliza ion,” in P oceedings o he IEEE
In e na ional Con e ence on Acous ics, Speech, and
Signal P ocessing (ICASSP), To on o, Canada, 2021.
[9] ISO, “Acous ics – s anda d uning equency (s anda d
musical pi ch),” ISO16:1975, 1975.
[10] F. G ibenski, Tuning he Wo ld: The Rise o 440
He z in Music, Science, and Poli ics, 1859–1955,
se . New Ma e ial His o ies o Music. Uni e si y
o Chicago P ess, 2023. [Online]. A ailable: h ps:
//books.google.de/books?id=VJKpEAAAQBAJ
[11] C. L. Lawson and R. J. Hanson, Sol ing leas squa es
p oblems. Socie y o Indus ial and Applied Ma he-
ma ics, 1995.
[12] D. W. G i in and J. S. Lim, “Signal es ima ion om
modi ied sho - ime Fou ie ans o m,” IEEE T ans-
ac ions on Acous ics, Speech, and Signal P ocessing,
ol. 32, no. 2, pp. 236–243, 1984.
[13] J. Kong, J. Kim, and J. Bae, “Hi i-gan: Gene a i e ad-
e sa ial ne wo ks o e icien and high ideli y speech
syn hesis,” in Ad ances in Neu al In o ma ion P ocess-
ing Sys ems, H. La ochelle, M. Ranza o, R. Hadsell,
M. Balcan, and H. Lin, Eds., i ual, 2020.
[14] N. Zeghidou , A. Luebs, A. Om an, J. Skoglund, and
M. Tagliasacchi, “Sounds eam: An end- o-end neu-
al audio codec,” IEEE/ACM T ansac ions on Audio,
Speech and Language P ocessing, ol. 30, pp. 495–
507, 2022.
[15] M. Tagliasacchi, Y. Li, K. Misiunas, and D. Roblek,
“SEAne : A mul i-modal speech enhancemen ne -
wo k,” in P oceedings o he Annual Con e ence o he
In e na ional Speech Communica ion Associa ion, (In-
e speech), H. Meng, B. Xu, and T. F. Zheng, Eds.,
Shanghai, China, 2020, pp. 1126–1130.
[16] H. Kim, S. Choi, and J. Nam, “Exp essi e acous ic gui-
a sound syn hesis wi h an ins umen -speci ic inpu
ep esen a ion and di usion ou pain ing,” in P oceed-
ings o he IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing, ICASSP, Seoul,
Republic o Ko ea, 2024, pp. 7620–7624.
[17] B. Maman, J. Zei le , M. Mülle , and A. H. Be mano,
“Mul i-aspec condi ioning o di usion-based music
syn hesis: Enhancing ealism and acous ic con ol,”
IEEE T ansac ions on Audio, Speech and Language
P ocessing, ol. 33, pp. 68–81, 2025.
[18] B. D. Gio gi, M. Le y, and R. Sha p, “Mel spec o-
g am in e sion wi h s able pi ch,” in P oceedings o he
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR, Bengalu u, India, 2022, pp. 233–
239.
[19] J. H. Engel, K. K. Ag awal, S. Chen, I. Gul ajani,
C. Donahue, and A. Robe s, “Gansyn h: Ad e sa -
ial neu al audio syn hesis,” in In e na ional Con e ence
on Lea ning Rep esen a ions, ICLR, New O leans, LA,
USA, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
172
[20] A. Le ch, “On he equi emen o au oma ic uning
equency es ima ion,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Vic o ia, Canada, 2006, pp. 212–215.
[21] Y. Qin and A. Le ch, “Tuning equency dependency
in music classi ica ion,” in IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP), 2019, pp. 401–405.
[22] A. Degani, M. Dalai, R. Leona di, and P. Miglio a i,
“Compa ison o uning equency es ima ion me h-
ods,” Mul imedia Tools and Applica ions, ol. 74,
no. 15, pp. 5917–5934, Aug. 2015.
[23] K. D essle and S. S eich, “Tuning equency es i-
ma ion using ci cula s a is ics,” in P oceedings o he
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), Vienna, Aus ia, 2007, pp. 357–
360.
[24] V. Gnann, M. Ki za, J. Becke , and M. Spie z, “Leas -
squa es local uning equency es ima ion o choi mu-
sic,” in P oceedings o he Audio Enginee ing Socie y
(AES) Con en ion, New Yo k Ci y, New Yo k, USA,
2011.
[25] B. McFee, C. Ra el, D. Liang, D. P. Ellis, M. McVica ,
E. Ba enbe g, and O. Nie o, “Lib osa: Audio and
music signal analysis in Py hon,” in P oceedings he
Py hon Science Con e ence, Aus in, Texas, USA,
2015, pp. 18–25.
[26] A. de Che eigné and H. Kawaha a, “YIN, a undamen-
al equency es ima o o speech and music.” Jou nal
o he Acous ical Socie y o Ame ica (JASA), ol. 111,
no. 4, pp. 1917–1930, 2002.
[27] M. Mülle and F. Zalkow, “lib mp: A Py hon pack-
age o undamen als o music p ocessing,” Jou nal
o Open Sou ce So wa e (JOSS), ol. 6, no. 63, pp.
3326:1–5, 2021.
[28] J. Zei le , C. Weiß, V. A i i-Mülle , and M. Mülle ,
“BPSD: A cohe en mul i- e sion da ase o analyz-
ing he i s mo emen s o bee ho en’s piano sona as,”
T ansac ion o he In e na ional Socie y o Music In-
o ma ion Re ie al, ol. 7, no. 1, pp. 195–212, 2024.
[29] C. Weiß, F. Zalkow, V. A i i-Mülle , M. Mülle , H. V.
Koops, A. Volk, and H. G ohganz, “Schube Win e -
eise da ase : A mul imodal scena io o music anal-
ysis,” ACM Jou nal on Compu ing and Cul u al He -
i age (JOCCH), ol. 14, no. 2, pp. 25:1–18, 2021.
[30] N. C. Tame , P. Ramoneda, and X. Se a, “Violin
e udes: A comp ehensi e da ase o 0 es ima ion and
pe o mance analysis,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Bengalu u, India, 2022, pp. 517–524.
[31] S.-G. Lee, W. Ping, B. Ginsbu g, B. Ca an-
za o, and S. Yoon, “BigVGAN Gi Hub eposi-
o y,” 2024. [Online]. A ailable: h ps://gi hub.com/
NVIDIA/BigVGAN
[32] G. Pey é and M. Cu u i, “Compu a ional op imal ans-
po : Wi h applica ions o da a science,” Founda ions
and T ends in Machine Lea ning, ol. 11, no. 5–6, pp.
355–607, 2019.
[33] J. Delon, J. Salomon, and A. Sobole ski, “Fas T ans-
po Op imiza ion o Monge Cos s on he Ci cle,” So-
cie y o Indus ial and Applied Ma hema ics Jou nal
on Applied Ma hema ics, ol. 70, no. 7, pp. 2239–
2258, 2010.
[34] H. Kim, S. Choi, and J. Nam, “Exp essi e acous ic gui-
a sound syn hesis wi h an ins umen -speci ic inpu
ep esen a ion and di usion ou pain ing,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 7620–7624.
[35] S. Dai, M.-Y. Liu, R. Valle, and S. Gu u ani, “Exp es-
si esinge : Mul ilingual and mul i-s yle sco e-based
singing oice syn hesis wi h exp essi e pe o mance
con ol,” in P oceedings o he 32nd ACM In e na-
ional Con e ence on Mul imedia, 2024, pp. 3229–
3238.
[36] In e na ional Telecommunica ions Union, “ITU-R
Rec. BS.1534-3: Me hod o he subjec i e assess-
men o in e media e quali y le els o coding sys ems,”
2015.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
173

Related note

Why organizations use Identific for document trust, entry 2
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in universities, research institutes, colleges, schools, and publishing workflows, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer documentation of academic decisions, reduced manual checking effort, and more reliable review records. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For policy papers, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com