scieee Science in your language
[en] (orig)

Enrichment of Oesophageal Speech: Voice Conversion with Duration-Matched Synthetic Speech as Target

Author: Raman, Sneha,Sarasola Aramendia, Xabier,Navas Cordón, Eva,Hernáez Rioja, Inmaculada
Publisher: MDPI
Year: 2021
DOI: 10.3390/app11135940
Source: https://addi.ehu.eus/bitstream/10810/52459/1/applsci-11-05940-v2.pdf
applied
sciences
A icle
En ichmen o Oesophageal Speech: Voice Con e sion wi h
Du a ion–Ma ched Syn he ic Speech as Ta ge
Sneha Raman *, Xabie Sa asola, E a Na as and Inma He naez *


Ci a ion: Raman, S.; Sa asola, X.;
Na as, E.; He naez, I. En ichmen o
Oesophageal Speech: Voice
Con e sion wi h Du a ion–Ma ched
Syn he ic Speech as Ta ge . Appl. Sci.
2021,11, 5940. h ps://doi.o g/
10.3390/app11135940
Academic Edi o : F ancesc Alías
Recei ed: 9 Ap il 2021
Accep ed: 18 June 2021
Published: 26 June 2021
Publishe ’s No e: MDPI s ays neu al
wi h ega d o ju isdic ional claims in
published maps and ins i u ional a il-
ia ions.
Copy igh : © 2021 by he au ho s.
Licensee MDPI, Basel, Swi ze land.
This a icle is an open access a icle
dis ibu ed unde he e ms and
condi ions o he C ea i e Commons
A ibu ion (CC BY) license (h ps://
c ea i ecommons.o g/licenses/by/
4.0/).
HiTZ-Aholab, Uni e si y o he Basque Coun y (UPV/EHU), 48013 Bilbao, Spain; xabie [email p o ec ed] (X.S.);
[email p o ec ed] (E.N.)
*Co espondence: [email p o ec ed] (S.R.), [email p o ec ed] (I.H.)
Abs ac :
Pa hological speech such as Oesophageal Speech (OS) is di icul o unde s and due o
he p esence o undesi ed a e ac s and lack o no mal heal hy speech cha ac e is ics. Mode n
speech echnologies and machine lea ning enable us o ans o m pa hological speech o imp o e
in elligibili y and quali y. We ha e used a neu al ne wo k based oice con e sion me hod wi h he
aim o imp o ing he in elligibili y and educing he lis ening e o (LE) o ou OS speake s o
a ying speaking p o iciency. The no el y o his me hod is he use o syn he ic speech ma ched in
du a ion wi h he sou ce OS as he a ge , ins ead o pa allel aligned heal hy speech. We e alua ed
he con e ed samples om his sys em using a collec ion o Au oma ic Speech Recogni ion sys ems
(ASR), an objec i e in elligibili y me ic (STOI) and a subjec i e es . ASR e alua ion shows ha he
p oposed sys em had signi ican ly be e wo d ecogni ion accu acy compa ed o unp ocessed OS,
and baseline sys ems which used aligned heal hy speech as he a ge . The e was an imp o emen o
a leas 15% on STOI sco es indica ing a highe in elligibili y o he p oposed sys em compa ed o
unp ocessed OS, and a highe a ge simila i y in he p oposed sys em compa ed o baseline sys ems.
The subjec i e es e eals a signi ican p e e ence o he p oposed sys em compa ed o unp ocessed
OS o all OS speake s, excep one who was he leas p o icien OS speake in he da a se .
Keywo ds: pa hological speech; oice con e sion; in elligibili y; speech ecogni ion
1. In oduc ion
La yngec omy is he su gical p ocedu e o emo ing he la ynx [
1
]. In addi ion o
se e al unc ional diso de s and li es yle changes [2], his esul s in he loss o ocal olds
and he pa ien ’s p e-su ge y speech [
3
]. One o he se e al al e na i e ways ha a la yngec-
omee can communica e [
3
,
4
] is o speak using he ib a ions o he pha yngoesophageal
segmen [
5
], known as Oesophageal Speech (OS). Gene a ing OS in oduces acous ic a e-
ac s [
6
] and makes OS less in elligible [
7
,
8
], which a ec s communica ion, social ac i i ies
and quali y o li e [2,9].
OS is less in elligible and mo e e o ul o lis en o compa ed o heal hy speech (HS).
This is e iden om p e ious lis ening expe imen s [
10
,
11
] as well as acous ic cha ac e -
is ics and challenges o OS [
12
]. P olonged exposu e o e o ul speech causes a igue in
lis ene s [
13
]. The e o e, he e is a s ong mo i a ion o make OS mo e in elligible and
pleasan o lis en o. We aim o en ich OS by closing he OS-HS gaps in in elligibili y,
quali y and lis ening e o (LE).
Mode n speech echnologies and machine lea ning ha e g ea po en ial o use in he
heal hca e sec o , be i o imp o emen o heal hca e se ices [
14
] o o aid pa ien s wi h
speech impai men s [
15
]. One such applica ion is ans o ming pa hological speech wi h
he aim o making i mo e in elligible, pleasan and easie o p ocess. This can educe he
load on he lis ene s and imp o e communica ion o people wi h speech pa hologies.
One o he possible app oaches o en ich OS is o use a oice con e sion (VC) sys em.
The goal o a VC sys em is o con e he u e ances o a sou ce speake o sound like hose
Appl. Sci. 2021,11, 5940. h ps://doi.o g/10.3390/app11135940 h ps://www.mdpi.com/jou nal/applsci
Appl. Sci. 2021,11, 5940 2 o 14
o a a ge speake [
16
]. In he OS en ichmen con ex , u e ances o an OS speake can be
mapped o hose o a heal hy speake , he eby ha ing OS acqui e cha ac e is ics o HS.
Some OS en ichmen has been done using s a is ical VC me hods such as Gaussian
Mix u e models (GMMs) [
17
–
19
]. In hese me hods, OS and HS a e modelled by a linea
combina ion o Gaussian dis ibu ions. In he aining p ocess, he Gaussian dis ibu ions
o OS a e mapped o hose o HS. The ou pu o such a aining session is a con e sion
unc ion mapping OS o HS. This con e sion unc ion can hen be used o con e new
OS samples, he eby ge ing OS speech ha has cha ac e is ics o HS. In ecen imes,
Deep Neu al Ne wo ks (DNN) a e mo e popula and e ec i e compa ed o GMM based
me hods o enhancemen o ala yngeal speech [
20
–
23
] and o he ypes o pa hological
speech [
24
,
25
]. Ano he a emp o en ich OS was by using he eigen oices concep [
26
],
which was inspi ed by he eigen aces concep [
27
]. Some s udies ha e used il e ing
app oaches [
28
], o man syn hesis [
29
] and inc easing he ha monics o noise a io o
OS [30].
Like ou p e ious app oaches [
31
,
32
], he p oposed me hod is also based on VC.
The bidi ec ional long sho - e m memo y (BLSTM) based ans o ma ion [
31
] had be -
e Au oma ic Speech Recogni ion (ASR) sco es compa ed o OS. The me hod used by
Se ano e al. [32]
was inspi ed by a Phone ic Pos e io g ams (PPG) based sys em [
33
]
which had good esul s o HS-HS VC. When applied o OS-HS VC, he e was no imp o e-
men in ASR. Mel Ceps al Dis o ion (MCD) was educed by bo h sys ems. Unp ocessed
OS was p e e ed o e bo h o he sys ems in p e e ence es s.
VC sys ems may be pa allel ( equi es empo ally aligned sou ce– a ge u e ance pai s)
o non-pa allel ( equi es many hou s o speech da a). Due o da a limi a ions (100 sen ences
pe speake ), pa allel VC is bes sui ed o ou pu poses. A pa allel VC equi es he pa allel
sou ce and a ge sen ences o be aligned o aining. This is p ima ily done by Dynamic
Time Wa ping (DTW) alignmen which inds an op imal ma ch based on simila i ies in he
wo sequences.
Helande e al. [
34
] desc ibe some challenges o DTW in he con ex o VC. One o
hem is he p esence o silences o ex a sounds in he sou ce and no in he a ge . Ano he
one is he poo es ima ion o end poin s o silences and phonemes. A hi d case is he
many- o-one and one- o-many na u e o he DTW mapping. Fo example, i he sou ce
con ains a phoneme wi h a longe du a ion compa ed o he a ge , hen a single ame
o he a ge may be mapped o se e al ames o he sou ce. OS has undesi ed silences
and a e ac s and longe and a ying du a ions o phonemes. These quali ies make DTW
challenging in he OS-HS VC ask.
As a wo ka ound, in ou p e ious a emp [
31
], we pe o med alignmen a wo s ages:
i s aligning he phone bounda ies and hen applying DTW, ancho ing he phone bound-
a ies. In his pape , we ook ad an age o he a ailable phone labels and he possibili y
o gene a ing syn he ic speech (SS) wi h explici phone du a ions. This esul ed in SS
ha ma ches in du a ion wi h he sou ce OS u e ances, and would be a pe ec ly aligned
a ge . This elimina ed he need o DTW and i s limi a ions. We hypo hesise ha his
DTW- ee VC would imp o e he in elligibili y and quali y o he en iched OS compa ed
o ou p e ious me hods which equi ed sou ce– a ge alignmen .
A obus en ichmen sys em should ideally wo k wi h OS speake s o a ying speak-
ing p o iciency. The e o e, we pe o med en ichmen s o OS speake s anging om e y
low o e y high in elligibili y. As he en ichmen sys em is buil o imp o e communi-
ca ions o he OS speake , i is impo an ha he ou pu o he en ichmen sys em is
p e e ed by lis ene s o e he unp ocessed OS. Mo eo e , gi en ha oice in e ac ions
wi h machines a e becoming mo e and mo e common, he en iched ou pu s should be
in elligible o machines. Taking hese poin s in o conside a ion, we e alua ed he sub-
jec i e p e e ence o he en iched sys em amongs human lis ene s as well as objec i e
in elligibili y and ASR pe o mance.
Appl. Sci. 2021,11, 5940 3 o 14
To sum up, we p esen a no el, DTW- ee, pa allel VC sys em o OS en ichmen
which includes an SS a ge . We e alua e i s ou pu s o ASR pe o mance, an objec i e
in elligibili y me ic and a p e e ence es in compa ison wi h unp ocessed OS.
2. Da a
We chose ou OS speake s (3 male, 1 emale) wi h a wide ange o in elligibili y om
an OS da abase ha con ains o e 30 OS speake s [
12
]. In he o iginal da abase, he ou
speake s we e iden i ied as ‘02M3’, ‘04M3’, ‘16M3’ and ‘25F3’, and we con inue o use hese
IDs. Some de ails o he ou speake s such as age, sex, ime passed since he la yngec omy
ope a ion, s imulus du a ion and speaking a es a e p esen ed in Table 1[
35
]. The age and
he ime since la yngec omy we e collec ed on he day o he eco ding. Speake s 04M3
and 16M3 a e ela i ely ecen la yngec omees and hence, a e less p o icien han he o he
wo speake s.
Fo each speake , we used a pa allel da ase o 100 phone ically-balanced Spanish
sen ences (desc ibed in de ail in [
12
]). The sen ences we e syn ac ically and seman ically
p edic able bu had some low equency wo ds. The numbe o wo ds in each sen ence
anged be ween 9 and 18 wo ds (mean = 13.19, SD = 3.66).
Table 1. Speake cha ac e is ics.
Speake IDs Sex Age Time since
La yngec omy
Du a ion Pe S imulus
Mean ±SD (Seconds)
Speaking Ra e Mean ±SD
(Syllables Pe Second)
02M3 Male 75 yea s 5 mon hs 8 yea s 1 mon h 7.48 ±1.67 4.32 ±1.80
04M3 Male 59 yea s 4 mon hs 1 yea 7 mon hs 9.27 ±2.36 3.84 ±1.71
16M3 Male 66 yea s 4 mon hs 1 yea 10 mon hs 12.52 ±3.61 2.59 ±1.19
25F3 Female 59 yea s 3 mon hs
11 yea s 11 mon hs
7.85 ±2.02 4.24 ±1.86
3. P oposed VC Sys em
The p oposed VC sys em, BLSTM wi h SS as a ge (BLSTMSS), is a DNN based
sys em wi h OS as sou ce and SS wi h ma ching du a ions as a ge (see Figu e 1). The
p ocedu e is desc ibed in de ail in he ollowing s eps.
Figu e 1. The p oposed OS-HS VC sys em: BLSTMSS.
Appl. Sci. 2021,11, 5940 4 o 14
3.1. Labelling o Oesophageal Speech
Segmen a ion and labelling o OS is a icky p ocess owing o undesi ed a e ac s, in-
co ec p onuncia ions o some consonan s and uns able undamen al equency. The o ced
alignmen ea u e buil in o gene ic Spanish ASR sys ems such as Kaldi [
36
] was unsui able
o OS. The e o e, using he Mon eal Fo ced Alignmen ool [
37
], and wi h he aid o a
manually labelled se o one speake (speake 02M3), new models we e c ea ed by using
OS as he aining ma e ial. Pe o ming segmen a ion wi h his o ced aligne ga e us he
phone labels and hei du a ions o he sou ce OS u e ances.
3.2. Gene a ing Ta ge Syn he ic Speech
Using he labels, hei du a ions and he u e ance ex , SS was gene a ed by explici ly
assigning hese du a ions o he phones. We used an HMM based ex - o-speech
sys em [38]
which was o iginally de eloped o he Basque language. The Spanish e sion is desc ibed
in [39]. This p ocess ga e us equal-sized ame-by- ame aligned pai s o OS and SS.
Due o cons an swallowing o ai o p oduce speech, OS con ains se e al pauses
wi h a e ac s wi hin u e ances. Du ing he SS gene a ion, hese pauses we e eplaced
wi h silences.
3.3. Voice Con e sion Neu al Ne wo k
Voice con e sion was pe o med wi h he VC ecipe o he Me lin oolki [
40
]. Pa ame-
isa ion and esyn hesis was done using he WORLD Vocode [
41
]. The ex ac ed pa ame-
e s included 60 Mel Ceps al Coe icien s (MCCs), 1 exci a ion pa ame e (log F0), 1 Band
Ape iodici y Pa ame e (BAP), he del as and he del a del as o he MCCs, log F0 and BAP
and a oiced/un oiced bina y pa ame e . In all, he e we e 187 pa ame e s ex ac ed e e y
5 milliseconds.
A ma ix o size 187
×
(numbe o 5 ms ames) o OS and SS u e ances we e he
sou ce and a ge inpu s, espec i ely. We spli he 100 sou ce– a ge pai s in o 90 ain and
10 es pai s. As he sou ce and he a ge had he same numbe o ames, we skipped he
alignmen s ep in he aining p ocess. The ain pa ame e s we e no malised o 0 mean
and uni a iance and hen ed in o a 4 laye ed BLSTM (4
×
1024) aining ne wo k. A e
aining, he sou ce es u e ance pa ame e s we e con e ed using he ained model.
A deno malisa ion o he mean and he a iance was applied o he ou pu pa ame e s,
ollowed by a Maximum Likelihood Pa ame e Gene a ion using he a iances om he
aining da a. The esul ing con e ed pa ame e s we e ed in o he ocode o syn he-
sise he con e ed speech. A c oss alida ion was pe o med 10 imes, so ha all he
100 sen ences we e a ailable as es sen ences.
4. E alua ions and Resul s
E alua ions in ol ed compa ing BLSTMSS ou pu s o unp ocessed OS using h ee
ASR sys ems, STOI sco es and a p e e ence es . In addi ion, we compa ed ASR and STOI
sco es o BLSTMSS wi h hose o ou p e ious sys ems.
4.1. ASR
We e alua ed he ou pu s o ou p oposed en ichmen sys em using h ee ASR sys ems:
he speech- o- ex sys em om Mic oso Azu e using he py hon azu e-cogni i ese ices-
speech lib a y (ASR 1) [
42
], he Elhuya speech ecogni ion sys em (ASR 2) [
43
] and
a Kaldi [
36
] based sys em (ASR 3) de eloped in ou labo a o y and desc ibed in [
44
].
The inpu iles o hese ASR sys ems we e he 100 single channel wa iles sampled a
16,000 Hz. The ou pu s we e ex iles con aining he ansc ip ions.
The eason o using h ee ASR sys ems was o ha e a di e se se o e alua ions.
ASR 1 is a well known comme cial ASR sys em used wo ld wide and he e o e easie o
compa isons in u u e s udies elsewhe e. ASR 2 is a comme cial sys em buil locally in
Spain, and he e o e, be e adap ed o he speech s yle and ocabula y o he speake s
in ol ed in his s udy. ASR 3 is a cus omised Kaldi based ASR wi h ull con ol o all he
Appl. Sci. 2021,11, 5940 5 o 14
componen s such as he language model, dic iona y e c. We p esume ha he amoun o
audio used o ain ASR 3 (app oxima ely 5 h o audio) was smalle in compa ison o he
o he wo comme cial sys ems. I uses a lexicon limi ed o he ocabula y o he co pus
used in his s udy. I also uses a unig am language model and was used in ou p e ious
s udies [
31
,
32
]. The ad an age o ASR 3 is ha i is no p one o upda es as is he case o
comme cial ASRs. This allows us o make ai and accu a e compa isons o ou ongoing
wo k wi h ou p e ious wo k.
We calcula ed wo me ics om he ASR ansc ip ions: Wo d E o Ra es (WER) and
Pe cen age Wo ds Co ec (PWC). Fo WER, we calcula ed he Le ensh ein dis ance [
45
]
be ween he e e ence sen ence (o iginal eco ding u e ance ex ) and he hypo hesis sen-
ence (ASR ansc ip ion ou pu ) using he Wo d E o Ra e Ma lab oolbox [
46
]. The WER
o mula is shown in Equa ion (1). The Le ensh ein dis ance WER akes in o accoun he
inse ions, dele ions, and subs i u ions ha a e obse ed in he ansc ibed ou pu . Please
no e ha he WER can be mo e han 100% i he o al inse ions, subs i u ions and dele ions
exceed he o al numbe o wo ds in he e e ence sen ence.
WER =
Subs i u ions + Inse ions + Dele ions
To al numbe o wo ds in e e ence sen ence ×100. (1)
PWC is he pe cen age o wo ds om he e e ence sen ence co ec ly iden i ied in he
ansc ibed sen ence. The PWC o mula is shown in Equa ion (2).
PWC =
Wo ds co ec ly iden i ied in ansc ip ion
To al numbe o wo ds in e e ence sen ence ×100. (2)
Figu es 2–4show mean WER and PWC sco es o he 100 sen ences ob ained om
he ansc ip ions o ASR 1, 2 and 3, espec i ely. WER sco es we e lowe (i.e., highe
in elligibili y) o BLSTMSS compa ed o unp ocessed OS o all ASRs and speake s wi h
2 excep ions—speake 04M3 in ASR 1 and speake 16M3 in ASR 2. In he case o PWC
sco es, a highe PWC sco e (i.e., highe in elligibili y) was obse ed o he BLSTMSS
samples compa ed o unp ocessed OS samples o all speake s and ASRs.
When compa ing he di e en ASR sys ems, he bes WER and PWC sco es o
unp ocessed OS we e ob ained by ASR 1, ollowed by ASR 3 and ASR 2. In addi ion, he e
we e ewe di e ences be ween OS and en iched OS in ASR 1 compa ed o he o he wo
sys ems. Amongs all he ASRs, ASR 3 had he bes WER and PWC sco es o en iched OS.
We did co ela ion analysis o WERs and PWCs o OS ob ained om ASR 1. The e
was a signi ican nega i e co ela ion (Pea son’s
=−
0.959,
p=
0.041) be ween WER and
he numbe o mon hs since la yngec omy and a signi ican posi i e co ela ion (Pea son’s
=
0.952,
p=
0.048) be ween PWC and he numbe o mon hs since la yngec omy.
A simila co ela ion was obse ed wi h speaking a e, bu i did no each signi icance.
In ou p e ious s udies [
31
,
32
], we wo ked wi h speake 02M3 and ASR 3.
Figu e 5
shows he WER sco es o BLSTMSS in compa ison o ou p e ious me hods, PPG [
32
]
and BLSTMHS [31]. I can be obse ed ha he p oposed sys em was able o signi ican ly
educe ASR e o s in compa ison o p e ious me hods.

Appl. Sci. 2021,11, 5940 6 o 14
(a) Wo d E o Ra es (b) Pe cen age Wo ds Co ec
Figu e 2.
ASR 1 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con e ed ou pu s and a ge SS ( a ge ).
E o ba s show s anda d e o s.
(a) Wo d E o Ra es (b) Pe cen age Wo ds Co ec
Figu e 3.
ASR 2 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con e ed ou pu s and a ge SS ( a ge ).
E o ba s show s anda d e o s.
(a) Wo d E o Ra es (b) Pe cen age Wo ds Co ec
Figu e 4.
ASR 3 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con e ed ou pu s and a ge SS ( a ge ).
E o ba s show s anda d e o s.
Appl. Sci. 2021,11, 5940 7 o 14
Figu e 5.
WER sco es o Unp ocessed OS, p e ious sys ems (PPG and BLSTMHS) and he p oposed
BLSTMSS sys em as calcula ed by ASR 3 o speake 02M3. E o ba s show s anda d e o s.
4.2. STOI Sco es
STOI [
47
] is an in usi e objec i e in elligibili y measu e which is known o be co -
ela ed wi h subjec i e in elligibili y sco es o noisy speech. An in usi e in elligibili y
measu emen equi es a deg aded signal and an aligned e e ence signal. We calcula ed
STOI o unp ocessed OS samples and con e ed BLSTMSS samples o he ou OS speak-
e s using he al eady aligned du a ion–ma ched SS ( a ge signal) as he e e ence signal.
We used he SS as e e ence because hey we e he bes possible clean aligned signals
a ailable. Calcula ing STOI wi h aligned heal hy la yngeal speech would ha e esul ed
in alignmen e o s and hence, in an inaccu a e STOI measu emen . The STOI esul s a e
shown in Figu e 6.
Figu e 6.
STOI sco es o he ou OS speake s and he en iched e sions. Re e ence signal o STOI
is du a ion–ma ched SS. E o ba s show s anda d e o s.
We can obse e ha he STOI sco es ha e imp o ed conside ably (a leas 15 pe cen -
age poin s) om OS o BLSTMSS o all ou speake s. A high STOI sco e o o e 60%
was obse ed o all he BLSTMSS samples wi h in elligible syn he ic speech (>70% ASR
accu acy) as e e ence.
Appl. Sci. 2021,11, 5940 8 o 14
Like he ASR, we compa ed he STOI sco es o he p oposed sys em wi h hose o
ou p e ious me hods (see Figu e 7). The e e ences used o calcula e STOI we e he same
du a ion–ma ched SS signals. The p oposed sys em has highe STOI sco es (abou 5%)
compa ed o p e ious sys ems and unp ocessed OS.
Figu e 7.
STOI sco es o Unp ocessed OS, p e ious sys ems (PPG and BLSTMHS) and p oposed
BLSTMSS sys em o speake 02M3. Re e ence signal o STOI is du a ion–ma ched SS. E o ba s
show s anda d e o s.
5. Subjec i e Tes
While unp ocessed OS has se e al undesi ed a e ac s and lacks a na u al undamen al
equency, i is na u al speech. On he o he hand, al hough he BLSTMSS ou pu s a e
much clea e sounding, hey a e syn he ically p oduced and may ha e some limi a ions
because o ha . The success o he en ichmen depends majo ly on whe he lis ene s p e e
o lis en o he en iched e sion mo e han he unp ocessed OS. The e o e, we pe o med a
p e e ence es o collec lis ene s’ opinion on whe he hey p e e lis ening o he ou pu s
o he p oposed sys em o he unp ocessed OS.
The p e e ence es was a 5-poin Compa ison Mean Opinion Sco es (CMOS) es
conduc ed using a web-based in e ace (h ps://aholab.ehu.eus/use s/sneha/BLSTMSS_
e alua ion/p e e ence_ es .php (accessed on 25 June 2021)). A web-based es was consid-
e ed mo e app op ia e owing o COVID es ic ions. Pa icipan s we e sou ced by sending
emails o speech echnology ne wo ks in Spain and o he local ne wo ks. The pa icipan s
we e ins uc ed o pe o m he es wi h headphones. They we e in o med ha he e a e
no co ec o inco ec answe s in he es and ha hey should s a e hei opinions wi h
ull libe y.
The pa icipan s lis ened o 10 pai s o sen ences om each o he ou speake s—
a o al o 40 pai s o sen ences. Each pai con ained one unp ocessed OS sen ence and he
co esponding BLSTMSS ou pu o he same sen ence. The chosen 10 pai s we e he sho es
sen ences in he se , as ha allowed us o ha e he maximum numbe o e alua ions while
keeping he es unde 20 min. The p esen a ion o de o all he pai s, as well as he o de o
he BLSTMSS and OS sen ences wi hin each pai was andomised o a oid o de bias. A e
lis ening o he wo s imuli in each pai , he pa icipan s we e asked o ma k he p e e ed
s imulus. To do so, hey we e gi en he ollowing op ions: ‘P e ie o cla amen e la p ime a’
(I clea ly p e e he i s one), ‘P e ie o la p ime a’ (I p e e he i s one), ‘No pe cibo
di e encia/Ninguna suena mejo ’ (I do no pe cei e any di e ence/Nei he one sounds
be e ), ‘P e ie o la segunda’ (I p e e he second one), ‘P e ie o cla amen e la segunda’
(I clea ly p e e he second one).
Apa om he 40 es pai s, he e we e 4 pai s (p esen ed a egula in e als)
whe e bo h he samples we e he same ile, which was a sen ence spoken by a heal hy
Appl. Sci. 2021,11, 5940 9 o 14
speake . As bo h he iles in hese 4 con ol pai s we e he same exac ile, we expec ed
he pa icipan s o ma k he hi d op ion (‘I do no pe cei e any di e ence/Nei he one
sounds be e ’). Only hose pa icipan s who co ec ly ma ked his op ion o a leas
3 o hese 4 con ol pai s, we e included in he analysis. This ensu ed eliabili y o he
pa icipan s’ esponses.
We asked he pa icipan s o desc ibe he audio equipmen hey used du ing he
es . This was o ensu e ha hey we e no using any bad equipmen . The op ions we e:
good headphones, no mal headphones, good loudspeake s, no mal loudspeake s and
bad equipmen . We also asked whe he he pa icipan s had any expe ience wi h using
speech echnologies. The op ions we e: no expe ience, expe s, spo adic use s and h ough
pe cep ion es s. This was no o s udy he e ec o speech expe ise on he e alua ions,
bu o ensu e a good mix o all kinds o lis ene s.
A o al o 32 na i e Spanish pa icipan s pe o med he lis ening es . Two o hem
we e ejec ed because hey ailed he con ol es . One o he pa icipan desc ibed hei
audio equipmen as ‘bad equipmen ’ and was excluded oo. 16 ou o he chosen 29 lis ene s
had no expe ience wi h using speech echnologies. Fi e o hem we e speech echnology
expe s, 4 we e spo adic use s o speech echnology and 4 s a ed ha hei expe ience o
speech echnologies was h ough pe cep ion es s.
O e all, he mos chosen op ion was ‘P e e ence o BLSTMSS’ as can be obse ed in
Figu e 8c. The e we e mo e esponses in he ‘Clea p e e ence o BLSTMSS’ and ‘P e e ence
o BLSTMSS’ ca ego ies compa ed o ‘Clea p e e ence o OS’ and ‘P e e ence o OS’
ca ego ies, espec i ely. ‘Clea p e e ence o OS’ was he leas chosen op ion.
When looking a speake s sepa a ely we obse ed ha speake 16M3 (Figu e 8d) has a
di e en end compa ed o o he speake s. Fo speake s 02M3 (Figu e 8a), 04M3 (
Figu e 8b
)
and 25F3 (Figu e 8e), he mos p e e ed op ion was ‘P e e ence o BLSTMSS’. Howe e
o speake 16M3, he leas in elligible speake in he da ase , mos esponses we e in
he ‘No p e e ence o ei he ’ o he undecided ca ego y. The nex mos p e e ed op ion
was ‘P e e ence o BLSTMSS’. Addi ionally, o he mo e p o icien speake s (25F3 and
02M3), he e we e less ins ances o he ‘No p e e ence o ei he ’ ca ego y compa ed o he
non-p o icien speake s.
(a) Speake 02M3 (b) Speake 04M3 (c) All Speake s
(d) Speake 16M3 (e) Speake 25F3 ( ) Ca ego ies
Figu e 8. His og am plo s o he p e e ence sco es o he ou speake s sepa a ely and all oge he .