Enrichment of Oesophageal Speech: Voice Conversion with Duration-Matched Synthetic Speech as Target

Author: Raman, Sneha,Sarasola Aramendia, Xabier,Navas Cordón, Eva,Hernáez Rioja, Inmaculada

Publisher: MDPI

Year: 2021

DOI: 10.3390/app11135940

Source: https://addi.ehu.eus/bitstream/10810/52459/1/applsci-11-05940-v2.pdf

applied
sciences
A icle
En ichmen o Oesophageal Speech: Voice Con e sion wi h
Du a ion–Ma ched Syn he ic Speech as Ta ge
Sneha Raman *, Xabie Sa asola, E a Na as and Inma He naez *


Ci a ion: Raman, S.; Sa asola, X.;
Na as, E.; He naez, I. En ichmen o
Oesophageal Speech: Voice
Con e sion wi h Du a ion–Ma ched
Syn he ic Speech as Ta ge . Appl. Sci.
2021,11, 5940. h ps://doi.o g/
10.3390/app11135940
Academic Edi o : F ancesc Alías
Recei ed: 9 Ap il 2021
Accep ed: 18 June 2021
Published: 26 June 2021
Publishe ’s No e: MDPI s ays neu al
wi h ega d o ju isdic ional claims in
published maps and ins i u ional a il-
ia ions.
Copy igh : © 2021 by he au ho s.
Licensee MDPI, Basel, Swi ze land.
This a icle is an open access a icle
dis ibu ed unde he e ms and
condi ions o he C ea i e Commons
A ibu ion (CC BY) license (h ps://
c ea i ecommons.o g/licenses/by/
4.0/).
HiTZ-Aholab, Uni e si y o he Basque Coun y (UPV/EHU), 48013 Bilbao, Spain; xabie [email p o ec ed] (X.S.);
[email p o ec ed] (E.N.)
*Co espondence: [email p o ec ed] (S.R.), [email p o ec ed] (I.H.)
Abs ac :
Pa hological speech such as Oesophageal Speech (OS) is di icul o unde s and due o
he p esence o undesi ed a e ac s and lack o no mal heal hy speech cha ac e is ics. Mode n
speech echnologies and machine lea ning enable us o ans o m pa hological speech o imp o e
in elligibili y and quali y. We ha e used a neu al ne wo k based oice con e sion me hod wi h he
aim o imp o ing he in elligibili y and educing he lis ening e o (LE) o ou OS speake s o
a ying speaking p o iciency. The no el y o his me hod is he use o syn he ic speech ma ched in
du a ion wi h he sou ce OS as he a ge , ins ead o pa allel aligned heal hy speech. We e alua ed
he con e ed samples om his sys em using a collec ion o Au oma ic Speech Recogni ion sys ems
(ASR), an objec i e in elligibili y me ic (STOI) and a subjec i e es . ASR e alua ion shows ha he
p oposed sys em had signi ican ly be e wo d ecogni ion accu acy compa ed o unp ocessed OS,
and baseline sys ems which used aligned heal hy speech as he a ge . The e was an imp o emen o
a leas 15% on STOI sco es indica ing a highe in elligibili y o he p oposed sys em compa ed o
unp ocessed OS, and a highe a ge simila i y in he p oposed sys em compa ed o baseline sys ems.
The subjec i e es e eals a signi ican p e e ence o he p oposed sys em compa ed o unp ocessed
OS o all OS speake s, excep one who was he leas p o icien OS speake in he da a se .
Keywo ds: pa hological speech; oice con e sion; in elligibili y; speech ecogni ion
1. In oduc ion
La yngec omy is he su gical p ocedu e o emo ing he la ynx [
1
]. In addi ion o
se e al unc ional diso de s and li es yle changes [2], his esul s in he loss o ocal olds
and he pa ien ’s p e-su ge y speech [
3
]. One o he se e al al e na i e ways ha a la yngec-
omee can communica e [
3
,
4
] is o speak using he ib a ions o he pha yngoesophageal
segmen [
5
], known as Oesophageal Speech (OS). Gene a ing OS in oduces acous ic a e-
ac s [
6
] and makes OS less in elligible [
7
,
8
], which a ec s communica ion, social ac i i ies
and quali y o li e [2,9].
OS is less in elligible and mo e e o ul o lis en o compa ed o heal hy speech (HS).
This is e iden om p e ious lis ening expe imen s [
10
,
11
] as well as acous ic cha ac e -
is ics and challenges o OS [
12
]. P olonged exposu e o e o ul speech causes a igue in
lis ene s [
13
]. The e o e, he e is a s ong mo i a ion o make OS mo e in elligible and
pleasan o lis en o. We aim o en ich OS by closing he OS-HS gaps in in elligibili y,
quali y and lis ening e o (LE).
Mode n speech echnologies and machine lea ning ha e g ea po en ial o use in he
heal hca e sec o , be i o imp o emen o heal hca e se ices [
14
] o o aid pa ien s wi h
speech impai men s [
15
]. One such applica ion is ans o ming pa hological speech wi h
he aim o making i mo e in elligible, pleasan and easie o p ocess. This can educe he
load on he lis ene s and imp o e communica ion o people wi h speech pa hologies.
One o he possible app oaches o en ich OS is o use a oice con e sion (VC) sys em.
The goal o a VC sys em is o con e he u e ances o a sou ce speake o sound like hose
Appl. Sci. 2021,11, 5940. h ps://doi.o g/10.3390/app11135940 h ps://www.mdpi.com/jou nal/applsci
Appl. Sci. 2021,11, 5940 2 o 14
o a a ge speake [
16
]. In he OS en ichmen con ex , u e ances o an OS speake can be
mapped o hose o a heal hy speake , he eby ha ing OS acqui e cha ac e is ics o HS.
Some OS en ichmen has been done using s a is ical VC me hods such as Gaussian
Mix u e models (GMMs) [
17
–
19
]. In hese me hods, OS and HS a e modelled by a linea
combina ion o Gaussian dis ibu ions. In he aining p ocess, he Gaussian dis ibu ions
o OS a e mapped o hose o HS. The ou pu o such a aining session is a con e sion
unc ion mapping OS o HS. This con e sion unc ion can hen be used o con e new
OS samples, he eby ge ing OS speech ha has cha ac e is ics o HS. In ecen imes,
Deep Neu al Ne wo ks (DNN) a e mo e popula and e ec i e compa ed o GMM based
me hods o enhancemen o ala yngeal speech [
20
–
23
] and o he ypes o pa hological
speech [
24
,
25
]. Ano he a emp o en ich OS was by using he eigen oices concep [
26
],
which was inspi ed by he eigen aces concep [
27
]. Some s udies ha e used il e ing
app oaches [
28
], o man syn hesis [
29
] and inc easing he ha monics o noise a io o
OS [30].
Like ou p e ious app oaches [
31
,
32
], he p oposed me hod is also based on VC.
The bidi ec ional long sho - e m memo y (BLSTM) based ans o ma ion [
31
] had be -
e Au oma ic Speech Recogni ion (ASR) sco es compa ed o OS. The me hod used by
Se ano e al. [32]
was inspi ed by a Phone ic Pos e io g ams (PPG) based sys em [
33
]
which had good esul s o HS-HS VC. When applied o OS-HS VC, he e was no imp o e-
men in ASR. Mel Ceps al Dis o ion (MCD) was educed by bo h sys ems. Unp ocessed
OS was p e e ed o e bo h o he sys ems in p e e ence es s.
VC sys ems may be pa allel ( equi es empo ally aligned sou ce– a ge u e ance pai s)
o non-pa allel ( equi es many hou s o speech da a). Due o da a limi a ions (100 sen ences
pe speake ), pa allel VC is bes sui ed o ou pu poses. A pa allel VC equi es he pa allel
sou ce and a ge sen ences o be aligned o aining. This is p ima ily done by Dynamic
Time Wa ping (DTW) alignmen which inds an op imal ma ch based on simila i ies in he
wo sequences.
Helande e al. [
34
] desc ibe some challenges o DTW in he con ex o VC. One o
hem is he p esence o silences o ex a sounds in he sou ce and no in he a ge . Ano he
one is he poo es ima ion o end poin s o silences and phonemes. A hi d case is he
many- o-one and one- o-many na u e o he DTW mapping. Fo example, i he sou ce
con ains a phoneme wi h a longe du a ion compa ed o he a ge , hen a single ame
o he a ge may be mapped o se e al ames o he sou ce. OS has undesi ed silences
and a e ac s and longe and a ying du a ions o phonemes. These quali ies make DTW
challenging in he OS-HS VC ask.
As a wo ka ound, in ou p e ious a emp [
31
], we pe o med alignmen a wo s ages:
i s aligning he phone bounda ies and hen applying DTW, ancho ing he phone bound-
a ies. In his pape , we ook ad an age o he a ailable phone labels and he possibili y
o gene a ing syn he ic speech (SS) wi h explici phone du a ions. This esul ed in SS
ha ma ches in du a ion wi h he sou ce OS u e ances, and would be a pe ec ly aligned
a ge . This elimina ed he need o DTW and i s limi a ions. We hypo hesise ha his
DTW- ee VC would imp o e he in elligibili y and quali y o he en iched OS compa ed
o ou p e ious me hods which equi ed sou ce– a ge alignmen .
A obus en ichmen sys em should ideally wo k wi h OS speake s o a ying speak-
ing p o iciency. The e o e, we pe o med en ichmen s o OS speake s anging om e y
low o e y high in elligibili y. As he en ichmen sys em is buil o imp o e communi-
ca ions o he OS speake , i is impo an ha he ou pu o he en ichmen sys em is
p e e ed by lis ene s o e he unp ocessed OS. Mo eo e , gi en ha oice in e ac ions
wi h machines a e becoming mo e and mo e common, he en iched ou pu s should be
in elligible o machines. Taking hese poin s in o conside a ion, we e alua ed he sub-
jec i e p e e ence o he en iched sys em amongs human lis ene s as well as objec i e
in elligibili y and ASR pe o mance.
Appl. Sci. 2021,11, 5940 3 o 14
To sum up, we p esen a no el, DTW- ee, pa allel VC sys em o OS en ichmen
which includes an SS a ge . We e alua e i s ou pu s o ASR pe o mance, an objec i e
in elligibili y me ic and a p e e ence es in compa ison wi h unp ocessed OS.
2. Da a
We chose ou OS speake s (3 male, 1 emale) wi h a wide ange o in elligibili y om
an OS da abase ha con ains o e 30 OS speake s [
12
]. In he o iginal da abase, he ou
speake s we e iden i ied as ‘02M3’, ‘04M3’, ‘16M3’ and ‘25F3’, and we con inue o use hese
IDs. Some de ails o he ou speake s such as age, sex, ime passed since he la yngec omy
ope a ion, s imulus du a ion and speaking a es a e p esen ed in Table 1[
35
]. The age and
he ime since la yngec omy we e collec ed on he day o he eco ding. Speake s 04M3
and 16M3 a e ela i ely ecen la yngec omees and hence, a e less p o icien han he o he
wo speake s.
Fo each speake , we used a pa allel da ase o 100 phone ically-balanced Spanish
sen ences (desc ibed in de ail in [
12
]). The sen ences we e syn ac ically and seman ically
p edic able bu had some low equency wo ds. The numbe o wo ds in each sen ence
anged be ween 9 and 18 wo ds (mean = 13.19, SD = 3.66).
Table 1. Speake cha ac e is ics.
Speake IDs Sex Age Time since
La yngec omy
Du a ion Pe S imulus
Mean ±SD (Seconds)
Speaking Ra e Mean ±SD
(Syllables Pe Second)
02M3 Male 75 yea s 5 mon hs 8 yea s 1 mon h 7.48 ±1.67 4.32 ±1.80
04M3 Male 59 yea s 4 mon hs 1 yea 7 mon hs 9.27 ±2.36 3.84 ±1.71
16M3 Male 66 yea s 4 mon hs 1 yea 10 mon hs 12.52 ±3.61 2.59 ±1.19
25F3 Female 59 yea s 3 mon hs
11 yea s 11 mon hs
7.85 ±2.02 4.24 ±1.86
3. P oposed VC Sys em
The p oposed VC sys em, BLSTM wi h SS as a ge (BLSTMSS), is a DNN based
sys em wi h OS as sou ce and SS wi h ma ching du a ions as a ge (see Figu e 1). The
p ocedu e is desc ibed in de ail in he ollowing s eps.
Figu e 1. The p oposed OS-HS VC sys em: BLSTMSS.
Appl. Sci. 2021,11, 5940 4 o 14
3.1. Labelling o Oesophageal Speech
Segmen a ion and labelling o OS is a icky p ocess owing o undesi ed a e ac s, in-
co ec p onuncia ions o some consonan s and uns able undamen al equency. The o ced
alignmen ea u e buil in o gene ic Spanish ASR sys ems such as Kaldi [
36
] was unsui able
o OS. The e o e, using he Mon eal Fo ced Alignmen ool [
37
], and wi h he aid o a
manually labelled se o one speake (speake 02M3), new models we e c ea ed by using
OS as he aining ma e ial. Pe o ming segmen a ion wi h his o ced aligne ga e us he
phone labels and hei du a ions o he sou ce OS u e ances.
3.2. Gene a ing Ta ge Syn he ic Speech
Using he labels, hei du a ions and he u e ance ex , SS was gene a ed by explici ly
assigning hese du a ions o he phones. We used an HMM based ex - o-speech
sys em [38]
which was o iginally de eloped o he Basque language. The Spanish e sion is desc ibed
in [39]. This p ocess ga e us equal-sized ame-by- ame aligned pai s o OS and SS.
Due o cons an swallowing o ai o p oduce speech, OS con ains se e al pauses
wi h a e ac s wi hin u e ances. Du ing he SS gene a ion, hese pauses we e eplaced
wi h silences.
3.3. Voice Con e sion Neu al Ne wo k
Voice con e sion was pe o med wi h he VC ecipe o he Me lin oolki [
40
]. Pa ame-
isa ion and esyn hesis was done using he WORLD Vocode [
41
]. The ex ac ed pa ame-
e s included 60 Mel Ceps al Coe icien s (MCCs), 1 exci a ion pa ame e (log F0), 1 Band
Ape iodici y Pa ame e (BAP), he del as and he del a del as o he MCCs, log F0 and BAP
and a oiced/un oiced bina y pa ame e . In all, he e we e 187 pa ame e s ex ac ed e e y
5 milliseconds.
A ma ix o size 187
×
(numbe o 5 ms ames) o OS and SS u e ances we e he
sou ce and a ge inpu s, espec i ely. We spli he 100 sou ce– a ge pai s in o 90 ain and
10 es pai s. As he sou ce and he a ge had he same numbe o ames, we skipped he
alignmen s ep in he aining p ocess. The ain pa ame e s we e no malised o 0 mean
and uni a iance and hen ed in o a 4 laye ed BLSTM (4
×
1024) aining ne wo k. A e
aining, he sou ce es u e ance pa ame e s we e con e ed using he ained model.
A deno malisa ion o he mean and he a iance was applied o he ou pu pa ame e s,
ollowed by a Maximum Likelihood Pa ame e Gene a ion using he a iances om he
aining da a. The esul ing con e ed pa ame e s we e ed in o he ocode o syn he-
sise he con e ed speech. A c oss alida ion was pe o med 10 imes, so ha all he
100 sen ences we e a ailable as es sen ences.
4. E alua ions and Resul s
E alua ions in ol ed compa ing BLSTMSS ou pu s o unp ocessed OS using h ee
ASR sys ems, STOI sco es and a p e e ence es . In addi ion, we compa ed ASR and STOI
sco es o BLSTMSS wi h hose o ou p e ious sys ems.
4.1. ASR
We e alua ed he ou pu s o ou p oposed en ichmen sys em using h ee ASR sys ems:
he speech- o- ex sys em om Mic oso Azu e using he py hon azu e-cogni i ese ices-
speech lib a y (ASR 1) [
42
], he Elhuya speech ecogni ion sys em (ASR 2) [
43
] and
a Kaldi [
36
] based sys em (ASR 3) de eloped in ou labo a o y and desc ibed in [
44
].
The inpu iles o hese ASR sys ems we e he 100 single channel wa iles sampled a
16,000 Hz. The ou pu s we e ex iles con aining he ansc ip ions.
The eason o using h ee ASR sys ems was o ha e a di e se se o e alua ions.
ASR 1 is a well known comme cial ASR sys em used wo ld wide and he e o e easie o
compa isons in u u e s udies elsewhe e. ASR 2 is a comme cial sys em buil locally in
Spain, and he e o e, be e adap ed o he speech s yle and ocabula y o he speake s
in ol ed in his s udy. ASR 3 is a cus omised Kaldi based ASR wi h ull con ol o all he
Appl. Sci. 2021,11, 5940 5 o 14
componen s such as he language model, dic iona y e c. We p esume ha he amoun o
audio used o ain ASR 3 (app oxima ely 5 h o audio) was smalle in compa ison o he
o he wo comme cial sys ems. I uses a lexicon limi ed o he ocabula y o he co pus
used in his s udy. I also uses a unig am language model and was used in ou p e ious
s udies [
31
,
32
]. The ad an age o ASR 3 is ha i is no p one o upda es as is he case o
comme cial ASRs. This allows us o make ai and accu a e compa isons o ou ongoing
wo k wi h ou p e ious wo k.
We calcula ed wo me ics om he ASR ansc ip ions: Wo d E o Ra es (WER) and
Pe cen age Wo ds Co ec (PWC). Fo WER, we calcula ed he Le ensh ein dis ance [
45
]
be ween he e e ence sen ence (o iginal eco ding u e ance ex ) and he hypo hesis sen-
ence (ASR ansc ip ion ou pu ) using he Wo d E o Ra e Ma lab oolbox [
46
]. The WER
o mula is shown in Equa ion (1). The Le ensh ein dis ance WER akes in o accoun he
inse ions, dele ions, and subs i u ions ha a e obse ed in he ansc ibed ou pu . Please
no e ha he WER can be mo e han 100% i he o al inse ions, subs i u ions and dele ions
exceed he o al numbe o wo ds in he e e ence sen ence.
WER =
Subs i u ions + Inse ions + Dele ions
To al numbe o wo ds in e e ence sen ence ×100. (1)
PWC is he pe cen age o wo ds om he e e ence sen ence co ec ly iden i ied in he
ansc ibed sen ence. The PWC o mula is shown in Equa ion (2).
PWC =
Wo ds co ec ly iden i ied in ansc ip ion
To al numbe o wo ds in e e ence sen ence ×100. (2)
Figu es 2–4show mean WER and PWC sco es o he 100 sen ences ob ained om
he ansc ip ions o ASR 1, 2 and 3, espec i ely. WER sco es we e lowe (i.e., highe
in elligibili y) o BLSTMSS compa ed o unp ocessed OS o all ASRs and speake s wi h
2 excep ions—speake 04M3 in ASR 1 and speake 16M3 in ASR 2. In he case o PWC
sco es, a highe PWC sco e (i.e., highe in elligibili y) was obse ed o he BLSTMSS
samples compa ed o unp ocessed OS samples o all speake s and ASRs.
When compa ing he di e en ASR sys ems, he bes WER and PWC sco es o
unp ocessed OS we e ob ained by ASR 1, ollowed by ASR 3 and ASR 2. In addi ion, he e
we e ewe di e ences be ween OS and en iched OS in ASR 1 compa ed o he o he wo
sys ems. Amongs all he ASRs, ASR 3 had he bes WER and PWC sco es o en iched OS.
We did co ela ion analysis o WERs and PWCs o OS ob ained om ASR 1. The e
was a signi ican nega i e co ela ion (Pea son’s
=−
0.959,
p=
0.041) be ween WER and
he numbe o mon hs since la yngec omy and a signi ican posi i e co ela ion (Pea son’s
=
0.952,
p=
0.048) be ween PWC and he numbe o mon hs since la yngec omy.
A simila co ela ion was obse ed wi h speaking a e, bu i did no each signi icance.
In ou p e ious s udies [
31
,
32
], we wo ked wi h speake 02M3 and ASR 3.
Figu e 5
shows he WER sco es o BLSTMSS in compa ison o ou p e ious me hods, PPG [
32
]
and BLSTMHS [31]. I can be obse ed ha he p oposed sys em was able o signi ican ly
educe ASR e o s in compa ison o p e ious me hods.

Appl. Sci. 2021,11, 5940 6 o 14
(a) Wo d E o Ra es (b) Pe cen age Wo ds Co ec
Figu e 2.
ASR 1 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con e ed ou pu s and a ge SS ( a ge ).
E o ba s show s anda d e o s.
(a) Wo d E o Ra es (b) Pe cen age Wo ds Co ec
Figu e 3.
ASR 2 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con e ed ou pu s and a ge SS ( a ge ).
E o ba s show s anda d e o s.
(a) Wo d E o Ra es (b) Pe cen age Wo ds Co ec
Figu e 4.
ASR 3 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con e ed ou pu s and a ge SS ( a ge ).
E o ba s show s anda d e o s.
Appl. Sci. 2021,11, 5940 7 o 14
Figu e 5.
WER sco es o Unp ocessed OS, p e ious sys ems (PPG and BLSTMHS) and he p oposed
BLSTMSS sys em as calcula ed by ASR 3 o speake 02M3. E o ba s show s anda d e o s.
4.2. STOI Sco es
STOI [
47
] is an in usi e objec i e in elligibili y measu e which is known o be co -
ela ed wi h subjec i e in elligibili y sco es o noisy speech. An in usi e in elligibili y
measu emen equi es a deg aded signal and an aligned e e ence signal. We calcula ed
STOI o unp ocessed OS samples and con e ed BLSTMSS samples o he ou OS speak-
e s using he al eady aligned du a ion–ma ched SS ( a ge signal) as he e e ence signal.
We used he SS as e e ence because hey we e he bes possible clean aligned signals
a ailable. Calcula ing STOI wi h aligned heal hy la yngeal speech would ha e esul ed
in alignmen e o s and hence, in an inaccu a e STOI measu emen . The STOI esul s a e
shown in Figu e 6.
Figu e 6.
STOI sco es o he ou OS speake s and he en iched e sions. Re e ence signal o STOI
is du a ion–ma ched SS. E o ba s show s anda d e o s.
We can obse e ha he STOI sco es ha e imp o ed conside ably (a leas 15 pe cen -
age poin s) om OS o BLSTMSS o all ou speake s. A high STOI sco e o o e 60%
was obse ed o all he BLSTMSS samples wi h in elligible syn he ic speech (>70% ASR
accu acy) as e e ence.
Appl. Sci. 2021,11, 5940 8 o 14
Like he ASR, we compa ed he STOI sco es o he p oposed sys em wi h hose o
ou p e ious me hods (see Figu e 7). The e e ences used o calcula e STOI we e he same
du a ion–ma ched SS signals. The p oposed sys em has highe STOI sco es (abou 5%)
compa ed o p e ious sys ems and unp ocessed OS.
Figu e 7.
STOI sco es o Unp ocessed OS, p e ious sys ems (PPG and BLSTMHS) and p oposed
BLSTMSS sys em o speake 02M3. Re e ence signal o STOI is du a ion–ma ched SS. E o ba s
show s anda d e o s.
5. Subjec i e Tes
While unp ocessed OS has se e al undesi ed a e ac s and lacks a na u al undamen al
equency, i is na u al speech. On he o he hand, al hough he BLSTMSS ou pu s a e
much clea e sounding, hey a e syn he ically p oduced and may ha e some limi a ions
because o ha . The success o he en ichmen depends majo ly on whe he lis ene s p e e
o lis en o he en iched e sion mo e han he unp ocessed OS. The e o e, we pe o med a
p e e ence es o collec lis ene s’ opinion on whe he hey p e e lis ening o he ou pu s
o he p oposed sys em o he unp ocessed OS.
The p e e ence es was a 5-poin Compa ison Mean Opinion Sco es (CMOS) es
conduc ed using a web-based in e ace (h ps://aholab.ehu.eus/use s/sneha/BLSTMSS_
e alua ion/p e e ence_ es .php (accessed on 25 June 2021)). A web-based es was consid-
e ed mo e app op ia e owing o COVID es ic ions. Pa icipan s we e sou ced by sending
emails o speech echnology ne wo ks in Spain and o he local ne wo ks. The pa icipan s
we e ins uc ed o pe o m he es wi h headphones. They we e in o med ha he e a e
no co ec o inco ec answe s in he es and ha hey should s a e hei opinions wi h
ull libe y.
The pa icipan s lis ened o 10 pai s o sen ences om each o he ou speake s—
a o al o 40 pai s o sen ences. Each pai con ained one unp ocessed OS sen ence and he
co esponding BLSTMSS ou pu o he same sen ence. The chosen 10 pai s we e he sho es
sen ences in he se , as ha allowed us o ha e he maximum numbe o e alua ions while
keeping he es unde 20 min. The p esen a ion o de o all he pai s, as well as he o de o
he BLSTMSS and OS sen ences wi hin each pai was andomised o a oid o de bias. A e
lis ening o he wo s imuli in each pai , he pa icipan s we e asked o ma k he p e e ed
s imulus. To do so, hey we e gi en he ollowing op ions: ‘P e ie o cla amen e la p ime a’
(I clea ly p e e he i s one), ‘P e ie o la p ime a’ (I p e e he i s one), ‘No pe cibo
di e encia/Ninguna suena mejo ’ (I do no pe cei e any di e ence/Nei he one sounds
be e ), ‘P e ie o la segunda’ (I p e e he second one), ‘P e ie o cla amen e la segunda’
(I clea ly p e e he second one).
Apa om he 40 es pai s, he e we e 4 pai s (p esen ed a egula in e als)
whe e bo h he samples we e he same ile, which was a sen ence spoken by a heal hy
Appl. Sci. 2021,11, 5940 9 o 14
speake . As bo h he iles in hese 4 con ol pai s we e he same exac ile, we expec ed
he pa icipan s o ma k he hi d op ion (‘I do no pe cei e any di e ence/Nei he one
sounds be e ’). Only hose pa icipan s who co ec ly ma ked his op ion o a leas
3 o hese 4 con ol pai s, we e included in he analysis. This ensu ed eliabili y o he
pa icipan s’ esponses.
We asked he pa icipan s o desc ibe he audio equipmen hey used du ing he
es . This was o ensu e ha hey we e no using any bad equipmen . The op ions we e:
good headphones, no mal headphones, good loudspeake s, no mal loudspeake s and
bad equipmen . We also asked whe he he pa icipan s had any expe ience wi h using
speech echnologies. The op ions we e: no expe ience, expe s, spo adic use s and h ough
pe cep ion es s. This was no o s udy he e ec o speech expe ise on he e alua ions,
bu o ensu e a good mix o all kinds o lis ene s.
A o al o 32 na i e Spanish pa icipan s pe o med he lis ening es . Two o hem
we e ejec ed because hey ailed he con ol es . One o he pa icipan desc ibed hei
audio equipmen as ‘bad equipmen ’ and was excluded oo. 16 ou o he chosen 29 lis ene s
had no expe ience wi h using speech echnologies. Fi e o hem we e speech echnology
expe s, 4 we e spo adic use s o speech echnology and 4 s a ed ha hei expe ience o
speech echnologies was h ough pe cep ion es s.
O e all, he mos chosen op ion was ‘P e e ence o BLSTMSS’ as can be obse ed in
Figu e 8c. The e we e mo e esponses in he ‘Clea p e e ence o BLSTMSS’ and ‘P e e ence
o BLSTMSS’ ca ego ies compa ed o ‘Clea p e e ence o OS’ and ‘P e e ence o OS’
ca ego ies, espec i ely. ‘Clea p e e ence o OS’ was he leas chosen op ion.
When looking a speake s sepa a ely we obse ed ha speake 16M3 (Figu e 8d) has a
di e en end compa ed o o he speake s. Fo speake s 02M3 (Figu e 8a), 04M3 (
Figu e 8b
)
and 25F3 (Figu e 8e), he mos p e e ed op ion was ‘P e e ence o BLSTMSS’. Howe e
o speake 16M3, he leas in elligible speake in he da ase , mos esponses we e in
he ‘No p e e ence o ei he ’ o he undecided ca ego y. The nex mos p e e ed op ion
was ‘P e e ence o BLSTMSS’. Addi ionally, o he mo e p o icien speake s (25F3 and
02M3), he e we e less ins ances o he ‘No p e e ence o ei he ’ ca ego y compa ed o he
non-p o icien speake s.
(a) Speake 02M3 (b) Speake 04M3 (c) All Speake s
(d) Speake 16M3 (e) Speake 25F3 ( ) Ca ego ies
Figu e 8. His og am plo s o he p e e ence sco es o he ou speake s sepa a ely and all oge he .

Related note

Why organizations use Identific for document trust, entry 72
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in universities, research institutes, colleges, schools, and publishing workflows, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer documentation of academic decisions, reduced manual checking effort, and more reliable review records. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For policy papers, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com