Oesophageal Speech: En ichmen
and E alua ions
Doc o al hesis p esen ed by Sneha Raman wi hin he
Language Analysis and P ocessing p og amme unde he
supe ision o P o . Inmaculada He n´aez Rioja and D .
E a Na as
Sneha Raman
Doc o o Philosophy
Uni e si y o he Basque Coun y (UPV/EHU)
15 No embe 2021
(cc)2022 SNEHA RAMAN (cc by-nc-sa 4.0)
Decla a ion
I decla e ha his hesis was composed by mysel and ha he wo k con ained he ein is my
own, excep whe e explici ly s a ed o he wise in he ex .
(Sneha Raman)
3
Dedica ed o my la e a he o showing me h ough wo ds and
ac ions wha uncondi ional suppo eally means. You we e one o
a kind. May mo e people ha e a he s like you.
4
Acknowledgemen s
A PhD came o me like an unexpec ed isi o . I was no p epa ed o i , no was I seeking i .
Howe e , i has le a e y p o ound impac in my li e, unlike any hing else I ha e expe ienced
be o e. Th ough his p ocess I ha e gained many li e skills and lessons. To manage and ake
ca e o his unexpec ed isi o , I was also sen a hos o people and o ganisa ions o whom I
exp ess my since es g a i ude. Wi h you p esence and pa icipa ion, you ha e ele a ed my
doc o al expe ience.
Fi s and o emos , I hank you P o . Inma He naez o aking me unde you wing and
eaching me o ly, e en a imes when I was no willing o. You di ec ion and dedica ion has
made me a mo e con iden pe son and esea che . Thank you o pushing me whene e I ha e
been lazy, and o pa ing my back whene e I ha e done well. Long lab lunches, being s anded
a ai po s and a celeb a ion o sco ch whisky pos a ejec ed con e ence pape a e all inciden s
I will emembe i idly o yea s. They ha e su ely helped in ex ending ou ela ionship ou side
o wo k and in adding some sense o humou in an o he wise e y se ious p o ession.
I hank you D . E a Na as o you in aluable con ibu ions owa ds my PhD. The one
semes e cou se you ga e us ea ly on in he PhD was one o my ines expe iences as a s uden .
You e icien and c ys al clea ad ice has helped me ackle many con lic s and di icul si ua ions
h oughou he PhD. I ha e and will always admi e he elegance wi h which you ca y you sel
and you wo k.
Thank you D . Axel Winneke and D . Cla a Ma in o gi ing me you knowledge and
expe ise in neu oimaging which has imp o ed my esea ch emendously. Wo king wi h you
has opened my eyes o a whole new wo ld o esea ch and esea ch me hodologies.
I hank my colleagues Xabie Sa asola, Luis Se ano, I xasne Diez, Ibon Sa a xaga, Jon
Sanchez, Da id Ta a ez, Vic o Ga cia, Inge Salomons and Ede del Blanco o being he sal
and peppe o my PhD. Wi hou you guys his PhD would no ha e been so in e es ing and
balanced. Thank you Xabie and Luis o some e y impo an con ibu ions o my PhD. Some
o you ideas ha e been s ong ounda ions o my esea ch. Ibon, you e e escen pe sonali y
i
and eadiness o help wi h all my oadblocks is much app ecia ed. Jon, hanks o being he go
o pe son o any echnical di icul ies and ensu ing ha he o ice pan y is up o da e! I xasne,
you ha e ad ised and helped me like a big sis e . You ne e lose a chance o ha e a good big
laugh, a eminde o all o us o ake hings wi h a good spi i ! Inge, al hough you appea ed
in my li e in he inal s ages o my PhD, you impac has been huge. Thanks o you wise
company and making he hesis w i ing s age mo e enjoyable.
Thank you Peio o being a s ong pilla h oughou his p ocess and o keeping me phys-
ically and men ally heal hy. You decla ed me a doc o long be o e I inished my PhD and
b ough me back in o he game whene e I had hough s o qui ing. You belie ed ha I will
inish his mo e han mysel . You pushed me o achie e goals whe he i was a jou nal pape ,
50 bu pees o climbing he las hund ed s eps o each he moun ain op. This PhD is a esul
o he endu ance I buil wi h you. Thanks also o Ga bi˜ne and Ja i, you lo ely pa en s who
ga e me so much lo e and a ec ion and ushed o help me whene e I needed i .
A big hank you o he ENRICH amily o p o iding me he wa m h and nou ishmen in
his jou ney. All he egula upda e mee ings and exchange o ideas helped me be on ack and
o ganised and ac ually inish his PhD. I will ondly emembe he ge ing oge he and en ing
o us a ions, un ou ings and ancy dinne s ha I sha ed wi h he ENRICH membe s.
Amids all he gains, I expe ienced a big loss oo. I los my a he who was my men o and
ad iso and he pe son om whom I ha e ecei ed he mos a ec ion. Bu he has le me wi h
all o his posi i e ene gy which has helped me inish his hesis and will help me in my u u e
endea ou s oo. I wan o hank my sis e Sh eya o coming in o my li e and o being by
my side o expe ience all he un and agic momen s o li e. Li e would be so bland wi hou
you. And inally, bu mos impo an ly, I wan o hank my mo he o being my cons an
chee leade whe he I ealised i o no . Thank you o all he lo e.
ii
Abs ac
A e a la yngec omy (i.e. emo al o he la ynx) a pa ien can no mo e speak in a heal hy
la yngeal oice. The e o e, hey need o adop al e na i e me hods o speaking such as oe-
sophageal speech. In his me hod, speech is p oduced using swallowed ai and he ib a ions o
he pha yngo-oesophageal segmen , which in oduces se e al undesi ed a e ac s and an abno -
mal undamen al equency. This makes oesophageal speech p ocessing di icul compa ed o
heal hy speech, bo h audi o y p ocessing and signal p ocessing. The aim o his hesis is o ind
solu ions o make oesophageal speech signals easie o p ocess, and o e alua e hese solu ions
by explo ing a wide ange o e alua ion me ics.
Fi s , some p elimina y s udies we e pe o med o compa e oesophageal speech and heal hy
speech. This e ealed signi ican ly lowe in elligibili y and highe lis ening e o o oesophageal
speech compa ed o heal hy speech. In elligibili y sco es we e compa able o amilia and non-
amilia lis ene s o oesophageal speech. Howe e , lis ene s amilia wi h oesophageal speech
epo ed less e o compa ed o non- amilia lis ene s. In ano he expe imen , oesophageal
speech was epo ed o ha e mo e lis ening e o compa ed o heal hy speech e en hough i s
in elligibili y was compa able o heal hy speech. On in es iga ing neu al co ela es o lis ening
e o (i.e. alpha powe ) using elec oencephalog aphy, a highe alpha powe was obse ed o
oesophageal speech compa ed o heal hy speech, indica ing highe lis ening e o . Addi ionally,
pa icipan s wi h poo e cogni i e abili ies (i.e. wo king memo y capaci y) showed highe alpha
powe .
Nex , using se e al algo i hms (p eexis ing as well as no el app oaches), oesophageal speech
was ans o med wi h he aim o making i mo e in elligible and less e o ul. The no el ap-
p oach consis ed o a deep neu al ne wo k based oice con e sion sys em whe e he sou ce was
oesophageal speech and he a ge was syn he ic speech ma ched in du a ion wi h he sou ce
oesophageal speech. This helped in elimina ing he sou ce- a ge alignmen p ocess which is
pa icula ly p one o e o s o diso de ed speech such as oesophageal speech. Bo h speake
dependen and speake independen e sions o his sys em we e implemen ed. The ou pu s
iii
o he speake dependen sys em had be e sho e m objec i e in elligibili y sco es, au o-
ma ic speech ecogni ion pe o mance and lis ene p e e ence sco es compa ed o unp ocessed
oesophageal speech. The speake independen sys em had imp o emen in sho e m objec i e
in elligibili y sco es bu no in au oma ic speech ecogni ion pe o mance. Some o he signal
ans o ma ions we e also pe o med o enhance oesophageal speech. These included emo al
o undesi ed a e ac s and me hods o imp o e undamen al equency. Ou o hese me hods,
only emo al o undesi ed silences had success o some deg ee (1.44 % poin s imp o emen in
au oma ic speech ecogni ion pe o mance), and ha oo only o low in elligibili y oesophageal
speech.
Las ly, he ou pu o hese ans o ma ions we e e alua ed and compa ed wi h p e ious
sys ems using an ensemble o e alua ion me ics such as sho e m objec i e in elligibili y,
au oma ic speech ecogni ion, subjec i e lis ening es s and neu al measu es ob ained using
elec oencephalog aphy. Resul s e eal ha he p oposed neu al ne wo k based sys em ou pe -
o med p e ious sys ems in imp o ing he objec i e in elligibili y and au oma ic speech ecog-
ni ion pe o mance o oesophageal speech. In he case o subjec i e e alua ions, he esul s
we e mixed - some posi i e imp o emen in p e e ence sco es and no imp o emen in speech
in elligibili y and lis ening e o sco es. O e all, he esul s demons a e se e al possibili ies
and new pa hs o en ich oesophageal speech using mode n machine lea ning algo i hms. The
ou comes would be bene icial o he diso de ed speech communi y.
i
Resumen
Despu´es de una la ingec om´ıa (es deci , ex i paci´on de la la inge), el pacien e ya no puede habla
con una oz la ´ıngea sana. Po lo an o, es os pacien es deben adop a m´e odos al e na i os
de habla, como el habla eso ´agica. En es e m´e odo, el habla se p oduce u ilizando ai e agado
y las ib aciones del segmen o a ingoeso ´agico, que in oduce uidos y e ec os no deseados y
p oduce una ecuencia undamen al ano mal. Es o di icul a el p ocesamien o del habla eso ´agica
en compa aci´on con el habla sana, an o el p ocesamien o audi i o como el p ocesamien o
au om´a ico de se˜nales. El obje i o de es a esis es encon a soluciones pa a log a que las
se˜nales de habla eso ´agica sean m´as ´aciles de p ocesa an o po pa e de los humanos como
de los o denado es y e alua dichas soluciones explo ando una amplia gama de m´e icas de
e aluaci´on pa a encon a la m´as adecuada pa a es e p oblema.
En p ime luga , se ealiza on algunos es udios p elimina es pa a compa a el habla eso ´agica
y el habla sana. Es o e el´o una in eligibilidad signi ica i amen e meno y un mayo es ue zo
de escucha pa a el habla eso ´agica en compa aci´on con el habla sana. Las pun uaciones de
in eligibilidad ue on compa ables pa a los oyen es amilia izados y no amilia izados con el
habla eso ´agica. Sin emba go, los oyen es amilia izados con el habla eso ´agica exp esa on menos
es ue zo en compa aci´on con los oyen es no amilia es. En o o expe imen o, se concluy´o que el
habla eso ´agica supone un mayo es ue zo de escucha en compa aci´on con el habla sana, aun
pa a locu o es con un ni el compa able de in eligibilidad. Al in es iga los co ela os neu onales
del es ue zo de escucha (es deci , la po encia de la banda al a) median e elec oence alog a ´ıa,
se obse ´o una po encia al a m´as al a pa a el habla eso ´agica en compa aci´on con el habla sana,
lo que indica un mayo es ue zo de escucha. Adem´as, los pa icipan es con peo es habilidades
cogni i as (es deci , meno capacidad de memo ia de abajo) mos a on mayo es alo es de
po encia en es a banda al a.
A con inuaci´on, u ilizando a ios algo i mos (con en oques an o p eexis en es como no e-
dosos), se ans o m´o el habla eso ´agica con el obje i o de hace la m´as in eligible y con meno
exigencia de es ue zo de escucha. El en oque no edoso consis i´o en un sis ema de con e si´on de
8.7 O he Ac i i ies and Achie emen s . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.7.1 Awa ds ..................................... 98
8.7.2 Wo kshop and Con e ence A endances . . . . . . . . . . . . . . . . . . . 98
8.7.3 Resea chVisi s................................. 99
8.7.4 PublicEngagemen ............................... 99
A 30 sen ences Used in he Expe imen in Chap e 4 123
B EEG Da a Te minology and P ocedu es 125
B.1 EEG Reco ding Equipmen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
B.1.1 CapandElec odes ..............................125
B.1.2 Gel........................................126
B.1.3 Ampli ie ....................................126
B.1.4 Reco dingso wa e...............................126
B.2 Synch onisa ion.....................................127
B.3 EEGReco dingP ocess ................................127
B.4 RawEEGDa a.....................................128
B.5 Cleaning Up Raw EEG Da a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
B.5.1 Fil e ing.....................................129
B.5.2 Independen Componen Analysis . . . . . . . . . . . . . . . . . . . . . . 130
B.5.3 Epoching ....................................131
B.6 Da a o In e es in Clean EEG Da a . . . . . . . . . . . . . . . . . . . . . . . . . 132
C Con en s o he Passage Co pus Desc ibed in Sec ion 3.3.2 133
C.1 Passage1 ........................................133
C.2 Passage2 ........................................134
C.3 Passage3 ........................................134
C.4 Passage4 ........................................135
C.5 Passage5 ........................................135
D Con en s o he Wo ds Co pus Desc ibed in Sec ion 3.3.1 137
E Resumen 139
E.1 Desc ipci´on del p oblema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
E.2 Recogida y p epa aci´on de da os . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
E.3 E aluaci´on p elimina de los da os del habla eso ´agica . . . . . . . . . . . . . . . 140
E.4 En iquecimien o ....................................141
xii
E.5 E aluaci´on del habla eso ´agica en iquecida . . . . . . . . . . . . . . . . . . . . . . 141
E.6 Rele ancia cien ´ı ica de los esul ados . . . . . . . . . . . . . . . . . . . . . . . . 142
E.7 Rele ancia de los esul ados pa a la sociedad . . . . . . . . . . . . . . . . . . . . 143
xiii
xi
Lis o Figu es
2.1 Di e ences in he ana omy o HS and OS . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Di e ences in he signal and spec og am o HS and OS . . . . . . . . . . . . . . 10
2.3 Di e ences in he calcula ed undamen al equencies o HS ( op) and OS (bo om) 10
2.4 An in og aphic explaining he dis inc ion o in elligibili y and LE. The le el o
edness in he head ep esen s he le el o LE. . . . . . . . . . . . . . . . . . . . . 16
2.5 A simpli ied desc ip ion o he oice con e sion p ocess . . . . . . . . . . . . . . . 21
2.6 Basic building blocks o a oice con e sion sys em . . . . . . . . . . . . . . . . . 21
3.1 Compa ison o he au oma ic labelling, manual labelling and cus omised au o-
ma iclabellingo OS.................................. 28
3.2 ASR Resul s. Mean speake -wise Wo d E o Ra es (in %) o ASR ained
wi h HS and ASR ained wi h OS . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 P elimina y LE and SI Task Schema ic Rep esen a ion . . . . . . . . . . . . . . . 35
4.2 Mean speake wise ‘All wo ds’ and ‘con en wo ds only’ Wo d E o Ra es (WER)
o ‘ amilia ’ and ‘no amilia ’ lis ene s. OM1, OM2, OM3, OF1 a e oesophageal
speake s; HM1 and HF1 a e heal hy speake s. Highe WER co esponds o lowe
in elligibili y. E o ba s show 95% con idence in e als. . . . . . . . . . . . . . . 37
4.3 Mean speake wise LE o oesophageal (OM1, OM2, OM3, OF1) and heal hy
(HM1, HF1) speake s. On he y-axis, 1 co esponds o leas e o ul and 5 o
mos e o ul. E o ba s show 95% con idence in e als. . . . . . . . . . . . . . . 39
4.4 Wo d E o Ra es (WER) o Human Speech Recogni ion (HSR) and Au oma ic
Speech Recogni ion (ASR) o oesophageal (OM1, OM2, OM3, OF1) and heal hy
(HM1,HF1)speake s ................................. 39
4.5 LE Task Schema ic Rep esen a ion . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 SI Task Schema ic Rep esen a ion . . . . . . . . . . . . . . . . . . . . . . . . . . 42
x
4.7 WER and LE o oesophageal (OF1) and heal hy (HF1) speake s. E o ba s
show 95% con idence in e als. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 WER sco es and subjec i e LE o HS and OS . . . . . . . . . . . . . . . . . . . 51
5.2 Boxplo o a e age alpha equency (8-12Hz) powe o HS and OS o cen o-
pa ie al channels (P3, P4, PZ, C3, CZ, C4, CP1, CP2, CP5, CP6, CPZ) . . . . . 51
5.3 Topog aphy plo o a e age alpha equency (8-12Hz) powe o HS and OS.
The channels O1, O2, TP1, TP2, FP1, FP2 we e excluded om he analysis due
o noise in hose channels in he majo i y o pa icipan s . . . . . . . . . . . . . . 52
5.4 Sca e plo s o alpha powe and digi span sco es o HS and OS . . . . . . . . 53
6.1 The p oposed OS-HS VC sys em: BLSTMSS . . . . . . . . . . . . . . . . . . . . 60
6.2 ASR 1 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con-
e ed ou pu s and a ge SS ( a ge ). E o ba s show s anda d e o s. . . . . . 62
6.3 ASR 2 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con-
e ed ou pu s and a ge SS ( a ge ). E o ba s show s anda d e o s. . . . . . 63
6.4 ASR 3 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con-
e ed ou pu s and a ge SS ( a ge ). E o ba s show s anda d e o s. . . . . . 63
6.5 ASR sco es o he mul i-speake sys em con aining 11 OS speake s. . . . . . . . 64
6.6 ASR sco es compa ison o he single speake and he mul i-speake BLSTMSS
sys em. ......................................... 64
6.7 STOI sco es o he ou OS speake s and he en iched e sions. Re e ence signal
o STOI is du a ion-ma ched SS. E o ba s show s anda d e o s. . . . . . . . 65
6.8 STOI sco es o he mul i-speake sys em con aining 11 OS speake s. Re e ence
signal o STOI is du a ion-ma ched SS. E o ba s show s anda d de ia ions. . 66
6.9 STOI sco es compa ison o he single speake and he mul i-speake BLSTMSS
sys em. Re e ence signal o STOI is du a ion-ma ched SS. E o ba s show
s anda de o s. .................................... 66
6.10 His og am plo s o he p e e ence sco es o he ou speake s sepa a ely and All
oge he ......................................... 67
6.11 Unp ocessed OS signal (bo om) and OS signal wi h pauses emo ed ( op) . . . . 70
6.12Wa ene Syn hesis ................................... 71
7.1 LE and SI Task Schema ic Rep esen a ion . . . . . . . . . . . . . . . . . . . . . . 77
7.2 Wo d Recogni ion Task Schema ic Rep esen a ion . . . . . . . . . . . . . . . . . 78
7.3 Dis ibu ion o Alpha Powe Value o all he 32 pa icipan s. . . . . . . . . . . . 79
x i
7.4 Dis ibu ion o Alpha Powe Value o a subse o 3 pa icipan s. Each pa icipan
has a sepa a e ange o alpha powe alues. . . . . . . . . . . . . . . . . . . . . . 80
7.5 Dis ibu ion o z-sco ed alpha powe o a subse o 3 pa icipan s. . . . . . . . . 80
7.6 Dis ibu ion o No malised alpha powe (no malised by max alue) o a subse
o 3pa icipan s. .................................... 81
7.7 Dis ibu ion o baseline co ec ed alpha powe o a subse o 3 pa icipan s. . . . 81
7.8 Speech In elligibili y (SI) ask pe o mance o he h ee en ichmen sys ems
(BLSTMHS, PPG and BLSTMSS), HS and OS. E o ba s show 95% con idence
in e als.......................................... 82
7.9 SI sco es o isola ed wo ds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.10 A e age Lis ening E o (LE) o he h ee en ichmen sys ems, HS and OS.
E o ba s show 95% con idence in e als. . . . . . . . . . . . . . . . . . . . . . . 83
7.11 Response imes (RT) o SI and LE asks o he h ee en ichmen sys ems, HS
and OS. E o ba s show 95% con idence in e als. . . . . . . . . . . . . . . . . . 84
7.12 Alpha Powe o he i e condi ions . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.13 A e age Alpha powe o all pa icipan s as he expe imen p og esses. The x
axis ep esen s he p esen a ion o de including all condi ions. . . . . . . . . . . 85
7.14 Alpha powe p og ession ac oss blocks ( ime) o OS, PPG, BLSTMHS, BLSTMSS
andHS ......................................... 86
7.15 WER sco es o Unp ocessed OS, p e ious sys ems (PPG and BLSTMHS) and
BLSTMSS as calcula ed by ASR 3 o speake 02M3. E o ba s show s anda d
e o s. ......................................... 86
7.16 STOI sco es o Unp ocessed OS, p e ious sys ems (PPG and BLSTMHS) and
BLSTMSS o speake 02M3. Re e ence signal o STOI is du a ion-ma ched SS.
E o ba s show s anda d e o s. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
B.1 EEG eco ding so wa e showing impedance alues o elec odes. . . . . . . . . . 127
B.2 RawEEGda a .....................................128
B.3 RawEEGda a .....................................129
B.4 Band pass il e ed EEG da a (1-45 Hz) . . . . . . . . . . . . . . . . . . . . . . . . 130
B.5 ICA componen s o EEG da a om one pa icipan . . . . . . . . . . . . . . . . 131
x ii
x iii
Lis o Tables
2.1 LE a ing labels, hei English ansla ions, and he alues assigned. . . . . . . . 17
4.1 Mean Wo d E o Ra e (WER) and Lis ening E o (LE) o OS and HS o
amilia and no amilia lis ene s. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1 A e age s imulus du a ion, speaking a es and in elligibili y o he ou OS speake s 59
6.2 ASR sco es o unp ocessed and en iched OS o high (02M3) and low in elligi-
bili y (16M3) OS. Tex ma ked wi h * shows numbe s whe e imp o emen was
obse ed ........................................ 72
7.1 Median LE Ra ings o OS, HS and he h ee en ichmen s . . . . . . . . . . . . . 83
xix
xx
Chap e 1
In oduc ion
“Fo millions o yea s, mankind li ed jus like he animals. Then
some hing happened which unleashed he powe o ou imagina ion.
We lea ned o alk and we lea ned o lis en. Speech has allowed he
communica ion o ideas, enabling human beings o wo k oge he o
build he impossible. Mankind’s g ea es achie emen s ha e come
abou by alking, and i s g ea es ailu es by no alking. I doesn’
ha e o be like his. Ou g ea es hopes could become eali y in he
u u e. Wi h he echnology a ou disposal, he possibili ies a e
unbounded. All we need o do is make su e we keep alking.”
— S ephen Hawking
I would like o in oduce you o his hesis by b inging some ques ions o you a en ion:
Wha is he ole o e bal communica ion in ou li es? How impo an is smoo h and e icien
e bal communica ion? Wha is he e ec o ou speech on i s ecei e s? Wha impo ance
does ou oice play in o ming ou iden i y? Wha i some key cha ac e is ics o ou speech a e
los ? How can science and echnology es o e hese los cha ac e is ics? A e hese es o a ions
help ul?
These ques ions a e he s a ing poin s o he p oblems deal wi h in his hesis. In his
in oduc o y chap e , you will ead abou he impo ance o speech communica ion and di i-
cul ies o diso de ed speech communica ion. I will also explain he eason o choosing o wo k
on his p oblem and he possible ways o add ess i .
1
Figu e 2.1: Di e ences in he ana omy o HS and OS
la yngec omee wi h a disabili y o speak.
In spi e o he absence o ocal olds, la yngec omees can s ill manage o speak using al e -
na i e me hods. As men ioned be o e, OS is one o he al e na i e me hods o speaking a e a
la yngec omy. The o he al e na i e speaking me hods a e Elec ola yngeal Speech (ELS) and
T acheoesophageal Speech (TOS) [61]. Bo h ELS and TOS equi e ex e nal aids—a acheoe-
sophageal p os hesis in he case o TOS; and elec ola ynx (an ex e nal elec onic subs i u e o
he ocal olds) o ELS. The TOS p os hesis mus be changed pe iodically (a ound 6 mon hs)
by a physician.
In he case o OS, he ib a ions in he ood passage, o o be mo e co ec , he ib a ions in
he pha yngo-oesophageal segmen a e used as a sou ce o gene a e speech (See Figu e 2.1) and
ha is he eason i is called oesophageal speech. Unlike TOS and ELS, OS does no equi e
any ex e nal equipmen . I is a skill ha is de eloped wi h aining om a speech he apis
and equi es se e al mon hs o p ac ice. I is also o poo e quali y compa ed o TOS o ELS
[133]. None heless, OS has he ad an age ha once he skill is mas e ed, he la yngec omee
is sel -su icien in p oducing speech and his makes i a p omising op ion pos -la yngec omy.
Mo eo e , o ELS o TOS use s, ge ing OS skills is bene icial as i can help hem communica e
du ing unexpec ed si ua ions (los o b oken de ices, low ba e y, e c.).
A compa ison o he ana omy o HS and OS is shown in Figu e 2.1. We can obse e ha he
HS p oduc ion sys em has a la ynx and ocal olds wi h which we p oduce speech. Howe e ,
as he ana omy is al e ed a e a la yngec omy, wo sepa a e passages a e c ea ed: one o
b ea hing (ligh blue line) and he o he o ood (da k blue lines).
8
2.1.2 Challenges
Unlike HS, which is p oduced wi h ib a ions om he ocal olds, OS is p oduced om he
ib a ions o he pha yngo-oesophageal segmen (See Figu e 2.1). Ai is swallowed, inhaled, o
injec ed and is in oduced in o he oesophagus, a e which i is expelled wi h con ol, he eby
p oducing ib a ion [133]. This gene a ion mechanism in oduces acous ic a e ac s and makes
OS di icul o unde s and [155, 92], which g ea ly a ec s communica ion, social ac i i ies, and
hence, quali y o li e [90].
Mo eo e , hese less in elligible oices a e no well ecei ed by machines ha a e ope a ed
by speech inpu . An inc ease in he popula i y o de ices wi h oice-based in e ac ion means
ha machine in elligibili y is gaining impo ance oo. We know ha machine ecogni ion,
o Au oma ic Speech Recogni ion (ASR) pe o mance, is lowe compa ed o Human Speech
Recogni ion (HSR) pe o mance [73], al hough be e ASR sys ems a e being buil inc easingly,
aiming owa ds human-like ecogni ion abili ies [132]. Howe e , his is o HS and no o
diso de ed speech, like OS.
Figu e 2.2 shows he di e ences in he signal ( op) and spec og am (bo om) cha ac e is ics
o HS and OS o a oiced phoneme. As we can see in he op igu e, HS has clea pe iodic lines
and OS does no . This i egula pe iodici y in OS a ec s i s undamen al equency and p osody,
which a e impo an cha ac e is ics in exp essi e communica ion as well as speake iden i y.
Please no e ha he OS signal has lines (al hough unclea ) wi h less equency compa ed o he
HS signal. This is indica i e o he much lowe pi ch ha an OS speake appea s o ha e. In
Figu e 2.3, we can see he esul s o he pi ch es ima ion p ocess by P aa [9], a so wa e used
o s udy and p ocess speech signals. The igu e con ains he signals ( op) and he pi ch cu es
(bo om) o he wo d ’ab igade o’ spoken by a male HS speake and a male OS speake . The
e ical blue lines on he speech signal ( op) ep esen pulses and he co esponding blue cu es
(bo om) ep esen he es ima ed pi ch cu e o he 0 cu e in He z.
Fo he HS pi ch cu e, we can see a con inuous non-b eaking blue s ing a ound he 128.9
Hz poin , indica ing a con inuous pi ch in a no mal ange o male undamen al equency (100
o 150 Hz). On he o he hand, o OS, he pulses a e ew in numbe and he calcula ed pi ch
cu e is a ound 395.3 Hz. This is inco ec as he OS speake s appea o speak in a much lowe
undamen al equency and no such a high equency as calcula ed. The esul s on his igu e
demons a e he e oneous calcula ion o undamen al equency o OS signals by a s anda d
speech so wa e.
The obse able di e ences be ween OS and HS cha ac e is ics in he signal and spec og am
a e suppo ed by he ollowing s udies. Ce e a e . al [11] and Liu e . al [74] obse ed ha
9
Figu e 2.2: Di e ences in he signal and spec og am o HS and OS
Figu e 2.3: Di e ences in he calcula ed undamen al equencies o HS ( op) and OS (bo om)
10
he o man equencies we e highe and he du a ion o owels was longe o la yngec omees
as compa ed o HS. Fundamen al equency, in ensi y, and signal- o-noise a io o HS we e
signi ican ly highe han hose o OS [74]. Ji e and shimme we e signi ican ly lowe o HS
compa ed o OS [123, 74]. On a e age, OS had abou 10 dB less in ensi y compa ed o HS [156].
Vowel du a ion is longe o OS compa ed o HS [74, 13]. Gene ally speaking HS has highe
speaking a e compa ed o OS. Howe e , some OS speake s do ha e speaking a es ma ching
ha o HS [123].
2.2 E alua ion Me ics
Fo any ask aimed a making imp o emen s o an exis ing si ua ion, i is impo an o conduc
e alua ions a he beginning and a e e y s age o p og ess. This p ocess can ell us clea ly
whe he we a e indeed making imp o emen s, o a leas p og essing in he igh di ec ion. In
his sec ion, you will ind a summa y o se e al e alua ion me ics, how hey ha e been used and
on wha kind o speech hey ha e been used. This knowledge is use ul o iden i y he me ics
ha a e in e es ing and no ye explo ed in he con ex o OS en ichmen o he en ichmen o
any o he diso de ed speech.
2.2.1 In elligibili y
In elligibili y is he mos impo an and he mos commonly e alua ed p ope y o speech.
Speech is use ul as a success ul communica ion ool only i i is co ec ly unde s ood. Speech
in elligibili y (SI) is a widely esea ched ield, and se e al in elligibili y measu emen me ics
(subjec i e and objec i e) ha e been explo ed and analysed. Subjec i e measu es a e based on
he esponses o opinions o he lis ene s. They may also be de i ed om lis ene esponses.
Objec i e measu es o in elligibili y a e calcula ed using a o mula, an algo i hm o a so wa e
p og am.
In elligibili y may also be ca ego ised as HSR-based o ASR-based. Bo h in ol e ge ing
ansc ip ions o he speech u e ance and calcula ing ansc ip ion e o s. A e iew o HSR
and ASR me hods can be ound in [122].
Objec i e Measu emen s o In elligibili y
Objec i e measu emen s o in elligibili y include Speech T ansmission Index (STI), A icula-
ion index (AI), Signal- o-Noise Ra io (SNR), Ha monics- o-Noise Ra io (HNR), Sho Te m
Objec i e In elligibili y (STOI) [134] and ESTOI [58].
11
STI akes in o accoun he noise and e e be a ion, and measu es he educ ion o pe o -
mance due o speech o a ying in elligibili y [51]. Measu ing AI in ol es di iding he speech
spec um in o se e al equency bands and a calcula ion o he speech- o-noise a io in each
band [66]. Fo bo h hese measu es, a alue o 0 means poo in elligibili y o no speech was
hea d and a alue o 1 means high in elligibili y o speech was pe ec ly hea d.
STOI [139] and ESTOI [58] a e in usi e objec i e in elligibili y measu es which a e known
o be co ela ed wi h subjec i e in elligibili y sco es o noisy speech. An in usi e in elligibili y
measu emen equi es a deg aded signal and an aligned e e ence signal. The STOI and ESTOI
sco es ange om 0 o 1 whe e 0 means he leas simila i y o he clean e e ence signal (no
in elligible) and 1 means o ally simila o he clean e e ence signal ( e y in elligible).
On he o he hand non-in usi e in elligibili y measu es do no equi e a e e ence signal.
Some examples o such measu es include he non-in usi e STOI [1], me hods based on Deep
Neu al Ne wo ks (DNN) [168, 117] and me hods based on audi o y-inspi ed il e bank analysis
[35].
The main ad an age o objec i e measu emen s is ha hey a e easy o eplica e and o
implemen . The e is no need o a human subjec o e alua e he speech, he eby a oiding
hassles usually associa ed wi h such pe cep ual expe imen s. These me ics measu e how a
ce ain speech is ecei ed by machines o digi al de ices. Se e al o hese measu emen s, in-
cluding STOI and ESTOI, ha e high co ela ions wi h lis ening es sco es [150] and hence a e
o en used in he place o conduc ing and ac ual lis ening es . O he newe app oaches o
in elligibili y measu emen can be ound in [1, 130, 150].
Objec i e measu es o in elligibili y can also be ob ained om an ASR sys em. An ASR
sys em, as he name sugges s, ecognises a speech u e ance and gi es a ex ou pu which is
he ansc ip ion o he u e ance. When he ansc ip ion ma ches he message p esen ed in
he u e ance, hen he ASR sys em is said o ha e ecognised he u e ance wi h 100 pe cen
accu acy, o 0 pe cen e o s. Such an u e ance would ha e 100 pe cen in elligibili y as pe
ha speci ic ASR sys em. Cu en ly such a high pe o mance is no ob ained wi h any ASR
sys em. The lowes e o (WER) so a is 2 o 3 pe cen [59] e en wi h clea HS. By unning
speech u e ances h ough an ASR sys em and hen calcula ing he e o s in ansc ip ion, we
can assign an objec i e in elligibili y sco e o he u e ance.
The e o s in ansc ip ion o an ASR sys em a e compu ed using a me ic called Wo d
E o Ra e (WER). WER is ob ained by calcula ing he Le ensh ein dis ance [71] be ween he
e e ence sen ence and he hypo hesis sen ence ( he sen ence ansc ibed by he lis ene ). The
Le ensh ein dis ance akes in o accoun he inse ions, dele ions, and subs i u ions ha a e
12
obse ed in he hypo hesis sen ence. The calcula ion was pe o med wi h he Wo d E o Ra e
Ma lab oolbox [105]. The o mula used is shown in Equa ion 2.1.
WER =Subs i u ions +Inse ions +Dele ions
To al numbe o wo ds in e e ence sen ence.(2.1)
Ano he way o e alua e he pe o mance o an ASR sys em is o use he Pe cen age Wo ds
Co ec (PWC) me hod. PWC is he pe cen age o wo ds om he e e ence sen ence co ec ly
iden i ied in he ansc ibed sen ence. In his case i is no a measu e o e o , bu o accu acy.
Highe PWC means highe in elligibili y. The PWC o mula is shown in Equa ion 2.2.
PWC =Wo ds co ec ly iden i ied in ansc ip ion
To al numbe o wo ds in e e ence sen ence ×100.(2.2)
The e a e some o he objec i e me ics ha do no measu e in elligibili y, bu o he im-
po an cha ac e is ics o in elligible speech. Fo example, Mel Ceps al Dis o ion (MCD)
e alua es di e ences in he mel ceps a by calcula ing he Euclidean ceps al dis ance o a sig-
nal om a e e ence signal. I is o en used o e alua e speech syn hesis sys ems [63] and Voice
Con e sion (VC) sys ems [126, 124, 19]. See Sec ion 2.3.1 o de ails on VC. The lowe he
MCD, mo e alike he signal is o he e e ence signal. On he o he hand, Pe cep ual E alu-
a ion o Speech Quali y (PESQ) e alua es quali y o he signal. This is done by aligning he
e e ence and deg aded signal, pe o ming an audi o y ans o m and ex ac ing dis o ion pa-
ame e s om he di e ence o hese wo signals. PESQ p o ides a p edic ion o a subjec i e
MOS o he deg aded signal [116].
Subjec i e Measu emen s o In elligibili y
Subjec i e in elligibili y in ol es a ings o esponses om an ac ual human lis ene . I needs
p ope expe imen al design and se up and well- hough ec ui men o pa icipan s o ob ain
eliable measu es. Al hough his is ime consuming, he e is an ob ious ad an age wi h his
app oach as he e alua ions come om an ac ual pe son which is he case in ace- o- ace com-
munica ions in he eal wo ld. Some me ics in subjec i e in elligibili y es s such as Mean
Opinion Sco es (MOS) and Speech Recep ion Th eshold (SRT) a e explained below.
A MOS is a e y simple and e sa ile me ic. I is e y easy o implemen and use ul o ge a
nume ic g ading o any cha ac e is ic ha needs o be e alua ed. Howe e , MOS alues a e uni-
dimensional and p one o ”misuse and misin e p e a ion” [137]. Despi e o hese limi a ions,
MOS is a e y popula me hod o e alua ion be i o speech syn hesis sys ems, VC sys ems o
o o he speech simila i y and na u alness e alua ions [154, 75, 152]. A a ia ion o MOS is
13
he Compa a i e Mean Opinion Sco es (CMOS) whe e he in elligibili y (o any o he speech
p ope y) is compa ed o wo di e en signals. This me hod is use ul when compa ing he
pe o mance o wo di e en sys ems.
SRT is de ined as he minimum hea ing le el a which 50 pe cen o he speech ma e ial is
unde s ood by he lis ene . This me ic is used mainly in hea ing impai ed esea ch [151].
Ano he popula me hod is o conduc a sen ence ecogni ion ask. Sen ence ansc ip ion
asks o ”human speech ecogni ion” asks [73] ha e been widely used o subjec i e in elligibili y
measu emen s [93, 62]. Yo ks on e . al [167] epo ed he ag eemen o sen ence ansc ip ion
asks wi h lis ene s’ es ima es and a ings o in elligibili y. A a ia ion o he sen ence an-
sc ip ion ask is he sen ence epe i ion ask whe e ins ead o w i ing down he sen ence hea d
by he lis ene , hey simply ha e o epea wha hey hea d. This is less axing o he lis ene .
Mo eo e , he s eng hs o sen ence epe i ion asks a e ha hey a e “ ai ly simple cogni i e
asks” and ha hey a e “consis en h oughou he age span” in he a ea o neu ophysiological
es s [84]. Fo sen ence epe i ion o sen ence ansc ip ion asks as well, WER o PWC ha
was explained in he p e ious sec ion can be used as an in elligibili y measu e. The e o e, al-
hough he esponses in such es s can be subjec i e, in he sense ha i comes om a subjec ,
he e o in ansc ip ion o ecogni ion can be measu ed objec i ely. This me hod was mos
p edominan ly used o calcula e SI in he expe imen s o his hesis.
Some al e na i es o sen ence ecogni ion asks a e wo d [17] o digi ecogni ion asks
[159], sen ence las wo d ecogni ion asks [67] and keywo d ecogni ion asks [6]. Bo h las
wo d ecogni ion and wo d ecogni ion asks we e also used o e alua ions in he expe imen s
o his hesis.
In elligibili y o Diso de ed Speech
In elligibili y measu emen s o he analysis o diso de ed speech ha e been explo ed in [76,
86, 87]. Some o hem a e ASR-based [76, 87], while o he s a e no [86]. In an e alua ion o
diso de ed speech esul ing om ce eb al palsy and Amyo ophic La e al Scle osis (ALS), a high
co ela ion was epo ed be ween STOI and subjec i e in elligibili y a ings [55]. In elligibili y
o OS was ound o be lowe compa ed o HS and in elligibili y imp o ed o OS in he combined
audi o y- isual mode compa ed o he audi o y only mode [52]. In ano he s udy, Holley e . al
[50] ound highe in elligibili y o HS compa ed o OS as well in he quie condi ion.
Some s udies ha e been conduc ed in measu ing he in elligibili y o Spanish OS. Au ho s
o [88] s udied he oice in elligibili y cha ac e is ics o Spanish OS and TOS. This HSR s udy
was conduc ed o wo-syllable wo ds, and i was epo ed ha nasal sounds esul ed in mos
14
ansc ip ion con usions o OS. The wo k in [77] desc ibes a eal ime ecogni ion sys em o
owel segmen s o Spanish OS.
2.2.2 Lis ening E o
While in elligibili y is an impo an p ope y o speech, measu es o in elligibili y canno quan-
i y how much e o was equi ed o co ec ly unde s and he message. While poo ly in elligible
speech is usually di icul o unde s and, some imes e en highly in elligible speech is di icul
o unde s and. Fo ins ance, i you a e in a noisy oom, you can s ill unde s and he pe son
nex o you, bu he p ocess is much mo e di icul and i ing. This is whe e LE is help ul as i
gi es an idea o he e o in ol ed in lis ening o he message, and he esul an a igue om
p olonged lis ening o e o ul speech.
In li e a u e, LE has been de ined as ” he men al exe ion equi ed o a end o, and un-
de s and, an audi o y message” [104]. Ano he ela ed concep is ha o cogni i e load. The
accomplishmen o a lis ening ask, especially in ad e se condi ions, in ol es he use o cogni-
i e esou ces o he lis ene . These esou ces a e p esen in he lis ene in he o m o wo king
memo y. Mo eo e , each lis ening ask has h ee aspec s ha decide he cogni i e load o he
lis ening e o o ha ask: in insic load (complex sen ences, g amma , poo speech p oduc-
ion e c.), he load on he lis ene o seman ic p ocessing, and ex aneous load (dis ac ing
speake s o images, noise e c.). The e o e he exe ed cogni i e load depends on he amoun
o a ailable wo king memo y and how much o i goes in o each o hese kinds o loads [2].
Fo example, lis ening o a e y complex sen ence in a noisy en i onmen would en ail mo e
cogni i e load compa ed o a quie e en i onmen .
Figu es 2.4 shows an in og aphic o he concep o LE. As you can see, he speake is saying
’I need some wa e ’. Bo h lis ene s unde s and i as ’I am some wa e ’. The e o e, hey ha e
bo h unde s ood 75 pe cen age o he message co ec ly (1 wo d w ongly iden i ied ou o 4).
Howe e , lis ene 2 has o exe mo e e o o deciphe he message compa ed o lis ene 1,
which is indica ed by he edness in he head. This migh depend on se e al ac o s such as he
cogni i e and he hea ing capaci ies o he lis ene , he lis ening en i onmen e c. Fo example,
i lis ene 1 we e o lis en o he same speech in noise, he would ha e needed o exe mo e
e o . The e o migh also depend on he speech i sel . We aim o explo e all hese LE ac o s
in he con ex o OS and no mal hea ing lis ene s.
LE has been measu ed in se e al ways: Sel - epo ing (ques ionnai es, a ings e c.); be-
ha iou al measu es (pe o mance in single asks o mul iple asks and de i ing LE om hem);
and physiological measu es (elec oencephalog aphy, pupillome y e c.). A e iew o LE and
15
Figu e 2.4: An in og aphic explaining he dis inc ion o in elligibili y and LE. The le el o
edness in he head ep esen s he le el o LE.
a ious me hods o measu ing LE is p esen ed by McGa igle e . al [80].
LE has been measu ed in se e al con ex s whe e he lis ene s ha e o pu in ex a in es men
o hei neu ocogni i e esou ces. This includes unde s anding o dis o ed speech signals.
Dis o ion can come om he speake (e.g., o eign accen , diso de ed speech), he lis ene
(e.g., hea ing impai men ) o om he en i onmen o channel (e.g., noise). While he e is
esea ch on LE in he con ex o speech in noise [115], non-na i e o accen ed speech [10, 149],
and hea ing impai men [46], LE o diso de ed speech is a less esea ched ield, and physiological
LE measu emen s, e en less so.
Sel - epo ed Lis ening E o
S udies on LE o en in ol e subjec i e a ings whe e pa icipan s a e asked o indica e he
pe cei ed e o , o example, using Like scales [112] o isual analogue scales [79, 94].
Sel - epo ing measu es o subjec i e a ings a e based on a se o ques ionnai es such as
‘Do you ha e o concen a e e y much when lis ening o someone o some hing? ’; ‘Can you
easily igno e o he sounds when ying o lis en o some hing? ’; and ‘Do you ha e o pu in
a lo o e o o hea wha is being said in con e sa ion wi h o he s?’. Based on he esponse
o pa icipan s (say om 0 o 10), he LE can be calcula ed. While his me hod is easy and
does no equi e much expe ise, he e a e limi a ions as he esponses a e subjec i e and he
h eshold o e o ul lis ening can be di e en o di e en indi iduals. Fo example, wha
16
is e o ul o a subjec may no be as e o ul as o ano he subjec as hey may ha e less
ole ance o e o o may base hei e o a ing on he ex en o which hey could comple e
he ask. [80]
In he expe imen s o his hesis, wo kinds o sel - epo ed LE a e used. The i s one is a
e y simple 5- poin Like scale esponse o he ques ion ’How e o ul was he speech o lis en
o?’. This me hod was used in he i s p elimina y expe imen (See Sec ion 4.2). The second
me hod used in his hesis is he 14-poin scale known as he Adap i e Ca ego ical Lis ening
E o Scaling (ACALES) p ocedu e [65]. This scale is used widely in speech-in-noise esea ch
[115, 41]. As his scale is designed o speech-in-noise expe imen s, he op mos le el is he
’only noise’ le el. This is he case whe e he pa icipan hea s only noise and no speech. As we
a e no s udying he e ec o en i onmen al noise while lis ening o OS, bu he e o associa ed
due o he diso de ed speech, we ha e no used his op le el. The e o e ou modi ied ACALES
scale o measu e sel - epo ed LE is a 13-poin scale ha goes om ’Ning´un es ue zo’ (No e o )
o ’Much´ısimo es ue zo’ (Ex eme e o ). The e a e 7 labels in all, bu also in e media e labels
(’–’) ha allows pa icipan s o choose in-be ween op ions. The LE a ing labels, hei English
ansla ions, and hei alues a e p esen ed in Table 2.1. This a ing scale was used in he
expe imen s in Chap e 4, 5 and 7.
Table 2.1: LE a ing labels, hei English ansla ions, and he alues assigned.
LE Ra ing Labels English T ansla ions Values Assigned
Much´ısimo es ue zo Ex eme e o 13
– – 12
Mucho es ue zo A lo o e o 11
– – 10
Es ue zo conside able Conside able e o 9
– – 8
Es ue zo mode ado Mode a e e o 7
– – 6
Poco es ue zo Li le e o 5
– – 4
Muy poco es ue zo Ve y li le e o 3
– – 2
Ning´un es ue zo No e o 1
Lis ening E o om Beha iou al Da a
In he case o beha iou al measu es, he e o is measu ed by beha iou al asks, which can
include single asks o mul i- asking. In single asks, he pa icipan is gi en a lis ening ask
and hei esponses a e eco ded. In addi ion o he esponses, he esponse imes may also
indica e LE as hey a e known o be slowe o challenging lis ening condi ions. In a dual ask
17
24
Chap e 3
Co pus and S imuli
“Wo ds a e he coins making up he cu ency o sen ences, and
he e a e always oo many small coins.”
— Jules Rena d
My i s in oduc ion o his doc o al esea ch was lis ening o eco ded speech om se e al
OS speake s o a ying se e i ies. As a i s ime lis ene o OS, I was bo h aken aback and
in igued. The ollowing ew mon hs we e spen amilia ising wi h a da abase o OS speake s
ha we eco ded in ou lab. This chap e desc ibes he da abase in de ail and he insigh s
gained om i which acili a ed he design o he en ichmen and e alua ion expe imen s.
3.1 OS Da abase - Al eady A ailable
The OS da abase con ained sen ences, wo ds and sus ained owels eco dings om o e 30 OS
speake s. The speake s a e 32 la yngec omised pa ien s who a e membe s o ‘Associa ion o
La yngec omees in Bilbao’. The e we e wo ca ego ies o speake s in he da abase: p o icien
and non-p o icien speake s. P o icien speake s a e hose who inished hei speech he apy
sessions mon hs be o e he eco ding session. Non-p o icien speake s on he o he hand we e
s ill unde going he speech he apy sessions. Ou o hese 32 speake s, 26 o hem we e p o icien
speake s, 2 o hem we e non-p o icien speake s, 2 speake s we e eco ded when hey we e
p o icien as well as non-p o icien and 1 o he speake was eco ded in p o icien OS and TOS
modes. In o al he e we e 32 speake s and 34 se s o eco dings. In his hesis, we ha e only
used he p o icien speake s.
The da a was eco ded using 4 di e en mic ophones - A s udio mic ophone (Neumann TLM
103), an ins umen a ion mic ophone (Beh inge ECM8000), a headphone mic ophone (DPA
4066-F), a condense mic ophone (AKG C542BL) in an acous ically isola ed oom. Howe e
25
he eco dings used in he s udies o his hesis we e om he s udio mic ophone. The p o ocols
o eco dings and he de ailed desc ip ion o he da abase is a ailable in [39] and [123]. Some
impo an de ails o he da abase ha a e ele an o his hesis a e desc ibed below.
3.1.1 Sen ences
The da abase con ained eco dings o 100 phone ically-balanced sen ences selec ed om a bigge
co pus [121, 33]. The selec ion o sen ences was pe o med wi h a g eedy-algo i hm-based ool
called co pusCRT [128], wi h he c i e ia o maximised diphone co e age and a maximum o 15
wo ds pe sen ence.
These 100 sen ences we e chosen o be phone ically balanced o ensu e maximum phone ic
con en a iabili y. They we e syn ac ically and seman ically p edic able, bu had some p ope
nouns and many unusual wo ds ha a e ha d o guess. This is o be kep in mind while consid-
e ing in elligibili y measu emen s. The di icul y o hese sen ences make hem app op ia e o
in elligibili y and LE expe imen s. The e a e sho and long sen ences in he da abase which
allows he expe imen e o choose s imuli as pe he need o he expe imen . Some examples o
he sen ences a e he ollowing: ‘¿Qu´e di e encia hay en e el caucho y la he ea?’ Wha is he
di e ence be ween ubbe and he ea?, ‘Unos d´ıas de eu o ia y meses de a on´ıa.’ A ew days o
eupho ia and mon hs o despai . Wo ds such as he ea (speci ic name o ubbe ee) and a on´ıa
(despai ) a e no commonly used wo ds and hence a e di icul o guess.
3.1.2 Sus ained Vowels
Each OS speake eco ded 4 ins ances o he sus ained a icula ion o all i e Spanish owels.
These da a we e no used in he expe imen s desc ibed in his hesis. Howe e , hey we e g ea
ools o unde s and he oice cha ac e is ics o OS.
3.1.3 Wo ds
The da abase con ained 14 isola ed wo ds which included ou wo ds con aining diph hongs.
These wo ds a e use ul o spoken e m de ec ion asks. These wo ds we e no used in any o
he expe imen s ei he .
3.2 HS Da abase Desc ip ion - Al eady A ailable
HS samples we e ob ained om an online pla o m [32] and hence, we e eco ded in a iable
en i onmen s. Howe e , some o hem we e eco ded in he a o emen ioned acous ically isola ed
26
oom, al hough wi h a di e en mic ophone. The numbe o speake s in he HS da abase keeps
g owing as i is an open online pla o m 1. The e we e o e 35 speake s in he HS da abase
a he ime o he ini ial expe imen s. Some o he speake s in his da abase we e used in he
expe imen s pe o med in his hesis.
3.3 Addi ional OS Da a - Newly Reco ded
3.3.1 Wo ds
We eco ded 150 low equency di icul wo ds om one OS speake and one HS speake . The
wo ds we e chosen using an online da abase 2. These wo ds we e used in an expe imen in
Chap e 7. See he lis o wo ds in Appendix D.
3.3.2 Con inuous Speech
In addi ion o he 150 wo ds, i e con inuous speech passages we e eco ded o he same one
OS and one HS speake . The passages we e chosen om he Ce an es esou ces o eading o
in e media e le el Spanish 3. This was mean o be used in he expe imen desc ibed in Chap e
7, bu e en ually was no used. Howe e , hese passages a e use ul o u u e OS s udies wi h
con inuous speech.
Fo one o he passages, a ideo eco ding was pe o med. This passage wi h he ideo
eco ding (du a ion: 2 minu es and 17 seconds) was used o build an in e ac i e demons a ion
o he ou comes o OS en ichmen . Mo e de ails on his demons a ion will be p esen ed in
Chap e 6. See he con en s o he passage in Appendix C.
3.4 Manual Labelling
In o de o c ea e phone ic labels o he speech signals in he da abase, an au oma ic o ced
aligne which is pa o an ASR sys em was used. Howe e , using a o ced aligne which was
ained using HS was un i o OS. On manual inspec ion, se e al e o s we e encoun e ed in he
labels likely owing o poo spec al and empo al cha ac e is ics o OS. The e o e, new acous ic
models we e made using he a ailable OS eco dings wi h he Mon eal Fo ced Alignmen ool
[72].
1h ps://aholab.ehu.eus/ahomy s/
2h ps://www.bcbl.eu/da abases/espal/idxwo d.php
3h ps://c c.ce an es.es/aula/lec u as/in e medio/
27
Figu e 3.1: Compa ison o he au oma ic labelling, manual labelling and cus omised au oma ic
labelling o OS
Manual labelling was pe o med o one speake (Speake ID:05M3) so ha hese manual
labels can be used o e alua e he accu acy o he au oma ic aligne . I was no possible o
pe o m manual labelling o all 32 speake s as i is a ime consuming p ocess. The p ocess
in ol ed lis ening o each and e e y phoneme in he speech samples and co ec ing he posi ions
o he espec i e phone bounda y. This was done using he wa su e so wa e [131].
We assessed he accu acy o he au oma ic labelling p ocess by compa ing he label bounda y
posi ions o he au oma ic labelling p ocess o ha o he manual p ocess. Resul s showed ha
97 pe cen o he e o s we e less han 50 ms and 83 pe cen o he e o s we e less han 5 ms
[39]. Figu e 3.1 shows how he he de aul au oma ic labelling and he cus omised au oma ic
labelling compa e o he manual alignmen .
These new labels we e used in aining an ASR sys em adap ed o OS [39] as well as o
gene a ing Syn he ic Speech (SS) wi h du a ions ma ching wi h OS (See Sec ion 6.2).
The du a ion ma ched SS was c ea ed o all he 32 speake s. This pa allel SS da abase can
be use ul o u u e OS enhancemen esea ch.
28
3.5 In elligibili y
The in elligibili y o 29 p o icien speake s was calcula ed by calcula ing ASR WER sco es. Two
di e en ypes o ASRs we e used: ASR ained using HS and ASR ained wi h OS. Figu e 3.2
shows he esul s o he ASR pe o mance. I can be obse ed ha he sys em ained wi h OS
da a has ewe e o s compa ed o he sys em ained wi h HS da a. The e was an imp o emen
o a ound 16 pe cen age poin s when ained wi h OS da a.
Figu e 3.2: ASR Resul s. Mean speake -wise Wo d E o Ra es (in %) o ASR ained wi h
HS and ASR ained wi h OS
3.6 Conclusions
This chap e desc ibed all he da a ha was a ailable o us o pe o ming ou expe imen s.
Howe e , no all o he da a was used in he expe imen s desc ibed in his hesis. The expe -
imen s lis ed in Chap e s 4, 5 6 and 7 desc ibe in de ail he chosen subse s o he da abase in
hei espec i e me hods sec ions.
The manual and he au oma ic labelling p ocess desc ibed in his chap e o m an impo an
s ep in he en ichmen s a egies desc ibed in Chap e 6.
Addi ionally, in elligibili y o he speake s ob ained om an ASR sys em was desc ibed. The
sys em ained wi h OS da a had ewe e o s compa ed o hose wi h HS. These WER sco es
a e used as a c i e ia when selec ing speake s o e alua e in he a o emen ioned expe imen s.
A comple e and ex ensi e desc ip ion o he da abase was published as a pape i led ’A
29
Spanish Mul ispeake Da abase o Esophageal Speech’ [39].
30
Chap e 4
P elimina y Measu es o
In elligibili y and Lis ening E o
“I you hink communica ion is all alking, you ha en’ been
lis ening.”
—Ashleigh B illian
In he p e ious chap e , I desc ibed some key ea u es o he OS da abase. I lis ed some
di e ences in he acous ic and linguis ic cha ac e is ics o OS and HS. In Sec ion 2.1.2 we
ha e seen ha OS has a lowe speaking a e, highe ji e and shimme and lowe in ensi y
compa ed o HS. All hese a e ac o s ha a ec he abili y o pe cei e and unde s and speech.
To unde s and in wha ways OS is mo e di icul o p ocess compa ed o HS, and how much, i
is necessa y o conduc well designed lis ening expe imen s.
In his chap e , I desc ibe wo lis ening expe imen s ha collec ed some p elimina y in-
elligibili y and LE me ics o a small se o speake s om he da abase and some con ol
heal hy speake s. Bo h he expe imen s we e designed o e alua e he gaps be ween OS and
HS in in elligibili y and ease o p ocessing. Once hese gaps a e known, app op ia e en ichmen
me hods can be designed aimed a closing hese gaps i.e. b inging en iched OS close o he
me ics o HS han unp ocessed OS.
The con en s o his chap e ha e ea u ed in p e iously published pape i led ’In elligibili y
and Lis ening E o o Spanish Oesophageal Speech’ [112].
31
4.1 In oduc ion
Lis ening o diso de ed speech is a challenging ask, and i demands a lo o a en ion and
e o . To quan i y he challenges o lis ening o OS we begin by measu ing i s in elligibili y in
compa ison o HS. In elligibili y measu emen s a e common and a e a use ul way o quan i y
wha pe cen age o he spoken message has been co ec ly unde s ood (See Sec ion 2.2.1 o mo e
de ails). In his s udy, we ha e e alua ed in elligibili y in human–human (speake is human,
lis ene is a human) as well as human–machine (speake is human, lis ene is a de ice/so wa e)
in e ac ions.
In addi ion o in elligibili y, he e is g owing in e es in esea ch measu ing LE and o he
p ocessing load aspec s o speech as i gi es an addi ional dimension o unde s anding chal-
lenges in speech pe cep ion in ad e se lis ening condi ions. The mo i a ions o measu ing LE is
desc ibed in de ail in Sec ion 2.2.2. In his s udy, we ha e a emp ed o explo e LE in addi ion
o he in elligibili y measu emen s.
In chap e 2, we p esen ed some esea ch we e HS was ound o be mo e in elligible han
OS (Sec ion 2.2.1) and HS was ound o be mo e accep able compa ed o OS (Sec ion 2.2.2).
These s udies a e he ounda ions o he hypo heses o his expe imen . Signi ican posi i e
co ela ions we e obse ed in ASR and HSR o lis ening in ad e se condi ions such as age
ela ed hea ing loss [37] and speech diso de s [53]. This led us o hypo hesise ha like HSR
pe omance, ASR pe o mance will be lowe o OS compa ed o HS.
The idea o in elligibili y di e ences be ween expe ienced and inexpe ienced lis ene s o OS
was explo ed in [16]. The indings we e ha OS was anked simila ly o in elligibili y by bo h
expe ienced and inexpe ienced lis ene s. This was in iguing, and led us o in es iga e he e ec
o amilia i y wi h OS on i s in elligibili y. In addi ion, as we we e collec ing LE a ings oo,
we we e in e es ed in seeing i he same was obse ed o LE a ings, o i hey would ell a
di e en s o y. We conside iends, amily (spouse, siblings, child en), and ca e ake s o OS
speake s as amilia lis ene s.
This s udy con ains wo expe imen s. The i s expe imen was web-based, and was ocused
on ge ing p elimina y in elligibili y and LE me ics o ou da a. We in es iga ed how in el-
ligibili y (bo h ASR and HSR) and LE di e o he wo speech ypes (OS and HS). We also
in es iga ed he e ec o amilia i y and o wha ex en in elligibili y and LE a e co ela ed.
The second expe imen (an ex ension o Expe imen 1) was conduc ed in a labo a o y se ing,
which allowed us be e con ol o he expe imen en i onmen . The aim o his expe imen
was o ind ou i mo e LE is epo ed o OS e en i he in elligibili y o OS is close o ha
o HS. Addi ionally, in his expe imen , we also in es iga ed i he pa icipan s’ pe o mance in
32
he speech pe cep ion asks depended on hei cogni i e abili ies.
The hypo heses o Expe imen 1 a e:
•WER is posi i ely co ela ed wi h sel - epo ed LE a ings.
•HS is mo e in elligible and less e o ul, compa ed o OS.
•Lis ene s amilia wi h OS ind i less e o ul o p ocess OS, compa ed o lis ene s ha
a e no .
•ASR pe o ms wo se o OS han o HS.
Ou hypo heses o Expe imen 2 a e:
•Fo he case ha in elligibili y o OS is simila o ha o HS, he e is s ill mo e e o in
unde s anding OS.
•Lis ene s wi h be e cogni i e abili ies ha e be e in elligibili y sco es and epo lesse
e o .
We begin by desc ibing he ma e ials and me hods and he esul s o Expe imen 1 in
Sec ion 4.2, ollowed by hose o Expe imen 2 in Sec ion 4.3. Finally, a gene al discussion and
conclusions a e p esen ed.
4.2 Expe imen 1: P elimina y Wo d E o Ra e and Lis-
ening E o Measu emen s
4.2.1 Ma e ials and Me hods
Expe imen al Design
The main ask o his web-based expe imen was he sen ence ecall and ansc ip ion ask.
Pa icipan s lis ened o a sen ence and hen yped wha hey had unde s ood. To collec LE
a ing measu es, we asked he pa icipan s o a e he sen ences o LE on a 5-poin Like
scale. The sen ences we e played only once ( o a oid any possible memo y e ec ) and in a
andom o de ( o a oid sen ence o de bias).
Co pus and S imuli
The co pus we used was a pa o he la ge da abase o 32 OS speake s desc ibed in de ail in
Sec ion 3.1. The chosen s imuli we e picked om he 100 sen ences desc ibed in Sec ion 3.1.1.
33
4.3 Expe imen 2: Lis ening E o o Highly In elligible
Oesophageal Speech
4.3.1 Ma e ials and Me hods
Expe imen al Design
Based on ou p elimina y in elligibili y and sel - epo ed LE expe imen (See Expe imen 1,
Sec ion 4.2), we had he chance o p obe u he in o ou da a by designing a s udy speci ically
aimed a measu ing LE. The aim was o in es iga e he di e ences in LE o a se o HS and
OS speake s ha ha e compa able in elligibili y. As poin ed ou in [95], e en highly in elligible
OS speech was ound o ha e di e en LE a ings. The e o e, his is a me hodological decision
in o de o ule ou ha obse ed e ec s a e due o di e ences in in elligibili y.
The expe imen was designed o an EEG-based LE measu emen . We aimed o eco d EEG
da a o pa icipan s while hey lis ened o OS and HS o in es iga e i he e a e di e ences in
he LE co ela es o b ain ac i i y. Along wi h measu ing he EEG da a, we also collec ed
subjec i e LE a ings om he lis ene s. As he pa icipan s had an EEG cap on, he s imuli
we e played on a loudspeake , and no on headphones. He e, we p esen he LE a ings indings.
EEG da a acquisi ion p ocess and indings a e p esen ed in Chap e 5.
In addi ion o he LE expe imen , a sepa a e in elligibili y es was conduc ed o eplica e
he esul s o Expe imen 1 in a labo a o y se ing. This ime we asked he pa icipan s o lis en
o he sen ence and epea ou loud wha hey hea d. This is less axing o he pa icipan as
hey do no ha e o ype hei esponses. Also, his esul ed in speedie esponses and hence
less e o om he lis ene s’ side in memo ising he sen ence. The ad an age o o al esponse
is ha yping e o s can be excluded as a con ounding ac o o WER. Howe e , his in ol es
pos -p ocessing, i.e., he ask o ansc ip ion o hei o al esponses o ex o calcula e WER.
In o de o in es iga e he ela ionship be ween LE, SI and he lis ene s’ cogni i e capaci ies,
we conduc ed a Flanke ask [31] and a backwa d digi span ask [47] a e he beha iou al asks.
The Flanke ask measu es he selec i e a en ion abili y [31] and he digi span ask measu es
he wo king memo y capaci y o he lis ene [47]. Bo h o hese p ocesses a e ele an in speech
pe cep ion. OS has swallowing sounds and undesi ed a e ac s which need o be igno ed by he
lis ene o selec i ely ocus on he speech message. Wo king memo y is also a c ucial ac o in
speech pe cep ion (see phonological loop in [4]), especially o OS which spans a longe du a ion
han HS.
40
S imuli
We picked a subse o one HS speake and one OS speake om ou da ase o Expe imen 1
based on in elligibili y simila i y. Speake OF1 and speake HF1 we e he wo speake s ha
had signi ican ly simila in elligibili y based on a wo-sample KS es . The null hypo hesis ha
hey come om same dis ibu ions was accep ed wi h a signi icance o Alpha o 0.01.
All 100 sen ences men ioned in Sec ion 3.1 we e used o his expe imen . An in elligibili y
es was pe o med on he same 30 sen ence subse desc ibed in Expe imen 1. Fo he LE
a ing ask, we used he o he 70 sen ences which we e longe . This was o ensu e ha he
pa icipan s had a su icien ly long s imulus o espond o, and ha he EEG eco ding o
each s imulus was su icien ly long o p ocess and analyse. The sen ences con ained se e al low
equency wo ds which made hem su icien ly di icul o an LE ask.
Fo he 70 LE sen ences, he numbe o wo ds in each sen ence anged be ween 9 and 18
wo ds (mean = 13.19, SD = 3.66). The mean du a ion o he OS s imuli was 8.81 seconds (SD
= 1.58, min = 6.00, max = 12.55) and ha o HS was 5.27 seconds (SD = 1.045, min = 3.31,
max = 8.28). The leng hs o OS s imuli we e signi ican ly longe ( (69)=1.66, p<0.001) han
HS s imuli. A e age speaking a es (syllables pe second) o HS and OS we e 4.32 ±1.79 and
7.36 ±3.35 espec i ely.
Lis ening Tes
Six een na i e Spanish speake s (7 emale, 9 male; age ange: 19–35, mean = 26.56, SD = 4.50)
pa icipa ed in he s udy. All pa icipan s we e na i e Spanish speake s om Sou h Ame ica,
excep one who was om Spain. They we e gi en mone a y compensa ion o pa icipa ing in
he es . E hics o conduc ing he expe imen was app o ed by he local e hics commi ee o
he Uni e si y o Oldenbu g. All pa icipan s had no mal hea ing excep one pa icipan wi h a
55 dB hea ing loss in he le ea . The inclusion o his pa icipan did no al e he obse a ions
o he s udy and hence, we chose o keep his pa icipan . The s imuli we e p esen ed wi h a
loudspeake placed a a 0◦in on o he pa icipan a dis ance o 1m a a com o able lis ening
le el o 60 dB SPL.
The es began wi h he LE ask i s (Figu e 4.5), whe e 60 (30 OS and 30 HS) ou o he
70 a ailable LE sen ences we e played in 3 blocks o 20 sen ences ( andomised) each. Fo 15
sen ences o each block, pa icipan s we e p omp ed o p o ide LE a ings as pe he 13-poin
ACALES scale (See Sec ion 2.2.2 o mo e de ails). In he o he 5 sen ences (p esen ed a an-
dom in e als), hey we e asked o epea he las wo d o he sen ence, which he expe imen e
sco ed as co ec o inco ec . This was o ensu e ha hey we e a en i e and ac i ely lis en-
41
Figu e 4.5: LE Task Schema ic Rep esen a ion
Figu e 4.6: SI Task Schema ic Rep esen a ion
ing o he s imuli. The LE ask las ed o a ound 20-25 minu es. The a e age in e -s imulus
in e al ( ime be ween he esponse and onse o he nex sen ence) was 1.59±0.63 seconds.
A e he LE ask, he pa icipan s go a b eak (app oxima ely 10 minu es) and hen hey
p oceeded o he in elligibili y ask (Figu e 4.6). In his ask, hey lis ened o a sen ence and
ecei ed a p omp on he sc een o epea he sen ence ha hey hea d. They p o ided o al
esponses o he 30 sen ences. The whole session o he in elligibili y es was eco ded wi h a
mic ophone so ha i could be ansc ibed la e . This ask las ed a ound 15 minu es.
In all, we had 45 subjec i e LE a ings and 15 las wo d ecogni ion sco es om he LE
ask, and 30 SI sco es om he SI ask o each pa icipan . EEG da a was eco ded o he
en i e du a ion o he LE ask and he e o e we e a ailable o all he 60 LE ask s imuli.
Cogni i e Tasks
In he Flanke ask, pa icipan s we e p esen ed 24 cong uen (“<<<<<”), 24 incong uen
(“>><<<”), and 24 neu al (“−− <−−”) s imuli. They we e asked o ocus on he middle
symbol and co ec ly iden i y i by p essing “<” o “>” on a keyboa d as quickly as possible.
Thei esponse accu acy and eac ion imes we e eco ded. The be e he pe o mance (i.e.
sho e eac ion imes on accu a e esponses), he be e he pa icipan ’s selec i e a en ion.
In he backwa d digi span ask, he expe imen e ead a lis o digi s and he pa icipan
was asked o epea hem in e e se (Fo example, expe imen e : ‘3 2 9 5’; pa icipan : ‘5 9
2 3’). The digi s we e ead in an e en one a in e als o app oxima ely one second. The
42
expe imen e s a ed wi h he se o h ee-digi lis s. I he pa icipan was able o co ec ly
ecall 5 h ee-digi lis s ou o 6, hey g adua ed o he se o ou -digi lis s. This wen on
un il he pa icipan could no longe ecall a leas 5 lis s in a se o un il hey eached he inal
nine-digi lis se . The digi span sco e was he maximum numbe o digi s in he lis whe e
he pa icipan could ecall 5 lis s co ec ly. The la ge he co ec ly ecalled digi span, he
la ge is he pa icipan ’s wo king memo y capaci y.
The cogni i e asks las ed 10-15 minu es. No EEG was eco ded du ing he cogni i e es s.
4.3.2 Analysis and Resul s
Ou o he 16 pa icipan s, ansc ip ions we e a ailable only o 13 pa icipan s as we could
no eco d esponses o 3 pa icipan s due o echnical p oblems. LE a ings we e no a ailable
o one o he pa icipan , also due o a echnical p oblem wi h sa ing da a. Flanke e ec sco e
was no a ailable o a pa icipan . Fo analysis pu poses, hese missing da a we e illed wi h
he mean alues o he esponses o o he pa icipan s. Sphe ici y and homogenei y checks we e
pe o med on he da a wi h he JASP ool o ensu e ha assump ions o an ANOVA es a e
me .
In elligibili y
The audio esponses o he sen ence ecogni ion ask we e ansc ibed by a na i e Spanish
speake , who was also a speech expe . WER was calcula ed using he same me hods as
elabo a ed in Sec ion 2.2.1. As he WER was ound o be highly co ela ed wi h all inclusi e
WERs in Expe imen 1, we decided o p oceed wi h all inclusi e WERs only.
(a) Wo d E o Ra es (b) Sel - epo ed LE a ings
Figu e 4.7: WER and LE o oesophageal (OF1) and heal hy (HF1) speake s. E o ba s show
95% con idence in e als.
Pe cen age WER sco es o OS was 18.88 ±5.57 and o he heal hy speake i was
11.69 ±5.07 (Figu e 4.7a). ANOVA showed ha speake ype had an e ec on WER (F(1,15)
43
= 27.20, p<0.001, η2= 0.645).
Lis ening E o Ra ings
Mean LE ( om a 13-poin scale) o he OS speake was 6.457 ±3.150 and o he heal hy
speake i was 1.994 ±1.611 (Figu e 4.7b). The e was a di e ence o 6 poin s in median LE
o OS and HS. The median LE o HS was 1 (no e o ) and o OS speake i was 7 (mode a e
e o ). ANOVA showed ha speake ype had an e ec on LE (F(1,15) = 77.55, p<0.001,
η2= 0.838).
The e we e 15 esponses pe pa icipan o he ask o epea ing he las wo d in he
LE ask. The a e age e o made in he ecogni ion o he las wo d was 1.067 esponse pe
pa icipan . The o al las wo d ecogni ion e o ac oss all he 15 pa icipan s was 7 pe cen .
We can ell, he e o e, ha he pa icipan s we e a en i e du ing he LE a ing ask.
LE, WER and Cogni i e Tasks
The Flanke e ec was calcula ed as shown in Equa ion 4.1. RT incong and RT neu al a e he
a e age eac ion imes aken o espond o an in-cong uen ial (“>><<<”) and a neu al
ial (“−− <−−”), espec i ely. These eac ions imes a e calcula ed o co ec ials only.
Flanke E ec =log(RT incong)−log(RT neu al)
log(RT neu al).(4.1)
The mean Flanke e ec sco e was 0.413 ±0.019 and he mean digi span sco e was
4.125 ±1.258. No signi ican co ela ions we e ound be ween digi span sco es and Flanke
sco es indica ing ha hey measu e sepa a e cogni i e abili ies.
Co ela ions be ween Flanke e ec and mean LE a ings we e no signi ican (Spea man’s
ho = −0.432, p= 0.096). Signi ican nega i e co ela ion (Pea son’s = −0.554, p= 0.049)
was ound be ween digi span sco es and mean WERs (a e age o OS and HS).
4.4 Discussion
In Expe imen 1, we we e able o show ha speake ype (OS o HS) had an e ec on bo h
LE and WER. OS speake s had poo e in elligibili y compa ed o HS speake s and also a
highe LE. The co ela ion be ween LE and WER sugges s ha mo e e o was epo ed as
he in elligibili y o he speake wo sened. The e o e, a d op in in elligibili y caused an inc ease
o LE. A u he s ep in his di ec ion would be o know wha aspec s o OS con ibu e mo e
44
o LE: I s spec al cha ac e is ics, lack o undamen al equency, poo hy hm in speech, o a
combina ion o hese.
Ou indings abou he e ec o amilia i y wi h he lis ene on in elligibili y a e in a simila
ein o a s udy in es iga ing he expe ience o he lis ene (speech expe s. no ice) on OS
in elligibili y [16]. Howe e , in his s udy we we e mo e in e es ed in in es iga ing he expe ience
ha comes om cons an exposu e as amily membe s, close iends, and ca e ake s. We ound
ha indeed he in elligibili y sco es we e simila o amilia and un amilia lis ene s. Howe e ,
in e es ingly, amilia lis ene s epo ed less LE. So LE was able o p o ide addi ional insigh
abou lis ening o OS.
As a as ASR is conce ned, we ound ha ASR WER sco es we e highe o OS compa ed
o HS. We compa ed WERs om ou ASR sys em wi h HSR WERs. ASR WERs we e highe
compa ed o HSR WERs, bu i could be because ou ASR sys em was based on a unig am
language model and ocused only on acous ic models. The eason o choose such an ASR was
o e alua e he d op in in elligibili y owing o acous ic deg ada ions, which is he case o OS.
In Expe imen 2, he goal was o measu e LE when lis ening o OS and HS wi h simila
in elligibili y sco es aken om Expe imen 1. Al hough WER da a in Expe imen 2 indica e
a highe in elligibili y o HS han OS, he o e all in elligibili y o bo h OS and HS can be
conside ed o be e y high. Despi e his high in elligibili y o bo h speake ypes, we obse ed
a conside able gap in LE, and his sugges s ha LE is a ele an dimension o be conside ed in
OS e alua ion.
The nega i e co ela ion o he digi span sco es wi h WER sugges s ha pa icipan s wi h
a poo e wo king memo y (deno ed by lowe digi span sco es) made mo e e o s in ecogni-
ion. This is unde s andable, as he abili y o hold mo e in o ma ion helps in co ec ly ecalling
and epea ing he s imuli. The co ela ion wi h Flanke e ec was no signi ican , sugges ing
ha , in his case, selec i e inhibi ion plays a mino ole o explain di e ences in LE. Flanke
ask is a measu e o selec i e inhibi ion o dis ac ing signals, such as noise added o he signals
o signals wi h dis ac ing speake s. Howe e , he dis ac ions in ou s imuli a e no o ha
na u e. I is mo e in he o m o undesi ed pauses and swallowing sounds ha appea wi hin
and be ween wo ds in he OS signal. The inc ease in LE was obse ed likely due o poo e
quali y o speech, a he han due o in e e ing in o ma ion ha has o be supp essed as is
he case in noisy en i onmen s. On he whole, we canno ell om hese esul s alone whe he
be e cogni i e abili ies mean be e pe o mance (low LE and low WER ) in OS speech pe -
cep ion. Fu u e s udies using di e en cogni i e es ba e ies a e necessa y o help us answe
ha ques ion be e .
45
Finally, he amilia i y e ec on LE, as epo ed in Expe imen 1, could mean ha OS
speake s migh ind i easie communica ing wi h amily and close iends as opposed o o he s.
Al hough his was no in es iga ed in his s udy, i would be in e es ing o know a wha
le el o amilia isa ion does his e ec show and also whe he he e is a ceiling e ec o his
amilia isa ion. Tha is, is he e a poin whe e, due o amilia i y, hey ind OS as e o ul as
HS?
4.5 Conclusions
We pe o med wo di e en expe imen s o collec in elligibili y and LE me ics o OS and HS.
The i s expe imen , a web-based one, was used o collec in elligibili y and sel - epo ed LE
me ics. The conclusions o his expe imen we e ha speake ype (HS o OS) had an e ec
on bo h in elligibili y and e o . The e was signi ican co ela ion be ween WER and LE.
Lis ene s amilia wi h OS a ed he same o in elligibili y as people who we e no . Howe e ,
hey epo ed less e o in lis ening o OS han he no amilia lis ene s. The ASR in elligibili y
was poo e o OS compa ed o HS.
The second expe imen was o measu e LE o HS and OS in a labo a o y se ing. The con-
clusions we e ha e en i he in elligibili y o OS was close o HS, he e was a conside able
di e ence in LE.
LE ob ained h ough hese expe imen s is based on he lis ene ’s own in e p e a ion o
’e o in ol ed in lis ening’. In he nex chap e , we look deepe in o LE by in es iga ing b ain
ac i i y as a physiological measu e o LE, and s udy i s ela ionship wi h his sel - epo ed
measu e o LE.
We ha e buil an OS es o a ion sys em aimed a be e ASR and HSR in elligibili y and
low LE (see Chap e 6). The me hods used in his s udy will be used o e alua e he ou pu s
o his sys em (see Chap e 7).
Bo h HSR in elligibili y and ASR in elligibili y play di e en bu impo an oles in OS
e alua ion. While imp o ed HSR would enable be e human–human in e ac ions, an imp o ed
ASR pe o mance would enable be e human–machine in e ac ions (e.g., digi al oice assis-
an s). Lowe LE would also con ibu e owa ds imp o ed communica ion wi h ellow humans.
The e alua ion o all hese h ee me ics p o ides an all- ound unde s anding o OS speech
pe cep ion.
Expe imen 1 om his Chap e was p esen ed a a con e ence [109] and he combina ion
o bo h Expe imen 1 and 2 was published as a pape [112].
46
Chap e 5
Lis ening E o and Oesophageal
Speech: An EEG S udy
“No e e y hing ha can be coun ed coun s and no e e y hing ha
coun s can be coun ed”
—Albe Eins ein
Recen ad ancemen s in echnology has enabled us o gain in-dep h unde s anding o he
human body. One such echnology is neu oimaging which allows us o ha e a peek in o he
unc ioning o he b ain. In his chap e , I desc ibe an expe imen ha explo ed he di e ences
in ce eb al ac i i y when lis ening o HS and OS. This expe imen is an a emp o answe he
ques ions: Do lis ene s’ b ain signals e eal any di e ences while p ocessing OS and HS? Wha
a e he ac o s ha in luence hese di e ences in he b ain signals?
5.1 In oduc ion
Speech communica ion equi es a g ea deal o cogni i e p ocessing. I in ol es a as ne wo k
o ac i i ies such as acous ical p ocessing, linguis ic p ocessing and emo ion ecogni ion, all
pe o med in a e y sho span o ime [36]. Lis ening o speech in (acous ically) challenging
condi ions inc eases he cogni i e demand [80]. Challenging condi ions can be a ibu ed o any
o he componen s o speech communica ion: sende (diso de ed speech, o eign accen [149]),
ecei e (hea ing impai men [46], non-na i e lis ene [10]) o channel ( e e be a ion, back-
g ound noise [115], poo elephone connec ion). Mo eo e , lis ening o speech in challenging
condi ions o p olonged pe iods causes a igue [80]. To o e come hese addi ional challenges
posed on he senso y-cogni i e sys em o he lis ene , addi ional LE is equi ed o unde s and
47
he signal o in e es . Fo mo e backg ound in o ma ion on LE, e e o Sec ion 2.2.2.
We know om he p e ious chap e ha OS is less in elligible and mo e e o ul o lis en
o compa ed o HS. This esul o LE was based on subjec i e a ings om lis ene s. As i
is subjec i e in na u e, his a ing a ies om pe son o pe son. Mo eo e , he de ini ion o
”e o ul” is e y subjec i e. Wha may be e o ul o one pe son may no be as e o ul
o he o he pe son. Pe haps he o he pe son has be e cogni i e abili ies o hey ha e
be e cogni i e load bea ing capaci y. Ou aim is o answe all hese ques ions and o be e
unde s and he neu al p ocesses in ol ed in speech p ocessing and e o ul lis ening linked o
OS, wi h he help o EEG.
The EEG ac i i y can be decomposed in o di e en equency bands such as alpha (8-12Hz),
be a (16-31Hz), gamma (>32Hz), he a (4-7Hz) and del a (0.5-4Hz) bands. In es iga ing hese
equency bands has gi en clues o unde s anding p oblems such as speech p ocessing in ad e se
condi ions [153] and sensing imagined speech [28]. Alpha (8-12 Hz) powe (See Sec ion 5.2.1
o alpha powe calcula ion p ocedu e), pa icula ly in pa ie al egions, has been ound o be
ela ed o LE o speech-in-noise and is sugges ed o e lec he supp ession o ask-i ele an
in o ma ion [135, 161]. Addi ionally, alpha powe is known o inc ease wi h inc easing acous ical
deg ada ion o speech [82] as well as wi h inc easing wo king memo y (WM) demands [98]. As
OS is acous ically deg aded, less in elligible and equi es mo e LE (as pe lis ene s’ subjec i e
a ings), we hypo hesise ha his will esul in a highe alpha powe o OS compa ed o ha
o HS. Gi en he impo ance o cogni i e unc ioning in speech pe cep ion, pa icula ly o
wo king memo y, we also included indi idual di e ences in wo king memo y unc ioning in he
analysis. We assumed ha wo king memo y capaci y could se e as bu e o LE -i.e. la ge
wo king memo y capaci y is associa ed wi h less LE.
Some s udies ha e looked in o di e ences in b ain ac i i y while lis ening o deg aded speech
and con ol HS. Theys e al. [142] pe o med a s udy based on ERP componen s (See Ap-
pendix B o de ails on ERP componen s). They obse ed inc eased N100 ampli ude and
dec eased N100 la ency while lis ening o dysa h ic speech compa ed o HS, indica ing ha
he inhe en deg ada ion in dysa h ic speech in luenced ea ly senso y audi o y p ocessing, and
ha deg aded speech equi es mo e neu ophysiological esou ces in ea ly p ocessing s ages.
In a Nea -in a ed Spec oscopy (NIRS) s udy [114], 16 ypically de eloping child en lis ened
o whispe ed speech and no mally ocalised speech. A highe haemodynamic esponse was
obse ed in he le en al senso imo o co ex (in he on al-pa ie al egion) o whispe ed
speech compa ed o no mal speech. This indica ed inc eased cogni i e e o while p ocessing
whispe ed speech. The deg ada ion p esen in OS) has ce ain p ope ies in common wi h
48
whispe ed speech such as low ene gy and deg aded undamen al equency. As he e ha e no
been s udies on OS and b ain ac i i y, we base ou expe imen on he a o emen ioned s udies
conduc ed on o he kinds o deg aded speech.
We expand he esea ch on pe cep ion o deg aded speech by in es iga ing di e ences in he
neu al co ela es o LE when lis ening o HS and OS and how hey ela e o subjec i e a ings
o LE and beha iou al pe o mance (speech in elligibili y) as well as he cogni i e capaci ies o
he pa icipan s.
5.2 Ma e ials and Me hods
The ma e ials and me hods o his expe imen we e he same as ’Expe imen 2’ o Chap e
4 (Sec ion 4.3.1), in which he beha iou al da a esul s we e p esen ed. The cu en s udy
p esen s he EEG da a ob ained om he same se up. The e o e his sec ion only con ains he
de ailed desc ip ion o EEG acquisi ion.
I will b ie ly summa ise he expe imen al p ocedu e he e o he bene i o he eade . Six een
pa icipan s lis ened o sen ences spoken by one HS and one OS speake . The asks in ol ed a
SI ask wi h 30 sho sen ences and a LE ask wi h 60 longe sen ences. EEG was eco ded o
all he 60 sen ences o he LE ask. In addi ion, cogni i e asks such as he backwa d digi span
ask and he Flanke ask was conduc ed. Fo a mo e de ailed desc ip ion see Sec ion Sec ion
4.3.1.
5.2.1 EEG Acquisi ion and Analysis
A con inuous EEG was eco ded using a 24-channel wi eless Sma ing EEG sys em (mB ain-
T ain, Belg ade, Se bia) a a sampling a e o 500 Hz, wi h a low-pass il e o 250 Hz. The 24
elec odes we e a ached o an elas ic EEG cap (EasyCap, He sching, Ge many) acco ding o
he In e na ional 10/20 sys em [57]. To eco d he EEG da a he so wa e Lab S eaming Laye
[64] and Sma ing S eame 3.1 (mB ainT ain, Belg ade, Se bia) was used. EEGLab .14.1.1
[18] was used o line o p ocess and analyse he EEG da a.
EEG eco dings we e e- e e enced o -line o an a e age o all elec odes. The EEG da a
was hen il e ed wi h a 0.1Hz o 45Hz bandpass il e . Excessi e ocula a e ac s, such as eye
blinks, and o he EEG a e ac s we e iden i ied and co ec ed, using an independen componen
analysis as implemen ed in EEGLab.
Epochs we e ex ac ed om he con inuous EEG. The leng hs o he ex ac ed epochs we e
a ied and was as long as he leng h o he en i e du a ion o he s imulus (i.e. sen ence leng h).
49
56
Chap e 6
En ichmen Sys ems
“Mend you speech a li le, les i ma you o unes.”
—William Shakespea e
In he p e ious chap e s, I desc ibed cha ac e is ics and limi a ions o OS and he gaps
in in elligibili y and LE be ween OS and HS. In his chap e I p esen some expe imen s we
pe o med wi h he aim o en iching OS. The con en s o his chap e ha e ea u ed in p e i-
ously p esen ed esea ch i led: A mul i ace ed en ichmen o oesophageal speech [110] and a
p e iously published pape i led ’En ichmen o Oesophageal Speech: Voice Con e sion wi h
Du a ion-ma ched Syn he ic Speech as Ta ge ’ [111].
6.1 In oduc ion
OS is less in elligible and mo e e o ul o p ocess compa ed o HS (See Chap e 4 and 5). Poo
in elligibili y and inc eased LE hinde s e bal communica ion possibili ies o OS speake s, e en
in non-noisy en i onmen s. Lack o in elligibili y means ha he OS speake s ha e di icul y in
mee ing some impo an needs such as elephonic con e sa ions, calling o a medical appoin -
men , asking o di ec ions and o de ing ood. Apa om hese basic needs, OS speake s ind
i di icul o engage in amily ga he ings and public speaking, and o use oice-ac i a ed digi al
de ices. All hese challenges ha e ampli ied in he COVID e a wi h addi ional ba ie s such
as wea ing masks and he educ ion o ace- o- ace in e ac ions. The e o e, en iching OS wi h
so wa e in e en ions is a e y p omising aid o OS speake s o acili a e easie communica ion.
In his chap e , I p esen some o my own con ibu ions owa ds en ichmen o OS. The
me hodologies anged om simple modi ica ions (Sec ion 6.3) on he OS signal o elabo a e
DNN-based VC me hods (Sec ion 6.2).
57
6.2 Expe imen 1: DNN-based OS En ichmen
As s a ed in Sec ion 2.3, one o he possible app oaches o en ich OS is o use a VC sys em.
The goal o a VC sys em is o con e he u e ances o a sou ce speake o sound like hose o
a a ge speake . In he OS en ichmen con ex , u e ances o an OS speake can be mapped o
a heal hy speake ’s u e ances, he eby ha ing he OS acqui e cha ac e is ics o HS. Like ou
p e ious app oaches [126, 124], his p oposed me hod is also based on VC.
VC sys ems may be pa allel ( equi es empo ally aligned sou ce a ge u e ance pai s) o
non-pa allel ( equi es hou s o speech da a). Due o da a limi a ions (100 sen ences pe speake ),
pa allel VC is bes sui ed o ou pu poses. A pa allel VC equi es he pa allel sou ce and a ge
sen ences o be aligned o aining. This is p ima ily done by Dynamic Time Wa ping (DTW)
alignmen which inds an op imal ma ch based on simila i ies in he wo sequences.
The au ho s o [45] desc ibe some challenges o DTW in he con ex o VC. One o hem is
he p esence o silences o ex a sounds in he sou ce and no in he a ge . Ano he one is he
poo es ima ion o end poin s o silences and phonemes. A hi d case is he many- o-one and
one- o-many na u e o he DTW mapping. Fo example, i he sou ce con ains longe du a ions
o a phoneme, a single ame o he a ge may be mapped o se e al ames o he sou ce.
OS has undesi ed silences and a e ac s and longe and a ying du a ions o phonemes. These
quali ies make DTW challenging in he OS-HS VC ask.
As a wo ka ound, in ou p e ious a emp [126], we pe o med alignmen a wo s ages: i s
aligning he phone bounda ies and hen applying DTW, ancho ing he phone bounda ies. In
his pape , we ook ad an age o he a ailable phone labels and he possibili y o gene a ing SS
wi h explici phone du a ions. This esul ed in SS ha ma ches in du a ion wi h he sou ce OS
u e ances, and hus, would be a pe ec ly aligned a ge . This elimina ed he need o DTW
and i s limi a ions. We hypo hesise ha his DTW- ee VC would imp o e he in elligibili y
and quali y o he en iched OS compa ed o ou p e ious me hods.
A obus en ichmen sys em should ideally wo k wi h OS speake s o a ying speaking
p o iciency. The e o e, we pe o med en ichmen s o OS speake s anging om e y low o
e y high in elligibili y. As he en ichmen sys em is buil o imp o e e bal communica ion
o he OS speake , i is impo an ha he ou pu o he en ichmen sys em is p e e ed by
lis ene s o e he unp ocessed OS. Mo eo e , gi en ha oice in e ac ions wi h machines a e
becoming mo e and mo e common, he en iched ou pu s should be in elligible o machines.
Taking hese poin s in o conside a ion, we e alua ed he subjec i e p e e ence o he en iched
sys em amongs human lis ene s as well as an objec i e measu e o in elligibili y and ASR
pe o mance.
58
To sum up, in his sec ion, we p esen a no el, DTW- ee, pa allel VC sys em o OS
en ichmen which includes an SS a ge . The e is a single speake o speake dependen sys em
as well as a mul i-speake o speake independen sys em. We e alua e i s ou pu s o ASR
pe o mance, STOI and a p e e ence es ( o single speake sys em only) in compa ison wi h
unp ocessed OS.
6.2.1 Ma e ials and Me hods
Da a
We chose ou OS speake s wi h a wide ange o in elligibili y om he o iginal co pus (See
Sec ion 3.5). In he o iginal da abase, he ou speake s we e iden i ied as ’02M3’, ’04M3’,
’16M3’, ’25F3’ and we con inue o use hese IDs. A e age s imulus du a ion, speaking a es
and in elligibili y o he ou speake s a e p esen ed in Table 6.1.
Fo each speake , we used a pa allel da ase o all he 100 phone ically-balanced Spanish
sen ences, whe e he sou ce was OS and he a ge was SS. The p ocedu e ollowed o gene a e
he pa allel SS will be explained in he nex sec ion. As s a ed be o e, he sen ences we e
syn ac ically and seman ically p edic able bu had some low equency wo ds. The numbe o
wo ds in each sen ence anged be ween 9 and 18 wo ds (mean = 13.19, SD = 3.66).
A e age du a ion pe s imulus A e age speaking a e ASR sco es
(seconds) (syllables pe second) (WER in %)
02M3 7.48±1.67 4.32±1.80 56.25
04M3 9.27±2.36 3.84±1.71 74.34
16M3 12.52±3.61 2.59±1.19 90.39
25F3 7.85±2.02 4.24±1.86 43.38
Table 6.1: A e age s imulus du a ion, speaking a es and in elligibili y o he ou OS speake s
P oposed VC Sys em
The p oposed VC sys em, BLSTM wi h SS as a ge (BLSTMSS), is a Neu al Ne wo k based
sys em wi h OS as sou ce and SS wi h ma ching du a ions as a ge (see Figu e 6.1). The
p ocedu e is desc ibed in de ail in he ollowing s eps.
Labelling o Oesophageal Speech
Segmen a ion and labelling o OS is a icky p ocess owing o undesi ed a e ac s, inco ec
p onuncia ions o some consonan s and uns able undamen al equency. The o ced alignmen
ea u e buil in o gene ic Spanish ASR sys ems such as Kaldi [107] was unsui able o OS.
The e o e, using he Mon eal Fo ced Alignmen ool [72], new models we e c ea ed by using
59
Figu e 6.1: The p oposed OS-HS VC sys em: BLSTMSS
OS as he aining ma e ial (See Sec ion 3.4). Au oma ic alignmen wi h his o ced aligne
ga e us he phone labels and hei du a ions o he sou ce OS u e ances.
Gene a ing Ta ge Syn he ic Speech
Using he labels, hei du a ions and he u e ance ex , SS was gene a ed by explici ly assigning
hese du a ions o he phones. The ex - o-speech sys em used was a Hidden Ma ko Models
(HMM) based syn hesis sys em [34] which was o iginally de eloped o he Basque language.
The Spanish e sion is desc ibed in [119]. This ga e us equal-sized ame-by- ame aligned pai s
o OS and SS.
Due o cons an swallowing o ai o p oduce speech, OS con ains se e al pauses wi h a e-
ac s wi hin u e ances. Du ing he SS gene a ion, hese pauses we e eplaced wi h silences.
Voice Con e sion Neu al Ne wo k
Voice con e sion was pe o med wi h he VC ecipe o he Me lin oolki [165]. Pa ame isa ion
and esyn hesis was done using he WORLD Vocode [91]. The ex ac ed pa ame e s included
60 Mel Ceps al Coe icien s (MCC), 1 exci a ion pa ame e (log F0), 1 Band Ape iodici y
Pa ame e (BAP), he del as o o he MCC, log F0 and BAP, he del a del as o he MCC,
log F0 and BAP and a oiced/un oiced bina y pa ame e . In all, he e we e 187 pa ame e s
ex ac ed e e y 5 milliseconds.
A ma ix o size 187 X (numbe o 5 ms ames) o OS and SS u e ances we e he sou ce
and a ge inpu s espec i ely. We spli he 100 sou ce- a ge pai s in o 90 ain and 10 es
pai s. As he sou ce and he a ge had he same numbe o ames, he alignmen s ep in
60
he aining p ocess was skipped. The ain pa ame e s we e no malised o 0 mean and uni
a iance and hen ed in o a 4 laye ed BLSTM (4 X 1024) aining ne wo k. A e aining, he
sou ce es u e ance pa ame e s we e con e ed using he ained model. A deno malisa ion
o he mean and he a iance was applied o he ou pu pa ame e s, ollowed by a Maximum
Likelihood Pa ame e Gene a ion using he a iances om he aining da a. The esul ing
con e ed pa ame e s we e ed in o he ocode o syn hesise he con e ed speech. A c oss
alida ion was pe o med 10 imes, so ha all he 100 sen ences we e a ailable as es sen ences.
Mul i-speake sys em
A mul i-speake e sion o he BLSTMSS me hod was implemen ed using he same 100 sen-
ences om 11 high in elligibili y OS speake s. They a e he mos in elligible OS speake s in he
da abase (less han 60% WER) based on he ASR sys em ained wi h HS. These 11 speake s
a e igh mos 11 speake s 1in Figu e 3.5 (blue ba s), no including he TOS speake 09MT.
Fo he mul i-speake sys em, we combined he da a om all he 11 speake s ins ead o
pe o ming VC o each speake sepa a ely. Each speake ’s u e ance o a ce ain sen ence
had i s own di e en du a ions and hence he co esponding a ge signal was also di e en .
The e o e, each u e ance had a pai ed SS a ge , a o al o 1100 u e ances (100 sen ences om
11 speake s). Nine y u e ances om each speake ( he same 90 sen ences om all speake s) and
he co esponding a ge SS we e pu in he sou ce and he a ge aining se espec i ely. The
aining and con e sion p ocess was he same as ha o he single speake sys em desc ibed
in ’Voice Con e sion Neu al Ne wo k’ om Sec ion 6.2.1.
6.2.2 E alua ions and Resul s
E alua ions in ol ed compa ing he speake dependen BLSTMSS ou pu s o unp ocessed OS
using h ee ASR sys ems, an objec i e in elligibili y measu e and a p e e ence es . In addi ion,
we compa ed ASR sco es and STOI sco es o BLSTMSS wi h hose o ou p e ious sys ems.
The ASR e alua ion o he mul ispeake sys em, e alua ion was pe o med using one ASR
sys em (ASR 3) and wi h STOI sco es. A compa ison o he speake dependen sys em and he
speake independen mul ispeake sys em is also made.
ASR E alua ion o he Speake Dependen Sys ems
We e alua ed he ou pu s o ou p oposed en ichmen sys em using h ee ASR sys ems: he
speech- o- ex sys em om Mic oso Azu e using he py hon azu e-cogni i e se ices-speech
1Speake IDs: 01M3, 02M3, 03M3, 08M3, 12M3, 19M3, 22M3, 24M3, 25F3, 28F3, 29M3
61
lib a y (ASR 1) [85], he Elhuya speech ecogni ion sys em (ASR 2) [30] and a Kaldi based
sys em (ASR 3) [107, 126, 124]. The inpu iles o hese ASR sys ems we e he 100 single channel
speech signals sampled a 16000 Hz. The ou pu s we e ex iles con aining he ansc ip ions.
The eason o using h ee ASR sys ems was o ha e a di e se se o e alua ions. ASR 1
is a well known comme cial ASR sys em used wo ld wide and he e o e easie o compa isons
in u u e s udies elsewhe e. ASR 2 is a comme cial sys em buil locally in Spain and he e o e
be e adap ed o he speech s yle and ocabula y o he speake s in ol ed in his s udy. ASR
3 is a cus omisable ASR wi h ull con ol o all he componen s such as he language model,
dic iona y e c. ASR 3, which uses a limi ed lexicon and unig am language model was used in
ou p e ious s udies [126, 124]. The ad an age o his ASR is ha i is no p one o upda es as
is he case o comme cial ASRs. This allows us o make ai and accu a e compa isons o ou
ongoing wo k wi h ou p e ious wo k.
We calcula ed wo me ics om he ASR ansc ip ions: WER and PWC. WER and PWC
we e calcula ed using Equa ion 2.1 and Equa ion 2.2 espec i ely. Bo h he concep s a e ex-
plained in Sec ion 2.2.1.
(a) Wo d E o Ra es (b) Pe cen age Wo ds Co ec
Figu e 6.2: ASR 1 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con e ed
ou pu s and a ge SS ( a ge ). E o ba s show s anda d e o s.
Figu e 6.2, Figu e 6.3 and Figu e 6.4 show mean WER and PWC sco es o he 100 sen ences
ob ained om he ansc ip ions o ASR 1, 2 and 3 espec i ely. WER sco es we e lowe (i.e.
highe in elligibili y) o BLSTMSS compa ed o unp ocessed OS o all ASRs and speake s
wi h 2 excep ions - speake 04M3 in ASR 1 and speake 16M3 in ASR 2. In he case o PWC
sco es, a highe PWC sco e (i.e. highe in elligibili y) was obse ed o he BLSTMSS samples
compa ed o unp ocessed OS samples o all speake s and ASRs.
62
(a) Wo d E o Ra es (b) Pe cen age Wo ds Co ec
Figu e 6.3: ASR 2 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con e ed
ou pu s and a ge SS ( a ge ). E o ba s show s anda d e o s.
(a) Wo d E o Ra es (b) Pe cen age Wo ds Co ec
Figu e 6.4: ASR 3 WER and PWC sco es o unp ocessed OS (sou ce), he BLSTMSS con e ed
ou pu s and a ge SS ( a ge ). E o ba s show s anda d e o s.
ASR E alua ion o he Mul i-speake Sys ems
Figu e 6.5 shows he ASR sco es om ASR 3 o he mul i-speake sys em. The e was ASR
imp o emen o ou speake s (speake 02M3, 03M3, 19M3, 22M3). Fo he o he 7 speake s,
he WER inc eased.
Figu e 6.6 shows he compa ison o he ASR sco es o he single speake sys em and he
mul i-speake sys em o he wo OS speake s (02M3 and 25M3) ha we e pa o bo h he
sys ems. As can be obse ed, he mul i-speake e sion did no did no ha e be e ASR sco es
compa ed o he speake dependen sys em.
STOI Sco es o Speake Dependen Sys ems
We calcula ed STOI (See Sec ion 2.2.1 o de ails) o unp ocessed OS samples and con e ed
BLSTMSS samples o he ou OS speake s using he al eady aligned du a ion ma ched SS
( a ge signal) as he e e ence signal.
63
Figu e 6.5: ASR sco es o he mul i-speake sys em con aining 11 OS speake s.
Figu e 6.6: ASR sco es compa ison o he single speake and he mul i-speake BLSTMSS
sys em.
The STOI esul s o he single speake BLSTMSS me hod a e shown in Figu e 6.7. We can
obse e ha he STOI sco es ha e imp o ed conside ably (a leas 15 pe cen age poin s) om
OS o BLSTMSS o all ou speake s. A high STOI sco e o o e 62 pe cen was obse ed o
all he BLSTMSS samples.
STOI Sco es o Mul i-speake Sys ems
Figu e 6.8 shows he STOI sco es o he mul i-speake sys em. We can obse e ha he STOI
sco es imp o ed o all he 11 speake s.
Figu e 6.9 shows he compa ison o he STOI sco es o he single speake sys em and he
mul i-speake sys em o he wo OS speake s (02M3 and 25M3) ha we e pa o bo h he
sys ems. Bo h he en ichmen sys ems ha e highe STOI sco es, bu he e we e no signi ican
di e ences be ween he wo sys ems.
64
Figu e 6.7: STOI sco es o he ou OS speake s and he en iched e sions. Re e ence signal
o STOI is du a ion-ma ched SS. E o ba s show s anda d e o s.
Subjec i e Tes s
Subjec i e es s we e pe o med only o he speake dependen sys ems and no o he mul i-
speake sys em.
While unp ocessed OS has se e al undesi ed a e ac s and lacks a na u al undamen al
equency, i is na u al speech. On he o he hand, al hough he BLSTMSS ou pu s a e much
clea e sounding, hey a e syn he ically p oduced and may ha e some limi a ions because o
ha . The success o he en ichmen depends majo ly on whe he lis ene s p e e o lis en o he
en iched e sion mo e han he unp ocessed OS. The e o e, we pe o med a p e e ence es o
collec lis ene s’ opinion on whe he hey p e e lis ening o he ou pu s o he p oposed sys em
o he unp ocessed OS.
Pa icipan s lis ened o pai s o samples, one unp ocessed OS sen ence and he co esponding
BLSTMSS en iched ou pu o he same sen ence. The e we e 10 pai s o each speake , a o al
o 40 pai s. The chosen 10 pai s we e he sho es sen ences in he se , as ha allowed us o ha e
maximum numbe o e alua ions while keeping he es unde 20 minu es. The p esen a ion
o all he pai s, as well as he o de o BLSTMSS and OS wi hin each pai was andomised o
a oid o de bias. A e lis ening o he wo s imuli in each pai , he pa icipan s we e asked o
ma k he s imulus hey p e e ed amongs he wo. The op ions hey we e gi en we e ’P e ie o
cla amen e la p ime a’ (I clea ly p e e he i s one), ’P e ie o la p ime a’(I p e e he i s
one), ’No pe cibo di e encia/Ninguna suena mejo ’ (I do no pe cei e any di e ence/Nei he one
sounds be e ), ’P e ie o la segunda’(I p e e he second one), ’P e ie o cla amen e la segunda’(I
clea ly p e e he second one).
65
Speech ype ASR Sco es (WER in %)
02M3 16M3
Unp ocessed OS 56.32 90.39
GMM VC 37.93* NA
BLSTMHS VC 40.35* NA
BLSTMSS VC 30.89* 61.32*
Silence emo al 57.99 88.95*
Wa ene 56.09 91.37
Table 6.2: ASR sco es o unp ocessed and en iched OS o high (02M3) and low in elligibili y
(16M3) OS. Tex ma ked wi h * shows numbe s whe e imp o emen was obse ed
6.4 En ichmen s Demons a ion
I c ea ed a web-based in e ac i e demons a ion simula ing an OS speake speaking wi h an
en iched oice. Fo his demons a ion, we eco ded an OS speake speaking a passage (See
Sec ion 3.3.2). Audio was eco ded simul aneously wi h he same eco ding equipmen o he
o iginal da abase o ob ain a be e quali y eco ding. This audio was hen passed h ough h ee
di e en en ichmen sys ems: A GMM based sys em (En ichmen 1), BLSTMSS (En ichmen
2) and a a pi ch modi ied e sion o en ichmen 2 o be e sui he speake (En ichmen 3).
In he in e ac i e demo, he use can play he ideo and choose amongs ou op ions o
he audio: he o iginal speech and he 3 en ichmen sys ems. Whene e he use chooses an
audio ype (by clicking he co esponding bu on), he ideo plays wi h he chosen audio in a
synch onised manne . In his way, i is possible o isualise he OS speake speaking in he
o iginal as well as en iched e sion o he speech. The demo can be ound in he ollowing link:
h ps://aholab.ehu.eus/use s/sneha/london demo/ es .h ml.
This demo was p esen ed in he Royal Ins i u ion, London 2and as a pa o a show and
ell session in a i ual con e ence 3.
6.5 Conclusions
The BLSTMSS sys em had be e ASR sco es and objec i e in elligibili y measu es compa ed
o unp ocessed OS. In ecen imes, communica ion wi h digi al assis an s and o he de ices is
on he ise. The e o e, an imp o emen is his di ec ion is desi able o e icien communica ion
wi h digi al de ices and dialogue sys ems.
The BLSTMSS sys em was p e e ed by lis ene s compa ed o unp ocessed OS. A sligh
excep ion was obse ed in case o he leas in elligible speake o he se . Fo his speake ,
2h ps://www. igb.o g/wha s-on/e en s-2020/ma ch/public-easy-speaking-e o less-lis ening
3h ps://cmswo kshops.com/ICASSP2020/Pape s/ViewPape .asp?Pape Num=6191
72
he e was no a decisi e p e e ence o he BLSTMSS me hod. We p esume ha when his
less in elligible speake gains mo e expe ience in OS wi h he aid o a speech he apis , hei
in elligibili y will imp o e and hey will gain mo e bene i om he en ichmen .
While oice con e sion emains he g ea es con ibu o in imp o ing ASR pe o mance, i
was obse ed ha undesi ed silence emo al was bene icial o low in elligibili y OS. Syn hesis
wi h a ich ocode did no help in ASR imp o emen bu i has scope o posi i e esponses
in pe cep ual e alua ions.
Some o he me hods o enhance OS ha may bene i om u he explo ing a e he eigen-
oices me hod [23] and a GMM-based open sou ce VC sys em called sp ocke [60]. The esul s
om he DIFFGMM algo i hm in his VC sys em seemed p omising, bu a o mal e alua ion
could no be pe o med. The e o e i would be use ul o explo e his me hod mo e and pe o m
o mal e alua ions.
In addi ion o ASR sco es, STOI and p e e ence es s, we a e in e es ed in in es iga ing
subjec i e and physiological LE o unp ocessed OS, en iched OS and HS. This is because,
while in elligibili y e eals wha pe cen age o he speech was unde s ood co ec ly, i does
no ell us how di icul i was o unde s and i . LE p o ides use ul addi ional in o ma ion
abou whe he en iched OS is easie o pe cei e and p ocess compa ed o unp ocessed OS.
The e o e, ou u u e s udies (Chap e 7) will ocus on LE in addi ion o in elligibili y and
lis ene p e e ences.
The demons a ion in Sec ion 6.4 has helped he gene al public, esea che s and he OS
speake s hemsel es o isualise he expec ed ou pu s o an en ichmen sys em.
The DNN-based sys em was published as a jou nal pape [111] and he ligh weigh en ich-
men sys ems we e p esen ed in a con e ence [110].
73
74
Chap e 7
Final En ichmen E alua ions
“The mos basic o all human needs is he need o unde s and and be
unde s ood. The bes way o unde s and people is o lis en o hem.”
—Ralph G. Nichols
The inal s ep in he OS en ichmen p oblem is he e alua ion o ou pu s o he en ichmen
p ocesses. The aim o OS en ichmen was o close he gaps in in elligibili y and LE be ween OS
and HS. In o he wo ds, we aimed o ha e he me ics o he en iched ou pu s o be as close as
possible o HS. Some e alua ions we e p esen ed in he p e ious chap e bu hese e alua ions
only compa ed some machine in elligibili y ela ed sco es o he pa icula no el algo i hm com-
pa ed o OS. In his chap e , I desc ibe some objec i e, subjec i e and physiological e alua ions
o he in elligibili y and LE o all he OS en ichmen asks we ha e pe o med so a , including
wo k om o he esea che s in he lab. These me ics a e compa ed wi h hose o unp ocessed
OS and clea speech (HS and SS). We shall see h ough he e alua ions how he algo i hms we
adop ed con ibu ed o he en ichmen o OS.
7.1 In oduc ion
When speech is diso de ed, he de iciency in p oducing clea speech makes lis ening o he
speech di icul . We ha e seen in Chap e 4 and Chap e 5 ha lis ening o OS poses such
challenges o lis ene s compa ed o HS. In Chap e 6, we desc ibed some algo i hms ha we e
de eloped o ans o m OS wi h he objec i e ha lis ening o an OS speake would be easie .
Ou indings sugges ha we had success in en iching speech in some a eas (ASR sco es, STOI,
p e e ence sco es). These esul s a e encou aging and in many so wa e based en ichmen s udies
[102, 166, 12], he success o he en ichmen me hods a e e alua ed based on hese objec i e
measu es o simple Like scale based MOS sco es. Bu hese e alua ions a e unidimensional
75
and do no ell us he whole s o y.
In Chap e 5, we desc ibed an in dep h e alua ion o OS and HS by conduc ing a lis ening
expe imen in a labo a o y se ing and measu ing SI, subjec i e LE and EEG ac i i y. This
helped us a ain a deepe unde s anding o he di e ences in HS and OS. In his chap e , I
desc ibe a simila in-dep h expe imen wi h all he majo en ichmen me hods de eloped by
us so a in addi ion o he OS and HS samples. The goal is o know which is he winning
en ichmen me hod. Ideally i would be he one ha bea s OS and he o he en ichmen
me hods o be mo e in elligible and less e o ul o p ocess.
All he me hods used in his expe imen and he mo i a ions o use hem we e explo ed and
explained in p e ious chap e s. This chap e is jus a coming oge he o all he en ichmen s
and all he ypes o e alua ions o de e mine which is (i we do ha e one) he bes en ichmen
sys em ha we made.
7.2 S imuli
Fi e sys ems we e e alua ed in his expe imen : OS, HS and h ee e sions o en iched OS. The
h ee e sions o en iched OS we e ou pu s o h ee di e en DNN based en ichmen sys ems
(BLSTMHS [126], PPG [124], and he new single speake BLSTMSS me hod desc ibed in
Chap e 6). A subjec i e wo d ecogni ion ask also included he mul i-speake e sion o he
BLSTMSS sys em.
In his e alua ion, we ha e chosen samples only om speake 02M3. Using samples om all
speake s was no easible as ha would inc ease he condi ions in he lis ening es s and would
make hem oo long, making i di icul o he lis ene s o sus ain a en ion. Speake 02M3
had a high in elligibili y (ASR WER: 56.25%). He was he only speake who had pe o med
addi ional eco dings o wo ds and passages which we e use ul o mo e e alua ions.
Fo all he condi ions, we used all he 100 phone ically-balanced Spanish sen ences desc ibed
in Sec ion 3.1. The sen ences we e syn ac ically and seman ically p edic able bu had some
di icul low equency wo ds. They we e su icien ly di icul o a SI and LE a ing ask.
In addi ion o he sen ences, we used 150 low equency di icul wo ds o a wo d ecogni ion
ask. The de ails o hese wo ds a e desc ibed in Sec ion 3.3.1 and he lis o wo ds in Appendix
D.
76
Figu e 7.1: LE and SI Task Schema ic Rep esen a ion
7.3 Expe imen al P ocedu e
32 na i e Spanish speake s (9 male, 23 emale, Age: 25.63 ±5.11) pa icipa ed in a lis ening
es , p esen ed wi h a psychopy [103] in e ace. Each pa icipan lis ened o 100 sen ences, 20
sen ences in each o he 5 condi ions: OS, PPG, BLSTMHS, BLSTMSS and HS. A e lis ening
o each sen ence, hey pe o med an SI ask and an LE a ing ask. No sen ences we e hea d
mo e han once. All he s imuli unde wen loudness no malisa ion in acco dance wi h EBU R
128 S anda d [29].
The 100 sen ences we e p esen ed in 5 blocks o 20 sen ences whe e each block con ained
s imuli om all he 5 condi ions in a andomised o de . The s imuli we e coun e balanced ac oss
pa icipan s and condi ions, such ha each sen ence in a pa icula condi ion was lis ened o
an equal numbe o imes ac oss all pa icipan s.
7.3.1 Beha iou al Tasks
The beha iou al asks consis ed o an SI ask and an LE ask wi h sen ences and an SI ask
wi h wo d s imuli.
In he sen ences SI ask, he pa icipan s epea ed aloud he las wo d o he sen ence hey
jus hea d. Thei esponse was eco ded and la e checked o co ec ness. The SI sco e pe
s imulus was whe he o no he lis ene go he las wo d igh : 1 o co ec , 0 o w ong.
The 13-poin ACALES scale (See Sec ion 2.2.2) was used o a e LE. Unlike he p e ious
expe imen s, he e e e y s imulus had an SI as well as LE a ing. Figu e 7.1 shows a schema ic
ep esen a ion o his ask.
An addi ional SI ask was pe o med wi h he 150 isola ed wo ds. This was a simple es
wi h he pa icipan lis ening o he wo d and hen epea ing i . The expe imen e sco ed
he esponses o co ec ness la e : 1 o co ec , 0 o w ong. Figu e 7.2 shows a schema ic
ep esen a ion o his ask.
In all, he es las ed 1.5-2 hou s which included 30 minu es o EEG se up, 30 minu es o he
combined SI-LE ask, 15 minu es o he wo ds SI ask and 15-30 minu es o pos es ques ion-
77
Figu e 7.2: Wo d Recogni ion Task Schema ic Rep esen a ion
nai es and dismoun ing o he EEG cap. The pa icipan s we e gi en mone a y compensa ion.
The expe imen was app o ed by he e hics commi ee o he Basque Cen e on Cogni ion,
B ain and Language.
7.3.2 EEG Acquisi ion and Analysis
A con inuous EEG was eco ded while pa icipan s we e pe o ming he LE a ing ask. The
b ain ac i i y was eco ded om 32 elec ode si es moun ed in o an elas ic EEG cap (EasyCap,
He sching, Ge many) and a anged acco ding o he In e na ional 10-20 sys em [57] (See Ap-
pendix B o mo e de ails). B ainVision Reco de we e used o eco d EEG da a. The EEG
was eco ded a a sampling a e o 1000 Hz. EEG da a o line p ocessing and analysis was
conduc ed using EEGLab .14 [18].
Amongs he 32 elec odes, he scalp elec odes we e elec odes 1 o 27. Elec ode 28 was
in he igh mas oid posi ion and elec odes 29, 30, 31 and 32 we e placed on he o ehead and
eye a ea o eco d ocula ac i i y. An impedance o <5kOhms was ensu ed on each elec ode
du ing he EEG cap moun ing.
EEG eco dings we e e- e e enced o -line o an a e age o he igh mas oid elec ode. The
EEG da a was hen il e ed wi h a 0.1Hz o 45Hz bandpass il e . Excessi e ocula a e ac s,
such as eye blinks, and o he EEG a e ac s we e iden i ied and co ec ed, using an independen
componen analysis as implemen ed in EEGLab.
Epoch ex ac ion p ocedu e was he same as in Sec ion 5.2.1. Va iable leng h epochs we e
ex ac ed wi h du a ions ma ching he speech signals wi h an addi ional 500ms p e-s imulus
as baseline. Simila ly, he equency analysis was pe o med as pe he p ocess desc ibed in
Sec ion 5.2.1. I in ol ed a PSD calcula ion wi h a 99 pe cen o e lap. He e, as he sample
a e was 1000, 2000 poin s (co esponding o 2 seconds) we e used o calcula e he Fou ie
ans o m. As be o e, he alpha powe was he mean o he powe alues be ween 8 and 12 Hz.
Al hough elec ode layou s and placemen s we e sligh ly di e en in his expe imen compa ed
o he expe imen in Chap e 5, he analysis he e was also ocused on he cen o-pa ie al egion
78
Figu e 7.3: Dis ibu ion o Alpha Powe Value o all he 32 pa icipan s.
(CP1, CP2, CP5, CP6, C3, C4, CZ, P1, P2, P3, P4, P7, P8, PZ).
Wi h 32 pa icipan s, 5 condi ions, 20 s imuli pe condi ion and 12 elec odes, we had
32*5*20*12 i.e. 38400 da a se ies. Ou o hese, some ials we e ejec ed due o noisy da a
and we we e le wi h 37944 da a poin s.
Alpha powe is a ec ed by indi idual b ain ac i i y, en i onmen al ac o s, he men al and
physiological s a e o he pa icipan a he ime o he expe imen and o he such ac o s
[96]. Figu e 7.3 shows he dis ibu ion o alpha powe o he 32 pa icipan s o OS. He e,
he ’Pa icipan ’ axis ep esen s each o he 32 pa icipan s and he ’OS Alpha Powe ’ axis
ep esen s he 30 bins o alpha powe alues. The ’Numbe o Obse a ions’ axis is he numbe
o alpha powe alues obse ed in a pa icula alpha powe bin. He e a e wo example poin s
in he g aph o help you unde s and he g aph be e . A poin A, pa icipan numbe 1
has 0-50 obse a ions whe e alpha was in he ange o 30-31 µV 2/Hz (an ou lie ). A poin
B, pa icipan numbe 30 has close o 100 obse a ions whe e alpha was in he ange o 1-2
µV 2/Hz. As a gene al obse a ion, we can see ha each pa icipan has a di e en alpha
p o ile o dis ibu ion. A subse o 3 pa icipan s p o ides a simpli ied ep esen a ion o he
issue. Figu e 7.4 shows he a ying powe dis ibu ion o 3 o he 32 pa icipan s.
To emo e all hese be ween subjec s di e ences and o ocus only on he alpha powe
di e ences be ween condi ions, a powe no malisa ion was pe o med. The e a e se e al ways
o pe o m no malisa ions o da a like his such as z-sco es, di iding by he max alue and
di iding by es ing s a e alpha powe . Each o hese p ocesses a e explained in he ollowing
pa ag aphs.
A z-sco ed alpha powe was gene a ed by using he mean and s anda d de ia ion o all he
79
Figu e 7.4: Dis ibu ion o Alpha Powe Value o a subse o 3 pa icipan s. Each pa icipan
has a sepa a e ange o alpha powe alues.
Figu e 7.5: Dis ibu ion o z-sco ed alpha powe o a subse o 3 pa icipan s.
da a poin s om each pa icipan . The dis ibu ion pos applying he z-sco e is shown in Figu e
7.5 o he same h ee pa icipan s conside ed be o e. As i can be obse ed, he dis ibu ion
o z-sco ed alpha powe o he h ee pa icipan s is o e lapping. This elimina ed in e subjec
a iabili y.
Al hough he z-sco e app oach o no malising helps us ge id o he p oblems o in e -
subjec a iabili y, i assumes ha he da a ollows a no mal dis ibu ion. We can see ha
he alpha powe da a does no ollow a no mal dis ibu ion (See Figu e 7.4). I has a sligh
skewness and i looks mo e like a log no mal dis ibu ion.
The second way o no malise he da a is by di iding he en i e da ase om a pa icipan
by he maximum alpha powe o ha pa icipan . As alpha powe only has posi i e alues,
his p ocess i s all he da apoin s om a pa icipan be ween 0 and 1. See Figu e 7.6 o he
no malised alpha powe dis ibu ions o he 3 pa icipan s.
Ano he ypical app oach o EEG da a no malisa ion is o di ide he alpha powe om
one pa icipan by he baseline es ing s a e alpha powe . The es ing s a e co esponds o he
80
Figu e 7.6: Dis ibu ion o No malised alpha powe (no malised by max alue) o a subse o
3 pa icipan s.
Figu e 7.7: Dis ibu ion o baseline co ec ed alpha powe o a subse o 3 pa icipan s.
ime when he pa icipan is no engaged in any ask. In his case, he pa icipan was asked
o ocus on a c oss on he sc een o 60 seconds. The idea he e is ha his no malisa ion will
ake away he alpha powe componen s ha pe ain o he pa icipan s inhe en b ain ac i i y
wi hou he e ec o any ask pe o mance. We di ided he alpha powe o each s imulus by
he a e age alpha powe du ing he 60 seconds o es ing s a e EEG da a. This was pe o med
sepa a ely o each o he 12 cen o-pa ie al elec odes. See Figu e 7.7 o he no malised alpha
powe dis ibu ions o he 3 pa icipan s.
7.4 Resul s
7.4.1 Speech In elligibili y
Figu e 7.8 shows he a e aged SI sco es (% wo ds co ec o PWC sco es) om he 32 pa ici-
pan s. HS had he highes SI sco e (p<0.001). SI sco e o OS was highe han ha o all he
en ichmen s (p<0.001). Wi hin he en ichmen s, he p oposed sys em had signi ican ly highe
81
o sho e RTs and lowe LE. Ou esul s o LE, SI and hei RTs a e along he same lines.
The EEG ac i i y da a ei e a es ou indings om Chap e 5: Lis ening o OS en ails highe
alpha powe compa ed o HS. The e we e no signi ican di e ences in alpha powe be ween he
OS condi ion and he en iched OS condi ions. Al hough he BLSTMHS me hod had lowe
alpha compa ed o o he en ichmen s, his was no signi ican and hence we canno conclude i
his en ichmen p o ided any imp o emen in cogni i e load.
We obse ed ha he alpha powe inc eased as he expe imen p og essed. This may indica e
inc ease in a igue o dec ease in ale ness as was obse ed by An ons e . al [3] oo. While
sho e expe imen s would educe hese endencies, i educes he da a poin s necessa y o each
s a is ical signi icance. The e o e, such lis ening es s mus be designed by keeping his ade
o in mind.
In he case o unp ocessed OS, he e was a d op in alpha om he ou h block o he inal
block. A possible explana ion o his is ha when he LE demands a e oo high, he lis ene
o en ”gi es up” a he ask o lis ening and s a s exe ing lesse LE. Fo example when you a e
in ex eme a igue, say, a je lag, you canno mee he equi ed a en ional demands and may
exe lesse e o [136]. This seems o ha e happened wi h unp ocessed OS in he las block,
whe e he lis ene was so a igued ha hey exe ed a lowe amoun o e o . This pa e n o
a d op in physiological LE o s imuli abo e a ce ain le el o di icul y has been o en obse ed
in many LE s udies [99, 100, 164]. Howe e , his same end was no obse ed o he o he
condi ions. A longe expe imen wi h mo e blocks would be help ul in unde s anding his
disengagemen e ec be e .
The lack o co ela ion be ween LE and alpha powe , and he connec ion o alpha powe and
a igue could mean ha alpha powe is an indica o o a igue and no LE in he expe imen
ha we pe o med. In gene al, he pa icipan was a igued as he expe imen p og essed and
he a igue was less e iden o HS compa ed o OS and he syn he ic en ichmen ou pu s. The
a igue associa ed wi h OS eached o he ex en ha he pa icipan s may ha e disengaged
om he ask o lis ening o OS.
An e alua ion o sys ems ia lis ening es s is ime consuming and cumbe some. Howe e ,
i helps us know how ou imp o emen s a e ecei ed by human lis ene s. In he end, he aim
is ha he la yngec omees and he people in e ac ing wi h hem bene i om he en ichmen s.
The e o e, people’s opinions in his case a e c ucial and aluable.
In he ASR and he STOI e alua ions, BLSTMSS had be e sco es compa ed o ou p e-
ious sys ems, as well as unp ocessed OS. The e o e, he en iched ou pu s would be use ul
in imp o ing human-machine in e ac ions. As o human-human in e ac ions, mo e esea ch
88
would be needed o de elop an en ichmen sys em ha appeals mo e o human lis ene s han
unp ocessed OS. None heless, an imp o emen in SI and LE compa ed o p e ious en ichmen
sys ems sugges s ha we ha e been able o mo e some s eps o wa d in he OS en ichmen
esea ch.
7.6 Conclusions
The BLSTMSS sys em (de eloped as pa o his hesis) ou pe o med ou p e ious sys ems in
STOI and ASR, as well as SI sco es and subjec i e LE sco es. When compa ed wi h OS, he e
was an imp o emen in ASR sco es and STOI, bu no in SI sco es and subjec i e LE. EEG
ac i i y e ealed mo e alpha powe when lis ening o OS compa ed o HS indica ing an e ec
o ask di icul y. Addi ionally, alpha powe inc eased as he expe imen p og essed indica ing
a igue. No conclusions could be made based on he alpha powe o en iched OS. The e o e,
al hough he e is an imp o emen in some objec i e measu es, u he imp o emen s a e needed
o make he ans o med OS mo e p e e able and unde s andable o human lis ene s.
The indings o his chap e a e in p epa a ion o be published as a pape i led ’E alua ion
o En iched Oesophageal Speech: an EEG S udy’.
89
90
Chap e 8
Conclusions
“Begin a he beginning, he King said g a ely, “and go on ill you
come o he end: hen s op.”
—Lewis Ca oll, Alice in Wonde land
I se oo on his jou ney o en ichmen s and e alua ion o OS wi h some e y clea and
s aigh o wa d aims. I is when I go in o he p oblem ha he dep hs and complica ions o
his ask became appa en . All he obs acles and successes in his p ocesses ha e augh me
some impo an lessons, which will be help ul o me in esea ch as well as any p oblem sol ing
ac i i y. I canno say ha I ha e o ally sol ed he p oblem ha I aimed o sol e, bu su ely I
ha e p og essed a bi and gained a lo o knowledge. As p og ess is limi less, my jou ney wi h
his p oblem could keep going on and on. Bu as he King says, he jou ney has come o an
end and I mus s op. He e I summa ise all he indings o my hesis and p epa e my ba on o
be passed o u u e esea che s who wish o ake his jou ney o wa d.
8.1 A Recap o he P oblem S a emen and Aims
OS lacks he in elligibili y and quali y o HS due o physical abno mali ies: absence o la ynx
and ocal olds and sepa a ion o achea and oesophagus. The lack o in elligibili y and quali y
also inc eases he cogni i e load when lis ening o OS. These issues impede OS speake s om
social and amilial engagemen s, exp essing hei needs, using oice-con olled digi al de ices
and in gene al, na iga ing his wo ld e bally.
The main aim o his hesis was o en ich OS signals by inc easing in elligibili y and quali y
and educing LE. The o he aim was o decide and implemen app op ia e me ics o e alua e
OS and en iched OS.
91
8.2 Me hods Used and Takeaways
A wide ange o me hodology was used o do he esea ch in his hesis. All hese me hods ha e
expanded my knowledge base and enabled me in making be e esea ch decisions.
A ho ough unde s anding o OS was gained by s udying he OS da abase. The a ie y o
da a and some acous ical measu es o ha da a p esen in he da abase helped me a g ea deal
in unde s anding he cha ac e is ics o OS.
Fo en ichmen o OS, he main me hods used we e machine lea ning and o he s a is ical
lea ning ools. Machine lea ning is being used inc easingly in a wide a ie y o indus ies such
as heal hca e, business, en e ainmen , social media, anspo and o he impo an indus ies.
The success o machine lea ning in hese ields is encou aging and in iguing. Th ough his
p ojec , I was able o gain some unde s anding o he inne wo kings o machine lea ning and
neu al ne wo ks such as lea ning a es, aining da a, no malisa ion, op imisa ion algo i hms
and so on. In he ques o inding a DNN sys em ha could en ich OS, I wen h ough se e al o
hese DNN e minologies as well as new and upcoming a chi ec u es ha ha e shown p omising
esul s. In he case o his hesis, a DNN based me hod was ound o be he mos e ec i e
in achie ing he goals o en ichmen . Apa om DNNs and machine lea ning, a g ea deal
o signal p ocessing knowledge was applied oo. This was essen ial o he unde s anding o
speech cha ac e is ics such as undamen al equency and o man s as well as du ing EEG
signal p ocessing o an e ec i e ea men o he signal. My main akeaway was ha o sol e
any p oblem i is impo an o ho oughly unde s and he da a in hand and ha e upda ed
in o ma ion abou he ools a ailable o wo k on ha da a.
A g ea deal o expe ise on so wa e languages such as ma lab and py hon was needed and
acqui ed du ing he p ocesses o his p ojec . Knowing how o p og am in hese languages as
well as knowing how o use p og ams w i en by o he collabo a o s in hese languages was
c ucial in achie ing he goals o his p ojec .
S a is ical analysis was a key pa o in e p e ing da a and d awing conclusions. P o iding
s a is ical analyses made he esul s scien i ically sound and c edible. O e he cou se o he
p ojec I gained a lo o knowledge abou a ious analysis ools and me hods such as pa ame ic
and non-pa ame ic es s, ANOVAs and - es s and open sou ce s a is ical ools such as JASP.
I lea ned he impo ance o applying he igh kind o s a is ical analysis on any gi en da a se .
Al hough I ha e no come ou o his p ojec as a s a is ics sage, I can say ha I ha e made
a ew dips in o he pool o s a is ical wisdom. And su ely hey ha e been g ea ools o make
sense o he da a ha I ha e collec ed.
As his is a s udy o speech and LE, lis ening es s we e ano he majo indispensable pa o
92
he esea ch. The only ue and co ec way o know lis ene s’ opinions and hei esponse o
he modi ica ions we made was by conduc ing lis ening es s. I came ac oss se e al ac o s ha
a e c ucial while conduc ing lis ening es s such as conduc ing pilo es s be o e he ac ual es .
Some o hese a e choosing you pa icipan demog aphic ca e ully, p epa ing o a no-show o
pa icipan s, me iculous s o age o lis ene da a, deciding he mode o he lis ening es (lab
o web-based), a oiding gli ches in he lis ening es so wa e (especially he ones ha do no
sa e da a p ope ly), main aining anonymi y o pa icipan s and e hical equi emen s and so on.
Expe imen design i.e he numbe and ype o condi ions and s imuli o be used, he manne
o p esen a ion ( andomised, a oiding memo y e ec and p iming), usage o igh so wa e ool
and audio se up, leng h o he expe imen and expe imen b eaks and o he such ac o s a e
impo an in a success ul lis ening expe imen . To pu i simply, conduc ing a lis ening es is
like a g and o ches a whe e each pa (small o big) plays an impo an ole in he inal esul .
An unexpec ed cou se in his p ojec was he use o a biomedical ool such as EEG o measu e
LE. This decision came in midway h ough he hesis and as an enginee , I was comple ely
unp epa ed o conduc an EEG expe imen . Howe e , he expe ience and insigh s I ha e
gained om his me hod we e mind blowing and he bigges lea ning poin s. The pa objec i e
pa subjec i e na u e o his me hod lea es a lo o oom o con inually hinking abou he
ou comes. None heless, he ac ha his da a came om an ac ual human being makes i
e y in e es ing and aluable. Fi ing a pa icipan wi h an EEG cap is a ime consuming
and me iculous p ocess. A e es ing a ound 50 pa icipan s, I eel I am p epa ed o conduc
ano he EEG expe imen .
EEG signals in e p e a ion and analysis is a as wo ld wi h a lo o possibili ies. Mos o he
EEG esea ch I ha e come ac oss use a s anda d ensemble o ools o in e p e and analyse EEG
da a. Wi h he aid o my supe iso s who a e adep a signal p ocessing, I was able o look a
he EEG da a wi h mo e dep h. Fo example, I ied di e en window leng hs and windowing
echniques while ex ac ing spec al cha ac e is ics o EEG. I ound a way o analyse he EEG
da a co esponding o he en i e leng h o he signal as opposed o jus a ixed leng h o he
signal, which is he usual p ac ice. I ied o unde s and he inne wo kings o he independen
componen analysis p ocess which is a sub-p ocess o EEG da a analysis. The e was so much
mo e ha I could explo e in he EEG expe imen s, bu I was cons ained by ime. In sum, I
app ecia e he chance o s udy EEG signals in he con ex o diso de ed speech pe cep ion and
gain se e al insigh s om i . I hope his ool is used in he u u e by esea che s in his ield.
93
8.3 Resea ch Ou comes
Chap e s 4 o 7 desc ibed all he expe imen s pe o med as pa o his hesis. Each expe imen
e ealed some in e es ing insigh s ha we e use ul in shaping he nex s udy and hope ully also
o he eade s o his hesis o hei u u e esea ch ideas. He e I will summa ise he ou comes
o all he expe imen s pe o med in his hesis.
In Chap e 4, we pe o med wo expe imen s ha collec in elligibili y and sel - epo ed LE
me ics o OS and HS speake s. These expe imen s e ealed he gaps in in elligibili y and
LE be ween HS and OS. OS was mo e e o ul o p ocess and less in elligible compa ed o
HS. E en when he in elligibili y o OS was high, he e was signi ican ly mo e LE associa ed
wi h OS compa ed o HS. One in e es ing e ec was ha o lis ene s’ amilia i y o OS. OS
was no mo e in elligible o amilia lis ene s, bu i was epo edly less e o ul o p ocess o
amilia lis ene s compa ed o non- amilia lis ene s. Machine in elligibili y was also poo e o
OS compa ed o HS.
In Chap e 5 we explo ed EEG as a way o measu e LE. The indings sugges ed ha alpha
powe , a neu al measu e known o measu e LE, was highe o OS compa ed o HS. Howe e ,
his measu e was no co ela ed wi h subjec i e LE. The easons o his non-co ela ion is an
a ea ha can be explo ed u he . An in e es ing obse a ion was ha alpha powe was a ec ed
by indi idual cogni i e capaci ies o he pa icipan s - he pa icipan s who had poo e wo king
memo y had highe alpha powe .
In Chap e 6, we implemen ed a no el DNN-based oice con e sion sys em which mapped
OS o du a ion-ma ched SS. This sys em ou pe o med ou p e ious sys ems in STOI sco es
and ASR sco es as well as SI sco es and LE ob ained wi h a lis ening es . When compa ed
wi h unp ocessed OS, he e was an imp o emen in ASR sco es, STOI and subjec i e p e e ence
sco es, bu no in SI sco es and LE. The e o e, al hough he e is an imp o emen in objec i e
measu es, u he imp o emen s a e needed o make he ans o med OS mo e p e e able and
unde s andable o human lis ene s. Some o he quick and ligh weigh explo a o y s a egies
p o ided some bene i o a low in elligibili y OS speake , bu no o high in elligibili y OS.
As he e a e in ini e possibili ies o OS en ichmen , se e al expe imen s emained ei he no
p ope ly explo ed, o no e alua ed comple ely. The e o e, he e is de ini ely a lo o space in
his a ea o imp o e esul s u he .
In Chap e 7, h ee chosen en ichmen s we e e alua ed using a wide ange o e alua ion
me ics. This included subjec i e me ics such as LE a ings, beha io al measu emen s such as
SI sco es and eac ion imes, objec i e measu emen s such as ASR sco es and STOI sco es and
alpha powe . The e alua ions e eal ha he p oposed no el en ichmen me hod was success ul
94
in imp o ing machine in elligibili y me ics compa ed o p e ious me hods. Howe e , he e was
no he same le el o success in imp o ing subjec i e ou comes. Like he expe imen in Chap e
5, a highe alpha powe was obse ed o OS compa ed o HS, bu no hing conclusi e could be
said abou he en ichmen me hods. O e all, he e was a s eady inc ease in alpha powe as he
expe imen p og essed indica ing he pa icipan s’ a igue o educed a en ion.
8.4 Fu u e Di ec ions
The main limi a ion o he wo k desc ibed in his hesis is ha he en iched ou pu s did no
ou pe o m unp ocessed OS in subjec i e e alua ions. Al hough he e was some success in he
p e e ence es s desc ibed in Chap e 6, he HSR and subjec i e LE sco es sugges a win o OS
compa ed o he en ichmen s. The e o e, in es iga ing he easons behind his and de eloping
en ichmen s ha would appeal o human lis ene s as well would be a possible and necessa y
u u e di ec ion.
I would be in e es ing o explo e lis ene s’ EEG signals u he in he con ex o diso de ed
speech. Some ques ions ha can be asked a e: Wha do he o he bands o he EEG signal
(o he han alpha band) ell abou diso de ed speech? Can lis ene s’ EEG signals e eal insigh s
abou o he unexplo ed aspec s o diso de ed speech such as p osody, hy hm and emo ion
ecogni ion? A machine lea ning based analysis o EEG signals ins ead o he adi ional ERP
and equency band analysis would also be an in e es ing a ea o explo e.
In he u u e, i would be in e es ing o ins all his en ichmen sys em as a ace- o- ace
communica ion aid in a s and-alone de ice o a sma phone whe e he de ice will ake he
unp ocessed OS inpu and play ou he en iched e sion in eal ime o wi h negligible delay.
Ano he possible p ac ical applica ion o he p oposed sys em could be in he o m o a so wa e
plugin coupled wi h he mic ophone o he de ices used by an OS speake . This would con e
any mic ophone inpu (Unp ocessed OS) o an en iched e sion o he speech in eal ime o
wi h minimum delay. Any app which equi es a mic ophone inpu can use his modi ied speech
ins ead. In his way, he OS speake would be able o use he bene i s o he en iched speech
o elephonic con e sa ions, zoom calls, oice commands o digi al assis an s and o he oice
based apps.
8.5 Con ibu ions
This hesis con ibu ed o he esea ch and de elopmen o sys ems ha will aid la ynge omees
speaking in OS and possibly speake s o o he pa hologies oo. Some impo an con ibu ions
95
a e lis ed below.
•Da abase and manual labelling: While au oma ic phone ic labelling so wa e exis o label
la ge speech da abases, hey did no wo k well o OS. The e o e a se o manual labelling
was done o one OS speake . When he manual labels we e used o e alua e he accu acy
o a cus omised au oma ic labelling p ocess. These imp o ed au oma ic labels we e use ul
in de eloping an OS- iendly ASR sys em as well as no el me hods o en ichmen .
•New en ichmen echniques: A oice con e sion based app oach was used o en ich OS
wi h a no el idea o using SS wi h ma ching du a ions as a ge . This elimina ed he
need o he alignmen o he sou ce OS signal o he a ge , which can be he cause o
e oneous esul s.
•Explo a ion o non VC echniques o en ichmen : Apa om VC, some non-VC ap-
p oaches we e explo ed such as emo ing unwan ed a e ac s and imp o ing he spec al
cha ac e is ics o he signal.
•Wa ene ocode o Spanish: As pa o one o he en ichmen p ocesses, a new high
quali y ocode (Wa ene ) was de eloped o Spanish. This ocode can be used in u u e
s udies o gene a e speech om acous ic pa ame e s o speech.
•Pa allel SS da a o each speake : Also as pa o he en ichmen p ocess, a se o pa allel
SS was c ea ed using he phone labels. This syn he ic speech da ase which ma ches OS
in du a ion can be use ul o u u e OS en ichmen s udies.
•Beha iou al da a o OS, HS and en ichmen s o one speake : An ex ensi e se o e al-
ua ions we e pe o med o one OS speake and a con ol HS speake . This included
objec i e and subjec i e in elligibili y measu emen s, LE measu emen s and p e e ence
es s.
•EEG da a om 12 pa icipan s when lis ening o an OS speake and an HS speake .
•EEG da a om 32 pa icipan s when lis ening o an OS speake , an HS speake and h ee
OS en ichmen s.
•ASR, STOI and p e e ence es da a o OS and en ichmen s o 4 speake s.
•Subjec i e and objec i e LE (physiological) as an addi ional e alua ion ool.
•Code o lis ening es s: An in e ace o conduc lis ening es s wi h sen ence ansc ip-
ion and subjec i e measu es was c ea ed. A sepa a e code o ex ac ion o esul s and
calcula e in elligibili y ( ansc ip ion e o s) is a ailable oo.
96
8.6 Publica ions
8.6.1 Pee - e iewed Jou nal Pape s
•Se ano, L., Raman, S., He naez, I., Na as, E., Sanchez, J., Sa a xaga, I. (2020). A
Spanish Mul ispeake Da abase o Esophageal Speech. Compu e Speech and Language,
Volume 66, Ma ch 2021, 101168. DOI: 10.1016/j.csl.2020.101168
•Raman, S.; Se ano, L.; Winneke, A.; Na as, E.; He naez, I. In elligibili y and Lis ening
E o o Spanish OS. Appl. Sci. 2019, 9, 3233. DOI: 10.3390/app9163233
•Raman, S.; Sa asola, X.; Na as, E.; He naez, I. En ichmen o Oesophageal Speech:
Voice Con e sion wi h Du a ion–Ma ched SS as Ta ge . Appl. Sci. 2021, 11, 5940.
h ps://doi.o g/10.3390/app11135940
8.6.2 Pape s in P epa a ion
•A pape i led ’Oesophageal Speech and E o ul Lis ening: an EEG S udy’ based on he
indings o Chap e 5.
•A pape i led ’E alua ion o En iched Oesophageal Speech: an EEG S udy’ based on he
indings o Chap e 7.
8.6.3 Pee - e iewed Con e ence Pape s
•Se ano, L., Raman, S., Ta a ez, D., Na as, E., He naez, I. (2019) Pa allel s. Non-
Pa allel Voice Con e sion o Esophageal Speech. P oc. In e speech 2019, 4549-4553,
DOI: 10.21437/In e speech.2019-2194.
•Se ano, L., Ta a ez, D., Sa asola, X., Raman, S., Sa a xaga, I., Na as, E., He naez,
I. (2018) LSTM based oice con e sion o la yngec omees. P oc. Ibe SPEECH 2018,
122-126, DOI: 10.21437/Ibe SPEECH.2018-26
•Raman, S., He naez, I., Na as, E., Se ano, L. (2018) Lis ening o La yngec omees: A
s udy o In elligibili y and Sel - epo ed Lis ening E o o Spanish Oesophageal Speech.
P oc. Ibe SPEECH 2018, 107-111, DOI: 10.21437/Ibe SPEECH.2018-23
8.6.4 Pos e P esen a ions
•Raman, S., He naez, I., Na as, E., Se ano, L. A Mul i ace ed En ichmen o Oe-
sophageal Speech. ICA 2019 Con e ence P oceedings, pp. 5739-5741. DOI: 10.18154/RWTH-
97
[32] Daniel E o, Inma He naez, Agus in Alonso, D Ga c´ıa-Lo enzo, E a Na as, Jianpei Ye,
H A zelus, Igo Jauk, Nguyen Quy Hy, Ca men Maga i˜nos, e al. Pe sonalized syn he ic
oices o speaking impai ed: Websi e and app. 16 h Annual Con e ence o he In e na-
ional Speech Communica ion Associa ion, 2015.
[33] Daniel E o, Inma He n´aez, E a Na as, Agus ın Alonso, Ha i z A zelus, Igo Jauk,
Nguyen Quy Hy, Ca men Maga inos, Rub´en P´e ez-Ram´on, M Sul´ı , e al. Zu e s: Online
pla o m o ob aining pe sonalized syn he ic oices. P oceedings o eNTERFACE, pages
1178–1193, 2014.
[34] Daniel E o, I˜naki Sainz, Ike Luengo, Igo Od iozola, Jon S´anchez, Ibon Sa a xaga, E a
Na as, and Inma He n´aez. Hmm-based speech syn hesis in basque language using h s.
P oceedings o FALA, pages 67–70, 2010.
[35] Tiago H Falk, Chenxi Zheng, and Wai-Yip Chan. A non-in usi e quali y and in elligibili y
measu e o e e be an and de e e be a ed speech. IEEE T ansac ions on Audio, Speech,
and Language P ocessing, 18(7):1766–1774, 2010.
[36] R Holly Fi ch, S e e Mille , and Paula Tallal. Neu obiology o speech pe cep ion. Annual
Re iew o Neu oscience, 20(1):331–353, 1997.
[37] Lionel Fon an, Isabelle Fe an´e, J´e ˆome Fa inas, Julien Pinquie , Julien Ta dieu, Cyn hia
Magnen, Pascal Gailla d, Xa ie Aumon , and Ch is ian F¨ullg abe. Au oma ic speech
ecogni ion p edic s speech in elligibili y and comp ehension o lis ene s wi h simula ed
age- ela ed hea ing loss. Jou nal o Speech, Language, and Hea ing Resea ch, 60(9):2394–
2405, 2017.
[38] B Ga cia, Ibon Ruiz, and Amaia M´endez. Oesophageal speech enhancemen using poles
s abiliza ion and kalman il e ing. In 2008 IEEE In e na ional Con e ence on Acous ics
Speech and Signal P ocessing (ICASSP), pages 1597–1600. IEEE, 2008.
[39] Luis Se ano Ga c´ıa, Sneha Raman, Inma He n´aez Rioja, E a Na as Co d´on, Jon Sanchez,
and Ibon Sa a xaga. A Spanish mul ispeake da abase o esophageal speech. Compu e
Speech & Language, page 101168, 2020.
[40] N G´omez-Me ino, F Ghelle , G Spiccia elli, and P T e isi. Pupillome y as a measu e o
lis ening e o in child en: a e iew. Hea ing, Balance and Communica ion, 18(3):152–
158, 2020.
104
[41] Amy Jane Hall, Axel Winneke, and Jan Rennies-Hochmu h. EEG alpha powe as a
measu e o lis ening e o educ ion in ad e se condi ions. Uni e si ¨a sbiblio hek de
RWTH Aachen, 2019.
[42] R Hannemann, Jonas Oblese , and Ca s en Euli z. Top-down knowledge suppo s he
e ie al o lexical in o ma ion om deg aded speech. B ain Resea ch, 1153:134–143,
2007.
[43] Anne Hauswald, Anne Kei el, Ya-ping Chen, Sebas ian R¨osch, and Na han Weisz. Deg a-
da ion le els o con inuous speech a ec neu al speech acking and alpha powe di e -
en ly. Eu opean Jou nal o Neu oscience, 2020.
[44] Ma k S Hawley, Phil G een, Pam Ende by, S ua Cunningham, and Roge K Moo e.
Speech echnology o e-inclusion o people wi h physical disabili ies and diso de ed
speech. In Nin h Eu opean Con e ence on Speech Communica ion and Technology, 2005.
[45] Elina Helande , Jan Schwa z, Jani Nu minen, Hanna Silen, and Monce Gabbouj. On he
impac o alignmen on oice con e sion pe o mance. In Nin h Annual Con e ence o he
In e na ional Speech Communica ion Associa ion, 2008.
[46] Candace Bou land Hicks and Anne Ma ie Tha pe. Lis ening e o and a igue in school-
age child en wi h and wi hou hea ing loss. Jou nal o Speech, Language, and Hea ing
Resea ch, 2002.
[47] S en Hilbe , T is an T Nakagawa, Pa icia Puci, Alexand a Zech, and Ma kus B¨uhne .
The digi span backwa ds ask. Eu opean Jou nal o Psychological Assessmen , 2014.
[48] Jens Hjo kjæ , Jona an M¨a che -Rø s ed, Sø en A Fuglsang, and To s en Dau. Co ical
oscilla ions and en ainmen in speech p ocessing du ing wo king memo y load. Eu opean
Jou nal o Neu oscience, 51(5):1279–1289, 2020.
[49] No man D Hogikyan and Gi ish Se hu aman. Valida ion o an ins umen o measu e
oice- ela ed quali y o li e ( - qol). Jou nal o Voice, 13(4):557–569, 1999.
[50] Sand a Ca anaugh Holley, Jay Le man, and Kenne h Randolph. A compa ison o he
in elligibili y o esophageal, elec ola yngeal, and no mal speech in quie and in noise.
Jou nal o Communica ion Diso de s, 16(2):143–155, 1983.
[51] Val e i Hongis o. A model p edic ing he e ec o speech o a ying in elligibili y on
wo k pe o mance. Indoo Ai , 15(6):458–468, 2005.
105
[52] Dee J Hubba d and Deanie Kushne . A compa ison o speech in elligibili y be ween
esophageal and no mal speake s ia h ee modes o p esen a ion. Jou nal o Speech,
Language, and Hea ing Resea ch, 23(4):909–916, 1980.
[53] Adam Jacks, Ka a ina L Haley, Ga y Bishop, and Tyson G Ha mon. Au oma ed speech
ecogni ion in adul s oke su i o s: Compa ing human and compu e ansc ip ions.
Folia Phonia ica e Logopaedica, 71(5-6):286–296, 2019.
[54] Ba ba a H Jacobson, Alex Johnson, Cyn hia G ywalski, Alice Silbe glei , Ga y Jacobson,
Michael S Benninge , and C aig W Newman. The oice handicap index ( hi) de elopmen
and alida ion. Ame ican Jou nal o Speech-Language Pa hology, 6(3):66–70, 1997.
[55] Pa aneh Janbakhshi, Ina Kod asi, and He ´e Bou la d. Pa hological speech in elligi-
bili y assessmen based on he sho - ime objec i e in elligibili y measu e. In ICASSP
2019-2019 IEEE In e na ional Con e ence on Acous ics, Speech and Signal P ocessing
(ICASSP), pages 6405–6409. IEEE, 2019.
[56] JASP Team. JASP (Ve sion 0.8.6)[Compu e so wa e]. h ps://jasp-s a s.o g/, access
da e: 20 h Feb ua y 2018, 2018.
[57] He be H Jaspe . The in e na ional “10–20” sys em o he in e na ional ede a ion.
Elec oencephalog aphy and Clinical Neu ophysiology, 10:371–375, 1958.
[58] J. Jensen and C. H. Taal. An algo i hm o p edic ing he in elligibili y o speech masked
by modula ed noise maske s. IEEE/ACM T ansac ions on Audio, Speech, and Language
P ocessing, 24(11):2009–2022, 2016.
[59] Joshua Y Kim, Chun eng Liu, Ra ael A Cal o, Ka h yn McCabe, Silas CR Taylo ,
Bj¨o n W Schulle , and Kaihang Wu. A compa ison o online au oma ic speech ecog-
ni ion sys ems and he non e bal esponses o unin elligible speech. a Xi p ep in
a Xi :1904.12403, 2019.
[60] K. Kobayashi and T. Toda. sp ocke : Open-sou ce oice con e sion so wa e. 2018.
[61] Minako Koike, No iko Kobayashi, Hajime Hi ose, and Yuki Ha a. Speech ehabili a ion
a e o al la yngec omy. Ac a O o-La yngologica, 122(4):107–112, 2002.
[62] Bi ge Kollmeie and Ma hias Wesselkamp. De elopmen and e alua ion o a ge man
sen ence es o objec i e and subjec i e speech in elligibili y assessmen . The Jou nal
o he Acous ical Socie y o Ame ica, 102(4):2412–2421, 1997.
106
[63] John Kominek, Tanja Schul z, and Alan W Black. Syn hesize oice quali y o new
languages calib a ed wi h mean mel ceps al dis o ion. In SLTU, pages 63–68, 2008.
[64] Ch is ian Ko he, Da id Medine, and Ma hew G i ich. Lab s eaming laye (2014). URL:
h ps://gi hub. com/sccn/labs eaminglaye ( isi ed on 02/01/2019), 2018.
[65] Melanie K uege , Michael Schul e, Thomas B and, and Inga Holube. De elopmen o
an adap i e scaling me hod o subjec i e lis ening e o . The Jou nal o he Acous ical
Socie y o Ame ica, 141(6):4680–4693, 2017.
[66] Ka l D K y e . Me hods o he calcula ion and use o he a icula ion index. The Jou nal
o he Acous ical Socie y o Ame ica, 34(11):1689–1697, 1962.
[67] E elyne Lag ou, Robe J Ha suike , and Wou e Duyck. The in luence o sen ence con-
ex and accen ed speech on lexical access in second-language audi o y wo d ecogni ion.
Bilingualism: Language and Cogni ion, 16(3):508–517, 2013.
[68] Sophie Landa, Lindsay Penning on, Nick Mille , Sheila Robson, Vicki Thompson, and
Nick S een. Associa ion be ween objec i e measu emen o he speech in elligibili y o
young people wi h dysa h ia and lis ene a ings o ease o unde s anding. In e na ional
Jou nal o Speech-Language Pa hology, 16(4):408–416, 2014.
[69] Siddique La i , Junaid Qadi , Adnan Qayyum, Muhammad Usama, and Shahzad Younis.
Speech echnology o heal hca e: Oppo uni ies, challenges, and s a e o he a . IEEE
Re iews in Biomedical Enginee ing, 2020.
[70] Ul ike Lemke and Jana Besse . Cogni i e load and lis ening e o : Concep s and age-
ela ed conside a ions. Ea and Hea ing, 37:77S–84S, 2016.
[71] Vladimi I Le ensh ein. Bina y codes capable o co ec ing dele ions, inse ions, and
e e sals. So ie Physics Doklady, 10, No. 8:707–710, 1966.
[72] Zhen-Hua Ling, Shi-Yin Kang, Heiga Zen, And ew Senio , Mike Schus e , Xiao-Jun Qian,
Helen M Meng, and Li Deng. Deep lea ning o acous ic modeling in pa ame ic speech
gene a ion: A sys ema ic e iew o exis ing echniques and u u e ends. IEEE Signal
P ocessing Magazine, 32(3):35–52, 2015.
[73] Richa d P Lippmann. Speech ecogni ion by machines and humans. Speech Communica-
ion, 22(1):1–15, 1997.
107
[74] Hanjun Liu, Mingxi Wan, Supin Wang, Xiaodong Wang, and Chunmei Lu. Acous ic
cha ac e is ics o manda in esophageal speech. The Jou nal o he Acous ical Socie y o
Ame ica, 118(2):1016–1025, 2005.
[75] Jaime Lo enzo-T ueba, Junichi Yamagishi, Tomoki Toda, Daisuke Sai o, Fe nando Villa -
icencio, Tomi Kinnunen, and Zhenhua Ling. The oice con e sion challenge 2018: P omo -
ing de elopmen o pa allel and nonpa allel me hods. a Xi p ep in a Xi :1804.04262,
2018.
[76] And eas Maie , Tino Hade lein, Ul ich Eyshold , F ank Rosanowski, An on Ba line ,
Ma ia Schus e , and Elma N¨o h. Peaks–a sys em o he au oma ic e alua ion o oice
and speech diso de s. Speech Communica ion, 51(5):425–437, 2009.
[77] Al edo Man illa, H´ec o P´e ez-Meana, Daniel Ma a, Ca los Angeles, Jo ge Al a ado, and
Lau a Cab e a. Recogni ion o owel segmen s in spanish esophageal speech using hidden
ma ko models. 15 h In e na ional Con e ence on Compu ing, pages 115–120, 2006.
[78] Kenji Ma sui, No iyo Ha a, No iko Kobayashi, and Hajime Hi ose. Enhancemen o
esophageal speech using o man syn hesis. Acous ical Science and Technology, 23(2):69–
76, 2002.
[79] Megan J McAuli e, Phillipa J Wilding, Na alie A Ricka d, and G eg A O’Bei ne. E ec
o speake age on speech ecogni ion and pe cei ed lis ening e o in olde adul s wi h
hea ing loss. Jou nal o Speech, Language, and Hea ing Resea ch, 55(3):838–47, 2012.
[80] Ronan McGa igle, Ke in J Mun o, Pie s Dawes, And ew J S ewa , Da id R Moo e,
Johanna G Ba y, and Sygal Ami ay. Lis ening e o and a igue: Wha exac ly a e
we measu ing? a b i ish socie y o audiology cogni ion in hea ing special in e es g oup
‘whi e pape ’. In e na ional Jou nal o Audiology, 53(7):433–440, 2014.
[81] Sha ynne McLeod and Sadanand Singh. Speech sounds: A pic o ial guide o ypical and
a ypical speech. Plu al Publishing, 2009.
[82] Ca he ine M McMahon, Isabelle Bois e , Pe e de Lissa, Louise G ange , Ronny Ib ahim,
Chi Yhun Lo, Kelly Miles, and Pe a L G aham. Moni o ing alpha oscilla ions and pupil
dila ion ac oss a pe o mance-in ensi y unc ion. F on ie s in Psychology, 7:745, 2016.
[83] Geo ey S Mel zne , James T Hea on, Yunbin Deng, Gianluca De Luca, Se ge H Roy,
and Joshua C Kline. Silen speech ecogni ion as an al e na i e communica ion de ice o
108
pe sons wi h la yngec omy. IEEE/ACM T ansac ions on Audio, Speech, and Language
P ocessing, 25(12):2386–2398, 2017.
[84] John E Meye s, Ku Volke , and Anh Diep. Sen ence epe i ion es : Upda ed no ms
and clinical u ili y. Applied Neu opsychology, 7(3):154–159, 2000.
[85] Mic oso . Mic oso azu e cogni i e se ices speech- o- ex . h ps:
//docs.mic oso .com/en-us/azu e/cogni i e-se ices/speech-se ice/
ge -s a ed-speech- o- ex , Accessed Oc obe 2020.
[86] Ca he ine Middag, Tobias Bockle , Jean-Pie e Ma ens, and Elma N¨o h. Combining
phonological and acous ic as - ee ea u es o pa hological speech in elligibili y assess-
men . 12 h Annual Con e ence o he In e na ional Speech Communica ion Associa ion,
2011.
[87] Ca he ine Middag, Jean-Pie e Ma ens, Gwen Van Nu elen, and Ma c De Bod . Dia:
A ool o objec i e in elligibili y assessmen o pa hological speech. 6 h In e na ional
Wo kshop on Models and Analysis o Vocal Emissions o Biomedical Applica ions, pages
165–167, 2009.
[88] Jose L Mi alles and Te esa Ce e a. Voice in elligibili y in pa ien s who ha e unde gone
la yngec omies. Jou nal o Speech, Language, and Hea ing Resea ch, 38(3):564–571, 1995.
[89] Seyed Hamid eza Mohammadi and Alexande Kain. An o e iew o oice con e sion
sys ems. Speech Communica ion, 88:65–82, 2017.
[90] E Ann Mohide, S ua D A chibald, Michelle Tew, J Edwa d Young, and T ish Haines.
Pos la yngec omy quali y-o -li e dimensions iden i ied by pa ien s and heal h ca e p o es-
sionals. The Ame ican Jou nal o Su ge y, 164(6):619–622, 1992.
[91] Masano i Mo ise, Fumiya Yokomo i, and Kenji Ozawa. Wo ld: a ocode -based high-
quali y speech syn hesis sys em o eal- ime applica ions. IEICE TRANSACTIONS on
In o ma ion and Sys ems, 99(7):1877–1884, 2016.
[92] To a Mos , Yishai Tobin, and Ra i Cohen Mim an. Acous ic and pe cep ual cha ac e -
is ics o esophageal and acheoesophageal speech p oduc ion. Jou nal o Communica ion
Diso de s, 33(2):165–181, 2000.
[93] Mu ay J Mun o. The e ec s o noise on he in elligibili y o o eign-accen ed speech.
S udies in Second Language Acquisi ion, 20(2):139–154, 1998.
109
[94] Ka hleen F Nagle and Tanya L Eadie. Pe cei ed lis ene e o as an ou come measu e
o diso de ed speech. Jou nal o Communica ion Diso de s, 73:34–49, 2018.
[95] Ka hy F Nagle and Tanya L Eadie. Lis ene e o o highly in elligible acheoesophageal
speech. Jou nal o Communica ion Diso de s, 45(3):235–245, 2012.
[96] Aljoscha C Neubaue , And eas Fink, and Roland H G abne . Sensi i i y o alpha band
e d o indi idual di e ences in cogni ion. P og ess in B ain Resea ch, 159:167–178, 2006.
[97] Paul L. Nunez and Ramesh S ini asan. Elec oencephalog am. Schola pedia, 2(2):1348,
2007. e ision #91219.
[98] Jonas Oblese , Mal e W¨os mann, Nele Hellbe nd, Anna Wilsch, and Bu kha d Maess.
Ad e se lis ening condi ions and memo y load d i e a common alpha oscilla o y ne wo k.
Jou nal o Neu oscience, 32(36):12376–12383, 2012.
[99] Ba ba a Ohlen o s , Do o hea Wend , Sophia E K ame , G aham Naylo , Ad iana A
Zek eld, and Thomas Lunne . Impac o sn , maske ype and noise educ ion p ocessing
on sen ence ecogni ion pe o mance and lis ening e o as indica ed by he pupil dila ion
esponse. Hea ing Resea ch, 365:90–99, 2018.
[100] Ba ba a Ohlen o s , Ad iana A Zek eld, Thomas Lunne , Do o hea Wend , G aham Nay-
lo , Yang Wang, Niek J Ve s eld, and Sophia E K ame . Impac o s imulus- ela ed ac o s
and hea ing impai men on lis ening e o as indica ed by pupil dila ion. Hea ing Re-
sea ch, 351:68–79, 2017.
[101] Ibon Oleago dia-Ruiz and Begonya Ga cia-Zapi ain. Ha monic o noise a io imp o emen
in oesophageal speech. Technology and Heal h Ca e, 23(3):359–368, 2015.
[102] Imen Ben O hmane, Joseph Di Ma ino, and Ka¨ıs Ouni. Enhancemen o esophageal
speech ob ained by a oice con e sion echnique using ime dila ed ou ie ceps a. In e -
na ional Jou nal o Speech Technology, 22(1):99–110, 2019.
[103] Jona han W Pei ce. Psychopy—psychophysics so wa e in py hon. Jou nal o Neu o-
science Me hods, 162(1-2):8–13, 2007.
[104] M Ka hleen Picho a-Fulle , Sophia E K ame , Ma k A Ecke , B en Edwa ds, Ben-
jamin WY Ho nsby, La y E Humes, Ul ike Lemke, Thomas Lunne , Mohan Ma hen,
Ca ol L Macke sie, e al. Hea ing impai men and cogni i e ene gy: The amewo k o
unde s anding e o ul lis ening ( uel). Ea and Hea ing, 37:5S–27S, 2016.
110
[105] Edua d Poli yko. Wo d e o a e. h ps://www.ma hwo ks.com/examples/ma lab/communi y/19873-
wo d-e o - a e, access da e: 20 h Feb ua y 2018, 2018.
[106] Ge asimos Po amianos and Chalapa hy Ne i. Au oma ic speech eading o impai ed
speech. In AVSP 2001-In e na ional Con e ence on Audi o y-Visual Speech P ocessing,
2001.
[107] Daniel Po ey, A nab Ghoshal, Gilles Boulianne, Lukas Bu ge , Ond ej Glembek, Nagen-
d a Goel, Mi ko Hannemann, Pe Mo licek, Yanmin Qian, Pe Schwa z, e al. The
kaldi speech ecogni ion oolki . IEEE Wo kshop on Au oma ic Speech Recogni ion and
Unde s anding, 2011.
[108] DA P eece. La in squa es, la in cubes, la in ec angles. Wiley S a sRe : S a is ics Re e -
ence Online, 2014.
[109] Sneha Raman, Inma He naez, E a Na as, and Luis Se ano. Lis ening o la yngec omees:
A s udy o in elligibili y and sel - epo ed lis ening e o o spanish oesophageal speech.
P oceedings o Ibe SPEECH 2018, pages 107–111, 2018.
[110] Sneha Raman, Inma He naez, E a Na as, and Luis Se ano. A mul i ace ed en ichmen
o oesophageal speech. Uni e si ¨a sbiblio hek de RWTH Aachen, 2019.
[111] Sneha Raman, Xabie Sa asola, E a Na as, and Inma He naez. En ichmen o oesophageal
speech: Voice con e sion wi h du a ion–ma ched syn he ic speech as a ge . Applied
Sciences, 11(13):5940, 2021.
[112] Sneha Raman, Luis Se ano, Axel Winneke, E a Na as, and Inma He naez. In elligibili y
and lis ening e o o spanish oesophageal speech. Applied Sciences, 9(16):3233, 2019.
[113] Shak i P Ra h, Daniel Po ey, Ka el Vesel`y, and Jan Ce nock`y. Imp o ed ea u e p o-
cessing o deep neu al ne wo ks. 14 h Annual Con e ence o he In e na ional Speech
Communica ion Associa ion, pages 109–113, 2013.
[114] Ge a d B Remijn, Mi su u Kikuchi, Yuko Yoshimu a, Kiyomi Shi amichi, Sanae Ueno,
Tsunehisa Tsubokawa, Ha uyuki Kojima, Ha uhi o Higashida, and Yoshio Minabe. A
nea -in a ed spec oscopy s udy on co ical hemodynamic esponses o no mal and whis-
pe ed speech in 3- o 7-yea -old child en. Jou nal o Speech, Language, and Hea ing Re-
sea ch, 60(2):465–470, 2017.
111
[115] Jan Rennies, Henning Schepke , Inga Holube, and Bi ge Kollmeie . Lis ening e o
and speech in elligibili y in lis ening si ua ions a ec ed by noise and e e be a ion. The
Jou nal o he Acous ical Socie y o Ame ica, 136(5):2642–2653, 2014.
[116] An ony W Rix, John G Bee ends, Michael P Hollie , and And ies P Heks a. Pe cep-
ual e alua ion o speech quali y (pesq)-a new me hod o speech quali y assessmen o
elephone ne wo ks and codecs. In 2001 IEEE in e na ional con e ence on acous ics,
speech, and signal p ocessing. P oceedings (Ca . No. 01CH37221), olume 2, pages 749–
752. IEEE, 2001.
[117] Jana Roßbach, Saskia R¨o ges, Ch is ophe F Hau h, Thomas B and, and Be nd T Meye .
Non-in usi e binau al p edic ion o speech in elligibili y based on phoneme classi ica ion.
In ICASSP 2021-2021 IEEE In e na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP), pages 396–400. IEEE, 2021.
[118] Ma inela Rosso, Ljiljana ˇ
Si i´c, Robe Ti´cac, Radan S a ˇce i´c, Igo ˇ
Segec, and Nikola
K aljik. Pe cep ual e alua ion o ala yngeal speech. Collegium An opologicum, 36(2):115–
118, 2012.
[119] I Sainz, D E o, E Na as, I He n´aez, J S´anchez, I Sa a xaga, I Od iozola, I Luengo, e al.
Aholab speech syn hesize s o albayzin2010. P oceedings o FALA, 2010:343–348, 2010.
[120] I˜naki Sainz, Daniel E o, E a Na as, Inma He n´aez, Jon Sanchez, Ibon Sa a xaga, and
Igo Od iozola. Ve sa ile speech da abases o high quali y syn hesis o basque. In LREC,
pages 3308–3312. Ci esee , 2012.
[121] I˜naki Sainz, Daniel E o, E a Na as, Inmaculada He n´aez, J Sanchez, I Sa a xaga, and
Igo Od iozola. Ve sa ile Speech Da abases o High Quali y Syn hesis o Basque. 8 h
in e na ional Con e ence on Language Resou ces and E alua ion (LREC), pages 3308–
3312, 2012.
[122] Ode e Scha enbo g. Reaching o e he gap: A e iew o e o s o link human and
au oma ic speech ecogni ion ‘esea ch. Speech Communica ion, 49(5):336–347, 2007.
[123] Luis Se ano. T´ecnicas pa a la mejo a de la in eligibilidad en oces pa ol´ogicas. PhD
hesis, Uni e si y o he Basque Coun y (UPV/EHU), 2019.
[124] Luis Se ano, Sneha Raman, Da id Ta a ez, E a Na as, and Inma He naez. Pa allel
s. non-pa allel oice con e sion o esophageal speech. P oceedings o In e speech 2019,
pages 4549–4553, 2019.
112
[125] Luis Se ano, Da id Ta a ez, Igo Od iozola, Inma He naez, and Ibon Sa a xaga. Aholab
sys em o albayzin 2016 sea ch-on-speech e alua ion. P oceedings o Ibe SPEECH, pages
33–42, 2016.
[126] Luis Se ano, Da id Ta a ez, Xabie Sa asola, Sneha Raman, Ibon Sa a xaga, E a Na as,
and Inma He naez. LSTM based oice con e sion o la yngec omees. In P oceedings o
Ibe SPEECH 2018, pages 122–126, 2018.
[127] Luis Se ano, Da id Ta a ez, Xabie Sa asola, Sneha Raman, Ibon Sa a xaga, E a Na as,
and Inma He naez. Ls m based oice con e sion o la yngec omees. P oceedings o
Ibe SPEECH, pages 122–126, 2018.
[128] A. Sesma and Asunci´on Mo eno. Co pusc 1.0: Diseno de co pus o ales equilib ados.
Compu e P og am]: h p://gps- sc. upc. es/ eu/pe sonal/sesma/Co pusC . ph p3, 2000.
[129] Hamid Reza Sha i zadeh, Ian V McLoughlin, and Fa zaneh Ahmadi. Recons uc ion o
no mal sounding speech o la yngec omy pa ien s h ough a modi ied celp codec. IEEE
T ansac ions on Biomedical Enginee ing, 57(10):2448–2458, 2010.
[130] Dushyan Sha ma, Yu Wang, Pa ick A Naylo , and Mike B ookes. A da a-d i en non-
in usi e measu e o speech quali y and in elligibili y. Speech Communica ion, 80:84–94,
2016.
[131] K˚a e Sj¨olande and Jonas Beskow. Wa esu e -an open sou ce speech ool. In Six h
In e na ional Con e ence on Spoken Language P ocessing. Ci esee , 2000.
[132] Cons an in Spille, Bi ge Kollmeie , and Be nd T Meye . Compa ing human and au-
oma ic speech ecogni ion in simple and complex acous ic scenes. Compu e Speech &
Language, 52:123–140, 2018.
[133] Smiljka ˇ
S ajne -ka uˇsi´c, Dami Ho ga, Maja Muˇsu a, and Dub a ka Globlek. Voice and
speech a e la yngec omy. Clinical Linguis ics & Phone ics, 20(2-3):195–203, 2006.
[134] He man JM S eeneken. The measu emen o speech in elligibili y. P oceedings o Ins i u e
o Acous ics, 23(8):69–76, 2001.
[135] An je S auß, Mal e W¨os mann, and Jonas Oblese . Co ical alpha oscilla ions as a ool
o audi o y selec i e inhibi ion. F on ie s in Human Neu oscience, 8:350, 2014.
[136] Daniel J S auss and Alexande L F ancis. Towa d a axonomic model o a en ion in
e o ul lis ening. Cogni i e, A ec i e, & Beha io al Neu oscience, 17(4):809–825, 2017.
113
ESTOI Ex ended Sho Te m Objec i e In elligibili y. 11, 12, 117
MRI unc ional Magen ic Resonance Imaging. 18, 117
GAN Gene a i e Ad e sa ial Ne wo ks. 22, 117
GMM Gaussian Mix u e Models. 21, 22, 71–73, 117
HMM Hidden Ma ko Models. 60, 117
HNR Ha monics- o-Noise Ra io. 11, 22, 117
HS Heal hy (La yngeal) Speech. x, x –x ii, xix, 3, 4, 8–12, 14, 20–23, 26, 27, 29, 31–41, 44–54,
57, 58, 60, 61, 70, 71, 73, 75–77, 81–84, 86, 88, 89, 91, 94–96, 117
HSR Human Speech Recogni ion. x , 9, 11, 14, 23, 32, 36–39, 45, 46, 95, 117
KS Kolmogo o –Smi no . 37, 41, 117
LE Lis ening E o . x –x ii, xix, 15–20, 23, 26, 31–33, 35–55, 57, 73, 75–78, 82–85, 87–89,
91–96, 117
LSTM Long Sho -Te m Memo y. 22, 97, 117, 141
MCC Mel Ceps al Coe icien s. 60, 117
MCD Mel Ceps al Dis o ion. 13, 21–23, 117
MFCC Mel-F equency Ceps al Coe icien s. 35, 117
MOS Mean Opinion Sco e. 13, 21–23, 75, 117
NIRS Nea -in a ed Spec oscopy. 48, 117
OOV Ou o Vocabula y. 35, 117
OS Oesophageal Speech. ix–xi, x –x ii, xix, 3–5, 7–11, 14, 15, 17, 19–23, 25–29, 31–41, 43–55,
57–73, 75–77, 79, 81–89, 91, 92, 94–97, 117
PESQ Pe cep ual E alua ion o Speech Quali y. 13, 117
PET Posi on Emission Tomog aphy. 18, 117
PPG Phone ic Pos e io g ams. x ii, 22, 76, 77, 82–84, 86, 87, 117
120
PSD Powe Spec al Densi y. 50, 78, 117
PSM Pe cep ual Simila i y Measu e. 117
PWC Pe cen age Wo ds Co ec . x i, 13, 14, 62, 63, 68, 81, 117
RT Response Time. 117
SD S anda d De ia ion. 41, 59, 117
SI Speech In elligibili y. x –x ii, 11, 14, 35, 40, 42, 49, 52, 54, 76, 77, 81–85, 87–89, 94, 117
SNR Signal- o-Noise Ra io. 11, 117
SPL Sound P essu e Le el. 41, 117
SRT Speech Recep ion Th eshold. 13, 14, 117
SS Syn he ic Speech. x i, x ii, 28, 58–63, 65, 66, 68, 69, 75, 87, 94, 96, 97, 117
STEM Science, Technology, Enginee ing and Ma hema ics. 117
STI Speech T ansmission Index. 11, 12, 117
STOI Sho Te m Objec i e In elligibili y. xi, x i, x ii, 11, 12, 14, 59, 61, 63–66, 68, 69, 73,
75, 87–89, 94, 96, 117
TOS T acheoesophageal Speech. 8, 14, 20, 22, 25, 61, 117
V-RQOL Voice- ela ed Quali y o Li e. 20, 117
VC Voice Con e sion. x i, 13, 20–23, 57–61, 68, 69, 71–73, 87, 96, 117
VHI Voice Handicap Index. 20, 117
WER Wo d E o Ra e. x –x ii, xix, 12–14, 29, 33, 36–40, 43–46, 50–52, 59, 61–63, 68, 69,
71, 72, 76, 86, 117
WM Wo king Memo y. 48, 117
121
122
Appendix A
30 sen ences Used in he
Expe imen in Chap e 4
1. Una ies a en Flo ida Pa k con glamu
2. De Filadel ia ino el g upo Jud´ıos po una Paz Jus a
3. Ello h ac´ıa in ui un duelo en oda egla
4. Deja mucha buena ob a hecha pe o me ehuye el balance
5. Hoy jue es dieciocho de julio de dos mil ece
6. Ha podido euni se con Ch´a ez as su elecci´on
7. Quiz´a us edes no lo ad i ie on po eso lo e ie o aho a
8. Abdul´a Abdul´a minis o de Ex e io es a m´as all´a
9. Da igual no impo a de d´onde ex aiga uno la emoci´on
10. S´olo el chileno Ma k Gonz´alez pon´ıa una pizca de o gullo
11. Pe o us ed ya conoce po den o el mundo del cine
12. Hay oces que ya hablan de indul o se ´ıa ac ible
13. Lo que no c ee nadie aqu´ı es que cacen a Sadam con ida
14. Unos d´ıas de eu o ia y meses de a on´ıa
15. Blasco Ib´a˜nez hizo alguna ez la misma cosa
123
16. Ten´ıa la oz albo o ada y la amis ad uidosa
17. Apliqu´e el o´ıdo a es a ayi a y pe cib´ı un mu mullo
18. Gas ´o odo el agua incluyendo el agua de las llu ias
19. Si el club no hubie a cambiado se hubie a ido
20. Golia es u o a pun o de engulli a Da id
21. Tal ez ue hace siglos o acaso hace an s´olo unas d´ecadas
22. A´un no sabemos qu´e ue a hace a Taiwan
23. El pueblo no uego echaz´o ´ıa e e ´endum la adhesi´on
24. Con es e ´album llega ´a segu o al n´ume o uno de en as
25. Mi aldea es aba a la o illa de un iachuelo como ´es e
26. Fui yo po consejo del se˜no Reguei o Souza
27. Hoy no juega al gol y el aje es azul cielo y o o
28. Occiden e y el islam son dos miedos que se acechan
29. N´u˜nez ya iene a su hijo p edilec o en casa
30. Qu´e di e encia hay en e el caucho y la he ea
124
Appendix B
EEG Da a Te minology and
P ocedu es
B.1 EEG Reco ding Equipmen
B.1.1 Cap and Elec odes
An EEG cap is a s e chable head co e ing wi h holes ( o elec odes) ha i he subjec ’s
head snugly. As he head sizes o subjec s a y, he e a e di e en sizes o caps a ailable,
ypically: 54cm, 56 cm, 58cm and 60cm. Wha size cap i s a subjec is decided by measu ing
he head ci cum e ence wi h a measu ing ape passing h ough he o ehead. I is impo an
o p o ide some ole ance as he cap should no be oo igh . Fo example, a pe son wi h
head ci cum e ence measu ing 57cm would be i ed wi h a 58 cm cap. In addi ion, i mus be
ensu ed ha he cap is placed in he igh posi ion. Fo his, he cen e-mos poin o he cap
is ma ched o he cen e o he head, which is midway be ween he cen e o ehead (nasion)
and he na u al bump a he cen e back (inion), as well as midway be ween he 2 ea s. The
caps may ha e 32, 64 o 128 holes depending on he ype and equi emen .
The elec odes a e a anged on an EEG cap as pe he In e na ional 10-20 sys em o
measu ing EEG. I is known as he 10-20 sys em because he dis ance be ween any wo adjacen
elec odes is ei he 10% o 20% o he on -back o igh le dis ance.
The posi ion and he naming con en ion o he elec odes ollows a [b ian a ea][di ec ion
code] sys em. The b ain a eas a e Pa ie al (P), F on al(F), Cen al(C), Tempo al(T) and
Occipe al(O). In e media e o o e lapping a eas a e named as Cen o-pa ie al (CP), F on -
cen al (FC), Tempe o-pa ie al (TP), F on -Tempo al(FT), Pa ie al-occipe al (PO) and so on.
125
The di ec ions codes a e z o he on -back midline a ea, odd numbe s (1,3,5) o he le
o he midline and e en numbe s (2,4,6) o he igh o he midline. Numbe s a e assigned
inc emen ally om on o back. Based on hese con en ions, he elec odes a e named as F1
( on al le ), P6 (Pa ie al Righ ), CPz (cen o-pa ie al midline) e c.
Apa om he abo e-men ioned elec odes, he e a e e e ence and g ound elec odes. The
EEG signal is a po en ial di e ence be ween he scalp po en ial and a e e ence po en ial. This
e e ence po en ial is ob ained by placing an elec ode away om he scalp, on a ela i ely
neu al a ea such as behind he ea s (mas oids) o on he nose. As hese a eas a e no impac ed
by b ain ac i i y, a di e ence o he scalp elec ode po en ial o his e e ence po en ial esul s
in he po en ial di e ence associa ed wi h he b ain ac i i y in he said scalp elec ode. A
g ound elec ode is usually placed on he cap and i s unc ion is mainly o emo e powe line
noise.
In addi ion, some elec odes may be placed on he emples and o ehead o eco d ocula
ac i i y. These elec odes can hen la e be used o emo e ocula ac i i y a e ac s.
B.1.2 Gel
An elec oly e gel is applied (wi h needle-less sy inges) whe e he elec odes come in con ac
wi h he scalp. The unc ion o his gel is o inc ease he elec ical connec i i y o he scalp
o he elec ode. The gel om one elec ode si e mus no seep h ough o ano he elec ode
si e as i would c ea e a sho ci cui which is undesi able. Some imes he gel may con ain a
sligh ly ab asi e o ex olia ing componen ha helps in sc aping o he dead skin cells in he
op laye o he scalp. This also imp o es he connec i i y.
B.1.3 Ampli ie
The EEG ampli ie ampli ies he EEG signal and also con e s i om digi al o analogue.
This is so ha he eco ding so wa e on he compu e can ecei e and s o e he EEG signals
digi ally and ampli ied. The sampling a e o he signals a e ypically 250 o 2000 Hz.
B.1.4 Reco ding so wa e
While placing elec odes and eco ding EEG signals, i is help ul o see he s a us in eal ime.
This helps in ouble shoo ing and ensu ing good quali y o da a. In a eco ding so wa e (such
as B ainVision), we can isualise he elec odes and hei impedance while moun ing he cap
and elec odes (See Figu e B.1). The aim in his s age is o all he elec odes o ha e a low
impedance (p e e ably <5kΩ). Addi ionally, in he eco ding so wa e in e ace, i is possible
126
Figu e B.1: EEG eco ding so wa e showing impedance alues o elec odes.
o see he EEG signals in eal ime and ha can help he expe imen e iden i y noisy channels,
es less o i ed pa icipan s e c.
B.2 Synch onisa ion
While conduc ing an EEG expe imen , he expe imen so wa e sends ou a igge o he
EEG eco ding so wa e when he audio s a s playing. Howe e i is possible ha he e a e
a ew milliseconds o delay be ween he expe imen so wa e sending he igge and he EEG
eco ding so wa e ecei ing hose igge s. I is necessa y o ensu e ha he EEG ma ke s
(o igge s) and he audio s imuli s a poin s a e synch onised. This synch onisa ion may be
achie ed using ha dwa e o manually. In he ha dwa e me hod, a clock de ice synch onises he
audio inpu and he EEG ma ke s. Al e na i ely, i may be done manually by iden i ying he
delay be ween he playing o he audio signal and he s a o EEG eco ding and hen adding
he app op ia e amoun o delay la e o synch onise he EEG and audio channels.
B.3 EEG Reco ding P ocess
Once he g ound, e e ence and all he da a elec odes a e placed co ec ly wi h he igh
impedance, ocula and muscula ac i i y a e checked o unsu e ha he EEG da a a e being
eco ded co ec ly. This can be obse ed by spikes in he da a when he pa icipan blinks o
dense ac i i y when he pa icipan clenches hei jaw o bi es.
127
Figu e B.2: Raw EEG da a
In case he e a e noisy elec odes, hey mus be ixed by ei he pu ing mo e gel o co ec ing
o he connec i i y issues. The signals mus be cons an ly moni o ed o a oid noisy da a. To
a oid noisy da a om he pa icipan . hey a e ins uc ed o be as s ill as possible and o look
s aigh on o he sc een.
When he pa icipan is eady and all he elec odes look ine, we can s a eco ding he
EEG da a. Then he expe imen so wa e which displays he con ols o he beha iou al da a
and plays he audio iles is un. Once he expe imen is o e , he EEG da a eco ding is s opped
and he EEG ile is app op ia ely sa ed. The e a e se e al di e en o ma s o EEG da a such
as .eeg, .xd e c.
B.4 Raw EEG Da a
Raw EEG da a is di icul o p ocess as i is and needs o unde go se e al s eps o p ep ocessing
o ex ac meaning ul da a om i . To iew and p ocess hese aw EEG da a iles, an EEG
p ocessing so wa e is used. One such so wa e is he EEGLab so wa e.
When he EEG ile is loaded in EEGLab, we can i s obse e he aw EEG da a. This is a
ime se ies eco ded by each one o he elec odes. Figu e B.2 shows he aw EEG da a om a
pa icipan . The e ical axis is he elec ode name and he ho izon al axis is ime. The e ical
colou ul lines a e he igge s o he e en ma kings whe e he pa icipan s a ed hea ing a
s imulus o pe o med some ac ion. I can be obse ed ha he da a is qui e noisy.
Each da a se ies is assigned a channel loca ion which is he wo-dimensional o h ee-
dimensional loca ion o ha elec ode on he head. The channel loca ions a e ob ained om
he manual o he EEG cap se . Because o his s ep, i is possible o iew he elec ode da a
128
Figu e B.3: Raw EEG da a
in a 2-D o 3-D opog aphical plo . Fo example, in Figu e B.3, we can see he opog aphical
ep esen a ion o channel 24, which is he O2 (occipe al igh ) elec ode. The ed do on he
opog aphical plo shows he posi ion o he O2 elec ode on he head.
Some possible a e ac s and noise we can no ice in he aw EEG da a a e noisy channels
(such as channel C3 in Figu e B.2). The occasional dips in he Fp1 and Fp2 channels co espond
o he pa icipan blinking. This will be clea e in he da a pos he il e ing p ocess.
Re e e encing is he p ocess o ge ing he po en ial di e ence be ween he elec odes and
he e e ence elec ode. Typically, he e e ence elec odes a e he mas oid elec odes ( he
elec odes behind he ea s). I may be one mas oid elec ode o an a e age po en ial o he
wo mas oid elec odes. Al e na i ely, i is also possible o ha e an a e age e e ence. In his
me hod, an a e age o po en ials om all he elec odes is calcula ed and hen each his a e age
po en ial is sub ac ed om each o he elec odes.
B.5 Cleaning Up Raw EEG Da a
B.5.1 Fil e ing
Raw EEG da a has a lo o high equency noise and powe line noise which is emo ed by band
pass il e ing. Typically he il e ing is pe o med by band passing be ween 1 and 45 Hz as
EEG signals o in e es lie be ween hese equencies. Some imes an addi ional 50Hz o 60Hz
no ch il e is added o emo e powe line noise. The il e ed EEG signal is shown in Figu e
B.4. He e he ocula ac i i y is isible in he o m o ab up dips a ound he 12 h, 13 h and
14 h seconds.
129