Design and Development of Natural
Eye Typing Interface
vorgelegt von
M.Sc.
Zhe Zeng
an der Fakultät V – Verkehrs- und Maschinensysteme
der Technischen Universität Berlin
zur Erlangung des akademischen Grades
Doktorin der Naturwissenschaften
- Dr. rer. nat. -
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. Henning Jürgen Meyer
Gutachter: Prof. Dr. Matthias Rötting
Gutachter: Prof. Dr. John Paulin Hansen
Tag der wissenschaftlichen Aussprache: 04. Februar 2022
Berlin 2022
Abstract
With the development of eye tracking technology, gaze interaction has shown great po-
tential to assist day-to-day human computer interaction. Gaze input makes it possible
to combine both visual search and action activation into one step, as users can perceive
relevant information and activate desired commands, i.e., by focusing their gaze on a
corresponding item on the interface. However, the usage scenario is limited by eye track-
ing accuracy and user acceptance. The aim of this dissertation therefore is to design a
dynamic eye typing interface that tolerates eye tracking with low accuracy. It should be
easy to understand without extensive training, and can be applied to the use of public
screens. To this end, three studies were conducted, they are an offline gaze data analysis,
the evaluation for the new typing interface and its iterative design.
The first study focused on the effects of the number of objects and object moving
speed in the calibration-free gaze interface based on pursuit movements where objects
move linearly. Offline gaze data of 25 participants were collected and analyzed. The
results indicate that the number of objects significantly influenced the correct and false
detection rates. Participants’ performance was better on the interfaces containing 6 and
8 objects compared to 10, 12 and 15 objects. The detection rate was significantly higher
for interfaces with faster moving speed than for slower ones.
The second study concentrated on developing a new dynamic eye typing interface. The
eye typing interface features two-stage selection and one-point calibration. To activate a
command action, the user needs to first look at the corresponding character cluster and
then follow the movement of the desired character in this cluster with their eye. With the
findings of the first study, this typing interface was designed with eight character clusters,
and there are four characters in each character cluster. A user study with 29 participants
was conducted to compare four types of feedback for this eye typing interface. The results
confirmed that the user can learn how to interact with the system in a short time. The
interface with both visual and auditory feedback achieved the highest typing speed.
In the third study, three eye typing interfaces based on the second study were further
presented and compared. They are interfaces without language-model support (i.e., the
baseline), with letter prediction and with both letter and word prediction, respectively.
The results of the user study showed that the interface with letter and word prediction
achieved the highest typing speed (5.48 words per minute) and improved by 70 % compared
with the baseline. The improvement gradually increased as the user became familiar with
the interface.
In this dissertation, several dynamic eye-typing interfaces were developed based on
a one-point calibration and the learning costs are relatively low. The results of this
dissertation can be used to improve gaze-based human-computer interfaces and support
the design of robustly functioning eye-typing interfaces in public spaces.
Zusammenfassung
Entwicklungen im Feld der Eye-Tracking-Technologie eröffnen neue Möglichkeiten zur
Nutzung der Blickinteraktion für die Unterstützung der Mensch-Computer-Interaktion im
Alltag. Die Blickinteraktion ermöglicht es, visuelle Suche und Auslösen von Aktionen in
einem Schritt zu kombinieren, da die Nutzenden relevante Informationen wahrnehmen und
gleichzeitig gewünschte Aktionen auslösen können, indem sie beispielsweise ihren Blick
auf ein Element im Interface richten. Die Anwendungsmöglichkeiten sind jedoch durch
die Genauigkeit der Blickverfolgung und die Akzeptanz der Nutzer begrenzt. Ziel dieser
Dissertation ist es daher, ein natürliches Eye-Typing-Interface zu entwickeln, das auch
Eye-Tracking mit niedriger Genauigkeit toleriert. Das Interface sollte ohne umfangreiches
Training leicht zu verstehen und für die Nutzung von öffentlichen Bildschirmen geeignet
sein. Zu diesem Zweck wurden drei Studien durchgeführt: eine Offline-Blickdatenanalyse,
eine Evaluationsstudie für ein neues Eye-Typing Interface und die Analyse einer Weiter-
entwicklung des neuen Interface.
Die erste Studie untersuchte die Auswirkungen der Anzahl von Objekten und der
Objektbewegungsgeschwindigkeit in einem augenfolgebewegungsbasierten , kalibrierungs-
freien Eye-Typing Interface, in dem sich die angezeigten Objekte linear bewegen. Die
offline-Blickdaten von 25 wurden gesammelt und analysiert. Die Ergebnisse zeigen, dass
die Anzahl der Objekte im Interface einen signifikanten Einfluss auf die Zahl der korrekten
und falschen Detektionen der gezeigten Objekte hat. Die Leistung von Teilnehmenden war
beim Interface mit 6 und 8 Objekten besser als beim Interface mit 10, 12 und 15 Objekten.
Die Rate der korrekten Detektionen war bei den schnelleren Bewegungsgeschwindigkeiten
signifikant höher als bei den langsameren.
Die zweite Studie beschäftigte sich mit der Entwicklung eines dynamischen Eye-Typing
Interfaces. Das entwickelte Interface verfügt über eine zweistufige Auswahl und eine Ein-
Punkt-Kalibrierung. Um eine Aktion auszulösen, muss der/die Nutzende zunächst auf
die entsprechende Zeichengruppe blicken und dann die Bewegung des gewünschten Ze-
ichens in dieser Gruppe mit den Augen verfolgen. Es wurde eine Benutzerstudie mit 29
Teilnehmenden durchgeführt, und vier verschiedene Arten von Feedback für diese Eye-
Typing-Interface verglichen. Die Ergebnisse bestätigten, dass der Benutzer den Umgang
mit dem Eye-Typing-Interface in kurzer Zeit erlernen kann. Das Interface mit dem kom-
binierten visuellen und auditiven Feedback erreichte die höchste Tippgeschwindigkeit.
Auf der Grundlage der zweiten Studie wurden in der dritten Studie drei Eye-Typing-
Interfaces entwickelt und verglichen. Ein Interface ohne Unterstützung von Sprachmod-
ellen (als Baseline), ein Interface mit Buchstabenvorhersage und ein drittes Interface
mit kombinierter Buchstaben- und Wortvorhersage. Die Ergebnisse der Studie zeigen,
dass das Interface mit Buchstaben- und Wortvorhersage die höchste Tippgeschwindigkeit
erreichte (5.48 Wörter pro Minute), die sich im Vergleich zum Baseline Interface um
70 % verbesserte. Je vertrauter die Nutzenden mit dem Interface wurden, desto stärker
verbesserte sich die Tippgeschwindigkeit mit dem System.
In dieser Dissertation wurden mehrere dynamische Eye-Typing-Interfaces basierend
auf einer Einpunktkalibrierung entwickelt, deren Nutzung nur geringen Lernaufwand
erfordert. Die Ergebnisse dieser Dissertation können genutzt werden um blickbasierte
Mensch-Computer-Schnittstellen zu verbessern und unterstützen das Design von robust
funktionierenden Eye-Typing Interfaces im öffentlichen Raum.
Acknowledgements
Foremost, I would like to express my deepest gratitude to my supervisor, Prof. Dr.-Ing.
Matthias Rötting, for his guidance and support throughout the journey of my doctoral
research. His vision, and sincerity have deeply inspired me. It was very fortunate for
me to work and study under his supervision. I would also like to thank Prof. Dr. John
Paulin Hansen and Prof. Dr.-Ing. Henning Jürgen Meyer for responding immediately and
positively to my request for serving as a member of my doctoral committee.
Beyond my supervisor and committee, I got extensive support from Dr. Felix Wilhelm
Siebert. I want to express thanks to him for the thorough comments and valuable sugges-
tions. I would like to thank my colleagues from the Chair of Human-Machine Systems,
especially Mario Lasch and Stefan Damke, thank you for their support on the hardware.
Thanks also to Elisabeth Lange, who always encouraged me to learn German. In addition,
I would like to thank Dr. Xixie Zhang, Ying Jiang, Yuan Liu, and Dr. Xiaohan Li, we
had endless discussions about research as well as daily life and I will always remember
the time we spent together in Berlin.
I am also thankful for my friends who never make me feel lonely when I am away from
home. Hans-Georg Mosler, Bing Liu, Dr. Junhua Kang, Dr. Lin Chen, Jing Yang, Dr.
Xin Wang, Dr. Chun Yang - thank you!
Finally, the most important thanks go to my family. I would thank my parents for
their constant support, understanding, and love. A special thank goes to my husband,
Dr. Yu Feng, you are the joy of my life!
Contents
1 Introduction 1
1.1 Aimsofthethesis................................ 3
1.2 Structureofthethesis ............................. 4
2 Background 7
2.1 Physiology of eye and eye movements . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Fixations ................................ 7
2.1.2 Saccades................................. 8
2.1.3 Smoothpursuits ............................ 8
2.1.4 Comparison saccadic and smooth pursuit movements . . . . . . . . 9
2.2 Eye tracking technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Evaluationmetrics ............................... 12
2.3.1 Textentryrate ............................. 12
2.3.2 Errorrates................................ 12
2.3.3 Keystrokesavings............................ 14
2.3.4 Subjective workload . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Mathematicalmethods............................. 15
2.4.1 Density-based spatial clustering of applications with noise (DBSCAN) 15
2.4.2 Orthogonal distance regression (ODR) . . . . . . . . . . . . . . . . 16
2.5 Summary .................................... 17
3 Related work 19
3.1 Gazeinteraction................................. 19
3.1.1 Fixation-based gaze interaction . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Gesture-based gaze interaction . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 Pursuit-based gaze interaction . . . . . . . . . . . . . . . . . . . . . 21
3.1.4 Feedback in gaze interaction . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Eyetypinginterface............................... 26
3.2.1 QWERTY typing layout . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Specific typing interface . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Language model in eye typing . . . . . . . . . . . . . . . . . . . . . 31
3.3 Limitations of previous work . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Summary .................................... 35
4 Study 1: Gaze interfaces based on linear smooth pursuit 37
4.1 Introduction................................... 37
4.2 Empiricalevaluation .............................. 38
4.2.1 Experimental stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Participants............................... 40
4.2.4 Apparatus................................ 40
4.2.5 Procedure................................ 41
4.3 Classification algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Detection based on orthogonal distance regression . . . . . . . . . . 42
4.3.2 Detection based on angle between two vectors . . . . . . . . . . . . 43
4.3.3 Detection based on pearson correlation . . . . . . . . . . . . . . . . 44
4.4 Results...................................... 45
4.4.1 Velocity ................................. 45
4.4.2 Orientationerror ............................ 47
4.4.3 Detectionrates ............................. 50
4.4.4 Subjective evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.5 Comparison of three detection methods . . . . . . . . . . . . . . . . 54
4.5 Discussion.................................... 57
4.6 Summary .................................... 60
5 Study 2: Design eye typing interface based on hybrid eye movements 61
5.1 Introduction................................... 61
5.2 Interfacedesign................................. 62
5.2.1 Layout.................................. 64
5.2.2 Interactiondesign............................ 66
5.2.3 Classification algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.4 One-point calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Empiricalevaluation .............................. 73
5.3.1 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.2 Participants............................... 73
5.3.3 Apparatus................................ 74
5.3.4 Procedure................................ 74
5.4 Results...................................... 75
5.4.1 Textentryrate ............................. 75
5.4.2 Errorrates................................ 76
5.4.3 Subjective evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Discussion.................................... 80
5.6 Summary .................................... 83
6 Study 3: Evaluating eye typing interfaces with language model 85
6.1 Introduction................................... 85
6.2 Interfacedesign................................. 86
6.3 Empiricalevaluation .............................. 89
6.3.1 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3.3 Participants............................... 89
6.3.4 Apparatus................................ 90
6.3.5 Procedure................................ 90
6.4 Results...................................... 91
6.4.1 Textentryrate ............................. 91
6.4.2 Errorrates................................ 93
6.4.3 Keystroke savings (KS) . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4.4 Subjective evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Discussion....................................100
6.6 Summary ....................................102
7 Conclusions and outlook 105
7.1 Conclusions ...................................105
7.2 Outlook .....................................107
References 109
Appendix A 125
A.1 Questionnaire for study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.2 Visualization of moving trajectories from one participant . . . . . . . . . . . 128
Appendix B 134
B.1 Open questions for study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
B.2Phrasesforstudy2................................138
Appendix C 139
C.1NASATLXscale.................................139
C.2Phrasesforstudy3................................141
List of Figures
2.1 Purkinje reflections. Adapted from (Duchowski, 2017, p. 55) . . . . . . . . 11
2.2 Example of DBSCAN (example image under CC BY-SA 3.0) . . . . . . . . 16
2.3 Illustration of ODR (example image under CC BY-SA 3.0) . . . . . . . . . 16
3.1 Interface of GazeTalk (J. P. Hansen, Johansen, Hansen, Itoh, & Mashino,
2003a)....................................... 28
3.2 Illustration of pEYEs, adapted from Huckauf and Urbina (2007). . . . . . . 28
3.3 Interface of Quikwriting (Bee & André, 2008). . . . . . . . . . . . . . . . . 29
3.4 The gaze gestures of letter “a" and “b" in EyeWrite, adapted from Wob-
brock, Rubinstein, Sawyer, and Duchowski (2008). . . . . . . . . . . . . . . 29
3.5 Interface of Dasher (Ward & MacKay, 2002). . . . . . . . . . . . . . . . . . 30
3.6 Interface of StarGazer (D. W. Hansen, Skovsgaard, Hansen, & Møllenbach,
2008). ...................................... 31
3.7 Interface of SMOOVS (Lutz, Venjakob, & Ruff, 2015). . . . . . . . . . . . 31
4.1 Interface layouts before outward movement of digits. The interfaces contain
6, 8, 10, 12, and 15 moving objects, respectively. . . . . . . . . . . . . . . . 39
4.2 Experimentsettings............................... 41
4.3 Visualization of the gaze and object trajectories . . . . . . . . . . . . . . . 44
4.4 The velocity of eye and moving objects, the black line shows the moving
velocity of the object, the colored lines show the moving velocity of eye
regarding different number of objects in the interface. . . . . . . . . . . . . 46
4.5 The distribution of raw orientation error throughout the experiment . . . . 47
4.6 Orientation error in the experimental groups for filtered data. The green
triangles represent averages. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.7 Average orientation error for all interfaces. Green dashed line refers to
orientation error of slower moving speed, and red line refers to faster moving
speed. ...................................... 49
4.8 The ODR detection rates regarding different buffer angle range . . . . . . . 51
4.9 Correct detection rate. The green triangles represent averages. . . . . . . . 52
4.10 False detection rate. The green triangles represent averages. . . . . . . . . 53
4.11 The angle-vectors detection rates regarding different buffer angle ranges . . 56
4.12 The detection rates based on different correlation thresholds . . . . . . . . 57
4.13 The comparison of detection rate among detection methods . . . . . . . . . 58
5.1 The hybrid interaction design. The red circle represents the current gaze
position, the gray dashed line represents the moving trajectory of the object. 62
5.2 Differences in active interface areas in three pursuit-based gaze systems. . 63
5.3 The illustration of eye typing interface, the background color is black and
the character color is white in real typing system. The gray circular area
is idle area, i.e., deactivated area. . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Theinteractionflow .............................. 67
5.5 The provision of feedbacks is illustrated, from left to right, they are no
feedback, only visual feedback, only auditory feedback and both visual-
auditoryfeedback. ............................... 68
5.6 Visualization of the angular criterion in first stage. The red dot is gaze
point (G), and blue dot is the midpoint of screen (M). The gray dotted line
presents the identifiable range of each cluster. . . . . . . . . . . . . . . . . 69
5.7 The interface of one-point calibration . . . . . . . . . . . . . . . . . . . . . 70
5.8 Remove outliers using DBSCAN. The black cross is position of the middle
calibration Point. The black points are the gaze points during one-point
calibration, the red points are the reserved gaze points after discarding
outliers. ..................................... 71
5.9 Visualization the gaze data from two participants before and after one-
point calibration in pilot study. Red points are target points, the black
points visualize the raw gaze data, and blue points visualize adjusted gaze
data after one-point calibration. . . . . . . . . . . . . . . . . . . . . . . . . 72
5.10 Words per minute for each type of feedback, black dots visualize the average
values, diamonds are outlier observations. . . . . . . . . . . . . . . . . . . . 75
5.11 Keystrokes per character for each feedback, black dots visualize the average
values, diamonds are outlier observations. . . . . . . . . . . . . . . . . . . . 77
5.12 Minimun string distance error rate for each feedback, black dots visualize
the average values, diamonds are outlier observations. . . . . . . . . . . . . 77
6.1 Screenshot of interfaces of No prediction. (a) when the eyes look at the
ABCD cluster, and this cluster is highlighted as feedback. (b) the charac-
ters in selected cluster moved outward. . . . . . . . . . . . . . . . . . . . . 87
6.2 The default value of moving distance is 94 pixels (right figure), The figure
on the left shows the moving distance when letter prediction activated, i.e.,
68pixels. .................................... 87
6.3 The interface for Letter + word prediction. User could follow the corre-
sponding arrow to enter predicted word. In this example, the up arrow
stands for “and", the left arrow represents “as", and the right arrow is for
“are"........................................ 88
6.4 Mean of words per minute, corrected error rate, and uncorrected error rate
over3sessions. ................................. 92
6.5 Mean of NASA TLX results over sessions, blue circle represents No predic-
tion, orange triangle stands for Letter prediction, and green square indi-
cates Letter + word prediction. . . . . . . . . . . . . . . . . . . . . . . . . 96
7.1 Pointers on the display for different device. . . . . . . . . . . . . . . . . . . 108
List of Tables
2.1 Comparison of saccadic and smooth-pursuit eye movements . . . . . . . . . 10
3.1 Summary the selected studies of pursuit-based gaze interaction. . . . . . . 24
4.1 The angular range for one object . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Detectable range for different interfaces . . . . . . . . . . . . . . . . . . . . 50
4.3 Count for trials which were detected as adjacent digits and all false detected
trials for all experimental conditions. . . . . . . . . . . . . . . . . . . . . . 54
4.4 Study 1 - Statistical significance of main and interaction effects regarding
objectivemeasures................................ 54
4.5 Subjective feedback from participants . . . . . . . . . . . . . . . . . . . . . 55
4.6 The correct detection rate of interface with 8 objects in directions . . . . . 56
5.1 The main content in open questions . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Study 2 - Mean absolute values (M) and standard deviations (SD) for each
experimentalcondition.............................. 76
5.3 Ranking results, 1 represents the most preferred, 4 represents the lowest
ranking. ..................................... 78
5.4 Reasons behind feedback preference . . . . . . . . . . . . . . . . . . . . . . 79
5.5 Perception of the text entry system . . . . . . . . . . . . . . . . . . . . . . 80
6.1 Main topics for open questions . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Study 3 - Mean absolute values (M) and standard deviations (SD) for each
experimentalcondition.............................. 93
6.3 Mean and standard deviation for keystroke savings for interface with Letter
+wordprediction. ............................... 95
6.4 Study 3 - Statistical significance of main and interaction effects regarding
performance and subjective measures. . . . . . . . . . . . . . . . . . . . . . 99
List of Abbreviations
HCI: Human-computer interaction
WPM: Words per minute
KSPC: Keystrokes per character
MSD: Minimum string distance
CER: Corrected error rate
UER: Uncorrected error rate
KS: Keystroke savings
NASA TLX: NASA Task Load Index
DBSCAN: Density-based spatial clustering of applications with noise
ODR: Orthogonal distance regression
ART: Aligned rank transform
CNNs: Convolutional neural networks
HMDs: Head-mounted displays
AR: Augmented reality
VR: Virtual reality
POG: Photo-Oculo Graphy
VOG: Video-Oculo Graphy
POR: Point of regard
Chapter 1
Introduction
The eye is an organ for obtaining environmental information. As an important source
of information, the exploration of what the eye focuses on never stops. In the past 100
years, eye-tracking technology has evolved significantly. It enables the detection of eye
positions and different types of eye movements and is widely used in medical, marketing,
and psychological research. Additionally, when a user interacts with an interface, the eye
always looks at the intractable area before activating it. Thus, apart from the registration
of eye movements, person’s gaze has been used to allow users to control interfaces (Sibert
& Jacob, 2000).
Gaze interaction has the potential to become a ubiquitous part of assisting our daily
interaction. Gaze input makes it possible to combine both visual search and action ac-
tivation into one step, as users can perceive relevant information and activate desired
commands, e.g., by focusing their gaze on a corresponding item on the interface. The
interest in gaze-based interaction design is growing rapidly. The applications, such as
eye typing (Majaranta & Räihä, 2002), gaze-controlled web browsing (Menges, Kumar,
Müller, & Sengupta, 2017), wheelchair control (Eid, Giakoumidis, & El Saddik, 2016),
and gaze selection in virtual reality (Blattgerste, Renner, & Pfeiffer, 2018), have been
developed and studied.
As an input modality, gaze has a number of attractive features. Firstly, gaze-based
input enables a system to be more accessible. It can help people with physical limitations
to communicate with each other. Not only people with permanent disabilities, such as
amyotrophic lateral sclerosis (ALS) patients can benefit from this (Hwang, Weng, Wang,
Tsai, & Chang, 2014), but also people with a temporary restriction for physical activity,
for example, users with arm injuries, or those who are carrying belongings. Additionally,
there is an increasing concern about hand hygiene during the Coronavirus Disease 2019
1
(COVID-19). People are encouraged to use contactless interactions to minimize, and if
possible to avoid contacts between users’ hands and displays. Even though there are many
contactless interfaces using gesture or voice, the gaze is a potential complementary input
modality to provide a more hygienic interaction when using public displays/interfaces. In
addition, without bystanders hearing voice commands, gaze interaction has a high degree
of privacy (Katsini, Abdrabou, Raptis, Khamis, & Alt, 2020). As an example, PIN code
input using gaze is much more inconspicuous than voice or gesture control in a public
setting.
Although there is an increasing interest in gaze-based interaction, proposed imple-
mentations have suffered from a number of limitations. One main challenge in the design
of gaze-based interfaces is the lack of accuracy of real-time gaze coordinates, stemming
from both biological characteristics of the eye as well as limitations of the eye tracking
equipment. Due to miniature eye movements, such as tremor, drift, and microsaccades,
there is hardly a moment that the eye is absolutely still. Thus, the use of the gaze as a
cursor cannot be as accurate as e.g., using a computer mouse. Second, most gaze inter-
faces require time-consuming calibration before use. Even if individually calibrated, the
accuracy of the eye tracker will gradually decline during use.
To overcome these challenges, pursuit-based interfaces have been proposed by Vidal,
Bulling, and Gellersen (2013). Using these interfaces, users can activate an action by
following the corresponding moving target with their gaze. An action is activated based
on the fitting of the relative motion trajectory, instead of fixed gaze coordinates on the
interface, which is why a personal calibration is not necessary for pursuit-based interac-
tion. Hence, this calibration-free method has attracted a lot of interest, as it overcomes
the shortcomings arising from low spatial accuracy of the gaze data. Researchers have
explored dynamic gaze interfaces based on pursuit movements and developed different
applications, such as entering PIN code (Almoctar, Irani, Peysakhovich, & Hurter, 2018;
Cymek et al., 2014), eye typing (Lutz et al., 2015), controlling smart watches (Esteves, Vel-
loso, Bulling, & Gellersen, 2015) and selection in VR device (Sidenmark, Clarke, Zhang,
Phu, & Gellersen, 2020).
However, despite the advantages of pursuit-based gaze interaction, designing such a
dynamic interface needs to engage with the following problems. First, Ludvigh and Miller
(1958) reported that visual acuity decreases significantly with the increase in angular
velocity of the moving object. In pursuit-based gaze interaction, users need to search
for and identify an intended target moving at a certain speed from multiple moving
objects to activate a command. Second, when eye gaze is used as an input modality, the
2
basic functions of the eye, such as seeing and perceiving visual information, need to be
differentiated from the selection of an action, i.e., Midas touch problem (Majaranta &
Räihä, 2002). If this differentiation does not work well, users tend to feel annoyed when
their visual search is interrupted by a false activation.
1.1 Aims of the thesis
Designing an eye typing interface for public display is still challenging. Of course, there
is solution available to enter text using eye without personal calibration, for example,
SMOOVS (Lutz et al., 2015). This system works with one-point calibration and features
a circle layout in alphabetical order. However, it lacks user acceptance and its typing speed
is relatively slow. Furthermore, in this research area of pursuit-based gaze interaction, the
use of linear trajectory has not gained much attention, compared with circular trajectory.
But, it is common for objects to move in a straight-like line in daily life, such as a moving
train. That is, linear pursuit movements are very natural motions for the eye.
Therefore, this thesis starts from the investigation of natural linear pursuit eye move-
ments and aims to develop an eye typing interface without individual calibration, which
is simple and easy to understand without extensive training. In particular, the following
research questions are tackled.
•What is the appropriate number of objects and moving speed for the gaze interface
based on linear motion?
•How to design a dynamic eye typing interface that is easy to learn and user-friendly?
•How to use language models to improve the typing performance of the dynamic eye
typing interface?
To address the above research questions, three main studies are conducted.
Firstly, this dissertation investigates the effect of the number of objects and object
moving speed in a linear trajectory. The main goal of this study is to develop a deeper
understanding for the effect of object number and object moving speeds on the detection
performance of gaze interfaces based on linear trajectory smooth pursuit. Besides, three
detection methods are compared regarding the detection rates. Based on the findings,
some guidelines are formulated for the design of future eye typing interface.
Secondly, a new design concept, i.e., hybrid gaze interface, is proposed. This gaze
interface is featured with a two-stage selection and try to combine two selection methods
3
(dwell- and pursuit-based gaze interaction). The interface can be adapted to low-accuracy
equipment environments and also minimizes distractions to users’ attention. A novel eye
typing interface is designed based on this hybrid concept to create a robust gaze-enabled
text entry system. Further, this study is interested in providing feedback to achieve more
efficient typing performance and a better user experience. In the eye typing system, the
information input and output share the visual channel and it’s essential to provide the
user with the system state. This study would like to address the questions that provision
visual feedback is a more direct way or leads to mental overload or that the feedback from
another channel, e.g., auditory, is more efficient, or combined modalities are a better
solution for such a dynamic gaze interface. Thus, to investigate the effect of feedback,
different modalities (visual only, auditory-only, combined visual and auditory, and no
feedback) are compared in a user study.
In the third study, the language model is utilized to further improve the typing per-
formance of the hybrid eye typing interface. Three interfaces are designed: with letter
prediction, letter+word prediction, and no prediction. Those interfaces are compared in
a user study.
1.2 Structure of the thesis
This dissertation comprises seven chapters, and structures as follows:
Chapter 1 briefly introduces this research, which contains the aims of this thesis and
outline.
Chapter 2 provides the background information related to this work, including the phys-
iology of the eye, categorisation of eye movements, eye-tracking technology. Besides, the
related mathematical theories and evaluation metrics for both objective and subjective
measures are described in this chapter.
Chapter 3 presents an overview regarding gaze-based interaction. This chapter intro-
duces different gaze-based applications, and summarizes specific related work about eye
typing.
Chapter 4 introduces the study evaluating effects of differential number of objects and
moving speeds on the detection of pursuit eye movements. Besides, three detection meth-
ods are compared.
Chapter 5 describes the second study, which aims at developing a hybrid eye typing
interface. The typing performance and subjective evaluation are compared with four
forms of feedback.
4
Chapter 6 subsequently tests three typing interfaces to gain deeper insight into provision
word or letter prediction for this hybrid eye typing interface.
Chapter 7 discusses the studies presented in this thesis, and specifically details the
findings regarding design the hybrid eye typing interface. At the same time, the contribu-
tions of this thesis are declared. Besides, the outlook is provided regarding the potential
application of gaze interaction.
5
6
Chapter 2
Background
This chapter briefly explains the physiology of the eye and eye movements (Section 2.1),
then moves on to the introduction of eye-tracking technology (Section 2.2). In addition
to theoretical reviews, the evaluation metrics and mathematical algorithm are introduced
in Section 2.3.
2.1 Physiology of eye and eye movements
The eye is an important body organ, which is used to acquiring visual information in
most of the activities, such as reading, sporting, and walking. Although eye is a robust
source of information, eye moves all time to obtain a clear vision of an interesting object.
The fovea is the area of the retina with the highest visual acuity. Throughout the visible
visual range, only the information that falls in the fovea will be processed precisely. The
eyeballs often rotate unconsciously, allowing the light to focus on the fovea as much as
possible. There are several major patterns of eye movements and they are divided into five
basic types: saccadic, smooth pursuit, vergence, vestibular, and physiological nystagmus,
i.e., miniature movements of fixation (Robinson, 1968). This chapter focuses on the eye
movements related to gaze interactions.
2.1.1 Fixations
Fixations appear at an early stage in the life span (Roucoux, Culee, & Roucoux, 1983),
which are relatively static eye movements and related to the perception of a stationary ob-
ject (Duchowski, 2017, p. 44). The duration of one fixation is generally 200-600 ms (Jacob,
1995).
7
Fixations do not mean absolute stillness, which are accompanying with miniature eye
movements: tremor, drift, and microsaccades (Carpenter, 1988, p. 124-131). Microsac-
cades are responsible for correcting the displacement of the eye’s fixation point caused by
tremor and drift, which is generally less than 5◦visual angle (Martinez-Conde, Macknik,
& Hubel, 2004).
2.1.2 Saccades
Saccades are rapid ballistic eye movements and happen frequently when the fovea re-
positioned to the next location, for example, people scanning the environment during
looking for something in the room. As jerky movements, saccadic movements are consid-
ered to be one of the fastest movements of the body and the velocity which are stable
in the range of 30 to 40 ◦/s (Holmqvist et al., 2011, p. 326-328), and the maximum
saccadic velocity is ranged from 600 to 900 ◦/s (Fisher, Monty, & Senders, 1981, p. 33).
The duration of saccadic movements fluctuates with the amplitude (as shown in Equa-
tion 2.1) (Carpenter, 1988, p. 70-71), which ranged from 30 - 80 ms (Orban de Xivry &
Lefevre, 2007). Saccades can not only be intentionally controlled but also be triggered
involuntarily by stimuli, such as sudden changes in the peripheral area. The latency of
saccades is between 150 and 250 ms (Rashbass, 1961).
Saccade duration (ms) = 2.2∗Saccade amplitude (◦)+ 21 (2.1)
2.1.3 Smooth pursuits
Smooth pursuit is considered as the intervals eye movements between saccadic movements
when the eye follows a moving object (Rashbass, 1961). When people track a moving ob-
ject with their eyes, those eye movements are known as smooth pursuits which enable
people to keep the moving object on the fovea (Robinson, 1965). Meyer, Lasker, and
Robinson (1985) defined “when a visual target moves, humans, and other foveate animals,
track it, if the head is stationary, with an eye movement called smooth pursuit." Unlike
saccades, being both active and inactive response to stimulus, smooth pursuit eye move-
ments happen when eyes proactively follow a moving object or imaginary moving objects
(Fukushima, Fukushima, Warabi, & Barnes, 2013). This movement is performed under
voluntary control, i.e., the observer can choose whether to follow the moving target or
not.
8
For ramp target motion, eye can keep up with target motion when the velocity of
smooth pursuit motion is less than 30 ◦/s (Robinson, 1965). The maximum pursuit
velocity reaches 30 to 40 ◦/s (Fisher et al., 1981, p. 33). When the target is moving
slowly, the pursuit eye movements are more effective and the velocity is closely related to
the velocity of the target (Rashbass, 1961; Robinson, 1965). The visual acuity decreases
when the velocity of the target is more than 3 ◦/s (Westheimer & McKee, 1975). Smooth
pursuit eye movements tend to maintain eye velocity to match the target velocity and
smooth pursuit gain can be defined as the ratio of the eye movement velocity to the
target velocity. If the velocity is less than 10 ◦/s, the eye tends to keep up with the
movement of the target. However, the velocity of smooth pursuit is hard to perfectly fit
the movement of targets, i.e., the smooth pursuit gain is approximately 1 (Collewijn &
Tamminga, 1984; Rashbass, 1961). If the velocity of target motion is too fast (exceed
40 ◦/s), smooth pursuit eye movement will be interrupted by catch-up saccades to keep
up with the moving target (Dodge, 1903; Robinson, 1965).
The latency of evocation smooth pursuit motion ranges from 80 to 150 ms (Lisberger,
Morris, & Tychsen, 1987; Rashbass, 1961; Westheimer & McKee, 1978). When the target
is moving in an unpredictable direction and speed, the latency of smooth pursuit expands
to 100-200 ms. Conversely, when the visual system can anticipate the moving direction of
the target, the latency is not only shorter, even negative, i.e., the eyes start moving before
the target (Burke & Barnes, 2006; de Hemptinne, Lefevre, & Missal, 2006; Stark, Vossius,
& Young, 1962). Due to the latency, the eye accelerates with a ballistic acceleration until
it reaches the target speed, which is considered as “open-loop pursuit". Then the “closed-
loop pursuit" fallowed within 300 ms after the target starts to move (Holmqvist et al.,
2011, p. 432).
Orientations of smooth pursuit eye movements can have an effect on accuracy. Collewijn
and Tamminga (1984) found that horizontal pursuit movement performs slightly smoother
and more precisely than vertical in general. Ke, Lam, Pai, and Spering (2013) investigated
asymmetries in smooth pursuit eye movements and the horizontal and downward moving
direction are the most preferred regarding smooth pursuit eye movements.
2.1.4 Comparison saccadic and smooth pursuit movements
Both saccadic and smooth pursuit eye movements refer to the dynamic visual tracking
and aim to maintain the target on the fovea. Behavioral and neurophysiological studies
have revealed that saccades and pursuit eye movements play a synergistic role in visual
9
tracking. They are two modes of oculomotor control but share a single sensorimotor
process (Orban de Xivry & Lefevre, 2007). Saccades are rapid eye movements to align
the visual axis with the target, where smooth pursuit eye movements refer to track the
relatively slow-moving targets. The latency of pursuit movement is generally shorter
than saccadic latency (Fisher et al., 1981, p. 33). Smooth pursuit eye movements usually
require a moving target to follow (Young, 1971, p. 433-434), while saccades can be made
without stimuli (Duchowski, 2017, p. 40-41). The difference between saccadic and smooth
pursuit eye movements is briefly summarized in Table 2.1.
Table 2.1: Comparison of saccadic and smooth-pursuit eye movements
Dimensions Saccades Smooth Pursuits
Velocity 30 - 500 ◦/s <30 ◦/s
Latency 150 - 250 ms 80 - 150 ms
Gather visual information no yes
Target stimuli not necessary necessary
Sensorimotor process shared
2.2 Eye tracking technology
Eye tracking is known as a technology used to measure eye position and eye movement.
The device used for eye tracking is called eye tracker. The eye-tracking methods could
be divided into four broad categories: Electro-Oculo Graphy (EOG), Scleral contact
lens/Search coil, Photo-Oculo Graphy (POG)/Video-Oculo Graphy (VOG), and video-
based combined pupil and corneal reflection (Duchowski, 2017, p. 49-56).
Electro-Oculo Graphy is based on the recording of the electrical potential difference
between the electrodes around the ocular cavity. The measurements of eye movements
are influenced by head movement in this method (Duchowski, 2017, p. 50).
Scleral contact lens/Search coil is one of the most frequently used methods for measur-
ing eye movements, which is to attach a mechanical or optical reference to a contact lens.
And people wear directly over the eye to measure the eye movements. Head movements
are also involved during measuring (Duchowski, 2017, p. 51).
Photo-Oculo Graphy (POG)/Video-Oculo Graphy (VOG) used real-time image pro-
cessing technology to record the eye movements. Ocular features (such as the apparent
10
shape of the pupil, the position of the corneal limbus and the corneal reflection) are
extracted. Sometimes, chin rest is needed to fix head (Duchowski, 2017, p. 52).
However, the above mentioned methods are hard to measure the point of regard (POR).
Video-based combined pupil/corneal reflection is one of the most popular methods for the
measurement of the point of regard in real-time. Corneal reflection and the pupil center
are two key features of video-based corneal reflection. Corneal reflection was first used for
objective eye measurements in 1901 (Robinson, 1968). When light enters the eye, there are
four reflections, also know as Purkinje reflection (illustrated in Figure 2.1). Video-based
eye tracking is usually based on the first Purkinje reflection (glint). The corneal reflection
of a light source (usually near-infrared light) is measured by the relative position of the
center of the pupil (Duchowski, 2017, p. 52-54). The eye tracker apparatus of video-based
corneal reflection has two forms: table-mounted and head-mounted. The table-mounted
eye tracker i.e., Tobii EyeX was used in the following studies, which provided the mea-
surements of the point of regard and tolerated moderately head movements. Besides, the
price of this eye tracker apparatus is relatively low (about 100 $). Thus, Tobii EyeX was
frequently used in gaze interaction studies (Esteves et al., 2015; Feit et al., 2017; Morimoto
& Amir, 2010).
Iris
Lens
Cornea
Light
1
2
3
4
Purkinje
reflections
Figure 2.1: Purkinje reflections. Adapted from (Duchowski, 2017, p. 55)
11
2.3 Evaluation metrics
The evaluation of eye typing could be divided into two parts: performance measures
and subjective evaluations. The text enter rate (Subsection 2.3.1), text enter error rate
(Subsection 2.3.2) and keystroke savings (Subsection 2.3.3) are the main parts of the per-
formance measures, whereas perceived workload, and open questions are used to evaluate
the subjective feedback from participants. The evaluation metrics used in this thesis are
explained in the following subsections.
2.3.1 Text entry rate
Text entry rate refers to how fast users can enter text in a typing system. Since the aver-
age word length is different in various languages, in this thesis, typing speed is measured
in words per minute (WPM), where a word is regarded as five characters, i.e., regardless
influence of language type on word length. The counting includes letters, spaces, punc-
tuations, etc. (MacKenzie & Tanaka-Ishii, 2010, p. 48). Words per minute are computed
according to the following equation:
WPM =|T| − 1
S×60 ×1
5(2.2)
where |T|refers to the length of total typed characters in the transcribed string. S
represents run time in seconds.
2.3.2 Error rates
Typing error rates mainly include the corrected and uncorrected errors. The corrected
errors are the errors that were made but also corrected during typing, whereas the un-
corrected errors refer to the errors left in the transcribed text. The used metrics are
introduced in the following sections.
Keystrokes per character (KSPC) measures how frequently users correct the entered
characters during typing. Keystrokes per character are calculated by the ratio of the sum
of letters and including spaces to the final number of characters in the entered string
(MacKenzie & Tanaka-Ishii, 2010, p. 52). The formula for computing keystrokes per
character is:
KSPC =|IS|
|T|(2.3)
12
In Eq.(2.3), |IS|stands for the length of all characters, including backspaces. As men-
tioned before, |T|describes the length of total typed characters in the transcribed string.
If the user typed perfectly without using backspace and the keystrokes per character value
is 1.
Minimum string distance (MSD) is a string metric for measuring the minimum num-
ber of error correction operations required to convert a string to another string (Leven-
shtein, 1966; MacKenzie & Tanaka-Ishii, 2010). MSD error rate is the uncorrected error
and reflects how many errors remain in transcribed string. The formula for calculating
MSD error rate is:
MSD error rate =MSD(P, T)
MAX(|P|,|T|)(2.4)
In Eq.(2.4), Prefers the given string. |T|is the length of total typed characters in the
transcribed string.
To evaluate the error rates, the keystrokes per character (errors during entry), mini-
mum string distance (errors after entry) are used as dependent variables in the first study.
Since, the language model is added for the eye typing interface in the third study, and
the number of keystrokes required for typing one word is no longer equal to the length of
the word, i.e., if the next possible word is successfully predicted and selected by the user,
the number of activated keystrokes is smaller than the length of the word.
Thus, three new evaluation metrics are introduced for the third study, they are un-
corrected error rate (UER), corrected error rate (CER), and keystroke savings (KS). One
advantage for corrected error rate and uncorrected error rate is that those two metrics
share the same denominator, which can be combined to a total error rate (CER + UER).
Despite the absence of a description of the total error rate, Chapter 6 analyzes corrected
error rate and uncorrected error rate separately.
Corrected error rate (CER) measures the errors corrected during typing, which is
similar to the keystrokes per character (MacKenzie & Tanaka-Ishii, 2010, p. 55-56). The
formula for calculating the corrected error rate is:
Corrected error rate =IF
C+INF +IF (2.5)
In Eq.(2.5), IF refers to all characters backspaced during entry. Cstands for all correct
characters in the transcribed text. INF represents all incorrect characters in the tran-
13
scribed text.
Uncorrected error rate (UER) refers to the errors left in the transcribed text, which
is similar to the error rate based on minimum string distance (MacKenzie & Tanaka-Ishii,
2010, p. 55-56). The formula for calculating uncorrected error rate is showed in Equation
2.6, the denotation is the same as in Equation 2.5.
Uncorrected error rate =INF
C+INF +IF (2.6)
2.3.3 Keystroke savings
Keystroke savings (KS) reflects how many keystrokes are saved with word prediction/completion
(Trnka & McCoy, 2008). The formula for calculating KS is:
KS =Keysphrase −Keyswith prediction
Keysphrase
∗100% (2.7)
In Eq.(2.7), Keysphrase refers to the number of keystrokes needed to enter without any
word prediction. Keyswith prediction is the number of the keystrokes entered with word
prediction, the backspaces and corrected characters are not included.
2.3.4 Subjective workload
As Rubio, Díaz, Martín, and Puente (2004) writed “The evaluation of mental workload
is a key point in the research and development of human-machine interfaces, in search
of higher levels of comfort, satisfaction, efficiency, and safety in the workplace." Since
the eye is responsible for signal processing of both input and output in the gaze-based
interface, the study regarding gaze interaction needs to pay more attention to mental
workload during designing the interface.
There are a lot of assessment tools for subjective workload, such as NASA Task Load
Index (NASA TLX) (Hart & Staveland, 1988), Subjective Workload Assessment Tech-
nique (SWAT) (Reid & Nygren, 1988) and Workload Profile (WP) (Tsang & Velazquez,
1996).
NASA TLX is a subjective, multidimensional assessment tool including mental de-
mand, physical demand, temporal demand, performance, effort, and frustration, in all six
dimensions. The score for each dimension ranges from 0 to 100, and a lower score means
a better evaluation. SWAT is a workload assessment tool and has three dimensions, they
14
are time load, mental effort load, and psychological stress load. Participants can eval-
uate with three levels, low, medium and high. It has a pre-task, also known as card
sorting procedure, i.e., to develop individual workload scales. Thus, the implementation
of this method is considered relatively complex and time consuming. Workload profile
method is built from the multiple resource framework proposed by Wickens (2002) and
concludes eight workload dimensions. They are perceptual/central processing, response
selection and execution, spatial, processing, verbal processing, visual processing, auditory
processing, manual output, and speech output.
Taking into account the assessment dimensions corresponding to the typing task and
ease of use, the NASA TLX is chosen and used in the third study of this thesis. The
NASA TLX questionnaire can be found in Appendix C.1.
2.4 Mathematical methods
This section briefly introduces the mathematical methods used throughout this disserta-
tion, including density-based spatial clustering of applications with noise (DBSCAN) and
orthogonal distance regression (ODR).
2.4.1 Density-based spatial clustering of applications with noise
(DBSCAN)
DBSCAN is the most common density-based clustering algorithm and is used to remove
the spatial outlier points in the following studies. Ester, Kriegel, Sander, and Xu (1996)
proposed this method which can be used to cluster a set of points in space, that is, to
classify points into core points, reachable points, and outliers. There are two parameters
that need to be given for DBSCAN. One is the maximum distance (also known as ),
which is the distance between two gaze points, one of which is considered to be in the
vicinity of the other. The number of samples (minimum sample size) in the neighborhood
where a point is considered a core point. The example is illustrated in Figure 2.2.
The points near A are core points and form one cluster. Points B and C are not core
points, but are connected by density through the cluster of A and therefore belong to this
cluster. Point N is neither a core point nor density-connected via the cluster of A, so it
is classified as noise.
In this thesis, python package “sklearn.cluster.DBSCAN" is used in the one-point
calibration process. The methods will return the estimated number of clusters and number
15
Figure 2.2: Example of DBSCAN (example image under CC BY-SA 3.0)
of noise gaze points. All gaze data are labeled with either cluster number or with “-1" (for
noise points).
2.4.2 Orthogonal distance regression (ODR)
Orthogonal distance regression is a special case of total least squares. It aims to minimize
the orthogonal distance from data points to a functional or structural model (Boggs &
Rogers, 1990).
Figure 2.3: Illustration of ODR (example image under CC BY-SA 3.0)
16
As shown in Figure 2.3, the error in both X and Y coordinates are taken into account
in orthogonal distance regression, i.e., the sum perpendicular distances of the points with
respect to the estimated line is calculated. The process of estimation can be defined as:
Xi=xi−δi
Yi=yi−i
r2
i=min 2
i+δ2
i
subject to: Yi=f(Xi+δi;βi)−i
(2.8)
where Xi, Yi, i = 1, ..., n, are observed random variables with the true values xi, yi, i =
1, ..., n.δipresents the random error associated with xi, and istands for the random
error associated with yi.ris the orthogonal distance from the point to a linear modal.
The error model is estimated by achieving the minimization of the sum of the squared
distances of r.
2.5 Summary
This chapter provides an overview of eye movements, including the general classification
of eye movements and their characteristics. Besides, the different eye-tracking technol-
ogy is briefly introduced. Last but not least, the evaluation metrics and mathematical
methodologies are described in detail in this chapter.
17
18
Chapter 3
Related work
In this chapter, related work focuses on two parts. First, the different gaze interaction
designs are reviewed briefly in Section 3.1. Then, the studies of eye typing are summarized
in Section 3.2. Based on the previous work, the potential shortcomings of the research
area are highlighted in Section 3.3.
3.1 Gaze interaction
Gaze control has a number of advantages over those input methods that rely on physical
touch, e.g., hands-free interaction for aseptic environments and increased privacy, since
inputs cannot be visually observed by third parties (Cymek et al., 2014). Gaze input
provides a natural and fast method for interacting with computers. In some specific
situations, the reaction speed can exceed the mouse (Sibert & Jacob, 2000).
Hence, a large number of applications for gaze-based interaction have been developed
for everyday human-computer interaction. Gaze typing (Majaranta, MacKenzie, Aula, &
Räihä, 2006; Ward, Blackwell, & MacKay, 2000), password input (Cymek et al., 2014; de
Luca, Weiss, & Drewes, 2007), controlling smart watch (Esteves et al., 2015), map read-
ing (Göbel, Kiefer, Giannopoulos, Duchowski, & Raubal, 2018), controlling telepresence
robots (G. Zhang, Hansen, & Minakata, 2019) and selection in VR/AR device (Esteves
et al., 2017; Khamis, Oechsner, Alt, & Bulling, 2018; Sidenmark et al., 2020) have been
proposed as use cases.
Although there is an increasing interest in gaze-based interaction, proposed implemen-
tations have suffered from a number of limitations. The so-called Midas touch problem
represents one of the main challenges for gaze-based interaction. It describes the difficulty
to distinguish between visual search patterns and the intentional selection of actionable
19
items on an interface.
In addition, most eye trackers require calibration before use, which is time-consuming
and can be inconvenient for spontaneous interaction. And even with individual calibra-
tion, eye trackers are still prone to problems with fluctuating spatial accuracy during use.
Feit et al. (2017) found that eye-tracking systems need to be re-calibrated multiple times
a day to maintain interactable accuracy for most eye typing systems, and the individual
difference is relatively large both in accuracy and precision as well.
Another challenge in the design of gaze-based interfaces is the lack of accuracy of real-
time gaze coordinates, stemming from both biological characteristics of the eye as well
as limitations of the eye-tracking equipment. Due to miniature eye movements, such as
tremor, drift, and microsaccades, there is hardly a moment that the eye is absolutely still.
Thus, similar to the obvious and well-known “Fat Finger” problem in touchscreen (Bi,
Li, & Zhai, 2013), the use of the gaze as a cursor cannot be as accurate as e.g., using a
computer mouse.
According to the above mentioned problems, three mechanisms for the selection are
prevalent to trigger an action in gaze-based interfaces, namely fixation-based, gesture-
based, and smooth-pursuit based selection. This chapter gives a brief overview of the
design of gaze interaction methods mentioned above and introduces how people can in-
teract with the interface using their eyes.
3.1.1 Fixation-based gaze interaction
In fixation-based systems, actionable items are fixed on the interface and the user’s gaze
locates on the presented interface to select it. Actions, e.g., the selection of objects, are
triggered through gaze fixation on an actionable item for a set amount of time (typically
called dwell time). Gaze-based systems suffer from the Midas touch problem, where
actionable items are triggered, although users just look at the actionable item to identify
it (Jacob, 1990; Stampe & Reingold, 1995). To overcome the Midas touch problem, the
duration of dwell time is set to longer than general fixation (typically 200-600 ms) (Jacob,
1995). The dwell time is varied from 600 to 1000 ms in studies related to fixation-based
gaze interface (Majaranta & Räihä, 2002).
Fixation-based systems suffer from two major disadvantages. First, those systems
require an individual calibration of the eye-tracker, as they depend upon a high accuracy
in the registration of the gaze position, to locate the user’s gaze in relation to the objects
on the interface. Second, the size of the actionable item is limited due to the size of the
20
fovea and the size of the gaze jitter. The area covering 1◦visual angle is considered to be
the minimum area required for gaze-based interaction, and this distance corresponds to
approximately 1 cm at 65 cm from the display screen. With those limitations, the users
have to go through the calibration process before use. Besides, relatively large actionable
items with wide separation between objects have to be considered, which limits the number
of actionable items on the interface.
3.1.2 Gesture-based gaze interaction
Drewes and Schmidt (2007) introduced gaze gesture method and an action is triggered
after completion of a fixed sequence of gaze movements. Gesture-based gaze interaction
does not require a graphical interface. This approach is independent of display space and
insensitive to eye-tracking accuracy. However, since users need to learn and remember
available gestures, gaze gesture-based interaction is not practical for walk-up interaction,
where users have no prior knowledge about the interaction.
3.1.3 Pursuit-based gaze interaction
The gaze interfaces based on smooth-pursuit eye movements are composed of moving
actionable objects. Instead of an exact location of a user’s eye-gaze on the interface, gaze
trajectories are used for object identification, which result from the human eye following
a moving object. These eye movements are called smooth-pursuit movements, lending the
name to these categories of gaze interfaces.
To overcome the calibration challenge, pursuit-based interfaces have been first pro-
posed by Vidal et al. (2013). Using these interfaces, users can activate an action by
following the corresponding moving target with their gaze. An action is activated based
on the fitting of the relative motion trajectory, instead of fixed gaze coordinates on the in-
terface, which is why a personal calibration is not necessary for pursuit-based interaction.
Hence, this calibration-free method has been attracting a lot of interest, as it overcomes
the shortcomings arising from the low spatial accuracy of gaze data.
To relate the gaze trajectory to moving objects in the interface, Vidal et al. (2013)
utilized Pearson’s product-moment correlation in their first implementation of a smooth-
pursuit based interface. The parameters, i.e., number of objects, moving trajectory,
moving speed, window size, and correlation threshold, are evaluated. They reported
that the detection rate starts to decrease when there are eight or more objects on the
screen. However, when analyzing the data separately for linear and circular motion, it
21
was found that the detection rate of the circular motion is higher, which allows a more
robust detection performance for up to 15 objects. The larger window size (500 ms) has
a higher detection performance than the smaller one (100 ms). For slow-moving speed, a
larger window size was required. The correlation threshold value is recommended to be
set higher than 0.5.
For gaze interface based on smooth pursuits, many studies focus on the circular tra-
jectories. Esteves et al. (2015) explored the circular gaze trajectories in gaze-based smart
watch interfaces. The moving speed was given by angular speed, but not in pixels/s or
visual angle/s. However, the circular diameters were given (Small: 0.6 cm/0.98◦of visual
angle, Medium: 1.6 cm/2.62◦and Large: 2.6 cm/4.25◦). The results have shown that the
Pearson correlation-based detection method performs well on up to eight moving orbits.
For moving angular speed, the medium angular speed of 120 ◦/s achieves the best perfor-
mance in detection rates, whereas the orbits with large diameters have better detection
performance than others.
Drewes, Khamis, and Alt (2018) compared two detection methods, namely correlation
methods (with or without 5 samples delay) with different moving speeds. The results
found that the moving speeds between 6 ◦/s and 16 ◦/s result in the higher detection rate
with the correlation method than the method based on Euclidean distance, whereas the
euclidean method works better when the object moving speed is relatively slow.
Despite their advantages, Vidal et al. (2013) found that the detection rate was lower
when objects in the interface move in linear trajectories compared to circular trajecto-
ries. The possible reason is that the detection of horizontal and vertical movement using
Pearson’s product-moment correlation can lead to problems. For trajectories that are
purely horizontal or purely vertical, there is zero standard deviation when computing
the Pearson correlation. The condition that both the highest correlation coefficient of
corrxand corryfor the objects need to above a threshold is difficult to meet. Besides,
circular trajectories have been found to be subject to increased gaze deviations compared
to rectangular trajectories consisting of straight lines (Kosch, Hassib, Woundefinedniak,
Buschek, & Alt, 2018).
Thus, apart from circular object movements, researchers have developed interfaces
based on linear pursuit eye movement using other algorithms to improve the detection
performance. Cymek et al. (2014) investigated moving speed (Slow, Medium, Fast) and
minimal object distance (Larger, Small) for gaze interface based on linear smooth pur-
suit eye movements. The gaze interface contained 16 dynamic elements. Each element
moved in three stages, combining horizontal and vertical movements. The detection of
22
targets relies on analyzing the combination and classification of gaze trajectory sequences.
The authors reported that in terms of user-friendliness, the most favorable results were
obtained in the condition with medium speed (218 pixels/s, about 4.9 ◦/s) and large
distance (39 pixels).
This dynamic interface design based on the linear motion was further utilized by Lutz
et al. (2015), who explored an eye typing interface based on two-stage pursuit movement
which contains 32 moving objects and four moving speeds was tested, called SMOOVS.
The best typing performance was achieved with medium movement speed (300 pixels/s,
7.73 ◦/s). Later, Freytag, Venjakob, and Ruff (2017) compared the gaze interface for
PIN input and eye typing (i.e., SMOOVS). The results found that the detection is more
accurate and quick for the gaze interface containing less moving objects, i.e., the detection
performance for PIN input is better than eye typing. So far, there has been no research
specifically analyzing the influence of object number and object moving speed in linear
trajectory smooth pursuit gaze-based interfaces.
In summary, the large orbits size or long moving trajectory facilitate more successful
detection. Besides, a moderate moving speed is appropriate for gaze interface based on
smooth pursuit, as Esteves et al. (2015) reported “if it is too slow it becomes a fixation;
if it is too fast it turns into repeated saccades".
The results from the studies presented above are summarized in the following table.
Some moving speed parameters presented in the table have been estimated based on the
figures and descriptions available in the original papers when the actual values were not
provided by the authors.
3.1.4 Feedback in gaze interaction
In addition to differences in the basic workings of gaze-based spellers, the design of appro-
priate feedback can enhance users’ experience (Nielsen, 1994). First, feedback can provide
information about the state of the interaction, thereby allowing the user to adjust any
erroneous activation before it is registered. Second, feedback can be used to confirm that
a selection is registered. Generally, users are relatively tolerant of interfaces with appro-
priate feedback. However, inappropriate feedback can also be misleading or distracting
to users.
23
Table 3.1: Summary the selected studies of pursuit-based gaze interaction.
Moving speed1Trajectory Window size Methods for detection Number Distance
(threshold) of objects from display
Vidal et al. (2013) 100-850 pixels/s circle/linear 100, 500 ms correlation coefficient 2-20385 cm
(0.2-0.9)
Esteves et al. (2015) 60-240 ◦/s2circle 1000 ms correlation coefficient 2-16 35 cm
(0.8)
Kangas et al. (2016) 5-7 ◦/s circle/linear 500 ms squared distances 2 50-70 cm
(700 pixels)
Drewes et al. (2018) 0.78-25.1 ◦/s circle/square 500 ms correlation coefficient 1 53.5 cm
(0.8)
diamond euclidean distance (30 pixels)
Cymek et al. (2014) 3.3-9.8 ◦/s linear 1000 ms direction-based 16 70 cm
(one direction)
Lutz et al. (2015) 5.67-8.76 ◦/s linear >200 ms angle-based (58◦) 32 60 cm
1The speed parameters given in the literature vary and are divided into three categories (pixels/s, visual angle/s and angular speed/s). All speed
parameters that can be converted to visual angle/s have been converted.
2Angular speed/s.
3To prevent the distraction by other moving objects on the screen, only one moving target is visible
24
Feedback in fixation-based gaze interface
Majaranta, MacKenzie, Aula, and Räihä (2003) found an effect of feedback on typing
performance of the fixation-based system. The results indicated that participants typed
faster when receiving a combined click and visual feedback than other forms of feedback
such as speech only or visual only. Short no-speech sound, like a ’click’, was preferred
by participants over synthetic speech. The speech feedback is limited in some cases, as
it takes time to pronounce a selected letter than just giving a short, no-speech sound as
feedback. In addition, Majaranta et al. (2006) found that the feedback needs to be set
according to the duration of dwell time.
Feedback in gaze gestures
Since the interaction via gaze gestures is usually facilitated without a graphical user
interface, the related studies focus more on vibrotactile feedback (Rantala et al., 2020).
The implementation of vibrotactile feedback was found to improve both the typing speed
and subjective evaluation of users (Kangas et al., 2014). Köpsel, Majaranta, Isokoski,
and Huckauf (2016) compared visual, haptic, and auditory feedback modalities. They
evaluated feedback given both during and after the input of a gaze gesture. It was found
that there are only slight differences in accuracy and user experience between different
feedback modalities in gaze gesture interactions.
Feedback in pursuit-based gaze interface
The influence of feedback has also been considered in research on pursuit-based gaze in-
teraction. Špakov, Isokoski, Kangas, Akkil, and Majaranta (2016) added a “tick" tone
in smooth pursuit-based widgets. The comparison of feedback modalities (no feedback,
visual, auditory, and haptics) showed that feedback conditions do not significantly affect
the performance of pursuits-based gaze interaction. However, most users preferred tactile
and auditory feedback (Kangas et al., 2016). To interact with dynamic interfaces with
eyes, the cognitive workload is higher than the static, and suitable feedback could be con-
sidered in the design of dynamic interfaces. Even though pursuit-based gaze interaction
has been studied for years, there is less research on how to design the feedback of the
interaction to users.
25
3.2 Eye typing interface
Eye tracking technology is used in psychology, marketing, and also as an input device for
human-computer interaction (Jacob, 1990). For gaze interaction, eye typing is the most
widely used application. Using gaze for text entry has several unique benefits, and was
initially developed as an important communication tool for people with certain disabilities
(Majaranta & Räihä, 2002). Furthermore, systems have been developed to enter text
using gaze for public displays (Lutz et al., 2015), and recently have been widely studied
as an input method in virtual reality (Rajanna & Hansen, 2018). Dwell time, gestures,
or pursuit movements are used to replace the click/tap action to activate a character.
The layout of common eye typing interfaces can be divided into two parts: traditional
QWERTY layouts and other specific layouts.
3.2.1 QWERTY typing layout
The QWERTY keyboard layout is familiar to most users, hence users can learn the inter-
action process for gaze-based typing on QWERTY interfaces quickly. There have been a
number of studies involving the implementation of QWERTY layout in eye typing.
Dwell-based Typing Interface
For most eye typing systems, the input and output space share the same space and the
gaze works as an invisible cursor. Designs based on dwell time are popular, as they allow
to control over the Midas touch problem. That is, users can enter a letter by fixating
their gaze on it for a pre-defined duration, i.e. the dwell time.
To further facilitate the typing efficiency, individual differences can be taken into ac-
count and more flexible dwell duration was investigated, e.g., users were allowed to adjust
the dwell time themselves (Majaranta, Ahola, & Špakov, 2009), or systems were capable
of automatically adjusting the dwell time based on users’ performance (Špakov & Min-
iotas, 2004) or previously entered text and location of keys (Mott, Williams, Wobbrock,
& Morris, 2017; Pi, Koljonen, Hu, & Shi, 2020). However, dwell-time based gaze system
depends heavily on the spatial accuracy of the eye tracker, as even slight inaccuracies in
tracking can lead to an unintentional input for characters close to the intended target.
26
Dwell-free typing interface
In dwell-free interfaces, users do not need to fixate their gaze on actionable items for a
pre-defined dwell time. For example, on the context-switching interface (Morimoto &
Amir, 2010), two QWERTY keyboards are displayed on the screen, and the user swipes
between them to enter text. In addition to the character-level text entry method, there is
a word-level text entry method based on the analysis of letter sequence using a language
model. Users can enter text by looking at a sequence of letters they need to enter with
their eyes (Kurauchi, Feng, Joshi, Morimoto, & Betke, 2016; Pedrosa, Pimentel, Wright,
& Truong, 2015). When starting and ending a word, an explicit command needs to be
given out to distinguish between receiving visual information and selection.
The traditional QWERTY layout has both advantages and disadvantages for gaze-
based interfaces. Since it is well known by most users, the QWERTY layout can reduce
the cost of learning. However, it was developed for finger typing and is not well adapted
for gaze-based typing.
3.2.2 Specific typing interface
In addition to the traditional layout, there are a number of novel keyboard layouts de-
signed specifically for eye typing. Due to the unstable accuracy of gaze data, those layouts
try to rearrange the letters’ position or cluster letters to facilitate the user experience.
Static-graphic gaze interface
J. P. Hansen et al. (2003a) developed GazeTalk, which has more than ten active buttons
on the screen (as shown in Figure 3.1). The interface is updated in real-time and what
are more likely to be entered are displayed. The characters or words displayed in each
button are based on the predictions of a language model. To overcome the limitations
of inaccurate gaze data, the size of actionable items is increased, which helps with item
distinction, but reduces the number of displayable actionable items or results in the need
for bigger displays.
Huckauf and Urbina (2008) proposed the concept of two-stage selection. Their pEYEs
application takes the form of popup menus, this pie layout is available for menu selection
and text entry (see in Figure 3.2). Users first select a cluster and then select the intended
character from the selected cluster. In this interface, the displayed characters and the
area for activation are separated. The characters are located near the center of the circle
and the activation area is located at the edge of the circle.
27
Figure 3.1: Interface of GazeTalk (J. P. Hansen et al., 2003a).
Figure 3.2: Illustration of pEYEs, adapted from Huckauf and Urbina (2007).
Additionally, research has been carried out on gaze gestures to overcome space limita-
tions. Gaze gesture systems such as Quikwriting (Perlin, 1998) or EyeWrite (Wobbrock
et al., 2008) require users to spend a certain amount of time learning and remember the
gestures (see in Figure 3.3 and 3.4). Hence, they are not suitable for walk-up-and-use
scenarios.
Dynamic-graphic gaze interface
A number of studies have investigated the use of a dynamic interface for gaze typing, i.e.,
the position of the items is not fixed in such interface. One of the well-known eye typing
28
Figure 3.3: Interface of Quikwriting (Bee & André, 2008).
Figure 3.4: The gaze gestures of letter “a" and “b" in EyeWrite, adapted from Wobbrock
et al. (2008).
systems is Dasher (Ward & MacKay, 2002). In this interface, characters have no fixed
position and are placed in a row, according to the probabilities for the frequency of use for
each character (as shown in Figure 3.5). The character row is continuously moving from
right to left, with the actionable character highlighted. There was one more dynamic eye
typing interface, called StarGazer (D. W. Hansen et al., 2008), where user could select
characters with zooming and panning actions (see in Figure 3.6).
More recently, usage scenarios for gaze interaction have been broadened to include
public screens. Studies are dedicated to design interactive interfaces with reduced cal-
ibration time or without calibration. Lutz et al. (2015) developed the SMOOVS text
entry system, a pursuit-based gaze interface with one-point calibration. To avoid the
Midas touch problem, the interface is divided into interactable areas and a central area
that is not interactable. All elements in the interface are designed to stay static when
the gaze cursor is located in the non-interactable middle area of the interface. When the
29
Figure 3.5: Interface of Dasher (Ward & MacKay, 2002).
gaze position is moving out of the deactivated central area, the first moving stage starts.
In this stage, all clusters move outward simultaneously. In the second moving stage, the
tiles within the selected cluster move apart from each other around the center of the clus-
ter. Although users can quickly understand the interface and learn how to use it, users’
attention can be distracted by objects moving at the same time in the first stage. In
addition, the text entry rate for SMOOVS is relatively low in comparison to other typing
systems. A high error rate exacerbates this problem, as users need to frequently correct
wrong inputs, often resulting from involuntary initiations during the visual search for the
desired character.
In order to reduce the amount of keystrokes for entering a word or sentence, word
prediction function was added for SMOOVS interface to improve the typing speed to
4.5 WPM (Zeng & Roetting, 2018). In addition to move outward design, there is typing
interface featured clockwise or counter-clockwise circular motion, called EyeTell (Bafna,
Bækgaard, & Paulin Hansen, 2021) and the gaze estimation is based on front-facing
camera. The interface of EyeTell consists of an inner circle (moving counter-clockwise)
and an outer circle (moving clockwise), where the inner circle is formed by characters
clusters and the outer circle is the characters in the selected cluster. However, the typing
30
Figure 3.6: Interface of StarGazer (D. W. Hansen et al., 2008).
Figure 3.7: Interface of SMOOVS (Lutz et al., 2015).
speed still slow and needs to be improved to meet the needs of daily life.
3.2.3 Language model in eye typing
There has been a large volume of published studies describing the role of language models
in text entry tasks. Language model is one of the most widely used methods to im-
prove typing performance on smartphone, and it is also a very important complementary
31
function in the brain-computer interface (BCI) system (Orhan et al., 2012; Speier, Chan-
dravadia, Roberts, Pendekanti, & Pouratian, 2017) and eye typing system (Kristensson &
Vertanen, 2012; MacKenzie & Zhang, 2008), where the typing speeds are limited by the
input modality.
Provision next possible word/letter prediction
Many studies have explored adding word prediction to improve eye typing efficiency. With
the word predictions, the number of keystrokes that a user needs to enter for a certain
length of sentence is reduced. The user doesn’t have to enter every letter, as a result, the
times of wrong activated selection are reduced accordingly. MacKenzie and Zhang (2008)
introduced a weight-based algorithm, and compared letter and word prediction. With
weighting factors, the system recognized the input letter based on the combination of word
stem frequency and the distance from the current gaze position to adjacent letters. In
condition letter prediction, the next three possible letters were given and highlighted, and
the word prediction function was tested with five candidate words in the word prediction
condition. They found that letter prediction provided an effective method to advance
the typing performance of the interface with small buttons. And word prediction has an
advantage on the interface with big buttons. In addition to the traditional QWERTY
layout, there are also new gaze typing interfaces containing word predictions, such as
pEYEwrite (Urbina & Huckauf, 2010), GazeTalk (J. P. Hansen, Johansen, Hansen, Itoh,
& Mashino, 2003b), and SliceType (Benligiray, Topal, & Akinlar, 2019).
However, all the above mentioned methods are difficult to deploy in public display
due to the higher requirement for the accuracy of the gaze position data. The eye tracker
needs to be calibrated before use or re-calibrated during use, and this calibration process
is relatively boring and time-consuming. Although word prediction was considered in
pursuit-based typing interface (Abdrabou, Mostafa, Khamis, & Elmougy, 2019; Zeng &
Roetting, 2018). Zeng and Roetting (2018) reported that most participants were willing
to use the word prediction, and think it is helpful, but some of the participants tended
to ignore the existence of predictive words because it was not salient enough. And prior
studies have not been able to achieve a reasonable typing speed.
Word-level eye typing
Due to the limitation of eye-tracking accuracy, adjacent letters are easily activated by
mistake. Furthermore, as the number of predicted words increases, the probability of
32
hitting increases, but so does the time and cognitive effort of visual processing. Koester
and Levine (1997, 1998) found that the visual search took 150 milliseconds more for each
additional predicted word. The time to scan the predicted word list also needs to be taken
into account. On the other hand, the number of candidate words is limited by the size of
the display (Garay-Vitoria & Abascal, 2006; Garay-Vitoria & González-Abascal, 1997).
Thus, in addition, to provide word predictions, research utilizing language models to
improve eye typing performance has focused on the dwell-free eye typing methods (Kris-
tensson & Vertanen, 2012). In such a word-level typing system, users don’t have to select
letter by letter. The current typing word is predicted according to the gaze path that
the user glances at the characters comprising the word. Since the eye is always on, the
swipe-based method requires the user to specify the beginning and the ending of a word.
Kurauchi et al. (2016) used gestures to distinguish them. Some studies tried to distinguish
the beginning and the ending by the bi-modal method, which combined touch and gaze
input (Kumar, Hedeshy, MacKenzie, & Staab, 2020).
Some studies have abandoned the keyboard interface, the most famous of which is
Dasher (Ward et al., 2000; Ward & MacKay, 2002). It is an efficient dynamic typing
interface. According to the prediction of language model, the next possible letters move
from the right to the left of the screen. In this method, participants can reach a high
typing speed of 17.3 wpm after more than two hours of training (Tuisku, Majaranta,
Isokoski, & Räihä, 2008). However, since the typing system relies heavily on the language
model, which featured a word-level input. The user could not enter the words which are
not contained in the corpus.
Adjusting the detectability based language model
Since eyes need to switch between current typing characters and word prediction during
eye typing, there has been a lot of studies describing the role of the position for word
predictions. Diaz-Tula and Morimoto (2016) proposed Augkey, the related typing sug-
gestions were placed closer with the current selecting item to reduce the eye movements.
The results have shown that both typing speed and workload were improved with Augkey.
Conversely, Sengupta, Menges, Kumar, and Staab (2019) reported that they did not find
a significant difference regarding the position of word suggestions.
There was also research focusing on adjusting the detectability based on the language
model to achieve a reasonable text entry speed. In such studies, the dwell time can be
adjusted according to what has been entered (Mott et al., 2017; Pi et al., 2020).
33
3.3 Limitations of previous work
As mentioned in Chapter 1, calibration-free gaze interface has been a research topic of
great interest in gaze interaction fields. Since smooth pursuit eye movements can be
used without calibration in spontaneous gaze interaction, the intuitiveness of the gaze
interface design has been a topic of great interest in the human-computer interaction
field. However, since most related research focuses on curved smooth-pursuit trajectories
(Esteves et al., 2015; Khamis et al., 2018), the design issues of using linear trajectories
in gaze interaction are poorly understood. Hence, this dissertation will first explore the
user performance of gaze interfaces based on linear smooth pursuit eye movements.
Although research on eye typing has been going on for many years, it still lacks a a user-
friendly eye typing interface with high text entry speed. For eye typing interface based on
QWERTY, the learning effort is low, and the text entry rate is high. However, personal
calibration is necessary before use to ensure acceptable tracking accuracy. Additionally,
the accuracy of the gaze cursor is far from the mouse cursor due to the miniature eye
movements. Or if the accuracy of the eye tracker is reduced during use, both will cause
unintentional selections. To overcome the shortcomings, the size of each item of the eye
typing interface need to be designed quite large (Penkar, Lutteroth, & Weber, 2012). For
the alternative eye typing interface, some studies designed larger elements to overcome
the problem of accuracy (D. W. Hansen et al., 2008; J. P. Hansen et al., 2003a). Recently,
a considerable literature has grown up around the theme of designing calibration-free
eye typing interfaces for public displays, that aims to shorten or eliminate the personal
calibration. However, so far, the text entry rates of calibration-free eye typing interfaces
are still too slow. The SMOOVS reported that the typing speed ranged from 2.9 to 3.34
WPM, where the text entry speed was improved to 3.41 WPM with a similar layout based
on circle pursuit eye movements (Abdrabou et al., 2019). Bafna et al. (2021) tested a
new eye typing interface based on pursuit movement only using iPad’s front camera, and
achieved an average typing speed of 1.27 WPM.
Therefore, this dissertation will seek to design an eye typing interface that is more in
line with natural eye movements with high typing efficiency to make up for those short-
comings.
34
3.4 Summary
This chapter presents an overview regarding the main type of gaze interaction and the
design of feedback for gaze interface. Besides, the author summarizes the eye typing
literature from two aspects: eye typing interface and utilizing language model to improve
the typing performance. Lastly, the limitations of previous work are pointed out.
35
36
Chapter 4
Study 1: Gaze interfaces based on
linear smooth pursuit
Zeng, Zhe, Felix Wilhelm Siebert, Antje Christine Venjakob, and Roetting, Matthias.
Reprinted from, Calibration-free gaze interfaces based on linear smooth pursuit, Journal
of Eye Movement Research, 13(1). https://doi.org/10.16910/jemr.13.1.3.
A substantial portion of this chapter is based on the above paper.
4.1 Introduction
This study investigates the effect of number of objects and moving speed on user perfor-
mance in gaze interfaces where objects move linearly. The main goal of this chapter is
to develop a deeper understanding of the effect of number of objects and object moving
speed on the eye behavior and detection rates of gaze interfaces. This experiment was
conducted with the following hypotheses:
H1: The absolute angle difference between the target movement and gaze trajectory,
i.e., orientation error, will increase with an increased number of objects and conversely
decrease with a faster moving speed.
H2: The detection rates for objects will be different regarding the number of objects
and moving speed.
H3: Users prefer the gaze interface with fewer moving objects and a faster object
moving speed.
37
4.2 Empirical evaluation
This study was conducted in the Eye Tracking Laboratory of the Chair of Human-Machine
Systems at the Technische Universität Berlin.
4.2.1 Experimental stimuli
Five interfaces were developed (see Figure 4.1), which were implemented in Python with
the Tkinter GUI package.
The interfaces consist of multiple digits arranged in a circle and vary in the number
of digits presented. The ordering of the numbers was constant throughout the experi-
ment (increasing clockwise). The digits move outward in a linear trajectory with constant
speed. They are systematically placed, in varying degrees in relation to the center point
of the display (see Table 4.1 and Figure 4.1).
Table 4.1: The angular range for one object
Number of objects 6 8 10 12 15
Angular division 60◦45◦36◦30◦24◦
The diameter of the circles around the digits is 44 pixels (1.13◦). The distance from
the center of the display to each circle is 150 pixels (3.87◦). The interaction with the
interface consists of two steps. First, users need to visually perceive the information and
search the digit that they want to select. In the second step, users need to follow the
given digit with their eyes, while all digits move outward.
Since users’ capacity for visual processing is limited, only a certain number of items can
be processed at the same time. Research suggests that humans can visually process 20-50
items per second (Wolfe & Horowitz, 2017). Thus, the digits in the tested interface start to
move after 800 ms, ensuring that users have appropriate time to search for a target digit.
The gaze points were recorded after the objects start to move. Based on similar research
by Vidal et al. (2013) and conventional dwell-time based gaze interactions (J. P. Hansen
et al., 2003a), the duration of the outward movement of digits is set to 500 ms. The
recording was ended when the objects stop to move.
38
Figure 4.1: Interface layouts before outward movement of digits. The interfaces contain
6, 8, 10, 12, and 15 moving objects, respectively.
39
4.2.2 Experimental design
This experiment featured a 5×2 within-subjects design. The first factor (number of
objects) was varied five folds, i.e., 6, 8, 10, 12, or 15 objects in the interface. The
factor (moving speed) was varied two-fold, as objects moved either with 7.73 ◦/s (300
pixel/s) or with 12.89 ◦/s (500 pixel/s). All experimental conditions were repeated 12
times - that is to say, each participant performed 120 trials in total.
As dependent variables, two categories of variables were collected, performance mea-
sures and subjective experience. For users’ performance, orientation error and detection
rates were registered. Orientation error describes the absolute angle difference between
the target movement trajectory and the regressed line calculated from eye movement
data. The correct detection rate is the ratio between the trials of correct detection and
total trials for each participant in each condition. The false detection rate refers to the
percentage of wrongly detected trials for each participant in each condition, e.g., when
the eye trajectory was matched to target “1”, although participants were asked to follow
target “2”. For subjective experience, a semi-structured interview was conducted after the
experiment. Participants were asked about their preferences for the interfaces regarding
the number of objects and object moving speed. They were also asked about possible
reasons for their preference.
4.2.3 Participants
In this study, 25 participants (11 female and 14 male) were recruited. Their age ranged
from 21 to 46 years old, with a mean of 29.56 years. 17 participants had normal vision
while eight used vision aids during the study (four wore contact lenses and four wore
glasses). Six of the participants had previous experience with gaze interaction. Two
participants were left-handed, and 23 participants were right-handed. Participants were
rewarded with five Euro per visit or alternatively a certification of student experimental
hours for attendance.
4.2.4 Apparatus
A Tobii EyeX screen-based remote eye tracker with a sampling rate of 60 Hz was used
to record participants’ eye-movement data. The eye tracker was mounted beneath a 24-
inch Dell monitor with a resolution of 1920 * 1200 pixels (see Figure 4.2). All data were
collected without any prior calibration phase for individual users. The eye tracker was
40
calibrated by a third person and this setting was used for all the participants. Across all
participants, the average gaze estimation error of the eye data was 4◦visual angle (SD =
1.66). The average distance between the participants’ eyes to the display was 60 cm (SD
= 2.67). A chin rest was attached to the edge of the table, corresponding to the horizontal
center of the eye tracker and the monitor. The chin rest was used to prevent participants
from leaning too close to the display and maintain a constant viewing distance throughout
the experiment.
Figure 4.2: Experiment settings
4.2.5 Procedure
The experiment lasted approximately 30 minutes. After being welcomed, participants
were provided with written information about the experiment and asked to read it care-
fully. Then, the experimenter explained the eye tracker to participants. Any questions
about the experiment were clarified before the participants signed the Informed Consent
Form. Participants were asked to complete questionnaires including demographic infor-
mation, experience about eye tracking, and gaze interaction. Afterward, participants were
41
instructed to adjust the chin rest to a comfortable height.
The experiment consisted of one training session and two subsequent test sessions. In
each session, a target number was displayed in the center of the screen before the start
of individual trials. The task for participants was to find the given number in the digits
circle and to follow the outward-moving target number with their eyes. The moving
objects were displayed after the target number was shown 3 seconds. In the training
session, participants could try out the experimental tasks without data being recorded
and familiarize themselves with the interface. Once they fully understood the task, the
test sessions were started. In order to balance practice and fatigue effect, the sequence of
object moving speed and number of objects was fully randomized. For each experimental
condition, i.e., for a given number of objects and a given speed, 12 digits had to be
selected. The sequence of the 12 digits was randomized between participants to prevent
the effects of sequence. For the interface consisting of 15 digits, not all digits presented in
the interface had to be selected in the task. To minimize the potential effects of fatigue,
participants took a short break between the two test sessions. A semi-structured interview
was conducted after the experiment.
4.3 Classification algorithm
A large body of work investigated the detection algorithm used for pursuit-based gaze
interface, they mainly include Pearson correlation, Euclidean distance and calculating the
angle between two vectors (summarized in Section 3.1.3).
Since the Euclidean distance is more appropriate for detecting slower moving objects,
usually slower than 5-6 ◦/s (Drewes et al., 2018), the testing object moving speeds are
relative fast (7.73-12.89 ◦/s) in this study, this method was not included in the evaluation.
In this study, orthogonal distance regression is introduced to analyze orientation error
and detection rates regarding number of objects and moving speed. Besides, the method
based on orthogonal distance regression is compared with other two other frequently used
methods (Pearson correlation, and angle between two vectors) regarding overall correct
detection rate.
4.3.1 Detection based on orthogonal distance regression
A regression model was introduced to detect the linear movements to evaluate the gaze
trajectory based on smooth pursuit eye movements. This algorithm is based on Orthogo-
42
nal Distance Regression (ODR). This study utilized ODR to estimate a linear regression
model from gaze trajectory. The orthogonal distance ris defined as the distance from
the point to a linear model. Using ODR, both errors of Xand Ygaze coordinates were
taken into account. The function for the calculation of orthogonal distance rwith gaze
data is:
r2
i=min 2
i+δ2
i
subject to: Yeye,i =f(Xeye,i +δi;βi)−i
(4.1)
where Xeye = [x1, ..., xn]and Yeye = [y1, ..., yn]are the horizontal and vertical coordinates
of gaze data. The random error δiand iare corresponding to Xeye,i and Yeye,i, respectively.
And frefers to a function of Xeye,i with parameters set βi.
A linear model based on the gaze data was derived using ODR. The mean µand
standard deviation σof the orthogonal distances were calculated. The data points are
removed when riis three standard deviations away from the mean. The program itera-
tively estimate a linear model until no further data points are removed. The angle θwith
regard to the linear model is converted from the following function:
θ= arctan 2 (yend −ystart, xend −xstart)(4.2)
where xend and xstart refer to the horizontal coordinates of end and start gaze points.
And yend and ystart are the values of the function 4.1 with xend and xstart as input of the
function, respectively.
An angular criterion was used for target detection. A visualization of the criterion is
presented in Figure 4.3. The black lines show the trajectories of moving digits. The colored
lines are trajectory examples of pursuit eye movements from one participant. The gray
corridors show the angle ranges that are assigned to individual digits. Gaze trajectories
with angles located between two gray angle areas will be recognized as not detected. If
the detected gaze angle is within a certain range, it will be detected as the corresponding
object. A small angle range (α◦) is defined as a buffer in the middle between two object
classification angle areas. If the detected gaze angle is located in this interval, the system
will recognize it as not detected, i.e., miss detection.
4.3.2 Detection based on angle between two vectors
Similarly to detection method based on orthogonal distance regression, this vector-based
detection refers to the angle between the positive x-axis and the ray to gaze point (xeye, yeye)
43
Figure 4.3: Visualization of the gaze and object trajectories
and middle point (xmid, ymid). In this method, ymid and xmid are the average values of
x, y gaze coordinates when eye looks the middle of the screen. The mode of those an-
gles [θ1, ..., θn] is calculated as the selected angle. The angle is calculated by the function:
θi= arctan 2 (Yeye,i −ymid, Xeye,i −xmid), i = 1, ..., n
θmode = [θ1, ..., θn](4.3)
4.3.3 Detection based on pearson correlation
Pearson correlation is the most frequently used method to detect gaze input based on
smooth pursuit eye movements. To select an object, users need to follow a moving object
in the interface with their eyes, and the resulting gaze trajectory is then compared against
the trajectories of moving objects present on the display. Since the detection of trajectories
is invariant against its origin location, an individual calibration phase is not required
(Vidal et al., 2013). The formula for the Pearson correlation is:
44
rx=Pn
i=1(Xeye,i −Xeye,i)(Xobj,i −Xobj,i)
qPn
i=1 (Xeye,i −Xeye,i)2qPn
i=1 (Xobj,i −Xobj,i)2
(4.4)
where Xeye,i = [xeye,1, ..., xeye,n]and Xobj,i = [xobj,1, ..., xobj,n]are the vertical coordinates
of gaze data and one object, separately. The coefficients rx, ryare calculated separately
for horizontal and vertical positions for all objects, i.e., for the interface containing six
moving objects, six pairs of [rx, ry] are calculated. The closer the coefficient is to 1, the
stronger the correlation between the two time series, and therefore the more similar the
trajectory of the eye is to the trajectory of the object. The object with highest rx, ryand
both rx, ryare above a threshold is detected as following target.
The correlations between gaze and objects coordinates were calculated and the division-
by-zero error occurs when objects move in horizontal or vertical direction (Drewes et al.,
2018). Thus, to both x- and y-coordinates, small random errors were added (M= 0, SD
= 0.00001).
4.4 Results
The gaze data collected during the experiment was analyzed offline. In all, there are 3000
trials (2 moving speeds * 5 number of objects * 12 repeated times * 25 participants).
The figures visualized the moving trajectories from one participant for each condition are
shown in Appendix A.2. The orientation error, correct detection rate and false detection
rate were evaluated with repeated measures ANOVA at a significance level of α= 0.05.
The Mauchly’s test of sphericity was non-significant, so that no correction was needed. A
set of Bonferroni corrected t-tests was conducted for pairwise multiple comparisons.
4.4.1 Velocity
The moving velocity of eye and objects is visualized in Figure 4.4. Approximately 150 ms
latency was recorded for the onset of pursuit eye movements under both faster and slower
moving speed conditions, which contained a certain amount of systematic recording error.
However, the difference in velocity among the number of objects is quite small. Earlier
research has shown that the evocation of smooth pursuit eye movements has a latency
range from 80 to 130 ms in relation to the start of object movement (Lisberger, 2015;
Robinson, 1965). There is a relatively long pursuit latency, and the eye begins to move
later than the moving object. Once the eye starts to move, there are several saccadic
45
(a)
(b)
Figure 4.4: The velocity of eye and moving objects, the black line shows the moving
velocity of the object, the colored lines show the moving velocity of eye regarding different
number of objects in the interface.
46
movements, allowing the eye to catch up with the moving object (de Brouwer, Yuksel,
Blohm, Missal, & Lefèvre, 2002; Rashbass, 1961). In this study, the classification algo-
rithm took this pursuit latency into consideration. Hence, the first 100 ms gaze data of
each trial were discarded, and only the last 400 ms gaze data were analyzed.
4.4.2 Orientation error
Orientation error refers to the absolute angular difference between the target and eye
movement trajectory. The orientation error was calculated based on orthogonal distance
regression (ODR), the analysis focuses on the performance of smooth pursuit eye move-
ments under a low spatial accuracy. Figure 4.5 shows the distribution of orientation errors
for all participants and all trials of the experiment.
Figure 4.5: The distribution of raw orientation error throughout the experiment
In order to gain further understanding of the orientation error, the outlier data which
are three standard deviations away from the mean was excluded. The remaining 2926
trials were used for the analysis for orientation error. The descriptive statistics for all
experimental conditions are visualized as boxplots in Figure 4.6.
Descriptively, the mean orientation error decreased from 6 objects to 12 objects, while
increasing again for 15 objects. Nevertheless, it can be observed that the mean orientation
47
Figure 4.6: Orientation error in the experimental groups for filtered data. The green
triangles represent averages.
error was higher in experimental conditions with slower moving speed than in conditions
with faster moving speed. This descriptive difference can be observed in overall conditions,
irrespective of the number of objects presented in the interface.
The ANOVA confirms this effect, the results show that there was no significant main
effect of the number of objects on participants’ orientation error (F(4,96) = 1.27, p =
.29, η2
p=.05). However, participants’ mean orientation error for slower moving speed
was significantly greater than faster moving speed (F(1,24) = 30.99, p < .001, η2
p=.56).
There was no significant interaction between number of objects and moving speed for
correct detection rate (p>.05).
The mean orientation error for the different objects (i.e., different directions) in each
interface is presented in Figure 4.7. The orientation error for faster moving speed was
generally smaller than the orientation error for slower moving speed. This effect is most
pronounced in interfaces with small numbers of moving objects, but becomes less pro-
nounced in interfaces with a large number of moving objects. In addition, the orientation
error of some diagonal directions was found larger than cardinal directions for interfaces
with 8, 12 and 15 moving objects.
48
Figure 4.7: Average orientation error for all interfaces. Green dashed line refers to orien-
tation error of slower moving speed, and red line refers to faster moving speed.
49
4.4.3 Detection rates
This section reports the evaluation whether the target was detected as being followed,
i.e., correct detection, detected as another object, i.e., false detection, or no object was
detected, i.e., missed detection. The detection rates were analyzed using the raw data set
with the consideration of all human errors, i.e., no data was excluded from the analysis.
There are parameters that affect the target detection in pursuit-based interfaces, such
as moving speed, number of the moving objects, threshold value of the detection method,
and window size for detection. The window size was evaluated in previous study (Vidal et
al., 2013), and 500 ms is considered as a proper size. Besides, due to the latency of smooth
pursuit eye movements, varied from 80 to 150 ms (Lisberger et al., 1987; Rashbass, 1961;
Westheimer & McKee, 1978), the window size is set to all methods 400 ms in this study.
In the detection method based on orthogonal distance regression, a linear model was
estimated from gaze data, gaze angle is defined as the angle between the estimated line
and the x-axis that is calculated using arctan2. When the detected gaze angle is within
a certain range, it will be detected as the corresponding object. There is a buffer angle
range (α◦) in the middle between two object classification angle areas to prevent the false
selection of boundary objects. If the detected gaze angle is located in this interval, no
object will be detected. Five buffer angles were tested in this study, namely 3◦, 5◦, 7◦,
9◦and 11◦. Figure 4.8 shows that the correct and false detection rates decrease with
increasing buffer angle range, whereas the miss detection rate increases with increasing
buffer angle range. Since there are up to 15 objects in the gaze interface, a relatively
small buffer range (α= 5◦) was selected for further analysis. The size of detectable range
for different interfaces is showed in Table 4.2.
Table 4.2: Detectable range for different interfaces
Number of objects 6 8 10 12 15
Angular division 60◦45◦36◦30◦24◦
Size of detectable range 55◦40◦31◦25◦19◦
Correct detection rate
The grand mean (M) for the correct detection rate was 0.82. In Figure 4.9, the descriptive
statistics of the correct detection rate are presented for each condition.
The mean of correct detection rates generally decreased with an increasing number of
50
Figure 4.8: The ODR detection rates regarding different buffer angle range
objects. The conditions with faster moving speed had a greater mean for correct detection
rate than the conditions with slower moving speed.
The ANOVA shows that there was a significant main effect of the number of objects
on participants’ correct detection rate (F(4,96) = 27.62, p < .001, η2
p=.54). Moreover,
participants’ mean correct detection rate for faster moving speed was significantly higher
than the mean correct detection rate for the slower moving speed (F(1,24) = 13.93, p <
.01, η2
p=.37). There was no significant interaction between number of objects and moving
speed for correct detection rate (p>.05).
As the results showed a significant effect of the number of objects upon correct de-
tection rate, a set of post hoc t-tests was conducted to determine differences between
levels. The correct detection rate for 15 objects (M= 0.69, SD = 0.03) was signif-
icantly smaller than those for 6 objects (M= 0.91, SD = 0.02, p < .001), 8 objects
(M= 0.89, SD = 0.02, p < .001), 10 objects (M= 0.82, SD = 0.03, p < .001), 12 objects
(M= 0.80, SD = 0.03, p < .01). The correct detection rate for 6 objects was significantly
higher than those from 10 objects (p<.05) and 12 objects (p<.001). Additionally, there
were significant differences between 8 and 12 objects (p < 0.01).
51
Figure 4.9: Correct detection rate. The green triangles represent averages.
False detection rate
A false detection is registered when a gaze trajectory is detected as an object that is not
the target object. The false detection rate describes the ratio of all false detections of all
presented trials.
The grand mean (M) for false detection rate was 0.12. The descriptive statistics of false
detection rates for all experimental conditions are presented in Figure 4.10. The mean of
false detection rates increased with an increasing number of objects. The conditions with
slower moving speed had a greater mean for false detection rate than the conditions with
faster moving speed.
To gain further understanding of false detected trials, Table 4.3 compares the number
of trials which were falsely detected as adjacent digits (i.e., eye followed “2”, but the gaze
trajectory was detected as adjacent digits “1” or “3”.) and the number of all trials which
were falsely detected.
The ANOVA proved that there was a significant main effect of the number of objects
on participants’ false detection rate (F(4,96) = 8.85, p < .001, η2
p=.27). Meanwhile, the
object moving speed significantly affects the false detection rate, (F(1,24) = 6.99, p <
.05, η2
p=.23). There was no significant interaction between the number of objects and
52
Figure 4.10: False detection rate. The green triangles represent averages.
moving speed for false detection rate (p > .05). The results of statistical significance are
summarized in Table 4.4.
Pairwise t-tests show that the participants had a significantly higher false detection
rate with 15 objects (M= 0.17, SD = 0.03) than with 6 objects (M= 0.07, SD =
0.02, p < .01), and 8 objects (M= 0.08, SD = 0.02, p < .01). Moreover, there were
significant differences between 6 and 12 objects (M= 0.13, SD = 0.02, p < .01).
4.4.4 Subjective evaluation
Participants were asked to indicate their preference regarding the number of objects and
moving speed. They can choose one or more options. Results of participants’ preference
are presented in Table 4.5. The interface with 12 moving objects was preferred by 10
participants. Open answers to their reasoning behind this preference revealed that par-
ticipants felt that this interface is similar to the clock face. Five participants chose the
interface with 6 moving objects. The four participants who indicated a preference for an
interface with less than 6 objects reported that they would find it easier to identify the
target objects and expected that it would be easier to follow the objects on the interface
if there were less than 6 objects on it. In addition, the interfaces with 10 and 15 moving
53
Table 4.3: Count for trials which were detected as adjacent digits and all false detected
trials for all experimental conditions.
Number of objects
6 8 10 12 15
Slower-300 pixels adjacent 23 22 35 29 37
all 31 28 45 39 55
Faster-500 pixels adjacent 9 19 24 36 37
all 12 22 30 41 48
Table 4.4: Study 1 - Statistical significance of main and interaction effects
regarding objective measures.
Main effect of Main effect of Interaction
number of objects moving speed effects
Orientation error n.s.1p<.001 n.s.
Correct detection rate p<.001 p<.01 n.s.
False detection rate p<.001 p<.05 n.s.
1‘n.s.’: not significant
objects were chosen by two participants, respectively.
For the subjective experience of the objects’ moving speed, more than half of the
participants did not report a preference, since they did not perceive a difference in moving
speed. Eight participants preferred slower moving speed while three participants chose
faster moving speed as favorite. When asked for reasons behind their preferences, those
participants with a preference for slower moving speed reported that it easier to follow
slower moving objects and require less effort.
4.4.5 Comparison of three detection methods
In addition to detection method based on orthogonal distance regression, two more de-
tection methods are briefly discussed in this study.
Angle between two vectors
Consistent with method based on orthogonal distance regression, five buffer angles were
tested in this angle-based detection method, namely 3◦, 5◦, 7◦, 9◦and 11◦. Figure 4.11
54
Table 4.5: Subjective feedback from participants
User Preference Count
Speed Faster moving speed 3
Slower moving speed 8
Feel no difference in speed 14
Number of objects Less than 6 objects 4
6 objects 5
8 objects 4
10 objects 2
12 objects 10
15 objects 2
More than 15 objects 0
shows that the correct and false detection rates decrease slightly with increasing buffer
angle range, whereas the miss detection rate increases slightly with increasing buffer angle
range. Same as the detection based on orthogonal distance regression, the buffer range (α
= 5◦) was selected for further analysis. The size of detectable range for different interfaces
is showed in Table 4.2.
Pearson correlation
In the offline data analysis, the detection method was computed with five correlation
threshold levels: 0.5, 0.6, 0.7, 0.8, 0.9, i.e., the object with highest rx, ryis detected as
following target, only when both rx, ryof are above the predefined threshold. The Figure
4.12 illustrated the correct, false and miss detection rates. Lower thresholds result in both
higher correct and false detection rates, whereas higher thresholds lead to higher missing
detection rate. The correlation threshold of 0.8 was chosen in many studies (Drewes et
al., 2018; Esteves et al., 2015), which also applies to the linear moving data-set in this
study. Thus, the threshold of 0.8 was selected for further analysis.
However, this detection method entails major disadvantages that it is hard to detect
when there are the objects in the direction of the axis and diagonal lines simultaneously.
For some objects, the condition that both the highest correlation coefficient of corrxand
corryfor the objects need to above a threshold is difficult to meet. The interface with
8 objects was further analyzed regarding the moving directions (see Table 4.6). The
detection rates are higher than 0.8 for objects moving in diagonal directions, whereas the
55
Figure 4.11: The angle-vectors detection rates regarding different buffer angle ranges
Table 4.6: The correct detection rate of interface with 8 objects in directions
Number in interface Correct detection rate
0 0.0
1 0.87
2 0.0
3 0.82
4 0.0
5 0.88
6 0.0
7 0.82
detection fails to detect the objects moving in cardinal directions.
Comparing the three methods
The correct detection rate was compared regarding the three detection methods, i.e.,
vector-based, ODR-based and Pearson-correlation-based. From the above comparison of
56
Figure 4.12: The detection rates based on different correlation thresholds
the parameter ranges, the buffer range (α= 5◦) was selected for vector-based and ODR-
based detection. The correlation threshold of 0.8 was selected for Pearson correlation.
The results shows that the correct detection rate continues to decrease as the number
of objects increases for all three methods (see Figure 4.13). In particular, the correct
detection rate decreases dramatically from 6 to 10 objects for method based on Pearson
correlation. The vector-based method achieves the highest correct detection rate. Besides,
the correct detection rate of methods based on angle and orthogonal distance regression
are much higher than Pearson correlation method. For angle- and ODR-based methods,
the interfaces with 6 and 8 moving objects achieve a relative higher detection rate.
4.5 Discussion
This study analyzed how user performance is influenced by the number of objects and
object moving speed for gaze interface based linear pursuit movements. Additionally,
three detection methods were compared.
The first hypothesis anticipated that the orientation error would increase with an
increased number of objects and conversely decrease with moving speed. This hypothesis
57
Figure 4.13: The comparison of detection rate among detection methods
was partly confirmed. The results found that the orientation error did not significantly
differ between interfaces with varying number of objects. In other words, the pursuit eye
movements are not distracted by the increasing number of moving objects and the number
of objects has little effect on how well the eye follows the moving target.
At the same time, the orientation error for interfaces with faster moving speed of
objects was significantly smaller than for interfaces with slower moving speed. For slower
moving objects, the gaze trajectory does not follow the object path as closely as with
faster moving objects. A possible explanation for this difference between moving speed
conditions might be that with the same recording time and sample size, the moving
distance of faster moving condition is longer than the slower one, thus the ODR regression
model for trajectories of faster speed performs better than that of the slower ones.
Additionally, most orientation errors are located in an angle range between 0-30◦.
There is only a small number of orientation errors larger than 30◦, which are likely caused
by participants’ distraction or a participant’s inability to locate a target object. Since
the gaze data were collected using an eye tracker without individual calibration, the
orientation errors occurring within this range are mainly due to the accuracy of measuring
equipment.
58
Although there was no individual calibration for each participant, overall correct detec-
tion rates were high. On an individual level, only one participant had a correct detection
rate lower than 50%, most likely caused by very thick glasses. For more than two-thirds
of false detections, an adjacent digit was detected. The second hypothesis expected that
the detection rates for objects will be different regarding number of objects and moving
speed. The results found that the correct detection rate decreased significantly while the
false detection rate increased significantly with increasing number of objects in the inter-
face. On the question of differences in object moving speeds, this study found that the
correct detection rate increased significantly from 300 pixel/s to 500 pixels/s. The false
detection rate is significantly lower for faster moving speed than slower one. Furthermore,
the ratio of trials detected as adjacent digits to all false detection trials is relatively higher
for faster than slower moving speed.
The comparison among levels for the number of objects showed that the decrease in
correct detection rate was slow between 6 and 8 objects as well as between 10 and 12
objects. But the decrease was larger between 8 and 10 objects as well as between 12 and
15 objects. The correct detection rate of 15 objects was significantly different compared
with interfaces with lower number of objects. No significant difference was found between
6 and 8 numbers for both orientation error and detection rates. These differences in
detection rates regarding related to the number of objects in the interface may be caused
by trials in which participants were not able to find the position of a target object among
other objects presented. However, these differences could also be caused by the decrease
in the detectable angle range and the limitation of the spatial accuracy. The detectable
range was gradually decreasing with the increasing of objects number. For example, the
detectable angle range for each object was 55◦for interface with 6 objects, but the range
reduced to 19◦for interface with 15 objects. Although the target was well followed by the
eye, the gaze trajectory could be detected as adjacent objects when the detectable angle
was too small and the spatial accuracy was low.
The third hypothesis expected that participants would prefer interfaces with fewer
objects and fast-moving speed. Our results were inconsistent with the hypothesis. While
some participants preferred interfaces with fewer moving objects, the interface with 12
objects was the most preferred interface. The position of the objects played an important
role in the subjective evaluation of the interfaces. The similarity of the 12-object interface
to a clock face led a number of participants to report a familiarity between the experimen-
tal interface and a clock. Concurrently, this might have helped participants to find the
target more easily. Future studies should investigate this influence of interface-familiarity
59
on user preference and interaction performance. While a number of participants did not
consciously register the difference in moving speed of objects, some of them preferred
slower moving speeds. This is an interesting finding, as it reveals that the subjective
experience of users in gaze-based interfaces is not directly linked to a higher performance
while using the interface.
While researchers investigated gaze interaction based on smooth pursuits eye move-
ments, they were mainly interested in circular and square-shaped target trajectories
(Drewes et al., 2018; Esteves et al., 2015). The detection algorithm used is mainly based
on Pearson’s product-moment correlation in previous researches. In this study, the linear
regression method, vector-based method and Pearson correlation method were compared.
The results shows that the detection rates of linear regression method and vector-based
method outperform the correlation method regarding detection multiple linear moving
objects.
4.6 Summary
In this chapter, a controlled laboratory experiment was conducted to evaluate the effect
of objects number and object moving speed on interaction based on linear smooth pursuit
eye movements with no individual calibrated eye tracker. When comparing the number
of objects, there was only a little difference in orientation error, but the detection rates
decreased with an increasing number of objects. The results found the detection of faster
moving speed was better than the slower ones. Overall, both the 6 and 8 objects interface
with a faster moving speed yielded good user performance. In previous works, six moving
directions were frequently used in linear smooth pursuit based interface (Lutz et al., 2015;
Zeng & Roetting, 2018). This study shows that the difference between 6 and 8 objects
is not significant, both can be well detected by the system. Therefore, it is possible to
extend the moving directions of cluster to improve the flexibility of the gaze interface.
60
Chapter 5
Study 2: Design eye typing interface
based on hybrid eye movements
5.1 Introduction
This chapter proposes a new concept to design a hybrid gaze interface, which is featured
with a two-stage selection and one-point calibration. To select an object, the user needs
first to look at this object for a certain long time to activate it. Then, the user selects
this character by following its movement. Figure 5.1 gives a metaphor in daily life and
illustrates the interaction process. When people observe a bird on a branch, at that
moment, the bird flies off, and people’s eyes will naturally follow the trajectory of the
flying bird. The eye changes quickly from the state of fixation to catch up saccade, and
then to smooth pursuit eye movements. If the bird disappears from the field of view, people
will scan the environment by saccadic eye movements to re-position the bird (shown in
Figure 5.1a). Similar to the scenario in Figure 5.1a, the user selects a charterer by first
looking at the character and then following its movement to select it (shown in Figure
5.1b).
Compared to the previous pursuit-based interfaces, only the objects within the cur-
rent attention area are moving and interactable, and users can select the target item by
following its moving trajectory. The interface can be adapted to low-accuracy equipment
environments and also minimizes distractions to users’ attention. A novel eye typing
interface is designed based on this hybrid concept, to create a robust gaze-enabled text
entry system.
Additionally, previous research has mainly focused on interface design and algorithm
development, but has neglected to provide feedback for dynamic eye typing interface.
61
Gaze point
Trajectory
(a)
AA
A
B
C
D
B
C
D
(b)
Figure 5.1: The hybrid interaction design. The red circle represents the current gaze
position, the gray dashed line represents the moving trajectory of the object.
Thus, this study further investigates the effect of feedback on typing performance and
experience of the interface. Different modalities, i.e., visual only, auditory only, combined
visual and auditory and no feedback are compared in a user study.
5.2 Interface design
As mentioned before, the visual search for information as well as the selection of intended
objects is combined in gaze-based interfaces. In some dynamic gaze interfaces, user can
interact with an element by following the movement of this element with the eyes. In
Pursuits (Vidal et al., 2013), the moving objects are distributed across the screen. The
threshold of the time window and correlation coefficient is used to activate a command.
That is, if the correlation coefficient exceeds the threshold within the set time window,
an element is selected when the eyes gaze at anywhere on the screen. Lutz et al. (2015)
further distinguished two areas of the interface. When the gaze point is registered in
the deactivated area in the center of the screen, all characters remain static, selection
can be triggered only when the gaze is out of the deactivated area. In human-computer
interaction, perception happens before reaction in most cases, and interactive content is
generally located in area of attention. Thus, in this eye typing interface, the area where
the user is currently looking is defined as a dynamic area, which is located in the activated
62
area. Only objects within the attention area move. Other objects stay still to minimize
distractions. The iteration process is shown Figure 5.2.
activated/dynamic area
Pursuits
deactivated
area
activated/dynamic area
SMOOVS
gaze point
deactivated
area
activated area
This Study
dynamic area
Figure 5.2: Differences in active interface areas in three pursuit-based gaze systems.
In this study, a new gaze typing interface is designed with the goal to facilitate an
63
interaction that is perceived as natural as possible by users. Similar to the layout of the
Hex-o-Spell (Blankertz et al., 2007), SMOOVS (Lutz et al., 2015) and pEYEs (Huckauf
& Urbina, 2008) interfaces, this new interface utilizes a two-stage concept, where users
select a cluster of characters in a first step, followed by the selection of an individual
character in a second step.
5.2.1 Layout
The text entry interface is based on an octagon-like layout with eight clusters that allow
users to input different characters (see Figure 5.3).
Figure 5.3: The illustration of eye typing interface, the background color is black and the
character color is white in real typing system. The gray circular area is idle area, i.e.,
deactivated area.
The SMOOVS interface consists of six clusters, five of which contain six characters.
64
As Lutz et al. (2015) reported, the accuracy of gaze data decreased significantly from the
middle to the edge of the screen after one-point calibration, i.e., the interactions in the
central area of the interface are registered with higher accuracy. Hence, a larger number
of objects can be distinguished in the first stage of detection which is placed more central
in the interface, than in the second stage which is placed on the outside of the interface.
Accordingly, the number of clusters could be increased in the first stage, thereby the
number of characters in each cluster could be reduced in the second stage. In addition,
the Study 1 in Chapter 4 found that the detection rates are higher for both interfaces
with the 6 and 8 moving directions. Therefore, in this study, the number of clusters that
are selected in the first stage was increased from 6 to 8. With the increased number of
clusters, the number of character tiles in a cluster is reduced from 6 to 4, decreasing the
required accuracy for character distinction in the second stage. With fewer tiles in the
cluster, users also require a shorter time for searching and identifying the desired letters
in a cluster.
Regarding the direction of movement, studies have shown that the detection accuracy
for smooth pursuit eye movements is influenced by the movement direction. Horizontal
and vertical directions are more robustly detected than diagonal directions (Ke et al.,
2013; Krukowski & Stone, 2005). Thus, in order to achieve a more robust identification
of gaze directions, characters in each cluster are designed to move along the horizontal
and vertical axis, e.g., in cluster “ABCD", “A" moves left, “B" moves up, “C" moves right
and “D" move down. In this study, 1◦visual angle corresponds to 39 pixels at a distance
of 60 cm from the user to the screen.
The alphabetical A-Z layout was used in this study, which contains 26 letters, as well
as delete, space key, and question mark (see Figure 5.4a). According to (LetterFrequency,
2021), the space key is the most frequently used key, hence it is located in a vertically
downward direction of the interface as a single actionable item (not in a cluster). While
the key is mainly used to enter a space between words, it also serves as a way to confirm the
currently entered word and indicate the start of an entry for a new word. Once the space
key is used, typed words are moved from the center to the bottom of the screen (where the
word “YOU" located in Figure 5.4a). The delete key is responsible for correcting typed
characters.
65
5.2.2 Interaction design
When a user looks at the central area of the screen, e.g., when reading the typed charac-
ters, the interface remains static. When a user’s gaze is registered outside of the deacti-
vated central area, the detection of cluster selection is activated.
Stage 0: Idle state.
When the distance between the gaze point and midpoint of the screen is less than the
radius of the idle area (see Figure 5.3), no action will be activated. The detection happens
only when the eye position is out of the idle area. However, whenever the eyes move back
to the idle area during the interaction, all elements in the interface will gradually move
back to the their original positions and the system returns to the initial idle stage until
user looks at the area beyond the inner idle circle.
Stage 1: Searching and cluster selection.
In this stage, all clusters remain static. The user needs to find the cluster which includes
the desired character. The cluster in which the gaze is registered is highlighted (see
example "EFGH" cluster in Figure 5.4a). However, if the user’s gaze moves out of the
area of this cluster, the highlighting is reversed. If the gaze moves to another cluster, that
new cluster is highlighted, allowing users to adjust their selection before the activation of
an input.
Stage 2: Character selection.
After a cluster is selected, the second stage begins (for more details on the classification
algorithm, see Subsection 5.2.3). In this stage, the characters in the selected cluster
move outward. Other clusters that were not selected fade out gradually at the same time
to avoid visual distraction. That is, the active cluster is highlighted and the entailed
characters move outward, while the rest of the clusters fades out. The typing interface
at the end of stage 2 is shown in Figure 5.4b. The moving distance of characters is 75
pixels, if the user looks at another cluster when the characters in the selected cluster are
moving, those characters will move back to their original position. After the characters
in the selected cluster have moved 75 pixels, they fade out and then reappear in their
original position (i.e., where they were before moved). If a character is successfully
detected, feedback is given simultaneously. Detailed information can be found in the
subsection 5.2.2.
The interaction flow is shown in Figure 5.4. When user try to enter the phrase “you are
a wonderful example". The typed word (“YOU") is located at the bottom of the screen.
The current typed characters (“AR") are displayed in the center of the deactivated area.
66
(a) (b)
(c) (d)
(e)
Figure 5.4: The interaction flow
67
To Enter “E", the eyes first start looking at the “EFGH" cluster, and this cluster is
highlighted, illustrated in Figure 5.4a. The characters in selected cluster moved outward
in pursuit stage. Eyes need to follow the movement of letter E to enter it, Figure 5.4b
shows the end of pursuit stage. If Letter E is correctly detected, it will be shown in the
center of the screen (see Figure 5.4c). After that, user can continually follow the space key
to move the current typing word to the bottom area and enter space between words (see
Figure 5.4d and Figure 5.4e).
Feedback design
Feedback was given to confirm that the system is responding to the input, visualize reac-
tions, and confirm an activated action. The feedback factor consists of four levels, namely
no feedback, visual feedback, auditory feedback, and both visual-auditory feedback. Vi-
sual feedback appears behind the selected character in a gray circle (the diameter of the
circle is 44 pixels, approximately 1◦). Auditory feedback was given via computer speak-
ers. The auditory feedback sound was a short continuous beep at 300 Hz. The visual and
auditory feedback is synchronized and lasts approximately 200 ms. Figure 5.5 illustrates
four feedbacks. To avoid the effects of sequence caused by fatigue and practice, the order
of conditions was randomized.
🔊
🔊
(1)
(2)
(3)
(4)
Figure 5.5: The provision of feedbacks is illustrated, from left to right, they are no
feedback, only visual feedback, only auditory feedback and both visual-auditory feedback.
5.2.3 Classification algorithm
In the stage 1 (searching and cluster selection), the detection of a cluster is based on
the midpoint of screen (M), gaze point (G) and the positive direction of the x-axis (as
68
shown in Figure 5.6). The angle between the vector between two points and the positive
x-direction is given by function 5.1 and converted from radian to degree.
θg= arctan 2 (yg−ym, xg−xm)(5.1)
where xgand ygare the coordinates of the eye position, xmand ymare the midpoint
coordinates of a cluster. The detectable range for each cluster is 45◦in first stage. When
the angle meets the angular criterion of one cluster θu< θg< θl, it is recognized as the
corresponding cluster. When there are more than two successive measurements detected
as the same cluster, i.e., θg,i =θg,i+1, this cluster is highlighted to inform the user about
the current selection. When this cluster is further detected for more than 400 ms, then
this cluster is activated and the second stage (dynamic stage) starts.
θg
G (xg, yg)
M (xm, ym)x
y
0°
-90°
180°
90°
θl
θu
Figure 5.6: Visualization of the angular criterion in first stage. The red dot is gaze
point (G), and blue dot is the midpoint of screen (M). The gray dotted line presents the
identifiable range of each cluster.
In the stage 2 (character selection), only characters in the selected cluster begin to
move. The moving speed of characters is 433 pixels/s (approximately 10◦/s visual angle).
Users can select the desired character by following the movement of this character with
their eyes. But the following action itself does not affect the detection. When for two
consecutive gaze points the distance between gaze points and the midpoint of this group
69
exceeds 250 pixels in this stage, the characters in this cluster will move back to the
original position, the detection process returns to the first stage. Similar to the first stage,
character detection is also based on angular range. After the characters of the selected
group have finished moving, 20 gaze points [(xg,1, yg,1), ..., (xg,20, yg,20)] will continue to be
collected. A set of angles [θg,1, ..., θg,20] is calculated according to those 20 gaze points and
the middle point of this cluster. The mode of those angles is calculated and the character
corresponding to this angle is selected.
The detectable range of a character for this stage is 85 degrees. In order to reduce
false detections, an area of five degrees is defined between the adjacent characters. If the
measurement falls into this area, the system skips this input.
5.2.4 One-point calibration
The one-point calibration method developed by Lutz et al. (2015) was used in this study.
A number counting down from three appears in the center of the screen and this process
lasts three seconds. Users were asked to look at the countdown number in the center of
the screen (see in Figure 5.7).
Figure 5.7: The interface of one-point calibration
During the one-point calibration, gaze coordinates were recorded and outliers were
removed using DBSCAN (Ester et al., 1996). After the test in the pilot study, the max-
imum distance between two samples, i.e., one is considered to be in the vicinity of the
other is set to 100 pixels (about 2.56◦visual angle). The number of samples is set to 50,
70
i.e., to consider as a core point, the number of samples round the core point need be more
than 50. The cluster that aggregates the most gaze data is selected for use for one-point
calibration, i.e., other groups and gaze points that are not grouped are considered as
outliers (illustrated in Figure 5.8).
Figure 5.8: Remove outliers using DBSCAN. The black cross is position of the middle
calibration Point. The black points are the gaze points during one-point calibration, the
red points are the reserved gaze points after discarding outliers.
Based on the retained gaze points, the average means of distance for both x-and y-
axis between the midpoint of the screen and gaze positions were calculated. Then, gaze
coordinates were translated with the calculated offsets. In similar experiment settings,
study 1 reported the average gaze estimation error was 4◦(see Chapter 4). Thus if
the offset is greater than 4◦, the system reminds the participant to adjust their position
according to the distance from the screen, and perform the calibration again. According
the records during experiment, all participants were able to successfully complete the
one-point calibration without re-calibration. On average across all participants, the offset
was 55.42 pixels (SD = 35.79).
Figure 5.9 visualizes gaze data from the pilot study before and after one-point cali-
bration, where the participant gazes at a stationary point that appeared in sequence at
five positions for three seconds on the screen.
71
(a)
(b)
Figure 5.9: Visualization the gaze data from two participants before and after one-point
calibration in pilot study. Red points are target points, the black points visualize the raw
gaze data, and blue points visualize adjusted gaze data after one-point calibration.
72
5.3 Empirical evaluation
To investigate how well do the users enter text using the hybrid eye typing interface, a
user study was conducted. In addition, four forms of feedback were compared for their
influence on user performance and experience1.
5.3.1 Evaluation metrics
The words per minute (WPM), keystrokes per character (KSPC) and minimum string
distance (MSD) are used to measure typing performance, the calculated metrics are de-
scribed in detail in Chapter 2.
In addition to the evaluation of typing performance, the participants’ opinions were
collected via short open questions. After completing the typing tasks for each experi-
mental condition, the participants were asked directly to describe how they experienced
the feedback from the system just presented. Once the whole experiment finished, par-
ticipants were asked to fill out a written questionnaire. The main content of the open
questions is shown in Table 5.1.
Table 5.1: The main content in open questions
1. Please rank the four user interfaces and give reasons.
2. How did you perceive the speed of the objects.
3. How did you perceive the size of the objects?
4. Did you have enough time to search for the desired objects?
5. Suggestions for improvement
5.3.2 Participants
In this study, 29 participants were recruited (13 female, 16 male) from an online recruit-
ment system. Their ages ranged from 19 to 38 years old, with a mean of 26.7 years. All
participants had German as their native language. About half of the participants (69%)
reported that they had normal vision and 31% of them wore vision aids during the study.
Most participants (83%) had no previous experience with gaze interaction. 72% of par-
ticipants had no experience with eye tracking.
1This experiment was conducted by a master student under supervision of the author of this disser-
tation (Neuer, 2020)
73
Both participants and supervisor were required to wear masks and keep a social dis-
tance of two meters from each other throughout the experiment due to COVID-19 related
hygiene regulations. Participants were rewarded with ten Euros per visit or alternatively
a certification of student experimental hours for attendance.
5.3.3 Apparatus
In this experiment, the Tobii EyeX eye-tracker was used to register the gaze location in
screen coordinates with a sampling rate of 60 Hz. The eye tracker was mounted beneath
a 24” Dell monitor with a resolution of 1920 * 1200 pixels. Task scenarios were created
with Python and were presented on the monitor. All data were collected without personal
calibration from users. The eye tracker was calibrated once by the experiment supervisor.
The average distance between the participants’ eyes to the display was 62 cm (SD = 7.27).
5.3.4 Procedure
After reading the informed consent form, and hygiene concept, the experiment started
with an introduction of the gaze-based text entry system. The experiment consisted of
one training and one test phase.
In the training phase, the interface of no feedback was given and five phrases were given
for learning how to use the text entry system. If the participants had no further questions
about the experimental task, the test phase began. In the test phase, the four text entry
interfaces with different feedbacks were tested. Participants went through all four feedback
conditions. Each feedback condition was tested with five phrases. Entering a phrase was
considered as a trial, i.e., participant typed 5 phrases for each feedback condition and in
all, a total of 20 phrases were given per participant. The order of feedback conditions and
phrases was randomized. All phrases used in this experiment were selected from a set
of 500 phrases (MacKenzie & Soukoreff, 2003) and translated into German. The phrases
used in this study are showed in Appendix B.2. The participants were instructed to enter
the given phrases as fast and as accurately as possible.
At the beginning of each trial, there was a short one-point calibration, which lasted
three seconds. Then a phrase was displayed in capital letters in the center of the screen.
The participants then pressed the space bar to start typing after they memorized the
phrase. After finishing the tasks for one condition, participants were asked to take short
breaks. When the participants completed all the text entry tasks, they were asked to
fill out the questionnaire consisting of demographic information and subjective questions.
74
The whole experiment duration was between 40 and 60 minutes per participant.
5.4 Results
This study featured a one-factor within-subjects design. The results are based on a total
of 580 phrases (29 participants * 4 feedbacks * 5 phrases). Typing speed, accuracy and
subjective feedback were analyzed.
5.4.1 Text entry rate
The grand mean of typing speed was 4.7 WPM. The fastest typing speed was registered
with 8.38 WPM. The descriptive statistics with regard to typing speed for each feedback
condition are showed in Figure 5.10 and Table 5.2.
Figure 5.10: Words per minute for each type of feedback, black dots visualize the average
values, diamonds are outlier observations.
A repeated-measures ANOVA at a significance level of α= 0.05 was applied to analyze
data. Mauchly’s test of sphericity indicated that no correction was needed. The feedback
conditions significantly affected the text entry rate (F(3,84) = 4.61, p < .01). A set of
75
Bonferroni corrected t-tests were conducted for pairwise multiple comparisons. The Post-
hoc comparisons revealed that the text entry rate of both visual-auditory feedback was
significantly greater than no feedback (p<.001).
5.4.2 Error rates
Keystrokes per character (KSPC)
The grand mean of KSPC was 1.67. The descriptive statistics of KSPC for each feed-
back condition are presented in Figure 5.11 and Table 5.2. The Kolmogorov-Smirnov
test showed that the data did not conform to a normal distribution (p<.001), hence
the Friedman test was used to assess differences in the KSPC error between feedback
conditions. No significant difference was found between the different feedback conditions
(χ2(3) = 6.19, p =.10).
Minimun string distance (MSD)
The grand mean of MSD was 0.1 and the result of the descriptive statistics of MSD
was shown in Figure 5.12 and Table 5.2. The Kolmogorov-Smirnov test showed that the
data did not conform to a normal distribution (p<.001), and the Friedman test was
conducted to assess the MSD error. No significant difference was found in the feedbacks
(χ2(3) = 2.83, p =.42).
Table 5.2: Study 2 - Mean absolute values (M) and standard deviations (SD) for each
experimental condition.
Feedbacks WPM KSPC MSD
M SD M SD M SD
No feedback 4.47 1.13 1.72 0.54 0.10 0.11
Visual only 4.61 0.96 1.70 0.50 0.09 0.06
Auditory only 4.72 0.99 1.69 0.61 0.13 0.16
Both 5.01 0.92 1.58 0.41 0.09 0.08
5.4.3 Subjective evaluation
During and after the experiment, participants were asked to report their perceptions and
preferences for the four feedback conditions. In addition, the general perceptions about
76
Figure 5.11: Keystrokes per character for each feedback, black dots visualize the average
values, diamonds are outlier observations.
Figure 5.12: Minimun string distance error rate for each feedback, black dots visualize
the average values, diamonds are outlier observations.
77
the text entry system and ideas for improvement were collected and summarized.
All participants reported that they noticed differences in feedback, and almost all of
them correctly described these differences. However, the presence of visual feedback was
not perceived by the two participants. More than half (55%) of the participants thought
that the combined visual-auditory feedback is helpful. Approximately one-third (31%)
considered that visual feedback is helpful, and 17% thought that auditory feedback is
helpful. Additionally, participants were asked to rank all four feedback conditions. The
most preferred feedback was ranked first and the least preferred was ranked fourth. Table
5.3 shows the results of this ranking.
Table 5.3: Ranking results, 1 represents the most preferred, 4 represents the lowest rank-
ing.
Ranking count
Feedback N 1 2 3 4
No Feedback 29 1 3 6 19
Visual only 29 15 9 5 0
Auditory only 29 0 6 15 8
Both 29 13 12 3 1
More than half of participants ranked the sole-visual feedback condition first and
nearly half of participants ranked the combined auditory-visual feedback condition first.
A Friedman’s test revealed that there was a significant difference between the ranking of
the four feedback conditions (χ2(3) = 45.68, p < .001). Post-hoc pairwise comparisons
with the Wilcoxon signed-rank test showed that the ranking for combined visual-auditory
feedback condition was significantly higher than no feedback condition (p<.001) and sole-
auditory feedback condition (p < .001). Also, the ranking for the sole-visual feedback
condition was significantly higher than the sole-auditory (p<.001) and no feedback
condition (p<.001).
The reasons for feedback preference were collected (summarized in Table 5.4). The
sole-visual feedback is preferred by most participants because it not only helped the user to
confirm the input but also provided information about which character is selected. Some
participants found the tone used for the auditory feedback was distracting and disturbing.
The answer regarding the moving speed, visual search time, and item size were clus-
tered (shown in Table 5.5) . About one-third of participants considered the movement
78
Table 5.4: Reasons behind feedback preference
Feedback Reasons for ranking Count
No Feedback Confirmation of input detection is missing 13
Need to confirm the input 6
Visual only Provide direct feedback 6
Better concentration on typing task 5
Given confirmation of the input 2
Checking the input is omitted 2
Less mistakes 2
Auditory only Sound is disturbing 6
Eye response to sound causes errors 2
Both More robust system interaction 10
Double support in interaction 2
speed of the interface as too fast which leads to unintentional input. Some participants
felt that the moving speed is fast only at the beginning of the use. And some participants
stated that the characters which were not easy to find were moving too fast. Besides,
there were a few participants who considered the moving speed for some characters as too
slow.
At the end of the questionnaire, participants were asked to give suggestions for po-
tential improvement. Multiple participants suggested that the visual feedback should be
more salient, and e.g., color should be added to the selected character or the font should
be bold, indicating that other visual feedback might be difficult to notice. In addition,
participants suggested that the feedback tone should be more user-friendly. Another sug-
gestion was to use different types of auditory feedback for special keys, such as the delete
key. Some participants indicated that the delete key could be re-positioned to a better
location. And participants were also looking forward to having a QWERTY layout design.
A reduction of movement speed and the addition of a word prediction functionality were
also mentioned as potential advancements.
79
Table 5.5: Perception of the text entry system
Topic Description Count
Speed Too fast 11
Feel fast at the beginning of the use 6
Too fast for unfamiliar letters 6
Appropriate 3
Some keys (e.g., space) are slow 3
Search time Time is enough to search the desired tiles 14
Not enough time for searching 13
After using it for a while, the time is enough 2
Item size Appropriate 23
Some keys (e.g., delete) are small 3
Too small 3
5.5 Discussion
The goal of this study was to develop a novel eye typing interface with a relatively low
requirement for spatial accuracy. An experimental laboratory study was conducted to
evaluate users’ performance and preferences. In addition, multiple feedback stimuli were
compared to facilitate immediate input confirmation during the use of the interface.
The results show that the participants were able to learn how to efficiently use the
system within an hour. This result has to be seen in the light of the nature of participants
in this study, who all were novice users of this novel layout. Most of them had no
previous experience with gaze interaction. The average text entry speed reached 4.7
WPM, considerably higher than the SMOOVS typing systems (typing speed varies in the
range of 2.9 to 3.34 WPM).
However, both correct and uncorrected error rates were relatively high, and partici-
pants reported that the system was too sensitive. A lot of participants commented that
the moving speed was too fast and they didn’t have enough time to search the desired
letter, with one participant stating “It’s a bit fast when selecting letter groups, which is
why it often opens some groups that you don’t want to open." The dwell time for cluster
selection was set as 400 ms. To achieve a more robust detection, a longer dwell time could
be introduced, especially for novices who are not familiar with the layout.
The movement speed of the target is one of the important parameters for the pursuit-
80
based gaze interface and affects the detection performance. When the target moves too
fast, smooth pursuit eye movement will be interrupted by catch-up saccades to keep up
with the moving target (Dodge, 1903; Robinson, 1965). Moreover, the tracking trajectory
is limited for a linear moving target when the size of the display is not big enough, i.e.,
the target reaches the edge of the screen soon (Vidal et al., 2013). The studies reported
that the detection rate for method based on correlation drops when the moving speed
is too high or too low, which needs to be controlled in a reasonable range (Drewes et
al., 2018; Esteves et al., 2015). Medium angular speed around 120 ◦/s is considered
as proper in orbits trajectory (Esteves et al., 2015), where Drewes et al. (2018) came
to a conclusion that the detection rate works well when the moving speed in a range
from 6 to 16◦visual angle/s. In addition, the studies regarding linear trajectory also
came to similar conclusions. Cymek et al. (2014) found that the medium speed (218
pixels/s, about 4.9 ◦/s) has a higher subjective user evaluation. And the text entry
speed is not proportional to moving speed, but peaks at a medium speed (300 pixels/s,
approximately 8 ◦/s). Zeng, Siebert, Venjakob, and Roetting (2020) compared two moving
speeds (7.73 ◦/s and 12.89 ◦/s), and found that the detection rate for faster-moving speed
is higher than slower ones. More important is that there is a general consensus among
pursuit-based studies about moving distance, the detection rate or subjective evaluation
is higher with a longer trajectory (Cymek et al., 2014; Esteves et al., 2015; Zeng et al.,
2020). And our data suggests that the movement speed of the interface in this study
was potentially set too high for novices, they can get used to faster speeds as usage time
grows.
Besides, the input accuracy might be increased with consideration of learning effects
and individual differences. For example, users could be allowed to adjust the parameter
themselves (Špakov & Miniotas, 2004) or the system could automatically adjust to meet
the need based on past typing records (Mott et al., 2017; Pi et al., 2020).
The results also indicated that the type of feedback used in the interface significantly
relates to the typing speed. The highest typing speed was found in the feedback condition
with the combined visual and auditory feedback. No significant difference was found either
on KSPC or MSD error regarding feedback conditions. Hence, it can be concluded that
for typing efficiency in the hybrid eye typing interface, it is optimal to provide combined
audio-visual feedback, as the experimental data suggests that it works better than no
feedback.
Although the evaluation of user performance indicated that the combined visual and
auditory feedback resulted in higher typing speed and lower error rates, the evaluation of
81
user preference showed that 52% of participants ranked the sole-visual feedback to be the
most preferred feedback, and slightly fewer participants (45%) preferred the combined
visual-auditory feedback most. No participants felt that visual feedback caused extra
visual noise that made it hard to concentrate on typing. There was no significant increase
in demand for visual resources. Participants appreciated that visual feedback informed
them that input was registered and which action was activated. The reason that fewer
participants like the auditory feedback were similar to what Špakov et al. (2016) reported
about auditory feedback. Participants found the auditory feedback to be useful but
distracting in both studies.
Our assessment of effectiveness and efficiency variables, as well as user feedback for
the hybrid eye typing interface, shows that the system represents an enjoyable and well-
functioning human-machine interaction approach. While the text entry speed count of
4.7 WPM found in this study is not high in comparison to other gaze-based spellers, there
are two critical advantages over existing spellers, robustness and user experience.
The hybrid eye typing interface alleviates the challenges of a lack of calibration in
walk-up-and-use eye-tracking solutions. Such calibration-free eye typing methods need a
shorter time required for the calibration process of eye tracker. Since the angle of the
gaze-path is mostly unaffected by a gaze point offset that is sometimes found with one-
point calibration, this gaze speller is well suited for immediate interactions. The relatively
simple arrangement of usable objects further supports this.
As a second main advantage over existing systems, the exclusive actionability of select
object groups through visual and auditory feedback increases user feedback and user
understanding during the use of the system. This new eye typing interface is featured
with a hybrid design, which combines both dwell-based and pursuit-based gaze interaction.
The interaction is dived into two-stages. The cluster selection is corresponding to the
static stage, where the cluster looked at by the user is highlighted, but the characters
on this interface remain static which enables the user to search the desired character
more easily. The characters in one cluster begin to move when the user is fixating on
this cluster for a certain time. The selection of characters is done in the dynamic stage.
Compared to dwell-based interfaces which have a high requirement for calibration, the
calibration process for the proposed hybrid design is reduced to three seconds. There is
also no learning of specific eye-movements necessary (like in gesture-based systems), as
users do not need to remember or learn gestures to use the hybrid system. Compared to
the existing pursuit-based eye typing systems (e.g., SMOOVS), the typing efficiency is
slightly improved in the proposed hybrid interface.
82
It can be expected that any introduction of gaze-based interfaces in the public will
initially depend on a high level of detection robustness, the simpleness of the interface,
and the understand-ability of the interaction design. Since this novel eye typing inter-
face combines these factors, it represents a prime candidate for implementation in public
spaces. This hand-free text entry system can help to limit touches and helps to promote
hand hygiene. The system is also considered to be more convenient, e.g., for people with
disabilities. It could be complementary, where eye input can be used as an alternative
interaction method, coexisting with physical buttons and touch screens.
5.6 Summary
This new eye typing interface enables users to enter text by eye after a three seconds cali-
bration process, which shortens the time for calibration. It also provided a hands-free eye
typing system with robust typing performance as well as relatively high user acceptance.
The interface successfully avoids the Midas touch problem, through the implementation
of a static stage, that allows users to search the desired character. The rearrangement
of character-clusters in the interface counteracts the effect of lower gaze detection ac-
curacy in the border areas of interfaces. While the combined visual-auditory feedback
resulted in the highest typing speed, subjective data revealed that users prefer sole-visual
to combined visual-auditory feedback, because the sound was distracting sometimes.
83
84
Chapter 6
Study 3: Evaluating eye typing
interfaces with language model
6.1 Introduction
A circle-layout eye typing interface that enables enter text only by gaze after a short
one-point calibration was proposed in the last chapter. The user could enter a character
by following the movement of the desired character by eye. Although this text entry
system presents an early investigation of calibration-free eye typing, all novice participants
successfully completed the typing task, the maximum typing speed is limited due to the
two-stage selection design and is relatively slow.
In order to achieve an acceptable typing speed, utilizing the language model for this
eye typing interface is formally investigated. In this chapter, three eye typing interfaces
are developed and a user study is conducted to address the following research questions:
1. Whether utilizing language model can significantly improve the typing performance
for this eye typing interface.
2. Whether utilizing language model can significantly reduce the workload and improve
the user experience during typing.
3. What kind of letter/word prediction designs are more appropriate for such a dynamic
eye typing interface.
85
6.2 Interface design
Consistent with study 2, the eye typing interface is arranged in the order of letters A-Z
and contains 26 letters. In addition, there are delete and space keys. In this study, the
position of the space key is different from that of in study 2, where space key is located
in the vertical downward direction of the interface and, and there is only one item (space
key) in this direction. In this study, both delete and space keys are in a group with the
letters Y and Z. The position of the cluster moving downward is reserved for the function
keys of the word prediction. The currently typed letter is still displayed in the middle of
the idle area. When the user checks the typed letters, i.e., the letters displayed in the
middle of idle area, no action will be activated. If the user finishes typing a word, they
can follow the “SP" key, and the entered word will move from the center to the bottom
of the screen.
Before entering the typing interface, there is a one-point calibration process, which
lasts three seconds, as detailed in Subsection 5.2.4. The process of entering one character
is consistent with study 2.
However, according to the results from study 2, participants reflected that the search-
ing time was short and the moving speed is fast. Thus, the dwell time for the stage 1 (search-
ing and cluster selection) was lengthened to 600 ms, i.e., if one cluster is looked over 600
ms, the characters in this cluster start to move outwards. In addition, the moving speed
was slowed which is stable at about 250 pixels/s (equivalent to 6.4◦visual angle/s). To
avoid distraction, when the selected cluster is moving out, other clusters fade out. Figure
6.1b presents the position of the characters when the movement of one cluster is finished.
The detection method is the same as study 2. Auditory and visual feedback was given
after a character is selected.
Eye typing interfaces with language model
In this study, language model was introduced into the design of the hybrid eye typing
interface. Based on the original hybrid eye typing interface proposed in study 2, two
more versions, one with letter prediction and the other one with both letter and word
prediction, were proposed. Thus, three typing interfaces were tested, they are: (1) No
prediction (NoP), served as a baseline, (2) Letter prediction (LP), and (3) Letter + word
prediction (L+WP).
The interface of the letter prediction is basically the same in appearance as the No
prediction, both contain seven clusters (see Figure 6.1). The difference between No pre-
diction and letter prediction is the moving distance of each letter (see Figure 6.2). In the
86
(a) Cluster selection (b) Character selection
Figure 6.1: Screenshot of interfaces of No prediction. (a) when the eyes look at the ABCD
cluster, and this cluster is highlighted as feedback. (b) the characters in selected cluster
moved outward.
Letter prediction, if the letter belongs to the list of the next four most likely letters to
appear, and the cluster containing this letter contains only one letter from the list. The
moving distance of this cluster will be reduced from 94 to 68 pixels.
Figure 6.2: The default value of moving distance is 94 pixels (right figure), The figure on
the left shows the moving distance when letter prediction activated, i.e., 68 pixels.
The interface with Letter + word prediction contains eight clusters and has one more
cluster than No prediction and Letter prediction. In addition to letter prediction function,
the three most probable word candidates are presented around the current typed charac-
ters (see Figure 6.3). The predicted words are located in the central idle area, and when
looking at these predicted words, no action will be triggered. In the cluster containing
87
three arrows, each arrow corresponds to a predicted word in the corresponding position.
Since the entered words will be displayed at the bottom of the screen, no character is
placed in the downward direction of the arrow cluster to prevent the user from triggering
an unintentional selection when reading the entered words.
Figure 6.3: The interface for Letter + word prediction. User could follow the correspond-
ing arrow to enter predicted word. In this example, the up arrow stands for “and", the
left arrow represents “as", and the right arrow is for “are".
A character-to-word model based on Convolutional Neural Networks (CNNs) proposed
by Park (2017) has been chosen for word prediction, which is available to provide both
prediction of the current possible word as well as the next possible word. In Letter + word
prediction, when a user is typing the current words, three suggestions of word completion
are given. And when a word is just entered, three next possible words are provided. The
unintentional activation of word prediction leads to more errors and longer correction
time, thus, 15 more gaze points are collected to calculate the final angle for the arrow
cluster.
88
6.3 Empirical evaluation
This study executed a user study to collect the data of typing performance and subjective
evaluation on the three eye typing interfaces.
6.3.1 Experiment design
The experiment was a 3 * 3 within-subjects design. The Prediction mode factor contained
three levels (No prediction, Letter prediction, Letter + word prediction) and Session is
from 1 - 3. The phrases were chosen from the phrases set proposed by MacKenzie and
Soukoreff (2003). The order of the conditions was counterbalanced across participants.
In summary, 216 trials were retained: 12 participants * 3 interface designs * 3 sessions *
2 phrases = 216 trials.
6.3.2 Evaluation metrics
The words per minute (WPM), uncorrected error rate (UER), corrected error rate (CER)
and keystroke savings (KS) were used to measure the typing performance.
Words per minute were used to quantify the typing speed during text entry. One
word is considered as five characters, including letters, spaces, punctuations. Uncor-
rected error rate refers the how many errors remain in transcribed text. And corrected
error rate presents how frequently backspace is used during text entry (MacKenzie &
Tanaka-Ishii, 2010). Keystroke savings reflects how many keystrokes are saved with word
prediction/completion (Trnka & McCoy, 2008).
In addition, participant’s perceived workload was measured using NASA Task Load
Index (NASA TLX). Besides, subjective feedback about the moving speed, search time,
word prediction, and ideas for improvement was also collected in open questions. The
main questions are showed in Table 6.1.
6.3.3 Participants
Twelve participants (6 males, 6 females) were recruited for this study. The participants’
average age was 28.83 (SD = 4.88). Seven participants wore glasses, one wore contact
lenses, four did not wear any visual aids. Two participants knew about gaze interaction
before, and the others had had no experience with gaze interaction. And all of them were
fluent English speakers, although none of them were native English speakers. Participants
89
Table 6.1: Main topics for open questions
1. How did you perceive the moving speed of the objects?
2. Did you have enough time to search for the desired objects?
3. Can you easily understand that each arrow corresponds to each
predicted word?
4. Can you successfully select the prediction word?
5. Do you think word prediction is helpful?
6. Do you have any suggestions for improvement?
signed an informed consent form before starting the experiment and received monetary
compensation after the experiment.
6.3.4 Apparatus
Consistent with the setup of study 2 in Chapter 5, a Tobii EyeX with 60 Hz was attached
under a 24-inch monitor with a resolution of 1920 * 1200 pixels. The eye tracker was
calibrated beforehand by the experimenter. The participants were asked to sit in front
of the monitor. If the participant’s eye position moves out of the detection range of the
eye tracker device (too close or too far), the participant will be reminded to adjust the
position forward or backward until a detectable distance is reached. No other instructions
were given. The distance between user and monitor ranged from 44 cm to 72.8 cm (M
=58 cm, SD = 6.42). The experimental program were implemented in Python. At a
distance of 60 cm from participant to screen, 1◦is equivalent to 39 pixels.
6.3.5 Procedure
Prior to the experiment, participants were informed the hygiene concept regarding COVID-
19. Both participants and experimenters were required to wear masks and maintain a safe
distance of 1.5 meters.
The experiment started with a short introduction about the eye typing system. After
that, the participants can freely practice how to use this typing system. After under-
standing the basic operations, the test began.
At the beginning of each trial, a phrase was displayed in the center of the screen, and
participants were asked to remember the phase and press the space key to start the typing
task. All selected phrases were easy to understand and remember, such as “time to go
90
shopping", “you must make an appointment", the phrases used in this study are showed
in Appendix C.2. There were two trials in each condition, participants were asked to
complete the NASA-TLX questionnaire to measure their perceived workload after each
condition. They were instructed to enter the given phase as quickly and accurately as
possible and correct errors that occurred in the current typing word. Participants were
told that they can rest between trials. The demographic questionnaire (e.g., gender, age,
and experience about gaze interaction) and open questions were answered after the text
entry tasks. The experiment lasted about 40-60 minutes.
6.4 Results
In this section, the typing speed, error rates, and subjective evaluation are analyzed. The
descriptive statistics with regard to words per minute, corrected error rate and uncorrected
error rate is shown in Figure 6.4 and Table 6.2. The blue circle represents No prediction,
the orange triangle stands for Letter prediction, and the green square indicates Letter +
word prediction. The descriptive data for keystroke savings is presented in Table 6.3.
Repeated measures ANOVA was used for data analysis, and Bonferroni corrections
were used in pair-wise comparisons. Shapiro-Wilk was used to check normality, and for
non-normal data, the aligned rank transform (ART) was performed before ANOVA (Wob-
brock, Findlater, Gergle, & Higgins, 2011).
6.4.1 Text entry rate
Words per minute (WPM) was used to measured text entry speed. The Letter +
word prediction achieved an average text entry rate of 5.48 WPM over sessions, with
4.42 WPM (SD = 1.31) in session 1 and 6.19 WPM (SD = 1.77) in session 3. The
Letter prediction achieved an average text entry rate of 3.42 WPM over sessions, with
3.49 WPM (SD = 0.61) in session 1 and 3.33 WPM (SD = 0.83) in session 3. The
No prediction achieved an average text entry rate of 3.39 WPM over sessions, with 3.19
WPM (SD = 0.68) in session 1 and 3.64 WPM (SD = 0.6) in session 3.
A Shapiro-Wilk normality test shown that the words per minute were normally dis-
tributed (p > .05). Since Mauchly’s test indicated a violation of sphericity (p < .05),
the Greenhouse–Geisser correction was used to correct for violation of the assumption of
sphericity. The results show a significant main effect of Prediction mode on words per
minute (F(1.19,13.12) = 52.10, p < .001). A significant effect of Session on words per
91
Figure 6.4: Mean of words per minute, corrected error rate, and uncorrected error rate
over 3 sessions.
92
Table 6.2: Study 3 - Mean absolute values (M) and standard deviations (SD) for each
experimental condition.
WPM CER UER
Prediction Session M SD M SD M SD
L+WP 1 4.42 1.31 0.08 0.06 0.17 0.21
2 5.82 1.68 0.06 0.05 0.11 0.13
36.19 1.77 0.07 0.04 0.04 0.06
LP 1 3.49 0.61 0.13 0.09 0.08 0.09
2 3.43 0.80 0.14 0.10 0.06 0.11
3 3.33 0.83 0.13 0.06 0.10 0.14
NoP 1 3.19 0.68 0.14 0.09 0.08 0.13
2 3.36 0.45 0.15 0.11 0.05 0.09
3 3.64 0.60 0.11 0.07 0.05 0.10
minute was also found (F(2,22) = 7.34, p < .01). There was a significant interaction
between Prediction mode and Session on words per minute (F(2.2,24.2) = 4.95, p < .05).
The effect for session interacted with Prediction mode, that is, session affected Letter
+ word prediction differently than Letter prediction and No prediction. The green line
(Letter + word prediction) has a steeper increase from session 1 to session 3 whereas the
orange (Letter prediction) and blue (No prediction) lines are much more horizontal. And
the pair-wise comparisons shown that there were significant difference between session 1
and 2 (p<.05) and between session 1 and 3 (p<.01) regarding Letter + word prediction.
In addition, there were significant difference between Letter + word prediction and No
prediction in session 1 (p<.01), between Letter + word prediction and No prediction in
session 2 and 3 (p<.01), and between Letter + word prediction and Letter prediction in
session 2 and 3 (p<.01).
6.4.2 Error rates
Corrected error rate (CER) reflects how frequently users correct the unintended or
false typed characters using backspace. The Letter + word prediction started with a
corrected error rate of 0.08 (SD = 0.06) in session 1 to 0.07 (SD = 0.04) in session 3.
The Letter prediction started with a corrected error rate of 0.13 (SD = 0.09) in session
1 to 0.13 (SD = 0.06) in session 3. The No prediction started with a corrected error rate
of 0.14 (SD = 0.09) in session 1 to 0.11 (SD = 0.07) in session 3.
93
A Shapiro-Wilk normality test shown that the corrected error rate was normally dis-
tributed (p>.05). The Mauchly’s Test for Sphericity was not significant regarding the
corrected error rate (p>.05). A significant main effect of Prediction mode was found
F(2,22) = 7.17, p < .01. And there was no significant effect of Session on corrected error
rate F(2,22) = 0.76, p =.48. The interaction between these terms was not significant
F(4,44) = 0.67, p =.62. From the post-hoc test results, the results found that there were
significant differences (p<.001 ) between Letter + word prediction and Letter prediction,
and between Letter + word prediction and No prediction. But no difference was found
between Letter prediction and No prediction.
Uncorrected error rate (UER) reports how many errors left in the typed sentence.
The Letter + word prediction started with a uncorrected error rate of 0.17 (SD = 0.21)
in session 1 to 0.04 (SD = 0.06) in session 3. The Letter prediction started with a
uncorrected error rate of 0.08 (SD = 0.09) in session 1 to 0.1 (SD = 0.14) in session 3.
The No prediction started with a uncorrected error rate of 0.08 (SD = 0.13) in session 1
to 0.05 (SD = 0.1) in session 3.
A Shapiro-Wilk normality test shown that the uncorrected error rate was not normally
distributed (p<.05), thus, a two-way ART RM-ANOVA was performed. The results
show a significant main effect of Prediction mode on uncorrected error rate (F(2,88) =
3.45, p < .05). A significant effect of Session on uncorrected error rate was also found
(F(2,88) = 3.88, p < .05). There was no significant interaction between Prediction mode
and Session on uncorrected error rate (F(4,88) = 1.94, p =.11). From the post-hoc test
results, the results found that there were significant differences between Letter + word
prediction and No prediction (p<.05) and between session 1 and 3 (p<.05).
6.4.3 Keystroke savings (KS)
Keystroke savings measures the percentage of key saving with language model com-
pared to letter-by-letter text entry. Only the data for the Letter + word prediction was
analyzed. The Letter + word prediction started with a KS of 0.38 (SD = 0.2) in session
1 to 0.44 (SD = 0.08) in session 3.
A Shapiro-Wilk normality test shown that the KS was normally distributed (p>.05).
Since Mauchly’s test indicated a violation of sphericity (p<.05), the Greenhouse–Geisser
correction was used to correct for violation of the assumption of sphericity. The one-
way ANOVA indicated that Session had no significant influence on KS (F(1.27,13.95) =
1.07, p =.34).
94
Table 6.3: Mean and standard deviation for keystroke savings for interface with Letter
+ word prediction.
Keystroke savings Session 1 Session 2 Session 3
Mean (M) 0.38 0.44 0.44
Stand deviation (SD) 0.20 0.08 0.08
6.4.4 Subjective evaluation
After completing the task for each technique in each session, participants were asked to fill
the NASA TLX questionnaire. Figure 6.5 illustrates the mean scores of the NASA Task
Load Index in each dimension. Since these rating results were non-normal, the statistical
significance tests were performed using a two-way ART RM-ANOVA.
NASA Task Load Index
Mental demand
The Letter + word prediction started with a mental demand score of 31.1 (SD = 18.0) in
session 1 to 16.8 (SD = 19.1) in session 3. The Letter prediction started with a mental
demand score of 28.4 (SD = 17.4) in session 1 to 39.6 (SD = 30.8) in session 3. The No
prediction started with a mental demand score of 34.6 (SD = 18.4) in session 1 to 25.8
(SD = 18.5) in session 3.
The results show a significant main effect of Prediction mode on mental demand
(F(2,88) = 3.46, p < .05). No significant effect of Session on mental demand was also
found (F(2,88) = 1.5, p =.22). There was a significant interaction between Prediction
mode and Session on mental demand (F(4,88) = 3.18, p < .05). From the post-hoc test
results, the results found that the mental demand decreased significantly between session
1 and 3 (p<.001) for Letter + word prediction, and decreased also significantly between
Letter + word prediction and Letter prediction in session 3 (p<.001). Besides, significant
differences were found between Letter + word prediction in session 3 and No prediction in
session 1 (p<.001), and between Letter + word prediction in session 3 and No prediction
in session 2 (p<.05).
Physical demand
The Letter + word prediction started with a physical demand score of 32.4 (SD = 17.9)
in session 1 to 19.2 (SD = 19.4) in session 3. The Letter prediction started with a physical
demand score of 33.1 (SD = 16.6) in session 1 to 41.2 (SD = 29.0) in session 3. The No
95
Figure 6.5: Mean of NASA TLX results over sessions, blue circle represents No prediction,
orange triangle stands for Letter prediction, and green square indicates Letter + word
prediction.
96
prediction started with a physical demand score of 40 (SD = 21.3) in session 1 to 36.6
(SD = 25.1) in session 3.
The results show a significant main effect of Prediction mode on physical demand
(F(2,88) = 6.52, p < .001). No significant effect of Session on physical demand was also
found (F(2,88) = 0.85, p =.43). There was no significant interaction between Prediction
mode and Session on physical demand (F(4,88) = 1.43, p =.23). The post-hoc test results
show that there were significant declines between Letter + word prediction and Letter
prediction (p<.05 ), and between Letter + word prediction and No prediction (p<.05).
Temporal demand
The Letter + word prediction started with a temporal demand score of 27.8 (SD = 23.4)
in session 1 to 16.3 (SD = 20.3) in session 3. The Letter prediction started with a
temporal demand score of 27.6 (SD = 20.1) in session 1 to 33.5 (SD = 24.6) in session 3.
The No prediction started with a temporal demand score of 29.2 (SD = 17.1) in session
1 to 32.6 (SD = 25.3) in session 3.
The results show a significant main effect of Prediction mode on temporal demand
(F(2,88) = 3.6, p < .05). No significant effect of Session on temporal demand was also
found (F(2,88) = 0.32, p =.72). There was no significant interaction between Prediction
mode and Session on temporal demand (F(4,88) = 1.24, p =.3). The post-hoc test
results show that there was a significant decline between Letter + word prediction and
No prediction (p<.05).
Performance
The Letter + word prediction started with a overall performance score of 40.7 (SD = 30.3)
in session 1 to 19 (SD = 27.7) in session 3. The Letter prediction started with a overall
performance score of 30.5 (SD = 19.7) in session 1 to 35.2 (SD = 27.8) in session 3. The
No prediction started with a overall performance score of 46.7 (SD = 24.3) in session 1
to 29.5 (SD = 17.2) in session 3.
The results show that there was no significant main effect of Prediction mode on
performance (F(2,88) = 0.88, p =.41). A significant effect of Session on performance
was also found (F(2,88) = 3.3, p < .05). There was no significant interaction between
Prediction mode and Session on performance (F(4,88) = 1.87, p =.12). The post-hoc
test results show that participants felt more successful in session 3 than 1 (p<.05).
Effort
The Letter + word prediction started with a effort score of 46.9 (SD = 24.6) in session
97
1 to 24.5 (SD = 23.9) in session 3. The Letter prediction started with a effort score of
39.4 (SD = 21.7) in session 1 to 44.4 (SD = 32.3) in session 3. The No prediction started
with a effort score of 51.7 (SD = 23.3) in session 1 to 40.2 (SD = 28.5) in session 3.
The results show a significant main effect of Prediction mode on effort (F(2,88) =
5.26, p < .001). A significant effect of Session on effort was also found (F(2,88) =
4.29, p < .05). There was no significant interaction between Prediction mode and Session
on effort (F(4,88) = 1.78, p =.14). The post-hoc test results show that there were
significant declines between session 1 and 2 (p<.05), between session 1 and 3 (p<.05).
Besides, the effort score for Letter + word prediction was significantly lower than No
prediction (p<.05).
Frustration
The Letter + word prediction started with a frustration score of 30.2 (SD = 23.5) in
session 1 to 13.3 (SD = 17.6) in session 3. The Letter prediction started with a frustration
score of 24.6 (SD = 21.1) in session 1 to 36.9 (SD = 30.0) in session 3. The No prediction
started with a frustration score of 38.2 (SD = 31.4) in session 1 to 31.3 (SD = 23.2) in
session 3.
The results show no significant main effect of Prediction mode on frustration (F(2,88) =
2.65, p =.08). No significant effect of Session on frustration was also found (F(2,88) =
0.15, p =.86). There was no significant interaction between Prediction mode and Ses-
sion on frustration (F(4,88) = 1.35, p =.26). The results of statistical significance are
summarized in Table 6.4.
Interview
In addition to quantitative evaluation, participants’ qualitative feedbacks regarding the
moving speed, searching time, word prediction, and improvement suggestions were also
collected.
Perceived moving speed
More than half of the participants found the moving speed is appropriate (P1, P3-6, P9-
10). And four participants (P2, P8, P11-12) thought the speed is too fast, and one of
them (P2) reported that “the moving speed is too fast only at the beginning of the experi-
ment, and I can adapt to this speed after a period of practice." Only one participant (P7)
considered that the speed is too fast.
98
Table 6.4: Study 3 - Statistical significance of main and interaction effects
regarding performance and subjective measures.
Main effect of Main effect of Interaction
prediction mode session effects
Typing performance:
Words per minutes p<.001 p<.01 p<.05
Corrected error rate p<.01 n.s1n.s.
Uncorrected error rate p<.05 p<.05 n.s.
Keystroke savings2- n.s. -
NASA-TLX:
Mental demand p<.05 n.s. p<.05
Physical demand p<.001 n.s. n.s.
Temporal demand p<.05 n.s. n.s.
Performance n.s. p<.05 n.s.
Effort p<.001 p<.05 n.s.
Frustration n.s. n.s. n.s.
1‘n.s.’: not significant
2Only applied for data from letter and word prediction condition.
Searching time
Four participants (P2, P4, P6, P12) felt that they had enough time to search the desired
letter. Other eight participants (P1, P3, P5, P7-11) reported that they did not have
enough time only at the beginning, after practicing a few phrases they could have time
to find the desired letter.
Views on word prediction
Participants were asked three questions regarding their views on word prediction. The
first question about the degree of difficulty to understand the design that arrows in dif-
ferent directions corresponding to the predicted word at the corresponding position. All
participants considered that it is very easy to understand, and one of them (P10) sug-
gested adding color for different predicted words. The second question is about whether
each arrow can be selected successfully. Six participants (P7-12) thought that they can
select the arrows successfully, and one of them (P7) reported that arrows were easier
to be selected than letters. On the other hand, there are six participants (P1-P6) who
considered that predictive words can easily be chosen by mistake. The third question is
99
whether the word prediction is helpful. Almost all users responded that word prediction
function is very helpful. P9 appreciated the word prediction, special for lengthy words.
Suggestions
Five participants (P6, P7, P9, P11-12) suggested separating the space and delete keys
because misselections often occur between these two keys. Those misselections make
them nervous, which leads to more misselections. Four participants (P2-4, P6) proposed
that the sensitivity of the system could be reduced to avoid unintentional selection. P1
commented that “Hope that there can be an automatic calibration mechanism to make
the typing system more and more accurate".
6.5 Discussion
In this study, three prediction modes were compared, the results were analyzed regarding
the text entry performance, NASA Task Load Index, and qualitative feedback.
For the first research question, the results show that there was a significant improve-
ment in typing speed by adding the word predictions. With the provision of the word
prediction function, the typing speed was faster than Letter prediction and No predic-
tion over sessions. The typing speed of Letter + word prediction increased greatly from
session 1 to session 3, whereas the typing speed of Letter prediction and No prediction
grew more modestly. And the typing speed of Letter prediction and No prediction were
similar over sessions in this user study. This suggests that participants were able to learn
the novel interface after a short practice, and Letter + word prediction also exhibits a
more efficient typing speed. However, participants need more time to get familiar with
the interface with word predictions to reach relatively stable text entry speed.
The results also indicate that the corrected error rate of Letter + word prediction
was significantly lower than Letter prediction and No prediction. Using word prediction
reduces the number of keystrokes required for users to enter phrases of a certain length,
and at the same time, reduces the possibility of unintentional input to decrease the number
of corrections correspondingly.
For uncorrected error rate, there was a significant difference in prediction modes,
mainly between Letter + word prediction and No prediction. In this experiment, par-
ticipants were told that they don’t have to correct the errors for word prediction which
was incorrectly entered, because this study wants to know whether the word prediction
can be selected correctly. The results found that the uncorrected error rate is quite high
100
for Letter + word prediction in sessions 1 and 2 compared to Letter prediction and No
prediction. However, the uncorrected error rate of Letter + word prediction decreased
gradually over sessions, The gap with No prediction was already very small in session 3,
and even lower than Letter prediction. For beginners, the rate of unintentional selection
of word prediction is relatively high, thus, the uncorrected error rate is relatively high in
the first two sessions. However, those errors were significantly reduced with short practice
sessions. Besides, although there was no significant difference on KS over session in Letter
+ word prediction, the trend indicates that somewhat more keystrokes were saved from
sessions 1 to 2.
For the second research question, the results of the NASA TLX test reveals that
Letter + word prediction had a lower score than Letter prediction and No prediction
on the mental, physical, temporal demand, and effort dimensions. Those differences
were substantially increased from sessions 1 to 3. And participants evaluated that the
performance was getting better and the efforts devoted to accomplishing the task were
decreased over sessions.
Those results also confirm the findings of the text entry performance. When the user
is not familiar with the text entry system, the participants’ cognitive load is relatively
high, and they tend to focus on the task of finding letters, thereby ignoring the update of
the predicted words. After participants became familiar with the interaction, they could
find the correct word prediction more proficiently and quickly, and therefore relied more
on the word prediction.
For the last research question, the results show that adding word prediction can signif-
icantly improve typing performance and reduce workload level. However, not all designs
embedded language models will bring significant improvements. MacKenzie and Zhang
(2008) introduced weight based on language model to assist the gaze typing, and no sig-
nificant improvement was found compared with baseline. In this study, different from
what was expected, the typing interface containing letter prediction did not show a better
typing performance and lower workload index than No prediction, and even worse in some
situations. For example, in the mental demand, the scores of Letter + word prediction
and No prediction were slightly decreasing with practice, however, the score of Letter
prediction is gradually increasing. Even in the session 3, there was a significant difference
between Letter + word prediction and Letter prediction on mental demand. Two possible
reasons are speculated. First, similar to the findings of several studies, the control of
rhythm is very necessary to be taken into account (Majaranta et al., 2006; Mott et al.,
2017). Maintaining the rhythm of typing brings a smooth typing experience. In the Letter
101
prediction, the moving distance of letter with high probability was reduced by almost one
third. This obvious difference greatly interrupts the typing rhythm of the participants.
For subsequent improvements, a more moderate approach for letter prediction is needed,
such as referencing this literature (Mott et al., 2017) to keep this change (in dwell time or
moving distance) within a certain range and gradually increase or decrease it depending
on the predicted probability. Second, in the case of letter prediction, a letter with a high
probability will be “easier" to be selected. If the letter is what should be entered, the cost
of activating it can be reduced. However, if this letter is not the desired one, it also in-
creases the probability of making a mistake. It can be expected that the prediction results
of the language model will be more accurate, but for some words with lower occurrence
probability, the prediction of the language model has a correspondingly lower probability.
This leads to when users want to enter the letters with low probability, the letters with
higher probability are easier to be unintentionally activated. Therefore, the above two
points need to be considered in the follow-up work.
There are also some limitations to this study. The algorithm of detection needs to
be further optimized to avoid unintentional triggers in order to achieve a better user
experience. Another limitation is that the participants are not native English speakers,
this may have a certain impact on the text entry efficiency. Besides, participants reported
that the space and delete keys are too close to each other and often selected by mistake,
so they hope to separate those two keys. Studies have indicated that cognitive workload
can be measured using eye movement metrics (Duchowski et al., 2018; Kosch et al., 2018).
In this study, participants also reported that when they feel relaxed, the typing speed is
relative faster, but when unintentional activation happened repeatedly, typing efficiency
will decrease. Thus, the position of functional keys should be addressed in subsequent
iterations.
6.6 Summary
In summary, this chapter explored how to use language model to improve typing perfor-
mance and reduce the workload of the eye typing interface. A user study was conducted
and three prediction modes (Letter + word prediction, Letter prediction, No prediction)
were compared.
From the results, this study offers important insights into providing word/letter pre-
diction. First, the provision of word prediction significantly outperformed the other two
prediction modes (Letter prediction, No prediction) in terms of typing speed and error
102
rates. Besides, Letter + word prediction also got a better rating in terms of NASA TLX.
Although adding word prediction is a more efficient way for eye typing, participants
need more time to familiarize themselves with the operation than without word prediction.
For public display, where users do not have a longer time to learn the system, the number
of predicted words can be appropriately reduced to simplify the operation, for example,
only to offer a one-word candidate (Zeng & Roetting, 2018). In addition, the control of
rhythm also needs to be considered when the activation threshold is changed based on the
probability predicted by the language model. In this study, participants used the typing
system only for a relatively short time, in the case of Letter + word prediction, the text
entry speed has not yet reached a plateau.
In further work, a longitudinal study could be executed to provide more comprehensive
results on the learning curve. To find out how long it takes for a novice to learn to reach a
plateau in typing performance, and how fast can it achieve. In addition, a field study can
also be performed in real public display (Freytag, 2020; Khamis, Alt, & Bulling, 2015).
Besides, there is individual difference, e.g., novice, expert, regarding the preference
and typing efficiency. Thus, the parameters (such as moving speed, dwell-time) could
be set to automatically adjust based on the typing performance (Špakov & Miniotas,
2004). The provision of word prediction for this hybrid eye typing interface would also be
considered to further speed up the typing efficiency.
103
104
Chapter 7
Conclusions and outlook
The last chapter summarizes the conclusions that have been derived along the way and
provides an outlook for future research.
7.1 Conclusions
This thesis proposed a circle layout hybrid eye typing interface, including offline data
analysis, detection methods comparison, developing the interface, and improving the typ-
ing performance with a language model. Interpretations of the design issues are scattered
throughout the three studies, i.e., Chapter 4 - 6. The conclusions of this thesis are
discussed and summarized in the following paragraphs.
For the first research question “What is the appropriate number of objects and
movement speed for the gaze interface based on linear motion?" The first study
specifically analyzed the influence of object number (6, 8, 10, 12, or 15) and object moving
speed (7.73 ◦/s vs. 12.89 ◦/s) in linear trajectory smooth pursuit gaze-based interfaces.
Offline gaze data of pursuing linearly moving objects was collected. To estimate the
range of parameters and comparison detection algorithms for robust detection, the gaze
orientation and detection rates were compared with respect to the number of objects
and the speed of object movement. Besides, three detection methods were compared
regarding the correct detection rate. Results indicate that the number and speed of the
displayed objects influence users’ performance with the interface. The number of objects
significantly affected the correct and false detection rates when selecting objects in the
display. Participants performed better on the interfaces containing 6 and 8 objects than on
the interfaces containing 10, 12, and 15 objects. Detection rates and orientation error were
significantly influenced by the moving speed of displayed objects. Faster moving speed
105
(12.89 ◦/s) resulted in higher detection rates and smaller orientation error compared to
slower moving speeds (7.73 ◦/s). When it comes to the comparison among detection
methods, the angle-based method achieved the highest detection rate in this data set.
The findings of this study can help to develop an accessible calibration-free gaze interface
based on linear moving design.
To address the second research question “How to design a dynamic eye typing
interface that is easy to learn and user-friendly?" Based on the findings of the
first study, the second study presented a circle-layout eye typing interface, which is a two-
stage design and combined dwell- and pursuit-based selection. The alphabet is clustered
in groups of four characters, and users select a cluster by gazing at it in the first stage
(all items are static) and then select the desired character by following its movement with
their eyes in the second stage (only the characters in desired cluster are moving). A user
study was conducted to explore the impact of auditory and visual feedback on typing
performance and user experience of this novel interface. Results show that participants
can quickly learn how to use the system, and an average typing speed of 4.7 WPM can be
reached without lengthy training. The subjective evaluation of participants revealed that
users preferred visual feedback over auditory feedback while using the interface. Results
indicate that this eye typing interface can be used for walk-up-and-use interactions, as it
is easily understood and robust to eye-tracking inaccuracies.
The last research question is “How to use language models to improve the typ-
ing performance of the dynamic eye typing interface?" The third study aimed
to improve the typing performance of the circle-layout eye typing interface proposed in
the second study. Language model was utilized in this study to improve the typing per-
formance, and three interfaces were designed and compared, which are letter prediction,
letter+word prediction and no prediction. The results of the user study show that partic-
ipants achieved an average text entry speed of 5.48 WPM for letter+word prediction, 3.42
WPM for letter prediction, 3.39 WPM for no prediction. Typing speed of letter+word
prediction is 70.3% faster than without word prediction. And almost all participants
can easily understand the design for word prediction, and thought this function was very
helpful. And letter+word prediction also received a better grade on the NASA task load
index. However, compared to letter prediction and no prediction, participants needed
more time to familiarize themselves with the interface including letter+word prediction
in order to reach a plateau regarding text entry speed.
106
7.2 Outlook
Gaze is a potential input modality in interaction design because it provides a direct way to
interact with the interface, as Jacob (1990) wrote “what you look at is what you get". The
possibilities for further improvements, as well as potential areas of gaze-based application,
are discussed in this section.
Robust estimation
The gaze-based interaction suffers from some shortcomings. First of all, the accuracy
issue comes from the eye tracker device on the one hand, and from the characteristics of
the eye (miniature eye movements) on the other hand. Most existing eye trackers require
calibration before use. The calibration process is time-consuming which should be simpli-
fied to meet the need for spontaneous interaction. To achieve a more natural calibration
process, future work could investigate using pursuit calibration to further improve the
calibration accuracy (Drewes, Pfeuffer, & Alt, 2019; Pfeuffer, Vidal, Turner, Bulling, &
Gellersen, 2013). Moreover, the detection algorithm for low spatial gaze data could also
be studied, such as using a linear regression-based algorithm (Drewes, Khamis, & Alt,
2019), profile matching and 2D correlation (Velloso, Coutinho, Kurauchi, & Morimoto,
2018), to allow the interface to accommodate more targets and make the detection more
robust.
Gaze cursor in VR/AR
In human-computer interaction, the way of interacting with the interface varies with
the device (see Figure 7.1). The user controls the computer’s graphical user interface
through the computer mouse. When it comes to mobile device, such as tablet computer
and smartphone, touch is most frequently used to control those devices. As technology
advances and devices are updated, more new ways of interaction are emerging and being
accepted by users, such as gesture, voice control. The past decade has seen the rapid
development of head-mounted display (HMDs) for augmented or virtual reality. Hand
gesture and gaze cursor are becoming potential control methods for HMDs. Since the
tracking accuracy is reported that ranged from 2-4.5◦visual angle in VR device (Rajanna
& Hansen, 2018), the size of the character keys should be designed to be large enough to
prevent unintentional activation. In this case, the eye typing interface proposed in this
thesis could be one potential solution for entering text in head-mounted display, which
has a relatively low requirement for the eye-tracking accuracy.
107
+ +
Figure 7.1: Pointers on the display for different device.
Multimodal interaction
Another possible area of future research is to explore the combination with other input
modalities to make up for each other’s shortcomings to get more efficient and natural user
experiences. For example, adding head movement (Feng, Zou, Kurauchi, Morimoto, &
Betke, 2021; Sidenmark, Mardanbegi, Gomez, Clarke, & Gellersen, 2020) and touch (Ku-
mar et al., 2020).
Gaze estimation based on webcam
Additionally, since the eye tracker based gaze estimation is subject to the limitation in
the detectable tracking range, a webcam-based gaze estimation method could be used to
increase the available area for gaze estimation. Compared with the eye tracker used in
this study (Tobii EyeX), a webcam-based gaze estimation extends the maximum tolerance
distance range from 75 cm to 180 cm (X. Zhang, Sugano, & Bulling, 2019). In addition,
it would considerably reduce the price of equipment and also broaden the usage scenarios,
such as Eyetell (Bafna et al., 2021).
108
References
Abdrabou, Y., Mostafa, M., Khamis, M., & Elmougy, A. (2019). Calibration-free text en-
try using smooth pursuit eye movements. In Proceedings of the 11th acm symposium
on eye tracking research & applications. New York, NY, USA: Association for Com-
puting Machinery. Retrieved from https://doi.org/10.1145/3314111.3319838
doi: 10.1145/3314111.3319838
Almoctar, H., Irani, P., Peysakhovich, V., & Hurter, C. (2018). Path word: A multimodal
password entry method for ad-hoc authentication based on digits’ shape and smooth
pursuit eye movements. In Proceedings of the 20th acm international conference on
multimodal interaction (p. 268–277). New York, NY, USA: Association for Com-
puting Machinery. Retrieved from https://doi.org/10.1145/3242969.3243008
doi: 10.1145/3242969.3243008
Bafna, T., Bækgaard, P., & Paulin Hansen, J. P. (2021). Eyetell: Tablet-based calibration-
free eye-typing using smooth-pursuit movements. In Acm symposium on eye track-
ing research and applications. New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/3448018.3458015 doi:
10.1145/3448018.3458015
Bee, N., & André, E. (2008). Writing with your eye: A dwell time free writing system
adapted to the nature of human eye gaze. In International tutorial and research
workshop on perception and interactive technologies for speech-based systems (pp.
111–122). Retrieved from https://doi.org/10.1007/978-3-540-69369-7_13
Benligiray, B., Topal, C., & Akinlar, C. (2019). Slicetype: fast gaze typing with a merging
keyboard. Journal on Multimodal User Interfaces,13(4), 321–334. Retrieved from
https://doi.org/10.1007/s12193-018-0285-z
Bi, X., Li, Y., & Zhai, S. (2013). Ffitts law: Modeling finger touch with fitts’
law. In Proceedings of the sigchi conference on human factors in computing sys-
tems (p. 1363–1372). New York, NY, USA: Association for Computing Machin-
ery. Retrieved from https://doi.org/10.1145/2470654.2466180 doi: 10.1145/
109
2470654.2466180
Blankertz, B., Krauledat, M., Dornhege, G., Williamson, J., Murray-Smith, R., &
Müller, K.-R. (2007). A note on brain actuated spelling with the berlin brain-
computer interface. In International conference on universal access in human-
computer interaction (pp. 759–768). Retrieved from https://doi.org/10.1007/
978-3-540-73281-5_83
Blattgerste, J., Renner, P., & Pfeiffer, T. (2018). Advantages of eye-gaze over head-gaze-
based selection in virtual and augmented reality under varying field of views. In
Proceedings of the workshop on communication by gaze interaction. New York, NY,
USA: Association for Computing Machinery. Retrieved from https://doi.org/
10.1145/3206343.3206349 doi: 10.1145/3206343.3206349
Boggs, P. T., & Rogers, J. E. (1990). Orthogonal distance regression. Contemporary
Mathematics,112 , 183–194.
Burke, M., & Barnes, G. (2006). Quantitative differences in smooth pursuit and saccadic
eye movements. Experimental brain research,175(4), 596–608.
Carpenter, R. H. (1988). Movements of the eyes, 2nd rev. Pion Limited.
Collewijn, H., & Tamminga, E. P. (1984). Human smooth and saccadic eye move-
ments during voluntary pursuit of different target motions on different backgrounds.
The Journal of physiology,351(1), 217–250. Retrieved from https://doi.org/
10.1113/jphysiol.1984.sp015242
Cymek, D. H., Venjakob, A. C., Ruff, S., Lutz, O. H.-M., Hofmann, S., & Roetting,
M. (2014). Entering pin codes by smooth pursuit eye movements. Journal of
Eye Movement Research,7(4), 1. Retrieved from https://doi.org/10.16910/
jemr.7.4.1
de Brouwer, S., Yuksel, D., Blohm, G., Missal, M., & Lefèvre, P. (2002). What triggers
catch-up saccades during visual tracking? Journal of neurophysiology,87(3), 1646–
1650. Retrieved from https://doi.org/10.1152/jn.00432.2001
de Hemptinne, C., Lefevre, P., & Missal, M. (2006). Influence of cognitive expectation
on the initiation of anticipatory and visual pursuit eye movements in the rhesus
monkey. Journal of neurophysiology,95(6), 3770–3782. Retrieved from https://
doi.org/10.1152/jn.00007.2006
de Luca, A., Weiss, R., & Drewes, H. (2007). Evaluation of eye-gaze interaction methods
for security enhanced pin-entry. In Proceedings of the 19th australasian conference on
computer-human interaction: Entertaining user interfaces (pp. 199–202). New York,
NY, USA: ACM. Retrieved from http://doi.acm.org/10.1145/1324892.1324932
110
doi: 10.1145/1324892.1324932
Diaz-Tula, A., & Morimoto, C. H. (2016). Augkey: Increasing foveal throughput in eye
typing with augmented keys. In Proceedings of the 2016 chi conference on human
factors in computing systems (p. 3533–3544). New York, NY, USA: Association
for Computing Machinery. Retrieved from https://doi.org/10.1145/2858036
.2858517 doi: 10.1145/2858036.2858517
Dodge, R. (1903). Five types of eye movement in the horizontal meridian plane of the field
of regard. American journal of physiology-legacy content,8(4), 307–329. Retrieved
from https://doi.org/10.1152/ajplegacy.1903.8.4.307
Drewes, H., Khamis, M., & Alt, F. (2018). Smooth pursuit target speeds and tra-
jectories. In Proceedings of the 17th international conference on mobile and ubiq-
uitous multimedia (pp. 139–146). New York, NY, USA: ACM. Retrieved from
http://doi.acm.org/10.1145/3282894.3282913 doi: 10.1145/3282894.3282913
Drewes, H., Khamis, M., & Alt, F. (2019). Dialplates: Enabling pursuits-based user inter-
faces with large target numbers. In Proceedings of the 18th international conference
on mobile and ubiquitous multimedia. New York, NY, USA: Association for Com-
puting Machinery. Retrieved from https://doi.org/10.1145/3365610.3365626
doi: 10.1145/3365610.3365626
Drewes, H., Pfeuffer, K., & Alt, F. (2019). Time- and space-efficient eye tracker calibra-
tion. In Proceedings of the 11th acm symposium on eye tracking research & applica-
tions. New York, NY, USA: Association for Computing Machinery. Retrieved from
https://doi.org/10.1145/3314111.3319818 doi: 10.1145/3314111.3319818
Drewes, H., & Schmidt, A. (2007). Interacting with the computer using gaze gestures.
In Ifip conference on human-computer interaction (pp. 475–488). Retrieved from
https://doi.org/10.1007/978-3-540-74800-7_43
Duchowski, A. T. (2017). Eye tracking methodology: Theory and practice (3rd ed.).
Springer.
Duchowski, A. T., Krejtz, K., Krejtz, I., Biele, C., Niedzielska, A., Kiefer, P., ... Gi-
annopoulos, I. (2018). The index of pupillary activity: Measuring cognitive load
vis-à-vis task difficulty with pupil oscillation. In Proceedings of the 2018 chi con-
ference on human factors in computing systems (p. 1–13). New York, NY, USA:
Association for Computing Machinery. Retrieved from https://doi.org/10.1145/
3173574.3173856 doi: 10.1145/3173574.3173856
Eid, M. A., Giakoumidis, N., & El Saddik, A. (2016). A novel eye-gaze-controlled
wheelchair system for navigating unknown environments: Case study with a per-
111
son with als. IEEE Access,4, 558-573. Retrieved from 10.1109/ACCESS.2016
.2520093
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for
discovering clusters in large spatial databases with noise. In In proceedings of the sec-
ond international conference on knowledge discovery and data mining (p. 226–231).
AAAI Press.
Esteves, A., Velloso, E., Bulling, A., & Gellersen, H. (2015). Orbits: Gaze interaction
for smart watches using smooth pursuit eye movements. In Proceedings of the 28th
annual acm symposium on user interface software & technology (p. 457–466). New
York, NY, USA: Association for Computing Machinery. Retrieved from https://
doi.org/10.1145/2807442.2807499 doi: 10.1145/2807442.2807499
Esteves, A., Verweij, D., Suraiya, L., Islam, R., Lee, Y., & Oakley, I. (2017). Smooth-
moves: Smooth pursuits head movements for augmented reality. In Proceed-
ings of the 30th annual acm symposium on user interface software and technol-
ogy (p. 167–178). New York, NY, USA: Association for Computing Machin-
ery. Retrieved from https://doi.org/10.1145/3126594.3126616 doi: 10.1145/
3126594.3126616
Feit, A. M., Williams, S., Toledo, A., Paradiso, A., Kulkarni, H., Kane, S., & Morris,
M. R. (2017). Toward everyday gaze input: Accuracy and precision of eye tracking
and implications for design. In Proceedings of the 2017 chi conference on human
factors in computing systems (p. 1118–1130). New York, NY, USA: Association
for Computing Machinery. Retrieved from https://doi.org/10.1145/3025453
.3025599 doi: 10.1145/3025453.3025599
Feng, W., Zou, J., Kurauchi, A., Morimoto, C. H., & Betke, M. (2021). Hgaze
typing: Head-gesture assisted gaze typing. In Acm symposium on eye track-
ing research and applications. New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/3448017.3457379 doi:
10.1145/3448017.3457379
Fisher, D. F., Monty, R. A., & Senders, J. W. (1981). Eye movements: Cognition and
visual perception (psychology library editions: Perception) (Vol. 8). Routledge.
Freytag, S.-C. (2020). Sweet pursuit: User acceptance and performance of a smooth pur-
suit controlled candy dispensing machine in a public setting. In Acm symposium on
eye tracking research and applications. New York, NY, USA: Association for Com-
puting Machinery. Retrieved from https://doi.org/10.1145/3379156.3391356
doi: 10.1145/3379156.3391356
112
Freytag, S.-C., Venjakob, A. C., & Ruff, S. (2017). Applicability of smooth-pursuit based
gaze interaction for older users. In Proceedings of the cogain symposium. cogain
(Vol. 17).
Fukushima, K., Fukushima, J., Warabi, T., & Barnes, G. R. (2013). Cognitive processes
involved in smooth pursuit eye movements: behavioral evidence, neural substrate
and clinical correlation. Frontiers in systems neuroscience,7, 4. Retrieved from
https://doi.org/10.3389/fnsys.2013.00004
Garay-Vitoria, N., & Abascal, J. (2006). Text prediction systems: a survey. Universal
Access in the Information Society,4, 188–203. Retrieved from https://doi.org/
10.1007/s10209-005-0005-9
Garay-Vitoria, N., & González-Abascal, J. (1997). Intelligent word-prediction to enhance
text input rate (a syntactic analysis-based word-prediction aid for people with se-
vere motor and speech disability). In Proceedings of the 2nd international conference
on intelligent user interfaces (p. 241–244). New York, NY, USA: Association for
Computing Machinery. Retrieved from https://doi.org/10.1145/238218.238333
doi: 10.1145/238218.238333
Göbel, F., Kiefer, P., Giannopoulos, I., Duchowski, A. T., & Raubal, M. (2018). Improving
map reading with gaze-adaptive legends. In Proceedings of the 2018 acm symposium
on eye tracking research & applications (pp. 29:1–29:9). New York, NY, USA: ACM.
Retrieved from http://doi.acm.org/10.1145/3204493.3204544 doi: 10.1145/
3204493.3204544
Hansen, D. W., Skovsgaard, H. H. T., Hansen, J. P., & Møllenbach, E. (2008). Noise
tolerant selection by gaze-controlled pan and zoom in 3d. In Proceedings of the 2008
symposium on eye tracking research & applications (p. 205–212). New York, NY,
USA: Association for Computing Machinery. Retrieved from https://doi.org/
10.1145/1344471.1344521 doi: 10.1145/1344471.1344521
Hansen, J. P., Johansen, A. S., Hansen, D. W., Itoh, K., & Mashino, S. (2003a). Command
without a click: Dwell time typing by mouse and gaze selections. In Interact (Vol. 3,
pp. 121–128).
Hansen, J. P., Johansen, A. S., Hansen, D. W., Itoh, K., & Mashino, S. (2003b). Language
technology in a predictive, restricted on-screen keyboard with ambiguous layout for
severely disabled people. In Proceedings of eacl workshop on language modeling for
text entry methods (p. 59–66).
Hart, S. G., & Staveland, L. E. (1988). Development of nasa-tlx (task load index): Results
of empirical and theoretical research. In Advances in psychology (Vol. 52, pp. 139–
113
183). Elsevier. Retrieved from https://doi.org/10.1016/S0166-4115(08)62386
-9
Holmqvist, K., Nyström, M., Andersson, R., Dewhurst, R., Jarodzka, H., & Van de Weijer,
J. (2011). Eye tracking: A comprehensive guide to methods and measures. OUP
Oxford.
Huckauf, A., & Urbina, M. (2007). Gazing with peye: New concepts in eye typ-
ing. In Proceedings of the 4th symposium on applied perception in graphics and
visualization (p. 141). New York, NY, USA: Association for Computing Machin-
ery. Retrieved from https://doi.org/10.1145/1272582.1272618 doi: 10.1145/
1272582.1272618
Huckauf, A., & Urbina, M. H. (2008). Gazing with peyes: Towards a universal input
for various applications. In Proceedings of the 2008 symposium on eye tracking
research & applications (p. 51–54). New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/1344471.1344483 doi:
10.1145/1344471.1344483
Hwang, C.-S., Weng, H.-H., Wang, L.-F., Tsai, C.-H., & Chang, H.-T. (2014). An eye-
tracking assistive device improves the quality of life for als patients and reduces
the caregivers’ burden. Journal of motor behavior,46(4), 233–238. Retrieved from
https://doi.org/10.1080/00222895.2014.891970
Jacob, R. J. K. (1990). What you look at is what you get: Eye movement-based interaction
techniques. In Proceedings of the sigchi conference on human factors in computing
systems (p. 11–18). New York, NY, USA: Association for Computing Machinery. Re-
trieved from https://doi.org/10.1145/97243.97246 doi: 10.1145/97243.97246
Jacob, R. J. K. (1995). Eye tracking in advanced interface design. In Virtual environments
and advanced interface design (p. 258–288). USA: Oxford University Press, Inc.
Kangas, J., Akkil, D., Rantala, J., Isokoski, P., Majaranta, P., & Raisamo, R. (2014).
Gaze gestures and haptic feedback in mobile devices. In Proceedings of the sigchi
conference on human factors in computing systems (p. 435–438). New York, NY,
USA: Association for Computing Machinery. Retrieved from https://doi.org/
10.1145/2556288.2557040 doi: 10.1145/2556288.2557040
Kangas, J., Špakov, O., Isokoski, P., Akkil, D., Rantala, J., & Raisamo, R. (2016).
Feedback for smooth pursuit gaze tracking based control. In Proceedings of the 7th
augmented human international conference 2016. New York, NY, USA: Association
for Computing Machinery. Retrieved from https://doi.org/10.1145/2875194
.2875209 doi: 10.1145/2875194.2875209
114
Katsini, C., Abdrabou, Y., Raptis, G. E., Khamis, M., & Alt, F. (2020). The
role of eye gaze in security and privacy applications: Survey and future hci re-
search directions. In Proceedings of the 2020 chi conference on human factors in
computing systems (p. 1–21). New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/3313831.3376840 doi:
10.1145/3313831.3376840
Ke, S. R., Lam, J., Pai, D. K., & Spering, M. (2013). Directional asymmetries in human
smooth pursuit eye movements. Investigative ophthalmology & visual science,54 (6),
4409–4421. Retrieved from https://doi.org/10.1167/iovs.12-11369
Khamis, M., Alt, F., & Bulling, A. (2015). A field study on spontaneous gaze-based inter-
action with a public display using pursuits. In Adjunct proceedings of the 2015 acm
international joint conference on pervasive and ubiquitous computing and proceed-
ings of the 2015 acm international symposium on wearable computers (p. 863–872).
New York, NY, USA: Association for Computing Machinery. Retrieved from
https://doi.org/10.1145/2800835.2804335 doi: 10.1145/2800835.2804335
Khamis, M., Oechsner, C., Alt, F., & Bulling, A. (2018). Vrpursuits: Interaction in
virtual reality using smooth pursuit eye movements. In Proceedings of the 2018
international conference on advanced visual interfaces (pp. 18:1–18:8). New York,
NY, USA: ACM. Retrieved from http://doi.acm.org/10.1145/3206505.3206522
doi: 10.1145/3206505.3206522
Koester, H. H., & Levine, S. (1997). Keystroke-level models for user performance with
word prediction. Augmentative and Alternative Communication,13(4), 239–257.
Retrieved from https://doi.org/10.1080/07434619712331278068 doi: 10.1080/
07434619712331278068
Koester, H. H., & Levine, S. (1998). Model simulations of user performance with
word prediction. Augmentative and Alternative Communication,14(1), 25–36. Re-
trieved from https://doi.org/10.1080/07434619812331278176 doi: 10.1080/
07434619812331278176
Köpsel, A., Majaranta, P., Isokoski, P., & Huckauf, A. (2016). Effects of auditory,
haptic and visual feedback on performing gestures by gaze or by hand. Behaviour
& Information Technology,35(12), 1044–1062. Retrieved from https://doi.org/
10.1080/0144929X.2016.1194477 doi: 10.1080/0144929X.2016.1194477
Kosch, T., Hassib, M., Woundefinedniak, P. W., Buschek, D., & Alt, F. (2018). Your eyes
tell: Leveraging smooth pursuit for assessing cognitive workload. In Proceedings of
the 2018 chi conference on human factors in computing systems (p. 1–13). New
115
York, NY, USA: Association for Computing Machinery. Retrieved from https://
doi.org/10.1145/3173574.3174010 doi: 10.1145/3173574.3174010
Kristensson, P. O., & Vertanen, K. (2012). The potential of dwell-free eye-typing for
fast assistive gaze communication. In Proceedings of the symposium on eye tracking
research and applications (p. 241–244). New York, NY, USA: Association for Com-
puting Machinery. Retrieved from https://doi.org/10.1145/2168556.2168605
doi: 10.1145/2168556.2168605
Krukowski, A. E., & Stone, L. S. (2005). Expansion of direction space around the cardinal
axes revealed by smooth pursuit eye movements. Neuron,45(2), 315–323. Retrieved
from https://doi.org/10.1016/j.neuron.2005.01.005
Kumar, C., Hedeshy, R., MacKenzie, I. S., & Staab, S. (2020). Tagswipe: Touch assisted
gaze swipe for text entry. In Proceedings of the 2020 chi conference on human factors
in computing systems (p. 1–12). New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/3313831.3376317 doi:
10.1145/3313831.3376317
Kurauchi, A., Feng, W., Joshi, A., Morimoto, C., & Betke, M. (2016). Eyeswipe: Dwell-
free text entry using gaze paths. In Proceedings of the 2016 chi conference on human
factors in computing systems (p. 1952–1956). New York, NY, USA: Association
for Computing Machinery. Retrieved from https://doi.org/10.1145/2858036
.2858335 doi: 10.1145/2858036.2858335
LetterFrequency. (2021, August). Computer qwerty keyboard key frequency. http://
letterfrequency.org/.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and
reversals. In Soviet physics doklady (Vol. 10, pp. 707–710).
Lisberger, S. G. (2015). Visual guidance of smooth pursuit eye movements. Annual
review of vision science,1, 447–468. Retrieved from https://doi.org/10.1146/
annurev-vision-082114-035349
Lisberger, S. G., Morris, E., & Tychsen, L. (1987). Visual motion processing and sensory-
motor integration for smooth pursuit eye movements. Annual review of neuro-
science,10(1), 97–129. Retrieved from https://doi.org/10.1146/annurev.ne
.10.030187.000525
Ludvigh, E., & Miller, J. W. (1958). Study of visual acuity during the ocular pursuit
of moving test objects. i. introduction. Journal of the Optical Society of America,
48(11), 799–802. Retrieved from https://doi.org/10.1364/JOSA.48.000799
Lutz, O. H.-M., Venjakob, A. C., & Ruff, S. (2015). Smoovs: Towards calibration-free
116
text entry by gaze using smooth pursuit movements. Journal of Eye Movement
Research,8(1), 2. Retrieved from https://doi.org/10.16910/jemr.8.1.2
MacKenzie, I. S., & Soukoreff, R. W. (2003). Phrase sets for evaluating text entry
techniques. In Chi ’03 extended abstracts on human factors in computing systems
(p. 754–755). New York, NY, USA: Association for Computing Machinery. Retrieved
from https://doi.org/10.1145/765891.765971 doi: 10.1145/765891.765971
MacKenzie, I. S., & Tanaka-Ishii, K. (2010). Text entry systems: Mobility, accessibility,
universality. Elsevier.
MacKenzie, I. S., & Zhang, X. (2008). Eye typing using word and letter prediction
and a fixation algorithm. In Proceedings of the 2008 symposium on eye tracking
research & applications (p. 55–58). New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/1344471.1344484 doi:
10.1145/1344471.1344484
Majaranta, P., Ahola, U.-K., & Špakov, O. (2009). Fast gaze typing with an ad-
justable dwell time. In Proceedings of the sigchi conference on human factors in
computing systems (p. 357–360). New York, NY, USA: Association for Comput-
ing Machinery. Retrieved from https://doi.org/10.1145/1518701.1518758 doi:
10.1145/1518701.1518758
Majaranta, P., MacKenzie, I. S., Aula, A., & Räihä, K.-J. (2003). Auditory and visual
feedback during eye typing. In Chi ’03 extended abstracts on human factors in
computing systems (p. 766–767). New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/765891.765979 doi: 10
.1145/765891.765979
Majaranta, P., MacKenzie, I. S., Aula, A., & Räihä, K.-J. (2006). Effects of feedback and
dwell time on eye typing speed and accuracy. Universal Access in the Information
Society,5(2), 199–208. Retrieved from https://doi.org/10.1007/s10209-006
-0034-z
Majaranta, P., & Räihä, K.-J. (2002). Twenty years of eye typing: Systems and design
issues. In Proceedings of the 2002 symposium on eye tracking research & applications
(pp. 15–22). New York, NY, USA: ACM. Retrieved from http://doi.acm.org/
10.1145/507072.507076 doi: 10.1145/507072.507076
Martinez-Conde, S., Macknik, S. L., & Hubel, D. H. (2004). The role of fixational
eye movements in visual perception. Nature reviews neuroscience,5(3), 229–240.
Retrieved from https://doi.org/10.1038/nrn1348
Menges, R., Kumar, C., Müller, D., & Sengupta, K. (2017). Gazetheweb: A gaze-
117
controlled web browser. In Proceedings of the 14th web for all conference on
the future of accessible work. New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/3058555.3058582 doi:
10.1145/3058555.3058582
Meyer, C. H., Lasker, A. G., & Robinson, D. A. (1985). The upper limit of human
smooth pursuit velocity. Vision research,25(4), 561–563. Retrieved from https://
doi.org/10.1016/0042-6989(85)90160-9
Morimoto, C. H., & Amir, A. (2010). Context switching for fast key selection in text
entry applications. In Proceedings of the 2010 symposium on eye-tracking research
& applications (p. 271–274). New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/1743666.1743730 doi:
10.1145/1743666.1743730
Mott, M. E., Williams, S., Wobbrock, J. O., & Morris, M. R. (2017). Improving dwell-
based gaze typing with dynamic, cascading dwell times. In Proceedings of the 2017
chi conference on human factors in computing systems (p. 2558–2570). New York,
NY, USA: Association for Computing Machinery. Retrieved from https://doi
.org/10.1145/3025453.3025517 doi: 10.1145/3025453.3025517
Neuer, E. (2020). Improving a gaze speller based on smooth pursuit eye movements by
implementing multimodal feedback. (unpublished thesis, TU Berlin)
Nielsen, J. (1994). Usability engineering. Morgan Kaufmann.
Orban de Xivry, J.-J., & Lefevre, P. (2007). Saccades and pursuit: two outcomes of a
single sensorimotor process. The Journal of physiology,584(1), 11–23. Retrieved
from https://doi.org/10.1113/jphysiol.2007.139881
Orhan, U., Erdogmus, D., Roark, B., Oken, B., Purwar, S., E. Hild, K., ... Fried-
Oken, M. (2012). Improved accuracy using recursive bayesian estimation based
language model fusion in erp-based bci typing systems. In 2012 annual international
conference of the ieee engineering in medicine and biology society (p. 2497-2500).
doi: 10.1109/EMBC.2012.6346471
Park, K. (2017, October). Word prediction using convolutional neural networks. https://
github.com/Kyubyong/word_prediction.
Pedrosa, D., Pimentel, M. D. G., Wright, A., & Truong, K. N. (2015, March). Filteryed-
ping: Design challenges and user performance of dwell-free eye typing. ACM Trans.
Access. Comput.,6(1). Retrieved from https://doi.org/10.1145/2724728 doi:
10.1145/2724728
Penkar, A. M., Lutteroth, C., & Weber, G. (2012). Designing for the eye: Design
118
parameters for dwell in gaze interaction. In Proceedings of the 24th australian
computer-human interaction conference (pp. 479–488). New York, NY, USA:
ACM. Retrieved from http://doi.acm.org/10.1145/2414536.2414609 doi:
10.1145/2414536.2414609
Perlin, K. (1998). Quikwriting: continuous stylus-based text entry. In Proceedings of the
11th annual acm symposium on user interface software and technology (pp. 215–
216).
Pfeuffer, K., Vidal, M., Turner, J., Bulling, A., & Gellersen, H. (2013). Pursuit calibration:
Making gaze calibration less tedious and more flexible. In Proceedings of the 26th
annual acm symposium on user interface software and technology (p. 261–270). New
York, NY, USA: Association for Computing Machinery. Retrieved from https://
doi.org/10.1145/2501988.2501998 doi: 10.1145/2501988.2501998
Pi, J., Koljonen, P. A., Hu, Y., & Shi, B. E. (2020). Dynamic bayesian adjustment of dwell
time for faster eye typing. IEEE Transactions on Neural Systems and Rehabilitation
Engineering,28(10), 2315-2324. doi: 10.1109/TNSRE.2020.3016747
Rajanna, V., & Hansen, J. P. (2018). Gaze typing in virtual reality: Impact of keyboard
design, selection method, and motion. In Proceedings of the 2018 acm symposium
on eye tracking research & applications. New York, NY, USA: Association for Com-
puting Machinery. Retrieved from https://doi.org/10.1145/3204493.3204541
doi: 10.1145/3204493.3204541
Rantala, J., Majaranta, P., Kangas, J., Isokoski, P., Akkil, D., Špakov, O., & Raisamo, R.
(2020). Gaze interaction with vibrotactile feedback: Review and design guidelines.
Human–Computer Interaction,35(1), 1–39. Retrieved from https://doi.org/
10.1080/07370024.2017.1306444
Rashbass, C. (1961). The relationship between saccadic and smooth tracking eye move-
ments. The Journal of Physiology,159 (2), 326. Retrieved from https://doi.org/
10.1113/jphysiol.1961.sp006811
Reid, G. B., & Nygren, T. E. (1988). The subjective workload assessment technique:
A scaling procedure for measuring mental workload. Advances in psychology,52,
185–218. doi: https://doi.org/10.1016/S0166-4115(08)62387-0
Robinson, D. A. (1965). The mechanics of human smooth pursuit eye movement.
The Journal of Physiology,180(3), 569–591. Retrieved from https://doi.org/
10.1113/jphysiol.1965.sp007718
Robinson, D. A. (1968). The oculomotor control system: A review. Proceedings of the
IEEE,56(6), 1032–1049. doi: 10.1109/PROC.1968.6455
119
Roucoux, A., Culee, C., & Roucoux, M. (1983). Development of fixation and pursuit eye
movements in human infants. Behavioural brain research,10(1), 133–139. Retrieved
from https://doi.org/10.1016/0166-4328(83)90159-6
Rubio, S., Díaz, E., Martín, J., & Puente, J. M. (2004). Evaluation of subjective mental
workload: A comparison of swat, nasa-tlx, and workload profile methods. Applied
psychology,53(1), 61–86. Retrieved from https://doi.org/10.1111/j.1464-0597
.2004.00161.x
Sengupta, K., Menges, R., Kumar, C., & Staab, S. (2019). Impact of variable positioning
of text prediction in gaze-based text entry. In Proceedings of the 11th acm symposium
on eye tracking research & applications. New York, NY, USA: Association for Com-
puting Machinery. Retrieved from https://doi.org/10.1145/3317956.3318152
doi: 10.1145/3317956.3318152
Sibert, L. E., & Jacob, R. J. K. (2000). Evaluation of eye gaze interaction. In Proceedings
of the sigchi conference on human factors in computing systems (p. 281–288). New
York, NY, USA: Association for Computing Machinery. Retrieved from https://
doi.org/10.1145/332040.332445 doi: 10.1145/332040.332445
Sidenmark, L., Clarke, C., Zhang, X., Phu, J., & Gellersen, H. (2020). Outline pursuits:
Gaze-assisted selection of occluded objects in virtual reality. In Proceedings of the
2020 chi conference on human factors in computing systems (p. 1–13). New York,
NY, USA: Association for Computing Machinery. Retrieved from https://doi
.org/10.1145/3313831.3376438 doi: 10.1145/3313831.3376438
Sidenmark, L., Mardanbegi, D., Gomez, A. R., Clarke, C., & Gellersen, H. (2020). Bi-
modalgaze: Seamlessly refined pointing with gaze and filtered gestural head move-
ment. In Acm symposium on eye tracking research and applications. New York, NY,
USA: Association for Computing Machinery. Retrieved from https://doi.org/
10.1145/3379155.3391312 doi: 10.1145/3379155.3391312
Speier, W., Chandravadia, N., Roberts, D., Pendekanti, S., & Pouratian, N. (2017).
Online bci typing using language model classifiers by als patients in their homes.
Brain-Computer Interfaces,4(1-2), 114–121. Retrieved from https://doi.org/
10.1080/2326263X.2016.1252143
Stampe, D. M., & Reingold, E. M. (1995). Selection by looking: A novel com-
puter interface and its application to psychological research. Studies in Visual
Information Processing,6, 467–478. Retrieved from https://doi.org/10.1016/
S0926-907X(05)80039-X
Stark, L., Vossius, G., & Young, L. R. (1962). Predictive control of eye tracking
120
movements. IRE Transactions on human factors in electronics(2), 52–57. doi:
10.1109/THFE2.1962.4503342
Trnka, K., & McCoy, K. F. (2008). Evaluating word prediction: framing keystroke savings.
In Proceedings of acl-08: Hlt, short papers (pp. 261–264).
Tsang, P. S., & Velazquez, V. L. (1996). Diagnosticity and multidimensional subjective
workload ratings. Ergonomics,39(3), 358–381. Retrieved from https://doi.org/
10.1080/00140139608964470
Tuisku, O., Majaranta, P., Isokoski, P., & Räihä, K.-J. (2008). Now dasher! dash
away! longitudinal study of fast text entry by eye gaze. In Proceedings of the 2008
symposium on eye tracking research & applications (p. 19–26). New York, NY,
USA: Association for Computing Machinery. Retrieved from https://doi.org/
10.1145/1344471.1344476 doi: 10.1145/1344471.1344476
Urbina, M. H., & Huckauf, A. (2010). Alternatives to single character entry and dwell
time selection on eye typing. In Proceedings of the 2010 symposium on eye-tracking
research & applications (p. 315–322). New York, NY, USA: Association for Com-
puting Machinery. Retrieved from https://doi.org/10.1145/1743666.1743738
doi: 10.1145/1743666.1743738
Velloso, E., Coutinho, F. L., Kurauchi, A., & Morimoto, C. H. (2018). Circular orbits
detection for gaze interaction using 2d correlation and profile matching algorithms.
In Proceedings of the 2018 acm symposium on eye tracking research & applica-
tions. New York, NY, USA: Association for Computing Machinery. Retrieved from
https://doi.org/10.1145/3204493.3204524 doi: 10.1145/3204493.3204524
Vidal, M., Bulling, A., & Gellersen, H. (2013). Pursuits: Spontaneous interaction with
displays based on smooth pursuit eye movement and moving targets. In Pro-
ceedings of the 2013 acm international joint conference on pervasive and ubiqui-
tous computing (p. 439–448). New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/2493432.2493477 doi:
10.1145/2493432.2493477
Špakov, O., Isokoski, P., Kangas, J., Akkil, D., & Majaranta, P. (2016). Pursuitad-
juster: An exploration into the design space of smooth pursuit based widgets. In
Proceedings of the ninth biennial acm symposium on eye tracking research & appli-
cations (p. 287–290). New York, NY, USA: Association for Computing Machinery.
Retrieved from https://doi.org/10.1145/2857491.2857526
Špakov, O., & Miniotas, D. (2004). On-line adjustment of dwell time for target se-
lection by gaze. In Proceedings of the third nordic conference on human-computer
121
interaction (p. 203–206). New York, NY, USA: Association for Computing Machin-
ery. Retrieved from https://doi.org/10.1145/1028014.1028045 doi: 10.1145/
1028014.1028045
Ward, D. J., Blackwell, A. F., & MacKay, D. J. (2000). Dasher—a data entry interface
using continuous gestures and language models. In Proceedings of the 13th annual
acm symposium on user interface software and technology (pp. 129–137).
Ward, D. J., & MacKay, D. J. (2002). Fast hands-free writing by gaze direction. Nature,
418(6900), 838–838. Retrieved from https://doi.org/10.1038/418838a
Westheimer, G., & McKee, S. P. (1975). Visual acuity in the presence of retinal-image
motion. Journal of the Optical Society of America,65(7), 847–850. Retrieved from
https://doi.org/10.1364/JOSA.65.000847
Westheimer, G., & McKee, S. P. (1978). Stereoscopic acuity for moving retinal images.
Journal of the Optical Society of America,68(4), 450–455. Retrieved from https://
doi.org/10.1364/JOSA.68.000450
Wickens, C. D. (2002). Multiple resources and performance prediction. Theoretical issues
in ergonomics science,3(2), 159–177. Retrieved from https://doi.org/10.1080/
14639220210123806
Wobbrock, J. O., Findlater, L., Gergle, D., & Higgins, J. J. (2011). The aligned
rank transform for nonparametric factorial analyses using only anova procedures.
In Proceedings of the sigchi conference on human factors in computing systems
(p. 143–146). New York, NY, USA: Association for Computing Machinery. Re-
trieved from https://doi.org/10.1145/1978942.1978963
Wobbrock, J. O., Rubinstein, J., Sawyer, M. W., & Duchowski, A. T. (2008). Longitudinal
evaluation of discrete consecutive gaze gestures for text entry. In Proceedings of the
2008 symposium on eye tracking research & applications (p. 11–18). New York, NY,
USA: Association for Computing Machinery. Retrieved from https://doi.org/
10.1145/1344471.1344475 doi: 10.1145/1344471.1344475
Wolfe, J. M., & Horowitz, T. S. (2017). Five factors that guide attention in visual search.
Nature Human Behaviour,1(3), 0058. Retrieved from https://doi.org/10.1038/
s41562-017-0058
Young, L. R. (1971). The control of eye movements. Academic Press New York.
Zeng, Z., & Roetting, M. (2018). A text entry interface using smooth pursuit movements
and language model. In Proceedings of the 2018 acm symposium on eye tracking
research & applications (pp. 69:1–69:2). New York, NY, USA: ACM. Retrieved from
http://doi.acm.org/10.1145/3204493.3207413 doi: 10.1145/3204493.3207413
122
Zeng, Z., Siebert, F. W., Venjakob, A. C., & Roetting, M. (2020). Calibration-free gaze
interfaces based on linear smooth pursuit. Journal of Eye Movement Research,
13(1), 3. doi: 10.16910/jemr.13.1.3
Zhang, G., Hansen, J. P., & Minakata, K. (2019). Hand- and gaze-control of telep-
resence robots. In Proceedings of the 11th acm symposium on eye tracking re-
search & applications (pp. 70:1–70:8). New York, NY, USA: ACM. Retrieved from
http://doi.acm.org/10.1145/3317956.3318149 doi: 10.1145/3317956.3318149
Zhang, X., Sugano, Y., & Bulling, A. (2019). Evaluation of appearance-based methods and
implications for gaze-based applications. In Proceedings of the 2019 chi conference
on human factors in computing systems (pp. 1–13).
123
Appendix A Documents
A.1 Questionnaire for study 1
125
1
Fragebogen zu demographischen Angaben
Personencode: _______________
Bitte füllen Sie einige Angaben zu Ihrer Person aus.
Bitte geben Sie Ihr Alter an:
Beruf
Geschlecht
Weiblich
Männlich
Anderes
Haben Sie bereits Erfahrungen oder Vorkenntnisse im Bereich Blickbewegungs-
messung?
Ja,
Nein
Haben Sie bereits Erfahrungen mit Blickinteraktion?
Ja,
Nein
Tragen Sie Sehhilfe?
Ja,
Nein
Sind Sie von Natur aus Rechts- oder Linkshänder?
Rechtshänder
Linkshänder
Sind Ihre Augen zur Zeit in höherem Maße beansprucht?
nein, gar nicht
eher nein
teils, teils
eher ja
ja, genau
2
Bitte geben Sie Ihre Einschätzung ab. Kreuzen Sie Zutreffendes an.
Wird sich die Anzahl der Objekte auf die Verfolgung der Ziele auswirken?
Nein
Wenn Ja
Welche der folgenden Benutzerschnittstelle bevorzugen Sie? Die
Benutzerschnittstelle besteht aus:
weniger als 6 bewegten Objekte
6 bewegten Objekte
8 bewegten Objekte
10 bewegten Objekte
12 bewegten Objekte
15 bewegten Objekte
mehr als 15 bewegten Objekte
Warum? ______________________________________________________________
Wird sich die Geschwindigkeit von bewegten Objekten auf die Verfolgung der
Ziele auswirken?
Nein
Wenn Ja
Welches gefällt Ihnen besser?
Schnellere Geschwindigkeit
Langsamere Geschwindigkeit
Warum? ______________________________________________________________
Wird sich die Größe von bewegten Objekten auf die Verfolgung der Ziele
auswirken?
Nein
Wenn Ja
Wie empfinden Sie die Größe der Objekte im Experiment?
zu klein
□
□
□
zu groß
Warum? ______________________________________________________________
A.2 Visualization moving trajectories from one partici-
pant
128
Appendix B
B.1 Open questions for study 2 - German version
134
Sind Ihnen Unterschiede in den vier Benutzerschnittstellen aufgefallen?
Nein
Wenn Ja
1. Beschreiben Sie bitte was Ihnen aufgefallen ist.
2. Bringen Sie bitte die vier Benutzerschnittstellen in eine Rangfolge. Beginnen Sie mit der für
Sie präferierten Benutzerschnittstelle:
Kein Feedback
Auditives Feedback
Visuelles Feedback
Auditiv-Visuelles Feedback
Beschreiben Sie bitte warum Ihnen diese Schnittstelle am meisten gefallen hat:
Beschreiben Sie bitte warum Ihnen diese Schnittstelle am wenigsten gefallen hat:
Haben Sie Verbesserungsvorschläge für zukünftige Benutzerschnittstellen dieser Art?
Wenn ja, welche?
Wie haben Sie die Geschwindigkeit der Objekte wahrgenommen? War diese zu schnell /
zu langsam / genau richtig?
Wie haben Sie die Größe der Objekte wahrgenommen? War diese zu groß/klein/genau
richtig?
Hatten Sie genug Zeit die gewünschten Objekte zu suchen?
B.2 Phrases for study 2
Phrases used in study 2.
1. Kanada hat zehn Provinzen
2. Der Bus war sehr voll
3. Du bist ein gutes Beispiel
4. Dein Vortrag war interessant
5. Diese Uhr ist zu teuer
6. Er hat sieben Mal angerufen
7. Morgen scheint die Sonne
8. Du musst einen Termin machen
9. Gehst du gerne Zelten?
10. Das Bad ist gut zum Lesen
11. Ich stimme dir zu
12. Das ist eine sehr gute Idee
13. Bitte folge den Regeln
14. Ich spiele gerne Tennis
15. Willst du das wirklich?
16. Diese Lederjacke ist zu warm
17. Der Benzinpreis ist hoch
18. Die Kinder spielen
19. Ich treffe dich gegen Mittag
20. Heute ist es sehr windig
138
Appendix C Documents
C.1 NASA TLX Scale
139
NASA Task Load Index
Hart and Staveland’s NASA Task Load Index (TLX) method assesses
work load on five 7-point scales. Increments of high, medium and low
estimates for each point result in 21 gradations on the scales.
Code
Task
Date
Mental Demand How mentally demanding was the task?
Very Low Very High
Physical Demand How physically demanding was the task?
Very Low Very High
Temporal Demand How hurried or rushed was the pace of the task?
Very Low Very High
Performance How successful were you in accomplishing what
you were asked to do?
Perfect Failure
Effort How hard did you have to work to accomplish
your level of performance?
Very Low Very High
Frustration How insecure, discouraged, irritated, stressed,
and annoyed were you?
Very Low Very High
C.2 Phrases for study 3
Phrases used in study 3.
1. thank you for your help
2. the trains are always late
3. I can play much better now
4. you must make an appointment
5. what you see is what you get
6. players must know all the rules
7. time to go shopping
8. this is a very good idea
9. this equation is too complicated
10. I like to play tennis
11. can we play cards tonight
12. have a good weekend
13. I agree with you
14. healthy food is good for you
15. I will meet you at noon
16. my fingers are very cold
17. you are a wonderful example
18. the children are playing
141