scieee Science in your language
[en] (orig)

Detecting when Users Disagree with Generated Captions

Author: Bhatti, Omair Shahazad; Sriram, Harshinee; Mohamed Selim, Abdulrahman; Conati, Cristina; Barz, Michael; Sonntag, Daniel
Publisher: Zenodo
DOI: 10.1145/3686215.3688382
Source: https://zenodo.org/records/17258236/files/3686215.3688382.pdf
De ec ing when Use s Disag ee wi h Gene a ed Cap ions
Omai Shahzad Bha i
In e ac i e Machine Lea ning,
Ge man Resea ch Cen e o A i cial
In elligence (DFKI)
Ge many
omai _shahzad.bha i@d ki.de
C is ina Cona i
Uni e si y o B i ish Columbia
Canada
[email p o ec ed]
Ha shinee S i am
Uni e si y o B i ish Columbia
Canada
[email p o ec ed]
Michael Ba z
In e ac i e Machine Lea ning,
Ge man Resea ch Cen e o A i cial
In elligence (DFKI)
Ge many
Applied A i cial In elligence,
Uni e si y o Oldenbu g
Ge many
michael.ba z@d ki.de
Abdul ahman Mohamed Selim
In e ac i e Machine Lea ning,
Ge man Resea ch Cen e o A i cial
In elligence (DFKI)
Ge many
abdul ahman.mohamed@d ki.de
Daniel Sonn ag
In e ac i e Machine Lea ning,
Ge man Resea ch Cen e o A i cial
In elligence (DFKI)
Ge many
Applied A i cial In elligence,
Uni e si y o Oldenbu g
Ge many
daniel.sonn ag@d ki.de
Abs ac
The pe asi e in eg a ion o a i cial in elligence (AI) in o daily
li e has led o a g owing in e es in AI agen s ha can lea n con-
inuously. In e ac i e Machine Lea ning (IML) has eme ged as a
p omising app oach o mee his need, essen ially in ol ing human
expe s in he model aining p ocess, o en h ough i e a i e use
eedback. Howe e , epea ed eedback eques s can lead o us-
a ion and educed us in he sys em. Hence, he e is inc easing
in e es in e ning how hese sys ems in e ac wi h use s o ensu e
e ciency wi hou comp omising use expe ience. Ou esea ch
in es iga es he po en ial o eye acking da a as an implici eed-
back mechanism o de ec use disag eemen wi h AI-gene a ed
cap ions in image cap ioning sys ems. We conduc ed a s udy wi h
30 pa icipan s using a simula ed cap ioning in e ace and ga h-
e ed hei eye mo emen da a as hey assessed cap ion accu acy.
The goal o he s udy was o de e mine whe he eye acking da a
can p edic use ag eemen o disag eemen e ec i ely, he eby
s eng hening IML amewo ks. Ou ndings e eal ha , while eye
acking shows p omise as a aluable eedback sou ce, ensu ing
consis en and eliable model pe o mance ac oss di e se use s
emains a challenge.
CCS Concep s
• Human-cen e ed compu ing
→
Use s udies; Use models;•
Compu ing me hodologies
→
Supe ised lea ning by classi ca-
ion.
Pe mission o make digi al o ha d copies o all o pa o his wo k o pe sonal o
class oom use is g an ed wi hou ee p o ided ha copies a e no made o dis ibu ed
o p o o comme cial ad an age and ha copies bea his no ice and he ull ci a ion
on he s page. Copy igh s o componen s o his wo k owned by o he s han he
au ho (s) mus be hono ed. Abs ac ing wi h c edi is pe mi ed. To copy o he wise, o
epublish, o pos on se e s o o edis ibu e o lis s, equi es p io speci c pe mission
and/o a ee. Reques pe missions om [email p o ec ed].
ICMI Companion ’24, No embe 04–08, 2024, San Jose, Cos a Rica
© 2024 Copy igh held by he owne /au ho (s). Publica ion igh s licensed o ACM.
ACM ISBN 979-8-4007-0463-5/24/11
h ps://doi.o g/10.1145/3686215.3688382
Keywo ds
disag eemen de ec ion, in e ac i e machine lea ning, eye acking,
gaze, emo ion de ec ion, use disag eemen
ACM Re e ence Fo ma :
Omai Shahzad Bha i, Ha shinee S i am, Abdul ahman Mohamed Selim,
C is ina Cona i, Michael Ba z, and Daniel Sonn ag. 2024. De ec ing when
Use s Disag ee wi h Gene a ed Cap ions. In INTERNATIONAL CONFER-
ENCE ON MULTIMODAL INTERACTION (ICMI Companion ’24), No em-
be 04–08, 2024, San Jose, Cos a Rica. ACM, New Yo k, NY, USA, 9 pages.
h ps://doi.o g/10.1145/3686215.3688382
1 In oduc ion
As he use o A i cial In elligence (AI) inc eases in a ious aspec s
o daily li e, he e is also a g owing demand o AI-based sys ems
o ope a e au onomously while conside ing use p e e ences and
eedback. A p omising concep o mee his demand is in e ac i e
Machine Lea ning (IML). This app oach allows use s, including non-
expe s, o dynamically and inc emen ally s ee and ain models
by in eg a ing human eedback in o machine lea ning p ocesses
[
1
]. Much like IML, many AI sys ems ac oss di e en elds ely
on explici eedback o lea n use p e e ences and adap sys em
beha io . Howe e , equen eques s o explici eedback can
quickly become a sou ce o us a ion o use s [
7
], educing hei
us in he AI sys em and nega i ely impac ing hei pe cep ion o
i s accu acy [13].
Exis ing li e a u e has p oposed s a egies o imp o e in e ac ion
wi hin IML sys ems [
9
,
12
,
32
]. Speci cally, wo ks such as Dudley
and K is ensson
[9]
a gue o minimizing he equency o use -AI
in e ac ions, sugges ing ha eedback should only be eques ed
when i is c i ical o he model’s lea ning p ocess. Mo i a ed by his
pe spec i e, we explo e using implici signals, such as eye acking,
o cap u e use eedback implici ly. Eye acking is an unob usi e
me hod o cap u ing a use ’s gaze and can p o ide insigh in o he
use ’s cogni i e p ocessing and ocus o a en ion. These implici
signals could help iden i y ins ances o pe cei ed ag eemen o
disag eemen wi h he ou pu o an AI sys em, p o iding aluable
eedback o sys em imp o emen and po en ially helping decide
195
ICMI Companion ’24, No embe 04–08, 2024, San Jose, Cos a Rica Bha i e al.
when o igge explici eedback eques s. This app oach could
educe use e o and make in e ac ions wi h AI sys ems mo e
use - iendly.
Ou wo k ocuses on an image cap ioning scena io, whe e AI
sys ems gene a e ex ual desc ip ions (cap ions) o images. Despi e
ecen ad ancemen s, hese models can gene a e inco ec cap ions,
leading o disag eemen wi h he AI sys em’s ou pu . We aim o
p edic hese occu ences using implici signals. The e o e, we
conduc ed a use s udy wi h hi y pa icipan s who in e ac ed
wi h a simula ed image cap ioning sys em while we cap u ed hei
eye mo emen s and ideo- eco ded hei aces. Pa icipan s we e
ins uc ed o e alua e cap ions (hal o which we e in en ionally
awed) using images and cap ions sou ced om he FOIL-COCO
da ase [
28
]. By obse ing pa icipan s’ eac ions o co ec and
in en ionally inco ec cap ions, we aimed o iden i y ma ke s o
disag eemen . Speci cally, he con ibu ions o his wo k a e as
ollows:
(1)
A da ase wi h 30 pa icipan s’ in e ac ions wi h image-
cap ion pai s, eco ding eye acking and pupil dila ion da a,
along wi h hei bina y a ings indica ing ag eemen o dis-
ag eemen wi h he gene a ed cap ions.
(2)
A c oss-use expe imen implemen ing a Lea e-One-Use -
Ou 10- old c oss- alida ion app oach o examine he po en-
ial o gene alizabili y in disag eemen de ec ion models
ac oss di e en use s.
(3)
A wi hin-use expe imen aimed a in es iga ing he e -
ec i eness o pe sonalized model adap a ions, assessing
whe he cus omized models can enhance he pe o mance
o disag eemen de ec ion in a use -speci c con ex .
2 Backg ound & Rela ed Wo k
Cen al o ou esea ch is he no ion o use disag eemen , which
we de ne as ins ances whe e he sys em’s ou pu du ing a speci c
ask does no align wi h a use ’s expec a ion. These ins ances can
se e as po en ial eedback signals o igge s o eques u he
eedback in machine lea ning sys ems. To enhance ou unde s and-
ing o disag eemen , we d aw on ele an esea ch om a ec i e
compu ing and y o nd a connec ion o a ec i e s a es.
Exis ing li e a u e has shown ha human gaze and acial exp es-
sions can be used o a ec ecogni ion [
16
,
33
] and se e as sou ces
o implici use eedback [
2
,
3
,
27
]. Fo ins ance, Lallé e al
. [15]
u ilize gaze da a o p edic s a es o con usion. Acco ding o D’Mello
and G aesse
[10]
, con usion “is hypo hesized o occu when he e
is a misma ch be ween incoming in o ma ion and p io knowledge
[...], he eby ini ia ing cogni i e disequilib ium” (p. 292). Thus, we
hypo hesize ha signs o con usion could ac as indica o s o use
disag eemen wi h a model’s ou pu . This s a e o con usion in e -
sec s wi h ou unde s anding o use disag eemen —essen ially, a
misma ch be ween wha use s expec and wha he sys em deli e s.
None heless, while con usion migh signal disag eemen , no e -
e y case o disag eemen is necessa ily ied o con usion. Fu he ,
Pollak e al
. [22]
in es iga ed he use o acial emo ion ecogni ion
echnologies o dis inguish be ween use sa is ac ion and dissa is-
ac ion, he eby es ablishing a clea ela ionship be ween emo ional
esponses and use eedback. Inspi ed by hei ndings, ou s udy
is speci cally designed o elici use disag eemen and use eye
acke s and came as o eco d use beha io and eac ions.
Ea ly esea ch on con usion de ec ion o igina es om he eld
o educa ional compu ing [
6
,
8
], whe e de ec ion echniques o -
en in ol e analyzing acial exp essions o s uden s, pos u e o
in e ace in e ac ion, and hei s udying beha io . Pachman e al
.
[21]
p opose using gaze da a o p edic ing con usion in digi al
lea ning en i onmen s, by acking he p og ession o he use ’s
puzzle-sol ing asks. Thei goal was o de ec he buildup o con-
usion du ing he p oblem-sol ing p ocess. Ou ocus shi s om
hese s udies by concen a ing on he immedia e a ec i e s a e o
con usion ha esul s om he use p ocessing he in o ma ion
o he model’s ou pu . De ec ing his ype o immedia e con usion
is especially ele an in Human-Compu e In e ac ion (HCI) as i
impac s use expe ience and sa is ac ion [
19
]. Salminen e al
. [23]
de elop a con usion p edic o pa ly de i ed om gaze da a wi hin
hei pe sona in o ma ion isualiza ion ool, using me ics such
as he numbe o xa ions, ansi ions be ween A eas-o -In e es
(AOIs), and use s’ demog aphic in o ma ion o p edic con usion
wi h 80% accu acy. They la e enhance his p edic o by solely using
gaze da a, achie ing a 70% accu acy a e in iden i ying con usion,
which boos s o 99% when demog aphic de ails a e in eg a ed. This
indica es a s ong co ela ion be ween demog aphic ac o s and
con usion ins ances. Howe e , while demog aphic ea u es can help
model he equency o con usion, hey may no be e ec i e o
eal- ime moni o ing. No ably, hey highligh ha con usion p e-
dominan ly a ec s inexpe ienced, olde male use s in con as o
younge pa icipan s — a nding ha hin s a a possible co ela-
ion be ween con usion and demog aphic ai s such as age and
gende . Howe e , while demog aphic ea u es can help model he
equency o con usion ac oss di e en use g oups, hey may no
be as e ec i e o eal- ime con usion moni o ing.
Lallé e al
. [15]
c ea ed a p edic o o use con usion du ing in e -
ac ion wi h hei in e ac i e da a isualiza ion ool ValueCha , an
in e ac i e da a isualiza ion ool designed o aid use s in making
well-in o med decisions (such as nding en al p ope y) aligned
wi h hei p e e ence In hei s udy, wi h 136 pa icipan s, gaze and
mouse mo emen we e collec ed as use s pe o med asks wi h he
ool. Use s could indica e con usion by clicking a dedica ed bu on
in he ool’s in e ace in he op- igh co ne . Using a Random Fo es
Classi e , he au ho s’ model achie ed a 61% accu acy in p edic ing
con usion. A mo e ecen con ibu ion om he same g oup [
29
]
uses deep lea ning based on aw eye mo emen s o p edic con u-
sion on he same da ase as [
15
]. They shi ed om p e-p ocessed
ea u es o aw sequen ial gaze da a, ed in o a Recu en Neu al
Ne wo k (RNN). Acco ding o he au ho , his me hod enabled he
RNN o unco e sub le pa e ns indica i e o con usion, ou pe o m-
ing he p e ious model wi h an accu acy o 82% — a no ewo hy
imp o emen o e he ini ial 61%. The success o his app oach sup-
po s he po en ial o combining deep lea ning wi h unp ocessed
sequen ial gaze da a o mo e accu a e a ec ecogni ion. Howe e ,
he da ase is highly imbalanced, wi h ins ances o no con usion
o e whelmingly ou numbe ing con usion cases (99% s 1%). This
skew could po en ially bias he model’s abili y o iden i y con u-
sion accu a ely. Addi ionally, he in e ace’s con usion sel - epo
bu on migh a ec use s’ gaze beha io , in oducing u he da a
collec ion complexi ies. In esponse o hese issues, we p opose
u ilizing a handheld igge o cap u e use eedback o minimize
dis up ion o hei gaze [
5
]. To u he enhance he eliabili y o
ou s udy, we aim o balance cases o ag eemen and disag eemen ,
196
De ec ing when Use s Disag ee wi h Gene a ed Cap ions ICMI Companion ’24, No embe 04–08, 2024, San Jose, Cos a Rica
Figu e 1: (1) Use in e ac s wi h an IML sys em; (2) a p edic o
picks up ha he use disag ees wi h he ou pu ; (3) he IML
sys em eac s by e u ning an al e na i e solu ion o (4) ig-
ge s a eedback eques . S eps (3) and (4) illus a e possible
u u e in eg a ion in an IML sys em
hus add essing he disp opo ion ound in he ea lie da ase in
he da a collec ion s udy.
2.1 Applica ion in Machine Lea ning
The concep o implici eedback o a i cial agen s is a ecen idea,
wi h limi ed li e a u e on he opic. In an explo a i e s udy, Pollak
e al
. [22]
in es iga ed he po en ial o use emo ional eedback
se ing as a ewa d signal o a ein o cemen lea ning agen . This
eedback, de e mined by acial emo ion ecogni ion, was designed
o e ec he use ’s sa is ac ion le el, ca ego izing emo ions in o
nega i e (such as ’ang y’, ’disgus ’, ’ ea ’, ’sad’), posi i e (’happy’),
o neu al (’neu al’, ’su p ise’) ca ego ies acco ding o [
11
]. They
enabled a use o con ol a i ual d one, which, in o med by he
use ’s emo ional eac ions, adap ed i s mo emen s o align wi h
co ec ac ions. Thei p elimina y esul s indica e ha emo ional
eedback could indeed be in eg a ed as a unc ional pa o a e-
in o cemen lea ning agen ’s ewa d sys em. Howe e , hey also
obse ed signi can a iances in he in ensi y o emo ional eed-
back om pa icipan o pa icipan , which p esen ed challenges in
dis inguishing be ween posi i e and nega i e eac ions accu a ely.
K ause and Vossen
[14]
sugges using signs o use con usion o
unce ain y as cues o p o ide explana ions in in e ac ions be ween
humans and AI agen s. They a gue ha explana ions should no
only be p o ided when he use explici ly asks o i bu also when
he sys em iden i es signs o he use ’s unce ain y o con usion.
Fu he , hey iden i y addi ional igge s like, belie con ic s, o
misunde s andings o he agen ’s ou pu , which a e in line wi h he
indica o s o use disag eemen ha ou esea ch aims o explo e.
While K ause and Vossen
[14]
’s app oach ocuses on deli e ing
explana ions, we p opose in eg a ing hese indica o s in o he in-
e ac i e machine lea ning cycle. This in eg a ion migh in ol e
p o iding al e na i e model ou pu s and, when necessa y, solici -
ing addi ional use eedback o acili a e con inuous lea ning (see
Figu e 1).
3 Da a Collec ion
In his sec ion, we de ail ou da a collec ion s udy. This s udy was
designed o explo e he po en ial o eye mo emen as an implici sig-
nal o de ec ing use disag eemen in machine lea ning in e ac ion.
We a e pa icula ly in e es ed in he con ex o image cap ioning
asks, as his se ing allows o he occu ence o disag eemen s
due o model-gene a ed e o s o unsui able cap ions. Hence, in he
s udy, we p esen ed pa icipan s wi h a se ies o images and hei
associa ed cap ions om he FOIL COCO da ase [
28
], wi h hal o
he cap ions in en ionally con aining e o s o elici disag eemen .
Meanwhile, he pa icipan s we e eco ded using an eye acke
and a came a. Nex , we p o ide in o ma ion abou he pa icipan s,
he speci cs o he ask, he s imuli in ol ed, he appa a us se up,
and he p ocedu e ollowed o da a collec ion. Finally, an o e iew
o he collec ed da ase is p esen ed.
3.1 Pa icipan s
We designed and conduc ed his s udy ollowing guidelines p o-
ided by ou E hics Commi ee. The expe imen was e iewed and
app o ed by he Commi ee be o e any ec ui men o da a collec-
ion. We ec ui ed 31 po en ial pa icipan s ia email and uni e si y
pos ings. Howe e , due o complica ions wi h he eye acke , one
pa icipan ’s da a could no be included, esul ing in a nal coun
o 30 pa icipan s (21 males, 9 emales, a g. age 26.4). Nine een
pa icipan s had used an eye acke be o e. Each pa icipan was
uen in English and had no mal o co ec ed- o-no mal ision. Fo
hei con ibu ions, pa icipan s we e compensa ed a a a e o 15
Eu os pe hou . The s udy ook a ound 60 minu es.
3.2 Task
Pa icipan s in ou s udy we e p ima ily asked o a e he accu acy
o he images and cap ions pai ed. They we e gi en a se ies o
images, each linked wi h a co esponding cap ion de i ed om he
FOIL-COCO da ase . Hal o hese cap ions con ained delibe a e
e o s o mi iga e he issue o da a imbalance p e alen in ela ed
esea ch. As pa icipan s iewed each image-cap ion pai , hey we e
ins uc ed o p o ide a bina y a ing o ’ag ee’ o ’disag ee’ based
on hei pe cep ion o he cap ion’s co ec ness. Once a a ing was
p o ided, hey could p oceed o he nex pai in he se ies.
3.3 S imuli
We selec ed a o al o 154 images, 134 om he FOIL-COCO aining
se and 20 om he FOIL-COCO alida ion se . The FOIL-COCO
da ase builds upon he s anda d COCO da ase by p o iding ’ oil’
cap ions — cap ions ha a e iden ical o he o iginal cap ions bu
wi h one in en ional e o . To ensu e a di e se ange o ca ego ies,
we included wo images om each o he da ase ’s supe ca ego ies.
We p ima ily selec ed cap ions a ound en wo ds in leng h and
s anda dized he image esolu ion ac oss all s imuli. Pa icipan s
we e spli in o wo g oups — G oup A and G oup B. Bo h g oups
we e p esen ed wi h he same images o main ain uni o m isual
s imuli. The key dis inc ion be ween he wo g oups’ expe iences
was he p esen a ion o ’ oil’ cap ions: i G oup A saw he co ec
cap ion o a gi en image, G oup B would see he oil cap ion
o ha same image, and ice e sa. To educe o de e ec s, we
andomized he image sequence o each pa icipan .
197
ICMI Companion ’24, No embe 04–08, 2024, San Jose, Cos a Rica Bha i e al.
Figu e 2: Sc eensho o he s udy in e ace showcasing an
image-cap ion pai . He e he induced e o in he cap ion is
he wo d ’keyboa d’ ins ead o ’phone’.
3.4 Appa a us
The se up included a Tobii P o Fusion eye acke
1
, ope a ing
a 250Hz, moun ed on a 27-inch moni o . Di ec ly below he eye
acke , a Luxonis OAK-D[
17
] came a was posi ioned o eco d
he pa icipan . The in e ace in e ac ion was acili a ed using a
Logi ech P esen e , selec ed o i s in ui i e design and he abili y
o be used wi hou di e ing gaze om he sc een, minimizing in u-
ence on gaze beha io . Consis en ligh ing was main ained ac oss
sessions o educe he impac on pupil dila ion. A heigh -adjus able
able was used o op imize eye acke accu acy by accommoda ing
a ying pa icipan heigh s. Mo eo e , he pa icipan - o-sc een
dis ance was con olled a 60cm.
3.5 P ocedu e
As pa icipan s a i ed, hey we e s p esen ed wi h a consen
o m, which hey signed o acknowledge hei olun a y pa ici-
pa ion and unde s anding o he s udy’s na u e. They we e also
asked o ll ou a demog aphic ques ionnai e o collec ele an
backg ound in o ma ion. Following he acquisi ion o consen , pa -
icipan s ecei ed a ho ough b ie ng abou he ask a hand. This
b ie ng included he key de ail ha he cap ions hey we e o e al-
ua e had been gene a ed by an Image Cap ioning Model. A e
making su e ha pa icipan s ully g asped he ask and i s objec-
i es, we in oduced hem o he s udy sys em. We hen p oceeded
wi h he calib a ion o he eye acking de ice, u ilizing he Tobii
P o Eye T acke Manage o a p ecise 9-poin calib a ion p ocess.
To con m he accu acy o he eye acke , we manually checked
he da a using he p o ided gaze isualiza ion ool, pe o ming
ecalib a ions when necessa y.
Be o e beginning he main ask, a aining phase was conduc ed
o amilia ize pa icipan s wi h he s udy en i onmen and p oce-
du e. A coun down on he in e ace was shown a he cen e o he
sc een. This ensu ed ha all pa icipan s began hei ask wi h hei
gaze ocused on he cen e . Upon comple ion o he coun down, an
image-cap ion pai was p esen ed. Pa icipan s we e ins uc ed o
1
h ps://www. obii.com/p oduc s/eye- acke s/sc een-based/ obii-p o- usion, [Ac-
cessed 16-08-2024].
de e mine he co ec ness o he cap ion and hen p oceed o he
nex sc een o en e hei decision. Pa icipan s ad anced o he
main ask phase once hey con med hei unde s anding o hese
ins uc ions. Upon i s comple ion, a deb ie ng session was con-
duc ed, and pa icipan s we e compensa ed o hei con ibu ion
o he s udy.
3.6 Da ase O e iew
Ou da ase consis s o 4,620 samples, collec ed om 30 pa icipan s.
In line wi h he s udy’s design, he da ase was s uc u ed o achie e
an e en dis ibu ion o pe cei ed co ec ness, a ge ing a 50/50 spli
be ween ag eemen and disag eemen wi h he p esen ed image
cap ion pai s. This objec i e is e ec ed in he da ase , wi h 50.5% o
he samples a ed as inco ec (disag ee) and 49.5% as co ec (ag ee).
The a e age accu acy compa ed o g ound u h om he FOIL
COCO da ase ac oss all pa icipan s was 90.24% wi h a a iance
o 8.8%. Addi ionally, he esponse imes ac oss ials indica ed an
a e age du a ion o 5.16 seconds pe ial, a median du a ion o
4.45 seconds, and a s anda d de ia ion o 2.92 seconds. Las ly, he
obus ness o he eye acking da a was con med, wi h a minimum
ecognized gaze signal a e o 91% and an a e age a e o 98.5%.
The da ase is a ailable a h ps://gi hub.com/DFKI-In e ac i e-
Machine-Lea ning/disag eemen -de ec ion-da ase .
4 Me hod
In he ollowing sec ion, we desc ibe he p ep ocessing o he eye
acking da a and he ea u e ex ac ion me hod applied. The p ep o-
cessing was aimed a p epa ing he da a o he ea u e ex ac ion
p ocess, while he la e ocused on iden i ying he a ibu es om
he eye acking da a ha could be used o classi ca ion. Addi ion-
ally, he classi ca ion me hods p oposed o disag eemen de ec ion
a e ou lined. This includes adi ional machine lea ning algo i hms
as well as a deep lea ning me hod based on VTNe [
29
], designed
speci cally o p ocess aw gaze da a.
4.1 Da a P ep ocessing
Du ing p ep ocessing, we add essed inconsis encies in he da ase ’s
imes amps. Ini ially, hese imes amps a ied sligh ly, wi h di e -
ences anging om 3.99 o 4.01 milliseconds. These we e adjus ed
o a xed in e al o exac ly 4 milliseconds o es ablish empo al
consis ency ac oss he da ase . Gaze poin s we e hen compu ed
as he a e age o he gaze coo dina es ob ained om each eye.
In ci cums ances whe e da a om one eye was missing, he gaze
poin was in e ed om he coo dina es o he o he eye, ensu ing
con inuous da a ep esen a ion.
Fo he ca ego iza ion o gaze e en s in o xa ions and saccades,
ou implemen a ion closely ollowed he me hodology ou lined by
Tobii [
20
]. We classi ed xa ions using he Iden i ca ion by Veloc-
i y Th eshold (I-VT)[
25
] xa ion de ec ion algo i hm. This me hod
classi es xa ions as sequences o he aw gaze signal, whe e gaze
eloci y s ays below a p ede ned h eshold, indica ing a ela i ely
s able gaze. We chose he Sa i sky-Golay[
26
] l e o calcula ing
eloci ies, wi h an o de o 2 and a span o 40 ms, ollowing ec-
ommenda ions om li e a u e [
31
]. A de aul eloci y h eshold o
20 deg ees/s was applied. Howe e , a e manually inspec ing he
e en de ec ion esul s, we adjus ed he h eshold o 30 deg ees/s
198
De ec ing when Use s Disag ee wi h Gene a ed Cap ions ICMI Companion ’24, No embe 04–08, 2024, San Jose, Cos a Rica
Table 1: De ailed Gaze, Pupil, and T ansi ion Fea u es. Fea-
u es ma ked wi h an as e isk (*) a e calcula ed pe A ea o
In e es (AOI), including he whole isi , isi s on he Image,
and isi s on he Cap ion.
Ca ego y Desc ip ion
To al numbe o xa ions
Numbe o xa ions pe uni ime
Fixa ion-based* A e age du a ion o xa ions
S anda d de ia ion o xa ion du a ions
To al du a ion o all xa ions
A e age leng h o saccades
S anda d de ia ion o saccade leng hs
A e age o ela i e angles o saccades
Saccade-based* S anda d de ia ion o ela i e saccade angles
A e age o absolu e saccade angles
S anda d de ia ion o absolu e saccade angles
A e age wid h o he le pupil
S anda d de ia ion o le pupil wid h
Maximum wid h o he le pupil
Minimum wid h o he le pupil
A e age wid h o he igh pupil
Pupil* S anda d de ia ion o igh pupil wid h
Maximum wid h o he igh pupil
Minimum wid h o he igh pupil
Pupil wid h o he le eye a he s xa ion
Pupil wid h o he le eye a he las xa ion
Pupil wid h o he igh eye a he s xa ion
Pupil wid h o he igh eye a he las xa ion
T ansi ion Numbe o ansi ions be ween AOIs
o h ee pa icipan s whe e da a exhibi ed highe noise le els. In
addi ion, o each xa ion calcula ed, we de e mined he loca ion
o i s occu ence in ela ion o he p ede ned AOIs—speci cally,
he image, he cap ion, o he backg ound.
4.2 Fea u es
Ou ea u es a e e ie ed om Ba z e al
. [2]
and Lallé e al
. [15]
.
These ea u es a e calcula ed ac oss a ious segmen s o he use
in e ace o he en i e du a ion o each ask, ocusing on a eas
whe e use s’ a en ion is mos indica i e o hei decision-making
p ocess. We concen a e on h ee AOIs: he whole sc een, he image
a ea, and he cap ion a ea, as shown in Figu e 2.
The selec ion o hese AOIs is s a egic, as hey ep esen key
elemen s whe e use in e ac ion is mos elling o hei ag eemen o
disag eemen . The whole sc een p o ides a gene al o e iew o use
engagemen , he image a ea ela es o isual con en p ocessing,
and he cap ion a ea pe ains o ex ual con en p ocessing.
The ea u es a e ca ego ized in o h ee g oups: Fixa ion-based,
Saccade-based, and Pupil-based, each p o iding a di e en pe -
spec i e on he use ’s gaze beha io . Fixa ion-based ea u es, o
example, migh indica e he poin s o in e es o con usion, while
saccade-based ea u es could e ec he use ’s sea ch pa e ns o
hesi a ions. Pupil-based ea u es o e an addi ional laye , po en-
ially co ela ing wi h cogni i e load o emo ional esponse. Addi-
ionally, T ansi ion ea u es a e included o cap u e he dynamic
aspec o use in e ac ion, acking how use s mo e be ween di e -
en AOIs. These mo emen s can be e ealing o how use s p ocess
and e alua e he image-cap ion pai s.
Table 1 p esen s a de ailed b eakdown o hese ea u es, including
pa ame e s such as xa ion coun , xa ion a e, mean and s anda d
de ia ion o xa ion du a ions, saccade leng hs, angles, and a ious
pupil dimensions. Each ea u e ma ked wi h an as e isk (*) indica es
calcula ion on a pe -AOI basis.
4.3 Fea u e-based p edic o
The algo i hms Random Fo es (RF), Ex eme G adien Boos ing
(XGBoos ), and Logis ic Reg ession (LR) we e u ilized o de elop
models o p edic ing use disag eemen om eye- acking ea u es.
These me hods a e widely ecognized o hei obus pe o mance
in handling eye- acking da a [
2
,
4
,
15
,
24
]. Fo each algo i hm,
hype pa ame e uning was conduc ed o op imize he models.
Fo XGBoos , he hype pa ame e s ha we e conside ed included
he numbe o es ima o s (100, 200, 300), lea ning a e (0.01, 0.1),
maximum dep h (3, 6), and minimum child weigh (1, 2). Random-
Fo es models we e uned using he numbe o es ima o s (100, 200,
300), maximum dep h (None, 10, 20, 30), and he numbe o ea-
u es conside ed a each spli (None, ’sq ’). Fo Logis ic Reg ession,
he hype pa ame e s included he egula iza ion s eng h (C) wi h
alues 0.1, 1, 10, 100, and he ype o penal y (’l1’, ’l2’).
4.4 The VTNe and VTNe _a models
In pa allel wi h adi ional machine lea ning algo i hm models, we
also employed a deep lea ning me hod ha uses he aw gaze da a.
The VTNe model, ini ially p esen ed in [
29
], was de eloped o
de ec use con usion by lea ning om aw Eye T acking (ET) da a.
The model in eg a es a single-laye Ga ed Recu en Uni (GRU)
sub-model wi h a wo-laye Con olu ional Neu al Ne wo k (CNN)
sub-model, each ope a ing independen ly. The GRU sub-model is
esponsible o p ocessing he ET da a sequen ially, whe eas he
CNN sub-model p ocesses i s co esponding spa ial ep esen a ion
by lea ning om scan pa h images, which is a ep esen a ion o he
ET samples’ X and Y gaze coo dina es and he ansi ions be ween
hem. In [
30
], a sel -a en ion laye was added be o e he GRU
o allow i o ocus on he mo e impo an segmen s wi hin he
ET sequences, he eby enhancing he model’s abili y o disce n
long- e m dependencies. This model, now e med VTNe _a o
indica e he addi ion o a en ion, inco po a es a sel -a en ion
laye wi h a dimensionali y o 6 o ma ch he dimensionali y o he
inpu sequen ial ET da a and a single a en ion head o p ese e
model simplici y and compu a ional e ciency, which is impo an
o p e en o e ing on small da ase s (a ypical cha ac e is ic o
ET da ase s). The ou pu om he GRU, cha ac e ized by a 256-uni
hidden s a e, conca ena es wi h a 50-elemen ec o om he CNN
o o m a combined ec o o 306 elemen s. This ec o se es as
he inpu o a simple neu al ne wo k comp ising a hidden laye
and a So Max ou pu laye , which classi es he inpu in o wo
ca ego ies: Disag eemen o Ag eemen . The hype pa ame e s o
he VTNe _a model emain consis en wi h hose speci ed in
199

ICMI Companion ’24, No embe 04–08, 2024, San Jose, Cos a Rica
[
29
] and [
30
], and he model unde goes end- o-end aining as a
cohesi e uni .
As a da a augmen a ion me hod, we cyclically spli he eye ack-
ing sequences, ollowing he app oach used by Sims and Cona i
[29]
and S i am e al
. [30]
. In hese wo ks, he cyclical spli ing
p ocess p oduced ou new da a poin s om each o iginal one by
g ouping samples collec ed a 120Hz ha we e ou s eps apa
in o he same new da a poin . This me hod p ese ed he empo al
s uc u e because con iguous samples showed minimal a ia ion
due o he high sampling a e, and i expanded he da ase ou old.
Howe e , since ou eye acke had a highe sampling a e o 250Hz,
we cyclically spli each da a poin in o eigh sepa a e ones, he eby
inc easing he da ase eigh old.
5 E alua ion
In his sec ion, we p esen ou expe imen se up. We aim o explo e
he possibili ies o p edic ing use disag eemen h ough eye ack-
ing da a using adi ional machine lea ning algo i hms and a deep
lea ning me hod. Ou explo a ion is spli in o wo key expe imen s:
he C oss-use expe imen and he Wi hin-use expe imen .
The C oss-use expe imen a emp s o de e mine he gene aliz-
abili y o he p edic i e model—can i e ec i ely use eye acking
da a om a pool o use s o p edic disag eemen o any gi en
use ? This expe imen will e eal he model’s capabili y o apply
lea ned pa e ns o disag eemen om he collec i e da a o unseen
indi iduals.
Whe eas he Wi hin-use expe imen ocuses on he model’s
capaci y o p edic disag eemen when ained and es ed on da a
om he same use . He e, he aim is o unde s and how well a
model can lea n indi idual-speci c pa e ns o eye mo emen s and
whe he hese pe sonalized models lead o imp o ed pe o mance
o e gene alized models.
The undamen al ques ions guiding ou expe imen s a e:
(1)
Is eye acking da a a iable sou ce o implici eedback o
p edic ing use disag eemen wi h an AI-gene a ed cap ion,
and can such a p edic ion model gene alize ac oss di e en
use s?
(2)
How e ec i e a e pe sonalized models, ailo ed o indi idual
use s, when using eye acking da a o p edic disag eemen ?
In bo h c oss-use and wi hin-use expe imen s, we s a is ically
compa ed he esul s using a one-way MANOVA es , whe e he
model ype se ed as he independen a iable, while he pe o -
mance me ics ac ed as he dependen a iables. Fo pos -hoc pai -
wise compa isons, we used he Tukey’s HSD es . We epo s a is-
ical signi cance when he p- alue is less han 0.05.
5.1 C oss-Use e alua ion
The C oss-Use e alua ion aims o assess he gene alizabili y o
ou models in p edic ing use disag eemen . To achie e his, we
implemen ed a 10- old Lea e-G oups-Ou c oss- alida ion (CV)
s a egy. Unde his alida ion me hod, he da ase was di ided
in o 10 exclusi e g oups, wi h each g oup ac ing as a hold-ou es
se a di e en i e a ions. E e y i e a ion ensu ed ha a pa icula
use ’s da a was included in he es se jus once, hus gua an eeing
ha he aining se did no con ain any da a om he use being
es ed. This sepa a ion is i al o ensu ing ha ou e alua ion o
Bha i e al.
Table 2: A e age Pe o mance Me ics wi h S anda d De ia-
ion o Each Model
Model F1 Sco e Accu acy P ecision Recall
LR 0.59 ± 0.04 0.53 ± 0.04 0.53 ± 0.05 0.70 ± 0.17
RF 0.59 ± 0.05 0.54 ± 0.03 0.53 ± 0.03 0.66 ± 0.12
XGBoos 0.63 ± 0.04 0.55 ± 0.01 0.54 ± 0.02 0.76 ± 0.13
VTNe 0.53 ± 0.04 0.54 ± 0.03 0.53 ± 0.06 0.55 ± 0.10
VTNe _a 0.50 ± 0.07 0.55 ± 0.04 0.56 ± 0.07 0.47 ± 0.13
he model’s gene alizabili y is no comp omised by in o ma ion
leakage. In he con ex o ea u e-based p edic o s, such as Random
Fo es , Ex eme G adien Boos ing, and Logis ic Reg ession, we
inco po a ed ecu si e ea u e elimina ion wi h c oss- alida ion
(RFECV) wi hin he CV amewo k. This me hod is in ended o
ea u e selec ion and is execu ed in andem wi h hype pa ame e
uning on he aining se .
Fo he VTNe model, which accep s aw gaze da a wi hou
p io ea u e selec ion, no ea u e selec ion s ep was used wi hin
i s alida ion loop. The model was e alua ed based on i s abili y o
lea n om aw da a as p o ided.
Algo i hm 1 C oss-Use e alua ion
1: �, �, � ← Da ase , Ta ge s, G oups
2: � ← Se o Models
3: � ← Hype pa ame e s o Models in �
4: � ← Se o Fea u es
5: o (�����, ����) ∈ Lea eG oupsOu (�) do
6: ������, ������ ← � [�����],� [�����]
7: ����� , ����� ← � [����], � [����]
8: ��������� ← Fea u eSelec ion(������,������, � )
� �
9: ������ ← ������ [��������� ]
� �
10: ����� ← ����� [��������� ]
11: o ����� ∈ � do � �
12: ����_����� ← Pa amTuning(� ,������,�����, �)
����� � �
13: _�����, � Sco es����� ← E alua e(���� ���� , ����� )
14: end o
15: end o
16: e u n Sco es�����
5.1.1 Resul s. This sec ion epo s on he ndings om ou ex-
pe imen designed o es he gene alizabili y o a ious models
in p edic ing use disag eemen using eye acking da a. Table 2
p esen s he a e age pe o mance me ics o each model ac oss
he 10- old Lea e-G oups-Ou c oss- alida ion se up, p o iding
de ailed insigh in o hei p ecision, accu acy, F1 sco es, and ecall
a es, along wi h he a iabili y o hese me ics.
The da a in Table 2 highligh s he a e age pe o mance achie ed
by each model. The Logis ic Reg ession model eco ded an a e age
F1 Sco e o 0.59 wi h a s anda d de ia ion o 0.04 and ea u ed an
a e age Recall o 0.70. Simila ly, he Random Fo es model demon-
s a ed an a e age F1 Sco e o 0.59 and a Recall o 0.66. Among
adi ional machine lea ning models, XGBoos pe o med he bes
200
De ec ing when Use s Disag ee wi h Gene a ed Cap ions ICMI Companion ’24, No embe 04–08, 2024, San Jose, Cos a Rica
wi h he highes a e age F1 Sco e o 0.63 and he highes a e age
Recall o 0.76.
In con as , he deep lea ning app oaches, ep esen ed by VTNe
and VTNe _a , exhibi ed wo se pe o mance me ics compa ed
o he mo e adi ional machine lea ning models. VTNe achie ed
an a e age F1 Sco e o 0.53 and a Recall o 0.55, while VTNe _a
displayed sligh ly lowe sco es wi h an a e age F1 Sco e o 0.50
and a Recall o 0.47. Fu he mo e, he s anda d de ia ions o hese
models indica e a highe a iabili y in pe o mance, pa icula ly
o VTNe _a .
The MANOVA ound signi can e ec s o he ype o model on
he F1 Sco e(
�4,45 =
8
.
620, pa ial
�2 =
0
.
434) and Recall (
�4,45 =
6
.
254, pa ial
�2 =
0
.
357). Pai wise compa isons showed ha , based
on he F1 sco es, he XGBoos and Random Fo es models we e
equi alen in pe o mance and hey bo h ou pe o med he o he
h ee models (Logis ic Reg ession, VTNe , and VTNe _a ). These
h ee models we e ound o be equi alen o one ano he . In e ms
o Recall, he XGBoos , Logis ic Reg ession, and Random Fo es
models we e ound o be equi alen o one ano he and hey all
ou pe o med he VTNe and VTNe _a models which, in u n,
we e equi alen in pe o mance.
Algo i hm 2 Wi hin-Use Model T aining and E alua ion
1: � ← Se o Pa icipan s
2: � ← Da ase
3: � ← Models
4: o � ∈ � do
5: �� ← ge Da aO Pa icipan (�, �)
6: � ain, � es ,� ain, � es ← Spli (�� )
7: o ����� ∈ � do
8: T ain(�����, � ain, � ain)
9: E alua e(�����, � es , es )
10: end o
11: end o
12: e u n E alua ion sco es o each model
5.2 Wi hin-Use e alua ion
To assess he e ec i eness o pe sonalized models o each indi-
idual use , we conduc ed an expe imen whe e a dis inc model
was ained o each pa icipan using he ull a ay o me hods
p e iously in oduced, including bo h ea u e-based and VTNe
algo i hms.
Fo his pu pose, we o ganized he da ase comp ising o 154
samples om each use , spli ing hem in o aining and es ing
subse s. The di ision designa ed 134 samples o aining pu poses,
while 20 samples we e se aside o e alua ion, adhe ing o a p e-
es ablished spli based on he image se s. Subsequen ly, a unique
model employing bo h, he ea u e-based p edic o as well as VTNe ,
was ained o each pa icipan . This app oach allows us o explo e
and compa e he pe o mance o models pe sonalized o indi idual
use s. The speci c s eps o he aining and e alua ion p ocess o
each use a e de ailed in Algo i hm 2.
5.2.1 Resul s. The summa ized esul s o he wi hin-use e alua-
ions a e p esen ed in Table 3, which illus a es he a e age pe o -
mance o all pe sonalized models ained on da a o a single use .
Table 3: A e age Pe o mance Me ics wi h S anda d De ia-
ion o Each Model
Model F1 Sco e Accu acy P ecision Recall
LR 0.51 ± 0.19 0.54 ± 0.19 0.52 ± 0.30 0.38 ± 0.18
RF 0.50 ± 0.33 0.55 ± 0.24 0.52 ± 0.20 0.42 ± 0.20
XGBoos 0.57 ± 0.16 0.59 ± 0.12 0.57 ± 0.20 0.62 ± 0.18
VTNe 0.55 ± 0.18 0.57 ± 0.13 0.55 ± 0.19 0.57 ± 0.21
VTNe _a 0.58 ± 0.14 0.58 ± 0.12 0.59 ± 0.15 0.59 ± 0.17
Fo he Logis ic Reg ession model, an a e age F1 Sco e o 0
.
51
±
0
.
19
was ob ained. The Random Fo es model had an a e age F1 Sco e
o 0
.
50
±
0
.
33. XGBoos had an a e age F1 Sco e o 0
.
57
±
0
.
16, wi h
he highes a e age Accu acy o 0
.
59
±
0
.
12 and he highes a e age
P ecision o 0.57 ± 0.20 among he ea u e-based models.
In he ca ego y o deep lea ning app oaches, he VTNe model
achie ed an a e age F1 Sco e o 0
.
55
±
0
.
18, while he VTNe _a
model had an a e age F1 Sco e o 0
.
58
±
0
.
14. The VTNe _a model
also achie ed he highes a e age P ecision o 0
.
59
±
0
.
15 when
compa ed wi h o he models. The s anda d de ia ion alues indi-
ca e a iabili y in model pe o mance ac oss di e en use da a.
Howe e , he MANOVA es e ealed ha he e was no signi -
can e ec o he ype o model on any o he pe o mance me ics.
Hence, hey we e all s a is ically equi alen o one ano he .
6 Discussion
The esul s o he c oss-use expe imen e eal ha among he
models e alua ed, XGBoos achie ed he bes pe o mance. How-
e e , he o e all a e age accu acy sco e o 0.55 is qui e low. The
high ecall a e (0.76), sugges s ha he model is good a iden i-
ying mos ins ances whe e he use disag ees wi h he cap ion.
This means ha mos use disag eemen ins ances a e cap u ed
and can be used as eedback o he IML sys em. Howe e , he low
p ecision also implies ha he model may p oduce alse posi i es,
i.e. p edic ing disag eemen whe e he e may be none. This can
be p oblema ic i he model’s p edic ions a e used di ec ly o ig-
ge eques s o use eedback, po en ially leading o an excessi e
numbe o in e up ions. Such in e up ions can dec ease use ex-
pe ience by p omp ing use s o p o ide eedback oo equen ly,
pa icula ly in cases whe e hey migh pe cei e he AI sys em’s
ou pu as sa is ac o y.
The e o e, i migh be bene cial o explo e a di e en app oach
ha inco po a es addi ional implici sou ces o in o ma ion. By
in eg a ing da a such as acial exp essions, he model could gain
insigh in o a wide ange o implici use eedback signals, possibly
allowing o a mo e accu a e de e mina ion o when o eques
explici use inpu . Wi h hese enhancemen s, i would be possible
o main ain he bene s o de ec ing disag eemen o IML while
educing he isk o unnecessa y eedback p omp s.
In he wi hin-use expe imen , bo h he XGBoos and he VTNe
models achie e he bes a e age accu acies o 0.59 and 0.58, espec-
i ely. Howe e , hese esul s we e accompanied by high a iances
in pe o mance be ween use s. Such a iances a e u he empha-
sized by he conside able s anda d de ia ions be ween p ecision
and all o he me ics, as shown in Table 3.
201
ICMI Companion ’24, No embe 04–08, 2024, San Jose, Cos a Rica Bha i e al.
Table 4: Compa a i e Pe o mance Me ics o Top and Bo -
om 5 Use s Based on Balanced Accu acy o XGBoos and
Gaze Robus ness Sco e
Use ID Balanced Accu acy Gaze Robus ness
B13 0.3000 0.97230
A06 0.4000 0.94518
B07 0.4141 0.91343
B11 0.4394 0.99513
B03 0.4596 0.95736
B05 0.7083 0.99408
A04 0.7143 0.97802
B14 0.7473 0.99496
B09 0.7500 0.99438
B01 0.8333 0.99935
XGBoos wo ked well o ce ain use s and poo ly o o he s,
indica ing he p esence o indi idual di e ences in disag eemen
beha io (see Table 2). Fo 7 use s, he models p o ided balanced
accu acies exceeding 0.70, and mo e han hal o he use s had ac-
cu acies g ea e han 0.60. Howe e , o o he use s we obse ed
pe o mance wi h balanced accu acies alling below 0.50, highligh -
ing a po en ial dispa i y in how e ec i ely models can cap u e
indica i e beha io .
To unde s and he a iabili y in model pe o mance, pa icula ly
o use s whe e models unde pe o med (<.50), we examined me -
ics such as gaze obus ness and noise le el. We aimed o de e mine
whe he low pe o mance could be associa ed wi h iden i able
ac o s, such as a use ’s xa ion on he backg ound ins ead o he
ele an a eas o in e es o inconsis ency in gaze da a. An analysis
e ealed ha he accu acy o XGBoos has a mode a e co ela ion
(.43) wi h gaze obus ness, a me ic ha assesses he quali y and
consis ency o he gaze signal. No ably, 4 ou o he 5 use s wi h
he wo s model pe o mance had gaze obus ness sco es ha we e
lowe han he g oup a e age (98.5). This nding sugges s ha gaze
obus ness may play a ole in he e ec i eness o he model in
p edic ing use disag eemen accu a ely.
In addi ion, we also in es iga ed he in uence o inc easing ain-
ing da a using XGBoos , which has he bes a e age pe o mance.
These esul s a e g aphically ep esen ed in Figu e 3, whe e use s’
a e age accu acy is plo ed agains he inc easing numbe o ain-
ing samples p o ided o hei pe sonalized model and e o ba s
showing he s anda d de ia ion. We ound ha , on a e age, he e
is a mode a e inc ease in he XGBoos model’s pe o mance as
he quan i y o aining da a pe use inc eases, indica ing ha
addi ional aining samples imp o e model accu acy. Howe e , he
high s d sugges s ha he e is inconsis en indi idual pe o mance
imp o emen .
6.1 Limi a ions
Al hough ou s udy o e s he s insigh in o he p edic i e ca-
paci y o gaze o use disag eemen , i has i s limi a ions. Fi s ,
he da a we e collec ed in a con olled expe imen se ing which
con as s na u al use in e ac ion wi h AI, he la e o which can
in oduce a wide ange o a iabili y and complexi y ha is no
Figu e 3: A e age balanced accu acy pe o mance o pe son-
alized models wi h inc easing aining samples
ully eplica ed in a labo a o y con ex . Second, ou wo k does no
explo e al e na i e modeling echniques [
18
] o he in eg a ion o
addi ional da a sou ces (e.g., acial exp essions), which could o e
u he dimensions o unde s anding use disag eemen and im-
p o e model obus ness. Las ly, he scope o he s udy was es ic ed
o he con ex o image-cap ion pai s. The ocus on image-cap ion
pai s is con ex -speci c and hus, u u e esea ch should assess he
ans e abili y o hese insigh s ac oss di e en o ms o Human-AI
in e ac ion o alida e hei b oade applicabili y.
7 Conclusion & Fu u e Wo k
This wo k in es iga ed he po en ial o eye acking signals as an
implici sou ce o p edic ion o disag eemen when in e ac ing wi h
an AI sys em. We collec ed a da ase wi h 30 pa icipan s in which
hey in e ac ed wi h a simula ed image cap ioning sys em, while
hei eye mo emen s as hey a ed he AI-gene a ed cap ions we e
eco ded. We in es iga ed he pe o mance o machine lea ning
models in bo h c oss-use and wi hin-use con ex s. The ndings
indica e ha while he bes model (XGBoos ) pe o ms well in de-
ec ing ins ances o disag eemen , i s p ecision emains a challenge,
unde sco ing he necessi y o models ha be e cap u e indi idual
use beha io s. No ably, XGBoos p o ed e ec i e o some use s
while ailing o cap u e he disag eemen o o he s, demons a ing
a dispa i y in model pe o mance ha should be u he in es i-
ga ed. In conclusion, ou esea ch p o ides a s unde s anding o
he ela ionship be ween eye mo emen s and use disag eemen
in AI in e ac ions. The in eg a ion o addi ional modali ies and he
applica ion o ad anced analy ical echniques ep esen p omising
di ec ions o u u e esea ch.
Acknowledgmen s
This wo k was unded in pa by he Eu opean Union unde g an
numbe 101093079 (MASTER), and he Ge man Fede al Minis y o
Educa ion and Resea ch (BMBF) unde g an numbe 01IW23002
(No-IDLE).
202
De ec ing when Use s Disag ee wi h Gene a ed Cap ions
Re e ences
[1]
Saleema Ame shi, Maya Cakmak, William B adley Knox, and Todd Kulesza. 2014.
Powe o he People: The Role o Humans in In e ac i e Machine Lea ning. AI
Magazine 35, 4 (Dec. 2014), 105–120. h ps://doi.o g/10.1609/aimag. 35i4.2513
[2]
Michael Ba z, Omai Shahzad Bha i, and Daniel Sonn ag. 2021. Implici Es i-
ma ion o Pa ag aph Rele ance F om Eye Mo emen s. F on ie s Compu . Sci. 3
(2021), 808507. h ps://doi.o g/10.3389/ comp.2021.808507
[3]
Michael Ba z, S en S auden, and Daniel Sonn ag. 2020. Visual Sea ch Ta ge
In e ence in Na u al In e ac ion Se ings wi h Machine Lea ning. In ETRA ’20:
2020 Symposium on Eye T acking Resea ch and Applica ions, S u ga , Ge many,
June 2-5, 2020, And eas Bulling, Anke Huckau , Eak a Jain, Ralph Radach, and
Daniel Weiskop (Eds.). ACM, 1:1–1:8. h ps://doi.o g/10.1145/3379155.3391314
[4]
Nila a Bha acha ya, Somna h Rakshi , and Jacek Gwizdka. 2020. Towa ds
Real- ime Webpage Rele ance P edic ion UsingCon ex Hull Based Eye- acking
Fea u es. In ACM Symposium on Eye T acking Resea ch and Applica ions (S u ga ,
Ge many) (ETRA ’20 Adjunc ). Associa ion o Compu ing Machine y, New Yo k,
NY, USA, A icle 28, 10 pages. h ps://doi.o g/10.1145/3379157.3391302
[5]
Omai Bha i, Michael Ba z, and Daniel Sonn ag. 2022. Le e aging Implici Gaze-
Based Use Feedback o In e ac i e Machine Lea ning. In KI 2022: Ad ances in
A i cial In elligence, Ralph Be gmann, Lukas Malbu g, S ephanie C. Rode mund,
and Ingo J. Timm (Eds.). Sp inge In e na ional Publishing, Cham, 9–16.
[6]
Nigel Bosch, Yuxuan Chen, and Sidney D’Mello. 2014. I ’s W i en on You Face:
De ec ing A ec i e S a es om Facial Exp essions while Lea ning Compu e
P og amming. In In elligen Tu o ing Sys ems, S e an T ausan-Ma u, K is y Eliza-
be h Boye , Ma ha C osby, and Ki y Panou gia (Eds.). Sp inge In e na ional
Publishing, Cham, 39–44.
[7]
Maya Cakmak, C ys al Chao, and And ea L. Thomaz. 2010. Designing In e ac ions
o Robo Ac i e Lea ne s. IEEE T ansac ions on Au onomous Men al De elopmen
2, 2 (2010), 108–118. h ps://doi.o g/10.1109/TAMD.2010.2051030
[8]
Sidney K. D’Mello, Sco y D. C aig, and A C. G aesse . 2009. Mul ime hod
Assessmen o A ec i e Expe ience and Exp ession du ing Deep Lea ning. In .
J. Lea n. Technol. 4, 3/4 (oc 2009), 165–187. h ps://doi.o g/10.1504/IJLT.2009.
028805
[9]
John J. Dudley and Pe Ola K is ensson. 2018. A Re iew o Use In e ace Design
o In e ac i e Machine Lea ning. ACM T ans. In e ac . In ell. Sys . 8, 2, A icle 8
(jun 2018), 37 pages. h ps://doi.o g/10.1145/3185517
[10]
SIDNEY K D’Mello and A hu C G aesse . 2014. Con usion. In In e na ional
handbook o emo ions in educa ion. Rou ledge, 299–320.
[11]
Paul Ekman, Wallace V F iesen, Mau een O’sulli an, An hony Chan, I ene
Diacoyanni-Ta la zis, Ka l Heide , Raine K ause, William Ayhan LeComp e,
Tom Pi cai n, Pio E Ricci-Bi i, e al
.
1987. Uni e sals and cul u al di e ences in
he judgmen s o acial exp essions o emo ion. Jou nal o pe sonali y and social
psychology 53, 4 (1987), 712.
[12]
Maliheh Ghaja ga , Jan Pe sson, Je ey Ba dzell, La s Holmbe g, and Agnes
Tegen. 2020. The UX o In e ac i e Machine Lea ning. Associa ion o Compu ing
Machine y, New Yo k, NY, USA. h ps://doi.o g/10.1145/3419249.3421236
[13]
Donald Honeycu , Mahsan Nou ani, and E ic Ragan. 2020. Solici ing Human-in-
he-Loop Use Feedback o In e ac i e Machine Lea ning Reduces Use T us
and Imp essions o Model Accu acy. P oceedings o he AAAI Con e ence on
Human Compu a ion and C owdsou cing 8, 1 (Oc . 2020), 63–72. h ps://ojs.aaai.
o g/index.php/HCOMP/a icle/ iew/7464
[14]
Lea K ause and Piek Vossen. 2020. When o explain: Iden i ying explana ion
igge s in human-agen in e ac ion. In 2nd Wo kshop on In e ac i e Na u al
Language Technology o Explainable A i cial In elligence. 55–60.
[15]
Sébas ien Lallé, C is ina Cona i, and Giuseppe Ca enini. 2016. P edic ing Con-
usion in In o ma ion Visualiza ion om Eye T acking and In e ac ion Da a.
In P oceedings o he Twen y-Fi h In e na ional Join Con e ence on A i cial
In elligence (New Yo k, New Yo k, USA) (IJCAI’16). AAAI P ess, 2529–2535.
[16]
Jia Zheng Lim, James Moun s ephens, and Jason Teo. 2020. Emo ion Recogni ion
Using Eye-T acking: Taxonomy, Re iew and Cu en Challenges. Senso s 20, 8
(2020). h ps://doi.o g/10.3390/s20082384
[17]
luxonis. 2020. OAK-D: S e eo came a wi h Edge AI. h ps://luxonis.com/ S e eo
Came a wi h Edge AI capabili ies om Luxonis and OpenCV.
[18]
Abdul ahman Mohamed Selim, Michael Ba z, Omai Shahzad Bha i, Hasan
Md Tus qu Alam, and Daniel Sonn ag. 2024. A e iew o machine lea ning
in scanpa h analysis o passi e gaze-based in e ac ion. F on ie s in A i cial
In elligence 7 (2024). h ps://doi.o g/10.3389/ ai.2024.1391745
[19]
Suche a Nadka ni and Ree ika Gup a. 2007. A Task-Based Model o Pe cei ed
Websi e Complexi y. MIS Qua e ly 31, 3 (2007), 501–524. h p://www.js o .o g/
s able/25148805
[20]
Anneli Olsen. 2012. The Tobii IVT Fixa ion Fil e Algo i hm desc ip ion. h ps:
//api.seman icschola .o g/Co pusID:52834703
[21]
Ma iya Pachman, Amaël A guel, Lo i Lockye , G ego Kennedy, and Jason Lodge.
2016. Eye acking and ea ly de ec ion o con usion in digi al lea ning en i on-
men s: P oo o concep . Aus alasian Jou nal o Educa ional Technology 32, 6
(Dec. 2016). h ps://doi.o g/10.14742/aje .3060
ICMI Companion ’24, No embe 04–08, 2024, San Jose, Cos a Rica
[22]
Manuela Pollak, And ea Sal nge , and Ka in Anna Hummel. 2022. Teaching
D ones on he Fly: Can Emo ional Feedback Se e as Lea ning Signal o T aining
A i cial Agen s? a Xi p ep in a Xi :2202.09634 (2022).
[23]
Joni Salminen, Be na d J. Jansen, Jisun An, Soon-Gyo Jung, Lene Nielsen, and
Haewoon Kwak. 2018. Fixa ion and Con usion: In es iga ing Eye-T acking
Pa icipan s’ Exposu e o In o ma ion in Pe sonas. In P oceedings o he 2018
Con e ence on Human In o ma ion In e ac ion I&’ Re ie al (New B unswick, NJ,
USA) (CHIIR ’18). Associa ion o Compu ing Machine y, New Yo k, NY, USA,
110–119. h ps://doi.o g/10.1145/3176349.3176391
[24]
Joni Salminen, M idul Nagpal, Haewoon Kwak, Jisun An, Soon-gyo Jung, and
Be na d J. Jansen. 2019. Con usion P edic ion om Eye-T acking Da a: Ex-
pe imen s wi h Machine Lea ning. In P oceedings o he 9 h In e na ional Con-
e ence on In o ma ion Sys ems and Technologies (Cai o, Egyp ) (icis 2019). As-
socia ion o Compu ing Machine y, New Yo k, NY, USA, A icle 5, 9 pages.
h ps://doi.o g/10.1145/3361570.3361577
[25]
Da io D Sal ucci and Joseph H Goldbe g. 2000. Iden i ying xa ions and saccades
in eye- acking p o ocols. In P oceedings o he 2000 symposium on Eye acking
esea ch & applica ions. 71–78.
[26]
Ab aham. Sa i zky and M. J. E. Golay. 1964. Smoo hing and Di e en ia ion o
Da a by Simpli ed Leas Squa es P ocedu es. Analy ical Chemis y 36, 8 (1964),
1627–1639. h ps://doi.o g/10.1021/ac60214a047
[27]
Abdul ahman Mohamed Selim, Omai Shahzad Bha i, Michael Ba z, and Daniel
Sonn ag. 2024. Pe cei ed Tex Rele ance Es ima ion Using Scanpa hs and GNNs.
In P oceedings o he INTERNATIONAL CONFERENCE ON MULTIMODAL INTER-
ACTION (ICMI ’24) (San Jose, Cos a Rica) (ICMI ’24). Associa ion o Compu ing
Machine y, New Yo k, NY, USA. h ps://doi.o g/10.1145/3678957.3685736
[28]
Ra i Shekha , Sand o Pezzelle, Yauhen Klimo ich, Au elie He belo , Moin Nabi,
En e Sangine o, and Ra aella Be na di. 2017. "FOIL i ! Find One misma ch
be ween Image and Language cap ion". In P oceedings o he 55 h Annual Mee ing
o he Associa ion o Compu a ional Linguis ics (ACL) (Volume 1: Long Pape s).
255–265.
[29]
Shane D Sims and C is ina Cona i. 2020. A neu al a chi ec u e o de ec ing use
con usion in eye- acking da a. In P oceedings o he 2020 in e na ional con e ence
on mul imodal in e ac ion (Vi ual E en , Ne he lands) (ICMI ’20). Associa ion
o Compu ing Machine y, New Yo k, NY, USA, 15–23. h ps://doi.o g/10.1145/
3382507.3418828
[30]
Ha shinee S i am, C is ina Cona i, and Thalia Field. 2023. Classi ca ion o
Alzheime ’s Disease wi h Deep Lea ning on Eye- acking Da a. In P oceedings o
he 25 h In e na ional Con e ence on Mul imodal In e ac ion. 104–113.
[31]
Benjamin Voloh, Ma cus Wa son, Se h Konig, and Thilo Womelsdo . 2020. MAD
saccade: s a is ically obus saccade h eshold es ima ion ia he median absolu e
de ia ion. Jou nal o Eye Mo emen Resea ch 12 (05 2020). h ps://doi.o g/10.
16910/jem .12.8.3
[32]
Jan Zacha ias, Michael Ba z, and Daniel Sonn ag. 2018. A Su ey on Deep Lea n-
ing Toolki s and Lib a ies o In elligen Use In e aces. a Xi :1803.04818 [cs.HC]
[33]
Zhihong Zeng, Maja Pan ic, Glenn I. Roisman, and Thomas S. Huang. 2009. A
Su ey o A ec Recogni ion Me hods: Audio, Visual, and Spon aneous Exp es-
sions. IEEE T ansac ions on Pa e n Analysis and Machine In elligence 31, 1 (2009),
39–58. h ps://doi.o g/10.1109/TPAMI.2008.52
203