I-MPN: induc i e message passing
ne wo k o e icien human-in-
he-loop anno a ion o mobile eye
acking da a
Hoang H. Le1,2,3,7, Duy M. H. Nguyen1,4,5,7, Omai Shahzad Bha i1, László Kopácsi1,
Thinh P. Ngo2, Binh T. Nguyen2, Michael Ba z1,6 & Daniel Sonn ag1,6
Comp ehending how humans p ocess isual in o ma ion in dynamic se ings is c ucial o psychology
and designing use -cen e ed in e ac ions. While mobile eye- acking sys ems combining egocen ic
ideo and gaze signals can o e aluable insigh s, manual analysis o hese eco dings is ime-
in ensi e. In his wo k, we p esen a no el human-cen e ed lea ning algo i hm designed o au oma ed
objec ecogni ion wi hin mobile eye- acking se ings. Ou app oach seamlessly in eg a es an objec
de ec o wi h a spa ial ela ion-awa e induc i e message-passing ne wo k (I-MPN), ha nessing node
p o ile in o ma ion and cap u ing objec co ela ions. Such mechanisms enable us o lea n embedding
unc ions capable o gene alizing o new objec angle iews, acili a ing apid adap a ion and e icien
easoning in dynamic con ex s as use s na iga e hei en i onmen . Th ough expe imen s conduc ed
on h ee dis inc ideo sequences, ou in e ac i e-based me hod showcases signi ican pe o mance
imp o emen s o e ixed aining/ es ing algo i hms, e en when ained on conside ably smalle
anno a ed samples collec ed h ough use eedback. Fu he mo e, we demons a e excep ional
e iciency in da a anno a ion p ocesses and su pass p io in e ac i e me hods ha use comple e objec
de ec o s, combine de ec o s wi h con olu ional ne wo ks, o employ in e ac i e ideo segmen a ion.
Keywo ds Human-cen e ed AI, Scene Recogni ion
The ad en o mobile eye- acking echnology has signi ican ly expanded he ho izons o esea ch in ields
such as psychology, ma ke ing, and use in e ace design by p o iding a g anula iew o use isual a en ion
in na u alis ic se ings1,2. By cap u ing in ica e de ails o eye mo emen , his echnology p o ides eal- ime
insigh s in o cogni i e p ocesses and use beha io du ing in e ac ions wi h physical p oduc s o mobile de ices.
Fo ins ance, in educa ional esea ch, mobile eye- acking enables he explo a ion o lea ne s’ gaze beha io in
in e ac i e, eal-wo ld en i onmen s like class ooms and science labo a o ies3,4. Insigh s in o whe e s uden s
ocus hei isual a en ion, he e o e, can guide he design o ins uc ional s a egies and os e imp o ed
lea ning ou comes5. In his s udy, we in es iga e a new app oach aimed a enhancing objec ecogni ion unde
in e ac i e mobile eye- acking, speci ically op imizing da a anno a ion e iciency and ad ancing human-in-
he-loop lea ning models (Fig. 2). Equipped wi h eye- acking de ices, use s gene a e ideo s eams alongside
ixa ion poin s, p o iding isual ocus as hey na iga e h ough hei en i onmen . Ou p ima y aim lies in
ecognizing speci ic objec s, such as able -le , able - igh , book, de ice-le , and de ice- igh , wi h all o he
elemen s conside ed backg ound, as demons a ed in Fig.1.
Howe e , he manual analysis o hese eye- acking da a is challenging due o he ex ensi e olume o da a
gene a ed and he complexi y o dynamic isual en i onmen s, whe e a ge objec s may o e lap and be a ec ed
by en i onmen al noise6,7. In clinical, eal-wo ld esea ch con ex s o educa ional, he a iabili y o gaze pa e ns
ac oss pa icipan s u he complica es he ex ac ion o meaning ul insigh s. Addi ionally, he dynamic na u e
1In e ac i e Machine Lea ning Depa men , Ge man Resea ch Cen e o A i icial In elligence (DFKI), 66123
Saa b ücken, Ge many. 2Ma hema ics and Compu e Science Depa men , Uni e si y o Science, VNU-HCM,
Ho Chi Minh Ci y, Vie nam. 3Quy Nhon AI Resea ch and De elopmen Cen e , FPT So wa e, Quy Nhon,
Vie nam. 4Max Planck Resea ch School o In elligen Sys ems (IMPRS-IS), 70569 S u ga , Ge many. 5Machine
Lea ning and Simula ion Science Depa men , Uni e si y o S u ga , 70569 S u ga , Ge many. 6Applied
A i icial In elligence Depa men , Uni e si y o Oldenbu g, 26129 Oldenbu g, Ge many. 7Hoang H. Le and Duy
M. H. Nguyen: These au ho s con ibu ed equally o his wo k. email: [email p o ec ed];
ho_minh_duy[email p o ec ed]
OPEN
Scien i ic Repo s | (2025) 15:14192 1
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s
o isual scenes demands p ecise objec iden i ica ion and segmen a ion, o en equi ing ex ensi e manual
anno a ions o accoun o ac o s such as occlusion, shi ing ligh ing condi ions, and apid scene changes.
These challenges highligh he p essing need o au onomous analy ical s a egies ha can le e age ad anced
compu a ional echniques o s eamline da a p ocessing, imp o e accu acy, and educe he bu den o human
in e en ion. Among hese s a egies, echniques such as gaze-based clus e ing8,9, ixa ion hea maps10,11, and
p edic i e modeling12 a e pa icula ly p omising, as hey can no only enhance da a in e p e a ion bu also
acili a e eal- ime applica ions, such as adap i e lea ning sys ems o assis i e echnologies o indi iduals wi h
isual o cogni i e impai men s.
The algo i hms beyond hose me hods a e la gely powe ed by machine lea ning, wi h s a e-o - he-a
a chi ec u es such as con olu ional neu al ne wo ks (CNNs)13,14 and ecu en neu al ne wo ks (RNNs)15,16
achie ing ema kable accu acy in p edic ing gaze ajec o ies and iden i ying a eas o in e es (AOIs) ac oss bo h
s a ic and dynamic en i onmen s17,18. Building on hese successes, objec de ec ion models, pa icula ly hose
employing mul i-scale ea u e ex ac ion echniques like YOLO19 and Fas e R-CNN20, ha e u he enhanced
he e iciency o isual a en ion de ec ion in complex scenes21. O he impo an di ec ions in ol e g aph neu al
ne wo ks (GNNs)22–24, which u ilize g aph s uc u es o cap u e and model he spa ial and seman ic ela ionships
among objec s o egions in images, enabling obus objec ecogni ion in dynamic en i onmen s. O e all, by
au oma ing adi ionally manual and ime-in ensi e asks, hese models p o ide a scalable and obus app oach
o analyzing eye- acking da a, unlocking b oade applica ions in dynamic and isually in ica e en i onmen s.
Ne e heless, hese me hods ace no able challenges, many o which a ise om he inhe en a iabili y in
human eye mo emen pa e ns and con ex ual dependencies25,26. Fo example, gaze beha io is highly dynamic,
a ying ac oss use s, asks, and en i onmen al ac o s such as occlusions and changes in ligh ing condi ions.
This complex in e play o ac o s o en comp omises model obus ness, pa icula ly in eal-wo ld scena ios
whe e such a iabili y is p e alen . To o e come hese obs acles, la ge-scale aining da ase s a e essen ial o
ensu ing e ec i e gene aliza ion; howe e , he p ocess o acqui ing such da ase s is bo h labo -in ensi e and
ime-consuming, posing an addi ional hu dle o ad ancing hese me hods. Mo eo e , in eg a ing use eedback
wi h indi idual p e e ences and si ua ional con ex s in o machine lea ning wo k lows emains a signi ican
bo leneck27. These pe sonalized adap a ions a e c ucial o imp o ing he usabili y and accu acy o mobile eye-
acking sys ems, ye hey o en con lic wi h he need o compu a ional e iciency and eal- ime esponsi eness.
B idging his gap hus equi es inno a i e app oaches ha balance adap abili y wi h esou ce cons ain s, pa ing
he way o models ha can seamlessly cus omize o indi idual di e ences while emaining p ac ical o eal-
wo ld deploymen .
Fig. 1. Ou se up o da ase collec ion is designed in a labo a o y en i onmen , ea u ing a ious A eas o
In e es (AOIs), such as able s and expe imen s a ions o elec ical ci cui s. A child equipped wi h a mobile
eye acke in e ac s wi h hese s a ions, allowing o he collec ion o gaze da a essen ial o esea ch in
educa ional con ex s. The s udy aims o analyze child en’s lea ning p ocesses unde bo h AR-suppo ed and
non-AR condi ions. To achie e his, ou algo i hm p ocesses he inpu da a om he eye acke in a backend
se ice, p edic ing he eal- ime objec s he use is ocusing on and acking hei a en ion o e ime.
Scien i ic Repo s | (2025) 15:14192 2
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s/
In his wo k, we design a new me hod o in e ac i e mobile eye- acking as demons a ed in Fig. 2. The
aining p ocess s a s wi h ini ial da a anno a ions by le e aging ideo objec segmen a ion (VoS) echniques28,29.
Use s a e p omp ed o p o ide weak sc ibbles deno ing a eas o in e es and assign co esponding labels in
ini ial ames. Subsequen ly, he VoS ool au onomously ex apola es segmen a ion bounda ies closes o he
sc ibbled egions, he eby gene a ing p edic ions o la e ames. Du ing a pe iod o ime, use s in e ac wi h
he in e ace, e iewing and e ining esul s by manipula ing sc ibbles o a ea-o -e ec (AoE) labels i hey e eal
e o anno a ions.
In he nex phase, we collec segmen a ion masks and co espondence anno a ions p o ided by he VoS ool
o de ine bounding boxes encompassing AoI and hei co esponding labels o ain ecogni ion algo i hms. Ou
app oach, named I-MPN, consis s o wo p ima y componen s: (i) an objec de ec o asked wi h gene a ing
p oposal candida es wi hin en i onmen al se ups and (ii) an Induc i e Message-Passing Ne wo k30–32 designed
o disce n objec ela ionships and spa ial con igu a ions, he eby de e mining he labels o objec s p esen in
he cu en ame based on hei co ela ions. I is c ucial o highligh ha iden ical objec s may bea di e en
labels con ingen upon hei spa ial o ien a ions (e.g., le , igh ) in ou se ings (Fig. 1, de ice le and igh ).
This cha ac e is ic o en poses challenges o me hods elian on local ea u e disc imina ion, such as objec
de ec ion o con olu ional neu al ne wo ks, due o hei inhe en lack o global spa ial con ex . I-MPN, ins ead,
can o e come his issue by dynamically o mula ing g aph s uc u es a di e en ames whose node ea u es a e
ep esen ed by bounding box coo dina es and seman ic ea u e ep esen a ions inside de ec ed boxes de i ed
om he objec de ec o . Nodes hen exchange in o ma ion wi h hei local neighbo hoods h ough a se o
ainable agg ega o unc ions, which emain in a ian o inpu pe mu a ions and a e adap able o unseen
nodes in subsequen ames. Th ough his mechanism, I-MPN plausibly cap u es he in ica e ela ionships
be ween objec s, hus augmen ing i s ep esen a ional capaci y o dynamic en i onmen al shi s induced by use
mo emen .
Gi en he ini ial ained models, we in eg a e hem in o a human-in- he-loop phase o p edic ou comes
o each ame in a ideo. I use s iden i y e oneous p edic ions, hey ha e he abili y o e ine he models by
p o iding eedback h ough d awing sc ibbles on he cu en ame using VoS ools, as shown in Fig.3. This
eedback igge s he gene a ion o upda ed anno a ions o subsequen ames, acili a ing a apid e inemen
p ocess simila o he ini ial anno a ion s age bu wi h a educed ime ame. The new anno a ions a e hen
ga he ed and used o e ain bo h he objec de ec o and message-passing ne wo k in he backend be o e being
deployed o con inued in e ence. I e o s pe sis , he i e a i e p ocess con inues un il he models con e ge o
p oduce sa is ac o y esul s. We illus a e such an i e a i e loop in Fig.2.
In summa y, we make he ollowing con ibu ions:
• Fi s ly, we in oduce I-MPN, an e icien objec ecogni ion amewo k speci ically designed o in eg a e wi h
mobile eye- acking sys ems o analyzing gaze beha io in dynamic en i onmen s.
• Secondly, I-MPN demons a es excep ional adap abili y o use eedback wi hin mobile eye- acking appli-
ca ions. By le e aging only a small ac ion o use eedback da a (20%-30%), i achie es pe o mance le els
compa able o o su passing con en ional me hods ha ely on ixed da a spli s (e.g., 70% aining da a).
Fig. 2. O e iew o ou human-in- he-loop I-MPN app oach. Video ames a e p ocessed by an objec
de ec o o p oduce ea u e maps, bounding boxes, and seman ic ea u es, which a e hen analyzed by he
Induc i e Message Passing Ne wo k o objec p edic ion. A domain expe anno a es ini ial ames, and he
ideo objec segmen a ion module p opaga es hese anno a ions, educing manual e o . Use s con i m o
co ec p edic ions, engaging in a eedback loop (dashed a ow) ha upda es he model i e a i ely un il i
achie es a p ede ined le el o accu acy o eliabili y pe o mance assessed by he domain expe .
Scien i ic Repo s | (2025) 15:14192 3
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s/
• Thi dly, a compa a i e analysis wi h o he human-lea ning app oaches, such as objec de ec o s and in e ac-
i e segmen a ion me hods, highligh s he supe io pe o mance o I-MPN, especially in dynamic en i on-
men s in luenced by use mo emen . This unde sco es I-MPN’s capabili y o comp ehend objec ela ionships
in challenging condi ions.
• Finally, we measu e he a e age use engagemen ime needed o ini ial model aining da a p o ision and
subsequen eedback upda es. Th ough empi ical e alua ion o popula anno a ion ools in segmen a ion and
objec classi ica ion, we demons a e I-MPN’s ime e iciency, educing label gene a ion ime by
60% −70%
. We also in es iga e ac o s in luencing pe o mance, such as message-passing models. Ou indings con i m
he adap abili y o he p oposed amewo k ac oss di e se ne wo k a chi ec u es.
We ou line he s uc u e o he pape wi h ela ed wo k p esen ed in Sec ion "Rela ed wo k", ollowed by a
de ailed desc ip ion o ou me hodology in Sec ion "Me hodology", expe imen al esul s in Sec ion "Expe imen s
& esul s", and, inally, he conclusion and u u e di ec ions in Sec ion "Conclusion and u u e wo k".
Rela ed wo k
Eye acking- ela ed machine lea ning models
Many me hods ely on p e- ained models o analyze localized ea u es a ound ixa ion poin s. Fo ins ance, some
map ixa ions o bounding boxes using objec de ec ion models33,34, while o he s classi y small image pa ches
a ound ixa ion poin s wi h image classi ica ion models21,35. These app oaches, howe e , a e ypically limi ed o
con olled se ings whe e he aining da a closely ma ches he a ge domain21,36. S udies highligh subs an ial
disc epancies be ween manual and au oma ic anno a ions o a eas o in e es in benchma k da ase s like
COCO37,38, emphasizing he challenges in applying p e- ained models o di e se eal-wo ld scena ios34. Some
e o s o ine- une objec de ec ion models o speci ic domains show p omise39,40, ye hese lack in e ac i i y
and canno dynamically upda e du ing anno a ion.
Global in e ac ion-based me hods ocus on cap u ing and u ilizing b oade con ex ual in o ma ion.
T adi ional semi-au oma ic anno a ion s a egies o en ely on non-lea nable ea u e desc ip o s like colo
his og ams o bag-o -SIFT ea u es41,42, limi ing adap abili y. Mo e ecen ly, Ku zhals e al.43 p oposed an
in e ac i e app oach o anno a ing egocen ic eye- acking da a by i e a i ely sea ching ime sequences
based on eye mo emen s and isual ea u es. Thei me hod in ol es segmen ing gaze- ocused image pa ches,
clus e ing hem in o ep esen a i e humbnails, and isualizing hese clus e s on a 2D plane. While inno a i e,
such me hods p ima ily ope a e on p e-segmen ed pa ches and lack he dynamic modeling capabili ies needed
o mo e complex, adap i e en i onmen s. Unlike hese wo ks, ou I-MPN is designed o cap u e bo h de ailed
isual ea u e ep esen a ions and b oade ela ional in e ac ions among objec s h ough an induc i e message-
passing ne wo k, enhancing model obus ness unde occlusion o signi ican iewpoin changes.
G aph neu al ne wo ks o objec ecogni ion
G aph neu al ne wo ks (GNNs) a e neu al models designed o analyzing g aph-s uc u ed da a like social
ne wo ks, biological ne wo ks, and knowledge g aphs44. Beyond hese domains, GNNs can be applied in objec
ecogni ion o iden i y and loca e objec s in images o ideos by le e aging g aph s uc u es o encode spa ial
and seman ic ela ions among objec s o egions. Th ough mechanisms like g aph con olu ion45 o a en ion
mechanisms22, GNNs e icien ly agg ega e and p opaga e in o ma ion ac oss he g aph. Me hods such as GCN46,
GAT22,47, KGN23, SGRN48, and RGRN24 demons a e he abili y o GNNs o inco po a e con ex ual easoning,
spa ial ela ionships, and eal- ime p ocessing in o objec ecogni ion wo k lows.
Fig. 3. The ideo objec segmen a ion-based in e ace allows use s o anno a e ames using weak p omp s like
clicks and sc ibbles, hen p opaga e hese anno a ions o subsequen ames.
Scien i ic Repo s | (2025) 15:14192 4
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s/
O he app oaches ex end GNNs o add ess speci ic challenges in spa ial easoning and dynamic scena ios.
Rela ion Ne wo ks49 and Scene G aph Gene a ion50 explici ly model objec ela ionships and gene a e s uc u ed
scene ep esen a ions. Hie a chical GNNs, like HGRN51, in eg a e low-le el isual ea u es wi h high-le el
seman ics o imp o ed in e ac ion modeling. Addi ionally, dynamic amewo ks such as Dynamic G aph
Neu al Ne wo ks (DyGNN)52, P incipal Neighbou hood Agg ega ion (G-PNA) 53, Ga ed G aph Sequence
Neu al Ne wo ks (Ga edG)54, and G aph T ans o me (T ans o me G)55 cap u e bo h spa ial and empo al
dynamics o ideo-based objec ecogni ion.
Howe e , in mobile eye- acking scena ios, hese me hods ace wo signi ican challenges. Fi s ly, he
message-passing mechanism ypically ope a es on he en i e g aph s uc u e, necessi a ing a ixed se o objec s
du ing bo h aining and in e ence. This igidi y implies ha he en i e model mus be upda ed o accommoda e
new, unseen objec s ha may a ise la e due o use in e es s. Secondly, ce ain me hods, such as RGRN24 o
T ans o me G55, depend on es ima ing he co-occu ence o objec pai s in scenes using la ge amoun s o
aining da a. Howe e , in human-in- he-loop se ings, whe e use s ypically p o ide only small anno a ed
samples, his in o ma ion is no eadily a ailable. As a esul , he co-occu ence ma ices be ween objec s e ol e
dynamically o e ime as mo e anno a ions a e p o ided by he use . I-MPN ackles hese issues by pe o ming
message passing o agg ega e in o ma ion om neighbo ing nodes, enabling he model o main ain obus ness
o a iabili y in he g aph s uc u e ac oss di e en ins ances. While he e exis wo ks ha e exploi ed his idea o
link p edic ions30, ecommenda ion sys ems56, o ideo acking57, we he i s p opose a o mula ion o human
in e ac ion in eye- acking se ups.
Human-in- he-loop o eye acking
Recen wo ks on human-in- he-loop me hods o mobile eye acking ha e u ilized CNNs o objec de ec ion
and classi ica ion35,58,59. These me hods inco po a e use eedback o enhance model pe o mance, making hem
mo e adap i e o eal-wo ld scena ios. Howe e , hey o en ace challenges such as high compu a ional demands
and he need o ex ensi ely anno a ed da ase s59. Addi ionally, hese models can s uggle wi h en i onmen al
noise and a ying objec angles, which can educe hei accu acy60. In con as , ou I-MPN amewo k combines
objec de ec o s wi h induc i e message-passing echniques, o e ing mo e obus pe o mance in dynamic
en i onmen s while being less esou ce-in ensi e han adi ional CNN-based me hods.
Me hodology
Da ase
We begin by de ailing ou se up, including he p ocess o da ase eco ding and he gene a ion o ideo-g ound-
u h anno a ions used o e alua e ou me hod. Figu e 1 illus a es ou expe imen al se up whe e we eco d
h ee ideo sequences cap u ed by di e en use s, each occu ing in wo o h ee minu es (Table 2). The use s
wea an eye acke on hei o ehead, which eco ds wha hey obse e o e ime while also p o iding ixa ion
poin s, showing he use ’s ocus poin s a each ime ame. We a e in e es ed in de ec ing i e objec s: ables (le ,
igh ), books, and de ices (le , igh ).
Video g ound- u h anno a ions To gene a e da a o model e alua ion, we asked use s o anno a e objec s
in each ideo ame using he ideo objec segmen a ion ool in oduced in Sec ion"Use eedback as ideo
objec segmen a ion". Following he c oss-en opy memo y me hod as desc ibed in29, we in e ac ed wi h use s
by displaying segmen a ion esul s on a moni o . Use s hen labeled da a and c ea ed g ound u hs by clicking
he “Sc ibble” and “Adding Labels” unc ions o objec s. Subsequen ly, by clicking he “Fo wa d” bu on, he
VoS ool au oma ically segmen ed he objec s’ masks in he nex ames un il he end o he ideo. I use s en-
coun e ed inco ec ly gene a ed anno a ions, hey could click “S op” o edi he esul s using he “Sc ibble” and
“Adding Labels” unc ions again (Fig. 3). Fu he analysis o he VoS ool is p o ided in Table 2, which includes a
un ime compa ison agains o he me hods based on objec de ec ion and seman ic segmen a ion.
O e iew sys ems
Figu e 2 illus a es he main s eps in ou pipeline. Gi en a se o ideo ames: (i) he use gene a es anno a ions
by sc ibbling o d awing boxes a ound objec s o in e es , which a e hen ed in o he ideo objec segmen a ion
algo i hm o gene a e segmen masks o e he ime ames. (ii) The ou pu s a e subsequen ly added o he
da abase o ain an objec de ec o , pe o m spa ial easoning, and gene a e labels o appea ing objec s using
induc i e message-passing mechanisms. The ained models a e hen u ilized o in e he nex ames un il
he use in e up s upon encoun e ing inco ec p edic ions. A his poin , use s p o ide eedback as in s ep
(i) o hese ames (Fig.2 bo om dashed a ow). New anno a ions a e hen added o he da abase, and he
models a e e ained as in s ep (ii). This loop is epea ed o se e al ounds un il he model achie es sa is ac o y
pe o mance. In he ollowing sec ions, we desc ibe ou e icien s a egy o enabling use s o quickly gene a e
anno a ions o ideo ames (Sec ion"Use eedback as ideo objec segmen a ion") and ou obus machine
lea ning models designed o quickly adap om use eedback o ecognize objec s in dynamic en i onmen s
(Sec ion"Dynamic spa ial- empo al objec ecogni ion").
Scien i ic Repo s | (2025) 15:14192 5
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s/
Use eedback as ideo objec segmen a ion
Anno a ing objec s in ideo on a ame-by- ame le el p esen s a conside able ime and labo in es men , pa icula ly
in leng hy ideos con aining nume ous objec s. To su moun hese challenges, we u ilize ideo objec segmen a ion-
based me hods61,62, signi ican ly diminishing he manual wo kload. By using c oss- ideo memo y29, his me hod
achie ed p omising accu acy in a ious asks anging om ideo unde s anding63, obo ic manipula ion64, o neu al
ende ing65. In his s udy, we ha ness his capabili y as an e icien ool o use in e ac ion in anno a ion asks,
pa icula ly wi hin mobile eye- acking, acili a ing lea ning and model upda e phases. The ad an ages o using VoS
o e o he p e alen anno a ion me hods in segmen a ion a e p esen ed in Table 2.
Gene ally, wi h VoS, use s simply ma k poin s o sc ibble wi hin he A ea o In e es (AoI) along wi h hei
co esponding labels (Fig. 3). Subsequen ly, he VoS componen in e s segmen a ion masks o successi e ames by
le e aging spa ial- empo al co ela ions (Fig. 2-le ). These anno a ions a e hen subjec o use e i ica ion and, i
needed, adjus men s, s eamlining he p ocess a he han s a ing om sc a ch each ime. Fo mally, VoS aims o
iden i y and segmen objec s ac oss ideo ames (
{F1,F
2,...,F
T}
), p oducing a segmen a ion mask
M
o each
ame
F
. In he i s s ep, o each ame
F
, he model ex ac s a se o ea u e ec o s
F ={ 1,
2,...,
n}
,
whe e each
i
co esponds o a egion p oposal in he ame and
n
is he o al numbe o p oposals. Ano he memo y
module main ains a memo y
M ={m1,m
2,...,m
k}
ha s o es agg ega ed ea u e ep esen a ions o p e iously
iden i ied objec ins ances, whe e
k
is he numbe o unique ins ances s o ed up o ame
F
. To gene a e co ela ion
sco es
C ={c 1,c
2,...,c
n}
among consecu i e ames, a memo y eading unc ion
R(F ,M −1)→C
is
used. The sco es in
C
es ima e he likelihood o each egion p oposal in
F
ma ching an exis ing objec ins ance
in memo y. The memo y is hen upda ed ia a w i ing unc ion
W(F ,M −1,C )→M
, which modi ies
M
based on he cu en obse a ions and hei co ela ions o s o ed ins ances. Finally, gi en he upda ed memo y
and co ela ion sco es, he model assigns o each pixel in ame
F
a label and an ins ance ID, ep esen ed by
S(F ,M ,C )→{(l 1,i
1),(l 2,i
2),...,(l n,i
n)}
, whe e
(l i,i
i)
indica es he class label and ins ance ID o
he i- h p oposal.
Dynamic spa ial- empo al objec ecogni ion
Gene a ing candida e p oposals
Due o he powe ul lea ning abili y o deep con olu ional neu al ne wo ks, objec de ec o s such as Fas e R-CNN66
and YOLO19,67 o e high accu acy, end- o-end lea ning, adap abili y o di e se scenes, scalabili y, and eal- ime
pe o mance. Howe e , hey s ill only p opaga e he isual ea u es o he objec s wi hin he egion p oposal and
igno e complex opologies be ween objec s, leading o di icul ies dis inguishing di icul samples in complex spaces.
Ra he han pu ely using objec de ec o ou pu s, we le e age hei bounding boxes and co esponding seman ic
ea u e maps a each ame as candida e p oposals, which a e hen in e ed by ano he ela ional g aph ne wo k. In
pa icula , deno ing
θ
as he de ec o , a he i- h ame
Fi
, we compu e a se o k bonding boxes co e AoE egions
by
Bi={bi1,b
i2, ..., bik}
and ea u e embeddings inside hose ones
Zi={zi1,z
i2, ..., zik}
while igno ing
Pi
deno es he se o class p obabili ies o each bounding boxes in
Bi
whe e
{Bi,Zi,Pi}← θ(Fi)
. The
θ
is ained
and upda ed wi h use eedback wi h anno a ions gene a ed om he VoS ool.
Algo i hm 1. I-MPN o wa d and backwa d pass.
Scien i ic Repo s | (2025) 15:14192 6
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s/
Induc i e message passing ne wo k
We p opose a g aph neu al ne wo k
gϵ
using induc i e message-passing ope a ions30,31 o easoning ela ions
among objec s de ec ed wi hin each ame in he ideo. Le
Gi=(Vi,Ei)
deno e he g aph a he i- h ame
whe e
Vi
being nodes wi h each node
ij ←bij ∈Vi
de ined om bounding boxes
Bi
.
E
is he se o edges
whe e we pe mi each node o be ully connec ed o he emaining nodes in he g aph. We ini ialize node-
ea u e ma ix
Xi
, which associa es o each
ij ∈Vi
a ea u e embedding
x
ij
. In ou se ing, we di ec ly
use
x
ij =z
ij
∈Zi
aken om he ou pu o he objec de ec o . Mos cu en GNN app oaches o objec
ecogni ion24,48 use he ollowing amewo k o compu e ea u e embedding o each node in he inpu g aph
G
( o he sake o simplici y, we igno e ame index):
H(
l
+1) =σ(˜
D
−
1
2
˜
A˜
D
−
1
2
H(
l
)W(
l
))
(1)
whe e:
H(
l
)
ep esen s all node ea u es a laye ,
˜
A
is he adjacency ma ix o he g aph
G
wi h added sel -
connec ions,
˜
D
is he deg ee ma ix o
˜
A
,
W(
l
)
is he lea nable weigh ma ix a laye l,
σ
is he ac i a ion
unc ion,
H(
l
+1)
is he ou pu node ea u es a laye
l+1
. To in eg a e p io knowledge, Zhao, Jianjun, e al.24
u he coun ed co-occu ence be ween objec s as he adjacency ma ix
˜
A
. Howe e , because he adjacency
ma ix
˜
A
is ixed du ing he aining, he message passing ope a ion in Eq(1) canno gene a e p edic ions o
new nodes ha we e no pa o he aining da a appea du ing in e ence, i.e., he se o objec s in he aining
and in e ence has o be iden ical. This obs acle makes he model unsui able o he mobile eye- acking se ing,
whe e use s’ a eas o in e es may a y o e ime. We add ess such p oblems by changing he way node ea u es
a e upda ed, om being dependen on he en i e g aph s uc u e
˜
A
o neighbo ing nodes
N( )
o each node
. In pa icula ,
h(l)
N( )=
AGG
(
ℓ
)({
h
(
l
)
u,
∀
u
∈N(
)})
(2)
h(l+1)
=
σ
(
W
(l)·
CONCAT
(
h
(l)
,h
(l)
N( )))
(3)
whe e:
h(l)
ep esen s he ea u e ec o o node a laye l,
AGG
is an agg ega ion unc ion (e.g., Pooling,
LSTM),
CONCAT
be he conca ena ion ope a ion,
h(l+1)
is he upda ed ea u e ec o o node a
laye
l+1
. In scena ios when a new unseen objec
new
is added o ack by he use , we can agg ega e
in o ma ion om neighbo ing seen nodes
seen ∈N( new)
by:
h(l+1)
new
=
σ
(W(l)·CONCAT(h(l)
new ,
AGG(ℓ)({h(l)
seen
})
(4)
and hen upda e he ained model on his new sample a he han all nodes in aining da a as Eq.(1). The
o wa d and backwa d pass o ou message-passing algo i hm is summa ized in he Algo i hm1. We ound
ha such ope a ions ob ained be e esul s in expe imen s han o he message-passing me hods such as
a en ion ne wo k22, p incipled agg ega ion53 o ans o me 68 (Fig. 4b).
Fig. 4. Compa a i e pe o mance analysis.
Scien i ic Repo s | (2025) 15:14192 7
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s/
Algo i hm 2. PyTo ch-s yle I-MLE algo i hm.
End- o-end lea ning om human eedback
In Algo i hm2, we p esen he p oposed human-in- he-loop me hod o mobile eye- acking objec ecogni ion.
This app oach in eg a es use eedback o join ly ain he objec de ec o
θ
and he g aph neu al ne wo k
gϵ
o spa ial easoning o objec posi ions. Speci ically,
θ
is ained o gene a e coo dina es o p oposal objec
bounding boxes, which a e hen used as inpu s o
gϵ
(bounding box coo dina es and ea u e embedding inside
hose egions). The g aph neu al ne wo k
gϵ
is, on he o he hand, ained o gene a e labels o hese objec s
by conside ing he co ela ions among hem. No ably, ou pipeline ope a es as an end- o-end amewo k,
op imizing bo h he objec de ec o and he g aph neu al ne wo k simul aneously a he han as sepa a e
componen s. This lessens he p opaga ion o e o s om he objec de ec o o he GNN componen , making
he sys em be obus o noises in en i onmen se ups. The ained models a e deployed a e wa d o in e he
nex ames and a e hen e ined again a w ong p edic ions, gi ing use anno a ion eedback in a ew loops ill
he model con e ges. In he expe imen esul s, we ound ha such a human-in- he-loop scheme enhances he
algo i hm’s adap a ion abili y and yields compa able o supe io esul s o adi ional lea ning me hods wi h a
se numbe o aining and es ing samples.
Scien i ic Repo s | (2025) 15:14192 8
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s/
Algo i hm 3. Use eedback p opaga ion algo i hm.
Expe imen s & esul s
De ice Ou ha dwa e se up u ilizes he Pupil Co e eye- acking de ice1, which use s wea o obse e hei su -
oundings du ing he ideo eco ding p ocess. The de ice ou pu s bo h ideo da a and ixa ion poin s, cap u ing
he use ’s ocal a en ion a each momen in ime. The ideos a e displayed on a moni o , whe e a backend se ice
powe ed by he VoS ool acili a es he anno a ion p ocess.
Da ase s a is ics Ou s udy u ilizes h ee dis inc ideo sequences, each eco ded by a di e en use wi hin ou
con olled en i onmen . The i s ideo spans 169 seconds, yielding 3873 ex ac ed ames. The second ideo,
sligh ly longe a 183 seconds, comp ises 3422 ames. The hi d sequence, sho e in du a ion a 118 seconds,
con ains 2340 ames. This di e se ange o ideo leng hs and ame coun s ensu es a comp ehensi e da ase o
e alua ing he pe o mance and adap abili y o ou me hod.
Me ics The expe imen esul s a e measu ed by he consis ency o p edic ed bounding boxes and hei labels
wi h g ound- u h ones. In mos expe imen s excep he ixa ion poin cases, we e alua e pe o mance o all
objec s in each ideo ame. We de ine
AP @α
as he A ea Unde he P ecision-Recall Cu e (AUC-PR) e alu-
a ed a
α
IoU h eshold
AP
@α=
∫1
0
p( )
d
whe e p( ) ep esen s he p ecision a a gi en ecall le el . The
mean A e age P ecision69 is compu ed a di e en
α
IoU (
mAP @α
), which is he a e age o AP alues o e all
classes, i.e.,
mAP
@α=
1
n
n
∑
i=1
(AP @α)
i
. We p o ide esul s o
α∈{50,75}
. Fu he mo e, we epo mAP
as an a e age o di e en IoU anging om
0.5→0.95
wi h a s ep o 0.05.
To assess he o e all accu acy o objec classi ica ion ac oss he en i e scene (Sec ion "Fu he analysis"), we
compu e he a e age accu acy, which e lec s he model’s abili y o co ec ly iden i y and classi y objec s wi hin
a gi en ame. Speci ically, he a e age accu acy me ic is calcula ed by a e aging he classi ica ion accu acy o
each objec in all ideo ames. This measu e is c i ical in asks whe e no only he localiza ion o objec s bu also
hei co ec classi ica ion is impo an .
Model con igu a ions We use he Fas e -RCNN66 as he ne wo k backbone o he objec de ec o
θ
and
ollow he same p oposed aining p ocedu e by he au ho s. The message-passing componen
gϵ
uses he
MaxPooling
and
LSTM
agg ega o unc ions o ex ac and lea n embedding ea u es o each node. We use
ou pu bounding boxes and ea u e embedding a he las laye in
θ
as inpu s o
gϵ
. The ou pu s o
gϵ
a e hen
ed in o he
So max
and ained wi h c oss-en opy loss using Adam op imize 70.
Human-in- he-loop s. con en ional da a spli ing lea ning
We in es iga e I-MPN’s abili ies o in e ac i ely adap o human eedback p o ided du ing he lea ning model
and compa e i wi h a con en ional lea ning pa adigm using he ixed ain- es spli ing a e.
Baselines se up In he con en ional machine lea ning app oach (CML), we employ a ixed pa i ioning s a egy,
whe e he i s
70%
o ideo ames, along wi h hei co esponding labels, a e u ilized o aining, while he
emaining
30%
a e ese ed o es ing pu poses. We use I-MPN o lea n om hese anno a ions. In he human-
in- he-loop (HiL) se ing, we s ill u ilize I-MPN bu wi h a di e en app oach. Ini ially, only he i s 10 seconds
1 h ps://pupil-labs.com/p oduc s/co e
Scien i ic Repo s | (2025) 15:14192 9
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s/
58. T ajko ska, K., Kljun, M. & Puciha , K.Č. Gaze2aoi: Open sou ce deep-lea ning based sys em o au oma ic a ea o in e es
anno a ion wi h eye acking da a. a Xi p ep in a Xi :2411.13346 (2024).
59. Mosquei a-Rey, E., He nández-Pe ei a, E., Alonso-Ríos, D., Bobes-Basca án, J. & Fe nández-Leal, Á. Human-in- he-loop machine
lea ning: a s a e o he a . A i icial In elligence Re iew 56, 3005–3054 (2023).
60. Saeed, A., Spa his, D., Oh, J., Choi, E. & E emad, A. Lea ning unde label noise h ough ew-sho human-in- he-loop e inemen .
a Xi p ep in a Xi :2401.14107 (2024).
61. Yao, R., Lin, G., Xia, S., Zhao, J. & Zhou, Y. Video objec segmen a ion and acking: A su ey. ACM T ansac ions on In elligen
Sys ems and Technology (TIST) 11, 1–47 (2020).
62. Zhou, T., Po ikli, F., C andall, D. J., Van Gool, L. & Wang, W. A su ey on deep lea ning echnique o ideo segmen a ion. IEEE
T ansac ions on Pa e n Analysis and Machine In elligence 45, 7099–7122 (2022).
63. Song, E. e al. Mo iecha : F om dense oken o spa se memo y o long ideo unde s anding. a Xi p ep in a Xi :2307.16449
(2023).
64. Huang, W. e al. Voxpose : Composable 3d alue maps o obo ic manipula ion wi h language models. a Xi p ep in
a Xi :2307.05973 (2023).
65. Tsche nezki, V. e al. Epic ields: Ma ying 3d geome y and ideo unde s anding. Ad ances in Neu al In o ma ion P ocessing
Sys ems 36 (2024).
66. Gi shick, R. Fas -cnn. In P oceedings o he IEEE in e na ional con e ence on compu e ision, 1440–1448 (2015).
67. Jiang, P., E gu, D., Liu, F., Cai, Y. & Ma, B. A e iew o yolo algo i hm de elopmen s. P ocedia Compu e Science 199, 1066–1073
(2022).
68. Shi, Y. e al. Masked label p edic ion: Uni ied message passing model o semi-supe ised classi ica ion. a Xi p ep in
a Xi :2009.03509 (2020).
69. E e ingham, M., Van Gool, L., Williams, C. K., Winn, J. & Zisse man, A. The pascal isual objec classes ( oc) challenge.
In e na ional jou nal o compu e ision 88, 303–338 (2010).
70. Kingma, D.P. & Ba, J. Adam: A me hod o s ochas ic op imiza ion. a Xi p ep in a Xi :1412.6980 (2014).
71. Kellehe , J.D., Mac Namee, B. & D’a cy, A. Fundamen als o machine lea ning o p edic i e da a analy ics: algo i hms, wo ked
examples, and case s udies (MIT p ess, 2020).
72. In el. Compu e ision anno a ion ool (2021).
73. Kukkala, V. K., Tunnell, J., Pas icha, S. & B adley, T. Ad anced d i e -assis ance sys ems: A pa h owa d au onomous ehicles. IEEE
Consume Elec onics Magazine 7, 18–25 (2018).
74. Baldisse o o, F., K ej z, K. & K ej z, I. A e iew o eye acking in ad anced d i e assis ance sys ems: An adap i e mul i-modal
eye acking in e ace solu ion. In P oceedings o he 2023 Symposium on Eye T acking Resea ch and Applica ions, 1–3 (2023).
75. Zhang, L. e al. Lea ning unsupe ised wo ld models o au onomous d i ing ia disc e e di usion. In e na ional Con e ence on
Lea ning Rep esen a ions (2024).
76. Shi, J.-X. e al. Long- ail lea ning wi h ounda ion model: Hea y ine- uning hu s. In e na ional Con e ence on Machine (2024).
77. Ba z, M., Bha i, O.S., Alam, H. M.T., Nguyen, D. M.H. & Sonn ag, D. In e ac i e Fixa ion- o-AOI Mapping o Mobile Eye
T acking Da a Based on Few-Sho Image Classi ica ion. In Companion P oceedings o he 28 h In e na ional Con e ence on
In elligen Use In e aces, IUI ’23 Companion, 175–178, h ps://doi.o g/10.1145/3581754.3584179 (Associa ion o Compu ing
Machine y, New Yo k, NY, USA, 2023). E en -place: Sydney, NSW, Aus alia.
78. Jiang, Y. e al. Ueyes: Unde s anding isual saliency ac oss use in e ace ypes. In P oceedings o he 2023 CHI Con e ence on
Human Fac o s in Compu ing Sys ems, 1–21 (2023).
79. Y an idou, S. e al. The s a e o algo i hmic ai ness in mobile human-compu e in e ac ion. In P oceedings o he 25 h In e na ional
Con e ence on Mobile Human-Compu e In e ac ion, 1–7 (2023).
80. Shaily, R., Ha shi , S. & Asi , S. Fai ness wi hou demog aphics in human-cen e ed ede a ed lea ning. a Xi p ep in
a Xi :2404.19725 (2024).
81. Ma inó, G. C., Pe ini, A., Malchiodi, D. & F asca, M. Deep neu al ne wo ks comp ession: A compa a i e su ey and choice
ecommenda ions. Neu ocompu ing 520, 152–170 (2023).
82. Xu, C. & McAuley, J. A su ey on model comp ession and accele a ion o p e ained language models. In P oceedings o he AAAI
Con e ence on A i icial In elligence 37, 10566–10575 (2023).
83. Bolya, D. e al. Token me ging: You ViT bu as e . In In e na ional Con e ence on Lea ning Rep esen a ions (2023).
84. T an, H.-C. e al. Accele a ing ans o me s wi h spec um-p ese ing oken me ging. a Xi p ep in a Xi :2405.16148 (2024).
Acknowledgemen s
This wo k was unded, in pa , by he Eu opean Union unde g an numbe 101093079 (MASTER), he Ge -
man Fede al Minis y o Educa ion and Resea ch (BMBF) unde g an numbe 01IW23002 (No-IDLE), and he
Lowe Saxony Minis y o Science and Cul u e (MWK) as pa o he p ojec zug.KI.The au ho s hank he In-
e na ional Max Planck Resea ch School o In elligen Sys ems (IMPRS-IS) o suppo ing Duy M. H. Nguyen.
Au ho con ibu ions
Hoang H. Le and Duy M. H. Nguyen implemen he p oposed me hod and un expe imen s on da ase s. Omai ,
Laszlo, and Michael suppo he collec ion o da ase s, p o iding ools o anno a ion. Thinh Ngo suppo s in
p o iding da a anno a ions Binh T. Nguyen, Michael and Daniel Sonn ag guide he p ojec .
Decla a ions
Compe ing in e es s
The au ho s decla e no compe ing in e es s.
E hical app o al
We con i m ha his s udy used a da ase sou ced wi h app o al om he E hics Boa d o he Uni e si y o
Saa land, ensu ing adhe ence o all ele an e hical guidelines and egula ions, and we ha e he igh o use
his da ase o ou esea ch. The da ase did no in ol e di ec in e ac ion wi h pa icipan s by he au ho s.
The human anno a ions e e enced in he s udy we e pe o med by employed s uden s who we e compensa ed
o hei wo k and conduc ed he anno a ions as pa o hei p o essional du ies.
Addi ional in o ma ion
Co espondence and eques s o ma e ials should be add essed o H.H.L. o D.M.H.N.
Scien i ic Repo s | (2025) 15:14192 16
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s/
Rep in s and pe missions in o ma ion is a ailable a www.na u e.com/ ep in s.
Publishe ’s no e Sp inge Na u e emains neu al wi h ega d o ju isdic ional claims in published maps and
ins i u ional a ilia ions.
Open Access This a icle is licensed unde a C ea i e Commons A ibu ion-NonComme cial-NoDe i a i es
4.0 In e na ional License, which pe mi s any non-comme cial use, sha ing, dis ibu ion and ep oduc ion in
any medium o o ma , as long as you gi e app op ia e c edi o he o iginal au ho (s) and he sou ce, p o ide
a link o he C ea i e Commons licence, and indica e i you modi ied he licensed ma e ial. You do no ha e
pe mission unde his licence o sha e adap ed ma e ial de i ed om his a icle o pa s o i . The images o
o he hi d pa y ma e ial in his a icle a e included in he a icle’s C ea i e Commons licence, unless indica ed
o he wise in a c edi line o he ma e ial. I ma e ial is no included in he a icle’s C ea i e Commons licence
and you in ended use is no pe mi ed by s a u o y egula ion o exceeds he pe mi ed use, you will need o
ob ain pe mission di ec ly om he copy igh holde . To iew a copy o his licence, isi h p : / / c e a i e c o m m o
n s . o g / l i c e n s e s / b y - n c - n d / 4 . 0 / .
© The Au ho (s) 2025
Scien i ic Repo s | (2025) 15:14192 17
| h ps://doi.o g/10.1038/s41598-025-94593-y
www.na u e.com/scien i ic epo s/