scieee Science in your language
[en] (orig)

Egocentric Vision-based Action Recognition: A survey

Author: Núñez Marcos, Adrián,Azkune Galparsoro, Gorka,Arganda Carreras, Ignacio
Publisher: Elsevier
Year: 2022
DOI: 10.1016/j.neucom.2021.11.081
Source: https://addi.ehu.eus/bitstream/10810/56520/1/1-s2.0-S0925231221017586-main.pdf
Su ey pape
Egocen ic Vision-based Ac ion Recogni ion: A su ey
Ad ián Núñez-Ma cos
a,
⇑
, Go ka Azkune
b
, Ignacio A ganda-Ca e as
c,d,e
a
Deus o ech Ins i u e, Uni e si y o Deus o, A enida de las Uni e sidades, No. 24, Bilbao 48007, Basque Coun y, Spain
b
IXA NLP G oup, Facul y o Compu e Science, Euskal He iko Unibe si a ea (EHU/UPV), M. La dizabal 1, Donos ia 20008, Basque Coun y, Spain
c
Donos ia In e na ional Physics Cen e (DIPC), Manuel La dizabal 4, Donos ia 20018, Basque Coun y, Spain
d
Ike basque, Basque Founda ion o Science, Plaza Euskadi 5, Bilbao 48009, Basque Coun y, Spain
e
Depa men o Compu e Science and A i icial In elligence, Uni e si y o he Basque Coun y, M. La dizabal 1, Donos ia 20008, Basque Coun y, Spain
a icle in o
A icle his o y:
Recei ed 6 May 2021
Re ised 8 No embe 2021
Accep ed 21 No embe 2021
A ailable online 8 Decembe 2021
Keywo ds:
Deep lea ning
Compu e ision
Human ac ion ecogni ion
Egocen ic ision
Few-sho lea ning
abs ac
The egocen ic ac ion ecogni ion EAR ield has ecen ly inc eased i s popula i y due o he a o dable and
ligh weigh wea able came as a ailable nowadays such as GoP o and simila s. The e o e, he amoun o
egocen ic da a gene a ed has inc eased, igge ing he in e es in he unde s anding o egocen ic ideos.
Mo e speci ically, he ecogni ion o ac ions in egocen ic ideos has gained popula i y due o he chal-
lenge ha i poses: he wild mo emen o he came a and he lack o con ex make i ha d o ecognise
ac ions wi h a pe o mance simila o ha o hi d-pe son ision solu ions. This has igni ed he esea ch
in e es on he ield and, nowadays, many public da ase s and compe i ions can be ound in bo h he
machine lea ning and he compu e ision communi ies. In his su ey, we aim o analyse he li e a u e
on egocen ic ision me hods and algo i hms. Fo ha , we p opose a axonomy o di ide he li e a u e
in o a ious ca ego ies wi h subca ego ies, con ibu ing a mo e ine-g ained classi ica ion o he a ailable
me hods. We also p o ide a e iew o he ze o-sho app oaches used by he EAR communi y, a me hod-
ology ha could help o ans e EAR algo i hms o eal-wo ld applica ions. Finally, we summa ise he
da ase s used by esea che s in he li e a u e.
Ó2021 The Au ho (s). Published by Else ie B.V. This is an open access a icle unde he CC BY-NC-ND
license (h p://c ea i ecommons.o g/licenses/by-nc-nd/4.0/).
1. In oduc ion
Since he in oduc ion o he i s wea able came a [122], com-
me cial and ligh weigh came as such as GoP o and simila s ha e
become widely used, p oducing a as amoun o i s -pe son o
egocen ic ideos o analyse. These ideos a e eco ded om he
poin o iew o he wea e o he came a, p oducing ideos wi h
la ge, non-linea and unp edic able head and body mo ion and a
lack o global con ex , which pose a challenge om a machine
lea ning s andpoin . Hence, he inc easing amoun o da a and
he in e es ing se ing o hese ypes o ideos ha e a ac ed he
compu e ision and he machine lea ning communi ies owa ds
he ision-based EAR esea ch ield.
In con as o hi d-pe son o exocen ic ideos, i s -pe son o
egocen ic ideos con ain ich in insic ea u es, mo i a ing hei
use o no el app oaches, i.e. wi hou elying exclusi ely on
app oaches om he exocen ic ision li e a u e. Fo example,
hese ea u es include he occlusion- ee in e ac ions wi h objec s,
he ocus on he manipula ion o objec s, he gaze mo emen and
so o h, which ha e been iden i ied in he li e a u e [106] and a e
help ul o disce n ac ions. These cues make he i s -pe son o ego-
cen ic ac ion ecogni ion a esea ch ield on i s own, apa om
he hi d-pe son ision esea ch. In ac , exploi ing he in insic
ea u es o his ype o ision seems o be c ucial o co ec ly ecog-
nise he con en o ideos [139].
Nowadays, he egocen ic ision esea ch line has been adop ed
by a ious esea ch g oups and se e al solu ions ha e been p o-
posed. E en new ea u es such as he use o sound a e being le e -
aged in ecen wo ks [9,31], as some ac ions canno be
dis inguished using only isual cues. E en hough he ield is
ad ancing, i s ill has o become as la ge as he hi d-pe son one.
In addi ion, he esul s a e s ill a om being accep able. In ac ,
he majo i y o he esea ch is ocused on he supe ised lea ning
se ing in which labels a e p o ided in he aining s age. This
equi es la ge anno a ed da ase s, which is a labo ious ask. The e
a e, howe e , wo ks ha ha e analysed he use o ew-sho [208]
and ze o-sho [205] lea ning amewo ks. These equi e a ew
anno a ed samples a mos , being mo e sui able o eal-wo ld
applica ions han he classic supe ised se ings. Ne e heless,
mo e esea ch is equi ed in o de o s ee new solu ions in he
co ec di ec ion.
h ps://doi.o g/10.1016/j.neucom.2021.11.081
0925-2312/Ó2021 The Au ho (s). Published by Else ie B.V.
This is an open access a icle unde he CC BY-NC-ND license (h p://c ea i ecommons.o g/licenses/by-nc-nd/4.0/).
⇑
Co esponding au ho .
E-mail add esses: [email p o ec ed] (A. Núñez-Ma cos), go ka.azcune@e-
hu.eus (G. Azkune), [email p o ec ed] (I. A ganda-Ca e as).
Neu ocompu ing 472 (2022) 175–197
Con en s lis s a ailable a ScienceDi ec
Neu ocompu ing
jou nal homepage: www.else ie .com/loca e/neucom
1.1. Vision-based Exocen ic Ac ion Recogni ion
In o de o se le he basis o he ac ion ecogni ion ield (la e
ocused only on EAR), we b ie ly desc ibe he e olu ion o he exo-
cen ic ( hi d-pe son) ision-based ac ion ecogni ion ield o e
he las ew yea s.
Be o e he success o Deep Lea ning, hand-enginee ed ea u es
we e used o ac ion ecogni ion; o ins ance, ex ac ing he o e-
g ound (op ionally), compu ing ea u es om he inpu s (using,
e.g. adi ional algo i hms such as LBP [143,144], SIFT [116] and
SURF [18]) and applying a classi ie o ob ain an ac ion p edic ion.
The o eg ound ex ac ion can be done, o example, o segmen
hands and objec s in egocen ic- ision ames. The o he wo s eps
can be applied o bo h ypes o ac ion ecogni ion app oaches.
O he app oaches include compu ing Op ical Flow OF ea u es,
compu ing he skele on and join s, ajec o y-based ecogni ion
and so o h. These solu ions a e also seen in he EAR li e a u e
wi h small adjus men s o i be e he ea u es ha can be ound
in egocen ic ideos (e.g. hands and objec s).
Wi h Deep Lea ning, ea u es a e au oma ically ex ac ed,
ins ead o manually. The exocen ic ac ion ecogni ion ield
swi ched o h ee main app oaches [232]: mul i-s eam Con olu-
ional Neu al Ne wo kCNNs (being he wo-s eam ne wo k he
mos used one [174]), 3D CNN [81] and hose based on Recu en
Neu al Ne wo kRNN, e.g. he Long-Sho Te m Memo y LSTM
[73]. The e a e o he me hods such as hose using g aphs (e.g.
[13]) which can also be ound wi hin one o hese ca ego ies. Mo e-
o e , hanks o he use o Neu al Ne wo k NN a chi ec u es, ans-
e lea ning could be applied, allowing la ge models o be ained
wi h huge da ase s (Imagene [50] o s a ic images, UCF101
[180] o hi d-pe son ideos and so o h) be o e being ine-
uned on speci ic asks and/o smalle da ase s (egocen ic da a-
se s, o example).
Mul i-s eam ne wo ks s a ed wi h wo b anches ( he wo-
s eam ne wo k by [174]), aking RGB and OF ames, o ex ac
spa ial and empo al ea u es and a classi ie on op o make he
classi ica ion. They la e e ol ed o include mo e in o ma ion
(such as gaze [114] o isual hy hm images [40]) o e en o add
s eams wi h a ying in o ma ion ([170] includes bones, join s
and hei mo ion as inpu o hei mul i-s eam se ing). La e ,
he compu a ion o OF was alle ia ed by he p oposal o [237],
which had a wo-s eam ne wo k lea ning mo ion ea u es in an
end- o-end ashion, allowing a eal- ime p ocessing.3D CNN (e.g.
C3D [199]) lea n spa io- empo al ea u es using he 3D con olu-
ion ope a ion. They a e compu a ionally hea ie han he mul i-
s eam app oaches; he e a e e en wo ks ha aimed o di ide
he 3D ope a ion in o a 2D and a 1D ope a ion (as in he Xcep ion
ne wo k [38]). Fu he mo e, an app oach called Two-S eam
In la ed 3D Con Ne o I3D [30] mixed hese las wo ideas ( wo-
s eam ne wo k and he 3D CNN) and became an s anda d o
ex ac ing spa io- empo al ea u es.
In ac , ega ding ea u e ex ac ion, i was usual o ha e a ne -
wo k such as an I3D ex ac ing spa io- empo al ea u es in a sho -
e m span while ha ing an RNN such as he LSTM ex ac ing em-
po al ea u es in a longe empo al span. This ype o a chi ec u e
was popula ised by [56] and, a e wa ds, many wo ks s a ed
applying i , e.g. [200,65,229].
The majo i y o hese a chi ec u es can be di ec ly applied o
egocen ic ideos. Howe e , as seen in Sec ion 2, he e a e be e
ways o deal wi h egocen ic ideos.
1.2. Con ibu ions and A angemen
In his pape , we con ibu e he ollowing:
A axonomy o classi y EAR me hods in o ca ego ies and
subca ego ies.
A e iew o he EAR p oposals using his axonomy.
The es o he pape is a anged as ollows: Sec ion 2p esen s he
a o emen ioned axonomy and e iews he ine-g ained classi ied
li e a u e; Sec ion 3p esen s he EAR me hods ha use o ha e
he po en ial o be used wi hin he he ze o-sho pa adigm; Sec-
ion 4summa ises he egocen ic ideo da ase s and, inally, Sec-
ion 6p o ides he inal conclusions.
2. Egocen ic Ac ion Recogni ion
The idea o using egocen ic ideos has only s a ed o be
exploi ed in he las decade hanks o no el, ligh weigh and
a o dable de ices such as GoP o and simila s. In ac , li elogging
has become widely used. Indeed, he numbe o da ase s in he
s a e o he a o he EAR ield has p og essi ely inc eased du ing
his decade, wi h eleases such as he la ge EPIC Ki chens da ase
[42]. This has also mo i a ed he esea ch on he opic
[84,14,21,49,139,11,12], being mainly di ided in o h ee a eas: (i)
ac i i y ecogni ion/classi ica ion, (ii) ideo summa isa ion and
(iii) objec de ec ion. In his sec ion, we aim o p o ide an ex en-
si e e iew on he ac ion ecogni ion sub ield, e e ed o as EAR
h oughou he documen . Examples o egocen ic ac ions a e
shown in Fig. 1.
Fi s o all, i should be no ed ha he li e a u e p esen s wo
con lic ing e ms: ac ions and ac i i ies. [139] discussed ha bo h
e ms a e seman ically di e en : an ac ion is a sho e en such as
”opening a ja ” while an ac i i y is a seman ically highe e en in
which a ious ac ions a e combined, las ing om se e al minu es
o hou s. None heless, pa o he li e a u e does no ake his di -
e ence in o accoun and uses he wo d ”ac i i y” ins ead o ac ion.
Mo eo e , some wo ks e en deno e he mo ion using he wo d ”ac-
ion”, i.e. he mo emen gene a ed when some hing is being cu
would be called an ac ion, ega dless o he objec s p esen in
he scene. In his su ey, we will di e en ia e be ween ac ions
and ac i i ies and be ween ac ions and mo ion, being he mo ion
o us he mo emen gene a ed om an ac ion independen ly o
he objec .
Re iewing he li e a u e on EAR, i is no iceable ha he e a e
a ious special cues in insic o egocen ic ideos ha d i e he
ype o app oach ha esea che s use o ackle he EAR challenge.
Fo example, [106] used (i) he hand pose and i s mo emen [17],
(ii) he head mo ion and (iii) he gaze di ec ion as egocen ic cues
in hei wo k. In addi ion, hey also s essed he impo ance o
objec s in he egocen ic se ing. In gene al, om he li e a u e,
we can ex ac he main egocen ic ea u es o cues used, sum-
ma ised in Fig. 2. Hence, we can spli hese cha ac e is ics in o
wo g oups: hose ela ed o he appea ance o objec s and hose
ela ed o he mo emen o mo ion.
The e o e, in his chap e , we spli he li e a u e in o ou sec-
ions depending on he ype o modali y d i ing he app oaches:
(i) objec - o appea ance-based app oaches, (ii) mo ion-based
app oaches, (iii) hyb id app oaches (combining appea ance and
mo ion) and (i ) o he app oaches ha conside o he modali ies
such as he sound o ha a e making a con ibu ion no ela ed
o hese modali ies. The p oposed axonomy used o his sec ion
is illus a ed in Fig. 3 and all he e e ences ollowing his ca ego i-
sa ion can be ound in Table 1.
We belie e ha ha ing a axonomy o di ide he li e a u e
allows esea che s o ha e a be e pe spec i e o he kinds o
wo ks ha ha e been published o he esea ch lines ha a e cu -
en ly ac i e. Fo a beginne , his makes i easy o ind wo ks o
in e es and o explo e simila ones. The possible disad an age ha
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
176
hese kinds o axonomies may ha e is ha , unless hey a e e y
ine-g ained (which is no p ac ical), o some wo ks he e a e
o e laps be ween ca ego ies, i.e. a speci ic esea ch may all in o
a ious ca ego ies. The e migh be be e ep esen a ions o such
a axonomy (e.g. a g aph) ha canno be ep esen ed he e bu
could be bene icial o he EAR communi y. We hope his axon-
omy p oposal en iches he esea ch and mo i a es esea che s o
p opose new ways o di ide he li e a u e.
2.1. Objec -d i en Ac ion Recogni ion
The cu en li e a u e is highly domina ed by wo ks ha belie e
ha objec s p esen in he scene and, specially, objec s ela ed o
asks a e he main cues in he ecogni ion o ac ions. Tha is, ana-
lysing objec s in ideos can become a c i ical hin owa ds ecog-
nising ac ions. In ac , [59] a gued ha he egocen ic pa adigm
is specially bene icial o analyse ac ions ha in ol e objec s due
o h ee easons: (i) objec occlusions a e minimised, as he space
whe e hese a e manipula ed is always p esen ; (ii) objec s a e
o en seen a consis en iewing di ec ions wi h espec o he ego-
cen ic came a, as poses and he displacemen o he manipula ed
objec s a e also consis en in wo kspace coo dina es; and (iii) he
came a is usually ocusing on objec s and ac ions, ha a e usually
in he cen e o he image o ideo, hus ob aining high quali y
image measu emen s.
Rega ding he classi ica ion o objec s, he e a e a ious ways in
he li e a u e o ca ego ise hem. [192], o example, op ed o
de ining objec s by he ype o space hey a e in. Tha is, he space
obse ed by he subjec ( he one wea ing he came a) is known as
he obse able space. Then, any objec ha is g aspable o can be
eached using he hands is con ained wi hin he manipula ion
space. Las ly, an objec ha is g abbed by he subjec is said o be
a manipula ed objec .
In a complemen a y way, [139] s a ed ha ou ypes o objec s
can be obse ed:
Ac i e and passi e objec s: ac i e objec s a e hose ele an o
ac ions and passi e objec s a e backg ound o non-impo an
i ems.
Salien and non-salien objec s: he o me a e hose ha a e
ixa ed by he gaze o hose in which he ocus is pu on while
he la e can be conside ed backg ound o non-a ended
objec s.
Manipula ed objec s: objec s ha a e in he hands a e said o be
manipula ed.
Mul i-s a e objec s: hose ha ha e changes in e ms o colou
o shape.
I is specially impo an o s ess ha ac i e objec s a e conside ed
impo an o es ima e he ac ion [161], bu ecognising hem is also
a challenging ask due o hand occlusion o backg ound clu e . To
diminish he e ec o he backg ound clu e , [60,59] p oposed o
i s de ec a Region o In e es ROI be o e localising objec s. In ac ,
he e a e au ho s ha aim o de ec ac i e objec s in an unsupe -
ised way (wi hou ca ego ising hem). Namely, [85] gene a ed a
pool o segmen a ions, indi idually sea ching o ins ances o speci-
ic objec s (one a a ime) by en o cing cons ain s such as geome -
ic consis ency. [44] used a gaze acke o in e he mos impo an
objec s and analysed he in e ac ions wi h hem. [129] made a seg-
men a ion p ocess in wo s eps: i s , hey gene a ed a p obabilis ic
Fig. 1. Examples o egocen ic ac ions (subsampled ames) om he Ex ended GTEA Gaze + da ase : (a) ”cu bell peppe ” ac ion, (b) ”wash pan” ac ion and (c) ”mo e bowl”
ac ion.
Fig. 2. In insic egocen ic cues.
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
177
bounda y map o he scene and, second, hey made use o he ixa-
ion poin o ge he closed con ou ha included ha poin . [190]
p esen ed EYEWATCHME, an in eg a ed ision and s a e es ima ion
sys em ha , a he same ime, acked, among o he s, he posi ion
o hands and ac i e objec s. The app oaches using he gaze a e spe-
cially in e es ing, as [97,72] showed ha he eyes always look
di ec ly a he objec s ha a e being manipula ed (ac i e objec s).
In ac , hese app oaches could be in eg a ed in an ac ion ecogni-
ion sys em ha aimed o use ac i e objec s’ in o ma ion. Mo e
ecen ly, [100] s essed he impo ance o hands o he de ec ion
o ac i e objec s. They p oposed o au oma ically segmen hands
i s and, hen, including his in o ma ion in an objec localisa ion
ne wo k, achie ed a mo e p ecise localisa ion o objec s. This high-
ligh s he impo ance o hands in he ac i e objec de ec ion
p oblem.
Bag o Objec s app oaches. The e a e se e al s udies in which
he bag o objec s app oach is used (see Fig. 4 o an example).
Wo ks such as hose o [148,124] made use o bags o ac i e and
passi e objec s o in e ac ions, being he objec s i s de ec ed
by an objec de ec o and, hen, classi ied in o ac i e o passi e.
Fig. 3. The p oposed axonomy used o summa ise he li e a u e on EAR.
Table 1
Summa y o he li e a u e ollowing he axonomy p oposed in Fig. 3.
Ca ego y Sub-ca ego y Re e ences
Objec -based app oaches Bag o Objec s app oaches [192,148,125,61,124,135,4,88]
Hand-Objec and-Hand ela ions [20,19,33,67,16,120,196,138]
G aph ep esen a ions [59,133]
Tempo al dynamics [230]
Mo ion-based app oaches Eye mo emen [224,225]
Ego-mo ion [191,163,178,137,153]
[175,154,93,177]
Eye mo emen and ego-mo ion [142,215]
Hyb id app oaches Two-s eam a chi ec u es [121,95,211,189,105,202]
[234,209,117,187,118,228]
Mul i-s eam a chi ec u es [195,64,74,207,83,128]
Single-s eam, mul iple asks [176,188,86,149,146,115]
Combina ion o mul iple ea u es [182,171,216,35,176,233,134,131]
[79,155,52,239,238,87,226,227,94]
Knowledge g aphs [212,165]
Hand-based ecogni ion [236,66,28]
O he app oaches Sound modali y [9,31,32,90]
Task e o mula ion [130,213]
P i acy [152,54,183,198]
Da a sampling [218]
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
178
[192] used wo complemen a y se s: one o obse able objec s and
ano he one o manipulable objec s.
[125] a gued ha , as an ex ension o he adi ional bag o
objec s, spa io- empo al binning app oaches could cap u e
space– ime ela ions and, o sol e he issue o in lexible p ede ined
schemes, hey i s p oposed o lea n he spa io- empo al pa i-
ions ha we e mos disc imina i e. Fo ha , hey gene a ed a pool
o andomly gene a ed candida es and used a boos ing app oach o
selec he bes ones. Second, o u he imp o e he i s con ibu-
ion, hey aimed o c ea e objec -cen ic pa i ions, i.e. egions o
ideos whe e ac i e objec s a e supposed o appea , by c ea ing a
his og am o ac i e objec s o each ideo. Fo he classi ica ion,
hey compu ed ea u es om each p oposal in he pool and applied
he boos ing ope a ion o ge he bes p oposals ha we e used o
ain he inal classi ie .
One aspec ela ed o his bag o objec s a e he objec luen s,
i.e. a ime- a ying a ibu e o objec s o g oups o objec s whose
alues a e he speci ic s a es o he a ibu e [113,62,132]. Fo
example, o a mug, he s a es o luen s can be emp y and ull (bi-
na y luen s). Speci ically, [113] p oposed o ep esen an ac ion as
concu en and sequen ial objec luen s. Gi en an egocen ic
ideo, beam sea ch was used o ecognise he luen s pe ame
and hen in e ac ions. The bag o objec s used in he wo k o
[61] was composed o sequences o isual pa ches o objec s (a
sequence ep esen ed he changes o an objec du ing a ideo).
[4] also modelled objec s a e ansi ions as a means o in e ing
ac ions. In hei model, a CNN ex ac ed isual ea u es om a
se o ames selec ed om Ksegmen s (uni o mly sampled ac oss
each ideo), one pe segmen . The ne wo k was la e di ided in o
wo b anches by means o a poin -wise con olu ion: he i s one
was in cha ge o lea ning nouns, while he second one ook ca e
o lea ning s a es. A global a e age pooling was applied o ob ain
a ea u e ec o om each b anch, one pe ame. Fo he noun ec-
o s, a poin -wise con olu ion led o a single ea u e ec o while,
o he s a es, wo channels we e le a e he same ope a ion. The
wo channels o he s a e b anch ep esen ed he e b ( he ype o
change applied om he p e-s a e o he pos -s a e), lea n using a
Fully-Connec ed FC laye . Fo he ac ion classi ica ion ask, ano he
FC laye was used. [88] analysed he use o objec de ec ions om
YOLO [159] as a ool o de ec indoo ac ions and o expe imen
wi h a ious de ec ion pa ame e s. They obse ed ha he p es-
ence o ce ain objec s was highly co ela ed wi h some ac ions
and ha he lack in he de ec ion o hose ela ions hampe ed
he de ec ion o ac ions. Thus, hey compensa ed his using he
empo al in o ma ion o objec s, i.e. hey ga he ed de ec ions o
a ious ames o ge a mo e comple e pic u e o he scene. Mo e
speci ically, hey ained a NN wi h a pe - ame bag o objec s o
in e he loca ion (physical place), hey also did he same using a
ISTM ne wo k o in e he loca ion using he whole ideo. Finally,
o ac ion ecogni ion, ano he ISTMm was used, including in he
inpu he loca ion and shape o he bounding boxes o he de ec ed
objec s apa om he p esence ec o s.
New me hodologies o ep esen he bag o objec s app oach
a e also a ising, such as ha o [135]. They p esen ed a p elimina y
wo k on objec -based ac ion ecogni ion in which hey de ec ed
objec s using a p e- ained CNN and hey ecognised he ac ion
wi hou aining any o he model. Speci ically, o es ima e he
ac ion, hey exploi ed web da a o compu e he seman ic simila i y
be ween he de ec ed objec names and he names o he ac ion
classes.
Hands, Hand-Objec s and Objec -Objec in e ac ions. The
in e ac ion be ween humans (using hands mainly) and objec s
and also be ween objec s is also a qui e analysed opic in he
EAR ield. [20] p esen ed hei bag o ela ions, which ex ended
he idea o he bag o objec s including, no only he objec i sel ,
bu also he pa o he body ha in e ac ed wi h he objec
(objec -body) and also he objec -objec ela ions. Wi h he same
idea o he ”bag o in e ac ions”, [19] p oposed a His og am o O i-
en ed Pai wise Rela ions in which he spa ial ela ions (dis ances,
o ien a ions and alignmen s) be ween isual-wo ds we e ep e-
sen ed. Simila ly, [33] also aimed o cap u e hands and he objec s
ha we e being manipula ed. Fo ha , hey le e aged he R*CNN
p esen ed by [67] o de ec he p ima y egion (hands) and he sec-
onda y egions (objec s). The ou pu o ha module was gi en o
an ISTM o p ocess he e olu ion o he ideo. Going one s ep u -
he , [196] p esen ed a uni ied model which, gi en a single RGB
image, in a single eed- o wa d pass, es ima ed he 3D hand and
objec poses, hei in e ac ions and he objec and ac ion classes.
They ex ac ed ea u es using a Fully Con olu ional Ne wo kFCN
in which each ou pu cell p edic ed 3D hand poses and objec
bounding box coo dina es. Then, hese cells we e associa ed wi h
a ec o ha con ained a ge alues o he hand and objec pose,
he objec and ac ion class and he o e all con idence alue. Those
p edic ions wi h he highes con idence we e passed o hei in e -
ac ion RNN.
In con as , wi hou he need o include in e ac ions, he e is
esea ch abou he sole use o he shape and pose o hands o
de e mine ac ions. [16] a gued ha hey could in e ac ions in hei
da ase using only ha in o ma ion. To es hei hypo hesis, hey
masked ou he egion whe e he e we e no hands and used a CNN
o in e ac ions. E en hough he esul s we e no pe ec , hey
showed ha he e is a high co ela ion be ween hands and ac ions.
Taking in o accoun he empo al domain by applying a simple
majo i y o ing, hey concluded ha hei esul s imp o ed as a
consequence o he impo ance ha ce ain hand poses may ha e,
being mo e dis inc i e han o he s.
While he in e ac ions be ween hand and objec s a e impo an ,
he ela ion be ween di e en objec s is also a cen al elemen o
ac ions, i.e. in a gi en scena io, only a subse o objec s may be el-
e an o he ask. Tha is why [120] p oposed a way o model a bi-
a y ela ions be ween a bi a y subg oups o objec s. Thei
me hod was i s di ided in o wo pa s: (i) in he coa se-g ained
pa , a CNN ex ac ed ea u es om each ame, hese we e passed
h ough a Mul i-Laye Pe cep on MLP and, o join all he ea u es,
he Scale Do -P oduc A en ion (SDP-A en ion) o [201] was
applied o hem; and (ii) in he ine-g ained pa , he Region P o-
posal Ne wo k RPN p oposed by [160] was used o ex ac objec
ROI, which we e ed o he Recu en Highe -O de In e ac ion
(Recu en HOI) module hey con ibu ed. This module employed
a lea nable a en ion mechanism o decide he se o candida e
objec s ha we e ele an o each ac ion. Finally, he ou pu o
bo h s eams we e conca ena ed and a FC laye wi h a so max
ac i a ion was used.
Fig. 4. Bag-o -objec s app oaches aim a disco e ing ac ions using a collec ion o
objec s.
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
179

[138] in es iga ed he acquisi ion o addi ional ea u es ha
modelled he in e ac ion be ween hands and objec s. Fo ha , hey
ollowed he bag-o - isual-wo ds (BoVW) app oach o model
ac ions. To in e he class o new samples, Dynamic Time Wa ping
DTW was applied o compa e he ea u es om a new sample and
he ones o he es o samples. Nex , hey ained an objec de ec-
o o ecognise le and igh hands. Wi h hese de ec ions, he dis-
ance o any objec could be de e mined. As ac i e objec s should
be in con ac wi h hands, hose objec s ha we e being manipu-
la ed ( e y close o he hands’ posi ion) we e conside ed ac i es
and he dis ance be ween bo h hands and each hand and he ac i e
objec we e compu ed. The addi ion o hese ea u es o he p es-
ence o objec s boos ed he pe o mance on he ac ion ecogni ion.
[140] p oposed a no el NN based on SPD mani old lea ning. This
app oach employed skele on in o ma ion o hand ges u e (ac ion)
ecogni ion and was di ided in o h ee s ages: (i) a CNN o encode
skele al da a; (ii) a Gaussian embedding o encode i s - and
second-o de s a is ics; and (iii) he lea ning o he SPD ma ix
and he mapping o his ma ix o an Euclidean space o he clas-
si ica ion o ac ions.3D hand pose and ges u e (ac ion) ecogni ion
we e bo h he objec i e in he model p oposed by [219]. This
s a ed by lea ning join -awa e ea u es using a ResNe ne wo k
and hen he model b anched in (i) he ac ion ecogni ion and (ii)
he hand pose es ima ion pa s. These we e ained i e a i ely, as
he ou pu om one o hem was he inpu o he o he one and
ice e sa. Wi hin hese b anches, hey p oposed o use mul i-
o de mul i-s eam ea u e analysis. Tha is, a ious ea u es we e
compu ed: s a ic, hose ep esen ing eloci y and hose ep esen -
ing accele a ion. Fo he la e , hey ook in o accoun he slow and
as mo ing join s and p oposed o compu e hem sepa a ely. Each
o hese ea u es we e ed o a mul i-scale ela ion module ha
wen om ine-g ained hand ea u es o mo e holis ic ea u es
and hen class sco es we e compu ed wi h a Tempo al Con olu ion
Ne wo k (TCN). In a simila ashion (e en wi h he same da ase ),
bu no speci ically in ended o egocen ic ideos, [110] decoupled
hand pos u e a ia ions and hand mo emen s using a wo-s eam
ne wo k. Fo he i s one, a 3D CNN was employed, aking also he
inge ips’ ela i e posi ion as an ex a cue. The o he s eam was
implemen ed wi h ano he CNN. A FC laye compu ed he sco e
pe s eam be o e using hem o ges u e ecogni ion.
Recen ly, [46] p esen ed a g aph a chi ec u e o model hand
skele on da a o ecognise ac ions. Speci ically, hey employed a
spa io- empo al g aph CNN. In ac , by exploi ing he symme y
o hand g aphs, hey p oposed o use a ious sub-g aphs o build
sepa a e models o inge mo emen s. In con as , [111] a gued
ha , e en hough g aph me hods achie ed good esul s, hey we e
inhe en ly limi ed in cap u ing ea u es o hand in e ac ions. To
sol e ha , hey con ibu ed a sel -a en ion based me hod: he
hie a chical sel -a en ion ne wo k (HAN). A join sel -a en ion
module ex ac ed local ea u es and a inge sel -a en ion module
agg ega ed hem. Fo empo al easoning, he empo al sel -
a en ion module was in cha ge o modelling he dynamics o he
inge s and he en i e hand.
G aph ep esen a ions. G aphs a e also used o ep esen
ac ions, as in he case o he wo k o [59], in which hey buil a
hie a chical g aph (a ee-shaped g aph) o ac i i y ecogni ion
in which an ac i i y was composed o ac ion nodes. The la e
had some lea nodes: objec and hand nodes. Thei goal in in e -
ence ime was o be able o p edic hands, objec s, ac ions and
ac i i ies. To ain he sys em, hey employed an algo i hm simila
o he Expec a ion-Condi ional Maximiza ion o [127]. Recen ly,
[133] p esen ed a wo k in which hey buil a opological map ( ep-
esen ed by a g aph) o he scene (o he physical space) om ego-
cen ic ideos. In o de o clus e zones, hey employed a Siamese
ne wo k ha ook pai s o images and was able o ind pai s ha
co esponded o he same zone. Then, he g aph hey cons uc ed
had collec ions o clips wi hin nodes ( ep esen ing zones and he
clips in which hose zones we e isi ed) and edges ep esen ed
weak spa ial connec i i y be ween zones based on how people a-
e sed hem. F om his g aph hey could in e he p ima y places o
in e ac ions and he ac ions ela ed o hose spaces. Mo eo e , hey
showed how o link zones ac oss mul iple ela ed en i onmen s
(such as ki chens om di e en da ase s). [167] p oposed a
me hod o join ly ecognise, localise and summa ise ac ions. Fi s ,
hey applied a cen e-su ound model o de ec a cen al egion
and i s su oundings, ob aining supe pixels om which ea u es
we e ex ac ed using a GoogleNe [193]. These we e used o build
a g aph wi h he supe pixels as nodes. By applying a andom walk,
all he e ices could be anno a ed in a single un. Finally, a ac-
ional knapsack- ype o mula ion was adop ed o ob ain a sum-
ma y o he ac ions (gi en ha he e may be mo e han one
ac ion occu ing a he same ime and ha many supe pixels
may be labelled as backg ound). [96] pa ame e ised le and igh
hands and objec s as indi idual g aphs o be hen joined in a single
mul i-g aph s uc u e. This allowed hei model o lea n in e ac-
ions be ween bo h hands and be ween each hand and objec s.
Tempo al dynamics. The appea ance in a ame, he local ea-
u es, can be ex ended o model he whole appea ance o he ideo
o , be e said, i s dynamics and how i e ol es. [230] p oposed o
model he high le el dynamics o he sub-e en s wi hin an ac ion
by dynamically pooling ea u es o sub-in e als o ime se ies
using a empo al ea u e pooling unc ion. Speci ically, each ame
was encoded using a CNN, in which each ac i a ion neu on was
conside ed a poin in he ime se ies, and ea u es we e pooled
in de e mined in e als (sub-e en s) o model he sho - e m
changes. Then, hese sub-e en dynamics we e empo ally aligned
and a g oup o Fou ie coe icien s we e ex ac ed in a empo al
py amid o encode he o e all ideo ep esen a ion. T ans o me
laye s can also be employed o model his e olu ion, ans o ming
he p oblem in a sequence- o-sequence ask. Fo example, [103]
p esen ed hei T ea , a T ans o me -based a chi ec u e ha ook
RGB and dep h images. Each modali y was ed o an in e - ame
a en ion encode (no sha ing weigh s among hem), me ging
la e in he mu ual-a en ional usion block, allowing hem o c e-
a e c oss-modal ep esen a ions. The la e a e ed o a linea laye
o ob ain a pe - ame p edic ion, a e aged a he end ac oss ames
o he inal ac ion p edic ion.
2.2. Mo ion-d i en Ac ion Recogni ion
Apa om he objec cues, which ha e shown o be ele an in
egocen ic con ex s, he e a e also cues ela ed o he mo ion: eye
mo emen , hand mo ion and head mo ion. The e is also a ea u e
called ego-mo ion, usually e e ing o he global mo ion gene a ed
om objec s in he scene, he mo emen o he body and he head.
Fig. 5 shows an example o he ego-mo ion o a ideo.
Eye mo emen . [98] s a ed ha a pe son’s eye mo emen is a
aluable sou ce o in o ma ion o ecognise ac ions. In addi ion,
as men ioned by [27], he eye mo emen can be classi ied in o
h ee ypes o mo emen s: saccades, ixa ions and blinks. Saccades
a e he cons an and simul aneous mo emen s o bo h eyes ha
a e aimed a building a men al ”map” o he in e es ing pa s o
he scene, ixa ions a e s a iona y s a es in which he gaze is ixed
on a speci ic place and blinks a e he egula opening and closing
mo emen s o he eyelids. [224] limi ed hemsel es o ac ions pe -
o med on a able and ook hand posi ions, he loca ions o he eyes
and he head and he eco ded ego- ideos. Thei aim was o be able
o segmen ac ions. Fo ha , and based on he ac ha eye and
head mo emen s a e ela ed o he a en ion as men ioned in he
wo k o [72], hey de eloped a me hod o de ec a en ion
swi ches. The acking was done using a head-moun ed ISCAN
in a- ed ideo based eye acke . Wi h his, hey di ided each
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
180
ideo in o ac ion segmen s and used mul isenso y da a o ecog-
nise ac ions. In ano he wo k, [225] explo ed he mo emen
dynamics o some body pa s, namely, he eye (gaze), head and
hand mo emen s. They in eg a ed and modelled he ac ion using
Pa allel Hidden Ma ko Model HMM: body pa s we e p ocessed
in pa allel s eams and in eg a ed a he end. The bene i s we e
ha i allowed di e en sampling a es and di e en lea n opolo-
gies in each s eam and ha he noise o a s eam was isola ed
wi hou co up ing he o he s.
Ego-mo ion. A la ge pa o he li e a u e aims a cap u ing he
ego-mo ion o he gene al mo ion gene a ed om he head mo e-
men and employing i o ecognise ac ions. [92] s a ed ha he e
a e wo ypes o mo ion: ins an aneous mo ion (di ec ional com-
ponen ) and pe iodic mo ion ( equency componen ). In he i s
case, ac ions such as u ning one’s head ha e s ong di ec ional
componen while epe i i e ac ions such as walking ha e s ong
pe iodic componen s. [191] aimed o ecognise in e ac ions (each
one composed o he manipula ion, he objec and he loca ion)
using low esolu ion images and empo al empla es o mo ion
his o y images. These empla es cap u ed any mo ion de ec ed in
a ideo, using weigh s in e sely p opo ional o he empo al dis-
ance om he ame in which he mo ion was de ec ed o he cu -
en one. Fo each class, hey compu ed a mean empla e and
expe imen ed wi h simple image ma ching, leading o inding
ou ha no malised c oss-co ela ion pe o med he bes . To in e
he loca ion, objec s, in e ac ions, e en s and ac i i ies, hey p o-
posed a Dynamic Bayesian Ne wo k. [163] s udied in e ac ion-
ela ed ac ions, i.e. ac ions ha in ol e in e ac ing wi h he obse -
e such as ”a pe son hugging he obse e ” o ” h owing objec s o
he obse e ”. They wen one s ep beyond he wo k o [92] and
explo ed mul i-channel ke nels o in eg a e global and local
mo ion in o ma ion. They also in oduced a me hodology ha ook
in o accoun he empo al s uc u e o egocen ic ideos. Speci i-
cally, hei global desc ip o s we e his og ams ex ac ed om OF
da a and he local desc ip o s we e composed o 3-D XYT da a,
i.e. compu ing salien mo ion in he ideo and summa ising he
g adien alues o he de ec ed mo ion pa ches. Mo eo e , hey
clus e ised he mo ion desc ip o s and used he isual-wo d
app oach o ep esen he ideo.
Simila o he p e ious one, [137] made use o i s -pe son
dense ajec o ies in hei mo ion py amidal s uc u e. The ela i e
s eng hs o mo ion along he ajec o ies we e hen used o c ea e
a ious bag-o -wo ds desc ip o s ha we e la e combined in o a
single desc ip o o he ac ion. A non-linea Suppo Vec o Machi-
neSVM was ed wi h hese desc ip o s o classi y ac ions. [153]
p esen ed hei Cumula i e Displacemen Cu es, a me hod based
on he assump ion ha , o e a long pe iod o ime, he a e age dis-
placemen caused by he head o a ion is p ac ically ze o. The e-
o e, hey di ided he ames wi h a ixed g id and accumula ed
he displacemen up o a ce ain poin wi hin each cell (Cumula i e
Displacemen o CD). Analysing ends in hese displacemen s
allowed hem o ocus on long- e m ac ions and o a oid small pe -
u ba ions due o he head mo ion. Mo eo e , o long- e m ends,
hey con ol ed he CDs wi h a gaussian ke nel o smoo h hem. Fo
classi ica ion, hey ob ained a ious ea u es and s a is ics com-
pu ed om hese mo ion ec o s and applied an SVM. [178] con-
ibu ed a new da ase called LENA and p o ided se e al
expe imen s on i wi h a ious ea u e desc ip o s o ajec o ies,
namely, His og am o O ien ed G adien s (HOG), His og am o
Op ical Flow (HOF) and Mo ion Bounda y His og am (MBH); Fishe
Vec o encoding; P incipal Componen AnalysisPCA o dimension-
ali y educ ion; and a linea SVM o he classi ica ion s ep. [175]
a gued ha a me hod o bo h sho - e m ( ake, pu and so o h)
and long- e m ac ions (walking, d i ing and so on) did no exis
and p oposed a way o sol e he ask. Thei solu ion was based
on OF, in which hey aimed o iden i y he dominan mo ion, i.e.
mo ion gene a ed by objec s and he hands. They compensa ed
he came a mo ion using a RANSAC-based homog aphy [63] and
applied an ex ension o a His og am o Op ical Flow HOF. Thei
classi ica ion goal was solely o in e i a ideo showed a sho -
e m o a long- e m ac ion, bu his could be applied in an EAR
sys em.
[154] aimed o ecognise long- e m ac i i ies (help ul o seg-
men long and uns uc u ed ideos) wi h a CNN a chi ec u e. They
sampled segmen s o 4 o e lapping seconds om ideos, spa ially
di ided each ame in o a non-o e lapping g id o size 32 32 and
compu ed OF ea u es om wo co esponding g id cells in consec-
u i e ames. This led o a cube o size 32 32 2 (due o he xand
ycomponen s o low), which was used o c ea e an s ack o shape
32 32 120 om he whole ideo, inally employed as inpu o a
3D CNN. [93] ex ac ed ea u es such as His og ams o O ien ed
G adien s, Mo ion Bounda y His og ams and ajec o ies, com-
Fig. 5. Ego-mo ion example in he EGTEA Gaze + da ase . The op ow shows subsampled RGB ames, he middle ow has he ho izon al op ical low componen and he
bo om ow p esen s he e ical op ical low componen .
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
181
bined all o hem and applied PCA o educe he dimensionali y
be o e applying a ious classi ie s: SVM, k-Nea es Neighbo s K-
NN and he combina ion o he p e ious wo (SVMkNN). The e
a e some wo ks ha p o ide new ways o a ange mo ion in o ma-
ion, such as ha o [164], ha p esen ed a new ea u e ep esen-
a ion, called Pooled Time Se ies POT, based on he ime se ies
pooling o ea u e desc ip o s, pa icula ly designed o mo ion
in o ma ion in egocen ic ideos. Howe e , i could be applied o
any ea u e desc ip o such as HOF o CNN ea u es. POT sum-
ma ised he sho - and long- e m changes in he desc ip o s o e
ime, i applied a ious empo al il e s (se o ime in e als) ha
we e pooled wi h a ious ope a o s and conca ena ed o ob ain a
single ea u e ec o .
[177] p oposed o combine se e al ea u es such as dense a-
jec o ies (bo h o wa d and backwa d), HOG, HOF (wi h a compen-
sa ed head mo ion), MBH and so on. Tempo al py amids we e used
o ep esen ea u es o be e cap u e slow and as ac ions. E en-
ually, each ea u e ec o was used o build a bag o wo ds, which
showed an imp o emen in he pe o mance o he p oposed solu-
ion. One impo an conclusion o his wo k was ha , e en hough
he hands and objec s a e impo an as egocen ic cues, i is no
necessa y o explici ly segmen hem. Mo eo e , he ea u es used
in his wo k we e also applied in hi d-pe son p oposals, c ea ing a
b idge be ween bo h i s - and hi d-pe son ac ion ecogni ion.
Combining eye mo emen and ego-mo ion. O he s, such as
he wo k o [142], combined bo h app oaches and exploi ed he
eye mo emen and he ego-mo ion; speci ically, [142] analysed
he combina ion o he eye mo emen aken using an inside look-
ing came a and he ego-mo ion aken using an ou side looking
came a. Fo he i s case, hey p esen ed hei own encoding
me hod while o he second one hey used global OF alues.
[215] aimed o ecognise ac ions in an unsupe ised way in an
o ice and a home en i onmen : hey employed encoding saccade
in o ma ion ( om an inside came a) and OF encoding ob ained
om he ideo ames o an ou side came a. They in oduced
wo a ian s o Mul i-Task Clus e ing, including da a om di e en
use s in hei clus e s.
2.3. Hyb id app oaches o Ac ion Recogni ion
So a , he mos p omising app oaches ha e been he objec -
d i en ones. Howe e , mo ion-d i en me hods may add mo e
obus ness and, hus, hyb id models a e also p oposed in he li e -
a u e. Specially, he Deep Lea ning app oaches domina e he li e -
a u e due o hei ad an age in au oma ically ex ac ing ea u es
om di e en in o ma ion sou ces.
Two-s eam a chi ec u es. A highly popula ised app oach in
he DL communi y is he wo-s eam ne wo k p esen ed in he
wo k o [174], which employs bo h RGB and OF in o ma ion as
inpu . This model was i s used o exocen ic ision bu i was
la e adap ed o egocen ic ision [95,189,117,187]. In addi ion,
[174] obse ed ha ne wo ks pe o m be e when hey do no
need o lea n o es ima e he mo ion implici ly. Fig. 6 shows an
example o a neu al wo-s eam ne wo k ha akes RGB and OF
images as inpu . [121] p oposed an imp o emen o he appea -
ance s eam, di iding i in o wo modules: one o hand segmen a-
ion and he o he , ha ook he ou pu o he i s one, o objec
classi ica ion. The hand segmen a ion pa segmen ed and loca-
lised hands, c ea ing a gaussian bump in he egion whe e hands
we e loca ed (o he space be ween hands). Tha pa was c opped
and ed o he objec classi ica ion pa , which was ained o
objec ecogni ion. Bo h his ne wo k and he mo ion s eam had
hei own loss. A he end, bo h ne wo k ou pu s we e conca e-
na ed and a FC laye wi h a so max ac i a ion was used o classi y
ac ions. Hence, h ee di e en losses we e used o aining. The
usion o bo h b anches was done wi h a conca ena ion ope a ion;
howe e , his usion was la e e isi ed in he wo k o [95],in
which hey con ibu ed a long- e m usion pooling o agg ega e
he ea u es coming om he wo b anches and hey also analysed
he e ec o a ious pooling me hods, namely, sum pooling, max
pooling and g adien pooling. A combina ion o all hem seemed
o p o ide he bes accu acy. An SVM was used as a classi ie on
op. Ins ead o employing he s anda d ha d assignmen o a single
label, [211] used a so assignmen o a ious mo ion labels, e.g.
{open, hold, u n, o a e} can deno e he kind o mo ion used o
open a ja o a bo le ins ead o jus using he open label. This ep-
esen a ion can gene alise o unseen ac ions in which he mo ion
pa e n a y in some way, depending on he ac i e objec . La e ,
[209] p esen ed a mul i-label e b-only ep esen a ion o ac ion
ecogni ion and ac ion e ie al. Thei me hod allowed o an o e -
lap o labels, emo ing he ambigui y o p e ious single label
me hods. They obse ed ha a mul i- e b app oach wi h ha d
assignmen was bes sui ed o ecogni ion asks while an
app oach wi h so -assignmen was be e o e ie al asks.
As he wo-s eam app oaches equi ed an agg ega ion ope a-
ion o each clip o he ideo, [188,189] p oposed o ex end he
a chi ec u e in a CNN-RNN ashion using he Con olu ional Long-
Sho Te m Memo y Con LSTM ne wo k o [214] as he RNN.
Mo eo e , one o he con ibu ions o [189] was a spa ial a en ion
laye be ween he he CNN and he Con LSTM in he spa ial
b anch: hey used Class Ac i a ion Maps (CAM) [235] om a p e-
ained CNN o encode he ideo. Following he idea o [189] o
adding a en ion mechanisms, [105] de eloped a NN ha join ly
classi ied ac ions and lea n a en ion map dis ibu ions using gaze
in o ma ion as supe ision du ing he aining. An a en ion map
was sampled om his dis ibu ion an applied spa ially and em-
po ally o he ames in o de o guide he ac ion ecogni ion. A
es ime, using he ecei ed inpu ideo, he ne wo k could in e
bo h he gaze and he ac ion. The idea o employing he gaze o
an a en ion mechanism was also exploi ed in he wo k o [117],
who implemen ed a wo-s eam ne wo k whose spa ial b anch
had an a en ion mechanism on op. This was composed o a linea
ans o ma ion supe ised by a gaussian bump c ea ed om he
gaze ixa ion poin , i.e. a 2D gaussian cen ed in he poin he sub-
jec o he ac ion was s a ing a . A e ha , bo h b anches had a
bidi ec ional LSTM and, ollowing i , hey we e used.
[202] aimed a demons a ing ha a wo-s eam app oach wi h
an LSTM was sui able o classi ying egocen ic ac ions wi hou
any egocen ic ea u e. Mo eo e , hey also showed ha esizing
images o adjus he size o objec s o hose o Imagene ’s images
could po en ially imp o e he esul s. [187] hypo hesised how a
CNN-RNN s uc u e could ocus on ROI o be e disc imina e
ac ions and, o ha , hey analysed he sho comings o he LSTM
and p oposed hei al e na i e Long Sho -Te m A en ion LSTA
module. This new RNN in oduced a buil -in spa ial a en ion and
a e ised ou pu ga ing. They deployed hei LSTA in a wo-
s eam a chi ec u e and also p oposed, o he c oss-modali y
usion o RGB and OF, a no el con ol o he bias pa ame e o
one o he modali ies using he o he one. [118] aimed o lea n
spa io- empo al a en ion ea u es using human gaze as supe i-
sion. Fo ha , hey p oposed a wo-s eam ne wo k, in which each
o he s eams included he spa io- empo al a en ion module
(STAM) hey con ibu ed. This module included a 3D incep ion
module and a 3D con olu ional laye o p edic an a en ion
map. This map was combined wi h he o iginal ea u e o he
s eam o c ea e mo e in o ma i e ea u es. [228] ad oca ed o
he use o Ine ial Measu emen Uni IMU o he mo ion classi ica-
ion ins ead o he OF a guing ha he la e ’s compu a ion was
a he demanding. Ins ead, hey c ea ed a laye ed-like app oach.
The classi ica ion o he mo ion was pe o med i s by an LSTM.
Depending on he p edic ed label, samples we e ca ego ised in o
di e en mo ion g oups ( o example, ”s anding”, ”walking” and
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
182
so on). Wi hin each g oup, a ious possible ac ions could be
in e ed, bu he ac ions no associa ed o he mo ion o he g oup
we e disca ded (e.g. ac ions in which i was impossible o be
”s anding” a e disca ded i he sample is ca ego ised as ”s anding”).
I he sample was con ained wi hin a g oup wi h only one ac ion,
hen his ac ion was p edic ed. In case he e we e a ious possibil-
i ies, he mo ion g oup was used as a p io o he o he b anch
( he appea ance b anch), whose objec i e was o classi y he sam-
ple among he possible ac ions o he g oup using isual ea u es.
To adap he me hod o low and high ame- a e pho o s eams,
wo b anches we e used wi hin he appea ance s eam. Fo a low
ame- a e, a CNN was used and, o a high ame- a e, a CNN
and an LSTM we e employed. Simila ly, [119] implemen ed a
wo-s eam ne wo k in which one o he b anches passed IMU da a
h ough an LSTM. The o he b anch employed a Recu en Capsule
Ne wo k (RecCapsNe ) and a con ls m o ex ac spa io- empo al
ea u es. Then, bo h b anches’ ea u es we e ed o FC laye s (sep-
a a ely), hen combined by conca ena ion and, once again, he
esul was ed o a single FC laye . A so max ac i a ion was inally
used o p o ide an ac ion p obabili y dis ibu ion.
The applica ion o he wo-s eam s a ed becoming main-
s eam, as he a chi ec u e was being employed as a baseline. Fo
example, [234] ocused on hand-hygiene egocen ic ac ions and
p oposed a me hod o i s loca ing he ac ion wi hin an
un immed ideo using low-cos hand mask and mo ion his og am
ea u es. In ac , once he ac ion had been ound, he classi ica ion
was done using a wo-s eam ne wo k. [102] p oposed a wo-
s eam ne wo k in which one o he b anches was composed o a
sel -a en ion based G aph Con olu ional Ne wo k and he o he
one implemen ed a esidual-connec ion enhanced bidi ec ional
Independen ly RNN. [112] implemen ed a model ha gene a ed a
Hie a chical Volume ic Rep esen a ion (HVR) o he scene and
employed a wo-s eam ne wo k. One b anch ook he isual inpu
and p ocessed i wi h an I3D ne wo k and he o he one compu ed
en i onmen ea u es. This allowed he model o sample possible
ac ion loca ions (lea n in a la en space) and o use hose local
3D ea u es o he ac ion classi ica ion.
Mul i-s eam a chi ec u es. As wo-s eam a chi ec u es
became popula , a na u al ex ension o hem a ose including mo e
b anches and di e en inpu modali ies. Each modali y is assumed
o be complemen a y o he es and, hus, help ul o imp o e he
classi ica ion o ac ions. Fig. 7 shows a gene al schema o a mul i-
s eam a chi ec u e. [64], o he ac ion an icipa ion ask, used
h ee complemen a y modali ies o da a: RGB ( o appea ance,
using a Ba ch No malised Incep ion), OF ( o mo ion, using a TSN
o TSN) and objec ea u es (con idence sco es ob ained om an
objec de ec o ). They in oduced hei Modali y ATTen ion (MATT)
mechanism o use hem, weigh ing each o hem in an adap i e
way o p edic ac ions. The use o objec de ec o in o ma ion
was again explo ed in he wo k o [207], who de ec ed a sho com-
ing in he wo-b anched a chi ec u e (modelling appea ance and
mo ion): bo h ailed o exploi local in o ma ion as he e was no
posi ion-awa e in o ma ion. In ac , jus looking a he mo ion
change o he collec ion o objec s in he scene may no be enough
o an anno a o o unde s and he ac ion, ha is when posi ion-
awa e ea u es ( e e ed o as p i ileged in o ma ion) could help
o d i e he lea ning o ac ion- ele an mo ion and objec s. In
addi ion, hey con ibu ed a Symbio ic A en ion mechanism o
P i ileged in o ma ion (SAP) ha allowed o he communica ion
o he h ee sou ces o in o ma ion. A 3D CNN was used o p ocess
appea ance and mo ion (ou pu ing a single ea u e ec o ) while a
Fas e Region-based Con olu ional Neu al Ne wo kR-CNN was
employed o he objec ea u es (ex ac ed wi h RoIAlign). The
mo ion and appea ance ea u es we e indi idually used wi h he
de ec o ’s ea u es and some lea n ga e weigh s ( om he oppo-
si e b anch) we e applied o hem. One u he a en ion s ep
was applied using he opposi e b anch’s ea u es be o e ob aining
he las ea u e ec o o a b anch. Bo h he e b and he noun
we e in e ed sepa a ely and he p edic ions we e combined and
e-weighed by he aining se ’s dis ibu ion o ge he ac ion
p edic ion.
[195] le e aged dep h in o ma ion in hei mul i-s eam deep
neu al ne wo k (MDNN), ha ing wo mo e b anches ed wi h
RGB and OF da a. The con ibu ion o his app oach was ha hey
aimed o p ese e he dis inc i e cha ac e is ics o each s eam
and o explo e he sha eable in o ma ion. Tha is, as ea u es
ex ac ed om each s eam we e nei he ully independen no
co ela ed, he usion o hese ea u es lacked any meaning. Hence,
hey p oposed a non-linea usion s a egy in which hey mixed
he sha eable componen s and he dis inc i e componen s (bo h
ob ained wi h a non-linea mapping o he o iginal ea u es) wi h
a weigh ed addi ion. In he loss unc ion, apa om he ca ego ical
c oss-en opy loss, hey included wo mo e e ms: (i) a e m o
measu e he co ela ion be ween he sha eable e ms (modelled
wi h a Cauchy es ima o ) and (ii) a e m o en o ce he o hogonal-
i y cons ain on bo h he sha eable componen s and he dis inc-
i e ones. Mo eo e , hey also included a hand module ha was
ed wi h he RGB ames. Wi hin his module, a bina y mask was
gene a ed o black ou pa s o he o iginal RGB images ha we e
la e used o classi ica ion. In ac , he so max ou pu o his mod-
ule was combined h ough a weigh ed usion wi h he so max o
he o iginal ne wo k.
In ac , mul iple s eams can a ise in an in e media e s ep o he
sys em, no only a he beginning, as in he case o he wo k o [74].
They p esen ed a no el Mu ual Con ex Ne wo k (MCN) ha
join ly lea n an ac ion-dependen gaze p edic ion and a gaze-
Fig. 6. Two-s eam neu al ne wo k. I is composed o a ea u e ex ac o based on con olu ional ne wo ks and a classi ie based on ully-connec ed laye s.
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
183
p oposals. Table 2 summa ises he mos ele an da ase s o he
li e a u e and hei cha ac e is ics. Speci ically, we show whe he
hey con ain BB anno a ions, hei publica ion yea , he numbe
o ac ion clips (ins ances used o aining and e alua ing machine
lea ning models) and he numbe o ac ion, e b, and objec
classes. In he case o Cha ades-Ego, he da ase is pa ially egocen-
ic, ha ing pa o i s con en illed wi h hi d-pe son ideos. The
anno a ion o ac ions in all he p esen ed da ase s consis s o a
e b and a se o nouns, c ea ing an ac ion when combined. Tha
may be one o he easons why popula me hods such as he
wo-s eam ne wo k app oach ha e adap ed well o he egocen ic
ision, i.e. as he mo ion and he objec ea u es can be decom-
posed, he e a e labels o ain wo sepa a ed classi ie s and/o o
join ly ain wo b anches.
The e a e o he egocen ic da ase s ha a e no sui able o EAR
due o hei in insic pu pose ( he ask, a ocus on ac i i ies o
in e ac ions a he han on ac ions and so on) and/o due o he
lack o labels. We only p esen da ase s ha a e, o he bes o
ou knowledge, publicly a ailable.
The Uni e si y o Texas a Aus in Egocen ic (UT Ego) da ase
[101] is composed o 4 ideos (10 in o al, bu only 4 public) wi h a
leng h o 3-5 h and eco ded in an uncon olled se ing. The ideos
cap u e a a ie y o ac i i ies such as ea ing, shopping, a ending a
lec u e, d i ing and cooking.
The JPL Fi s -Pe son In e ac ion da ase (JPL-In e ac ion
da ase ) [163] is an egocen ic da ase composed o ac i i ies o
in e ac ions (e.g. shake hands, hug o punch) wi h he wea e o
he came a.
The NUSFPID - NUS Fi s Pe son In e ac ion Da ase [137] is
composed o 8 in e ac ions in bo h egocen ic and exocen ic
pe spec i es.
The S e eo Ego-Mo ion Da ase
1
con ains ideos ha show a
pe son walking a ound objec s o animals unde no special ci cum-
s ances. The i s wo objec s, a ca and a chai , show no mo ion
whe eas he ca s and dogs o he nex wo cases ha e s ong a icu-
la ed mo ion.
The LENA (Li e-logging Ego-ceN ic Ac i i ies) [178] includes
13 ac i i ies eco ded wi h he Google Glass such as ead, wa ch
ideos, walk s aigh and so o h.
The EGO-GROUP and EGO–HPE da ase s [8,7] a e aimed o
ego- ision applica ions: social g oup de ec ion and head pose es i-
ma ion, espec i ely.
The Egocen ic Da ase o he Uni e si y o Ba celona - Seg-
men a ion (EDUB-Seg) [194,53] is a da ase acqui ed wi h Na a-
i e Clip, aking a pic u e e e y 30 s, con aining 18;735 ames
om se en use s. Fo he sake o a ie y, each use eco ded hei
ac ions in di e en scena ios: a ending a con e ence, on holiday,
du ing he weekend and du ing he week. I con ains anno a ions
o segmen e en s in ime unde he condi ion ha hose e en s
can be in e ed using isual ea u es, i.e. he e is enough isual
in o ma ion in ha segmen o in e he e en .
The Mul imodal Egocen ic Ac i i y Da ase [179] con ains 20
ac i i ies, ha ing each ac i i y sho clips o up o 15 s. Fo exam-
ple, i includes w i ing sen ences, o ganising iles and unning. Fu -
he mo e, images a e accompanied by senso signals.
The UTokyo collec ion o da ase s, composed o UTokyo Pai ed
Ego-Video (PEV) da ase [221], he UTokyo Na iga ion da ase
[222] and he UTokyo Ego-Su da ase [220,223], a e a amily o
da ase s de eloped by he Uni e si y o Tokyo. The i s one con-
ains ideos om dyadic (be ween wo pe sons) con e sa ions,
cap u ing in e ac ions. The second one has ideos o people walk-
ing a ound a uni e si y campus o isi landma ks, bu he ideos
pe se a e no a ailable (due o p i acy conce ns), ye al eady
ex ac ed ea u es can be ob ained. The hi d one con ains 8 g oups
o ideos eco ded synch onously du ing ace- o- ace
con e sa ions.
The EgoFoodPlaces da ase [168] in ol es 12 use s in hei
daily ood- ela ed ac i i ies. The classes o his da ase a e locali-
sa ions whe e he ac i i ies a e held.
The Da ase o Mul imodal Seman ic Egocen ic Videos
(DoMSEV) [173] is a 80-h da ase con aining in o ma ion abou
he scenes ha we e being eco ded. This includes he ype o
scene (indoo , u ban, c owded en i onmen o na u e), he ac i i y
pe o med (walking, unning, s anding, b owsing, d i ing, biking,
ea ing, cooking, ea ing, obse ing, in con e sa ion, playing o shop-
ping), i he e was some hing special ha caugh he a en ion o
he obse e and also in e ac ions wi h some objec s.
The EGOcen ic–Cul u al He i age da ase (EGO-CH) [157] is a
da ase o cul u al si es’ isi o s beha iou unde s anding. The
da ase includes 60 ideos, 26 en i onmen s and o e 200 Poin
o In e es POI. Mo eo e , i is anno a ed wi h empo al labels
including he loca ion o he isi o and he obse ed POI, a BB
anno a ion a ound POI and he su ey associa ed o each ideo
illed by he isi o . The da ase is aimed a p o iding 4 asks:
oom-based localisa ion, POI o objec ecogni ion, objec e ie al
and su ey p edic ion.
The EgoK360 da ase [22] is an egocen ic 360° ideo analysis
da ase . I con ains se e al ac i i ies wi h ac ions wi hin hem,
being qui e challenging due o he dis o ion and he wide ield
o iew.
5. Applica ions o Egocen ic Video Analysis
The analysis o egocen ic ideos se e o se e al pu poses.
Al hough he da ase s shown in Sec ion 4may p o ide some hin s
on he kind o applica ions ha can be gi en, we e iew he appli-
ca ions ound in he li e a u e. Gi en ha he ield is s ill ela i ely
new, many new applica ions may a ise in he u u e.
Ambien Assis ed Li ing. One o he cu en main challenges
o he public adminis a ion is o p omo e ac i e and heal hy age-
ing o as long as possible. Achie ing i would pose posi i e conse-
quences o he socie y and he socio-sani a y se ices, such as
educing he cos s om medicines and o he ea men s. The la e
expenses a e becoming mo e and mo e wo ying wi h he ageing
o socie y. Fo example, Spain dedica ed he 9:8%o i s GDP o
elde ly ca e in 2014
2
. Gi en ha epo s es ima e ha he wo ld’s
olde popula ion is going o duplica e by 2050
3
, he magni ude o
he p oblem may become unmanageable. Due o his, public admin-
is a ions a e in es ing in esea ch p ojec s which may help alle ia -
ing o a oiding his p oblem in he u u e, c ea ing an ac i e and
heal hy olde popula ion. Al hough he esea ch p ojec s using com-
pu e ision app oaches ha e mainly ocused on he hi d-pe son
ision [29], nowadays he use o wea able sys ems is mo e abundan
[37,39].[126] p oposed a sys em o suppo clinicians o he ca e o
demen ia pa ien s and [231] used sma glasses wi h a i s -pe son
sys em ha could wa n people wi h cogni i e impai men s o dan-
ge ous si ua ions. Bu no only is i use ul o suppo ing heal h p o-
essionals, aiding ca egi e s is also a po en ial applica ion o i s -
pe son sys ems. [136] desc ibed a me hod le e aging a i s -pe son
came a o e alua e he ende demen ia-ca e echnique. They
ob ained he 3D acial dis ance, pose and eye-con ac s a es be ween
ca egi e s and ecei e s and pe o med s a is ical analysis o assess
he ca egi e ’s skills. These ypes o app oaches can be g ouped in
he AAL pa adigm, which p omo es he use o mode n ICT echnolo-
1
h ps://lmb.in o ma ik.uni- eibu g.de/ esou ces/da ase s/S e eoEgomo ion/
2
h ps://www.imse so.es/In e P esen 2/g oups/imse so/documen s/bina io/
112017001_in o me-2016-pe sona.pd
3
h ps://www.nih.go /news-e en s/news- eleases/wo lds-olde -popula ion-
g ows-d ama ically
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
190

gies o assis he elde ly in hei ADL. The main objec i e o he AAL
is o a oid he dependence o elde ly people on o he people in hei
daily li ing ac i i ies. In pa icula , EAR becomes a key enable o
AAL app oaches.
Hand ecogni ion. Hands a e o special impo ance o humans,
allowing us o in e ac wi h objec s and en i onmen s. As a conse-
quence, he daily li e o a pe son wi h impai ed o educed hand
unc ionali y may be d as ically a ec ed and he eco e y o hands
should be a p io i y [17]. E en hough heal h ela ed issues may be
g ouped wi hin he AAL ield, his is a special case o egocen ic
ideos. As seen h oughou he documen , hands play a key ole
in egocen ic ac ions and, he e o e, his use case is sepa a ed om
he a o emen ioned. The ecogni ion o hands includes hei local-
isa ion in he space, hei segmen a ion, hei iden i ica ion (le o
igh ) and he pose es ima ion ( inge ips, o example). F om his
in o ma ion, i is possible o emo ely assess he unc ioning o
hands. Ano he applica ion o he ecogni ion o hands is o be able
o unde s and child en’s isual a en ion [15], as i seems ha pa -
en s’ hands d i e hei a en ion.
The augmen ed eali y (AR) and he i ual eali y (VR) ech-
nologies, which a e becoming mo e popula , equi e he egocen-
ic ecogni ion o hands o na u al use in e aces ha need o
know he posi ion and mo emen s o he hands [17]. Fo example,
[71] p oposed an in e ace o mo e 3D objec s using hands and,
hus, hey implemen ed a i ual hand in e ac ion echnique. In
he wo k o [77], hey aimed a simul aneously de ec ing click
ac ions and es ima ing occluded inge ip posi ions. [197] in o-
duced a solu ion o allow use s o inspec 3D objec s using hei
hands, equi ing o es ima e he 6D palm pose and he ges u e pe -
o med. [76] ocused on he o a ion o 3D objec s. By pe o ming
he ”holding” ges u e, i ual objec s could be summoned in o
he palm, allowing ano he ges u e o igge hei unc ion. [26]
a gued ha i was di icul o co ec ly de ec hands in clu e ed
backg ounds wi h a ying illumina ions and, hence, hey p oposed
a solu ion o indoo and ou doo en i onmen s.
Social In e ac ion Analysis. People’s social beha iou can be
analysed and classi ied using egocen ic ideos. [58], o example,
aimed a de ec ing social in e ac ions in a day-long ac i i y. Fi s ,
he con ex p o ided by aces was ob ained and used o es ima e
he loca ion ha was being a ended. Second, based on he pa -
e ns o people, oles we e assigned o hem. By analysing empo al
pa e ns o oles and loca ions, hey we e able o de ec and ecog-
nise social in e ac ions. They also explo ed he inclusion o head
mo emen as an ex a ea u e. [163] ocused on in e ac ions wi h
he wea e o he came a, including bo h iendly and agg essi e
in e ac ions. [217] had as objec i e he ex ac ion o in e ac ion
ea u es (IF), ea u es ha a e common be ween in e ac ions.
These a e mainly composed o physical in o ma ion o head, body
languages and emo ional exp ession. An HMM was used o model
he sequence.
When conside ing a g oup, based on he concep o he F-
o ma ion [91,8] acked h ough a ideo sequence a g oup o peo-
ple, es ima ing hei head pose and 3D loca ion, o p edic he
a ini y o a wo people in he scene. Again ollowing he F-
o ma ion concep , [5] aimed a de ec ing when a social in e ac ion
was gi en.
Pedes ian mo emen an icipa ion. Using an egocen ic cam-
e a, i is possible o analyse he pa e ns o mo emen s o he
pedes ians in on o he wea e and an icipa e hei mo emen s.
This may e en ha e applica ions o au onomous ehicles o
pedes ian sa e y [184,36,108].
Nu i ional beha iou analysis. The analysis o egocen ic
ideos could be in e es ing when we a e pe o ming ac ions
ela ed o ea ing. This could lead o analyse ou nu i ional beha-
iou s, die and li es yle as p oposed by [82]. Mo eo e , as men-
ioned by [168], he ood in ake and i s du a ion a e o majo
ele ance o p o ec agains diseases. Tha is why hey de eloped
a model o de ec he ood in ake e en s du ing he day. [24] aimed
a bo h localising and ecognising ood simul aneously.
6. Conclusions
Th oughou his su ey ou main dis inc ways o ca ego ise
he EAR p oposals ha e been in oduced: hose solu ions based
on objec s o he appea ance, he ones employing mo ion as hei
main d i e , hyb id app oaches ha conside bo h he appea ance,
and he mo ion and o he app oaches (s ill no ha abundan ) ha
conside mo e modali ies like he sound o con ibu e on o he
opics o he ield. Mo eo e , al e na i e lea ning pa adigms o
he EAR and po en ial applica ions o his esea ch ield ha e been
summa ised.
Al hough he EAR ield ad ances a e s ill a om being com-
ple ely ans e able o eal-wo ld applica ions, many s eps owa ds
ha goal ha e been aken. The e a e la ge and la ge da ase s o
ain deepe and deepe models, allowing o ob ain models wi h
be e pe o mance and gene alisa ion abili y. The ange o egocen-
ic ac ions ha a e conside ed in he li e a u e is also inc easing
wi h he e olu ion o da ase s, conside ing a e o mo e di icul
e en s. Bu his ad ance does no only come om he da a, new
impo an modali ies o da a such as sound, c ucial o ac ions ha
a e ecognised only by ha ea u e o in which his may play an
impo an ole, a e being included in he li e a u e and he
da ase s.
Table 2
Summa y o he mos ele an egocen ic ac ion ecogni ion da ase s o de ed by hei publica ion yea . *Only o 4 objec s. **Manually compu ed, he e is no o icial numbe .
Da ase Yea Objec Ac ion Ac ion Ve b Objec
BB? clips classes classes classes
In el Egocen ic Vision [162] 2009 922 42 42 42
CMU [48] 2009 516 31 16 33
ADL [148] 2012 U436 32 24 42
GTEA Gaze [60] 2012 511 94 10 33
GTEA Gaze+ [60] 2012 3,371 44 9 29
BEOID [44] 2014 742 34 15 20
EGTEA Gaze+ [105] 2018 10,325 106 19 53
Cha ades-Ego [172] 2018 30,516 157 33 36
Fi s -Pe son Hand Ac ion (FPHA)[66] 2018 U* 1,175 45 27 26
EPIC-Ki chens [43] 2018 U50,547 2,747 93 272
EPIC-Ten [78] 2019 921 11 6 9
EPIC-Ki chens-100 [41] 2020 U89,979 4,025 97 300
Meccano [158] 2021 U8,857 61 12 20
H20 [96] 2021 U184** 36 11 8
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
191
6.1. Fu u e wo k
Many ideas o ackle he EAR ha e been p oposed h oughou
his documen . Many o hem a e s ill aking hei i s s eps while
o he s ha e a la ge ajec o y. None heless, hei po en ial is
shown when compa ing di e en solu ions using s anda d bench-
ma king da ase s. Be ween hem, he ollowing esea ch lines
should be aken in o accoun :
The use o he sound seems p omising (see Sec ion 2.4) despi e
models using i can no compa e di ec ly wi h me hods ha do
no employ i . Howe e , apa om he simple compa ison
be ween models o achie e he bes possible accu acy, solu ions
including sound ha e appea ed o p o ide a solu ion o new
ac ion classes ha did no ha e an easy way o be dis inguished.
Fo example, conside an ac ion ha is no seen by he came a
bu can be hea d, such as a idge closing while he came a
wea e is u ning back (pe o ming he ac ion while looking
away om he idge). By including sound i would now be pos-
sible o ecognise his ac ion. Cle e ways o use sound in o -
ma ion wi h RGB, OF and so on need o be p oposed o push
he eal-wo ld ecogni ion o egocen ic ac ions.
The use o complemen a y in o ma ion, apa om he sound,
o he adi ional RGB and OF se ing. Fo example, he objec -
cen ic ea u es ex ac ed om RPN modules in hyb id
app oaches. This seems o lead o compe i i e esul s [206]
while exploi ing one o he mos impo an ea u es in he ego-
cen ic ision: objec s. The e a e also wo ks including hand
in o ma ion. I is possible ha including hands jus like objec s
a e could lead o an imp o emen due o he inclusion o hands’
shape, ajec o y and so on, as some ac ions can only by dis in-
guished by disce ning hose cues. As an example, imagine ying
o dis inguish u ning on o o a bu ne . Visually, bo h ac ions
look he same, he e is only a a ia ion in he mo ion o he
hands. The e should also be mo e esea ch including le and
igh hand a ia ions, as so a he ield has ocused on igh -
handed ac ions when only one hand is necessa y.
C ea ing a en ion mechanisms ha a e speci ic o he egocen-
ic se ing. The e may be a sui able way o imp o e he esul s
and he in o ma ion cap u ed by models wi hou making ne -
wo ks bigge and deepe . In ac , he scaling o ne wo ks
owa ds bigge and bigge e sions is eaching ha dwa e limi a-
ions and, hus, al e na i e ways o inc ease he pe o mance
a e e en mo e necessa y.
Mul i- asking app oaches such as [121,86,89] ha e ob ained he
bes esul s among many EAR solu ions using he GTEA
Gaze + and EGTEA Gaze + da ase s. This ype o app oach may
be a key enable o he b eaking o he pe o mance ba ie ha
can be achie ed wi h single-objec i e me hods. This includes,
o example, aiming a lea ning egocen ic ea u es and/o e b,
objec and ac ion labels a he same ime, ollowing he li e a-
u e o he EAR ield. I mo e han a single objec i e is consid-
e ed, he esul s ob ained by hese wo ks may sugges ha a
s onge gene alisa ion is achie ed.
Al e na i e pa adigms o lea ning egocen ic ac ions in o de o
be able o apply an EAR sys em in he eal-wo ld should also be
conside ed, including he ze o-, one- and ew-sho lea ning.
These equi e none, one o ew samples, espec i ely, ela ed
o he ask and hey usually ex ac he in o ma ion equi ed
o he lea ning (i any) om p io knowledge o auxilia y da a-
se s. They may also exploi cha ac e is ic o he da a (hands o
objec s) o use unsupe ised algo i hms such as clus e ing, i.e.
g ouping da a poin s by speci ic ea u es. This allows o c ea e
models ha may be able o gene alise be e when he e is a
sca ci y o da a o a gi en ask, making hem mo e sui able
o eal-wo ld pu poses.
6.2. Challenges
One o he majo challenges ha needs o be add essed wi h
u u e wo ks is how au ho s dissemina e hei models and esul s.
I is al eady known ha he e is an issue wi h he ep oducibili y o
Deep Lea ning esul s [55]. In ac , his also applies o he EAR com-
muni y: he e is a need o be e desc ip ion o models, da ase s
employed, he da a spli s c ea ed and so on. I is also specially
impo an o es ablish app op ia e me ics o he sake o compa -
ison, as he accu acy is ex ensi ely used on i s own. Due o he
accu acy pa adox and he unbalanced na u e o EAR da ase s, he
accu acy is no a sui able me ic and i does no allow o co ec ly
compa e di e en solu ions. Mo eo e , how he esul s a e p o-
ided is s ill no usually speci ied. Tha is, gi en he andomness
associa ed o Deep Lea ning, p o iding a single esul may be mis-
leading and how his esul has been compu ed should be speci-
ied. This p oblem is desc ibed by [55], whose au ho s p opose o
compa e models using a budge (i.e. ime o ain, numbe o
hype -pa ame e s and so o h).
Ano he aspec o imp o e is he collec ion o egocen ic da a-
se s we ha e. In ac , his is an impo an issue o add ess in o de
o push o wa d he esea ch. In Sec ion 4 he a ailable da ase s
we e analysed. Among hem, he la ges and mos comple e is
he EPIC Ki chens da ase . In con as o he exocen ic ision, his
communi y did no ha e a e y la ge da ase o be used o p e-
aining o jus o ha e a common da ase o benchma king un il
he appea ance o EPIC Ki chens, limi ing he esea ch and pe o -
mance ha could be ob ained, ha ing o p e- ain EAR models wi h
exocen ic da ase s. None heless, e en la ge da ase s need o be
c ea ed (o he exis ing ones need o be ex ended), as i is known
ha ideo da ase s a e s ill small in compa ison o s a ic image
da ase s. In ac , in he egocen ic communi y he e is also a need
o a ie y. The mos used da ase s, he GTEA amily and he EPIC
Ki chen da ase , a ge ki chen ela ed ac ions. This limi s he
scope o ac ions and he possibili y o apply o he eal-wo ld mod-
els ha lea n om hem. Mo eo e , his could also lead o a da a
bias, as models ha used hese da ase s can be conside ed special-
is s in ki chen ac ions, neglec ing o he asks.
CRediT au ho ship con ibu ion s a emen
Ad ián Núñez-Ma cos: Concep ualiza ion, Me hodology, In es-
iga ion, W i ing - o iginal d a . Go ka Azkune: Concep ualiza-
ion, Supe ision, W i ing - e iew & edi ing. Ignacio A ganda-
Ca e as: Concep ualiza ion, Supe ision, W i ing - e iew &
edi ing.
Decla a ion o Compe ing In e es
The au ho s decla e ha hey ha e no known compe ing inan-
cial in e es s o pe sonal ela ionships ha could ha e appea ed
o in luence he wo k epo ed in his pape .
Acknowledgemen
We g a e ully acknowledge he suppo o he Basque Go e n-
men ’s Depa men o Educa ion o he p edoc o al unding o
he i s au ho . This wo k has been suppo ed by he Spanish
Go e nmen unde he Fu u AAL-Con ex p ojec (RTI2018-
101045-B-C21) and by he Basque Go e nmen unde he Deus ek
p ojec (IT-1078–16-D).
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
192
Re e ences
[1] Sa hyana ayanan Aaku , Fillipe de Souza, Sudeep Sa ka , Gene a ing open
wo ld desc ip ions o ideo using common sense knowledge in a pa e n
heo y amewo k, Qua e ly o Applied Ma hema ics 77 (2) (2019) 323–356.
[2] Sa hyana ayanan N Aaku , Sanjoy Kundu, and Nikhil Gun i. Knowledge
guided lea ning: Towa ds open domain egocen ic ac ion ecogni ion wi h
ze o supe ision. a Xi p ep in a Xi :2009.07470, 2020..
[3] Gi maw Abebe, And ea Ca alla o, Xa ie Pa a, Robus mul i-dimensional
mo ion ea u es o i s -pe son ision ac i i y ecogni ion, Compu e Vision
and Image Unde s anding 149 (2016) 229–248.
[4] Nachwa Aboubak , James L C owley, and Rémi Ron a d. Recognizing
manipula ion ac ions om s a e- ans o ma ions. a Xi p ep in
a Xi :1906.05147, 2019..
[5] Maedeh Aghaei, Ma iella Dimiccoli, Pe ia Rade a, Wi h whom do i in e ac ?
de ec ing social in e ac ions in egocen ic pho o-s eams, in: 2016 23 d
In e na ional Con e ence on Pa e n Recogni ion (ICPR), IEEE, 2016, pp. 2959–
2964.
[6] Mohammad Al-Nase , Hi oki Ohashi, She az Ahmed, Ka suyuki Nakamu a,
Takayuki Akiyama, Taku o Sa o, Phong Xuan Nguyen, and And eas Dengel.
Hie a chical model o ze o-sho ac i i y ecogni ion using wea able senso s.
In ICAART (2), pages 478–485, 2018..
[7] S e ano Alle o, Giuseppe Se a, Simone Calde a a, Ri a Cucchia a,
Unde s anding social ela ionships in egocen ic ision, Pa e n Recogni ion
48 (12) (2015) 4082–4096.
[8] S e ano Alle o, Giuseppe Se a, Simone Calde a a, F ancesco Sole a, Ri a
Cucchia a, F om ego o nos- ision: De ec ing social ela ionships in i s -
pe son iews, in: P oceedings o he IEEE Con e ence on Compu e Vision and
Pa e n Recogni ion Wo kshops, 2014, pp. 580–585.
[9] Mehme Ali A abaci_, Fa ih Özkan, Eli Su e , Pe e Janc
ˇo ic
ˇ, and Alp ekin
Temizel. Mul i-modal egocen ic ac i i y ecogni ion using audio- isual
ea u es. a Xi p ep in a Xi :1807.00612, 2018..
[10] Relja A andjelo ic
´, And ew Zisse man, Th ee hings e e yone should know o
imp o e objec e ie al, in: 2012 IEEE Con e ence on Compu e Vision and
Pa e n Recogni ion, IEEE, 2012, pp. 2911–2918.
[11] Ma yam Asadi-Aghbolaghi, Albe Clapés, Ma co Bellan onio, Hugo Jai
Escalan e, Víc o Ponce-López, Xa ie Ba ó, Isabelle Guyon, Shoh eh Kasaei,
and Se gio Escale a. Deep lea ning o ac ion and ges u e ecogni ion in image
sequences: A su ey. In Ges u e Recogni ion, pages 539–578. Sp inge , 2017..
[12] Khalid E.L. Asnaoui, Aksasse Hamid, Aksasse B ahim, Ouanan Mohammed, A
su ey o ac i i y ecogni ion in egocen ic li elogging da ase s, in: 2017
In e na ional Con e ence on Wi eless Technologies, Embedded and
In elligen Sys ems (WITS), IEEE, 2017, pp. 1–8.
[13] Sikai Bai, Qi Wang, Xuelong Li, M i: Mul i- ange ea u e in e change o ideo
ac ion ecogni ion, in: 2020 25 h In e na ional Con e ence on Pa e n
Recogni ion (ICPR), IEEE, 2021, pp. 6664–6671.
[14] S en Bambach. A su ey on ecen ad ances o compu e ision algo i hms
o egocen ic ideo. a Xi p ep in a Xi :1501.02825, 2015..
[15] S en Bambach, John F anchak, Da id C andall, and Chen Yu. De ec ing hands
in child en’s egocen ic iews o unde s and embodied a en ion du ing
social in e ac ion. In P oceedings o he Annual Mee ing o he Cogni i e
Science Socie y, olume 36, 2014..
[16] S en Bambach, S e an Lee, Da id J C andall, Yu. Chen, Lending a hand:
De ec ing hands and ecognizing ac i i ies in complex egocen ic
in e ac ions, in: P oceedings o he IEEE In e na ional Con e ence on
Compu e Vision, 2015, pp. 1949–1957.
[17] And ea Bandini, José Za i a, Analysis o he hands in egocen ic ision: A
su ey, IEEE T ansac ions on Pa e n Analysis and Machine In elligence
(2020).
[18] He be Bay, Tinne Tuy elaa s, Luc Van Gool, in: Su : Speeded up obus
ea u es In Eu opean con e ence on compu e ision, Sp inge , 2006, pp. 404–
417.
[19] A dhendu Behe a, Ma hew Chapman, An hony G Cohn, and Da id C Hogg.
Egocen ic ac i i y ecogni ion using his og ams o o ien ed pai wise
ela ions. In 2014 In e na ional Con e ence on Compu e Vision Theo y and
Applica ions (VISAPP), olume 2, pages 22–30. IEEE, 2014..
[20] A dhendu Behe a, Da id C Hogg, An hony G Cohn, Egocen ic ac i i y
moni o ing and eco e y, in: Asian Con e ence on Compu e Vision,
Sp inge , 2012, pp. 519–532.
[21] Alejand o Be ancou , Pie o Mo e io, Ca lo S Regazzoni, Ma hias Rau e be g,
The e olu ion o i s pe son ision me hods: A su ey, IEEE T ansac ions on
Ci cui s and Sys ems o Video Technology 25 (5) (2015) 744–760.
[22] Kesha Bhanda i, Ma io A DeLaGa za, Ziliang Zong, Hugo La apie, Yan Yan,
Egok360: A 360 egocen ic kine ic human ac i i y ideo da ase , in: 2020
IEEE In e na ional Con e ence on Image P ocessing (ICIP), IEEE, 2020, pp.
266–270.
[23] Bha a Lal Bha naga , Su iya Singh, Che an A o a, CV Jawaha , and KCIS CVIT.
Unsupe ised lea ning o deep ea u e ep esen a ion o clus e ing
egocen ic ac ions. In IJCAI, pages 1447–1453, 2017..
[24] Ma c Bolaños, Pe ia Rade a, Simul aneous ood localiza ion and ecogni ion,
in: 2016 23 d In e na ional Con e ence on Pa e n Recogni ion (ICPR), IEEE,
2016, pp. 3140–3145.
[25] Anna Bosch, And ew Zisse man, Xa ie Munoz, Rep esen ing shape wi h a
spa ial py amid ke nel, in: P oceedings o he 6 h ACM In e na ional
Con e ence on Image and Video Re ie al, 2007, pp. 401–408.
[26] Nadia B anca i, Giuseppe Caggianese, Ma ia F ucci, Luigi Gallo, Pie o Ne oni,
Robus inge ip de ec ion in egocen ic ision unde a ying illumina ion
condi ions, in: 2015 IEEE In e na ional Con e ence on Mul imedia & Expo
Wo kshops (ICMEW), IEEE, 2015, pp. 1–6.
[27] And eas Bulling, Jamie A Wa d, Hans Gelle sen, Ge ha d T os e , Eye
mo emen analysis o ac i i y ecogni ion using elec ooculog aphy, IEEE
T ansac ions on Pa e n Analysis and Machine In elligence 33 (4) (2010) 741–
753.
[28] Minjie Cai, Lu. Feng, Yue Gao, Desk op ac ion ecogni ion om i s -pe son
poin -o - iew, IEEE T ansac ions on Cybe ne ics 49 (5) (2018) 1616–1628.
[29] Fabien Ca dinaux, Deepayan Bhowmik, Cha i h Abhaya a ne, Ma k S Hawley,
Video based echnology o ambien assis ed li ing: A e iew o he li e a u e,
Jou nal o Ambien In elligence and Sma En i onmen s 3 (3) (2011) 253–
269.
[30] Joao Ca ei a, And ew Zisse man, Quo adis, ac ion ecogni ion? a new model
and he kine ics da ase , in: P oceedings o he IEEE Con e ence on Compu e
Vision and Pa e n Recogni ion, 2017, pp. 6299–6308.
[31] Alejand o Ca as, Jo di Luque, Pe ia Rade a, Ca los Segu a, and Ma iella
Dimiccoli. How much does audio ma e o ecognize egocen ic objec
in e ac ions? a Xi p ep in a Xi :1906.00634, 2019..
[32] Alejand o Ca as, Jo di Luque, Pe ia Rade a, Ca los Segu a, Ma iella Dimiccoli,
Seeing and hea ing egocen ic ac ions: How much can we lea n?, in:
P oceedings o he IEEE In e na ional Con e ence on Compu e Vision
Wo kshops, 2019
[33] Alejand o Ca as, Pe ia Rade a, and Ma iella Dimiccoli. Con ex ually d i en
i s -pe son ac ion ecogni ion om ideos..
[34] Alejand o Ca as, Pe ia Rade a, Ma iella Dimiccoli, Modeling long- e m
in e ac ions o enhance ac ion ecogni ion, in: 2020 25 h In e na ional
Con e ence on Pa e n Recogni ion (ICPR), IEEE, 2021, pp. 10351–10358.
[35] Daniel Cas o, S e en Hickson, Vinay Be adapu a, Edison Thomaz, G ego y
Abowd, Hen ik Ch is ensen, and I an Essa. P edic ing daily ac i i ies om
egocen ic images using deep lea ning. In p oceedings o he 2015 ACM
In e na ional symposium on Wea able Compu e s, pages 75–82, 2015..
[36] Mohamed Chaabane, Ameni T abelsi, Na haniel Blancha d, Ross Be e idge,
Looking ahead: An icipa ing pedes ians c ossing wi h u u e ames
p edic ion, in: The IEEE Win e Con e ence on Applica ions o Compu e
Vision, 2020, pp. 2297–2306.
[37] Alexand os And é Chaa aoui, Pa.u. Climen -Pé ez, F ancisco Fló ez-Re uel a,
A e iew on ision echniques applied o human beha iou analysis o
ambien -assis ed li ing, Expe Sys ems wi h Applica ions 39 (12) (2012)
10873–10888.
[38] F ançois Cholle , Xcep ion: Deep lea ning wi h dep hwise sepa able
con olu ions, in: P oceedings o he IEEE con e ence on compu e ision
and pa e n ecogni ion, 2017, pp. 1251–1258.
[39] Pa.u. Climen -Pé ez, Susanna Spinsan e, Alex Mihailidis, F ancisco Flo ez-
Re uel a, A e iew on ideo-based ac i e and assis ed li ing echnologies o
au oma ed li elogging, Expe Sys ems wi h Applica ions 139 (2020) 112847.
[40] Da win T i o Concha, Helena De Almeida Maia, Helio Ped ini, Heme son
Tacon, And é De Souza B i o, Hugo De Lima Cha es, and Ma celo Be na des
Viei a. Mul i-s eam con olu ional neu al ne wo ks o ac ion ecogni ion in
ideo sequences based on adap i e isual hy hms. In 2018 17 h IEEE
In e na ional Con e ence on Machine Lea ning and Applica ions (ICMLA),
pages 473–480. IEEE, 2018..
[41] Dima Damen, Hazel Dough y, Gio anni Ma ia Fa inella, An onino Fu na i, Jian
Ma, E angelos Kazakos, Da ide Mol isan i, Jona han Mun o, Toby Pe e ,
Will P ice, and Michael W ay. Rescaling egocen ic ision. CoRR, abs/
2006.13256, 2020..
[42] Dima Damen, Hazel Dough y, Gio anni Ma ia Fa inella, Sanja Fidle , An onino
Fu na i, E angelos Kazakos, Da ide Mol isan i, Jona han Mun o, Toby Pe e ,
Will P ice, and Michael W ay. Scaling egocen ic ision: The epic-ki chens
da ase . In Eu opean Con e ence on Compu e Vision (ECCV), 2018..
[43] Dima Damen, Hazel Dough y, Gio anni Ma ia Fa inella, Sanja Fidle , An onino
Fu na i, E angelos Kazakos, Da ide Mol isan i, Jona han Mun o, Toby Pe e ,
Will P ice, e al. Scaling egocen ic ision: The epic-ki chens da ase . In
P oceedings o he Eu opean Con e ence on Compu e Vision (ECCV), pages
720–736, 2018..
[44] Dima Damen, Teesid Leelasawassuk, Osian Haines, And ew Calway, Wal e io
W Mayol-Cue as, You-do, i-lea n: Disco e ing ask ele an objec s and hei
modes o in e ac ion om mul i-use egocen ic ideo, BMVC 2 (2014) page
3.
[45] Dima Damen, Teesid Leelasawassuk, Wal e io Mayol-Cue as, You-do, i-lea n:
Egocen ic unsupe ised disco e y o objec s and hei modes o in e ac ion
owa ds ideo-based guidance, Compu e Vision and Image Unde s anding
149 (2016) 98–112.
[46] P a yusha Das, An onio O ega, Symme ic sub-g aph spa io- empo al g aph
con olu ion and i s applica ion in complex ac i i y ecogni ion, in: ICASSP
2021–2021 IEEE In e na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP), IEEE, 2021, pp. 3215–3219.
[47] S e en Da is, Paul Me mels ein, Compa ison o pa ame ic ep esen a ions
o monosyllabic wo d ecogni ion in con inuously spoken sen ences, IEEE
T ansac ions on Acous ics, Speech, and Signal P ocessing 28 (4) (1980) 357–
366.
[48] Fe nando De la To e, Jessica Hodgins, Adam Ba g eil, Xa ie Ma in, Jus in
Macey, Alex Collado, and Pep Bel an. Guide o he ca negie mellon uni e si y
mul imodal ac i i y (cmu-mmac) da abase. 2009..
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
193
[49] Ana Ga cia Del Molino, Ches on Tan, Joo-Hwee Lim, Ah-Hwee Tan,
Summa iza ion o egocen ic ideos: A comp ehensi e su ey, IEEE
T ansac ions on Human-Machine Sys ems 47 (1) (2016) 65–76.
[50] Jia Deng, Wei Dong, Richa d Soche , Li-Jia Li, Kai Li, Li Fei-Fei, Imagene : A
la ge-scale hie a chical image da abase, in: 2009 IEEE Con e ence on
Compu e Vision and Pa e n Recogni ion, Ieee, 2009, pp. 248–255.
[51] Jean Deze and Flo en in Sma andache. Ad ances and applica ions o dsm
o in o ma ion usion. Am. Res. P ess, Rehobo h, 1, 2004..
[52] Alexande Die e, Timo Sz yle , Lydia Weiland, Heine S uckenschmid ,
Imp o ing mo ion-based ac i i y ecogni ion wi h ego-cen ic ision, in:
2018 IEEE In e na ional Con e ence on Pe asi e Compu ing and
Communica ions Wo kshops (Pe Com Wo kshops), IEEE, 2018, pp. 488–491.
[53] Seman ic egula ized clus e ing o egocen ic pho o s eams segmen a ion,
Ma iella Dimiccoli, Ma c Bolaños, Es e ania Tala e a, Maedeh Aghaei, S a i G
Nikolo , and Pe ia Rade a. S -clus e ing, Compu e Vision and Image
Unde s anding 155 (2017) 55–69.
[54] Ma iella Dimiccoli, Juan Ma ín, Edison Thomaz, Mi iga ing bys ande p i acy
conce ns in egocen ic ac i i y ecogni ion wi h deep lea ning and
in en ional image deg ada ion, P oceedings o he ACM on In e ac i e,
Mobile, Wea able and Ubiqui ous Technologies 1 (4) (2018) 1–18.
[55] Jesse Dodge, Suchin Gu u angan, Dallas Ca d, Roy Schwa z, and Noah A
Smi h. Show you wo k: Imp o ed epo ing o expe imen al esul s. a Xi
p ep in a Xi :1909.03004, 2019..
[56] Je ey Donahue, Lisa Anne Hend icks, Se gio Guada ama, Ma cus Roh bach,
Subhashini Venugopalan, Ka e Saenko, and T e o Da ell. Long- e m
ecu en con olu ional ne wo ks o isual ecogni ion and desc ip ion. In
P oceedings o he IEEE con e ence on compu e ision and pa e n
ecogni ion, pages 2625–2634, 2015..
[57] Chen Fang, Lo enzo To esani, in: Measu ing image dis ances ia embedding
in a seman ic mani old In Eu opean Con e ence on Compu e Vision, Sp inge ,
2012, pp. 402–415.
[58] Ali cza Fa hi, Jessica K Hodgins, James M Rehg, Social in e ac ions: A i s -
pe son pe spec i e, in: 2012 IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion, IEEE, 2012, pp. 1226–1233.
[59] Ali eza Fa hi, Ali Fa hadi, James M Rehg, Unde s anding egocen ic ac i i ies,
in: 2011 In e na ional Con e ence on Compu e Vision, IEEE, 2011, pp. 407–
414.
[60] Ali eza Fa hi, Yin Li, James M Rehg, Lea ning o ecognize daily ac ions using
gaze, in: Eu opean Con e ence on Compu e Vision, Sp inge , 2012, pp. 314–
327.
[61] Ali eza Fa hi, James M Rehg, Modeling ac ions h ough s a e changes, in:
P oceedings o he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion, 2013, pp. 2579–2586.
[62] Amy Fi e, Song-Chun Zhu, Lea ning pe cep ual causali y om ideo, ACM
T ansac ions on In elligen Sys ems and Technology (TIST) 7 (2) (2015) 1–22.
[63] Ma in A. Fischle , Robe C. Bolles, Random sample consensus: a pa adigm
o model i ing wi h applica ions o image analysis and au oma ed
ca og aphy, Communica ions o he ACM 24 (6) (1981) 381–395.
[64] An onino Fu na i, Gio anni Ma ia Fa inella, Wha would you expec ?
an icipa ing egocen ic ac ions wi h olling-un olling ls ms and modali y
a en ion, in: P oceedings o he IEEE In e na ional Con e ence on Compu e
Vision, 2019, pp. 6252–6261.
[65] Ha shala Gammulle, Simon Denman, S idha S idha an, Clin on Fookes, Two
s eam ls m: A deep usion amewo k o human ac ion ecogni ion, in: 2017
IEEE Win e Con e ence on Applica ions o Compu e Vision (WACV), IEEE,
2017, pp. 177–186.
[66] Guille mo Ga cia-He nando, Shanxin Yuan, Seung yul Baek, Tae-Kyun Kim,
Fi s -pe son hand ac ion benchma k wi h gb-d ideos and 3d hand pose
anno a ions, in: P oceedings o he IEEE Con e ence on Compu e Vision and
Pa e n Recogni ion, 2018, pp. 409–419.
[67] Geo gia Gkioxa i, Ross Gi shick, Ji end a Malik, Con ex ual ac ion ecogni ion
wi h * cnn, in: P oceedings o he IEEE In e na ional Con e ence on
Compu e Vision, 2015, pp. 1080–1088.
[68] Pe e M Gollwi ze , Ac ion phases and mind-se s. Handbook o mo i a ion
and cogni ion, Founda ions o social beha io 2 (1990) 53–92.
[69] Ul G enande , Elemen s o pa e n heo y, JHU P ess (1996).
[70] Kai Guo, P akash Ishwa , Janusz Kon ad, Ac ion ecogni ion om ideo using
ea u e co a iance ma ices, IEEE T ansac ions on Image P ocessing 22 (6)
(2013) 2479–2494.
[71] Taejin Ha, S e en Feine , Woon ack Woo, Wea hand: Head-wo n, gb-d
came a-based, ba e-hand use in e ace wi h isually enhanced dep h
pe cep ion, in: 2014 IEEE In e na ional Symposium on Mixed and
Augmen ed Reali y (ISMAR), IEEE, 2014, pp. 219–228.
[72] Ma y Hayhoe, Vision using ou ines: A unc ional accoun o ision, Visual
Cogni ion 7 (1–3) (2000) 43–64.
[73] Sepp Hoch ei e , Jü gen Schmidhube , Long sho - e m memo y, Neu al
compu a ion 9 (8) (1997) 1735–1780.
[74] Yi ei Huang, Zhenqiang Li, Minjie Cai, and Yoichi Sa o. Mu ual con ex
ne wo k o join ly es ima ing egocen ic gaze and ac ions. a Xi p ep in
a Xi :1901.01874, 2019..
[75] Ja ed Im an, Balasub amanian Raman, Th ee-s eam spa io- empo al
a en ion ne wo k o i s -pe son ac ion and in e ac ion ecogni ion,
Jou nal o Ambien In elligence and Humanized Compu ing (2021) 1–16.
[76] Youngkyoon Jang, Ikbeom Jeon, Tae-Kyun Kim, Woon ack Woo, Me apho ic
hand ges u es o o ien a ion-awa e objec manipula ion wi h an
egocen ic iewpoin , IEEE T ansac ions on Human-Machine Sys ems 47 (1)
(2016) 113–127.
[77] Youngkyoon Jang, Seung-Tak Noh, Hyung Jin Chang, Tae-Kyun Kim, and
Woon ack Woo. 3d inge cape: Clicking ac ion and posi ion es ima ion unde
sel -occlusions in egocen ic iewpoin . IEEE T ansac ions on Visualiza ion
and Compu e G aphics, 21(4), 501–510, 2015..
[78] Youngkyoon Jang, B ian Sulli an, Casimi Ludwig, Iain Gilch is , Dima Damen,
and Wal e io Mayol-Cue as. Epic- en : An egocen ic ideo da ase o
camping en assembly. In P oceedings o he IEEE In e na ional Con e ence
on Compu e Vision Wo kshops, pages 0–0, 2019..
[79] Ali Ja idani, Ahmad Mahmoudi-Azna eh, A uni ied me hod o i s and hi d
pe son ac ion ecogni ion, in: I anian Con e ence on Elec ical Enginee ing
(ICEE), IEEE, 2018, pp. 1629–1633.
[80] He e Jegou, Flo en Pe onnin, Ma hijs Douze, Jo ge Sánchez, Pa ick Pe ez,
Co delia Schmid, Agg ega ing local image desc ip o s in o compac codes,
IEEE T ansac ions on Pa e n Analysis and Machine In elligence 34 (9) (2011)
1704–1716.
[81] Shuiwang Ji, Xu. Wei, Ming Yang, Yu. Kai, 3d con olu ional neu al ne wo ks
o human ac ion ecogni ion, IEEE ansac ions on pa e n analysis and
machine in elligence 35 (1) (2012) 221–231.
[82] Wenyan Jia, Yuecheng Li, Ruowei Qu, Thomas Ba anowski, Lo a E Bu ke, Hong
Zhang, Yicheng Bai, Julie M Mancino, Guizhi Xu, Zhi-Hong Mao, e al.
Au oma ic ood de ec ion in egocen ic images using a i icial in elligence
echnology. Public heal h nu i ion, 22(7):1168–1179, 2019..
[83] Haiyu Jiang, Yan Song, Jiang He, and Xiangbo Shu. C oss usion o egocen ic
in e ac i e ac ion ecogni ion. In In e na ional Con e ence on Mul imedia
Modeling, pages 714–726. Sp inge , 2020..
[84] Takeo Kanade, Ma ial Hebe , Fi s -pe son ision, P oceedings o he IEEE
100 (8) (2012) 2442–2453.
[85] Hongwen Kang, Ma ial Hebe , Takeo Kanade, Disco e ing objec ins ances
om scenes o daily li ing, in: 2011 In e na ional Con e ence on Compu e
Vision, IEEE, 2011, pp. 762–769.
[86] Geo gios Kapidis, Ronald Poppe, Elsbe h an Dam, Lucas Noldus, Remco
Vel kamp, Mul i ask lea ning o imp o e egocen ic ac ion ecogni ion, in:
P oceedings o he IEEE In e na ional Con e ence on Compu e Vision
Wo kshops, 2019.
[87] Geo gios Kapidis, Ronald Poppe, Elsbe h an Dam, Lucas PJJ Noldus, and
Remco C Vel kamp. Egocen ic hand ack and objec -based human ac ion
ecogni ion. a Xi p ep in a Xi :1905.00742, 2019..
[88] Geo gios Kapidis, Ronald Poppe, Elsbe h an Dam, Lucas PJJ Noldus, Remco C
Vel kamp, Objec de ec ion-based loca ion and ac i i y classi ica ion om
egocen ic ideos: A sys ema ic analysis, in: Sma Assis ed Li ing, Sp inge ,
2020, pp. 119–145.
[89] Geo gios Kapidis, Ronald Poppe, Remco C Vel kamp, Mul i-da ase , mul i ask
lea ning o egocen ic ision asks, IEEE T ansac ions on Pa e n Analysis and
Machine In elligence (2021).
[90] E angelos Kazakos, A sha Nag ani, And ew Zisse man, Dima Damen, Epic-
usion: Audio- isual empo al binding o egocen ic ac ion ecogni ion, in:
P oceedings o he IEEE In e na ional Con e ence on Compu e Vision, 2019,
pp. 5492–5501.
[91] Adam Kendon. S udies in he beha io o social in e ac ion, olume 6.
Humani ies P ess In e na ional, 1977..
[92] K is M Ki ani, Takahi o Okabe, Yoichi Sa o, and Akihi o Sugimo o. Fas
unsupe ised ego-ac ion lea ning o i s -pe son spo s ideos. In CVPR
2011, pages 3241–3248. IEEE, 2011..
[93] K.P. Sanal Kuma , Ac i i y ecogni ion in egocen ic ideo using s m, knn and
combined s mknn classi ie s, IOP Con e ence Se ies: Ma e ials Science and
Enginee ing, olume 225, IOP Publishing, 2017, 012226.
[94] K.P. Sanal Kuma , R. Bha ani, Human ac i i y ecogni ion in egocen ic ideo
using hog, gis and colo ea u es, Mul imedia Tools and Applica ions 79 (5)
(2020) 3543–3559.
[95] Heeseung Kwon, Yeonho Kim, Jin S Lee, Minsu Cho, Fi s pe son ac ion
ecogni ion ia wo-s eam con ne wi h long- e m usion pooling, Pa e n
Recogni ion Le e s 112 (2018) 161–167.
[96] Taein Kwon, Bug a Tekin, Jan S uhme , Fede ica Bogo, and Ma c Polle eys.
H2o: Two hands manipula ing objec s o i s pe son in e ac ion ecogni ion.
a Xi p ep in a Xi :2104.11181, 2021..
[97] Michael Land, Neil Mennie, Jenni e Rus ed, The oles o ision and eye
mo emen s in he con ol o ac i i ies o daily li ing, Pe cep ion 28 (11)
(1999) 1311–1328.
[98] Michael Land, Benjamin Ta le , Looking and ac ing: ision and eye
mo emen s in na u al beha iou , Ox o d Uni e si y P ess, 2009.
[99] I an Lap e , Ma cin Ma szalek, Co delia Schmid, Benjamin Rozen eld,
Lea ning ealis ic human ac ions om mo ies, in: 2008 IEEE Con e ence on
Compu e Vision and Pa e n Recogni ion, IEEE, 2008, pp. 1–8.
[100] Kyungjun Lee, Abhina Sh i as a a, He nisa Kaco i, Hand-p iming in objec
localiza ion o assis i e egocen ic ision, in: The IEEE Win e Con e ence on
Applica ions o Compu e Vision, 2020, pp. 3422–3432.
[101] Yong Jae Lee, Joydeep Ghosh, K is en G auman, Disco e ing impo an people
and objec s o egocen ic ideo summa iza ion, in: 2012 IEEE con e ence on
compu e ision and pa e n ecogni ion, IEEE, 2012, pp. 1346–1353.
[102] Chuankun Li, Shuai Li, Yanbo Gao, Xiang Zhang, and Wanqing Li. A wo-
s eam neu al ne wo k o pose-based hand ges u e ecogni ion. a Xi
p ep in a Xi :2101.08926, 2021..
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
194
[103] Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, and
Wanqing Li. T ea : T ans o me -based gb-d egocen ic ac ion ecogni ion.
IEEE T ansac ions on Cogni i e and De elopmen al Sys ems, 2021..
[104] Yanghao Li, Tusha Naga ajan, Bo Xiong, K is en G auman, Ego-exo:
T ans e ing isual ep esen a ions om hi d-pe son o i s -pe son
ideos, in: P oceedings o he IEEE/CVF Con e ence on Compu e Vision and
Pa e n Recogni ion, 2021, pp. 6943–6953.
[105] Yin Li, Miao Liu, James M Rehg, In he eye o beholde : Join lea ning o gaze
and ac ions in i s pe son ideo, in: P oceedings o he Eu opean Con e ence
on Compu e Vision (ECCV), 2018, pp. 619–635.
[106] Yin Li, Zhe an Ye, James M Rehg, Del ing in o egocen ic ac ions, in:
P oceedings o he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion, 2015, pp. 287–295.
[107] Ji Lin, Chuang Gan, Song Han, Tsm: Tempo al shi module o e icien ideo
unde s anding, in: P oceedings o he IEEE In e na ional Con e ence on
Compu e Vision, 2019, pp. 7083–7093.
[108] Bingbin Liu, Ehsan Adeli, Zhangjie Cao, Kuan-Hui Lee, Abhijee Shenoi, Ad ien
Gaidon, and Juan Ca los Niebles. Spa io empo al ela ionship easoning o
pedes ian in en p edic ion. IEEE Robo ics and Au oma ion Le e s, 5(2),
3485–3492, 2020..
[109] Hugo Liu and Push Singh. Concep ne -a p ac ical commonsense easoning
ool-ki . BT echnology jou nal, 22(4):211–226, 2004..
[110] Jianbo Liu, Yongcheng Liu, Ying Wang, Ve onique P ine , Shiming Xiang, and
Chunhong Pan. Decoupled ep esen a ion lea ning o skele on-based ges u e
ecogni ion. In P oceedings o he IEEE/CVF Con e ence on Compu e Vision
and Pa e n Recogni ion, pages 5751–5760, 2020..
[111] Jianbo Liu, Ying Wang, Shiming Xiang, and Chunhong Pan. Han: An e icien
hie a chical sel -a en ion ne wo k o skele on-based ges u e ecogni ion.
a Xi p ep in a Xi :2106.13391, 2021..
[112] Miao Liu, Lingni Ma, Ki an Somasunda am, Yin Li, K is en G auman, James M
Rehg, and Chao Li. Egocen ic ac i i y ecogni ion and localiza ion on a 3d
map. a Xi p ep in a Xi :2105.09544, 2021..
[113] Yang Liu, Ping Wei, Song-Chun Zhu, Join ly ecognizing objec luen s and
asks in egocen ic ideos, in: P oceedings o he IEEE In e na ional
Con e ence on Compu e Vision, 2017, pp. 2924–2932.
[114] Yinan Liu, Wu. Qingbo, Liangzhi Tang, Hengcan Shi, Gaze-assis ed mul i-
s eam deep neu al ne wo k o ac ion ecogni ion, IEEE Access 5 (2017)
19432–19441.
[115] Alejand o López-Ci uen es, Ma cos Escude o-Viñolo, and Jesús Bescós. A
p ospec i e s udy on sequence-d i en empo al sampling and ego-mo ion
compensa ion o ac ion ecogni ion in he epic-ki chens da ase . a Xi
p ep in a Xi :2008.11588, 2020..
[116] Da id G Lowe, Dis inc i e image ea u es om scale-in a ian keypoin s,
In e na ional jou nal o compu e ision 60 (2) (2004) 91–110.
[117] Lu. Minlong, Ze-Nian Li, Yueming Wang, Gang Pan, Deep a en ion ne wo k
o egocen ic ac ion ecogni ion, IEEE T ansac ions on Image P ocessing 28
(8) (2019) 3703–3713.
[118] Lu. Minlong, Danping Liao, Ze-Nian Li, Lea ning spa io empo al a en ion o
egocen ic ac ion ecogni ion, in: P oceedings o he IEEE In e na ional
Con e ence on Compu e Vision Wo kshops, 2019.
[119] Yan ao Lu and Senem Velipasala . Human ac i i y classi ica ion inco po a ing
egocen ic ideo and ine ial measu emen uni da a. In 2018 IEEE Global
Con e ence on Signal and In o ma ion P ocessing (GlobalSIP), pages 429–433.
IEEE, 2018..
[120] Chih-Yao Ma, Asim Kada , Iain Mel in, Zsol Ki a, Ghassan AlRegib, Hans
Pe e G a , A end and in e ac : Highe -o de objec in e ac ions o ideo
unde s anding, in: P oceedings o he IEEE Con e ence on Compu e Vision
and Pa e n Recogni ion, 2018, pp. 6790–6800.
[121] Minghuang Ma, Haoqi Fan, K is M Ki ani, Going deepe in o i s -pe son
ac i i y ecogni ion, in: P oceedings o he IEEE Con e ence on Compu e
Vision and Pa e n Recogni ion, 2016, pp. 1894–1903.
[122] S e e Mann. ‘wea cam’( he wea able came a): pe sonal imaging sys ems o
long- e m use in wea able e he less compu e -media ed eali y and
pe sonal pho o/ ideog aphic memo y p os hesis. In Diges o Pape s.
Second In e na ional Symposium on Wea able Compu e s (Ca . No.
98EX215), pages 124–131. IEEE, 1998..
[123] Joanna Ma e zynska, Te e Xiao, Roei He zig, Huijuan Xu, Xiaolong Wang, and
T e o Da ell. Some hing-else: Composi ional ac ion ecogni ion wi h
spa ial- empo al in e ac ion ne wo ks. In P oceedings o he IEEE/CVF
Con e ence on Compu e Vision and Pa e n Recogni ion, pages 1049–1059,
2020..
[124] Kenji Ma suo, Ken a o Yamada, Sa oshi Ueno, Sei Nai o, An a en ion-based
ac i i y ecogni ion o egocen ic ideo, in: P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni ion Wo kshops, 2014,
pp. 551–556.
[125] Tomas McCandless and K is en G auman. Objec -cen ic spa io- empo al
py amids o egocen ic ac i i y ecogni ion. In BMVC, olume 2, page 3.
Ci esee , 2013..
[126] Geo gios Medi skos, Pie e-Ma ie Plans, Thanos G. S a opoulos, Jenny
Benois-Pineau, Vincen Buso, Ioannis Kompa sia is, Mul i-modal ac i i y
ecogni ion om egocen ic ision, seman ic en ichmen and li elogging
applica ions o he ca e o demen ia, Jou nal o Visual Communica ion and
Image Rep esen a ion 51 (2018) 169–190.
[127] Xiao-Li Meng, Donald B Rubin, Maximum likelihood es ima ion ia he ecm
algo i hm: A gene al amewo k, Biome ika 80 (2) (1993) 267–278.
[128] Shinya Michiba a, Ka su umi Inoue, Michi umi Yoshioka, A sushi Hashimo o,
Cooking ac i i y ecogni ion in egocen ic ideos wi h a hand mask image
b anch in he mul i-s eam cnn, in: P oceedings o he 2020 Mul imedia on
Cooking and Ea ing Ac i i ies Wo kshop, 2020, pp. 1–6.
[129] Ajay K Mish a, Yiannis Aloimonos, Loong Fah Cheong, Ash a Kassim, Ac i e
isual segmen a ion, IEEE T ansac ions on Pa e n Analysis and Machine
In elligence 34 (4) (2011) 639–653.
[130] Da ide Mol isan i, Michael W ay, Wal e io Mayol-Cue as, Dima Damen,
T espassing he bounda ies: Labeling empo al bounds o objec in e ac ions
in egocen ic ideo, in: P oceedings o he IEEE In e na ional Con e ence on
Compu e Vision, 2017, pp. 2886–2894.
[131] Thie y Pinhei o Mo ei a, Da id Meno i, Helio Ped ini, Fi s -pe son ac ion
ecogni ion h ough isual hy hm ex u e desc ip ion, in: 2017 IEEE
In e na ional Con e ence on Acous ics, Speech and Signal P ocessing
(ICASSP), IEEE, 2017, pp. 2627–2631.
[132] E ik T Muelle , Commonsense easoning: an e en calculus based app oach,
Mo gan Kau mann, 2014.
[133] Tusha Naga ajan, Yanghao Li, Ch is oph Feich enho e , and K is en
G auman. Ego- opo: En i onmen a o dances om egocen ic ideo. a Xi
p ep in a Xi :2001.04583, 2020..
[134] Ka suyuki Nakamu a, Se ena Yeung, Alexand e Alahi, Li Fei-Fei, Join ly
lea ning ene gy expendi u es and ac i i ies using egocen ic mul imodal
signals, in: P oceedings o he IEEE Con e ence on Compu e Vision and
Pa e n Recogni ion, 2017, pp. 1868–1877.
[135] Tomoya Naka ani, Ryohei Kuga, Takuya Maekawa, P elimina y in es iga ion
o objec -based ac i i y ecogni ion using egocen ic ideo based on web
knowledge, in: P oceedings o he 17 h In e na ional Con e ence on Mobile
and Ubiqui ous Mul imedia, 2018, pp. 375–381.
[136] A sushi Nakazawa, Miwako Honda, Fi s -pe son came a sys em o e alua e
ende demen ia-ca e skill, in: P oceedings o he IEEE In e na ional
Con e ence on Compu e Vision Wo kshops, 2019.
[137] Sana h Na ayan, Mohan S Kankanhalli, Kalpa hi R Ramak ishnan, Ac ion and
in e ac ion ecogni ion in i s -pe son ideos, in: P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni ion Wo kshops, 2014,
pp. 512–518.
[138] Jean-Ch is ophe Nebel, F ancisco Flo ez-Re uel a, e al., Recogni ion o
ac i i ies o daily li ing om egocen ic ideos using hands de ec ed by a
deep con olu ional ne wo k, in: In e na ional Con e ence Image Analysis and
Recogni ion, Sp inge , 2018, pp. 390–398.
[139] Thi-Hoa-Cuc Nguyen, Jean-Ch is ophe Nebel, F ancisco Flo ez-Re uel a,
e al., Recogni ion o ac i i ies o daily li ing wi h egocen ic ision: A
e iew, Senso s 16 (1) (2016) 72.
[140] Xuan Son Nguyen, Luc B un, Oli ie Lézo ay, Sébas ien Bougleux, A neu al
ne wo k based on spd mani old lea ning o skele on-based hand ges u e
ecogni ion, in: P oceedings o he IEEE/CVF Con e ence on Compu e Vision
and Pa e n Recogni ion, 2019, pp. 12036–12045.
[141] Ad ián Núñez-Ma cos, Go ka Azkune, Eneko Agi e, Diego López-de Ipiña,
and Ignacio A ganda-Ca e as. Using ex e nal knowledge o imp o e ze o-
sho ac ion ecogni ion in egocen ic ideos. In In e na ional Con e ence on
Image Analysis and Recogni ion, pages 174–185. Sp inge , 2020..
[142] Keisuke Ogaki, K is M Ki ani, Yusuke Sugano, Yoichi Sa o, Coupling eye-
mo ion and ego-mo ion ea u es o i s -pe son ac i i y ecogni ion, in:
2012 IEEE Compu e Socie y Con e ence on Compu e Vision and Pa e n
Recogni ion Wo kshops, IEEE, 2012, pp. 1–7.
[143] Timo Ojala, Ma i Pie ikainen, Da id Ha wood, Pe o mance e alua ion o
ex u e measu es wi h classi ica ion based on kullback disc imina ion o
dis ibu ions, P oceedings o 12 h in e na ional con e ence on pa e n
ecogni ion, olume 1, IEEE, 1994, pp. 582–585.
[144] Timo Ojala, Ma i Pie ikäinen, Da id Ha wood, A compa a i e s udy o
ex u e measu es wi h classi ica ion based on ea u ed dis ibu ions, Pa e n
ecogni ion 29 (1) (1996) 51–59.
[145] Juan-Manuel Pe ez-Rua, B ais Ma inez, Xia ian Zhu, An oine Toisoul, Vic o
Esco cia, and Tao Xiang. Knowing wha , whe e and when o look: E icien
ideo ac ion modeling wi h a en ion. a Xi p ep in a Xi :2004.01278,
2020..
[146] Juan-Manuel Pe ez-Rua, An oine Toisoul, B ais Ma inez, Vic o Esco cia, Li
Zhang, Xia ian Zhu, and Tao Xiang. Egocen ic ac ion ecogni ion by ideo
a en ion and empo al con ex . a Xi p ep in a Xi :2007.01883, 2020..
[147] Flo en Pe onnin, Ch is ophe Dance, Fishe ke nels on isual ocabula ies
o image ca ego iza ion, in: 2007 IEEE Con e ence on Compu e Vision and
Pa e n Recogni ion, IEEE, 2007, pp. 1–8.
[148] Hamed Pi sia ash, De a Ramanan, De ec ing ac i i ies o daily li ing in i s -
pe son came a iews, in: 2012 IEEE Con e ence on Compu e Vision and
Pa e n Recogni ion, IEEE, 2012, pp. 2847–2854.
[149] Mi co Planamen e, And ea Bo ino, and Ba ba a Capu o. Join encoding o
appea ance and mo ion ea u es wi h sel -supe ision o i s pe son ac ion
ecogni ion. a Xi p ep in a Xi :2002.03982, 2020..
[150] Mi co Planamen e, And ea Bo ino, Ba ba a Capu o, Sel -supe ised join
encoding o mo ion and appea ance o i s pe son ac ion ecogni ion, in:
2020 25 h In e na ional Con e ence on Pa e n Recogni ion (ICPR), IEEE,
2021, pp. 8751–8758.
[151] Mi co Planamen e, Chia a Plizza i, Emanuele Albe i, and Ba ba a Capu o.
C oss-domain i s pe son audio- isual ac ion ecogni ion h ough ela i e
no m alignmen . a Xi p ep in a Xi :2106.01689, 2021..
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
195

[152] Yai Poleg, Che an A o a, and Shmuel Peleg. Head mo ion signa u es om
egocen ic ideos. In Asian Con e ence on Compu e Vision, pages 315–329.
Sp inge , 2014..
[153] Yai Poleg, Che an A o a, Shmuel Peleg, Tempo al segmen a ion o egocen ic
ideos, in: P oceedings o he IEEE Con e ence on Compu e Vision and
Pa e n Recogni ion, 2014, pp. 2537–2544.
[154] Yai Poleg, A iel Eph a , Shmuel Peleg, Che an A o a, Compac cnn o
indexing egocen ic ideos, in: 2016 IEEE Win e Con e ence on Applica ions
o Compu e Vision (WACV), IEEE, 2016, pp. 1–9.
[155] Ra ael Possas, Sheila Pin o Cace es, Fabio Ramos, Egocen ic ac i i y
ecogni ion on a budge , in: P oceedings o he IEEE Con e ence on
Compu e Vision and Pa e n Recogni ion, 2018, pp. 5967–5976.
[156] Didik Pu wan o, Yie-Ta ng Chen, Wen-Hsien Fang, Fi s -pe son ac ion
ecogni ion wi h empo al pooling and hilbe –huang ans o m, IEEE
T ansac ions on Mul imedia 21 (12) (2019) 3122–3135.
[157] F ancesco Ragusa, An onino Fu na i, Sebas iano Ba ia o, Gio anni Signo ello,
and Gio anni Ma ia Fa inella. Ego-ch: Da ase and undamen al asks o
isi o s beha io al unde s anding using egocen ic ision. Pa e n
Recogni ion Le e s, 131:150–157, 2020..
[158] F ancesco Ragusa, An onino Fu na i, Sal a o e Li a ino, Gio anni Ma ia
Fa inella, The meccano da ase : Unde s anding human-objec in e ac ions
om egocen ic ideos in an indus ial-like domain, in: P oceedings o he
IEEE/CVF Win e Con e ence on Applica ions o Compu e Vision, 2021, pp.
1569–1578.
[159] Joseph Redmon, San osh Di ala, Ross Gi shick, Ali Fa hadi, You only look
once: Uni ied, eal- ime objec de ec ion, in: P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni ion, 2016, pp. 779–788.
[160] Shaoqing Ren, Kaiming He, Ross Gi shick, and Jian Sun. Fas e -cnn: Towa ds
eal- ime objec de ec ion wi h egion p oposal ne wo ks. In Ad ances in
Neu al In o ma ion P ocessing Sys ems, pages 91–99, 2015..
[161] Xiao eng Ren, Gu. Chunhui, Figu e-g ound segmen a ion imp o es handled
objec ecogni ion in egocen ic ideo, in: 2010 IEEE Compu e Socie y
Con e ence on Compu e Vision and Pa e n Recogni ion, IEEE, 2010, pp.
3137–3144.
[162] Xiao eng Ren, Ma hai Philipose, Egocen ic ecogni ion o handled objec s:
Benchma k and analysis, in: 2009 IEEE Compu e Socie y Con e ence on
Compu e Vision and Pa e n Recogni ion Wo kshops, IEEE, 2009, pp. 1–8.
[163] Michael S Ryoo, La y Ma hies, Fi s -pe son ac i i y ecogni ion: Wha a e
hey doing o me?, in: P oceedings o he IEEE Con e ence on Compu e
ision and Pa e n Recogni ion, 2013, pp 2730–2737.
[164] Michael S Ryoo, B andon Ro h ock, La y Ma hies, Pooled mo ion ea u es
o i s -pe son ideos, in: P oceedings o he IEEE Con e ence on Compu e
Vision and Pa e n Recogni ion, 2015, pp. 896–904.
[165] Abhimanyu Sahu, Raji Bha acha ya, Pallabh Bhu a, Ananda S Chowdhu y,
in: Ac ion ecogni ion om egocen ic ideos using andom walks In
P oceedings o 3 d In e na ional Con e ence on Compu e Vision and Image
P ocessing, Sp inge , 2020, pp. 389–402.
[166] Abhimanyu Sahu, Ananda S Chowdhu y, Sho le el egocen ic ideo co-
summa iza ion, in: 2018 24 h In e na ional Con e ence on Pa e n
Recogni ion (ICPR), IEEE, 2018, pp. 2887–2892.
[167] Abhimanyu Sahu, Ananda S Chowdhu y, Toge he ecognizing, localizing and
summa izing ac ions in egocen ic ideos, IEEE T ansac ions on Image
P ocessing 30 (2021) 4330–4340.
[168] Mos a a Kamal Sa ke , Ha em A. Rashwan, Es e ania Tala e a, Syeda Fu uka
Banu, Pe ia Rade a, Domenec Puig, e al., Macne : Mul i-scale a ous
con olu ion ne wo ks o ood places classi ica ion in egocen ic pho o-
s eams, in: P oceedings o he Eu opean Con e ence on Compu e Vision
(ECCV), 2018.
[169] Tyle R Sco , Michael Sh a sman, and Ka l Ridgeway. Uni ying ew-and
ze o-sho egocen ic ac ion ecogni ion. a Xi p ep in a Xi :2006.11393,
2020..
[170] Lei Shi, Yi an Zhang, Jian Cheng, Lu. Hanqing, Skele on-based ac ion
ecogni ion wi h mul i-s eam adap i e g aph con olu ional ne wo ks, IEEE
T ansac ions on Image P ocessing 29 (2020) 9532–9545.
[171] Yuki Shiga, Takumi Toyama, Yuzuko U sumi, Koichi Kise, And eas Dengel,
Daily ac i i y ecogni ion combining gaze mo ion and isual ea u es, in:
P oceedings o he 2014 ACM In e na ional Join Con e ence on Pe asi e and
Ubiqui ous Compu ing: Adjunc Publica ion, 2014, pp. 1103–1111.
[172] Gunna A Sigu dsson, Abhina Gup a, Co delia Schmid, Ali Fa hadi, and
Ka eek Alaha i. Cha ades-ego: A la ge-scale da ase o pai ed hi d and i s
pe son ideos. a Xi p ep in a Xi :1804.09626, 2018..
[173] Michel Sil a, Washing on Ramos, João Fe ei a, Felipe Chamone, Ma io
Campos, and E ickson R. Nascimen o. A weigh ed spa se sampling and
smoo hing ame ansi ion app oach o seman ic as - o wa d i s -pe son
ideos. In 2018 IEEE/CVF Con e ence on Compu e Vision and Pa e n
Recogni ion (CVPR), pages 2383–2392, Sal Lake Ci y, USA, Jun. 2018..
[174] Ka en Simonyan and And ew Zisse man. Two-s eam con olu ional
ne wo ks o ac ion ecogni ion in ideos. In Ad ances in Neu al
In o ma ion P ocessing Sys ems, pages 568–576, 2014..
[175] Su iya Singh, Che an A o a, C.V. Jawaha , Gene ic ac ion ecogni ion om
egocen ic ideos, in: 2015 Fi h Na ional Con e ence on Compu e Vision,
Pa e n Recogni ion, Image P ocessing and G aphics (NCVPRIPG), IEEE, 2015,
pp. 1–4.
[176] Su iya Singh, Che an A o a, C.V. Jawaha , Fi s pe son ac ion ecogni ion
using deep lea ned desc ip o s, in: P oceedings o he IEEE Con e ence on
Compu e Vision and Pa e n Recogni ion, 2016, pp. 2620–2628.
[177] Su iya Singh, Che an A o a, and CV Jawaha . T ajec o y aligned ea u es o
i s pe son ac ion ecogni ion. Pa e n Recogni ion, 62:45–55, 2017..
[178] Sibo Song, Vijay Chand asekha , Ngai-Man Cheung, Sana h Na ayan, Liyuan
Li, and Joo-Hwee Lim. Ac i i y ecogni ion in egocen ic li e-logging ideos.
In Asian Con e ence on Compu e Vision, pages 445–458. Sp inge , 2014..
[179] Sibo Song, Ngai-Man Cheung, Vijay Chand asekha , Bappadi ya Mandal, Jie
Li i, Egocen ic ac i i y ecogni ion wi h mul imodal ishe ec o , in: 2016
IEEE In e na ional Con e ence on Acous ics, Speech and Signal P ocessing
(ICASSP), IEEE, 2016, pp. 2717–2721.
[180] Khu am Soom o, Ami Roshan Zami , and Muba ak Shah. Uc 101: A da ase
o 101 human ac ions classes om ideos in he wild. a Xi p ep in
a Xi :1212.0402, 2012..
[181] Robe Spee , Ca he ine Ha asi, Concep ne 5: A la ge seman ic ne wo k o
ela ional knowledge, in: The People’s Web Mee s NLP, Sp inge , 2013, pp. 161–176.
[182] Eka e ina H Sp iggs, Fe nando De La To e, Ma ial Hebe , Tempo al
segmen a ion and ac i i y classi ica ion om i s -pe son sensing, in: 2009
IEEE Compu e Socie y Con e ence on Compu e Vision and Pa e n
Recogni ion Wo kshops, IEEE, 2009, pp. 17–24.
[183] Julian S eil, Ma ion Koelle, Wilko Heu en, Susanne Boll, And eas Bulling,
P i aceye: p i acy-p ese ing head-moun ed eye acking using egocen ic
scene image and eye mo emen ea u es, in: P oceedings o he 11 h ACM
Symposium on Eye T acking Resea ch & Applica ions, 2019, pp. 1–10.
[184] Oily S yles, A un Ross, Vic o Sanchez, Fo ecas ing pedes ian ajec o y wi h
machine-anno a ed aining da a, in: 2019 IEEE In elligen Vehicles
Symposium (IV), IEEE, 2019, pp. 716–721.
[185] Swa hiki an Sudhaka an, Se gio Escale a, and Oswald Lanz. Fbk-hupba
submission o he epic-ki chens 2019 ac ion ecogni ion challenge. a Xi
p ep in a Xi :1906.08960, 2019..
[186] Swa hiki an Sudhaka an, Se gio Escale a, and Oswald Lanz. Hie a chical
ea u e agg ega ion ne wo ks o ideo ac ion ecogni ion. a Xi p ep in
a Xi :1905.12462, 2019..
[187] Swa hiki an Sudhaka an, Se gio Escale a, Oswald Lanz, Ls a: Long sho - e m
a en ion o egocen ic ac ion ecogni ion, in: P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni ion, 2019, pp. 9954–9963.
[188] Swa hiki an Sudhaka an, Oswald Lanz, Con olu ional long sho - e m
memo y ne wo ks o ecognizing i s pe son in e ac ions, in: P oceedings
o he IEEE In e na ional Con e ence on Compu e Vision Wo kshops, 2017,
pp. 2339–2346.
[189] Swa hiki an Sudhaka an and Oswald Lanz. A en ion is all we need: Nailing
down objec -cen ic a en ion o egocen ic ac i i y ecogni ion. a Xi
p ep in a Xi :1807.11794, 2018..
[190] Li Sun, Ul ich Klank, Michael Bee z, Eyewa chme-3d hand and objec acking
o inside ou ac i i y analysis, in: 2009 IEEE Compu e Socie y Con e ence on
Compu e Vision and Pa e n Recogni ion Wo kshops, IEEE, 2009, pp. 9–16.
[191] Sudeep Sunda am, Wal e io W Mayol, Cue as, High le el ac i i y ecogni ion
using low esolu ion wea able ision, in: 2009 IEEE Compu e Socie y
Con e ence on Compu e Vision and Pa e n Recogni ion Wo kshops, IEEE,
2009, pp. 25–32.
[192] Dipak Su ie, Thomas Pede son, Fabien Lag i oul, La s-E ik Janle , Daniel
Sjölie, in: Ac i i y ecogni ion using an egocen ic pe spec i e o e e yday
objec s In In e na ional Con e ence on Ubiqui ous In elligence and
Compu ing, Sp inge , 2007, pp. 246–257.
[193] Ch is ian Szegedy, Wei Liu, Yangqing Jia, Pie e Se mane , Sco Reed,
D agomi Anguelo , Dumi u E han, Vincen Vanhoucke, and And ew
Rabino ich. Going deepe wi h con olu ions. In P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni ion, pages 1–9, 2015..
[194] Es e ania Tala e a, Ma iella Dimiccoli, Ma c Bolanos, Maedeh Aghaei, Pe ia
Rade a, R-clus e ing o egocen ic ideo segmen a ion, in: Ibe ian Con e ence
on Pa e n Recogni ion and Image Analysis, Sp inge , 2015, pp. 327–336.
[195] Yansong Tang, Zian Wang, Lu. Jiwen, Jianjiang Feng, Jie Zhou, Mul i-s eam
deep neu al ne wo ks o gb-d egocen ic ac ion ecogni ion, IEEE
T ansac ions on Ci cui s and Sys ems o Video Technology 29 (10) (2018)
3001–3015.
[196] Bug a Tekin, Fede ica Bogo, Ma c Polle eys, H+ o, Uni ied egocen ic ecogni ion o
3d hand-objec poses and in e ac ions, in: P oceedings o he IEEE Con e ence on
Compu e Vision and Pa e n Recogni ion, 2019, pp. 4511–4520.
[197] Daniel Thalmann, Hui Liang, Junsong Yuan, Fi s -pe son palm pose acking
and ges u e ecogni ion in augmen ed eali y, in: In e na ional Join
Con e ence on Compu e Vision, Imaging and Compu e G aphics, Sp inge ,
2015, pp. 3–15.
[198] Daksh Thapa , Che an A o a, and Adi ya Nigam. Is sha ing o egocen ic ideo
gi ing away you biome ic signa u e? 2020..
[199] Du. T an, Lubomi Bou de , Rob Fe gus, Lo enzo To esani, Manoha Palu i,
Lea ning spa io empo al ea u es wi h 3d con olu ional ne wo ks, in:
P oceedings o he IEEE In e na ional Con e ence on Compu e Vision,
2015, pp. 4489–4497.
[200] Amin Ullah, Jamil Ahmad, Khan Muhammad, Muhammad Sajjad, and Sung
Wook Baik. Ac ion ecogni ion in ideo sequences using deep bi-di ec ional
ls m wi h cnn ea u es. IEEE access, 6:1155–1166, 2017..
[201] Ashish Vaswani, Noam Shazee , Niki Pa ma , Jakob Uszko ei , Llion Jones, Aidan
N Gomez, Łukasz Kaise , and Illia Polosukhin. A en ion is all you need. In
Ad ances in Neu al In o ma ion P ocessing Sys ems, pages 5998–6008, 2017..
[202] Saga Ve ma, P a in Naga , Di am Gup a, Che an A o a, Making hi d pe son
echniques ecognize i s -pe son ac ions in egocen ic ideos, in: 2018 25 h
IEEE In e na ional Con e ence on Image P ocessing (ICIP), IEEE, 2018, pp.
2301–2305.
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
196
[203] Théo Voillemin, Hazem Wannous, Jean-Philippe Vandebo e, 2d deep ideo
capsule ne wo k wi h empo al shi o ac ion ecogni ion, in: 2020 25 h
In e na ional Con e ence on Pa e n Recogni ion (ICPR), IEEE, 2021, pp. 3513–
3519.
[204] Heng Wang, Co delia Schmid, Ac ion ecogni ion wi h imp o ed ajec o ies,
in: P oceedings o he IEEE In e na ional Con e ence on Compu e Vision,
2013, pp. 3551–3558.
[205] Wei Wang, Vincen W Zheng, Han Yu, and Chunyan Miao. A su ey o ze o-
sho lea ning: Se ings, me hods, and applica ions. ACM T ansac ions on
In elligen Sys ems and Technology (TIST), 10(2):1–37, 2019..
[206] Xiaohan Wang, Yu Wu, Linchao Zhu, and Yi Yang. Baidu-u s submission o he
epic-ki chens ac ion ecogni ion challenge 2019. a Xi p ep in
a Xi :1906.09383, 2019..
[207] Xiaohan Wang, Yu Wu, Linchao Zhu, and Yi Yang. Symbio ic a en ion wi h
p i ileged in o ma ion o egocen ic ac ion ecogni ion. a Xi p ep in
a Xi :2002.03137, 2020..
[208] Yaqing Wang, Quanming Yao, James T Kwok, Lionel M Ni, Gene alizing om a
ew examples: A su ey on ew-sho lea ning, ACM Compu ing Su eys
(CSUR) 53 (3) (2020) 1–34.
[209] Michael W ay and Dima Damen. Lea ning isual ac ions using mul iple e b-
only labels. a Xi p ep in a Xi :1907.11117, 2019..
[210] Michael W ay, Diane La lus, Gab iela Csu ka, Dima Damen, Fine-g ained
ac ion e ie al h ough mul iple pa s-o -speech embeddings, in:
P oceedings o he IEEE In e na ional Con e ence on Compu e Vision,
2019, pp. 450–459.
[211] Michael W ay, Da ide Mol isan i, Dima Damen, Towa ds an unequi ocal
ep esen a ion o ac ions, in: P oceedings o he IEEE Con e ence on
Compu e Vision and Pa e n Recogni ion Wo kshops, 2018, pp. 1127–1131.
[212] Michael W ay, Da ide Mol isan i, Wal e io Mayol-Cue as, Dima Damen, in:
Sembed: Seman ic embedding o egocen ic ac ion ideos In Eu opean
Con e ence on Compu e Vision, Sp inge , 2016, pp. 532–545.
[213] Michael W ay, Da ide Mol isan i, Wal e io Mayol-Cue as, and Dima Damen.
Imp o ing classi ica ion by imp o ing labelling: In oducing p obabilis ic
mul i-label objec in e ac ion ecogni ion. a Xi p ep in a Xi :1703.08338,
2017..
[214] SHI Xingjian, Zhou ong Chen, Hao Wang, Di -Yan Yeung, Wai-Kin Wong, and
Wang-chun Woo. Con olu ional ls m ne wo k: A machine lea ning app oach
o p ecipi a ion nowcas ing. In Ad ances in Neu al In o ma ion P ocessing
Sys ems, pages 802–810, 2015..
[215] Yan Yan, Elisa Ricci, Gaowen Liu, Nicu Sebe, Recognizing daily ac i i ies om
i s -pe son ideos wi h mul i- ask clus e ing, in: Asian Con e ence on
Compu e Vision, Sp inge , 2014, pp. 522–537.
[216] Yan Yan, Elisa Ricci, Gaowen Liu, Nicu Sebe, Egocen ic daily ac i i y
ecogni ion ia mul i ask clus e ing, IEEE T ansac ions on Image P ocessing
24 (10) (2015) 2984–2995.
[217] Jen-An Yang, Chia-Han Lee, V. Shao-Wen Yang, S ini asa Somayazulu, Yen-
Kuang Chen, Shao-Yi Chien, Wea able social came a: Egocen ic ideo
summa iza ion o social in e ac ion, in: 2016 IEEE In e na ional
Con e ence on Mul imedia & Expo Wo kshops (ICMEW), IEEE, 2016, pp. 1–6.
[218] Lijin Yang. Egocen ic ac ion ecogni ion om noisy ideos. 2020..
[219] Siyuan Yang, Jun Liu, Lu. Shijian, Meng Hwa E , Alex C Ko , Collabo a i e
lea ning o ges u e ecogni ion and 3d hand pose es ima ion wi h mul i-
o de ea u e analysis, in: Eu opean Con e ence on Compu e Vision,
Sp inge , 2020, pp. 769–786.
[220] Ryo Yone ani, K is M Ki ani, Yoichi Sa o, Ego-su ing i s -pe son ideos, in:
P oceedings o he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion, 2015, pp. 5445–5454.
[221] Ryo Yone ani, K is M Ki ani, Yoichi Sa o, Recognizing mic o-ac ions and
eac ions om pai ed egocen ic ideos, in: P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni ion, 2016, pp. 2629–
2638.
[222] Ryo Yone ani, K is M Ki ani, Yoichi Sa o, Visual mo i disco e y ia i s -
pe son ision, in: Eu opean Con e ence on Compu e Vision, Sp inge , 2016,
pp. 187–203.
[223] Ryo Yone ani, K is M Ki ani, and Yoichi Sa o. Ego-su ing: Pe son localiza ion
in i s -pe son ideos using ego-mo ion signa u es. IEEE ansac ions on
pa e n analysis and machine in elligence, 40(11):2749–2761, 2017..
[224] Chen Yu and Dana H Balla d. Lea ning o ecognize human ac ion sequences.
In P oceedings 2nd In e na ional Con e ence on De elopmen and Lea ning.
ICDL 2002, pages 28–33. IEEE, 2002..
[225] Yu. Chen, Dana H Balla d, Unde s anding human beha io s based on eye-
head-hand coo dina ion, in: In e na ional Wo kshop on Biologically
Mo i a ed Compu e Vision, Sp inge , 2002, pp. 611–619.
[226] Yu. Haibin, Wenyan Jia, Zhen Li, Feixiang Gong, Ding Yuan, Hong Zhang,
Mingui Sun, A mul isou ce usion amewo k d i en by use -de ined
knowledge o egocen ic ac i i y ecogni ion, EURASIP Jou nal on
Ad ances in Signal P ocessing 2019 (1) (2019) 14.
[227] Yu. Haibin, Wenyan Jia, Li Zhang, Mian Pan, Yuanyuan Liu, and Mingui Sun. A
hie a chical pa allel usion amewo k o egocen ic adl ecogni ion based
on disce nmen ame pa i ioning and belie coa sening. Jou nal o Ambien
In elligence and Humanized, Compu ing (2020) 1–23.
[228] Yu. Haibin, Guoxiong Pan, Mian Pan, Chong Li, Wenyan Jia, Li Zhang, Mingui
Sun, A hie a chical deep usion amewo k o egocen ic ac i i y ecogni ion
using a wea able hyb id senso sys em, Senso s 19 (3) (2019) 546.
[229] Yuan Yuan, Yang Zhao, Qi Wang, Ac ion ecogni ion using spa ial-op ical da a
o ganiza ion and sequen ial lea ning amewo k, Neu ocompu ing 315
(2018) 221–233.
[230] Hasan FM Zaki, Faisal Sha ai , and Ajmal Mian. Modeling sub-e en dynamics
in i s -pe son ac ion ecogni ion, in: P oceedings o he IEEE Con e ence on
Compu e Vision and Pa e n Recogni ion, 2017, pp. 7253–7262.
[231] Kai Zhan, S e en Faux, Fabio Ramos, Mul i-scale condi ional andom ields o
i s -pe son ac i i y ecogni ion, in: 2014 IEEE in e na ional con e ence on
pe asi e compu ing and communica ions (Pe Com), IEEE, 2014, pp. 51–59.
[232] Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, Qing Lei, Lijie Yang, Du. Ji-
Xiang, Duan-Sheng Chen, A comp ehensi e su ey o ision-based human
ac ion ecogni ion me hods, Senso s 19 (5) (2019) 1005.
[233] Yun C Zhang, Yin Li, James M Rehg, Fi s -pe son ac ion decomposi ion and
ze o-sho lea ning, in: 2017 IEEE e ence on Applica ions o Compu e Vision
(WACV), IEEE, 2017, pp. 121–129.
[234] Chengzhang Zhong, Amy R Reibman, Hansel Mina Co doba, Amanda J
Dee ing, Hand-hygiene ac i i y ecogni ion in egocen ic ideo, in: 2019
IEEE 21s In e na ional Wo kshop on Mul imedia Signal P ocessing (MMSP),
IEEE, 2019, pp. 1–6.
[235] Bolei Zhou, Adi ya Khosla, Aga a Laped iza, Aude Oli a, An onio To alba,
Lea ning deep ea u es o disc imina i e localiza ion, in: P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni ion, 2016, pp. 2921–2929.
[236] Yang Zhou, Bingbing Ni, Richang Hong, Xiaokang Yang, Qi Tian, Cascaded
in e ac ional a ge ing ne wo k o egocen ic ideo analysis, in: P oceedings
o he IEEE Con e ence on Compu e Vision and Pa e n Recogni ion, 2016,
pp. 1904–1913.
[237] Yi Zhu, Zhenzhong Lan, Shawn Newsam, Alexande Haup mann, Hidden wo-
s eam con olu ional ne wo ks o ac ion ecogni ion, in: Asian con e ence
on compu e ision, Sp inge , 2018, pp. 363–378.
[238] Zheming Zuo, Bo Wei, Fei Chao, Qu. Yanpeng, Yonghong Peng, Longzhi Yang,
Enhanced g adien -based local ea u e desc ip o s by saliency map o
egocen ic ac ion ecogni ion, Applied Sys em Inno a ion 2 (1) (2019) 7.
[239] Zheming Zuo, Longzhi Yang, Yonghong Peng, Fei Chao, Qu. Yanpeng, Gaze-
in o med egocen ic ac ion ecogni ion o memo y aid sys ems, IEEE Access
6 (2018) 12894–12904.
Ad ián Núñez-Ma cos is a PhD s uden in he Uni e -
si y o Deus o. He is a BsC in Compu e Science om he
Uni e si y o Basque Coun y (UPV/EHU), whe e he also
ob ained he MsC deg ee in Compu a ional Enginee ing
and In elligen Sys ems. His esea ch in e es s include
compu e ision and deep lea ning.
Go ka Azkune is an assis an p o esso in he Uni e si y
o Basque Coun y (UPV/EHU). He has published o e 20
in e na ional pee - e iewed a icles in jou nals and
in e na ional con e ences. He is a membe o he IXA
NLP g oup. His esea ch in e es s include machine
lea ning and mul imodal deep lea ning. He ecei ed a
PhD in Compu e Science om he Uni e si y o Deus o.
Ignacio A ganda-Ca e as is an Ike basque Resea ch
Associa e a he Uni e si y o he Basque Coun y (UPV/
EHU), in San Sebas ian, Spain. His esea ch in e es s
include compu e ision and bioimage analysis. He
ecei ed a Ph.D. in compu e science and elec ical
enginee ing om he Uni e sidad Au onoma de Mad id,
Spain.
Ad ián Núñez-Ma cos, G. Azkune and I. A ganda-Ca e as Neu ocompu ing 472 (2022) 175–197
197