Resul s in Enginee ing 24 (2024) 103498
A ailable online 26 No embe 2024
2590-1230/© 2024 The Au ho (s). Published by Else ie B.V. This is an open access a icle unde he CC BY-NC license (h p://c ea i ecommons.o g/licenses/by-
nc/4.0/).
Con en s lis s a ailable a ScienceDi ec
Resul s in Enginee ing
jou nal homepage: www.sciencedi ec .com/jou nal/ esul s-in-enginee ing
Resea ch pape
On he black-box explainabili y o objec de ec ion models o sa e and
us wo hy indus ial applica ions
Alain And es a,b,∗, Ai o Ma inez-Se asa, Ibai Lañaa,b, Ja ie Del Se a,c
aTECNALIA, Basque Resea ch and Technology Alliance (BRTA), Mikele egi Pasealekua 2, Donos ia-San Sebas ian, 20009, Spain
bUni e si y o Deus o, 20012, Donos ia-San Sebas ián, Spain
cUni e si y o he Basque Coun y (UPV/EHU), Bilbao, 48013, Spain
A R T I C L E I N F O A B S T R A C T
Keywo ds:
Explainable A ificial In elligence
Sa e A ificial In elligence
T us wo hy A ificial In elligence
Objec de ec ion
Single-s age objec de ec ion
Indus ial obo ics
In he ealm o human-machine in e ac ion, a ificial in elligence has become a powe ul ool o accele a ing
da a modeling asks. Objec de ec ion me hods ha e achie ed ou s anding esul s and a e widely used in c i ical
domains like au onomous d i ing and ideo su eillance. Howe e , hei adop ion in high- isk applica ions,
whe e e o s may cause se e e consequences, emains limi ed. Explainable A ificial In elligence me hods aim
o add ess his issue, bu many exis ing echniques a e model-specific and designed o classifica ion asks,
making hem less effec i e o objec de ec ion and difficul o non-specialis s o in e p e . In his wo k we
ocus on model-agnos ic explainabili y me hods o objec de ec ion models and p opose D-MFPP, an ex ension
o he Mo phological F agmen al Pe u ba ion Py amid (MFPP) echnique based on segmen a ion-based masks
o gene a e explana ions. Addi ionally, we in oduce D-Dele ion, a no el me ic combining ai h ulness and
localiza ion, adap ed specifically o mee he unique demands o objec de ec o s. We e alua e hese me hods
on eal-wo ld indus ial and obo ic da ase s, examining he influence o pa ame e s such as he numbe o
masks, model size, and image esolu ion on he quali y o explana ions. Ou expe imen s use single-s age objec
de ec ion models applied o wo sa e y-c i ical obo ic en i onmen s: i) a sha ed human- obo wo kspace whe e
sa e y is o pa amoun impo ance, and ii) an assembly a ea o ba e y ki s, whe e sa e y is c i ical due o he
po en ial o damage among high- isk componen s. Ou findings e ince ha D-Dele ion effec i ely gauges he
pe o mance o explana ions when mul iple elemen s o he same class appea in a scene, while D-MFPP p o ides
a p omising al e na i e o D-RISE when ewe masks a e used.
1. In oduc ion
In ecen yea s, A ificial In elligence (AI) has eme ged as a ans-
o ma i e o ce ac oss a ious domains, especially in human-machine
in e ac ion, whe e i has enabled significan ad ancemen s in da a-
d i en decision-making p ocesses. Among hese ad ances, objec de-
ec ion has become a key componen , finding applica ion in c i ical
a eas such as au onomous d i ing, secu i y su eillance, indus ial au-
oma ion, and obo ics [45,22]. S a e-o - he-a objec de ec ion mod-
els, including Fas e -RCNN [28], DETR [6], and he YOLO se ies [37],
ha e demons a ed imp essi e pe o mance in iden i ying and localiz-
ing objec s wi hin images. Despi e hei success, he adop ion o hese
models in highly sensi i e en i onmen s emains limi ed, pa icula ly
in domains whe e e o s could esul in se ious consequences such as
inju y, equipmen damage, o ope a ional ailu es. One o he p ima y
* Co esponding au ho a : TECNALIA, Basque Resea ch and Technology Alliance (BRTA), Mikele egi Pasealekua 2, Donos ia-San Sebas ian, 20009, Spain.
E-mail add ess: [email p o ec ed] (Alain And es).
easons o his hesi ancy is he black-box na u e o objec de ec o s
implemen ed as Deep Lea ning models, which o da e amoun o he ma-
jo i y o p oposals in he li e a u e. The in e nal ac i a ions o hese a e
no inhe en ly in e p e able, making i challenging o end-use s o us
he p edic ions issued by objec de ec o s, especially in high- isk en i-
onmen s ope a ing in open-wo ld en i onmen s such as au onomous
ehicles and indus ial obo ics.
In his con ex , he field o Explainable AI (XAI) [5] aims o en-
hance he in e p e abili y o AI sys ems by hei audience and ul ima ely,
o enhance he use ’s us in he ou pu o AI-based sys ems. Lea ing
aside he ca ego y o anspa en AI models (which a e inhe en ly in e -
p e able and do no equi e any explana ions o a use o unde s and
how hey wo k), explainabili y me hods in XAI can be b oadly ca ego-
ized in o whi e-box and black-box app oaches. Whi e-box XAI me hods
equi e access o he in e nal wo kings o he model, such as weigh s,
h ps://doi.o g/10.1016/j. ineng.2024.103498
Recei ed 24 Oc obe 2024; Recei ed in e ised o m 14 No embe 2024; Accep ed 21 No embe 2024
Resul s in Enginee ing 24 (2024) 103498
2
A. And es, A. Ma inez-Se as, I. Laña e al.
ac i a ions, o g adien s (e.g., G ad-CAM [34]). While hese me hods
can p o ide powe ul insigh s, hey a e o en limi ed by hei depen-
dence on specific model a chi ec u es, making hem difficul o gene al-
ize ac oss diffe en models and less accessible o use s un amilia wi h
AI esea ch/ ools. In con as , black-box XAI me hods ea he model
as an opaque en i y, p o iding explana ions based solely on he mod-
el’s inpu -ou pu beha io wi hou equi ing any access o i s in e nal
componen s. Howe e , mos black-box XAI me hods a e designed o
classifica ion asks a he han o objec de ec ion [29,19,25,3].
While classifica ion models p oduce a single label pe image, ob-
jec de ec ion models mus iden i y and localize mul iple objec s wi hin
an image. The e o e, hey need o explain no only he class p edic ion
o each de ec ed objec – wha hey de ec – bu also he spa ial ea-
soning behind he bounding boxes ha define he objec ’s loca ion –
whe e he objec is posi ioned wi hin he image. Balancing hese dual
aspec s complica es he explana ion p ocess and equi es mo e sophis-
ica ed echniques han hose used o classifica ion asks.
In his pape , we add ess he gap in XAI me hods o objec de ec ion
by ocusing on model-agnos ic, black-box XAI echniques. We p opose and
e alua e no el black-box XAI me hods and XAI me ics ha a e speci -
ically ailo ed o objec de ec ion models, wi hou equi ing access o
in e nal model de ails. Ou p oposed me hods a e gene alizable o objec
de ec ion amewo ks beyond hose u ilized in ou expe imen s. Speci -
ically, he con ibu ions o his wo k can be summa ized as ollows:
•We o mally define a quan i a i e e alua ion me ic, D-Dele ion,
which ex ends he exis ing Dele ion me ic [4,25] p oposed o classi-
fica ion asks. This me ic is adap ed o handle he unique challenges
o objec de ec ion, including localiza ion (as seen in Fig. 4), which
is o u mos impo ance when mul iple ins ances o he same objec
appea in he same scene.
•By using he simila i y sco e o D-RISE [26], we analyze mul iple
mask gene a ion me hods’ pe o mance and in oduce D-MFPP, an
ex ension o MFPP [42] o iginally de eloped o classifica ion asks.
D-MFPP u ilizes segmen a ion-based mask gene a ion o imp o e ex-
plana ions o objec de ec ion models.
•We analyze he impac o key pa ame e s, such as image dimensions
and he model sizes wi hin he YOLO 8 a chi ec u e u ilized in ou
expe imen s, which can significan ly influence he quali y o he e-
sul ing explana ions.
•Las bu no leas , we acili a e he b oade adop ion o he de eloped
echniques o objec de ec ion in eal-wo ld use cases by eleasing he
code publicly in a eposi o y: h ps://gi hub .com /aklein1995 /d ise _
dm pp _ddele ion.
The emainde o his pape is s uc u ed as ollows: in Sec ion 2, we
fi s e iew li e a u e ela ed o XAI o objec de ec ion. In Sec ion 3,
we p o ide he necessa y backg ound on objec de ec ion and XAI o
amilia ize he eade wi h he key concep s used in he defini ions o
D-RISE and Dele ion. Nex , Sec ion 4p esen s he expe imen al se up, in-
cluding da ase s, objec de ec ion aining configu a ion, employed XAI
me hods, and e alua ion me ics. In his sec ion we also in oduce ou
p oposed D-MFPP me hod and D-Dele ion me ic. We discuss ou esul s
in Sec ion 5. Finally, Sec ion 6concludes he pape wi h a summa y o
ou key findings and di ec ions o u u e esea ch.
2. Rela ed wo k
Be o e p oceeding wi h he ma e ials and no el me hods in oduced
in his wo k, we fi s pause b iefly a XAI me hods, ocusing on hose
used o objec de ec ion asks and pu o p ac ice in indus ial applica-
ions:
XAI me hods. As s a ed in he in oduc ion, XAI offe s insigh s in o
he p ocedu e ollowed by an AI-based sys em o elici hei ou pu s,
enabling end-use s o unde s and and e en ually us he decisions ou -
pu by he AI-based sys em g ounded on objec i e da a [3]. To da e, he
majo i y o XAI me hods a e designed o models lea ned o add ess clas-
sifica ion asks. Fo ins ance, CAM-based me hods like G adCAM [34],
G adCAM++[7]and In eg a ed G adien s [36] quan i y and a ibu e
he pixel-wise impo ance o a gi en inpu acco ding o he g adien s
wi h espec a a ge class. Mo eo e , making use o backp opaga ion,
LRP [20] calcula es he con ibu ion ha a neu on has wi h neu ons
in consecu i e laye s o ge ele ance sco es. In con as , pe u ba ion-
based echniques wo k by occluding ce ain pa s o he inpu and ana-
lyzing i s impac in he p edic ions. Wi hin his ype o echniques, LIME
[29], app oxima es a NN wi h an in e p e able model; SHAP [19]as-
signs impo ance alues o each inpu ea u e based on Shapley alues;
RISE [25] gene a es saliency maps by p obing he model wi h andomly
masked e sions o he inpu image; and MFPP [42] gene a es masks
by di iding he inpu image in o mul i-scale supe pixels. None heless,
none o hem ha e been explici ly ex ended o objec de ec ion asks
–wi h he excep ion o RISE, which has been adap ed o his pu pose–
al hough echniques like SHAP can also be u ilized o eg ession p ob-
lems.
XAI me hods o objec de ec ion models. In ecen imes, a sca ci y
o XAI app oaches has been p oposed o suppo he in e p e abili y o
complex objec de ec ion models. SODEx [33]is a me hod capable o ex-
plaining any objec de ec ion algo i hm using classifica ion explaine s,
demons a ing how LIME can be in eg a ed wi hin YOLO 4, a a ian
o he YOLO amily o single-s age objec de ec o s. Simila ly, D-RISE
[26] ex ends RISE’s mask gene a ion echnique by in oducing a new
simila i y sco e ha assesses bo h he localiza ion and classifica ion as-
pec s o objec de ec ion models. Mo e ecen ly, D-CLOSE [38] enhances
D-RISE by p oducing less noisy explana ions. Along wi h o he me hod-
ological imp o emen s, D-CLOSE uses mul iple le els o segmen a ion in
he mask gene a ion phase. O he app oaches ocusing on hie a chical
masking ha e been p oposed. Conc e ely, GSM-NH [41]e alua es he
saliency maps a mul iple le els based on he in o ma ion o p e ious
less fine-g ained saliency maps, whe eas BODEM [21] u he ex ends
his idea bu ocuses on an ex eme black-box scena io whe e only ob-
jec coo dina es a e a ailable.
XAI me hods o indus ial applica ions. Al hough XAI is inc easingly
impo an in indus ial se ings o ensu e sa e y, eliabili y and com-
pliance, he adop ion o XAI o objec de ec ion me hods in indus ial
use cases has been limi ed o da e [17,15,9]. The as majo i y o he
wo ks ocus ei he on image classifica ion, like [8] ha u ilizes G ad-
CAM o in e p e ib a ion signal images in he classifica ion o bea ing
aul s; ime-se ies da a, e.g. [35] ha p esen s he implemen a ion and
explana ions o a emaining li e es ima o model; o abula da a, as in
[31]whe e SHAP is used o in e p e and s udy he influence o soil
and clima e ea u es on c op ecommenda ions. Rega ding XAI and ob-
jec de ec ion o indus ial applica ions, we can find a ew exempla y
s udies ha expose he sho age o eal-wo ld use cases cu en ly no ed
in his echnological c oss oads. In [23], a ious objec de ec ion mod-
els a e e alua ed o hei effec i eness in de ec ing weld cha ac e is ics
in adiog aphy images, wi h an emphasis on explainabili y and deploy-
men on edge de ices o assis wo ke s. In he same sense, [32]p o ides
a comp ehensi e e iew and analysis o a ious XAI echniques applied
o objec de ec ion asks in compu e ized omog aphy imaging o med-
ical pu poses. Finally, [14] demons a es how o in eg a e G ad-CAM
in o he YOLO a chi ec u e and pe o ms expe imen s in bo h public
and p i a e da ase s o ehicle on collision and ea - iew came as.
3. Backg ound
We now p oceed by elabo a ing on key concep s needed o p ope ly
unde s and he de ails o he p oposed D-MFPP echnique and he D-
Dele ion me ic ha lie a he co e o his wo k. Conc e ely, we p o ide
undamen als o objec de ec ion models (Sec ion 3.1) and XAI, wi h a
ocus on model-agnos ic black-box me hods o explain he p edic ions
o objec de ec ion models (Sec ion 3.2).
Resul s in Enginee ing 24 (2024) 103498
3
A. And es, A. Ma inez-Se as, I. Laña e al.
3.1. Objec de ec o s
Objec de ec o s a e c ucial componen s in compu e ision asks,
capable o iden i ying and localizing objec s wi hin an image. They can
be b oadly ca ego ized in o single-s age and wo-s age de ec o s.
Single-s age de ec o s. They di ec ly p edic bounding boxes and class
p obabili ies om inpu images in a single pass. Popula single-s age
de ec o s, such as YOLO [37], SSD [18]and Re inaNe [30], ea objec
de ec ion as a simple eg ession p oblem, s aigh om image pixels
o bounding box coo dina es and class p obabili ies. To his end, hey
p oduce a dense g id o bounding box p oposals and class p obabili ies
in one s ep. Specifically, YOLO [37] di ides he inpu image in o a g id
and p edic s bounding boxes and class p obabili ies o each g id cell.
Al hough his efficiency is beneficial o eal- ime applica ions, i o en
comes a he cos o accu acy when compa ed o wo-s age de ec o s
Two-s age de ec o s. These models, among which Fas e R-CNN [28]
can be conside ed o be he mos ep esen a i e one, ollow a mo e com-
plex app oach ha di ides he de ec ion p ocess in o wo s ages. In he
fi s s age, a Region P oposal Ne wo k (RPN) gene a es a se o candida e
objec p oposals (bounding boxes) om he inpu image. In he second
s age, hese p oposals a e efined and classified in o diffe en objec ca -
ego ies by a second ne wo k. This second s age ypically in ol es a mo e
complex ne wo k, such as a con olu ional neu al ne wo k (CNN), which
pe o ms classifica ion and u he efinemen o he bounding box co-
o dina es. This wo-s ep p ocess boos s accu acy by allowing o a mo e
efined ea u e analysis, hough i also slows down p ocessing, making
wo-s age de ec o s less sui ed o applica ions ha equi e high-speed
pe o mance.
Mos de ec o ne wo ks, including Fas e R-CNN and YOLO, p o-
duce a la ge numbe o bounding box p oposals which a e subsequen ly
efined using confidence h esholding and Non-Maximum Supp ession
(NMS) o p oduce a se o finally de ec ed objec s in he image. Each
bounding box p oposal 𝑑𝑖can be defined as ollows:
𝐝𝑖=[𝐋𝑖,𝑂
𝑖,𝐏𝑖]=[(𝑥𝑖
1,𝑦
𝑖
1,𝑥
𝑖
2,𝑦
𝑖
2),𝑂
𝑖,(𝑝𝑖
1,…,𝑝
𝑖
𝐶)],(1)
whe e 𝐋𝑖defines he bounding box co ne s (𝑥𝑖
1, 𝑦𝑖
1)and (𝑥𝑖
2, 𝑦𝑖
2); 𝑂𝑖∈
[0, 1] e e s o he p obabili y ha bounding box 𝐿𝑖con ains an objec
o any class; and 𝐏𝑖is a ec o o p obabili ies (𝑝𝑖
1, … , 𝑝𝑖
𝐶) ep esen -
ing he p obabili y ha egion 𝐋𝑖belongs o each o 𝐶classes. Unlike
adi ional classifie s, which assign a single class label o an en i e im-
age, objec de ec o s mus handle bo h classifica ion and localiza ion
simul aneously. This dual ask, p edic ing he class and p ecise loca ion
o each objec , inc eases he complexi y o making hese models in e -
p e able.
3.2. Explainable A ificial In elligence (XAI)
Despi e he g ea pe o mance exhibi ed by objec de ec o s in man-
i old applica ions, hei adop ion in isk-sensi i e scena ios is o en hin-
de ed by a lack o us and anspa ency by he use making decisions
based on he de ec ions issued by hese models. As in oduced p e i-
ously, esea ch on XAI p oduce echniques and me hods ha make he
beha io and p edic ions o AI models unde s andable o humans wi h-
ou sac ificing pe o mance [11]. To his end, mul iple XAI echniques
ha e been p oposed, which can be classified in o ou b oad ca ego ies
[3]:
•Scoop-based echniques ocus on he ex en o he explana ion, p o id-
ing ei he local explana ions o specific p edic ions o global expla-
na ions o he o e all model beha io .
•Complexi y-based me hods conside he complexi y o he model, wi h
simple , in e p e able models offe ing in insic in e p e abili y and
mo e complex models equi ing pos -hoc explana ions.
•Model-based app oaches dis inguish be ween XAI me hods ha a e spe-
cific o pa icula ypes o models, and hose ha a e model-agnos ic,
capable o being applied o any model dis ega ding he specifics o
hei in e nals.
•Me hodology-based echniques a e ca ego ized by hei me hodological
app oach, such as backp opaga ion-based me hods ha ace inpu
influences, o pe u ba ion-based me hods ha al e inpu s o obse e
changes in he ou pu o he model.
Gi en ha objec de ec o s a e ypically complex neu al ne wo ks,
hey all unde he complexi y-based ca ego y, he eby equi ing pos -hoc
explainabili y me hods o explain hei decisions. Among he a ious
me hodology-based echniques, a ibu ion me hods a e commonly used
o es ima e he ele ance o each pixel in an image o he de ec ion
ask. A ibu ion me hods a e pa icula ly impo an o objec de ec-
ion, whe e bo h localiza ion and classifica ion need o be explained.
T adi ional a ibu ion me hods ha e been p ima ily de eloped o
image classifie s [1], which p oduce a single ca ego ical ou pu , making
hem less sui ed o objec de ec o s. Objec de ec o s, unlike classi-
fie s, gene a e mul iple de ec ion ec o s ha encode no only class
p obabili ies, bu also localiza ion in o ma ion and addi ional me ics,
such as objec ness sco es (see Sec ion 3.1). Fu he mo e, echniques
like NMS and confidence h eshold fil e ing, which a e used o efine
bounding box p oposals, add complexi ies ha equi e a deepe un-
de s anding o he model’s in e nal wo kings, complica ing he use o
ce ain XAI me hods, such as g adien -based app oaches. The e o e, we
ocus on model-agnos ic black-box XAI app oaches, which a e designed
o be a chi ec u e-independen , and do no depend a all on he specifics
o he model unde a ge .
Among model-agnos ic XAI me hods, pe u ba ion-based app oaches
a e commonly used due o hei simplici y and effec i eness in e ealing
which pa s o he inpu a e mos influen ial o he model’s p edic ions.
Pe u ba ion-based echniques offe a di ec way o assess how changes
o he inpu image affec he model’s ou pu . By sys ema ically al e ing
o masking pa s o he inpu image (using masks o gene a e pe u bed
samples), hese me hods allow in e ing he impo ance o diffe en e-
gions based on he model’s inpu -ou pu beha io .
The ypical pipeline o pe u ba ion-based XAI me hods can be di-
ided in o h ee s ages: (1) Da a P epa a ion, (2) Model Assessmen ,
and (3) Impo ance Compu a ion. In he Da a P epa a ion s age, masks
a e gene a ed and applied o he image o c ea e pe u bed samples.
The Model Assessmen s age in ol es passing hese pe u bed images
h ough he model o obse e he changes in ou pu . Finally, in he
Impo ance Compu a ion s age, he impo ance o each pixel is calcu-
la ed by compa ing he model’s ou pu s o he o iginal and pe u bed
images. While he Model Assessmen s age emains consis en ac oss
me hods, wi h each pe u bed image passed h ough he model, he Im-
po ance Compu a ion a ies depending on he XAI app oach used. This
can ange om simple echniques like e aining a model (e.g., LIME)
o mo e complex app oaches. Since he effec i eness o hese me hods
la gely depends on how he pe u bed images a e gene a ed, h ee mask
gene a ion algo i hms a e nex desc ibed (Fig. 1):
•Sliding Window: This me hod, which is simila o he Occlusion ech-
nique p oposed in [43], sys ema ically mo es a window o fixed size
ac oss he image and se s he egion wi hin he window o a cons an
alue (e.g., ze o) o occlude ha pa o he image. By i e a i ely slid-
ing he window ac oss he en i e image, we can assess he impac
o each occluded egion on he model’s ou pu . The me hod equi es
speci ying he window size, which de e mines he a ea o he image
being occluded a each s ep, and he s ide, which se s how much he
window mo es be ween i e a ions.
•RISE: Randomized Inpu Sampling o Explana ion (RISE) [25]in-
ol es sampling 𝑁bina y masks o size ℎ ×𝑤, which a e smalle
han he o iginal image size 𝐻×𝑊. Each elemen in he mask
is independen ly se o 1 wi h p obabili y 𝑝and o 0 wi h he
emaining p obabili y 1 −𝑝. These masks a e hen upsampled o
size (ℎ+1)⋅𝐶𝐻×(𝑤+1)⋅𝐶𝑊using bilinea in e pola ion, whe e
Resul s in Enginee ing 24 (2024) 103498
4
A. And es, A. Ma inez-Se as, I. Laña e al.
Fig. 1. Example o h ee masks gene a ed using Sliding Window ( op), RISE
(middle), and MFPP (bo om). MFPP masks a e dependen on he image a he
inpu o he model. In his case, we conside a sample om he ba e y assembly
da ase de ailed in Sec ion 4.
𝐶𝐻×𝐶𝑊=⌊𝐻∕ℎ⌋×⌊𝑊∕𝑤⌋. The upsampled masks a e c opped o
he o iginal image size 𝐻×𝑊wi h uni o mly andom offse s anging
om (0, 0) o (𝐶𝐻, 𝐶𝑊). This me hod c ea es a di e se se o masks
ha co e diffe en pa s o he image, allowing o a comp ehensi e
e alua ion o he impo ance o a ious egions.
•MFPP: The so-called Mo phological F agmen al Pe u ba ion Py amid
(MFPP) [42]me hod di ides he inpu image in o mul i-scale ag-
men s and pe u bs hem andomly. In his sense, i is simila o RISE,
bu ins ead o pe u bing elemen s o he gene a ed masks wi h di-
mension ℎ ×𝑤, MFPP defines egions acco ding o segmen a ions a
diffe en scales. Depending on he numbe o defined agmen s, he
egions would be mo e fine-g ained ye mo e ime-consuming. The
segmen s a e dependen on each image, equi ing he c ea ion o new
masks o e e y image.
4. Ma e ials and me hods
This sec ion desc ibes he indus ial obo ics use cases in wha e e s
o he da ase s (Sec ion 4.1), objec de ec ion model (Sec ion 4.2), XAI
me hods (Sec ion 4.3) and he explana ion quali y me ics (Sec ion 4.4)
conside ed in ou wo k. The no el XAI echnique and quali y me ics
p oposed in his manusc ip a e also desc ibed in Sec ion 4.3.
4.1. Indus ial obo ics da ase s unde conside a ion
The da ase s used in his manusc ip ha e been collec ed du ing he
cou se o he ULTIMATE p ojec , h ps://ul ima e -p ojec .eu/, which
ea u es wo dis inc eal obo ics use cases [16]. The fi s da ase ,
om PIAP h ps://piap .lukasiewicz .go .pl/, in ol es a collabo a i e
wo kspace whe e a human and a obo ic a m wo k oge he . The sec-
ond da ase , p o ided by Robo nik h ps:// obo nik .eu/, ocuses on a
ba e y assembly a ea, whe e a obo ic a m assembles componen s o
a ba e y ki .1
Da ase 1: Human-Robo Da ase . This da ase consis s o 96 images
cap u ed om h ee diffe en came as, as exemplified in Fig. 2, wi h 32
images aken om each came a. The da ase includes wo objec classes:
human and g ippe . Impo an ly, each image in his da ase con ains
1While he da ase s con ain a ela i ely small numbe o images, his da a
sho age is ypically encoun e ed in eal-wo ld indus ial scena ios subjec o
da a a ailabili y cons ain s. Ne e heless, in he use cases unde conside a ions
he con ex ual and scene a iabili y is minimal, yielding sho - ailed dis ibu-
ions o he objec s o be de ec ed. The e o e, he small da ase s desc ibed in he
pape sufficien ly cap u e he ele an ea u es o he specific objec de ec ion
asks add essed by he models.
Fig. 2. Da ase 1 (Human-Robo collabo a ion): Da a a e cap u ed om came as
loca ed in 3 diffe en posi ions. All he images belonging o his da ase con ain
he aces blu o p ese e anonymi y.
Fig. 3. Da ase 2 (Ba e y Assembly ki ): The se up whe e a obo ic a m would
assemble he ki based a bi d-eye iew o he able whe e all componen a e
expec ed o be; (le ) a heo e ical se up; ( igh ) an ac ual sample.
only a single objec o each class, meaning a maximum o one human
and one g ippe pe image. To ensu e a di e se and ep esen a i e
sample, we applied ea u e ex ac ion using ResNe [12] o ob ain em-
beddings o he en i e da ase . The dimensionali y o hese embeddings
was educed using P incipal Componen Analysis (PCA), ollowed by K-
means clus e ing (wi h 𝑘 =8clus e s). F om each clus e , ou images
we e andomly selec ed, esul ing in a final subse . The da a we e spli
in o h ee se s: 72 images o aining (75%), 6 o alida ion (6.25%),
and 18 o es ing (18.75%). To main ain consis ency, we applied he
same pa i ioning o he da a om each came a. This esul ed in 24 im-
ages o aining, 2 o alida ion, and 6 o es ing om each came a.
Da ase 2: Ba e y Assembly Da ase . This da ase consis s o 7 images,
all cap u ed om a bi d’s-eye ( op-down) iew, showing a obo ic a m
assembling a ba e y ki , as shown in Fig. 3. The da ase includes fi e
dis inc objec ypes: indi idual ba e y,bms_a,bms_b,ba -
e y holde , and unknown objec . In con as o he Human-Robo
Da ase , each image in he Ba e y Assembly Da ase may con ain mul-
iple objec s o he same class, such as se e al indi idual ba e ies in a
single scene.
I is wo h no ing ha XAI echniques can be applied o any ype o
da a. When applied o aining da a, hey help e eal wha he model
has lea ned o ocus on du ing aining. When applied o es da a, hey
p o ide insigh in o how well he model gene alizes o new, unseen ex-
amples. Fo he Human-Robo Da ase , XAI explana ions we e applied
exclusi ely o he es images, allowing us o assess he model’s beha -
io on unseen da a. Howe e , o he Ba e y Assembly Da ase , gi en
he limi ed numbe o images (only 7), XAI explana ions we e applied
o he en i e da ase .
4.2. Objec de ec ion model: YOLO 8
Among he possible objec de ec o models, we selec ed one o he
s a e-o - he-a op ions, YOLO 8, due o i s nume ous ad ancemen s
o e p e ious e sions and i s obus pe o mance in objec de ec ion
asks [37]. YOLO 8 [27] in eg a es a no el combina ion o Fea u e Py a-
mid Ne wo k (FPN) and Pa h Agg ega ion Ne wo k (PAN) a chi ec u es,
enhancing i s abili y o de ec objec s a a ious scales and esolu ions.
Resul s in Enginee ing 24 (2024) 103498
5
A. And es, A. Ma inez-Se as, I. Laña e al.
The FPN g adually educes he spa ial esolu ion o he inpu image
while inc easing ea u e channels, acili a ing mul i-scale objec de ec-
ion. The PAN a chi ec u e u he agg ega es ea u es om diffe en
le els h ough skip connec ions, imp o ing he de ec ion o objec s wi h
di e se sizes and shapes. Addi ionally, YOLO 8 in oduces an ancho -
ee de ec ion mechanism ha di ec ly p edic s he cen e o an objec
(ins ead o he offse om a known ancho box), educing he numbe
o box p oposals and speeding-up he pos -p ocessing. Fu he mo e, i
was ained wi h la ge and mo e di e se da ase s including he pop-
ula COCO da ase , imp o ing i s pe o mance ac oss a wide ange o
images.
YOLO 8 was de eloped and eleased by Ul aly ics, and al hough
he model and i s weigh s a e open-sou ce, mos use s a e expec ed o
u ilize he Ul aly ics amewo k o i s enhanced usabili y. Howe e ,
unlike p e ious YOLO eleases whe e he p obabili y o each class pe
p edic ed box was accessible, in YOLO 8, he Ul aly ics API ou pu s
only he p obabili y o he class wi h he highes confidence in each
box.2Consequen ly, by de aul , YOLO 8 ou pu s:
𝐝𝑖=[𝐋𝑖,𝑂
𝑖,𝐶
𝑖]=[(𝑥𝑖
1,𝑦
𝑖
1,𝑥
𝑖
2,𝑦
𝑖
2),𝑂
𝑖,𝐶
𝑖],(2)
whe e 𝐋𝑖=(𝑥𝑖
1, 𝑦𝑖
1, 𝑥𝑖
2, 𝑦𝑖
2) ep esen s he coo dina es o he bounding
box, 𝑂𝑖deno es he objec ness sco e, and 𝐶𝑖co esponds o he p e-
dic ed class label o he objec wi hin he bounding box, which diffe s
wi h espec o he ou pu s shown in Exp ession (1).
4.3. Explainabili y me hods
We e alua e ou popula me hods o gene a ing isual explana-
ions o black-box models: LIME, RISE, D-RISE, and D-MFPP. The fi s
wo me hods, LIME and RISE,3we e o iginally de eloped o image clas-
sifie s bu can be adap ed o objec de ec o s. Howe e , hey p ima ily
ocus on explaining classifica ion aspec s and a e no capable o add ess-
ing localiza ion cha ac e is ics. In con as , D-RISE is one o he fi s XAI
me hods specifically designed o objec de ec o s, p o iding explana-
ions ha encompass bo h classifica ion and localiza ion. Addi ionally,
we ex end he exis ing MFPP me hod (o iginally ailo ed o classifie s)
in o a e sion sui able o objec de ec ion, which we e e o as D-MFPP.
In wha ollows we b iefly desc ibe hem, flowing in o a desc ip ion o
he p oposed D-MFPP app oach:
•LIME was o iginally designed o explain he p edic ions o any clas-
sifie by app oxima ing i locally wi h an in e p e able model. To
explain he p edic ion o an inpu image 𝐼, LIME fi s an in e p e able
model 𝑔(e.g., a linea model) o app oxima e he beha io o he
black-box model 𝑓locally a ound 𝐼. The simila i y be ween he o ig-
inal image and he pe u bed samples is measu ed using a ke nel
unc ion 𝜋𝐼(𝑧). When image explana ions a e a ge ed, LIME g oups
con iguous pixels in o supe pixels based on simila ea u es hey ep-
esen . This app oach allows LIME o measu e he impo ance o
egions in he image a he han indi idual pixels, making he expla-
na ions mo e in e p e able.
•As in oduced in he p e ious sec ion, RISE [25]was o iginally de-
signed o deep neu al ne wo ks ha ake images as inpu and ou -
pu a class p obabili y (e.g., a classifie like ResNe -50). I gene a es
saliency maps ha indica e he impo ance o each pixel by applying
andomly gene a ed bina y masks 𝑀𝑖 o he inpu image 𝐼and obse -
ing he changes in he model’s ou pu 𝑓(𝐼⊙𝑀
𝑖). In RISE, 𝑁bina y
masks 𝑀𝑖∈{0, 1}ℎ×𝑤a e gene a ed (as explained in Sec ion 3.2).
2h ps://gi hub .com /ul aly ics /ul aly ics /issues /2863%
h ps://gi hub .com /ul aly ics /ul aly ics /issues /4908.
3These XAI me hods ha e been chosen due o hei pe u ba ion-based na u e,
which aligns closely wi h he me hodology ollowed by he XAI me hods D-
RISE and D-MFPP p oposed in his wo k. Bo h D-RISE and D-MFPP gene a e
explana ions h ough pe u ba ions.
These masks a e hen applied o he inpu image 𝐼 o gene a e masked
images 𝐼′
𝑖=𝐼⊙𝑀
𝑖, whe e ⊙deno es elemen -wise mul iplica ion.
The model is e alua ed on each masked image 𝐼′
𝑖 o ob ain he ou -
pu s 𝑓(𝐼⊙𝑀
𝑖). The impo ance sco e o each pixel (𝑥, 𝑦)is hen
calcula ed as he weigh ed sum o he ou pu s:
𝑆𝐼,𝑓 (𝑥, 𝑦)= 1
𝑁
𝑁
∑
𝑖=1
𝑓(𝐼⊙𝑀
𝑖)⋅𝑀𝑖(𝑥, 𝑦)(3)
whe e he weigh s 𝑀𝑖(𝑥, 𝑦) ep esen he alue o mask 𝑖a pixel (𝑥, 𝑦).
The in ui ion behind RISE is ha 𝑓(𝐼⊙𝑀
𝑖)would be high when pix-
els p ese ed by mask 𝑀𝑖a e impo an . Al hough his is ue when
ha ing infini e di e se masks, in p ac ice RISE calcula es each pix-
el’s impo ance empi ically by Mon e Ca lo sampling. The e o e, RISE
la gely depends on he numbe o masks (𝑁) and how hey a e gen-
e a ed (i.e., is sensi i e o he selec ed p obabili y 𝑝and esolu ion
𝑠).
4.3.1. D-RISE and p oposed D-MFPP app oach
Unlike he o he wo app oaches o iginally designed o classifie s
ha measu e solely classifica ion aspec s, D-RISE (De ec o Random-
ized Inpu Sampling o Explana ion) [26]was designed o explain bo h
he classifica ion and localiza ion o a de ec ion. In his sense, D-RISE ex-
ends RISE by p oducing saliency maps specifically o objec de ec o s.
As p e iously seen in Sec ion 3.1, he ou pu gi en by an objec de ec-
o diffe s om he p obabili y ec o gi en by a classifie , ob aining
localiza ion in o ma ion 𝐿𝑖, an objec ness sco e 𝑂𝑖and he p obabili y
o classi ying each bounding box o any o he conside ed classes 𝑃𝑖. As
a consequence, Exp ession (3)used by RISE is eplaced in D-RISE wi h
a new simila i y sco e, gi en by:
𝑆𝐼,𝑓 (𝐝𝑡,𝐝𝑗)=𝑠𝐿(𝐝𝑡,𝐝𝑗)⋅𝑠𝑃(𝐝𝑡,𝐝𝑗)⋅𝑠𝑂(𝐝𝑡,𝐝𝑗),(4)
whe e 𝑠𝐿=𝐼𝑜𝑈(𝐋𝑡, 𝐋𝑗), 𝑠𝑃=𝐏𝑡⋅𝐏𝑗∕(||𝐏𝑡|| ⋅||𝐏𝑗||), and 𝑠𝑂=𝑂𝑗. In his
o mula ion, 𝑠𝐿 ep esen s he spa ial p oximi y o he bounding boxes
encoded by he a ge de ec ion 𝐝𝑡and he p oposal 𝐝𝑗, measu ed using
he In e sec ion o e Union (IoU); he e m 𝑠𝑃e alua es he simila i y
be ween he class p obabili ies o he a ge de ec ion and he p oposal
using cosine simila i y; and 𝑠𝑂inco po a es he objec ness sco e o he
p oposal 𝑂𝑗. I is impo an o no e ha o a de ec ion a ge 𝐝𝑡 he e
would po en ially be mo e han one de ec ion p oposals 𝐝𝑗. The e o e,
we would ha e mul iple 𝑆𝐼,𝑓(𝐝𝑡, 𝐝𝑗). As explained in D-RISE, he expla-
na ions conside only he de ec ion wi h maximal sco e o each mask:
𝑆𝐼,𝑓 (𝐝𝑡,𝑓(𝑀𝑖⊙𝐼))=max
𝐝𝑗∈𝑓(𝑀𝑖⊙𝐼)𝑆𝐼,𝑓(𝐝𝑡,𝐝𝑗).(5)
Gi en he YOLO 8 ou pu s explained in Sec ion 4.2, which do no
p o ide he class p obabili y ec o 𝑃𝑖wi hou modi ying i s a chi ec-
u e (an app oach we wan o a oid wi hin he scope o his pape ),
we mus adap he simila i y sco e o only conside 𝑠𝐿and 𝑠𝑂. Conse-
quen ly, he modified simila i y sco e can be exp essed as:
𝑆𝐼,𝑓 (𝐝𝑡,𝐝𝑗)=𝑠𝐿(𝐝𝑡,𝐝𝑗)⋅𝑠𝑂(𝐝𝑡,𝐝𝑗)=𝐼𝑜𝑈(𝐋𝑡,𝐋𝑗)⋅𝑂𝑗.(6)
This adjus men allows s ill u ilizing D-RISE effec i ely o gene -
a ing saliency maps wi h he de aul YOLO 8 model, ocusing on he
spa ial and objec ness aspec s o de ec ions, while main aining he in-
eg i y o he model’s o iginal a chi ec u e.
Simila ly, we can adop his simila i y sco e bu apply i wi h a
diffe en mask gene a ion p ocess. The MFPP me hod in oduced in Sec-
ion 3.2, o iginally designed o classifica ion asks, can be ex ended by
applying Equa ion (6), esul ing in D-MFPP. To he bes o ou knowl-
edge, no p e ious wo k has p oposed his a ian o MFPP o objec
de ec ion asks.
Resul s in Enginee ing 24 (2024) 103498
6
A. And es, A. Ma inez-Se as, I. Laña e al.
4.4. Me ics
E alua ing he pe o mance o a ibu ion-based explainabili y me h-
ods o image da a in ol es assessing how well he gene a ed ele ance
hea maps highligh impo an egions o he inpu image ha con ibu e
o he model’s decision. Gene ally, acco ding o [13], explana ion qual-
i y me ics can be g ouped in o six ca ego ies based on hei logical
simila i y: ai h ulness, obus ness, localiza ion, complexi y, andom-
iza ion, and axioma ic me ics. In his s udy, we ocus on wo o hese
ca ego ies ha a e pa icula ly ele an o objec de ec ion: localiza ion
(Sec ion 4.4.1) and ai h ulness (Sec ion 4.4.2).
4.4.1. Localiza ion
Localiza ion me ics e alua e whe he he explainable e idence is
cen e ed a ound a egion o in e es (RoI) defined by a bounding box,
segmen a ion mask, o a cell wi hin a g id. These me ics aim o e i y i
he saliency maps co ec ly highligh he a eas in he image ha con ain
he objec o in e es . Among hem, ou expe imen s will conside :
•Poin ing Game (PG), which is a human e alua ion me ic in oduced
in [44]. I he highes saliency poin lies inside he human-anno a ed
bounding box o an objec , i is coun ed as a hi . The PG accu acy is
gi en by:
PG =#𝐻𝑖𝑡𝑠
#𝐻𝑖𝑡𝑠+#𝑀𝑖𝑠𝑠𝑒𝑠 ,(7)
which is a e aged o e all ca ego ies in he da ase .
•Ene gy-based Poin ing Game (EBPG) [39], which measu es he p opo -
ion o ac i a ions wi hin he gi en bounding box ela i e o he whole
ac i a ion in he image. I assesses how much o he model’s ac i a-
ion ene gy is concen a ed wi hin he p edefined egion o in e es .
Fo mally:
EBPG =∑(𝑥,𝑦)∈bbox 𝑆𝐼,𝑓 (𝑥, 𝑦)
∑(𝑥,𝑦)∈bbox 𝑆𝐼,𝑓 (𝑥, 𝑦)+∑(𝑥,𝑦)∉bbox 𝑆𝐼,𝑓(𝑥, 𝑦),(8)
whe e 𝑆𝐼,𝑓 (𝑥, 𝑦) ep esen he saliency sco e a pixel
(𝑥, 𝑦), ∑(𝑥,𝑦)∈bbox 𝑆𝐼,𝑓 (𝑥, 𝑦) ep esen s he sum o ac i a ion alues
wi hin he bounding box, and
∑(𝑥,𝑦)∉bbox 𝑆𝐼,𝑓(𝑥, 𝑦) ep esen s he sum
o ac i a ion alues ou side he bounding box.
4.4.2. Fai h ulness
Me ics accoun ing o ai h ulness quan i y o wha ex en explana-
ions ollow he p edic i e beha io o he model, asse ing ha mo e
impo an ea u es play a la ge ole in model ou comes. These me ics
ocus on unde s anding he causal ela ionship be ween inpu ea u es
and he model’s ou pu by sys ema ically al e ing he ea u es and ob-
se ing he changes in p edic ions. Among hem:
•Dele ion: Inspi ed by he wo k by [4], he Dele ion me ic was p o-
posed in RISE [25]. This me ic measu es a dec ease in he p obabili y
o he p edic ed class as mo e and mo e impo an pixels a e emo ed,
whe e he impo ance is ob ained om he saliency map. A sha p
d op, and hus a low A ea Unde he p obabili y Cu e (AUC, as a
unc ion o he ac ion o emo ed pixels), indica es a good explana-
ion. Gi en he impo ance sco e o each pixel calcula ed by any XAI
me hod, 𝑆𝐼,𝑓, we can o mula e he Dele ion me ic as:
Dele ion(I,S,c) =AUC({𝑃𝑟(𝑓(𝐼⊙𝑀
𝑘)=𝑐)}𝐾
𝑘=1),(9)
whe e 𝐼is he o iginal image, 𝑀𝑘 ep esen a mask wi h he 𝑘- h
mos impo an pixels emo ed so ed by 𝑆𝐼,𝑓 , 𝑃𝑟(𝑓(𝐼⊙𝑀
𝑘)=𝑐)
ep esen s he p obabili y o model 𝑓p edic ing ha he bounding
box belongs o class 𝑐, and AUC(⋅) compu es he a ea unde he cu e
o he 𝐾p edic ions.
•Minimum Subse : I ollows he same logic as Dele ion, bu ins ead o
de e mining he AUC, i conside s he equi ed numbe o pixels ha
make he p edic ion o change [10]. Gi en he impo ance sco e o
each pixel (𝑆𝐼,𝑓 ), Min-Subse is defined as he smalles subse o pixels
ha needs o be emo ed o change he model’s p edic ion. Ma hema -
ically:
Min-Subse (𝐼,𝑆,𝑐)=
min {𝑘∈{1,2…,𝐾}∶ 𝑓(𝐼⊙𝑀
𝑘)≠𝑓(𝐼)},(10)
whe e 𝑓(𝐼⊙𝑀
𝑘) ep esen s he class label assigned by he model 𝑓
a e passing he image 𝐼wi h he op 𝑘mos impo an pixels e-
mo ed, and 𝑓(𝐼)is he class label p edic ed o he o iginal image.
4.4.3. P oposed D-dele ion and D-minimal subse me ics
O iginally, Dele ion was designed o classifie s. Howe e , wi h ob-
jec de ec o s, mul iple de ec ions in a single image can occu . Al hough
D-RISE s a ed he necessi y o adap his me ic o objec de ec o s [26],
no o mal defini ion can be ound in he li e a u e. The e o e, conside -
ing he impo ance o his issue in eal use cases, we o mally e-define
Equa ion (9)in wo manne s:
1. Dele ion. Measu es he explana ion gi en he a ge class label 𝐶𝑡( e-
ga dless i he e is mo e han one elemen o a class) and i e a i ely
emo es he op 𝑘mos impo an pixels:
Dele ion(𝐼,𝑆,𝐶𝑡)=
𝐴𝑈𝐶 ⎛⎜⎜⎝{max
𝐝𝑘
𝑗[𝑂𝑘
𝑗
⋅I{𝐶𝑘
𝑗=𝐶𝑡}]}𝐾
𝑘=1⎞⎟⎟⎠
.(11)
The model 𝑓(⋅) akes as inpu he masked image 𝐼⊙𝑀
𝑘and ou -
pu s a se o bounding box p oposals 𝐝𝑘
𝑗=[𝐋𝑘
𝑗, 𝑂𝑘
𝑗, 𝐶𝑘
𝑗]. The indica o
unc ion I{𝐶𝑘
𝑗=𝐶𝑡}equals 1 i he p edic ed class 𝐶𝑘
𝑗ma ches he
a ge class 𝐶𝑡, and 0 o he wise. The e m max𝐝𝑘
𝑗[𝑂𝑘
𝑗
⋅I{𝐶𝑘
𝑗=𝐶𝑡}]
selec s he maximum objec ness sco e 𝑂𝑘
𝑗 o he bounding boxes
whe e he p edic ed class ma ches he a ge class. The AUC is hen
compu ed o e he se o p edic ion sco es o he 𝐾s eps, whe e a
each s ep he mos impo an pixels a e p og essi ely emo ed.
2. D-Dele ion. While he s anda d Dele ion me ic e alua es he impac
o pixel emo al on a class p edic ion, i lacks he abili y o accoun
o spa ial localiza ion, which is essen ial in objec de ec ion asks
whe e mul iple ins ances o he same class can appea . D-Dele ion
add esses his limi a ion by ocusing on a specific a ge bounding
box 𝐝𝑡, conside ing bo h he class in o ma ion, 𝐶𝑡, and IoU be ween
he a ge and o he de ec ed p oposals, 𝐝𝑘
𝑗. This ensu es ha he
me ic no only measu es ai h ulness bu also akes localiza ion in o
accoun , p o iding mo e p ecise explana ions in si ua ions whe e di -
e en objec s o he same class coexis . Ma hema ically is exp essed
as:
D-Dele ion(𝐼,𝑆,𝐶𝑡)=
𝐴𝑈𝐶 ⎛⎜⎜⎝{max
𝐝𝑘
𝑗[𝑂𝑘
𝑗
⋅I{𝐶𝑘
𝑗=𝐶𝑡}⋅I{𝐼𝑜𝑈(𝐝𝑡,𝐝𝑘
𝑗)>𝛾}]}𝐾
𝑘=1⎞⎟⎟⎠
(12)
whe e 𝛾is a h eshold. As a consequence, when mul iple elemen s
o he same class a e in an image, D-Dele ion will only conside hose
p oposals 𝐝𝑘
𝑗p edic ed by he model ha ha e a p edefined IoU wi h
he a ge bounding box 𝐝𝑡.
The diffe ence be ween Dele ion and D-Dele ion is illus a ed in
Fig. 4. This figu e highligh s how D-Dele ion dis inguishes be ween
diffe en objec s o he same class by inco po a ing localiza ion in o -
ma ion, leading o mo e efined and accu a e explana ions (↓AUC in
he Figu e’s las ow) when mul iple objec s o he same class a e de-
ec ed in an image. Fo he sake o cla i y, we p o ide he pseudocode
o Dele ion in Algo i hm 1, whe e he main diffe ence wi h espec o
D-Dele ion a e lines 10 o 12.
Resul s in Enginee ing 24 (2024) 103498
7
A. And es, A. Ma inez-Se as, I. Laña e al.
Fig. 4. Illus a ion o a collabo a i e wo kspace ea u ing wo humans and a
obo ic a m. The fi s ow shows he o iginal image. The second ow displays
he image wi h he 10% mos impo an pixels emo ed o each human, as
iden ified by an XAI me hod. In he hi d ow, he Dele ion me ic cu e, which
only conside s class ype, shows a high p obabili y sco e e en when he p ima y
human is la gely occluded by he o he pe son. The ou h ow p esen s he D-
Dele ion me ic cu e, which inco po a es a localiza ion componen , p o iding a
mo e accu a e measu e o explana ion impo ance by conside ing he posi ions
o en i ies wi hin he image. A lowe a ea unde he cu e indica es a be e
explana ion.
Las ly, akin o how Min-Subse is ela ed o Dele ion, D-Min-Subse is
associa ed wi h D-Dele ion. Consequen ly, D-Min-Subse conside s bo h
he class ype and he IoU o de e mine he numbe o pixels equi ed
o make he p edic ion o change:
D-Min-Subse (𝐼,𝑆,𝐶𝑡)=
Algo i hm 1 Dele ion Me ic’s Pseudocode o Objec De ec o .
Requi e: Image 𝐼, saliency map 𝑆𝐼,𝑓 , numbe o s eps 𝐾, a ge de ec ion 𝐝𝑡,
a ge class 𝐶𝑡
1: Ini ialize 𝑆←[]
2: o 𝑘 =1 o 𝐾do
3: 𝑀𝑘←𝑆𝐼,𝑓 emo ing he op 𝑘mos impo an pixels
4: Apply mask 𝑀𝑘 o image 𝐼
5: Fo wa d pass h ough he model 𝑓and ob ain he bounding box p opos-
als 𝐝𝑗=[𝐋𝑗, 𝑂𝑗, 𝐶𝑗] =𝑓(𝐼⊙𝑀
𝑘)
6: Ini ialize lis o p oposals: p oposals ←[]
7: o each bounding box 𝐝𝑗p edic ed by he model 𝑓do
8: i 𝐶𝑗=𝐶𝑡 hen
9: p oposals.append(𝑂𝑗)
% Fo D-Dele ion
10: i 𝐼𝑜𝑈(𝐝𝑡, 𝐝𝑗) >𝛾 hen
11: p oposals.append(𝑂𝑗)
12: end i
13: else
14: p oposals.append(0)
15: end i
16: end o
17: Inse he maximum sco e wi hin he dele ion buffe : 𝑆←𝑆∪
max(p oposals)
18: end o
19: Calcula e he Dele ion me ic as he a ea unde he cu e: 𝐷=AUC(𝑆)
20: e u n Dele ion sco e 𝐷
min {𝑘∈{1,2…,𝐾}∶ 𝐶𝑘
𝑗≠𝐶𝑡o 𝐼𝑜𝑈(𝐝𝑡,𝐝𝑘
𝑗)<𝛾},(13)
whe e 𝐶𝑘
𝑗 ep esen s he p edic ed class label o de ec ion 𝑗when pass-
ing he masked image 𝐼⊙𝑀
𝑘 h ough he model 𝑓, wi h 𝐝𝑘
𝑗=𝑓(𝐼⊙𝑀
𝑘)
being he se o de ec ions a e emo ing he op 𝑘mos impo an pix-
els. In his con ex , D-Min-Subse depends on wo condi ions: (1) he class
p obabili y labels 𝐶𝑘
𝑗 o he p edic ed bounding box 𝐝𝑘
𝑗mus no longe
ma ch he a ge class 𝐶𝑡, o (2) he IoU be ween he a ge bounding
box 𝐝𝑡and he p edic ed bounding box 𝐝𝑘
𝑗 alls below he h eshold 𝛾.
The minimum 𝑘is iden ified as he s ep whe e ei he o hese condi ions
is fi s me .
5. Expe imen s and esul s
Con a ily o mos s udies in he XAI li e a u e ha p ima ily o-
cus on benchma k da ase s, ou esea ch wo k ocuses on assessing he
explainabili y o objec de ec o s in eal-wo ld indus ial da a. In his
con ex , o e alua e he effec i eness o explana ions, we o mula e ou
key esea ch ques ions o answe hem wi h empi ical e idence:
•RQ1: Which XAI me hod p o ides he mos eliable and insigh ul
explana ions o objec de ec ion models?
•RQ2: Does he D-Dele ion me ic enhance he us wo hiness o XAI
ou pu s when mul iple objec s o he same class a e p esen in he
image?
•RQ3: How does he mask gene a ion p ocess influence he quali y
o explana ions, pa icula ly when using simila i y sco es o objec
de ec ion? How does D-MFPP beha e?
•RQ4: Do diffe en image dimensions impac he explana ions gene -
a ed by XAI me hods? Do models o a ying sizes (la ge, medium,
small, nano) ocus on diffe en egions o he image in hei explana-
ions?
Nex , we ou line he hype pa ame e s used ac oss ou expe imen s o
ensu e consis ency in aining and e alua ion. Fo bo h da ase s, models
we e ained using he YOLO 8 a chi ec u e o a o al o 100 epochs.
The image size (imgsize) was se o he la ges dimension o he inpu
image (e.g., 720 × 1280 ←←→ 1280), and da a augmen a ion echniques such
as andom ho izon al flipping and colo ji e we e applied. Fo consis-
ency, he de aul Ul aly ics se ings we e used whe e e applicable. In
Resul s in Enginee ing 24 (2024) 103498
8
A. And es, A. Ma inez-Se as, I. Laña e al.
Table 1
Quan i a i e me ics o LIME, RISE and D-RISE o he Human-Robo da ase . The able p esen s he pe o mance
o each XAI echnique in e ms o classifica ion (Dele ion, D-Dele ion, Min-Subse , D-Min-Subse ) and localiza ion
me ics (PG and EBPG), wi h sco es epo ed o each class (Human,G ippe ) and he o e all a e age. Lowe
alues a e be e o me ics ma ked wi h ↓, while highe alues a e be e o hose ma ked wi h ↑.Bold alues
indica e he bes a e age sco es ac oss all objec s, highligh ing he bes -pe o ming XAI me hod o each me ic.
Values highligh ed in g ay ep esen he bes sco es o each objec ca ego y (Human o G ippe ) and should be
in e p e ed e ically.
XAI Me hod Objec Dele ion (↓) D-Dele ion (↓)Min-Subse (↓) D-Min-Subse (↓)PG(↑)EBPG(↑)
LIME
Human 0.0759 0.0632 4.2703 4.2703 1.0000 34.984
G ippe 0.4688 0.0324 1.3859 1.3859 1.0000 2.2270
A e age 0.2723 0.0478 2.8281 2.8281 1.0000 18.6060
RISE
Human 0.1827 0.1241 9.0108 9.0108 0.7500 19.0824
G ippe 0.2637 0.0060 0.6355 0.63 1.0000 1.0542
A e age 0.2232 0.0651 4.8232 4.8232 0.875 10.0683
D-RISE
Human 0.1255 0.0818 5.7335 5.7335 0.8750 20.4061
G ippe 0.2777 0.0059 0.6091 0.6091 1.0000 1.0815
A e age 0.2016 0.0438 3.1713 3.1713 0.9375 10.7438
Fig. 5. Hea maps ob ained by applying RISE (le ) and D-RISE ( igh ) o he
de ec ion o a human in he Human-Robo Da ase .
he case o LIME, we use he baseline implemen a ion o [29], whe e
we adop he SLIC segmen a ion algo i hm [2](wi h 100 segmen s) and
gene a ed 1000 samples o assess he quali y o he p oduced explana-
ions. Fo RISE and D-RISE, we employed 5000 masks wi h a p obabili y
o 0.25 and a esolu ion o 16 ×16 o p oduce he saliency maps. Las ly,
o all objec de ec ion p edic ions, a confidence h eshold o 0.7 was
se o de e mine he alidi y o each de ec ion.
In wha ollows we p esen and discuss on he esul s ob ained o
answe each o he RQ o mula ed abo e:
RQ1: Compa ison be ween XAI me hods
In he Human-Robo da ase , he compa ison be ween LIME, RISE,
and D-RISE, as shown in Table 1, e eals dis inc s eng hs ac oss di -
e en me ics (Sec ion 4.4). LIME pe o ms good in e ms o localiza-
ion, wi h highe PG and EBPG sco es (100% and 18.60%, espec i ely)
compa ed o D-RISE (93.75% and 10.74%). This indica es ha LIME
gene a es mo e localized saliency maps, ocusing closely on he bound-
ing boxes o de ec ed objec s. Howe e , his supe io pe o mance is
pa ly due o he size o he objec being analyzed. LIME’s supe pixel
gene a ion is be e sui ed o la ge objec s (e.g., human), as la ge e-
gions o he image can be g ouped effec i ely in o meaning ul segmen s,
leading o highe localiza ion sco es. This ad an age also applies o clas-
sifica ion, whe e la ge objec s allow LIME o be e p ese e ele an
ea u es o de ec ion. Con e sely, o smalle objec s (e.g., g ippe ),
LIME s uggles when compa ed o he o he me hods, as eflec ed by i s
wo se pe o mance me ics in hose cases.
In con as , RISE and D-RISE a e less sensi i e o objec size, making
hem mo e obus ac oss diffe en objec scales, which is e iden in hei
be e pe o mance on smalle objec s like he g ippe . They achie e
Dele ion sco es o 0.2636 and 0.2777, espec i ely, compa ed o LIME’s
0.4688. When conside ing he o e all pe o mance ac oss classes, RISE,
wi h a Dele ion sco e o 0.2232 and D-Dele ion o 0.0651, shows im-
p o emen o e LIME in classifica ion- ela ed asks bu s ill lags behind
D-RISE, which achie es he lowes Dele ion (0.2016) and D-Dele ion
(0.0438) sco es. Al hough D-RISE offe s he bes balance be ween clas-
sifica ion and localiza ion, he diffe ence be ween RISE and D-RISE is
Fig. 6. Explana ions o a scene using diffe en s ide and window size configu a-
ions when using D-Sliding Window (combina ion o mask gene a ion explained
in Sec ion 3.2 and Equa ion (6)). (Le ) S ide o 16; (Righ ) S ide o 8; (Top)
Window size o 32; (Bo om) Window size o 64.
minimal in his da ase , whe e each image con ains only a single objec
pe class. As a esul , as shown in Fig. 5, hei hea maps a e e y simi-
la o each o he , bo h highligh ing he human head. Howe e , D-RISE
elimina es less ele an a eas mo e effec i ely.
In he esul s ob ained o e he Ba e y Assembly da ase (Table 2),
a simila pa e n can be no iced. LIME excels a localiza ion wi h an
a e age EBPG o 16.03%, while RISE and D-RISE pe o m be e in e-
aining key classifica ion ea u es. Since his da ase includes mul iple
objec s o he same class (e.g., mul iple ba e ies), bo h LIME and RISE,
which a e no designed o handle mul iple de ec ions o he same class,
expose se e e limi a ions. RISE, wi h a D-Dele ion sco e o 0.1474, p e-
se es key ea u es be e han LIME, bu is ou pe o med by D-RISE,
which achie es a sco e o 0.0344. D-RISE also shows he highes PG
sco e (97%), pe o ming significan ly be e han LIME (76.85%) and
RISE (66.95%).
O e all, when dealing wi h da ase s con aining only one objec pe
class, he diffe ences be ween LIME, RISE, and D-RISE a e ela i ely
small in quan i a i e e ms. Howe e , when mul iple objec s o he
same class appea in a gi en inpu image, D-RISE clea ly domina es
o e he es o echniques. As illus a ed in Fig. 7, D-RISE gene a es
cohe en hea maps o each de ec ed objec in he Ba e y Assembly
da ase , whe eas LIME and RISE p o ide a global saliency map o he
en i e class. By combining he indi idual saliency maps om D-RISE,
a mo e accu a e and objec -specific explana ion can be p oduced. This
also highligh s he limi a ions o LIME and RISE when applied o mul-
Resul s in Enginee ing 24 (2024) 103498
9
A. And es, A. Ma inez-Se as, I. Laña e al.
Table 2
Quan i a i e me ics o LIME, RISE and D-RISE o he Ba e y Assembly da ase . The able p esen s he pe o mance o each
XAI echnique in e ms o classifica ion (Dele ion, D-Dele ion, Min-Subse , D-Min-Subse ) and localiza ion me ics (PG and
EBPG), wi h sco es epo ed o each objec (indi ba , bms a, bms b, unknown objec , ba holde ) and
he o e all a e age. Lowe alues a e be e o me ics ma ked wi h ↓, while highe alues a e be e o hose ma ked
wi h ↑.Bold alues indica e he bes a e age sco es ac oss all objec s, highligh ing he bes -pe o ming XAI me hod o
each me ic. G ay-highligh ed alues ep esen he bes sco es o each objec ca ego y (indi ba , bms a, bms b,
unknown objec o ba holde ) and should be in e p e ed e ically.
XAI Me hod Objec Dele ion (↓) D-Dele ion (↓)Min-Subse (↓) D-Min-Subse (↓)PG(↑)EBPG(↑)
LIME
indi ba 0.7549 0.2806 44.3412 14.6756 0.3188 2.5347
bms a 0.0184 0.0181 0.8784 0.8784 1.0000 30.7573
bms b 0.0125 0.0125 0.8321 0.8321 1.0000 16.5440
unknown objec 0.0624 0.0245 2.0342 2.0342 1.0000 20.1071
ba holde 0.1498 0.0849 6.8115 4.3766 0.5238 10.2087
A e age 0.1996 0.0841 10.9795 4.5594 0.7685 16.0304
RISE
indi ba 0.7659 0.4359 81.2344 32.0482 0.0144 1.2558
bms a 0.0190 0.0190 1.9417 1.9417 1.0000 3.0409
bms b 0.0217 0.0217 2.1266 2.1266 1.0000 2.4561
unknown objec 0.5333 0.2008 3.8372 3.8372 1.0000 5.5780
ba holde 0.0902 0.0595 7.9519 5.5632 0.3333 3.3681
A e age 0.2860 0.1474 19.4184 9.1034 0.6695 3.1398
D-RISE
indi ba 0.6214 0.0311 35.7678 2.7725 1.0000 2.0546
bms a 0.0181 0.0181 1.9880 1.9880 1.0000 3.2955
bms b 0.0128 0.0116 1.3407 1.3407 1.0000 2.8876
unknown objec 0.0879 0.0485 3.7448 4.2071 0.8571 7.6831
ba holde 0.4839 0.0626 21.7291 5.1009 1.0000 4.9635
A e age 0.2448 0.0344 12.9141 3.0819 0.9714 4.1768
iple objec s, as hei global saliency maps do no diffe en ia e be ween
indi idual ins ances.
RQ2: D-dele ion me ic o scenes wi h mul iple objec s o he same class
As a seconda y obse a ion in he expe imen s o RQ1, he D-
Dele ion me ic is specifically designed o o e come he limi a ions o
adi ional dele ion me ics, pa icula ly when mul iple objec s o he
same class a e p esen in an image.
In he Ba e y Assembly da ase , whe e se e al ins ances o he same
class (e.g., indi ba ) appea , D-Dele ion demons a es clea ad an-
ages. By inspec ing Table 2, RISE, while pe o ming easonably well
wi h an a e age Dele ion sco e o 0.2860, i s ill ob ains a ela i ely
high D-Dele ion sco e o 0.1474, sugges ing ha i s uggles o di -
e en ia e be ween he con ibu ions o indi idual objec s. In con as ,
D-RISE, which ob ains an a e age Dele ion sco e o 0.2448, ou pe o ms
RISE wi h a D-Dele ion sco e o 0.03444. This highligh s D-RISE’s abil-
i y o isola e and p ese e key ea u es o each objec , p o iding mo e
us wo hy, objec -specific explana ions a he han b oad, class-le el
insigh s.
The Min-Subse and D-Min-Subse me ics, which measu e he min-
imal p opo ion o pixels needed o dis up a de ec ion, ein o ce hese
findings. In he Human-Robo da ase , Table 1, whe e only one ob-
jec pe class appea s, he diffe ences be ween Dele ion and D-Dele ion
sco es a e mino , and he Min- Subse and D-Min-Subse alues a e
close o each o he . Howe e , in he Ba e y Assembly da ase , whe e
he diffe ences be ween Dele ion and D-Dele ion a e mo e subs an ial
and mul iple objec s o he same class co-occu in he same image, he
Min-Subse (12.9141) and D-Min-Subse (3.0819) alues also di e ge sig-
nifican ly.
RQ3: Influence o he mask gene a ion s a egy
When compa ing XAI app oaches o objec de ec ion asks config-
u ed wi h diffe en mask gene a ion echniques, he esul s in Tables 3
and 4ini ially sugges ha D-Sliding Window pe o ms he bes in al-
mos all me ics. Howe e , as no ed in he cap ions, his is only in cases
whe e explana ions we e p o ided. Fo he Human-Robo da ase (Ta-
ble 3), ega dless o he window size and s ide, D-Sliding Window ailed
o p o ide explana ions o la ge objec s, such as humans, and only
p o ided meaning ul explana ions o smalle objec s like g ippe in-
s ances. Simila ly, in he Ba e y Assembly da ase (Table 4), D-Sliding
Window s uggled wi h la ge objec s when using a smalle window size
(w=32), which led o highe sco es in classifica ion me ics. E en wi h
Fig. 7. Hea maps gene a ed in a scene o he Ba e y Assembly da ase o wo a ge classes: indi idual ba e y ( op ow) and ba e y holde (bo om
ow). The fi s and second columns shows he saliency maps ob ained using LIME and RISE, independen o he numbe o objec s o he same class in he image.
Columns 3,4 and 5 display hea maps gene a ed using D-RISE o h ee diffe en indi idual elemen s o he same class.