CLIP is S ong Enough o Figh Back: Tes - ime Coun e a acks owa ds
Ze o-sho Ad e sa ial Robus ness o CLIP
Songlong Xing1Zhengyu Zhao2*Nicu Sebe1
1Uni e si y o T en o, I aly 2Xi’an Jiao ong Uni e si y, China
{songlong.xing, niculae.sebe}@uni n.i [email p o ec ed]
Abs ac
Despi e i s p e alen use in image- ex ma ching asks in a
ze o-sho manne , CLIP has been shown o be highly ul-
ne able o ad e sa ial pe u ba ions added on o images.
Recen s udies p opose o ine une he ision encode o
CLIP wi h ad e sa ial samples gene a ed on he ly, and
show imp o ed obus ness agains ad e sa ial a acks on
a spec um o downs eam da ase s, a p ope y e med as
ze o-sho obus ness. In his pape , we show ha mali-
cious pe u ba ions ha seek o maximise he classi ica ion
loss lead o ‘ alsely s able’ images, and p opose o le e -
age he p e- ained ision encode o CLIP o coun e a ack
such ad e sa ial images du ing in e ence o achie e obus -
ness. Ou pa adigm is simple and aining- ee, p o iding
he i s me hod o de end CLIP om ad e sa ial a acks a
es ime, which is o hogonal o exis ing me hods aiming o
boos ze o-sho ad e sa ial obus ness o CLIP. We conduc
expe imen s ac oss 16 classi ica ion da ase s, and demon-
s a e s able and consis en gains compa ed o es - ime de-
ence me hods adap ed om exis ing ad e sa ial obus ness
s udies ha do no ely on ex e nal ne wo ks, wi hou no-
iceably impai ing pe o mance on clean images. We also
show ha ou pa adigm can be employed on CLIP models
ha ha e been ad e sa ially ine uned o u he enhance
hei obus ness a es ime. Ou code is a ailable he e.
1. In oduc ion
Wi h he inc easing a ailabili y o image- ex da a and
he ad ancemen o sel -supe ised lea ning echniques
[7,8,16], ision-language models (VLM) ha e con inued
o spa k esea ch in e es s in bo h academia and indus y
[21,28,39,40,42,56]. As a ep esen a i e VLM, CLIP
[39] has shown imp essi e abili ies o ma ch an image wi h
i s desc ip i e ex in a ze o-sho manne . Howe e , e-
cen s udies ha e shown ha adding small impe cep ible
pe u ba ions o an image can cause CLIP o misclassi y i
*Co esponding au ho
Figu e 1. Tes - ime coun e a acks ha ness he exp essi e powe
o CLIP o gene a e a coun e a ack o de end CLIP agains ad e -
sa ies wi hou ine uning he ision encode .
[26,27,32,44,50,59,63], a common p oblem plaguing
nea ly all neu al ne wo ks [2,6,25,29,33,36,48,57,58].
As ounda ional models a e deployed in eal-wo ld appli-
ca ions, hei sa e y and eliabili y ha e become a p essing
conce n. In his pape , we ocus on he obus ness o CLIP
agains ad e sa ial pe u ba ions.
Unlike con en ional models o which ad e sa ial o-
bus ness has been ex ensi ely s udied [2,6,11,33,47,48],
CLIP is a p e- ained ounda ion model ha has lea ned
massi e amoun s o eal-wo ld knowledge, and should be
deal wi h ca e ully o minimise damage o i s gene ali-
sa ion abili ies. Ad e sa ial obus ness o CLIP has jus
s a ed o ga ne esea ch a en ion [26,32,44,50] in ecen
yea s. Exis ing e o s all in o wo ca ego ies. The i s
ype is based on ad e sa ial aining [3,58], which al e -
na ely gene a es ad e sa ial images on one da ase and uses
hem o ine une he ision encode o CLIP [32,50]. This
ype o me hods, known as ad e sa ial ine uning (AFT),
dynamically mimics a min-max game be ween CLIP and
he h ea model in he ine uning phase, and deploys he
ine uned model in a wide a ie y o downs eam classi i-
ca ion asks wi hou u he aining. This me hod shows
ans e able obus ness o downs eam da ase s, a p ope y
e med as ze o-sho obus ness [32,50]. The o he ype o
This CVPR pape is he Open Access e sion, p o ided by he Compu e Vision Founda ion.
Excep o his wa e ma k, i is iden ical o he accep ed e sion;
he inal published e sion o he p oceedings is a ailable on IEEE Xplo e.
15172
me hods eso s o p omp uning [61,62], which inse s
lea nable ex okens in he embedding space, aligns he co -
esponding ex p omp s wi h ad e sa ial images, and unes
he lea nable okens by p opaga ing g adien s o he ex em-
beddings. This app oach is known as ad e sa ial p omp
uning (APT) [26,59]. Al hough hese me hods ha e shown
imp o ed obus ness o e he o iginal CLIP, he e a e appa -
en limi a ions. Fi s ly, hey equi e ime-consuming ain-
ing, especially ad e sa ial ine uning which in ol es gen-
e a ing ad e sa ial images on he ly. Secondly, he model
o e i s o he aining da a, which comp omises gene alisa-
ion on clean and ad e sa ial images om o he da a dis i-
bu ions. In he case o ad e sa ial ine uning, he ad e sa i-
ally ine uned CLIP ou pe o ms he o iginal CLIP on clean
images om he da ase used o ine uning, indica ing ha
he model has o e i ed o he dis ibu ion o he da ase
(See Tab. 1in Sec. 4). Thi dly, o ad e sa ial ine uning
me hods [32,50], obus ness o ad e sa ial samples comes
a he cos o a signi ican decline o classi ica ion accu acy
on clean images.
To add ess hese limi a ions, we d aw inspi a ion om
exis ing ad e sa ial obus ness s udies ha obus i y non-
ounda ional a ge models a es ime [1,17,31,52], and
p opose a es - ime pa adigm ha u ilises he exp essi e
powe o CLIP o de end i sel om ad e sa ial a acks
(Fig. 1). Following p e ious s udies, we ocus on ad e -
sa ies ha aim o maximise he classi ica ion loss o CLIP
gi en a es image. We obse e ha such ad e sa ies cause
images o be mo e s able when a small andom noise is
added, compa ed o clean images. We e m his beha iou
o ad e sa ial images ‘ alse s abili y’, which can be in e -
p e ed as he images being apped in a oxic su ounding in
he la en space by an ad e sa y. In ui i ely, he p e- ained
ision encode o CLIP is highly exp essi e, and can be
le e aged o push he ad e sa ial image away om i s oxic
embedding. The e o e, we p opose o employ he ision en-
code o CLIP o coun e ac he alse s abili y o ad e sa ial
images, he eby achie ing obus ness o hese a acks. Since
no label in o ma ion is a ailable a es ime, we o mula e
he es image as an ancho , and i e a i ely upda e he coun-
e a ack pe u ba ion such ha i maximises he L2dis ance
wi h he ancho in he embedding space [44]. Howe e ,
pushing he es image away om i s embedding isks hu -
ing pe o mance on clean images. To add ess his, we p o-
pose τ- h esholded weigh ed coun e a acks, which em-
ploy a h eshold o p e en u he coun e a acking i he
es image does no exhibi alse s abili y, hus p ese ing
pe o mance on clean images. To ou bes knowledge, ou
pa adigm is he i s wo k o u ilise he exp essi e powe
o CLIP o de end i sel om ad e sa ial a acks, and can
be ca ego ised as a es - ime inpu pu i ica ion me hod o
VLMs. We conduc ex ensi e expe imen s and analyses on
16 classi ica ion da ase s, es ablishing ou me hod as an e -
ec i e es - ime de ence o CLIP. We summa ise he main
con ibu ions o his pape as ollows:
• We p opose he i s me hod ha ha nesses he powe o
CLIP o de end i sel om ad e sa ial a acks a in e -
ence ime wi hou elying on any auxilia y ne wo ks. Ou
me hod is simple and aining- ee, and can be easily em-
ployed in o he VLMs.
• We p opose a es - ime coun e a ack me hod based on
P ojec ed G adien Descen (PGD). We show ha ou
me hod can de end CLIP wi h a small numbe o coun e -
a ack s eps wi hou signi ican ly impac ing pe o mance
on clean images.
• We conduc expe imen s ac oss 16 classi ica ion da ase s,
and demons a e supe io pe o mance compa ed o es -
ime de ences adap ed om exis ing ad e sa ial obus -
ness li e a u e. Ou pa adigm can be employed on ad-
e sa ially ine uned CLIP o u he enhance obus ness
pe o mance a es ime.
2. Rela ed Wo k
Ad e sa ial Robus ness. Since he ea ly de elopmen o
deep neu al ne wo ks [18,24,49], i has been ound ha
hey a e ulne able o ad e sa ial a acks. Speci ically, a
small ad e sa ial pe u ba ion bounded by a Lp- adius ball,
usually impe cep ible by humans, can cause he ne wo k o
misclassi y he sample en i ely [6,48]. To add ess his ul-
ne abili y, ad e sa ial aining (AT) [29,41,58] al e na ely
gene a es ad e sa ial samples wi h he a ge model on he
ly, and ains he ne wo k wi h hese ad e sa ial samples.
This p ac ice has shown signi ican ly imp o ed obus ness
o ad e sa ial a acks, and has become a de ac o s anda d in
ad e sa ial machine lea ning, despi e he p esence o limi-
a ions such as expensi e aining [45,51]. O he ypes o
me hods a e also p oposed, o which he mos ela ed o his
wo k is es - ime de ence. This can be achie ed by employ-
ing a gene a i e model o pu i y he es image wi h an aux-
ilia y gene a i e model [34,43,55], o by adjus ing he es
image based on an objec i e [1,20,31,52]. Howe e , sub-
sequen s udies show ha es - ime de ence can be ci cum-
en ed by adap i e a acks specially designed o he de-
ence [12]. Amongs hese me hods, Hedge De ense (HD)
[52] is he mos closely ela ed o ou wo k. They a ack
he es image by maximising he c oss en opy loss wi h e-
spec o all classes, based on he inding ha he loss su ace
is smoo he a ound he g ound- u h label. An impo an
di e ence be ween HD and his wo k is ha hey apply he
de ence me hod on an ad e sa ially ained model. In con-
as , we ocus on CLIP and show ha ounda ion models
like CLIP possess he inhe en abili y o de end hemsel es
agains a acks ha seek o maximise he classi ica ion loss,
by p oducing a coun e a ack pe u ba ion ha leads he ad-
e sa ial image away om i s o iginal embedding in he la-
en space, wi h no need o an ad e sa ially ained model.
15173
Figu e 2. Pipeline o gene a e an ad e sa ial pe u ba ion δgi en
an image xand i s g ound- u h label based on CLIP. Black and
ed a ows deno e he o wa d and backwa d pass, espec i ely.
VLMs and Thei Ad e sa ial Robus ness. Recen ly, ad-
e sa ial obus ness o ounda ion models ha e ga ne ed in-
c easing esea ch a en ion [46,60]. This pape ocuses on
enhancing he ad e sa ial obus ness o CLIP [39] since i
is a ep esen a i e ounda ion model ha aligns images and
ex . Exis ing me hods can be di ided in o wo ypes: (1)
Ad e sa ial ine uning. Mao e al. [32] p opose TeCoA,
which ine unes he ision encode o CLIP using ad e -
sa ial samples gene a ed on he ly on one da ase , and
ans e s he lea ned obus ness o downs eam classi ica-
ion da ase s. Based on his pipeline, Wang e al. [50] u -
he p opose o employ he o iginal CLIP o guide ad e sa -
ial ine uning by imposing wo egula isa ion e ms, show-
ing imp o ed gene aliza ion on clean and ad e sa ial im-
ages ac oss downs eam da ase s. (2) Ad e sa ial p omp
uning. This line o esea ch is buil on p omp uning o
CLIP [61,62], whe e model weigh s a e kep ozen. Li
e al. [26] show ha ex ual p omp s play an impo an ole
in he e ec i eness o bo h ad e sa ial a acks and obus -
ness. They p opose o inse lea nable okens and une he
okens by aligning he ex ual p omp wi h ad e sa ial im-
ages. Zhang e al. [59] p opose a simila pipeline by aining
obus ex okens, assuming ha he a acke has access o
he model bu no o he ex p omp s employed by he end
use . In his wo k, we disca d any aining and show ha
CLIP possesses he abili y o de end i sel om ad e sa ial
a acks by coun e a acking ad e sa ial images. Ou me hod
is he i s es - ime de ence me hod o CLIP.
3. Me hodology
In his sec ion, we i s p o ide p elimina ies ega ding
CLIP and ad e sa ial obus ness o CLIP in an image clas-
si ica ion con ex . Then we p oceed o in oduce ou es -
ime coun e a ack pa adigm.
3.1. P elimina ies and Se up
Ze o-sho classi ica ion o CLIP. CLIP [39] is a ision-
language model ha ma ches images wi h hei desc ip i e
ex . I has been con as i ely p e- ained on 400 million
image- ex pai s. Speci ically, i aligns an image xwi h i s
Figu e 3. Ou es - ime coun e a ack pa adigm. We c a a coun-
e a ack pe u ba ion δ c o lead an ad e sa ial image away om
i s o iginal embedding a es ime wi hou ine uning.
co esponding ex h ough he cosine simila i y be ween
hei ep esen a ions p oduced by a ision encode θ(·)and
a ex encode gϕ(·), espec i ely, whe e θand ϕa e model
weigh s. A in e ence ime, CLIP pe o ms classi ica ion
in a ze o-sho manne . Gi en a se o classes de ined in
hei ex ual names c1, . . . , cK, CLIP ma ches a es image
xagains he ex ual p omp s co esponding o candida e
class names w apped by a empla e T(usually ‘a pho o o
[CLASS]’):
si= θ(x)Tgϕ(T(ci))
∥ θ(x)∥ · ∥ gϕ(T(ci)) ∥(1)
The p obabili y o xbelonging o class ciis calcula ed as he
no malized simila i y pi=exp(si)
Pjexp(sj), and he candida e
class wi h he highes p obabili y is he p edic ed class.
Ad e sa ial a acks o CLIP. CLIP is highly ulne able
o ad e sa ial pe u ba ions [32]. In a se ing whe e he a -
acke has ull knowledge o he model weigh s and g adi-
en s o CLIP, a small pe u ba ion δbounded by a Lp- adius
can be maliciously designed o cause CLIP o misclassi y:
δa= a g max
δL(x+δ, c), s. . ∥δ∥p≤ϵa(2)
whe e cis he g ound- u h label o xand Lis a loss unc-
ion which is usually a c oss-en opy loss, and ϵais he a -
ack budge , which ensu es ha he manipula ion is sub le
and impe cep ible o humans. Eq. (2) can be app oxima ed
by P ojec ed G adien Descen (PGD) [6]. The ad e sa -
ial image is he addi ion o he image and he pe u ba ion
x′:= x+δa. This p ocess is illus a ed in Fig. 2.
Ad e sa ial ine uning s eng hens he ad e sa ial obus -
ness o CLIP [32,50] by al e na ely gene a ing ad e sa -
ial images x′ ollowing Eq. (2) and using hem o ine une
θ. TeCoA [32] pe o ms ine uning by aligning x′wi h
he g ound- u h ex T( c)on one da ase . PMG-AFT [50]
imposes wo CLIP-guided egula iza ion e ms on op o
TeCoA o imp o e he gene aliza ion on clean and ad e -
sa ial images. A e he ine uning phase, CLIP has lea ned
obus ness o ad e sa ial a acks, which ans e s o down-
s eam classi ica ion da ase s wi hou u he aining [32].
15174
3.2. Tes - ime Coun e a acks
Al hough ad e sa ial ine uning has been shown o signi -
ican ly imp o e CLIP’s ad e sa ial obus ness, limi a ions
a e appa en , such as cumbe some aining in ol ing he
gene a ion o ad e sa ial samples and ine uning o he i-
sion encode weigh s θ. In his pape , we in es iga e he
abili y o CLIP o de end i sel a es ime, wi h no need o
any aining, p o iding he i s es - ime de ence me hod
o CLIP. Ou p oposed pa adigm, e med as Tes - ime
Coun e a acks (TTC), is illus a ed in Fig. 3. Following
p e ious s udies [26,32,50,59], we ocus on a acks ha
aim o maximise he classi ica ion loss o CLIP.
In ui i ely, a p e- ained ision encode θis highly ex-
p essi e in cap u ing he nuanced pixel pa e n in an im-
age. In his sense, we specula e ha an ad e sa ial image
ha success ully ools CLIP is apped in a oxic su ound-
ing induced by he ad e sa y, and ha he p e- ained θis
able o lead he image away om his su ounding by u il-
ising i s exp essi eness. Recen ly, Schla mann e al. [44]
p opose an unsupe ised a ack me hod o CLIP, whe e a
pe u ba ion is upda ed such ha i maximises he L2dis-
ance be ween he image and i s o iginal embedding. In-
spi ed by his label- ee a ack me hod, we employ he same
loss unc ion in ou pa adigm. Speci ically, we employ he
o iginal image embedding θ(x)as he ancho , and c a
a es - ime coun e a ack pe u ba ion δ c such ha he L2
dis ance be ween he embedding o he coun e a acked im-
age θ(x+δ c)and he ancho θ(x)is maximised:
δ c = a g max
δ∥ θ(x+δ)− θ(x)∥, s. . ∥δ∥p≤ϵ c
(3)
This coun e a ack can also be app oxima ed by PGD [6].
Since his coun e a ack is pe o med by he end use a
es ime, he coun e a ack does no need o be impe -
cep ible, hence a la ge use -de ined coun e a ack budge
ϵ c. Howe e , we hope o main ain a consis en a ack
s yle wi h exis ing s udies, and s ill keep he coun e a ack
budge low, bounded by a Lp- adius. In he expe imen s
(Sec. 4), we show ha a budge a ϵ c = 4/255 is able o
imp o e CLIP’s ad e sa ial obus ness signi ican ly. No e
ha he ision encode weigh s θa e kep ozen h ough-
ou . Among exis ing me hods o non- ounda ional mod-
els, he mos closely ela ed o ou s is hedge de ense (HD)
[52]. An impo an di e ence is ha hey employ HD on
ad e sa ially- ained models, whils we show ha CLIP
wi hou ad e sa ial ine uning can ha ness he exp essi e-
ness o i s ision encode o de end i sel .
3.3. τ- h esholded Weigh ed Coun e a acks
Sec. 3.2 has discussed he idea o de ending CLIP wi h
i s p e- ained ision encode . An undesi able isk is ha
he coun e a acks can hu na u al images as well. Based
on he idea o TTC, we u he p opose τ- h esholded
(a) CIFAR10 (b) ImageNe
Figu e 4. Ra io o L2d i due o a andom noise. The alue o τ
is he a e age τac oss 100 andomly selec ed samples.
weigh ed coun e a acks o coun e a ack ad e sa ies e -
ec i ely while educing he impac on clean images.
Wu e al. [52] show ha ad e sa ial images a e mo e ul-
ne able o a small noise han clean images. In his s udy,
we ind ha ad e sa ial images a e ac ually mo e obus o
small andom noises, and a e only ulne able o su icien ly
la ge noises, based on ou analysis o ad e sa ial images
ob ained by i e a i e a ack me hods (PGD [6] in ou case).
Speci ically, we de ine a s ochas ic a iable τinduced by a
andom noise n∈ RC×W×H∼U(−ϵ andom, ϵ andom),
condi ioned on an image x∈ RC×W×H:
τ=∥ θ(x+n)− θ(x)∥
∥ θ(x)∥(4)
which can be in e p e ed as he a io o he L2d i in he
la en space when a andom noise nis applied on an image.
The alues o τa e epo ed in Fig. 4 o ImageNe and CI-
FAR10. We epo mo e esul s and analysis on τ o o he
da ase s in Appendix (Sec. 7). As can be seen om Fig. 4,
when a small andom noise (ϵ andom = 1/255,4/255) is
imposed, he a io o L2d i in he la en space is unusually
small, showing ha hey a e apped in a oxic su ounding
and ende ed ‘ alsely s able’ by an ad e sa y. Ad e sa ial
images only become ulne able when he s eng h o an-
dom noise is inc eased, as e idenced by he disp opo ion-
a ely ising alues o τ. We e m his beha iou o ad e -
sa ial images ob ained by maximising CLIP’s classi ica ion
loss ‘ alse s abili y’, and p o ide mo e heo e ical analysis
in Appendix (Sec. 7).
Buil upon he analysis abo e, we p opose τ- h esholded
weigh ed coun e a acks based on PGD [6]. Speci ically, we
ollow a s anda d pipeline o PGD i e a ions, wi h he a ack
objec i e being Eq. (3). A he ze o- h i e a ion, a andom
pe u ba ion wi hou any upda e δ0
c is applied, whe e we
compu e he τ alue based on Eq. (4) as an indica o . I i is
highe han a use -de ined h eshold τ h es, meaning ha i
is no ‘ alsely s able’, we hal he coun e a ack and e u n
he andom noise δ0
c. O he wise, he coun e a ack is e-
sumed. No e ha he selec ion o τ h es is dependen only
on he τ alue o clean images and he s eng h o andom
15175
Algo i hm 1 τ- h esholded weigh ed coun e a acks.
Requi e: Tes image x, p e- ained CLIP ision encode
θ, coun e a ack budge ϵ c, s epsize α, numbe o
s eps N, use -de ined pa ame e s τ h es and β.
1: p ocedu e TEST-TIME COUNTERATTACKS
2: δ0
c ∼U(−ϵ c, ϵ c).
3: Compu e τbased on Eq. (4) using δ0
c.
4: i τ≥τ h es hen
5: w0= 1
6: e u n δ c =δ0
c
7: else i τ < τ h es hen
8: w,δ c := {},{}
9: o i= 1,2, . . . , N do
10: δi
c=Π(δi−1
c +α∇δ∥ θ(x+δi−1
c )− θ(x)∥)
11: wi= exp(β·i)/PN
j=0 exp(β·j)(Eq. (5))
12: w←wi,δ c ←δi
c
13: end o
14: e u n δ c =PN
i=0 wi·δi
c (Eq. (6))
15: end i
16: end p ocedu e
noise, i espec i e o ypes and s eng hs o a acks. Since
employing only one δ c may be subop imal, we weigh he
coun e a ack pe u ba ion ec o s ac oss all s eps:
wj=exp(β·j)
PN
j=0 exp(β·j)(5)
δ c =
N
X
j=0
wjδj
c (6)
whe e β > 0is a hype pa ame e con olling he ascending
a e o weigh s, Nis he numbe o s eps o pe o ming
he coun e a ack, and δj
c is he coun e a ack pe u ba ion
ob ained a e js eps. We summa ise ou τ- h esholded
weigh ed coun e a acks in Algo i hm 1.
4. Expe imen s
In his sec ion, we conduc ex ensi e expe imen s o e i y
he e ec i eness o ou es - ime coun e a ack pa adigm.
4.1. Expe imen se up
Da ase s. Following p e ious wo k [32,50], we con-
duc ou expe imen s on 16 da ase s, which include gen-
e al objec ecogni ion da ase s CIFAR10 [23], CIFAR100
[23], STL10 [10], ImageNe [13], Cal ech101 [14] and
Cal ech256 [15]; ine-g ained ecogni ion da ase s Ox o d-
Pe s [37], Flowe s102 [35], Food101 [5], S an o dCa s
[22]; scene ecogni ion da ase s SUN397 [53], Coun y211
[39]; domain-speci ic da ase s FGVCAi c a [30], Eu-
oSAT [19], DTD [9], and PCAM [4].
Implemen a ion De ails. We use a coun e a ack budge
o ϵ c = 4/255 and a h eshold τ h es = 0.2. We se he
numbe o s eps o coun e a acks as N= 2, unless o he -
wise s a ed. βis se o 2.0. All a acks and coun e a acks
in expe imen s a e bounded by a L∞ adius. No e ha he
selec ion o τ h es is dependen on ϵ c, and is de e mined
based on he τbeha iou o clean images (Fig. 4). A highe
τ h es ades o mo e clean accu acy o obus ness. We
p o ide mo e de ails in Appendix (Sec. 9and Sec. 11).
Baselines. Since he e a e no es - ime de ence me hods o
CLIP, we implemen se e al es - ime me hods om exis -
ing ad e sa ial obus ness s udies ha do no ely on auxil-
ia y ne wo ks. Speci ically, we implemen An i-ad e sa y
[1] and Hedge De ense (HD) [52], which a e he mos
closely ela ed o ou me hod. An i-ad e sa y [1] gene a es
a pe u ba ion o ein o ce he con idence o he classi ie
gi en a es image. HD [52] pe o ms a coun e a ack on
he es image by inc easing he c oss-en opy w. . . all can-
dida e classes, based on hei inding ha he loss unc ion
su ace is smoo he a ound he g ound- u h class. We also
adap hei me hod in ou expe imen s wi h CLIP. Fo hese
wo me hods, we employ a es - ime pe u ba ion budge o
4/255, which is equal o he coun e a ack budge o ou
TTC. Following he o iginal pape s, he numbe o s eps a e
2 and 20 o An i-ad e sa y and HD, espec i ely. Consid-
e ing ha p e ious s udies es ablish image ans o ma ions
as a simple and e ec i e de ence me hod [17,38,54], we
also implemen es - ime ans o ma ion ensembling (TTE)
[38], which ensembles image ans o ma ion as a de ence.
We implemen TTE wi h image lip, 4 c ops, and image
lip o all c ops, o aling 9 augmen a i e iews. As a sim-
ples baseline, we also include andom noise (RN) which
adds a andom pe u ba ion noise wi h he same s eng h
as ou ϵ c, i.e., n∼U(−ϵ c, ϵ c). As a use ul e e -
ence, we also implemen ad e sa ially ine uning me hods
TeCoA [32], PMG-AFT [50] and FARE [44] by ine un-
ing he CLIP ision encode wi h ad e sa ial images on
TinyImageNe , based on he objec i e unc ions p oposed
in hei pape s1. We also ine une CLIP wi h clean images
(CLIP-FT) on TinyImageNe . In he phase o ine uning, we
use a 2-s ep PGD a ack, wi h he s epsize α= 1/255 and
a ack budge ϵ= 1/255, ollowing [32,50]. The lea ning
a e o ine uning is 5e−5. A e he ine uning phase, he
ine uned models a e deployed on 16 downs eam da ase s.
4.2. TTC on O iginal CLIP
Robus ness unde ϵa= 1/255.We i s es he obus ness
o all me hods unde he a ack budge o ϵa= 1/255. Fol-
lowing p e ious s udies on CLIP’s ad e sa ial obus ness
1Unlike o iginal implemen a ions, we hold ou 10% o he aining se
o TinyImageNe o e alua ion in ou implemen a ion, wi hou consul ing
downs eam da ase s. We also ind p ep ocessing signi ican ly a ec s he
pe o mance o ine uned models on CIFAR10, CIFAR100, and STL10.
We ollow he p ep ocessing pipeline ecommended by CLIP (Tab. 1).
15176
(%) CLIP Ad e sa ial Fine uning Tes - ime De ence ∆
CLIP-FT TeCoA PMG-AFT FARE RN TTE An i-ad HD TTC (ou s)
TinyImageNe Rob. 0.19 2.19 48.64 46.12 25.47 0.28±0.02 19.52±4.21 4.46±0.23 3.11±0.05 20.64±0.17 +20.45
Acc. 57.64 77.06 70.86 66.85 73.63 51.83±0.16 56.74±0.22 52.55±0.06 51.37±0.15 51.84±0.17 -5.80
CIFAR10 Rob. 0.74 3.34 33.61 40.66 19.65 2.01±0.08 41.35±6.14 12.39±0.07 17.22±0.45 28.75±0.18 +28.01
Acc. 85.12 84.90 64.61 70.69 74.44 81.18±0.07 84.74±0.40 83.52±0.09 78.23±0.16 81.18±0.07 -3.94
CIFAR100 Rob. 0.26 0.90 18.95 22.52 11.40 0.67±0.05 20.06±4.03 5.73±0.04 3.86±0.10 14.31±0.25 +14.05
Acc. 57.14 59.51 35.96 40.32 46.67 56.34±0.20 58.61±0.25 53.95±0.15 52.86±0.16 56.34±0.20 -0.80
STL10 Rob. 11.0 12.73 70.08 73.08 59.06 16.23±0.08 78.48±3.83 37.42±0.40 39.02±0.30 76.70±0.23 +65.70
Acc. 96.40 94.49 87.40 88.56 91.72 95.85±0.04 96.26±0.04 95.45±0.08 89.50±0.07 95.85±0.04 -0.55
ImageNe Rob. 1.15 0.93 18.89 21.43 14.00 1.77±0.03 31.01±4.40 8.67±0.05 6.63±0.05 38.41±0.07 +37.26
Acc. 59.69 54.24 34.89 36.12 48.79 59.34±0.06 60.02±0.12 54.27±0.14 54.54±0.05 49.39±0.00 -10.30
Cal ech101 Rob. 14.67 14.21 55.51 61.08 50.74 18.90±0.14 67.56±3.88 34.81±0.16 31.53±0.22 65.78±0.07 +51.11
Acc. 85.66 83.63 71.68 75.45 80.95 86.61±0.10 85.84±0.09 84.02±0.10 82.33±0.04 86.53±0.07 +0.87
Cal ech256 Rob. 8.47 6.76 43.19 45.91 38.79 11.33±0.04 60.09±4.03 25.36±0.17 23.48±0.10 60.11±0.04 +51.64
Acc. 81.72 78.53 61.14 62.24 73.32 81.25±0.03 82.49±0.08 79.38±0.12 79.12±0.01 79.66±0.04 -2.06
Ox o dPe s Rob. 1.04 2.10 38.35 41.18 31.07 1.86±0.01 50.33±7.30 20.42±0.22 12.04±0.16 57.87±0.15 +56.83
Acc. 87.44 84.14 62.12 65.88 79.37 87.41±0.12 88.13±0.13 80.62±0.35 80.91±0.05 83.35±0.21 -4.09
Flowe s102 Rob. 1.14 0.54 21.94 23.43 17.14 1.52±0.01 35.88±4.72 7.16±0.41 7.29±0.06 39.14±0.28 +38.00
Acc. 65.46 53.37 36.80 37.00 47.98 64.62±0.19 65.18±0.22 62.66±0.14 58.22±0.12 64.16±0.19 -1.30
FGVCAi c a Rob. 0.00 0.00 2.49 2.22 1.35 0.00±0.00 6.23±1.37 1.27±0.07 1.26±0.07 13.77±0.38 +13.77
Acc. 20.10 14.04 5.31 5.55 10.86 19.25±0.18 20.19±0.36 15.88±0.23 16.36±0.03 18.00±0.16 -2.10
S an o dCa s Rob. 0.02 0.06 8.76 11.65 6.75 0.16±0.02 22.36±4.17 4.40±0.30 2.71±0.09 33.01±0.07 +32.99
Acc. 52.02 42.11 20.91 25.44 38.68 52.14±0.09 52.73±0.31 36.21±0.27 44.28±0.02 48.16±0.16 -3.86
SUN397 Rob. 1.14 0.94 19.39 22.58 14.91 1.72±0.01 30.79±4.43 8.05±0.04 6.40±0.06 41.52±0.04 +40.38
Acc. 58.50 55.73 36.69 37.98 52.42 59.69±0.06 59.12±0.08 56.00±0.04 53.17±0.02 55.13±0.06 -3.37
Coun y211 Rob. 0.04 0.03 1.78 2.12 0.85 0.06±0.00 3.05±0.89 0.67±0.05 0.47±0.02 7.09±0.04 +7.05
Acc. 15.25 12.07 4.75 4.64 9.26 14.80±0.02 14.66±0.16 11.58±0.12 11.72±0.07 13.08±0.05 -2.17
Food101 Rob. 0.70 0.42 13.90 18.57 11.65 1.20±0.01 43.94±6.97 13.12±0.16 8.03±0.11 57.84±0.15 +57.14
Acc. 83.88 64.86 29.98 36.61 55.31 83.44±0.04 83.96±0.02 75.81±0.22 80.30±0.05 82.18±0.02 -1.70
Eu oSAT Rob. 0.03 0.04 11.96 12.60 10.67 0.15±0.01 6.91±2.13 2.15±0.04 4.57±0.09 12.19±0.24 +12.16
Acc. 42.59 27.64 16.58 18.53 21.88 53.24±0.09 44.38±1.60 36.78±0.18 39.08±0.06 53.24±0.09 +10.65
DTD Rob. 2.98 2.39 17.61 14.95 15.64 3.71±0.09 23.90±2.34 5.62±0.07 11.63±0.17 27.32±0.25 +24.34
Acc. 40.64 36.49 25.16 21.76 32.07 37.96±0.13 41.33±0.32 38.92±0.22 34.89±0.35 36.98±0.21 -3.66
PCAM Rob. 0.08 1.11 48.24 46.18 16.23 0.41±0.01 10.62±3.22 4.97±0.12 44.74±0.17 52.85±0.20 +52.77
Acc. 52.02 47.21 49.96 50.03 52.54 52.73±0.07 51.01±0.08 52.49±0.02 50.38±0.04 52.73±0.07 +0.71
A g. Rob. 2.70 2.91 26.54 28.76 20.00 3.86±0.02 33.28±3.98 12.01±0.04 13.81±0.06 39.17±0.02 +36.47
Acc. 61.51 55.80 40.25 42.30 51.02 61.61±0.03 61.79±0.13 57.35±0.03 56.62±0.02 59.75±0.06 -1.76
Table 1. Classi ica ion accu acy (%) on bo h ad e sa ial images (Rob.) unde 10-s ep PGD a ack a ϵa= 1/255 and clean images (Acc.)
ac oss 16 da ase s. We include he esul s on TinyImageNe because i is used o ine une he model o CLIP-FT, TeCoA [32], PMG-AFT
[50], and FARE [44]. Compa ison is made among ou pa adigm and es - ime de ences adap ed om exis ing ad e sa ial s udies, wi h
ine uning-based models implemen ed as a e e ence. We epo he mean and s anda d de ia ion o es - ime me hods o e 3 uns. The
las column epo s he gains w. . . o iginal CLIP wi hou any ine uning o es - ime ope a ions.
[32,50], we es all baselines unde 10-s ep PGD a acks
ac oss 16 da ase s, assuming ha he a acke has ull ac-
cess o he weigh s and g adien s o he deployed model,
bu no o he es - ime ope a ions made by he end use . We
epo he accu acy on bo h ad e sa ial images and clean
images in Tab. 1. I can be seen ha all ine uning-based
me hods o e i o he da ase used o ad e sa ial ine un-
ing o a ying ex en s, as e idenced by he highe accu acy
o clean images han he o iginal CLIP on TinyImageNe .
The imp o ed obus ness on downs eam da ase s comes a
a cos o a no iceable clean accu acy d op. Among es - ime
me hods, bo h An i-ad e sa y and HD, which gene a e an
addi i e pe u ba ion based on an objec i e, lead o limi ed
imp o emen o obus accu acy. Ou TTC, which u ilises
he p e- ained ision encode o CLIP o p oduce coun e -
a acks, shows he bes obus accu acy on mos downs eam
da ase s, usually wi h a la ge gain. We also e ain he bes
clean accu acy compa ed o hese wo pe u ba ion upda e
me hods. Adding andom noise (RN) b ings li le obus -
ness, e en hough he added noise is ou imes la ge han
he a ack budge , i.e., ϵ c ≫ϵa. RN can be iewed as
a special case o TTC wi h he numbe o Nbeing 0. By
exploi ing he p e- ained model θ o op imize he noise,
TTC signi ican ly imp o es obus ness. TTE ensembles a
numbe o image ans o ma ions, which imp o es CLIP’s
obus ness a es ime o an a e age accu acy o 33.28%.
15177
(%) Rob. Acc.
CLIP 0.09 61.51
CLIP-FT 0.96 55.80
TeCoA1[32] 6.51 40.25
TeCoA4[32] 10.03 35.57
PMG-AFT1[50] 7.03 42.30
PMG-AFT4[50] 10.70 37.58
FARE1[44] 1.50 51.02
FARE4[44] 3.67 46.17
RN 0.06±0.00 61.61±0.03
TTE [38] 7.79±3.23 61.79±0.13
An i-ad [1] 0.53±0.00 57.32±0.03
HD [52] 1.19±0.01 56.62±0.02
TTC (ou s) 20.63±0.05 55.99±0.06
∆+20.54 -5.52
Table 2. Classi ica ion accu acy (%) on ad e sa ial images (Rob.)
unde 10-s ep PGD a ϵa= 4/255 and clean images (Acc.) a e -
aged on 16 da ase s. Supe sc ip s indica e he a ack budge used
in he ine uning phase. The las ow epo s he gains compa ed
o he o iginal CLIP.
Howe e , his gain is gene ally uns able ac oss uns, as in-
dica ed by he high s anda d de ia ion o obus accu acy.
O e all, ou p oposed TTC leads o consis en gains on o-
bus accu acy (+36.47%) a e aged on downs eam da ase s
wi h a sligh loss (-1.76%) on clean accu acy compa ed o
he o iginal CLIP, se ing as a s able de ence a in e ence
ime. We es he obus ness unde CW a acks [6] in Ap-
pendix (Sec. 8.1) o limi ed space. I can also be seen ha
he obus ness gains come a a cos o accu acy educ ion on
clean images o a ying ex en s ac oss da ase s. We p o ide
mo e analysis in Appendix Sec. 9.
Robus ness unde ϵa= 4/255.We u he es he obus -
ness o all me hods unde a high a ack budge ϵa= 4/255.
Fo his se ing, we inc ease he numbe o s eps N o 5 o
mo e e ec i e coun e a acks, while o he hype pa ame e s
a e unchanged. We also implemen ine uning-based me h-
ods wi h a ack budge ϵ= 4/255 du ing ine uning, in his
se ing. We epo he a e age accu acy ac oss 16 da ase s in
Tab. 2and p o ide he ull able in Appendix (Tab. 5). I can
be seen ha a high a ack budge a ϵa= 4/255 deg ades he
accu acy o all models o a e y low le el. An i-Ad e sa y
[1] and HD [52] p o ide li le o no obus ness unde his
se ing. TTE de ends he model o a limi ed ex en , bu s ill
wi h low eliabili y as indica ed by he high s anda d de i-
a ion. In compa ison, ou p oposed TTC p o ides a s able
obus ness gain a e aged on 16 da ase s.
4.3. TTC on Ad e sa ially Fine uned CLIP
Since ou me hod pe o ms coun e a acks using he ic im
model a es ime, i can also be employed on ad e sa i-
ally ine uned models in a plug-in manne . In his sec ion,
we apply TTC o ine uning-based models, assuming ha
he a acke has ull access o he deployed model, bu no
o he ope a ions by he end use . No e ha we s ill em-
ploy he o iginal ision encode θo CLIP o compu e τ
(Eq. (4)), because he sensi i i y o ad e sa ial ine uned
ision encode s is la gely educed. We epo he esul s
in Tab. 3. I can be seen ha TTC can u he boos ad e -
sa ial obus ness by exploi ing he ine uned model o pe -
o m coun e a acks a es ime. Speci ically, TTC achie es
a obus ness accu acy o 29.06% and 30.81% when em-
ployed on TeCoA and PMG-AFT, su passing he o iginal
ine uned models by 2.52 and 2.05 poin s, espec i ely. A
signi ican gain o 13.85 poin s is achie ed when we em-
ploy TTC on op o FARE, an unsupe ised ad e sa ially
ine uned model. In e es ingly, we ind ha ad e sa ial ine-
uning g ea ly educes he sensi i i y o CLIP o a ia ions
in he pixel space, hus hu ing he exp essi e powe o he
p e- ained encode . We p o ide mo e in-dep h analyses o
such loss in Appendix (Sec. 10). Since ou coun e a acks
ely hea ily on he exp essi eness o he p e- ained ision
encode θ, his also explains he smalle gains achie ed
on ad e sa ially ine uned models, compa ed o he o iginal
CLIP. The la ge inc ease o obus accu acy on FARE im-
plies ha ad e sa ially ine uning CLIP in an unsupe ised
manne be e e ains he exp essi eness o he model.
4.4. Abla ion s udies
We expe imen ally ind ha he numbe o s eps No ou
TTC g ea ly a ec s pe o mance on bo h ad e sa ial and
clean images. A gene al ule o humb is ha an a ack wi h
a highe budge ϵawould equi e mo e s eps o coun e a -
acks. In his sec ion, we in es iga e he e ec o Nand
keep he o he hype pa ame e s unchanged. We p o ide
analysis on o he hype pa ame e s in Appendix (Sec. 11).
Fig. 5p o ides he pe o mance o CLIP employing TTC
on 14 da ase s as N a ies. Fo smalle a acks a ϵa=
1/255, i akes ewe han h ee s eps o CLIP o de end
i sel e ec i ely on mos da ase s. Excessi e coun e a -
acks can impai he images, as e idenced by he decline
a e a ce ain numbe o s eps. In compa ison, a s ong a -
ack ϵa= 4/255 equi es a la ge numbe o coun e a ack
s eps o each a easonable accu acy, showing ha hey a e
mo e esilien o coun e a acks by he use side. TTC does
no impac accu acy on clean images signi ican ly on mos
da ase s, excep o SUN397 (Fig. 5d), Ox o dPe s (Fig. 5j),
S an o dCa s (Fig. 5n) and ImageNe (Fig. 5l), whe e clean
images a e ound sensi i e o he inc ease o N.
5. Limi a ions
Al hough we show TTC imp o es obus ness o CLIP o
ad e sa y ha maximises he classi ica ion loss, he e a e
limi a ions as discussed below. Fi s ly, he obus ness gain
o applying TTC on TeCoA [32] and PMG-AFT [50] is less
ob ious. This is due o he educed exp essi eness o CLIP
caused by ad e sa ial ine uning. We a gue ha o la ge
p e- ained models like CLIP, ad e sa ial ine uning should
15178
(%)
CIFAR10
CIFAR100
STL10
ImageNe
Cal ech101
Cal ech256
Ox o dPe s
Flowe 102
FGVCAi c a
S an o dCa s
SUN397
Coun y211
Food101
Eu oSAT
DTD
PCAM
A g. Rob.
A g. Acc.
TeCoA 33.61 18.95 70.08 18.89 55.51 43.19 38.35 21.94 2.49 8.76 19.39 1.78 13.90 11.96 17.61 48.24 26.54 40.25
TeCoA+TTC 34.68 20.00 71.65 23.14 59.44 48.49 42.66 25.13 2.78 12.09 23.91 2.49 17.79 12.75 18.87 48.44 29.02 39.85
∆1.07 ↑1.05 ↑1.57 ↑4.25 ↑3.93 ↑5.30 ↑4.31 ↑3.19 ↑0.29 ↑3.33 ↑4.52 ↑0.71 ↑3.89 ↑0.79 ↑1.26 ↑0.20 ↑2.48 ↑−0.40 ↓
PMG-AFT 40.66 22.52 73.08 21.43 61.08 45.91 41.18 23.43 2.22 11.65 22.58 2.12 18.57 12.60 14.95 46.18 28.76 42.30
PMG-AFT+TTC 42.17 24.09 73.55 24.36 63.72 50.37 43.96 25.94 2.51 14.97 25.70 2.57 22.33 13.94 15.98 46.52 30.79 41.89
∆1.51 ↑1.57 ↑0.47 ↑2.93 ↑2.64 ↑4.46 ↑2.78 ↑2.51 ↑0.29 ↑3.32 ↑3.12 ↑0.45 ↑3.76 ↑1.34 ↑1.03 ↑0.34 ↑2.03 ↑−0.41 ↓
FARE 19.65 11.40 59.06 14.00 50.74 38.79 31.07 17.14 1.35 6.85 14.90 0.85 11.65 10.67 15.64 16.23 20.00 51.02
FARE+TTC 35.55 22.34 76.65 30.52 67.39 59.20 51.53 29.85 5.03 20.46 33.42 4.04 31.76 15.49 23.17 35.79 33.89 49.91
∆15.90 ↑10.94 ↑17.59 ↑16.52 ↑16.65 ↑20.41 ↑20.46 ↑12.71 ↑3.68 ↑13.61 ↑18.52 ↑3.19 ↑20.11 ↑4.82 ↑7.53 ↑19.56 ↑13.89 ↑−1.11 ↓
Table 3. TTC employed on ad e sa ially ine uned models a es ime. We epo he obus accu acy a ϵa= 1/255 and he obus ness
gain by employing TTC o each da ase .
(a) CIFAR10 (b) DTD (c) STL10 (d) SUN397 (e) Eu oSAT ( ) Cal ech101 (g) PCAM
(h) CIFAR100 (i) Food101 (j) Ox o dPe s (k) Flowe 102 (l) ImageNe (m) Cal ech256 (n) S an o dCa s
Figu e 5. E ec s o he numbe o s eps N o coun e a acks pe o med on CLIP. The g een lines ep esen accu acy on clean images, and
ed and blue lines accu acy on ad e sa ial images a ϵa= 1/255 and ϵa= 4/255, espec i ely.
be employed spa ingly, conside ing ha a undamen al di -
e ence om ad e sa ial s udies on non- ounda ional mod-
els is ha hey ha e lea ned massi e amoun s o eal-wo ld
knowledge. Secondly, al hough TTC does no in ol e
aining on ad e sa ial images, i incu s mo e compu a ion
expenses a in e ence ime. Addi ionally, he numbe o
coun e a ack s eps a ec s obus ness pe o mance. I can
be di icul o une o he mos sui able N, i he a ack
s eng h ϵais no known a p io i (Fig. 5). We ecommend
ewe s eps (no mo e han h ee) i he a ack is unknown
o a oid excessi e coun e a acks and unp oduc i e com-
pu a ional o e head. In he u u e, we in end o explo e
me hods o adjus he numbe o s eps based on he es im-
age. Thi dly, acco ding o ad e sa ial obus ness s udies
on con en ional models, es - ime de ence can be ci cum-
en ed by adap i e a acks [12]. We discuss in Appendix
(Sec. 12) possible adap i e a acks o b eak ou coun e a -
acks assuming he wo s scena io whe e he a acke has
access o he weigh s o he deployed CLIP model and TTC
pe o med by he end use .
6. Conclusion
We show ha CLIP can le e age i s own p e- ained ision
encode o de end agains ad e sa y maliciously manipu-
la ed o maximise i s loss by pe o ming coun e a acks a
es ime, wi hou elying on any auxilia y ne wo ks. Based
on he inding ha ad e sa ial images a e ‘ alsely s able’,
we p opose τ- h esholded coun e a acks o guide he ad-
e sa ial image away om i s o iginal embedding in he la-
en space. Expe imen s on 16 da ase s show ha TTC em-
ployed on CLIP achie es s able and p omising accu acy on
ad e sa ial images. TTC is also shown o u he enhance
obus ness o ad e sa ially ine uned CLIP models. We also
ind ha ine uning CLIP wi h ad e sa ial images comp o-
mises i s own exp essi eness, and ecommend cau ious use
o ad e sa ial ine uning as he only app oach o obus i-
ying la ge p e- ained models. Ou pa adigm is he i s
es - ime me hod o de end CLIP a in e ence ime wi hou
any ine uning. We hope his s udy will encou age u u e
esea ch o obus i ying app oaches o CLIP al e na i e o
ad e sa ial ine uning.
Acknowledgemen
This wo k was suppo ed by he MUR PNRR
p ojec FAIR (PE00000013) unded by he Nex Gen-
e a ionEU and he EU Ho izon p ojec s ELIAS
(No. 101120237) and AI4T us (No. 101070190).
15179
Re e ences
[1] Mo asem Al a a, Juan C P´
e ez, Ali Thabe , Adel Bibi,
Philip HS To , and Be na d Ghanem. Comba ing ad e -
sa ies wi h an i-ad e sa ies. In P oceedings o he AAAI
Con e ence on A i icial In elligence, pages 5992–6000,
2022. 2,5,7
[2] Anish A halye, Nicholas Ca lini, and Da id Wagne . Ob us-
ca ed g adien s gi e a alse sense o secu i y: Ci cum en ing
de enses o ad e sa ial examples. In In e na ional con e -
ence on machine lea ning, pages 274–283. PMLR, 2018. 1
[3] Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian Wang.
Recen ad ances in ad e sa ial aining o ad e sa ial o-
bus ness. In P oceedings o he Thi ie h In e na ional Join
Con e ence on A i icial In elligence, pages 4312–4321. In-
e na ional Join Con e ences on A i icial In elligence O ga-
niza ion, 2021. Su ey T ack. 1
[4] Babak Eh eshami Bejno di, Mi ko Ve a, Paul Johannes
Van Dies , B am Van Ginneken, Nico Ka ssemeije , Gee
Li jens, Je oen AWM Van De Laak, Meyke He msen,
Qui ine F Manson, Maschenka Balkenhol, e al. Diagnos-
ic assessmen o deep lea ning algo i hms o de ec ion o
lymph node me as ases in women wi h b eas cance . Jama,
318(22):2199–2210, 2017. 5
[5] Lukas Bossa d, Ma hieu Guillaumin, and Luc Van Gool.
Food-101–mining disc imina i e componen s wi h andom
o es s. In Compu e ision–ECCV 2014: 13 h Eu opean
con e ence, zu ich, Swi ze land, Sep embe 6-12, 2014, p o-
ceedings, pa VI 13, pages 446–461. Sp inge , 2014. 5
[6] Nicholas Ca lini and Da id Wagne . Towa ds e alua ing he
obus ness o neu al ne wo ks. In 2017 ieee symposium on
secu i y and p i acy (sp), pages 39–57. Ieee, 2017. 1,2,3,
4,7
[7] Ma hilde Ca on, Hugo Tou on, Ishan Mis a, He ´
e J´
egou,
Julien Mai al, Pio Bojanowski, and A mand Joulin. Eme g-
ing p ope ies in sel -supe ised ision ans o me s. In P o-
ceedings o he IEEE/CVF in e na ional con e ence on com-
pu e ision, pages 9650–9660, 2021. 1
[8] Ting Chen, Simon Ko nbli h, Mohammad No ouzi, and Ge-
o ey Hin on. A simple amewo k o con as i e lea ning
o isual ep esen a ions. In In e na ional con e ence on ma-
chine lea ning, pages 1597–1607. PMLR, 2020. 1
[9] Mi cea Cimpoi, Subh ansu Maji, Iasonas Kokkinos, Sammy
Mohamed, and And ea Vedaldi. Desc ibing ex u es in he
wild. In P oceedings o he IEEE con e ence on compu e
ision and pa e n ecogni ion, pages 3606–3613, 2014. 5
[10] Adam Coa es, And ew Ng, and Honglak Lee. An analysis o
single-laye ne wo ks in unsupe ised ea u e lea ning. In
P oceedings o he ou een h in e na ional con e ence on
a i icial in elligence and s a is ics, pages 215–223. JMLR
Wo kshop and Con e ence P oceedings, 2011. 5
[11] F ancesco C oce and Ma hias Hein. Reliable e alua-
ion o ad e sa ial obus ness wi h an ensemble o di e se
pa ame e - ee a acks. In In e na ional con e ence on ma-
chine lea ning, pages 2206–2216. PMLR, 2020. 1
[12] F ancesco C oce, S en Gowal, Thomas B unne , E an Shel-
hame , Ma hias Hein, and Taylan Cemgil. E alua ing he
ad e sa ial obus ness o adap i e es - ime de enses. In In-
e na ional Con e ence on Machine Lea ning, pages 4421–
4435. PMLR, 2022. 2,8
[13] Jia Deng, Wei Dong, Richa d Soche , Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagene : A la ge-scale hie a chical image
da abase. In 2009 IEEE con e ence on compu e ision and
pa e n ecogni ion, pages 248–255. Ieee, 2009. 5
[14] Li Fei-Fei, Robe Fe gus, and Pie o Pe ona. One-sho
lea ning o objec ca ego ies. IEEE ansac ions on pa e n
analysis and machine in elligence, 28(4):594–611, 2006. 5
[15] G ego y G i in, Alex Holub, Pie o Pe ona, e al. Cal ech-
256 objec ca ego y da ase . Technical epo , Technical
Repo 7694, Cali o nia Ins i u e o Technology Pasadena,
2007. 5
[16] Jean-Bas ien G ill, Flo ian S ub, Flo en Al ch´
e, Co en in
Tallec, Pie e Richemond, Elena Bucha skaya, Ca l Doe sch,
Be na do A ila Pi es, Zhaohan Guo, Mohammad Ghesh-
laghi Aza , e al. Boo s ap you own la en -a new app oach
o sel -supe ised lea ning. Ad ances in neu al in o ma ion
p ocessing sys ems, 33:21271–21284, 2020. 1
[17] Chuan Guo, Mayank Rana, Mous apha Cisse, and Lau ens
an de Maa en. Coun e ing ad e sa ial images using inpu
ans o ma ions. In In e na ional Con e ence on Lea ning
Rep esen a ions, 2018. 2,5
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep esidual lea ning o image ecogni ion. In P oceed-
ings o he IEEE con e ence on compu e ision and pa e n
ecogni ion, pages 770–778, 2016. 2
[19] Pa ick Helbe , Benjamin Bischke, And eas Dengel, and
Damian Bo h. Eu osa : A no el da ase and deep lea ning
benchma k o land use and land co e classi ica ion. IEEE
Jou nal o Selec ed Topics in Applied Ea h Obse a ions
and Remo e Sensing, 12(7):2217–2226, 2019. 5
[20] Duhun Hwang, Eunjung Lee, and Wonjong Rhee. Aid-
pu i ie : A ligh auxilia y ne wo k o boos ing ad e sa ial
de ense. Neu ocompu ing, 541:126251, 2023. 2
[21] Chao Jia, Yin ei Yang, Ye Xia, Yi-Ting Chen, Za ana Pa ekh,
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
Due ig. Scaling up isual and ision-language ep esen a-
ion lea ning wi h noisy ex supe ision. In In e na ional
con e ence on machine lea ning, pages 4904–4916. PMLR,
2021. 1
[22] Jona han K ause, Michael S a k, Jia Deng, and Li Fei-Fei.
3d objec ep esen a ions o ine-g ained ca ego iza ion. In
P oceedings o he IEEE in e na ional con e ence on com-
pu e ision wo kshops, pages 554–561, 2013. 5
[23] Alex K izhe sky, Geo ey Hin on, e al. Lea ning mul iple
laye s o ea u es om iny images. 2009. 5
[24] Alex K izhe sky, Ilya Su ske e , and Geo ey E Hin on.
Imagene classi ica ion wi h deep con olu ional neu al ne -
wo ks. Ad ances in neu al in o ma ion p ocessing sys ems,
25, 2012. 2
[25] Alexey Ku akin, Ian J Good ellow, and Samy Bengio. Ad-
e sa ial examples in he physical wo ld. In A i icial in-
elligence sa e y and secu i y, pages 99–112. Chapman and
Hall/CRC, 2018. 1
[26] Lin Li, Haoyan Guan, Jianing Qiu, and Michael Sp a ling.
One p omp wo d is enough o boos ad e sa ial obus ness
15180