CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP

Author: Xing, Songlong; Zhao, Zhengyu; Sebe, Niculae

Publisher: Zenodo

DOI: 10.1109/CVPR52734.2025.01413

Source: https://zenodo.org/records/17688752/files/Xing_CLIP_is_Strong_Enough_to_Fight_Back_Test-time_Counterattacks_towards_CVPR_2025_paper.pdf

CLIP is S ong Enough o Figh Back: Tes - ime Coun e a acks owa ds
Ze o-sho Ad e sa ial Robus ness o CLIP
Songlong Xing1Zhengyu Zhao2*Nicu Sebe1
1Uni e si y o T en o, I aly 2Xi’an Jiao ong Uni e si y, China
{songlong.xing, niculae.sebe}@uni n.i [email p o ec ed]
Abs ac
Despi e i s p e alen use in image- ex ma ching asks in a
ze o-sho manne , CLIP has been shown o be highly ul-
ne able o ad e sa ial pe u ba ions added on o images.
Recen s udies p opose o ine une he ision encode o
CLIP wi h ad e sa ial samples gene a ed on he ly, and
show imp o ed obus ness agains ad e sa ial a acks on
a spec um o downs eam da ase s, a p ope y e med as
ze o-sho obus ness. In his pape , we show ha mali-
cious pe u ba ions ha seek o maximise he classi ica ion
loss lead o ‘ alsely s able’ images, and p opose o le e -
age he p e- ained ision encode o CLIP o coun e a ack
such ad e sa ial images du ing in e ence o achie e obus -
ness. Ou pa adigm is simple and aining- ee, p o iding
he i s me hod o de end CLIP om ad e sa ial a acks a
es ime, which is o hogonal o exis ing me hods aiming o
boos ze o-sho ad e sa ial obus ness o CLIP. We conduc
expe imen s ac oss 16 classi ica ion da ase s, and demon-
s a e s able and consis en gains compa ed o es - ime de-
ence me hods adap ed om exis ing ad e sa ial obus ness
s udies ha do no ely on ex e nal ne wo ks, wi hou no-
iceably impai ing pe o mance on clean images. We also
show ha ou pa adigm can be employed on CLIP models
ha ha e been ad e sa ially ine uned o u he enhance
hei obus ness a es ime. Ou code is a ailable he e.
1. In oduc ion
Wi h he inc easing a ailabili y o image- ex da a and
he ad ancemen o sel -supe ised lea ning echniques
[7,8,16], ision-language models (VLM) ha e con inued
o spa k esea ch in e es s in bo h academia and indus y
[21,28,39,40,42,56]. As a ep esen a i e VLM, CLIP
[39] has shown imp essi e abili ies o ma ch an image wi h
i s desc ip i e ex in a ze o-sho manne . Howe e , e-
cen s udies ha e shown ha adding small impe cep ible
pe u ba ions o an image can cause CLIP o misclassi y i
*Co esponding au ho
Figu e 1. Tes - ime coun e a acks ha ness he exp essi e powe
o CLIP o gene a e a coun e a ack o de end CLIP agains ad e -
sa ies wi hou ine uning he ision encode .
[26,27,32,44,50,59,63], a common p oblem plaguing
nea ly all neu al ne wo ks [2,6,25,29,33,36,48,57,58].
As ounda ional models a e deployed in eal-wo ld appli-
ca ions, hei sa e y and eliabili y ha e become a p essing
conce n. In his pape , we ocus on he obus ness o CLIP
agains ad e sa ial pe u ba ions.
Unlike con en ional models o which ad e sa ial o-
bus ness has been ex ensi ely s udied [2,6,11,33,47,48],
CLIP is a p e- ained ounda ion model ha has lea ned
massi e amoun s o eal-wo ld knowledge, and should be
deal wi h ca e ully o minimise damage o i s gene ali-
sa ion abili ies. Ad e sa ial obus ness o CLIP has jus
s a ed o ga ne esea ch a en ion [26,32,44,50] in ecen
yea s. Exis ing e o s all in o wo ca ego ies. The i s
ype is based on ad e sa ial aining [3,58], which al e -
na ely gene a es ad e sa ial images on one da ase and uses
hem o ine une he ision encode o CLIP [32,50]. This
ype o me hods, known as ad e sa ial ine uning (AFT),
dynamically mimics a min-max game be ween CLIP and
he h ea model in he ine uning phase, and deploys he
ine uned model in a wide a ie y o downs eam classi i-
ca ion asks wi hou u he aining. This me hod shows
ans e able obus ness o downs eam da ase s, a p ope y
e med as ze o-sho obus ness [32,50]. The o he ype o
This CVPR pape is he Open Access e sion, p o ided by he Compu e Vision Founda ion.
Excep o his wa e ma k, i is iden ical o he accep ed e sion;
he inal published e sion o he p oceedings is a ailable on IEEE Xplo e.
15172
me hods eso s o p omp uning [61,62], which inse s
lea nable ex okens in he embedding space, aligns he co -
esponding ex p omp s wi h ad e sa ial images, and unes
he lea nable okens by p opaga ing g adien s o he ex em-
beddings. This app oach is known as ad e sa ial p omp
uning (APT) [26,59]. Al hough hese me hods ha e shown
imp o ed obus ness o e he o iginal CLIP, he e a e appa -
en limi a ions. Fi s ly, hey equi e ime-consuming ain-
ing, especially ad e sa ial ine uning which in ol es gen-
e a ing ad e sa ial images on he ly. Secondly, he model
o e i s o he aining da a, which comp omises gene alisa-
ion on clean and ad e sa ial images om o he da a dis i-
bu ions. In he case o ad e sa ial ine uning, he ad e sa i-
ally ine uned CLIP ou pe o ms he o iginal CLIP on clean
images om he da ase used o ine uning, indica ing ha
he model has o e i ed o he dis ibu ion o he da ase
(See Tab. 1in Sec. 4). Thi dly, o ad e sa ial ine uning
me hods [32,50], obus ness o ad e sa ial samples comes
a he cos o a signi ican decline o classi ica ion accu acy
on clean images.
To add ess hese limi a ions, we d aw inspi a ion om
exis ing ad e sa ial obus ness s udies ha obus i y non-
ounda ional a ge models a es ime [1,17,31,52], and
p opose a es - ime pa adigm ha u ilises he exp essi e
powe o CLIP o de end i sel om ad e sa ial a acks
(Fig. 1). Following p e ious s udies, we ocus on ad e -
sa ies ha aim o maximise he classi ica ion loss o CLIP
gi en a es image. We obse e ha such ad e sa ies cause
images o be mo e s able when a small andom noise is
added, compa ed o clean images. We e m his beha iou
o ad e sa ial images ‘ alse s abili y’, which can be in e -
p e ed as he images being apped in a oxic su ounding in
he la en space by an ad e sa y. In ui i ely, he p e- ained
ision encode o CLIP is highly exp essi e, and can be
le e aged o push he ad e sa ial image away om i s oxic
embedding. The e o e, we p opose o employ he ision en-
code o CLIP o coun e ac he alse s abili y o ad e sa ial
images, he eby achie ing obus ness o hese a acks. Since
no label in o ma ion is a ailable a es ime, we o mula e
he es image as an ancho , and i e a i ely upda e he coun-
e a ack pe u ba ion such ha i maximises he L2dis ance
wi h he ancho in he embedding space [44]. Howe e ,
pushing he es image away om i s embedding isks hu -
ing pe o mance on clean images. To add ess his, we p o-
pose τ- h esholded weigh ed coun e a acks, which em-
ploy a h eshold o p e en u he coun e a acking i he
es image does no exhibi alse s abili y, hus p ese ing
pe o mance on clean images. To ou bes knowledge, ou
pa adigm is he i s wo k o u ilise he exp essi e powe
o CLIP o de end i sel om ad e sa ial a acks, and can
be ca ego ised as a es - ime inpu pu i ica ion me hod o
VLMs. We conduc ex ensi e expe imen s and analyses on
16 classi ica ion da ase s, es ablishing ou me hod as an e -
ec i e es - ime de ence o CLIP. We summa ise he main
con ibu ions o his pape as ollows:
• We p opose he i s me hod ha ha nesses he powe o
CLIP o de end i sel om ad e sa ial a acks a in e -
ence ime wi hou elying on any auxilia y ne wo ks. Ou
me hod is simple and aining- ee, and can be easily em-
ployed in o he VLMs.
• We p opose a es - ime coun e a ack me hod based on
P ojec ed G adien Descen (PGD). We show ha ou
me hod can de end CLIP wi h a small numbe o coun e -
a ack s eps wi hou signi ican ly impac ing pe o mance
on clean images.
• We conduc expe imen s ac oss 16 classi ica ion da ase s,
and demons a e supe io pe o mance compa ed o es -
ime de ences adap ed om exis ing ad e sa ial obus -
ness li e a u e. Ou pa adigm can be employed on ad-
e sa ially ine uned CLIP o u he enhance obus ness
pe o mance a es ime.
2. Rela ed Wo k
Ad e sa ial Robus ness. Since he ea ly de elopmen o
deep neu al ne wo ks [18,24,49], i has been ound ha
hey a e ulne able o ad e sa ial a acks. Speci ically, a
small ad e sa ial pe u ba ion bounded by a Lp- adius ball,
usually impe cep ible by humans, can cause he ne wo k o
misclassi y he sample en i ely [6,48]. To add ess his ul-
ne abili y, ad e sa ial aining (AT) [29,41,58] al e na ely
gene a es ad e sa ial samples wi h he a ge model on he
ly, and ains he ne wo k wi h hese ad e sa ial samples.
This p ac ice has shown signi ican ly imp o ed obus ness
o ad e sa ial a acks, and has become a de ac o s anda d in
ad e sa ial machine lea ning, despi e he p esence o limi-
a ions such as expensi e aining [45,51]. O he ypes o
me hods a e also p oposed, o which he mos ela ed o his
wo k is es - ime de ence. This can be achie ed by employ-
ing a gene a i e model o pu i y he es image wi h an aux-
ilia y gene a i e model [34,43,55], o by adjus ing he es
image based on an objec i e [1,20,31,52]. Howe e , sub-
sequen s udies show ha es - ime de ence can be ci cum-
en ed by adap i e a acks specially designed o he de-
ence [12]. Amongs hese me hods, Hedge De ense (HD)
[52] is he mos closely ela ed o ou wo k. They a ack
he es image by maximising he c oss en opy loss wi h e-
spec o all classes, based on he inding ha he loss su ace
is smoo he a ound he g ound- u h label. An impo an
di e ence be ween HD and his wo k is ha hey apply he
de ence me hod on an ad e sa ially ained model. In con-
as , we ocus on CLIP and show ha ounda ion models
like CLIP possess he inhe en abili y o de end hemsel es
agains a acks ha seek o maximise he classi ica ion loss,
by p oducing a coun e a ack pe u ba ion ha leads he ad-
e sa ial image away om i s o iginal embedding in he la-
en space, wi h no need o an ad e sa ially ained model.
15173
Figu e 2. Pipeline o gene a e an ad e sa ial pe u ba ion δgi en
an image xand i s g ound- u h label based on CLIP. Black and
ed a ows deno e he o wa d and backwa d pass, espec i ely.
VLMs and Thei Ad e sa ial Robus ness. Recen ly, ad-
e sa ial obus ness o ounda ion models ha e ga ne ed in-
c easing esea ch a en ion [46,60]. This pape ocuses on
enhancing he ad e sa ial obus ness o CLIP [39] since i
is a ep esen a i e ounda ion model ha aligns images and
ex . Exis ing me hods can be di ided in o wo ypes: (1)
Ad e sa ial ine uning. Mao e al. [32] p opose TeCoA,
which ine unes he ision encode o CLIP using ad e -
sa ial samples gene a ed on he ly on one da ase , and
ans e s he lea ned obus ness o downs eam classi ica-
ion da ase s. Based on his pipeline, Wang e al. [50] u -
he p opose o employ he o iginal CLIP o guide ad e sa -
ial ine uning by imposing wo egula isa ion e ms, show-
ing imp o ed gene aliza ion on clean and ad e sa ial im-
ages ac oss downs eam da ase s. (2) Ad e sa ial p omp
uning. This line o esea ch is buil on p omp uning o
CLIP [61,62], whe e model weigh s a e kep ozen. Li
e al. [26] show ha ex ual p omp s play an impo an ole
in he e ec i eness o bo h ad e sa ial a acks and obus -
ness. They p opose o inse lea nable okens and une he
okens by aligning he ex ual p omp wi h ad e sa ial im-
ages. Zhang e al. [59] p opose a simila pipeline by aining
obus ex okens, assuming ha he a acke has access o
he model bu no o he ex p omp s employed by he end
use . In his wo k, we disca d any aining and show ha
CLIP possesses he abili y o de end i sel om ad e sa ial
a acks by coun e a acking ad e sa ial images. Ou me hod
is he i s es - ime de ence me hod o CLIP.
3. Me hodology
In his sec ion, we i s p o ide p elimina ies ega ding
CLIP and ad e sa ial obus ness o CLIP in an image clas-
si ica ion con ex . Then we p oceed o in oduce ou es -
ime coun e a ack pa adigm.
3.1. P elimina ies and Se up
Ze o-sho classi ica ion o CLIP. CLIP [39] is a ision-
language model ha ma ches images wi h hei desc ip i e
ex . I has been con as i ely p e- ained on 400 million
image- ex pai s. Speci ically, i aligns an image xwi h i s
Figu e 3. Ou es - ime coun e a ack pa adigm. We c a a coun-
e a ack pe u ba ion δ c o lead an ad e sa ial image away om
i s o iginal embedding a es ime wi hou ine uning.
co esponding ex h ough he cosine simila i y be ween
hei ep esen a ions p oduced by a ision encode θ(·)and
a ex encode gϕ(·), espec i ely, whe e θand ϕa e model
weigh s. A in e ence ime, CLIP pe o ms classi ica ion
in a ze o-sho manne . Gi en a se o classes de ined in
hei ex ual names c1, . . . , cK, CLIP ma ches a es image
xagains he ex ual p omp s co esponding o candida e
class names w apped by a empla e T(usually ‘a pho o o
[CLASS]’):
si= θ(x)Tgϕ(T(ci))
∥ θ(x)∥ · ∥ gϕ(T(ci)) ∥(1)
The p obabili y o xbelonging o class ciis calcula ed as he
no malized simila i y pi=exp(si)
Pjexp(sj), and he candida e
class wi h he highes p obabili y is he p edic ed class.
Ad e sa ial a acks o CLIP. CLIP is highly ulne able
o ad e sa ial pe u ba ions [32]. In a se ing whe e he a -
acke has ull knowledge o he model weigh s and g adi-
en s o CLIP, a small pe u ba ion δbounded by a Lp- adius
can be maliciously designed o cause CLIP o misclassi y:
δa= a g max
δL(x+δ, c), s. . ∥δ∥p≤ϵa(2)
whe e cis he g ound- u h label o xand Lis a loss unc-
ion which is usually a c oss-en opy loss, and ϵais he a -
ack budge , which ensu es ha he manipula ion is sub le
and impe cep ible o humans. Eq. (2) can be app oxima ed
by P ojec ed G adien Descen (PGD) [6]. The ad e sa -
ial image is he addi ion o he image and he pe u ba ion
x′:= x+δa. This p ocess is illus a ed in Fig. 2.
Ad e sa ial ine uning s eng hens he ad e sa ial obus -
ness o CLIP [32,50] by al e na ely gene a ing ad e sa -
ial images x′ ollowing Eq. (2) and using hem o ine une
θ. TeCoA [32] pe o ms ine uning by aligning x′wi h
he g ound- u h ex T( c)on one da ase . PMG-AFT [50]
imposes wo CLIP-guided egula iza ion e ms on op o
TeCoA o imp o e he gene aliza ion on clean and ad e -
sa ial images. A e he ine uning phase, CLIP has lea ned
obus ness o ad e sa ial a acks, which ans e s o down-
s eam classi ica ion da ase s wi hou u he aining [32].
15174
3.2. Tes - ime Coun e a acks
Al hough ad e sa ial ine uning has been shown o signi -
ican ly imp o e CLIP’s ad e sa ial obus ness, limi a ions
a e appa en , such as cumbe some aining in ol ing he
gene a ion o ad e sa ial samples and ine uning o he i-
sion encode weigh s θ. In his pape , we in es iga e he
abili y o CLIP o de end i sel a es ime, wi h no need o
any aining, p o iding he i s es - ime de ence me hod
o CLIP. Ou p oposed pa adigm, e med as Tes - ime
Coun e a acks (TTC), is illus a ed in Fig. 3. Following
p e ious s udies [26,32,50,59], we ocus on a acks ha
aim o maximise he classi ica ion loss o CLIP.
In ui i ely, a p e- ained ision encode θis highly ex-
p essi e in cap u ing he nuanced pixel pa e n in an im-
age. In his sense, we specula e ha an ad e sa ial image
ha success ully ools CLIP is apped in a oxic su ound-
ing induced by he ad e sa y, and ha he p e- ained θis
able o lead he image away om his su ounding by u il-
ising i s exp essi eness. Recen ly, Schla mann e al. [44]
p opose an unsupe ised a ack me hod o CLIP, whe e a
pe u ba ion is upda ed such ha i maximises he L2dis-
ance be ween he image and i s o iginal embedding. In-
spi ed by his label- ee a ack me hod, we employ he same
loss unc ion in ou pa adigm. Speci ically, we employ he
o iginal image embedding θ(x)as he ancho , and c a
a es - ime coun e a ack pe u ba ion δ c such ha he L2
dis ance be ween he embedding o he coun e a acked im-
age θ(x+δ c)and he ancho θ(x)is maximised:
δ c = a g max
δ∥ θ(x+δ)− θ(x)∥, s. . ∥δ∥p≤ϵ c
(3)
This coun e a ack can also be app oxima ed by PGD [6].
Since his coun e a ack is pe o med by he end use a
es ime, he coun e a ack does no need o be impe -
cep ible, hence a la ge use -de ined coun e a ack budge
ϵ c. Howe e , we hope o main ain a consis en a ack
s yle wi h exis ing s udies, and s ill keep he coun e a ack
budge low, bounded by a Lp- adius. In he expe imen s
(Sec. 4), we show ha a budge a ϵ c = 4/255 is able o
imp o e CLIP’s ad e sa ial obus ness signi ican ly. No e
ha he ision encode weigh s θa e kep ozen h ough-
ou . Among exis ing me hods o non- ounda ional mod-
els, he mos closely ela ed o ou s is hedge de ense (HD)
[52]. An impo an di e ence is ha hey employ HD on
ad e sa ially- ained models, whils we show ha CLIP
wi hou ad e sa ial ine uning can ha ness he exp essi e-
ness o i s ision encode o de end i sel .
3.3. τ- h esholded Weigh ed Coun e a acks
Sec. 3.2 has discussed he idea o de ending CLIP wi h
i s p e- ained ision encode . An undesi able isk is ha
he coun e a acks can hu na u al images as well. Based
on he idea o TTC, we u he p opose τ- h esholded
(a) CIFAR10 (b) ImageNe
Figu e 4. Ra io o L2d i due o a andom noise. The alue o τ
is he a e age τac oss 100 andomly selec ed samples.
weigh ed coun e a acks o coun e a ack ad e sa ies e -
ec i ely while educing he impac on clean images.
Wu e al. [52] show ha ad e sa ial images a e mo e ul-
ne able o a small noise han clean images. In his s udy,
we ind ha ad e sa ial images a e ac ually mo e obus o
small andom noises, and a e only ulne able o su icien ly
la ge noises, based on ou analysis o ad e sa ial images
ob ained by i e a i e a ack me hods (PGD [6] in ou case).
Speci ically, we de ine a s ochas ic a iable τinduced by a
andom noise n∈ RC×W×H∼U(−ϵ andom, ϵ andom),
condi ioned on an image x∈ RC×W×H:
τ=∥ θ(x+n)− θ(x)∥
∥ θ(x)∥(4)
which can be in e p e ed as he a io o he L2d i in he
la en space when a andom noise nis applied on an image.
The alues o τa e epo ed in Fig. 4 o ImageNe and CI-
FAR10. We epo mo e esul s and analysis on τ o o he
da ase s in Appendix (Sec. 7). As can be seen om Fig. 4,
when a small andom noise (ϵ andom = 1/255,4/255) is
imposed, he a io o L2d i in he la en space is unusually
small, showing ha hey a e apped in a oxic su ounding
and ende ed ‘ alsely s able’ by an ad e sa y. Ad e sa ial
images only become ulne able when he s eng h o an-
dom noise is inc eased, as e idenced by he disp opo ion-
a ely ising alues o τ. We e m his beha iou o ad e -
sa ial images ob ained by maximising CLIP’s classi ica ion
loss ‘ alse s abili y’, and p o ide mo e heo e ical analysis
in Appendix (Sec. 7).
Buil upon he analysis abo e, we p opose τ- h esholded
weigh ed coun e a acks based on PGD [6]. Speci ically, we
ollow a s anda d pipeline o PGD i e a ions, wi h he a ack
objec i e being Eq. (3). A he ze o- h i e a ion, a andom
pe u ba ion wi hou any upda e δ0
c is applied, whe e we
compu e he τ alue based on Eq. (4) as an indica o . I i is
highe han a use -de ined h eshold τ h es, meaning ha i
is no ‘ alsely s able’, we hal he coun e a ack and e u n
he andom noise δ0
c. O he wise, he coun e a ack is e-
sumed. No e ha he selec ion o τ h es is dependen only
on he τ alue o clean images and he s eng h o andom
15175
Algo i hm 1 τ- h esholded weigh ed coun e a acks.
Requi e: Tes image x, p e- ained CLIP ision encode
θ, coun e a ack budge ϵ c, s epsize α, numbe o
s eps N, use -de ined pa ame e s τ h es and β.
1: p ocedu e TEST-TIME COUNTERATTACKS
2: δ0
c ∼U(−ϵ c, ϵ c).
3: Compu e τbased on Eq. (4) using δ0
c.
4: i τ≥τ h es hen
5: w0= 1
6: e u n δ c =δ0
c
7: else i τ < τ h es hen
8: w,δ c := {},{}
9: o i= 1,2, . . . , N do
10: δi
c=Π(δi−1
c +α∇δ∥ θ(x+δi−1
c )− θ(x)∥)
11: wi= exp(β·i)/PN
j=0 exp(β·j)(Eq. (5))
12: w←wi,δ c ←δi
c
13: end o
14: e u n δ c =PN
i=0 wi·δi
c (Eq. (6))
15: end i
16: end p ocedu e
noise, i espec i e o ypes and s eng hs o a acks. Since
employing only one δ c may be subop imal, we weigh he
coun e a ack pe u ba ion ec o s ac oss all s eps:
wj=exp(β·j)
PN
j=0 exp(β·j)(5)
δ c =
N
X
j=0
wjδj
c (6)
whe e β > 0is a hype pa ame e con olling he ascending
a e o weigh s, Nis he numbe o s eps o pe o ming
he coun e a ack, and δj
c is he coun e a ack pe u ba ion
ob ained a e js eps. We summa ise ou τ- h esholded
weigh ed coun e a acks in Algo i hm 1.
4. Expe imen s
In his sec ion, we conduc ex ensi e expe imen s o e i y
he e ec i eness o ou es - ime coun e a ack pa adigm.
4.1. Expe imen se up
Da ase s. Following p e ious wo k [32,50], we con-
duc ou expe imen s on 16 da ase s, which include gen-
e al objec ecogni ion da ase s CIFAR10 [23], CIFAR100
[23], STL10 [10], ImageNe [13], Cal ech101 [14] and
Cal ech256 [15]; ine-g ained ecogni ion da ase s Ox o d-
Pe s [37], Flowe s102 [35], Food101 [5], S an o dCa s
[22]; scene ecogni ion da ase s SUN397 [53], Coun y211
[39]; domain-speci ic da ase s FGVCAi c a [30], Eu-
oSAT [19], DTD [9], and PCAM [4].
Implemen a ion De ails. We use a coun e a ack budge
o ϵ c = 4/255 and a h eshold τ h es = 0.2. We se he
numbe o s eps o coun e a acks as N= 2, unless o he -
wise s a ed. βis se o 2.0. All a acks and coun e a acks
in expe imen s a e bounded by a L∞ adius. No e ha he
selec ion o τ h es is dependen on ϵ c, and is de e mined
based on he τbeha iou o clean images (Fig. 4). A highe
τ h es ades o mo e clean accu acy o obus ness. We
p o ide mo e de ails in Appendix (Sec. 9and Sec. 11).
Baselines. Since he e a e no es - ime de ence me hods o
CLIP, we implemen se e al es - ime me hods om exis -
ing ad e sa ial obus ness s udies ha do no ely on auxil-
ia y ne wo ks. Speci ically, we implemen An i-ad e sa y
[1] and Hedge De ense (HD) [52], which a e he mos
closely ela ed o ou me hod. An i-ad e sa y [1] gene a es
a pe u ba ion o ein o ce he con idence o he classi ie
gi en a es image. HD [52] pe o ms a coun e a ack on
he es image by inc easing he c oss-en opy w. . . all can-
dida e classes, based on hei inding ha he loss unc ion
su ace is smoo he a ound he g ound- u h class. We also
adap hei me hod in ou expe imen s wi h CLIP. Fo hese
wo me hods, we employ a es - ime pe u ba ion budge o
4/255, which is equal o he coun e a ack budge o ou
TTC. Following he o iginal pape s, he numbe o s eps a e
2 and 20 o An i-ad e sa y and HD, espec i ely. Consid-
e ing ha p e ious s udies es ablish image ans o ma ions
as a simple and e ec i e de ence me hod [17,38,54], we
also implemen es - ime ans o ma ion ensembling (TTE)
[38], which ensembles image ans o ma ion as a de ence.
We implemen TTE wi h image lip, 4 c ops, and image
lip o all c ops, o aling 9 augmen a i e iews. As a sim-
ples baseline, we also include andom noise (RN) which
adds a andom pe u ba ion noise wi h he same s eng h
as ou ϵ c, i.e., n∼U(−ϵ c, ϵ c). As a use ul e e -
ence, we also implemen ad e sa ially ine uning me hods
TeCoA [32], PMG-AFT [50] and FARE [44] by ine un-
ing he CLIP ision encode wi h ad e sa ial images on
TinyImageNe , based on he objec i e unc ions p oposed
in hei pape s1. We also ine une CLIP wi h clean images
(CLIP-FT) on TinyImageNe . In he phase o ine uning, we
use a 2-s ep PGD a ack, wi h he s epsize α= 1/255 and
a ack budge ϵ= 1/255, ollowing [32,50]. The lea ning
a e o ine uning is 5e−5. A e he ine uning phase, he
ine uned models a e deployed on 16 downs eam da ase s.
4.2. TTC on O iginal CLIP
Robus ness unde ϵa= 1/255.We i s es he obus ness
o all me hods unde he a ack budge o ϵa= 1/255. Fol-
lowing p e ious s udies on CLIP’s ad e sa ial obus ness
1Unlike o iginal implemen a ions, we hold ou 10% o he aining se
o TinyImageNe o e alua ion in ou implemen a ion, wi hou consul ing
downs eam da ase s. We also ind p ep ocessing signi ican ly a ec s he
pe o mance o ine uned models on CIFAR10, CIFAR100, and STL10.
We ollow he p ep ocessing pipeline ecommended by CLIP (Tab. 1).
15176

(%) CLIP Ad e sa ial Fine uning Tes - ime De ence ∆
CLIP-FT TeCoA PMG-AFT FARE RN TTE An i-ad HD TTC (ou s)
TinyImageNe Rob. 0.19 2.19 48.64 46.12 25.47 0.28±0.02 19.52±4.21 4.46±0.23 3.11±0.05 20.64±0.17 +20.45
Acc. 57.64 77.06 70.86 66.85 73.63 51.83±0.16 56.74±0.22 52.55±0.06 51.37±0.15 51.84±0.17 -5.80
CIFAR10 Rob. 0.74 3.34 33.61 40.66 19.65 2.01±0.08 41.35±6.14 12.39±0.07 17.22±0.45 28.75±0.18 +28.01
Acc. 85.12 84.90 64.61 70.69 74.44 81.18±0.07 84.74±0.40 83.52±0.09 78.23±0.16 81.18±0.07 -3.94
CIFAR100 Rob. 0.26 0.90 18.95 22.52 11.40 0.67±0.05 20.06±4.03 5.73±0.04 3.86±0.10 14.31±0.25 +14.05
Acc. 57.14 59.51 35.96 40.32 46.67 56.34±0.20 58.61±0.25 53.95±0.15 52.86±0.16 56.34±0.20 -0.80
STL10 Rob. 11.0 12.73 70.08 73.08 59.06 16.23±0.08 78.48±3.83 37.42±0.40 39.02±0.30 76.70±0.23 +65.70
Acc. 96.40 94.49 87.40 88.56 91.72 95.85±0.04 96.26±0.04 95.45±0.08 89.50±0.07 95.85±0.04 -0.55
ImageNe Rob. 1.15 0.93 18.89 21.43 14.00 1.77±0.03 31.01±4.40 8.67±0.05 6.63±0.05 38.41±0.07 +37.26
Acc. 59.69 54.24 34.89 36.12 48.79 59.34±0.06 60.02±0.12 54.27±0.14 54.54±0.05 49.39±0.00 -10.30
Cal ech101 Rob. 14.67 14.21 55.51 61.08 50.74 18.90±0.14 67.56±3.88 34.81±0.16 31.53±0.22 65.78±0.07 +51.11
Acc. 85.66 83.63 71.68 75.45 80.95 86.61±0.10 85.84±0.09 84.02±0.10 82.33±0.04 86.53±0.07 +0.87
Cal ech256 Rob. 8.47 6.76 43.19 45.91 38.79 11.33±0.04 60.09±4.03 25.36±0.17 23.48±0.10 60.11±0.04 +51.64
Acc. 81.72 78.53 61.14 62.24 73.32 81.25±0.03 82.49±0.08 79.38±0.12 79.12±0.01 79.66±0.04 -2.06
Ox o dPe s Rob. 1.04 2.10 38.35 41.18 31.07 1.86±0.01 50.33±7.30 20.42±0.22 12.04±0.16 57.87±0.15 +56.83
Acc. 87.44 84.14 62.12 65.88 79.37 87.41±0.12 88.13±0.13 80.62±0.35 80.91±0.05 83.35±0.21 -4.09
Flowe s102 Rob. 1.14 0.54 21.94 23.43 17.14 1.52±0.01 35.88±4.72 7.16±0.41 7.29±0.06 39.14±0.28 +38.00
Acc. 65.46 53.37 36.80 37.00 47.98 64.62±0.19 65.18±0.22 62.66±0.14 58.22±0.12 64.16±0.19 -1.30
FGVCAi c a Rob. 0.00 0.00 2.49 2.22 1.35 0.00±0.00 6.23±1.37 1.27±0.07 1.26±0.07 13.77±0.38 +13.77
Acc. 20.10 14.04 5.31 5.55 10.86 19.25±0.18 20.19±0.36 15.88±0.23 16.36±0.03 18.00±0.16 -2.10
S an o dCa s Rob. 0.02 0.06 8.76 11.65 6.75 0.16±0.02 22.36±4.17 4.40±0.30 2.71±0.09 33.01±0.07 +32.99
Acc. 52.02 42.11 20.91 25.44 38.68 52.14±0.09 52.73±0.31 36.21±0.27 44.28±0.02 48.16±0.16 -3.86
SUN397 Rob. 1.14 0.94 19.39 22.58 14.91 1.72±0.01 30.79±4.43 8.05±0.04 6.40±0.06 41.52±0.04 +40.38
Acc. 58.50 55.73 36.69 37.98 52.42 59.69±0.06 59.12±0.08 56.00±0.04 53.17±0.02 55.13±0.06 -3.37
Coun y211 Rob. 0.04 0.03 1.78 2.12 0.85 0.06±0.00 3.05±0.89 0.67±0.05 0.47±0.02 7.09±0.04 +7.05
Acc. 15.25 12.07 4.75 4.64 9.26 14.80±0.02 14.66±0.16 11.58±0.12 11.72±0.07 13.08±0.05 -2.17
Food101 Rob. 0.70 0.42 13.90 18.57 11.65 1.20±0.01 43.94±6.97 13.12±0.16 8.03±0.11 57.84±0.15 +57.14
Acc. 83.88 64.86 29.98 36.61 55.31 83.44±0.04 83.96±0.02 75.81±0.22 80.30±0.05 82.18±0.02 -1.70
Eu oSAT Rob. 0.03 0.04 11.96 12.60 10.67 0.15±0.01 6.91±2.13 2.15±0.04 4.57±0.09 12.19±0.24 +12.16
Acc. 42.59 27.64 16.58 18.53 21.88 53.24±0.09 44.38±1.60 36.78±0.18 39.08±0.06 53.24±0.09 +10.65
DTD Rob. 2.98 2.39 17.61 14.95 15.64 3.71±0.09 23.90±2.34 5.62±0.07 11.63±0.17 27.32±0.25 +24.34
Acc. 40.64 36.49 25.16 21.76 32.07 37.96±0.13 41.33±0.32 38.92±0.22 34.89±0.35 36.98±0.21 -3.66
PCAM Rob. 0.08 1.11 48.24 46.18 16.23 0.41±0.01 10.62±3.22 4.97±0.12 44.74±0.17 52.85±0.20 +52.77
Acc. 52.02 47.21 49.96 50.03 52.54 52.73±0.07 51.01±0.08 52.49±0.02 50.38±0.04 52.73±0.07 +0.71
A g. Rob. 2.70 2.91 26.54 28.76 20.00 3.86±0.02 33.28±3.98 12.01±0.04 13.81±0.06 39.17±0.02 +36.47
Acc. 61.51 55.80 40.25 42.30 51.02 61.61±0.03 61.79±0.13 57.35±0.03 56.62±0.02 59.75±0.06 -1.76
Table 1. Classi ica ion accu acy (%) on bo h ad e sa ial images (Rob.) unde 10-s ep PGD a ack a ϵa= 1/255 and clean images (Acc.)
ac oss 16 da ase s. We include he esul s on TinyImageNe because i is used o ine une he model o CLIP-FT, TeCoA [32], PMG-AFT
[50], and FARE [44]. Compa ison is made among ou pa adigm and es - ime de ences adap ed om exis ing ad e sa ial s udies, wi h
ine uning-based models implemen ed as a e e ence. We epo he mean and s anda d de ia ion o es - ime me hods o e 3 uns. The
las column epo s he gains w. . . o iginal CLIP wi hou any ine uning o es - ime ope a ions.
[32,50], we es all baselines unde 10-s ep PGD a acks
ac oss 16 da ase s, assuming ha he a acke has ull ac-
cess o he weigh s and g adien s o he deployed model,
bu no o he es - ime ope a ions made by he end use . We
epo he accu acy on bo h ad e sa ial images and clean
images in Tab. 1. I can be seen ha all ine uning-based
me hods o e i o he da ase used o ad e sa ial ine un-
ing o a ying ex en s, as e idenced by he highe accu acy
o clean images han he o iginal CLIP on TinyImageNe .
The imp o ed obus ness on downs eam da ase s comes a
a cos o a no iceable clean accu acy d op. Among es - ime
me hods, bo h An i-ad e sa y and HD, which gene a e an
addi i e pe u ba ion based on an objec i e, lead o limi ed
imp o emen o obus accu acy. Ou TTC, which u ilises
he p e- ained ision encode o CLIP o p oduce coun e -
a acks, shows he bes obus accu acy on mos downs eam
da ase s, usually wi h a la ge gain. We also e ain he bes
clean accu acy compa ed o hese wo pe u ba ion upda e
me hods. Adding andom noise (RN) b ings li le obus -
ness, e en hough he added noise is ou imes la ge han
he a ack budge , i.e., ϵ c ≫ϵa. RN can be iewed as
a special case o TTC wi h he numbe o Nbeing 0. By
exploi ing he p e- ained model θ o op imize he noise,
TTC signi ican ly imp o es obus ness. TTE ensembles a
numbe o image ans o ma ions, which imp o es CLIP’s
obus ness a es ime o an a e age accu acy o 33.28%.
15177
(%) Rob. Acc.
CLIP 0.09 61.51
CLIP-FT 0.96 55.80
TeCoA1[32] 6.51 40.25
TeCoA4[32] 10.03 35.57
PMG-AFT1[50] 7.03 42.30
PMG-AFT4[50] 10.70 37.58
FARE1[44] 1.50 51.02
FARE4[44] 3.67 46.17
RN 0.06±0.00 61.61±0.03
TTE [38] 7.79±3.23 61.79±0.13
An i-ad [1] 0.53±0.00 57.32±0.03
HD [52] 1.19±0.01 56.62±0.02
TTC (ou s) 20.63±0.05 55.99±0.06
∆+20.54 -5.52
Table 2. Classi ica ion accu acy (%) on ad e sa ial images (Rob.)
unde 10-s ep PGD a ϵa= 4/255 and clean images (Acc.) a e -
aged on 16 da ase s. Supe sc ip s indica e he a ack budge used
in he ine uning phase. The las ow epo s he gains compa ed
o he o iginal CLIP.
Howe e , his gain is gene ally uns able ac oss uns, as in-
dica ed by he high s anda d de ia ion o obus accu acy.
O e all, ou p oposed TTC leads o consis en gains on o-
bus accu acy (+36.47%) a e aged on downs eam da ase s
wi h a sligh loss (-1.76%) on clean accu acy compa ed o
he o iginal CLIP, se ing as a s able de ence a in e ence
ime. We es he obus ness unde CW a acks [6] in Ap-
pendix (Sec. 8.1) o limi ed space. I can also be seen ha
he obus ness gains come a a cos o accu acy educ ion on
clean images o a ying ex en s ac oss da ase s. We p o ide
mo e analysis in Appendix Sec. 9.
Robus ness unde ϵa= 4/255.We u he es he obus -
ness o all me hods unde a high a ack budge ϵa= 4/255.
Fo his se ing, we inc ease he numbe o s eps N o 5 o
mo e e ec i e coun e a acks, while o he hype pa ame e s
a e unchanged. We also implemen ine uning-based me h-
ods wi h a ack budge ϵ= 4/255 du ing ine uning, in his
se ing. We epo he a e age accu acy ac oss 16 da ase s in
Tab. 2and p o ide he ull able in Appendix (Tab. 5). I can
be seen ha a high a ack budge a ϵa= 4/255 deg ades he
accu acy o all models o a e y low le el. An i-Ad e sa y
[1] and HD [52] p o ide li le o no obus ness unde his
se ing. TTE de ends he model o a limi ed ex en , bu s ill
wi h low eliabili y as indica ed by he high s anda d de i-
a ion. In compa ison, ou p oposed TTC p o ides a s able
obus ness gain a e aged on 16 da ase s.
4.3. TTC on Ad e sa ially Fine uned CLIP
Since ou me hod pe o ms coun e a acks using he ic im
model a es ime, i can also be employed on ad e sa i-
ally ine uned models in a plug-in manne . In his sec ion,
we apply TTC o ine uning-based models, assuming ha
he a acke has ull access o he deployed model, bu no
o he ope a ions by he end use . No e ha we s ill em-
ploy he o iginal ision encode θo CLIP o compu e τ
(Eq. (4)), because he sensi i i y o ad e sa ial ine uned
ision encode s is la gely educed. We epo he esul s
in Tab. 3. I can be seen ha TTC can u he boos ad e -
sa ial obus ness by exploi ing he ine uned model o pe -
o m coun e a acks a es ime. Speci ically, TTC achie es
a obus ness accu acy o 29.06% and 30.81% when em-
ployed on TeCoA and PMG-AFT, su passing he o iginal
ine uned models by 2.52 and 2.05 poin s, espec i ely. A
signi ican gain o 13.85 poin s is achie ed when we em-
ploy TTC on op o FARE, an unsupe ised ad e sa ially
ine uned model. In e es ingly, we ind ha ad e sa ial ine-
uning g ea ly educes he sensi i i y o CLIP o a ia ions
in he pixel space, hus hu ing he exp essi e powe o he
p e- ained encode . We p o ide mo e in-dep h analyses o
such loss in Appendix (Sec. 10). Since ou coun e a acks
ely hea ily on he exp essi eness o he p e- ained ision
encode θ, his also explains he smalle gains achie ed
on ad e sa ially ine uned models, compa ed o he o iginal
CLIP. The la ge inc ease o obus accu acy on FARE im-
plies ha ad e sa ially ine uning CLIP in an unsupe ised
manne be e e ains he exp essi eness o he model.
4.4. Abla ion s udies
We expe imen ally ind ha he numbe o s eps No ou
TTC g ea ly a ec s pe o mance on bo h ad e sa ial and
clean images. A gene al ule o humb is ha an a ack wi h
a highe budge ϵawould equi e mo e s eps o coun e a -
acks. In his sec ion, we in es iga e he e ec o Nand
keep he o he hype pa ame e s unchanged. We p o ide
analysis on o he hype pa ame e s in Appendix (Sec. 11).
Fig. 5p o ides he pe o mance o CLIP employing TTC
on 14 da ase s as N a ies. Fo smalle a acks a ϵa=
1/255, i akes ewe han h ee s eps o CLIP o de end
i sel e ec i ely on mos da ase s. Excessi e coun e a -
acks can impai he images, as e idenced by he decline
a e a ce ain numbe o s eps. In compa ison, a s ong a -
ack ϵa= 4/255 equi es a la ge numbe o coun e a ack
s eps o each a easonable accu acy, showing ha hey a e
mo e esilien o coun e a acks by he use side. TTC does
no impac accu acy on clean images signi ican ly on mos
da ase s, excep o SUN397 (Fig. 5d), Ox o dPe s (Fig. 5j),
S an o dCa s (Fig. 5n) and ImageNe (Fig. 5l), whe e clean
images a e ound sensi i e o he inc ease o N.
5. Limi a ions
Al hough we show TTC imp o es obus ness o CLIP o
ad e sa y ha maximises he classi ica ion loss, he e a e
limi a ions as discussed below. Fi s ly, he obus ness gain
o applying TTC on TeCoA [32] and PMG-AFT [50] is less
ob ious. This is due o he educed exp essi eness o CLIP
caused by ad e sa ial ine uning. We a gue ha o la ge
p e- ained models like CLIP, ad e sa ial ine uning should
15178
(%)
CIFAR10
CIFAR100
STL10
ImageNe
Cal ech101
Cal ech256
Ox o dPe s
Flowe 102
FGVCAi c a
S an o dCa s
SUN397
Coun y211
Food101
Eu oSAT
DTD
PCAM
A g. Rob.
A g. Acc.
TeCoA 33.61 18.95 70.08 18.89 55.51 43.19 38.35 21.94 2.49 8.76 19.39 1.78 13.90 11.96 17.61 48.24 26.54 40.25
TeCoA+TTC 34.68 20.00 71.65 23.14 59.44 48.49 42.66 25.13 2.78 12.09 23.91 2.49 17.79 12.75 18.87 48.44 29.02 39.85
∆1.07 ↑1.05 ↑1.57 ↑4.25 ↑3.93 ↑5.30 ↑4.31 ↑3.19 ↑0.29 ↑3.33 ↑4.52 ↑0.71 ↑3.89 ↑0.79 ↑1.26 ↑0.20 ↑2.48 ↑−0.40 ↓
PMG-AFT 40.66 22.52 73.08 21.43 61.08 45.91 41.18 23.43 2.22 11.65 22.58 2.12 18.57 12.60 14.95 46.18 28.76 42.30
PMG-AFT+TTC 42.17 24.09 73.55 24.36 63.72 50.37 43.96 25.94 2.51 14.97 25.70 2.57 22.33 13.94 15.98 46.52 30.79 41.89
∆1.51 ↑1.57 ↑0.47 ↑2.93 ↑2.64 ↑4.46 ↑2.78 ↑2.51 ↑0.29 ↑3.32 ↑3.12 ↑0.45 ↑3.76 ↑1.34 ↑1.03 ↑0.34 ↑2.03 ↑−0.41 ↓
FARE 19.65 11.40 59.06 14.00 50.74 38.79 31.07 17.14 1.35 6.85 14.90 0.85 11.65 10.67 15.64 16.23 20.00 51.02
FARE+TTC 35.55 22.34 76.65 30.52 67.39 59.20 51.53 29.85 5.03 20.46 33.42 4.04 31.76 15.49 23.17 35.79 33.89 49.91
∆15.90 ↑10.94 ↑17.59 ↑16.52 ↑16.65 ↑20.41 ↑20.46 ↑12.71 ↑3.68 ↑13.61 ↑18.52 ↑3.19 ↑20.11 ↑4.82 ↑7.53 ↑19.56 ↑13.89 ↑−1.11 ↓
Table 3. TTC employed on ad e sa ially ine uned models a es ime. We epo he obus accu acy a ϵa= 1/255 and he obus ness
gain by employing TTC o each da ase .
(a) CIFAR10 (b) DTD (c) STL10 (d) SUN397 (e) Eu oSAT ( ) Cal ech101 (g) PCAM
(h) CIFAR100 (i) Food101 (j) Ox o dPe s (k) Flowe 102 (l) ImageNe (m) Cal ech256 (n) S an o dCa s
Figu e 5. E ec s o he numbe o s eps N o coun e a acks pe o med on CLIP. The g een lines ep esen accu acy on clean images, and
ed and blue lines accu acy on ad e sa ial images a ϵa= 1/255 and ϵa= 4/255, espec i ely.
be employed spa ingly, conside ing ha a undamen al di -
e ence om ad e sa ial s udies on non- ounda ional mod-
els is ha hey ha e lea ned massi e amoun s o eal-wo ld
knowledge. Secondly, al hough TTC does no in ol e
aining on ad e sa ial images, i incu s mo e compu a ion
expenses a in e ence ime. Addi ionally, he numbe o
coun e a ack s eps a ec s obus ness pe o mance. I can
be di icul o une o he mos sui able N, i he a ack
s eng h ϵais no known a p io i (Fig. 5). We ecommend
ewe s eps (no mo e han h ee) i he a ack is unknown
o a oid excessi e coun e a acks and unp oduc i e com-
pu a ional o e head. In he u u e, we in end o explo e
me hods o adjus he numbe o s eps based on he es im-
age. Thi dly, acco ding o ad e sa ial obus ness s udies
on con en ional models, es - ime de ence can be ci cum-
en ed by adap i e a acks [12]. We discuss in Appendix
(Sec. 12) possible adap i e a acks o b eak ou coun e a -
acks assuming he wo s scena io whe e he a acke has
access o he weigh s o he deployed CLIP model and TTC
pe o med by he end use .
6. Conclusion
We show ha CLIP can le e age i s own p e- ained ision
encode o de end agains ad e sa y maliciously manipu-
la ed o maximise i s loss by pe o ming coun e a acks a
es ime, wi hou elying on any auxilia y ne wo ks. Based
on he inding ha ad e sa ial images a e ‘ alsely s able’,
we p opose τ- h esholded coun e a acks o guide he ad-
e sa ial image away om i s o iginal embedding in he la-
en space. Expe imen s on 16 da ase s show ha TTC em-
ployed on CLIP achie es s able and p omising accu acy on
ad e sa ial images. TTC is also shown o u he enhance
obus ness o ad e sa ially ine uned CLIP models. We also
ind ha ine uning CLIP wi h ad e sa ial images comp o-
mises i s own exp essi eness, and ecommend cau ious use
o ad e sa ial ine uning as he only app oach o obus i-
ying la ge p e- ained models. Ou pa adigm is he i s
es - ime me hod o de end CLIP a in e ence ime wi hou
any ine uning. We hope his s udy will encou age u u e
esea ch o obus i ying app oaches o CLIP al e na i e o
ad e sa ial ine uning.
Acknowledgemen
This wo k was suppo ed by he MUR PNRR
p ojec FAIR (PE00000013) unded by he Nex Gen-
e a ionEU and he EU Ho izon p ojec s ELIAS
(No. 101120237) and AI4T us (No. 101070190).
15179
Re e ences
[1] Mo asem Al a a, Juan C P´
e ez, Ali Thabe , Adel Bibi,
Philip HS To , and Be na d Ghanem. Comba ing ad e -
sa ies wi h an i-ad e sa ies. In P oceedings o he AAAI
Con e ence on A i icial In elligence, pages 5992–6000,
2022. 2,5,7
[2] Anish A halye, Nicholas Ca lini, and Da id Wagne . Ob us-
ca ed g adien s gi e a alse sense o secu i y: Ci cum en ing
de enses o ad e sa ial examples. In In e na ional con e -
ence on machine lea ning, pages 274–283. PMLR, 2018. 1
[3] Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian Wang.
Recen ad ances in ad e sa ial aining o ad e sa ial o-
bus ness. In P oceedings o he Thi ie h In e na ional Join
Con e ence on A i icial In elligence, pages 4312–4321. In-
e na ional Join Con e ences on A i icial In elligence O ga-
niza ion, 2021. Su ey T ack. 1
[4] Babak Eh eshami Bejno di, Mi ko Ve a, Paul Johannes
Van Dies , B am Van Ginneken, Nico Ka ssemeije , Gee
Li jens, Je oen AWM Van De Laak, Meyke He msen,
Qui ine F Manson, Maschenka Balkenhol, e al. Diagnos-
ic assessmen o deep lea ning algo i hms o de ec ion o
lymph node me as ases in women wi h b eas cance . Jama,
318(22):2199–2210, 2017. 5
[5] Lukas Bossa d, Ma hieu Guillaumin, and Luc Van Gool.
Food-101–mining disc imina i e componen s wi h andom
o es s. In Compu e ision–ECCV 2014: 13 h Eu opean
con e ence, zu ich, Swi ze land, Sep embe 6-12, 2014, p o-
ceedings, pa VI 13, pages 446–461. Sp inge , 2014. 5
[6] Nicholas Ca lini and Da id Wagne . Towa ds e alua ing he
obus ness o neu al ne wo ks. In 2017 ieee symposium on
secu i y and p i acy (sp), pages 39–57. Ieee, 2017. 1,2,3,
4,7
[7] Ma hilde Ca on, Hugo Tou on, Ishan Mis a, He ´
e J´
egou,
Julien Mai al, Pio Bojanowski, and A mand Joulin. Eme g-
ing p ope ies in sel -supe ised ision ans o me s. In P o-
ceedings o he IEEE/CVF in e na ional con e ence on com-
pu e ision, pages 9650–9660, 2021. 1
[8] Ting Chen, Simon Ko nbli h, Mohammad No ouzi, and Ge-
o ey Hin on. A simple amewo k o con as i e lea ning
o isual ep esen a ions. In In e na ional con e ence on ma-
chine lea ning, pages 1597–1607. PMLR, 2020. 1
[9] Mi cea Cimpoi, Subh ansu Maji, Iasonas Kokkinos, Sammy
Mohamed, and And ea Vedaldi. Desc ibing ex u es in he
wild. In P oceedings o he IEEE con e ence on compu e
ision and pa e n ecogni ion, pages 3606–3613, 2014. 5
[10] Adam Coa es, And ew Ng, and Honglak Lee. An analysis o
single-laye ne wo ks in unsupe ised ea u e lea ning. In
P oceedings o he ou een h in e na ional con e ence on
a i icial in elligence and s a is ics, pages 215–223. JMLR
Wo kshop and Con e ence P oceedings, 2011. 5
[11] F ancesco C oce and Ma hias Hein. Reliable e alua-
ion o ad e sa ial obus ness wi h an ensemble o di e se
pa ame e - ee a acks. In In e na ional con e ence on ma-
chine lea ning, pages 2206–2216. PMLR, 2020. 1
[12] F ancesco C oce, S en Gowal, Thomas B unne , E an Shel-
hame , Ma hias Hein, and Taylan Cemgil. E alua ing he
ad e sa ial obus ness o adap i e es - ime de enses. In In-
e na ional Con e ence on Machine Lea ning, pages 4421–
4435. PMLR, 2022. 2,8
[13] Jia Deng, Wei Dong, Richa d Soche , Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagene : A la ge-scale hie a chical image
da abase. In 2009 IEEE con e ence on compu e ision and
pa e n ecogni ion, pages 248–255. Ieee, 2009. 5
[14] Li Fei-Fei, Robe Fe gus, and Pie o Pe ona. One-sho
lea ning o objec ca ego ies. IEEE ansac ions on pa e n
analysis and machine in elligence, 28(4):594–611, 2006. 5
[15] G ego y G i in, Alex Holub, Pie o Pe ona, e al. Cal ech-
256 objec ca ego y da ase . Technical epo , Technical
Repo 7694, Cali o nia Ins i u e o Technology Pasadena,
2007. 5
[16] Jean-Bas ien G ill, Flo ian S ub, Flo en Al ch´
e, Co en in
Tallec, Pie e Richemond, Elena Bucha skaya, Ca l Doe sch,
Be na do A ila Pi es, Zhaohan Guo, Mohammad Ghesh-
laghi Aza , e al. Boo s ap you own la en -a new app oach
o sel -supe ised lea ning. Ad ances in neu al in o ma ion
p ocessing sys ems, 33:21271–21284, 2020. 1
[17] Chuan Guo, Mayank Rana, Mous apha Cisse, and Lau ens
an de Maa en. Coun e ing ad e sa ial images using inpu
ans o ma ions. In In e na ional Con e ence on Lea ning
Rep esen a ions, 2018. 2,5
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep esidual lea ning o image ecogni ion. In P oceed-
ings o he IEEE con e ence on compu e ision and pa e n
ecogni ion, pages 770–778, 2016. 2
[19] Pa ick Helbe , Benjamin Bischke, And eas Dengel, and
Damian Bo h. Eu osa : A no el da ase and deep lea ning
benchma k o land use and land co e classi ica ion. IEEE
Jou nal o Selec ed Topics in Applied Ea h Obse a ions
and Remo e Sensing, 12(7):2217–2226, 2019. 5
[20] Duhun Hwang, Eunjung Lee, and Wonjong Rhee. Aid-
pu i ie : A ligh auxilia y ne wo k o boos ing ad e sa ial
de ense. Neu ocompu ing, 541:126251, 2023. 2
[21] Chao Jia, Yin ei Yang, Ye Xia, Yi-Ting Chen, Za ana Pa ekh,
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
Due ig. Scaling up isual and ision-language ep esen a-
ion lea ning wi h noisy ex supe ision. In In e na ional
con e ence on machine lea ning, pages 4904–4916. PMLR,
2021. 1
[22] Jona han K ause, Michael S a k, Jia Deng, and Li Fei-Fei.
3d objec ep esen a ions o ine-g ained ca ego iza ion. In
P oceedings o he IEEE in e na ional con e ence on com-
pu e ision wo kshops, pages 554–561, 2013. 5
[23] Alex K izhe sky, Geo ey Hin on, e al. Lea ning mul iple
laye s o ea u es om iny images. 2009. 5
[24] Alex K izhe sky, Ilya Su ske e , and Geo ey E Hin on.
Imagene classi ica ion wi h deep con olu ional neu al ne -
wo ks. Ad ances in neu al in o ma ion p ocessing sys ems,
25, 2012. 2
[25] Alexey Ku akin, Ian J Good ellow, and Samy Bengio. Ad-
e sa ial examples in he physical wo ld. In A i icial in-
elligence sa e y and secu i y, pages 99–112. Chapman and
Hall/CRC, 2018. 1
[26] Lin Li, Haoyan Guan, Jianing Qiu, and Michael Sp a ling.
One p omp wo d is enough o boos ad e sa ial obus ness
15180

Related note

Why institutions use Plag.ai for originality review, entry 79
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by review committees in large academic systems, distance-learning programs, and cross-border universities, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer separation between similarity and misconduct, more consistent review procedures, and more transparent source review. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For grant proposals, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai