Fishe Tune: Fishe -Guided Robus Tuning o Vision Founda ion Models o
Domain Gene alized Segmen a ion
Dong Zhao1, Jinlong Li2, Shuang Wang1B, Mengyao Wu, Qi Zang1B, Nicu Sebe2, Zhun Zhong3
1School o A i icial In elligence, Xidian Uni e si y, Shaanxi, China
2Depa men o In o ma ion Enginee ing and Compu e Science, Uni e si y o T en o, I aly
3School o Compu e Science and In o ma ion Enginee ing, He ei Uni e si y o Technology, China
Abs ac
Vision Founda ion Models (VFMs) excel in gene aliza-
ion due o la ge-scale p e aining, bu ine- uning hem o
Domain Gene alized Seman ic Segmen a ion (DGSS) while
main aining his abili y emains a challenge. Exis ing ap-
p oaches ei he selec i ely ine- une pa ame e s o eeze
he VFMs and upda e only he adap e s, bo h o which may
unde u ilize he VFMs’ ull po en ial in DGSS asks. We
obse e ha domain-sensi i e pa ame e s in VFMs, a is-
ing om ask and dis ibu ion di e ences, can hinde gen-
e aliza ion. To add ess his, we p opose Fishe Tune, a
obus ine- uning me hod guided by he Domain-Rela ed
Fishe In o ma ion Ma ix (DR-FIM). DR-FIM measu es
pa ame e sensi i i y ac oss asks and domains, enabling
selec i e upda es ha p ese e gene aliza ion and enhance
DGSS adap abili y. To s abilize DR-FIM es ima ion, Fish-
e Tune inco po a es a ia ional in e ence, ea ing pa am-
e e s as Gaussian-dis ibu ed a iables and le e aging p e-
ained p io s. Ex ensi e expe imen s show ha Fishe -
Tune achie es supe io c oss-domain segmen a ion while
main aining gene aliza ion, ou pe o ming bo h selec i e-
pa ame e and adap e -based me hods.
1. In oduc ion
Vision Founda ion Models (VFMs), such as CLIP [46],
DINO 2 [5], and EVA02 [15], ha e eme ged as powe -
ul ools in compu e ision, achie ing ema kable gene -
aliza ion ac oss di e se downs eam asks, including c oss-
domain pe cep ion [35,53,56,71,76], ew-sho [37,65,70,
79] and ze o-sho pe cep ion [26,31,54,72]. P e- ained
on massi e da a ields, VFMs encapsula e ich isual ep e-
This wo k was suppo ed in pa by he Na ional Na u al Sci-
ence Founda ion o China No. 62271377, he Key Resea ch and De-
elopmen P og am o Shannxi P og am No. 2021ZDLGY0106 and
No. 2022ZDLGY0112, he Key Scien i ic Technological Inno a ion Re-
sea ch P ojec by Minis y o Educa ion, he MUR PNRR p ojec FAIR
(PE00000013) unded by he Nex Gene a ionEU and he EU Ho izon
p ojec s ELIAS (No. 101120237) and AI4T us (No. 101070190).
BCo-co esponding au ho .
VFMs
VFMs VFMs
(b) Selec i e Tuning
(a) Adap e Tuning
: Domain-sensi i e pa ame e s
(c) Fishe Tuning (Ou s)
: Ex a adap e s : Handmade pa ame e s
Sou ce Da a
Sou ce Da a Sou ce Da a
Figu e 1. Compa ison o p inciples o di e en VFM adjus men
me hods: (a) uning by adap e inse ion [60,66], (b) uning by
manually selec ed [57] o au oma ically selec ed pa ame e s [55],
(c) ou me hod o uning domain-sensi i e pa ame e s.
sen a ions ha can be ans e ed o nume ous applica ions
wi h mino adap a ion [61,66]. Despi e his, when i comes
o Domain Gene alized Seman ic Segmen a ion (DGSS),
whe e he goal is o segmen unseen domain images wi hou
explici access o hei aining da a, e ec i ely ine- uning
VFMs while p ese ing hei s ong gene aliza ion capabil-
i ies emains an open challenge.
Exis ing me hods o adap ing VFMs o DGSS asks yp-
ically in ol e ine- uning ia adap e laye s o emap p e-
ained okens, as shown in Fig. 1(a) [60,66]. While his
app oach educes o e i ing, i does no ully le e age he
in e nal ep esen a ions o he VFM, as he co e con en o
he model emains unchanged. Fu he mo e, when he p e-
aining asks o he VFM (e.g., MAE, DINO 2) signi i-
can ly de ia e om he DGSS ask, he adap a ion imp o e-
men is limi ed. An al e na i e solu ion is o ine- une a sub-
se o pa ame e s ela ed o he a ge DGSS ask, which ac-
i a es he ep esen a ions o VFMs, as illus a ed in Fig. 1
(b). Howe e , we obse e ha adi ional pa ame e selec-
ion me hods, whe he manually de ined [57] o au oma -
ically chosen [36,62], ail o gua an ee he gene aliza ion
abili y o he VFM, and in ac , pe o m e en wo se han
simply adding adap e s, as shown in Fig. 2.
This CVPR pape is he Open Access e sion, p o ided by he Compu e Vision Founda ion.
Excep o his wa e ma k, i is iden ical o he accep ed e sion;
he inal published e sion o he p oceedings is a ailable on IEEE Xplo e.
15043
Figu e 2. Compa ison o a e age pe o mance ac oss mul-
iple VFMs in DGSS expe imen s on GTA →Ci yscapes +
BDD100K + Mapilla y using di e en ine- uning me hods, in-
cluding adap e -based Rein [60], manually selec ed pa ame e -
based VQT [57], adap i ely selec ed pa ame e -based ChildTune
[62], and ou Fishe Tune.
Ou expe imen s e eal ha ce ain pa ame e s o Vi-
sion Founda ion Models (VFMs) a e c ucial o main aining
gene aliza ion, while o he s a e key o adap ing o new do-
mains and asks. T adi ional selec i e ine- uning me hods
ocus solely on ask-sensi i e pa ame e s, isking he dis-
up ion o he VFM’s gene aliza ion abili y. The e o e, we
p opose iden i ying and ine- uning hese domain-sensi i e
pa ame e s. To add ess his, we in oduce Fishe Tune,
a no el obus ine- uning me hod based on he Domain-
Rela ed Fishe In o ma ion Ma ix (DR-FIM). This me hod
allows us o p ese e he gene aliza ion capabili y o he
p e- ained VFM while ac i a ing i s adap abili y o DGSS
asks. Speci ically, we i s in oduce he DR-FIM me ic,
which measu es domain sensi i i y by e alua ing he luc-
ua ion o pa ame e s ac oss di e en domains. Unlike FIM
me ics, DR-FIM accoun s o domain shi s, ex ending
FIM o c oss-domain asks. To mi iga e po en ial deg a-
da ion in DR-FIM es ima ion, Fishe Tune inno a i ely em-
ploys a ia ional in e ence. By ea ing model pa ame e s
as andom a iables ollowing a Gaussian dis ibu ion and
inco po a ing p io in o ma ion om he p e- ained VFM,
Fishe Tune s abilizes he DR-FIM es ima ion p ocess, en-
su ing accu acy and obus ness in c oss-domain asks. This
no el es ima ion me hod imp o es compu a ional e iciency
and enhances he scalabili y o Fishe Tune o la ge-scale
VFM models. Th ough ex ensi e expe imen s on mul iple
DGSS benchma ks, we demons a e ha Fishe Tune con-
sis en ly ou pe o ms bo h selec i e-pa ame e and adap e -
based me hods. In summa y, ou con ibu ions a e,
• We p opose Fishe Tune, a no el ine- uning s a egy
ha le e ages Fishe In o ma ion Ma ix o selec i ely
ine- une VFMs o DGSS, p ese ing he gene aliza-
ion capabili ies o VFMs while imp o ing domain
adap abili y.
• We in oduce Domain-Rela ed FIM (DR-FIM), a no el
me ic ma ix ha quan i ies he sensi i i y o pa ame-
e s o domain shi s.
• We employ a ia ional in e ence o ea model pa am-
e e s as Gaussian-dis ibu ed, ensu ing s able and ac-
cu a e DR-FIM es ima ion.
• We alida e he e ec i eness o Fishe Tune h ough
ex ensi e expe imen s, showing supe io gene aliza-
ion compa ed o s a e-o - he-a me hods.
2. Rela ed Wo k
Domain Gene alized Seman ic Segmen a ion (DGSS)
ocuses on enhancing a model’s abili y o gene alize o un-
seen domains by aining on sou ce domains [8,43,74].
Common s a egies include domain-in a ian ep esen a ion
lea ning me hods and domain augmen a ion echniques.
Domain-in a ian ep esen a ion lea ning app oaches in-
ol e spli ing lea ned ea u es in o domain-in a ian [68,
69] and domain-speci ic componen s [32,58,59,75], o em-
ploying me a-lea ning o de elop mo e obus models [11,
30,73]. Addi ionally, se e al me hods ha e succeeded by
lea ning ea u e no maliza ion o whi ening schemes [9,
39,44]. Domain augmen a ion echniques, on he o he
hand, imp o e segmen a ion esul s h ough s yle ans e
a image-le el [23,43,45,78] o ea u e-le el [6,8,77] and
he in oduc ion o addi ional da a [29,38,67]. Some e-
cen wo k has shown ha ex -guided ea u e enhancemen
[12,13] h ough CLIP [47] o syn he ic da a o di usion
models can also bene i model gene aliza ion [3,40,76].
Pa ame e -E icien Fine-Tuning(PEFT) [18] cus omizes
p e- ained models by ine- uning a subse o pa ame e s
[17], imp o ing pe o mance and gene aliza ion wi h lowe
compu a ional cos . The dominance o ViT [10] in i-
sion asks has spu ed he de elopmen o PEFT me hods.
Adap o -based P omp Tuning [34,55] has shown s ong
pe o mance in ision ans e asks by adding lea nable
p omp s. Fo ins ance, Visual P omp Tuning (VPT) [25]
in oduces lea nable p omp s o each T ans o me laye ’s
inpu embeddings, Adap Fo me [7] adds a bo leneck ully
connec ed laye pa allel o he MLP block, and VQT [57]
op imizes p omp s h ough bypassing o e ec i ely le e -
age in e media e ea u es o VFMs.
In seman ic segmen a ion, se e al wo ks ha e applied i-
sual p omp s o model ans e . [35] uses equency and
spa ial p omp s o ans e p e- ained ViTs o low-le el seg-
men a ion asks, while [64] applies mask p omp s o aid
con inual adap a ion. [71] uses weak supe ision o LoRA
adap e [21] o adap VFMs ac oss domains. Rein [60] adds
LoRA adap e s o ans o m okens and ac i a e VFMs o
15044
DGSS. [66] enhances c oss-domain adap a ion by adding
Fou ie ans o m p omp s o in e media e okens o VFMs.
While mos isual seman ic segmen a ion me hods ely
on adap e -based p omp ine- uning, ou wo k pionee s
selec i e ine- uning me hods o isual model adap a ion.
Closely ela ed o ou app oach a e selec i e ine- uning
me hods in NLP, like ChildTune [62] and Fishe mask [36],
which use Fishe ma ices o iden i y ask-sensi i e pa am-
e e s. In con as , ou me hod in oduces domain-sensi i e
pa ame e s and a s able es ima ion me hod o iden i y pa-
ame e s highly sensi i e o bo h asks and domains, making
i pa icula ly sui ed o DGSS asks.
3. Me hodology
3.1. P elimina ies
Domain Gene alized Seman ic Segmen a ion (DGSS)
aims o ain models ha can gene alize ac oss unseen do-
mains. Fo mally, gi en a se o labeled sou ce domains
Ds={(xi, yi)}Ns
i=1, whe e xi ep esen s he inpu image
and yi ep esen s he co esponding pixel-wise label, he
goal is o ain a model θpa ame e ized by θ ha pe o ms
well on unseen a ge domains D ={xj}N
j=1, whe e he la-
bels o D a e no a ailable du ing aining. The op imiza-
ion objec i e o DGSS can be w i en as:
min
θ
E(xi,yi)∼Ds[L( θ(xi), yi)] ,(1)
whe e Lis he segmen a ion loss (e.g., c oss-en opy) ha
e alua es he di e ence be ween he p edic ed segmen a ion
map θ(xi)and he g ound u h yi. The challenge lies in
ensu ing ha he lea ned model θgene alizes well o un-
seen a ge domains D , which can be w i en as a gene al-
iza ion objec i e:
min
θ
Exj∼D L( θ(xj), y∗
j),(2)
whe e y∗
j ep esen s he ue (bu unknown) labels o
he a ge domain D . Since he labels y∗
ja e no a ailable,
he op imiza ion ocuses on lea ning domain-in a ian ep-
esen a ions in θ, enabling s ong pe o mance ac oss bo h
seen and unseen domains.
Vision Founda ion Models (VFMs), such as CLIP
[46], MAE [19], SAM [28], EVA02 [14], and DINO 2
[41], almos ly use he Vision T ans o me (ViT) a chi-
ec u e. ViT ypically consis o Ls acked blocks,
each con aining wo main submodules: mul i-head a en-
ion (MHA) and a eed- o wa d ne wo k (FFN). Speci i-
cally, he a en ion sco e o each head is calcula ed as:
MHA(X) = Conca (head1,...,headh)θo,headi=
So max Xθqi(Xθki)T
√dhXθ i,whe e θois he ou pu p o-
jec ion ma ix, and θqi,θki, and θ i ep esen he que y,
key, and alue p ojec ions o head i. The FFN consis s
o wo linea laye s wi h a ReLU ac i a ion: FFN(X) =
F eeze
Full
使用DinoV2-La ge 在GTA→
(Ci yscapes, BDD, Mapi,的DGSS实验
平均性能)
我们观察到,仅微调Vision
Founda ion Models (VFM)中的部分
关键参数相比于全微调或完全冻结
其他参数,能显著提高模型的泛化
性能。这一现象启发了我们提出一
个假设:VFM中的某些预训练参数与
特定任务和领域的适应性密切相关,
而其他参数则较为通用,能够更好
地适应不同的域和任务。
基于这一观察,我们提出识别VFM中
与任务和域适应性密切相关的“域敏
感参数”,并对这些参数进行精细微
调,从而在保持VFM预训练泛化能力
的前提下,提升其在Domain
Gene alized Seman ic Segmen a ion
(DGSS)任务中的适应性。
Figu e 3. Obse a ions o ine- uning di e en VFM laye s
o DGSS expe imen s using DINOV2-la ge unde GTA →
Ci yscapes + BDD100K + Mapilla y. I shows ha ine- uning
di e en laye s has di e en e ec s on he gene aliza ion pe o -
mance o he VFMs. B means blocks.
ReLU(Xθ n1+θb1)θ n2+θb2.Bo h he MHA and FFN
a e ollowed by esidual connec ions and laye no maliza-
ion. Le θdeno e he se o he pa ame e s o hose VFMs
ha we aim o ine- une:
θ= [θ(1)
Q, θ(1)
K, θ(1)
V, θ(1)
FFN, . . . , θ(L)
Q, θ(L)
K, θ(L)
V, θ(L)
FFN]⊤,
(3)
whe e θ(l)
Q= [θqi, ..., θqh], and so a e θ(l)
Kand θ(l)
K.
Fishe In o ma ion Ma ix (FIM) is a undamen al con-
cep in s a is ical es ima ion heo y [16], which measu es
he amoun o in o ma ion ha an obse able andom a i-
able ca ies abou an unknown pa ame e [2]. In he con ex
o neu al ne wo ks, he FIM p o ides insigh s in o he sensi-
i i y o he loss unc ion wi h espec o he model pa ame-
e s [36], e lec ing he cu a u e o he loss landscape [48].
Ma hema ically, he FIM Fθis de ined as:
Fθ=ExEy∼ θ(y|x)∇θL( θ(x), y)· ∇θL( θ(x), y)⊤,
(4)
whe e Fθ∈R|θ|×|θ|is he symme ical ma ix, ∇θL(·)de-
no es he g adien o he loss unc ion o he pa ame e s. In-
ui i ely, he Fishe In o ma ion Ma ix cap u es how much
changing he pa ame e s a ec s he model’s ou pu , hus
quan i ying he “in o ma i eness” o he pa ame e s.
3.2. Mo i a ion
This s udy is mo i a ed by an in iguing expe imen al ob-
se a ion. We g oup he θQ,θK,θV, and θFFN componen s
om di e en blocks o DINO 2 sepa a ely and ine- une
all possible pai wise combina ions o hese g oups, and an-
alyzed hei impac on model gene aliza ion, as shown in
Fig. 3. We ound ha uning speci ic laye s led o di e en
le els o gene aliza ion in VFMs, wi h some con igu a ions
15045
e en ou pe o ming a ully uned model. This sugges s ha
ce ain pa ame e s a e c i ical o main aining gene aliza-
ion, while o he s a e key o adap ing o new domains and
asks. Based on his, we hypo hesize ha VFMs con ain
domain-sensi i e pa ame e s sui ed o speci ic asks and
domains, while o he pa ame e s emain b oadly gene aliz-
able. Consequen ly, we p opose iden i ying and ine- uning
hese domain-sensi i e pa ame e s, enhancing adap abili y
in DGSS o imp o ed c oss-domain pe o mance.
3.3. Fishe Tune
In his sec ion, we p esen ou Fishe Tune, ine- uning
ViT-based Vision Founda ion Models (VFMs) guided by
he Fishe In o ma ion Ma ix (FIM) while p ese ing hei
gene aliza ion s eng hs. Ou idea is o use FIM o ind ask-
and domain-sensi i e pa ame e s in θand ine- une hese
sensi i e pa ame e s o imp o e he gene aliza ion abili y
o he model on unseen domains while main aining he p e-
ained knowledge o VFMs.
3.3.1 Domain-Rela ed FIM
Domain-Rela ed FIM. In Eq. 4, FIM quan i ies he impo -
ance o model pa ame e s o he cu en ask by measu -
ing he sensi i i y o pa ame e s o model ou pu . Howe e ,
FIM can no su icien ly cap u e he beha io o pa ame e s
in c oss-domain scenes, especially in DGSS asks, whe e
he sensi i i y o di e en pa ame e s o a ying da a dis i-
bu ions may di e signi ican ly. Fo he DGSS ask, we
need a pa ame e s es ima ion me ic ha can cap u e he
a ia ion o pa ame e s ac oss di e en domains.
To add ess his issue, we p opose o calcula e he Fishe
in o ma ion change ∆Fθbe ween di e en da a domains
(seen domain and simula ed unseen domain) o measu e he
sensi i i y di e ence o pa ame e s ac oss domains. Fo -
mally, gi en he seen single-sou ce domain Ds={(x, y)},
he ∆Fθis calcula ed as:
∆Fθ=|Fθ(x, y)−Fθ(x′, y)|
min(Fθi(x),Fθi(x′)) + ϵ,(5)
whe e Fθ(x, y)and Fθ(x′, y)is he FIM o in he seen
and simula ed unseen domain. ϵ= 1 ×10−8is a small
cons an o p e en di ision by ze o. The nume a o ,
|Fθ(x, y)−Fθ(x′, y)|, compu es he di e ence be ween
he FIM ac oss domains, e lec ing he model’s a ying sen-
si i i y o pa ame e changes in di e en en i onmen s o
da a dis ibu ions. The denomina o , min(Fθi(x),Fθi(x′)),
no malizes his di e ence o ensu e he ela i e na u e o he
me ic. A highe ∆Fiindica es ha he pa ame e is mo e
sensi i e o domain changes.
To simula e an unseen domain sample x′ om a seen
domain x, we le e age he unce ain y modeling me hod in-
spi ed by [33]. Speci ically, he unseen domain ea u e x′
𝑥𝑥=𝑥𝑥𝑥 𝒙𝒙𝟏𝟏𝑥
FIM DR-FIM
=
FIM DR-FIM
<
FIM DR-FIM
<
FIM DR-FIM
<
Domain Shi :
𝒙𝒙𝟐𝟐𝑥𝒙𝒙𝟑𝟑𝑥
𝑥𝑥1𝑥𝑥𝑥2𝑥
𝑥𝑥3𝑥
Figu e 4. Compa ison o FIM and DR-FIM unde di e en de-
g ees o domain shi . The size o he ci cle indica es he alue.
I shows ha DR-FIM is a gene aliza ion o FIM as i addi ionally
conside s he c oss-domain sensi i i y o pa ame e s.
is simula ed by modi ying he ea u e s a is ics (mean and
a iance) o he seen domain sample x. The pe u bed mean
is gene a ed as, α(x) = µ(x) + ϵµΣµ(x), whe e µ(x) ep-
esen s he mean o he ea u e, Σµ(x)is he unce ain y
es ima ion o he mean, and ϵµ∼ N(0,1) is noise sam-
pled om a s anda d no mal dis ibu ion. Nex , he pe -
u bed a iance is gene a ed as, β(x) = σ(x) + ϵσΣσ(x),
whe e σ(x)is he s anda d de ia ion o he ea u e, Σσ(x)
is he unce ain y es ima ion o he s anda d de ia ion, and
ϵσ∼ N(0,1). Using he pe u bed mean α(x)and a i-
ance β(x), he unseen domain sample x′is gene a ed wi h
he ollowing o mula,
x′=β(x)·x−µ(x)
σ(x)+α(x).(6)
Using ∆Fθ, we in oduce a uni ied me ic, Domain-
Rela ed FIM (DR-FIM), o accoun o bo h ask-sensi i e
and domain-sensi i e pa ame e s as,
DRFθ=Fθ(x, y)
| {z }
ask-sensi i e
+e−(ϵµ+ϵσ)|Fθ(x, y)−Fθ(x′, y)|
min(Fθi(x),Fθi(x′)) + ϵ
| {z }
domain-sensi i e
.
(7)
The DRFθis a linea combina ion o Fθand ∆Fθ, and
he combina ion coe icien s a e de e mined by domain shi
con ol ac o s ϵµand ϵσ. When ϵµand ϵσa e la ge, he
simula ed domain shi is signi ican , and ∆Fθis scaled ap-
p op ia ely o balance wi h Fθ. The ela ionship be ween
he simula ed domain shi and he nume ical alues o DR-
FIM and FIM is shown in Fig. 4.
3.3.2 S able Es ima ion o DR-FIM
Al hough DRFθp o ides an es ima ion o he domain sen-
si i i y o VFMs pa ame e s, he dimensions o he pa am-
e e s lis θa e e y high, which makes i imp ac ical o di-
ec ly calcula e in compu a ion and s o age, i.e.,O(|θ|2).
The e o e, i is necessa y o app oxima e he FIM o educe
he compu a ional complexi y.
Diagonal App oxima ion. Following [36], by assuming
ha he o -diagonal elemen s a e negligible, he FIM can
15046
be e icien ly app oxima ed by using a diagonal app oxima-
ion, i.e.,ˆ
Fθ=diag(Fθ1,Fθ2, ..., Fθ|θ|),(8)
whe e each indi idual Fθncan be calcula ed as,
Fθn=1
N
N
X
i=1
Eyi∼ θ(yi|xi)(∇θnL( θ(x), yi))2,(9)
In he abo e diagonal app oxima ion, only he indi idual
con ibu ion o each pa ame e o he loss is conside ed,
while he in e ac ion e ms be ween di e en pa ame e s a e
igno ed. This diagonal app oxima ion e ec i ely simpli ies
aO(|θ|2)ma ix o a ec o o leng h O(|θ|), g ea ly e-
ducing he compu a ional complexi y.
Va ia ional Es ima ion o DR-FIM. Diagonaliza ion p o-
ides an e icien app oxima ion me hod o compu ing he
FIM. Howe e , due o he di e ences be ween he p e ain-
ing asks o he VFM and he DGSS ask, he es ima ed
FIM pa ame e s o en exhibi high sensi i i y, leading o
inaccu acies (See Fig. 6). To add ess his issue, we in o-
duce a a ia ional in e ence app oach [4], ea ing he ine-
uning model’s pa ame e s θas andom a iables ollowing
a Gaussian dis ibu ion. This in oduces an addi ional eg-
ula iza ion e m in o he FIM es ima ion, helping o lea n
a smoo he p io dis ibu ion. Consequen ly, a ia ional in-
e ence s abilizes he g adien upda e p ocess du ing FIM
calcula ion, mi iga ing he ins abili y caused by high g adi-
en noise.
Speci ically, assuming ha he pos e io dis ibu ion
o he model pa ame e s ollows a Gaussian dis ibu ion:
q(θ) = N(ˆ
θ,Λ−1), whe e ˆ
θis he mean o he cu en pa-
ame e s es ima ion, Λ−1is he co a iance ma ix o he pa-
ame e s. To p ese e he p e- ained knowledge o VFMs,
we in oduce he p io pa ame e dis ibu ion as a egula -
ize o p e en deg ada ion du ing p edic ion:
p(θ) = N(θp , τ2I),(10)
whe e θp is he p e- ained pa ame e s o VFMs, τ2is he
a iance con olling he lexibili y o ine- une pa ame e s,
and Iis he iden i y ma ix. We u ilize he a ia ional ee
ene gy (also called he e idence lowe bound, ELBO [20])
as he loss unc ion o op imizing Λ,
L(ˆ
θ,Λ−1) = Eθ∼q(θ)[L(θ)] + γ KL(q(θ)∥p(θ)),(11)
whe e γis he egula iza ion coe icien , con olling he in-
luence o he p io , and KL(q(θ)∥p(θ)) is he Kullback-
Leible di e gence be ween he pos e io q(θ)and he p io
p(θ).
Connec ion wi h DR-FIM. To simpli y he i s e m
in Eq. (11), we pe o m a second-o de Taylo expan-
sion o he loss unc ion L(θ)a ound he cu en pa-
ame e s es ima e θ=ˆ
θ. Taking he expec a ion o e
he weigh dis ibu ion q(θ),Eθ∼q(θ)[L(θ)] ≈ L(ˆ
θ) +
1
2T ∇2
θL(ˆ
θ)Λ−1.Acco ding o he de ini ion o FIM
and i s connec ion wi h he Hessian ma ix [16], he
FIM can be app oxima ed by he Hessian ma ix nea ˆ
θ,
∇2
θL(ˆ
θ)≈Fθ.Thus,
Eθ∼q(θ)[L(θ)] ≈ L(ˆ
θ) + 1
2T FθΛ−1.(12)
The second e m in Eq. (11), KL di e gence be ween wo
Gaussian dis ibu ions is simpli ied by,
KL(q(θ)∥p(θ)) = 1
2τ−2T (Λ−1) + τ−2∥ˆ
θ−θp ∥2
−k+kln τ2+ ln de Λ.
(13)
Subs i u ing he Eq. (12) and Eq. (13) back in o Eq. (11),
and aking he de i a i e o he loss unc ion wi h espec
o Λand we ob ain (See Appendix A o de ailed de i a-
ion),
Fθ=γΛ−γτ−2I. (14)
Then, he DR-FIM de ined in Eq. (7) is upda ed as,
DRFθ=γ Λx−τ−2I+e−(ϵµ+ϵσ)|Λx−Λx′|
min(Λx,Λx′) + ϵ
γ!.
(15)
I shows ha he DR-FIM can be es ima ed om he co-
a iance ma ix Λ, wi h γand τas hype pa ame e s. Us-
ing Eq. (15) o es ima e he DR-FIM has se e al ad an ages
o e using Eq. (7) and Eq. (9). 1) S abili y in Es ima ion:
Eq. (15) in oduces a mo e s able es ima ion mechanism by
inco po a ing p io knowledge om he p e- ained VFMs
p(θ)in Eq. (10) and he pos e io dis ibu ion q(θ). This
app oach helps p e en he deg ada ion o FIM es ima ion
caused by he ask shi be ween VFM p e- aining asks
and DGSS asks, ensu ing mo e obus pe o mance ac oss
unseen domains. 2) Compu a ional E iciency: The co a i-
ance ma ix Λcan be e icien ly compu ed by di ec ly mini-
mizing he loss L(ˆ
θ,Λ−1)in Eq. (11) using epa ame e iza-
ion ick [27] and s ochas ic g adien a ia ional Bayes [1],
educing bo h compu a ional and memo y o e head com-
pa ed o adi ional FIM es ima ion in Eq. (9).
3.3.3 T aining Schedule o Fishe Tune
We ollow Rein [60] which adds a mask decode o he
backbone ne wo k o VFMs as a segmen a ion model o
DGSS. Di e en om Rein, we do no modi y he backbone
s uc u e o add addi ional adap e s. Du ing aining, we
i s ix he backbone ne wo k o VFMs and use he o iginal
da a o wa m-up he decode o adap he whole segmen a-
ion model o he DGSS ask. A e ha , we une he VFMs
and decode by ou Fishe Tune as ollows.
15047
Algo i hm 1 Fishe Tune P ocess
1: Inpu : sou ce da ase D={(xi, yi)}N
i=1; Hype pa-
ame e s: egula iza ion coe icien γ, a iance coe i-
cien τ, wa m-up i e a ions T1, FIM es ima ion i e a-
ions T2, numbe o une i e a ions T3; p e ained VFM
θVFM; segmen a ion decode θdec.
2: S ep 1: Wa m-up decode :
3: T ain he decode θdec on D o T1s eps, oze θVFM.
4: S ep 2: Sampling and DR-FIM Calcula ion:
5: o = 1 o T2do
6: Sample ba ch (x, y)∼ D
7: Simula e unseen domain da a x′ ia Eq. 6.
8: Op imize co a iance ma ix Λ ia Eq. 11
9: Es ima e DR-FIM using he op imized Λ ia Eq. 15
10: S ep 3: Pa ame e Fine-Tuning:
11: o = 1 o T3do
12: Sample a ba ch (x, y)∼ D
13: Selec pa ame e s ˆ
θVFM ia Eq. 16
14: Upda e he selec ed ˆ
θVFM and θdec ia Eq. 1.
15: Ou pu : Fine- uned θVFM and θdec.
In Fishe Tune, he selec ion o pa ame e s o ine- uning
is guided by he DR-FIM (DRFθ), which quan i ies he
sensi i i y o pa ame e s o ask and domain shi s. To op-
imize he ine- uning p ocess, we p opose a dynamic ain-
ing schedule ha adjus s he numbe o ainable pa ame e s
based on hei DR-FIM alues. A he beginning o aining,
we ine- une only he mos sensi i e δmin% o pa ame e s,
as anked by DRFθi. As aining p og esses, we g adually
inc ease he pe cen age o ine- uned pa ame e s, eaching
δmax% by he end. This ensu es ha he model s a s wi h
a ocused ine- uning p ocess, a ge ing only he mos c i -
ical pa ame e s, and p og essi ely expands he ine- uning
scope as he model becomes mo e s able. Fo mally, a each
aining s ep , he dynamic h eshold DRF h esh( )is up-
da ed as ollows:
DRF h esh( ) = δmin +(δmax −δmin)·exp −
T,(16)
whe e Tis he o al numbe o aining s eps. Pa ame e s
wi h DRFθ alues highe han he h eshold DRF h esh( )
will be selec ed o aining. The de ailed ine- uning p o-
cess o ou Fishe Tune is in Algo i hm 1.
4. Expe imen s
4.1. Da ase s & Se up See Appendix B.
4.2. Compa ison wi h S a e-o - he-a Al e na i es
GTAV →C, B, M. Table 1demons a es ha ou ap-
p oach signi ican ly ou pe o ms o he ine- uning me hods
ac oss mul iple ision ounda ion models (VFMs). Com-
GTAV →Ci yscapes (Ci ys) + BDD100K (BDD) + Mapilla y (Map)
VFM ype Fine- une Me hod T ainable Pa ams Ci ys BDD Map A g.
CLIP [46]
(ViT-La ge)
Full 304.20M 51.3 47.6 54.3 51.1
F eeze 0M 53.7 48.7 55.0 52.5
LoRA [22] 0.79M 54.0 49.8 55.1 53.0
VPT [25] 3.69M 54.0 51.8 57.5 54.4
Rein [60] 2.99M 57.1 54.7 60.5 57.4
VQT [57] 3.01M 54.3 51.2 56.7 55.3
ChildTune [63] 15.21M 57.9 53.4 58.2 56.5
Ou s 15.21M 59.2 57.5 61.0 59.2
MAE [19]
(Huge))
Full 304.20M 53.7 50.8 58.1 54.2
F eeze 0M 43.3 37.8 48.0 43.0
LoRA [22] 0.79M 44.6 38.4 52.5 45.2
VPT [25] 3.69M 52.7 50.2 57.6 53.5
Rein [60] 2.99M 55.0 49.3 58.6 54.3
VQT [57] 3.01M 53.3 50.3 57.7 53.8
ChildTune [63] 15.21M 55.4 50.6 58.1 54.7
Ou s 15.21M 56.6 51.9 59.7 56.1
SAM [28]
(Huge)
Full 632.18M 57.6 51.7 61.5 56.9
F eeze 0M 57.0 47.1 58.4 54.2
LoRA [22] 0.79M 57.4 47.7 58.4 54.5
VPT [25] 3.69M 56.3 52.7 57.8 55.6
Rein [60] 2.99M 59.6 52.0 62.1 57.9
VQT [57] 3.01M 56.7 53.9 59.3 56.6
ChildTune [63] 15.21M 60.8 49.6 61.2 57.2
Ou s 15.21M 60.9 54.4 63.9 59.7
EVA02 [15]
(La ge)
Full 304.20M 62.1 56.2 64.6 60.9
LoRA [22] 0.79M 55.5 52.7 58.3 55.5
Adap Fo me [7] 3.17M 63.7 59.9 64.2 62.6
VPT [25] 3.69M 62.2 57.7 62.5 60.8
Rein [60] 2.99M 65.3 61.1 63.9 63.4
VQT [57] 3.01M 61.3 55.1 62.2 59.5
ChildTune [63] 15.21M 61.6 59.3 62.3 61.1
Ou s 15.21M 65.8 61.5 66.0 64.4
DINO 2 [41]
(ViT-La ge)
Full 304.20M 63.7 57.4 64.2 61.7
LoRA [22] 0.79M 65.2 58.3 64.6 62.7
Adap Fo me [7] 3.17M 64.9 59.0 64.2 62.7
VQT [25] 3.01M 64.6 59.0 65.7 63.1
Rein [60] 2.99M 66.4 60.4 66.1 64.3
ChildTune [63] 15.21M 65.6 59.3 65.3 63.4
Ou s 15.21M 68.2 63.3 68.0 66.5
EVA02 VLTSeg [24] 304.2M 65.3 58.3 66.0 63.2
DINOV2 SDT [66] 6.94M 68.1 61.6 67.7 65.8
CLIP+SAM CLOUDS [42] 304.2M 60.2 57.4 67.0 61.5
EVA02 qdm [42] 304.2M 68.9 59.2 70.1 66.1
EVA02 Ou s 15.21M 65.8 61.5 66.0 64.4
DINOV2 Ou s 15.21M 68.2 63.3 68.7 66.6
Table 1. Pe o mance and T ainable Pa ame e s Compa ison wi h
he p oposed Fishe Tune ac oss Mul iple VFMs as Backbones un-
de he GTAV →Ci yscapes (Ci ys) + BDD100K (BDD) + Map-
illa y (Map) gene aliza ion se ing.
pa ed o adap e -based me hods (e.g., LoRA and Rein),
ou app oach achie es an a e age o 4.3% highe mIoU
han Rein ac oss i e VFM models. Addi ionally, i su -
passes he sel - ocused pa ame ic ine- uning me hod VQT
by 3.1% on a e age. No ably, o models wi h a subs an-
ial gap be ween p e- aining and downs eam asks, such
as MAE and EVA02, adap e me hods yielded modes im-
p o emen s o 1.3% and 1.7% mIoU, espec i ely, whe eas
ou app oach achie ed 4.6% and 6.6% imp o emen s. Be-
sides, we added compa isons wi h he s a e-o - he-a me h-
ods using VFMs, and ou me hod emains compe i i e. The
qdm [42] and VLTSeg [24] me hod le e ages ea u es o
he language model, while Rein-se ies me hods and ou s o-
cus on isual models. These esul s highligh ou me hod’s
enhanced adap abili y o downs eam asks and i s signi i-
15048
Ci yscapes →BDD100K
Fine- une Me hod T ainable Pa ams oad side. build. wall ence pole ligh sign ege e . sky pe s. ide ca uck bus ain mo o. bicy. mIoU
DINO 2 [5]
(La ge)
Full 304.20M 89.0 44.5 89.6 51.1 46.4 49.2 60.0 38.9 89.1 47.5 91.7 75.8 48.2 91.7 52.5 82.9 81.0 30.4 49.9 63.7
F eeze 0M 92.1 55.2 90.2 57.2 48.5 49.5 56.7 47.7 89.3 47.8 91.1 74.2 46.7 92.2 62.6 77.5 47.7 29.6 47.2 63.3
REIN [60] 2.99M 92.4 59.1 90.7 58.3 53.7 51.8 58.2 46.4 89.8 49.4 90.8 73.9 43.3 92.3 64.3 81.6 70.9 40.4 54.0 66.4
VQT [57] 3.01M 88.3 49.9 85.9 50.7 47.9 44.3 55.6 39.2 86.1 42.8 87.5 71.3 45.4 89.4 53.5 82.6 74.9 46.1 57.4 63.1
ChildTune [62] 15.21M 92.1 56.1 91.0 58.8 46.9 52.0 58.6 47.2 90.8 47.9 93.3 72.0 47.1 93.0 63.9 76.2 47.9 28.8 48.3 63.8
Ou s 15.21M 92.1 55.4 90.2 58.9 50.9 54.5 59.8 49.1 92.5 52.8 91.0 73.7 51.5 92.7 67.4 82.9 72.8 44.3 54.1 67.7
EVA02 [15]
(La ge)
Full 304.20M 89.3 46.9 89.9 47.7 45.6 50.1 56.8 42.2 88.8 48.4 89.9 75.8 49.0 90.5 45.3 69.2 55.9 44.4 55.1 62.1
F eeze 0M 93.1 52.7 88.0 47.4 31.1 41.7 46.0 39.6 85.7 41.4 89.5 67.5 39.7 89.0 47.0 72.8 46.3 19.2 35.2 56.5
REIN [60] 2.99M 91.7 51.8 90.1 52.8 48.4 48.2 56.0 42.0 89.1 44.1 90.2 74.2 47.0 91.1 54.5 84.1 78.9 47.2 59.4 65.3
VQT [57] 3.01M 90.1 46.6 91.1 46.9 46.4 51.7 56.5 43.2 89.3 49.6 92.3 75.0 50.3 90.3 44.6 71.8 57.4 44.0 55.8 62.8
ChildTune [62] 15.21M 87.9 46.5 88.1 46.5 46.1 46.1 56.0 41.5 87.9 50.3 89.6 77.7 45.6 91.4 42.4 68.1 54.7 46.0 56.8 61.5
Ou s 15.21M 92.6 49.9 95.9 51.1 53.0 50.8 59.8 45.7 92.9 54.6 94.0 83.5 52.2 93.9 45.1 69.4 57.1 47.2 62.4 65.8
Ci yscapes →ACDC
DINO 2 [5]
(La ge)
Full 304.20M 92.8 75 87.4 55.7 54.1 55.6 71.2 69.6 82.4 56 92.2 66.8 45.6 89 79.7 87.9 87.5 51.4 62.7 71.7
F eeze 0M 86.0 68.1 80.2 52.4 47.8 48.2 65.5 65.3 80.0 54.7 86.2 65.0 44.9 86.4 73.3 80.5 86.9 50.1 60.9 67.5
REIN [60] 2.99M 94.6 78.3 92.0 61.9 55.0 64.8 73.8 72.7 88.4 67.4 95.4 77.1 60.2 92.6 84.1 86.9 92.5 67.6 68.6 77.6
VQT [57] 3.01M 93.3 76.4 89.2 55.0 53.9 53.9 72.0 67.3 83.4 55.3 95.1 67.7 47.0 90.5 81.6 86.3 88.2 50.1 61.9 72.0
ChildTune [62] 15.21M 92.9 72.8 84.7 56.6 54.1 56.8 70.9 67.7 82.3 55.7 93.6 65.9 45.3 89.6 77.6 87.8 87.0 52.5 62.2 71.4
Ou s 15.21M 95.6 79.0 96.5 60.5 58.3 64.9 75.6 77.7 85.0 61.3 98.6 73.6 51.5 94.8 85.4 94.7 93.8 59.0 66.7 77.5
EVA02 [15]
(La ge)
Full 304.20M 90.2 68.8 81.0 53.7 49.9 48.1 68.7 64.2 80.1 57.4 88.1 68.8 41.8 89.7 74.1 82.1 89.7 50.0 56.8 68.6
F eeze 0M 86.0 60.5 76.3 49.0 41.7 46.1 60.5 61.0 72.1 49.8 77.7 56.7 40.6 80.3 68.3 77.2 85.5 46.7 56.4 62.8
REIN [60] 2.99M 88.7 71.8 81.7 55.2 51.7 50.5 70.5 64.9 83.7 59.0 90.3 72.0 48.3 93.0 79.3 83.3 91.3 50.8 62.0 70.9
VQT [57] 3.01M 90.3 71.2 81.4 54.3 53.1 49.1 67.9 64.3 82.0 60.5 86.9 66.8 41.3 89.3 76.6 81.7 91.3 47.2 55.7 69.0
ChildTune [62] 15.21M 86.4 68.8 81.0 54.4 50.6 48.9 69.6 64.5 83.2 57.8 88.2 69.0 47.9 90.2 74.8 82.8 90.3 51.0 61.4 69.5
Ou s 15.21M 90.5 75.2 83.6 58.8 54.6 52.2 73.1 66.6 85.7 60.5 90.2 70.7 51.5 92.3 82.6 88.2 91.9 54.0 62.4 72.9
Table 2. DGSS gene aliza ion pe o mance o each ca ego y om he Ci yscapes sou ce domain o mixed-domain BDD100K and ACDC,
wi h compa ison me hods including adap o -based Rein [60] and selec i e pa ame e ine- uning me hods VQT [57] and ChildTune [62].
Ci yscapes →Ad e se Wea he
Fine- une Me hod T ainable Pa ams Foggy Zu ich [49] Foggy D i ing [49] Da k Zu ich [50] Nigh ime D i ing [52] ACDC-Rain [51] ACDC-Snow [51] mIoU
DINO 2 [5]
(La ge)
Full 304.20M 50.4 55.3 62.7 47.7 75.2 76.8 61.3
F eeze 0M 50.3 43.7 54.3 40.8 66.1 71.7 54.5
REIN [60] 2.99M 55.5 58.2 64.3 50.3 78.2 79.5 64.3
VQT [57] 3.01M 54.1 57.1 61.9 47.4 76.1 75.3 62.0
ChildTune [62] 15.21M 55.2 56.9 64.5 50.7 77.7 78.3 63.9
Ou s 15.21M 56.9 60.0 66.6 53.2 78.6 82.2 66.3
Table 3. DGSS pe o mance compa ison o Ci yscapes as he sou ce domain unde di e se wea he condi ions.
Ci yscapes →BDD100K Ci yscapes →ACDC
EVA02 [5]
(La ge)
Full 62.1 68.6
F eeze 56.5 62.8
Random 61.1 67.6
Random Q62.8 69.1
Random K61.9 68.1
Random V62.9 69.2
Fθ63.8 69.5
∆Fθ63.1 71.3
DRFθ65.8 (+3.7) 72.9 (+5.3)
DINO 2 [5]
(La ge)
Full 63.7 71.7
F eeze 63.3 67.5
Random 62.7 71.0
Random Q63.2 72.0
Random K63.5 72.3
Random V63.2 72.9
Fθ63.8 71.4
∆Fθ64.5 76.1
DRFθ67.7 (+4.0) 77.5 (+5.8)
Table 4. Abla ion s udy on gene aliza ion wi h 5% ine- unable
pa ame e s in e ms o mIoU.
can imp o emen in model gene aliza ion.
Ci yscapes →BDD100K, ACDC. In mig a ing om
Ci yscapes o BDD100K and ACDC, ou me hod achie ed
s ong esul s, wi h a e age mIoU sco es o 67.7% and
77.5%. As shown in Table 2, ou me hod’s a e age mIoU
on BDD100K is 2.4% highe han REIN. Compa ed o
VQT and ChildTune, ou me hod imp o ed mIoU by 4.4%
and 2.6% on he espec i e da ase s, add essing issues in
pa ame e uning, da a adap a ion, and mig a ion s a egy.
These esul s highligh ou me hod’s supe io adap abili y
and gene aliza ion in complex scenes.
Ci yscapes →Ad e se Wea he . We e alua ed a ious
ine- uning s a egies o DINO 2 models ac oss challeng-
ing wea he condi ions, as shown in Table 3. Ou app oach
achie ed an a e age o 2.0% highe mIoU han he adap e -
based REIN me hod and 4.3% highe han he sel - ocused
VQT app oach. This imp o emen likely s ems om he
subs an ial di e ence be ween p e- aining and downs eam
asks. Besides, ChildTune showed limi ed pe o mance
gains, and ou me hod su passed ChildTune by an a e -
age o 2.4% mIoU, demons a ing supe io adap abili y and
gene aliza ion unde complex wea he scena ios.
4.3. Abla ion S udies
Abla ion o DR-FIM e ec i eness As shown in Table 4,
andomly selec ing Q,K, and Vpa ame e s o ine- uning
does no ully le e age he gene aliza ion abili y o VFMs,
leading o lowe mIoU. Using FIM (Fθ) o pa ame e se-
lec ion imp o es pe o mance o e andom choice. Fu -
he gains a e achie ed wi h ∆Fθ, which be e iden i ies
domain-sensi i e pa ame e s—especially on ACDC, whe e
se e e wea he di e ences pose g ea e challenges. Ou
p oposed DR-FIM, combining Fand ∆F, deli e s he bes
15049
Me hod EVA02 EVA02+FP DINOV2 DINOV2+FP
Adap Fo me 62.6 63.3 (+0.7) 62.7 63.7 (+1.0)
VPT 60.8 61.8 (+1.0) 63.3 64.1 (+0.8)
Rein 63.6 63.9 (+0.3) 64.3 65.0 (+0.7)
Ou s 64.4 64.5 (+0.1) 66.3 66.5 (+0.2)
Table 5. Abla ion s udy on Fea u e Pe u ba ion (FP) using [33].
esul s, boos ing mIoU by +3.7% and +5.3% on Ci yscapes
→BDD100K and Ci yscapes →ACDC o EVA02 (La ge),
and by +4.0% and +5.8%, espec i ely. These esul s high-
ligh he e ec i eness o ou me hod.
Abla ion o DR-FIM Es ima ion Fig. 5p esen s he abla-
ion s udy on he p oposed s able es ima ion me hod. The
esul s show ha while DR-FIM ou pe o ms FIM in pa-
ame e e alua ion, bu he e ec i eness o DR-FIM is lim-
i ed by adi ional es ima ion me hods. The s able es ima-
ion me hod signi ican ly enhances he accu acy o pa ame-
e e alua ion o bo h FIM and DR-FIM. No ably, applying
s able es ima ion o DR-FIM esul s in an a e age imp o e-
men o 2.6% mIoU, demons a ing supe io o e all gene -
aliza ion pe o mance.
Abla ion o Fea u e Pe u ba ion Since we adop do-
main simula ion augmen a ion om [32], which is gene -
ally conside ed e ec i e o DG, we also apply i o exis -
ing VFM me hods o a ai compa ison. Fishe Tune uses
ea u e pe u ba ion (FP) solely o iden i ying domain-
sensi i e pa ame e s, no du ing ine- uning. As shown in
Table 5, FP yields a modes imp o emen (+1.0% mIoU)
on GTA→A g., ye ou me hod s ill ou pe o ms o he s.
4.4. Discussion
Cap u ed domain-sensi i e pa ame e s. Fig. 6illus a es
he impac o di e en es ima ion me hods on pa ame e
sensi i i y es ima ion. (a) shows ha pa ame e sensi i -
i y es ima ed om o iginal FIM is gene ally high, making
i di icul o iden i y he mos aluable pa ame e s. (b)
demons a es ha inco po a ing ∆Fθ ede ines pa ame e
sensi i i y by comp ehensi ely conside ing bo h ask el-
e ance and domain sensi i i y. (c) p esen s he DR-FIM
es ima ed using a obus way, which highligh s impo an
pa ame e s mo e e ec i ely, aiding in he selec ion o alu-
able pa ame e s. Addi ionally, (c) e eals ha impo an pa-
ame e s end o be concen a ed in he Q,K,Vand FFM
pa ame e s o deepe blocks. Fu he mo e, he o e all sen-
si i i y o Qand Kis highe han ha o V.
Fea u e Visualiza ion. Fig. 7compa es he T-SNE isual-
iza ions o ea u e dis ibu ions be ween Rein [60] and Fish-
e Tune. Fishe Tune exhibi s a mo e balanced ea u e dis i-
bu ion ac oss mul iple unseen domains, indica ing educed
domain bias and imp o ed gene aliza ion.
The Ra io o Fine- uned Pa ame e s.See Appendix C.
Segmen a ion Resul Visualiza ion.See Appendix D.
In luence o Hype -pa ame e s.See Appendix E.
Figu e 5. Abla ion s udy o es ima ion ways on Ci yscapes
→BDD100K (C2B), →ACDC (C2A), and GTAV →
Ci yscapes(G2C), →BDD100K(G2B) and →Mapilla y(G2M).
(b) DR-FIM wi hou S able Es ima ion
(c) DR-FIM wi h S able Es ima ion
(a) FIM
Figu e 6. Diag am o pa ame e sensi i i y es ima ed by FIM and
ou DR-FIM using DINOV2-la ge, ained on GTAV o DGSS
expe imen s. The Q,K,V, and FFN pa ame e s a e a anged in
ascending o de acco ding o hei block indices.
Nigh •Rainy•Snow
.
•Foggy
Figu e 7. Compa ison o T-SNE ea u e isualiza ions: Rein [60]
(le ) and he p oposed Fishe Tune ( igh ). The model is ained on
he Ci yscapes →ACDC DGSS ask. Fishe Tune shows a mo e
balanced ea u e dis ibu ion ac oss mul iple unseen domains.
5. Conclusion
We p opose Fishe Tune, a ine- uning me hod o Vi-
sion Founda ion Models (VFMs) in DGSS. I in oduces
he Domain-Rela ed Fishe In o ma ion Ma ix (DR-FIM)
o measu e pa ame e sensi i i y o domain shi s, using
a ia ional in e ence o s able es ima ion. Fishe Tune en-
hances domain adap abili y while main aining gene aliza-
ion. We hope i encou ages u he esea ch on selec i e
ine- uning o be e unlock he gene aliza ion po en ial o
VFMs in DGSS and beyond.
15050
Re e ences
[1] Alessand o Achille, Michael Lam, Rahul Tewa i, A inash
Ra ichand an, Subh ansu Maji, Cha less C Fowlkes, S e-
ano Soa o, and Pie o Pe ona. Task2 ec: Task embedding
o me a-lea ning. In P oceedings o he IEEE/CVF in e -
na ional con e ence on compu e ision, pages 6430–6439,
2019. 5
[2] Alessand o Achille, Gio anni Paolini, and S e ano Soa o.
Whe e is he in o ma ion in a deep neu al ne wo k? a Xi
p ep in a Xi :1905.12213, 2019. 3
[3] Yasse Benigmim, Subhanka Roy, Slim Essid, Vicky Kalo-
gei on, and S ´
ephane La huili`
e e. Collabo a ing ounda ion
models o domain gene alized seman ic segmen a ion. In
P oceedings o he IEEE/CVF Con e ence on Compu e Vi-
sion and Pa e n Recogni ion, pages 3108–3119, 2024. 2
[4] Da id M Blei, Alp Kucukelbi , and Jon D McAuli e. Va i-
a ional in e ence: A e iew o s a is icians. Jou nal o he
Ame ican s a is ical Associa ion, 112(518):859–877, 2017.
5
[5] Ma hilde Ca on e al. Dino 2: Lea ning obus isual ea-
u es wi hou supe ision. a Xi p ep in a Xi :2304.07193,
2023. 1,7
[6] P i h iji Cha opadhyay, Ka ik Sa angma h, Vi ek Vi-
jaykuma , and Judy Ho man. Pas a: P opo ional ampli ude
spec um aining augmen a ion o syn- o- eal domain gen-
e aliza ion. In P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision, pages 19288–19300, 2023.
2
[7] Shou a Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang,
Yibing Song, Jue Wang, and Ping Luo. Adap o me :
Adap ing ision ans o me s o scalable isual ecogni-
ion. Ad ances in Neu al In o ma ion P ocessing Sys ems,
35:16664–16678, 2022. 2,6
[8] Ziyuan Cheng, Ruinian Wan, Meng Li, Feiyu Wang, Chao
Xu, and Xiao ei He. Domain gene aliza ion ia s yle-
e icien pe u ba ion and clus e ing o in a-domain he e o-
geneous da a. In P oceedings o he IEEE/CVF Con e ence
on Compu e Vision and Pa e n Recogni ion, pages 3938–
3947, 2022. 2
[9] Sungha Choi, Sanghun Jung, Huiwon Yun, Joanne T Kim,
Seung yong Kim, and Jaegul Choo. Robus ne : Imp o ing
domain gene aliza ion in u ban-scene segmen a ion ia in-
s ance selec i e whi ening. In P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni ion,
pages 11580–11590, 2021. 2
[10] Alexey Doso i skiy. An image is wo h 16x16 wo ds:
T ans o me s o image ecogni ion a scale. a Xi p ep in
a Xi :2010.11929, 2020. 2
[11] Qiong Dou, Daniel Ca o de Cas o, Kons an inos Kamni sas,
and Ben Glocke . Domain gene aliza ion ia model-agnos ic
lea ning o seman ic ea u es. In Ad ances in Neu al In o -
ma ion P ocessing Sys ems, pages 6450–6461, 2019. 2
[12] Mohammad Fahes, Tuan-Hung Vu, And ei Bu suc, Pa ick
P´
e ez, and Raoul De Cha e e. Poda: P omp -d i en ze o-
sho domain adap a ion. In P oceedings o he IEEE/CVF
In e na ional Con e ence on Compu e Vision, pages 18623–
18633, 2023. 2
[13] Mohammad Fahes, Tuan-Hung Vu, And ei Bu suc, Pa ick
P´
e ez, and Raoul de Cha e e. A simple ecipe o language-
guided domain gene alized segmen a ion. In P oceedings o
he IEEE/CVF Con e ence on Compu e Vision and Pa e n
Recogni ion, pages 23428–23437, 2024. 2
[14] Hao Fang e al. E a-02: A isual lea ne o mo e gene al-
ized isual ep esen a ion lea ning. In Con e ence on Com-
pu e Vision and Pa e n Recogni ion (CVPR), 2023. 3
[15] Hao Fang e al. E a-clip: Imp o ing ision-language models
wi h masked modeling. a Xi p ep in a Xi :2303.13495,
2023. 1,6,7
[16] Ronald A Fishe . On he ma hema ical ounda ions o he-
o e ical s a is ics. Philosophical ansac ions o he Royal
Socie y o London. Se ies A, con aining pape s o a ma he-
ma ical o physical cha ac e , 222(594-604):309–368, 1922.
3,5
[17] Zhangwei Gao, Zhe Chen, E ei Cui, Yiming Ren, Weiyun
Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He,
Xizhou Zhu, e al. Mini-in e n l: a lexible- ans e pocke
mul i-modal model wi h 5% pa ame e s and 90% pe o -
mance. Visual In elligence, 2(1):1–17, 2024. 2
[18] Zeyu Han, Chao Gao, Jinyang Liu, Je Zhang, and Sai Qian
Zhang. Pa ame e -e icien ine- uning o la ge models: A
comp ehensi e su ey. a Xi p ep in a Xi :2403.14608,
2024. 2
[19] Kaiming He e al. Masked au oencode s a e scalable ision
lea ne s. In Con e ence on Compu e Vision and Pa e n
Recogni ion (CVPR), 2022. 3,6
[20] Ma hew D Ho man, Da id M Blei, Chong Wang, and John
Paisley. S ochas ic a ia ional in e ence. Jou nal o Machine
Lea ning Resea ch, 2013. 5
[21] Edwa d J Hu e al. Lo a: Low- ank adap a ion o la ge lan-
guage models. In e na ional Con e ence on Lea ning Rep-
esen a ions (ICLR), 2022. 2
[22] Edwa d J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Lo a: Low- ank adap a ion o la ge language models. a Xi
p ep in a Xi :2106.09685, 2021. 6
[23] Jiaxing Huang, Dayan Guan, Ao an Xiao, and Shijian Lu.
Fsd : F equency space domain andomiza ion o domain
gene aliza ion. In P oceedings o he IEEE/CVF con e ence
on compu e ision and pa e n ecogni ion, pages 6891–
6902, 2021. 2
[24] Ch is oph H¨
umme , Manuel Schwonbe g, Liangwei Zhou,
Hu Cao, Alois Knoll, and Hanno Go schalk. S ong bu
simple: A baseline o domain gene alized dense pe cep ion
by clip-based ans e lea ning. In P oceedings o he Asian
Con e ence on Compu e Vision, pages 4223–4244, 2024. 6
[25] Menglin Jia, Luming Tang, Bo -Chun Chen, Clai e Ca die,
Se ge Belongie, Bha a h Ha iha an, and Se -Nam Lim. Vi-
sual p omp uning. In Eu opean Con e ence on Compu e
Vision, pages 709–727. Sp inge , 2022. 2,6
[26] Muhammad Uzai Kha ak, Hanoona Rasheed, Muhammad
Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple:
Mul i-modal p omp lea ning. In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n
Recogni ion, pages 19113–19122, 2023. 1
15051