The Geome y o Gene a i e Reasoning
Gauge-Theo e ic T ans o me s as Realiza ions o Seman ic Sameness
Bee Rosa Da is
NASA Mission Sys ems Enginee
[email p o ec ed]
Abs ac
While he geome y o de ec ion has been o malized ia he Da is mani old amewo k, he
geome y o gene a ion emains unde -speci ied. Cu en T ans o me s a e ypically iewed
as s a is ical sequence p edic o s, lacking explici geome ic cons ain s on hei in e nal s a e
e olu ion. We in oduce Func o ial T ans o me s (FunT ans), a amewo k modeling he
T ans o me as a disc e ized gauge low on a seman ic ibe bundle
E→M
. We o malize
Mul i-Head A en ion as a disc e e app oxima ion o anspo ia an in eg al ke nel, go e ned
by a connec ion
ω
wi h empi ically obse able holonomy. Ou con ibu ions a e h ee old: (1) We
in oduce a Na u ali y Loss
L un
o en o ce diag amma ic commu a i i y in he esidual
s eam, and a Holonomy Loss
Lhol
ha penalizes seman ic cu a u e along i ual loops.
(2) We p o e Theo em 4.1 (Poinca é–Hodge o Seman ics): in he low-holonomy egime,
he easoning low app oxima es a conse a i e ield, sugges ing ha consis en seman ic s a es
eside on he le el se s o a po en ial Φ. (3) We de i e a Cu a u e-Awa e S ep Size
α
(
Kloc
),
dynamically scaling he esidual upda e magni ude (“Speed o Though ”) in e sely wi h he
local sec ional cu a u e o he seman ic mani old. Finally, we ske ch he Da is Topological
P ocesso (DTP), a ha dwa e speci ica ion ha eplaces dense ma ix mul iplica ion wi h
opology-awa e spa se anspo , enabling ha dwa e-le el p uning o b anches o he compu a ion
g aph ha exhibi high holonomy e o . This wo k uni ies deep lea ning and di e en ial geome y
o cons ain gene a i e easoning wi hin hallucina ion- esis an bounds.
1 In oduc ion
1.1 Mo i a ion: om Da is mani olds o ans o me s
Mode n la ge language models (LLMs) a e usually desc ibed as high-capaci y s a is ical sequence
p edic o s, ained o minimize nex - oken loss. Empi ically, howe e , hei in e nal compu a ion
looks geome ic: hidden s a es mo e along ajec o ies in a lea ned ep esen a ion space; a en ion
heads implemen s uc u ed in e ac ions be ween dis an seman ic s a es; dep h ac s like a disc e ized
ime coo dina e; and a g owing body o wo k ea s ans o me s as dynamical sys ems on lea ned
mani olds. When his in e nal geome y is well-beha ed, models end o eason consis en ly; when
i is no , hey hallucina e, b eak simple in a iances, o iola e basic seman ic equi alences.
This pape is he hi d s ep in a p og am on geome y- i s de ec ion and seman ic sameness.
The i s s ep, he Da is mani old amewo k, akes a ep esen a ion- and geome y- i s iew o
de ec ion in iden i y-p ese ing empo al domains (e.g., ideo, senso s eams, epea ed p omp s).
1
Ins ead o beginning wi h a classi ie and asking i o be obus , i lea ns a Riemannian s a e space
in which each inpu ace is a pa h o la en s a es. Wi hin his space, benign pa h amilies
P
(
L
)
cap u e seman ics-p ese ing ans o ma ions o bounded leng h
L
(e.g., small iewpoin changes,
empo al shi s, mild pe u ba ions); dis o ion unc ions
ε
(
L
)bound how much he me ic can
s e ch along such pa hs; con igu a ion ma gins (
κha d, κso
)encode ha d/so decision h esholds;
and composi ional e o budge s ack how much slack is alloca ed o geome y, linkage, calib a ion,
and abs en ion. In o mally, a Da is mani old is a lea ned embedding in which “sa e mo es” in inpu
space become sho , well-con olled pa hs, and co ec ness gua an ees a e ph ased in e ms o hese
pa hs and budge s.
The second s ep, The Geome y o Sameness, isola es he unde lying de ec ion p oblem as a
seman ic sameness s uc u e
S
: an abs ac speci ica ion o which inpu s should be ea ed as
“ he same up o nuisance” (same objec , same speake , same seman ic con en ). I shows ha wo
appa en ly di e en enginee ing adi ions— ansla ion- i s sys ems and geome y- i s sys ems—
a e dual ways o ealizing he same
S
. T ansla ion- i s sys ems encode
S
as ansla o g aphs
(e.g., encode –decode pipelines o mul i-s age p ocessing g aphs whose nodes a e in e media e
ep esen a ions and whose edges a e lea ned maps); geome y- i s sys ems encode
S
as a Riemannian
mani old equipped wi h benign pa hs. These a e o ganized in o ealiza ion ca ego ies
SamT ans
(
S
)
and
SamGeom
(
S
), and unc o s
FS:SamT ans0
(
S
)
→SamGeom0
(
S
)and
GS:SamGeom0
(
S
)
→
SamT ans0
(
S
)a e cons uc ed on well-beha ed subca ego ies wi h smoo h cha s and bounded
dis o ion. An
ε
-equi alence-o -ca ego ies esul and an E o Budge T ans e Theo em show ha
Da is-s yle co ec ness bounds can be mo ed be ween ansla o and mani old ealiza ions wi h
explici , i s -o de slack.
Bo h o hese s eps delibe a ely ea a chi ec u e as a black box. They assume ha some
ep esen a ion (lea ned by con as i e + smoo hness aining, o inhe i ed om a backbone) al eady
ealizes a ixed sameness s uc u e
S
as ei he a ansla o g aph o a Riemannian mani old, and
hen build de ec o s and gua an ees on op. Mode n ans o me s, howe e , do no me ely consume
a geome ic ep esen a ion o seman ics: hei in e nal compu a ion seems o implemen geome y.
A en ion heads beha e like s uc u ed, ke nel-based anspo s be ween seman ic s a es; dep h
beha es like disc e e ime; and empi ical wo k inc easingly in e p e s ans o me s as dynamical
sys ems on lea ned mani olds.
This pape adds a hi d ealiza ion space o his pic u e, now cen e ed on he ans o me i sel :
•
a ca ego y
FunT ans
(
S
)o unc o ial ans o me s ha ealize a ixed sameness s uc u e
S
as disc e e- ime dynamical sys ems, wi h mo phisms gi en by a chi ec u e-p ese ing
epa ame e iza ions and low-d i ine- unes (local coo dina e changes analogous o gauge
ans o ma ions in physics, bu de ined pu ely on pa ame e s and ou pu s);
•
a gauge- heo e ic seman ics o a en ion, in which each head is modeled as a di usion– anspo
ope a o on a ec o bundle
E→M
o e a seman ic mani old
M
, go e ned by a lea ned
connec ion;
•
geome ic aining objec i es ha b ing Da is-s yle cu a u e, holonomy, and e o -budge
con ol inside he ans o me , a he han ea ing geome y as a sepa a e p e- o pos -
p ocessing laye .
Concep ually, we ake he same seman ic sameness s uc u e
S
om he p io wo k and demand
a hi d kind o ealiza ion: ans o me s whose in e nal s a es ace disc e e app oxima ions o he
2
benign seman ic pa h amilies
PS
(
L
)a ached o
S
(pa hs o seman ics-p ese ing ans o ma ions),
and whose a en ion pa e ns de ine a connec ion wi h con olled cu a u e and holonomy in he
egime whe e he Da is and Sameness heo ies a e alid. Ope a ionally, he goal is no o p opose
ye ano he no el a chi ec u e, bu o e o i exis ing ans o me s wi h geome ic s uc u e and
losses ha
1.
make hei in e nal compu a ion compa ible wi h a Da is mani old o
SamGeom
(
S
) ealiza ion
o he same S, and
2.
expose Da is-s yle, composi ional e o budge s (geome y, linkage, calib a ion, abs en ion) a
he le el o heads, laye s, and lows.
1.2 Con ibu ions
We o ganize he con ibu ions in o six clus e s, labeled C1–C6.
C1: Ca ego y
FunT ans
(
S
)and a undamen al diag am. We de ine a ca ego y
FunT ans
(
S
)
o unc o ial ans o me s ealizing a ixed seman ic sameness s uc u e
S
. Objec s o
FunT ans
(
S
)
a e ans o me a chi ec u es plus pa ame e s whose in-dis ibu ion beha io implemen s a sameness
de ec o o
S
in he sense o geome y- i s de ec ion and Sameness-s yle e o budge s. Mo phisms
a e a chi ec u e-p ese ing epa ame e iza ions and low-d i ine- uning maps ha lea e he ealized
sameness ela ion in a ian wi hin a bounded dis o ion ma gin (local epa ame e iza ions analogous
o gauge ans o ma ions, bu de ined pu ely a he le el o ne wo k pa ame e s and ou pu s).
Building on he ealiza ion ca ego ies
SamT ans
(
S
)and
SamGeom
(
S
)and he unc o s
FS, GS
,
we cons uc a undamen al diag am ha ela es unc o ial ans o me s, ansla o ealiza ions,
mani old ealiza ions, and con inuous- ime lows:
SamT ans0(S)FS
−−→ SamGeom0(S)
↑Θ↓Da is
FunT ans(S)Hol-Flow
←−−−−−− Flows/ODE
He e:
•
Θ
:FunT ans
(
S
)
→SamT ans
(
S
)ex ac s a ansla o g aph om a ans o me (a en ion
heads and esidual blocks become ansla o s be ween in e nal “modali ies”, in he sense o
modules connec ed by lea ned maps);
•FS
and
GS
a e he Sameness unc o s ha “s i ch ansla o g aphs in o a mani old” and “un old
a mani old back in o local ansla o cha s” on well-beha ed subca ego ies
SamT ans0
(
S
),
SamGeom0(S)sa is ying smoo h cha abili y and bounded dis o ion;
•Da is
maps a geome ic ealiza ion in
SamGeom0
(
S
) o a Da is mani old and i s associa ed
Da is lows (Riemannian s a e spaces wi h benign pa hs and e o budge s);
•Hol-Flow
associa es o a con inuous- ime seman ic low (an ODE on
M
) a amily o holonomy-
minimizing disc e e ans o me lows ha app oxima e he same dynamics.
3
We p o e a local commu a i i y esul o his diag am. Res ic ing o benign seman ic ajec o ies
γ∈ PS(L)wi h Lbelow he benign-pa h adius and emaining wi hin he injec i i y adius o he
Da is mani old, he misma ch be ween: (i) ollowing
γ
h ough
FunT ans
(
S
)
→SamT ans0
(
S
)
→
SamGeom0
(
S
)
→Flows
, and (ii) ollowing
γ
h ough
FunT ans
(
S
)
→Flows →SamGeom0
(
S
)
→
SamT ans0
(
S
), is bounded by a i s -o de disc e iza ion e o ha does no accumula e exponen ially
wi h dep h. Mo e p ecisely, using a Benign Pa h Boundedness Lemma (Sec ion 3) ha con ols
me ic dis o ion along pa hs in
PS
(
L
), we show ha he Da is-s yle e o budge s o he wo
ealiza ions di e by a mos
O
(
L·εdisc
), whe e
εdisc
is he pe -laye disc e iza ion e o o he
ans o me low. Thus he diag am commu es up o bounded dis o ion on benign pa hs, a he
han gene ic, uncon olled slack.
C2: Gauge- heo e ic seman ics o a en ion as hea -ke nel anspo . We place ans-
o me a en ion in o a gauge- heo e ic amewo k in which he di usion-like na u e o a en ion
is explici a he han hand-wa ed. Fo a seman ic mani old (
M, g
)equipped wi h a i ializable
ec o bundle
E
=
M×V
(hidden s a es li e in a ixed ibe
V
bu admi non i ial connec ions),
we:
•
in e p e each a en ion head
H
as de ining disc e e anspo ia a hea -like in eg al ke nel on
E
. The so max a en ion pa e n
Aij
=
So max
(
QK⊤/√dh
)
ij
ac s as a disc e e hea -ke nel
ope a o :
(AHh)i=X
j
Aijhj≈ZM
Kω(zi, y;τ)h(y)dµg(y),
whe e
Kω
(
x, y
;
τ
)is he co a ian hea ke nel gene a ed by a connec ion
ω
on
E
, sol ing he
hea equa ion
∂τKω
= ∆
ωKω
wi h espec o he Bochne Laplacian ∆
ω
. In his o mula ion,
a en ion mixes in o ma ion di usi ely ac oss okens while ad ec ing i along ibe s ia pa allel
anspo ;
•
p o e a Ke nel Limi Theo em (P oposi ion 5.3 in Sec ion 5): unde mild egula i y assump ions
on he que y/key maps and oken sampling densi y, he disc e e a en ion ope a o s con e ge,
as okens densely sample
M
and he empe a u e is app op ia ely scaled, o he con inuous
di usion ope a o
exp(τ∆ω)
go e ned by
ω
. This econciles he en opy-inc easing mixing
o a en ion wi h geome ic anspo on
E→M
: he head dimension is he ibe whe e
in o ma ion is pa allel- anspo ed, while spa ially i di uses acco ding o he hea ke nel;
•
de ine h ee amilies o loops in oken
×
dep h space: ype (A) single-laye head cycles be ween
wo okens, ype (B) esidual-laye cycles ha mo e along dep h and back ia skip connec ions,
and ype (C) mul i-head, mul i-laye i ual cycles o med by composi ions o dis inc heads
ha e u n o he s a ing oken-posi ion;
•
a ach disc e e holonomy and cu a u e o hese loop amilies and ela e hem o he cu a u e
wo- o m F=dω +ω∧ωo he induced connec ion.
On small geodesic balls
U
=
B
(
z
)
⊂M
wi h
< inj
(
M, z
), we p o e a Poinca é–Hodge- ype
in eg abili y esul (Theo em 4.1 in Sec ion 4): i he disc e e holonomy is small on a su icien ly ich
amily o sampled loops o ypes (A)–(C) in
U
, hen he unde lying connec ion is nea ly in eg able
on U,
ω=dΦ + η, ∥η∥∞≤C(K, ε, diam(U)),
4
o some po en ial Φ
:U→End
(
V
)and a esidual 1- o m
η
cap u ing he non-conse a i e emainde .
In ui i ely,
d
Φplays he ole o an app oxima ely conse a i e “ easoning ield”, while
η
agg ega es
he esidual holonomy and opology-induced non-in eg abili ies. In his way, “small loop holonomy”
is made p ecise as a local in eg abili y condi ion o he di usion– anspo ield implemen ed by
a en ion.
C3: Geome ic losses and in eg abili y, no global la ness. We in oduce a sui e o
geome ic egula ize s o ans o me s and de ine unc o ial ans o me s as hose ha app oxima ely
sa is y he na u ali y condi ions hese losses encode. Beyond he ask loss, ou geome ic objec i e
con ains ou main e ms:
•
ana u ali y loss
L un
ha penalizes he ailu e o esidual–a en ion squa es o commu e— ha
is, i measu es how a a en ion and esidual d i depa om a diag amma ic na u ali y
condi ion compa ible wi h he unc o s Θ, FS, GS;
•
aholonomy loss
Lhol
ha penalizes non i ial disc e e holonomy along i ual loops in
oken
×
dep h space, wi h pa icula emphasis on ype (C) loops ha a e se mul i-head,
mul i-laye easoning cycles in he esidual s eam. C ucially, we do no minimize in insic
cu a u e o he mani old
M
; ins ead, we minimize seman ically spu ious holonomy along
benign i ual loops associa ed wi h
PS
(
L
). This en o ces local in eg abili y o he easoning
ield on seman ic cha s, while allowing he global mani old o emain opologically and
me ically cu ed;
•
an in e se-head loss
Lin
buil om explici in e se heads
Hin
pai ed wi h selec ed o -
wa d heads
H wd
. Concep ually, one can ealize
Hin
as an independen ly pa ame e ized head
(doubling he head coun o pai ed heads), bu we emphasize p ac ically iable, cons ained pa-
ame e iza ions such as
Win
Q
=
W wd
V, Win
V
=
W wd
Q
, which en o ce app oxima e in e ibili y
wi hou in oducing new pa ame e s (Sec ion 7 discusses bo h a ian s);
•
acu a u e-p oxy loss
Lcu
ha uses a scala summa y
Kloc
o he pe -laye a en ion
spec um as a p oxy o local cu a u e, compu ed as a no malized a iance o he singula
alues o a sui ably no malized a en ion ope a o
e
P
=
P/√n
. This yields a dimensionless
quan i y compa able ac oss sequence leng hs; he p ecise de ini ion o he oken-le el cu a u e
agg ega ion is gi en in Sec ion 7.
These losses a e de ined so ha , when minimized, he esul ing ans o me si s in
FunT ans
(
S
)
and i s ex ac ed ansla o g aph and mani old ealiza ion ia Θand
FS
inhe i small Da is-s yle
geome ic e o budge s
Egeom
and con olled holonomy on benign pa hs. Holonomy egula iza ion
hus dis inguishes necessa y opological cu a u e (da a complexi y) om spu ious pa h-dependence
(inconsis en easoning), e ec i ely aining he model o ac like a conse a i e ec o ield on he
seman ic cha s de ined by he da a.
C4: Cu a u e-awa e Eule s ep con ol and a CFL-like “speed o hough ”. We in e p e
ans o me dep h as a disc e iza ion o a con inuous- ime seman ic low on
M
. The laye upda e a
oken iand laye ℓis modeled as an explici Eule s ep
hℓ+1
i=hℓ
i+ ∆ ℓ(hℓ
i) (hℓ
i),
whe e is a lea ned ec o ield and ∆ ℓ(hℓ
i)is an e ec i e s ep size. We show ha :
5
•
he in o mal no ion o “speed o hough ” can be made p ecise as a cu a u e-awa e imes ep
cons ain o s able disc e e lows: i is no a li e al eloci y limi , bu a Cou an –F ied ichs–
Lewy (CFL)– ype condi ion on ∆ ℓ o explici in eg a o s on cu ed mani olds;
•
using classical s abili y c i e ia o explici Eule disc e iza ions o s i ODEs and pa abolic
PDEs, we de i e bounds o he o m
∆ ℓ(z)≤CCFL ·1
qe
Kloc(z)+ε
,
whe e
e
Kloc
is a dimensionless, cu a u e-like quan i y de i ed om he local a en ion spec um,
ε >
0is a small nume ical cushion, and
CCFL
is a s abili y cons an . In highly cu ed (locally
s i ) egions, s abili y o ces ∆
ℓ
o be small, ma ching he in ui ion ha he disc e e low
mus ake ine s eps;
•
en o cing such bounds ac s as a o m o cu a u e-awa e s ep-size con ol ha p ese es he
geodesic-app oxima ion p ope ies o he benign pa h amilies
PS
(
L
)while a oiding uns able
o “ elepo ing” seman ic upda es in high-cu a u e egions.
In p ac ice, hese bounds appea as so cons ain s o adap i e ga ing ules o esidual upda es
in dep h, ied di ec ly o he cu a u e p oxies used in
Lcu
. The cu a u e-awa e “speed o hough ”
α(Kloc)is hus a CFL-like con ol law o explici ans o me dynamics on a seman ic mani old.
C5: T aining heo y and a geome ic phase ansi ion a
λc i
.We combine he geome ic
losses in o a single aining objec i e
L(θ) = L ask(θ)+λ unL un(θ)+λholLhol(θ)+λin Lin (θ)+λcu Lcu (θ),
and s udy s ochas ic g adien descen (SGD) on his objec i e.
We assume
L
is made coe ci e by s anda d weigh decay (e.g.,
L
+
λwd∥θ∥2
), so ha suble el
se s
{θ:L
(
θ
)
≤c}
a e compac ; g adien s a e locally Lipschi z; and SGD noise is unbiased wi h
bounded a iance, wi h lea ning a e schedule α ∝1/√ .
Unde hese condi ions we p o e Theo em F (Geome ic Con ol o S a iona y Poin s, o mal
s a emen and p oo in Sec ion 9.4): any limi poin
θ⋆
o SGD sa is ies
∇L
(
θ⋆
) = 0, and he
geome ic quan i ies a e con olled in he p ecise sense ha
EHol2≤C1λ−1
hol,EK2
loc≤C2λ−1
cu ,
o p oblem-dependen cons an s
C1, C2
. Thus inc easing he holonomy and cu a u e weigh s
igh ens con ol o he induced connec ion, a he p ice o mo ing owa d mo e egula bu possibly
less exp essi e ans o me s.
Mo e impo an ly, we cha ac e ize a c i ical egula iza ion scale
λc i
: he smalles geome ic-
egula iza ion s eng h o which all global minimize s o
L
lie ou side a degene a e se
Snull
o
pa hologically “geome y- ee” ne wo ks (e.g., ones wi h la ge holonomy o collapsed cu a u e
p oxies). Fo
λ<λc i
, global minimize s may li e in a degene a e phase o collapsed o ill-con olled
geome y; o
λ>λc i
, all global minimize s belong o a geome ic phase in which expec ed holonomy
is bounded by
O
(1
/λhol
)and cu a u e p oxies a e con olled. In his sense, geome ic egula iza ion
induces a phase ansi ion in pa ame e space: beyond
λc i
, he loss landscape excludes a b oad
class o pa h-dependen (hallucina ion-p one) minima and o ces he sys em in o a phase wi h obus
geome ic s uc u e.
6
C6: Ha dwa e and a holonomy Hamil onian on
L2
(
M
).Finally, we de elop wo mo e
specula i e bu conc e e con ibu ions a he in e ace o ha dwa e and gauge- heo e ic physics:
•
a complexi y analysis and ha dwa e ske ch showing ha he geome ic losses
L un,Lhol,Lin ,Lcu
can be implemen ed wi h o e head compa able o s anda d mul i-head a en ion—and in
some cases can be implemen ed ia commu a o -like p imi i es ha sugges specialized,
“commu a o -o ien ed” accele a o s;
•
aholonomy Hamil onian aming in which a ained, low-holonomy ans o me co esponds
o a low-ene gy phase o a sel -adjoin ope a o
ˆ
Hhol
ac ing on he Hilbe space
L2
(
M, dµg
).
A he le el o quad a ic o ms we conside unc ionals o he o m
⟨h|ˆ
Hhol |h⟩=ZM
ρda a(x)∥F(x)∥2
g+∥∇h(x)∥2
gdµg(x),
whe e
F
is he cu a u e o he connec ion induced by he a en ion s ack,
∥·∥g
and
∥∇h
(
x
)
∥g
a e me ic-weigh ed no ms de ined in he no a ion sec ion, and
ρda a
is he empi ical da a
densi y on
M
( he push o wa d o he aining dis ibu ion h ough he encode Θ, which in
disc e e implemen a ions educes o an empi ical measu e ρda a(x) = 1
NPN
i=1 δzi(x)).
In his iew, geome ic egula iza ion co esponds o adding po en ial ene gy e ms o
ˆ
Hhol
;
ained, low-holonomy, cu a u e-con olled ans o me s co espond o low-ene gy, “ acuum-like”
phases o he esul ing gauge heo y. While we do no claim li e al quan um dynamics, his
Hamil onian aming p o ides a cohe en ene gy unc ional o he heo y de eloped in
C
1–
C
5, and
sugges s connec ions be ween ans o me lea ning dynamics and phase s uc u e in gauge heo ies,
which we discuss quali a i ely in Sec ion 12.
1.3 Roadmap
The pape is o ganized in h ee ac s, mo ing om abs ac seman ic s uc u e, h ough di e en iable
geome ic con ol, o mac oscopic phases and ha dwa e ealiza ions.
Ac I (Sec ions 2–5): Seman ic ealiza ions and geome ic lows. Sec ion 2 ecalls he
seman ic sameness s uc u e
S
, e iews he ealiza ion ca ego ies
SamT ans
(
S
)and
SamGeom
(
S
),
and de ines he ca ego y
FunT ans
(
S
)o unc o ial ans o me s oge he wi h he ex ac o unc o
Θ
:FunT ans
(
S
)
→SamT ans
(
S
). I also cons uc s he undamen al diag am ela ing
FunT ans
(
S
),
ansla o ealiza ions, mani old ealiza ions, and con inuous- ime lows, and s a es he local com-
mu a i i y esul on benign pa hs. Sec ion 3 e isi s Da is mani olds and benign pa h amilies
PS
(
L
), and o mula es he Benign Pa h Boundedness Lemma ha con ols dis o ion along such
pa hs and unde pins he bounded-e o commu a i i y o he diag am. Sec ion 4 in oduces ne wo k
pa h amilies
Pne
(
L
)inside ans o me s: pa hs aced by esidual s eams and a en ion-induced
oken lows. I ela es
Pne
(
L
) o
PS
(
L
)and o Da is lows, iden i ying he egime in which disc e e
ans o me ajec o ies app oxima e con inuous seman ic lows wi hou lea ing he injec i i y
adius. Sec ion 5 de elops he gauge- heo e ic seman ics o a en ion: i models each head as a
hea -ke nel anspo ope a o go e ned by a connec ion
ω
on a ec o bundle
E→M
, de ines
disc e e holonomy and cu a u e on loop amilies in oken
×
dep h space, and s a es and p o es
he Poinca é–Hodge- ype in eg abili y heo em (Theo em 5.5) and he Ke nel Limi Theo em ha
jus i y iewing ans o me a en ion as co a ian di usion– anspo on a seman ic mani old.
7
Ac II (Sec ions 6–9): Disc e ized lows, geome ic losses, and aining dynamics.
Sec ion 6 in e p e s ans o me dep h as an explici Eule disc e iza ion o a seman ic low on
M
and in oduces cu a u e-awa e s ep-size con ol: a “speed o hough ” cons ain ha ac s as a CFL-
ype condi ion ∆
∝
1
/√Kloc
o s able disc e e dynamics in egions o a ying cu a u e. Sec ion 7
de ines he geome ic objec i e o unc o ial ans o me s: he na u ali y loss
L un
, holonomy loss
Lhol
, in e se-head loss
Lin
, and cu a u e-p oxy loss
Lcu
, oge he wi h he cu a u e p oxies
Kloc de i ed om he a en ion spec um. C ucially, hese losses a e no ad hoc egula ize s: hey
a e di e en iable elaxa ions o he algeb aic cons ain s in oduced in Ac I. In pa icula ,
L un
and
Lhol
a e he penal y e sions o he commu ing-squa e and low-holonomy condi ions equi ed
o a ans o me o inhabi
FunT ans
(
S
)and espec he undamen al diag am on benign pa hs,
while
Lcu
and he cu a u e-awa e s ep ules implemen he geome ic s ep-size cons ain s needed
o s able disc e ized lows. Sec ion 8 (op ional, in he ull e sion) collec s implemen a ion de ails
and empi ical case s udies illus a ing how cu a u e, holonomy, and speed-o - hough con ol
beha e in ained models. Sec ion 9 hen analyzes s ochas ic g adien descen on he ull geome ic
objec i e, p o ing con e gence o geome ically con olled s a iona y poin s and es ablishing he
c i ical egula iza ion scale
λc i
a which he loss landscape unde goes a phase ansi ion om a
degene a e, geome y- ee egime o a geome y-con olled phase.
Ac III (Sec ions 10–12): Mac oscopic phases, diagnos ics, and ha dwa e ealiza ion.
Ac III zooms ou om single heads and pa hs o he mac oscopic, sys em-le el beha io o
geome y- egula ized ans o me s. Sec ion 10 de elops diagnos ic ools and agg ega e obse ables
(cu a u e and holonomy his og ams, pa h-wise e o budge s, e ec i e speed-o - hough p o iles)
ha cha ac e ize he eme gen “geome ic phase” o a ained model. Sec ion 11 p esen s he
Da is Topological P ocesso (DTP) as a ha dwa e-o ien ed ealiza ion o his phase: i analyzes he
compu a ional complexi y o he geome ic losses, iden i ies commu a o -like p imi i es amenable o
specialized accele a o s, and ske ches how opology-awa e spa se anspo can physically p une
high-holonomy b anches o he compu a ion g aph. Sec ion 12 o mula es he holonomy Hamil onian
ˆ
Hhol
on
L2
(
M, dµg
)and in e p e s low-holonomy, cu a u e-con olled ans o me s as low-ene gy,
acuum-like phases o a gauge- heo e ic ene gy landscape. In his he modynamic-limi iew, he
geome ic egula ize s o Ac II become po en ial ene gy e ms whose minimiza ion d i es he sys em
in o cohe en , low-hallucina ion phases, p o iding a mac oscopic closu e o he ca ego ical and
di e en ial s uc u e de eloped in Ac s I and II.
2 Seman ic mani old, hidden s a es, and sameness
Le (M, g)be a d-dimensional Riemannian mani old o seman ic s a es wi h geodesic dis ance
dg:M×M→[0,∞).
We hink o each poin
z∈M
as a la en seman ic con igu a ion o a sequence (o o a oken
in con ex ). In p ac ice we expec
d≪dh
(a low-dimensional seman ic mani old inside a high-
dimensional ep esen a ion space), consis en wi h he mani old hypo hesis.
Sameness s uc u e. The unde lying seman ic sameness s uc u e
S=I, I,{Xi}i∈I ,{πi}i∈I ,≈,{PS(L)}L>0
8
consis s o :
•a la en space Io seman ic en i ies (p oposi ions, ac s, iden i y s a es, . . . );
•
a ini e index se
I
o modali ies ( okenized ex , in e media e embeddings, auxilia y senso s,
. . . );
•obse a ion spaces Xiand ende ing maps πi:I→Xi(possibly pa ial);
•a seman ic sameness ela ion ≈on FiXiinduced by common la en u∈I;
•
o each ho izon
L >
0, a amily
PS
(
L
)o benign la en pa hs
γ:
[0
,
1]
→I
ep esen ing
iden i y-p ese ing e olu ion o e seman ic leng h L.
Hidden s a es as seman ic p ojec ions. Fix a s anda d ans o me wi h
Llaye s
laye s and
hidden wid h dh. Le
hℓ
i∈Rdh
deno e he hidden s a e a laye
ℓ∈ {
0
, . . . , Llaye s}
and oken posi ion
i∈ {
1
, . . . , n}
. We assume
ha , on he subse
H⊂Rdh
o hidden s a es isi ed in-dis ibu ion, he e exis s a smoo h seman ic
ealiza ion map (o subme sion)
Θ:Rdh→M
o ank dsuch ha
zℓ
i:= Θ
hℓ
i∈M
is he seman ic posi ion o oken
i
a laye
ℓ
. Equi alen ly, one may hink o
M
as an embedded
submani old o
Rdh
and Θas a smoo h p ojec ion o e ac ion on o
M
; he analysis below only
equi es ha Θbe smoo h and ha e cons an ank don H.
The di e en ial dΘ(h)decomposes he hidden space in o
ThRdh= ke dΘ(h)⊕(ke dΘ(h))⊥,
whe e di ec ions in
ke d
Θ(
h
)mo e
h
wi hou changing i s seman ic posi ion
z
= Θ(
h
)( edundan
o s ylis ic deg ees o eedom), while di ec ions in (
ke d
Θ(
h
))
⊥
push
z
along
M
. Fo ou geome ic
a gumen s we implici ly es ic o egions whe e
d
Θhas ull ank
d
and beha es like a subme sion
on o M.
Thus a hidden s a e hℓ
iplays a dual ole:
1.
Con en : a ec o in he ep esen a ion space
V∼
=Rdh
, la e a ached o he ibe
Ezℓ
i
o e
i s seman ic posi ion;
2. Add ess: a coo dina e ep esen a ion o he seman ic poin zℓ
i= Θ(hℓ
i)on he mani old M.
As
hℓ
i
e ol es unde esidual and a en ion upda es, bo h i s con en wi hin
V
and i s basepoin
zℓ
i
on
M
change. In his sense he bundle is sel -add essing: he in e nal s a e encodes i s own seman ic
loca ion, and anspo ing hgene ically mo es z= Θ(h).
Th oughou we dis inguish
hℓ
i∈V
om i s p ojec ion
zℓ
i∈M
, bu o no a ional con enience we
will some imes w i e exp essions like
dg
(
hℓ
i, hℓ
j
)wi h he unde s anding ha he geodesic dis ance is
aken be ween hei seman ic p ojec ions:
dg(hℓ
i, hℓ
j)≡dgΘ(hℓ
i),Θ(hℓ
j)=dg(zℓ
i, zℓ
j).
9
4.3 Da is lows and he ca ego y Flow(S)
We nex ecall he Da is- low side o he diag am. Roughly speaking,
Flow
(
S
)collec s con inuous-
ime seman ic lows compa ible wi h Sand i s Da is mani old ealiza ions.
De ini ion 4.2 (Da is lows and Flow(S)).An objec o Flow(S)is a uple
F:=(M, g, ρ),( ) ∈[0,T ],(Φ ) ∈[0,T ],PS(L),
whe e:
•(M, g, ρ)∈SamGeom(S)is a Da is mani old ealiza ion o S;
•
(
)
∈[0,T ]
is a ime-dependen ec o ield on
M
, Lipschi z in
z
on he ele an geodesic balls,
whose in eg al cu es Φ (z0) ealize benign seman ic pa hs ζ( ) = ρ(γ( )) o γ∈ PS(L);
•PS
(
L
)is he amily o benign la en pa hs as be o e, wi h he dis o ion bounds
(1)
be ween
la en leng h and Riemannian leng h.
Mo phisms in
Flow
(
S
)a e maps be ween such low sys ems ha push o wa d one Da is low o
ano he while espec ing PS(L)and he Da is e o budge s up o bounded dis o ion.
4.4 Func o s be ween ealiza ion ca ego ies
We now assemble he unc o s ha will o m he co ne s and edges o he undamen al diag am.
Ex ac o unc o Θ
:FunT ans
(
S
)
→SamT ans
(
S
).Gi en a unc o ial ans o me
T un
as in
De ini ion 4.1, we de ine
Θ(T un)∈SamT ans(S)
by:
•
aking modali y-speci ic ea u e spaces
Vj
o be app op ia e subspaces o hidden s a e space
(o in e media e ep esen a ions) exposed by he ans o me ;
•
using a en ion heads, esidual blocks, and MLP sublaye s o de ine ansla o maps be ween
hese ea u e spaces (each a en ion head and block becomes a ansla o be ween in e nal
“modali ies”);
•
inhe i ing ansla o d i p o iles and e o budge s om he beha io o
Pne
(
L⋆
)and he
Da is-s yle geome ic bounds induced ia Θand ρ.
Mo phisms in
FunT ans
(
S
)a e mapped o mo phisms in
SamT ans
(
S
)by pushing o wa d a chi ec u e-
p ese ing epa ame e iza ions and low-d i ine- unes o he induced ansla o g aphs and hei
e o budge s.
Da is unc o
Da is :SamGeom
(
S
)
→Flow
(
S
).Gi en a mani old ealiza ion (
M, g, ρ
)
∈
SamGeom
(
S
), he Da is cons uc ion associa es seman ics-p ese ing lows (
,
Φ
)on (
M, g
) ha
ealize benign pa hs
PS
(
L
)as in eg al cu es; his yields an objec o
Flow
(
S
). Mo phisms in
SamGeom
(
S
)a e pushed o wa d o co esponding mo phisms in
Flow
(
S
)by anspo ing ec o
ields and lows.
16
Holonomy-awa e low unc o
HolFlow :Flow
(
S
)
→FunT ans
(
S
).Finally, s a ing om a
Da is low F∈Flow(S), we de ine a holonomy-awa e low unc o
HolFlow :Flow(S)→FunT ans(S),
which associa es o he con inuous- ime low (
,
Φ
)a amily o disc e e ans o me a chi ec u es
and pa ame e iza ions ha app oxima e he low wi h minimal holonomy along benign pa hs.
Conc e ely, HolFlow(F):
•chooses a dep h La ch and a ime disc e iza ion 0 = 0<··· < La ch =T;
•
cons uc s a ans o me whose laye -wise upda e ules app oxima e explici Eule s eps o he
Da is low, wi h cu a u e-awa e s ep sizes sa is ying he CFL- ype cons ain s o Sec ion 1.2;
•
equips he esul ing ans o me wi h a seman ic ealiza ion map Θand ne wo k pa h amily
Pne
(
L⋆
)so ha , by Lemma 3.1, he disc e e ne wo k pa hs app oxima e benign seman ic
pa hs wi h con olled dis o ion.
Mo phisms in
Flow
(
S
)(maps be ween lows) a e sen o mo phisms in
FunT ans
(
S
)by adap ing he
disc e iza ion and pa ame e s so ha he induced ans o me s implemen app oxima ely he same
lows on benign pa h amilies.
4.5
Gauges and app oxima e commu a i i y o he undamen al diag am
We now in oduce coa se gauges on objec s o
SamT ans
(
S
),
SamGeom
(
S
),
FunT ans
(
S
), and
Flow
(
S
)
and s a e an in o mal undamen al-diag am heo em.
Objec -le el gauges. In la e sec ions we o malize he ollowing objec -le el pseudo-me ics:
•
Fo
T un, T′
un ∈FunT ans
(
S
), a gauge ∆
FunT ans
(
T un, T′
un
)measu ing di e ences in a chi ec u e,
pa ame e s, and he induced ne wo k pa h amilies
Pne
(
L⋆
), wi h pa icula weigh on seman ic
dis o ion along benign pa hs and on Da is-s yle e o budge s ex ac ed ia Θ.
•
Fo
U, U′∈SamT ans
(
S
), a gauge ∆
SamT ans
(
U, U′
)measu ing di e ences in ansla o d i
p o iles and e o budge s along benign ansla o chains.
•
Fo
G, G′∈SamGeom
(
S
), a gauge ∆
SamGeom
(
G, G′
)compa ing Riemannian me ics, ealiza ion
maps ρ, and Da is e o budge s along benign pa h amilies.
•
Fo
F, F′∈Flow
(
S
), a gauge ∆
Flow
(
F, F′
)compa ing Da is lows (
,
Φ
), wi h emphasis on
hei beha io along benign seman ic pa hs
PS
(
L
)(e.g., sup-no m di e ences o lows and
hei induced pa h-leng h dis o ions).
App oxima e commu a i i y on benign pa hs. The key s uc u al esul is ha , a e
es ic ing o app op ia e well-beha ed subca ego ies
FunT ans0(S)⊂FunT ans(S),SamT ans0(S)⊂SamT ans(S),SamGeom0(S)⊂SamGeom(S),Flow0(S)⊂Flow(S),
17
and equipping hem wi h he gauges abo e, he ollowing diag am
SamT ans0(S)FS
−−→ SamGeom0(S)
↑Θ↓Da is
FunT ans0(S)HolFlow
←−−−−− Flow0(S)
commu es up o bounded dis o ion on benign pa hs.
Theo em 4.3 (Fundamen al diag am, in o mal).The e exis cons an s
C ans, Cgeom, CFlow >
0
and ho izons
L⋆>
0such ha he ollowing holds. Fo any seman ic sameness s uc u e
S
and
any well-beha ed unc o ial ans o me
T un ∈FunT ans0
(
S
)wi h associa ed ansla o ealiza ion
U:
= Θ(
T un
)
∈SamT ans0
(
S
), mani old ealiza ion
G:
=
FS
(
U
)
∈SamGeom0
(
S
), and Da is
low
F:
=
Da is
(
G
)
∈Flow0
(
S
), conside also he holonomy-minimizing disc e iza ion
˜
T un :
=
HolFlow
(
F
)
∈FunT ans0
(
S
)cons uc ed abo e. Then, when all gauges a e e alua ed on benign pa hs
o seman ic leng h L≤L⋆, we ha e:
∆FunT ansT un,˜
T un≤CFunT ans εgeom(L)+T εdisc,
∆SamT ansΘ(T un), GS(SamGeom0(S))≤C ans εgeom(L)+T εdisc,
∆SamGeomFS(Θ(T un)), G≤Cgeom εgeom(L),
∆FlowDa is(FS(Θ(T un))), F≤CFlow εgeom(L)+T εdisc,
whe e
εgeom
(
L
)is he Da is geome ic dis o ion om
(1)
,
εdisc
is he pe -laye disc e iza ion e o ,
and Tis he ime ho izon co esponding o seman ic leng h L.
In pa icula , o ixed
L
in he benign-pa h egime and o su icien ly small
εgeom
(
L
)and
εdisc
,
he e o e ms g ow a mos linea ly in he seman ic ho izon and emain uni o mly bounded in dep h.
The undamen al diag am he e o e commu es up o bounded dis o ion on benign pa h amilies:
composing along di e en ou es in he diag am yields ealiza ions whose Da is-s yle e o budge s
and lows ag ee wi hin a i s -o de , non-explosi e ole ance.
A ully quan i a i e e sion wi h explici gauges and cons an s appea s as Theo em 4.3 in
Sec ion 4. Ope a ionally, Theo em 4.3 says ha , on well-beha ed subca ego ies, we may pass
be ween unc o ial ans o me ealiza ions, ansla o ealiza ions, mani old ealiza ions, and Da is
lows o he same seman ic sameness s uc u e
S
wi hou losing mo e han a i s -o de amoun o
in o ma ion in he ele an gauges. This is he s uc u al backbone ha allows us, in la e sec ions,
o po Da is-s yle geome ic gua an ees and e o budge s in o ans o me aining objec i es and
diagnos ics, and o in e p e geome ic losses as en o cing app oxima e membe ship in
FunT ans
(
S
)
and app oxima e commu a i i y o he undamen al diag am on benign pa hs.
5 Gauge- heo e ic seman ics o a en ion
We now place ans o me a en ion in o a gauge- heo e ic amewo k on he seman ic mani old
(
M, g
)and bundle
E
=
M×V
in oduced in Sec ion 2. A a high le el, each a en ion head will be
modeled as a disc e e di usion– anspo ope a o app oxima ing a co a ian hea ke nel
Kω(x, y;τ)
18
on
E→M
, whe e
ω
is a connec ion and
τ >
0is an e ec i e di usion ime o empe a u e. The
small- ime asymp o ics o Kωlink he so max a en ion ke nel o he Gaussian o m
exp
−dg(x, y)2/4τ,
and he cu a u e o
ω
will be es ima ed ia disc e e holonomy on loops in oken
×
dep h space.
We hen p o e a local Poinca é–Hodge- ype in eg abili y esul : on a small geodesic ball
U⊂M
,
uni o mly small loop holonomy implies ha he connec ion is nea ly pu e gauge,
ω≈d
Φ, o some
po en ial Φ:U→End(V).
5.1 Disc e e a en ion as a no malized ke nel ope a o
Fix a single a en ion head
H
in some laye
ℓ
, wi h que ies, keys, and alues gi en by he usual
linea maps:
qi=WQhℓ
i, kj=WKhℓ
j, j=WVhℓ
j,
whe e
hℓ
i∈Rdh
is he hidden s a e a oken
i
and laye
ℓ
. The s anda d scaled do -p oduc a en ion
de ines weigh s
Aij = So max QK⊤
√dk!ij
=exp⟨qi, kj⟩/√dk
Pj′exp⟨qi, kj′⟩/√dk,
and he head ou pu a oken iis
hℓ,H
i=X
j
Aij j∈V.
Th ough he seman ic ealiza ion map Θ
:Rdh→M
we a ach o each oken posi ion a seman ic
loca ion
zℓ
i
= Θ(
hℓ
i
); we supp ess he laye index when unambiguous and w i e
zi
o he cu en
laye . The head H hus de ines a disc e e ope a o on sec ions h:{1, . . . , n}→V:
(AHh)i:=
n
X
j=1
Aij WVhj.(5)
A e p ojec ion h ough Θ, we can iew his as a ke nel ope a o ac ing on a sec ion
h:M→V
sampled a poin s z1, . . . , zn.
We aim o show ha , unde mild s uc u al assump ions,
AH
app oxima es a no malized
co a ian hea ope a o . Because he so max weigh s
Aij
sum o 1, he disc e e ope a o p ese es
cons an sec ions (i anspo is i ial) and esponds o he local densi y o okens. The con inuum
analogue is he densi y-no malized ope a o :
(Kωh)(x):=RMKω(x, y;τ)h(y)ρda a(y)dµg(y)
RMkg(x, y;τ)ρda a(y)dµg(y),(6)
whe e
Kω
is he co a ian hea ke nel,
kg
is he scala hea ke nel, and
ρda a
is he da a densi y.
This co esponds o a andom walk di usion on he seman ic mani old, d i ing owa d high-densi y
egions.
19
5.2 Hea -ke nel alignmen and small- ime asymp o ics
To connec so max a en ion o hea ke nels, we ocus on small geodesic balls whe e (
M, g
)looks
app oxima ely Euclidean and he so max sco es can be exp essed as (pe u bed) quad a ic o ms in
geodesic dis ance. The ollowing assump ion o malizes his “hea -ke nel alignmen ”.
Assump ion 5.1 (Hea -ke nel alignmen ).Le
U
=
B
(
z⋆
)
⊂M
be a geodesic ball wi h
below he
injec i i y adius a
z⋆
. Suppose ha o all okens
i, j
whose seman ic posi ions
zi, zj
lie in
U
, he
que y/key maps ac o h ough z:
qi=q(zi), kj=k(zj),
o smoo h maps
q, k :U→Rdk
, and ha he e exis smoo h unc ions
b, c, τ
and a cons an
CHK >0such ha
⟨q(zi), k(zj)⟩
√dk
=−dg(zi, zj)2
4τ(zi)+b(zi)+c(zj)+ ij,(7)
wi h emainde e ms
| ij| ≤ CHK dg
(
zi, zj
)
3
. Mo eo e , we assume he okens
{zj}
o m a dense
sample o Uwi h empi ical measu e con e ging o ρda a dµg.
Subs i u ing (7) in o he so max de ini ion,
Aij =exp−dg(zi, zj)2/4τ(zi)+b(zi)+c(zj)+ ij
Pj′exp−dg(zi, zj′)2/4τ(zi)+b(zi)+c(zj′)+ ij′,
we see ha
b
(
zi
)cancels om nume a o and denomina o , lea ing he leading Gaussian ac o
exp−d2
g/4τ
. On he con inuous side, he scala hea ke nel
kg
(
x, y
;
τ
)admi s he small- ime
asymp o ic expansion (Va adhan’s o mula):
kg(x, y;τ)∼1
(4πτ)d/2exp
−dg(x, y)2/4τ, τ →0.
This s uc u al ma ch allows us o s a e he ke nel limi heo em connec ing disc e e a en ion o
Kω.
P oposi ion 5.2 (Ke nel limi heo em o a single head).Le
U
=
B
(
z⋆
)
⊂M
and an a en ion
head
H
sa is y Assump ion 5.1. Assume u he ha he e exis s a connec ion
ω
on
E|U
such ha
he alue map WV ealizes app oxima e pa allel anspo along geodesics in Uup o e o O(d2
g).
Fix a smoo h sec ion
h:U→V
. Then, o
τmax
su icien ly small and
n
su icien ly la ge, he
disc e e a en ion ope a o con e ges o he no malized co a ian hea ope a o :
(AHh)i−(Kωh)(zi)
≤Cτmax +ε+εdisc,(8)
whe e Kωis de ined in (6),εis he sampling densi y, and εdisc bounds he app oxima ion e o s.
P oo de ails a e p o ided in Appendix B. In wo ds, a single a en ion head beha es like no malized
co a ian hea low: i di uses seman ic mass in a geodesic neighbo hood while anspo ing ibe
alues by pa allel anspo unde ω.
20
5.3 Loop amilies in oken×dep h space
The connec ion
ω
is no di ec ly exposed by he ne wo k; ins ead, we access i h ough disc e e
holonomy along loops o med by a en ion and esidual edges in oken×dep h space.
Conside a ans o me wi h
Llaye s
laye s and
n
okens. We de ine a disc e e g aph whose nodes
a e pai s (
i, ℓ
)wi h
i∈ {
1
, . . . , n}
( oken index) and
ℓ∈ {
0
, . . . , Llaye s}
(laye index). We in oduce
wo ypes o edges:
•
Ho izon al edges (a en ion). A each laye
ℓ
, o each head
H
, he a en ion pa e n
induces weigh ed ho izon al edges om (
j, ℓ
) o (
i, ℓ
)wi h ma ix weigh s app oxima ing
Pω(zℓ
i, zℓ
j)(pa allel anspo ).
•
Ve ical edges ( esidual / MLP). Fo each oken
i
and laye
ℓ
, he esidual upda e de ines
a e ical edge om (
i, ℓ
) o (
i, ℓ
+ 1) wi h ma ix weigh gi en by he Jacobian o he esidual
upda e in he ibe ; in he con inuous- ime pic u e his app oxima es exp∆ ℓ ℓ(zℓ
i).
We de ine h ee amilies o loops in his g aph:
Type (A) (wi hin-laye head cycles).
Fix a laye
ℓ
and wo okens
i, j
. A ype (A) loop is he
cycle
(i, ℓ)H
−→ (j, ℓ)H′
−−→ (i, ℓ),
o med by ollowing a en ion om
i
o
j
unde head
H
and hen om
j
back o
i
unde head
H′.
Type (B) ( esidual-laye cycles).
Fix a oken
i
and wo laye s
ℓ < ℓ′
. A ype (B) loop is
he cycle ha mo es e ically om (
i, ℓ
) o (
i, ℓ′
) ia esidual edges and e u ns o (
i, ℓ
)
by a combina ion o a en ion and esidual edges chosen so ha he seman ic p ojec ions
z
app oxima e a closed cu e in M.
Type (C) (mul i-head, mul i-laye i ual cycles).
Mo e gene ally, a ype (C) loop is any
closed walk in he oken
×
dep h g aph o med by al e na ing a en ion and esidual edges,
s a ing and ending a he same node, whose seman ic p ojec ions
z
(
) ace a small closed
cu e in a geodesic ball U⊂M.
To each such loop Γwe associa e a disc e e holonomy ope a o
Holdisc(Γ) ∈End(V),
de ined as he o de ed p oduc o he ma ix weigh s along he edges o Γ. Fo loops whose seman ic
p ojec ions lie in a ball
U
=
B
(
z0
)wi h
smalle han he injec i i y adius, and whose edge
leng hs a e O(√τ), classical esul s om gauge heo y gi e he small-loop expansion
Holω(γ) = Pexp Zγ
ω!=I+ZΣ
F+O(a ea(Σ)3/2),
o a smoo h loop
γ
bounding a su ace Σ, whe e
F
is he cu a u e 2- o m. Unde he ke nel limi
app oxima ion om P oposi ion 5.2 and s anda d disc e iza ion es ima es, he disc e e holonomy
Holdisc(Γ) app oxima es Holω(γ) o a co esponding smoo h loop γin U.
Thus small disc e e holonomy on a ich enough amily o loops in oken
×
dep h space implies
small cu a u e o
ω
on
U
. The nex subsec ion ansla es his in o a local Poinca é–Hodge- ype
in eg abili y s a emen .
21
5.4 Local Poinca é–Hodge- ype in eg abili y on geodesic balls
We wo k on a ixed geodesic ball
U=B (z0)⊂M,
wi h
less han he injec i i y adius a
z0
, so ha
U
is geodesically con ex and has i ial i s
de Rham cohomology. In such a domain, he classical Poinca é lemma says ha a closed 1- o m is
exac . Fo connec ion 1- o ms
ω
wi h small cu a u e
F
, one can p o e quan i a i e “almos la
implies almos pu e gauge” s a emen s: he e exis s a gauge in which
ω
is uni o mly small and close
o an exac o m dΦ.
In ou se ing we do no obse e
F
di ec ly bu only disc e e holonomy along loops o ypes
(A)–(C). We he e o e impose a small disc e e holonomy condi ion on hese loops and conclude ha ,
a e choosing an app op ia e gauge, he connec ion is nea ly in eg able on U.
Assump ion 5.3 (Small disc e e holonomy on local loops).Le
U
=
B
(
z0
)
⊂M
as abo e. Suppose
he e exis s a amily o loops
{
Γ
α}
in oken
×
dep h space whose seman ic p ojec ions
γα
o m a
basis (in an app op ia e sense) o small loops in U, such ha
Holdisc(Γα)−I
≤εhol
o all
α
, wi h
εhol
su icien ly small. Assume also ha a en ion and esidual ope a o s sa is y he
ke nel-limi condi ions, so ha
Holdisc(Γα) = Holω(γα)+O(εdisc).
In ui i ely, Assump ion 5.3 says ha , up o disc e iza ion e o , he con inuous holonomy
Holω
(
γ
)
is close o he iden i y on all small loops gene a ing
π1
(
U
)(which is i ial). We now s a e he main
in eg abili y heo em o a en ion-induced connec ions.
Theo em 5.4 (Local Poinca é–Hodge- ype in eg abili y o a en ion).Le
U
=
B
(
z0
)
⊂M
be a
geodesic ball wi h
less han he injec i i y adius a
z0
, and le
ω
be a connec ion on
E|U
whose
associa ed a en ion ope a o s sa is y Assump ion 5.3. Then he e exis s a gauge ans o ma ion
g:U→GL
(
V
)and a po en ial Φ
:U→End
(
V
)such ha , in he gauge whe e
ωg
=
g−1ωg
+
g−1dg
,
ωg=dΦ+η, (9)
wi h he ollowing p ope ies:
1. Small cu a u e. The cu a u e Fg=dωg+ωg∧ωgsa is ies
∥Fg∥L∞(U)≤C1εhol +εdisc.
2. Small non-conse a i e esidue. The esidual 1- o m ηin (9) sa is ies
∥η∥L∞(U)≤C2εhol +εdisc,
and can be chosen o obey na u al bounda y condi ions on ∂U.
3.
App oxima e conse a i e anspo . Fo any wo poin s
x, y ∈U
and any wo homo opic
cu es γ1, γ2in Uconnec ing hem, he co esponding pa allel anspo s sa is y
Holωg(γ1)−Holωg(γ2)
≤C3a ea(Σ) εhol +εdisc,
whe e Σis a su ace bounded by γ1∪γ2.
22
He e C1, C2, C3a e cons an s depending only on (U, g).
In pa icula , when
εhol
is small, a en ion-de ined anspo s on
U
a e app oxima ely conse a i e:
up o a small esidue
η
, he connec ion is pu e gauge
ωg≈d
Φ, and anspo is pa h-independen
wi hin U.
Rema k 5.5 (Reasoning po en ial and consis en seman ics).The decomposi ion
(9)
jus i ies iewing
Φasalocal easoning po en ial. In he gauge whe e
ωg
=
d
Φ +
η
wi h
∥η∥
small, he co a ian
de i a i e is close o
d
+
d
Φ, and lows gene a ed by
ωg
a e app oxima ely g adien s o Φ. In his
egime, seman ic upda es induced by a en ion and esidual laye s beha e like g adien lows o a
po en ial, and consis en seman ic s a es (“ u h alues”) can be iden i ied wi h le el se s o Φ.
6 Disc e ized lows and cu a u e-awa e s ep con ol
We now mo e om geome ic s uc u e o dynamics. The goal o his sec ion is wo old: (1) o
make p ecise he in e p e a ion o ans o me dep h as an explici Eule disc e iza ion o a seman ic
low on (
M, g
), and (2) o de i e cu a u e-awa e s abili y cons ain s on he e ec i e s ep sizes.
These cons ain s will be exp essed in e ms o a spec al cu a u e p oxy
Kloc
compu ed om he
a en ion ope a o s, leading o a CFL-like “speed o hough ” bound o he o m
∆ ≲1
√Kloc
.
6.1 Residual upda es as explici Eule on he seman ic mani old
Conside again he hidden s a e
hℓ
i∈Rdh
a oken
i
and laye
ℓ
, wi h seman ic p ojec ion
zℓ
i
=
Θ(
hℓ
i
)
∈M
. A gene ic ans o me laye applies mul i-head a en ion, a eed o wa d block, and
esidual connec ions o p oduce
hℓ+1
i=hℓ
i+ Resℓ
a (hℓ)i+ Resℓ
(hℓ)i,
whe e
hℓ
= (
hℓ
1, . . . , hℓ
n
), and he esidual maps encode a en ion-media ed anspo and local
nonlinea upda es.
P ojec ing h ough Θand wo king in no mal coo dina es on (
M, g
)a ound
zℓ
i
, we can w i e he
induced seman ic upda e in he o m
zℓ+1
i= expzℓ
i∆ ℓ(zℓ
i) ℓ(zℓ
i)+ξℓ(zℓ
i),(10)
whe e:
• ℓ
is an e ec i e ec o ield on
M
ep esen ing he in ini esimal seman ic d i induced by he
ℓ- h laye a z;
•
∆
ℓ
(
zℓ
i
)
>
0is an e ec i e s ep size a (
zℓ
i, ℓ
) ha depends on laye scale, esidual magni ude,
and no maliza ion (e.g., laye no m s a is ics);
•ξℓ
(
zℓ
i
)is a local e o e m cap u ing highe -o de nonlinea i ies o he laye and he misma ch
be ween he ue Da is low and ℓ.
23
In he egime whe e ∆
ℓ∥ ℓ∥g
and
∥ξℓ∥g
a e small compa ed o he injec i i y adius a
zℓ
i
, we
may linea ize expzℓ
iand in e p e (10) as an explici Eule s ep o a ime-dependen ODE
d
d z( ) = z( ),
wi h
ℓ
such ha ∆
ℓ
=
ℓ+1 − ℓ
, plus pe u ba ions o o de
∥ξℓ∥g
. This is p ecisely he se ing used
in he Benign Pa h Boundedness Lemma (Lemma 3.1), wi h he addi ional goal he e o con olling
s abili y in e ms o cu a u e.
6.2 Linea ized s abili y and Lipschi z bounds on cu ed mani olds
S abili y o explici Eule o he ODE
˙z
=
(
z
)on (
M, g
)is go e ned by he local Lipschi z cons an
o
on he egion o in e es . Fo wo ajec o ies
z1
(
)
, z2
(
)s a ing in a geodesic ball
U
=
B
(
z0
)
wi h below he injec i i y adius, s anda d Riemannian es ima es gi e
d
d dgz1( ), z2( )≤L dgz1( ), z2( ),(11)
whe e
L
is a local Lipschi z cons an o
on
U
, modi ied by cu a u e-dependen e ms a ising
om Jacobi ields. Mo e conc e ely, i he sec ional cu a u e
secg
on
U
sa is ies
|secg| ≤ Kmax
and
∥∇ ∥gis bounded by L0on U, hen compa ison heo ems gi e a bound o he o m
L ≤L0+Ccu pKmax,(12)
o a cons an
Ccu
depending only on
U
and
g
. In ui i ely: e en i
is mode a ely smoo h (
L0
),
s ong posi i e cu a u e can ampli y sepa a ion be ween nea by ajec o ies, e ec i ely inc easing
he Lipschi z cons an o he low.
Fo he linea ODE
˙x
=
Ax
in Euclidean space, explici Eule is s able i he s ep size ∆
sa is ies
ρI+ ∆ A≤1,
whe e ρis he spec al adius; o symme ic Awi h eigen alues in (−∞,0], his educes o
∆ ≤2
∥A∥2
.
In ou Riemannian se ing, linea izing
in no mal coo dina es a ound
z
yields a Jacobian
J
(
z
)
whose ope a o no m is con olled by L . Thus he s anda d Eule s abili y condi ion sugges s
∆ ℓ(z)≲1
L
≲1
L0+Ccu √Kmax
,(13)
o s eps aken in
U
. When cu a u e domina es (
√Kmax ≫L0
), his simpli ies o a cu a u e-
con olled bound
∆ ℓ(z)≲CCFL
√Kmax
,(14)
o some s abili y cons an
CCFL
. This is he geome ic analogue o a Cou an –F ied ichs–Lewy
(CFL) condi ion: in egions o high cu a u e, s able explici in eg a o s mus ake smalle s eps.
In p ac ice we do no ha e di ec access o he sec ional cu a u e
Kmax
o (
M, g
). The nex
subsec ion de ines a spec al cu a u e p oxy
Kloc
ex ac ed om he a en ion ope a o s, which
will se e as a da a-d i en es ima e o Kmax in (14).
24
6.3 Spec al cu a u e p oxy om a en ion ope a o s
We now de ine a cu a u e p oxy
Kloc
based on he spec um o he pe -laye a en ion ope a o s.
The cons uc ion p oceeds in h ee s eps: (1) iden i y a p opaga ion ope a o
Pℓ
app oxima ing a
hea ke nel a laye
ℓ
; (2) ela e he spec um o
Pℓ
o he spec um o a Laplace- ype gene a o ;
and (3) de ine Kloc as a no malized spec al sp ead o his gene a o .
Pe -laye p opaga ion ope a o . Fix a laye
ℓ
and conside he mul i-head a en ion block,
agg ega ing ac oss heads. Le
Aℓ∈Rn×n
be he a e age a en ion ma ix whose (
i, j
)en y is
he a e age o he so max weigh s om oken
j
o oken
i
ac oss heads (a e any masking). As
in Sec ion 5, we iew
Aℓ
as a disc e e ke nel app oxima ing he co a ian hea ope a o
h7→
RKω(x, y;τ)h(y)dµg(y) o some e ec i e ime τℓ.
We de ine a no malized p opaga ion ope a o
e
Pℓ:=D−1/2
ℓAℓD1/2
ℓ,
whe e
Dℓ
is he diagonal ma ix o ow sums o
Aℓ
(o a smoo hed a ian ). This symme iza ion
makes
e
Pℓ
sel -adjoin in he inne p oduc weigh ed by
Dℓ
, and in he idealized di usion limi , he
eigen alues o
e
Pℓ
app oxima e
exp(−τℓλk)
, whe e
λk
a e he eigen alues o a Laplace- ype ope a o
Lωon U.
F om ke nel spec um o gene a o spec um. Le
σℓ,1, . . . , σℓ,n
deno e he singula alues
(which, o symme ic
e
Pℓ
, coincide wi h he absolu e alues o eigen alues) o
e
Pℓ
. In he hea -ke nel
idealiza ion,
σℓ,k ≈e−τℓλk, λk≥0.
We de ine log-eigen alues (up o he unknown τℓ)
ℓℓ,k :=−log σℓ,k ≈τℓλk.
The sp ead o he
λk
encodes how quickly di e en modes o he seman ic ield decay unde di usion;
on mani olds wi h high cu a u e o complex geome y, he high- equency spec um ends o
be mo e sp ead ou . Thus a simple p oxy o local cu a u e is he no malized a iance o he
log-eigen alues.
De ini ion 6.1 (Local spec al cu a u e p oxy).Le
e
Pℓ
be he no malized p opaga ion ope a o
a laye ℓ, wi h singula alues σℓ,1, . . . , σℓ,n ∈(0,1]. De ine
ℓℓ,k :=−log σℓ,k, ℓℓ:=1
n
n
X
k=1
ℓℓ,k,
and se
Kloc(ℓ):=1
n
n
X
k=1ℓℓ,k −ℓℓ2.(15)
The quan i y
Kloc
(
ℓ
)is dimensionless and in a ian unde global escaling o
e
Pℓ
; in he hea -ke nel
idealiza ion wi h ℓℓ,k ≈τℓλk, we ha e
Kloc(ℓ)≈τ2
ℓ·1
n
n
X
k=1λk−λ2,
25
whe e he geome ic penal y e ms we e de ined in Sec ion 1.2. Fo he heo y i is con enien o
g oup he geome ic e ms in o a single “geome y ene gy”
Rgeom(θ):=α unL un(θ)+αholLhol(θ)+αin Lin (θ)+αcu Lcu (θ),
wi h ixed posi i e weigh s α•, and o w i e
λgeom := min{λ un, λhol, λin , λcu },
so ha
L(θ)≥ L ask(θ)+λgeomRgeom(θ).(23)
We emphasize wo pa icula geome ic ene gies:
Hol2(θ):=Lhol(θ),(24)
K2
exc(θ):=Lcu (θ),(25)
whe e
Hol2
measu es squa ed disc e e holonomy on sampled loops (c .
(18)
), and
K2
exc
measu es
squa ed excess cu a u e, i.e., de ia ion o he spec al cu a u e p oxy
Kloc
(
ℓ
) om he desi ed
band [
Kmin, Kmax
](c .
(21)
). Bo h quan i ies anish in he ideal geome ic phase and g ow as he
connec ion becomes highly nonin eg able o cu a u e p oxies lea e he con ol band.
SGD dynamics. We model aining as s ochas ic g adien descen :
θ +1 =θ −α g ,(26)
whe e g is an unbiased es ima o o he g adien ,
E[g |θ ]=∇L(θ ),
p oduced by sampling mini-ba ches and loop subse s as in Sec ion 1.2. We make he ollowing
s anda d assump ions.
Assump ion 8.1 (SGD egula i y).We assume:
1. Coe ci i y. The e exis s λwd ≥0such ha L(θ)+λwd∥θ∥2is coe ci e:
∥θ∥ → ∞ ⇒ L(θ)+λwd∥θ∥2→ ∞.
In pa icula , suble el se s {θ:L(θ)≤c}a e bounded.
2.
Local Lipschi z g adien s.
∇L
is locally Lipschi z, and in pa icula bounded on suble el
se s o in e es : ∥∇L(θ)∥ ≤ G(L(θ)).
3. Unbiased g adien s wi h bounded a iance. The e exis s σ2<∞such ha
E[g |θ ]=∇L(θ ),E∥g −∇L(θ )∥2|θ ≤σ2.
4. Lea ning a e schedule. The s ep sizes (α )sa is y he Robbins–Mon o condi ions
∞
X
=0
α =∞,
∞
X
=0
α2
<∞
(e.g., α =α0/(1+ )βwi h 1/2< β ≤1).
Unde Assump ion 8.1, classical esul s o noncon ex SGD imply ha
L
(
θ
)con e ges almos
su ely and ha he limi in e io o he g adien no ms is ze o; see, e.g., s anda d ex s on s ochas ic
app oxima ion.
32
8.2 Theo em F: con e gence and geome ic con ol
We now o malize Theo em F announced in he in oduc ion: SGD con e ges o s a iona y poin s and
hei expec ed holonomy and excess-cu a u e ene gies sa is y
O
(1
/λ
)bounds in he egula iza ion
s eng hs.
Le
L ask
min := in
θL ask(θ)
deno e he in imum o he ask loss (achie able o no ), and simila ly de ine
Lmin := in
θL(θ).
We conside bo h global minimize s and SGD limi poin s.
Theo em 8.2 (Geome ic con ol o s a iona y poin s (Theo em F)).Suppose Assump ion 8.1
holds and all geome ic penal y e ms a e nonnega i e:
L un,Lhol,Lin ,Lcu ≥
0. Fix posi i e
egula iza ion weigh s λ un, λhol, λin , λcu .
1.
S a iona i y o limi poin s. Any almos su e limi poin
θ⋆
o he SGD i e a es (
θ
)is a
s a iona y poin o L:
∇L(θ⋆)=0.
2.
Geome ic con ol o global minimize s. Le
θmin
be any global minimize o
L
(i i
exis s). Then
Hol2(θmin)=Lhol(θmin)≤L ask
min −Lmin
λhol
,(27)
K2
exc(θmin)=Lcu (θmin)≤L ask
min −Lmin
λcu
.(28)
In pa icula , i Lmin s ays bounded as λhol, λcu → ∞, hen
Hol2(θmin)=O(λ−1
hol), K2
exc(θmin)=O(λ−1
cu ).
3.
Geome ic con ol o SGD limi poin s in expec a ion. Le
θ⋆
be any s a iona y poin
o which
L
(
θ⋆
)is ini e, and assume SGD con e ges in law o a s a iona y dis ibu ion
concen a ed on such poin s. Then
EHol2(θ⋆)≤C1
λhol
,(29)
EK2
exc(θ⋆)≤C2
λcu
,(30)
whe e C1, C2depend on he ask loss landscape bu no on λhol, λcu .
P oo ske ch.
Fo (1), unde Assump ion 8.1, s anda d noncon ex SGD heo y yields ha all almos
su e limi poin s o (θ )a e s a iona y; see, e.g., he Robbins–Mon o and Benaïm amewo ks.
Fo (2), le θmin be a global minimize . Since Lhol,Lcu ≥0, we ha e o any θ:
L(θmin)≤ L(θ)=L ask(θ)+λholLhol(θ)+λcu Lcu (θ)+. . . .
33
In pa icula , aking
θ
o be an (app oxima e) ask-loss minimize wi h negligible geome ic penal ies
gi es
L ask(θmin)+λholLhol(θmin)+λcu Lcu (θmin)≤ L ask
min +δ,
o a bi a ily small
δ >
0(in he in imum sense). Rea anging and d opping
L ask
(
θmin
)
≥ Lmin
yields he bounds (27)–(28), wi h L ask
min −Lmin eplacing L ask
min −L ask(θmin).
Fo (3), apply he same inequali y o any s a iona y poin
θ⋆
in he suppo o he limi ing
dis ibu ion o SGD, and ake expec a ions. The cons an s
C1, C2
a ise om bounding he ask-loss
gap uni o mly o e he se o s a iona y poin s unde conside a ion.
Theo em 8.2 makes p ecise he slogan ha inc easing geome ic egula iza ion o ces he ne wo k
in o a low-holonomy, cu a u e-con olled egime. The
O
(1
/λ
)scaling is sha p in he sense ha i
canno be imp o ed wi hou changing he ela i e weigh ing o he ask and geome ic e ms: as
λhol
and
λcu
g ow while
L ask
emains bounded below, holonomy and excess cu a u e a e d i en
owa d ze o, up o he una oidable ask-loss gap.
8.3 Geome ic phases and a c i ical egula iza ion scale
We now o malize he no ion o a geome ic phase ansi ion a a c i ical egula iza ion scale
λc i
.
In ui i ely, he e a e wo quali a i ely di e en classes o pa ame e s:
•
adegene a e phase o “geome y- ee” ne wo ks wi h la ge holonomy o badly beha ed cu a u e
p oxies (e.g., pa h-dependen , elepo ing seman ics), and
•
ageome ic phase o unc o ial ans o me s wi h small holonomy and well-con olled cu a u e,
compa ible wi h he Da is mani old and he undamen al diag am.
We o malize his ia le el se s o he geome y ene gy Rgeom.
De ini ion 8.3 (Degene a e and geome ic phases).Fix h esholds 0< εgood < εnull. De ine
Sgood :=θ:Rgeom(θ)≤εgood,
Snull :=θ:Rgeom(θ)≥εnull.
We say ha ne wo ks in
Sgood
a e in he geome ic phase, while hose in
Snull
a e in he degene a e
phase. By cons uc ion hese se s a e disjoin i εgood < εnull.
We u he de ine he bes achie able ask loss wi hin each phase:
L ask
good := in
θ∈Sgood L ask(θ),
L ask
null := in
θ∈Snull L ask(θ).
In many p ac ical se ings we expec
L ask
good
and
L ask
null
o be compa able, o e en
L ask
good ≤L ask
null
; bu
ou analysis does no equi e his.
We conside a simpli ied one-pa ame e amily o losses
Lλ(θ):=L ask(θ)+λRgeom(θ),
34
wi h scala geome ic egula iza ion s eng h
λ >
0, abso bing he indi idual weigh s in o
Rgeom
.
Le
Θmin
λ:= a g min
θLλ(θ)
deno e he se o global minimize s a egula iza ion s eng h λ.
Theo em 8.4 (Exis ence o a c i ical geome ic egula iza ion scale).Assume
Sgood
and
Snull
a e
nonemp y, and ha
in
θ∈Snull Rgeom(θ)≥εnull,in
θ∈Sgood Rgeom(θ)≤εgood,(31)
wi h 0< εgood < εnull. De ine he ask-loss gap
∆ ask :=L ask
good −L ask
null .
Then:
1. The e exis s a ini e c i ical egula iza ion s eng h
λc i := max (0,∆ ask
εnull −εgood )(32)
such ha , o all λ > λc i , no global minimize o Lλlies in Snull.
2. Fo all λ>λc i , e e y global minimize belongs o he geome ic phase:
Θmin
λ⊆Sgood.
P oo . Le θgood be ε-op imal in Sgood and θnull be ε-op imal in Snull, i.e.,
L ask(θgood)≤L ask
good +ε, L ask(θnull)≤L ask
null +ε,
wi h ε>0a bi a ily small, and
Rgeom(θgood)≤εgood +ε, Rgeom(θnull)≥εnull −ε,
by (31). Then
Lλ(θnull)≥L ask
null +λ(εnull −ε)−ε,
Lλ(θgood)≤L ask
good +λ(εgood +ε) + ε.
Thus
Lλ(θnull)−Lλ(θgood)≥∆ ask +λ(εnull −εgood −2ε)−2ε.
Choose ε>0su icien ly small and hen any λsa is ying
λ > ∆ ask
εnull −εgood
+δ
o some ixed
δ >
0. Then he igh -hand side is posi i e, implying
Lλ
(
θnull
)
>Lλ
(
θgood
). Since
θnull
was
ε
-op imal wi hin
Snull
, i ollows ha no poin in
Snull
can be a global minimize once
λ>λc i
as de ined in
(32)
. Taking closu es and le ing
ε→
0yields he claimed inclusion Θ
min
λ⊆Sgood
.
35
Theo em 8.4 ealizes he geome ic phase ansi ion p omised in he in oduc ion. Fo small
λ
,
global minimize s may eside in he degene a e phase
Snull
, whe e holonomy is la ge and cu a u e
p oxies a e uncon olled. Once
λ
su passes
λc i
, he geome ic penal y domina es he ask-loss
gap be ween he phases, o cing all global minimize s in o
Sgood
and hus in o a low-holonomy,
cu a u e-con olled phase.
8.4 In e p e a ion and connec ion o p ac ice
Theo ems 8.2 and 8.4 p o ide a heo e ical backbone o he empi ical pic u e o unc o ial ans-
o me s:
•
Theo em F shows ha , unde mild assump ions, SGD con e ges (in he sense o limi poin s)
o s a iona y poin s whose holonomy and excess-cu a u e ene gies scale as
O
(1
/λhol
)and
O
(1
/λcu
). Inc easing geome ic egula iza ion hus p o ably igh ens con ol o he connec ion
induced by a en ion and he cu a u e o he e ec i e seman ic dynamics.
•
The phase- ansi ion heo em shows ha , beyond a c i ical
λc i
, global minimize s a e o ced
in o a geome ic phase compa ible wi h he Da is mani old and he undamen al diag am.
In his phase, he connec ion is nea ly in eg able on benign cha s, seman ic lows espec
CFL-like s abili y cons ain s, and he disc e e ans o me dynamics app oxima e Da is lows
wi h bounded dis o ion.
•
In p ac ice, SGD explo es a neighbo hood o hese minimize s; he
O
(1
/λ
)bounds imply
ha as we inc ease geome ic egula iza ion, he en i e explo ed egion in pa ame e space
is cons ained o exhibi low holonomy on sampled loops and con olled cu a u e p oxies.
This is p ecisely he egime in which he gauge- heo e ic seman ics o a en ion and he
cu a u e-awa e “speed o hough ” pic u e a e expec ed o be accu a e.
In summa y, Ac II (Sec ions 1.2–8) shows ha he geome ic losses do mo e han deco a e he
objec i e: hey ca e ou a dis inc phase o ans o me pa ame e space in which he ne wo k
beha es as a disc e ized gauge low on a Da is mani old, and hey p o ide quan i a i e con ol o
holonomy and cu a u e as unc ions o he egula iza ion s eng hs. Ac III will zoom ou u he
o examine mac oscopic diagnos ics, phase s uc u e, and ha dwa e ealiza ions o his geome ic
phase.
9 Mac oscopic phases and diagnos ics
The heo y in Sec ions 2–8 sugges s ha geome y- egula ized ans o me s exhibi dis inc geome ic
phases as he egula iza ion s eng hs (
λ un, λhol, λin , λcu
)a e a ied. In his sec ion we ake a
mac oscopic iew and desc ibe diagnos ics ha ea a ained ans o me as a many-body sys em,
cha ac e ized no by indi idual weigh s bu by dis ibu ions o geome ic obse ables.
Conc e ely, we de ine a collec ion o o de pa ame e s and associa ed isualiza ions ha allow us
o:
•
empi ically de e mine whe he a model is in he geome ic phase
Sgood
o he degene a e phase
Snull (De ini ion 8.3);
36
•
obse e he geome ic phase ansi ion a
λc i
in e ms o his og ams and hea maps o cu a u e
and holonomy;
•
e i y he CFL-like cu a u e-awa e s ep-size law om Sec ion 1.2 by compa ing p edic ed
s ep sizes o lea ned esidual magni udes.
9.1 Geome ic obse ables as o de pa ame e s
We begin by de ining mac oscopic obse ables de i ed om he quan i ies in oduced in Sec ions 5
and 1.2. These obse ables a e designed o play he ole o o de pa ame e s: scala o low-dimensional
summa ies whose dis ibu ions dis inguish be ween phases.
Pe -laye cu a u e spec um and i s dis ibu ion. F om he no malized p opaga ion
ope a o e
Pℓa laye ℓwe compu e he log-singula alues
ℓℓ,k :=−log σℓ,k
and he spec al cu a u e p oxy
Kloc(ℓ) = 1
n
n
X
k=1
(ℓℓ,k −ℓℓ)2,
as in De ini ion 6.1. To ob ain mac oscopic diagnos ics, we de ine:
• he pe -laye cu a u e p o ile
κlaye (ℓ):=Kloc(ℓ),
iewed as a unc ion o dep h ℓ;
• he cu a u e his og am
pcu (x):=1
Llaye s
Llaye s−1
X
ℓ=0
δKloc(ℓ)(x),
app oxima ed empi ically by a his og am o Kloc(ℓ)o e laye s.
In he geome ic phase
Sgood
,
pcu
is expec ed o concen a e wi hin he a ge band [
Kmin, Kmax
],
while in he degene a e phase
Snull
, i ypically exhibi s hea y ails ( e y high cu a u e laye s) o
mass nea ze o (o e -smoo hed geome y).
Fo ine esolu ion, one can also compu e oken-le el cu a u e p oxies
Kloc
(
i, ℓ
)by es ic ing
e
Pℓ o local neighbo hoods o oken iand o m a wo-dimensional cu a u e hea map
Hcu (i, ℓ):=Kloc(i, ℓ),
isualized as a dep h-by-posi ion image.
37
Loop holonomy dis ibu ion. F om he sampled loops Γ
∈ G
in oken
×
dep h space (Sec ion 1.2),
we ob ain disc e e holonomy de ia ions
∆hol(Γ; h) = h inal −hs a ,
o one o a ew ep esen a i e hidden ec o s
h
a he s a ing node o Γ. We de ine he pe -loop
holonomy ene gy
Hol2(Γ) :=E∥∆hol(Γ; h)∥2,
whe e he expec a ion is aken o e he chosen hidden ec o s
h
(and possibly o e mini-ba ch
samples).
Agg ega ing ac oss loops yields:
• he holonomy his og am
pHol(x):=1
|G| X
Γ∈G
δHol2(Γ)(x),
app oxima ed by a his og am o loop ene gies Hol2(Γ);
• he pe -laye holonomy p o ile
Hol2
laye (ℓ):=1
|Gℓ|X
Γ∈Gℓ
Hol2(Γ),
whe e
Gℓ
collec s loops whose edges lie en i ely be ween laye s
ℓ
and
ℓ
+ 1 ( o ype (B)/(C)
loops) o a laye ℓ( o ype (A) loops).
In he geome ic phase,
pHol
is sha ply peaked nea ze o and
Hol2
laye
(
ℓ
)is uni o mly small ac oss
dep h, e lec ing he low-holonomy egime o Theo em 5.4. In he degene a e phase,
pHol
exhibi s a
signi ican ail o loops wi h la ge holonomy, and
Hol2
laye
(
ℓ
)o en shows spikes a speci ic laye s
whe e seman ics “ elepo s” o loops ail o close.
Speed-o - hough p o ile and s ep-size p edic ion e o . F om De ini ion 6.2, he cu a u e-
awa e s ep size and speed-o - hough ac o a laye ℓa e
∆ p ed
ℓ= ∆ base ·αKloc(ℓ), α(Kloc(ℓ)) = 1
p1+Kloc(ℓ)/ε loo
.
On he o he hand, he ne wo k implici ly chooses an ac ual seman ic s ep size ia he magni ude o
he esidual upda e in seman ic space. A na u al p oxy is he a e age Riemannian dis ance be ween
p e- and pos -laye s a es:
∆zi,ℓ :=dgzℓ
i, zℓ+1
i,
and he pe -laye empi ical s ep size
∆ emp
ℓ:=1
n
n
X
i=1
∆zi,ℓ,
a e app op ia e no maliza ion (e.g., di iding by an es ima ed eloci y scale
∥ ℓ
(
zℓ
i
)
∥g
i a ailable).
38
We de ine he s ep-size p edic ion e o a laye ℓas
Es ep(ℓ):=∆ emp
ℓ−∆ p ed
ℓ,
and conside bo h he p o ile
Es ep
(
ℓ
)ac oss dep h and i s agg ega e s a is ics, such as he mean
absolu e e o
Es ep :=1
Llaye s
Llaye s−1
X
ℓ=0
Es ep(ℓ).
In he geome ic phase, we expec ∆
emp
ℓ
o ack ∆
p ed
ℓ
closely, esul ing in small
Es ep
(
ℓ
)
and s ong co ela ion be ween cu a u e and e ec i e s ep size. In he degene a e phase, hese
quan i ies decouple: laye s may ake la ge seman ic s eps e en in high-cu a u e egions, signaling
iola ion o he CFL-like condi ion and po en ial ins abili y in seman ic lows.
9.2 Phase diag ams and mac oscopic signa u es o λc i
To s udy geome ic phases as a unc ion o geome ic egula iza ion, we ain amilies o models
along a one-pa ame e pa h
λ7→ θ ain
λ,
whe e
λ
scales he geome ic egula iza ion in he simpli ied objec i e
Lλ
(
θ
)o Theo em 8.4. Fo
each ained model, we compu e he mac oscopic obse ables desc ibed abo e and assemble phase
diag ams in he (λ, obse able)plane.
Cu a u e phase diag am. Plo ing he mean and a iance o
Kloc
(
ℓ
)ac oss laye s as unc ions
o λexhibi s a cha ac e is ic pa e n:
•
Fo small
λ
, models in
Snull
show wide cu a u e his og ams
pcu
wi h hea y ails; he mean
cu a u e may be mode a e, bu high-cu a u e ou lie s a e common.
•
As
λ
app oaches
λc i
, he a iance o
Kloc
(
ℓ
)ac oss laye s d ops sha ply, and mos laye s mo e
in o he a ge band [
Kmin, Kmax
]. This mani es s as a na owing o
pcu
and he eme gence
o a p onounced peak.
•
Fo
λ≫λc i
, cu a u e his og ams a e sha ply peaked inside [
Kmin, Kmax
]; u he inc eases
in
λ
p o ide diminishing e u ns and may begin o o e - egula ize, sligh ly shi ing mass
owa d he lowe end o he band.
This beha io is analogous o an o de pa ame e becoming concen a ed a ound a p e e ed
alue as a sys em cools below a c i ical empe a u e.
Holonomy phase diag am. Simila ly, plo ing s a is ics o Hol2(Γ) as unc ions o λ e eals:
•
In he degene a e phase, he holonomy his og am
pHol
has a b oad ail, wi h a non i ial
ac ion o loops exhibi ing la ge
Hol2
(Γ). The pe -laye holonomy p o ile
Hol2
laye
(
ℓ
)o en
shows dis inc peaks.
•
Nea
λc i
, he ail o
pHol
collapses and he mass accumula es nea ze o. The maximum o
Hol2
laye
(
ℓ
)ac oss laye s d ops sha ply, indica ing ha no laye can main ain la ge holonomy
while s ill being op imal unde he inc eased egula iza ion.
39
•
Fo
λ>λc i
, he bulk o
pHol
is concen a ed nea ze o, wi h a apidly decaying ail; empi ical
i s o E[Hol2(Γ)] e sus λa e consis en wi h he O(1/λ)scaling p edic ed by Theo em 8.2.
Taken oge he , he cu a u e and holonomy phase diag ams p o ide empi ical con i ma ion o a
bi u ca ion a a ini e λc i , consis en wi h he heo e ical c i ical scale in Theo em 8.4.
Speed-o - hough alignmen . A hi d amily o plo s compa es he p edic ed cu a u e-awa e
s ep sizes ∆ p ed
ℓ o he empi ical seman ic s ep sizes ∆ emp
ℓ. The key signals a e:
•
Sca e plo s o ∆
emp
ℓ
e sus 1
/pKloc(ℓ)
ac oss laye s and models: in he geome ic phase,
poin s clus e igh ly a ound a line, con i ming he CFL-like scaling; in he degene a e phase,
he sca e is uns uc u ed.
•
The p o ile
Es ep
(
ℓ
)ac oss dep h: in he geome ic phase, i is uni o mly small and la ; in he
degene a e phase, i exhibi s la ge a ia ion and spikes.
•
The dis ibu ion o speed-o - hough ac o s
αKloc
(
ℓ
)
o e laye s: in he geome ic phase, i
has a na ow, unimodal dis ibu ion; in he degene a e phase, i is o en bimodal o b oad,
wi h some laye s e ec i ely unning a unsa e speeds.
These diagnos ics ie he abs ac CFL-like condi ion o obse able consequences in ained
models.
9.3 P ac ical diagnos ic p ocedu e
We summa ize a p ac ical pipeline o diagnosing whe he a gi en ained ans o me is in he
geome ic phase:
1.
Collec hidden s a es and a en ion ma ices. Fo a held-ou diagnos ic se , eco d
hℓ
i
and a en ion ma ices Aℓ o all laye s (o a ep esen a i e subse ).
2.
Compu e cu a u e p oxies. Cons uc
e
Pℓ
and es ima e he leading singula alues ia a
low- ank SVD. Compu e Kloc(ℓ)and, i desi ed, Kloc(i, ℓ).
3.
Sample loops and holonomy ene gies. Using he loop-sampling scheme om Sec ion 1.2,
gene a e a se Go ype (A)/(B)/(C) loops pe ba ch and compu e Hol2(Γ) o each.
4.
Es ima e seman ic s ep sizes. Use Θ o ob ain
zℓ
i
, compu e ∆
zi,ℓ
=
dg
(
zℓ
i, zℓ+1
i
), and
agg ega e o ob ain ∆ emp
ℓ.
5.
Fo m mac oscopic summa ies. Build his og ams
pcu , pHol
, p o iles
κlaye
(
ℓ
)
,Hol2
laye
(
ℓ
)
, Es ep
(
ℓ
),
and sca e plo s o ∆ emp
ℓ e sus 1/pKloc(ℓ).
6.
Compa e o geome ic-phase signa u es. Check (quali a i ely and quan i a i ely)
whe he :
•Kloc(ℓ)lies in he a ge band o mos laye s;
•pHol is sha ply peaked nea ze o wi h small ail;
•Es ep is small and ∆ emp
ℓco ela es wi h 1/pKloc(ℓ).
40
Models ha sa is y hese checks a e s ong candida es o being in
Sgood
; models ha ail
hem (e.g., wi h b oad cu a u e/holonomy his og ams and la ge s ep-size disc epancies) a e
likely in Snull.
In he nex sec ions we use hese diagnos ics no only o e i y he p esence o a geome ic phase
and a phase ansi ion as
λ
c osses
λc i
, bu also o mo i a e ha dwa e-le el designs (Sec ion 10)
and ene gy-landscape pe spec i es (Sec ion 11) ha ea low-holonomy, cu a u e-con olled
ans o me s as mac oscopic phases o a gauge- heo e ic sys em.
10 The Da is Topological P ocesso (DTP)
So a we ha e ea ed geome y as a so wa e-le el cons ain : losses, lows, and diagnos ics ha
li e en i ely in he aining loop. In his sec ion we ske ch how he same s uc u e can be ealized
in ha dwa e ia a Da is Topological P ocesso (DTP): an accele a o ha ea s a en ion as
opology-awa e spa se anspo and uses geome ic signals (cu a u e, holonomy, na u ali y) o
ga e compu a ion.
Two design goals guide he DTP:
1.
Cons an - ac o o e head. Geome ic losses and diagnos ics should incu a mos a
cons an - ac o o e head o e s anda d mul i-head a en ion and esidual blocks—no
O
(
d3
h
)
commu a o ma ices, no dense Riemann enso s.
2.
P uning by geome y. The accele a o should sa e compu e by p uning heads and b anches
wi h high holonomy o pa hological cu a u e—“ga ing by geome y”—so ha he ne e ec
on in e ence cos is neu al o e en nega i e ela i e o a baseline ans o me .
10.1 Design p imi i es: commu a o -like and loop-like ke nels
We begin by iden i ying he co e compu a ional p imi i es used by he geome ic losses (Sec ion 1.2)
and showing how hey can be implemen ed as sligh a ian s o s anda d a en ion and MLP ke nels.
Ac i a ion-only commu a o app oxima ions. The na u ali y loss
L un
concep ually in ol es
a commu a o [
Aℓ, Rℓ
] =
Aℓ◦Rℓ−Rℓ◦Aℓ
a each laye
ℓ
, whe e
Aℓ
is a en ion and
Rℓ
is he
esidual/MLP upda e. Fo ming [
Aℓ, Rℓ
]as a dense (
dh×dh
)ma ix would be
O
(
d3
h
)and in easible.
Ins ead, as in (17), we only e e apply he wo composi ions
Aℓhℓ+Rℓ(hℓ), Aℓ(hℓ)+RℓAℓ(hℓ)
o he cu en ac i a ions hℓ. This yields a ec o - alued commu a o ac ion
∆ℓ
un(hℓ):=Aℓ(hℓ+Rℓ(hℓ)) −Aℓ(hℓ)+Rℓ(Aℓ(hℓ)),
wi h cos domina ed by wo ex a applica ions o he same a en ion/MLP p imi i es al eady p esen
in he model.
A DTP-s yle accele a o he e o e does no need any new dense ma ix uni s; i only needs a
commu a o ke nel ha :
• eads ac i a ions hℓ;
41
•Fo sui able s ep-size schedules, he e ec i e e olu ion a e Llaye s app oxima es
hL≈exp
L−1
X
ℓ=0
∆ ℓLωh0,
which is eminiscen o an RG low ha p og essi ely smoo hs ou high- equency modes and
e ains coa se, la ge-scale seman ic s uc u e.
•
In he geome ic phase, he cu a u e-awa e s ep ules (Sec ion 1.2) and small holonomy
(Theo em 5.4) ensu e ha his e olu ion is s able and app oxima ely scale-consis en : deepe
laye s see a seman ic mani old whose e ec i e cu a u e and holonomy a e al eady con olled
by he lowe laye s, analogous o a eno malized e ec i e ield heo y a longe leng h scales.
One can he e o e hink o he pai (
ℓ,
∆
ℓ
)as a disc e e eno maliza ion scale: ea ly laye s
co espond o sho imes and ine seman ic esolu ion; la e laye s co espond o longe imes and
coa se seman ics. The holonomy Hamil onian
ˆ
Hhol
hen plays he ole o an ene gy unc ional
whose low-ene gy phases a e in a ian (o slowly a ying) unde his RG-like low: mo ing deepe in
he ne wo k does no c ea e la ge new cu a u e o holonomy on benign pa hs, bu a he p ese es
o u he supp esses hem.
Again, we s ess ha his RG language is an analogy. A ull eno maliza ion-g oup ea men
would equi e cons uc ing a amily o e ec i e Hamil onians
H(Λ)
hol
a di e en scales Λand p o ing
ha laye composi ion implemen s an RG ans o ma ion be ween hem. We lea e such a o mal
de elopmen o u u e wo k.
11.4 Summa y and ou look
The holonomy Hamil onian pic u e p o ides a uni ying pe spec i e on he heo y de eloped in his
pape :
•
The seman ic mani old (
M, g
), connec ion
ω
, and ield
h
de ine a gauge- heo e ic con igu a ion
space;
ˆ
Hhol
measu es he ene gy o his con igu a ion in e ms o cu a u e and co a ian
g adien s on he da a mani old.
•
Geome ic losses in aining app oxima e minimizing
⟨hθ|ˆ
Hhol |hθ⟩
subjec o ask cons ain s,
and Theo ems 8.2 and 8.4 show ha inc easing egula iza ion d i es models owa d low-ene gy,
low-holonomy phases.
•
The geome ic phase
Sgood
is he app oxima e acuum mani old o his heo y: a se o
unc o ial ans o me s ha beha e as disc e ized gauge lows on a Da is mani old, sa is y
CFL-like s abili y bounds, and exhibi small loop holonomy on benign pa hs.
•
The Da is Topological P ocesso (Sec ion 10) can be iewed as a physical de ice o p epa ing
and main aining hese low-ene gy phases e icien ly, using geome y-awa e spa si y and ga ing.
This Hamil onian aming does no eplace he mo e conc e e aining- heo e ic esul s; a he ,
i o ganizes hem in o a single “physical heo y o seman ic geome y” ha may suppo u he
connec ions o gauge heo y, s a is ical mechanics, and eno maliza ion in u u e wo k.
48
12 Discussion and Ou look
This pape has p oposed a geome ic and gauge- heo e ic iew o ans o me compu a ion, ying
oge he h ee s ands o p io wo k: seman ic sameness s uc u es
S
, Da is mani old ealiza ions o
de ec ion, and he in e nal dynamics o mode n ans o me s. The esul ing pic u e ea s a ained
ans o me as a disc e ized gauge low on a seman ic mani old (
M, g
), wi h a en ion implemen ing
co a ian di usion– anspo along a connec ion
ω
on a i ial bundle
E
=
M×V
, and dep h ac ing
as an explici Eule disc e iza ion o a seman ic low. We ha e a gued ha , when endowed wi h
sui able geome ic losses, ans o me s can be b ough in o a geome ic phase in which seman ic lows
espec CFL-like s abili y cons ain s, loop holonomy is small on benign pa hs, and he undamen al
diag am ela ing ansla o , mani old, ans o me , and low ealiza ions o
S
commu es up o
bounded dis o ion.
12.1 Summa y o con ibu ions
A a high le el, he pape makes ou concep ual mo es:
1.
F om seman ic equi alence o unc o ial ans o me s. Building on he seman ic
sameness s uc u e
S
and he ealiza ion ca ego ies
SamT ans
(
S
)and
SamGeom
(
S
), we in o-
duced a hi d ealiza ion ca ego y
FunT ans
(
S
)whose objec s a e ans o me s whose in e nal
dynamics ealize
S
as disc e e- ime lows on (
M, g
). We cons uc ed he undamen al diag am
(Sec ion 4.5) ela ing unc o ial ans o me s, ansla o sys ems, geome ic ealiza ions, and
Da is lows, and showed ha i commu es up o bounded dis o ion on benign pa hs, using
he Benign Pa h Boundedness Lemma (Sec ion 3).
2.
F om a en ion o gauge- heo e ic di usion– anspo . We modeled mul i-head
a en ion as an app oxima ion o co a ian hea low on
E→M
(Sec ion 5), unde a hea -
ke nel alignmen assump ion ha links so max sco es o geodesic dis ance and a en ion
alue maps o pa allel anspo . We de ined disc e e loop amilies in oken
×
dep h space and
showed how hei holonomy app oxima es he cu a u e o
ω
, leading o a local Poinca é–
Hodge- ype in eg abili y esul (Theo em 5.4) ha o malizes he idea ha low holonomy
implies app oxima ely conse a i e anspo s on geodesic cha s.
3.
F om geome y o ainable losses and phase ansi ions. We ansla ed ca ego ical
and gauge- heo e ic cons ain s in o di e en iable losses: na u ali y (
L un
), holonomy (
Lhol
),
in e se-head consis ency (
Lin
), and spec al cu a u e con ol (
Lcu
), all implemen ed wi h
ac i a ion-only compu a ions and spa se loop sampling (Sec ion 1.2). We used spec al
cu a u e p oxies de i ed om a en ion ope a o s o de ine cu a u e-awa e s ep sizes and
a CFL-like “speed o hough ” law (Sec ion 1.2). On he op imiza ion side, we showed
ha SGD on he ull objec i e con e ges o s a iona y poin s wi h holonomy and cu a u e
ene gies bounded as
O
(1
/λ
)(Theo em 8.2), and ha beyond a c i ical egula iza ion scale
λc i
(Theo em 8.4), global minimize s lie in a geome ic phase
Sgood
and a oid a degene a e,
geome y- ee phase Snull.
4.
F om so wa e geome y o ha dwa e and Hamil onians. We ske ched he Da is Topo-
logical P ocesso (DTP), an accele a o ha implemen s geome y-awa e ke nels (commu a o -
like, loop-like, and spec al) wi h cons an - ac o o e head and uses holonomy/cu a u e sco es
o p une heads and edges— opology-awa e spa se anspo (Sec ion 10). Finally, we packaged
49
he heo y in o a holonomy Hamil onian
ˆ
Hhol
ac ing on
L2
(
M
)(Sec ion 11), in e p e ing
low-holonomy, cu a u e-con olled ans o me s as low-ene gy phases o a gauge- heo e ic
ene gy unc ional and dep h as a heu is ic eno maliza ion scale.
Taken oge he , hese componen s sugges a “physical heo y o seman ics” in which easoning is
modeled as low-ene gy, low-holonomy low on a seman ic mani old, and ans o me s a e mechanisms
o app oxima ing such lows wi h disc e e laye s and a en ion ke nels.
12.2 Limi a ions and ca ea s
Despi e he uni ied pic u e, se e al pa s o he amewo k ely on idealiza ions and p oxies. He e
we ou line he main limi a ions.
Cu a u e p oxies s. ue cu a u e. We ne e compu e he Riemann cu a u e enso o
(
M, g
)o he exac cu a u e
F
o
ω
. The spec al cu a u e p oxy
Kloc
(
ℓ
)de ined in Sec ion 6.3 is a
heu is ic buil om he spec um o no malized a en ion ope a o s. While mo i a ed by hea -ke nel
heo y and Laplacian spec al geome y, i is a bes an indi ec measu e o local cu a u e and
s i ness. The CFL-like s ep-size law in Sec ion 1.2 he e o e con ols s abili y ela i e o his p oxy,
no o cu a u e in a s ic Riemannian sense. Unde s anding when
Kloc
ai h ully e lec s seman ic
cu a u e, and when i me ely cap u es a i ac s o he ne wo k pa ame e iza ion, emains an open
ques ion.
Local, no global, in eg abili y. The Poinca é–Hodge- ype in eg abili y esul in Theo em 5.4
is explici ly local: i applies on geodesic balls
U
=
B
(
z0
)o adius below he injec i i y adius,
unde small-loop holonomy assump ions. In hese domains, we showed ha he connec ion can
be pu in a gauge whe e i is close o an exac o m
d
Φplus a small esidue. We make no claim
ha
ω
is globally in eg able o ha a single po en ial Φexis s ac oss he en i e seman ic mani old,
especially in he p esence o opological obs uc ions o la ge-scale cu a u e. The in o mal language
o “ u h as a po en ial” should he e o e be ead as “ u h beha es like a local po en ial on benign
cha s whe e holonomy is small,” no as a global s a emen .
Seman ic ealiza ion map and mani old hypo hesis. We assumed he exis ence o a smoo h
seman ic ealiza ion map Θ
:Rdh→M
o ank
d
and ea ed
M
as a low-dimensional embedded
submani old cap u ing meaning ul seman ics (Sec ion 2). This is a s ong e sion o he mani old
hypo hesis and igno es he possibili y ha seman ics may be undamen ally high-dimensional,
mul imodal, o non-mani old-like (e.g., wi h b anching o disc e e s uc u e). In p ac ice, Θmay
be highly non-unique, ask-dependen , and only app oxima ely smoo h on he subse o hidden
s a es isi ed du ing aining. The ca ego ical and gauge- heo e ic conclusions should be in e p e ed
as holding in hose egions whe e such a seman ic cha is easonable, no as a s a emen ha all
hidden s a es admi a clean geome ic in e p e a ion.
App oxima e equali y o ca ego ies and lows. The undamen al diag am in Sec ion 4.5
ela es
FunT ans
(
S
),
SamT ans
(
S
),
SamGeom
(
S
), and
Flow
(
S
) ia unc o s and low cons uc ions.
All commu a i i y s a emen s a e app oxima e and limi ed o benign pa hs wi h con olled leng h
and dis o ion. Ou side o hese egimes—e.g., o ad e sa ial inpu s, long- ange jumps, o hea ily
ou -o -dis ibu ion beha io — he ela ionship be ween disc e e ans o me lows and Da is lows
50
may b eak down, and he e o budge s in Theo em 4.3 can become la ge. The amewo k does no
p e en such ailu es; i only p o ides ools o cons ain hem on he subse o beha io cap u ed by
he benign pa h amilies.
Op imiza ion idealiza ions. The aining heo y in Sec ion 8 assumes a ela i ely clean SGD
egime: coe ci e objec i es, locally Lipschi z g adien s, unbiased g adien es ima es wi h bounded
a iance, and a classical Robbins–Mon o lea ning a e schedule. Real la ge-scale aining pipelines
o en use adap i e op imize s, g adien clipping, mixed p ecision, and agg essi e scheduling, and may
no sa is y hese assump ions s ic ly. The
O
(1
/λ
)bounds on holonomy and excess cu a u e should
he e o e be ead as asymp o ic ends a he han p ecise quan i a i e gua an ees in p ac ical
se ings.
Hamil onian and RG analogies. The holonomy Hamil onian
ˆ
Hhol
and he eno maliza ion
pic u e o dep h (Sec ion 11) a e p esen ed as o ganizing analogies, no p o en equi alences. We do
no cons uc a ull eno maliza ion-g oup low o p o e con e gence o a con inuum ield heo y
as dep h g ows. The co espondence be ween geome ic phases and “ acuum s a es” o
ˆ
Hhol
is
heu is ic in he sense ha i elies on app oxima e disc e iza ions, p oxy ene gies, and empi ical
da a dis ibu ions.
12.3 Fu u e di ec ions
The amewo k opens se e al lines o wo k, bo h heo e ical and empi ical.
Lea ning loop amilies and da a-d i en benign pa hs. We ea ed he loop amilies used in
Lhol
and he benign pa h amilies
PS
(
L
)as gi en o hand-designed. A na u al nex s ep is o lea n
hese s uc u es:
•
designing loop-sampling s a egies ha adap o he model, ocusing holonomy penal ies on
egions and heads whe e inconsis ency is empi ically highes ;
•
lea ning la en “pa h empla es” ha cap u e ecu ing easoning ajec o ies in oken
×
dep h
space, and aligning hese wi h seman ic benign pa hs in I;
•
join ly op imizing o e
S
( he sameness s uc u e), Θ, and he geome ic losses o disco e
ask-speci ic no ions o seman ic sameness and benign e olu ion.
Such da a-d i en loop and pa h amilies could igh en he link be ween geome y and p ac ice,
and po en ially educe he cos o geome ic egula iza ion by a ge ing he mos ele an pa s o
he model’s dynamics.
C oss-modal and mul i-agen sys ems. The o malism is agnos ic o modali y and can, in
p inciple, accommoda e mul iple modali ies and agen s by enla ging he index se
I
and he
la en space
I
. Ex ending he heo y o c oss-modal sys ems ( ision–language, audio–language,
code–language) would in ol e:
•
de ining mul i-modal sameness s uc u es whe e benign pa hs a e se di e en obse a ion
spaces Xiand encode s ϕi;
51
•
in e p e ing c oss-a en ion blocks as connec ions be ween bundles o e di e en seman ic
mani olds, and s udying hei cu a u e and holonomy;
•
explo ing whe he geome ic phases in one modali y (e.g., ision) help egula ize o s abilize
geome y in ano he (e.g., language) h ough sha ed ep esen a ions.
Simila ly, o mul i-agen o ool-using sys ems, one could s udy whe he geome ic losses on
in e -agen communica ion channels encou age consis en , low-holonomy seman ics ac oss agen s.
Rigo ous con inuum limi s and RG. The eno maliza ion pe spec i e in Sec ion 11 in i es a
mo e igo ous ea men . Fu u e wo k could aim o:
•
cons uc con inuum limi s o ans o me dynamics as
Llaye s → ∞
wi h app op ia ely scaled
s ep sizes, and iden i y limi ing PDEs o SDEs on (M, g);
•
de ine explici RG ans o ma ions be ween ne wo ks o di e en dep hs and wid hs ha
p ese e (o sys ema ically change) he holonomy Hamil onian;
•
s udy ixed poin s and phase diag ams o hese RG lows, ela ing hem o pe o mance,
obus ness, and he onse o geome ic phases.
Such esul s would p o ide a mo e solid heo e ical unde pinning o he dep h-as-scale heu is ic.
Mechanis ic in e p e abili y and sa e y. The geome ic ools de eloped he e—loop holon-
omy, cu a u e p oxies, speed-o - hough p o iles—a e na u ally complemen a y o mechanis ic
in e p e abili y. Possible di ec ions include:
•
using holonomy- and cu a u e-based diagnos ics o iden i y ci cui s o heads esponsible o
con adic ions, hallucina ions, o uns able easoning;
•
co ela ing geome ic obse ables wi h human- a ed measu es o consis ency, u h ulness, and
obus ness o dis ibu ion shi s;
•
designing sa e y in e en ions ha ac di ec ly on geome ic quan i ies (e.g., clamping cu a u e
o holonomy in high- isk deploymen s) a he han solely on ou pu s.
He e, he goal would no be o eplace exis ing in e p e abili y me hods bu o p o ide a geome ic
laye o analysis and con ol.
Ha dwa e co-design and e icien geome y. The Da is Topological P ocesso ske ch sugges s
ha geome ic egula iza ion and diagnos ics can be b ough in o he ha dwa e s ack. Conc e e
di ec ions include:
•
implemen ing commu a o and loop ke nels in exis ing accele a o a chi ec u es and quan i ying
hei cos /bene i adeo s;
•
explo ing in e ence- ime geome y ga ing policies ha adap i ely p une compu a ion based on
eal- ime holonomy and cu a u e es ima es;
•
co-designing a chi ec u es and geome ic penal ies so ha geome y-awa e spa si y pa e ns
align wi h ha dwa e- iendly s uc u es (e.g., block spa si y, enso co e iling).
52
Empi ical e alua ion and abla ion. Finally, he amewo k needs sys ema ic empi ical s udy.
Key ques ions include:
•
How do geome ic losses a ec s anda d benchma ks (language modeling, easoning, obus ness
o pe u ba ions) ac oss scales?
•
Do cu a u e and holonomy diagnos ics p edic whe e models a e likely o hallucina e o ail
on ou -o -dis ibu ion inpu s?
•
How sensi i e a e he obse ed geome ic phases and c i ical scales
λc i
o he choice o p oxies,
a chi ec u es, and aining egimes?
12.4 Concluding ema ks
We ha e p oposed a way o see ans o me s no only as s acks o ma ices o sequence models,
bu as physical sys ems e ol ing on a lea ned seman ic mani old unde a gauge ield induced by
a en ion. In his iew, geome ic egula iza ion is no an aes he ic choice bu a way o o ce he
ne wo k in o a phase whe e easoning is s able, pa h-independen on benign cha s, and compa ible
wi h a sha ed no ion o seman ic sameness.
Many s eps in his cons uc ion a e app oxima e and local; much emains o be es ed, e ined, o
eplaced. None heless, ea ing ans o me s as unc o ial gauge lows on Da is mani olds p o ides a
cohe en language in which he geome y o ep esen a ions, he dynamics o compu a ion, and he
s uc u e o ha dwa e accele a o s can be s udied oge he . Whe he o no his “seman ic physics”
ul ima ely becomes he s anda d way o hink abou la ge models, i o e s one possible bluep in
o cons aining and unde s anding hem as hey con inue o scale.
Appendix
A P oo o phase ansi ion and exis ence o λc i
We now p o e Theo em 8.4, which o malizes he exis ence o a c i ical egula iza ion s eng h
λc i
beyond which global minimize s lie in a geome ic phase.
Recall he geome ic ene gy
Rgeom(θ):=α unL un(θ)+αholLhol(θ)+αin Lin (θ)+αcu Lcu (θ),
and he phase se s om De ini ion 8.3:
Sgood :={θ:Rgeom(θ)≤εgood}, Snull :={θ:Rgeom(θ)≥εnull},
wi h 0< εgood < εnull.
De ine he bes achie able ask loss wi hin each phase:
L ask
good := in
θ∈Sgood L ask(θ),
L ask
null := in
θ∈Snull L ask(θ).
53
We also de ine he ask-loss gap and geome y-ene gy gap:
∆ ask :=L ask
good −L ask
null ,
δR:= in
θ∈Snull Rgeom(θ)−sup
θ∈Sgood Rgeom(θ).
By de ini ion o Sgood and Snull,
in
θ∈Snull Rgeom(θ)≥εnull,sup
θ∈Sgood Rgeom(θ)≤εgood,
so
δR ≥ εnull −εgood >0.
Conside he one-pa ame e amily
Lλ(θ):=L ask(θ)+λRgeom(θ), λ > 0.
Fo each λ, de ine he minimal o al loss achie able wi hin each phase:
Lgood(λ):= in
θ∈Sgood Lλ(θ),
Lnull(λ):= in
θ∈Snull Lλ(θ).
The key compa ison is be ween hese wo in ima, no be ween a bi a y poin s.
Lemma A.1 (Ene gy compa ison be ween phases (in imum e sion)).Fo any λ>0,
Lnull(λ)−Lgood(λ)≥ −∆ ask +λ δR.(36)
P oo . By de ini ion,
Lnull(λ) = in
θ∈SnullL ask(θ)+λRgeom(θ)
≥in
θ∈Snull L ask(θ)+λin
θ∈Snull Rgeom(θ)
≥L ask
null +λεnull,
and simila ly
Lgood(λ) = in
θ∈SgoodL ask(θ)+λRgeom(θ)
≤in
θ∈Sgood L ask(θ)+λsup
θ∈Sgood Rgeom(θ)
≤L ask
good +λεgood.
Sub ac ing, we ob ain
Lnull(λ)−Lgood(λ)≥L ask
null −L ask
good +λ(εnull −εgood)
=−∆ ask +λ δR,
which is exac ly (36).
54
Lemma A.1 shows ha once
λ δR>∆ ask,
he minimal o al loss achie able inside
Snull
exceeds ha achie able inside
Sgood
. This yields he
exis ence o a c i ical egula iza ion s eng h.
P oposi ion A.2 (Exis ence o λc i ).De ine
λc i := max 0,∆ ask
δR.
Then o any λ>λc i :
Θmin
λ∩Snull =∅.(37)
Fu he mo e, i we de ine he gap egion
Sgap
=
{θ:εgood <Rgeom
(
θ
)
< εnull}
, hen o su icien ly
la ge λ, global minimize s a e also excluded om Sgap, e en ually o cing Θmin
λ⊆Sgood.
P oo . Take any λ>λc i , so ha
λ δR−∆ ask >0.
By Lemma A.1,
Lnull(λ)−Lgood(λ)≥ −∆ ask +λ δR>0,
so
Lnull(λ)> Lgood(λ).
This implies ha any pa ame e θ∈Snull has s ic ly la ge o al loss han a leas one pa ame e
in Sgood. The e o e no global minimize can lie in Snull, so Θmin
λ∩Snull =∅.
To add ess he gap egion, assume
L ask
is bounded below by
L ask
in
. Fo any
θ∈Sgap
, we ha e
Rgeom
(
θ
)
> εgood
. Fo e y la ge
λ
, he geome ic penal y will e en ually domina e any ask loss
ad an age ela i e o Sgood. Speci ically, we conside he bound
λ > sup
θ∈Sgap
L ask
good −L ask(θ)
Rgeom(θ)−εgood
.
Assuming
L ask
and
Rgeom
a e con inuous and suble el se s a e compac , his sup emum is ini e
(any sequence app oaching
εgood
om abo e would ha e a limi poin in
Sgood
, bounding he ask
loss om below by L ask
good). Thus, asymp o ically, Θmin
λ⊆Sgood.
B P oo o ke nel limi heo em (P oposi ion 5.2)
In his appendix we p o e he ke nel limi heo em o a single a en ion head (P oposi ion 5.2). The
se ing is a geodesic ball
U
=
B
(
z⋆
)
⊂M
wi h
smalle han he injec i i y adius, an a en ion
head
H
sa is ying he hea -ke nel alignmen Assump ion 5.1, and a connec ion
ω
on
E|U
such ha
he alue map
WV
and ou pu p ojec ion ealize app oxima e pa allel anspo along geodesics up
o O(d2
g)e o .
We show ha he disc e e ope a o
(AHh)i:=
n
X
j=1
AijWVh(zj)
55
con e ges, as n→ ∞ and τmax →0, o he no malized co a ian hea ope a o
(Kωh)(x):=RUkg(x, y;τ(x)) Pω(x, y)h(y)ρda a(y)dµg(y)
RUkg(x, y;τ(x)) ρda a(y)dµg(y),(38)
whe e
kg
is he scala hea ke nel and
Pω
(
x, y
)
:Ey→Ex
is pa allel anspo . No e ha he
nume a o in eg a es a ec o - alued quan i y ( anspo ed ibe s a es) while he denomina o
in eg a es a scala densi y; he a io de ines a sec ion o
E
. This co esponds o he andom
walk di usion on he da a mani old (d i ing owa d high-densi y egions), consis en wi h he
ow-s ochas ic na u e o he so max a en ion ma ix.
B.1 Geome y o he geodesic ball and no mal coo dina es
Fix a cen e poin x∈Uand wo k in no mal coo dina es a ound x:
expx:Bρ(0) ⊂Rd→U,
wi h
ρ >
0small enough ha
Bρ
(0) is mapped di eomo phically in o
U
and all geodesics in
U
s a ing a
x
a e minimizing. In hese coo dina es each poin
y∈U
is ep esen ed as
y
=
expx
(
ξ
)
o ξ∈Rd, and he me ic admi s he expansion
gij(ξ)=δij +O(|ξ|2),(39)
wi h coe icien s depending smoo hly on
x
and bounded uni o mly on
Bρ
(0). The squa ed geodesic
dis ance be ween xand y= expx(ξ)sa is ies
dg(x, y)2=|ξ|2+O(|ξ|4),(40)
whe e
|·|
is he Euclidean no m on
Rd
. The Riemannian olume o m
dµg
is likewise ela ed o
Lebesgue measu e by
dµg(y) = 1+O(|ξ|2)dξ. (41)
All O(·) e ms a e uni o m o xin a compac subse o Uand |ξ|≤ρ.
B.2 Asymp o ics o he sco e unc ion and link o Va adhan’s o mula
Recall he hea -ke nel alignmen assump ion (Assump ion 5.1): o zi, zj∈U,
⟨q(zi), k(zj)⟩
√dk
=−dg(zi, zj)2
4τ(zi)+b(zi)+c(zj)+ ij,(42)
wi h
| ij|≤CHK dg
(
zi, zj
)
3
o
dg
(
zi, zj
)
≤
, and smoo h unc ions
b, c, τ
on
U
wi h
τ >
0. Fo
ixed i( ixed que y loca ion x=zi), de ine
Sij :=−dg(x, zj)2
4τ(x)+b(x) + c(zj)+ ij.
The a en ion weigh s a e
Aij =expSij
Pj′expSij′.
56
Fac o ou eb(x) om nume a o and denomina o :
Aij =exp−dg(x, zj)2/4τ(x) + c(zj)+ ij
Pj′exp−dg(x, zj′)2/4τ(x) + c(zj′)+ ij′.
The shape o he ke nel is go e ned by he e m −dg(x, zj)2/4τ(x).
Le
kg
(
x, y
;
τ
)deno e he scala hea ke nel o he Laplace–Bel ami ope a o ∆
g
on (
M, g
).
Classical hea ke nel asymp o ics (e.g., Va adhan’s o mula) gi e
lim
τ→04τlog kg(x, y;τ)=−dg(x, y)2,(43)
uni o mly o x, y in compac se s. Speci ically, we ha e he expansion:
kg(x, y;τ) = 1
(4πτ)d/2exp −dg(x, y)2
4τ!a0(x, y)+O(τ),(44)
wi h a0(x, x) = 1.
Compa ing he a en ion sco e wi h he loga i hm o he hea ke nel, we w i e
expSij∝expc(zj)+ ijexp −dg(x, zj)2
4τ(x)!(45)
≈wj(x)kg(x, zj;τ(x)),(46)
whe e
wj
(
x
)cap u es he smoo h modula ion om
c
(
zj
)and he highe -o de emainde
ij
. Since
| ij|
=
O
(
d3
g
)and e ec i e suppo is
dg∼√τ
,
wj
(
x
)ac s as a bounded, slowly a ying modula ion.
B.3 Fibe anspo and disc e e sum- o-in eg al limi
We now inco po a e he ibe anspo e m
WV
. By assump ion, he e exis s a connec ion
ω
on
E|Usuch ha
WVh(y) = Pω(x, y)h(y)+∆PT(x, y;h),(47)
wi h e o bounded by
CPT dg
(
x, y
)
2∥h∥∞
. This e o con ibu es a e m o o de
O
(
τ
) o he inal
esul (as he Gaussian ke nel localizes o squa ed dis ance τ).
We ocus on he main e m. De ine he ke nel p oxy unc ion
ψx(y):= exp c(y)exp −dg(x, y)2
4τ(x)!.
The disc e e a en ion ope a ion on he main anspo e m is
Ti(h) = Pn
j=1 ψx(zj)Pω(x, zj)h(zj)
Pn
j′=1 ψx(zj′).
Mul iplying nume a o and denomina o by 1
/n
, we in e p e hese sums as Mon e Ca lo app oxi-
ma ions o in eg als agains he empi ical measu e
bρn
. Assuming dense sampling con e gence o
ρda a(y)dµg(y)(Eq. B.3), we ha e:
57