F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D
Gaussian Scene wi hou Spa ial P io s
Chenxi Li
Tianjin Uni e si y
Tianjin, China
[email p o ec ed]
Weijie Wang†
Uni e si y o T en o
T en o, I aly
weijie[email p o ec ed]
Qiang Li
Tianjin Uni e si y
Tianjin, China
[email p o ec ed]
Nicu Sebe
Uni e si y o T en o
T en o, I aly
[email p o ec ed]
B uno Lep i
Fondazione B uno Kessle
T en o, I aly
lep i@ bk.eu
Weizhi Nie
Tianjin Uni e si y
Tianjin, China
[email p o ec ed]
Abs ac
Tex -d i en objec inse ion in he 3D scene is an eme ging ask ha
enables in ui i e scene edi ing h ough na u al language. Despi e
i s po en ial, exis ing 2D edi ing-based me hods o en su e om
eliance on spa ial p io s such as 2D masks, 3D bounding boxes,
and hey s uggle o ensu e inse ed objec consis ency. These limi-
a ions hinde lexibili y and scalabili y in eal-wo ld applica ions.
In his pape , we p opose F eeInse , a no el amewo k ha le e -
ages ounda ion models (MLLMs, LGM, and di usion models) o
disen angle objec gene a ion and spa ial placemen , enabling unsu-
pe ised and lexible objec inse ion in 3D scenes wi hou spa ial
p io s. F eeInse begins wi h an MLLM-based pa se ha ex ac s
s uc u ed seman ics—including objec ypes, spa ial ela ionships,
and a achmen egions— om use ins uc ions. These seman ics
guide bo h he econs uc ion o he inse ed objec o 3D consis-
ency and he lea ning o i s deg ees o eedom. We i s le e age
he spa ial easoning capabili ies o MLLMs o ini ialize he objec ’s
pose and scale. To u he enhance na u al in eg a ion wi h he
scene, a hie a chical spa ially-awa e s age is employed o e ine he
objec ’s placemen , inco po a ing bo h he spa ial seman ics and
p io s in e ed by he MLLM. Finally, he objec ’s appea ance is
enhanced using inse ed-objec image o imp o e isual ideli y. Ex-
pe imen al esul s demons a e ha F eeInse enables seman ically
cohe en , spa ially p ecise, and isually ealis ic 3D inse ions, wi h-
ou equi ing any spa ial p io s, o e ing a use - iendly and lexible
edi ing expe ience. P ojec page: h ps:// julcx.gi hub.io/F eeInse /.
CCS Concep s
•Compu ing me hodologies →Compu e ision.
Weijie Wang†is he Co esponding au ho .
Pe mission o make digi al o ha d copies o all o pa o his wo k o pe sonal o
class oom use is g an ed wi hou ee p o ided ha copies a e no made o dis ibu ed
o p o i o comme cial ad an age and ha copies bea his no ice and he ull ci a ion
on he i s page. Copy igh s o componen s o his wo k owned by o he s han he
au ho (s) mus be hono ed. Abs ac ing wi h c edi is pe mi ed. To copy o he wise, o
epublish, o pos on se e s o o edis ibu e o lis s, equi es p io speci ic pe mission
and/o a ee. Reques pe missions om [email p o ec ed].
MM ’25, Dublin, I eland
©2025 Copy igh held by he owne /au ho (s). Publica ion igh s licensed o ACM.
ACM ISBN 979-8-4007-2035-2/2025/10
h ps://doi.o g/10.1145/3746027.3755072
Tex P omp
O iginal
Scene
Inse ed
Scene
Acou ya d
A man
A doll
Image P omp
Spa ial P io s
(2D mask o 3D bounding box) Tex -D i en Objec Inse ion
❌
P o ide
A doll wea ing a bow ie A man wi h a mous ache Al e he ho se o a og
A ga den
An apple on he able
Figu e 1: “No Spa ial P io s, Jus P omp s.” Compa ed o exis -
ing me hods ha equi e use -p o ided spa ial p io s, limi -
ing hei p ac icali y, ou me hod enables lexible ex -d i en
objec inse ion wi hou any need o such p io s (e.g., 2D
masks o 3D bounding boxes). Gi en only a ex p omp (The
image p omp is op ional), F eeInse na u ally inse s ob-
jec s ac oss di e se scenes.
Keywo ds
Tex -D i en 3D Scene Edi ing; Objec Inse ion; Di usion Models;
Mul imodal La ge Language Models; Gaussian Spla ing
ACM Re e ence Fo ma :
Chenxi Li, Weijie Wang
†
, Qiang Li, Nicu Sebe, B uno Lep i, and Weizhi Nie.
2025. F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D Gaussian
Scene wi hou Spa ial P io s. In P oceedings o he 33 d ACM In e na ional
Con e ence on Mul imedia (MM ’25), Oc obe 27–31, 2025, Dublin, I eland.
ACM, New Yo k, NY, USA, 10 pages. h ps://doi.o g/10.1145/3746027.3755072
1 In oduc ion
Tex -d i en 3D gene a ion [
3
,
24
,
28
,
40
,
47
] and edi ing [
2
,
43
,
46
]
a e gaining ac ion o enabling he in ui i e cus omiza ion o
digi al con en wi h a ew wo ds. Despi e ecen ad ances [
9
,
10
,
12
,
17
,
23
,
25
,
32
,
37
–
39
,
49
] ha ha e made signi ican p og ess in
edi ing he geome y and appea ance o scene componen s, lexibly
inse ing new objec s in o he scene emains challenging due o
di icul ies in p ecise placemen and seamless in eg a ion.
Recen 3D edi ing me hods le e age di usion models by i s
pe o ming ex -guided 2D edi s on single [
6
,
30
] o mul i- iew
images [
12
,
49
], hen li ing hem o 3D. Relying solely on ex ual
desc ip ions o objec inse ion o en leads o inse ion ailu e
and subop imal esul s due o misin e p e a ion o he ex [
48
,
10915
MM ’25, Oc obe 27–31, 2025, Dublin, I eland Chenxi Li e al.
49
], as shown in Figu e 3, exempli ied by me hods like Ins uc -
NeRF2NeRF [
12
] and GaussC l [
41
]. Some me hods in oduce a -
en ion mechanisms o cap u e he spa ial ela ionship be ween
he inse ed objec and he scene. Howe e , hese me hods s ill
s uggle wi h accu a ely de e mining he inse ed objec ’s pose and
scale [
33
,
46
,
49
]. To add ess his limi a ion, o he me hods le e age
use -p o ided 2D masks [
6
,
30
] o 3D bounding boxes [
31
,
48
] as
s ong cons ain s o achie e mo e con ollable and concise inse -
ion. Ne e heless, hey o en demand specialized expe ise [
48
]
and conside able manual e o , limi ing hei p ac ical usabili y.
In addi ion, hey s ill ace challenges wi h inaccu a e dep h es i-
ma ion [
6
] and inconsis en 3D mul i- iew econs uc ion due o
he modali y gap be ween 2D and 3D. We summa ize he abo e
discussion as shown in Table 1.
Inspi ed by his, achie ing lexible and high-quali y objec in-
se ion in o 3D scenes wi hou manual supe ision emains un-
de explo ed. The ad en o la ge-scale models [
1
,
8
,
11
,
21
,
42
],
which ha e acqui ed human commonsense knowledge, has made
unsupe ised lea ning inc easingly p omising. In his pape , we
p opose F eeInse , a me hod ha le e ages ounda ion models
(MLLMs [
1
,
8
], LGM [
34
] and Di usion model [
29
] ) o assis ob-
jec inse ion in 3D scenes wi hou elying on any spa ial p io s
as Figu e 1. Ou me hod emo es he need o spa ial p io s by
in e ing objec inse ion di ec ly om high-le el ex ual cues (e.g.,
“Add [objec ] o/on [ a ge ]”) as Figu e 1 shows. We a gue ha he
inse ion p ocess can essen ially be iewed as i s gene a ing
a objec , ollowed by es ima ing he ans o ma ion, which
de ines he inse ed objec ’s deg ees o eedom (pose and
scale) ela i e o he scene.
Speci ically, we disen angle he objec inse ion p ocess in o
objec gene a ion and i s pa ame e ized deg ees o eedom (DoF)
es ima ion, bo h guided by ex ual desc ip ions wi h ounda ion
models. We i s ob ain a ex ins uc ion om he use and pa se
i in o s uc u ed seman ics (e.g., objec ype, spa ial ela ion, a -
achmen egion) using an MLLM-based [
1
] objec inse ion pa se .
This enables p ecise, con ollable objec inse ion ha aligns wi h
he use ’s in en . We hen employ a 3D-consis en econs uc ion
model [
34
] o ob ain an ini ial Gaussian-based objec model, which
is coa sely inse ed in o he scene guided by he isual and spa ial
easoning capabili ies o MLLMs[
1
,
8
]. This s ep inhe en ly ci cum-
en s he inconsis ency issues associa ed wi h 2D edi ing-based
me hods. While he eed- o wa d p ocedu e p o ides a ligh weigh
3D layou , i o en su e s om subop imal placemen and impe -
ec geome y. To add ess hese issues, we p opose a wo-s age
e inemen . Fi s , he Hie a chical Spa ial-Awa e Re inemen s age
op imizes he objec ’s DoF ia spa ially-awa e sco e dis illa ion
sampling (SSDS) [
7
] om p e ained di usion model [
29
]. This
s age le e ages MLLM-de i ed easoning esul s o align he ob-
jec ’s DoF wi h bo h local and global spa ial seman ics, enhancing
mo e p ecise and con ollable placemen . These easoning esul s
also help he model handle a e spa ial composi ion e.g., “Add a
pai o sunglasses on he o ehead”, he eby imp o ing obus ness.
In he inal appea ance e inemen s age, we ine- une a p e ained
di usion model on mul i- iew ende ings o he op imized inse ed
objec and i s co esponding inse ed-objec image, and use i o
enhance he objec ’s appea ance. By disen angling placemen om
objec gene a ion, ou me hod enables lexible seman ic con ol
Table 1: Compa ison o exis ing me hods o objec inse ion
in 3D scenes. Ou s can achie e high-quali y objec inse ion
wi hou manual supe ision while keeping 3D consis ency.
No Requi ed
Manual Supe ision
3D View
Consis ency
Suppo
Image-P omp s
Ins uc -N2N[12] ✓✗ ✗
GaussC l [41] ✓ ✓ ✗
GaussianEdi o [6] ✗ ✗ ✗
TIP-Edi o [48] ✗ ✗ ✓
F eeInse ✓ ✓ ✓
o e inse ion while p ese ing objec quali y and ensu ing cohe -
en , plausible 3D scenes.
To e alua e he p oposed me hod, we applied i o a ious scena -
ios, including objec -cen ic, human-cen ic, and complex ou doo
scenes. Ou expe imen al esul s demons a e ha he p oposed ap-
p oach can inse di e se objec s in o 3D scenes wi hou equi ing
manual supe ision while achie ing mul i- iew consis en objec
quali y. In summa y, ou con ibu ions a e as ollows:
•
We add ess consis en objec inse ion in di e se 3D scenes
using only ex ual inpu , emo ing he need o spa ial p io s
and ou pe o ming exis ing me hods h ough a amewo k
ha disen angle objec gene a ion and spa ial placemen .
•
We p opose a DoF op imiza ion me hod o objec inse ion,
using he easoning capabili ies o MLLMs and di usion
models in place o manual supe ision. The MLLM’s seman-
ic and spa ial p io s u he suppo SSDS in enhancing
p ecision and obus ness.
•
We ensu e high-quali y objec gene a ion by main aining 3D
shape consis ency ia a econs uc ion model and e ining
isual appea ance.
•
We p esen he i s baseline o e alua ing unsupe ised
3D scene inse ion, wi h expe imen s showing compe i i e
pe o mance agains s a e-o - he-a me hods.
2 Rela ed Wo ks
Tex -Guided 3D Scene Edi ing. Tex -d i en 3D scene edi ing has
seen apid p og ess, hanks o he ise o di usion models [
13
,
29
].
Mos me hods [
9
,
12
,
18
,
19
,
23
,
27
,
36
] ocus on modi ying exis ing
con en , ei he globally o locally. Local edi ing equi es p ecise
localiza ion o a oid a ec ing un ela ed egions, which emains
challenging. While some wo ks use implici cues om models like
Ins uc Pix2Pix [
4
] o Con olNe [
44
], o he s inco po a e explici
cons ain s such as segmen a ion masks [
37
] o c oss-a en ion
maps [
49
]. Howe e , hese me hods s uggle wi h 3D objec in-
se ion, which demands easoning abou seman ically app op ia e
ye physically unoccupied egions o placemen . In his wo k, we
p ima ily ocus on he ask o objec inse ion in 3D scenes.
Objec Inse ion in 3D Scene. In con as o modi ying exis -
ing scene con en , objec inse ion emains unde explo ed. MVIn-
pain e [
5
] le e ages segmen a ion o iden i y suppo egions like
able su aces, bu s uggles wi h ine-g ained inse ions on objec s
o humans. GaussianEdi o [
6
] and InseRF [
30
] inse objec s using
3D econs uc models, ye s ill equi e use -p o ided 2D masks and
su e om dep h- ela ed localiza ion issues. FocalD eame [
20
]
a aches pa s o base shapes bu depends on use -speci ied 3D pa-
ame e s (e.g., o a ion, ansla ion, scale), and lacks gene aliza ion
o complex scenes. O he me hods [
31
,
48
] guide objec gene a ion
10916
F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D Gaussian Scene wi hou Spa ial P io s MM ’25, Oc obe 27–31, 2025, Dublin, I eland
Scene A achmen
RegionDe ec ion
Objec DoF
Ini ializa ion
3D Objec
Recons uc ion
Hie a chical Spa ial Awa e Re inemen Objec Appea ance Re inemen
🔥
Lo a
❄
Une
❄
Lo a
❄
Une
....
....
*
Times ep Adjus
Ta ge 3D Scene
e inemen
Tex 2Image
O iginal 3DGS
Sample Views
Ini ial 3D
Composi ionGaussians
T ans o med
Gaussians
Add a ed hea -shaped
glasses o he doll
A achmen
Region Gaussian
op ional
:
:
:
:
"A ed hea -shaped
glasses"
"A doll wea s a ed hea -
shaped glasses"
"Glasses align wi h he
eyes"
"Eyes"
:
"Wea s"
:
"align wi h"
MLLM-based
Objec
Inse ion
Pa se
MLLM-based Objec Inse ion Pa se Ini ializa ion ia La ge Models
*
(a) (b)
(c) (d)
sscale
o a ion
ansla ion
Scene
Sample View
Figu e 2: O e iew o F eeInse . Gi en an ex p omp
T
and op ionally an image p omp
I
O
, he objec inse ion p ocess
includes ou s ages: (a) The MLLM-based Objec Inse ion Pa se (see Sec ion 3.2) i s ex ac s s uc u ed seman ics o suppo
he subsequen s ages. (b) The Ini ializa ion ia La ge Models (see Sec ion 3.3) s age gene a es objec and ini ializes i s
DoFO
in
he scene . (c) The Hie a chical Spa ial Awa e Re inemen (see Sec ion 3.4) s age e ines he
DoFO
. (d) The inal s age, Objec
Appea ance Re inemen (see Sec ion 3.5), enhances he objec ’s isual quali y using objec imageI
O.
ia di usion using 3D bounding boxes, which imposes a bu den on
use s. In his wo k, we aim o unsupe ised and b oadly applicable
3D objec inse ion, emo ing he need o manual anno a ions o
spa ial p io s.
La ge Language Models in 3D Gene a ion and Edi ing. LLMs,
like GPT [
1
] and Llama [
11
] se ies, ha e exhibi ed ou s anding e -
icacy in many ex - ela ed asks. Zhou e al
. [47]
and Zhou e al
.
[45]
u ilize LLMs o p o ide coa se composi ional spa ial in o -
ma ion om ex ual desc ip ions o cons uc he 3D scene. The
mul i-modal a ian s o LLMs [
1
,
8
] inco po a e images and a e
addi ionally ained on image- ex pai s, showing imp essi e e-
sul s o isual cap ioning and ision ques ion-answe ing (VQA).
No ably, Molmo [
8
] can pe o m pixel-le el localiza ion mainly
because i was ained wi h ichly anno a ed image da a. This ca-
pabili y is c ucial o obus spa ial g ounding in ision-language
asks. GG-Edi o [
43
] i s exploi s GPT-4V [
1
] o be e unde s and
bo h he ex ual and 3D isual inpu s and hen in e easonable
local egions o 3D edi ing. Howe e , i p ima ily a ge s objec
edi ing and i s p elimina y use o MLLM s uggles o ensu e spa ial
p ecision. In his wo k, we le e age he ex easoning capabili ies
and spa ial ela ionship unde s anding o GPT-4 [
1
] and Molmo [
8
],
supplemen ed by a basic de ec ion model [
42
], o elimina e he
eliance on manually p o ided p io s in objec inse ion.
3 Me hod
3.1 P oblem S a emen
Figu e 2 illus a es he o e all amewo k o F eeInse . Gi en a
g oup o 3D Gaussians
GS
o an inpu scene and a ex p omp
T
guiding he inse ion o an objec in o he scene, ou algo i hm pe -
o ms high-quali y, seman ically consis en objec inse ion wi hou
any manual supe ision (e.g., 3D bounding boxes o masks). We de-
couple he objec inse ion ask in o objec gene a ion and he
op imiza ion o he objec ’s 3D deg ees o eedom
DoFO
( o a ion,
ansla ion, scale), bo h guided by seman ic alignmen be ween he
esul ing scene and he use p omp s. The esul ing scene is ep e-
sen ed by a new se o Gaussians,
Ginse ed
. Mo eo e , ou me hod
allows image p omp
I
O
as inpu o speci y he objec ’s appea ance.
Fo mally, he inse ion p ocess is de ined as:
Ginse ed =EGS,T,I
O,(1)
whe e
E
deno es he p ocess applied o he inse ed objec , includ-
ing i s gene a ion, DoF lea ning in he con ex o he scene
GS
, and
appea ance e inemen .
3.2 MLLM-based Objec Inse ion Pa se
A key challenge in unsupe ised objec inse ion is con e ing high-
le el use in en in o s uc u ed, ine-g ained guidance. To add ess
his, we in oduce an MLLM-based Objec Inse ion Pa se (MLLM-
OIP) ha u ilizes he MLLM’s spa ial unde s anding capabili y o
pa se he ins uc ion
T
in o seman ically p omp s, p o iding es-
sen ial guidance o he subsequen objec inse ion. Speci ically,
we p o ide a p omp empla e
T
pa se
and a sampled scene image
I
S
as inpu o he mul imodal LLM
MMLLM
[
1
] o ob ain s uc u ed
ou pu s. The p omp s gene a ion p ocess is o malized as :
T
O,T
AR,T
GT,T
IW,T
LT,T
SW=MMLLM T,T
pa se ,I
S(2)
He e, he Objec P omp (
T
O
) is used o 3D objec gene a ion and
appea ance e inemen s age. The A achmen Region P omp (
T
AR
)
10917
MM ’25, Oc obe 27–31, 2025, Dublin, I eland Chenxi Li e al.
plays a c ucial ole in he ini ializa ion o he objec ’s deg ees o
eedom
DoFO
. The emaining ou p omp s including he Global
Ta ge P omp (
T
GT
) and i s Objec In e ac ion Wo d (
T
IW
), he Local
Ta ge P omp (
T
LT
) and i s Spa ial Rela ionship Wo d (
T
SW
) a e
employed du ing he hie a chical spa ial-awa e e inemen s age
o e ine he DoFO, suppo ing global-local seman ic alignmen .
3.3 Ini ializa ion ia La ge Models
Objec om P omp s. To a oid 3D inconsis ency, we i s use
a ex - o-image (T2I) [
29
] model o syn hesize a Tex -gene a ed
image
I
O
o he objec om he objec desc ip ion p omp
T
O
. The
syn hesized image is hen used o eco e he 3D geome y
G𝑂
ia
LGM [
34
], a single- iew econs uc ion model ha achie e a ade-
o be ween econs uc ion quali y and e iciency. O he ligh weigh
3D econs uc ion me hods [
14
,
22
] can also be adop ed. In addi ion,
I
O
can be di ec ly speci ied by he use , allowing o mo e p ecise
con ol o e he objec ’s appea ance.
Scene’s A achmen Region De ec ion. In ui i ely, an objec ’s
placemen is in luenced by he a achmen egion o he scene and
he deg ees o eedom wi hin ha egion. D i en by his, we ex ac
an a achmen egion
GAR
and i s associa ed deg ees o eedom
DoFAR
(
sAR
,
AR
,
AR
) om he 3D scene based on he A achmen
Region P omp
T
AR
. This egion se es as a c ucial spa ial e e -
ence, guiding ini ializing he inse ed objec
DoFO
. Speci ically, we
employ an open- ocabula y de ec ion model Flo ence2 [
42
] o lo-
calize 2D bounding boxes ac oss smapled iews om he scene wi h
came a poses
Ccam
, guided by
T
AR
. Fo each iew, he de ec ed box
is con e ed in o a bina y mask
I
BAR
, ep esen ing he candida e
a achmen egion. The 3D a achmen a ea is pa ame e ized by he
deg ees o eedom o a ini ial 3D bounding box
Bini
. We op imize
he a achmen by compu ing c oss-en opy be ween he p ojec ed
ans o med bounding box and he de ec ed a achmen egion
mask
BAR
ac oss all came a iews. Thus,
DoFAR
is calcula ed as
ollows:
DoFAR =a g min
𝜃∑︁
T
cam ∈Ccam
LBCE P oj (B,T
cam),I(T
cam)
BAR ,(3)
whe e
𝜃=(s, , )
deno es he ans o ma ion pa ame e s o he
canonical box
Bini
,
Fa ine
is he a ine ans o ma ion unc ion, and
he ans o med 3D bounding box is compu ed as
B=Fa ine(Bini , 𝜃)
.
The unc ion
P oj(B,T
cam)
deno es he 2D p ojec ion o he 3D
bounding box
B
on o he image plane unde he came a pose
T
cam
.
A e ob aining DoFAR, he a achmen egion GAR is ex ac ed by
selec ing 3D Gaussians om he scene ep esen a ion
GS
wi hin he
ans o med bounding box
BAR =Fa ine(Bini ,DoFAR)
. Fo mally,
he a achmen egion is de ined as:
GAR ={𝑔∈ G𝑠|𝑔∈ BAR}
Objec ’s DOF Ini ializa ion. Once ob ained he a achmen e-
gion
GAR
and i s associa ed ans o ma ion
DoFAR
, we ini ialize
he inse ed objec ’s deg ees o eedom
DoFO
(
sO
,
O
,
O
) acco d-
ingly. Fo he
sO
ini ializa ion, we assume an in ui i e eal-wo ld
p io : he e exis s a easonable ela i e scale a io
𝜆 el
be ween he
inse ed objec and he a achmen egion, which helps ensu e a
plausible inse ion. This a io is implici ly unde s ood by la ge-scale
language models. The e o e, we le e age
MMLLM
o p edic
𝜆 el
,
and compu e he objec scale as
𝑠O=sAR ·𝜆 el
. Conside ing he
unce ain y in MLLM p edic ions and he in luence o scale ini ial-
iza ion quali y on subsequen e inemen , we adop an i e a i e
s a egy. A e ini ializing
O
and
O
, we ende he scene wi h he
inse ed objec and i e a i ely in e ac wi h he MLLM, using isual
eedback o adjus sOand imp o e ealism and in eg a ion.
Fo he
O
ini ializa ion, we le e age
MMLLM
o ini ialize a se-
man ically app op ia e objec o a ion. Gi en a o MLLM-sugges ed
p ima y scene iewpoin , we ende a scene image
I
S
, and sample
a se o objec -cen ic ende ings
{I(𝑟)
O}𝑟∈R
o he inse ed objec ,
whe e each
𝑟∈ R
co esponds o a unique azimu h-ele a ion o-
a ion
(𝜙, 𝜃) ∈ [
0
,
2
𝜋) × [
0
, 𝜋)
. Based on he
I
S
, he ende ing se
{I(𝑟)
O}
, and he Global Ta ge P omp
T
GT
, he model can selec he
op imal o a ion O ha maximizes a seman ic alignmen sco e:
O=a g max
𝑟∈R MMLLM I
S,I(𝑟)
O,T
GT(4)
whe e
MMLLM
e alua es semen ic plausibili y he placemen aligns
wi h he scene.
To ini ialize he
O
, we use s ong pixel-le el seman ic spa ial
localiza ion capabili y o Molmo [
8
] o p edic a se o 2D objec
cen e s
{𝑐(𝑣)
𝑂}𝑣∈V
ac oss mul iple scene iews wi h he Local Ta -
ge
T
LT
, using p omp like “Poin he posi ion o add <
T
LT
>”. Le
ˆ
G(𝑡)
𝑂
deno e he objec geome y a e applying he ans o ma ion
(sO
,
O
,
𝑡
), whe e
𝑡
is a op imized pa ame e . Fo each iew
𝑣
, we
p ojec he ans o med objec and compu e he 2D cen oid o
i s p ojec ion. The
O
is ob ained by op imizing
𝑡
o minimize he
disc epancy be ween he p ojec ed and he p edic ed cen oids:
𝑂=a g min
𝑡∑︁
𝑣∈V
Cen oid 𝜋𝑣ˆ
G(𝑡)
𝑂−𝑐(𝑣)
𝑂
2
2+ Lcoll ( GAR,𝑂𝑐)(5)
He e,
𝜋𝑣(·)
is he came a p ojec ion unc ion o iew
𝑣
, and Cen-
oid (
·
)compu es he 2D p ojec ed cen e o he objec . To ensu e
physical plausibili y du ing objec inse ion, we in oduce he colli-
sion loss [
45
]
Lcoll
, which penalizes in e pene a ion be ween he
objec cen iod 𝑂𝑐and he scene a achmen egion GAR.
3.4 Hie a chical Spa ial Awa e Re inemen
The ini ial
DoFO
om
MMLLM
o en lack spa ial accu acy, hinde ing
seamless scene in eg a ion. Base on ha , we hen op imize he
DoFO
using SSDS Loss [
7
], e ining he objec ’s placemen in he scene.
The loss is de ined as:
∇𝜃LSSDS (𝜙★, 𝑥 )=E𝑡,𝜖 𝑤(𝑡) ( ˆ
𝜖𝜙★(𝑥𝑖;𝑦, 𝑡 ) − 𝜖)𝜕𝑥
𝜕𝜃 (6)
He e,
𝜃
,
𝑥
,
𝜙∗
, and
ˆ
𝜖𝜙∗(𝑥𝑡
;
T, 𝑡)
deno e he 3D ep esen a ion, en-
de ed image, spa ial a en ion map, and he sco e unc ion p edic -
ing noise
𝜖
om he noised image
𝑥𝑡
wi h ex p omp
T
. Unlike he
o iginal design o mul i-objec composi ion wi h high imes eps,
we ind ha lowe imes eps a e mo e e ec i e o ine-g ained
DoF e inemen in ou se ing, as i emphasizes local spa ial de ails
c i ical o p ecise alignmen .
Global-Local Collabo a i e Spa ial Awa eness. Di usion mod-
els o en exhibi spa ial biases due o aining da a imbalance, e.g.,
gene a ing mous aches a a ela i ely la ge scale ac oss he lowe
ace. La ge models ained on da a-d i en p io s o en ail o mee
human expec a ions in spa ial easoning. (see Figu e 6) To add ess
spa ial ambigui y, we le e age spa ial ela ion e ms (e.g., “on”,
“in on o ”) o impose explici cons ain s on objec localiza ion.
10918
F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D Gaussian Scene wi hou Spa ial P io s MM ’25, Oc obe 27–31, 2025, Dublin, I eland
Compa ed o gene al e bs like “wea ing” o “wi h”, hese ela-
ions encode mo e p ecise spa ial p io s, leading o mo e e ec i e
supe ision o op imizing objec placemen . We le e age spa ial
p omp s in e ed om MLLM-OIP, which o e s bo h global se-
man ic g ounding T
GT wi h in e ac ion wo dT
IW and ine-g ained
posi ional cues
T
LT
wi h spa ial ela ionship wo d
T
SW
. We de ine
a hie a chical spa ial loss ha join ly supe ises local and global
alignmen :
Lspa ial =𝛽· Lssds-global ( T
GT,T
IW)+(1−𝛽)·Lssds-local (T
LT,T
SW),(7)
A en ion-based Localiza ion. We ound ha due o he bias
in he T2I model’s aining da a, SSDS Loss exhibi s limi a ions
when handling a e spa ial ela ionships. These inhe en limi a ions
es ic he e ec i eness o p omp ins uc ions. To enable s onge
spa ial condi ioning, we adop a en ion-based localiza ion loss [
48
],
en o cing igh e egional cons ain s as ollows:
L𝑙𝑜𝑐 =1−max
𝑠∈S (𝐴𝑠
𝑡)+𝜆∑︁
𝑠∈˜
S
𝐴𝑠
𝑡
2
2(8)
whe e
𝜆
balances he wo e ms,
S
deno es he mul i- iew mask
egion p ojec ed om he 3D bounding box
B
, ob ained by igh ly
enclosing he objec a e DoF ini ializa ion, and
˜
S
deno es he
complemen a y egion. As shown in ou abla ion (Figu e 7), i is
essen ial o p ecise objec placemen wi hin he designa ed a ea.
3.5 Objec Appea ance Re inemen
Once he objec ’s deg ees o eedom a e de e mined, a e inemen
module is in oduced o enhance he isual quali y o he inse ed
objec
G𝑂
. Speci ically, we e ine
G𝑂
using he high-quali y appea -
ance om he inse ed-objec image I
𝑂 ia LoRA [15].
Viewpoin F equency Balancing. To a oid he o e i ing caused
by using a single- iew op imiza ion, which o en leads o 3D in-
consis encies and missing objec pa s (e.g., a side iew causing
missing legs) as shown in Figu e 9 (b). We pe o m mul i- iew
sampling o he inse ed objec . Speci ically, gi en a se o iews
{𝐼𝑖}𝑁
𝑖=1
, ende ed om he inse ed objec
G𝑂
. We es ima e he
pose
𝑃∗
o he objec image
I
O
by selec ing he mos simila iew
based on DINO ea u e simila i y [
26
]. To ensu e bo h appea ance
ideli y and geome ic consis ency, We cons uc he aining se
D e
by combining he ende ed mul i- iew images wi h epea ed
samples o he inse ed-objec image
I
O
and i s es ima ed pose
𝑃∗
,
as ollows:
D e ={(𝐼𝑖, 𝑃𝑖)}𝑁
𝑖=1∪ {(I
𝑂, 𝑃∗)}𝑀
𝑗=1( epea ed, 𝑀 >𝑁)(9)
The ollowing objec i e is used o ine- une he LoRA laye s:
L e =E𝑧𝑖,𝐼𝑖,𝑃𝑖,𝑦∗,𝜖,𝑡
𝜖𝜙2(𝑧𝑖, 𝑡, 𝑃𝑖, 𝐼𝑖, 𝑦∗) − 𝜖
2
2,(𝐼𝑖, 𝑃𝑖)∼D e .(10)
𝑧𝑖
is he noisy la en o image
𝐼𝑖
,
𝑡
is he di usion imes ep, and
𝜖
is he a ge noise. The denoising ne wo k
𝜖𝜙2
, augmen ed wi h
LoRA [
15
], is condi ioned on
𝐼𝑖
, i s pose
𝑃𝑖
, and he objec -speci ic
p omp 𝑦∗, e.g., “A < oken> dog”, which is o ma ing om T
O.
Appea ance-Focused Re inemen . We employ ine- uning di u-
sion o upda e he objec Gaussian
G𝑂
, guided by
L𝑠𝑑𝑠
[
28
]. Unde
ou se ing, we sample om a lowe ange o imes eps du ing
op imiza ion o educe he impac on he objec ’s geome y (e.g.,
shape and scale), he eby encou aging he model o ocus mo e on
e ining appea ance de ails. The co esponding objec i e is de ined
as ollows:
∇𝜃ˆ
L𝑠𝑑𝑠 (𝜙, 𝑥 )=Eˆ
𝑡,𝜖 𝑤(ˆ
𝑡)ˆ
𝜖𝜙(𝑥𝑖;𝑦𝑖,ˆ
𝑡) − 𝜖𝜕𝑥
𝜕𝜃 ,(11)
whe e ˆ
𝑡deno es he adjus ed (lowe ) di usion imes ep.
3.6 Objec Replacemen
In addi ion, ou me hod can be na u ally ex ended o objec e-
placemen in he scene. Speci ically, gi en a use p omp such as
“Add a [new objec ] o eplace [exis ing objec ]”, he co esponding ob-
jec o be eplaced is iden i ied h ough he A achmen Region
GAR
.
We emo e
GAR
and hen execu e he s anda d inse ion pipeline,
enabling eplacemen wi hou being cons ained by he o iginal
objec ’s geome y o s uc u e.
4 Expe imen s
4.1 Expe imen s Se up.
Implemen a ion De ails Du ing ini ializa ion, we use a lea ning
a e o 5
×
10
−3
o op imizing bo h
GAR
and
O
o inse ed objec .
When es ima ing he coa se o a ion
O
, we ende he objec a
10-deg ee in e als . Du ing he Hie a chical Spa ial Awa e Re ine-
men s age, we apply a lea ning a e o 5
×
10
−4
wi h di usion
imes eps in he ange o [0.02, 0.2],
𝜆=
0
.
1is se in
L𝑙𝑜𝑐
and
𝛽
is linea ly inc eased om 0 o 1 du ing aining. Fo appea ance
e inemen , we op imize he objec appea ance using imes eps in
[0.02, 0.5
∼
0.25]. The objec image
I
O
is upsampled wi h a sampling
a io o
𝑀/𝑁=
3 ela i e o mul i- iew inpu s. All expe imen s a e
conduc ed on a single NVIDIA A40 GPU. Mo e de ails a e p o ided
in he Appendix.
Da ase To comp ehensi ely e alua e ou me hod, we ollow p io
wo ks [
6
,
12
,
48
] and selec ep esen a i e scenes o a ying com-
plexi y, including simple backg ounds, human aces, and complex
ou doo en i onmen s. In hese scenes, we inse commonly as-
socia ed objec s (e.g., glasses, gi a es) and e alua e di e se ca e-
go ies such as bow ies and mous aches o assess gene aliza ion.
Fo GaussianEdi o [
6
], we manually anno a e masks, while o
TIP-Edi o [
48
], we use he au ho -p o ided bounding boxes and
objec images o compa ison.
Baselines We compa e ou me hod wi h s a e-o - he-a 3D scene
edi ing app oaches ha suppo objec inse ion and eplacemen ,
unde wo ypes o guidance: ex p omp and ex -image p omp .The
ex -guided baselines include h ee me hods: Ins uc -GS2GS [35],
which ex ends Ins uc -NeRF2NeRF (IN2N) [
12
] by eplacing he
NeRF in IN2N wi h a 3DGS model; GaussC l [
41
], and GaussianEd-
i o [
6
]. Fo ex -image p omp me hods, we compa e wi h TIP-
Edi o [
48
], which uses an example image o speci y objec appea -
ance. As TIP-Edi o p o ides only limi ed inse ion sc ip s (e.g., “A
doll wea ing sunglasses”, “A man wi h bea d”). Fo ai ness, we use
o icial code and p e- ained weigh s.
E alua ion C i e ia. We use CLIP Tex -Image di ec ional simila -
i y ollowing [
6
,
41
,
48
,
49
] o assess he alignmen be ween he ex
and he edi ing esul s. Fo appea ance-speci ied cases, we u he
employ DINO simila i y [
26
] ollowing [
48
] o assess appea ance
p ese a ion. We also conduc ed a use s udy wi h 50 pa icipan s,
10919
MM ’25, Oc obe 27–31, 2025, Dublin, I eland Chenxi Li e al.
O iginal Scene
...............................................................................................................................
IN2N(GS)
“Gi e he doll a pai o glasses” “Gi e he man a pai o glasses” “Tu n he s one ho se in o a gi a e”
GaussC l
“A pho o o a doll wea ing a pai o glasses” “A pho o o a man wea ing a pai o glasses” “A pho o o a gi a e in on o he museum”
GaussianEdi o
“A doll wea ing a pai o glasses” “A man wea ing a pai o glasses” “A gi a e s anding on he s one pla o m”
F eeInse
“Add a pai o glasses o he doll” “Add a pai o glasses o he man” “Add a gi a e o eplace he s one ho se”
Figu e 3: Visual compa ison wi h s a e-o - he-a me hods o ex -guided objec inse ion (Cols 1–2) and eplacemen (Col 3). Ou
me hod gene a es highe -quali y esul s while p ese ing scene in eg i y. IN2N (GS) and GaussC l some imes misunde s and
he p omp and ail o comple e inse ion (e.g., “Gi e he doll a pai o glasses o he doll”), and s uggle o p oduce clea shape
changes in eplacemen (Col 3, Rows 2–3). GaussianEdi o equi es manual masks and dep h adjus men , and su e s om
a i ac s and low-quali y objec s due o pos -inpain ing and 3D econs uc ion limi a ions.
who a ed he 3D edi ing esul s (p esen ed wi h p omp s in shu -
led o de ) on ou c i e ia: Seman ic Alignmen , Objec In eg i y,
Geome ic Consis ency, and De ail P ese a ion, using a 1–10 scale.
4.2
Compa isons wi h S a e-o - he-A Me hods
4.2.1 Quali a i e compa isons. In his pa , we conduc a quali a-
i e compa ison wi h di e en baselines unde wo ypes o inpu
se ings ( ex p omp and ex -image p omp ) o e alua e hei pe -
o mance unde iden ical condi ions. Video demons a ions a e
included in he supplemen a y.
Tex P omp Compa isons. Figu e 3 shows isual compa isons be-
ween ou me hod wi h h ee baselines. Bo h IN2N(GS) and GaussC-
l, which ely solely on seman ic guidance, s uggle o success ully
comple e inse ions in some scene-objec combina ions, such as
“Add a pai o glasses o he doll”. Al hough GaussC l imp o es
consis ency, he eplacemen s emain oo simila o he o iginal
(e.g., a ho se), ailing o con incingly esemble he a ge (e.g., a
gi a e). GaussianEdi o elies on use -p o ided 2D masks o objec
inse ion bu s uggles in objec - o human-cen ic scenes due o
inaccu a e pos -inpain ing segmen a ion, leading o a i ac s like
o eg ound o e laps. While i pe o ms be e in ou doo scenes
(e.g., he gi a e example), i s dep h es ima ion is o en imp ecise
and equi es manual adjus men . By con as , ou me hod achie es
high-quali y objec inse ion and eplacemen esul s in bo h scene
p ese a ion and objec comple eness wi hou equi ing any man-
ual anno a ions.
Tex -Image P omp Compa isons. Besides he ex -p omp me h-
ods, we u he e alua e objec inse ion and eplacemen wi h a
gi en image p omp o he speci ied objec in Figu e 4, compa -
ing agains TIP-Edi o [
48
]. Al hough TIP-Edi o suppo s lexible
inse ion ia 3D bounding boxes, i su e s om inconsis en mul i-
iew appea ances due o i s eliance on 2D edi ing echniques. Mos
c i ically, achie ing such esul s emains dependen on inely use -
p o ided 3D bounding boxes, which signi ican ly hinde s scalabili y
and p ac icali y. In con as , ou me hod deli e s he mos com-
ple e geome y and be e appea ance ideli y o he image p omp
wi hou elying on any anno a ion.
4.2.2 Quan i a i e Compa isons. Table 2 p esen s he quan i a-
i e compa ison o ou me hod agains o he baseline me hods.
Ou me hod achie es CLIP ex –image seman ic alignmen sco es
compa able o he s a e-o - he-a me hods wi hou equi ing any
manual anno a ions o he inse ion egion. Mo eo e , compa ed
10920
F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D Gaussian Scene wi hou Spa ial P io s MM ’25, Oc obe 27–31, 2025, Dublin, I eland
O iginal/Re TIP-Edi o F eeInse
“A doll wea ing a pai o 𝑉1sunglasses” “Add a pai o glasses o he doll”
“A doll wea ing a 𝑉1ha ” “Add a ha o he doll”
“A man wi h a 𝑉1bea d” “Add a bea d o he man”
“A 𝑉1gi a e in a ga den” “Add a gi a e o eplace he ho se”
“A 𝑉1dog in woods” “Add a dog o eplace he s one bea ”
Figu e 4: Visual compa isons wi h TIP-Edi o using ex -image p omp s. Ou me hod achie es compe i i e esul s wi h TIP-
Edi o , wi hou elying on 3D bounding boxes. TIP-Edi o s uggles o main ain he 3D consis ency o he inse ed objec (e.g.,
he misaligned ha ac oss iews in ow 2, column 2, and he igh on paw in e sec ing wi h he le in ow 5, column 2), as i s
2D edi ing p ocess lacks c oss- iew cons ain s. Ou me hod p oduces clea ly mo e 3D-consis en esul s and mo e closely
esembles he e e ence image.
o he app oach ha speci ies objec appea ances (TIP-Edi o ), ou
me hod exhibi s highe DINO simila i y o he image p omp . The
Use o e
a ings clea ly demons a e ha use s p e e ou me hod
o e he baselines. F eeInse is a leas an hou as e han TIP-
Edi o , equi es no manual p io s, and ou pe o ms as e me hods.
Table 2: Quan i a i e compa isons o SOTA. CLIP
𝑑𝑖𝑟
deno es
he CLIP Tex -Image di ec ional simila i y. DINO
𝑠𝑖𝑚
is he
DINO simila i y.
Me hod
CLIP
𝑑𝑖𝑟 ↑
DINO
𝑠𝑖𝑚 ↑
Use
𝑣𝑜𝑡𝑒 ↑Time𝑐𝑜𝑠𝑡 ↓
Ins uc N2N(GS) [
35
]
26.76%
- 18.3 15 mins
GaussianEdi o [6]
27.36%
- 24.7 20 mins
GaussC l [41]
25.39%
- 26.3 15 mins
TIP-Edi o [48]
30.01% 83.30%
32.3 2.5 h
F eeInse
29.48% 83.45%
36.9 1.1 h
4.3 Abla ion S udies
Abla ion Visualiza ion Ac oss S ages. To be e demons a e
how each s age in F eeInse con ibu es o he inal ou come, we
isualize he in e media e esul s a each s ep, as shown in Figu e 5.
Col 2 highligh s he a achmen egion Gaussians wi hin he scene,
ma ked in ed. Col 3 shows he ini ial deg ees o eedom (DoF),
which a e ypically sub-op imal. A e SSDS e inemen , he objec
achie es a mo e accu a e DoF (Col 4). Finally, Col 5 demons a es
he enhanced objec appea ance.
E ec i eness o Global-Local Collabo a i e Spa ial Awa e-
ness. To e i y he e ec i eness o global-local collabo a i e spa ial-
awa e s a egy, we conduc abla ion s udies compa ing he ollow-
ing a ian s: no ssds, only
Lssds-global
, only
Lssds-local
, and he com-
bina ion o bo h. As shown in Figu e 6, global p omp like “A man
wi h mous ache” o en lead o ambiguous placemen s, while local
p omp such as “A mous ache is unde he nose and abo e he uppe
lip” p o ide p ecise cons ain s bu may igno e global plausibili y.
Ou me hod balances bo h by eweigh ing key spa ial e ms and
p og essi ely shi ing om local o global ocus, esul ing in mo e
accu a e and seman ically cohe en placemen s.
E ec i eness o A en ion-based localiza ion. We e alua e he
con ibu ion o
L𝑙𝑜𝑐
o enhancing spa ial awa eness, pa icula ly
10921
MM ’25, Oc obe 27–31, 2025, Dublin, I eland Chenxi Li e al.
Inpu DoF Ini DoF Re ine Final Ou pu
“Add a dog o eplace he s one bea ”
“Add a bea d o he man”
A achmen
Region
“Add a pai o glasses o he doll”
Figu e 5: Visualiza ion o di e en s ages in F eeInse .
w/o ssds w/o Lssds-local w/o Lssds-global F eeInse
Figu e 6: Abla ion o Global-Local Collabo a i e Spa ial
Awa eness.
“Add a pai o sunglasses
on he o ehead”
“Add an apple on
he cen e o able”
w/o L𝑙𝑜𝑐 F eeInse w/o L𝑙𝑜𝑐 F eeInse
Figu e 7: Abla ion o A en ion-based Localiza ion
o a e o ambiguous placemen s. As shown in Figu e 7,
L𝑙𝑜𝑐
encou ages objec posi ion wi hin he bounding box egion in e ed
by he la ge model, while allowing lexibili y o adjus he DoF.
Unlike ixed cons ain s, i so ly guides a en ion owa d in ended
egions, mi iga ing seman ic con ol ailu es caused by aining
da a biases.
Compa ison o DoF lea ning di ec ly by Di e en Mul i-
Modal La ge Language Models s. F eeInse .To e alua e ou
DoF lea ning me hod, we compa e i wi h s a e-o - he-a mul i-
modal LLMs, including GPT-4V [
1
], Molmo-7B [
8
], and GPT-o1 [
16
],
which di ec ly p edic objec DoFs. T ansla ion and scale a e de-
i ed om mul i- iew p omp s (e.g., “Poin he ou coo dina es o a
bounding box o add [objec ] o/on [ a ge ]”) and li ed o 3D, while
o a ion ollows ou ini ializa ion. As shown in Figu e 8 (“Add a
pai o glasses o he doll”), ou p edic ions a e mo e plausible. Quan-
i a i ely, ou me hod achie es he highes alignmen wi h human
p e e ences in p ojec ed mIoU ac oss all cases Table 3.
E ec i eness o Viewpoin F equency Balancing. Figu e 9 com-
pa es objec appea ance op imiza ion when ine- uning LoRA wi h
a single objec image e sus combining i wi h mul i- iew images
GPT-4V [1] Molmo-7B [8] GPT-o1 [16] F eeInse
Figu e 8: Quali a i e compa ison o di e en MLLMs o DoF
lea ning on “Add a pai o sunglasses o he doll”.
Table 3: Quan i a i e analysis o DoF Op imiza ion in F eeIn-
se .
Me ic GPT-4V Molmo-7B GPT-o1 F eeInse
mIoU o e 15 cases (%) 68.2 74.7 78.9 89.5
(a) O iginal (b) Single Image (c) FR=1 (d) FR=3
Figu e 9: Abla ion o Viewpoin F equency Balancing
a di e en sampling equency a ios (
𝐹𝑅
). Using only inse ed-
objec image leads o shape inconsis encies and a i ac s ac oss
iews (i em (b)), while inco po a ing mul i- iew da a imp o es
consis ency bu may educe de ail. Ou expe imen s show ha
𝐹𝑅 =3
o e s he bes ade-o be ween mul i- iew consis ency
and single- iew image quali y.
5 Conclusion and Limi a ions
In his wo k, we p esen ed F eeInse , a no el amewo k o ex -
d i en objec inse ion in 3D scenes ha elimina es he need o
spa ial p io s such as 2D masks o 3D bounding boxes. By disen an-
gling objec gene a ion om spa ial placemen , F eeInse enables
unsupe ised and seman ically guided edi ing h ough na u al lan-
guage. Le e aging he easoning capabili ies o ounda ion models,
ou me hod ex ac s s uc u ed seman ics om use ins uc ions
o guide 3D econs uc ion and spa ial in eg a ion, achie ing ac-
cu a e placemen and high isual ideli y. Ex ensi e expe imen s
con i m he e ec i eness o ou app oach in enabling p ecise, and
use - iendly 3D objec inse ions, pa ing he way o mo e scalable
and in ui i e scene edi ing in open-wo ld scena ios.
While p omising, F eeInse s ill aces some limi a ions. I may
ail when he unde lying 3D econs uc ion su e s om se e e
geome ic inconsis encies, such as duplica ed limbs o he Janus
p oblem, which canno be ully compensa ed by ou objec -speci ic
e inemen . Complex spa ial ins uc ions ha equi e hie a chical
o ela ional easoning (e.g., “Add ...... o he second laye om he
op o he shel ”) may also exceed he capaci y o cu en MLLMs.
These challenges a e expec ed o diminish as ounda ion models o
3D econs uc ion and mul i-modal easoning con inue o ad ance.
Addi ionally, in eplacemen asks, misma ched con ac egions o
imp ecise 3D bounding boxes can in oduce a i ac s, which can be
alle ia ed by in eg a ing ins ance segmen a ion and local geome y
e inemen o inpain ing.
10922
F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D Gaussian Scene wi hou Spa ial P io s MM ’25, Oc obe 27–31, 2025, Dublin, I eland
6 Acknowledgemen
This wo k has been pa ially suppo ed by he Eu opean Union’s
Ho izon Eu ope esea ch and inno a ion p og am unde g an
ag eemen No. 101120237 (ELIAS). B uno Lep i and Nicu Sebe also
acknowledge he suppo o he PNRR p ojec FAIR - Fu u e AI
Resea ch (PE00000013), unde he NRRP MUR p og am unded
by he Nex Gene a ionEU. This wo k was also suppo ed by he
Tianjin Na u al Science Founda ion, Key P ojec , unde G an No.
22JCZDJC00220, “Ul asound Imaging Algo i hm Resea ch o HIFU
The mal The apy Moni o ing” (2022.10–2025.9).
Re e ences
[1]
Josh Achiam, S e en Adle , Sandhini Aga wal, Lama Ahmad, Ilge Akkaya, Flo en-
cia Leoni Aleman, Diogo Almeida, Janko Al enschmid , Sam Al man, Shyamal
Anadka , e al
.
2023. Gp -4 echnical epo . a Xi p ep in a Xi :2303.08774
(2023).
[2]
Ami Ba da, Ma heus Gadelha, Vladimi G Kim, Noam Aige man, Ami H
Be mano, and Thibaul G oueix. 2025. Ins an 3di : Mul i iew Inpain ing o
Fas Edi ing o 3D Objec s. In P oceedings o he IEEE/CVF Con e ence on Com-
pu e Vision and Pa e n Recogni ion.
[3]
Alexey Bokho kin, Quan Meng, Shubham Tulsiani, and Angela Dai. 2025. Scene-
Fac o : Fac o ed La en 3D Di usion o Con ollable 3D Scene Gene a ion. In
P oceedings o he IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni-
ion.
[4]
Tim B ooks, Aleksande Holynski, and Alexei A E os. 2023. Ins uc pix2pix:
Lea ning o ollow image edi ing ins uc ions. In P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni ion. 18392–18402.
[5]
Chenjie Cao, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. [n. d.].
MVInpain e : Lea ning Mul i-View Consis en Inpain ing o B idge 2D and 3D
Edi ing. In The Thi y-eigh h Annual Con e ence on Neu al In o ma ion P ocessing
Sys ems.
[6]
Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiao eng Yang, Yikai Wang,
Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin. 2024. Gaussianedi o :
Swi and con ollable 3d edi ing wi h gaussian spla ing. In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion. 21476–21485.
[7]
Yongwei Chen, Teng ei Wang, Tong Wu, Xingang Pan, Kui Jia, and Ziwei Liu. 2024.
Combo e se: Composi ional 3d asse s c ea ion using spa ially-awa e di usion
guidance. In Eu opean Con e ence on Compu e Vision. Sp inge , 128–146.
[8]
Ma Dei ke, Ch is ophe Cla k, Sangho Lee, Rohun T ipa hi, Yue Yang, Jae Sung
Pa k, Mohammad eza Salehi, Niklas Muennigho , Kyle Lo, Luca Soldaini, e al
.
2024. Molmo and pixmo: Open weigh s and open da a o s a e-o - he-a mul i-
modal models. a Xi p ep in a Xi :2409.17146 (2024).
[9]
Jiahua Dong and Yu-Xiong Wang. 2023. Vica-ne : View-consis ency-awa e
3d edi ing o neu al adiance ields. Ad ances in Neu al In o ma ion P ocessing
Sys ems 36 (2023), 61466–61477.
[10]
Songxue Gao, Chuanqi Jiao, Ruidong Chen, Weijie Wang, and Weizhi Nie. 2023.
Poin Cloud Comple ion Guided by P io Knowledge ia Causal In e ence. a Xi
p ep in a Xi :2305.17770 (2023).
[11]
Aa on G a a io i, Abhimanyu Dubey, Abhina Jauh i, Abhina Pandey, Abhishek
Kadian, Ahmad Al-Dahle, Aiesha Le man, Akhil Ma hu , Alan Schel en, Alex
Vaughan, e al
.
2024. The llama 3 he d o models. a Xi p ep in a Xi :2407.21783
(2024).
[12]
Ayaan Haque, Ma hew Tancik, Alexei A E os, Aleksande Holynski, and Angjoo
Kanazawa. 2023. Ins uc -ne 2ne : Edi ing 3d scenes wi h ins uc ions. In
P oceedings o he IEEE/CVF In e na ional Con e ence on Compu e Vision. 19740–
19750.
[13]
Jona han Ho, Ajay Jain, and Pie e Abbeel. 2020. Denoising di usion p obabilis ic
models. Ad ances in neu al in o ma ion p ocessing sys ems 33 (2020), 6840–6851.
[14]
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Di an Liu, Feng Liu,
Kalyan Sunka alli, T ung Bui, and Hao Tan. 2023. L m: La ge econs uc ion
model o single image o 3d. In P oceedings o he In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR).
[15]
Edwa d J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean
Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adap a ion o La ge
Language Models. In In e na ional Con e ence on Lea ning Rep esen a ions.
[16]
Aa on Jaech, Adam Kalai, Adam Le e , Adam Richa dson, Ahmed El-Kishky,
Aiden Low, Alec Helya , Aleksande Mad y, Alex Beu el, Alex Ca ney, e al
.
2024.
Openai o1 sys em ca d. a Xi p ep in a Xi :2412.16720 (2024).
[17]
Uma Khalid, Hasan Iqbal, Nazmul Ka im, Muhammad Tayyab, Jing Hua, and
Chen Chen. 2024. La en Edi o : ex d i en local edi ing o 3D scenes. In Eu opean
Con e ence on Compu e Vision. Sp inge , 364–380.
[18]
Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, and
Jinwoo Shin. 2023. Collabo a i e sco e dis illa ion o consis en isual edi ing.
Ad ances in Neu al In o ma ion P ocessing Sys ems 36 (2023), 73232–73257.
[19]
Juil Koo, Chanho Pa k, and Minhyuk Sung. 2024. Pos e io dis illa ion sam-
pling. In P oceedings o he IEEE/CVF Con e ence on Compu e Vision and Pa e n
Recogni ion. 13352–13361.
[20]
Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou,
and Bingbing Ni. 2024. Focald eame : Tex -d i en 3d edi ing ia ocal- usion
assembly. In P oceedings o he AAAI Con e ence on A i icial In elligence, Vol. 38.
3279–3287.
[21]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng-
gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, e al
.
2024. Deepseek- 3
echnical epo . a Xi p ep in a Xi :2412.19437 (2024).
[22]
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu,
Yuexin Ma, Song-Hai Zhang, Ma c Habe mann, Ch is ian Theobal , e al
.
2024.
Wonde 3d: Single image o 3d using c oss-domain di usion. In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion. 9970–9980.
[23]
Ashkan Mi zaei, T is an Aumen ado-A ms ong, Ma cus A B ubake , Jona han
Kelly, Alex Le insh ein, Kons an inos G De panis, and Igo Gili schenski. 2025.
Wa ch you s eps: Local image and scene edi ing by ex ins uc ions. In Eu opean
Con e ence on Compu e Vision. Sp inge , 111–129.
[24]
Weizhi Nie, Ruidong Chen, Weijie Wang, B uno Lep i, and Nicu Sebe. 2024. T2TD:
Tex -3D gene a ion model based on p io knowledge guidance. IEEE T ansac ions
on Pa e n Analysis and Machine In elligence (2024).
[25]
Weizhi Nie, Weijie Wang, Anan Liu, Jie Nie, and Yu ing Su. 2019. HGAN: Holis-
ic gene a i e ad e sa ial ne wo ks o wo-dimensional image-based h ee-
dimensional objec e ie al. ACM T ansac ions on Mul imedia Compu ing, Com-
munica ions, and Applica ions (TOMM) 15, 4 (2019), 1–24.
[26]
Maxime Oquab, Timo hée Da ce , Théo Mou akanni, Huy V. Vo, Ma c Sza aniec,
Vasil Khalido , Pie e Fe nandez, Daniel HAZIZA, F ancisco Massa, Alaaeldin
El-Nouby, Mido Ass an, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-
Yao Huang, Shang-Wen Li, Ishan Mis a, Michael Rabba , Vasu Sha ma, Gab iel
Synnae e, Hu Xu, He e Jegou, Julien Mai al, Pa ick Laba u , A mand Joulin,
and Pio Bojanowski. 2024. DINO 2: Lea ning Robus Visual Fea u es wi hou
Supe ision. T ansac ions on Machine Lea ning Resea ch (2024).
[27]
JangHo Pa k, Gihyun Kwon, and Jong Chul Ye. 2024. ED-NeRF: E icien Tex -
Guided Edi ing o 3D Scene Wi h La en Space NeRF. In The Twel h In e na ional
Con e ence on Lea ning Rep esen a ions.
[28]
Ben Poole, Ajay Jain, Jona han T. Ba on, and Ben Mildenhall. 2022. D eamFusion:
Tex - o-3D using 2D Di usion. In P oceedings o he In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR).
[29]
Robin Rombach, And eas Bla mann, Dominik Lo enz, Pa ick Esse , and Bjö n
Omme . 2022. High- esolu ion image syn hesis wi h la en di usion models. In
P oceedings o he IEEE/CVF con e ence on compu e ision and pa e n ecogni ion.
10684–10695.
[30]
Mohamad Shahbazi, Liesbe h Claessens, Michael Niemeye , Edo Collins, Alessio
Tonioni, Luc Van Gool, and Fede ico Tomba i. 2024. InseRF: Tex -D i en Gen-
e a i e Objec Inse ion in Neu al 3D Scenes. a Xi p ep in a Xi :2401.05335
(2024).
[31]
Ka Chun Shum, Jaeyeon Kim, Binh-Son Hua, Duc Thanh Nguyen, and Sai-Ki
Yeung. 2024. Language-d i en Objec Fusion in o Neu al Radiance Fields wi h
Pose-Condi ioned Da ase Upda es. In P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion. 5176–5187.
[32]
Hyeonseop Song, Seokhun Choi, Hoseok Do, Chul Lee, and Taehyeong Kim.
2023. Blending-ne : Tex -d i en localized edi ing in neu al adiance ields. In
P oceedings o he IEEE/CVF in e na ional con e ence on compu e ision. 14383–
14393.
[33]
Yanhao Sun, Runze Tian, Xiao Han, XinYao Liu, Yan Zhang, and Kai Xu. 2024.
GSEdi P o: 3D Gaussian Spla ing Edi ing wi h A en ion-based P og essi e
Localiza ion. In Compu e G aphics Fo um, Vol. 43. Wiley Online Lib a y, e15215.
[34]
Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Teng ei Wang, Gang Zeng, and
Ziwei Liu. 2024. Lgm: La ge mul i- iew gaussian model o high- esolu ion 3d
con en c ea ion. In Eu opean Con e ence on Compu e Vision. Sp inge , 1–18.
[35]
Cy us Vachha and Ayaan Haque. [n. d.]. Ins uc -gs2gs: Edi ing 3d gaussian
spla s wi h ins uc ions (2024). URL h ps://ins uc -gs2gs. gi hub. io ([n. d.]).
[36]
Dongqing Wang, Tong Zhang, Alaa Abboud, and Sabine Süss unk. 2024. In-
ne 360: Tex -guided 3d-consis en objec inpain ing on 360-deg ee neu al adi-
ance ields. In P oceedings o he IEEE/CVF Con e ence on Compu e Vision and
Pa e n Recogni ion. 12677–12686.
[37]
Junjie Wang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, and Qi Tian. 2024. Gaus-
sianedi o : Edi ing 3d gaussians delica ely wi h ex ins uc ions. In P oceedings o
he IEEE/CVF con e ence on compu e ision and pa e n ecogni ion. 20902–20911.
[38]
Weijie Wang, Guo eng Mei, Bin Ren, Xiaoshui Huang, Fabio Poiesi, Luc Van Gool,
Nicu Sebe, and B uno Lep i. 2023. Ze o-sho poin cloud egis a ion. a Xi
p ep in a Xi :2312.03032 (2023).
[39]
Weijie Wang, Guo eng Mei, Jian Zhang, Nicu Sebe, B uno Lep i, and Fabio Poiesi.
2025. Fully-Geome ic C oss-A en ion o Poin Cloud Regis a ion. a Xi
p ep in a Xi :2502.08285 (2025).
[40]
Weijie Wang, Jichao Zhang, Chang Liu, Xia Li, Xingqian Xu, Humph ey Shi, Nicu
Sebe, and B uno Lep i. 2024. UVMap-ID: A Con ollable and Pe sonalized UV
10923