FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors

Author: Li, Chenxi; Wang, Weijie; Li, Qiang; Sebe, Niculae; Lepri, Bruno; Nie, Weizhi

Publisher: Zenodo

DOI: 10.1145/3746027.3755072

Source: https://zenodo.org/records/17689184/files/3746027.3755072.pdf

F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D
Gaussian Scene wi hou Spa ial P io s
Chenxi Li
Tianjin Uni e si y
Tianjin, China
[email p o ec ed]
Weijie Wang†
Uni e si y o T en o
T en o, I aly
weijie[email p o ec ed]
Qiang Li
Tianjin Uni e si y
Tianjin, China
[email p o ec ed]
Nicu Sebe
Uni e si y o T en o
T en o, I aly
[email p o ec ed]
B uno Lep i
Fondazione B uno Kessle
T en o, I aly
lep i@ bk.eu
Weizhi Nie
Tianjin Uni e si y
Tianjin, China
[email p o ec ed]
Abs ac
Tex -d i en objec inse ion in he 3D scene is an eme ging ask ha
enables in ui i e scene edi ing h ough na u al language. Despi e
i s po en ial, exis ing 2D edi ing-based me hods o en su e om
eliance on spa ial p io s such as 2D masks, 3D bounding boxes,
and hey s uggle o ensu e inse ed objec consis ency. These limi-
a ions hinde lexibili y and scalabili y in eal-wo ld applica ions.
In his pape , we p opose F eeInse , a no el amewo k ha le e -
ages ounda ion models (MLLMs, LGM, and di usion models) o
disen angle objec gene a ion and spa ial placemen , enabling unsu-
pe ised and lexible objec inse ion in 3D scenes wi hou spa ial
p io s. F eeInse begins wi h an MLLM-based pa se ha ex ac s
s uc u ed seman ics—including objec ypes, spa ial ela ionships,
and a achmen egions— om use ins uc ions. These seman ics
guide bo h he econs uc ion o he inse ed objec o 3D consis-
ency and he lea ning o i s deg ees o eedom. We i s le e age
he spa ial easoning capabili ies o MLLMs o ini ialize he objec ’s
pose and scale. To u he enhance na u al in eg a ion wi h he
scene, a hie a chical spa ially-awa e s age is employed o e ine he
objec ’s placemen , inco po a ing bo h he spa ial seman ics and
p io s in e ed by he MLLM. Finally, he objec ’s appea ance is
enhanced using inse ed-objec image o imp o e isual ideli y. Ex-
pe imen al esul s demons a e ha F eeInse enables seman ically
cohe en , spa ially p ecise, and isually ealis ic 3D inse ions, wi h-
ou equi ing any spa ial p io s, o e ing a use - iendly and lexible
edi ing expe ience. P ojec page: h ps:// julcx.gi hub.io/F eeInse /.
CCS Concep s
•Compu ing me hodologies →Compu e ision.
Weijie Wang†is he Co esponding au ho .
Pe mission o make digi al o ha d copies o all o pa o his wo k o pe sonal o
class oom use is g an ed wi hou ee p o ided ha copies a e no made o dis ibu ed
o p o i o comme cial ad an age and ha copies bea his no ice and he ull ci a ion
on he i s page. Copy igh s o componen s o his wo k owned by o he s han he
au ho (s) mus be hono ed. Abs ac ing wi h c edi is pe mi ed. To copy o he wise, o
epublish, o pos on se e s o o edis ibu e o lis s, equi es p io speci ic pe mission
and/o a ee. Reques pe missions om [email p o ec ed].
MM ’25, Dublin, I eland
©2025 Copy igh held by he owne /au ho (s). Publica ion igh s licensed o ACM.
ACM ISBN 979-8-4007-2035-2/2025/10
h ps://doi.o g/10.1145/3746027.3755072
Tex P omp
O iginal
Scene
Inse ed
Scene
Acou ya d
A man
A doll
Image P omp
Spa ial P io s
(2D mask o 3D bounding box) Tex -D i en Objec Inse ion
❌
P o ide
A doll wea ing a bow ie A man wi h a mous ache Al e he ho se o a og
A ga den
An apple     on he able
Figu e 1: “No Spa ial P io s, Jus P omp s.” Compa ed o exis -
ing me hods ha equi e use -p o ided spa ial p io s, limi -
ing hei p ac icali y, ou me hod enables lexible ex -d i en
objec inse ion wi hou any need o such p io s (e.g., 2D
masks o 3D bounding boxes). Gi en only a ex p omp (The
image p omp is op ional), F eeInse na u ally inse s ob-
jec s ac oss di e se scenes.
Keywo ds
Tex -D i en 3D Scene Edi ing; Objec Inse ion; Di usion Models;
Mul imodal La ge Language Models; Gaussian Spla ing
ACM Re e ence Fo ma :
Chenxi Li, Weijie Wang
†
, Qiang Li, Nicu Sebe, B uno Lep i, and Weizhi Nie.
2025. F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D Gaussian
Scene wi hou Spa ial P io s. In P oceedings o he 33 d ACM In e na ional
Con e ence on Mul imedia (MM ’25), Oc obe 27–31, 2025, Dublin, I eland.
ACM, New Yo k, NY, USA, 10 pages. h ps://doi.o g/10.1145/3746027.3755072
1 In oduc ion
Tex -d i en 3D gene a ion [
3
,
24
,
28
,
40
,
47
] and edi ing [
2
,
43
,
46
]
a e gaining ac ion o enabling he in ui i e cus omiza ion o
digi al con en wi h a ew wo ds. Despi e ecen ad ances [
9
,
10
,
12
,
17
,
23
,
25
,
32
,
37
–
39
,
49
] ha ha e made signi ican p og ess in
edi ing he geome y and appea ance o scene componen s, lexibly
inse ing new objec s in o he scene emains challenging due o
di icul ies in p ecise placemen and seamless in eg a ion.
Recen 3D edi ing me hods le e age di usion models by i s
pe o ming ex -guided 2D edi s on single [
6
,
30
] o mul i- iew
images [
12
,
49
], hen li ing hem o 3D. Relying solely on ex ual
desc ip ions o objec inse ion o en leads o inse ion ailu e
and subop imal esul s due o misin e p e a ion o he ex [
48
,
10915
MM ’25, Oc obe 27–31, 2025, Dublin, I eland Chenxi Li e al.
49
], as shown in Figu e 3, exempli ied by me hods like Ins uc -
NeRF2NeRF [
12
] and GaussC l [
41
]. Some me hods in oduce a -
en ion mechanisms o cap u e he spa ial ela ionship be ween
he inse ed objec and he scene. Howe e , hese me hods s ill
s uggle wi h accu a ely de e mining he inse ed objec ’s pose and
scale [
33
,
46
,
49
]. To add ess his limi a ion, o he me hods le e age
use -p o ided 2D masks [
6
,
30
] o 3D bounding boxes [
31
,
48
] as
s ong cons ain s o achie e mo e con ollable and concise inse -
ion. Ne e heless, hey o en demand specialized expe ise [
48
]
and conside able manual e o , limi ing hei p ac ical usabili y.
In addi ion, hey s ill ace challenges wi h inaccu a e dep h es i-
ma ion [
6
] and inconsis en 3D mul i- iew econs uc ion due o
he modali y gap be ween 2D and 3D. We summa ize he abo e
discussion as shown in Table 1.
Inspi ed by his, achie ing lexible and high-quali y objec in-
se ion in o 3D scenes wi hou manual supe ision emains un-
de explo ed. The ad en o la ge-scale models [
1
,
8
,
11
,
21
,
42
],
which ha e acqui ed human commonsense knowledge, has made
unsupe ised lea ning inc easingly p omising. In his pape , we
p opose F eeInse , a me hod ha le e ages ounda ion models
(MLLMs [
1
,
8
], LGM [
34
] and Di usion model [
29
] ) o assis ob-
jec inse ion in 3D scenes wi hou elying on any spa ial p io s
as Figu e 1. Ou me hod emo es he need o spa ial p io s by
in e ing objec inse ion di ec ly om high-le el ex ual cues (e.g.,
“Add [objec ] o/on [ a ge ]”) as Figu e 1 shows. We a gue ha he
inse ion p ocess can essen ially be iewed as i s gene a ing
a objec , ollowed by es ima ing he ans o ma ion, which
de ines he inse ed objec ’s deg ees o eedom (pose and
scale) ela i e o he scene.
Speci ically, we disen angle he objec inse ion p ocess in o
objec gene a ion and i s pa ame e ized deg ees o eedom (DoF)
es ima ion, bo h guided by ex ual desc ip ions wi h ounda ion
models. We i s ob ain a ex ins uc ion om he use and pa se
i in o s uc u ed seman ics (e.g., objec ype, spa ial ela ion, a -
achmen egion) using an MLLM-based [
1
] objec inse ion pa se .
This enables p ecise, con ollable objec inse ion ha aligns wi h
he use ’s in en . We hen employ a 3D-consis en econs uc ion
model [
34
] o ob ain an ini ial Gaussian-based objec model, which
is coa sely inse ed in o he scene guided by he isual and spa ial
easoning capabili ies o MLLMs[
1
,
8
]. This s ep inhe en ly ci cum-
en s he inconsis ency issues associa ed wi h 2D edi ing-based
me hods. While he eed- o wa d p ocedu e p o ides a ligh weigh
3D layou , i o en su e s om subop imal placemen and impe -
ec geome y. To add ess hese issues, we p opose a wo-s age
e inemen . Fi s , he Hie a chical Spa ial-Awa e Re inemen s age
op imizes he objec ’s DoF ia spa ially-awa e sco e dis illa ion
sampling (SSDS) [
7
] om p e ained di usion model [
29
]. This
s age le e ages MLLM-de i ed easoning esul s o align he ob-
jec ’s DoF wi h bo h local and global spa ial seman ics, enhancing
mo e p ecise and con ollable placemen . These easoning esul s
also help he model handle a e spa ial composi ion e.g., “Add a
pai o sunglasses on he o ehead”, he eby imp o ing obus ness.
In he inal appea ance e inemen s age, we ine- une a p e ained
di usion model on mul i- iew ende ings o he op imized inse ed
objec and i s co esponding inse ed-objec image, and use i o
enhance he objec ’s appea ance. By disen angling placemen om
objec gene a ion, ou me hod enables lexible seman ic con ol
Table 1: Compa ison o exis ing me hods o objec inse ion
in 3D scenes. Ou s can achie e high-quali y objec inse ion
wi hou manual supe ision while keeping 3D consis ency.
No Requi ed
Manual Supe ision
3D View
Consis ency
Suppo
Image-P omp s
Ins uc -N2N[12] ✓✗ ✗
GaussC l [41] ✓ ✓ ✗
GaussianEdi o [6] ✗ ✗ ✗
TIP-Edi o [48] ✗ ✗ ✓
F eeInse ✓ ✓ ✓
o e inse ion while p ese ing objec quali y and ensu ing cohe -
en , plausible 3D scenes.
To e alua e he p oposed me hod, we applied i o a ious scena -
ios, including objec -cen ic, human-cen ic, and complex ou doo
scenes. Ou expe imen al esul s demons a e ha he p oposed ap-
p oach can inse di e se objec s in o 3D scenes wi hou equi ing
manual supe ision while achie ing mul i- iew consis en objec
quali y. In summa y, ou con ibu ions a e as ollows:
•
We add ess consis en objec inse ion in di e se 3D scenes
using only ex ual inpu , emo ing he need o spa ial p io s
and ou pe o ming exis ing me hods h ough a amewo k
ha disen angle objec gene a ion and spa ial placemen .
•
We p opose a DoF op imiza ion me hod o objec inse ion,
using he easoning capabili ies o MLLMs and di usion
models in place o manual supe ision. The MLLM’s seman-
ic and spa ial p io s u he suppo SSDS in enhancing
p ecision and obus ness.
•
We ensu e high-quali y objec gene a ion by main aining 3D
shape consis ency ia a econs uc ion model and e ining
isual appea ance.
•
We p esen he i s baseline o e alua ing unsupe ised
3D scene inse ion, wi h expe imen s showing compe i i e
pe o mance agains s a e-o - he-a me hods.
2 Rela ed Wo ks
Tex -Guided 3D Scene Edi ing. Tex -d i en 3D scene edi ing has
seen apid p og ess, hanks o he ise o di usion models [
13
,
29
].
Mos me hods [
9
,
12
,
18
,
19
,
23
,
27
,
36
] ocus on modi ying exis ing
con en , ei he globally o locally. Local edi ing equi es p ecise
localiza ion o a oid a ec ing un ela ed egions, which emains
challenging. While some wo ks use implici cues om models like
Ins uc Pix2Pix [
4
] o Con olNe [
44
], o he s inco po a e explici
cons ain s such as segmen a ion masks [
37
] o c oss-a en ion
maps [
49
]. Howe e , hese me hods s uggle wi h 3D objec in-
se ion, which demands easoning abou seman ically app op ia e
ye physically unoccupied egions o placemen . In his wo k, we
p ima ily ocus on he ask o objec inse ion in 3D scenes.
Objec Inse ion in 3D Scene. In con as o modi ying exis -
ing scene con en , objec inse ion emains unde explo ed. MVIn-
pain e [
5
] le e ages segmen a ion o iden i y suppo egions like
able su aces, bu s uggles wi h ine-g ained inse ions on objec s
o humans. GaussianEdi o [
6
] and InseRF [
30
] inse objec s using
3D econs uc models, ye s ill equi e use -p o ided 2D masks and
su e om dep h- ela ed localiza ion issues. FocalD eame [
20
]
a aches pa s o base shapes bu depends on use -speci ied 3D pa-
ame e s (e.g., o a ion, ansla ion, scale), and lacks gene aliza ion
o complex scenes. O he me hods [
31
,
48
] guide objec gene a ion
10916
F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D Gaussian Scene wi hou Spa ial P io s MM ’25, Oc obe 27–31, 2025, Dublin, I eland
Scene A achmen 
RegionDe ec ion
Objec DoF
Ini ializa ion
3D Objec
Recons uc ion
Hie a chical Spa ial Awa e Re inemen Objec Appea ance Re inemen
🔥
Lo a
❄
Une
❄
Lo a
❄
Une
....
....
*
Times ep Adjus
Ta ge 3D Scene
e inemen
Tex 2Image
O iginal 3DGS
Sample Views
Ini ial 3D
Composi ionGaussians
T ans o med
Gaussians
Add a ed hea -shaped
glasses o he doll
A achmen
Region Gaussian
op ional
：

：
：
：
"A ed hea -shaped
glasses"
"A doll wea s a ed hea -
shaped glasses"
"Glasses align wi h he
eyes"
"Eyes"
：
"Wea s"
：
"align wi h"
MLLM-based
Objec
Inse ion
Pa se
MLLM-based Objec Inse ion Pa se Ini ializa ion ia La ge Models
*
(a) (b)
(c) (d)

sscale
 o a ion
 ansla ion
Scene
Sample View
Figu e 2: O e iew o F eeInse . Gi en an ex p omp
T
and op ionally an image p omp
I
O
, he objec inse ion p ocess
includes ou s ages: (a) The MLLM-based Objec Inse ion Pa se (see Sec ion 3.2) i s ex ac s s uc u ed seman ics o suppo
he subsequen s ages. (b) The Ini ializa ion ia La ge Models (see Sec ion 3.3) s age gene a es objec and ini ializes i s
DoFO
in
he scene . (c) The Hie a chical Spa ial Awa e Re inemen (see Sec ion 3.4) s age e ines he
DoFO
. (d) The inal s age, Objec
Appea ance Re inemen (see Sec ion 3.5), enhances he objec ’s isual quali y using objec imageI
O.
ia di usion using 3D bounding boxes, which imposes a bu den on
use s. In his wo k, we aim o unsupe ised and b oadly applicable
3D objec inse ion, emo ing he need o manual anno a ions o
spa ial p io s.
La ge Language Models in 3D Gene a ion and Edi ing. LLMs,
like GPT [
1
] and Llama [
11
] se ies, ha e exhibi ed ou s anding e -
icacy in many ex - ela ed asks. Zhou e al
. [47]
and Zhou e al
.
[45]
u ilize LLMs o p o ide coa se composi ional spa ial in o -
ma ion om ex ual desc ip ions o cons uc he 3D scene. The
mul i-modal a ian s o LLMs [
1
,
8
] inco po a e images and a e
addi ionally ained on image- ex pai s, showing imp essi e e-
sul s o isual cap ioning and ision ques ion-answe ing (VQA).
No ably, Molmo [
8
] can pe o m pixel-le el localiza ion mainly
because i was ained wi h ichly anno a ed image da a. This ca-
pabili y is c ucial o obus spa ial g ounding in ision-language
asks. GG-Edi o [
43
] i s exploi s GPT-4V [
1
] o be e unde s and
bo h he ex ual and 3D isual inpu s and hen in e easonable
local egions o 3D edi ing. Howe e , i p ima ily a ge s objec
edi ing and i s p elimina y use o MLLM s uggles o ensu e spa ial
p ecision. In his wo k, we le e age he ex easoning capabili ies
and spa ial ela ionship unde s anding o GPT-4 [
1
] and Molmo [
8
],
supplemen ed by a basic de ec ion model [
42
], o elimina e he
eliance on manually p o ided p io s in objec inse ion.
3 Me hod
3.1 P oblem S a emen
Figu e 2 illus a es he o e all amewo k o F eeInse . Gi en a
g oup o 3D Gaussians
GS
o an inpu scene and a ex p omp
T
guiding he inse ion o an objec in o he scene, ou algo i hm pe -
o ms high-quali y, seman ically consis en objec inse ion wi hou
any manual supe ision (e.g., 3D bounding boxes o masks). We de-
couple he objec inse ion ask in o objec gene a ion and he
op imiza ion o he objec ’s 3D deg ees o eedom
DoFO
( o a ion,
ansla ion, scale), bo h guided by seman ic alignmen be ween he
esul ing scene and he use p omp s. The esul ing scene is ep e-
sen ed by a new se o Gaussians,
Ginse ed
. Mo eo e , ou me hod
allows image p omp
I
O
as inpu o speci y he objec ’s appea ance.
Fo mally, he inse ion p ocess is de ined as:
Ginse ed =EGS,T,I
O,(1)
whe e
E
deno es he p ocess applied o he inse ed objec , includ-
ing i s gene a ion, DoF lea ning in he con ex o he scene
GS
, and
appea ance e inemen .
3.2 MLLM-based Objec Inse ion Pa se
A key challenge in unsupe ised objec inse ion is con e ing high-
le el use in en in o s uc u ed, ine-g ained guidance. To add ess
his, we in oduce an MLLM-based Objec Inse ion Pa se (MLLM-
OIP) ha u ilizes he MLLM’s spa ial unde s anding capabili y o
pa se he ins uc ion
T
in o seman ically p omp s, p o iding es-
sen ial guidance o he subsequen objec inse ion. Speci ically,
we p o ide a p omp empla e
T
pa se
and a sampled scene image
I
S
as inpu o he mul imodal LLM
MMLLM
[
1
] o ob ain s uc u ed
ou pu s. The p omp s gene a ion p ocess is o malized as :
T
O,T
AR,T
GT,T
IW,T
LT,T
SW=MMLLM T,T
pa se ,I
S(2)
He e, he Objec P omp (
T
O
) is used o 3D objec gene a ion and
appea ance e inemen s age. The A achmen Region P omp (
T
AR
)
10917
MM ’25, Oc obe 27–31, 2025, Dublin, I eland Chenxi Li e al.
plays a c ucial ole in he ini ializa ion o he objec ’s deg ees o
eedom
DoFO
. The emaining ou p omp s including he Global
Ta ge P omp (
T
GT
) and i s Objec In e ac ion Wo d (
T
IW
), he Local
Ta ge P omp (
T
LT
) and i s Spa ial Rela ionship Wo d (
T
SW
) a e
employed du ing he hie a chical spa ial-awa e e inemen s age
o e ine he DoFO, suppo ing global-local seman ic alignmen .
3.3 Ini ializa ion ia La ge Models
Objec om P omp s. To a oid 3D inconsis ency, we i s use
a ex - o-image (T2I) [
29
] model o syn hesize a Tex -gene a ed
image
I
O
o he objec om he objec desc ip ion p omp
T
O
. The
syn hesized image is hen used o eco e he 3D geome y
G𝑂
ia
LGM [
34
], a single- iew econs uc ion model ha achie e a ade-
o be ween econs uc ion quali y and e iciency. O he ligh weigh
3D econs uc ion me hods [
14
,
22
] can also be adop ed. In addi ion,
I
O
can be di ec ly speci ied by he use , allowing o mo e p ecise
con ol o e he objec ’s appea ance.
Scene’s A achmen Region De ec ion. In ui i ely, an objec ’s
placemen is in luenced by he a achmen egion o he scene and
he deg ees o eedom wi hin ha egion. D i en by his, we ex ac
an a achmen egion
GAR
and i s associa ed deg ees o eedom
DoFAR
(
sAR
,
AR
,
AR
) om he 3D scene based on he A achmen
Region P omp
T
AR
. This egion se es as a c ucial spa ial e e -
ence, guiding ini ializing he inse ed objec
DoFO
. Speci ically, we
employ an open- ocabula y de ec ion model Flo ence2 [
42
] o lo-
calize 2D bounding boxes ac oss smapled iews om he scene wi h
came a poses
Ccam
, guided by
T
AR
. Fo each iew, he de ec ed box
is con e ed in o a bina y mask
I
BAR
, ep esen ing he candida e
a achmen egion. The 3D a achmen a ea is pa ame e ized by he
deg ees o eedom o a ini ial 3D bounding box
Bini
. We op imize
he a achmen by compu ing c oss-en opy be ween he p ojec ed
ans o med bounding box and he de ec ed a achmen egion
mask
BAR
ac oss all came a iews. Thus,
DoFAR
is calcula ed as
ollows:
DoFAR =a g min
𝜃∑︁
T
cam ∈Ccam
LBCE P oj (B,T
cam),I(T
cam)
BAR ,(3)
whe e
𝜃=(s, , )
deno es he ans o ma ion pa ame e s o he
canonical box
Bini
,
Fa ine
is he a ine ans o ma ion unc ion, and
he ans o med 3D bounding box is compu ed as
B=Fa ine(Bini , 𝜃)
.
The unc ion
P oj(B,T
cam)
deno es he 2D p ojec ion o he 3D
bounding box
B
on o he image plane unde he came a pose
T
cam
.
A e ob aining DoFAR, he a achmen egion GAR is ex ac ed by
selec ing 3D Gaussians om he scene ep esen a ion
GS
wi hin he
ans o med bounding box
BAR =Fa ine(Bini ,DoFAR)
. Fo mally,
he a achmen egion is de ined as:
GAR ={𝑔∈ G𝑠|𝑔∈ BAR}
Objec ’s DOF Ini ializa ion. Once ob ained he a achmen e-
gion
GAR
and i s associa ed ans o ma ion
DoFAR
, we ini ialize
he inse ed objec ’s deg ees o eedom
DoFO
(
sO
,
O
,
O
) acco d-
ingly. Fo he
sO
ini ializa ion, we assume an in ui i e eal-wo ld
p io : he e exis s a easonable ela i e scale a io
𝜆 el
be ween he
inse ed objec and he a achmen egion, which helps ensu e a
plausible inse ion. This a io is implici ly unde s ood by la ge-scale
language models. The e o e, we le e age
MMLLM
o p edic
𝜆 el
,
and compu e he objec scale as
𝑠O=sAR ·𝜆 el
. Conside ing he
unce ain y in MLLM p edic ions and he in luence o scale ini ial-
iza ion quali y on subsequen e inemen , we adop an i e a i e
s a egy. A e ini ializing
O
and
O
, we ende he scene wi h he
inse ed objec and i e a i ely in e ac wi h he MLLM, using isual
eedback o adjus sOand imp o e ealism and in eg a ion.
Fo he
O
ini ializa ion, we le e age
MMLLM
o ini ialize a se-
man ically app op ia e objec o a ion. Gi en a o MLLM-sugges ed
p ima y scene iewpoin , we ende a scene image
I
S
, and sample
a se o objec -cen ic ende ings
{I(𝑟)
O}𝑟∈R
o he inse ed objec ,
whe e each
𝑟∈ R
co esponds o a unique azimu h-ele a ion o-
a ion
(𝜙, 𝜃) ∈ [
0
,
2
𝜋) × [
0
, 𝜋)
. Based on he
I
S
, he ende ing se
{I(𝑟)
O}
, and he Global Ta ge P omp
T
GT
, he model can selec he
op imal o a ion O ha maximizes a seman ic alignmen sco e:
O=a g max
𝑟∈R MMLLM I
S,I(𝑟)
O,T
GT(4)
whe e
MMLLM
e alua es semen ic plausibili y he placemen aligns
wi h he scene.
To ini ialize he
O
, we use s ong pixel-le el seman ic spa ial
localiza ion capabili y o Molmo [
8
] o p edic a se o 2D objec
cen e s
{𝑐(𝑣)
𝑂}𝑣∈V
ac oss mul iple scene iews wi h he Local Ta -
ge
T
LT
, using p omp like “Poin he posi ion o add <
T
LT
>”. Le
ˆ
G(𝑡)
𝑂
deno e he objec geome y a e applying he ans o ma ion
(sO
,
O
,
𝑡
), whe e
𝑡
is a op imized pa ame e . Fo each iew
𝑣
, we
p ojec he ans o med objec and compu e he 2D cen oid o
i s p ojec ion. The
O
is ob ained by op imizing
𝑡
o minimize he
disc epancy be ween he p ojec ed and he p edic ed cen oids:
𝑂=a g min
𝑡∑︁
𝑣∈V 

Cen oid 𝜋𝑣ˆ
G(𝑡)
𝑂−𝑐(𝑣)
𝑂


2
2+ Lcoll ( GAR,𝑂𝑐)(5)
He e,
𝜋𝑣(·)
is he came a p ojec ion unc ion o iew
𝑣
, and Cen-
oid (
·
)compu es he 2D p ojec ed cen e o he objec . To ensu e
physical plausibili y du ing objec inse ion, we in oduce he colli-
sion loss [
45
]
Lcoll
, which penalizes in e pene a ion be ween he
objec cen iod 𝑂𝑐and he scene a achmen egion GAR.
3.4 Hie a chical Spa ial Awa e Re inemen
The ini ial
DoFO
om
MMLLM
o en lack spa ial accu acy, hinde ing
seamless scene in eg a ion. Base on ha , we hen op imize he
DoFO
using SSDS Loss [
7
], e ining he objec ’s placemen in he scene.
The loss is de ined as:
∇𝜃LSSDS (𝜙★, 𝑥 )=E𝑡,𝜖 𝑤(𝑡) ( ˆ
𝜖𝜙★(𝑥𝑖;𝑦, 𝑡 ) − 𝜖)𝜕𝑥
𝜕𝜃 (6)
He e,
𝜃
,
𝑥
,
𝜙∗
, and
ˆ
𝜖𝜙∗(𝑥𝑡
;
T, 𝑡)
deno e he 3D ep esen a ion, en-
de ed image, spa ial a en ion map, and he sco e unc ion p edic -
ing noise
𝜖
om he noised image
𝑥𝑡
wi h ex p omp
T
. Unlike he
o iginal design o mul i-objec composi ion wi h high imes eps,
we ind ha lowe imes eps a e mo e e ec i e o ine-g ained
DoF e inemen in ou se ing, as i emphasizes local spa ial de ails
c i ical o p ecise alignmen .
Global-Local Collabo a i e Spa ial Awa eness. Di usion mod-
els o en exhibi spa ial biases due o aining da a imbalance, e.g.,
gene a ing mous aches a a ela i ely la ge scale ac oss he lowe
ace. La ge models ained on da a-d i en p io s o en ail o mee
human expec a ions in spa ial easoning. (see Figu e 6) To add ess
spa ial ambigui y, we le e age spa ial ela ion e ms (e.g., “on”,
“in on o ”) o impose explici cons ain s on objec localiza ion.
10918
F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D Gaussian Scene wi hou Spa ial P io s MM ’25, Oc obe 27–31, 2025, Dublin, I eland
Compa ed o gene al e bs like “wea ing” o “wi h”, hese ela-
ions encode mo e p ecise spa ial p io s, leading o mo e e ec i e
supe ision o op imizing objec placemen . We le e age spa ial
p omp s in e ed om MLLM-OIP, which o e s bo h global se-
man ic g ounding T
GT wi h in e ac ion wo dT
IW and ine-g ained
posi ional cues
T
LT
wi h spa ial ela ionship wo d
T
SW
. We de ine
a hie a chical spa ial loss ha join ly supe ises local and global
alignmen :
Lspa ial =𝛽· Lssds-global ( T
GT,T
IW)+(1−𝛽)·Lssds-local (T
LT,T
SW),(7)
A en ion-based Localiza ion. We ound ha due o he bias
in he T2I model’s aining da a, SSDS Loss exhibi s limi a ions
when handling a e spa ial ela ionships. These inhe en limi a ions
es ic he e ec i eness o p omp ins uc ions. To enable s onge
spa ial condi ioning, we adop a en ion-based localiza ion loss [
48
],
en o cing igh e egional cons ain s as ollows:
L𝑙𝑜𝑐 =1−max
𝑠∈S (𝐴𝑠
𝑡)+𝜆∑︁
𝑠∈˜
S
𝐴𝑠
𝑡

2
2(8)
whe e
𝜆
balances he wo e ms,
S
deno es he mul i- iew mask
egion p ojec ed om he 3D bounding box
B
, ob ained by igh ly
enclosing he objec a e DoF ini ializa ion, and
˜
S
deno es he
complemen a y egion. As shown in ou abla ion (Figu e 7), i is
essen ial o p ecise objec placemen wi hin he designa ed a ea.
3.5 Objec Appea ance Re inemen
Once he objec ’s deg ees o eedom a e de e mined, a e inemen
module is in oduced o enhance he isual quali y o he inse ed
objec
G𝑂
. Speci ically, we e ine
G𝑂
using he high-quali y appea -
ance om he inse ed-objec image I
𝑂 ia LoRA [15].
Viewpoin F equency Balancing. To a oid he o e i ing caused
by using a single- iew op imiza ion, which o en leads o 3D in-
consis encies and missing objec pa s (e.g., a side iew causing
missing legs) as shown in Figu e 9 (b). We pe o m mul i- iew
sampling o he inse ed objec . Speci ically, gi en a se o iews
{𝐼𝑖}𝑁
𝑖=1
, ende ed om he inse ed objec
G𝑂
. We es ima e he
pose
𝑃∗
o he objec image
I
O
by selec ing he mos simila iew
based on DINO ea u e simila i y [
26
]. To ensu e bo h appea ance
ideli y and geome ic consis ency, We cons uc he aining se
D e
by combining he ende ed mul i- iew images wi h epea ed
samples o he inse ed-objec image
I
O
and i s es ima ed pose
𝑃∗
,
as ollows:
D e ={(𝐼𝑖, 𝑃𝑖)}𝑁
𝑖=1∪ {(I
𝑂, 𝑃∗)}𝑀
𝑗=1( epea ed, 𝑀 >𝑁)(9)
The ollowing objec i e is used o ine- une he LoRA laye s:
L e =E𝑧𝑖,𝐼𝑖,𝑃𝑖,𝑦∗,𝜖,𝑡 
𝜖𝜙2(𝑧𝑖, 𝑡, 𝑃𝑖, 𝐼𝑖, 𝑦∗) − 𝜖

2
2,(𝐼𝑖, 𝑃𝑖)∼D e .(10)
𝑧𝑖
is he noisy la en o image
𝐼𝑖
,
𝑡
is he di usion imes ep, and
𝜖
is he a ge noise. The denoising ne wo k
𝜖𝜙2
, augmen ed wi h
LoRA [
15
], is condi ioned on
𝐼𝑖
, i s pose
𝑃𝑖
, and he objec -speci ic
p omp 𝑦∗, e.g., “A < oken> dog”, which is o ma ing om T
O.
Appea ance-Focused Re inemen . We employ ine- uning di u-
sion o upda e he objec Gaussian
G𝑂
, guided by
L𝑠𝑑𝑠
[
28
]. Unde
ou se ing, we sample om a lowe ange o imes eps du ing
op imiza ion o educe he impac on he objec ’s geome y (e.g.,
shape and scale), he eby encou aging he model o ocus mo e on
e ining appea ance de ails. The co esponding objec i e is de ined
as ollows:
∇𝜃ˆ
L𝑠𝑑𝑠 (𝜙, 𝑥 )=Eˆ
𝑡,𝜖 𝑤(ˆ
𝑡)ˆ
𝜖𝜙(𝑥𝑖;𝑦𝑖,ˆ
𝑡) − 𝜖𝜕𝑥
𝜕𝜃 ,(11)
whe e ˆ
𝑡deno es he adjus ed (lowe ) di usion imes ep.
3.6 Objec Replacemen
In addi ion, ou me hod can be na u ally ex ended o objec e-
placemen in he scene. Speci ically, gi en a use p omp such as
“Add a [new objec ] o eplace [exis ing objec ]”, he co esponding ob-
jec o be eplaced is iden i ied h ough he A achmen Region
GAR
.
We emo e
GAR
and hen execu e he s anda d inse ion pipeline,
enabling eplacemen wi hou being cons ained by he o iginal
objec ’s geome y o s uc u e.
4 Expe imen s
4.1 Expe imen s Se up.
Implemen a ion De ails Du ing ini ializa ion, we use a lea ning
a e o 5
×
10
−3
o op imizing bo h
GAR
and
O
o inse ed objec .
When es ima ing he coa se o a ion
O
, we ende he objec a
10-deg ee in e als . Du ing he Hie a chical Spa ial Awa e Re ine-
men s age, we apply a lea ning a e o 5
×
10
−4
wi h di usion
imes eps in he ange o [0.02, 0.2],
𝜆=
0
.
1is se in
L𝑙𝑜𝑐
and
𝛽
is linea ly inc eased om 0 o 1 du ing aining. Fo appea ance
e inemen , we op imize he objec appea ance using imes eps in
[0.02, 0.5
∼
0.25]. The objec image
I
O
is upsampled wi h a sampling
a io o
𝑀/𝑁=
3 ela i e o mul i- iew inpu s. All expe imen s a e
conduc ed on a single NVIDIA A40 GPU. Mo e de ails a e p o ided
in he Appendix.
Da ase To comp ehensi ely e alua e ou me hod, we ollow p io
wo ks [
6
,
12
,
48
] and selec ep esen a i e scenes o a ying com-
plexi y, including simple backg ounds, human aces, and complex
ou doo en i onmen s. In hese scenes, we inse commonly as-
socia ed objec s (e.g., glasses, gi a es) and e alua e di e se ca e-
go ies such as bow ies and mous aches o assess gene aliza ion.
Fo GaussianEdi o [
6
], we manually anno a e masks, while o
TIP-Edi o [
48
], we use he au ho -p o ided bounding boxes and
objec images o compa ison.
Baselines We compa e ou me hod wi h s a e-o - he-a 3D scene
edi ing app oaches ha suppo objec inse ion and eplacemen ,
unde wo ypes o guidance: ex p omp and ex -image p omp .The
ex -guided baselines include h ee me hods: Ins uc -GS2GS [35],
which ex ends Ins uc -NeRF2NeRF (IN2N) [
12
] by eplacing he
NeRF in IN2N wi h a 3DGS model; GaussC l [
41
], and GaussianEd-
i o [
6
]. Fo ex -image p omp me hods, we compa e wi h TIP-
Edi o [
48
], which uses an example image o speci y objec appea -
ance. As TIP-Edi o p o ides only limi ed inse ion sc ip s (e.g., “A
doll wea ing sunglasses”, “A man wi h bea d”). Fo ai ness, we use
o icial code and p e- ained weigh s.
E alua ion C i e ia. We use CLIP Tex -Image di ec ional simila -
i y ollowing [
6
,
41
,
48
,
49
] o assess he alignmen be ween he ex
and he edi ing esul s. Fo appea ance-speci ied cases, we u he
employ DINO simila i y [
26
] ollowing [
48
] o assess appea ance
p ese a ion. We also conduc ed a use s udy wi h 50 pa icipan s,
10919

MM ’25, Oc obe 27–31, 2025, Dublin, I eland Chenxi Li e al.
O iginal Scene
...............................................................................................................................
IN2N(GS)
“Gi e he doll a pai o glasses” “Gi e he man a pai o glasses” “Tu n he s one ho se in o a gi a e”
GaussC l
“A pho o o a doll wea ing a pai o glasses” “A pho o o a man wea ing a pai o glasses” “A pho o o a gi a e in on o he museum”
GaussianEdi o
“A doll wea ing a pai o glasses” “A man wea ing a pai o glasses” “A gi a e s anding on he s one pla o m”
F eeInse
“Add a pai o glasses o he doll” “Add a pai o glasses o he man” “Add a gi a e o eplace he s one ho se”
Figu e 3: Visual compa ison wi h s a e-o - he-a me hods o ex -guided objec inse ion (Cols 1–2) and eplacemen (Col 3). Ou
me hod gene a es highe -quali y esul s while p ese ing scene in eg i y. IN2N (GS) and GaussC l some imes misunde s and
he p omp and ail o comple e inse ion (e.g., “Gi e he doll a pai o glasses o he doll”), and s uggle o p oduce clea shape
changes in eplacemen (Col 3, Rows 2–3). GaussianEdi o equi es manual masks and dep h adjus men , and su e s om
a i ac s and low-quali y objec s due o pos -inpain ing and 3D econs uc ion limi a ions.
who a ed he 3D edi ing esul s (p esen ed wi h p omp s in shu -
led o de ) on ou c i e ia: Seman ic Alignmen , Objec In eg i y,
Geome ic Consis ency, and De ail P ese a ion, using a 1–10 scale.
4.2
Compa isons wi h S a e-o - he-A Me hods
4.2.1 Quali a i e compa isons. In his pa , we conduc a quali a-
i e compa ison wi h di e en baselines unde wo ypes o inpu
se ings ( ex p omp and ex -image p omp ) o e alua e hei pe -
o mance unde iden ical condi ions. Video demons a ions a e
included in he supplemen a y.
Tex P omp Compa isons. Figu e 3 shows isual compa isons be-
ween ou me hod wi h h ee baselines. Bo h IN2N(GS) and GaussC-
l, which ely solely on seman ic guidance, s uggle o success ully
comple e inse ions in some scene-objec combina ions, such as
“Add a pai o glasses o he doll”. Al hough GaussC l imp o es
consis ency, he eplacemen s emain oo simila o he o iginal
(e.g., a ho se), ailing o con incingly esemble he a ge (e.g., a
gi a e). GaussianEdi o elies on use -p o ided 2D masks o objec
inse ion bu s uggles in objec - o human-cen ic scenes due o
inaccu a e pos -inpain ing segmen a ion, leading o a i ac s like
o eg ound o e laps. While i pe o ms be e in ou doo scenes
(e.g., he gi a e example), i s dep h es ima ion is o en imp ecise
and equi es manual adjus men . By con as , ou me hod achie es
high-quali y objec inse ion and eplacemen esul s in bo h scene
p ese a ion and objec comple eness wi hou equi ing any man-
ual anno a ions.
Tex -Image P omp Compa isons. Besides he ex -p omp me h-
ods, we u he e alua e objec inse ion and eplacemen wi h a
gi en image p omp o he speci ied objec in Figu e 4, compa -
ing agains TIP-Edi o [
48
]. Al hough TIP-Edi o suppo s lexible
inse ion ia 3D bounding boxes, i su e s om inconsis en mul i-
iew appea ances due o i s eliance on 2D edi ing echniques. Mos
c i ically, achie ing such esul s emains dependen on inely use -
p o ided 3D bounding boxes, which signi ican ly hinde s scalabili y
and p ac icali y. In con as , ou me hod deli e s he mos com-
ple e geome y and be e appea ance ideli y o he image p omp
wi hou elying on any anno a ion.
4.2.2 Quan i a i e Compa isons. Table 2 p esen s he quan i a-
i e compa ison o ou me hod agains o he baseline me hods.
Ou me hod achie es CLIP ex –image seman ic alignmen sco es
compa able o he s a e-o - he-a me hods wi hou equi ing any
manual anno a ions o he inse ion egion. Mo eo e , compa ed
10920
F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D Gaussian Scene wi hou Spa ial P io s MM ’25, Oc obe 27–31, 2025, Dublin, I eland
O iginal/Re TIP-Edi o F eeInse
“A doll wea ing a pai o 𝑉1sunglasses” “Add a pai o glasses o he doll”
“A doll wea ing a 𝑉1ha ” “Add a ha o he doll”
“A man wi h a 𝑉1bea d” “Add a bea d o he man”
“A 𝑉1gi a e in a ga den” “Add a gi a e o eplace he ho se”
“A 𝑉1dog in woods” “Add a dog o eplace he s one bea ”
Figu e 4: Visual compa isons wi h TIP-Edi o using ex -image p omp s. Ou me hod achie es compe i i e esul s wi h TIP-
Edi o , wi hou elying on 3D bounding boxes. TIP-Edi o s uggles o main ain he 3D consis ency o he inse ed objec (e.g.,
he misaligned ha ac oss iews in ow 2, column 2, and he igh on paw in e sec ing wi h he le in ow 5, column 2), as i s
2D edi ing p ocess lacks c oss- iew cons ain s. Ou me hod p oduces clea ly mo e 3D-consis en esul s and mo e closely
esembles he e e ence image.
o he app oach ha speci ies objec appea ances (TIP-Edi o ), ou
me hod exhibi s highe DINO simila i y o he image p omp . The
Use o e
a ings clea ly demons a e ha use s p e e ou me hod
o e he baselines. F eeInse is a leas an hou as e han TIP-
Edi o , equi es no manual p io s, and ou pe o ms as e me hods.
Table 2: Quan i a i e compa isons o SOTA. CLIP
𝑑𝑖𝑟
deno es
he CLIP Tex -Image di ec ional simila i y. DINO
𝑠𝑖𝑚
is he
DINO simila i y.
Me hod
CLIP
𝑑𝑖𝑟 ↑
DINO
𝑠𝑖𝑚 ↑
Use
𝑣𝑜𝑡𝑒 ↑Time𝑐𝑜𝑠𝑡 ↓
Ins uc N2N(GS) [
35
]
26.76%
- 18.3 15 mins
GaussianEdi o [6]
27.36%
- 24.7 20 mins
GaussC l [41]
25.39%
- 26.3 15 mins
TIP-Edi o [48]
30.01% 83.30%
32.3 2.5 h
F eeInse
29.48% 83.45%
36.9 1.1 h
4.3 Abla ion S udies
Abla ion Visualiza ion Ac oss S ages. To be e demons a e
how each s age in F eeInse con ibu es o he inal ou come, we
isualize he in e media e esul s a each s ep, as shown in Figu e 5.
Col 2 highligh s he a achmen egion Gaussians wi hin he scene,
ma ked in ed. Col 3 shows he ini ial deg ees o eedom (DoF),
which a e ypically sub-op imal. A e SSDS e inemen , he objec
achie es a mo e accu a e DoF (Col 4). Finally, Col 5 demons a es
he enhanced objec appea ance.
E ec i eness o Global-Local Collabo a i e Spa ial Awa e-
ness. To e i y he e ec i eness o global-local collabo a i e spa ial-
awa e s a egy, we conduc abla ion s udies compa ing he ollow-
ing a ian s: no ssds, only
Lssds-global
, only
Lssds-local
, and he com-
bina ion o bo h. As shown in Figu e 6, global p omp like “A man
wi h mous ache” o en lead o ambiguous placemen s, while local
p omp such as “A mous ache is unde he nose and abo e he uppe
lip” p o ide p ecise cons ain s bu may igno e global plausibili y.
Ou me hod balances bo h by eweigh ing key spa ial e ms and
p og essi ely shi ing om local o global ocus, esul ing in mo e
accu a e and seman ically cohe en placemen s.
E ec i eness o A en ion-based localiza ion. We e alua e he
con ibu ion o
L𝑙𝑜𝑐
o enhancing spa ial awa eness, pa icula ly
10921
MM ’25, Oc obe 27–31, 2025, Dublin, I eland Chenxi Li e al.
Inpu DoF Ini DoF Re ine Final Ou pu
“Add a dog o eplace he s one bea ”
“Add a bea d o he man”
A achmen
Region
“Add a pai o glasses o he doll”
Figu e 5: Visualiza ion o di e en s ages in F eeInse .
w/o ssds w/o Lssds-local w/o Lssds-global F eeInse
Figu e 6: Abla ion o Global-Local Collabo a i e Spa ial
Awa eness.
“Add a pai o sunglasses
on he o ehead”
“Add an apple on
he cen e o able”
w/o L𝑙𝑜𝑐 F eeInse w/o L𝑙𝑜𝑐 F eeInse
Figu e 7: Abla ion o A en ion-based Localiza ion
o a e o ambiguous placemen s. As shown in Figu e 7,
L𝑙𝑜𝑐
encou ages objec posi ion wi hin he bounding box egion in e ed
by he la ge model, while allowing lexibili y o adjus he DoF.
Unlike ixed cons ain s, i so ly guides a en ion owa d in ended
egions, mi iga ing seman ic con ol ailu es caused by aining
da a biases.
Compa ison o DoF lea ning di ec ly by Di e en Mul i-
Modal La ge Language Models s. F eeInse .To e alua e ou
DoF lea ning me hod, we compa e i wi h s a e-o - he-a mul i-
modal LLMs, including GPT-4V [
1
], Molmo-7B [
8
], and GPT-o1 [
16
],
which di ec ly p edic objec DoFs. T ansla ion and scale a e de-
i ed om mul i- iew p omp s (e.g., “Poin he ou coo dina es o a
bounding box o add [objec ] o/on [ a ge ]”) and li ed o 3D, while
o a ion ollows ou ini ializa ion. As shown in Figu e 8 (“Add a
pai o glasses o he doll”), ou p edic ions a e mo e plausible. Quan-
i a i ely, ou me hod achie es he highes alignmen wi h human
p e e ences in p ojec ed mIoU ac oss all cases Table 3.
E ec i eness o Viewpoin F equency Balancing. Figu e 9 com-
pa es objec appea ance op imiza ion when ine- uning LoRA wi h
a single objec image e sus combining i wi h mul i- iew images
GPT-4V [1] Molmo-7B [8] GPT-o1 [16] F eeInse
Figu e 8: Quali a i e compa ison o di e en MLLMs o DoF
lea ning on “Add a pai o sunglasses o he doll”.
Table 3: Quan i a i e analysis o DoF Op imiza ion in F eeIn-
se .
Me ic GPT-4V Molmo-7B GPT-o1 F eeInse
mIoU o e 15 cases (%) 68.2 74.7 78.9 89.5
(a) O iginal (b) Single Image (c) FR=1 (d) FR=3
Figu e 9: Abla ion o Viewpoin F equency Balancing
a di e en sampling equency a ios (
𝐹𝑅
). Using only inse ed-
objec image leads o shape inconsis encies and a i ac s ac oss
iews (i em (b)), while inco po a ing mul i- iew da a imp o es
consis ency bu may educe de ail. Ou expe imen s show ha
𝐹𝑅 =3
o e s he bes ade-o be ween mul i- iew consis ency
and single- iew image quali y.
5 Conclusion and Limi a ions
In his wo k, we p esen ed F eeInse , a no el amewo k o ex -
d i en objec inse ion in 3D scenes ha elimina es he need o
spa ial p io s such as 2D masks o 3D bounding boxes. By disen an-
gling objec gene a ion om spa ial placemen , F eeInse enables
unsupe ised and seman ically guided edi ing h ough na u al lan-
guage. Le e aging he easoning capabili ies o ounda ion models,
ou me hod ex ac s s uc u ed seman ics om use ins uc ions
o guide 3D econs uc ion and spa ial in eg a ion, achie ing ac-
cu a e placemen and high isual ideli y. Ex ensi e expe imen s
con i m he e ec i eness o ou app oach in enabling p ecise, and
use - iendly 3D objec inse ions, pa ing he way o mo e scalable
and in ui i e scene edi ing in open-wo ld scena ios.
While p omising, F eeInse s ill aces some limi a ions. I may
ail when he unde lying 3D econs uc ion su e s om se e e
geome ic inconsis encies, such as duplica ed limbs o he Janus
p oblem, which canno be ully compensa ed by ou objec -speci ic
e inemen . Complex spa ial ins uc ions ha equi e hie a chical
o ela ional easoning (e.g., “Add ...... o he second laye om he
op o he shel ”) may also exceed he capaci y o cu en MLLMs.
These challenges a e expec ed o diminish as ounda ion models o
3D econs uc ion and mul i-modal easoning con inue o ad ance.
Addi ionally, in eplacemen asks, misma ched con ac egions o
imp ecise 3D bounding boxes can in oduce a i ac s, which can be
alle ia ed by in eg a ing ins ance segmen a ion and local geome y
e inemen o inpain ing.
10922
F eeInse : Disen angled Tex -Guided Objec Inse ion in 3D Gaussian Scene wi hou Spa ial P io s MM ’25, Oc obe 27–31, 2025, Dublin, I eland
6 Acknowledgemen
This wo k has been pa ially suppo ed by he Eu opean Union’s
Ho izon Eu ope esea ch and inno a ion p og am unde g an
ag eemen No. 101120237 (ELIAS). B uno Lep i and Nicu Sebe also
acknowledge he suppo o he PNRR p ojec FAIR - Fu u e AI
Resea ch (PE00000013), unde he NRRP MUR p og am unded
by he Nex Gene a ionEU. This wo k was also suppo ed by he
Tianjin Na u al Science Founda ion, Key P ojec , unde G an No.
22JCZDJC00220, “Ul asound Imaging Algo i hm Resea ch o HIFU
The mal The apy Moni o ing” (2022.10–2025.9).
Re e ences
[1]
Josh Achiam, S e en Adle , Sandhini Aga wal, Lama Ahmad, Ilge Akkaya, Flo en-
cia Leoni Aleman, Diogo Almeida, Janko Al enschmid , Sam Al man, Shyamal
Anadka , e al
.
2023. Gp -4 echnical epo . a Xi p ep in a Xi :2303.08774
(2023).
[2]
Ami Ba da, Ma heus Gadelha, Vladimi G Kim, Noam Aige man, Ami H
Be mano, and Thibaul G oueix. 2025. Ins an 3di : Mul i iew Inpain ing o
Fas Edi ing o 3D Objec s. In P oceedings o he IEEE/CVF Con e ence on Com-
pu e Vision and Pa e n Recogni ion.
[3]
Alexey Bokho kin, Quan Meng, Shubham Tulsiani, and Angela Dai. 2025. Scene-
Fac o : Fac o ed La en 3D Di usion o Con ollable 3D Scene Gene a ion. In
P oceedings o he IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni-
ion.
[4]
Tim B ooks, Aleksande Holynski, and Alexei A E os. 2023. Ins uc pix2pix:
Lea ning o ollow image edi ing ins uc ions. In P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni ion. 18392–18402.
[5]
Chenjie Cao, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. [n. d.].
MVInpain e : Lea ning Mul i-View Consis en Inpain ing o B idge 2D and 3D
Edi ing. In The Thi y-eigh h Annual Con e ence on Neu al In o ma ion P ocessing
Sys ems.
[6]
Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiao eng Yang, Yikai Wang,
Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin. 2024. Gaussianedi o :
Swi and con ollable 3d edi ing wi h gaussian spla ing. In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion. 21476–21485.
[7]
Yongwei Chen, Teng ei Wang, Tong Wu, Xingang Pan, Kui Jia, and Ziwei Liu. 2024.
Combo e se: Composi ional 3d asse s c ea ion using spa ially-awa e di usion
guidance. In Eu opean Con e ence on Compu e Vision. Sp inge , 128–146.
[8]
Ma Dei ke, Ch is ophe Cla k, Sangho Lee, Rohun T ipa hi, Yue Yang, Jae Sung
Pa k, Mohammad eza Salehi, Niklas Muennigho , Kyle Lo, Luca Soldaini, e al
.
2024. Molmo and pixmo: Open weigh s and open da a o s a e-o - he-a mul i-
modal models. a Xi p ep in a Xi :2409.17146 (2024).
[9]
Jiahua Dong and Yu-Xiong Wang. 2023. Vica-ne : View-consis ency-awa e
3d edi ing o neu al adiance ields. Ad ances in Neu al In o ma ion P ocessing
Sys ems 36 (2023), 61466–61477.
[10]
Songxue Gao, Chuanqi Jiao, Ruidong Chen, Weijie Wang, and Weizhi Nie. 2023.
Poin Cloud Comple ion Guided by P io Knowledge ia Causal In e ence. a Xi
p ep in a Xi :2305.17770 (2023).
[11]
Aa on G a a io i, Abhimanyu Dubey, Abhina Jauh i, Abhina Pandey, Abhishek
Kadian, Ahmad Al-Dahle, Aiesha Le man, Akhil Ma hu , Alan Schel en, Alex
Vaughan, e al
.
2024. The llama 3 he d o models. a Xi p ep in a Xi :2407.21783
(2024).
[12]
Ayaan Haque, Ma hew Tancik, Alexei A E os, Aleksande Holynski, and Angjoo
Kanazawa. 2023. Ins uc -ne 2ne : Edi ing 3d scenes wi h ins uc ions. In
P oceedings o he IEEE/CVF In e na ional Con e ence on Compu e Vision. 19740–
19750.
[13]
Jona han Ho, Ajay Jain, and Pie e Abbeel. 2020. Denoising di usion p obabilis ic
models. Ad ances in neu al in o ma ion p ocessing sys ems 33 (2020), 6840–6851.
[14]
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Di an Liu, Feng Liu,
Kalyan Sunka alli, T ung Bui, and Hao Tan. 2023. L m: La ge econs uc ion
model o single image o 3d. In P oceedings o he In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR).
[15]
Edwa d J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean
Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adap a ion o La ge
Language Models. In In e na ional Con e ence on Lea ning Rep esen a ions.
[16]
Aa on Jaech, Adam Kalai, Adam Le e , Adam Richa dson, Ahmed El-Kishky,
Aiden Low, Alec Helya , Aleksande Mad y, Alex Beu el, Alex Ca ney, e al
.
2024.
Openai o1 sys em ca d. a Xi p ep in a Xi :2412.16720 (2024).
[17]
Uma Khalid, Hasan Iqbal, Nazmul Ka im, Muhammad Tayyab, Jing Hua, and
Chen Chen. 2024. La en Edi o : ex d i en local edi ing o 3D scenes. In Eu opean
Con e ence on Compu e Vision. Sp inge , 364–380.
[18]
Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, and
Jinwoo Shin. 2023. Collabo a i e sco e dis illa ion o consis en isual edi ing.
Ad ances in Neu al In o ma ion P ocessing Sys ems 36 (2023), 73232–73257.
[19]
Juil Koo, Chanho Pa k, and Minhyuk Sung. 2024. Pos e io dis illa ion sam-
pling. In P oceedings o he IEEE/CVF Con e ence on Compu e Vision and Pa e n
Recogni ion. 13352–13361.
[20]
Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou,
and Bingbing Ni. 2024. Focald eame : Tex -d i en 3d edi ing ia ocal- usion
assembly. In P oceedings o he AAAI Con e ence on A i icial In elligence, Vol. 38.
3279–3287.
[21]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng-
gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, e al
.
2024. Deepseek- 3
echnical epo . a Xi p ep in a Xi :2412.19437 (2024).
[22]
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu,
Yuexin Ma, Song-Hai Zhang, Ma c Habe mann, Ch is ian Theobal , e al
.
2024.
Wonde 3d: Single image o 3d using c oss-domain di usion. In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion. 9970–9980.
[23]
Ashkan Mi zaei, T is an Aumen ado-A ms ong, Ma cus A B ubake , Jona han
Kelly, Alex Le insh ein, Kons an inos G De panis, and Igo Gili schenski. 2025.
Wa ch you s eps: Local image and scene edi ing by ex ins uc ions. In Eu opean
Con e ence on Compu e Vision. Sp inge , 111–129.
[24]
Weizhi Nie, Ruidong Chen, Weijie Wang, B uno Lep i, and Nicu Sebe. 2024. T2TD:
Tex -3D gene a ion model based on p io knowledge guidance. IEEE T ansac ions
on Pa e n Analysis and Machine In elligence (2024).
[25]
Weizhi Nie, Weijie Wang, Anan Liu, Jie Nie, and Yu ing Su. 2019. HGAN: Holis-
ic gene a i e ad e sa ial ne wo ks o wo-dimensional image-based h ee-
dimensional objec e ie al. ACM T ansac ions on Mul imedia Compu ing, Com-
munica ions, and Applica ions (TOMM) 15, 4 (2019), 1–24.
[26]
Maxime Oquab, Timo hée Da ce , Théo Mou akanni, Huy V. Vo, Ma c Sza aniec,
Vasil Khalido , Pie e Fe nandez, Daniel HAZIZA, F ancisco Massa, Alaaeldin
El-Nouby, Mido Ass an, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-
Yao Huang, Shang-Wen Li, Ishan Mis a, Michael Rabba , Vasu Sha ma, Gab iel
Synnae e, Hu Xu, He e Jegou, Julien Mai al, Pa ick Laba u , A mand Joulin,
and Pio Bojanowski. 2024. DINO 2: Lea ning Robus Visual Fea u es wi hou
Supe ision. T ansac ions on Machine Lea ning Resea ch (2024).
[27]
JangHo Pa k, Gihyun Kwon, and Jong Chul Ye. 2024. ED-NeRF: E icien Tex -
Guided Edi ing o 3D Scene Wi h La en Space NeRF. In The Twel h In e na ional
Con e ence on Lea ning Rep esen a ions.
[28]
Ben Poole, Ajay Jain, Jona han T. Ba on, and Ben Mildenhall. 2022. D eamFusion:
Tex - o-3D using 2D Di usion. In P oceedings o he In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR).
[29]
Robin Rombach, And eas Bla mann, Dominik Lo enz, Pa ick Esse , and Bjö n
Omme . 2022. High- esolu ion image syn hesis wi h la en di usion models. In
P oceedings o he IEEE/CVF con e ence on compu e ision and pa e n ecogni ion.
10684–10695.
[30]
Mohamad Shahbazi, Liesbe h Claessens, Michael Niemeye , Edo Collins, Alessio
Tonioni, Luc Van Gool, and Fede ico Tomba i. 2024. InseRF: Tex -D i en Gen-
e a i e Objec Inse ion in Neu al 3D Scenes. a Xi p ep in a Xi :2401.05335
(2024).
[31]
Ka Chun Shum, Jaeyeon Kim, Binh-Son Hua, Duc Thanh Nguyen, and Sai-Ki
Yeung. 2024. Language-d i en Objec Fusion in o Neu al Radiance Fields wi h
Pose-Condi ioned Da ase Upda es. In P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion. 5176–5187.
[32]
Hyeonseop Song, Seokhun Choi, Hoseok Do, Chul Lee, and Taehyeong Kim.
2023. Blending-ne : Tex -d i en localized edi ing in neu al adiance ields. In
P oceedings o he IEEE/CVF in e na ional con e ence on compu e ision. 14383–
14393.
[33]
Yanhao Sun, Runze Tian, Xiao Han, XinYao Liu, Yan Zhang, and Kai Xu. 2024.
GSEdi P o: 3D Gaussian Spla ing Edi ing wi h A en ion-based P og essi e
Localiza ion. In Compu e G aphics Fo um, Vol. 43. Wiley Online Lib a y, e15215.
[34]
Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Teng ei Wang, Gang Zeng, and
Ziwei Liu. 2024. Lgm: La ge mul i- iew gaussian model o high- esolu ion 3d
con en c ea ion. In Eu opean Con e ence on Compu e Vision. Sp inge , 1–18.
[35]
Cy us Vachha and Ayaan Haque. [n. d.]. Ins uc -gs2gs: Edi ing 3d gaussian
spla s wi h ins uc ions (2024). URL h ps://ins uc -gs2gs. gi hub. io ([n. d.]).
[36]
Dongqing Wang, Tong Zhang, Alaa Abboud, and Sabine Süss unk. 2024. In-
ne 360: Tex -guided 3d-consis en objec inpain ing on 360-deg ee neu al adi-
ance ields. In P oceedings o he IEEE/CVF Con e ence on Compu e Vision and
Pa e n Recogni ion. 12677–12686.
[37]
Junjie Wang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, and Qi Tian. 2024. Gaus-
sianedi o : Edi ing 3d gaussians delica ely wi h ex ins uc ions. In P oceedings o
he IEEE/CVF con e ence on compu e ision and pa e n ecogni ion. 20902–20911.
[38]
Weijie Wang, Guo eng Mei, Bin Ren, Xiaoshui Huang, Fabio Poiesi, Luc Van Gool,
Nicu Sebe, and B uno Lep i. 2023. Ze o-sho poin cloud egis a ion. a Xi
p ep in a Xi :2312.03032 (2023).
[39]
Weijie Wang, Guo eng Mei, Jian Zhang, Nicu Sebe, B uno Lep i, and Fabio Poiesi.
2025. Fully-Geome ic C oss-A en ion o Poin Cloud Regis a ion. a Xi
p ep in a Xi :2502.08285 (2025).
[40]
Weijie Wang, Jichao Zhang, Chang Liu, Xia Li, Xingqian Xu, Humph ey Shi, Nicu
Sebe, and B uno Lep i. 2024. UVMap-ID: A Con ollable and Pe sonalized UV
10923

Related note

Why institutions use Plag.ai for originality review, entry 31
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by teachers in the United States, the European Union, South America, and other research regions, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also faster first-level screening, better protection of institutional reputation, and stronger evidence for review committees. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For student essays, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai