A Da ase and Me ic o Tex ual Video Con en Desc ip ion
S e an J. A zbe ge
[email p o ec ed]
JOANNEUM RESEARCH – DIGITAL
G az, Aus ia
Paul Rai h∗
[email p o ec ed]
JOANNEUM RESEARCH – DIGITAL
G az, Aus ia
We ne Baile
we ne [email p o ec ed]
JOANNEUM RESEARCH – DIGITAL
G az, Aus ia
Ma ion Jaks
[email p o ec ed]
Aus ian Media hek
Vienna, Aus ia
In e nLM-XCompose -2.5 Ta sie VideoLLaMa2
0.0
0.2
0.4
0.6
0.8
1.0
alue
Di ec ion
ENT( e p ed)
ENT(p ed e )
CONT( e p ed)
CONT(p ed e )
Long-200 Medium-50 Sho -25 B ie -10
0.0
0.2
0.4
0.6
0.8
1.0
alue
Di ec ion
ENT( e p ed)
ENT(p ed e )
CONT( e p ed)
CONT(p ed e )
Figu e 1: Le : Dis ibu ion o en ailmen and con adic ion when ma ching appo x. 200-wo d cap ion p edic ions om s a e-
o - he-a MLLM-based models wi h e e ence cap ions; igh : compa ison o ma ching cap ions o di e en leng hs p edic ed
using In e nLM agains he e e ences.
Abs ac
Ob aining ex ual desc ip ions o he isual con en o images and
ideos is o en equi ed in mul imedia analysis and e ie al. T adi-
ional ideo cap ioning app oaches a e usually e alua ed on e y
sho cap ions using a he simple me ics om NLP, while mul i-
modal la ge language model (MLLM)-based app oaches a e mos ly
e alua ed wi h ques ion answe ing, which is que y speci ic. We p o-
ide a da ase (FM-V2T) wi h 258 ideo clips om a media a chi e,
anno a ed wi h de ailed manually cu a ed desc ip ions in English
and Ge man (long and sho ). We p opose an LLM-based me ic,
which assesses he en ailmen and con adic ion o ac s ex ac ed
om a desc ip ion wi h a e e ence, add essing sho comings o
exis ing me ics small changes wi h seman ic impac and compa -
ing desc ip ions wi h subs an ially di e en leng hs. We p o ide
expe imen al esul s on he eliabili y o he me ic, and apply i o
baseline esul s o h ee MLLM-based app oaches on he FM-V2T
da ase , compa ing i wi h o he me ics.
∗This wo k was done while Paul was wo king a JOANNEUM RESEARCH.
Publica ion igh s licensed o ACM. ACM acknowledges ha his con ibu ion was
au ho ed o co-au ho ed by an employee, con ac o o a ilia e o a na ional go e n-
men . As such, he Go e nmen e ains a nonexclusi e, oyal y- ee igh o publish
o ep oduce his a icle, o o allow o he s o do so, o Go e nmen pu poses only.
Reques pe missions om owne /au ho (s).
MM ’25, Dublin, I eland
©2025 Copy igh held by he owne /au ho (s). Publica ion igh s licensed o ACM.
ACM ISBN 979-8-4007-2035-2/2025/10
h ps://doi.o g/10.1145/3746027.3758224
CCS Concep s
•In o ma ion sys ems
→
Mul imedia in o ma ion sys ems;•
Compu ing me hodologies
→
In o ma ion ex ac ion;Language
esou ces.
Keywo ds
Benchma king, ideo o ex , ideo cap ioning, e alua ion, me ics
ACM Re e ence Fo ma :
S e an J. A zbe ge , Paul Rai h, We ne Baile , and Ma ion Jaks. 2025. A
Da ase and Me ic o Tex ual Video Con en Desc ip ion. In P oceedings
o he 33 d ACM In e na ional Con e ence on Mul imedia (MM ’25), Oc obe
27–31, 2025, Dublin, I eland. ACM, New Yo k, NY, USA, 7 pages. h ps:
//doi.o g/10.1145/3746027.3758224
1 In oduc ion
Enabling he explo a ion, e ie al, and unde s anding o mul i-
media da a using ex que ies is a common ask since he ea ly
days o esea ch on mul imedia da abases. Gene a ing ex ual de-
sc ip ions o mul imedia con en allows no only o eed hem in o
ex p ocessing and indexing pipelines, bu also has he ad an age
o a long- e m in e ope able ep esen a ion, which may no be
he case o mul imodal embeddings such as CLIP [
19
]. Video cap-
ioning o ideo o ex (V2T) me hods add ess his ask, and like
o many o he mul imedia analysis asks, he app oaches ha e
ecen ly mo ed om ask speci ic app oaches o me hods elying
on mul imodal la ge language models (MLLMs).
MM ’25, Oc obe 27–31, 2025, Dublin, I eland S e an J. A zbe ge , Paul Rai h, We ne Baile , and Ma ion Jaks
MLLMs a e commonly e alua ed wi h asks such as isual ques-
ion answe ing (VQA). This does no only hinde he compa abili y
wi h speci ic cap ioning me hods, bu also excludes many p ac i-
cally ele an applica ions. Fo example, in mul imedia e ie al,
i is no easible o un an MLLM on he ly agains a la ge da a-
base o answe a speci ic que y. I is equi ed o index in o ma ion
gene a ed om he mul imedia con en a inges ime. I a ex ual
desc ip ion is used o his pu pose, i needs o be a que y agnos-
ic desc ip ion, cap u ing he key elemen s o he ideo con en .
T adi ional ideo cap ioning benchma ks, such as MSR-VTT [
32
],
p o ide e y sho cap ions ha do no co e all ele an con en
o he ideo. We obse e hus a lack o da ase s wi h longe e -
e ence cap ions, as well as da ase s co e ing o he gen es han
news. In addi ion mos da ase s a e only a ailable in English and
some in Chinese, bu o he languages a e no well co e ed. The
as -paced ad ances in MLLMs make gene al pu pose models ca-
pable o eaching s a e o he a pe o mance o he ideo o ex
ask, when p omp ed app op ia ely. Howe e , he li e a u e on how
hese p omp s a e c a ed is spa se, and he choices seem a he
ad-hoc.
This pape p oposes a da ase o he e alua ion o s a e o he a
app oaches o he ideo o ex ask, and a me ic o comp ehensi e
desc ip ions wi h di e en le el o de ail and leng h. In pa icula ,
he con ibu ions o his pape a e:
•
We p o ide FM-V2T, a bilingual da ase (English, Ge man) o
ideo o ex e alua ion, c ea ed om a chi al ma e ial wi h
de ailed ex desc ip ions, and de i ed sho anno a ions o
allow compa ibili y wi h adi ional cap ioning benchma ks.
•
We p opose an LLM-based me ic (LLMFac sF1) o compa e
desc ip ions based on en ailed and con adic ing ac s.
•
We p o ide expe imen al esul s o assessing he eliabili y
o he p oposed me ic.
•
We un h ee s a e o he a MLLMs on he FM-V2T da ase
o gene a e baseline esul s, es ing di e en p omp s, and
e alua e he esul s wi h es ablished me ics and LLMFac sF1.
We p o ide a b ie o e iew o ideo o ex me hods as well as
benchma ks and me ics o his ask in Sec ion 2, and in oduce he
FM-V2T da ase in Sec ion 3. Sec ion 4 p oposes he LLMFac sF1
me ic, which is used in he baseline expe imen s in Sec ion 5.
Sec ion 6 concludes he pape .
2 Rela ed wo k
We b ie ly e iew ela ed wo k on ideo o ex me hods, including
speci ic ideo cap ion app oaches (Sec ion 2.1) and MLLM-based
ones (Sec ion 2.2). A ull e iew o app oaches is ou o scope o his
pape , bu he e a e ecen su eys (e.g. [
24
]). In addi ion, we p o ide
an o e iew on e alua ion app oaches o his ask. Video cap ions
may a y in language, o ma (e.g., na u al language, keywo ds),
and ocus (e.g., sen imen , dynamic ac i i ies, s a ic desc ip ions o
objec s/subjec s), in oducing unce ain y in de ining g ound u h.
The pe o mance me ics used o e alua e ex ual co espondence
can a y widely based on hese p io i ies and o ma s, and hus mus
be aligned acco dingly. Unlike image cap ioning, which ocuses on
isual in o ma ion wi hin single ames (spa ial in o ma ion), ideo
cap ioning mus also accoun o he seman ics o e ime ( empo al
in o ma ion), making i a mo e complex challenge [1].
2.1 Video cap ioning models
Fo se e al yea s, he ideo cap ioning ask was add essed by spe-
ci ic models, o example, buil on LSTMs [
21
]. Ea ly app oaches
we e based on image cap ioning wo k, and o en p o ide single
sen ence desc ip ions. A ound 10 yea s ago, app oaches aiming a
longe desc ip ions eme ged. These s and o wo k is o en e e ed
o as “dense ideo cap ioning”, and app oaches include segmen -
ing he ideo in o single ac ions [
22
] o e en s [
13
] o be u ned
in o sen ences. The eme gence o mul imodal embeddings ha e
led o ideo cap ioning models making use o hei capabili ies,
o example, VideoBERT [
23
]. Fu he ad ances in language and
ision-languages models ha e been adop ed in ecen ans o me -
based me hods, such as mPLUG-2 [
31
] and VAST [
6
]. Despi e he
good pe o mance o hese models, keeping up wi h he as -paced
ad ances o gene al pu pose MLLMs has become challenging.
2.2 MLLM-based ideo o ex models
The in oduc ion o mul imodal la ge language models (MLLMs)
has accele a ed ad ances in ideo cap ioning, bo h in me hodol-
ogy and e alua ion app oaches [
24
]. In con as o he speci ic
app oaches, gene ic MLLMs can be ine- uned o he ideo cap-
ioning ask, o simply p omp ed. Mos o hese models a e he
ision/ ideo a ian o a la ge amily o LLMs, such as VideoL-
LaMA [
7
], In e nVideo [
29
], Deepseek-VL [
30
] o Pix al [
2
]. We
p o ide an o e iew o some ecen me hods in Table 1, ocusing
on open sou ce me hods wi h pe missi e licenses.
We obse e ha i is ha d o compa e hem wi h speci ic ideo
cap ioning app oaches because MLLM-based me hods a e e alua ed
on di e en ask se ings such as ques ion answe ing. In addi ion,
he choice o p omp s seem ad-hoc, and he e is ha dly any li e a u e
on bes p ac ices o p omp ing MLLMs o ideo o ex asks.
2.3 Benchma ks and me ics
E alua ing ideo cap ioning in ol es de e mining how closely he
gene a ed cap ions ma ch a e e ence, p o ided as one o mo e
possible cap ions, each gene a ed o a leas e iewed/ e ised by a
human assesso . This in oduces a subjec i e challenge: ele ance
may a y depending on he con ex and ocus. Fo ins ance, ‘a hie
is pu sued’ and ‘a Ge man shephe d is sp in ing’ a e bo h alid
desc ip ions bu emphasize di e en aspec s o he scene. Simila ly,
‘a unning dog’ may be conside ed a co ec bu less de ailed de-
sc ip ion. Due hese challenges, human e alua ion emains he gold
s anda d o assessing seman ic co espondence [26].
Quan i a i e e alua ion can be ca ego ized in o ex -based me -
ics and benchma ks. Tex -based me ics include adi ional NLP
measu es which s uggle wi h synonymy and mul i-sen ence ex s.
These adi ional NLP me ics such as BLEU [
18
], ROUGE [
15
] ex-
hibi he limi a ion o wo d-wise ma ching (
𝑛
-g am-based), whe eas
imp o emen s such as CIDE +[
25
] o METEOR [
3
] also conside
synonyms o wo d s ems. Embedding-based me ics, such as BERT-
Sco e [
35
], assess seman ic simila i y in embedding space and han-
dle pa aph asing and longe ex s be e (bu may e.g. ail handling
nega ions p ope ly, as hey may be p ojec ed close o he posi i e
s a emen ). Recen LLM-based e alua ions ha e he po en ial o
app oxima e human unde s anding o ex simila i y [26].
A Da ase and Me ic o Tex ual Video Con en Desc ip ion MM ’25, Oc obe 27–31, 2025, Dublin, I eland
Me hod Type Yea License Commen
mPLUG-2 [31] Speci ic 2023 Apache-2.0
Combines p e- ained ision and language ans o me s o ideo o ex asks.
VAST [6] Speci ic 2023 MIT Fuses ideo, audio, and sub i le in o ma ion ia a ans o me a chi ec u e.
Ta sie [26] MLLM 2024 Apache-2.0 Tempo al modeling h ough LLMs wi h he DREAM e alua ion benchma k.
VideoLLaMA2 [7] MLLM 2024 Apache-2.0 Enhances spa ial- empo al unde s anding wi h a con olu ional connec o .
In e nLM-XC-2.5 [34] MLLM 2023 Apache-2.0 Suppo s long-con ex inpu s/ou pu s and high- esolu ion inpu s.
VideoCha 2 [14] MLLM 2024 Apache-2.0
MVBench benchma k o spa io- empo al unde s anding wi h VideoCha 2 as
a baseline.
In e nVideo2 [29] MLLM 2024 Apache-2.0
Leade on MVBench o ine-g ained ac ion desc ip ion; long-con ex inpu s.
Deepseek-VL2 [30] MLLM 2024 MIT & DsML E icien MLLM wi h compe i i e pe o mance.
Pix al 12B [2] MLLM 2024 Apache-2.0 E icien model om he mis al amily wi h image suppo .
Table 1: A selec ion o ecen ideo o ex me hods wi h pe missi e licenses.
Benchma ks can be classi ied in o adi ional and LLM-based
ypes. T adi ional da ase s ypically con ain mul iple single-sen ence
cap ions wi h simple ex ual s uc u e and a e benchma ked us-
ing adi ional NLP me ics. Fo example, MSR-VTT [
32
] p o ides
20 single-sen ence e e ence cap ions o a single ideo, each ad-
d essing subjec i e ocus, and has an a e age leng h o 9 wo ds
pe cap ion. O he adi ional da ase benchma ks a e o en special-
ized o speci ic ideo domains like MSVD [
5
], Ac i i yNe [
4
] and
YouCook [
9
]. LLM-d i en benchma ks, howe e , a e o en ailo ed
o speci ic p io i ies, wi h a ied anno a ion s uc u e and a e e-
quen ly e alua ed wi h he aid o LLMs. Fo example, he Ta sie
me hod in oduced he DREAM-1k benchma k [
26
] wi h a wo-
s ep LLM-based e alua ion using single mul i-sen ence cap ions
and key e en ex ac ion. MVBench [
14
], designed o emphasizing
empo al unde s anding, also in oduced he VideoCha 2 model.
O he examples include VATEX [
27
] and MMBench [
16
]. The e a e
la ge da ase s wi h longe desc ip ions such as V ip [
33
], In e n-
Vid [
28
], Mi aDa a[
12
] and HowTo100M [
17
] which a e only ully
au oma ically anno a ed.
We obse e a gap be ween he adi ional cap ioning me ics
making a compa ison agains a e e ence – add essing o en b ie
cap ions and me ics wi h limi ed abili y o cap u e complex se-
man ics – and app oaches om LLM benchma ks.
3 FM-V2T Da ase
We p o ide he FAIRmedia Video o Tex (FM-V2T) da ase , con-
aining de ailed ex ual desc ip ions o he isual con en o ideo
sho s. The da ase uses con en o Ös e eichische Media hek, he
Aus ian audio and ideo a chi e, and was de eloped in he con ex
o he p ojec FAIRmedia
1
. The da ase uses a selec ion o con en
o he Wiene Video Reko de collec ion
2
) con aining con en c e-
a ed wi h consume ideo came as, and documen ing e e yday li e.
The con en is di e se, co e ing e en s in public space, es i i ies,
amily e en s o people li ing in o a ound Vienna, bu con ains
qui e di e se con en , including a el o nea by and a away places.
The con en has been clea ed o publica ion in Media hek’s online
ca alog, add essing he po en ial copy igh and p i acy issues.
1Fai and us ed da ase s o media compu ing h ps://www.joanneum.a /digi al/en/
p ojec s/ ai media/
2
h ps://www.media hek.a /wiene - ideo eko de /english-in o ma ion (con en a ail-
able a h ps://www.media hek.a /digi ale-sammlung
F om he collec ion, 268 ideos ha e been selec ed by a chi e
expe s, aiming a di e si y o he con en . These ideos ha e been
empo ally segmen ed using he sho bounda y de ec o desc ibed
in [
11
], and key ame ex ac ion based on isual ac i i y in he
con en has been pe o med wi h ha algo i hm. As anno a ion o
he en i e collec ion is no easible, bu we aim a a he e ogeneous
and ep esen a i e con en se , we selec one sho pe ideo manu-
ally. As sho bounda ies may delinea e comple ely di e en con en ,
desc ibing mul iple sho s oge he does no gene ally p o ide added
alue. In o de no o selec e y sho o long clips, he sho s a e
equi ed o all in o a leng h in e al o
[
5; 35
]
s. As we also wan o
exclude almos s a ic o ex emely dynamic con en , we equi e he
numbe o key ames o be
[
2; 100
]
and he numbe o key ames/s
o be
[
0
.
5; 4
]
. A e his p ocess, 258 clips emain, wi h mean leng h
15.45 ±7.79s (min 5.12s, max 34.92s).
Fo hese clips, ex ual anno a ions in English and Ge man a e
gene a ed. The English desc ip ion is gene a ed using VideoL-
LaMa2 [
7
], using he p omp ‘Desc ibe his ideo, exac ly and only
ocus on wha is isible, wi hou imagining any de ails ha a e no
isible! Answe wha can be seen, whe e he ideo was sho , wha
pe sons, animals o buildings e c. can be seen. Wha is happening in
he ideo? When was he ideo ilmed a day o nigh o example
. . .
Is he e some hing unique o his ideo? Limi he desc ip ion o 200
wo ds!’. The esul ing ex is hen manually checked, co ec ed and
amended as needed. As p elimina y es s wi h Ge man ideo o ex
models showed in e io quali y in compa ison o English models,
we ansla e he e ised anno a ion o Ge man using NLLB [
8
].
Again, he esul ing anno a ion is manually checked and e ised.
In o de o es ablish in e ope abili y wi h widely used bench-
ma ks such as MSR-VTT [
32
], we de i e 20 single sen ence sho
cap ions (simila o he “gold cap ions” in MSR-VTT). This has been
done using Cha GPT-4o mini
3
, using he p omp ‘I will gi e you a
ideo cap ion and you ha e o ex ac he mos impo an in o ma-
ion in o a sho 10 wo d desc ip ion and make di e en a ia ions
o he "p ed_cap ion". These "sho _cap ion" a e a ia ions o he
"p ed_cap ion" and ha e he same meaning and maybe ocus on a ew
o he de ails om he o iginal ideo cap ion and a e all o mula ed
in o he wo ds bu wi hou in en ing any o he de ails ha whe e no
in he o iginal ideo cap ion. So simila like in he MSR_VTT da ase .
3h ps://openai.com/index/gp -4o-mini-ad ancing-cos -e icien -in elligence
MM ’25, Oc obe 27–31, 2025, Dublin, I eland S e an J. A zbe ge , Paul Rai h, We ne Baile , and Ma ion Jaks
He e his is a example please also s ay in his json o ma : { ... }’. The
esul ing cap ions we e again manually e ined.
This da ase , based on di e se a chi e ideo con en , is p ima -
ily in ended o assessing and compa ing ideo o ex me hods,
suppo ing wo languages. The p o ision o long de ailed anno a-
ions and a se o sho cap ions enables he applica ion o a wide
ange o me ics and benchma king app oaches. In addi ion, he
da ase can se e ela ed downs eam asks, such as ac ex ac ion
o isual ques ion answe ing. A elease o he da ase is a ailable
a h ps://gi hub.com/FAIRmedia-AT/FM-V2T. In addi ion o he
anno a ions (published unde a CC-BY 4.0 license), i con ains he
e e ences and me ada a o he ideo, including code o download-
ing he clips o which anno a ions a e a ailable.
In addi ion o es ablished NLP me ics, we p o ide wo a i-
an s o cosine simila i y o ex embeddings p oduced by Sen ence-
BERT [
20
]
4
o compa ison. These sco es ely on dense ep esen-
a ions p oduced by Sen enceBERT. Fo he inpu ex span
𝑡
, we
ob ain an embedding
𝑒(𝑡)=MeanPoolMiniLM(𝑡)∈R384,
whe e
he hidden s a es o all okens a e a e aged (mean pooling). Gi en
a p edic ed cap ion 𝑐𝑝and a e e ence cap ion 𝑐𝑟, we epo :
CosPa (pa ag aph_cosine_simila i y). A single embedding is
compu ed o he en i e cap ion; he sco e is he cosine be ween he
wo pa ag aph ec o s CosPa (𝑐𝑝,𝑐𝑟)=cos𝑒pa a(𝑐𝑝), 𝑒pa a (𝑐𝑟).
CosSen (sen ence_median_simila i y). Bo h cap ions a e i s
segmen ed in o sen ences. Sen ence embeddings
𝑒sen (·)
a e pai ed
one- o-one by g eedily choosing he highes emaining cosine simi-
la i y, p oducing a ma ching se
M(𝑐𝑟,𝑐𝑝)
. The sco e is he median
o hese pai wise cosines, which is obus o ou lie s such as e y
dissimila o unma ched sen ences:
CosSen (𝑐𝑝,𝑐𝑟)=median (𝑠𝑟,𝑠𝑐) ∈ M (𝑐𝑟,𝑐𝑝)cos𝑒sen (𝑠𝑟), 𝑒sen (𝑠𝑐).
Bo h me ics heo e ically ange om -1 o 1, bu in ou expe -
imen s on na u al language cap ions, sco es ell in he p ac ical
ange o 0 (minimal seman ic simila i y) o 1 (iden ical embeddings).
Bo h use he same embedding-pooling con igu a ion {mean,me-
dian_simila i y} speci ied in he expe imen al se ings.
4 Fac -based desc ip ion me ic
T adi ional me ics o assessing ideo cap ioning, such as BLEU,
ROUGE o CIDE , a e no well sui ed o handling longe and mo e
exp essi e desc ip ions. Using ex embeddings such as Sen ence-
BERT [
20
] in o de o measu e dis ances in he embedding space
add esses some o he issues o hese me ics. Howe e , embeddings
may s ill be simila when many o he concep s in he sen ences
align, igno ing ha e.g. a pa o a sen ence has been nega ed. In
addi ion, hese me ics do no pe o m eliably when compa ing
ex s o di e en leng hs. This issue can pa ly be add essed by
b eaking each o he ex s in o sen ences, aising he ques ion how
o agg ega e he se o hei pai wise compa ison sco es.
We hus p opose a me ic based on he o e lap o ac ual s a e-
men s be ween wo ex s, e.g. gene a ed and e e ence desc ip ion
o a ideo clip. We discuss in his sec ion he design o he me ic,
desc ibe he conc e e implemen a ion and p esen he expe imen s
pe o med o alida e he obus ness o he me ic.
4sen ence- ans o me s/pa aph ase-mul ilingual-MiniLM-L12- 2
4.1 Design
The basic idea o he me ic is o ex ac single sen ence s a emen s
con ained in a ex
𝑇𝐴
(“ ac s”), and check hem agains ano he
ex
𝑇𝐵
.
𝑇𝐵
may suppo hese ac s (en ailmen ), be in con lic
wi h hem (con adic ion) o no con ain in o ma ion ela ed o
his s a emen (neu al). This wo s ep pipeline is ealized using
wo p omp s o LLMs: he ac ex ac ion model
M𝑓 𝑒𝑥 (𝑇) → 𝐹
,
whe e
𝐹
is he se o ac s
𝐹={𝑓1, . . . , 𝑓𝑘}
, and he checking model
M𝑐ℎ𝑘 (𝑇, 𝐹) → (𝐸,𝐶, 𝑁 )
, whe e
𝐸
,
𝐶
and
𝑁
a e bina y ec o s o
size
𝑘
, encoding o each ac whe he i is en ailed, con adic ing
o neu al (|𝐸|+|𝐶|+|𝑁|=𝑘,|·|deno es he 𝐿1no m).
We ob ain
(𝐸𝐴𝐵,𝐶𝐴𝐵, 𝑁𝐴𝐵)=M𝑐ℎ𝑘 (𝑇𝐵,M𝑓 𝑒𝑥 (𝑇𝐴))
, and calcu-
la e he a es
𝐸𝑁𝑇𝐴𝐵 =
|𝐸𝐴𝐵 |
𝑘𝐴
,𝐶𝑂𝑁𝑇𝐴𝐵 =
|𝐶𝐴𝐵 |
𝑘𝐴
.
By including en ailmen s and con adic ions no malised by he
numbe o ac s in he me ic, he amoun o neu al s a emen s is
implici ly included. In o de o compensa e o e ec s o di e en
leng hs o
𝑇𝐴
and
𝑇𝐵
, we pe o m he ex ac ion and checking
p ocess in bo h di ec ions, and calcula e he ha monic mean
𝐸𝑁𝑇𝐹1=
2|𝐸𝐴𝐵 ||𝐸𝐵𝐴 |
|𝐸𝐴𝐵 |+|𝐸𝐵𝐴|,𝐶𝑂𝑁𝑇 𝐹1=
2|𝐶𝐴𝐵 ||𝐶𝐵𝐴 |
|𝐶𝐴𝐵 |+|𝐶𝐵𝐴 |.
In o de o exp ess seman ic o e lap as a single numbe , we de ine
𝐿𝐿𝑀𝐹𝑎𝑐𝑡𝑠𝐹1=
2𝐸𝑁𝑇𝐹1(1−𝐶𝑂𝑁𝑇 𝐹1)
𝐸𝑁𝑇𝐹1+ (1−𝐶𝑂𝑁𝑇 𝐹1).
4.2 Implemen a ion
We implemen he p oposed LLMFac sF1 me ic as a wo-s age
pipeline comp ising ac ex ac ion and ela ional classi ica ion
using LLMs. The componen s a e in eg a ed in o a locally hos ed
in e ence sys em using he HuggingFace T ans o me s in e ace.
In he ac ex ac ion s age, he LLM is p omp ed o iden i y
a omic ac ual s a emen s om pa ag aph-leng h inpu . Ex ac ed
ac s mus be sel -con ained, including subjec , p edica e, and objec ,
and mus e ain linguis ic modi ie s such as modali y, nega ion, and
quan i ica ion.
In he classi ica ion s age, each ac is e alua ed o en ailmen ,
con adic ion, o neu ali y wi h espec o a compa ison pa ag aph.
This s ep is pe o med collec i ely: he en i e ac se is assessed
agains he ull e e ence o hypo hesis ex , a he han indi idu-
ally. I pa sing ails due o syn ac ic inconsis encies, an auxilia y
LLM call a emp s s uc u al co ec ion. Manual inspec ion was
occasionally needed o ix o ma ing, such as emo ing spu ious
quo a ion ma ks a ound named en i ies, bu did no modi y he
s a emen s hemsel es.
The sys em uses he Llama 3.1 8B5[10] model wi h de e minis-
ic hype pa ame e s
6
and a high oken limi o p ese e con ex
in eg i y. Smalle models (e.g., 3B a ian s) showed de iciencies
in meaning ul ac ex ac ion. On a single NVIDIA A6000 GPU
(48GB RAM), he ac ex ac ion s ep akes 16
.
8
±
4
.
9s and he ac
alignmen check akes 35.9±18.5s.
5h ps://hugging ace.co/me a-llama/Llama-3.1-8B-Ins uc
6do_sample=False, empe a u e=None, op_p=None
A Da ase and Me ic o Tex ual Video Con en Desc ip ion MM ’25, Oc obe 27–31, 2025, Dublin, I eland
(a) Re 2Re (b) In e nLM2In e nLM
0.0
0.2
0.4
0.6
0.8
1.0
alue
Di ec ion
ENT( e e )
ENT( e e )
CONT( e e )
CONT( e e )
Figu e 2: Bidi ec ional alida ion o LLM Me ic o ENT
(g een) & CONT ( ed) o (a) e e ence-agains - e e ence and
(b) In e nLM- o-In e nLM.
P omp s we e i e a i ely e ined o obus ness; he inal e -
sions a e documen ed in he Gi Hub eposi o y o he da ase . The
pipeline p o ides coun s o en ailmen s, con adic ions, and neu al
ela ions in bo h di ec ions, which a e used o compu e he inal
LLMFac sF1 sco e as de ailed abo e.
4.3 Valida ion o he me ic
In o de o assess he eliabili y o he me ic, we es he me ic on
sel -ma ching he e e ences in he FM-V2T da ase . This should
esul in
𝐸𝑁𝑇𝐹1
close o 1, and
𝐶𝑂𝑁𝑇𝐹1
close o 0. We pe o m
he expe imen wice, ma ching o wa d and backwa d, in o de
o es he ep oducibili y o he models. In addi ion, we epea
he same expe imen wi h sel -ma ching he ou pu o In e nLM-
XCompose -2.5 [
34
] on he FM-V2T da ase . Figu e 2 shows he
esul s, indica ing ha he me ic is e y eliable in e ms o con a-
dic ions, which a e almos 0. The a e o en ailmen s is e y high,
hough wi h mo e ou lie s, i.e. cap ions esul ing in lowe en ail-
men a es. This means ha suppo o some ex ac ed s a emen s
could no be e i ied, and hey a e hus conside ed neu al. We
also ma ch he In e nLM ideo o ex model ou pu agains i sel ,
which u ns ou o be e en mo e eliable, wi h a lowe numbe o
ou lie s. This is p obably due o simple and sho e na u e o he
ou pu s compa ed o he e e ences. We also pe o m a es wi h
andomly misaligned e e ences, shown in Figu e 3. As expec ed,
he en ailmen is e y low, con adic ions a e qui e high. As he
andomly aligned e e ences may desc ibe di e en con en , and
no necessa ily con adic each o he , he con adic ions a e almos
uni o mly dis ibu ed, also ac oss di e en leng hs.
4.4 Quali a i e example
In o de o illus a e he ad an ages o he p oposed me ic, we
p o ide a quali a i e example. Table 3 p o ides an example o a
ideo cap ion, including he e e ence, ou pu s o h ee models and
wo manually changed e sions o he e e ence (one changing a
name, one in oducing a nega ion).
Table 4 shows he esul s o he p oposed LLMFac sF1 and a se
o o he me ics. I becomes e iden , ha he p oposed me ic sco es
he a ian s o he e e ence clea ly highe han he p edic ions,
and penalizes he nega ion mo e han he name change, which is
no he case o mos o he o he me ics. The cosine simila i y
Re 2In e nLM (Long-200) Re 2In e nLM (Medium-50) Re 2In e nLM (Sho -25) Re 2In e nLM (B ie -10)
0.0
0.2
0.4
0.6
0.8
1.0
alue
Di ec ion
ENT( e p ed)
ENT(p ed e )
CONT( e p ed)
CONT(p ed e )
Figu e 3: Resul s wi h andomly misaligned e e ences ac oss
di e en cap ion leng hs.
me ics o ex embeddings come closes , bu ail o make a clea
dis inc ion om he p edic ion wi h a w ong ex on he banne
and he one omi ing he ex en i ely.
5 Expe imen s
In o de o ob ain baseline esul s, we un h ee s a e-o - he-a
models on he FM-V2T da ase s: Ta sie [
26
], VideoLLaMA2 [
7
]
and In e nLM-XCompose -2.5 [34].
Figu e 1 (le ) shows he dis ibu ion o en ailmen and con a-
dic ion o he model ou pu s agains he e e ence on he FM-V2T
da ase s. These ou pu s a e ob ained by p omp ing he models o
de ailed (
≤
200 wo ds) cap ions. The esul s show ha he In e nLM
model has he highes ac ion o en ailmen and he lowes ac-
ion o con adic ions, while he esul s o he o he wo models a e
wo se. The ac ha In e nLM ou pe o ms VideoLLaMa shows ha
he e is no bias in e ms o desc ip ion quali y s emming om he
ac ha VideoLLaMa was used as s a ing poin o he human an-
no a ion. Howe e , he VideoLLaMa model p o ides a mo e simila
le el o de ail o he desc ip ion, as e iden om he simila dis i-
bu ion o en ailmen – his migh be a sligh bias om he p ocess.
Fo he o he wo models, he en ailmen o ac s ex ac ed om
he p edic ion is highe han om he e e ence, indica ing ha he
e e ence is mo e comp ehensi e, and hus i is easie o he LLM
o align ac s wi h i . The di e ences in ma ching di ec ions o
some models con i m also he decision o use he ha monic mean
o bo h di ec ions as an in eg a ed sco e.
We also analyse he impac o he desc ip ion leng h on he
sco e. We use In e nLM o p edic desc ip ions o 200, 50, 25 and
10 wo ds, and compa e o he e e ences. The esul s a e shown in
Figu e 1 ( igh ). As expec ed, he sho e desc ip ions esul in a
lowe numbe o en ailed and con adic ing ac s when ma ched
agains he longe e e ence. The ex ac ion o ac s om he sho e
p edic ion esul s in highe en ailmen when ma ched agains he
longe one, and he ela ed in o ma ion can be ound he e, howe e ,
he absolu e numbe o ac s is lowe han in he opposi e di ec ion.
Table 2 p o ides an o e iew o esul s wi h p omp ing he
baseline models o desc ip ions o di e en leng hs, and ma ching
agains he long e e ence o he sho cap ions. These esul s also
show ha he leng hs asked in he p omp s a e ollowed o a di e -
en deg ee by he di e en models. The esul s indica e ha s a e o
he a models p o ide usable esul s on his ask, bu also ha he
MM ’25, Oc obe 27–31, 2025, Dublin, I eland S e an J. A zbe ge , Paul Rai h, We ne Baile , and Ma ion Jaks
Model Re P ed Wo ds LLMFac sF1 METEOR BLEU-1 BLEU-2 ROUGE-L CosPa CosSen
VideoLLama Re Long-200 148±31 0.496±0.146 0.212±0.035 0.390±0.091 0.245±0.073 0.296±0.052 0.751±0.113 0.670±0.087
In e nLM Re Long-200 136±22 0.529±0.184 0.198±0.047 0.380±0.085 0.223±0.083 0.283±0.072 0.711±0.112 0.648±0.099
Ta sie Re Long-200 59±14 0.476±0.175 0.135±0.032 0.257±0.107 0.150±0.069 0.257±0.047 0.771±0.092 0.644±0.094
In e nLM Re Medium-50 61±22 0.508±0.186 0.129±0.044 0.248±0.125 0.142±0.086 0.246±0.065 0.729±0.108 0.640±0.101
In e nLM Re Sho -25 49±30 0.473±0.207 0.106±0.051 0.173±0.153 0.097±0.092 0.205±0.072 0.682±0.125 0.653±0.116
In e nLM Re B ie -10 10±3 0.431±0.230 0.033±0.015 0.002±0.014 0.001±0.007 0.085±0.034 0.629±0.121 0.686±0.124
In e nLM Sho Re s B ie -10 10±3 - 0.213±0.076 0.720±0.182 0.450±0.226 0.415±0.144 - -
Ta sie Sho Re s B ie -10 20±35 - 0.204±0.066 0.590±0.160 0.353±0.183 0.373±0.120 - -
Table 2: Me ics ob ained o di e en ly p omp ed model ou pu s using he h ee baseline me hods on he FM-V2T da ase .
Sho Re e e s o he sho e sion o he e e ence cap ion.
Type Tex : Sho -25
Re e ence
A small ai plane is lying ac oss he sky, wi h a banne
eading "BUSSI SUSI-LEO" ailing behind i .
Nega ion
A small ai plane is lying ac oss he sky, wi h
a
no
banne eading "BUSSI SUSI-LEO" ailing behind i .
NameChange
A small ai plane is lying ac oss he sky, wi h a ban-
ne eading "
BUSSI SUSI-LEO
BUSSI JOSEF-MARIA"
ailing behind i .
In e nLM-XC
A helicop e is lying ac oss a clea blue sky, wi h a
banne eading "BUSSI SUSHI -LEO" ailing behind i .
Ta sie
Aplane lies ac oss he sky wi h a sign displaying
’BUSSI SUJU-LEON’.
VideoLLaMa
A small ai plane lies o e a unway, ollowed by a
helicop e , bo h seen om a dis ance agains a g ay
sky.
Table 3: Gene a ed desc ip ions o he ideo ac oss models
and edi s. Tex in ed indica es con adic ing and in g een
en ailed s a emen s. Fo he edi ed a ian s, ed ex ma ks
dele ions, and blue ex inse ions.
Sho -25 LLMFac sF1 METEOR BLEU-1 BLEU-2 ROUGE-L CosPa CosSen
Nega ion 0.585 0.594 0.941 0.907 0.941 0.967 0.967
NameChange 0.750 0.518 0.941 0.907 0.941 0.981 0.981
In e nLM 0.400 0.402 0.722 0.618 0.747 0.722 0.722
Ta sie 0.635 0.186 0.385 0.304 0.468 0.908 0.908
VideoLLaMa 0.273 0.127 0.250 0.162 0.219 0.728 0.728
Table 4: Me ics ob ained o he example in Table 3.
da ase is challenging enough o lea e oom o imp o emen . The
esul s also show ha sho e desc ip ions a e able o cap u e he
mos ele an aspec s o he con en . The las wo lines o he able
p o ides esul s o he de i ed sho cap ions, showing ha he
da ase p o ides compa ibili y wi h MSR-VTT s yle benchma ks.
We also analyse he co ela ion o he p oposed me ic wi h base-
line me ics o In e nLM esul s on he FM-V2T da ase (Figu e 4).
Gene ally, he co ela ion is low, showing ha he me ics assess
di e en aspec s o he desc ip ions. The e is a weak co ela ion
wi h he me ics using cosine simila i y o he ex embeddings,
while he e is almos no co ela ion wi h some o he adi ional
NLP me ics. Based on samples we assume ha his is due o cases
whe e syn ac ic and seman ic impac o di e ences di e ges. We
0.0 0.2 0.4 0.6 0.8 1.0
Baseline me ic
0.0
0.2
0.4
0.6
0.8
1.0
LLMFac sF1
BLEU-1
BLEU-2
METEOR
ROUGE-L
CosPa
CosSen
Figu e 4: Compa ison o he p oposed me ic wi h baseline
me ics o In e nLM esul s on he FM-V2T da ase .
ha e made expe imen s o selec ing p omp s o he MLLMs. Like
o en in p omp enginee ing, i is ha d o p edic which changes
will ha e subs an ial e ec s. We ha e hus c ea ed 30 p omp s and
an expe imen s wi h hem. We p o ide he lis o p omp s and an
in e ac i e plo on he da ase ’s Gi hub eposi o y.
6 Conclusion
We ha e p o ided a bilingual da ase o he e alua ion ideo o
ex me hods, compa ible wi h he e alua ion me hods used o a-
di ional cap ioning as well as MLLM-based me hods. We ha e also
p oposed a no el LLM-based me ic, and alida ed he me ic in a
ange o expe imen s on he da ase . This also includes compa isons
wi h o he me ics on he ou pu s o h ee s a e o he a MLLMs
on he p oposed da ase . The da ase p o ides he anno a ions in
wo languages, bu he machine ansla ion and e ision wo k low
we ha e used o ob ain he Ge man anno a ions can be e icien ly
eplica ed o o he languages. In a simila way, he isual con en
desc ip ion could be amended by speech o ex .
Acknowledgmen s
This wo k has been unded pa ially by he Aus ian Resea ch P o-
mo ion Agency (FFG) unde he Digi al Technologies p ojec FAIR-
media (h ps://www.joanneum.a /digi al/en/p ojec s/ ai media/),
and by Eu opean Union’s Ho izon Eu ope p og amme unde g an
ag eemen n
◦
101070250 XRECO (h ps://x eco.eu/). The au ho s
hank Geo g Thallinge o his eedback on he pape .
A Da ase and Me ic o Tex ual Video Con en Desc ip ion MM ’25, Oc obe 27–31, 2025, Dublin, I eland
Re e ences
[1]
Moloud Abda , Meenakshi Kolla i, Swa aja Ku apa hi, Fa had Pou panah, Daniel
McDu , Mohammad Gha amzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas
Khos a i, E ik Camb ia, e al
.
2024. A e iew o deep lea ning o ideo cap ioning.
IEEE T ansac ions on Pa e n Analysis and Machine In elligence (2024).
[2]
P a esh Ag awal, Szymon An oniak, Emma Bou Hanna, Bap is e Bou , De end a
Chaplo , Jessica Chudno sky, Diogo Cos a, Baudouin De Monicaul , Sau abh
Ga g, Theophile Ge e , e al
.
2024. Pix al 12B. a Xi p ep in a Xi :2410.07073
(2024).
[3]
Sa anjee Bane jee and Alon La ie. 2005. METEOR: An au oma ic me ic o
MT e alua ion wi h imp o ed co ela ion wi h human judgmen s. In P oceedings
o he acl wo kshop on in insic and ex insic e alua ion measu es o machine
ansla ion and/o summa iza ion. 65–72.
[4]
Fabian Caba Heilb on, Vic o Esco cia, Be na d Ghanem, and Juan Ca los Niebles.
2015. Ac i i yne : A la ge-scale ideo benchma k o human ac i i y unde s and-
ing. In P oceedings o he ieee con e ence on compu e ision and pa e n ecogni ion.
961–970.
[5]
Da id Chen and William B Dolan. 2011. Collec ing highly pa allel da a o
pa aph ase e alua ion. In P oceedings o he 49 h annual mee ing o he associa ion
o compu a ional linguis ics: human language echnologies. 190–200.
[6]
Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu,
and Jing Liu. 2023. Vas : A ision-audio-sub i le- ex omni-modali y ounda ion
model and da ase . Ad ances in Neu al In o ma ion P ocessing Sys ems 36 (2023),
72842–72866.
[7]
Zesen Cheng, Sicong Leng, Hang Zhang, Yi ei Xin, Xin Li, Guanzheng Chen,
Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, e al
.
2024. Videollama 2:
Ad ancing spa ial- empo al modeling and audio unde s anding in ideo-llms.
a Xi p ep in a Xi :2406.07476 (2024).
[8]
Ma a R Cos a-Jussà, James C oss, Onu Çelebi, Maha Elbayad, Kenne h Hea ield,
Ke in He e nan, Elahe Kalbassi, Janice Lam, Daniel Lich , Jean Mailla d, e al
.
2022. No language le behind: Scaling human-cen e ed machine ansla ion.
a Xi p ep in a Xi :2207.04672 (2022).
[9]
P adip o Das, Chenliang Xu, Richa d F Doell, and Jason J Co so. 2013. A housand
ames in jus a ew wo ds: Lingual desc ip ion o ideos h ough la en opics
and spa se objec s i ching. In P oceedings o he IEEE con e ence on compu e
ision and pa e n ecogni ion. 2634–2641.
[10]
Aa on G a a io i e al. 2024. The Llama 3 He d o Models.
a Xi :2407.21783 [cs.AI] h ps://a xi .o g/abs/2407.21783
[11]
Hannes Fassold. 2024. Fas e han eal- ime de ec ion o sho bounda ies, sam-
pling s uc u e and dynamic key ames in ideo. In 2024 8 h In e na ional Con e -
ence on Imaging, Signal P ocessing and Communica ions (ICISPC). IEEE, 33–36.
[12]
Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xin ao Wang, Ailing Zeng,
Yu Xiong, Qiang Xu, and Ying Shan. 2024. Mi ada a: A la ge-scale ideo da ase
wi h long du a ions and s uc u ed cap ions. Ad ances in Neu al In o ma ion
P ocessing Sys ems 37 (2024), 48955–48970.
[13]
Ranjay K ishna, Kenji Ha a, F ede ic Ren, Li Fei-Fei, and Juan Ca los Niebles.
2017. Dense-cap ioning e en s in ideos. In P oceedings o he IEEE in e na ional
con e ence on compu e ision. 706–715.
[14]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan
Xu, Guo Chen, Ping Luo, e al
.
2024. M bench: A comp ehensi e mul i-modal
ideo unde s anding benchma k. In P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion. 22195–22206.
[15]
Chin-Yew Lin. 2004. Rouge: A package o au oma ic e alua ion o summa ies.
In Tex summa iza ion b anches ou . 74–81.
[16]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo
Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, e al
.
2024. Mmbench:
Is you mul i-modal model an all-a ound playe ?. In Eu opean con e ence on
compu e ision. Sp inge , 216–233.
[17]
An oine Miech, Dimi i Zhuko , Jean-Bap is e Alay ac, Maka and Tapaswi, I an
Lap e , and Jose Si ic. 2019. How o100m: Lea ning a ex - ideo embedding by
wa ching hund ed million na a ed ideo clips. In P oceedings o he IEEE/CVF
in e na ional con e ence on compu e ision. 2630–2640.
[18]
Kisho e Papineni, Salim Roukos, Todd Wa d, and Wei-Jing Zhu. 2002. Bleu: a
me hod o au oma ic e alua ion o machine ansla ion. In P oceedings o he
40 h annual mee ing o he Associa ion o Compu a ional Linguis ics. 311–318.
[19]
Alec Rad o d, Jong Wook Kim, Ch is Hallacy, Adi ya Ramesh, Gab iel Goh,
Sandhini Aga wal, Gi ish Sas y, Amanda Askell, Pamela Mishkin, Jack Cla k,
e al
.
2021. Lea ning ans e able isual models om na u al language supe ision.
In In e na ional con e ence on machine lea ning. PmLR, 8748–8763.
[20]
Nils Reime s and I yna Gu e ych. 2019. Sen ence-BERT: Sen ence Embeddings
using Siamese BERT-Ne wo ks. In P oceedings o he 2019 Con e ence on Empi ical
Me hods in Na u al Language P ocessing and he 9 h In e na ional Join Con e ence
on Na u al Language P ocessing (EMNLP-IJCNLP). 3982–3992.
[21]
Anna Roh bach, Ma cus Roh bach, and Be n Schiele. 2015. The long-sho s o y
o mo ie desc ip ion. In Pa e n Recogni ion: 37 h Ge man Con e ence, GCPR 2015,
Aachen, Ge many, Oc obe 7-10, 2015, P oceedings 37. Sp inge , 209–221.
[22]
And ew Shin, Ka suno i Ohnishi, and Ta suya Ha ada. 2016. Beyond cap ion o
na a i e: Video cap ioning wi h mul iple sen ences. In 2016 IEEE In e na ional
con e ence on image p ocessing (ICIP). IEEE, 3364–3368.
[23]
Chen Sun, Aus in Mye s, Ca l Vond ick, Ke in Mu phy, and Co delia Schmid.
2019. Videobe : A join model o ideo and language ep esen a ion lea ning.
In P oceedings o he IEEE/CVF in e na ional con e ence on compu e ision. 7464–
7473.
[24]
Yunlong Tang, Jing Bi, Si ing Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan
Zhang, Jie An, Jingyang Lin, Rongyi Zhu, e al
.
2025. Video unde s anding wi h
la ge language models: A su ey. IEEE T ansac ions on Ci cui s and Sys ems o
Video Technology (2025).
[25]
Ramak ishna Vedan am, C Law ence Zi nick, and De i Pa ikh. 2015. Cide :
Consensus-based image desc ip ion e alua ion. In P oceedings o he IEEE con e -
ence on compu e ision and pa e n ecogni ion. 4566–4575.
[26]
Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. [n. d.]. Ta sie :
Recipes o aining and e alua ing la ge ideo desc ip ion models, 2024. URL
h ps://a xi . o g/abs/2407.00634 8 ([n. d.]).
[27]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang
Wang. 2019. Va ex: A la ge-scale, high-quali y mul ilingual da ase o ideo-
and-language esea ch. In P oceedings o he IEEE/CVF in e na ional con e ence on
compu e ision. 4581–4591.
[28]
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li,
Guo Chen, Xinyuan Chen, Yaohui Wang, e al
.
2023. In e n id: A la ge-scale
ideo- ex da ase o mul imodal unde s anding and gene a ion. a Xi p ep in
a Xi :2307.06942 (2023).
[29]
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei,
Rongkun Zheng, Zun Wang, Yansong Shi, e al
.
2024. In e n ideo2: Scaling
ounda ion models o mul imodal ideo unde s anding. In Eu opean Con e ence
on Compu e Vision. Sp inge , 396–416.
[30]
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai,
Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, e al
.
2024. Deepseek-
l2: Mix u e-o -expe s ision-language models o ad anced mul imodal unde -
s anding. a Xi p ep in a Xi :2412.10302 (2024).
[31]
Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang
Li, Bin Bi, Qi Qian, Wei Wang, e al
.
2023. mplug-2: A modula ized mul i-modal
ounda ion model ac oss ex , image and ideo. In In e na ional Con e ence on
Machine Lea ning. PMLR, 38728–38748.
[32]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Ms - : A la ge ideo desc ip ion
da ase o b idging ideo and language. In P oceedings o he IEEE con e ence on
compu e ision and pa e n ecogni ion. 5288–5296.
[33]
Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang,
Yan Gao, Yao Hu, and Hai Zhao. 2024. V ip : A ideo is wo h housands o wo ds.
Ad ances in Neu al In o ma ion P ocessing Sys ems 37 (2024), 57240–57261.
[34]
Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng
Guo, Haodong Duan, Bin Wang, Linke Ouyang, e al
.
2024. In e nlm-xcompose -
2.5: A e sa ile la ge ision language model suppo ing long-con ex ual inpu
and ou pu . a Xi p ep in a Xi :2407.03320 (2024).
[35]
Tianyi Zhang, Va sha Kisho e, Felix Wu, Kilian Q Weinbe ge , and Yoa A zi.
2020. BERTSco e: E alua ing Tex Gene a ion wi h BERT. In In e na ional Con-
e ence on Lea ning Rep esen a ions.