A Dataset and Metric for Textual Video Content Description

Author: Arzberger, Stefan J.; Raith, Paul; Marion, Jaks; Bailer, Werner

Publisher: Zenodo

DOI: 10.1145/3746027.3758224

Source: https://zenodo.org/records/17287642/files/ACM_MM2025_Video2Text-5.pdf

A Da ase and Me ic o Tex ual Video Con en Desc ip ion
S e an J. A zbe ge
[email p o ec ed]
JOANNEUM RESEARCH – DIGITAL
G az, Aus ia
Paul Rai h∗
[email p o ec ed]
JOANNEUM RESEARCH – DIGITAL
G az, Aus ia
We ne Baile
we ne [email p o ec ed]
JOANNEUM RESEARCH – DIGITAL
G az, Aus ia
Ma ion Jaks
[email p o ec ed]
Aus ian Media hek
Vienna, Aus ia
In e nLM-XCompose -2.5 Ta sie VideoLLaMa2
0.0
0.2
0.4
0.6
0.8
1.0
alue
Di ec ion
ENT( e p ed)
ENT(p ed e )
CONT( e p ed)
CONT(p ed e )
Long-200 Medium-50 Sho -25 B ie -10
0.0
0.2
0.4
0.6
0.8
1.0
alue
Di ec ion
ENT( e p ed)
ENT(p ed e )
CONT( e p ed)
CONT(p ed e )
Figu e 1: Le : Dis ibu ion o en ailmen and con adic ion when ma ching appo x. 200-wo d cap ion p edic ions om s a e-
o - he-a MLLM-based models wi h e e ence cap ions; igh : compa ison o ma ching cap ions o di e en leng hs p edic ed
using In e nLM agains he e e ences.
Abs ac
Ob aining ex ual desc ip ions o he isual con en o images and
ideos is o en equi ed in mul imedia analysis and e ie al. T adi-
ional ideo cap ioning app oaches a e usually e alua ed on e y
sho cap ions using a he simple me ics om NLP, while mul i-
modal la ge language model (MLLM)-based app oaches a e mos ly
e alua ed wi h ques ion answe ing, which is que y speci ic. We p o-
ide a da ase (FM-V2T) wi h 258 ideo clips om a media a chi e,
anno a ed wi h de ailed manually cu a ed desc ip ions in English
and Ge man (long and sho ). We p opose an LLM-based me ic,
which assesses he en ailmen and con adic ion o ac s ex ac ed
om a desc ip ion wi h a e e ence, add essing sho comings o
exis ing me ics small changes wi h seman ic impac and compa -
ing desc ip ions wi h subs an ially di e en leng hs. We p o ide
expe imen al esul s on he eliabili y o he me ic, and apply i o
baseline esul s o h ee MLLM-based app oaches on he FM-V2T
da ase , compa ing i wi h o he me ics.
∗This wo k was done while Paul was wo king a JOANNEUM RESEARCH.
Publica ion igh s licensed o ACM. ACM acknowledges ha his con ibu ion was
au ho ed o co-au ho ed by an employee, con ac o o a ilia e o a na ional go e n-
men . As such, he Go e nmen e ains a nonexclusi e, oyal y- ee igh o publish
o ep oduce his a icle, o o allow o he s o do so, o Go e nmen pu poses only.
Reques pe missions om owne /au ho (s).
MM ’25, Dublin, I eland
©2025 Copy igh held by he owne /au ho (s). Publica ion igh s licensed o ACM.
ACM ISBN 979-8-4007-2035-2/2025/10
h ps://doi.o g/10.1145/3746027.3758224
CCS Concep s
•In o ma ion sys ems
→
Mul imedia in o ma ion sys ems;•
Compu ing me hodologies
→
In o ma ion ex ac ion;Language
esou ces.
Keywo ds
Benchma king, ideo o ex , ideo cap ioning, e alua ion, me ics
ACM Re e ence Fo ma :
S e an J. A zbe ge , Paul Rai h, We ne Baile , and Ma ion Jaks. 2025. A
Da ase and Me ic o Tex ual Video Con en Desc ip ion. In P oceedings
o he 33 d ACM In e na ional Con e ence on Mul imedia (MM ’25), Oc obe
27–31, 2025, Dublin, I eland. ACM, New Yo k, NY, USA, 7 pages. h ps:
//doi.o g/10.1145/3746027.3758224
1 In oduc ion
Enabling he explo a ion, e ie al, and unde s anding o mul i-
media da a using ex que ies is a common ask since he ea ly
days o esea ch on mul imedia da abases. Gene a ing ex ual de-
sc ip ions o mul imedia con en allows no only o eed hem in o
ex p ocessing and indexing pipelines, bu also has he ad an age
o a long- e m in e ope able ep esen a ion, which may no be
he case o mul imodal embeddings such as CLIP [
19
]. Video cap-
ioning o ideo o ex (V2T) me hods add ess his ask, and like
o many o he mul imedia analysis asks, he app oaches ha e
ecen ly mo ed om ask speci ic app oaches o me hods elying
on mul imodal la ge language models (MLLMs).
MM ’25, Oc obe 27–31, 2025, Dublin, I eland S e an J. A zbe ge , Paul Rai h, We ne Baile , and Ma ion Jaks
MLLMs a e commonly e alua ed wi h asks such as isual ques-
ion answe ing (VQA). This does no only hinde he compa abili y
wi h speci ic cap ioning me hods, bu also excludes many p ac i-
cally ele an applica ions. Fo example, in mul imedia e ie al,
i is no easible o un an MLLM on he ly agains a la ge da a-
base o answe a speci ic que y. I is equi ed o index in o ma ion
gene a ed om he mul imedia con en a inges ime. I a ex ual
desc ip ion is used o his pu pose, i needs o be a que y agnos-
ic desc ip ion, cap u ing he key elemen s o he ideo con en .
T adi ional ideo cap ioning benchma ks, such as MSR-VTT [
32
],
p o ide e y sho cap ions ha do no co e all ele an con en
o he ideo. We obse e hus a lack o da ase s wi h longe e -
e ence cap ions, as well as da ase s co e ing o he gen es han
news. In addi ion mos da ase s a e only a ailable in English and
some in Chinese, bu o he languages a e no well co e ed. The
as -paced ad ances in MLLMs make gene al pu pose models ca-
pable o eaching s a e o he a pe o mance o he ideo o ex
ask, when p omp ed app op ia ely. Howe e , he li e a u e on how
hese p omp s a e c a ed is spa se, and he choices seem a he
ad-hoc.
This pape p oposes a da ase o he e alua ion o s a e o he a
app oaches o he ideo o ex ask, and a me ic o comp ehensi e
desc ip ions wi h di e en le el o de ail and leng h. In pa icula ,
he con ibu ions o his pape a e:
•
We p o ide FM-V2T, a bilingual da ase (English, Ge man) o
ideo o ex e alua ion, c ea ed om a chi al ma e ial wi h
de ailed ex desc ip ions, and de i ed sho anno a ions o
allow compa ibili y wi h adi ional cap ioning benchma ks.
•
We p opose an LLM-based me ic (LLMFac sF1) o compa e
desc ip ions based on en ailed and con adic ing ac s.
•
We p o ide expe imen al esul s o assessing he eliabili y
o he p oposed me ic.
•
We un h ee s a e o he a MLLMs on he FM-V2T da ase
o gene a e baseline esul s, es ing di e en p omp s, and
e alua e he esul s wi h es ablished me ics and LLMFac sF1.
We p o ide a b ie o e iew o ideo o ex me hods as well as
benchma ks and me ics o his ask in Sec ion 2, and in oduce he
FM-V2T da ase in Sec ion 3. Sec ion 4 p oposes he LLMFac sF1
me ic, which is used in he baseline expe imen s in Sec ion 5.
Sec ion 6 concludes he pape .
2 Rela ed wo k
We b ie ly e iew ela ed wo k on ideo o ex me hods, including
speci ic ideo cap ion app oaches (Sec ion 2.1) and MLLM-based
ones (Sec ion 2.2). A ull e iew o app oaches is ou o scope o his
pape , bu he e a e ecen su eys (e.g. [
24
]). In addi ion, we p o ide
an o e iew on e alua ion app oaches o his ask. Video cap ions
may a y in language, o ma (e.g., na u al language, keywo ds),
and ocus (e.g., sen imen , dynamic ac i i ies, s a ic desc ip ions o
objec s/subjec s), in oducing unce ain y in de ining g ound u h.
The pe o mance me ics used o e alua e ex ual co espondence
can a y widely based on hese p io i ies and o ma s, and hus mus
be aligned acco dingly. Unlike image cap ioning, which ocuses on
isual in o ma ion wi hin single ames (spa ial in o ma ion), ideo
cap ioning mus also accoun o he seman ics o e ime ( empo al
in o ma ion), making i a mo e complex challenge [1].
2.1 Video cap ioning models
Fo se e al yea s, he ideo cap ioning ask was add essed by spe-
ci ic models, o example, buil on LSTMs [
21
]. Ea ly app oaches
we e based on image cap ioning wo k, and o en p o ide single
sen ence desc ip ions. A ound 10 yea s ago, app oaches aiming a
longe desc ip ions eme ged. These s and o wo k is o en e e ed
o as “dense ideo cap ioning”, and app oaches include segmen -
ing he ideo in o single ac ions [
22
] o e en s [
13
] o be u ned
in o sen ences. The eme gence o mul imodal embeddings ha e
led o ideo cap ioning models making use o hei capabili ies,
o example, VideoBERT [
23
]. Fu he ad ances in language and
ision-languages models ha e been adop ed in ecen ans o me -
based me hods, such as mPLUG-2 [
31
] and VAST [
6
]. Despi e he
good pe o mance o hese models, keeping up wi h he as -paced
ad ances o gene al pu pose MLLMs has become challenging.
2.2 MLLM-based ideo o ex models
The in oduc ion o mul imodal la ge language models (MLLMs)
has accele a ed ad ances in ideo cap ioning, bo h in me hodol-
ogy and e alua ion app oaches [
24
]. In con as o he speci ic
app oaches, gene ic MLLMs can be ine- uned o he ideo cap-
ioning ask, o simply p omp ed. Mos o hese models a e he
ision/ ideo a ian o a la ge amily o LLMs, such as VideoL-
LaMA [
7
], In e nVideo [
29
], Deepseek-VL [
30
] o Pix al [
2
]. We
p o ide an o e iew o some ecen me hods in Table 1, ocusing
on open sou ce me hods wi h pe missi e licenses.
We obse e ha i is ha d o compa e hem wi h speci ic ideo
cap ioning app oaches because MLLM-based me hods a e e alua ed
on di e en ask se ings such as ques ion answe ing. In addi ion,
he choice o p omp s seem ad-hoc, and he e is ha dly any li e a u e
on bes p ac ices o p omp ing MLLMs o ideo o ex asks.
2.3 Benchma ks and me ics
E alua ing ideo cap ioning in ol es de e mining how closely he
gene a ed cap ions ma ch a e e ence, p o ided as one o mo e
possible cap ions, each gene a ed o a leas e iewed/ e ised by a
human assesso . This in oduces a subjec i e challenge: ele ance
may a y depending on he con ex and ocus. Fo ins ance, ‘a hie
is pu sued’ and ‘a Ge man shephe d is sp in ing’ a e bo h alid
desc ip ions bu emphasize di e en aspec s o he scene. Simila ly,
‘a unning dog’ may be conside ed a co ec bu less de ailed de-
sc ip ion. Due hese challenges, human e alua ion emains he gold
s anda d o assessing seman ic co espondence [26].
Quan i a i e e alua ion can be ca ego ized in o ex -based me -
ics and benchma ks. Tex -based me ics include adi ional NLP
measu es which s uggle wi h synonymy and mul i-sen ence ex s.
These adi ional NLP me ics such as BLEU [
18
], ROUGE [
15
] ex-
hibi he limi a ion o wo d-wise ma ching (
𝑛
-g am-based), whe eas
imp o emen s such as CIDE +[
25
] o METEOR [
3
] also conside
synonyms o wo d s ems. Embedding-based me ics, such as BERT-
Sco e [
35
], assess seman ic simila i y in embedding space and han-
dle pa aph asing and longe ex s be e (bu may e.g. ail handling
nega ions p ope ly, as hey may be p ojec ed close o he posi i e
s a emen ). Recen LLM-based e alua ions ha e he po en ial o
app oxima e human unde s anding o ex simila i y [26].
A Da ase and Me ic o Tex ual Video Con en Desc ip ion MM ’25, Oc obe 27–31, 2025, Dublin, I eland
Me hod Type Yea License Commen
mPLUG-2 [31] Speci ic 2023 Apache-2.0
Combines p e- ained ision and language ans o me s o ideo o ex asks.
VAST [6] Speci ic 2023 MIT Fuses ideo, audio, and sub i le in o ma ion ia a ans o me a chi ec u e.
Ta sie [26] MLLM 2024 Apache-2.0 Tempo al modeling h ough LLMs wi h he DREAM e alua ion benchma k.
VideoLLaMA2 [7] MLLM 2024 Apache-2.0 Enhances spa ial- empo al unde s anding wi h a con olu ional connec o .
In e nLM-XC-2.5 [34] MLLM 2023 Apache-2.0 Suppo s long-con ex inpu s/ou pu s and high- esolu ion inpu s.
VideoCha 2 [14] MLLM 2024 Apache-2.0
MVBench benchma k o spa io- empo al unde s anding wi h VideoCha 2 as
a baseline.
In e nVideo2 [29] MLLM 2024 Apache-2.0
Leade on MVBench o ine-g ained ac ion desc ip ion; long-con ex inpu s.
Deepseek-VL2 [30] MLLM 2024 MIT & DsML E icien MLLM wi h compe i i e pe o mance.
Pix al 12B [2] MLLM 2024 Apache-2.0 E icien model om he mis al amily wi h image suppo .
Table 1: A selec ion o ecen ideo o ex me hods wi h pe missi e licenses.
Benchma ks can be classi ied in o adi ional and LLM-based
ypes. T adi ional da ase s ypically con ain mul iple single-sen ence
cap ions wi h simple ex ual s uc u e and a e benchma ked us-
ing adi ional NLP me ics. Fo example, MSR-VTT [
32
] p o ides
20 single-sen ence e e ence cap ions o a single ideo, each ad-
d essing subjec i e ocus, and has an a e age leng h o 9 wo ds
pe cap ion. O he adi ional da ase benchma ks a e o en special-
ized o speci ic ideo domains like MSVD [
5
], Ac i i yNe [
4
] and
YouCook [
9
]. LLM-d i en benchma ks, howe e , a e o en ailo ed
o speci ic p io i ies, wi h a ied anno a ion s uc u e and a e e-
quen ly e alua ed wi h he aid o LLMs. Fo example, he Ta sie
me hod in oduced he DREAM-1k benchma k [
26
] wi h a wo-
s ep LLM-based e alua ion using single mul i-sen ence cap ions
and key e en ex ac ion. MVBench [
14
], designed o emphasizing
empo al unde s anding, also in oduced he VideoCha 2 model.
O he examples include VATEX [
27
] and MMBench [
16
]. The e a e
la ge da ase s wi h longe desc ip ions such as V ip [
33
], In e n-
Vid [
28
], Mi aDa a[
12
] and HowTo100M [
17
] which a e only ully
au oma ically anno a ed.
We obse e a gap be ween he adi ional cap ioning me ics
making a compa ison agains a e e ence – add essing o en b ie
cap ions and me ics wi h limi ed abili y o cap u e complex se-
man ics – and app oaches om LLM benchma ks.
3 FM-V2T Da ase
We p o ide he FAIRmedia Video o Tex (FM-V2T) da ase , con-
aining de ailed ex ual desc ip ions o he isual con en o ideo
sho s. The da ase uses con en o Ös e eichische Media hek, he
Aus ian audio and ideo a chi e, and was de eloped in he con ex
o he p ojec FAIRmedia
1
. The da ase uses a selec ion o con en
o he Wiene Video Reko de collec ion
2
) con aining con en c e-
a ed wi h consume ideo came as, and documen ing e e yday li e.
The con en is di e se, co e ing e en s in public space, es i i ies,
amily e en s o people li ing in o a ound Vienna, bu con ains
qui e di e se con en , including a el o nea by and a away places.
The con en has been clea ed o publica ion in Media hek’s online
ca alog, add essing he po en ial copy igh and p i acy issues.
1Fai and us ed da ase s o media compu ing h ps://www.joanneum.a /digi al/en/
p ojec s/ ai media/
2
h ps://www.media hek.a /wiene - ideo eko de /english-in o ma ion (con en a ail-
able a h ps://www.media hek.a /digi ale-sammlung
F om he collec ion, 268 ideos ha e been selec ed by a chi e
expe s, aiming a di e si y o he con en . These ideos ha e been
empo ally segmen ed using he sho bounda y de ec o desc ibed
in [
11
], and key ame ex ac ion based on isual ac i i y in he
con en has been pe o med wi h ha algo i hm. As anno a ion o
he en i e collec ion is no easible, bu we aim a a he e ogeneous
and ep esen a i e con en se , we selec one sho pe ideo manu-
ally. As sho bounda ies may delinea e comple ely di e en con en ,
desc ibing mul iple sho s oge he does no gene ally p o ide added
alue. In o de no o selec e y sho o long clips, he sho s a e
equi ed o all in o a leng h in e al o
[
5; 35
]
s. As we also wan o
exclude almos s a ic o ex emely dynamic con en , we equi e he
numbe o key ames o be
[
2; 100
]
and he numbe o key ames/s
o be
[
0
.
5; 4
]
. A e his p ocess, 258 clips emain, wi h mean leng h
15.45 ±7.79s (min 5.12s, max 34.92s).
Fo hese clips, ex ual anno a ions in English and Ge man a e
gene a ed. The English desc ip ion is gene a ed using VideoL-
LaMa2 [
7
], using he p omp ‘Desc ibe his ideo, exac ly and only
ocus on wha is isible, wi hou imagining any de ails ha a e no
isible! Answe wha can be seen, whe e he ideo was sho , wha
pe sons, animals o buildings e c. can be seen. Wha is happening in
he ideo? When was he ideo ilmed a day o nigh o example
. . .
Is he e some hing unique o his ideo? Limi he desc ip ion o 200
wo ds!’. The esul ing ex is hen manually checked, co ec ed and
amended as needed. As p elimina y es s wi h Ge man ideo o ex
models showed in e io quali y in compa ison o English models,
we ansla e he e ised anno a ion o Ge man using NLLB [
8
].
Again, he esul ing anno a ion is manually checked and e ised.
In o de o es ablish in e ope abili y wi h widely used bench-
ma ks such as MSR-VTT [
32
], we de i e 20 single sen ence sho
cap ions (simila o he “gold cap ions” in MSR-VTT). This has been
done using Cha GPT-4o mini
3
, using he p omp ‘I will gi e you a
ideo cap ion and you ha e o ex ac he mos impo an in o ma-
ion in o a sho 10 wo d desc ip ion and make di e en a ia ions
o he "p ed_cap ion". These "sho _cap ion" a e a ia ions o he
"p ed_cap ion" and ha e he same meaning and maybe ocus on a ew
o he de ails om he o iginal ideo cap ion and a e all o mula ed
in o he wo ds bu wi hou in en ing any o he de ails ha whe e no
in he o iginal ideo cap ion. So simila like in he MSR_VTT da ase .
3h ps://openai.com/index/gp -4o-mini-ad ancing-cos -e icien -in elligence
MM ’25, Oc obe 27–31, 2025, Dublin, I eland S e an J. A zbe ge , Paul Rai h, We ne Baile , and Ma ion Jaks
He e his is a example please also s ay in his json o ma : { ... }’. The
esul ing cap ions we e again manually e ined.
This da ase , based on di e se a chi e ideo con en , is p ima -
ily in ended o assessing and compa ing ideo o ex me hods,
suppo ing wo languages. The p o ision o long de ailed anno a-
ions and a se o sho cap ions enables he applica ion o a wide
ange o me ics and benchma king app oaches. In addi ion, he
da ase can se e ela ed downs eam asks, such as ac ex ac ion
o isual ques ion answe ing. A elease o he da ase is a ailable
a h ps://gi hub.com/FAIRmedia-AT/FM-V2T. In addi ion o he
anno a ions (published unde a CC-BY 4.0 license), i con ains he
e e ences and me ada a o he ideo, including code o download-
ing he clips o which anno a ions a e a ailable.
In addi ion o es ablished NLP me ics, we p o ide wo a i-
an s o cosine simila i y o ex embeddings p oduced by Sen ence-
BERT [
20
]
4
o compa ison. These sco es ely on dense ep esen-
a ions p oduced by Sen enceBERT. Fo he inpu ex span
𝑡
, we
ob ain an embedding
𝑒(𝑡)=MeanPoolMiniLM(𝑡)∈R384,
whe e
he hidden s a es o all okens a e a e aged (mean pooling). Gi en
a p edic ed cap ion 𝑐𝑝and a e e ence cap ion 𝑐𝑟, we epo :
CosPa (pa ag aph_cosine_simila i y). A single embedding is
compu ed o he en i e cap ion; he sco e is he cosine be ween he
wo pa ag aph ec o s CosPa (𝑐𝑝,𝑐𝑟)=cos𝑒pa a(𝑐𝑝), 𝑒pa a (𝑐𝑟).
CosSen (sen ence_median_simila i y). Bo h cap ions a e i s
segmen ed in o sen ences. Sen ence embeddings
𝑒sen (·)
a e pai ed
one- o-one by g eedily choosing he highes emaining cosine simi-
la i y, p oducing a ma ching se
M(𝑐𝑟,𝑐𝑝)
. The sco e is he median
o hese pai wise cosines, which is obus o ou lie s such as e y
dissimila o unma ched sen ences:
CosSen (𝑐𝑝,𝑐𝑟)=median (𝑠𝑟,𝑠𝑐) ∈ M (𝑐𝑟,𝑐𝑝)cos𝑒sen (𝑠𝑟), 𝑒sen (𝑠𝑐).
Bo h me ics heo e ically ange om -1 o 1, bu in ou expe -
imen s on na u al language cap ions, sco es ell in he p ac ical
ange o 0 (minimal seman ic simila i y) o 1 (iden ical embeddings).
Bo h use he same embedding-pooling con igu a ion {mean,me-
dian_simila i y} speci ied in he expe imen al se ings.
4 Fac -based desc ip ion me ic
T adi ional me ics o assessing ideo cap ioning, such as BLEU,
ROUGE o CIDE , a e no well sui ed o handling longe and mo e
exp essi e desc ip ions. Using ex embeddings such as Sen ence-
BERT [
20
] in o de o measu e dis ances in he embedding space
add esses some o he issues o hese me ics. Howe e , embeddings
may s ill be simila when many o he concep s in he sen ences
align, igno ing ha e.g. a pa o a sen ence has been nega ed. In
addi ion, hese me ics do no pe o m eliably when compa ing
ex s o di e en leng hs. This issue can pa ly be add essed by
b eaking each o he ex s in o sen ences, aising he ques ion how
o agg ega e he se o hei pai wise compa ison sco es.
We hus p opose a me ic based on he o e lap o ac ual s a e-
men s be ween wo ex s, e.g. gene a ed and e e ence desc ip ion
o a ideo clip. We discuss in his sec ion he design o he me ic,
desc ibe he conc e e implemen a ion and p esen he expe imen s
pe o med o alida e he obus ness o he me ic.
4sen ence- ans o me s/pa aph ase-mul ilingual-MiniLM-L12- 2
4.1 Design
The basic idea o he me ic is o ex ac single sen ence s a emen s
con ained in a ex
𝑇𝐴
(“ ac s”), and check hem agains ano he
ex
𝑇𝐵
.
𝑇𝐵
may suppo hese ac s (en ailmen ), be in con lic
wi h hem (con adic ion) o no con ain in o ma ion ela ed o
his s a emen (neu al). This wo s ep pipeline is ealized using
wo p omp s o LLMs: he ac ex ac ion model
M𝑓 𝑒𝑥 (𝑇) → 𝐹
,
whe e
𝐹
is he se o ac s
𝐹={𝑓1, . . . , 𝑓𝑘}
, and he checking model
M𝑐ℎ𝑘 (𝑇, 𝐹) → (𝐸,𝐶, 𝑁 )
, whe e
𝐸
,
𝐶
and
𝑁
a e bina y ec o s o
size
𝑘
, encoding o each ac whe he i is en ailed, con adic ing
o neu al (|𝐸|+|𝐶|+|𝑁|=𝑘,|·|deno es he 𝐿1no m).
We ob ain
(𝐸𝐴𝐵,𝐶𝐴𝐵, 𝑁𝐴𝐵)=M𝑐ℎ𝑘 (𝑇𝐵,M𝑓 𝑒𝑥 (𝑇𝐴))
, and calcu-
la e he a es
𝐸𝑁𝑇𝐴𝐵 =
|𝐸𝐴𝐵 |
𝑘𝐴
,𝐶𝑂𝑁𝑇𝐴𝐵 =
|𝐶𝐴𝐵 |
𝑘𝐴
.
By including en ailmen s and con adic ions no malised by he
numbe o ac s in he me ic, he amoun o neu al s a emen s is
implici ly included. In o de o compensa e o e ec s o di e en
leng hs o
𝑇𝐴
and
𝑇𝐵
, we pe o m he ex ac ion and checking
p ocess in bo h di ec ions, and calcula e he ha monic mean
𝐸𝑁𝑇𝐹1=
2|𝐸𝐴𝐵 ||𝐸𝐵𝐴 |
|𝐸𝐴𝐵 |+|𝐸𝐵𝐴|,𝐶𝑂𝑁𝑇 𝐹1=
2|𝐶𝐴𝐵 ||𝐶𝐵𝐴 |
|𝐶𝐴𝐵 |+|𝐶𝐵𝐴 |.
In o de o exp ess seman ic o e lap as a single numbe , we de ine
𝐿𝐿𝑀𝐹𝑎𝑐𝑡𝑠𝐹1=
2𝐸𝑁𝑇𝐹1(1−𝐶𝑂𝑁𝑇 𝐹1)
𝐸𝑁𝑇𝐹1+ (1−𝐶𝑂𝑁𝑇 𝐹1).
4.2 Implemen a ion
We implemen he p oposed LLMFac sF1 me ic as a wo-s age
pipeline comp ising ac ex ac ion and ela ional classi ica ion
using LLMs. The componen s a e in eg a ed in o a locally hos ed
in e ence sys em using he HuggingFace T ans o me s in e ace.
In he ac ex ac ion s age, he LLM is p omp ed o iden i y
a omic ac ual s a emen s om pa ag aph-leng h inpu . Ex ac ed
ac s mus be sel -con ained, including subjec , p edica e, and objec ,
and mus e ain linguis ic modi ie s such as modali y, nega ion, and
quan i ica ion.
In he classi ica ion s age, each ac is e alua ed o en ailmen ,
con adic ion, o neu ali y wi h espec o a compa ison pa ag aph.
This s ep is pe o med collec i ely: he en i e ac se is assessed
agains he ull e e ence o hypo hesis ex , a he han indi idu-
ally. I pa sing ails due o syn ac ic inconsis encies, an auxilia y
LLM call a emp s s uc u al co ec ion. Manual inspec ion was
occasionally needed o ix o ma ing, such as emo ing spu ious
quo a ion ma ks a ound named en i ies, bu did no modi y he
s a emen s hemsel es.
The sys em uses he Llama 3.1 8B5[10] model wi h de e minis-
ic hype pa ame e s
6
and a high oken limi o p ese e con ex
in eg i y. Smalle models (e.g., 3B a ian s) showed de iciencies
in meaning ul ac ex ac ion. On a single NVIDIA A6000 GPU
(48GB RAM), he ac ex ac ion s ep akes 16
.
8
±
4
.
9s and he ac
alignmen check akes 35.9±18.5s.
5h ps://hugging ace.co/me a-llama/Llama-3.1-8B-Ins uc
6do_sample=False, empe a u e=None, op_p=None
A Da ase and Me ic o Tex ual Video Con en Desc ip ion MM ’25, Oc obe 27–31, 2025, Dublin, I eland
(a) Re 2Re (b) In e nLM2In e nLM
0.0
0.2
0.4
0.6
0.8
1.0
alue
Di ec ion
ENT( e e )
ENT( e e )
CONT( e e )
CONT( e e )
Figu e 2: Bidi ec ional alida ion o LLM Me ic o ENT
(g een) & CONT ( ed) o (a) e e ence-agains - e e ence and
(b) In e nLM- o-In e nLM.
P omp s we e i e a i ely e ined o obus ness; he inal e -
sions a e documen ed in he Gi Hub eposi o y o he da ase . The
pipeline p o ides coun s o en ailmen s, con adic ions, and neu al
ela ions in bo h di ec ions, which a e used o compu e he inal
LLMFac sF1 sco e as de ailed abo e.
4.3 Valida ion o he me ic
In o de o assess he eliabili y o he me ic, we es he me ic on
sel -ma ching he e e ences in he FM-V2T da ase . This should
esul in
𝐸𝑁𝑇𝐹1
close o 1, and
𝐶𝑂𝑁𝑇𝐹1
close o 0. We pe o m
he expe imen wice, ma ching o wa d and backwa d, in o de
o es he ep oducibili y o he models. In addi ion, we epea
he same expe imen wi h sel -ma ching he ou pu o In e nLM-
XCompose -2.5 [
34
] on he FM-V2T da ase . Figu e 2 shows he
esul s, indica ing ha he me ic is e y eliable in e ms o con a-
dic ions, which a e almos 0. The a e o en ailmen s is e y high,
hough wi h mo e ou lie s, i.e. cap ions esul ing in lowe en ail-
men a es. This means ha suppo o some ex ac ed s a emen s
could no be e i ied, and hey a e hus conside ed neu al. We
also ma ch he In e nLM ideo o ex model ou pu agains i sel ,
which u ns ou o be e en mo e eliable, wi h a lowe numbe o
ou lie s. This is p obably due o simple and sho e na u e o he
ou pu s compa ed o he e e ences. We also pe o m a es wi h
andomly misaligned e e ences, shown in Figu e 3. As expec ed,
he en ailmen is e y low, con adic ions a e qui e high. As he
andomly aligned e e ences may desc ibe di e en con en , and
no necessa ily con adic each o he , he con adic ions a e almos
uni o mly dis ibu ed, also ac oss di e en leng hs.
4.4 Quali a i e example
In o de o illus a e he ad an ages o he p oposed me ic, we
p o ide a quali a i e example. Table 3 p o ides an example o a
ideo cap ion, including he e e ence, ou pu s o h ee models and
wo manually changed e sions o he e e ence (one changing a
name, one in oducing a nega ion).
Table 4 shows he esul s o he p oposed LLMFac sF1 and a se
o o he me ics. I becomes e iden , ha he p oposed me ic sco es
he a ian s o he e e ence clea ly highe han he p edic ions,
and penalizes he nega ion mo e han he name change, which is
no he case o mos o he o he me ics. The cosine simila i y
Re 2In e nLM (Long-200) Re 2In e nLM (Medium-50) Re 2In e nLM (Sho -25) Re 2In e nLM (B ie -10)
0.0
0.2
0.4
0.6
0.8
1.0
alue
Di ec ion
ENT( e p ed)
ENT(p ed e )
CONT( e p ed)
CONT(p ed e )
Figu e 3: Resul s wi h andomly misaligned e e ences ac oss
di e en cap ion leng hs.
me ics o ex embeddings come closes , bu ail o make a clea
dis inc ion om he p edic ion wi h a w ong ex on he banne
and he one omi ing he ex en i ely.
5 Expe imen s
In o de o ob ain baseline esul s, we un h ee s a e-o - he-a
models on he FM-V2T da ase s: Ta sie [
26
], VideoLLaMA2 [
7
]
and In e nLM-XCompose -2.5 [34].
Figu e 1 (le ) shows he dis ibu ion o en ailmen and con a-
dic ion o he model ou pu s agains he e e ence on he FM-V2T
da ase s. These ou pu s a e ob ained by p omp ing he models o
de ailed (
≤
200 wo ds) cap ions. The esul s show ha he In e nLM
model has he highes ac ion o en ailmen and he lowes ac-
ion o con adic ions, while he esul s o he o he wo models a e
wo se. The ac ha In e nLM ou pe o ms VideoLLaMa shows ha
he e is no bias in e ms o desc ip ion quali y s emming om he
ac ha VideoLLaMa was used as s a ing poin o he human an-
no a ion. Howe e , he VideoLLaMa model p o ides a mo e simila
le el o de ail o he desc ip ion, as e iden om he simila dis i-
bu ion o en ailmen – his migh be a sligh bias om he p ocess.
Fo he o he wo models, he en ailmen o ac s ex ac ed om
he p edic ion is highe han om he e e ence, indica ing ha he
e e ence is mo e comp ehensi e, and hus i is easie o he LLM
o align ac s wi h i . The di e ences in ma ching di ec ions o
some models con i m also he decision o use he ha monic mean
o bo h di ec ions as an in eg a ed sco e.
We also analyse he impac o he desc ip ion leng h on he
sco e. We use In e nLM o p edic desc ip ions o 200, 50, 25 and
10 wo ds, and compa e o he e e ences. The esul s a e shown in
Figu e 1 ( igh ). As expec ed, he sho e desc ip ions esul in a
lowe numbe o en ailed and con adic ing ac s when ma ched
agains he longe e e ence. The ex ac ion o ac s om he sho e
p edic ion esul s in highe en ailmen when ma ched agains he
longe one, and he ela ed in o ma ion can be ound he e, howe e ,
he absolu e numbe o ac s is lowe han in he opposi e di ec ion.
Table 2 p o ides an o e iew o esul s wi h p omp ing he
baseline models o desc ip ions o di e en leng hs, and ma ching
agains he long e e ence o he sho cap ions. These esul s also
show ha he leng hs asked in he p omp s a e ollowed o a di e -
en deg ee by he di e en models. The esul s indica e ha s a e o
he a models p o ide usable esul s on his ask, bu also ha he

MM ’25, Oc obe 27–31, 2025, Dublin, I eland S e an J. A zbe ge , Paul Rai h, We ne Baile , and Ma ion Jaks
Model Re P ed Wo ds LLMFac sF1 METEOR BLEU-1 BLEU-2 ROUGE-L CosPa CosSen
VideoLLama Re Long-200 148±31 0.496±0.146 0.212±0.035 0.390±0.091 0.245±0.073 0.296±0.052 0.751±0.113 0.670±0.087
In e nLM Re Long-200 136±22 0.529±0.184 0.198±0.047 0.380±0.085 0.223±0.083 0.283±0.072 0.711±0.112 0.648±0.099
Ta sie Re Long-200 59±14 0.476±0.175 0.135±0.032 0.257±0.107 0.150±0.069 0.257±0.047 0.771±0.092 0.644±0.094
In e nLM Re Medium-50 61±22 0.508±0.186 0.129±0.044 0.248±0.125 0.142±0.086 0.246±0.065 0.729±0.108 0.640±0.101
In e nLM Re Sho -25 49±30 0.473±0.207 0.106±0.051 0.173±0.153 0.097±0.092 0.205±0.072 0.682±0.125 0.653±0.116
In e nLM Re B ie -10 10±3 0.431±0.230 0.033±0.015 0.002±0.014 0.001±0.007 0.085±0.034 0.629±0.121 0.686±0.124
In e nLM Sho Re s B ie -10 10±3 - 0.213±0.076 0.720±0.182 0.450±0.226 0.415±0.144 - -
Ta sie Sho Re s B ie -10 20±35 - 0.204±0.066 0.590±0.160 0.353±0.183 0.373±0.120 - -
Table 2: Me ics ob ained o di e en ly p omp ed model ou pu s using he h ee baseline me hods on he FM-V2T da ase .
Sho Re e e s o he sho e sion o he e e ence cap ion.
Type Tex : Sho -25
Re e ence
A small ai plane is lying ac oss he sky, wi h a banne
eading "BUSSI SUSI-LEO" ailing behind i .
Nega ion
A small ai plane is lying ac oss he sky, wi h
a
no
banne eading "BUSSI SUSI-LEO" ailing behind i .
NameChange
A small ai plane is lying ac oss he sky, wi h a ban-
ne eading "
BUSSI SUSI-LEO
BUSSI JOSEF-MARIA"
ailing behind i .
In e nLM-XC
A helicop e is lying ac oss a clea blue sky, wi h a
banne eading "BUSSI SUSHI -LEO" ailing behind i .
Ta sie
Aplane lies ac oss he sky wi h a sign displaying
’BUSSI SUJU-LEON’.
VideoLLaMa
A small ai plane lies o e a unway, ollowed by a
helicop e , bo h seen om a dis ance agains a g ay
sky.
Table 3: Gene a ed desc ip ions o he ideo ac oss models
and edi s. Tex in ed indica es con adic ing and in g een
en ailed s a emen s. Fo he edi ed a ian s, ed ex ma ks
dele ions, and blue ex inse ions.
Sho -25 LLMFac sF1 METEOR BLEU-1 BLEU-2 ROUGE-L CosPa CosSen
Nega ion 0.585 0.594 0.941 0.907 0.941 0.967 0.967
NameChange 0.750 0.518 0.941 0.907 0.941 0.981 0.981
In e nLM 0.400 0.402 0.722 0.618 0.747 0.722 0.722
Ta sie 0.635 0.186 0.385 0.304 0.468 0.908 0.908
VideoLLaMa 0.273 0.127 0.250 0.162 0.219 0.728 0.728
Table 4: Me ics ob ained o he example in Table 3.
da ase is challenging enough o lea e oom o imp o emen . The
esul s also show ha sho e desc ip ions a e able o cap u e he
mos ele an aspec s o he con en . The las wo lines o he able
p o ides esul s o he de i ed sho cap ions, showing ha he
da ase p o ides compa ibili y wi h MSR-VTT s yle benchma ks.
We also analyse he co ela ion o he p oposed me ic wi h base-
line me ics o In e nLM esul s on he FM-V2T da ase (Figu e 4).
Gene ally, he co ela ion is low, showing ha he me ics assess
di e en aspec s o he desc ip ions. The e is a weak co ela ion
wi h he me ics using cosine simila i y o he ex embeddings,
while he e is almos no co ela ion wi h some o he adi ional
NLP me ics. Based on samples we assume ha his is due o cases
whe e syn ac ic and seman ic impac o di e ences di e ges. We
0.0 0.2 0.4 0.6 0.8 1.0
Baseline me ic
0.0
0.2
0.4
0.6
0.8
1.0
LLMFac sF1
BLEU-1
BLEU-2
METEOR
ROUGE-L
CosPa
CosSen
Figu e 4: Compa ison o he p oposed me ic wi h baseline
me ics o In e nLM esul s on he FM-V2T da ase .
ha e made expe imen s o selec ing p omp s o he MLLMs. Like
o en in p omp enginee ing, i is ha d o p edic which changes
will ha e subs an ial e ec s. We ha e hus c ea ed 30 p omp s and
an expe imen s wi h hem. We p o ide he lis o p omp s and an
in e ac i e plo on he da ase ’s Gi hub eposi o y.
6 Conclusion
We ha e p o ided a bilingual da ase o he e alua ion ideo o
ex me hods, compa ible wi h he e alua ion me hods used o a-
di ional cap ioning as well as MLLM-based me hods. We ha e also
p oposed a no el LLM-based me ic, and alida ed he me ic in a
ange o expe imen s on he da ase . This also includes compa isons
wi h o he me ics on he ou pu s o h ee s a e o he a MLLMs
on he p oposed da ase . The da ase p o ides he anno a ions in
wo languages, bu he machine ansla ion and e ision wo k low
we ha e used o ob ain he Ge man anno a ions can be e icien ly
eplica ed o o he languages. In a simila way, he isual con en
desc ip ion could be amended by speech o ex .
Acknowledgmen s
This wo k has been unded pa ially by he Aus ian Resea ch P o-
mo ion Agency (FFG) unde he Digi al Technologies p ojec FAIR-
media (h ps://www.joanneum.a /digi al/en/p ojec s/ ai media/),
and by Eu opean Union’s Ho izon Eu ope p og amme unde g an
ag eemen n
◦
101070250 XRECO (h ps://x eco.eu/). The au ho s
hank Geo g Thallinge o his eedback on he pape .
A Da ase and Me ic o Tex ual Video Con en Desc ip ion MM ’25, Oc obe 27–31, 2025, Dublin, I eland
Re e ences
[1]
Moloud Abda , Meenakshi Kolla i, Swa aja Ku apa hi, Fa had Pou panah, Daniel
McDu , Mohammad Gha amzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas
Khos a i, E ik Camb ia, e al
.
2024. A e iew o deep lea ning o ideo cap ioning.
IEEE T ansac ions on Pa e n Analysis and Machine In elligence (2024).
[2]
P a esh Ag awal, Szymon An oniak, Emma Bou Hanna, Bap is e Bou , De end a
Chaplo , Jessica Chudno sky, Diogo Cos a, Baudouin De Monicaul , Sau abh
Ga g, Theophile Ge e , e al
.
2024. Pix al 12B. a Xi p ep in a Xi :2410.07073
(2024).
[3]
Sa anjee Bane jee and Alon La ie. 2005. METEOR: An au oma ic me ic o
MT e alua ion wi h imp o ed co ela ion wi h human judgmen s. In P oceedings
o he acl wo kshop on in insic and ex insic e alua ion measu es o machine
ansla ion and/o summa iza ion. 65–72.
[4]
Fabian Caba Heilb on, Vic o Esco cia, Be na d Ghanem, and Juan Ca los Niebles.
2015. Ac i i yne : A la ge-scale ideo benchma k o human ac i i y unde s and-
ing. In P oceedings o he ieee con e ence on compu e ision and pa e n ecogni ion.
961–970.
[5]
Da id Chen and William B Dolan. 2011. Collec ing highly pa allel da a o
pa aph ase e alua ion. In P oceedings o he 49 h annual mee ing o he associa ion
o compu a ional linguis ics: human language echnologies. 190–200.
[6]
Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu,
and Jing Liu. 2023. Vas : A ision-audio-sub i le- ex omni-modali y ounda ion
model and da ase . Ad ances in Neu al In o ma ion P ocessing Sys ems 36 (2023),
72842–72866.
[7]
Zesen Cheng, Sicong Leng, Hang Zhang, Yi ei Xin, Xin Li, Guanzheng Chen,
Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, e al
.
2024. Videollama 2:
Ad ancing spa ial- empo al modeling and audio unde s anding in ideo-llms.
a Xi p ep in a Xi :2406.07476 (2024).
[8]
Ma a R Cos a-Jussà, James C oss, Onu Çelebi, Maha Elbayad, Kenne h Hea ield,
Ke in He e nan, Elahe Kalbassi, Janice Lam, Daniel Lich , Jean Mailla d, e al
.
2022. No language le behind: Scaling human-cen e ed machine ansla ion.
a Xi p ep in a Xi :2207.04672 (2022).
[9]
P adip o Das, Chenliang Xu, Richa d F Doell, and Jason J Co so. 2013. A housand
ames in jus a ew wo ds: Lingual desc ip ion o ideos h ough la en opics
and spa se objec s i ching. In P oceedings o he IEEE con e ence on compu e
ision and pa e n ecogni ion. 2634–2641.
[10]
Aa on G a a io i e al. 2024. The Llama 3 He d o Models.
a Xi :2407.21783 [cs.AI] h ps://a xi .o g/abs/2407.21783
[11]
Hannes Fassold. 2024. Fas e han eal- ime de ec ion o sho bounda ies, sam-
pling s uc u e and dynamic key ames in ideo. In 2024 8 h In e na ional Con e -
ence on Imaging, Signal P ocessing and Communica ions (ICISPC). IEEE, 33–36.
[12]
Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xin ao Wang, Ailing Zeng,
Yu Xiong, Qiang Xu, and Ying Shan. 2024. Mi ada a: A la ge-scale ideo da ase
wi h long du a ions and s uc u ed cap ions. Ad ances in Neu al In o ma ion
P ocessing Sys ems 37 (2024), 48955–48970.
[13]
Ranjay K ishna, Kenji Ha a, F ede ic Ren, Li Fei-Fei, and Juan Ca los Niebles.
2017. Dense-cap ioning e en s in ideos. In P oceedings o he IEEE in e na ional
con e ence on compu e ision. 706–715.
[14]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan
Xu, Guo Chen, Ping Luo, e al
.
2024. M bench: A comp ehensi e mul i-modal
ideo unde s anding benchma k. In P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion. 22195–22206.
[15]
Chin-Yew Lin. 2004. Rouge: A package o au oma ic e alua ion o summa ies.
In Tex summa iza ion b anches ou . 74–81.
[16]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo
Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, e al
.
2024. Mmbench:
Is you mul i-modal model an all-a ound playe ?. In Eu opean con e ence on
compu e ision. Sp inge , 216–233.
[17]
An oine Miech, Dimi i Zhuko , Jean-Bap is e Alay ac, Maka and Tapaswi, I an
Lap e , and Jose Si ic. 2019. How o100m: Lea ning a ex - ideo embedding by
wa ching hund ed million na a ed ideo clips. In P oceedings o he IEEE/CVF
in e na ional con e ence on compu e ision. 2630–2640.
[18]
Kisho e Papineni, Salim Roukos, Todd Wa d, and Wei-Jing Zhu. 2002. Bleu: a
me hod o au oma ic e alua ion o machine ansla ion. In P oceedings o he
40 h annual mee ing o he Associa ion o Compu a ional Linguis ics. 311–318.
[19]
Alec Rad o d, Jong Wook Kim, Ch is Hallacy, Adi ya Ramesh, Gab iel Goh,
Sandhini Aga wal, Gi ish Sas y, Amanda Askell, Pamela Mishkin, Jack Cla k,
e al
.
2021. Lea ning ans e able isual models om na u al language supe ision.
In In e na ional con e ence on machine lea ning. PmLR, 8748–8763.
[20]
Nils Reime s and I yna Gu e ych. 2019. Sen ence-BERT: Sen ence Embeddings
using Siamese BERT-Ne wo ks. In P oceedings o he 2019 Con e ence on Empi ical
Me hods in Na u al Language P ocessing and he 9 h In e na ional Join Con e ence
on Na u al Language P ocessing (EMNLP-IJCNLP). 3982–3992.
[21]
Anna Roh bach, Ma cus Roh bach, and Be n Schiele. 2015. The long-sho s o y
o mo ie desc ip ion. In Pa e n Recogni ion: 37 h Ge man Con e ence, GCPR 2015,
Aachen, Ge many, Oc obe 7-10, 2015, P oceedings 37. Sp inge , 209–221.
[22]
And ew Shin, Ka suno i Ohnishi, and Ta suya Ha ada. 2016. Beyond cap ion o
na a i e: Video cap ioning wi h mul iple sen ences. In 2016 IEEE In e na ional
con e ence on image p ocessing (ICIP). IEEE, 3364–3368.
[23]
Chen Sun, Aus in Mye s, Ca l Vond ick, Ke in Mu phy, and Co delia Schmid.
2019. Videobe : A join model o ideo and language ep esen a ion lea ning.
In P oceedings o he IEEE/CVF in e na ional con e ence on compu e ision. 7464–
7473.
[24]
Yunlong Tang, Jing Bi, Si ing Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan
Zhang, Jie An, Jingyang Lin, Rongyi Zhu, e al
.
2025. Video unde s anding wi h
la ge language models: A su ey. IEEE T ansac ions on Ci cui s and Sys ems o
Video Technology (2025).
[25]
Ramak ishna Vedan am, C Law ence Zi nick, and De i Pa ikh. 2015. Cide :
Consensus-based image desc ip ion e alua ion. In P oceedings o he IEEE con e -
ence on compu e ision and pa e n ecogni ion. 4566–4575.
[26]
Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. [n. d.]. Ta sie :
Recipes o aining and e alua ing la ge ideo desc ip ion models, 2024. URL
h ps://a xi . o g/abs/2407.00634 8 ([n. d.]).
[27]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang
Wang. 2019. Va ex: A la ge-scale, high-quali y mul ilingual da ase o ideo-
and-language esea ch. In P oceedings o he IEEE/CVF in e na ional con e ence on
compu e ision. 4581–4591.
[28]
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li,
Guo Chen, Xinyuan Chen, Yaohui Wang, e al
.
2023. In e n id: A la ge-scale
ideo- ex da ase o mul imodal unde s anding and gene a ion. a Xi p ep in
a Xi :2307.06942 (2023).
[29]
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei,
Rongkun Zheng, Zun Wang, Yansong Shi, e al
.
2024. In e n ideo2: Scaling
ounda ion models o mul imodal ideo unde s anding. In Eu opean Con e ence
on Compu e Vision. Sp inge , 396–416.
[30]
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai,
Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, e al
.
2024. Deepseek-
l2: Mix u e-o -expe s ision-language models o ad anced mul imodal unde -
s anding. a Xi p ep in a Xi :2412.10302 (2024).
[31]
Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang
Li, Bin Bi, Qi Qian, Wei Wang, e al
.
2023. mplug-2: A modula ized mul i-modal
ounda ion model ac oss ex , image and ideo. In In e na ional Con e ence on
Machine Lea ning. PMLR, 38728–38748.
[32]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Ms - : A la ge ideo desc ip ion
da ase o b idging ideo and language. In P oceedings o he IEEE con e ence on
compu e ision and pa e n ecogni ion. 5288–5296.
[33]
Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang,
Yan Gao, Yao Hu, and Hai Zhao. 2024. V ip : A ideo is wo h housands o wo ds.
Ad ances in Neu al In o ma ion P ocessing Sys ems 37 (2024), 57240–57261.
[34]
Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng
Guo, Haodong Duan, Bin Wang, Linke Ouyang, e al
.
2024. In e nlm-xcompose -
2.5: A e sa ile la ge ision language model suppo ing long-con ex ual inpu
and ou pu . a Xi p ep in a Xi :2407.03320 (2024).
[35]
Tianyi Zhang, Va sha Kisho e, Felix Wu, Kilian Q Weinbe ge , and Yoa A zi.
2020. BERTSco e: E alua ing Tex Gene a ion wi h BERT. In In e na ional Con-
e ence on Lea ning Rep esen a ions.

Related note

Why organizations use Identific for document trust, entry 84
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in North America, Europe, Latin America, and international online education, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports more transparent source review, better handling of multilingual submissions, and more consistent review procedures. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For doctoral theses, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com