Universal CEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

Author: Imperial, Joseph Marvin; Baraean, Abdullah; Stodden, Regina; Wilkens, Rodrigo; Muñoz Sánchez, Ricardo; Gao, Lingyun; R. Toribio, Melissa Esther; Reynolds, Robert; Ribeiro, Eugénio; Saggion, Horacio; Volodina, Elena; Vajjala, Sowmya; François, Thomas; Alv

Publisher: Zenodo

DOI: 10.18653/v1/2025.emnlp-main.491

Source: https://zenodo.org/records/17698633/files/2025.emnlp-main.491.pdf

P oceedings o he 2025 Con e ence on Empi ical Me hods in Na u al Language P ocessing, pages 9715–9767
No embe 4-9, 2025 ©2025 Associa ion o Compu a ional Linguis ics
UNIVERSALCEFR: Enabling Open Mul ilingual Resea ch on Language
P o iciency Assessmen
Joseph Ma in Impe ial1,3, Abdullah Ba ayan2,14, Regina S odden4,
Rod igo Wilkens5,Rica do Muñoz Sánchez6,Lingyun Gao7,Melissa To gbi1,
Dawn Knigh 2,Gail Fo ey1,Reka R. Jablonkai1,Eka e ina Kochma 8,
Robe Reynolds9,Eugénio Ribei o10,11,Ho acio Saggion12,
Elena Volodina6,Sowmya Vajjala13,Thomas F ançois7,
Fe nando Al a-Manchego2,Ha ish Tayya Madabushi1
1Uni e si y o Ba h, 2Ca di Uni e si y, 3Na ional Uni e si y Philippines,
4Biele eld Uni e si y, 5Uni e si y o Exe e , 6Uni e si y o Go henbu g, 7UCLou ain,
8MBZUAI, 9B igham Young Uni e si y, 10INESC-ID Lisboa,
11Ins i u o Uni e si á io de Lisboa (ISCTE-IUL), 12Uni e si a Pompeu Fab a,
13Na ional Resea ch Council, Canada, 14King Abdulaziz Uni e si y
CORRESPONDENCE:[email p o ec ed],[email p o ec ed]
Abs ac
We in oduce UNIVERSALCEFR, a la ge-
scale mul ilingual and mul idimensional
da ase o ex s anno a ed wi h CEFR (Com-
mon Eu opean F amewo k o Re e ence)
le els in 13 languages. To enable open
esea ch in au oma ed eadabili y and language
p o iciency assessmen , UNIVERSALCEFR
comp ises 505,807 CEFR-labeled ex s
cu a ed om educa ional and lea ne -o ien ed
esou ces, s anda dized in o a uni ied da a
o ma o suppo consis en p ocessing,
analysis, and modelling ac oss asks and
languages. To demons a e i s u ili y, we
conduc benchma king expe imen s using
h ee modelling pa adigms: a) linguis ic
ea u e-based classi ica ion, b) ine- uning
p e- ained LLMs, and c) desc ip o -based
p omp ing o ins uc ion- uned LLMs. Ou
esul s suppo using linguis ic ea u es and
ine- uning p e ained models in mul ilingual
CEFR le el assessmen . O e all, UNIVER-
SALCEFR aims o es ablish bes p ac ices
in da a dis ibu ion o language p o iciency
esea ch by s anda dising da ase o ma s, and
p omo ing hei accessibili y o he global
esea ch communi y.
uni e salce .gi hub.io
hugging ace.co/Uni e salCEFR
gi hub.com/Uni e salCEFR
1 In oduc ion
Language p o iciency esea ch plays a cen al ole
in educa ion, and o en in e sec s wi h ad ances in
linguis ics and a i icial in elligence (AI). In na u al
language p ocessing (NLP), language p o iciency
has been app oached h ough well-es ablished asks
Dialogue
English
A abic
Czech Du ch Es onian
F ench Ge man Hindi I alian
Po uguese Russian Spanish Welsh
Sen ence
Pa ag aph
Documen
Uni e salCEFR Da ase
CATEGORY (2) FORMAT (4)
ANNOTATION (2)
LEVEL COVERAGE (6,+)
Manual
Compu e -Assis ed
Re e ence Tex
LANGUAGES (13)
Lea ne Tex
A1 A2 B1
B2 C1 C2 +
Uni e salCEFR Ini ia i e
Languages (13)
A abic
Czech
Du ch
English
Es onian
F ench
Ge man
Hindi
I alian
Po uguese
Russian
Spanish
Welsh
Ca ego ies (2)
Lea ne Tex
Re e ence Tex
Fo ma s (4)
Sen ence-Le el
Pa ag aph-Le el
Documen -Le el
Dialogue-Le el
Le els (6)
CEFR- ecognized
A1 B1 C1
A2 B2 C2
Use Cases
Readabili y Assessmen *
Fea u e Analysis*
Essay Sco ing
Tex Simpliﬁca ion
Co pus Analysis
S o y Gene a ion
Open
Pe missi e Licenses
S anda dized Fo ma
Machine-Readable (HF)
Open · Mul ilingual · Mul i o ma · Mul ica ego y · Mul ile el · Mul ipu pose The Uni e salCEFR Da ase
Languages (13)
A abic
Czech
Du ch
English
Es onian
F ench
Ge man
Hindi
I alian
Po uguese
Russian
Spanish
Welsh
Ca ego ies (2)
Lea ne Tex
Re e ence Tex
Fo ma s (4)
Sen ence-Le el
Pa ag aph-Le el
Documen -Le el
Dialogue-Le el
Le els (6)
CEFR- ecognized
A1 B1 C1
A2 B2 C2
Use Cases
Readabili y Assessmen
Fea u e Analysis
Essay Sco ing
Tex Simpliﬁca ion
Co pus Analysis
S o y Gene a ion
Lea ne Simula ion
S yle T ans e
Accessibili y
Pe missi e Licenses
S anda dised Fo ma
Machine-Readable
Open · Mul ilingual · Mul i o ma · Mul ica ego y · Mul ile el · Mul ipu pose
Figu e 1: O e iew o he con ibu ions o he UNIVER-
SALCEFR da ase , highligh ing i s di e se s uc u al
co e age—spanning language, o ma , ca ego y, and
CEFR le el—as well as i s accessibili y and in e ope -
abili y o downs eam asks and use cases enabled by
pe missi e licenses and s anda dized da a o ma s.
such as au oma ed eadabili y assessmen (ARA)
and au oma ed essay sco ing (AES). ARA ocuses
on de e mining whe he a gi en ex ma ches he
expec ed eading skills o language lea ne s ac-
co ding o hei le el, whe eas AES e alua es he
w i ing skills o he lea ne s as e lec ed in a ex
hey ha e w i en. In his pape , we combine hese
asks unde he mo e gene ic e m o language p o-
iciency assessmen , as i has a ied p ac ical appli-
ca ions in educa ional assessmen and calib a ion
o eading ma e ials o lea ne s (Xia e al.,2016;
Ha sch,2014;Figue as,2012) as well as o a ious
NLP asks (see use cases in Figu e 1). A widely ec-
ognized s anda d o measu ing second language
(L2) p o iciency is he Common Eu opean F ame-
wo k o Re e ence o Languages (CEFR),1de el-
1h ps://www.coe.in /en/web/common-eu opean-
amewo k- e e ence-languages
9715
Resou ce # Da ase s
Indexed
# Languages
Co e ed Da a Types Da a
Accessibili y
S anda d
Fo ma
Geog aphic
Res ic ions
CEFRLex 7†6 ex un es ic ed no none
Co po a @ UCLou ain 31†9 ex , audio, ideo eques pe co pus no yes
CLARIN L2 Lea ne Co po a 75†34 ex , ideo eques pe co pus no yes
Lea ne Language (Sp åkbanken) 15†13 ex , audio eques pe co pus no yes
UNIVERSALCEFR 26 13 ex un es ic ed yes none
Table 1: Compa ison o exis ing language lea ning and language p o iciency da ase collec ions wi h UNIVERSAL-
CEFR.
†
indica es ha only a subse o he co esponding esou ce in ha eposi o y con ains CEFR labels. Among
he i e eposi o ies, UNIVERSALCEFR is he only non-geo-locked and s anda dized collec ion, allowing seamless,
un es ic ed use o non-comme cial esea ch wi h p ope a ibu ion.
oped by he Council o Eu ope. CEFR o e s a
language-independen guide o e alua ing lea n-
e s’ abili ies in eading, w i ing, lis ening, and
speaking. I de ines a six-le el scale (A1, A2, B1,
B2, C1, and C2) deno ing inc easing language com-
pe ency (No h,2014,2007).
Recen ad ances in language p o iciency assess-
men ha e mo ed om models elying on hand-
c a ed linguis ic ea u es o la ge language models
(LLMs), which achie e high pe o mance ac oss
di e se p edic i e and gene a i e asks h ough
pos - aining echniques such as supe ised ine-
uning (De lin e al.,2019;Vaswani e al.,2017)
o ins uc ion uning (Wei e al.,2022). This o m
o ask gene aliza ion enables complex linguis ic
pa e n (e.g., ea u es ha make a ex complex)
modelling wi hin uni ied amewo ks o assess-
ing language p o iciency on s anda dized scales
like CEFR. Mo eo e , hey can also be ex ended
o low- esou ce languages, po en ially imp o ing
au oma ic assessmen h ough echniques such as
c oss-lingual ans e (He and Li,2024;Impe ial
and Kochma ,2023a,b;Vajjala and Rama,2018).
To ully le e age he po en ial o mode n ap-
p oaches o CEFR-le el p edic ion, esea che s
equi e access o high-quali y da ase s wi h b oad
co e age ac oss languages, p o iciency le els,
and ex g anula i y. Howe e , despi e he long-
s anding use o CEFR in educa ional and NLP
esea ch, he e a e e y limi ed s anda dized,
machine- eadable, and openly accessible collec-
ions o CEFR-anno a ed co po a, especially in
e ms o language co e age and g anula i y beyond
sen ence le el (Naous e al.,2024). Mo eo e , mos
exis ing single-language esou ces a e a ailable in
inconsis en o ou da ed o ma s (e.g., unp ocessed
ex iles, XML), which equi e ex ensi e p ep o-
cessing and no maliza ion. Finally, many da ase s
a e es ic ed by copy igh o licensing e ms, lim-
i ing hei accessibili y o open esea ch.
To his end, ou wo k add esses he esou ce gap
in CEFR-based language p o iciency assessmen
esea ch h ough he ollowing con ibu ions:
•
We in oduce UNIVERSALCEFR, a la ge-
scale mul ilingual mul idimensional open
da ase composed o 505K CEFR-labeled
ex s ac oss 13 languages, designed o ad-
ance mul ilingual esea ch in language p o i-
ciency assessmen .
•
We p opose a da a s anda diza ion pipeline
and anno a ion empla e o homogenize a ail-
able CEFR-labeled ex s, enhancing hei in-
e ope abili y and accessibili y o esea che s
ac oss domains.
•
We p o ide a c i ical e lec ion o cu en p ac-
ices in da a sha ing o language p o iciency
assessmen esou ces and sugges pa hways o-
wa ds imp o emen using UNIVERSALCEFR
as a case s udy o a mo e open, s anda dized
ini ia i e o esou ce de elopmen .
2 Backg ound
Language Lea ning Da abases and Resou ces.
Language lea ning and language p o iciency a e
esea ch a eas d i en by he collec ion o wo
main ypes o da a: e e ence-based da a c ea ed
by expe s (e.g. e e ence eading ma e ials) and
lea ne -based da a c ea ed by language lea ne s
(e.g. essays, con e sa ions, and dialogue snippe s).
I a ask equi es i , such as in p o iciency assess-
men , hese co po a may unde go examina ion
by language p o iciency expe s who will g ade
hem based on a scale (e.g. CEFR). We lis ou
communi y- ecognized da abanks and esou ce
collec ions in he domain o language lea ning
and p o iciency assessmen in Table 1. CEFRLex
is a collec ion o machine- eadable mul ilingual
9716
lexicon-based da ase s in 6 Eu opean languages.
The Co po a Hub hos ed by UCLou ain, he
Lea ne Language om Sp åkbanken Tex (SBX),
and he L2 Lea ning Co po a hos ed by he
Common Language Resou ces and Technology
In as uc u e (CLARIN) a e all la ge collec ions
o gene al mul ilingual and mul imodal language
lea ne da ase s. No all co po a in hese da abases
a e anno a ed wi h CEFR labels, and each co pus
is associa ed wi h a publica ion de ailing how hey
we e collec ed and buil and hei speci ic pu pose
in language lea ning esea ch.
Access Res ic ions and Da a P i acy Regula-
ions. Despi e he exis ence o L2 esou ce col-
lec ions as lis ed in Table 1, esea che s canno
eely and openly use all da ase s hos ed in hese
eposi o ies. CEFRLex,
2
Co po a @ UCLou ain,
3
CLARIN,
4
and Sp åkbanken Tex
5
a e hos ed un-
de Eu opean uni e si ies and ins i u ions which
means hey a e unde he ju isdic ion o EU Da a
P i acy Laws, pa icula ly he Gene al Da a P o-
ec ion Regula ion (GDPR).
6
Thus, lea ne ex s
om hese collec ions, w i en based on pe sonal
in e ac ions and con aining Pe sonally Iden i iable
In o ma ion (PII), can only be accessed h ough
special legal coo dina ion wi h he da a main ain-
e s. I access is g an ed, he licensee may also need
o p o ide a p oo o PII anonymiza ion ha p o-
duces a de i a ion dis inc om he o iginal da ase
as done in Jen o and Samuel (2023) o he ASK
Co pus (Ten jo d e al.,2006) con aining L2 No -
wegian CEFR-labeled ex s and he In e na ional
Co pus o Lea ne Finnish (ICLFI) (Jan unen e al.,
2013) con aining L2 Finnish CEFR-labeled ex s.
Mo eo e , some da ase s such as he SweLL Co -
pus (Volodina,2024;Volodina e al.,2019,2016)
om Sp åkbanken Tex , composed o Swedish L2
ex s wi h CEFR le els, a e geog aphically licensed
and can only be used by ins i u ions wi hin he EU
and EEA egion. As such, hese da ase s emain
o -limi s o any esea che ou side o Eu ope.
CEFR Assessmen and S anda diza ion. The
majo i y o esea ch on au oma ic classi ica ion
(o anking) o ex s based on he CEFR scale
2h ps://cen al.uclou ain.be/ce lex/
3h ps://co po a.uclou ain.be/ca alog/
4h ps://www.cla in.eu/ esou ce- amilies/L2-c
o po a
5h ps://sp aakbanken.gu.se/en/ esou ces/lea
ne -language
6h ps://gdp -in o.eu/
ends o ocus on single-language model e alua-
ions (Ribei o e al.,2024a;Wilkens e al.,2024,
2023,2018;Tack e al.,2017;Volodina e al.,
2016;Pilán e al.,2016;Vajjala and Lõo,2014;
Xia e al.,2016;Yancey e al.,2021;Vásquez-
Rod íguez e al.,2022). This allows deepe in-
es iga ion o language-speci ic nuances and in-
icacies connec ed o measu ing ex complexi y.
Meanwhile, o he wo ks ha e explo ed uni e sal,
language-agnos ic ea u es such as Azpiazu and
Pe a (2019); A hiliuc e al. (2020); Caines and Bu -
e y (2020); Vajjala and Rama (2018) whe e hey
used adi ional wo d and PoS-ng am ea u es o
build a mul i- and c oss-lingual CEFR p o iciency
classi ie o Ge man, Czech, I alian, Spanish, and
English, among o he s. He and Li (2024), on he
o he hand, ocused on c oss-lingual au oma ic es-
say sco ing ancho ed on he CEFR scale, co e ing
six languages (Czech, English, Ge man, I alian,
Po uguese, and Spanish).
In pa allel wi h he ise o benchma king s udies
o LLMs, simila e o s a e g owing in he CEFR-
based language p o iciency communi y. Two wo ks
in his di ec ion include Naous e al. (2024), which
in oduced ReadMe++, a mul ilingual, mul ido-
main da ase o sen ence-le el eadabili y assess-
men on a CEFR scale co e ing i e languages,
while he iRead4Skills P ojec by Pin a d e al.
(2024) eleased a collec ion o w i en ex s in
F ench, Po uguese, and Spanish ac oss mul iple
gen es and le els pa e ned o CEFR. Likewise, in
da a collec ion s anda diza ion, CLARIN eleased
he Co e Me ada a Schema o Lea ne Co po a
(LC-me a), which aims o p o ide a s uc u ed
me hod wi h a speci ic emphasis on cap u ing me a-
da a o collec ed lea ne ex s, ocusing on lea ne
backg ound, con ex , and indi idual di e ences
(Paquo e al.,2024).
3 The UNIVERSALCEFR Da ase
To suppo mul ilingual language p o iciency e-
sea ch, we in oduce UNIVERSALCEFR, a la ge-
scale ini ia i e ha cu a es and s anda dizes open
human-anno a ed CEFR-labeled co po a. Uni ying
di e se esou ces unde a consis en o ma enables
ep oducible and scalable esea ch ac oss linguis-
ics, NLP, and educa ion. In his sec ion, we ou line
he da ase ’s design p inciples, de ail he da a col-
lec ion and s anda diza ion pipeline, p o ide key
s a is ics, and p esen a linguis ic ea u e analysis
ha suppo s downs eam modelling.
9717
3.1 Design P inciples
Ou me hodology was guided by h ee key design
p inciples.
Openness and Accessibili y. In building UNI-
VERSALCEFR, we aim o demons a e how da a-
d i en esea ch in language p o iciency and assess-
men bene i s om s anda dized, uni ied da a o -
ma s. This enables po abili y and in e ope abili y
ac oss domains wi h e ol ing da a pipelines, such
as language model p e- aining in NLP. All co po a
included in UNIVERSALCEFR a e publicly a ail-
able o non-comme cial esea ch h ough pe mis-
si e licenses (e.g. C ea i e Commons). Howe e ,
signi ican e o was equi ed o colla e and s an-
da dize hese da ase s, highligh ing he need o
s anda diza ion and imp o ed accessibili y.
Mul ilinguali y and S uc u e Di e si y. Al-
hough CEFR o igina ed in Eu ope, i has been
inc easingly adop ed as a e e ence amewo k o
language p o iciency assessmen wo ldwide. Ac-
co dingly, UNIVERSALCEFR ex ends beyond Eu-
opean languages. I s cu en e sion includes
13 languages, spanning high- esou ce (English,
Spanish, F ench, Ge man, I alian, Po uguese),
mid- esou ce (Du ch, Russian, A abic), and low-
esou ce (Czech, Es onian, Hindi, Welsh) lan-
guages. I also cap u es s uc u al di e si y by an-
no a ing each co pus wi h i s p oduc ion ca ego y
(lea ne o e e ence), g anula i y (sen ence, pa a-
g aph, documen , o discou se), and label co e age
(s anda d CEFR o CEFR plus le els).
Global Collabo a ion. F om i s concep ualiza-
ion and planning, he UNIVERSALCEFR ini ia-
i e in ol ed close collabo a ion among 20 e-
sea che s in language p o iciency assessmen , NLP,
and educa ion om 13 ins i u ions ac oss nine coun-
ies (UK, Canada, USA, Ge many, Sweden, UAE,
Spain, Belgium, and Po ugal).
7
They all played
a key ole in de ining he s anda diza ion p o ocol,
designing e alua ion expe imen s, and discussing
u u e esea ch di ec ions. These collabo a i e de-
cisions a e de ailed in he ollowing sec ions.
3.2 Da a Collec ion
This sec ion ou lines he co pus selec ion c i e ia
and he s anda diza ion me hods used in UNIVER-
SALCEFR o acqui ing and consolida ing a la ge
and di e se collec ion o esou ces.
7
As CEFR is a Eu opean amewo k, mos ac i e e-
sea che s in he ield a e based in Eu ope.
Co po a Selec ion. The inclusion o da ase s in
UNIVERSALCEFR is guided by h ee c i e ia:
1.
Public Accessibili y: Da ase s mus be a ail-
able unde a pe missi e license o non-
comme cial esea ch (e.g., C ea i e Com-
mons, CC-BY-NC), o be in he public domain
and acqui able h ough di ec download o ia
a eques o m o usage acking.
2.
Gold-S anda d CEFR Labels: Da ase s
mus include CEFR anno a ions p oduced o
alida ed by domain expe s, such as language
eache s o p o iciency esea che s, pa icu-
la ly in he case o lea ne ex s.
3.
Human Au ho ship: All ex s mus be w i -
en by humans o ensu e sui abili y o e-
sea ch in ol ing c ea i e, mul ilingual, mul i-
le el, and mul i-gen e con en . As o his
w i ing, UNIVERSALCEFR does no include
machine-gene a ed ex s.
The ull lis o consolida ed co po a ha mee
all h ee UNIVERSALCEFR inclusion c i e ia is
p o ided in Table 22 in he Appendix.
S anda diza ion P ocess. To ensu e in e ope -
abili y, ans o ma ion, and machine eadabili y,
we s anda dized he collec ed da ase s by p ep o-
cessing hei a ied sou ce o ma s in o a uni ied
s uc u e. We adop ed JSON as he pe -ins ance
o ma and de ined eigh me ada a ields consid-
e ed essen ial o each CEFR-labeled ex . These
ields include he sou ce da ase , language, g anu-
la i y (documen , pa ag aph, sen ence, discou se),
p oduc ion ca ego y (lea ne o e e ence), and li-
cense. Full desc ip ions and p ede e mined alues
used o each ield a e p o ided in Table 15. The
inal s anda dized da ase is a ailable om Hug-
gingFace Da ase eposi o y.
8
A key challenge was
he lack o a uni ied o ma ac oss he language
p o iciency communi y. Sou ce co po a came in
a ious o ma s, including plain ex (e.g., cs , s ,
x ), sp eadshee s (e.g., XLSX, XLS), ma kup (e.g.,
XML), and PDFs equi ing manual ex ac ion. This
challenge u he mo i a es he need o uni ied
da a agg ega ion ini ia i es ha UNIVERSALCEFR
aims o help es ablish.
8h ps://hugging ace.co/Uni e salCEFR/da ase
s
9718
UNIVERSALCEFR # o Ins ances
-FULL* 505,807
(ou -o -scope ins ances) 11,316
-FULL 494,491
-TRAIN 435,919
-DEV 54,107
-TEST 4,465
Table 2: Da a spli s o UNIVERSALCEFR. FULL* de-
no es all ins ances, including hose wi h CEFR labels
ha we cu en ly do no ecognize o he ask (e.g., NA,
A+, B). These we e excluded om he TRAIN,DEV, and
TEST se s used in ou expe imen s.
3.3 Da ase S a is ics
The inal UNIVERSALCEFR collec ion comp ises
505,807 CEFR-labeled ex s ac oss 13 languages
and 4 sc ip s (La in, A abic, De anaga i, and Cy il-
lic). Tables 2and 3show he o e all da ase size,
i s spli s and b eakdown pe CEFR le el pe lan-
guage. We iden i ied 11,316 ins ances wi h in alid
o ou -o -scope labels (e.g., NA, A+, B) ou side
he six ecognized CEFR labels (A1–C2) and du-
plica es, which we e emo ed be o e spli ing UNI-
VERSALCEFR in o TRAIN,DEV, and TEST. Fo
he TEST, we se a cap o 200 ins ances pe lan-
guage and pe g anula i y le el. Addi ional da ase
s a is ics can be ound in Appendix A.
LANG A1 A2 B1 B2 C1 C2
EN 192,596 132,614 66,425 23,266 8,004 795
ES 8,282 8,648 6,835 5,061 3,224 0
DE 319 15,970 15,630 474 130 426
NL 51 216 782 738 219 85
CS 1 188 165 81 4 0
IT 29 381 394 2 0 0
FR 151 390 575 478 293 126
ET 0 395 588 407 307 0
PT 314 325 367 233 112 72
AR 81 259 625 645 361 183
HI 263 283 286 263 222 174
RU 402 293 409 326 237 91
CY 764 608 0 0 0 0
To al 203,253 160,570 93,081 31,974 13,113 1,952
Table 3: Da a s a is ics o UNIVERSALCEFR-FULL in
e ms o ecognized CEFR le els (A1, A2, B1, B2, C1,
C2) ac oss he 13 a ge languages.
4 Linguis ic Fea u e Analysis
We aim o examine how well a b oad se o linguis-
ic ea u es aligns wi h CEFR p o iciency le els
ac oss languages in UNIVERSALCEFR. We ex-
ac ed a se o 100 linguis ic ea u es, g ouped
in o mo phosyn ac ic (62), syn ac ic (18), leng h-
conc e eness
sen ence_ e
imageabili y
ge _ ype_ oken_ a io
ge _ a io_o _punc ua ion
coun _cha ac e s_pe _sen ence
coun _syllables_in_sen ence
coun _cha ac e s
doc_num_ okens
coun _wo ds_pe _sen ence
a cscydeenese hii nlp u
-0.21 -0.41 -0.16 -0.31 0.44 0.44 0.40 0.41
-0.67 -0.51 -0.69 -0.52 -0.13 0.62 0.61 0.62 0.73 0.59
-0.06 -0.24 -0.08 -0.17 -0.20 0.32 0.32 0.24 0.29
-0.04 -0.11 -0.02 -0.01 0.04 0.28 0.15 0.15 0.06 0.23
-0.56 -0.11 -0.57 -0.15 -0.48 0.47 0.45 0.47 0.66 0.42
-0.69 -0.27 -0.65 -0.34 -0.45 0.49 0.49 0.49 0.59 0.45
-0.76 -0.57 -0.76 -0.43 -0.40 0.75 0.75 0.75 0.61 0.71
-0.44 -0.61 -0.37 -0.51 -0.07 0.67 0.69 0.68 0.58 0.64
-0.52 -0.70 -0.48 -0.52 -0.22 0.73 0.75 0.70 0.68
-0.59 -0.31 -0.54 -0.52 -0.21 0.60 0.58 0.60 0.75 0.58
-0.18 -0.37 -0.16 -0.27 0.43 0.44 0.43 0.18 0.39
-0.42 -0.37 -0.40 -0.27 -0.18 0.34 0.35 0.34 0.44 0.33
-0.40 -0.59 -0.28 -0.34 -0.13 0.75 0.75 0.75 0.63 0.63
Figu e 2: Highly co ela ed linguis ic ea u es occu ing
in a leas h ee languages. Blue ea u es lean owa ds
posi i e co ela ion, while ed ea u es deno e nega i e
co ela ion. Fo b e i y, hese op ea u es a e hose
lying a he ex eme ends o he co ela ion spec um.
based (11), lexical (4), eadabili y (2), psycholin-
guis ic (2), and discou se (1) ca ego ies. A com-
ple e and de ailed lis is a ailable in Appendix E.
4.1 Co ela ion Ac oss All Languages
Conside ing he absolu e Spea man co ela ion be-
ween he ea u es and he CEFR le el (selec ing
alues wi h
p < 0.05
and
ρ > 0.3
on a e age
ac oss all languages), he s onges associa ions
we e ound in leng h-based measu es, such as cha -
ac e s pe sen ence and syllables pe sen ence. Se -
e al g amma ical complexi y ea u es, including
pa se ee heigh and ph ase leng h, showed mod-
e a e co ela ions. Readabili y indices (FKGL and
Flesch Reading Ease) also displayed mode a e co -
ela ions in he expec ed di ec ion. Psycholinguis-
ic ea u es, such as conc e eness and imageabili y,
we e nega i ely co ela ed wi h p o iciency, indica -
ing a shi owa d mo e abs ac language a highe
le els. Finally, mo phosyn ac ic ea u es ega ding
oice, ense, and numbe showed mode a e bu con-
sis en co ela ions, suppo ing hei ele ance in
e lec ing syn ac ic de elopmen .
9719

4.2 Co ela ion By CEFR Le el
To assess he consis ency o ea u e ele ance
ac oss languages, we examined he numbe o ea-
u es wi h signi ican co ela ions (
p < 0.05
) wi h
CEFR le els pe language as isualized in Figu e 2.
The esul s e ealed no able a ia ions. Languages
such as Czech (CS), Es onian (ET), and I alian (IT)
showed a high numbe o ele an ea u es, sug-
ges ing s ong alignmen be ween he selec ed lin-
guis ic ea u es and CEFR p og ession in hese
languages. English (EN), Spanish (ES), F ench
(FR), Hindi (HI), and Russian (RU) showed mode -
a e co e age, wi h a easonable numbe o ea u es
exceeding he 0.3 co ela ion h eshold. In con-
as , A abic (AR), Du ch (NL), and Po uguese
(PT) exhibi ed weak co e age, while Welsh (CY)
and Ge man (DE) had e y ew o no ea u es wi h
ele an co ela ions, indica ing a limi ed ma ch be-
ween he cu en ea u e se and CEFR le els o
hose languages. Fu he mo e, a ew ea u es a e
only ele an o a ew languages, e.g., he ansla-
i e case o only Es onian, nega i e e b pola i y
o only Czech, o geni i e case o only Czech,
Es onian, and Russian. This a iabili y highligh s
he in luence o language-speci ic p ope ies on he
e ec i eness o gene al ea u e-based models o
p o iciency p edic ion.
5 CEFR Le el Classi ica ion
Gi en he a ailabili y o gold-s anda d CEFR la-
bels and he linguis ic di e si y o he UNIVERSAL-
CEFR da ase , we de ine ou p ima y expe imen al
ask as mul iclass, mul ilingual CEFR le el clas-
si ica ion. The goal is o p edic one o he six
CEFR le els (A1–C2) o a gi en ex ins ance in
any o he 13 suppo ed languages. We e alua e
h ee modeling pa adigms: ea u e-based classi ica-
ion, ine- uning o mul ilingual p e- ained models,
and p omp ing LLMs.
5.1 Fea u e-Based Models
We e alua ed wo widely-used classi ica ion mod-
els om Sciki -Lea n (Ped egosa e al.,2011):
Random Fo es (RANDFOREST) and Logis ic Re-
g ession (LOGREGR). Bo h models we e ained
on he linguis ic ea u es desc ibed in Sec ion 4,
using Sciki -Lea n’s de aul hype pa ame e se -
ings. We expe imen ed wi h wo ea u e con ig-
u a ions: one using all 100 ea u es (ALLFEATS)
and ano he using an au oma ically selec ed sub-
se o op-pe o ming ea u es ac oss all languages
(TOPFEATS). Appendices E.1 and E.2 de ail he
linguis ic ea u e in o ma ion o bo h se ups.
5.2 Fine- uned Models
We used h ee BERT-based models wi h a ying
deg ees o mul ilingual co e age: Mode nBERT
(Wa ne e al.,2024), a monolingual English model
wi h 395M pa ame e s; Eu oBERT (Boiza d e al.,
2025), a mul ilingual model ained on 15 di e se
Eu opean and non-Eu opean languages, wi h 210M
pa ame e s; and XLM-R (Conneau e al.,2020), a
massi ely mul ilingual model suppo ing 100 lan-
guages, wi h 279M pa ame e s. Each model was
ine- uned o h ee epochs, wi h he bes check-
poin selec ed based on he highes weigh ed F1
sco e on he alida ion se . Addi ional de ails can
be ound in Appendix Table 17.
5.3 Desc ip o -Based P omp ing
We e alua ed h ee ins uc ion- uned models:
Gemma 1 (Gemma Team,2024), an English-
cen ic model wi h 7B pa ame e s; Gemma
3(Gemma Team,2025), a mul ilingual model
ained on 140+ global languages wi h 12B pa-
ame e s; and Eu oLLM (Ma ins e al.,2024), a
mul ilingual model ained on 15 Eu opean-cen ic
languages wi h 9B pa ame e s. We explo ed i e
p omp ing s a egies, anging om no con ex o
se ups using CEFR le el desc ip o s o eading
comp ehension and w i en p oduc ion, ei he in
English o in speci ic languages. The p omp con-
igu a ions a e as ollows:
•
BASE. Gene ic p omp ing wi h no CEFR le el desc ip-
o s as con ex .
•
EN-READ. CEFR le el desc ip o s o eading comp e-
hension in English used as con ex .
•
EN-WRITE. CEFR le el desc ip o s o w i en p o-
duc ion in English used as con ex .
•
LANG-READ. CEFR le el desc ip o s o eading com-
p ehension, ansla ed o he a ge language being as-
sessed used as con ex .
•
LANG-WRITE. CEFR le el desc ip o s o w i en p o-
duc ion, ansla ed o he a ge language being assessed
used as con ex .
All CEFR desc ip o s we e e ie ed om he
o icial CEFR websi e. P omp empla es and hy-
pe pa ame e alues o each se up a e de ailed in
Table 18 and Appendix I.
5.4 E alua ion Me ics
We use weigh ed F1 as he p ima y e alua ion
me ic ac oss all expe imen s. This accoun s
9720
MODEL & SETUP EN ES DE NL CS IT FR ET PT AR HI RU CY A g
BASELINE
MOST FREQUENT CLASS 7.39 18.1 26.8 21.4 23.8 35.5 16.3 15.9 10.0 23.3 7.28 10.7 33.4 19.3
GEMMA1-7B (ENGLISH)
BASE 21.8 26.0 40.6 32.1 44.0 57.3 32.2 39.0 14.0 28.9 25.0 34.8 48.7 34.2
EN-READ 20.5 28.3 31.0 23.5 53.6 41.0 22.7 24.9 27.2 29.5 8.4 18.0 55.7 29.6
EN-WRITE 19.8 24.5 34.5 29.3 51.9 57.7 27.7 42.7 22.2 20.8 14.0 27.6 52.1 32.9
LANG-READ 20.5 29.3 35.1 37.8 55.3 48.0 27.1 44.6 20.2 32.2 12.8 26.2 52.8 34.0
LANG-WRITE 19.8 29.8 32.6 34.0 49.9 61.7 26.3 46.3 21.2 36.7 12.7 26.9 53.6 34.7
GEMMA3-12B (MULTI)
BASE 28.8 35.0 42.2 47.0 42.6 65.2 38.1 39.5 24.6 41.8 28.7 29.7 40.9 38.8
EN-READ 19.3 25.5 35.8 25.5 18.5 22.9 29.3 26.0 9.8 33.3 14.8 21.2 20.5 23.3
EN-WRITE 26.6 36.7 46.4 46.7 50.1 77.4 40.5 43.8 27.3 48.6 24.0 37.4 52.4 43.2
LANG-READ 19.3 28.1 35.2 37.6 50.9 64.8 35.0 30.4 26.1 29.5 20.5 32.5 61.6 36.3
LANG-WRITE 26.6 33.2 38.3 39.6 55.0 76.4 37.7 42.4 25.4 38.0 24.6 31.5 53.7 40.2
EUROLLM-9B (MULTI)
BASE 18.6 25.4 28.0 29.1 25.0 39.9 25.9 32.0 16.4 34.3 12.7 15.1 14.4 24.4
EN-READ 23.1 26.9 38.1 30.2 33.3 41.9 24.5 33.6 19.9 33.8 18.0 21.8 26.4 28.6
EN-WRITE 21.5 26.2 29.8 32.0 32.4 33.1 26.8 32.8 21.1 31.8 17.7 17.5 24.5 26.7
LANG-READ 23.1 27.0 32.7 31.8 29.8 32.9 28.3 28.6 16.8 32.4 14.3 16.2 17.3 25.5
LANG-WRITE 21.5 28.5 35.1 30.1 30.8 30.6 27.6 29.9 16.5 35.2 21.0 16.1 8.80 25.5
FINE-TUNED MODELS
MODERNBERT (ENGLISH) 75.8 71.8 72.1 54.2 66.9 82.7 47.2 88.3 33.5 30.8 51.6 48.9 73.2 61.3
EUROBERT (MULTI) 74.6 72.0 70.6 53.2 63.9 79.7 42.0 86.6 32.1 35.4 44.7 45.9 79.9 60.0
XLM-R (MULTI) 75.5 69.6 73.2 59.0 68.8 83.2 51.6 88.8 29.2 43.0 52.8 49.6 72.6 62.8
FEATURE-BASED MODELS
RANDFOREST (TOPFEATS) 62.0 57.6 64.9 54.5 69.5 79.9 44.1 84.2 27.8 43.8 44.1 47.2 72.9 57.9
RANDFOREST (ALLFEATS) 63.4 60.6 65.4 53.0 69.2 79.3 41.4 84.2 26.4 42.8 46.8 47.8 78.2 58.3
LOGREGR (ALLFEATS) 32.1 28.2 50.9 47.1 62.9 81.9 41.7 67.5 23.1 34.1 47.8 41.1 63.8 47.9
LOGREGR (TOPFEATS) 30.4 29.7 52.5 44.1 62.7 82.7 40.3 67.5 22.7 33.5 48.4 41.1 59.2 47.3
Table 4: Full weigh ed F1 pe o mance esul s om he mul ilingual and English-cen ic model e alua ion expe -
imen s using h ee se ups ( ea u e-based, ine- uning, and p omp ing) and using UNIVERSALCEFR-TEST spli
ac oss he 13 languages. Bold aced alues indica e he highes sco es o e all pe model se up, while
unde lined
alues highligh he highes sco es o each model se up wi hin each language.
o he class imbalance in CEFR le el dis ibu-
ion and g anula i y ac oss language subse s in
UNIVERSALCEFR-TEST. Using accu acy in he
expe imen s would p oduce misleading pe o -
mance in a o o any majo i y class.
6 Resul s
6.1 Model-Based Pe o mance Compa ison
Table 4shows ha , in e ms o o e all a e -
age pe o mance ac oss languages, he ine- uned
se up wi h Mode nBERT, Eu oBERT, and XLM-
R achie ed he highes weigh ed F1 sco e ange
(
≈
60%-62.8%) ou pe o ming ea u e-based mod-
els (
≈
47%-58%) and p omp ing (
≈
23%-43%).
Among he LLM-based app oaches—p omp ing
and ine- uning—models ained on b oade mul-
ilingual co po a gene ally pe o med be e . Fo
ins ance, XLM-R, which suppo s 100 languages,
was he op pe o me , ollowed by Eu oBERT (15
languages) and Mode nBERT (English-only). A
simila end was obse ed in p omp ing: Gemma
3, ained on 140+ languages, ou pe o med Eu-
oLLM (15 languages) and he English-cen ic
Gemma 1, achie ing he bes p omp ing sco e o
43.2. These indings a e consis en wi h p e ious
wo k (Naous e al.,2024;Sha dlow e al.,2024;
Colla e al.,2023;Yuan and S ohmaie ,2021), e-
in o cing he use ulness o mul ilingual models o
language p o iciency assessmen asks. One limi-
a ion o ou expe imen al se up, howe e , is ha
we did no include language-speci ic p e- ained
models o languages o he han English, which
may ha e u he imp o ed pe o mance o low-
9721
MODEL SENT PARA DOC ALL
GEMMA1 19.41 42.74 30.81 33.63
GEMMA3 38.71 43.12 39.62 42.33
XLM-R 62.67 66.38 71.12 65.92
RANDFOREST-ALL 56.88 62.77 64.58 61.38
RANDFOREST-TOP 53.89 62.98 64.94 60.50
Table 5: Weigh ed F1 sco es o op-pe o ming unique
model e alua ion se ups ac oss g anula i ies a ailable
o all languages.
and mid- esou ce languages.
6.2 G anula i y-Le el Compa ison
Table 5highligh s clea pe o mance di e ences
ac oss ex g anula i ies (sen ence, pa ag aph, and
documen ) o all models, bu mo e p ominen ly o
he Gemma models unde p omp ing. Gemma 1, in
pa icula , ends o o e -p edic lowe CEFR le els
(A1–B1) on sen ence-le el da a, whe eas i s p edic-
ions on documen -le el subse s a e mo e e enly
dis ibu ed and be e aligned wi h g ound u h dis-
ibu ions. This sugges s ha p omp -based me h-
ods may equi e longe ex s o make mo e accu a e
p edic ions, unlike models ained o ine- uned
on he espec i e da ase s. O he models, such as
XLM-R and Random Fo es , show be e esul s
on documen (
≈
64%-71%) and pa ag aph-le el
da a (
≈
62%-66%) han sen ence-le el da a (
≈
53%-
62%), which was shown o be a mo e di icul
ask in p e ious wo k on eadabili y (Dell’O le a
e al.,2011;Vajjala and Meu e s,2014). Re-
ga ding language-speci ic di e ences, among En-
glish, Ge man, and Welsh, he bes pe o mance is
seen wi h he pa ag aph-le el da ase o English,
he documen -le el da ase o Ge man, and he
sen ence-le el da ase o Welsh and F ench wi h
he ine- uned XLM-R model. Simila a ia ions
can be obse ed o o he languages wi h mo e han
one le el o g anula i y (see Table 19). No single
g anula i y o model shows consis en ly be e pe -
o mance ac oss all es ed languages. These esul s
a e likely due o he dis ibu ion o exce p s ac oss
g anula i y le els in each language (see Table 7in
Appendix A).
6.3 Lea ne -Re e ence Compa ison
Fou languages in UNIVERSALCEFR con ain bo h
lea ne and e e ence ex s: A abic, Ge man, En-
glish, and Spanish. Table 6 epo s he a e age
weigh ed F1 pe o mance di e ence be ween he
wo ca ego ies ac oss he ou languages. Fo Ge -
LANGUAGE LEARNER REFERENCE
AR 41.92†54.69
DE 71.14 74.39
EN 83.41 58.24
ES 97.99 42.72
Table 6: A e age pe o mances o he bes models on
lea ne ex e sus e e ence ex ac oss languages.
†
indica es pe o mance wi h Gemma 3, and he es e e
o pe o mance o he XLM-R model. Only hese ou
languages ha e bo h lea ne and e e ence ex s.
man, pe o mance is compa able be ween lea ne
and e e ence ex s (
≈
71–74%). In con as , En-
glish and Spanish show highe pe o mance on
lea ne ex s (83% and 98%) han on e e ence ex s
(58% and 42%, espec i ely). A abic displays he
opposi e end: esul s on e e ence ex s (54%)
a e much highe han hose o lea ne ex s, whe e
he bes esul s we e ob ained by Gemma 3 (41%).
One possible explana ion is ha Gemma 3 may
ha e been exposed o mo e A abic con en in i s
p e- and pos - aining phases.
7 Discussion
We discuss po en ial pa hways h ough which UNI-
VERSALCEFR can se e as a model, and o e key
conside a ions o ad ancing da a accessibili y in
language p o iciency esea ch.
C i ical Re lec ions o Cu en P ac ices. The
mul i egional and mul idisciplina y e o behind
UNIVERSALCEFR exposed signi ican inconsis-
encies and c i ical gaps in building CEFR-labeled
language p o iciency assessmen co po a. Upon
examina ion o anno a ion p ac ices, he e appea s
o be no s anda d me hod o conduc ing expe
anno a ions, including inconsis en use o in e -
anno a o ag eemen me ics and unclea guide-
lines on he numbe o anno a o s equi ed o
achie e eliable ag eemen . This is e lec ed in
he UNIVERSALCEFR da ase i sel , whe e nea ly
hal o he co po a lack in o ma ion on he anno a-
o s in ol ed and hei ag eemen sco es. We posi
ha his may be due o di e se judgmen s o wha
cons i u es high-quali y da a ha does no equi e
u he human anno a ions.
In e ms o language co e age, UNIVERSAL-
CEFR includes nine (EN, ES, DE, NL, CS, IT,
FR, ET, PT) o he 24 ecognized Eu opean lan-
guages. As a esul , esea che s wo king on hese
nine languages now ha e access o open, s anda d-
9722
ized da a o CEFR-based language p o iciency
assessmen . The emaining 15 languages ep e-
sen aluable oppo uni ies o u u e expansion
h ough collabo a i e e o s. While ou open da a
and s anda diza ion ini ia i e is a s ep owa ds ad-
d essing cu en challenges in in e ope abili y and
accessibili y o esou ces, simila pa allel e o s
a e needed in a eas such as anno a ion and e alu-
a ion p ac ices o ensu e sus ained p og ess in he
language p o iciency assessmen communi y.
Need o P o-Resea ch Da a Sha ing Policies.
As gene a i e AI, pa icula ly LLMs, becomes
mo e ubiqui ous, o ganiza ions ha c ea e aluable
da a o language p o iciency assessmen , such as
publishe s, educa ional ins i u ions, and media ou -
le s, a e g owing mo e cau ious abou how hei
esou ces a e used. A majo conce n is he isk o
da a being used o ain p op ie a y gene a i e mod-
els, especially when such models a e only accessi-
ble ia comme cial APIs ha equi e ans e ing
e alua ion co po a o ex e nal se e s. An exam-
ple is he TCFLE-8 co pus (Wilkens e al.,2023)
con aining CEFR-labeled essays hos ed by F ance
Educa ion In e na ional. Resea che s seeking ac-
cess o his da ase mus explici ly speci y ha he
esou ce will no be p ocessed h ough comme -
cial APIs o p e en po en ial da a ha es ing. To
add ess hese conce ns, we belie e he communi y
needs o ag ee on a uni ied p o- esea ch da a sha -
ing policy wi h clea usage guidelines o academic,
non-comme cial s udies ha equi e analysis o
p o ec ed da a wi h gene a i e AI models wi hou
aining on hem.
Linguis ic Fea u es and Fine- uning S ill Ma -
e . While ecen ad ances in LLMs keep ans-
o ming NLP esea ch, ou mul ilingual and mul-
idimensional expe imen s in Sec ion 6 ea i m
he con inued alue o linguis ic ea u es o a-
di ional ML classi ie s and ine- uning p e- ained
models in language p o iciency assessmen . We
obse e common pa e ns whe e highe dis ibu-
ion and ins ance coun lead o be e esul s us-
ing hese wo se ups (see pe o mances on Span-
ish, English, and Ge man subse s in Table 4) o e
p omp ing wi h CEFR desc ip o s. Mo eo e , us-
ing linguis ic ea u es in language p o iciency as-
sessmen allows deepe analysis o language in-
e ac ions wi h a iables such as complexi y, as
seen in Appendix C. Gi en hese insigh s, we en-
cou age u he e o s in he expansion o exis ing
bu low- esou ce language da ase s wi h CEFR la-
bels, as well as he explo a ion o ea u es o be e
model mo phologically- ich languages (e.g., Es-
onian and Po uguese). Toge he , hese ecom-
menda ions b idge cu en obse ed model ailu es
o p ac ical app oaches in imp o ing mul ilingual
CEFR p o iciency assessmen .
8 Conclusion and Fu u e Di ec ions
In his wo k, we in oduced UNIVERSALCEFR, a
la ge-scale, open, mul ilingual, mul idimensional
da ase comp ising 505,807 CEFR-anno a ed ex s
ac oss 13 languages de eloped h ough global col-
labo a ion. Ou indings om di e se model expe -
imen s wi h CEFR le el p edic ion p o ide s ong
suppo o he u ili y o linguis ic ea u es and ine-
uning mul ilingual models in language p o iciency
assessmen . Simila ly, ou c i ical analysis o he
cu en da a and esou ce-building p ac ices em-
phasized he need o simila ini ia i es om he
communi y, and p o- esea ch da a sha ing policies
in he ad en o gene a i e AI o emo e ba ie s
o accessibili y wi hou comp omising da a p i acy
and in ellec ual p ope y.
Beyond i s da a and echnical con ibu ions, UNI-
VERSALCEFR also ca ies b oade sociolinguis-
ic signi icance. UNIVERSALCEFR add esses he
g owing linguis ic inequali y in mode n AI de el-
opmen by ocusing on unde ep esen ed languages
alongside English. We hope his ini ia i e can lead
o mo e esponsible AI de elopmen ha ac i ely
esis s he g owing linguis ic cen aliza ion a ound
English in global AI esea ch—a mode n Ma hew
e ec (Me on,1988)—whe e well- esou ced lan-
guages ecei e disp opo iona e echnological a -
en ion while smalle languages (like Czech o
Welsh) a e le behind (Masciolini e al.,2025).
The UNIVERSALCEFR is a s ong s ep owa ds
mi iga ing he Ma hew e ec in language p o i-
ciency assessmen esea ch.
Limi a ions
We discuss se e al limi a ions o ou wo k on UNI-
VERSALCEFR and how esea che s can conside
hese di ec ions o de elop he esou ce u he .
Na u al Da a Dispa i y in Expe imen s. F om
he s a is ics p esen ed in Tables 3and 7 o UNI-
VERSALCEFR, i is expec ed ha no all languages
ha e he exac same dis ibu ion o da a ac oss di-
mensions, including o ma s (sen ence-, pa ag aph-
, documen -, and dialogue-le el) and ca ego y ( e -
e ence and lea ne ex s). Hence, ou main expe i-
9723
LANG SENT PARA DOC DIAG
EN 12,826 409,362 1,837 0
ES 0 713 31,355 0
DE 26,244 1,033 5,673 0
NL 0 0 3,596 0
CS 0 441 0 0
IT 0 813 0 0
FR 1,669 0 344 0
ET 0 420 1,277 0
PT 0 1,423 0 0
AR 1,945 215 0 0
HI 1,491 0 0 0
RU 1,758 0 0 0
CY 1,107 109 41 115
To al 47,040 414,529 115 44,123
Table 7: Da a s a is ics o UNIVERSALCEFR-FULL
in e ms o le els (sen ence, pa ag aph, documen , dia-
logue) ac oss he 13 a ge languages.
A Full Da a S a is ics
Tables 7,9,11and 13 epo he quan i y o CEFR-
labeled ex s ac oss g anula i y le els pe language,
and Tables 3,8,10 and 12 e lec hei coun e -
pa s in e ms o CEFR le el co e age. In o m-
ing he TEST spli , we andomly sampled CEFR-
labeled ex ins ances pe language pe g anula i y
le el, while se ing a cap o 200. This allows us
o ha e a sizeable ep esen a ion o UNIVERSAL-
CEFR while main aining e iciency o in e ence
wi h LLMs. In o al, we ha e 4,465 CEFR-labeled
ins ances o UNIVERSALCEFR-TEST, which is
compa able o he gene al sizes o benchma k es
se s om p e ious wo ks ela ed o language p o-
iciency (Naous e al.,2024;Zhang e al.,2024;
Impe ial and Tayya Madabushi,2024). Fo he
TRAIN and DEV se s o ine- uning and ea u e-
based classi ica ion, we spli he FULL subse (mi-
nus he TEST se ) in o a 90%-10% pa i ion, espec-
i ely.
B Co e age o La ge Language Models
In Table 14, we map each model’s language co -
e age o language suppo based on i s espec i e
elease pape s and publica ions. Language suppo
means wha speci ic languages ha e been added
and in subs an ial quan i ies in a model’s aining
da a (e.g., mul ilingual Wikipedia da a dumps o
p e aining XLM-R (Conneau e al.,2020)).
LANG A1 A2 B1 B2 C1 C2
EN 173,005 119,335 59,634 20,746 7,122 675
ES 4577 4989 4,051 3,007 1,707 0
DE 273 13,208 12,996 346 108 308
NL 18 93 323 277 84 33
CS 1 92 77 38 2 0
IT 17 261 267 1 0 0
FR 106 302 404 335 210 98
ET 0 266 406 293 215 0
PT 204 62 270 59 80 0
AR 62 207 407 445 285 153
HI 203 219 223 203 182 145
RU 327 234 331 256 192 69
CY 463 332 0 0 0 0
To al 179,256 139,600 79,389 26,006 10,187 1,481
Table 8: Da a s a is ics o UNIVERSALCEFR-TRAIN
in e ms o ecognized CEFR le els (A1, A2, B1, B2,
C1, C2) ac oss he 13 a ge languages.
C Language-Speci ic Analysis
We p o ide in-dep h analysis o model pe o -
mances om he expe imen s in Sec ion 5ac oss
mul iple dimensions o UNIVERSALCEFR on
esul s o selec ed languages ha we a e quali ied
o in e p e .
English. Analysis o model pe o mance shows
ha using ine- uned models and linguis ic
ea u e-based classi ica ion (62%-75%) ob ains
he bes pe o mance compa ed o p omp ing wi h
ins uc ion- uned LLMs (19%-28%). Howe e ,
hese models end o p o ide dis inc pa e ns o
speci ic CEFR labels. Fo he p omp ing se up,
Gemma1, Gemma3, and Eu oLLM models end
o gi e labels wi hin he A1 and B1 ange, while
ine- uned and ea u e-based models end o lean
owa ds he B1 and B2 ange. Fo he p e- ained
and ins uc ion- uned models, his inding may
be ied o A1 and B2 being he mos common
CEFR le el band o mos gene al-pu pose ex s
ound online, whe e he sou ces o he da a om
which hese models a e ained. Fo ea u e-based
models, we no e he po en ial e ec o aining
and es da a ha ing highe ins ance coun s o
hese le el bands han A1, C1, and C2. Rega ding
model scale, upg aded e sions om simila
model amilies pe o m be e han hei p e ious
e sions, echoing p e ious indings in li e a u e
(Impe ial and Tayya Madabushi,2024). This is
pa icula ly e iden in Gemma3 being 12B in size
and ained wi h massi ely mul ilingual da a in
9730

LANG SENT PARA DOC DIAG
EN 12,826 409,362 1,837 0
ES 0 713 31,355 0
DE 26,244 1,033 5,673 0
NL 0 0 3,596 0
CS 0 441 0 0
IT 0 813 0 0
FR 1,669 0 344 0
ET 0 420 1,277 0
PT 0 1,423 0 0
AR 1,945 215 0 0
HI 1,491 0 0 0
RU 1,758 0 0 0
CY 1,107 109 41 115
To al 47,040 414,529 115 44,123
Table 9: Da a s a is ics o UNIVERSALCEFR-TRAIN
in e ms o le els (sen ence, pa ag aph, documen , dia-
logue) ac oss he 13 a ge languages.
140+ languages and ob aining 28% in weigh ed
F1 compa ed o Gemma1, which is 7B in size
and English-cen ic, ob aining 21.8%. We no e a
po en ial de aul e ec in using hese models whe e
addi ional speci ic CEFR desc ip o in o ma ion
is no needed i he ex s being e alua ed a e in
English, due o he majo i y o da a in he con ex
o CEFR ha is e lec ed in he aining da a being
English.
Spanish. Fine- uned models ou pe o m o he
se ups, wi h ea u e-based app oaches, especially
Random Fo es , achie ing easonable compa a i e
pe o mance. Mo eo e , mul ilingual models
p o ide no iceable pe o mance gains when
compa ed o he English-only model. As pe
p omp ing s a egy, o smalle mul ilingual
models he language-speci ic p omp seems o play
a ole in imp o ing he pe o mance as i also does
o he Gemma1 English-only model, howe e , he
Gemma3 wi h 12B pa ame e is no a ec ed by
his, and i has been able o p oduce he bes esul s
o he LLMs (plus mo e sophis ica ed p omp ing
s a egies). As o he g anula i y o he inpu ,
models pe o m no iceably be e a he documen
le el han a he pa ag aph le el, indica ing ha
longe con ex s a e easie o classi y han sho
ones. Finally, i is wo h epo ing a no iceable
e o o Gemma1: he p edic ion o C2 g ade le el,
which does no exis in he Spanish da ase .
Hindi. Bo h he Gemma models pe o m poo ly
LANG A1 A2 B1 B2 C1 C2
EN 19,449 13,151 6,643 2,384 797 85
ES 1535 1226 904 471 285 0
DE 32 2,494 2,392 60 13 41
NL 6 70 235 230 99 32
CS 0 14 9 6 0 0
IT 3 33 23 1 0 0
FR 13 30 39 43 20 12
ET 0 19 52 21 25 0
PT 61 213 50 144 19 61
AR 7 26 56 53 35 15
HI 22 30 20 16 12 13
RU 34 23 25 34 21 9
CY 67 44 0 0 0 0
To al 21,229 17,373 10,448 3,463 1,326 268
Table 10: Da a s a is ics o UNIVERSALCEFR-DEV in
e ms o ecognized CEFR le els (A1, A2, B1, B2, C1,
C2) ac oss he 13 a ge languages.
compa ed o he ine- uned XLM-R and he
Random Fo es a ian s and end o classi y
mos Hindi es i ems as A1 o A2. Fo example,
Gemma1 pu s 57% o Hindi es samples as A1,
whe eas he e a e only 19% o he es samples
labeled as A1 in he gold s anda d labels. This is in
line wi h he gene al end no iced in Sec ion 6.2,
as he Hindi subse is en i ely sen ence-le el. The
dis ibu ion is close o he Gold dis ibu ion o he
ine- uned and ea u e-enginee ed models. XLM-R
ine- uned models gi e he bes pe o mance
amongs all models o Hindi, bo h in e ms o
exac ca ego y p edic ion and in e ms o he
deg ee o e o (i.e., being wi hin 1 le el abo e
o below he co ec le el). Finally, we looked
a he co ela ion be ween a simple app oxima-
ion o ex leng h (calcula ed as he numbe
o space-sepa a ed okens), a commonly used
a iable in such au oma ed language assessmen
app oaches in NLP esea ch, and he CEFR gold
labels, as well as model-p edic ed labels, a e
con e ing hem o a nume ic scale. The e was a
high co ela ion be ween ex leng h and he gold
labels (0.7), which was also seen wi h he XLM-R
model (0.74) and he Random Fo es models
(0.77). Howe e , he Gemma models only had
co ela ions o 0.44 and 0.54, espec i ely, wi h
ex leng h. Howe e , conside ing ha he Hindi
subse only has sen ence-le el anno a ions wi hou
a la ge con ex , i may be challenging o achie e
u he consis ency wi h he gold s anda d labels,
gi en he size o he anno a ed da ase . Fu u e
9731
LANG SENT PARA DOC DIAG
EN 1,274 40,980 0 255
ES 0 51 0 4,370
DE 4,168 79 0 785
NL 0 0 0 672
CS 0 29 0 0
IT 0 60 0 0
FR 146 0 0 11
ET 0 19 0 98
PT 0 548 0 0
AR 188 4 0 0
HI 113 0 0 0
RU 146 0 0 0
CY 111 0 0 0
To al 6,146 41,770 0 6,191
Table 11: Da a s a is ics o UNIVERSALCEFR-DEV
in e ms o le els (sen ence, pa ag aph, documen , dia-
logue) ac oss he 13 a ge languages.
esea ch should expand he a ailable CEFR-g aded
esou ces bo h in e ms o quan i y as well as
g anula i y o he language.
Russian. The Russian esul s ollow he b oad
pa e ns epo ed in he pape , bu hei ich
in lec ional mo phology and hei compa a i ely
limi ed aining da a ampli y se e al e ec s.
Gemma1 (34.8%) g ea ly o e -p edic s ex s as
beginne -le el (only 5% o ex s had p edic ions
abo e B1), con i ming he o e all end ha
small, English-cen ic LLMs s uggle mos wi h
mo phologically ich languages. Gemma3 (37.4%)
pa ially co ec s his, bu s ill massi ely unde -
p edic s B2 and C2. XLM-R (49.6%) mi o s he
gold dis ibu ion mos ai h ully, possibly because
i s mul ilingual ocabula y gi es i be e co e age
o Russian in lec ional mo phology, a pa e n also
seen o o he highly in lec ed languages such as
Czech. The wo Random Fo es models (47.2%
and 47.8%) unde -p edic A2 and C2 bu o he wise
ma ch he gold shape, showing ha handc a ed
lexical and mo pho-syn ac ic ea u es cap u e
use ul Russian-speci ic signals e en wi h limi ed
da a. Subwo d-le el mul ilingual models (XLM-R)
o explici mo pho-syn ac ic ea u es (RF) a e
bes sui ed o cap u e he meanings and ela ions
be ween Russian wo ds. Tex leng h appea s
o be a alse iend; al hough i does co ela e
highly wi h eadabili y ( =0.65), i also appea s
o be he sou ce o many e o s; op-pe o ming
model ou pu s had ex leng h co ela ions as high
LANG A1 A2 B1 B2 C1 C2
EN 107 114 132 129 83 35
ES 49 58 140 108 45 0
DE 14 264 238 67 9 8
NL 4 21 69 77 22 7
CS 0 82 79 37 2 0
IT 9 87 104 0 0 0
FR 32 57 132 100 63 16
ET 0 110 130 93 67 0
PT 49 50 47 30 13 11
AR 12 26 162 145 40 15
HI 38 34 42 42 28 16
RU 41 36 52 35 24 12
CY 233 232 0 0 0 0
To al 588 1,171 1,327 863 396 120
Table 12: Da a s a is ics o UNIVERSALCEFR-TEST
in e ms o ecognized CEFR le els (A1, A2, B1, B2,
C1, C2) ac oss he 13 a ge languages.
as 0.73. Since his expe imen wi h Russian is
limi ed o sen ence-le el eadabili y, compa ison
wi h p e ious esea ch on Russian eadabili y
assessmen is no s aigh o wa d. Howe e , he
weigh ed F1 (49.6%) o he bes -pe o ming model
(XLM-R) is below s a e-o - he-a esul s o
longe ex s, including 67% (Reynolds,2016),
74% (Solnyshkina e al.,2018), and 78% (Blino a
and Ta aso ,2022). Mos likely, his di e ence
is pa ly due o he absence o Russian-speci ic
mo phosyn ac ic ea u es ha ha e been highly
in o ma i e in p e ious s udies’ models.
Po uguese. Compa ing he di e en se ups, we
can see ha he esul s o Po uguese ollow
he global endency, wi h ine- uned models
achie ing he highes pe o mance, ollowed by
ea u e-based models, and wi h p omp ing aking
he las place. Al hough his s udy only co e s
pa ag aph-le el lea ne da a o Po uguese, simila
pa e ns we e obse ed on e e ence da a (Ribei o
e al.,2024b). Howe e , compa ing he esul s
wi h hose o o he languages and, pa icula ly,
hose wi h pa ag aph-le el lea ne da a, we can
see ha Po uguese is he language wi h he
lowes pe o mance (
≈
33.5%). Se e al ac o s
may con ibu e o his ou come. Fo ins ance,
Po uguese is one o he languages wi h he leas
a ailable aining da a, and he dis ibu ion o
p o iciency labels is igh -skewed (especially
in COPLE2). Fu he mo e, he da a consis s o
ex s w i en by lea ne s om a wide ange o
9732
LANG SENT PARA DOC DIAG
EN 200 200 200 0
ES 0 200 200 0
DE 200 200 200 0
NL 0 0 200 0
CS 0 200 0 0
IT 0 200 0 0
FR 200 0 200 0
ET 0 200 200 0
PT 0 200 0 0
AR 200 200 0 0
HI 200 0 0 0
RU 200 0 0 0
CY 200 109 41 115
To al 1,400 1,709 1,241 115
Table 13: Da a s a is ics o UNIVERSALCEFR-TEST
in e ms o le els (sen ence, pa ag aph, documen , dia-
logue) ac oss he 13 a ge languages.
L1 backg ounds wi h gene ally low p o iciency.
This makes i mo e di icul o models o iden i y
consis en pa e ns due o s ong L1 in e e ence
and low co e age. O e all, bo h ine- uned
and ea u e-based models seem o be unable o
dis inguish be ween suble els, wi h mos examples
o bo h A le els being p edic ed as A1, and he
emainde (mos ly examples o he B le els) as B1.
On he posi i e side, con a y o wha was obse ed
o o he languages, he models do no seem o
be in luenced by ex leng h, wi h he p edic ions
o XML-R ha ing a co ela ion o jus 0.39 wi h
ha ea u e. The p omp ing app oaches lead o a
bias owa ds he p edic ion o le els A2 and B1,
wi h he op pe o me among hese app oaches
(Gemma3 wi h EN-WRITE p omp ) p edic ing
A2 o 28% o he examples and B1 o 62%.
No ably, when using he mo e desc ip i e p omp s,
he Gemma 1 model ou pe o med Eu oLLM, in
spi e o ha ing ewe pa ame e s and no being
speci ically ained on Po uguese da a.
F ench. The F ench co pus and ou analysis a e
di ided in o sen ence-le el and documen -le el
da a. The sen ence-le el se con ains 1,668
sen ences anging om A1 o C2, while he
documen -le el se includes 344 documen s om
A1 o C1, wi h an in ense concen a ion a he
B le els (75% o he da a alls wi hin B1 and
B2). In line wi h he o he languages, XLM-R is
he mos consis en model and achie es he bes
global pe o mance in e e y se ing. Random
Fo es (RF) wi h all ea u es luc ua es mo e in
o e all pe o mance, d opping no ably in he
documen -le el ask, bu e ains some consis ency
in e ms o which p o iciency le els i pe o ms
bes o wo s on. RF wi h op ea u es pe o ms
inconsis en ly o e all bu achie es he bes esul s
on he documen -le el ask. Howe e , i shows
ins abili y in class-le el pe o mance, wi h changes
in which le els a e mos accu a ely p edic ed.
Among he p omp -based models, Gemma3 is
mo e s able han Gemma1, bu bo h emain below
he pe o mance o XLM-R and RF, showing a
weake pe o mance in he LLMs (Gemma1 and
Gemma3). Gemma1, in pa icula , is he leas
consis en model, wi h highly a iable class-le el
pe o mance and occasional ze o F1 sco es o
some le els in speci ic se ups. The Gemma1
esul s a e likely due o he lack o F ench
documen s du ing he aining o his model.
Ac oss all models, p edic ion is gene ally mo e
eliable o in e media e le els (A2–B2), while
C-le el p edic ions emain he mos challenging.
Fine- uning has he clea ad an age: he ine- uned
XLM-R achie es he highes accu acy ac oss all
e alua ion se -ups, making i he mos eliable in
co ec ly p edic ing gold labels. I consis en ly
ou pe o ms all o he models, bo h a he sen ence
and documen le els. This is consis en wi h
p e ious expe imen s on F ench (Yancey e al.,
2021;Ngo and Pa men ie ,2023;Wilkens e al.,
2024), al hough ou pe o mance is sligh ly lowe
han in hose s udies. P omp ing is he leas
e ec i e: bo h Gemma1 and Gemma3, used in a
p omp -based se ing, show he lowes p edic ion
accu acy, o en ailing o iden i y he co ec
labels, especially a he ex emes o he p o iciency
scale (A1, C1, and C2 le els). T adi ional
supe ised classi ie s (Random Fo es ) pe o m
mode a ely well, consis en ly ou pe o ming he
p omp -based models bu s ill lagging behind he
ine- uned model. The ea u e-based models had
a pa icula ly poo pe o mance on C1 and C2
le els. This is likely due o a lack o specialized
ea u es o hose p o iciency le els. Mo eo e ,
hei pe o mance a ies by se -up, wi h some
gains a he documen le el bu no iceable d ops
elsewhe e. Ne e heless, he wo RF la ou s had
simila esul s. In summa y, ine- uning yields he
bes p edic ions, ollowed by adi ional supe ised
lea ning, while p omp ing unde pe o ms in his
ask.
9733
Model EN ES DE NL CS IT FR ET PT AR HI RU CY Tally
GEMMA1✓1/1
GEMMA3✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 13/140
EUROLLM ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 12/35
MODERNBERT ✓1/1
EUROBERT ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 10/15
XLM-R ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 13/100
Table 14: Mapping o language co e age o aining da a used o he six la ge, p e ained language models in
he model e alua ion pa adigm in Sec ion 5. Models in eal a e English-cen ic ( ained p ima ily wi h English
da a), and models in pu ple a e mul ilingual ( ained wi h massi e mul ilingual da a). We e e ed o each
model’s co esponding elease pape s and publica ions o in o ma ion on hei suppo ed languages. No e ha he
documen a ion o GEMMA3 indica es i has been ained wi h 140+ languages. Thus, we loosely conside i o co e
all 13 languages in UNIVERSALCEFR. The ally column indica es {
lang_co e ed
/
lang_seen
}. Fo example,
EUROBERT co e s 10 o he languages in he cu en UNIVERSALCEFR he 15 languages i suppo s.
Ge man. Fo Ge man, he ine- uned models
(>70%) ha e been shown o ou pe o m all o he
app oaches, such as ea u e-based (
≈
50%-65%)
and p omp ing (
≈
38%-46%), despi e he p esence
o unbalanced CEFR le els in bo h he aining
and es da a. The indings de i ed om he
English-only and mul ilingual models, including
ine- uning and p omp ing me hodologies, exhibi
no no able di e ence. This may be due o he
simila i ies be ween English and Ge man, bo h o
which a e Wes Ge manic languages. Al e na i ely,
he g ea ans e abili y o he ine- uned English-
only model may also be due o he la ge amoun
o Ge man aining da a a ailable (27,000 aining
samples). The ea u e-based models pe o med
second bes and we e s ill able o compe e wi h
he ine- uned models o some ex en . This is
su p ising, gi en ha a p e ious analysis showed
ha he ea u es only exhibi ed low co ela ions
wi h CEFR le els (see Sec ion E.3). P o iciency
assessmen o Ge man appea s o equi e ce ain
idiosync a ic ea u es. Fo example, he ea u e
co e ing he maximum dis ance be ween wo ds in
a dependency ee showed a high ea u e impo -
ance only o Ge man, e lec ing he language’s
ee wo d o de and long-dis ance dependencies.
Fo he p omp ing se up, he mul ilingual Gemma3
model pe o med, achie ing good esul s o lowe
CEFR le els, bu unde p edic ing highe le els.
By con as , Gemma1 signi ican ly o e p edic s
le el A1 (250 agains 14 om he gold labels),
esul ing in poo e pe o mance on a e age and
ac oss he o he le els. One decep i e indica o
migh be he leng h o he ex s o be classi ied, as
e lec ed by he s ong co ela ion be ween ex
leng h and Gemma1’s p edic ions (
=0.61). When
compa ing he p omp ing se ups wi h ega d o
language-speci ic ask desc ip ions, no clea end
eme ges ac oss all h ee LLMs, mi o ing he
di icul y o p omp enginee ing o a complex ask
such as mul i-lingual p o iciency classi ica ion.
A abic. Ac oss he 400 A abic es i ems, Gemma1
ends o o e -p edic lowe CEFR le els, assigning
31 i ems o A1 while only 12 a e om he ue
labels, and 90 o A2 agains 26. The e is also a
endency o unde -p edic C1, wi h 18 p edic ions
agains 40 om he ue labels, esul ing in he
highes a e age g ade de ia ion o 1.0. In con-
as , XLM-R and bo h Random Fo es a ian s dis-
ibu ed hei p edic ions mo e e enly o e all, wi h
XLM-R achie ing he smalles a e age g ade de i-
a ion o 0.75. In e ms o g anula i y, he A abic
subse is spli in o sen ence-le el, e e ence da a,
and pa ag aph-le el lea ne da a. Fo he sen ence-
le el e e ence ex s, XLM-R (
≈
55%) and Random
Fo es models om he wo linguis ic ea u e se-
ups (
≈
49.3%-51.2%) ou pe o m bo h Gemma1
and Gemma3 models h ough p omp ing (
≈
16.5%-
32%). Howe e , wi h pa ag aph-le el lea ne ex s,
Gemma3 leads he e alua ion (
≈
41%). A he
same ime, XLM-R and he Random Fo es mod-
els all behind (
≈
32%), possibly due o he A abic
da a used in he aining spli , which a e en i ely
sen ence-le el. In con as , he Gemma3 model has
mos likely seen di e se online A abic da a.
D S anda dized Da ase Fields
We p esen he s anda dized JSON o ma used as
a empla e when p ocessing all quali ied da ase s in
UNIVERSALCEFR. This s uc u ed o ma ensu es
lexibili y and in e ope abili y in o o he o ma s
9734
accep ed and used by he AI communi y, including
Hugging ace and C oissan . Mo eo e , his o ma
cap u es he dimensions ha a e essen ial o each
ins ance o CEFR-labeled ex , including o ma o
g anula i y, ca ego y, license, and language.
E Full Linguis ic Fea u e Analysis
E.1 All Linguis ic Fea u es
O e all, we ha e ex ac ed 100 di e se linguis-
ic ea u es which can be g ouped in o mo -
phosyn ac ic (62), syn ac ic (18), leng h-based
(11), lexical (4), eadabili y (2), psycholinguis-
ic (2), and discou se (1). The ull lis o ea-
u es, including sho desc ip ions, is a ailable
in Appendix E. We ex ac ed a di e se se o
100 linguis ic ea u es based on sen ence-based
linguis ic anno a ion wi h
spacy
(Mon ani e al.,
2023) and
s anza
(Qi e al.,2020), includ-
ing okeniza ion, pa -o -speech agging, and de-
pendency pa sing pe o med. Addi ionally, we
use
as ex
embeddings (G a e e al.,2018),
pyphen
o hyphena ion (Be endsen and Kozea,
2025) and
MEGA.HR
c ossling lexicon
9
o image-
abili y and conc e eness (Ljubeši´
c,2018). Mos o
he ea u es ha e al eady been implemen ed in he
ex -simpli ica ion-e alua ion (TSE al)
package
10
(see Ma in e al. (2018) o he o ig-
inal e sion and S odden and Kallmeye (2020) o
he mul ilingual e sion).
In Table 21, we p o ide an o e iew o all ea-
u es including a sho desc ip ion, esou ces used,
and co ela ion wi h he CEFR le el.
E.2 Top Linguis ic Fea u es
To ex ac he op linguis ic ea u es (TOPFEATS),
we selec ed hose ha a e p esen in he op 10
anked mos impo an ea u es o a leas h ee
languages. Using his c i e ia, we came up wi h a
lis o 23 linguis ic ea u es as epo ed in Table 16
which was hen used in he expe imen esul in
Table 4.
E.3 Linguis ic Co ela ion Analysis
In he ollowing, we desc ibe some insigh s in o
linguis ic di e si y o he Uni e salCEFR da a by
co ela ion analysis be ween he ea u es and he
CEFR le els.
9h ps://www.cla in.si/ eposi o y/xmlui/handl
e/11356/1187
10h ps://gi hub.com/ acebook esea ch/ ex -sim
pli ica ion-e alua ion
Co ela ion Ac oss All Languages. Conside -
ing he absolu e Spea man co ela ion be ween he
ea u es and he CEFR le el (selec ing alues wi h
p < 0.05
and
ρ > 0.3
on a e age ac oss all lan-
guages), he s onges associa ions we e ound in
leng h-based measu es, such as cha ac e s pe sen-
ence and syllables pe sen ence. Se e al g amma i-
cal complexi y ea u es, including pa se ee heigh
and ph ase leng h, showed mode a e co ela ions.
Readabili y indices (FKGL and Flesch Reading
Ease) also displayed mode a e co ela ions in he
expec ed di ec ion. Psycholinguis ic ea u es, such
as conc e eness and imageabili y, we e nega i ely
co ela ed wi h p o iciency, indica ing a shi o-
wa d mo e abs ac language a highe le els. Fi-
nally, mo phosyn ac ic ea u es ega ding oice,
ense, and numbe showed mode a e bu consis en
co ela ions, suppo ing hei ele ance in e lec -
ing syn ac ic de elopmen .
Co ela ion By CEFR Le el. To assess he con-
sis ency o ea u e ele ance ac oss languages, we
examined he numbe o ea u es wi h signi ican
co ela ions (
p < 0.05
) wi h CEFR le els pe lan-
guage. The esul s e ealed no able a ia ions. Lan-
guages such as Czech (CS), Es onian (ET), and I al-
ian (IT) showed a high numbe o ele an ea u es,
sugges ing s ong alignmen be ween he selec ed
linguis ic ea u es and CEFR p og ession in hese
languages. English (EN), Spanish (ES), F ench
(FR), Hindi (HI), and Russian (RU) showed mode -
a e co e age, wi h a easonable numbe o ea u es
exceeding he 0.3 co ela ion h eshold. In con-
as , A abic (AR), Du ch (NL), and Po uguese
(PT) exhibi ed weak co e age, while Welsh (CY)
and Ge man (DE) had e y ew o no ea u es wi h
ele an co ela ions, indica ing a limi ed ma ch be-
ween he cu en ea u e se and CEFR le els o
hose languages. Fu he mo e, a ew ea u es a e
only ele an o a ew languages, e.g., he ansla-
i e case o only Es onian, nega i e e b pola i y
o only Czech, o geni i e case o only Czech,
Es onian, and Russian. This a iabili y highligh s
he in luence o language-speci ic p ope ies on he
e ec i eness o gene al ea u e-based models o
p o iciency p edic ion.
Poin -Bise ial Co ela ion. A poin -bise ial co -
ela ion analysis by CEFR le el e ealed ha mos
ea u es exhibi only weak co ela ions, sugges -
ing limi ed disc imina i e powe when isola ing
indi idual CEFR bands. In e es ingly, he abso-
lu e co ela ion alues end o be s onges a he
9735

Field Desc ip ion
i le
The unique i le o he ex e ie ed om i s o iginal co pus (
NA
i he e
a e no i les such as CEFR-assessed sen ences o pa ag aphs).
lang
The sou ce language o he ex in ISO 638-1 o ma (e.g.,
en
o English).
sou ce_name
The sou ce da ase name whe e he ex is collec ed as indica ed om
hei sou ce da ase , pape , and/o documen a ion (e.g.,
camb idge-exams
om Xia e al. (2016)).
o ma
The o ma o he ex in e ms o le el o g anula i y as indica ed om hei
sou ce da ase , pape , and/o documen a ion. The ecognized o ma s a e
he ollowing: [
documen -le el
,
pa ag aph-le el
,
discou se-le el
,
sen ence-le el].
ca ego y
The classi ica ion o he ex in e ms o who c ea ed he ma e ial. The
ecognized ca ego ies a e
e e ence
o ex s c ea ed by expe s, eache s,
and language lea ning p o essionals and
lea ne
o ex s w i en by
language lea ne s and s uden s.
ce _le el
The CEFR le el associa ed wi h he ex . The six ecognized CEFR le els
a e he ollowing: [
A1, A2, B1, B2, C1, C2
]. A small ac ion (<1%) o
ex in UNIVERSALCEFR con ains unlabelled ex , ex s wi h plus signs
(e.g., A1+), and ex s wi h no le el indica o (e.g., A, B).
license
The licensing in o ma ion associa ed wi h he ex (
Unknown
i no s a ed).
ex The ac ual con en o he ex i sel .
Table 15: The s uc u ed JSON ields wi h desc ip ions and examples used as he s anda dized uni o m o ma o
building he UNIVERSALCEFR da ase . All ins ances alida ed om he collec ion o CEFR-labelled co po a
con o m o his o ma .
9736
CATEGORY FEATURE NAME
Leng h
doc_num_sen s
doc_num_ okens
num_cha ac e s
num_cha ac e s_pe _sen ence
num_cha ac e s_pe _wo d
num_syllables_in_sen ence
num_syllables_pe _sen ence
num_syllables_pe _wo d
num_wo ds
Lexical a e age_pos_in_ eq_ able
lexical_complexi y_sco e
Mo phosyn ac ic
a io_Tense_Pas
a io_o _de e mine s
a io_o _nume als
a io_o _p onouns
Psycholinguis ic conc e eness
imagebili y
Readabili y sen ence_ kgl
sen ence_ e
Syn ac ic
a g_dis ance_be ween_wo ds
a e age_leng h_VP
pa se_ ee_heigh
a io_o _coo dina ing_clauses
Table 16: Lis o linguis ic ea u es occu ing in he op
10 o a leas h ee languages. We use his lis o he
TOPFEATURES subse used in he expe imen esul in
Table 4.
A1 le el, pa icula ly o psycholinguis ic ea u es
such as imageabili y (
ρ= 0.48
) and conc e e-
ness (
ρ= 0.46
), as well as punc ua ion- ela ed
measu es. This sugges s ha ce ain su ace-le el
and lexical-seman ic ea u es may be especially
in o ma i e a he lowes p o iciency le el. A no-
able case is he ea u e o wo d leng h in cha -
ac e s, which shows a nega i e co ela ion a A1
(
ρ=−0.45
), becomes neu al a A2, and shi s o
a posi i e co ela ion a B1 and highe le els. This
pa e n may e lec inc easing lexical complexi y
wi h p o iciency. Simila ly, ea u es ela ed o syn-
ac ic s uc u e, such as he a io o pas ense e bs
and ph ase leng h, gene ally shi om weak neg-
a i e o weak posi i e co ela ions as p o iciency
inc eases, indica ing p og essi e syn ac ic de el-
opmen . O e all, he di ec ionali y o se e al ea-
u es sugges s dynamic usage pa e ns ac oss CEFR
bands, e en i he co ela ion s eng hs emain mod-
es .
F Hype pa ame e Values
We de ail he hype pa ame e alues used o ine-
uning p e ained (MODERNBERT, EUROBERT,
HYPERPARAMETER VALUE
Lea ning a e 3.6×10−5
T ain ba ch size 2
E alua ion ba ch size 3
Random seed 42
G adien accumula ion s eps 16
To al e ec i e ba ch size 32
Op imize adamw_ o ch_ used
Be as (0.9, 0.999)
Epsilon 10−8
Lea ning- a e schedule linea
Wa m-up a io 0.1
Table 17: Hype pa ame e alues used o ine- uning
p e ained language models.
HYPERPARAMETER VALUE
Sampling False
Max New Tokens 10
Da a Type o ch.b loa 16
GPU 4 x NVIDIA RTX A5000 (24GB)
Table 18: Hype pa ame e alues and GPU in o ma ion
used o p omp ing ins uc ion- uned models.
LANG SENT PARA DOC OVERALL
AR 55.7 32.6 - 43.1
CY 86.9 72.5 61.5 72.7
CS - 68.8 - 68.8
DE 65.4 71.1 83.4 73.2
EN 68.3 100.0 57.6 75.5
ES - 40.6 98.0 69.69
ET -93.6 84.0 88.9
FR 57.6 - 44.2 51.7
HI 52.9 - - 52.9
IT - 83.3 - 83.3
NL - 59.0 59.0
PT - 29.2 - 29.2
RU 49.6 - - 49.6
Table 19: Weigh ed F1 sco es o he ine- uned XLM-
R ( op model ac oss all se ups) pe o mance on he
UNIVERSALCEFR-TEST, classi ied by he g anula i y
le els o he da a.
and XLM-R) and ins uc ion- uned language mod-
els (GEMMA1, GEMMA3, and EUROLLM) in Ta-
bles 17 and 18, espec i ely.
9737
G Addi ional Con ex on Res ic ions o
GDPR-P o ec ed Da ase s
The c i ical aspec o he GDPR is ha i gi es da a
subjec s (e.g., L2 lea ne s o CEFR) he igh o
wi hd aw hei pe sonal in o ma ion om p ocess-
ing, which equi es da a p ocesso s o s o e bo h
he signed consen s and he ID mappings (i.e., map-
pings be ween he names o he eal people and
hei IDs in a eleased co po a). As long as hese
documen s exis and eiden i ica ion is heo e i-
cally possible, he da a alls unde he scope o he
GDPR. Fu he complica ing ac o s a e na ional
legisla ions and e hical egula ions, such as a chi al
laws, ha ea any da a p oduced a uni e si ies—
including hose used o language p o iciency as-
sessmen such as essays, eco ded dialogues, and
w i en ex s om pe sonal expe iences—as he
p ope y o he s a e (and hence making des uc ion
o he ID mappings a non- i ial ac ) (Eu opean
Pa liamen and Council,2016).
Ye ano he upcoming challenge is he EU AI
Ac (Eu opean Pa liamen and Council,2024) ha
implies ha AI models ained on pe sonal da a
should inhe i he same license as he da a hey
ha e been ained on, meaning ha he models will
be unde he scope o he GDPR. We hypo hesize
ha he non- es ic ed da ase s included in UNI-
VERSALCEFR ei he do no con ain pe sonal in o -
ma ion o we e collec ed be o e he GDPR, since
hey a e al eady openly accessible o he public.
We u he hypo hesize ha he da ase s cu en ly
unde he GDPR will e en ually ha e hei ID map-
pings des oyed and will no longe be subjec o
he GDPR. This may mean ha he lea ne co po a
ha can be added o UNIVERSALCEFR will g ow
wi h ime.
H Full Da ase Di ec o y o
Uni e salCEFR
We p o ide he comple e in o ma ion o quali ied
co po a included in he cu en UNIVERSALCEFR
collec ion o o m a di ec o y o da ase s. Aside
om eigh pe -ins ance in o ma ion included in he
s anda dized JSON o ma in Table 15, we also
epo i e pe -co pus in o ma ion as lis ed below:
•
Anno a ion me hod used (manual, compu e -
assis ed, o NA).
• To al numbe o expe anno a o s.
•
Dis inc L1 lea ne s pe language o lea ne
co po a.
•
In e -anno a o ag eemen (IAA) me ic and
sco e.
• Re e ence o published pape o eposi o y.
I P omp Templa es
We p o ide he comple e copies o he p omp
empla es used in p omp ing expe imen s wi h
ins uc ion- uned LLMs as desc ibed in Sec ion 5.
The p omp empla es a e ca ego ized by colo
based on he se up: BASE,EN-READ,LANG-
READ,EN-WRITE,LANG-WRITE.
J Welsh Da a Collec ion
One o he con ibu ions o UNIVERSALCEFR is
he elease o he i s -e e open da ase o he
Welsh language (CY) wi h gold-s anda d CEFR
labels o A1 and A2. To ob ain his da a, we
co esponded wi h da a main aine s om Lea n
Welsh (
h ps://lea nwelsh.cym u/
), which is
a compila ion o expe -c ea ed books ( e e ence
ex s) and acqui ed PDF e sions. This esou ce
can be sha ed in any o ma o non-comme cial
esea ch, which i s he goal o UNIVERSALCEFR.
We hen manually ex ac ed quali ied ex s acco d-
ing o he ou le els o g anula i y: sen ence, pa a-
g aph, dialogue, and documen . The dis ibu ion
o CEFR le els and ex g anula i y o his new
Welsh da ase can be ound in Table 3and 7, e-
spec i ely.
9738
Ca ego y Fea u e Sho Desc ip ion Resou ce Co . ρ
(a g.)
Co . ρ
(SD)
Discou se a io_ e e en ial Ra io o e e en ial okens o all okens based on dependency ee ela ions SpaCy, S anza, TSE al 0.1278 0.09
Leng h
doc_num_sen s Numbe o sen ences pe documen / ex SpaCy, S anza, 0.1819 0.18
doc_num_ okens Numbe o okens pe documen / ex SpaCy, S anza, 0.5041 0.22
num_cha ac e s Numbe o cha ac e s pe documen / ex SpaCy, S anza, TSE al 0.5224 0.19
num_cha ac e s_pe _sen ence Numbe o cha ac e s pe sen ence SpaCy, S anza, TSE al 0.5301 0.17
num_cha ac e s_pe _wo d Numbe o cha ac e s pe wo d SpaCy, S anza, TSE al 0.3895 0.16
num_sen ences Numbe o sen ences pe documen / ex SpaCy, S anza, TSE al -0.0324 0.11
num_syllables_in_sen ence Numbe o syllables in documen / ex SpaCy, S anza, pyphen, TSE al 0.525 0.19
num_syllables_pe _sen ence Numbe o syllables pe sen ence SpaCy, S anza, pyphen, TSE al 0.4634 0.23
num_syllables_pe _wo d Numbe o syllables pe wo d SpaCy, S anza, pyphen, TSE al 0.3924 0.16
num_wo ds Numbe o okens pe documen / ex SpaCy, S anza, TSE al 0.479 0.18
num_wo ds_pe _sen ence Numbe o okens pe sen ence SpaCy, S anza, TSE al 0.4863 0.16
a e age_pos_in_ eq_ able A e age equency ank o okens in Fas Tex embeddings SpaCy, S anza, TSE al -0.0907 0.22
lexical_complexi y_sco e Lexical complexi y based on anks o Fas Tex embeddings SpaCy, S anza, Fas Tex , TSE al 0.0649 0.13
ype_ oken_ a io Type- oken-Ra io SpaCy, S anza, TSE al -0.3364 0.16
Lexical
max_pos_in_ eq_ able Maximum equency ank o okens in Fas Tex embeddings SpaCy, S anza, TSE al 0.2014 0.07
Psycholinguis ic conc e eness Conc e eness o wo ds based on MEGAHR and Fas Tex -Embeddings MEGA.HR c ossling -0.4254 0.24
imagebili y imagebili y o wo ds based on MEGAHR and Fas Tex -Embeddings MEGA.HR c ossling -0.3962 0.24
sen ence_ kgl Flesch-Kincaid-G ading-Le el, designed o English SpaCy, S anza, TSE al 0.4738 0.19
Readabili y sen ence_ e Flesch-Reading Ease, designed o English SpaCy, S anza, TSE al -0.3976 0.19
Syn ac ic
a g_dis ance_be weeen_ e b_pa icle A e age dis ance be ween e b and pa icle based on dependency ee SpaCy, S anza, 0.0385
a g_dis ance_be weeen_wo ds A e age dis ance be ween wo ds based on dependency ee SpaCy, S anza, 0.3934 0.17
max_dis ance_be weeen_ e b_pa icles Maximum dis ance be ween e b and pa icle based on dependency ee SpaCy, S anza, 0.0385
max_dis ance_be weeen_wo ds Maximum dis ance be ween wo ds based on dependency ee SpaCy, S anza, 0.2109 0.14
check_i _head_is_noun Whe he he head o he dependency ee is a noun SpaCy, S anza, TSE al -0.1147 0.13
check_i _head_is_ e b Whe he he head o he dependency ee is a e b SpaCy, S anza, TSE al 0.0954 0.13
check_i _one_child_o _ oo _is_subjec Whe he a child o a oo is a subjec (no a e b) SpaCy, S anza, TSE al 0.0144 0.15
check_passi e_ oice Whe he a sen ence is in passi e oice SpaCy, S anza, TSE al
a e age_leng h_NP A e age leng h o noun ph ase in okens SpaCy, S anza, TSE al 0.3485 0.13
a e age_leng h_VP A e age leng h o e b ph ase in okens SpaCy, S anza, TSE al 0.4324 0.16
a g_leng h_PP A e age leng h o p eposi ional ph ase in okens SpaCy, S anza, TSE al 0.1406 0.07
pa se_ ee_heigh Dep h o heigh o he dependency ee SpaCy, S anza, TSE al 0.4657 0.15
a io_clauses Ra io o okens associa ed o a clause o all okens based on dependency ee ela ions SpaCy, S anza, TSE al 0.1918 0.12
a io_o _coo dina ing_clauses Ra io o okens associa ed o a coo dina ing clause o all okens based on dependency ee ela ions SpaCy, S anza, TSE al 0.2503 0.16
a io_o _subo dina e_clauses Ra io o okens associa ed o a subo dina ing clause o all okens based on dependency ee ela ions SpaCy, S anza, TSE al 0.2361 0.14
a io_p eposi ional_ph ases Ra io o okens associa ed o a p eposi ional ph ase o all okens based on dependency ee ela ions SpaCy, S anza, TSE al 0.1797 0.14
a io_ ela i e_ph ases Ra io o okens associa ed o a ela i e clause o all okens based on dependency ee ela ions SpaCy, S anza, TSE al 0.3262 0.15
is_non_p ojec i e Whe he a dependency ee is non p ojec i e SpaCy, S anza, TSE al 0.0883 0.13
a io_Abb _Yes Ra io o nouns which a e an abb e ia ion o all nouns SpaCy, S anza, Uni e salDependencies -0.0685 0.07
a io_Case_Abe Ra io o nouns in abessi e case o all nouns SpaCy, S anza, Uni e salDependencies 0.2394
a io_Case_Acc Ra io o nouns in accusa i e case o all nouns SpaCy, S anza, Uni e salDependencies 0.0658 0.29
a io_Case_Ben Ra io o nouns in bene ac i e case o all nouns SpaCy, S anza, Uni e salDependencies
a io_Case_Cau Ra io o nouns in causa i e case o all nouns SpaCy, S anza, Uni e salDependencies
a io_Case_Cmp Ra io o nouns in compa a i e case o all nouns SpaCy, S anza, Uni e salDependencies
a io_Case_Cns Ra io o nouns in conside a i e case o all nouns SpaCy, S anza, Uni e salDependencies
a io_Case_Com Ra io o nouns in comi a i e case o all nouns SpaCy, S anza, Uni e salDependencies 0.1293
a io_Case_Da Ra io o nouns in da i e case o all nouns SpaCy, S anza, Uni e salDependencies 0.1822 0.09
a io_Case_Dis Ra io o nouns in dis ibu i e case o all nouns SpaCy, S anza, Uni e salDependencies
a io_Case_Equ Ra io o nouns in equa i e case o all nouns SpaCy, S anza, Uni e salDependencies
Mo phosyn ac ic
a io_Case_E g Ra io o nouns in e ga i e case o all nouns SpaCy, S anza, Uni e salDependencies
Table 20: O e iew o all 100 ea u es, including co ela ion coe icien wi h CEFR le el ac oss all languages.
9739
CEFR speci ica ions o eading comp ehension in Czech (CS)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Czech ex o
na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 – S uden i é o ú o nˇ
e dokážou po ozumˇ
e elmi k á kým jednoduchým ex ˚um po jedné ázi,
pochy ají známá jména, slo a a základní áze a pˇ
eˇ
c ou si je podle po ˇ
eby.
A2 – S uden i é o ú o nˇ
e dokážou po ozumˇ
e k á kým jednoduchým ex ˚um obsahujícím
nej ek en o anˇ
ejší slo ní zásobu, ˇ
ce nˇ
eˇ
cás i sdílené meziná odní slo ní zásoby.
B1 – S uden i é o ú o nˇ
e dokážou ˇ
cís pˇ
ímoˇ
ca é ˇ
ecné ex y na éma a sou isející s jejich oblas í
zájmu s uspokoji ou ú o ní po ozumˇ
ení.
B2 – S uden i é o ú o nˇ
e dokážou ˇ
cís s elkou mí ou nezá islos i, pˇ
izp˚usobují s yl a ychlos
ˇ
c ení ˚uzným ex ˚um a úˇ
cel˚um a selek i nˇ
e použí ají hodné e e enˇ
cní zd oje. Má ši okou slo ní
zásobu ak i ního ˇ
c ení, ale m˚uže mí po íže s nízko ek enˇ
cními idiomy.
C1 – S uden i é o ú o nˇ
e dokážou pod obnˇ
e po ozumˇ
e dlouhým a složi ým ex ˚um, a ’ už se
ýkají nebo ne ýkají jejich las ní oblas i specializace, za pˇ
edpokladu, že dokážou zno u pˇ
eˇ
cís
ob ížné ˇ
cás i. Mohou aké po ozumˇ
e ši oké škále ex ˚u, ˇ
ce nˇ
e li e á ních ex ˚u, ˇ
clánk˚u no inách
nebo ˇ
casopisech a specializo aných akademických nebo odbo ných publikací, za pˇ
edpokladu, že
mají pˇ
íleži os i k opako anému ˇ
c ení a mají pˇ
ís up k e e enˇ
cním nás oj˚um.
C2 – S uden i é o ú o nˇ
e mohou po ozumˇ
e p ak icky šem yp˚um ex ˚u ˇ
ce nˇ
e abs ak ních,
s uk u álnˇ
e složi ých nebo ysoce ho o o ých li e á ních a neli e á ních spis˚u. Dokážou aké
po ozumˇ
e ši oké škále dlouhých a složi ých ex ˚u, oceni jemné ozdíly e s ylu a implici ní i
explici ní ýznam.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
9746

CEFR speci ica ions o eading comp ehension in I alian (IT)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en I alian ex o
na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 - Gli s uden i di ques o li ello iescono a comp ende e es i mol o b e i e semplici, una ase
alla ol a, cogliendo nomi amilia i, pa ole e asi di base e ileggendo quando necessa io.
A2 - Gli s uden i di ques o li ello iescono a comp ende e es i b e i e semplici con enen i il
ocabola io più equen e, inclusa una pa e di elemen i di ocabola io in e nazionale condi iso.
B1 - Gli s uden i di ques o li ello iescono a legge e es i a uali semplici su a gomen i co ela i al
lo o campo di in e esse con un li ello di comp ensione soddis acen e.
B2 - Gli s uden i di ques o li ello iescono a legge e con un ampio g ado di indipendenza,
ada ando s ile e eloci à di le u a a es i e scopi di e si e u ilizzando on i di i e imen o
app op ia e in modo sele i o. Ha un ampio ocabola io di le u a a i a, ma può a e e qualche
di icol à con idiomi a bassa equenza.
C1 - Gli s uden i di ques o li ello iescono a comp ende e in de aglio es i lunghi e complessi,
indipenden emen e dal a o che siano co ela i o meno alla p op ia a ea di specializzazione,
a condizione che iescano a ilegge e sezioni di icili. Possono anche comp ende e un’ampia
a ie à di es i, a cui sc i i le e a i, a icoli di gio nali o i is e e pubblicazioni accademiche o
p o essionali specializza e, a condizione che i siano oppo uni à di ile u a e abbiano accesso a
s umen i di i e imen o.
C2 - Gli s uden i di ques o li ello possono comp ende e p a icamen e u i i ipi di es i, a cui
sc i i le e a i e non le e a i as a i, s u u almen e complessi o al amen e colloquiali. Possono
anche comp ende e un’ampia gamma di es i lunghi e complessi, app ezzando so ili dis inzioni di
s ile e signi ica o implici o ed esplici o.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
9747
CEFR speci ica ions o eading comp ehension in F ench (FR)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en F ench ex o
na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 - Les app enan s de ce ni eau peu en comp end e des ex es ès cou s e simples, ph ase pa
ph ase, en ep enan des noms, des mo s e des exp essions de base amilie s e en les elisan si
nécessai e.
A2 - Les app enan s de ce ni eau peu en comp end e des ex es cou s e simples con enan le
ocabulai e le plus cou an , y comp is une pa ie du ocabulai e in e na ional commun.
B1 - Les app enan s de ce ni eau peu en li e des ex es ac uels simples su des suje s liés à leu
domaine d’in é ê a ec un ni eau de comp éhension sa is aisan .
B2 - Les app enan s de ce ni eau peu en li e a ec une g ande au onomie, en adap an leu s yle e
leu i esse de lec u e à di é en s ex es e objec i s, e en u ilisan sélec i emen des sou ces de
é é ence app op iées. Possède un ocabulai e de lec u e ac i e é endu, mais peu ép ou e des
di icul és a ec les exp essions idioma iques peu équen es.
C1 - Les app enan s de ce ni eau peu en comp end e en dé ail des ex es longs e complexes,
qu’ils elè en ou non de leu domaine de spéciali é, à condi ion de pou oi eli e les passages
di iciles. Ils peu en égalemen comp end e une g ande a ié é de ex es, no ammen des éc i s
li é ai es, des a icles de jou naux ou de magazines, ainsi que des publica ions uni e si ai es ou
p o essionnelles spécialisées, à condi ion de dispose d’oppo uni és de elec u e e d’ou ils de
é é ence.
C2 - Les app enan s de ce ni eau peu en comp end e p a iquemen ous les ypes de ex es, y
comp is les éc i s li é ai es e non li é ai es abs ai s, s uc u ellemen complexes ou ès amilie s.
Ils peu en égalemen comp end e un la ge é en ail de ex es longs e complexes, en app écian les
sub ili és s ylis iques e le sens implici e e explici e.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
9748
CEFR speci ica ions o eading comp ehension in Es onian (ET)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Es onian ex
o na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 – selle aseme õppijad saa ad a u äga lühikes es lih sa es eks ides ühe aasi kaupa,
ko ja ad üles u a ad nimed, sõnad ja põhi aasid ning loe ad ajaduse ko al uues i läbi.
A2 – selle aseme õppijad saa ad a u lühikes es lih sa es eks ides , mis sisalda ad kõige
sagedamini kasu a a a sõna a a, sealhulgas osa jaga ud ah us ahelis es sõna a aüksus es .
B1 – selle aseme õppijad oska ad ahulda al mõis us asemel lugeda o sekoheseid ak i eks e
nende hu i aldkonnaga seo ud eemadel.
B2 – selle aseme õppijad oska ad lugeda suu el mää al iseseis al , kohandades lugemiss iili
ja -kii us e ine a e eks ide ja eesmä kidega ning kasu ades alikulisel sobi aid ii eallikaid.
Tal on lai ak ii se lugemise sõna a a, kuid al õib esineda askusi madala sagedusega idioomidega.
C1 – selle aseme õppijad saa ad üksikasjalikul a u pikkades ja kee uka es eks ides , olenema a
selles , kas need on seo ud nende enda e ialaga õi mi e, eeldusel, e nad suuda ad askeid
lõike uues i lugeda. Nad saa ad a u ka paljudes e ine a es eks ides , sealhulgas ki janduslikes
ki ju is es , ajaleh ede õi ajaki jade a ikli es ning e ialas es akadeemilis es õi e ialas es
äljaanne es , eeldusel, e neil on õimalus uues i lugeda ja neil on juu depääs ii e ahendi ele.
C2 – selle aseme õppijad saa ad a u peaaegu iga üüpi eks ides , sealhulgas abs ak se es ,
s uk uu sel kee uka es õi äga kõnekeelse es ki janduslikes ja mi eki janduslikes ki ju is es .
Samu i saa ad nad a u paljudes pikkades ja kee ulis es eks ides , mõis es peen s iilie i lus ning
kaudse ja selgesõnalis ähendus .
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
9749
CEFR speci ica ions o eading comp ehension in Po uguese (PT)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Po uguese ex
o na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 - Os alunos des e ní el podem en ende ex os mui o cu os e simples, uma única ase de cada
ez, pegando nomes, pala as e ases básicas amilia es e elendo con o me necessá io.
A2 - Os alunos des e ní el podem en ende ex os cu os e simples con endo o ocabulá io de
maio equência, incluindo uma p opo ção de i ens de ocabulá io in e nacional compa ilhados.
B1 - Os alunos des e ní el podem le ex os ac uais di e os sob e assun os elacionados ao seu
campo de in e esse com um ní el sa is a ó io de comp eensão.
B2 - Os alunos des e ní el podem le com um al o g au de independência, adap ando o es ilo e a
elocidade de lei u a a di e en es ex os e p opósi os, e usando on es de e e ência ap op iadas
sele i amen e. Tem um amplo ocabulá io de lei u a a i a, mas pode e alguma di iculdade com
exp essões idiomá icas de baixa equência.
C1 - Os alunos des e ní el podem en ende em de alhes ex os longos e complexos, es ejam eles
elacionados ou não à sua p óp ia á ea de especialidade, desde que possam ele seções di íceis.
Eles ambém podem en ende uma g ande a iedade de ex os, incluindo esc i os li e á ios, a igos
de jo nais ou e is as e publicações acadêmicas ou p o issionais especializadas, desde que haja
opo unidades de elei u a e enham acesso a e amen as de e e ência.
C2 - Alunos des e ní el podem en ende i ualmen e odos os ipos de ex os, incluindo esc i os
abs a os, es u u almen e complexos ou al amen e coloquiais, li e á ios e não li e á ios. Eles
ambém podem en ende uma g ande a iedade de ex os longos e complexos, ap eciando su is
dis inções de es ilo e signi icado implíci o e explíci o.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
9750
CEFR speci ica ions o eading comp ehension in A abic (a )
You a e an expe in language p o iciency classi ica ion based on he
Common Eu opean F amewo k o Re e ence o Languages (CEFR).
You ask is o analyze he gi en A abic ex o na a i e and
de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on
he CEFR desc ip o s o eading comp ehension o lea ne s below:
󰁀 󰆳󰆎 ﺓﺍﻭ 󰃎󰃆󰄨󰄚 ،ﻭ ﺍً ﺓ ﺹ  ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  – A1
.ﺍ  ﺓﺀﺍﺍ ﺓﺩ󰌣ﻭ ،ﺍ ﺍ ﺕﺍﺭﺍﻭ ﺕ󰀖󰀞ﺍﻭ ﺀ󰄺󰄫ﺍ ﻁﺍﻭ ،ﺓ
ﺕﺍﺩﺍ ﺃ  ﻱ ﻭ ﺓ ﺹ  ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  – A2
.󰂯󰂋ﺍ ﻭ󰃕󰃀ﺍ ﺕﺍﺩﺍ  ﺀ 󰊜󰊈ﺫ 󰆳󰆎  ،ً
ﻝ  ﺍ ﻝ ﺓ ﺍﻭ ﺹ ﺓﺀﺍ ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  – B1
.ﺍ  ٍﺽ ﻯ 󰉼󰓜ﺍ
ﺏﺃ ﻭ ،ﺍ   ﺭ ﺓﺀﺍﺍ ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  – B2
󰁀󰀞 ﺍ ﺍﺍ ﺭﺩ ﻡﺍﺍﻭ ،ﺽﺍﺍﻭ ﺹﺍ ﻑ ًﻭ ﺓﺀﺍﺍ ﻭ
ﺍ  ﺍ  ﻥﺍ  ـ ،ﺍﻭ  ﺓﺀﺍ ﺕﺍﺩ 󰉼󰈸󰃕󰃀 .󰆳󰅗ﺍ
.ﻡﺍﺍ ﺓﺭﺩ ﺍ
󰁅 ﺀﺍ ، ﺓﻭ 󰃎󰃆 ﺹ  ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  – C1
.ﺍ ﺍ ﺓﺀﺍ ﺓﺩﺇ  ﺭﺩ ﺍ ﻥﺃ ﻁ ، ﻡﺃ 󰉾 ﻝ 
ﺍ ﺕﻭ ،ﺩﺍ ﻝ󰄨󰄇ﺍ 󰊜󰊈ﺫ 󰆳󰆎  ،ﺹﺍ  ﺍﻭ 󰄨  ًﺃ 󰉼󰉫
ﺓﺀﺍﺍ ﺓﺩ ﺹ  ﻁ ،ﺍ ﺍ ﻭﺃ ﺩ󰁅ﺍ ﺕﺍﺭﺍﻭ ،ﺕﺍ ﻭﺃ
.ﺍﺍ ﺕﺍﻭﺩﺃ 󰆭󰆠ﺇ ﻝﺍﻭ
ﺕﺍ 󰊜󰊈ﺫ 󰆳󰆎  ،ً ﺹﺍ ﻉﺍﺃ 󰄨󰄚  ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  – C2
ًﺃ 󰉼󰉫 .ﺍ ﺓ ﻭﺃ ﺍ   ﺓ ﻭﺃ ﺓﺩﻥ ﺍ ﺩﺍ ﻭ ﺩﺍ
󰆳󰅤ﺍﻭﺏﺍ 󰆳󰆎 󰃕󰃀ﺍ ﺕﻭﺍ ﻭ،ﺓﺍﻭ󰃎󰃆ﺍﺹﺍ  ﺓ 󰄨
.ﺍﻭ ﺍ
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion
o jus i ica ion.
Tex : «TEXT»
Answe :
9751

CEFR speci ica ions o eading comp ehension in Hindi (hi)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Hindi ex o
na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 -    
  ,       
   , 
,    
          
A2 -    
     ,     ,
       
B1 -    
    
      
       
B2 -    
         , -  
        
   ,       
     
    ,       

     
C1 -    
 ,      ,   
 
    ,        , 
   ,   
       -
    ,              
C2 -    
        , , 
        -     
        ,          

   
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
3
9752
CEFR speci ica ions o eading comp ehension in Russian ( u)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Russian ex
o na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 — учащиеся этого уровня могут понимать очень короткие, простые тексты
по одной фразе за раз, подбирая знакомые имена, слова и основные фразы и
перечитывая их по мере необходимости.
A2 — учащиеся этого уровня могут понимать короткие, простые тексты, содер-
жащие наиболее частотную лексику, включая часть общих международных
словарных единиц.
B1 — учащиеся этого уровня могут читать простые фактические тексты по темам,
связанным с их областью интересов, с удовлетворительным уровнем понимания.
B2 — учащиеся этого уровня могут читать с большой степенью независимости,
адаптируя стиль и скорость чтения к различным текстам и целям и выборочно
используя соответствующие справочные источники. Имеет широкий активный
словарный запас чтения, но может испытывать некоторые трудности с редко
встречающимися идиомами.
C1 — учащиеся этого уровня могут понимать в деталях длинные, сложные тексты,
независимо от того, относятся ли они к их собственной области специализации,
при условии, что они могут перечитывать сложные разделы. Они также могут
понимать широкий спектр текстов, включая литературные произведения, га-
зетные или журнальные статьи, а также специализированные академические
или профессиональные публикации, при условии, что есть возможности для
перечитывания и у них есть доступ к справочным материалам.
C2 - Учащиеся этого уровня могут понимать практически все типы текстов, вклю-
чая абстрактные, структурно сложные или очень разговорные литературные
и нелитературные произведения. Они также могут понимать широкий спектр
длинных и сложных текстов, оценивая тонкие различия стиля и неявного, а
также явного значения.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
4
9753
CEFR speci ica ions o eading comp ehension in Welsh (CY)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Welsh ex o
na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 - Gall dysgwy y le el hon ddeall es unau by iawn, syml un cymal a y o, gan godi enwau,
gei iau ac ymad oddion syl aenol cy a wydd ac aildda llen yn ôl y angen.
A2 - Gall dysgwy y le el hon ddeall es unau by , syml sy’n cynnwys y ei a wya aml, gan
gynnwys cy an o ei emau gei a yngwladol a enni .
B1 - Gall dysgwy y le el hon dda llen es unau ei hiol syml a bynciau sy’n ymwneud â’u maes
diddo deb gyda le el oddhaol o ddeall w iae h.
B2 - Gall dysgwy y le el hon dda llen yn annibynnol iawn, gan addasu a ddull a chy lymde
da llen i wahanol des unau a dibenion, a de nyddio ynonellau cy ei io p iodol yn dde holus. Yn
meddu a ei a dda llen wei h edol eang, ond gall b o i pe h anhaws e gydag idiomau amledd isel.
C1 - Gall dysgwy y le el hon ddeall yn anwl des unau hi a chymhle h, p’un a yw’ hain yn
ymwneud â’u maes a benigedd eu hunain ai peidio, a y amod eu bod yn gallu aildda llen ad annau
anodd. Gallan he yd ddeall am ywiae h eang o des unau gan gynnwys ysg i au llenyddol,
e hyglau papu newydd neu gylchg onau, a chyhoeddiadau academaidd neu b o esiynol
a benigol, a y amod bod cy leoedd i’w hail-dda llen a bod o e cy ei io a gael iddyn .
C2 - Gall dysgwy a y le el hon ddeall b on bob ma h o des unau gan gynnwys ysg i au
llenyddol ac anllenyddol haniae hol, s wy hu ol gymhle h, neu ysg i au llenyddol ac anllenyddol
hynod la a . Gallan he yd ddeall ys od eang o des unau hi a chymhle h, gan we h aw ogi
gwahaniae hau cynnil o an a ddull ac ys y ymhlyg yn ogys al ag ys y amlwg.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
9754
CEFR speci ica ions o w i en p oduc ion in English (EN)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en ex o na a i e
and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR desc ip o s o
eading comp ehension o lea ne s below:
A1 - Lea ne s o his le el can gi e in o ma ion abou ma e s o pe sonal ele ance (e.g. likes and
dislikes, amily, pe s) using simple wo ds/signs and basic exp essions. Lea ne s can also p oduce
simple isola ed ph ases and sen ences.
A2 - Lea ne s o his le el can p oduce a se ies o simple ph ases and sen ences linked wi h simple
connec o s like “and”, “bu ” and “because”. Lea ne s ha e su icien ocabula y o he exp ession
o basic communica i e needs and o coping wi h simple su i al needs.
B1 - Lea ne s o his le el can p oduce s aigh o wa d connec ed ex s on a ange o amilia
subjec s wi hin hei ield o in e es , by linking a se ies o sho e disc e e elemen s in o a linea se-
quence. Lea ne s ha e a good ange o ocabula y ela ed o amilia opics and e e yday si ua ions.
B2 - Lea ne s o his le el can p oduce clea , de ailed ex s on a a ie y o subjec s ela ed o hei
ield o in e es , syn hesising and e alua ing in o ma ion and a gumen s om a numbe o sou ces.
Lea ne s ha e a good ange o ocabula y o ma e s connec ed o hei ield and mos gene al
opics.
C1 - Lea ne s o his le el can p oduce clea , well-s uc u ed ex s o complex subjec s, unde lining
he ele an salien issues, expanding and suppo ing poin s o iew a some leng h wi h subsidia y
poin s, easons and ele an examples, and ounding o wi h an app op ia e conclusion. Lea ne s
can alsoemploy he s uc u e and con en ions o a a ie y o gen es, a ying he one, s yle and
egis e acco ding o add essee, ex ype and heme.
C2 - Lea ne s o his le el can p oduce clea , smoo hly lowing, complex ex s in an app op ia e
and e ec i e s yle and a logical s uc u e which helps he eade iden i y signi ican poin s.
Lea ne s ha e a good command o a e y b oad lexical epe oi e including idioma ic exp essions
and colloquialisms; shows awa eness o conno a i e le els o meaning.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
9755
CEFR speci ica ions o w i en p oduc ion in Es onian (ET)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Es onian ex
o na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 – Selle aseme õppijad suuda ad anda ea e isiklikul olulis el eemadel (n meeldimised
ja mi emeeldimised, pe ekond, lemmikloomad), kasu ades lih said sõnu/ iipeid ja põhilisi
äljendeid. Õppijad suuda ad moodus ada ka lih said üksikuid aase ja lauseid.
A2 – Selle aseme õppijad suuda ad oo a lih sa e aaside ja lause e jada, mis on seo ud lih sa e
sidesõnadega nagu „ja“, „aga“ ja „ses “. Neil on piisa sõna a a põhilis e suh lus ajadus e ja
lih sa e ellujäämis ajadus e ahuldamiseks.
B1 – Selle aseme õppijad suuda ad koos ada a usaada aid, seo ud eks e u a a el eemadel
oma hu i aldkonnas, sidudes lühemaid üksikuid elemen e lineaa seks jä jes useks. Neil on hea
sõna a a u a a e eemade ja igapäe as e oluko dade ki jeldamiseks.
B2 – Selle aseme õppijad suuda ad koos ada selgeid ja üksikasjalikke eks e e ine a el nende
hu i aldkonnaga seo ud eemadel, sün eesides ja hinna es ea e ja a gumen e mi mes allikas .
Neil on hea sõna a a oma aldkonnaga seo ud eemadeks ning enamike üldis e eemade jaoks.
C1 – Selle aseme õppijad suuda ad koos ada selgeid ja häs i s uk u ee i ud eks e kee uka el
eemadel, uues esile olulised küsimused, laiendades ja oe ades seisukoh i üksikasjalikul koos
äienda a e punk ide, põhjus e ja asjakohas e näide ega ning lõpe ades sobi a jä eldusega. Samu i
suuda ad nad kasu ada e ine a e žan i e s uk uu i ja kon en sioone ning a iee ida ooni, s iili ja
egis i as a al ad essaadile, eks iliigile ja eemale.
C2 – Selle aseme õppijad suuda ad koos ada selgeid, suju aid ja kee ukaid eks e sobi as ja
õhusas s iilis ning loogilises s uk uu is, mis ai ab lugejal u as ada olulisi punk e. Neil on äga
lai sõna a a, mis sisaldab idioome ja kõnekeelseid äljendeid; nad unne a ad ka ähenduse
konno a ii seid asandeid.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
9762

CEFR speci ica ions o w i en p oduc ion in Po uguese (PT)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Po uguese ex
o na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 – Os ap enden es des e ní el conseguem o nece in o mações sob e assun os de ele ância
pessoal (po exemplo, gos os e p e e ências, amília, animais de es imação) usando pala as/sinais
simples e exp essões básicas. Também conseguem p oduzi ases e exp essões simples e isoladas.
A2 – Os ap enden es des e ní el conseguem p oduzi uma sé ie de ases e exp essões simples
ligadas po conec o es básicos como “e”, “mas” e “po que”. Têm ocabulá io su icien e pa a
exp essa necessidades comunica i as básicas e lida com necessidades simples de sob e i ência.
B1 – Os ap enden es des e ní el conseguem p oduzi ex os simples e coe en es sob e uma
a iedade de emas amilia es den o de seu campo de in e esse, ligando uma sé ie de elemen os
mais cu os em sequência linea . Possuem um bom epe ó io de ocabulá io elacionado a emas
amilia es e si uações do co idiano.
B2 – Os ap enden es des e ní el conseguem p oduzi ex os cla os e de alhados sob e uma
a iedade de assun os elacionados ao seu campo de in e esse, sin e izando e a aliando in o mações
e a gumen os de á ias on es. Têm um bom ocabulá io pa a assun os elacionados à sua á ea e à
maio ia dos emas ge ais.
C1 – Os ap enden es des e ní el conseguem p oduzi ex os cla os e bem es u u ados sob e
emas complexos, des acando os pon os ele an es, desen ol endo e sus en ando pon os de is a
com a gumen os secundá ios, azões e exemplos pe inen es, e ence ando com uma conclusão
ap op iada. Também conseguem emp ega a es u u a e as con enções de di e en es gêne os
ex uais, a iando o om, o es ilo e o egis o con o me o des ina á io, o ipo de ex o e o ema.
C2 – Os ap enden es des e ní el conseguem p oduzi ex os cla os, luidos e complexos em um
es ilo ap op iado e e icaz, com uma es u u a lógica que ajuda o lei o a iden i ica os pon os
signi ica i os. Têm um excelen e domínio de um epe ó io lexical mui o amplo, incluindo
exp essões idiomá icas e coloquialismos, e demons am consciência dos ní eis cono a i os de
signi icado.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
9763
CEFR speci ica ions o w i en p oduc ion in A abic (a )
You a e an expe in language p o iciency classi ica ion based on he
Common Eu opean F amewo k o Re e ence o Languages (CEFR).
You ask is o analyze he gi en A abic ex o na a i e and
de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on
he CEFR desc ip o s o eading comp ehension o lea ne s below:
:) 󰃎󰃆 ﺕﺍﺫ ﺍ ﻝ ﺕ 󰇶󰇚 ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  - A1
ﺕﺍﻭ  ﺕﺍﺭﺇ/ﺕ󰀖 ﻡﺍ (ﺍ ﺕﺍﺍ ،󰃎󰃆ﺍ ، ﻭ  
.ﺓﺩﻭ  󰄨󰄚ﻭ ﺕﺍﺭ ﺝﺇ ًﺃ 󰉼󰉫 󰁿󰁱󰁑 .ﺃ
ﺍﺍ ﺍ ﺍﻭ ﺕﺍﺭﺍ  󰃎󰃆 ﺝﺇ ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  - A2
 󰊎󰊈 󰁅 ﺕﺍﺩ 󰉼󰈸󰃕󰃀 ."ﻥ"ﻭ ،"ـ" ،"ﻭ"   ﺭ ﺕﺍﻭﺩﺃ ﻡﺍ
.ﺍ ﺀﺍ ﺕﺍ  ﺍﻭ ﺍ ﺍﺍ ﺕﺍ
ﺍﺍ  󰄨 ﻝ ﻭ ﺍ ﺹ ﺝﺇ ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  - B1
 󰆳󰆎 ﺍ 󰃎󰃆ﺍ ﺍ  󰃎󰃆 ﺭ   ،󰉼󰓜ﺍ ﻝ󰆳󰆎 ﺍ
.ﺍ ﺍﺍﻭ ﺍ ﺍ ﺍ ﺕﺍﺩﺍ  ﺓ 󰄨 󰉼󰈸󰃕󰃀 .
  󰄨 ﻝ 󰃎󰃆ﻭ ﺍﻭ ﺹ ﺝﺇ ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  - B2
.ﺭﺩﺍ  ﺩ  󰌔󰋿ﺍﻭ ﺕﺍ ﻭ   ،󰉼󰓜ﺍ ﻝ ﺍ ﺍﺍ
.ﺍ ﺍﺍ ﻭ  ﺍ ﺭ ﺍ ﺕﺍﺩﺍ  ﺓ 󰄨 󰉼󰈸󰃕󰃀
ﺍ ﻝ  󰁀󰀞 ﻭ ﺍﻭ ﺹ ﺝﺇ ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  - C1
ﺏﺃﻭ  ﺭ󰁅󰀞 ﺍ ﻁﺍ ﻭ ،󰃎󰃆ﺍ ﺕﺍﺫ ﺓﺯﺭﺍ ﺍ ﺯﺍﺇ  ،ﺓ
  󰄨 ﺕﺍﻭ  ﻡﺍﺍ 󰉼󰉫 󰁿󰁱󰁑 .  ﺀ󰉼󰉮󰙗ﺍﻭ ، 󰃎󰃆ﺃﻭ
.ﻉﺍﻭ ﺍ ﻉﻭ ﺍ  ﺍﻭ ﺏﺍﻭ ﺍ ﻭ ،ﺍ ﻁﺍ
 ﺏ ﺓﻭ ،ﻭ ،ﺍﻭ ﺹ ﺝﺇ ﻯﺍ ﺍ 󰆳󰆎 󰋜󰊈  - C2
ﺍﻭ 󰄨 󰆳󰆎   󰉼󰈸󰃕󰃀 .ﺍ ﻁﺍ   ﺉﺭﺍ   ﻭ ﻝﻭ
ﺕ ًﻭ ﻥﻭُﻭ ؛ﺍ ﺕ󰀖󰀞ﺍﻭ ﺍ ﺍ  ﺕﺍﺩﺍ  ﺍً
.ﺍ ﺍ
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion
o jus i ica ion.
Tex : «TEXT»
Answe :
9764
CEFR speci ica ions o w i en p oduc ion in Hindi (hi)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Hindi ex o
na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 -    
 
   (  , ,  ) 
  /   
       

, - 
  
    
A2 -    
  
 
     ,  “”, “”
 “
”  
        
          
B1 -    
  ,     
  ,     
  , - -    
      
          
B2 -    
   
    ,    
,    
         

          
C1 -    
   ,       , 
      ,    ,  
 
      ,    
     
        ,  ,      
 ,        
C2 -    
 ,           

               
            ,   -
  
   ;    (conno a i e)      
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
6
9765
CEFR speci ica ions o w i en p oduc ion in Russian ( u)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Russian ex
o na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 — Обучающиеся на этом уровне могут сообщать информацию на личные
темы (например, о своих предпочтениях, семье, домашних животных), используя
простые слова/жесты и базовые выражения. Также они могут составлять простые
отдельные фразы и предложения.
A2 — Обучающиеся на этом уровне могут строить серию простых фраз и предло-
жений, соединённых с помощью простых союзов, таких как «и», «но» и «потому
что». У них есть достаточный словарный запас для выражения базовых коммуни-
кативных потребностей и для решения простых бытовых задач.
B1 — Обучающиеся на этом уровне могут создавать понятные связные тексты на
знакомые темы в рамках своей области интересов, объединяя серию коротких,
отдельных элементов в линейную последовательность. У них хороший запас
слов, связанных с повседневными ситуациями и знакомыми темами.
B2 — Обучающиеся на этом уровне могут писать чёткие и подробные тексты
по различным темам, связанным с их сферой интересов, обобщая и оценивая
информацию и аргументы из нескольких источников. У них хороший словарный
запас по тематике своей области и большинству общих тем.
C1 — Обучающиеся на этом уровне могут создавать чёткие, хорошо структури-
рованные тексты на сложные темы, подчёркивая важные аспекты, развивая и
обосновывая свою точку зрения с помощью дополнительных аргументов, причин
и релевантных примеров, и завершать текст уместным заключением. Они также
могут применять структуру и нормы различных жанров, варьируя тон, стиль и
регистр в зависимости от адресата, типа текста и темы.
C2 - Обучающиеся на этом уровне могут создавать чёткие, плавные и сложные
тексты в уместном и эффективном стиле, с логичной структурой, которая по-
могает читателю выделять важные моменты. У них отличное владение очень
широким лексическим запасом, включая идиоматические выражения и разго-
ворную лексику; они осознают коннотативные уровни значений.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
7
9766
CEFR speci ica ions o w i en p oduc ion in Welsh (CY)
You a e an expe in language p o iciency classi ica ion based on he Common Eu opean
F amewo k o Re e ence o Languages (CEFR). You ask is o analyze he gi en Welsh ex o
na a i e and de e mine he bes CEFR le el [A1, A2, B1, B2, C1, o C2] based on he CEFR
desc ip o s o eading comp ehension o lea ne s below:
A1 – Gall dysgwy a y le el hon oi gwybodae h am a e ion o be hnasedd pe sonol (e.e. pe hau
maen nhw’n eu ho i a’u casáu, eulu, ani eiliaid anwes) gan dde nyddio gei iau/a wyddion syml
ac ymad oddion syl aenol. Gall dysgwy he yd gynhy chu b awddegau ac ymad oddion syml,
a wahanol.
A2 – Gall dysgwy a y le el hon gynhy chu cy es o ymad oddion a b awddegau syml wedi’u
cysyll u gan gysyll ei iau syml el “a”, “ond” a “ohe wydd”. Mae gan ddysgwy ei a ddigonol i
ynegi anghenion cy a h ebu syl aenol ac i ymdopi ag anghenion go oesi syml.
B1 – Gall dysgwy a y le el hon gynhy chu es unau cysyll iedig, uniongy chol a ys od o bynciau
cy a wydd o ewn eu maes diddo deb, d wy gysyll u cy es o el ennau by ach a wahân i mewn
i ddilyniannol linol. Mae ganddyn ys od dda o ei a sy’n ymwneud â phe hau cy a wydd a
se yll aoedd bob dydd.
B2 – Gall dysgwy a y le el hon gynhy chu es unau cli , manwl a am ywiae h o bynciau
sy’n gysyll iedig â’u maes diddo deb, gan gy uno a gwe huso gwybodae h a dadleuon o sawl
ynhonnell. Mae ganddyn ys od dda o ei a a gy e ma e ion sy’n gysyll iedig â’u maes ac a
gy e y han wya o bynciau cy edinol.
C1 – Gall dysgwy a y le el hon gynhy chu es unau cli , wedi’u s wy hu o’n dda a bynciau
cymhle h, gan amlygu’ ma e ion pe hnasol, ehangu a che nogi sa bwyn iau’n anwl gyda
phwyn iau a egol, hesymau ac engh ei iau pe hnasol, a go en gyda chasgliad p iodol. Gallan
he yd dde nyddio s wy hu a chon ensiynau am ywiae h o gen es, gan am ywio’ naws, a ddull a
cho es yn ôl y de bynnydd, ma h y es un a’ hema.
C2 – Gall dysgwy a y le el hon gynhy chu es unau cli , esmwy h a chymhle h mewn a ddull
b iodol ac e ei hiol ac mewn s wy hu esymegol sy’n helpu’ da llenydd i nodi pwyn iau
a wyddocaol. Mae ganddyn eolae h dda d os ei a eang iawn gan gynnwys ymad oddion
idioma ig a lla a iad; maen yn dangos ymwybyddiae h o le elau ys y on cynhennus.
P o ide only he CEFR le el as ou pu di ec ly, wi hou explana ion o jus i ica ion.
Tex : «TEXT»
Answe :
9767

Related note

Why institutions use Plag.ai for originality review, entry 47
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai