scieee Science in your language
[en] (orig)

Democracy made easy: simplifying complex topics to enable democratic participation

Author: Khallaf, Nouran; Bott, Stefan Markus; Eugeni, Carlo; O'Flaherty, John; Sharoff, Serge; Saggion, Horacio
Publisher: Zenodo
DOI: 10.5281/zenodo.17700966
Source: https://zenodo.org/records/17700966/files/2025.aielpl-1.10.pdf
P oceedings o he 1s Wo kshop on A i icial In elligence and Easy and Plain Language in Ins i u ional Con ex s (AI & EL/PL),
pages 108–124, June 23, 2025
Democ acy Made Easy: Simpli ying Complex Topics o Enable Democ a ic
Pa icipa ion
Nou an Khalla 1, S e an Bo 2, Ca lo Eugeni1, John O’Flahe y3,
Se ge Sha o 1,Ho acio Saggion2
1School o Languages, Cul u es and Socie ies, Uni e si y o Leeds, Uni ed Kingdom
2Depa men o Enginee ing, Uni e si a Pompeu Fab a, Spain
3The Na ional Mic oelec onics Applica ions Cen e (MAC) L d, I eland
Abs ac
Se e al g oups o people a e excluded om
democ a ic delibe a ion because he language
used in his con ex may be oo di icul o
hem o unde s and. Ou iDEM p ojec aims
o educe exis ing linguis ic ba ie s in delibe -
a i e p ocesses by de eloping echnology o
acili a e he ansla ion o complica ed ex
in o Easy- o-Read o ma s ha a e mo e sui -
able o many people. In his pape , we de-
sc ibe classi ica ion expe imen s o de ec ing
di e en ypes o di icul ies which should be
amended in o de o make ex s easie o un-
de s and. We ocus on a lexical simpli ica ion
sys em ha can achie e s a e-o - he-a esul s
wi h he use o a ee and open-weigh la ge
language model o he Romance Languages
in ou p ojec . Mo eo e , a sen ence segmen a-
ion sys em is in oduced o p oduce ex seg-
men a ion o long sen ences based on aining
da a. Finally, we desc ibe he iDEM mobile
app, which will make ou echnology a ailable
as a se ice o end-use s o ou a ge popula-
ions.
1 In oduc ion
Rep esen a i e democ acy is based on delega ing
policy ma e s o elec ed ep esen a i es, while he
delibe a i e democ a ic p ocess aims a in ol ing
he s akeholde s di ec ly (Bäch ige e al.,2018).
Mode n democ a ic ins i u ions aim a a g ea e o-
cus on s akeholde s’ in ol emen . Howe e , his
has he equi emen o clea e language, which is
accessible o he s akeholde s, especially in cases
whe e he s akeholde s ace challenges in unde -
s anding, o example, in such cases as people wi h
in ellec ual disabili ies o non-na i e speake s. The
demand o be e communica ion is also e lec ed
in he in e na ional ea ies, in pa icula , he Uni-
e sal Decla a ion o Human Righ s, in i s A icle
© 2025 The au ho s. This a icle is licensed unde a C ea i e
Commons 4.0 licence, no de i a i e wo ks, a ibu ion, CC-
BY-ND.
19, a i ms e e yone’s igh o seek and ecei e
in o ma ion.
Mo eo e , pa icula ly impo an in his con ex
is he Uni ed Na ions Con en ion on he Righ s o
Pe sons wi h Disabili ies (CRPD), which includes
accessibili y as one key enable o a mo e inclusi e
socie y. Tha is, he abili y o any p oduc , se ice,
con en , en i onmen , e c., o be used by people
wi h he wides ange o abili ies (including linguis-
ic and cogni i e abili ies). The CRPD also con-
side s accessibili y as, o example, an enable o
democ a ic pa icipa ion igh s such as eedom o
exp ession and sel -de e mina ion. Consequen ly,
a lack o accessibili y can be linked o a isk o ex-
clusion o pe sons who canno pa icipa e equally
due o linguis ic ba ie s.
The ocus o ou pape is on p o iding an in o-
duc ion in o echnologies de eloped in he con ex
o ou p ojec , iDEM
1 2
: in he a ea o in e sec ion-
ali y and equali y in delibe a i e and pa icipa o y
democ a ic spaces, iDEM aims a making in o ma-
ion mo e accessible and inclusi e in he con ex
o democ acy and in pa icula in delibe a i e and
pa icipa o y p ocesses. Mo e speci ically, in his
pape , we will discuss:
1.
A ool o assessing sen ence-le el complex-
i y and p edic ing app op ia e simpli ica ion
s a egies;
2.
Applica ions o hese ools o eal-wo ld co -
po a, including he Uni ed Na ions Pa allel
Co pus and Eu opa l;
3.
A ex simpli ica ion pipeline powe ed by
La ge Language Models (LLMs), ocusing on
lexical simpli ica ion and Easy- o-Read (E2R)
sen ence segmen a ion.
The es o he pape is s uc u ed as ollows:
1
Inno a i e and Inclusi e Democ a ic Spaces o Delibe a ion
and Pa icipa ion
2h ps://idemp ojec .eu/
108
Sec ion 2p o ides an o e iew o he iDEM
p ojec . Sec ion 3 e iews ela ed wo k on complex-
i y assessmen and ex simpli ica ion. Sec ion 4
de ails ou sen ence complexi y classi ie and sim-
pli ica ion app oach, including he e alua ion e-
sul s. Sec ion 5ou lines he mobile applica ion
in de elopmen , while Sec ion 6discusses cu en
limi a ions. Finally, Sec ion 7o e s concluding
ema ks.
2 P ojec O e iew
The iDEM P ojec in he a ea o in e sec ionali y
and equali y in delibe a i e and pa icipa o y demo-
c a ic spaces aims a making in o ma ion mo e ac-
cessible and inclusi e in he con ex o democ acy
and, in pa icula , in delibe a i e and pa icipa-
o y p ocesses. In he i s phase o he p ojec
we ha e in es iga ed, using a heo e ical app oach,
cu en ma ginaliza ion om delibe a i e p ocesses
o di e se unde ep esen ed g oups due o language
skills in o de o unde s and wha a e he linguis-
ic ba ie s which hampe hei pa icipa ion. By
wo king wi h di e en associa ions, iDEM adop s
a use -cen e ed app oach in use case design and
da a collec ion and anno a ion o ensu e maximum
impac in he communi y, hus con ibu ing o mak-
ing democ acy mo e accessible and inclusi e. An
inno a i e iDEM se ice (e.g., mobile app) is be-
ing implemen ed o hos he de eloped language
echnologies o suppo on-demand simpli ica ion
in Ca alan, English, I alian, and Spanish. In he
cu en phase o he p ojec , we a e de eloping he
unde lying na u al language p ocessing echnology
as well as ine- uning he use cases o es and e al-
ua e ou p oposed app oach o a mo e inclusi e
delibe a i e democ acy. The in e es ed eade is
e e ed o (Saggion e al.,2024b) o an o e iew
o he p ojec .
3 Rela ed Wo k
3.1 Easy- o-Read
Since he la e nine ies, many o ganisa ions ha e
aised awa eness abou undamen al in o ma ion
being w i en in a way ha is oo di icul o un-
de s and o many people. Ini ia i es o pallia e
his de ici include “Plain Language” (U.S. Go e n-
men ,2011) and “Easy- o- ead” (Inclusion Eu ope,
2009). They a e wo di e en me hods whose o e -
all objec i e is o make in o ma ion mo e acces-
sible. They p oposed guidelines o how o w i e
mo e accessible ex s; howe e , applying hem o
p oduce accessible ma e ial hea ily depends on
well- ained human edi o s and, he e o e, consid-
e ably limi s he p oduc ion o easy- o- ead o plain
language ex s.
Compa ed o s anda d language, easy- o- ead
language is a simpli ied e sion o he sake o
eadabili y o speci ic audiences (Ca o,2017). In
his pape , we adop E2R o e Plain Language be-
cause i s s uc u ed guidelines o m he ounda ion
o di e se and adap able ansla ion s a egies de-
signed o make in o ma ion accessible o people
wi h eading di icul ies, including people wi h in-
ellec ual disabili ies. They ha e li le command o
he language and poo li e acy.
Empi ical esea ch in he ield is uncommon
(González-So dé and Ma amala,2024), especially
when compa ed o ields such as au oma ic ex sim-
pli ica ion. Al hough he opic has gained g ea e
schola s’ a en ion in ecen yea s, some imes e-
sea ch epo s on appa en ly con adic o y indings
(Faja do e al.,2014) be ween guidelines and ac-
ual ex unde s anding by a ge E2R popula ions;
mo eo e , e en guidelines appea o ake on di -
e en aspec s wi h li le ag eemen be ween hem
(Maaß,2020). Wi h he ad ances ha na u al lan-
guage p ocessing has achie ed in ecen yea s, in-
e es in he au oma ic adap a ion o ex s o plain
language o E2R has in ensi ied (Ala con e al.,
2021;Da Cunha Fanego,2021;Saggion,2024).
3.2 Complexi y Iden i ica ion
The i s ocus o ou esea ch wi hin iDEM is on
heo e ically unde s anding he ac o s ha con-
ibu e o he complexi y o a ex o he sen ences
wi hin his ex . The guidelines desc ibed in he
p e ious sec ion a e di ec ed o human edi o s and
o en lea e much oom o in e p e a ion and a e
ha d o ope a ionalise, o example he ins uc ion
o a oid di icul wo ds. We a e in e es ed in com-
bining heo e ical insigh s wi h da a-d i en analysis
and classi ica ion.
In p e ious wo k, compu a ional s udies ypi-
cally o e look insigh s om ansla ion s udies,
pa icula ly he a ious s a egies p oposed (Vinay
and Da belne ,1971;Newma k,1988;Ches e man,
1997;Zabalbeascoa,2000;Molina and Hu ado Al-
bi ,2002;Gambie ,2006), ocusing on he sys-
ema ic p ocesses in ol ed in ansla ing a sou ce
ex in o a a ge ex ac oss languages. T ansla ion
s udies p o ide a complemen a y app oach in ex-
amining s a egies used in in a-lingual ansla ion,
whe e a sou ce ex is ansla ed in o a a ge ex
109
in he same language. Eugeni and Gambie (2023)
a gue ha such ans e s habi ually achie e a com-
ple e co espondence be ween sou ce and a ge
ex s. One key ask in o de o ans o m sen ences
in o E2R is lexical simpli ica ion, i.e., simpli ying
indi idual wo ds o sho ph ases independen o
he e ec o such simpli ica ions on he o e all sen-
ence cohe ence. Fo ins ance, Pae zold and Specia
(2016) de eloped me hods ha speci ically a ge ed
complex wo d iden i ica ion (CWI), which de ec s
di icul wo ds and sugges s simple al e na i es.
These echniques usually igno e how such simple
wo ds would i he gene al sen ence s uc u e.
Da ase s de eloped o e alua e lexical simpli i-
ca ion, e.g., SemE al-2012 Task 1 (Specia e al.,
2012), ALEXSIS (Fe és and Saggion,2022),
TSAR 2022 (Saggion e al.,2022) o MLSP 2024
(Sha dlow e al.,2024) ha e aided a ocus on
single wo d-le el eplacemen s. Though help ul,
hese da ase s p ima ily co e single wo d sub-
s i u ions in isola ion a he han mo e gene al
con ex -sensi i e simpli ica ions. As a esul , sim-
pli ica ions gene a ed wi h he assis ance o hese
ools some imes sound unna u al, which needed
a pos gene a i e model o e ine sen ence cohe -
ence. This issue was also highligh ed by Sha dlow
(2014), who e iewed a ious lexical simpli ica ion
app oaches and no ed ha , while e ec i e o ead-
abili y, hey equen ly igno e sen ence cohe ence
and g amma ical co ec ness.
Co po a o sen ence simpli ica ion includes
ASSET (Al a-Manchego e al.,2020) ha p o-
ides mul iple quali y simpli ica ions pe sen ence.
Howe e , ASSET s ill ocuses o some ex en on
ine-g ained lexical o ph ase-le el modi ica ions
and lacks anno a ions o deepe g amma ical o
discou se-le el modi ica ions. Simila ly, Wiki-
La ge (Zhang and Lapa a,2017) p o ides la ge
pa allel sen ence pai s o aining simpli ica ion
models bu does no explici ly anno a e he simpli-
ica ion s a egies, making i di icul o s udy in
de ail exac ly how sen ences a e simpli ied. The
Simplex co pus (Saggion e al.,2015) p o ides
ull documen simpli ica ions ollowing E2R guide-
lines o he Spanish language wi hou indica ion o
ans o ma ion ype while Po Simples (Aluísio and
Gaspe in,2010) p o ides documen simpli ica ion
in Po uguese co e ing se e al ope a ions.
3.3 Tex Simpli ica ion
Ou ocus o his pape is on lexical simpli ica-
ion; o an o e iew o ull ex simpli ica ion
app oaches and me hods, he eade is e e ed
o (Saggion,2017). Se e al pas app oaches o
lexical simpli ica ion used adi ional coun -based
wo d- ec o s and a ailable dic iona ies o mod-
elling wo d seman ics and o selec simple wo d
eplacemen s o complex wo ds (Bi an e al.,2011;
Bo e al.,2012); in la e wo k, wo d embedding
we e used, which is lea ned om huge ex col-
lec ion (Gla aš and Š ajne ,2015). Mo e ecen ly,
la ge-scale language models such as BERT and
i s a ia ions ha e been applied o p edic subs i u-
ion candida es o complex wo ds. Fo example,
LS-BERT (Qiang e al.,2020) uses he masked
language model (MLM) o BERT o p edic a se
o candida e subs i u ion wo ds and hei associ-
a ed p obabili y. In his con ex , he MLM p e-
dic s subs i u e wo ds which a e anked o sim-
plici y using: p obabili ies, a language model, a
pa aph ase da abase, wo d equency and wo d se-
man ic simila i y wi h he a ge wo d. Ve y ecen
wo k p esen ed in he TSAR 2022 (Saggion e al.,
2022) and MLSP 2024 (Sha dlow e al.,2024) e al-
ua ion amewo ks ha e demons a ed ha La ge
Language Models (LLMs) a e in ac he bes pe -
o ming models o he lexical simpli ica ion. Tech-
niques such as “p omp ing” a e used o condi ion
he LLMs o p oduce a simpli ica ion o o sugges
al e na i e wo ds. No e, howe e , ha hese mod-
els unde pe o m when simpli ying low- esou ced
languages. We de ine ’low- esou ced languages’ as
hose wi h limi ed digi al ex co po a (e.g., Ca a-
lan s. English), impac ing LLM pe o mance as
no ed in Sec ion 4.4. Despi e ad ances in lexical
simpli ica ion (e.g., TSAR 2022, MLSP 2024), key
gaps emain: (1) How can simpli ica ion s a egies
be sys ema ically ca ego ised beyond lexical subs i-
u ion? (2) Wha axonomies exis o in a lingual
ansla ion, and how do hey apply o au oma ion?
Sec ion 4.2 add esses hese by p oposing a s a -
egy axonomy, es ing i on ins i u ional co po a,
and le e aging LLMs wi hou p omp enginee -
ing—a less-explo ed app oach due o i s complex-
i y (Sha dlow e al.,2024).
4 Na u al Language P ocessing o
Easy- o-Read T ansla ion
4.1 Da ase s
We use a ange o da ase s ac oss di e en com-
ponen s o ou sys em
3
. The p ima y da ase used
3
Whe e applicable, da ase s a e a ailable on eques om he
au ho s o a e publicly accessible h ough he ci ed sou ces.
110
o complexi y assessmen and simpli ica ion s a -
egy classi ica ion comp ises 76 pa allel ex s col-
lec ed om Sco ish ca e se ices, UK poli ical
mani es os (2024), and Disabili y Equali y Sco -
land newsle e s. These co e di e se opics such
as heal hca e, en i onmen al policies, disabili y
ad ocacy, and accessibili y. The ex s we e man-
ually aligned a he sen ence le el, esul ing in
4,166 wo ds in 206 o iginal (“complex”) sen ences
and 3,259 wo ds in 210 simpli ied coun e pa s.
Despi e he educ ion in wo d coun , he numbe
o sen ences inc eased sligh ly, e lec ing a key
simpli ica ion s a egy ha is spli ing longe sen-
ences o imp o e eadabili y. We addi ionally use
a F ench da ase o 370 manually aligned sen ence
pai s. The o iginal ex s we e e ie ed om he
Ré ugiés.in o websi e and we e anonymised o e-
mo e any pe sonally iden i iable in o ma ion (PII)
(Team,2025). These pa allel sen ence pai s p o-
ide aining da a o ou simpli ica ion s a egy
classi ie (Sec ion 4.2).4
Fo e alua ing ou sys em on la ge , mul ilin-
gual co po a, we use he Eu opean Pa liamen
(Koehn,2005) and he Uni ed Na ions Pa allel
Co pus (Ziemski e al.,2016). These a e pub-
licly a ailable and p o ide high-quali y sen ence-
aligned ansla ions in English, Spanish, and I al-
ian. We applied ou mul ilingual classi ie o hese
da ase s o analyse simpli ica ion needs ac oss lan-
guages (Sec ion 4.3).
Fo lexical simpli ica ion, we use ew-sho
p omp ing on p e- ained Salamand a models wi h
ial da a om he MLSP 2024 sha ed ask (Sha d-
low e al.,2024), co e ing English, Spanish, I alian,
and Ca alan (Sec ion 4.4).
Finally, o sen ence segmen a ion in Spanish
acco ding o E2R s anda ds, we accessed a p i a e
anno a ed da ase p o ided by Calleja e al. (2024).
This da ase includes 3,826 aining, 484 alida ion,
and 1,452 de elopmen sen ences, each anno a ed
wi h E2R-compa ible cu poin s (Sec ion 4.6).
4.2 Complexi y Assessmen
The simpli ica ion s a egy p edic ion ask aims o
de e mine he ypes o ans o ma ions needed o
make a sen ence mo e accessible. Table 1p o ides
examples o hese ans o ma ions.
4
English da ase was anno a ed by a linguis wi h expe ise
in ansla ion and ex simpli ica ion, using he same p ede-
ined se o simpli ica ion s a egy ca ego ies desc ibed in Ap-
pendix B; he F ench da ase was labelled by he Ré ugiés.in o
edi o ial eam ollowing he same guidelines and ca ego y
de ini ions.
Ou axonomy is in o med by Inclusion Eu-
ope’s guidelines (Inclusion Eu ope,2009), in alin-
gual ansla ion p ac ices in o E2R (Hansen-Schi a
e al.,2020), and a quali a i e analysis o ou
da ase . While p e ious axonomies in T ansla ion
S udies ha e o e ed aluable models o in e lin-
gual and diamesic ansla ion, hey lack he g an-
ula i y needed o desc ibe all s a egies obse ed
in E2R p ac ice. On he o he hand, ypologies in
Au oma ic Tex Simpli ica ion (ATS) a e based on
co pus analysis (Bo and Saggion,2014) o on edi
ope a ions ha mainly deal wi h adding, dele ing,
eplacing, and mo ing wo ds Ca don e al. (2022).
Howe e , ex s ansla ed in E2R language clea ly
show ha p o essionals in he ield apply many
mo e ope a ions ha pe ain o he ield o p ag-
ma ics and semio ics, ocused on how concep s
a e dis ibu ed and o explained o help he use
unde s and hem.
To add ess his gap, ou amewo k adap s in-
sigh s om bo h domains. Based on Inclusion
Eu ope’s h ee le els o simpli ica ion—lexical,
syn ac ic, and seman ic—we de ine eigh mac o-
s a egies ha ange along a con inuum om ad-
di i e ope a ions (e.g., Explana ion) o educ i e
ones (e.g., Omission). These a e ou lined in Table 2
comp ises 8 mac o-s a egies (excluding ansc ip
since i is a non-simpli ica ion ope a ion), 8 s a e-
gies, and 30 mic o-s a egies. Fo he ull se o
s a egies, see Table 10 (Appendix E).
C oss-linguis ic di e ences in simpli ica ion
s a egies a e also ele an . In ou mul ilingual
expe imen s, we obse ed a ia ions in dominan
s a egies ac oss English, Spanish, and I alian,
which sugges s ha language-speci ic ea u es in-
luence how simpli ica ion is ope a ionalised. This
will be u he explo ed in Sec ion 4.3.
The classi ie is buil by applica ion o p e-
ained ans o me -based models (such as mul i-
lingual BERT (De lin e al.,2019)) o mul iclass
ex classi ica ion, ocusing on he p edic ion o
simpli ica ion s a egies need o simpli y he e-
spec i e sen ences. We employed S a i ied 5- old
C oss-Valida ion o igo ous e alua ion and gen-
e alisa ion. We ook he a e age o he alida ion
sco es ac oss all he olds o de e mine he inal
sco es. Ea ly s opping was also employed, whe ein
he aining was hal ed i he alida ion loss did no
see an imp o emen o he pa ience pe iod.
Class imbalance in he da a, wi h ce ain s a e-
gies being unde ep esen ed, was a p oblem du -
ing aining. To coun e his, we used a weigh ed
111
In 2018-20 li e expec ancy a bi h in Sco land was 76.8 yea s o males and 81.0 yea s o emales.
F om 2018 o 2020
babies bo n in Sco land we e expec ed o li e
77 yea s i hey we e boys
and 81 yea s i hey we e
gi ls.
Modula ion Explana ion Synonymy,Syn ax Synonymy,Syn ax
Table 1: Segmen alignmen o he o iginal ( op) and simpli ied (bo om) sen ences
S a egy Desc ip ion Example
Omission
Remo ing unnecessa y he o ical o di-
amesic cons uc s.
“Si Kei Rodney S a me KCB KC” →“S a me ”
Comp ession
Simpli ying g amma ical/seman ic con-
s uc s.
“ o guide he g oup” →“ o he g oup”
Syn ac ic Change Adjus men s be ween syn ac ic le els. “ci izens” →“people in Sco land”
T ansc ip No changes made o he ex . “I lo e music”
T ansposi ion Wo d class change. “ou aim is” →“we wan ”
Synonymy
Simpli ying echnical o abs ac wo ds.
“con e sa ion” →“ alk”
Modula ion Redis ibu ing in o ma ion linea ly.
“joins in ac i i ies... suppo ed by amily”
→
“He joins ac i i ies.
His amily helps.”
Explana ion
Making hidden con en o e ms explici .
“co-design se ices...”
→
“co-design means sha ing you ideas”
Illocu iona y
Change
Making implied meaning explici . “know you body’s lib a y” →“know you body”
Table 2: Simpli ica ion s a egies equi ed o a sen ence, wi h examples
c oss-en opy loss unc ion. Class weigh s we e
calcula ed as he in e se equency o each class:
wc=
1
eqc
·N
2(1)
whe e
wc
is he weigh assigned o class
c
,
eqc
is he equency o class
c
, and
N
is he numbe o
samples. This app oach ensu ed ha unde ep e-
sen ed classes con ibu ed mo e o he o e all loss,
so he model became mo e capable o p edic ing
he mino i y classes.
Addi ionally, g adien clipping was applied du -
ing aining o s abilise he op imisa ion. G adien
clipping limi s he maximum alue o g adien s
du ing backp opaga ion o p e en ex emely la ge
upda es o model pa ame e s ha could des abilise
aining o lead o di e gence. Ma hema ically, g a-
dien clipping can be exp essed as:
gclipped = min g, g h eshold
∥g∥,(2)
whe e
g
ep esen s he o iginal g adien ec o ,
g h eshold
is he clipping h eshold, and
∥g∥
is he
no m o he g adien ec o . G adien clipping en-
su es consis en upda es o model pa ame e s, im-
p o ing aining s abili y.
See he summa y o hype -pa ame e s in Table 8
(Appendix B). The use o medium-sized PLMs
(such as mul ilingual BERT) ins ead o LLMs helps
wi h he possibili y o applying he classi ie s o
la ge ins i u ional da ase s (such as he en i e y o
Eu opa l o he Uni ed Na ions Co pus), as well as
wi h he possibili y o deploying he classi ie s o
guide he co ec ions.
We used s anda d p ecision, ecall, and F1-sco e
me ics (Manning e al.,2008) o e alua e model
pe o mance. Gi en he class imbalance, we epo
he weigh ed mac o F1-sco e (Sokolo a and La-
palme,2009), which be e e lec s he classi ie ’s
abili y o handle bo h equen and a e simpli i-
ca ion s a egies. The ine- uned classi ie model
achie ed a weigh ed mac o F1-sco e o 0.8089,
demons a ing i s abili y o gene alize ac oss ma-
jo i y and mino i y classes. In pa icula , i ou pe -
o med he baseline majo i y-class s a egy, which
co esponds o he weigh ed mac o F1-sco e o
0.096.
The F1 sco e o he mul ilingual model ( ained
on English, es ed on F ench) is 0.6339, hus e-
lec ing he need o imp o e i s abili y o gene alize
ac oss languages. Howe e , gi en ha i s e o s
a e balanced, i.e., he model is con used wi h p e-
dic ing Synonymy o Explana ion and ice e sa,
see he con usion ma ix in Figu e 2(Appendix C).
Omission and Comp ession ca ego ies end o be
con used wi h one ano he , wi h Omission com-
monly p edic ed as Explana ion o T ansc ip , mi -
o ing he need o enhance he sepa a ion be ween
emo al and ew i ing s a egies. Modula ion is
also commonly con used wi h Synonymy, mi o -
ing he need o s eng hen sen ence es uc u ing
cues in aining.
112

Ca ego y English Spanish I alian
# Sen . % # Sen . % # Sen . %
Eu opa l
To al Sen ences 2,005,688 100 1,788,913 100 1,928,874 100
Complex 1,932,492 96.3 1,660,631 92.8 1,868,714 96.8
Omission 59,065 3.1 23 0.001 57 0.003
Syn ac ic Change 254,483 13.2 11,777 0.7 21,321 1.1
T ansposi ion 13,075 0.7 35,053 2.1 40,633 2.2
Synonymy 1,104,564 57.2 37,259 2.2 81,468 4.4
Modula ion 41,802 2.2 724,469 43.6 1,004,438 54.2
Explana ion 459,503 23.8 852,050 51.3 702,526 37.9
UN Co pus
To al Sen ences 10,600,000 100 10,665,709 100
Complex 9,628,533 96.2 9,987,750 93.6
Omission 75,217 0.7 62 0.0006
Syn ac ic Change 181,228 1.8 503,047 5.0
T ansposi ion 39,356 0.4 68,878 0.7
Synonymy 4,587,340 45.0 198,479 1.9
Modula ion 445,095 4.3 5,345,515 53.5
Explana ion 4,878,679 47.7 3,871,769 38.7
Table 3: Sen ence coun s and p opo ions o simpli ica-
ion s a egies in ins i u ional da ase s
4.3 Expe imen s wi h assessing ins i u ional
eposi o ies
We expe imen ed wi h wo ins i u ional eposi o-
ies, which include English, I alian and Spanish,
some o he languages o ou p ojec , he co pus o
he Eu opean Pa liamen (Koehn,2005) and he
Uni ed Na ions Pa allel Co pus (Ziemski e al.,
2016). Bo h esou ces include high-quali y ansla-
ions, so he con en o each sen ence is he same.
Howe e , we can expec ha he h ee languages
di e in hei adi ions o main aining linguis ic
complexi y in such o mal ex s as he pa liamen-
a y p oceedings. To al sen ences ow in Table 3
p esen s he amoun o da a in each da ase . We
used sen ence-aligned e sions om he espec i e
eposi o ies and applied he mul ilingual classi ie s
desc ibed in he p e ious sec ion o make p edic-
ions. I he complexi y classi ie de ec ed he need
o simpli y he sen ence, i.e., i was p edic ed as
"Complex", we es ima ed he likely s a egy needed
o his ask. As he classi ica ion model is limi ed
o he one-label se up, ou o se e al edi ope a ions
equi ed o a sen ence (see he example in Table
1), ou cu en e sion o he model p edic s he
single mos likely ope a ion (Explana ion in his
example).
Table 3shows ha he majo i y o sen ences in
bo h da ase s and ac oss all he languages consid-
e ed (English, Spanish, and I alian) equi e some
o m o simpli ica ion. Fo English sen ences,
he mos common simpli ica ion ope a ions ound
a e (1) lexical subs i u ion (synonymy), p ima ily
h ough he choice o simple synonyms, and (2)
Explana ion which p o ides mo e explana ion o
acili a e eading.
Con e sely, o bo h da ase s o Spanish and I al-
ian sen ences, he p edominan simpli ica ion s a -
egy is modula ion, wi h a pa icula emphasis on
sen ence es uc u ing o he pu pose o achie ing
a mo e linea and s aigh o wa d eading expe i-
ence.
4.4 Simpli ying Complex Wo ds
As epo ed in ecen lexical simpli ica ion chal-
lenges (i.e. TSAR 2022 (Saggion e al.,2022) and
MLSP 2024 (Sha dlow e al.,2024)), mos ecen
s a e-o - he-a lexical simpli ica ion sys ems ely
on decode -only au o eg essi e LLMs like GPT-4
(Enomo o e al.,2024). These sys ems seem o sys-
ema ically ou pe o m o he sys ems, like encode -
only language models (e.g. BERT), also because
ecen de elopmen s o LLMs ha e mos ly concen-
a ed on decode models. Decode s a e gene ally
mo e lexible and ha e s ong ze o-sho o ew-sho
abili ies. Comme cial closed-weigh models like
GPT-4, howe e , ca y conce ns o he pu pose o
ou p ojec since hey lack gua an ees o p i acy
p o ec ion and gene a e cos s by using he API. In
addi ion, hei closed na u e does no usually allow
us o ine- une hem.
In p elimina y expe imen s, we ound ou ha
he LLMs o he Salamand a amily
5
(Gonzalez-
Agi e e al.,2025) pe o m e y well on Eu opean
Languages, especially on Romance languages, and
wi hin he las g oup, hey especially excel a he
pe o mance o Ca alan. This can be explained
because Salamand a models a e pa o he Alia
ini ia i e (Go e nmen o Spain) unded by he
Spanish go e nmen wi h a s ong ocus on lan-
guages spoken in Spain. Salamand a models we e
ained as decode -only, and hey a e also p o ided
as ins uc ion- uned e sions. Wi h his, we de-
cided o use a simple ew-sho sys em as ou i s
app oach.
Few-sho p edic ion om a p e- ained model
e e s o he p ocess whe e a model ha has al-
eady been ained on a la ge da ase (a p e- ained
model) is used o make p edic ions o pe o m
asks wi h no o only a ew labeled examples o
a new ask. The sho s a e examples p o ided in
he p omp , as opposed o being used as aining
da a o ine- uning. Ze o-sho p edic ion does no
p o ide any example. The p e- ained model is
ypically a decode -only model, which p oduces
ou pu based on an inpu p omp ha condi ions
5h ps://hugging ace.co/collec ions/BSC-LT/
salamand a-66 c171485944d 79469043a
113
he ou pu . In essence, ew-sho p edic ion om a
p e ained model means le e aging a model’s p io
knowledge om a la ge da ase o pe o m well
on a new ask o da ase , e en wi h e y ew la-
beled examples. As ou p e- ained models, we
used he 2 billion pa ame e s (2B) and he 7 billion
pa ame e (7B) e sions o Salamand a.
We used he ollowing p omp wi hou doing
e inemen h ough p omp enginee ing:
Gi en he con ex and he speci ied a -
ge wo d in {LANGUAGE}, answe 10
simple al e na i e wo ds. Do no
gi e less han 10 al e na i e wo ds.
Gi e di e en wo ds as al e na i es.
{SHOT_EXAMPLES} Con ex : {CON-
TEXT} Ta ge Wo d: {TARGET} Al e -
na i es Wo ds:
He e LANGUAGE is a a iable which is se ac-
co ding o he language (Ca alan, English, I alian,
Spanish) in which we wan o p oduce p edic ed so-
lu ions. Fo ew-sho p edic ion we used examples
om he ial sec ion o he MLSP da a. The sho
examples we e selec ed andomly, bu we made
su e ha unique con ex s we e selec ed. An in-
s ance o a SHOT_EXAMPLE is gi en he e:
´
Gi en he con ex .... Con ex : A con-
inue s a emen will skip he emainde o
he block and s a a he con olling con-
di ional s a emen again. Ta ge Wo d:
emainde Al e na i e Wo ds: es , e-
s ic i e, emaining, emainde , balance
Fo a 2-sho o 4-sho p omp , 2 o 4 o hese
di e en examples would be included in he p omp
gi en o he sys em. The CONTEXT and TARGET
a iables ha e he same o m as in he p o ided
sho examples.
As e alua ion measu es, we used he same as in
he MLSP sha ed ask (see Sec ion 3.3). Accu acy
(ACC) exp esses he pe cen age o igh solu ions
gi en ou o all gi en solu ions. He e we use Accu-
acy@1@ op1 which is de ined as he pe cen age
o ins ances whe e he i s op- anked subs i u e
ma ches he mos equen ly sugges ed synonym
in he gold da a ( op1). MAP@k (Mean A e age
P ecision) uses a anked lis o gene a ed subs i-
u es, which can ei he be ma ched o no ma ched
agains he se o he gold-s anda d subs i u es. The
i s ksolu ions o he anked lis a e conside ed.
The esul s can be seen in Table 4. We use
he same baseline he e as was used in he MLSP
sha ed ask. I has o be no ed ha he baseline
used he e was e y s ong, since i used ze o-sho
p omp ing wi h he use o he cha - ine- uned e -
sion o Llama-2-70B. This is a e sion wi h 70
Billion Pa ame e s and hus en imes la ge han
he Salamand a-7B model we use he e. In ac ,
many pa icipa ing sys ems in he MLSP sha ed
ask could no ou pe o m his baseline. In he a-
bles, we ma k hose esul s wi h an as e isk ha a e
highe han his baseline. As a u he e e ence we
also lis he pe o mance o he di e en winning
sys ems o he sha ed ask. These winne s, how-
e e , use GPT-3 o Ca alan (Du illeul e al.,2024)
and GPT-4 (Enomo o e al.,2024) o he es o
he languages, and o easons we desc ibe abo e,
we canno use hem o he iDEM p ojec .
As expec ed, he 2B e sion o Salamand a could
no ou pe o m he baseline (Table 9in Appendix
D). We a ibu e his o he ac ha his model is
oo small o p oduce eliable esul s in a ask ha
equi es qui e a la ge amoun o gene al knowledge
abou language, such as synonymy and simplic-
i y. The esul s om his able a e s ill in e es ing
because we wan o use ine- uning on Salamand a-
2B in u u e wo k. The 7B e sion o Salamand a,
on he o he hand, could ou pe o m he baseline
nea ly sys ema ically in ew-sho se ings. In e es -
ingly, he di e ence be ween 2-sho and 4-sho p e-
dic ions is no e y la ge. In some cases, he 4-sho
p edic ions pe o m e en wo se han 2-sho p edic-
ions. Ano he obse a ion ha can be made is ha
Salamand a mos ly excels a he h ee Romance
languages Spanish, Ca alan and I alian, while o
English, i pe o ms e y close o he baseline. In
his case, i means ha he baseline is highe and
ha de o bea o English han o he o he lan-
guages because o he mul ilingual capabili ies o
he baseline sys em o he lack he eo . These ob-
se a ions con i m ou assump ion ha Salamand a
is a good choice o he se o languages ha we
ha e o ea in iDEM.
4.5
In eg a ion o Complexi y Assessmen and
Lexical Simpli ica ion
This sec ion p esen s ongoing wo k owa d in eg a -
ing wo co e modules o ou sys em: complexi y
assessmen (Sec ion 4.2) and lexical simpli ica ion
(Sec ion 4.4). The classi ie i s de ec s complex
lexical i ems in a sen ence, and he simpli ica ion
module hen p oposes easie al e na i es. While
he ull pipeline has no ye been o mally e al-
ua ed, we ha e implemen ed a p oo -o -concep
114
0-Sho 2-Sho 4-Sho MLSP Baseline MLSP Winne
ACC MAP@3 ACC MAP@3 ACC MAP@3 ACC MAP@3 ACC MAP@3
English 0.1280 0.1912 0.4017* 0.3868 0.4035* 0.4242* 0.3877 0.4241 0.5245 0.5762
Spanish 0.0286 0.1213 0.3541* 0.5148* 0.3608* 0.3644 0.3254 0.4157 0.4536 0.6763
Ca alan 0.0426 0.1390 0.2292* 0.3742* 0.2022* 0.3357* 0.1977 0.3024 0.2719 0.5003
I alian 0.035 0.1419 0.3596* 0.4108* 0.3315* 0.3868* 0.2964 0.3310 0.4762 0.5661
Table 4: Resul s o Ze o and Few Sho Lexical Simpli ica ion Pe o mance o a big model (Salamand a-7B).
Resul s a e compa ed o he s a e o he a as epo ed in he ecen MLSP 2024 lexical simpli ica ion sha ed ask.
As e isks (*) indica e he model ou pe o med he s ong baseline o he compe i ion.
Sen ence Easy o Read Segmen a ion
The way his sen ence is cu is easy o ead. The way his sen ence is cu
is easy o ead.
Valida es comp oba si un documen o es ácil de comp ende . Valida es comp oba si un documen o
es ácil de comp ende .
Table 5: Examples o segmen ed sen ences in English and Spanish aken om Easy- o-Read guidelines.
o illus a e i s easibili y. Table 6p o ides mul i-
lingual examples whe e he complexi y classi ie
lags di icul wo ds, which a e hen simpli ied by
he Salamand a-7B lexical simpli ie . Fo ins ance,
in he English sen ence “The eason why hypo ha-
lamic lesions a ec body a ...,” he wo ds ‘hy-
po halamic’ and ‘lesions’ a e iden i ied as com-
plex and eplaced wi h ‘b ain’ and ‘damage,’ e-
spec i ely—subs i u ions ha signi ican ly enhance
eadabili y.
In he con ex o he iDEM p ojec , his in eg a-
ion is in ended o deploymen wi hin he mobile
applica ion cu en ly unde de elopmen (see Sec-
ion 5), whe e use s wi h cogni i e o linguis ic ba -
ie s can ecei e eal- ime suppo in unde s anding
complex in o ma ion. Fu u e wo k will in ol e
o mal e alua ion, expansion o ull sen ences, and
deepe c oss-linguis ic adap a ion.
4.6 Segmen ing Sen ences o Easy- o-Read
Acco ding o E2R s anda ds (Inclusion Eu ope,
2009;153101,2018), sen ences in E2R a e ec-
ommended o be sho and should i on one line
on he p in ed page (o sc een). Since his is no
always possible, guidelines ecommend cu ing he
sen ence whe e people would pause when ead-
ing ou loud. Resea ch on sen ence segmen a ion
is somehow ela ed o he p edic ion o p osodic
ma ke s in ex - o-speech sys ems, whe e syn ac ic
s uc u e and wo d/ oken in o ma ion is used (Fi z-
pa ick and Bachenko,1989). Examples o how
sen ences should be segmen ed in E2R in English
and Spanish a e p esen ed in Table 5.
Al hough da ase s o sen ence and lexical sim-
pli ica ion exis (as epo ed abo e), he e is a lack
o publicly a ailable da ase s o E2R segmen a-
ion. We ha e gained p i a e access o a da ase
o segmen ed E2R ex s in Spanish (Calleja e al.,
2024). This da ase is o ganized in o h ee iles
co esponding o ain (3,826 sen ences), alida-
ion (484 sen ences), and de elopmen (1,452 sen-
ences). Each sen ence is explici ly ma ked o indi-
ca e whe e i should be segmen ed ollowing E2R
s anda ds. We adop a machine lea ning app oach
o sen ence segmen a ion, de eloping a classi ie
based on linguis ic in o ma ion and o he ea u es
such as he posi ion o he oken in he sequence
( i s , second, e c.) o he dis ance o he p e ious
cu . We p ocess he da ase in o de o con e
he o iginal sen ences in o ins ances o lea ning.
The ins ances o lea ning a e based on he okens
(wo ds, punc ua ion, numbe s, e c.) in each sen-
ence; ou aim is o classi y all okens as cu -poin
o no . In o de o c ea e he lea ning ins ances,
we linguis ically analyze each sen ence using a
Spanish model om he SpaCy lib a y (Honnibal
e al.,2020), which p oduces in o ma ion on pa s
o speech, syn ac ic dependencies, and named en i-
ies. We ex ac se e al ea u es including he Pa s
O Speech (POS) ag o he oken, he case o he
oken (lowe cased, uppe cased), whe he he o-
ken is a punc ua ion, whe he he oken is pa o a
named en i y (begin, inside, ou side), he posi ion
o he oken in he sen ence, he dis ance o he p e-
ious cu poin (o -1), and he dis ance o he end
o he sen ence. The lea ning ins ances (one pe
oken) a e s o ed in a CSV ile o use by a machine
lea ning algo i hm. We epo esul s using a De-
cision T ee algo i hm (S einbe g,2009) due o i s
simplici y and explana o y powe (i.e. se o ules).
O he algo i hms we e less success ul on ou da a.
The lea ning algo i hm is an ins ance om he De-
115
Lang Con ex (Sen ence) Complex wo d
(by CA)
Subs i u e
(by LS)
Eng The eason why hypo halamic lesions a ec body a and eeding beha io has in ac
much o do wi h lep in signaling.
hypo halamic b ain
lesions damage
a ec in luence
Spa
Si es e indicado baja de 1, implica ía que la emp esa no es á en capacidad de cub i
sus obligaciones de co o plazo con los ac i os líquidos que posee.
(I his indica o is below 1, his implies ha he en e p ise is no in condi ions
o co e i s obliga ions in he long un wi h he liquid asse s i possesses.)
implica ía signi ica ía
indicado medida
plazo iempo
ac i os bienes
Ca
La o mació sos é que "els posicionamen s excloen s en e s a al es eali a s
educa i es onamen ades amb idees polí iques dis o sionen la eali a del model" ca alà.
(The o ma ion main ains ha "exclusiona y posi ions owa ds o he educa ional
eali ies based on poli ical ideas dis o he eali y o he
Ca alan model".)
sos é de ensa
posicionamen s posicións
e s con a
Table 6: Examples o cases whe e he Complexi y Assessmen (CA) sys em iden i ies a wo d ha needs simpli ica ion
and he Lexical Simpli ica ion (LS) sys em simpli ies i .
cision T ee implemen a ion p o ided by he Sciki
Lea n lib a y6(Ped egosa e al.,2011).
Table 7 epo s segmen a ion esul s o he de-
cision ee classi ie and wo baselines. The base-
lines a e based on (i) he Pa s o Speech (POS) ag,
which on aining da a is he bes p edic o o he
oken whe e he sen ence should be segmen ed, and
(ii) on he mos common leng h o he segmen . As
o he decision ee, wo me hods a e applied: he
o acle con igu a ion knows abou he ue p e ious
cu s, while he blind con igu a ion has only access
o he p edic ed p e ious cu s. The di e ence be-
ween o acle and blind con igu a ions a e expec ed.
The di e ence in pe o mance be ween he deci-
sion ee and he baselines is an indica ion ha
he ea u es a e con ibu ing o he classi ica ion
pe o mance. Fu u e wo k should look a analyz-
ing ea u e con ibu ion and imp o ing he models,
and p o iding segmen a ion suppo o Ca alan,
English and I alian.
Algo i hm F1 (Cu ) F1 (No Cu ) A g. F1
Decision T ee (O acle) 0.43 0.89 0.66
Decision T ee (Blind) 0.26 0.91 0.58
POS Tag Baseline 0.17 0.95 0.56
Seg. Leng h Baseline 0.12 0.91 0.52
Table 7: Segmen a ion esul s (based on F1 measu e)
in o Easy o Read (Spanish da a). Compa ison o a
Decision T ee wi h baselines.
5 Accessing Simpli ica ion Technology
h ough he iDEM App
The iDEM p ojec implemen s and deploys a cloud-
based, open-API iDEM pla o m o deli e ex -
6h ps://sciki -lea n.o g/s able/modules/ ee.
h ml
simpli ica ion se ices, in eg a ing componen s o
complex language de ec ion (Sec ion 4.2) phenom-
ena and adap a ion h ough ex simpli ica ion (Sec-
ion 4.4). I suppo s di e se audiences, languages,
and domains, and solu ions a e made a ailable o
delibe a i e pa icipa o y spaces as open-sou ce
p oduc s. The cu en e sion o he app suppo s
i e a ion ia yped ex , speech, OCR o PDF. A
pa icipa ion unc ionali y allows he use o check
p oposals cu en ly being discussed and simpli y
hem o be e unde s anding. Fo example, he De-
cidim pla o m (A agón e al.,2017) can be di ec ly
accessed om he app o ansla e, o simpli y ac-
i e pa icipa o y p ocesses. Examples o he APP
in ac ion can be seen in Figu es 1a and Figu es 1b.
No e ha he cu en simpli ica ion echnology sup-
po ed by he app is no ye he one desc ibed in
he pape ; i s ill se es as a demons a o o wha
i will look like in he coming mon hs.
6
Limi a ions and E hical Conside a ions
The s udies on Complexi y Assessmen in sec ion
4.2 and 4.3 a gue o an analysis and simpli ica-
ion o a la ge a ay o ac o s, one o which is
lexical simpli ica ion. We a e awa e ha his is a
cu en limi a ion, bu u u e e sions o he iDEM
simpli ica ion ools will include a ull ea men
o sen ence simpli ica ion. Ou cu en simpli ica-
ion model, al hough achie ing good pe o mance
in compa ison wi h a s ong baseline, does no do
so wi h espec o he s a e o he a . This can
be explained by ou aim o keep models open and
accessible o a b oade communi y o s akehold-
e s, i.e. ligh e , open models could be a o ded
by mo e disad an aged communi ies in he spi i
o ou p ojec . Since ou p ojec deals wi h p o-
116
(con inued om p e ious page)
S a egy Mac oS a egy Explana ion and Examples To al
ModLin Modula ion Redis ibu ion o sen ence componen s:
-ModWo d: “...collabo a ion and in o ma ion sha -
ing...”
→
“...wo king oge he and sha ing in o ma-
ion...”
-ModG ou: “Accessible Museums is a opic...”
→
“Ou membe s hink i is impo an o alk abou Ac-
cessible Museums”
-ModClau: “To imp o e communi y heal h... he
Go e nmen wo ks...”
→
“The Go e nmen wo ks...
o imp o e...”
2
P aSyn Synonymy P agma ic synonyms:
-P aP op: UN
→
Uni ed Na ions, Nu ella
→
choco-
la e c eam
-P aCon : “Si Kei S a me ”
→
“ he new P ime
Minis e ”
SemSyn Synonymy Seman ic synonyms:
-SemS e e: ponde → hink
-SemHype: lec u e s → eache s
-SemHypo: lo a → ees and lowe s
3
G aSyn Synonymy G amma ical synonyms:
-G aP on: “you don’ see i ”
→
“you don’ see he
mis ake”
-G aTens: “we ha e been doing”
→
“we ha e done”
-G aPass: passi e →ac i e
-G aNega: “no an obs acle” →“ acili a ed”
T aNou T ansposi ion Noun ansposi ion.
e.g. “ou aim” →“we wan ”
T aVe T ansposi ion Ve b ansposi ion.
e.g. “lis ening o music” →“music”
T Adje T ansposi ion Adjec i e ansposi ion.
e.g. “moun ainous landscapes” →“moun ains”
4
T Ad e T ansposi ion Ad e b ansposi ion.
e.g. “beha ing happily” →“was happy”
T ansc ip T ansc ip A sen ence is le unchanged.
SynW2G/S/C Syn ac ic Change Wo d o g oup/clause/sen ence
SynG2W/C/S Syn ac ic Change G oup o wo d/clause/sen ence 12
SynC2W/G/S Syn ac ic Change Clause o wo d/g oup/sen ence
SynS2W/G/C Syn ac ic Change Sen ence o wo d/g oup/clause
Illocu iona y Change Illocu iona y Change Making implied meaning explici . 1
G aSim Comp ession G amma ical simpli ica ion.
e.g. “so as o” →“ o”
SemSim Comp ession Seman ic simpli ica ion.
e.g. condensing explana ions
2
OmiEle Omission Omission o elemen s:
-OmiSubj: “Si Kei Rodney S a me ...”
→
“S a me
is...”
-OmiVe b,OmiComp,OmiClau,OmiSen (e.g.
ull sen ence emo ed)
(con inued on nex page)
123

(con inued om p e ious page)
S a egy Mac oS a egy Explana ion and Examples To al
OmiDia Omission Omission o discou se elemen s:
-OmiFil: “you know” → emo ed
-OmiRe : “I was igh ... igh when...”
→
“I was
igh when...”
-OmiRhe: “wasn’ I?” → emo ed
2
To al 30
Table 10: Mac o-s a egies, S a egies, Mic o-s a egies, and Examples wi h Anno a ed To als
124