P oceedings o he 18 h In e na ional Na u al Language Gene a ion Con e ence, pages 215–231
Oc obe 29–No embe 2, 2025. ©2025 Associa ion o Compu a ional Linguis ics
Towa ds T us wo hy Lexical Simpli ica ion: Explo ing Sa e y and
E iciency wi h Small LLMs
Akio Hayakawa S e an Bo
Depa men o Enginee ing, Uni e si a Pompeu Fab a
Ba celona, Spain
{akio.hayakawa, s e an.bo , ho acio.saggion}@up .edu
Ho acio Saggion
Abs ac
Despi e hei s ong pe o mance, la ge lan-
guage models (LLMs) ace challenges in eal-
wo ld applica ion o lexical simpli ica ion (LS),
pa icula ly in p i acy-sensi i e and esou ce-
cons ained en i onmen s. Mo eo e , since ul-
ne able use g oups (e.g., people wi h disabil-
i ies) a e one o he key a ge g oups o his
echnology, i is c ucial o ensu e he sa e y
and co ec ness o he ou pu o LS sys ems.
To add ess hese issues, we p opose an e i-
cien amewo k o LS sys ems ha u ilizes
small LLMs deployable in local en i onmen s.
Wi hin his amewo k, we explo e knowledge
dis illa ion wi h syn hesized da a and in-con ex
lea ning as baselines. Ou expe imen s in i e
languages e alua e model ou pu s bo h au o-
ma ically and manually. Ou manual analysis
e eals ha while knowledge dis illa ion boos s
au oma ic me ic sco es, i also in oduces a
sa e y ade-o by inc easing ha m ul simpli i-
ca ions. Impo an ly, we ind ha he model’s
ou pu p obabili y is a use ul signal o de ec -
ing ha m ul simpli ica ions. Le e aging his,
we p opose a il e ing s a egy ha supp esses
ha m ul simpli ica ions while la gely p ese -
ing bene icial ones. This wo k es ablishes a
benchma k o e icien and sa e LS wi h small
LLMs. I highligh s he key ade-o s be ween
pe o mance, e iciency, and sa e y, and demon-
s a es a p omising app oach o sa e eal-wo ld
deploymen .
1 In oduc ion
Tex Simpli ica ion (TS) aims o make ex s mo e
accessible by ew i ing hem in simple language.
TS holds he po en ial o alle ia e eading and un-
de s anding di icul ies, pa icula ly o indi idu-
als wi h dyslexia (Rello e al.,2013), in ellec ual
disabili ies (Säube li e al.,2024), and Dea and
ha d-o -hea ing adul s (Alonzo e al.,2021). TS
is a ask s ongly o ien ed owa ds eal-wo ld sce-
na ios, aiming o p omo e social pa icipa ion and
inclusion among people who ace challenges in ex
comp ehension.
Recen ad ancemen s in la ge language models
(LLMs) ha e e olu ionized na u al language p o-
cessing and achie ed s a e-o - he-a pe o mance
ac oss a ious asks (OpenAI,2024). TS is no ex-
cep ion, as LLMs ha e ou pe o med exis ing TS
sys ems (Feng e al.,2023;Wu and A ase,2024;
Qiang e al.,2025).
Howe e , applying LLMs o TS in eal-wo ld
scena ios, pa icula ly o ulne able use g oups,
aces c i ical challenges. Fi s , p omp s p o ided o
LLMs and ex s equi ing simpli ica ion may con-
ain sensi i e pe sonal in o ma ion, such as da a
ela ed o cogni i e impai men s. The use o API-
based LLMs in ol es ansmi ing ha sensi i e
da a o e he in e ne , aising signi ican p i acy
conce ns. Fo ins ance, gi en ha indi iduals wi h
dyslexia o en hesi a e o disclose hei condi ion
due o conce ns abou s igma and nega i e pe cep-
ions (Hamil on Cla k,2024), i can be p oblema ic
o design p omp s o TS such as "I ha e dyslexia;
Can you simpli y his diagnosis esul o me?".
Thus, TS sys ems capable o unning locally a e
highly desi able.
Open-access LLMs add ess his p i acy conce n.
Howe e , high-pe o ming open-access LLMs yp-
ically equi e subs an ial compu a ional esou ces
o in e ence. Deploying such la ge models di ec ly
on esou ce-cons ained de ices, such as sma -
phones and able s ha a e commonly used by he
a ge use s (Söde s öm e al.,2021), is cu en ly
imp ac ical. This highligh s he need o de el-
oping smalle models ha can pe o m e ec i ely
wi hin hese limi ed ha dwa e en i onmen s.
Building on hese challenges, we in es iga e how
o de elop e icien TS sys ems ha can ope a e
unde cons ained compu a ional esou ces. This
app oach is essen ial o suppo ing in o ma ion
access o all while espec ing use p i acy.
U ilizing small LLMs is a p omising app oach,
215
as ~3B models a e o en explici ly enginee ed o
on-de ice deploymen (Me aAI,2024), he eby ad-
d essing p i acy and e iciency issues. Howe e ,
pa icula a en ion mus be paid o sa e y when
employing small LLMs, as hei limi ed capaci y
compa ed o la ge coun e pa s in oduces c i ical
conside a ions ega ding he eliabili y and ha m-
ulness o he gene a ed simpli ica ions. Poo o
inaccu a e simpli ica ions can be de imen al, as
hey may ac i ely p o ide misin o ma ion o cause
con usion, which a e mo e se ious issues han lea -
ing he ex unchanged (Rello e al.,2013;Säube li
e al.,2024). The e o e, in p ac ice, i is c ucial
no only o simpli y ex s e ec i ely, bu also o
minimize ha m ul ou pu s and ensu e sa e y.
As a i s s ep owa ds add essing hese chal-
lenges, his pape ocuses speci ically on lexical
simpli ica ion (LS), a sub ask o TS ha eplaces
complex wo ds in a con ex sen ence wi h simple
al e na i es. LS can be conside ed a ela i ely con-
se a i e and sa e sub ask compa ed o sen ence-
o documen -le el simpli ica ion, which o en in-
ol es ope a ions such as in o ma ion dele ion (Al-
Thanyyan and Azmi,2021).
We adop ed small LLMs and explo ed wo ap-
p oaches: in-con ex lea ning, which equi es no
aining, and knowledge dis illa ion, which ans-
e s knowledge om a la ge eache model o a
smalle s uden model. Ou app oach also consid-
e s ex ensibili y o di e se languages, as suppo ing
a b oad use g oup equi es simpli ica ion ac oss
mul iple languages.
To e alua e he sa e y o simpli ica ion ou pu s,
pa icula ly in supp essing ha m ul con en , we
conduc ed manual e alua ions alongside au oma ic
me ics. Manual analysis e ealed ha , while
knowledge dis illa ion gene ally boos ed au oma ic
me ic sco es, i did no educe ha m ul ou pu s
and some imes e en inc eased hem. Fu he mo e,
we obse ed ha , especially in models ained ia
knowledge dis illa ion, he ou pu p obabili y p o-
ided by LLMs may se e as a use ul signal o
iden i ying ha m ul simpli ica ions.1
Ou con ibu ions a e summa ized as ollows:
•
We in es iga ed he po en ial and challenges
o using small LLMs o lexical simpli ica-
ion wi h espec o sa e y and e iciency, and
we es ablish a benchma k in his impo an
esea ch a ea.
1
Ou codes will be a ailable a
h ps://gi hub.com/
ahaya3776/sa e-e icien -ls.
•
We demons a ed ha small LLMs o e sig-
ni ican in e ence speedups, which highligh s
hei e iciency.
•
We ound ha s anda d app oaches such as
in-con ex lea ning and knowledge dis illa-
ion can p oduce bene icial simpli ica ions,
bu hey inhe en ly isk gene a ing ha m ul
ou pu s.
•
We iden i ied ha model’s log-p obabili y
se es as a use ul signal o de ec ing ha m-
ul simpli ica ions, sugges ing a p omising il-
e ing s a egy o ensu e sa e y owa ds eal-
wo ld applica ions.
2 Rela ed Wo k
Lexical Simpli ica ion LSBe (Qiang e al.,
2021) es ablished i sel as a s ong baseline o
LS by le e aging BERT’s unmasking capabili ies
and con ex ual unde s anding, ou pe o ming ea -
lie sys ems based on pa aph ase da abases and
wo d embeddings (Bi an e al.,2011;Gla aš and
Š ajne ,2015). Howe e , such sys ems based on
masked language models (MLMs) we e limi ed in
gene a ing mul i- oken wo ds (P zybyła and Sha d-
low,2020) and i s e ec i eness ou side English
has been ques ioned (S ajne e al.,2023). Fu -
he mo e, MLM-based sys ems o en equi e mul i-
s age pipelines in ol ing candida e anking, which
in oduces signi ican la ency ha con lic s ou goal
o on-de ice e iciency. Thei mul ilingual applica-
bili y is also hinde ed by he inconsis en a ailabil-
i y o monolingual models ac oss languages.
Mo e ecen au o- eg essi e app oaches, using
T5 (Sheang and Saggion,2021) and GPT-3 (Au-
mille and Ge z,2022), ha e ou pe o med MLM-
based me hods, leading o he widesp ead adop-
ion o LLMs as he p edominan solu ion o LS
(Sha dlow e al.,2024b). No ably, a GPT-4-based
LS sys em (Enomo o e al.,2024) achie ed ema k-
able pe o mance ac oss mul iple languages.
Smalle LLMs and E iciency The use o high-
pe o ming e sa ile LLMs poses se e al chal-
lenges in eal-wo ld scena ios, including esou ce
limi a ions, p i acy conce ns, and high ope a ional
cos s. To add ess hese issues, a ious e o s ha e
been made o de elop LLMs capable o unning
on local de ices. These include echniques such
as quan iza ion (Zhou e al.,2024) and he GPT-
Gene a ed Uni ied Fo ma (GGUF),
2
bo h o which
2h ps://gi hub.com/ggml-o g/ggml/blob/mas e /
docs/ggu .md
216
aim o enable e icien in e ence wi hou high-end
ha dwa e, as well as he de elopmen o small
LLMs (Qwen Team,2024;Gemma Team,2024;
Me a AI,2024).
Small LLMs can be u he ained o imp o e
pe o mance on speci ic asks (Xu e al.,2024),
including LS (Baez and Saggion,2023;Xiao
e al.,2024). Baez and Saggion (2023) p oposed
LSLlama, a LLAMA-7B model ine- uned on an
exis ing LS da ase , which achie ed pe o mance
compa able o a GPT-3-based app oach. Xiao e al.
(2024) in oduced he Pi o KD amewo k, which
ained Chinese-cen ic small LLMs using pseudo-
ins ances gene a ed by GPT-4, and buil a cos -
e ec i e Chinese LS sys em by inco po a ing web-
based synonym and wo d sense e ie al du ing
in e ence. These s udies demons a ed he po en-
ial o ask-speci ic aining o small LLMs o LS.
Howe e , hei applicabili y o languages beyond
English and Chinese emains unce ain, especially
gi en mo phological complexi y and dispa i ies in
p e- aining esou ces.
Sa e y and Reliabili y o Tex Simpli ica ion
While TS suppo s eading and unde s anding, i
also ca ies he isk o causing con usion o mis-
in e p e a ion. In p ac ice, ou pu s om au oma ic
TS sys ems o en su e om low ac uali y (De-
a aj e al.,2022) and in o ma ion loss (Ag awal
and Ca pua ,2024), which can nega i ely a ec
eade s’ eading ime and accu acy on comp ehen-
sion ques ions (Rello e al.,2013;Säube li e al.,
2024). In such cases, lea ing he o iginal ex un-
changed may be p e e able o applying a ha m ul
simpli ica ion. The e o e, adop ing a s a egy ha
accep s simpli ica ion only when ce ain c i e ia
a e me o e s a p ac ical app oach in eal-wo ld
scena ios. In his ega d, T ienes e al. (2024) p e-
sen ed one o he ew e o s o assess he po en ial
ha m o TS by de ec ing in o ma ion loss. How-
e e , i s eliance on LLMs makes i unsui able o
use in cons ained en i onmen s.
3 Expe imen al Se up
Figu e 1 illus a es he o e all low o ou sys em
de elopmen and e alua ion. We used he Hugging-
Face T ans o me s lib a y
3
o he de elopmen o
ou LS models. A single Tesla T4 GPU wi h 16
GB o memo y was used o he de elopmen . To
enable high-speed in e ence on CPUs, he mod-
3h ps://hugging ace.co/docs/ ans o me s/
Da a Syn hesis
Compu ing En i onmen
Wikipedia
Picking up sen ences 10-100 wo ds long
The women and child en make Gue nica he image
o innocen , de enseless humani y ic imized.
Randomly selec ing a Ta ge wo d om op-5
in equen wo ds (excep P ope Nouns/OOVs)
The women and child en make Gue nica he image
o innocen ,de enseless humani y ic imized.
Ge ing an al e na i e wo d o he a ge
om he eache model
The women and child en make Gue nica he image
o innocen , de enseless humani ypeople ic imized.
Small
LLMs
Cons ained
En i onmen
Fine- uned
5-sho
KD
ICL Mul i
LS
- La ency
- ACC/POT
- Manual
E alua ion
Figu e 1: O e all low o ou expe imen s. We de el-
oped and e alua ed sys ems o each language sepa-
a ely.
els we e con e ed in o he GGUF o ma using
llama.cpp.4
3.1 Task Fo mula ion
The e m Lexical Simpli ica ion (LS) has been used
wi h a ying scopes. In some cases, i e e s o
a sen ence-le el simpli ica ion pipeline consis ing
o complex wo d iden i ica ion, subs i u ion gen-
e a ion, and anking (Pae zold and Specia,2017).
Howe e , in his pape , we adop a na owe de -
ini ion o LS, ocusing solely on he subs i u ion
gene a ion. Speci ically, we de ine LS as gene a -
ing a simple al e na i e o a single a ge wo d ha
appea s in a gi en con ex sen ence. An al e na i e
should make he con ex easie o unde s and han
he o iginal while p ese ing i s meaning. The e-
o e, an LS sys em akes a con ex and a ge wo d
as inpu and ou pu s a single al e na i e wo d.
3.2 Da ase
We used Mul iLS (Sha dlow e al.,2024c), a LS
da ase co e ing 10 languages, o e alua e sys em
pe o mance. We selec ed i e languages, English,
Spanish, Ca alan, Ge man, and Japanese, o ac-
4h ps://gi hub.com/ggml-o g/llama.cpp
217
coun o di e ences in language amily, mo pho-
logical s uc u e, and esou ce a ailabili y.
Table 1 shows an example LS ins ance, consis -
ing o a con ex sen ence, a a ge wo d, and al e na-
i e wo ds sugges ed by mul iple human anno a o s.
Mul iLS allowed anno a o s o use a a ge as an
al e na i e when hey could no iden i y a alid
simpli ica ion, which o en occu ed when he a -
ge was al eady simple enough (Sha dlow e al.,
2024a). This anno a ion scheme enables us o ex-
clude ins ances whe e LS is inhe en ly di icul .
We emo ed such ins ances whe e he op- anked
al e na i e was unchanged om he a ge wo d.
This p ocess esul ed in he numbe o ins ances
pe language shown in Table 2. We andomly spli
he selec ed ins ances in o wo pa s, assigning 90
ins ances o de elopmen and he es o es ing.
5
3.3 LS Sys ems
We employed wo small LLMs: Qwen 2.5 1.5B
(Qwen o sho ) (Qwen Team,2024) and Llama
3.2 1B (Llama o sho ) (Me a AI,2024). Bo h
models we e ained on mul iple languages om
hei la ge coun e pa s.
6
To make hese mod-
els pe o m LS, we adop ed wo app oaches: in-
con ex lea ning and knowledge dis illa ion.7
3.3.1 In-Con ex Lea ning
In-con ex lea ning (B own e al.,2020), which p o-
ides se e al examples as ew-sho o guide model
beha io , is a common echnique o imp o e ou pu
quali y. We used i e ixed examples in he p omp
(5-sho ) ollowing he empla e in Appendix A.
These examples we e sampled om he pilo spli
o Mul iLS, which was sepa a ed om he main
e alua ion da a.
3.3.2 Knowledge Dis illa ion
Knowledge dis illa ion, which in ol es ans e ing
knowledge o la ge eache models o smalle s u-
den models, has been widely used o adap LLMs
o speci ic asks, including LS (Baez and Saggion,
2023;Xiao e al.,2024). Recen app oaches com-
monly employ simple supe ised ine- uning o s u-
den models wi h ha d labels de i ed om eache
model ou pu s, due o he ad anced capabili ies o
closed-sou ce LLMs (Xu e al.,2024). Following
5
As up o h ee ins ances sha e he same con ex , we assign
90 ins ances wi h 30 unique con ex s o he de elopmen da a.
6
We used base LLMs ins ead o ins uc ion- uned e sions
as base LLMs. See Appendix C o de ails.
7See Appendix B o he hype pa ame e se ings.
Con ex : Elec onically con olled mo o ized zoom
lenses a e placed on bo h came a and p ojec o , and
synch onized wi h one ano he so ha bo h lenses zoom
oge he and a he same ocal leng h a all imes.
Ta ge Wo d: ocal
Gold Al e na i es: main, main, cen al, cen al, basic,
p ima y, ocal
Table 1: Example om he Mul iLS English subse .
Fo his ins ance, ACC is me i he ou pu al e na i e
is "main" o "cen al", which a e he mos sugges ed
al e na i es. POT is me i he ou pu al e na i e is
one o "main", "cen al", "basic", and "p ima y". I he
ou pu al e na i e is " ocal", which is unchanged om
he a ge wo d, i does no mee ei he me ic.
# O iginal # Selec ed A g. Con ex
Language Ins ances Ins ances Leng h
English 570 515 25.4
Spanish 593 502 29.3
Ca alan 445 261 45.0
Ge man 570 547 37.7
Japanese 570 562 20.3
Table 2: S a is ics o Mul iLS ins ances pe language.
his amewo k, we pe o med knowledge dis illa-
ion ( ine- uned) by syn hesizing LS ins ances.
Syn hesizing Con ex and Ta ge s We an-
domly ex ac ed con ex sen ences om Wikipedia
o each language. Sen ences we e pa sed using
MeCab
8
o Japanese and spaCy
9
o he o he lan-
guages. We e ained only hose con aining be ween
10 and 100 wo ds as con ex s.10
To ensu e ha a ge wo ds we e simpli iable,
we excluded p ope nouns and ou -o - ocabula y
wo ds om he se o candida e wo ds wi hin each
con ex sen ence. F om he emaining candida es,
we andomly selec ed one o he i e leas e-
quen wo ds as he a ge wo d, based on Zip e-
quency.11
Syn hesizing Al e na i e Wo ds To ob ain al e -
na i e wo ds o he con ex - a ge pai s desc ibed
abo e, we employed he ins uc ion- uned Gemma
2 9B (Gemma Team,2024) as a eache model,
an LLM known o i s s ong pe o mance ac oss
di e se languages. The model was p omp ed o
gene a e a single al e na i e wo d using he same
5-sho se ing desc ibed in § 3.3.1.
8h ps:// aku910.gi hub.io/mecab/
9h ps://spacy.io/
10
Fo Japanese, simple okeniza ion ules we e applied. See
Appendix D o de ails.
11
Calcula ed using wo d eq Py hon lib a y:
h ps://
gi hub.com/ spee /wo d eq/
218
The pe o mance o ine- uned s uden mod-
els can o en be imp o ed by emo ing low-
quali y ou pu s om he eache (Jung e al.,2023;
Huang e al.,2023). The e o e, we il e ed ou
low-con idence al e na i es, app oxima ing con i-
dence using ou pu p obabili ies (desc ibed la e in
§ 3.4.4). Fo each language, we gene a ed al e -
na i es o 60,000 syn hesized con ex - a ge pai s
and selec ed he op 30,000 high-con idence in-
s ances o aining.
Fine- uning Models We ine- uned each s uden
model o each language sepa a ely, using he co -
esponding 30,000 ins ances o up o i e epochs.
To educe memo y consump ion, we adop ed he
QLoRA amewo k (De me s e al.,2023). In his
se up, he weigh s o base models we e quan ized
o 4-bi p ecision using he bi sandby es
12
lib a y.
Fine- uning was hen pe o med ia 16-bi LoRA
adap e s. Following De me s e al. (2023), we
only ine- uned Que y and Key p ojec ions laye s
wi hin he a en ion modules. Each ype o s uden
model was ine- uned wi h h ee di e en andom
seeds. We sa ed a checkpoin e e y 0.2 epochs and
selec ed he one ha achie ed he highes Po en-
ial@1 (desc ibed la e in § 3.4.1) on he de elop-
men se . The p omp empla e in Appendix A was
used o aining and in e ence.
3.3.3 Baselines
As a baseline, we employed he ins uc ion- uned
Gemma 2 9B (Gemma o sho ) in he same 5-sho
se ing used o he eache model.
3.4 E alua ion
3.4.1 Au oma ic LS Me ics
To au oma ically e alua e he pe o mance o LS
sys ems, we used Accu acy@1@ op1 (ACC) and
Po en ial@1 (POT), as de ined in Saggion e al.
(2022). As shown in Table 1, ACC is he pe cen -
age o p edic ions ma ching he mos equen ly
sugges ed al e na i e. POT is he pe cen age o p e-
dic ions ma ching any sugges ed al e na i e. Gi en
ha all ins ances we e assumed simpli iable a e
he selec ion p ocess in § 3.3.1, any p edic ions un-
changed om he a ge wo d we e no conside ed
a ma ch o ei he ACC o POT, e en i he a ge
wo d was included in he gold al e na i es.
12h ps://gi hub.com/bi sandby es- ounda ion/
bi sandby es
3.4.2 La ency E alua ion
To es ima e model esponse ime in esou ce-
cons ained en i onmen s, we cons uc ed a i -
ual small-scale in as uc u e using compu ing
ins ances om Amazon Web Se ices (AWS).
We selec ed m6g.la ge and m6g.xla ge compu -
ing ins ances om AWS Elas ic Compu ing Cloud,
which p o ide 2 and 4 i ual CPUs and 8 GB
and 16 GB o memo y, espec i ely. These con ig-
u a ions e lec he ha dwa e commonly ound in
sma phones and able s. Bo h compu ing ins ances
a e based on G a i on p ocesso s, which a e widely
applied in mobile de ices.13
To al la ency mainly consis s o p omp p ocess-
ing ime and in e ence ime. As bo h depend on he
numbe o okens in he p omp and he gene a ed
ou pu , we measu ed he a e age p e- oken p omp
p ocessing and in e ence imes o each model us-
ing llama.cpp. No ably, llama.cpp caches he ini ial
ixed po ion o he p omp (i.e., ew-sho exam-
ples), so i s p ocessing la ency is no incu ed on
subsequen in e ences. While his caching is key o
he e iciency, i makes dynamic p omp ing s a e-
gies imp ac ical, as hey would equi e equen
cache in alida ion.
3.4.3 Manual LS E alua ion
To gain a mo e nuanced unde s anding o LS qual-
i y and sa e y om a use pe spec i e, we con-
duc ed a manual e alua ion. We andomly sampled
100 ins ances pe language and assigned ha m ul-
ness ags o he al e na i es gene a ed by each sys-
em. Ou manual e alua ion ocused on ins ances
ha we e no co e ed by ou au oma ic me ics.
Fo his pu pose, we only assigned ags o al e na-
i es ha we e nei he unchanged om he a ge
no included in he gold al e na i es.
Taking in o accoun he s anda d human e alua-
ion c i e ia o luency, adequacy, and simplici y in
TS, we de ined he ollowing ou ha m ul ags:
•
G amma E o : The al e na i e is g amma i-
cally inco ec , including in lec ion, and con-
juga ion e o s.
•
Change o Meaning: Replacing he a ge wi h
he al e na i e d as ically changes he mean-
ing o con ex .
•
Mo e Di icul : The al e na i e is clea ly mo e
di icul han he a ge , e en hough i p e-
se es he meaning o some ex en .
13h ps://aws.amazon.com/ec2/ins ance- ypes/
m6g/
219
LS Pe o mance La ency (msec / oken)
Model Se ings English Spanish Ca alan Ge man Japanese m6g.la ge m6g.xla ge
ACC POT ACC POT ACC POT ACC POT ACC POT ead p ed ead p ed
Gemma(9B) 5-sho .529 .751 .427 .774 .333 .690 .405 .643 .252 .494 652 581 326 292
Qwen(1.5B) 5-sho .358 .534 .274 .473 .076 .205 .186 .298 .064 .150 91 275 45 139
ine- uned .382 .574 .318 .537 .129 .265 .119 .206 .076 .154 86 274 43 138
Llama(1B) 5-sho .202 .278 .053 .092 .047 .105 .090 .142 .023 .042 70 219 35 110
ine- uned .370 .544 .293 .529 .160 .292 .138 .217 .058 .145 66 221 33 107
Table 3: Pe o mance o models on he Mul iLS da ase . Gemma was quan ized o 4-bi due o memo y cons ain s.
•
Gibbe ish: The al e na i e does no make
sense a all.
Fo each language, anno a ion was pe o med by
a single in-house anno a o , all o whom we e na-
i e speake s excep o Ca alan. The Ca alan anno-
a ion was conduc ed by a CEFR C1 le el speake
wi h o e en yea s o expe ience. The ask was
designed as a simple bina y decision o minimize
subjec i i y, ensu ing he e alua ion amewo k is
easily ex ensible o o he languages and domains.
Based on he au oma ically and manually as-
signed ags, al e na i es we e ca ego ized in o ol-
lowing h ee g oups. Tags de e mined by au oma ic
me ics a e ma ked wi h A, while hose equi ing
manual anno a ion a e ma ked wi h M.
•Bene icial
–ACC (A) : equi alen o Accu acy@1@ op1.
–POT (A) : Po en ial@1 bu no ACC
–Good (M) : no ha m ul ags we e assigned.
•Unchanged
(A) : al e na i e was iden ical o
a ge .
•Ha m ul
–Deg aded
(M) : one o mo e non-Gibbe ish
ha m ul ags we e assigned.
–Gibbe ish (M) : Gibbe ish was assigned.
See Appendix E o de ailed examples o he
ha m ul ags and g oups.
3.4.4 Fil e ing S a egy
To add ess he isk o in oducing ha m ul simpli i-
ca ions discussed abo e, we p opose and e alua e
a il e ing s a egy. This s a egy le e ages he ou -
pu p obabili y sco e as a eliabili y signal in a
h eshold-based decision mechanism o de e mine
whe he o pe o m LS.
P obabili y Sco e We compu ed he p obabili y
sco e as he sum o he log-p obabili ies o he o-
kens o ming he al e na i e wo d, including he
oken indica ing he end o he wo d (e.g., a new-
line o EOS oken). We conside ed he p obabili y
sco es o indi idual al e na i es as candida e h esh-
olds. Fo each h eshold alue, al e na i es wi h
sco es abo e he h eshold we e accep ed, while
o he s we e ejec ed, and no simpli ica ion was ap-
plied.
E alua ion To quan i a i ely e alua e he e ec-
i eness o he p oposed s a egy, we de ined he
ollowing me ics:
•
AUC (
Bene icial
s
Ha m ul
): To assess
how well he p obabili y sco e unc ions as a
sa e y signal, we compu ed he A ea Unde
he ROC Cu e (B adley,1997) o classi y-
ing al e na i es as
Bene icial
s.
Ha m ul
,
excluding Unchanged al e na i es.
•BH0.1
(
Bene icial
Ra e a 10%
Ha m ul
):
To quan i y p ac ical bene i unde a
sa e y cons ain , we epo ed he a e o
Bene icial
achie ed when he a e o
Ha m ul
in oduced was limi ed o 10% o
o al ins ances. We chose he 10% h eshold
o balance sa e y and u ili y by o e ing a
p ac ical e e ence poin o compa ison ha
emains adap able o di e en needs.
4 Resul s
4.1 Au oma ic E alua ion
Table 3 shows he au oma ic me ic sco es o ou
LS sys ems. The esul s con i m ou hypo hesis
ha ine- uning, as pa o he knowledge dis illa-
ion, imp o ed he pe o mance o small LLMs.
Fo example, ine- uned Llama achie ed 0.370
ACC on English, signi ican ly highe han he 5-
sho sco e (0.202). Simila gains we e obse ed o
bo h Llama and Qwen ac oss mos languages.
The ine- uned Llama pe o med compa ably o
Qwen despi e i s smalle size, sugges ing ha he
1B model can app oach 1.5B model in pe o mance
220
G-5
Q-5
Q-
L-5
L-
0
0.2
0.4
0.6
0.8
1
P opo ion
English
G-5
Q-5
Q-
L-5
L-
Spanish
G-5
Q-5
Q-
L-5
L-
Ca alan
G-5
Q-5
Q-
L-5
L-
Ge man
G-5
Q-5
Q-
L-5
L-
Japanese
ACC POT Good Unchanged Deg aded Gibbe ish
Bene icial Ha m ul
Figu e 2: Dis ibu ion o ou pu al e na i e ca ego ies. G: Gemma, Q: Qwen, L: Llama. -5: 5-sho , - : ine- uned.
a e aining. Howe e , nei he s uden models
eached he eache ’s le el.
Table 3 also epo s he la ency (ms/ oken) o
p omp eading ( ead) and ou pu gene a ion (p ed).
Bo h s uden models showed subs an ially lowe
la ency han he eache model. On m6g.la ge,
Llama’s ead la ency (66 msec/ oken) was nea ly
10 imes as e han Gemma’s (652 msec/ oken),
wi h simila ends ac oss en i onmen s.
4.2 Manual E alua ion
Figu e 2 shows he dis ibu ion o al e na i e ca e-
go ies, as judged by human e alua o s, ac oss mod-
els, se ings, and languages. Each s acked ba ep-
esen s he p opo ion o ou pu al e na i es alling
in o he ca ego ies.
Unde 5-sho se ings, small LLMs, especially
Llama o English and Spanish, p oduced a high
p opo ion o
Unchanged
ou pu s, indica ing sa e
bu less p oac i e simpli ica ion beha io . Fine-
uning educed
Unchanged
and co esponding ise
in
Bene icial
simpli ica ions, e lec ing a gen-
e al imp o emen in LS capabili y. Howe e , ine-
uning also in oduced a sa e y ade-o , as i in-
c eased he p opo ions o Ha m ul al e na i es.
In con as , such ade-o was no obse ed o
Ge man and Japanese. Fo hese languages, pe o -
mance emained low ac oss bo h 5-sho and ine-
uned se ings, wi h
Ha m ul
al e na i es consis-
en ly domina ing he esul s. This sugges s a mo e
undamen al challenge s emming om he inhe en
di icul y o cu en small LLMs o pe o m LS
e ec i ely in hese languages.
Lang Model Se ings B HAUC BH0.1
En
Qwen 5-sho .63 .30 .679 .41
ine- uned .63 .27 .707 .46
Llama 5-sho .30 .28 .510 .12
ine- uned .61 .20 .797 .54
Es
Qwen 5-sho .51 .19 .737 .46
ine- uned .63 .19 .850 .61
Llama 5-sho .09 .12 .907 .09
ine- ued .56 .21 .804 .50
Ca
Qwen 5-sho .34 .52 .735 .18
ine- uned .38 .49 .904 .34
Llama 5-sho .15 .52 .614 .03
ine- uned .46 .42 .813 .36
De
Qwen 5-sho .41 .38 .841 .34
ine- uned .38 .51 .721 .16
Llama 5-sho .19 .40 .730 .11
ine- uned .35 .55 .737 .16
Ja
Qwen 5-sho .33 .58 .807 .16
ine- uned .21 .73 .799 .13
Llama 5-sho .16 .64 .745 .04
ine- uned .28 .67 .845 .19
Table 4: E alua ion o Fil e ing S a egy.
B
and
H
e e o he o iginal a e o
Bene icial
and
Ha m ul
ou pu s.
4.3 Fil e ing S a egy
Table 4 p esen s he esul s o il e ing s a egy.
Fi s , he AUC sco es a e no ably high, espe-
cially unde ine- uned se ings, sugges ing ha
log-p obabili y se es as an e ec i e signal o de-
ec ing Ha m ul al e na i es. Mo eo e , he ine-
uned models gene ally show highe AUC ac oss
model ypes and languages, which indica es ha
knowledge dis illa ion enhances he quali y o p ob-
abili y as a sa e y indica o .
The
BH0.1
me ic shows he p ac ical alue o
his s a egy. Fo example, in Spanish, ine- uned
221
-2-4-6-8-10
Sum o Log-P obabili y
H
B
H
Bn=18n=16
n=10
n=42
n=34
n=4
n=10n=39
5-sho
ine- uned
0 20 40 60 80 100
Pe cen ile o
Log-P obabili y Th eshold (%)
0.0
0.1
0.2
0.3
0.4
0.5
Ra e o Bene icial (B)
o Ha m ul (H) Ou pu s
(70, 0.18)
(51, 0.34)
B, 5-sho
H, 5-sho
B, ine- uned
H, ine- uned
Figu e 3:
Bene icial
and
Ha m ul
al e na i es and
hei p obabili y o Qwen in Ca alan. (Top) Dis i-
bu ion o aw p obabili y sco es. (Bo om) Ra e o
Bene icial
and
Ha m ul
al e na i es a e il e ing a
each pe cen ile h eshold. Do ed lines a e plo ed on
h esholds whe e Ha m ul becomes 10%.
Qwen educed
Ha m ul
a e om 19% o 10% wi h
only a sligh d op in
Bene icial
om 63% o 61%.
BH0.1
also highligh s he supe io i y o ine- uning
o 5-sho se ings.
To u he explo e hese indings, we ocus on
he beha io o Qwen models in Ca alan. He e,
while he o iginal
Bene icial
and
Ha m ul
a es
a e close be ween 5-sho and ine- uned se ings,
he impac o il e ing s a egy di e s signi ican ly.
In Figu e 3, he iolin plo ( op) isualizes he dis i-
bu ion o log-p obabili y sco es, whe e ine- uning
leads o a clea sepa a ion be ween
Bene icial
and Ha m ul al e na i es.
The line plo (bo om) acks
Bene icial
and
Ha m ul
a es ac oss h esholds pe cen iles. Fo
he ine- uned model, inc easing he h eshold
educes
Ha m ul
apidly, while
Bene icial
de-
clines mo e g adually. As a esul , he
Ha m ul
a e is educed om nea ly 50% o 10%, wi h mos
Bene icial simpli ica ion p ese ed.
Con ex : The e a e also di e en
edi ing
s yles in he
sense o how bold people a e willing o be.
Ta ge Wo d: edi ing
Gold Al e na i es: changing, modi ying, al e ing ...
Gemma 5-sho (4%): w i ing (Change o Meaning)
Qwen 5-sho (92%): w i ing (Change o Meaning)
Qwen ine- uned (3%): p oo eading (Mo e Di icul )
Table 5: Example ou pu s om he LS sys ems. Pe -
cen ages nex o sys em names indica e log-p obabili y
pe cen iles wi hin each sys em.
5 Discussion
5.1 Case S udy
To be e unde s and he cha ac e is ics o model
ou pu s, pa icula ly ha m ul simpli ica ions o e -
looked by au oma ic me ics and he po en ial o
he log-p obabili y signal, we p esen an example
in Table 5. In his example, model ou pu al e -
na i es "w i ing" and "p oo eading" we e ca ego-
ized as
Ha m ul
, wi h he ags "Change o Mean-
ing" and "Mo e Di icul ", espec i ely. C ucially,
hese al e na i es we e associa ed wi h lowe log-
p obabili y pe cen iles o Gemma (5-sho ) and
ine- uned Qwen, while hey we e much highe
o Qwen unde he 5-sho se ing. This case con-
i ms ou indings ha ine- uned models e ec i ely
le e age log-p obabili y o iden i y ha m ul al e na-
i es. I also shows ha log-p obabili y is a use ul
signal o he eache model, e en wi hou ine-
uning. This alida es he il e ing p ocessed used
du ing da a syn hesis. Examples in o he languages
a e desc ibed in Appendix F.
5.2 Sa e y
As exempli ied by he case s udy abo e, ha m ul
LS al e na i es pose a se ious isk in eal-wo ld
scena ios. Ou manual e alua ion e ealed key lim-
i a ions o s anda d au oma ic e alua ion me ics
based on human-p o ided al e na i es. They ail o
iden i y accep able simpli ica ions no in he gold
al e na i es, and hey do no expose ha m ul al e -
na i es. Al hough manual e alua ion is cos ly and
no scalable, ou ha m ulness anno a ions p o ide
a aluable basis o building au oma ic de ec ion
me hods, such as LLM-as-a-judge, o suppo mo e
p ac ical sa e y assessmen .
Ha m ul al e na i es we e pa icula ly p o-
nounced in Ge man and Japanese. In hese lan-
guages, complex mo phology may hinde he con-
sis en gene a ion o co ec and simple single-
wo d al e na i es by small LLMs. Ou e o anal-
222
ysis highligh s a c i ical challenge ela ed o his:
al e na i es wi h he G amma E o ag in Ge man
and Japanese o en ecei ed high p obabili y sco es
om small LLMs (bo h ew-sho and ine- uned),
making hem di icul o dis inguish om bene icial
al e na i es o o he ha m ul ypes. Fo ins ance,
he a e age log-p obabili y sco e o G amma E -
o om he ine- uned Llama model in Japanese
was -2.992, which was no ably highe han ha
o Change o Meaning (-3.762) and Gibbe ish (-
4.457). This sugges s ha ou il e ing s a egy had
limi ed e ec i eness in mi iga ing g amma e o s.
In e es ingly, his issue was less p e alen in he
eache model (see Appendix G o de ails ac oss
all ags and models). This dispa i y implies ha
non-small LLMs can be e le e age ou pu p oba-
bili y as a signal o g amma ical co ec ness e en
in mo phologically complex languages. In con-
as , small LLMs may s uggle o cap u e hese
ine-g ained g amma ical nuances wi h simple ap-
p oaches such as in-con ex lea ning and knowl-
edge dis illa ion. Inco po a ing ins ances wi h
g amma ical e o s as nega i e examples in con-
as i e lea ning may help s uden models lea n o
a oid hem, enhancing he eliabili y o h eshold-
based il e ing.
While log-p obabili y is e ec i e o il e -
ing ha m ul al e na i es, selec ing an app op ia e
h eshold o eal-wo ld use equi es ca e ul uning
based on human e alua ion, aking in o accoun
domain- and language-speci ic conside a ions and
p ac ical applica ion needs, o ensu e bo h sa e y
and u ili y.
5.3 La ency
While he smalle models o e subs an ial speed
imp o emen s, hei p ac ical in e ence speed o
eal- ime and on-de ice LS needs u he conside -
a ion. Assuming ha a s anda d inpu consis s o
30 okens and he ou pu al e na i e wo d is com-
posed o wo okens, he o e all in e ence ime o
ine- uned Llama on he as e m6g.xla ge en i on-
men would be abou 1.2 seconds: (30 okens *
33 ms/ oken [ ead]) + (2 okens * 107 ms/ oken
[p ed]) = 1204 ms.
Al hough a esponse ime o a ound one second
may be ole able in some cases, u he educ ion
would likely imp o e he use expe ience on mobile
de ices. One possible app oach is o educe he
p omp size by including only a limi ed window o
wo ds su ounding he a ge , a he han he ull
sen ence. Na u ally, his s a egy would equi e
ca e ul sa e y assessmen .
6 Conclusion
This s udy add essed he challenge o building e i-
cien and sa e LS sys ems using small LLMs, mo-
i a ed by eal-wo ld needs. We p oposed bench-
ma k sys ems in i e languages based on in-con ex
lea ning and knowledge dis illa ion, and in oduced
a il e ing s a egy using log-p obabili y as a sa e y
signal. Expe imen s showed ha small LLMs o e
signi ican e iciency gains, bu ha knowledge dis-
illa ion, while imp o ing au oma ic me ics sco e,
inc eases ha m ul ou pu s.
We demons a ed ha ou pu log-p obabili y
se es as an e ec i e signal o de ec ing ha m ul
simpli ica ions. This signal enables il e ing s a -
egy ha educe ha m ul ou pu s while e aining
bene icial ones. These indings lay he ounda ion
o ligh weigh LS sys ems ha emain sa e and
e ec i e ac oss languages.
Fu u e wo k should imp o e aining me hods
o educe ha m ulness and explo e eal- ime LS
o mobile en i onmen s. Ul ima ely, his esea ch
ad ances deployable, us wo hy LS ools ha sup-
po inclusi e in o ma ion access.
Limi a ions
Ou s udy, while demons a ing he po en ial o
small LLMs o e icien and sa e lexical simpli i-
ca ion, has se e al limi a ions ha highligh di ec-
ions o u he in es iga ion.
Fi s , he manual e alua ion o ha m ulness was
conduc ed by a single anno a o pe language.
While he anno a ion ask was designed as a simple
bina y decision o minimize subjec i i y, we we e
unable o assess in e -anno a o ag eemen , which
may a ec he gene alizabili y o he ha m ulness
e alua ions. Es ablishing a mo e obus e alua ion
p o ocol wi h mul iple anno a o s would be a alu-
able nex s eop o c ea e a gold-s anda d da ase o
ha m ulness de ec ion in LS.
Nex , we employed ela i ely simple p omp
enginee ing, using ixed 5-sho examples and
p omp empla es o ensu e ep oducibili y and es-
ablish baseline pe o mance. We did no explo e
ad anced p omp enginee ing echniques, which
could po en ially enhance he models’ pe o mance.
Fu u e wo k could in es iga e how mo e sophis-
ica ed p omp ing s a egies impac he ade-o
be ween pe o mance and sa e y explo ed in his
s udy.
223
Spanish
Con ex : Pe o si eso ocu e habi ualmen e, ienes un lujo de ondos nega i o y u p esupues o es á
desequilib ado.
(Bu i ha happens habi ually, you ha e a nega i e cash low and you budge is no in equilib ium.)
Ta ge Wo d: desequilib ado (no in equilib ium)
Gold Al e na i es: ines able (uns able), desni elado (une en), desbalanceado (unbalanced) ...
Gemma 5-sho (5%): desbalanceado (unbalanced) (Bene icial (POT))
Llama 5-sho (55%): equilib ado (in equilib ium) (Change o Meaning)
Llama ine- uned (77%): equilib ado (in equilib ium) (Change o Meaning)
Ca alan
Con ex : En el mani es s’ha quali ica "d’escandalosa" la sen ència con a els memb es de "la Manada"
ja que "se’n iu i menysp ea una dona jo e" que a se ag edia "b u almen pe un g up de sal a ges".
(In he s a emen , he sen ence agains he membe s o "la Manada" was desc ibed as "scandalous" since
"laughs a and despises a young woman" who was assaul ed "b u ally by a g oup o sa ages".)
Ta ge Wo d: b u almen (b u ally)
Gold Al e na i es: iolen amen ( iolen ly), o amen (s ongly), du amen (se e ely) ...
Gemma 5-sho (51%): iolen amen ( iolen ly) (Bene icial(POT))
Llama 5-sho (41%): b u almen (b u ally) (Unchanged)
Llama ine- uned (8%): malamen amen amen amen amen amen amen amen amen (Gibbe ish)
Ge man
Con ex : Salzbo n nenn als in die mode ne Beg i sgenese on Demok a ie eingesch iebene We e: (...)
und die Gewäh elemen a e Rech e de Menschen gegen den S aa .
(Salzbo n names as alues insc ibed in o he mode n concep ual genesis o democ acy: (...) and he
gua an ee o elemen a y igh s o human beings agains he s a e.)
Ta ge Wo d: elemen a e (elemen a y)
Gold Al e na i es: g undlegende ( undamen al), wich ige (impo an ), essen ielle (essen ial) ...
Gemma 5-sho (57%): g undlegende ( undamen al) (G amma E o )
Llama 5-sho (35%): G und ech ( undamen al igh ) (G amma E o )
Llama ine- uned (9%): g unds ehende (g ound-s anding) (Mo e Di icul )
Japanese
Con ex :迅速に適切な解決を図るために、相談窓口を活用することをお奨めします。
(To ensu e a p omp and app op ia e esolu ion, we ecommend u ilizing he consula ion se ice.)
Ta ge Wo d:活用する (u ilizing)
Gold Al e na i es:使う(use),利用する (make use o ), ...
Gemma 5-sho (97%): 利用する (make use o ) (Bene icial (ACC))
Qwen 5-sho (63%): 使おう (G amma E o )
Qwen ine- uned (76%): 使(G amma E o )
Table 11: Example ou pu s om he LS sys ems. Pe cen ages nex o sys em names indica e log-p obabili y
pe cen iles wi hin each sys em.
230
English Spanish Ca alan Ge man Japanese
Tags # Logp ob # Logp ob # Logp ob # Logp ob # Logp ob
Gemma-5sho
(All) 100 -1.615 100 -1.567 100 -1.679 100 -1.588 100 -2.268
Mo e Di icul 4 -1.905 2 -1.989 0 - 1 -1.300 4 -2.031
Change o Meaning 14 -1.874 2 -1.266 6 -1.907 6 -1.620 4 -2.447
G amma E o 1 -2.158 2 -1.409 3 -1.836 7 -1.835 10 -2.528
Gibbe ish 3 -2.056 1 -2.944 3 -2.227 1 -1.975 2 -3.617
Qwen-5sho
(All) 100 -1.884 100 -2.129 100 -3.592 100 -2.754 100 -4.132
Mo e Di icul 2 -2.203 1 -3.013 0 - 3 -3.217 3 -3.882
Change o Meaning 20 -2.088 8 -2.957 24 -3.976 20 -3.834 23 -4.766
G amma E o 3 -2.339 12 -2.469 24 -3.927 20 -3.254 15 -4.220
Gibbe ish 5 -1.991 0 - 13 -4.440 2 -4.253 2 -5.209
Qwen- ine- uned
(All) 100 -1.297 100 -2.063 100 -4.431 100 -3.667 100 -3.337
Mo e Di icul 2 -2.033 0 - 0 - 1 -4.934 1 -3.421
Change o Meaning 14 -1.617 12 -3.001 18 -5.692 34 -4.250 21 -3.697
G amma E o 5 -1.514 9 -3.390 17 -5.161 10 -3.296 17 -3.018
Gibbe ish 7 -1.408 4 -5.029 21 -6.047 9 -5.206 37 -4.021
Llama-5sho
(All) 100 -1.807 100 -1.244 100 -2.873 100 -3.135 100 -4.204
Mo e Di icul 3 -1.603 1 -1.802 0 - 2 -4.501 0 -
Change o Meaning 14 -2.045 8 -1.573 32 -3.246 13 -3.537 16 -4.520
G amma E o 0 - 3 -1.851 28 -3.374 14 -3.275 19 -3.011
Gibbe ish 12 -1.604 1 -2.686 6 -3.517 13 -3.375 31 -6.016
Llama- ine- uned
(All) 100 -1.161 100 -1.862 100 -2.880 100 -3.645 100 -3.360
Mo e Di icul 0 - 0 - 0 - 4 -4.867 3 -4.091
Change o Meaning 11 -1.465 16 -2.720 17 -3.012 25 -4.147 20 -3.762
G amma E o 3 -1.764 6 -2.834 13 -3.260 14 -3.304 12 -2.992
Gibbe ish 6 -1.697 3 -2.219 18 -4.415 15 -4.918 35 -4.457
Table 12: A e age log-p obabili y sco es o each language and ha m ul ag.
231