Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs

Author: Hayakawa, Akio; Bott, Stefan Markus; Saggion, Horacio

Publisher: Zenodo

DOI: 10.5281/zenodo.17700807

Source: https://zenodo.org/records/17700807/files/2025.inlg-main.15.pdf

P oceedings o he 18 h In e na ional Na u al Language Gene a ion Con e ence, pages 215–231
Oc obe 29–No embe 2, 2025. ©2025 Associa ion o Compu a ional Linguis ics
Towa ds T us wo hy Lexical Simpli ica ion: Explo ing Sa e y and
E iciency wi h Small LLMs
Akio Hayakawa S e an Bo
Depa men o Enginee ing, Uni e si a Pompeu Fab a
Ba celona, Spain
{akio.hayakawa, s e an.bo , ho acio.saggion}@up .edu
Ho acio Saggion
Abs ac
Despi e hei s ong pe o mance, la ge lan-
guage models (LLMs) ace challenges in eal-
wo ld applica ion o lexical simpli ica ion (LS),
pa icula ly in p i acy-sensi i e and esou ce-
cons ained en i onmen s. Mo eo e , since ul-
ne able use g oups (e.g., people wi h disabil-
i ies) a e one o he key a ge g oups o his
echnology, i is c ucial o ensu e he sa e y
and co ec ness o he ou pu o LS sys ems.
To add ess hese issues, we p opose an e i-
cien amewo k o LS sys ems ha u ilizes
small LLMs deployable in local en i onmen s.
Wi hin his amewo k, we explo e knowledge
dis illa ion wi h syn hesized da a and in-con ex
lea ning as baselines. Ou expe imen s in i e
languages e alua e model ou pu s bo h au o-
ma ically and manually. Ou manual analysis
e eals ha while knowledge dis illa ion boos s
au oma ic me ic sco es, i also in oduces a
sa e y ade-o by inc easing ha m ul simpli i-
ca ions. Impo an ly, we ind ha he model’s
ou pu p obabili y is a use ul signal o de ec -
ing ha m ul simpli ica ions. Le e aging his,
we p opose a il e ing s a egy ha supp esses
ha m ul simpli ica ions while la gely p ese -
ing bene icial ones. This wo k es ablishes a
benchma k o e icien and sa e LS wi h small
LLMs. I highligh s he key ade-o s be ween
pe o mance, e iciency, and sa e y, and demon-
s a es a p omising app oach o sa e eal-wo ld
deploymen .
1 In oduc ion
Tex Simpli ica ion (TS) aims o make ex s mo e
accessible by ew i ing hem in simple language.
TS holds he po en ial o alle ia e eading and un-
de s anding di icul ies, pa icula ly o indi idu-
als wi h dyslexia (Rello e al.,2013), in ellec ual
disabili ies (Säube li e al.,2024), and Dea and
ha d-o -hea ing adul s (Alonzo e al.,2021). TS
is a ask s ongly o ien ed owa ds eal-wo ld sce-
na ios, aiming o p omo e social pa icipa ion and
inclusion among people who ace challenges in ex
comp ehension.
Recen ad ancemen s in la ge language models
(LLMs) ha e e olu ionized na u al language p o-
cessing and achie ed s a e-o - he-a pe o mance
ac oss a ious asks (OpenAI,2024). TS is no ex-
cep ion, as LLMs ha e ou pe o med exis ing TS
sys ems (Feng e al.,2023;Wu and A ase,2024;
Qiang e al.,2025).
Howe e , applying LLMs o TS in eal-wo ld
scena ios, pa icula ly o ulne able use g oups,
aces c i ical challenges. Fi s , p omp s p o ided o
LLMs and ex s equi ing simpli ica ion may con-
ain sensi i e pe sonal in o ma ion, such as da a
ela ed o cogni i e impai men s. The use o API-
based LLMs in ol es ansmi ing ha sensi i e
da a o e he in e ne , aising signi ican p i acy
conce ns. Fo ins ance, gi en ha indi iduals wi h
dyslexia o en hesi a e o disclose hei condi ion
due o conce ns abou s igma and nega i e pe cep-
ions (Hamil on Cla k,2024), i can be p oblema ic
o design p omp s o TS such as "I ha e dyslexia;
Can you simpli y his diagnosis esul o me?".
Thus, TS sys ems capable o unning locally a e
highly desi able.
Open-access LLMs add ess his p i acy conce n.
Howe e , high-pe o ming open-access LLMs yp-
ically equi e subs an ial compu a ional esou ces
o in e ence. Deploying such la ge models di ec ly
on esou ce-cons ained de ices, such as sma -
phones and able s ha a e commonly used by he
a ge use s (Söde s öm e al.,2021), is cu en ly
imp ac ical. This highligh s he need o de el-
oping smalle models ha can pe o m e ec i ely
wi hin hese limi ed ha dwa e en i onmen s.
Building on hese challenges, we in es iga e how
o de elop e icien TS sys ems ha can ope a e
unde cons ained compu a ional esou ces. This
app oach is essen ial o suppo ing in o ma ion
access o all while espec ing use p i acy.
U ilizing small LLMs is a p omising app oach,
215
as ~3B models a e o en explici ly enginee ed o
on-de ice deploymen (Me aAI,2024), he eby ad-
d essing p i acy and e iciency issues. Howe e ,
pa icula a en ion mus be paid o sa e y when
employing small LLMs, as hei limi ed capaci y
compa ed o la ge coun e pa s in oduces c i ical
conside a ions ega ding he eliabili y and ha m-
ulness o he gene a ed simpli ica ions. Poo o
inaccu a e simpli ica ions can be de imen al, as
hey may ac i ely p o ide misin o ma ion o cause
con usion, which a e mo e se ious issues han lea -
ing he ex unchanged (Rello e al.,2013;Säube li
e al.,2024). The e o e, in p ac ice, i is c ucial
no only o simpli y ex s e ec i ely, bu also o
minimize ha m ul ou pu s and ensu e sa e y.
As a i s s ep owa ds add essing hese chal-
lenges, his pape ocuses speci ically on lexical
simpli ica ion (LS), a sub ask o TS ha eplaces
complex wo ds in a con ex sen ence wi h simple
al e na i es. LS can be conside ed a ela i ely con-
se a i e and sa e sub ask compa ed o sen ence-
o documen -le el simpli ica ion, which o en in-
ol es ope a ions such as in o ma ion dele ion (Al-
Thanyyan and Azmi,2021).
We adop ed small LLMs and explo ed wo ap-
p oaches: in-con ex lea ning, which equi es no
aining, and knowledge dis illa ion, which ans-
e s knowledge om a la ge eache model o a
smalle s uden model. Ou app oach also consid-
e s ex ensibili y o di e se languages, as suppo ing
a b oad use g oup equi es simpli ica ion ac oss
mul iple languages.
To e alua e he sa e y o simpli ica ion ou pu s,
pa icula ly in supp essing ha m ul con en , we
conduc ed manual e alua ions alongside au oma ic
me ics. Manual analysis e ealed ha , while
knowledge dis illa ion gene ally boos ed au oma ic
me ic sco es, i did no educe ha m ul ou pu s
and some imes e en inc eased hem. Fu he mo e,
we obse ed ha , especially in models ained ia
knowledge dis illa ion, he ou pu p obabili y p o-
ided by LLMs may se e as a use ul signal o
iden i ying ha m ul simpli ica ions.1
Ou con ibu ions a e summa ized as ollows:
•
We in es iga ed he po en ial and challenges
o using small LLMs o lexical simpli ica-
ion wi h espec o sa e y and e iciency, and
we es ablish a benchma k in his impo an
esea ch a ea.
1
Ou codes will be a ailable a
h ps://gi hub.com/
ahaya3776/sa e-e icien -ls.
•
We demons a ed ha small LLMs o e sig-
ni ican in e ence speedups, which highligh s
hei e iciency.
•
We ound ha s anda d app oaches such as
in-con ex lea ning and knowledge dis illa-
ion can p oduce bene icial simpli ica ions,
bu hey inhe en ly isk gene a ing ha m ul
ou pu s.
•
We iden i ied ha model’s log-p obabili y
se es as a use ul signal o de ec ing ha m-
ul simpli ica ions, sugges ing a p omising il-
e ing s a egy o ensu e sa e y owa ds eal-
wo ld applica ions.
2 Rela ed Wo k
Lexical Simpli ica ion LSBe (Qiang e al.,
2021) es ablished i sel as a s ong baseline o
LS by le e aging BERT’s unmasking capabili ies
and con ex ual unde s anding, ou pe o ming ea -
lie sys ems based on pa aph ase da abases and
wo d embeddings (Bi an e al.,2011;Gla aš and
Š ajne ,2015). Howe e , such sys ems based on
masked language models (MLMs) we e limi ed in
gene a ing mul i- oken wo ds (P zybyła and Sha d-
low,2020) and i s e ec i eness ou side English
has been ques ioned (S ajne e al.,2023). Fu -
he mo e, MLM-based sys ems o en equi e mul i-
s age pipelines in ol ing candida e anking, which
in oduces signi ican la ency ha con lic s ou goal
o on-de ice e iciency. Thei mul ilingual applica-
bili y is also hinde ed by he inconsis en a ailabil-
i y o monolingual models ac oss languages.
Mo e ecen au o- eg essi e app oaches, using
T5 (Sheang and Saggion,2021) and GPT-3 (Au-
mille and Ge z,2022), ha e ou pe o med MLM-
based me hods, leading o he widesp ead adop-
ion o LLMs as he p edominan solu ion o LS
(Sha dlow e al.,2024b). No ably, a GPT-4-based
LS sys em (Enomo o e al.,2024) achie ed ema k-
able pe o mance ac oss mul iple languages.
Smalle LLMs and E iciency The use o high-
pe o ming e sa ile LLMs poses se e al chal-
lenges in eal-wo ld scena ios, including esou ce
limi a ions, p i acy conce ns, and high ope a ional
cos s. To add ess hese issues, a ious e o s ha e
been made o de elop LLMs capable o unning
on local de ices. These include echniques such
as quan iza ion (Zhou e al.,2024) and he GPT-
Gene a ed Uni ied Fo ma (GGUF),
2
bo h o which
2h ps://gi hub.com/ggml-o g/ggml/blob/mas e /
docs/ggu .md
216
aim o enable e icien in e ence wi hou high-end
ha dwa e, as well as he de elopmen o small
LLMs (Qwen Team,2024;Gemma Team,2024;
Me a AI,2024).
Small LLMs can be u he ained o imp o e
pe o mance on speci ic asks (Xu e al.,2024),
including LS (Baez and Saggion,2023;Xiao
e al.,2024). Baez and Saggion (2023) p oposed
LSLlama, a LLAMA-7B model ine- uned on an
exis ing LS da ase , which achie ed pe o mance
compa able o a GPT-3-based app oach. Xiao e al.
(2024) in oduced he Pi o KD amewo k, which
ained Chinese-cen ic small LLMs using pseudo-
ins ances gene a ed by GPT-4, and buil a cos -
e ec i e Chinese LS sys em by inco po a ing web-
based synonym and wo d sense e ie al du ing
in e ence. These s udies demons a ed he po en-
ial o ask-speci ic aining o small LLMs o LS.
Howe e , hei applicabili y o languages beyond
English and Chinese emains unce ain, especially
gi en mo phological complexi y and dispa i ies in
p e- aining esou ces.
Sa e y and Reliabili y o Tex Simpli ica ion
While TS suppo s eading and unde s anding, i
also ca ies he isk o causing con usion o mis-
in e p e a ion. In p ac ice, ou pu s om au oma ic
TS sys ems o en su e om low ac uali y (De-
a aj e al.,2022) and in o ma ion loss (Ag awal
and Ca pua ,2024), which can nega i ely a ec
eade s’ eading ime and accu acy on comp ehen-
sion ques ions (Rello e al.,2013;Säube li e al.,
2024). In such cases, lea ing he o iginal ex un-
changed may be p e e able o applying a ha m ul
simpli ica ion. The e o e, adop ing a s a egy ha
accep s simpli ica ion only when ce ain c i e ia
a e me o e s a p ac ical app oach in eal-wo ld
scena ios. In his ega d, T ienes e al. (2024) p e-
sen ed one o he ew e o s o assess he po en ial
ha m o TS by de ec ing in o ma ion loss. How-
e e , i s eliance on LLMs makes i unsui able o
use in cons ained en i onmen s.
3 Expe imen al Se up
Figu e 1 illus a es he o e all low o ou sys em
de elopmen and e alua ion. We used he Hugging-
Face T ans o me s lib a y
3
o he de elopmen o
ou LS models. A single Tesla T4 GPU wi h 16
GB o memo y was used o he de elopmen . To
enable high-speed in e ence on CPUs, he mod-
3h ps://hugging ace.co/docs/ ans o me s/
Da a Syn hesis
Compu ing En i onmen
Wikipedia
Picking up sen ences 10-100 wo ds long
The women and child en make Gue nica he image
o innocen , de enseless humani y ic imized.
Randomly selec ing a Ta ge wo d om op-5
in equen wo ds (excep P ope Nouns/OOVs)
The women and child en make Gue nica he image
o innocen ,de enseless humani y ic imized.
Ge ing an al e na i e wo d o he a ge
om he eache model
The women and child en make Gue nica he image
o innocen , de enseless humani ypeople ic imized.
Small
LLMs
Cons ained
En i onmen
Fine- uned
5-sho
KD
ICL Mul i
LS
- La ency
- ACC/POT
- Manual
E alua ion
Figu e 1: O e all low o ou expe imen s. We de el-
oped and e alua ed sys ems o each language sepa-
a ely.
els we e con e ed in o he GGUF o ma using
llama.cpp.4
3.1 Task Fo mula ion
The e m Lexical Simpli ica ion (LS) has been used
wi h a ying scopes. In some cases, i e e s o
a sen ence-le el simpli ica ion pipeline consis ing
o complex wo d iden i ica ion, subs i u ion gen-
e a ion, and anking (Pae zold and Specia,2017).
Howe e , in his pape , we adop a na owe de -
ini ion o LS, ocusing solely on he subs i u ion
gene a ion. Speci ically, we de ine LS as gene a -
ing a simple al e na i e o a single a ge wo d ha
appea s in a gi en con ex sen ence. An al e na i e
should make he con ex easie o unde s and han
he o iginal while p ese ing i s meaning. The e-
o e, an LS sys em akes a con ex and a ge wo d
as inpu and ou pu s a single al e na i e wo d.
3.2 Da ase
We used Mul iLS (Sha dlow e al.,2024c), a LS
da ase co e ing 10 languages, o e alua e sys em
pe o mance. We selec ed i e languages, English,
Spanish, Ca alan, Ge man, and Japanese, o ac-
4h ps://gi hub.com/ggml-o g/llama.cpp
217
coun o di e ences in language amily, mo pho-
logical s uc u e, and esou ce a ailabili y.
Table 1 shows an example LS ins ance, consis -
ing o a con ex sen ence, a a ge wo d, and al e na-
i e wo ds sugges ed by mul iple human anno a o s.
Mul iLS allowed anno a o s o use a a ge as an
al e na i e when hey could no iden i y a alid
simpli ica ion, which o en occu ed when he a -
ge was al eady simple enough (Sha dlow e al.,
2024a). This anno a ion scheme enables us o ex-
clude ins ances whe e LS is inhe en ly di icul .
We emo ed such ins ances whe e he op- anked
al e na i e was unchanged om he a ge wo d.
This p ocess esul ed in he numbe o ins ances
pe language shown in Table 2. We andomly spli
he selec ed ins ances in o wo pa s, assigning 90
ins ances o de elopmen and he es o es ing.
5
3.3 LS Sys ems
We employed wo small LLMs: Qwen 2.5 1.5B
(Qwen o sho ) (Qwen Team,2024) and Llama
3.2 1B (Llama o sho ) (Me a AI,2024). Bo h
models we e ained on mul iple languages om
hei la ge coun e pa s.
6
To make hese mod-
els pe o m LS, we adop ed wo app oaches: in-
con ex lea ning and knowledge dis illa ion.7
3.3.1 In-Con ex Lea ning
In-con ex lea ning (B own e al.,2020), which p o-
ides se e al examples as ew-sho o guide model
beha io , is a common echnique o imp o e ou pu
quali y. We used i e ixed examples in he p omp
(5-sho ) ollowing he empla e in Appendix A.
These examples we e sampled om he pilo spli
o Mul iLS, which was sepa a ed om he main
e alua ion da a.
3.3.2 Knowledge Dis illa ion
Knowledge dis illa ion, which in ol es ans e ing
knowledge o la ge eache models o smalle s u-
den models, has been widely used o adap LLMs
o speci ic asks, including LS (Baez and Saggion,
2023;Xiao e al.,2024). Recen app oaches com-
monly employ simple supe ised ine- uning o s u-
den models wi h ha d labels de i ed om eache
model ou pu s, due o he ad anced capabili ies o
closed-sou ce LLMs (Xu e al.,2024). Following
5
As up o h ee ins ances sha e he same con ex , we assign
90 ins ances wi h 30 unique con ex s o he de elopmen da a.
6
We used base LLMs ins ead o ins uc ion- uned e sions
as base LLMs. See Appendix C o de ails.
7See Appendix B o he hype pa ame e se ings.
Con ex : Elec onically con olled mo o ized zoom
lenses a e placed on bo h came a and p ojec o , and
synch onized wi h one ano he so ha bo h lenses zoom
oge he and a he same ocal leng h a all imes.
Ta ge Wo d: ocal
Gold Al e na i es: main, main, cen al, cen al, basic,
p ima y, ocal
Table 1: Example om he Mul iLS English subse .
Fo his ins ance, ACC is me i he ou pu al e na i e
is "main" o "cen al", which a e he mos sugges ed
al e na i es. POT is me i he ou pu al e na i e is
one o "main", "cen al", "basic", and "p ima y". I he
ou pu al e na i e is " ocal", which is unchanged om
he a ge wo d, i does no mee ei he me ic.
# O iginal # Selec ed A g. Con ex
Language Ins ances Ins ances Leng h
English 570 515 25.4
Spanish 593 502 29.3
Ca alan 445 261 45.0
Ge man 570 547 37.7
Japanese 570 562 20.3
Table 2: S a is ics o Mul iLS ins ances pe language.
his amewo k, we pe o med knowledge dis illa-
ion ( ine- uned) by syn hesizing LS ins ances.
Syn hesizing Con ex and Ta ge s We an-
domly ex ac ed con ex sen ences om Wikipedia
o each language. Sen ences we e pa sed using
MeCab
8
o Japanese and spaCy
9
o he o he lan-
guages. We e ained only hose con aining be ween
10 and 100 wo ds as con ex s.10
To ensu e ha a ge wo ds we e simpli iable,
we excluded p ope nouns and ou -o - ocabula y
wo ds om he se o candida e wo ds wi hin each
con ex sen ence. F om he emaining candida es,
we andomly selec ed one o he i e leas e-
quen wo ds as he a ge wo d, based on Zip e-
quency.11
Syn hesizing Al e na i e Wo ds To ob ain al e -
na i e wo ds o he con ex - a ge pai s desc ibed
abo e, we employed he ins uc ion- uned Gemma
2 9B (Gemma Team,2024) as a eache model,
an LLM known o i s s ong pe o mance ac oss
di e se languages. The model was p omp ed o
gene a e a single al e na i e wo d using he same
5-sho se ing desc ibed in § 3.3.1.
8h ps:// aku910.gi hub.io/mecab/
9h ps://spacy.io/
10
Fo Japanese, simple okeniza ion ules we e applied. See
Appendix D o de ails.
11
Calcula ed using wo d eq Py hon lib a y:
h ps://
gi hub.com/ spee /wo d eq/
218
The pe o mance o ine- uned s uden mod-
els can o en be imp o ed by emo ing low-
quali y ou pu s om he eache (Jung e al.,2023;
Huang e al.,2023). The e o e, we il e ed ou
low-con idence al e na i es, app oxima ing con i-
dence using ou pu p obabili ies (desc ibed la e in
§ 3.4.4). Fo each language, we gene a ed al e -
na i es o 60,000 syn hesized con ex - a ge pai s
and selec ed he op 30,000 high-con idence in-
s ances o aining.
Fine- uning Models We ine- uned each s uden
model o each language sepa a ely, using he co -
esponding 30,000 ins ances o up o i e epochs.
To educe memo y consump ion, we adop ed he
QLoRA amewo k (De me s e al.,2023). In his
se up, he weigh s o base models we e quan ized
o 4-bi p ecision using he bi sandby es
12
lib a y.
Fine- uning was hen pe o med ia 16-bi LoRA
adap e s. Following De me s e al. (2023), we
only ine- uned Que y and Key p ojec ions laye s
wi hin he a en ion modules. Each ype o s uden
model was ine- uned wi h h ee di e en andom
seeds. We sa ed a checkpoin e e y 0.2 epochs and
selec ed he one ha achie ed he highes Po en-
ial@1 (desc ibed la e in § 3.4.1) on he de elop-
men se . The p omp empla e in Appendix A was
used o aining and in e ence.
3.3.3 Baselines
As a baseline, we employed he ins uc ion- uned
Gemma 2 9B (Gemma o sho ) in he same 5-sho
se ing used o he eache model.
3.4 E alua ion
3.4.1 Au oma ic LS Me ics
To au oma ically e alua e he pe o mance o LS
sys ems, we used Accu acy@1@ op1 (ACC) and
Po en ial@1 (POT), as de ined in Saggion e al.
(2022). As shown in Table 1, ACC is he pe cen -
age o p edic ions ma ching he mos equen ly
sugges ed al e na i e. POT is he pe cen age o p e-
dic ions ma ching any sugges ed al e na i e. Gi en
ha all ins ances we e assumed simpli iable a e
he selec ion p ocess in § 3.3.1, any p edic ions un-
changed om he a ge wo d we e no conside ed
a ma ch o ei he ACC o POT, e en i he a ge
wo d was included in he gold al e na i es.
12h ps://gi hub.com/bi sandby es- ounda ion/
bi sandby es
3.4.2 La ency E alua ion
To es ima e model esponse ime in esou ce-
cons ained en i onmen s, we cons uc ed a i -
ual small-scale in as uc u e using compu ing
ins ances om Amazon Web Se ices (AWS).
We selec ed m6g.la ge and m6g.xla ge compu -
ing ins ances om AWS Elas ic Compu ing Cloud,
which p o ide 2 and 4 i ual CPUs and 8 GB
and 16 GB o memo y, espec i ely. These con ig-
u a ions e lec he ha dwa e commonly ound in
sma phones and able s. Bo h compu ing ins ances
a e based on G a i on p ocesso s, which a e widely
applied in mobile de ices.13
To al la ency mainly consis s o p omp p ocess-
ing ime and in e ence ime. As bo h depend on he
numbe o okens in he p omp and he gene a ed
ou pu , we measu ed he a e age p e- oken p omp
p ocessing and in e ence imes o each model us-
ing llama.cpp. No ably, llama.cpp caches he ini ial
ixed po ion o he p omp (i.e., ew-sho exam-
ples), so i s p ocessing la ency is no incu ed on
subsequen in e ences. While his caching is key o
he e iciency, i makes dynamic p omp ing s a e-
gies imp ac ical, as hey would equi e equen
cache in alida ion.
3.4.3 Manual LS E alua ion
To gain a mo e nuanced unde s anding o LS qual-
i y and sa e y om a use pe spec i e, we con-
duc ed a manual e alua ion. We andomly sampled
100 ins ances pe language and assigned ha m ul-
ness ags o he al e na i es gene a ed by each sys-
em. Ou manual e alua ion ocused on ins ances
ha we e no co e ed by ou au oma ic me ics.
Fo his pu pose, we only assigned ags o al e na-
i es ha we e nei he unchanged om he a ge
no included in he gold al e na i es.
Taking in o accoun he s anda d human e alua-
ion c i e ia o luency, adequacy, and simplici y in
TS, we de ined he ollowing ou ha m ul ags:
•
G amma E o : The al e na i e is g amma i-
cally inco ec , including in lec ion, and con-
juga ion e o s.
•
Change o Meaning: Replacing he a ge wi h
he al e na i e d as ically changes he mean-
ing o con ex .
•
Mo e Di icul : The al e na i e is clea ly mo e
di icul han he a ge , e en hough i p e-
se es he meaning o some ex en .
13h ps://aws.amazon.com/ec2/ins ance- ypes/
m6g/
219

LS Pe o mance La ency (msec / oken)
Model Se ings English Spanish Ca alan Ge man Japanese m6g.la ge m6g.xla ge
ACC POT ACC POT ACC POT ACC POT ACC POT ead p ed ead p ed
Gemma(9B) 5-sho .529 .751 .427 .774 .333 .690 .405 .643 .252 .494 652 581 326 292
Qwen(1.5B) 5-sho .358 .534 .274 .473 .076 .205 .186 .298 .064 .150 91 275 45 139
ine- uned .382 .574 .318 .537 .129 .265 .119 .206 .076 .154 86 274 43 138
Llama(1B) 5-sho .202 .278 .053 .092 .047 .105 .090 .142 .023 .042 70 219 35 110
ine- uned .370 .544 .293 .529 .160 .292 .138 .217 .058 .145 66 221 33 107
Table 3: Pe o mance o models on he Mul iLS da ase . Gemma was quan ized o 4-bi due o memo y cons ain s.
•
Gibbe ish: The al e na i e does no make
sense a all.
Fo each language, anno a ion was pe o med by
a single in-house anno a o , all o whom we e na-
i e speake s excep o Ca alan. The Ca alan anno-
a ion was conduc ed by a CEFR C1 le el speake
wi h o e en yea s o expe ience. The ask was
designed as a simple bina y decision o minimize
subjec i i y, ensu ing he e alua ion amewo k is
easily ex ensible o o he languages and domains.
Based on he au oma ically and manually as-
signed ags, al e na i es we e ca ego ized in o ol-
lowing h ee g oups. Tags de e mined by au oma ic
me ics a e ma ked wi h A, while hose equi ing
manual anno a ion a e ma ked wi h M.
•Bene icial
–ACC (A) : equi alen o Accu acy@1@ op1.
–POT (A) : Po en ial@1 bu no ACC
–Good (M) : no ha m ul ags we e assigned.
•Unchanged
(A) : al e na i e was iden ical o
a ge .
•Ha m ul
–Deg aded
(M) : one o mo e non-Gibbe ish
ha m ul ags we e assigned.
–Gibbe ish (M) : Gibbe ish was assigned.
See Appendix E o de ailed examples o he
ha m ul ags and g oups.
3.4.4 Fil e ing S a egy
To add ess he isk o in oducing ha m ul simpli i-
ca ions discussed abo e, we p opose and e alua e
a il e ing s a egy. This s a egy le e ages he ou -
pu p obabili y sco e as a eliabili y signal in a
h eshold-based decision mechanism o de e mine
whe he o pe o m LS.
P obabili y Sco e We compu ed he p obabili y
sco e as he sum o he log-p obabili ies o he o-
kens o ming he al e na i e wo d, including he
oken indica ing he end o he wo d (e.g., a new-
line o EOS oken). We conside ed he p obabili y
sco es o indi idual al e na i es as candida e h esh-
olds. Fo each h eshold alue, al e na i es wi h
sco es abo e he h eshold we e accep ed, while
o he s we e ejec ed, and no simpli ica ion was ap-
plied.
E alua ion To quan i a i ely e alua e he e ec-
i eness o he p oposed s a egy, we de ined he
ollowing me ics:
•
AUC (
Bene icial
s
Ha m ul
): To assess
how well he p obabili y sco e unc ions as a
sa e y signal, we compu ed he A ea Unde
he ROC Cu e (B adley,1997) o classi y-
ing al e na i es as
Bene icial
s.
Ha m ul
,
excluding Unchanged al e na i es.
•BH0.1
(
Bene icial
Ra e a 10%
Ha m ul
):
To quan i y p ac ical bene i unde a
sa e y cons ain , we epo ed he a e o
Bene icial
achie ed when he a e o
Ha m ul
in oduced was limi ed o 10% o
o al ins ances. We chose he 10% h eshold
o balance sa e y and u ili y by o e ing a
p ac ical e e ence poin o compa ison ha
emains adap able o di e en needs.
4 Resul s
4.1 Au oma ic E alua ion
Table 3 shows he au oma ic me ic sco es o ou
LS sys ems. The esul s con i m ou hypo hesis
ha ine- uning, as pa o he knowledge dis illa-
ion, imp o ed he pe o mance o small LLMs.
Fo example, ine- uned Llama achie ed 0.370
ACC on English, signi ican ly highe han he 5-
sho sco e (0.202). Simila gains we e obse ed o
bo h Llama and Qwen ac oss mos languages.
The ine- uned Llama pe o med compa ably o
Qwen despi e i s smalle size, sugges ing ha he
1B model can app oach 1.5B model in pe o mance
220
G-5
Q-5
Q-
L-5
L-
0
0.2
0.4
0.6
0.8
1
P opo ion
English
G-5
Q-5
Q-
L-5
L-
Spanish
G-5
Q-5
Q-
L-5
L-
Ca alan
G-5
Q-5
Q-
L-5
L-
Ge man
G-5
Q-5
Q-
L-5
L-
Japanese
ACC POT Good Unchanged Deg aded Gibbe ish
Bene icial Ha m ul
Figu e 2: Dis ibu ion o ou pu al e na i e ca ego ies. G: Gemma, Q: Qwen, L: Llama. -5: 5-sho , - : ine- uned.
a e aining. Howe e , nei he s uden models
eached he eache ’s le el.
Table 3 also epo s he la ency (ms/ oken) o
p omp eading ( ead) and ou pu gene a ion (p ed).
Bo h s uden models showed subs an ially lowe
la ency han he eache model. On m6g.la ge,
Llama’s ead la ency (66 msec/ oken) was nea ly
10 imes as e han Gemma’s (652 msec/ oken),
wi h simila ends ac oss en i onmen s.
4.2 Manual E alua ion
Figu e 2 shows he dis ibu ion o al e na i e ca e-
go ies, as judged by human e alua o s, ac oss mod-
els, se ings, and languages. Each s acked ba ep-
esen s he p opo ion o ou pu al e na i es alling
in o he ca ego ies.
Unde 5-sho se ings, small LLMs, especially
Llama o English and Spanish, p oduced a high
p opo ion o
Unchanged
ou pu s, indica ing sa e
bu less p oac i e simpli ica ion beha io . Fine-
uning educed
Unchanged
and co esponding ise
in
Bene icial
simpli ica ions, e lec ing a gen-
e al imp o emen in LS capabili y. Howe e , ine-
uning also in oduced a sa e y ade-o , as i in-
c eased he p opo ions o Ha m ul al e na i es.
In con as , such ade-o was no obse ed o
Ge man and Japanese. Fo hese languages, pe o -
mance emained low ac oss bo h 5-sho and ine-
uned se ings, wi h
Ha m ul
al e na i es consis-
en ly domina ing he esul s. This sugges s a mo e
undamen al challenge s emming om he inhe en
di icul y o cu en small LLMs o pe o m LS
e ec i ely in hese languages.
Lang Model Se ings B HAUC BH0.1
En
Qwen 5-sho .63 .30 .679 .41
ine- uned .63 .27 .707 .46
Llama 5-sho .30 .28 .510 .12
ine- uned .61 .20 .797 .54
Es
Qwen 5-sho .51 .19 .737 .46
ine- uned .63 .19 .850 .61
Llama 5-sho .09 .12 .907 .09
ine- ued .56 .21 .804 .50
Ca
Qwen 5-sho .34 .52 .735 .18
ine- uned .38 .49 .904 .34
Llama 5-sho .15 .52 .614 .03
ine- uned .46 .42 .813 .36
De
Qwen 5-sho .41 .38 .841 .34
ine- uned .38 .51 .721 .16
Llama 5-sho .19 .40 .730 .11
ine- uned .35 .55 .737 .16
Ja
Qwen 5-sho .33 .58 .807 .16
ine- uned .21 .73 .799 .13
Llama 5-sho .16 .64 .745 .04
ine- uned .28 .67 .845 .19
Table 4: E alua ion o Fil e ing S a egy.
B
and
H
e e o he o iginal a e o
Bene icial
and
Ha m ul
ou pu s.
4.3 Fil e ing S a egy
Table 4 p esen s he esul s o il e ing s a egy.
Fi s , he AUC sco es a e no ably high, espe-
cially unde ine- uned se ings, sugges ing ha
log-p obabili y se es as an e ec i e signal o de-
ec ing Ha m ul al e na i es. Mo eo e , he ine-
uned models gene ally show highe AUC ac oss
model ypes and languages, which indica es ha
knowledge dis illa ion enhances he quali y o p ob-
abili y as a sa e y indica o .
The
BH0.1
me ic shows he p ac ical alue o
his s a egy. Fo example, in Spanish, ine- uned
221
-2-4-6-8-10
Sum o Log-P obabili y
H
B
H
Bn=18n=16
n=10
n=42
n=34
n=4
n=10n=39
5-sho
ine- uned
0 20 40 60 80 100
Pe cen ile o
Log-P obabili y Th eshold (%)
0.0
0.1
0.2
0.3
0.4
0.5
Ra e o Bene icial (B)
o Ha m ul (H) Ou pu s
(70, 0.18)
(51, 0.34)
B, 5-sho
H, 5-sho
B, ine- uned
H, ine- uned
Figu e 3:
Bene icial
and
Ha m ul
al e na i es and
hei p obabili y o Qwen in Ca alan. (Top) Dis i-
bu ion o aw p obabili y sco es. (Bo om) Ra e o
Bene icial
and
Ha m ul
al e na i es a e il e ing a
each pe cen ile h eshold. Do ed lines a e plo ed on
h esholds whe e Ha m ul becomes 10%.
Qwen educed
Ha m ul
a e om 19% o 10% wi h
only a sligh d op in
Bene icial
om 63% o 61%.
BH0.1
also highligh s he supe io i y o ine- uning
o 5-sho se ings.
To u he explo e hese indings, we ocus on
he beha io o Qwen models in Ca alan. He e,
while he o iginal
Bene icial
and
Ha m ul
a es
a e close be ween 5-sho and ine- uned se ings,
he impac o il e ing s a egy di e s signi ican ly.
In Figu e 3, he iolin plo ( op) isualizes he dis i-
bu ion o log-p obabili y sco es, whe e ine- uning
leads o a clea sepa a ion be ween
Bene icial
and Ha m ul al e na i es.
The line plo (bo om) acks
Bene icial
and
Ha m ul
a es ac oss h esholds pe cen iles. Fo
he ine- uned model, inc easing he h eshold
educes
Ha m ul
apidly, while
Bene icial
de-
clines mo e g adually. As a esul , he
Ha m ul
a e is educed om nea ly 50% o 10%, wi h mos
Bene icial simpli ica ion p ese ed.
Con ex : The e a e also di e en
edi ing
s yles in he
sense o how bold people a e willing o be.
Ta ge Wo d: edi ing
Gold Al e na i es: changing, modi ying, al e ing ...
Gemma 5-sho (4%): w i ing (Change o Meaning)
Qwen 5-sho (92%): w i ing (Change o Meaning)
Qwen ine- uned (3%): p oo eading (Mo e Di icul )
Table 5: Example ou pu s om he LS sys ems. Pe -
cen ages nex o sys em names indica e log-p obabili y
pe cen iles wi hin each sys em.
5 Discussion
5.1 Case S udy
To be e unde s and he cha ac e is ics o model
ou pu s, pa icula ly ha m ul simpli ica ions o e -
looked by au oma ic me ics and he po en ial o
he log-p obabili y signal, we p esen an example
in Table 5. In his example, model ou pu al e -
na i es "w i ing" and "p oo eading" we e ca ego-
ized as
Ha m ul
, wi h he ags "Change o Mean-
ing" and "Mo e Di icul ", espec i ely. C ucially,
hese al e na i es we e associa ed wi h lowe log-
p obabili y pe cen iles o Gemma (5-sho ) and
ine- uned Qwen, while hey we e much highe
o Qwen unde he 5-sho se ing. This case con-
i ms ou indings ha ine- uned models e ec i ely
le e age log-p obabili y o iden i y ha m ul al e na-
i es. I also shows ha log-p obabili y is a use ul
signal o he eache model, e en wi hou ine-
uning. This alida es he il e ing p ocessed used
du ing da a syn hesis. Examples in o he languages
a e desc ibed in Appendix F.
5.2 Sa e y
As exempli ied by he case s udy abo e, ha m ul
LS al e na i es pose a se ious isk in eal-wo ld
scena ios. Ou manual e alua ion e ealed key lim-
i a ions o s anda d au oma ic e alua ion me ics
based on human-p o ided al e na i es. They ail o
iden i y accep able simpli ica ions no in he gold
al e na i es, and hey do no expose ha m ul al e -
na i es. Al hough manual e alua ion is cos ly and
no scalable, ou ha m ulness anno a ions p o ide
a aluable basis o building au oma ic de ec ion
me hods, such as LLM-as-a-judge, o suppo mo e
p ac ical sa e y assessmen .
Ha m ul al e na i es we e pa icula ly p o-
nounced in Ge man and Japanese. In hese lan-
guages, complex mo phology may hinde he con-
sis en gene a ion o co ec and simple single-
wo d al e na i es by small LLMs. Ou e o anal-
222
ysis highligh s a c i ical challenge ela ed o his:
al e na i es wi h he G amma E o ag in Ge man
and Japanese o en ecei ed high p obabili y sco es
om small LLMs (bo h ew-sho and ine- uned),
making hem di icul o dis inguish om bene icial
al e na i es o o he ha m ul ypes. Fo ins ance,
he a e age log-p obabili y sco e o G amma E -
o om he ine- uned Llama model in Japanese
was -2.992, which was no ably highe han ha
o Change o Meaning (-3.762) and Gibbe ish (-
4.457). This sugges s ha ou il e ing s a egy had
limi ed e ec i eness in mi iga ing g amma e o s.
In e es ingly, his issue was less p e alen in he
eache model (see Appendix G o de ails ac oss
all ags and models). This dispa i y implies ha
non-small LLMs can be e le e age ou pu p oba-
bili y as a signal o g amma ical co ec ness e en
in mo phologically complex languages. In con-
as , small LLMs may s uggle o cap u e hese
ine-g ained g amma ical nuances wi h simple ap-
p oaches such as in-con ex lea ning and knowl-
edge dis illa ion. Inco po a ing ins ances wi h
g amma ical e o s as nega i e examples in con-
as i e lea ning may help s uden models lea n o
a oid hem, enhancing he eliabili y o h eshold-
based il e ing.
While log-p obabili y is e ec i e o il e -
ing ha m ul al e na i es, selec ing an app op ia e
h eshold o eal-wo ld use equi es ca e ul uning
based on human e alua ion, aking in o accoun
domain- and language-speci ic conside a ions and
p ac ical applica ion needs, o ensu e bo h sa e y
and u ili y.
5.3 La ency
While he smalle models o e subs an ial speed
imp o emen s, hei p ac ical in e ence speed o
eal- ime and on-de ice LS needs u he conside -
a ion. Assuming ha a s anda d inpu consis s o
30 okens and he ou pu al e na i e wo d is com-
posed o wo okens, he o e all in e ence ime o
ine- uned Llama on he as e m6g.xla ge en i on-
men would be abou 1.2 seconds: (30 okens *
33 ms/ oken [ ead]) + (2 okens * 107 ms/ oken
[p ed]) = 1204 ms.
Al hough a esponse ime o a ound one second
may be ole able in some cases, u he educ ion
would likely imp o e he use expe ience on mobile
de ices. One possible app oach is o educe he
p omp size by including only a limi ed window o
wo ds su ounding he a ge , a he han he ull
sen ence. Na u ally, his s a egy would equi e
ca e ul sa e y assessmen .
6 Conclusion
This s udy add essed he challenge o building e i-
cien and sa e LS sys ems using small LLMs, mo-
i a ed by eal-wo ld needs. We p oposed bench-
ma k sys ems in i e languages based on in-con ex
lea ning and knowledge dis illa ion, and in oduced
a il e ing s a egy using log-p obabili y as a sa e y
signal. Expe imen s showed ha small LLMs o e
signi ican e iciency gains, bu ha knowledge dis-
illa ion, while imp o ing au oma ic me ics sco e,
inc eases ha m ul ou pu s.
We demons a ed ha ou pu log-p obabili y
se es as an e ec i e signal o de ec ing ha m ul
simpli ica ions. This signal enables il e ing s a -
egy ha educe ha m ul ou pu s while e aining
bene icial ones. These indings lay he ounda ion
o ligh weigh LS sys ems ha emain sa e and
e ec i e ac oss languages.
Fu u e wo k should imp o e aining me hods
o educe ha m ulness and explo e eal- ime LS
o mobile en i onmen s. Ul ima ely, his esea ch
ad ances deployable, us wo hy LS ools ha sup-
po inclusi e in o ma ion access.
Limi a ions
Ou s udy, while demons a ing he po en ial o
small LLMs o e icien and sa e lexical simpli i-
ca ion, has se e al limi a ions ha highligh di ec-
ions o u he in es iga ion.
Fi s , he manual e alua ion o ha m ulness was
conduc ed by a single anno a o pe language.
While he anno a ion ask was designed as a simple
bina y decision o minimize subjec i i y, we we e
unable o assess in e -anno a o ag eemen , which
may a ec he gene alizabili y o he ha m ulness
e alua ions. Es ablishing a mo e obus e alua ion
p o ocol wi h mul iple anno a o s would be a alu-
able nex s eop o c ea e a gold-s anda d da ase o
ha m ulness de ec ion in LS.
Nex , we employed ela i ely simple p omp
enginee ing, using ixed 5-sho examples and
p omp empla es o ensu e ep oducibili y and es-
ablish baseline pe o mance. We did no explo e
ad anced p omp enginee ing echniques, which
could po en ially enhance he models’ pe o mance.
Fu u e wo k could in es iga e how mo e sophis-
ica ed p omp ing s a egies impac he ade-o
be ween pe o mance and sa e y explo ed in his
s udy.
223
Spanish
Con ex : Pe o si eso ocu e habi ualmen e, ienes un lujo de ondos nega i o y u p esupues o es á
desequilib ado.
(Bu i ha happens habi ually, you ha e a nega i e cash low and you budge is no in equilib ium.)
Ta ge Wo d: desequilib ado (no in equilib ium)
Gold Al e na i es: ines able (uns able), desni elado (une en), desbalanceado (unbalanced) ...
Gemma 5-sho (5%): desbalanceado (unbalanced) (Bene icial (POT))
Llama 5-sho (55%): equilib ado (in equilib ium) (Change o Meaning)
Llama ine- uned (77%): equilib ado (in equilib ium) (Change o Meaning)
Ca alan
Con ex : En el mani es s’ha quali ica "d’escandalosa" la sen ència con a els memb es de "la Manada"
ja que "se’n iu i menysp ea una dona jo e" que a se ag edia "b u almen pe un g up de sal a ges".
(In he s a emen , he sen ence agains he membe s o "la Manada" was desc ibed as "scandalous" since
"laughs a and despises a young woman" who was assaul ed "b u ally by a g oup o sa ages".)
Ta ge Wo d: b u almen (b u ally)
Gold Al e na i es: iolen amen ( iolen ly), o amen (s ongly), du amen (se e ely) ...
Gemma 5-sho (51%): iolen amen ( iolen ly) (Bene icial(POT))
Llama 5-sho (41%): b u almen (b u ally) (Unchanged)
Llama ine- uned (8%): malamen amen amen amen amen amen amen amen amen (Gibbe ish)
Ge man
Con ex : Salzbo n nenn als in die mode ne Beg i sgenese on Demok a ie eingesch iebene We e: (...)
und die Gewäh elemen a e Rech e de Menschen gegen den S aa .
(Salzbo n names as alues insc ibed in o he mode n concep ual genesis o democ acy: (...) and he
gua an ee o elemen a y igh s o human beings agains he s a e.)
Ta ge Wo d: elemen a e (elemen a y)
Gold Al e na i es: g undlegende ( undamen al), wich ige (impo an ), essen ielle (essen ial) ...
Gemma 5-sho (57%): g undlegende ( undamen al) (G amma E o )
Llama 5-sho (35%): G und ech ( undamen al igh ) (G amma E o )
Llama ine- uned (9%): g unds ehende (g ound-s anding) (Mo e Di icul )
Japanese
Con ex :迅速に適切な解決を図るために、相談窓口を活用することをお奨めします。
(To ensu e a p omp and app op ia e esolu ion, we ecommend u ilizing he consula ion se ice.)
Ta ge Wo d:活用する (u ilizing)
Gold Al e na i es:使う(use),利用する (make use o ), ...
Gemma 5-sho (97%): 利用する (make use o ) (Bene icial (ACC))
Qwen 5-sho (63%): 使おう (G amma E o )
Qwen ine- uned (76%): 使(G amma E o )
Table 11: Example ou pu s om he LS sys ems. Pe cen ages nex o sys em names indica e log-p obabili y
pe cen iles wi hin each sys em.
230

English Spanish Ca alan Ge man Japanese
Tags # Logp ob # Logp ob # Logp ob # Logp ob # Logp ob
Gemma-5sho
(All) 100 -1.615 100 -1.567 100 -1.679 100 -1.588 100 -2.268
Mo e Di icul 4 -1.905 2 -1.989 0 - 1 -1.300 4 -2.031
Change o Meaning 14 -1.874 2 -1.266 6 -1.907 6 -1.620 4 -2.447
G amma E o 1 -2.158 2 -1.409 3 -1.836 7 -1.835 10 -2.528
Gibbe ish 3 -2.056 1 -2.944 3 -2.227 1 -1.975 2 -3.617
Qwen-5sho
(All) 100 -1.884 100 -2.129 100 -3.592 100 -2.754 100 -4.132
Mo e Di icul 2 -2.203 1 -3.013 0 - 3 -3.217 3 -3.882
Change o Meaning 20 -2.088 8 -2.957 24 -3.976 20 -3.834 23 -4.766
G amma E o 3 -2.339 12 -2.469 24 -3.927 20 -3.254 15 -4.220
Gibbe ish 5 -1.991 0 - 13 -4.440 2 -4.253 2 -5.209
Qwen- ine- uned
(All) 100 -1.297 100 -2.063 100 -4.431 100 -3.667 100 -3.337
Mo e Di icul 2 -2.033 0 - 0 - 1 -4.934 1 -3.421
Change o Meaning 14 -1.617 12 -3.001 18 -5.692 34 -4.250 21 -3.697
G amma E o 5 -1.514 9 -3.390 17 -5.161 10 -3.296 17 -3.018
Gibbe ish 7 -1.408 4 -5.029 21 -6.047 9 -5.206 37 -4.021
Llama-5sho
(All) 100 -1.807 100 -1.244 100 -2.873 100 -3.135 100 -4.204
Mo e Di icul 3 -1.603 1 -1.802 0 - 2 -4.501 0 -
Change o Meaning 14 -2.045 8 -1.573 32 -3.246 13 -3.537 16 -4.520
G amma E o 0 - 3 -1.851 28 -3.374 14 -3.275 19 -3.011
Gibbe ish 12 -1.604 1 -2.686 6 -3.517 13 -3.375 31 -6.016
Llama- ine- uned
(All) 100 -1.161 100 -1.862 100 -2.880 100 -3.645 100 -3.360
Mo e Di icul 0 - 0 - 0 - 4 -4.867 3 -4.091
Change o Meaning 11 -1.465 16 -2.720 17 -3.012 25 -4.147 20 -3.762
G amma E o 3 -1.764 6 -2.834 13 -3.260 14 -3.304 12 -2.992
Gibbe ish 6 -1.697 3 -2.219 18 -4.415 15 -4.918 35 -4.457
Table 12: A e age log-p obabili y sco es o each language and ha m ul ag.
231

Related note

Why institutions use Plag.ai for originality review, entry 29
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by review committees in large academic systems, distance-learning programs, and cross-border universities, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer separation between similarity and misconduct, more consistent review procedures, and more transparent source review. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For grant proposals, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai