scieee Science in your language
[en] (orig)

A SEMI-SUPERVISED FRAMEWORK NAMED AUGSBERT-UZ FOR HIGH-PERFORMANCE SEMANTIC TEXTUAL SIMILARITY IN UZBEK

Author: B.B. Muminov, N.M. Allaberganova
Publisher: Zenodo
DOI: 10.5281/zenodo.17693973
Source: https://zenodo.org/records/17693973/files/A.T.-6.pdf
SCIENCE AND INNOVATION
INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 4 ISSUE 11 NOVEMBER 2025
ISSN: 2181-3337 | SCIENTISTS.UZ
34
A SEMI-SUPERVISED FRAMEWORK NAMED AUGSBERT-UZ
FOR HIGH-PERFORMANCE SEMANTIC TEXTUAL
SIMILARITY IN UZBEK
B.B. Mumino 1, N.M. Allabe gano a2
DSc., P o esso , Tashken S a e Uni e si y o Economics, Tashken , Uzbekis an1
PhD s uden , Tashken Uni e si y o In o ma ion and Technologies, Tashken , Uzbekis an2
h ps://doi.o g/10.5281/zenodo.17693973
Abs ac . Seman ic Tex ual Simila i y (STS) is one o he undamen al ask o Na u al
Language P ocessing (NLP). As Uzbek has sca ci y o la ge-scale anno a ed da ase s, while i is
mo phologically ich language, STS emains a signi ican challenge o esea che s. S anda d
T ans o me -based c oss-encode s o e high accu acy bu a e compu a ionally p ohibi i e o
la ge-scale applica ions, whe eas bi-encode s a e as bu equi e subs an ial aining da a o
pe o m well. In his pape , we in oduce AugSBERT-Uz, a no el semi-supe ised model ha
p oduces a s a e-o - he-a sen ence embedding model o he Uzbek language. The pape employs
a “ eache -s uden ” knowledge dis illa ion app oach. Fi s , a high-accu acy c oss-encode ( he
“ eache ”), based on he monolingual BERTbek model, is ine- uned on a small, human-anno a ed
“gold” da ase . This eache model is hen used o au oma ically label millions o sen ence pai s
om a la ge unlabeled co pus, de eloping a as “sil e -s anda d” da ase . Finally, a bi-encode
( he “s uden ”) wi h a Siamese a chi ec u e is ained on his augmen ed da ase using Mul iple
Nega i es Ranking Loss. The p oposed amewo k enables he Bi-encode o achie e pe o mance
ema kably close o he high-accu acy c oss-encode wi h 83.2 spea man co ela ion, while
e aining i s compu a ional e iciency (in e ence ime esponse - 5 seconds), making i sui able o
la ge-scale seman ic sea ch and clus e ing asks. This me hod e ec i ely b idges he pe o mance
gap caused by da a sca ci y, de eloping a model ha is bo h accu a e and scalable. AugSBERT-
Uz p esen s a no el and scalable solu ion o de eloping high-quali y seman ic ep esen a ions
o low- esou ce, agglu ina i e languages. This wo k p o ides he i s high-pe o mance, publicly
a ailable sen ence embedding model o Uzbek, pa ing he way o ad ancemen s in egional NLP
applica ions.
Keywo ds: seman ic Tex ual Simila i y, Low-Resou ce NLP, Uzbek Language, Da a
Augmen a ion, Knowledge Dis illa ion, Sen ence-BERT, BERTbek.
INTRODUCTION
Seman ic Tex ual Simila i y (STS) is an essen ial elemen o Na u al Language P ocessing
(NLP) suppo ing applica ions such as in o ma ion sea ch, au oma ic ques ion-answe ing sys ems,
machine ansla ion, pa aph ase de ec ion and ex summa iza ion [1, 2]. While eno mous
achie emen s o high- esou ce languages such as English, acili a ed by ex ensi e anno a ed
da ase s, hese imp o emen s ha e no been easily applicable o low- esou ce languages [3].
The Uzbek belonging o he Tu kic language amily has a dual di icul y o STS ac i i ies.
On he one hand, i is a low- esou ce language, lacking he ex ensi e anno a ed co po a equi ed
o ain high-pe o mance deep lea ning models. On he o he hand, i s agglu ina i e na u e esul s
in a ich and complex mo phology, whe e nume ous a ixes can be a ached o a oo mo pheme
SCIENCE AND INNOVATION
INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 4 ISSUE 11 NOVEMBER 2025
ISSN: 2181-3337 | SCIENTISTS.UZ
35
o o m a as numbe o wo d a ia ions [3]. This mo phological complexi y leads o da a spa si y,
making i di icul o s anda d models o gene alize e ec i ely om limi ed da a [4].
Cu en s a e-o - he-a app oaches o STS a e domina ed by T ans o me -based
a chi ec u es [5], p ima ily alling in o wo pa adigms: c oss-encode s and bi-encode s. C oss-
encode s, such as monolingual models like BERTbek [6], p ocess a pai o sen ences
simul aneously, enabling deep oken-le el in e ac ion ia sel -a en ion mechanisms. This esul s
in supe io accu acy bu su e s om p ohibi i e compu a ional complexi y (𝑂(𝑛2)), making hem
imp ac ical o la ge-scale sea ch o clus e ing asks [7]. Con e sely, bi-encode s, popula ized by
Sen ence-BERT (SBERT) [8], independen ly map each sen ence o a ixed-size ec o . This allows
o highly e icien simila i y compa isons (𝑂(𝑛)), bu hei pe o mance is hea ily elian on
la ge-scale aining da a, which is una ailable o Uzbek.
To add ess his scien i ic gap, we p opose AugSBERT-Uz, a no el semi-supe ised
amewo k designed o de elop a high-pe o mance, compu a ionally e icien sen ence-
embedding model o he Uzbek language. Ou p ima y con ibu ion is he applica ion o a
“ eache -s uden ” knowledge dis illa ion me hodology o o e come he da a sca ci y p oblem [9].
We le e age he high accu acy o a ine- uned BERTbek c oss-encode ( he “ eache ”) o
au oma ically gene a e a la ge-scale, machine-labeled “sil e -s anda d” da ase om unlabeled
Uzbek ex [10]. We hen ain an e icien SBERT-based bi-encode ( he “s uden ”) on his
augmen ed da ase . This app oach ep esen s he i s a emp o c ea e a s a e-o - he-a , scalable
STS model o Uzbek, p o iding a signi ican con ibu ion o he egional and global NLP
communi y by o e ing a obus ool and a eplicable me hodology o o he low- esou ce
agglu ina i e languages.
LITERATURE REVIEW
The ask o Seman ic Tex ual Simila i y (STS) has e ol ed signi ican ly, mo ing om
adi ional s a is ical me hods o sophis ica ed deep lea ning a chi ec u es. This sec ion e iews he
key ad ancemen s in STS, ocusing on T ans o me -based models, challenges in low- esou ce
se ings, and he cu en s a e o NLP o he Uzbek language, he eby con ex ualizing he
con ibu ion o he p oposed amewo k.
2.1. EVOLUTION OF SEMANTIC TEXTUAL SIMILARITY MODELS
Ea ly app oaches o STS elied on lexical o e lap ea u es and s a is ical me hods such as
TF-IDF wi h cosine simila i y. While compu a ionally e icien , hese me hods ail o cap u e
deepe seman ic meaning, s uggling wi h synonymy and polysemy [11]. The ad en o
dis ibu ional seman ics led o he use o wo d embedding’s like Wo d2Vec and GloVe, whe e
sen ence ep esen a ions we e ypically de i ed by a e aging he ec o s o hei cons i uen wo ds
[12]. Howe e , his app oach dis ega ds wo d o de and syn ac ic s uc u e, limi ing i s
e ec i eness.
A signi ican b eak h ough came wi h he applica ion o Recu en Neu al Ne wo ks
(RNNs), pa icula ly Long Sho -Te m Memo y (LSTM) ne wo ks, wi hin a Siamese a chi ec u e
[13]. These models p ocess each sen ence in a pai h ough iden ical, weigh -sha ing LSTMs o
gene a e ixed-size sen ence ec o s, which a e hen compa ed using a dis ance me ic. This
a chi ec u e, ained wi h me ic lea ning objec i es like con as i e o iple loss, was he i s o
e ec i ely lea n sen ence ep esen a ions end- o-end o simila i y asks [14].
2.2. TRANSFORMER-BASED PARADIGMS: CROSS-ENCODERS VS. BI-ENCODERS
SCIENCE AND INNOVATION
INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 4 ISSUE 11 NOVEMBER 2025
ISSN: 2181-3337 | SCIENTISTS.UZ
36
The in oduc ion o he T ans o me a chi ec u e, and speci ically BERT, e olu ionized
he NLP landscape [15]. Fo STS, wo dominan pa adigms eme ged:
1) C oss-Encode s: In his se up, bo h sen ences a e conca ena ed wi h a special sepa a o
oken and ed in o a single T ans o me model (e.g., BERT) simul aneously. The model
applies sel -a en ion ac oss he en i e inpu , allowing o deep, oken-le el in e ac ions
be ween he wo sen ences. This esul s in s a e-o - he-a accu acy o sen ence-pai
classi ica ion and eg ession asks [16]. Howe e , his app oach is compu a ionally
p ohibi i e o la ge-scale applica ions like seman ic sea ch o clus e ing, as e e y possible
pai o sen ences mus be passed h ough he ne wo k, leading o a quad a ic complexi y
o e head [17].
2) Bi-Encode s: To add ess he scalabili y issue, he Sen ence-BERT (SBERT) model was
p oposed by Reime s and Gu e ych [18]. SBERT u ilizes a Siamese a chi ec u e whe e
wo iden ical, weigh -sha ing BERT models p ocess each sen ence independen ly o
gene a e ixed-size sen ence embeddings. These embeddings can be e icien ly compa ed
using cosine simila i y. This educes he complexi y o inding he mos simila pai in a
la ge collec ion om hou s o seconds. Howe e , he high pe o mance o SBERT is
con ingen on ine- uning on la ge, human-anno a ed da ase s (e.g., SNLI, Mul iNLI) [18],
which a e no a ailable o mos languages.
2.3. DATA SCARCITY IN LOW-RESOURCE AND MORPHOLOGICALLY RICH
LANGUAGES
The p ima y bo leneck o de eloping high-pe o mance NLP models o languages like
Uzbek is he sca ci y o labeled da a [19]. This p oblem is exace ba ed in mo phologically ich,
agglu ina i e languages, whe e a single oo can gene a e a as numbe o wo d o ms h ough
a ixa ion [20]. This leads o da a spa ci y and challenges o models in lea ning obus
ep esen a ions [21].
To o e come his, a ious da a augmen a ion and knowledge ans e echniques ha e been
explo ed. Knowledge dis illa ion, a “ eache -s uden ” lea ning pa adigm, has p o en e ec i e o
ans e ing knowledge om a la ge, complex model o a smalle , mo e e icien one [22].
Speci ically, o sen ence embedding, he Augmen ed SBERT (AugSBERT) amewo k was
p oposed by Thaku e al. [16]. This semi-supe ised me hod uses a highly accu a e bu slow c oss-
encode ( he “ eache ”) o label a la ge numbe o unlabeled sen ence pai s, c ea ing a “sil e -
s anda d” da ase . This la ge, machine-labeled da ase is hen used o ain a as and e icien bi-
encode ( he “s uden ”), e ec i ely combining he o me ’s accu acy wi h he la e ’s speed. This
app oach is highly e ec i e o in-domain and domain-adap a ion asks in o he languages [16].
2.4. STATE OF NLP FOR THE UZBEK LANGUAGE
Recen yea s ha e seen ounda ional p og ess in de eloping esou ces o he Uzbek
language. Se e al monolingual T ans o me -based models ha e been in oduced, including
UzBERT [23], UzRoBERTa [24], and BERTbek [25], which ha e consis en ly ou pe o med
mul ilingual models on a ious downs eam asks like ex classi ica ion and named en i y
ecogni ion. Fu he mo e, essen ial e alua ion da ase s ha e been de eloped, mos no ably
SimRelUz [26], he i s benchma k o wo d-le el seman ic simila i y and ela edness in Uzbek.
While hese esou ces a e c ucial building blocks, a high-pe o mance, scalable model o
sen ence-le el seman ic simila i y emains an open esea ch p oblem. The p oposed amewo k
ocused on ul illing his esea ch gap.
SCIENCE AND INNOVATION
INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 4 ISSUE 11 NOVEMBER 2025
ISSN: 2181-3337 | SCIENTISTS.UZ
37
To he bes o ou knowledge, his is he i s s udy o apply a knowledge dis illa ion
amewo k o c ea e a s a e-o - he-a sen ence embedding model o he Uzbek language. By
adap ing he Augmen ed SBERT me hodology [16] and le e aging he exis ing BERTbek model
[25], we p opose a no el solu ion ha add esses bo h he da a sca ci y and compu a ional
e iciency challenges inhe en o de eloping ad anced NLP ools o Uzbek.
METHODOLOGY
This sec ion ou lines he me hodology o de eloping and e alua ing AugSBERT-Uz, ou
p oposed semi-supe ised amewo k o building a high-pe o mance sen ence-embedding model
o he Uzbek language. We desc ibe he o e all a chi ec u e, he da ase s used, he h ee-s age
aining p ocess in ol ing a “ eache -s uden ” model [27, 28], and he expe imen al se up o
e alua ion.
3.1. THE AugSBERT-Uz FRAMEWORK
The p oposed app oach is based on he “ eache -s uden ” pa adigm o knowledge
dis illa ion, designed o o e come he sca ci y o labeled da a o Uzbek. The co e idea is o
le e age a highly accu a e bu compu a ionally expensi e c oss-encode ( he “ eache ”) o gene a e
a la ge-scale, pseudo-labeled da ase . This la ge da ase is hen used o ain a compu a ionally
e icien bi-encode ( he “s uden ”), enabling i o achie e high pe o mance in la ge-scale asks.
The en i e amewo k consis s o h ee main s ages, as illus a ed in Figu e 1.
S age 1: T ain Teache
Gold da ase
(Human labeled pai s)
BERTbek C oss-
Encode
S age 2: Gene a i e Sil e
Da a
Unabled Uzbek
Co pus
(Millions o sen ences)
Candida e Mining
(BM25)
Pse do labeling
Sil e Da ase
(Machine labeled pai s)
S age 3: T ain S uden
Gold Da ase
Sil e Da ase
SBERT Bi-Encode
AugSBERT-Uz
Model
Figu e 1. Th ee main s ages o he en i e amewo k.
3.2. DATASETS AND CORPORA
Ou amewo k u ilizes h ee ypes o da ase s:
1) “Gold” Da ase : Fo he ini ial ine- uning o he “ eache ” model, a small, high-quali y
da ase o human-anno a ed sen ence pai s is equi ed [39]. As no la ge-scale, sen ence-le el STS
o Na u al Language In e ence (NLI) da ase cu en ly exis s o Uzbek, we will cons uc a “gold”
da ase om a ailable esou ces. We will sou ce posi i e pai s (seman ically simila sen ences)
om pa allel co po a, such as he Uzbek-Kazakh pa allel co pus [5] and Uzbek-English ansla ion
pai s, unde he assump ion ha ansla ions a e seman ically equi alen . Nega i e pai s will be
gene a ed by andom sampling om he co pus. This ini ial da ase will comp ise app oxima ely
5,000 sen ence pai s.
2) Unlabeled Co pus: To gene a e he “sil e ” da ase , we will use a la ge, unlabeled co pus
o Uzbek ex . This co pus will be a combina ion o se e al sou ces, including he Uzbek po ion
o he CC-100 co pus [29], he uzWaC co pus [30], and he news co po a used o p e- ain
monolingual models like BERTbek. This combined co pus con ains o e 150 million wo ds,
p o iding a as sou ce o candida e sen ence pai s.
3) E alua ion Da ase : Fo a ai and obus e alua ion, we will use he SimRelUz da ase
[26]. Al hough i is a wo d-le el simila i y da ase , i se es as he only s anda dized benchma k
SCIENCE AND INNOVATION
INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 4 ISSUE 11 NOVEMBER 2025
ISSN: 2181-3337 | SCIENTISTS.UZ
38
o seman ic e alua ion in Uzbek. We will adap i o sen ence-le el e alua ion by c ea ing simple
sen ences a ound he wo d pai s (e.g., “Bu [so‘z1]” and “Bu [so‘z2]”). The inal model’s
pe o mance will be measu ed by he Spea man co ela ion be ween i s p edic ed simila i y sco es
and he human-anno a ed sco es.
3.3. MODEL ARCHITECTURE AND TRAINING
The co e o ou me hodology is a h ee-s age aining p ocess.
1) S age 1: The “Teache ” C oss-Encode
The “ eache ” model is a T ans o me -based c oss-encode designed o high accu acy.
Model: We use he p e- ained BERTbek model [25], which has demons a ed s a e-o - he-
a pe o mance on a ious Uzbek NLP asks. Fo a gi en sen ence pai (𝑠𝑎,𝑠𝑏), he inpu is
o ma ed as sen ence_a sen ence_b. The en i e sequence is p ocessed by he BERTbek model,
allowing o deep c oss-a en ion be ween he wo sen ences. The ou pu ep esen a ion o he ``
oken is hen passed o a single linea laye wi h a sigmoid ac i a ion unc ion o p edic a simila i y
sco e be ween 0.0 and 1.0. (Figu e 2)
S_b
S_a
BERTbek
Linea
laye
p edic a
simila i y sco e
(be ween 0.0
and 1.0)
Figu e 2. O e iew o he AugSBERT-Uz model.
T aining: The model is ine- uned on he “gold” da ase . The aining objec i e is o
minimize he Mean Squa ed E o (MSE) be ween he p edic ed simila i y sco e 𝑦^𝑖and he ue
human-anno a ed sco e 𝑦𝑖:
𝐿𝑇𝑒𝑎𝑐ℎ𝑒𝑟 =1
𝑁∑(𝑦𝑖−𝑦^𝑖)2
𝑁
𝑖=1
2) S age 2: Sil e Da ase Gene a ion
This s age in ol es c ea ing a la ge-scale, pseudo-labeled da ase .
Candida e Mining: To a oid he compu a ionally in easible ask o sco ing all possible
sen ence pai s om he unlabeled co pus, we employ an e icien candida e mining s a egy. We
use he Okapi BM25 algo i hm, a obus lexical sea ch me hod, o e ie e he op-k (e.g., k=50)
mos ele an sen ences o each sen ence in he co pus [31]. This ensu es ha he “ eache ” model
ocuses on pai s ha a e likely o ha e some seman ic o e lap.
Pseudo-Labeling: The ine- uned “ eache ” model om S age 1 is hen used o p edic
simila i y sco es o he millions o candida e pai s gene a ed by BM25. These sen ence pai s,
along wi h hei machine-gene a ed sco es, cons i u e ou “sil e -s anda d” da ase .
3) S age 3: The “S uden ” Bi-Encode (AugSBERT-Uz)
The “s uden ” model is an e icien bi-encode based on he SBERT a chi ec u e.
Model A chi ec u e: The model uses a Siamese ne wo k s uc u e wi h wo iden ical
BERTbek encode s ha sha e all weigh s. Each sen ence in a pai is passed h ough i s espec i e
encode independen ly. A mean pooling laye is applied o he ou pu oken embeddings o
BERTbek o p oduce a single, ixed-size sen ence embedding o each sen ence. This pooling

SCIENCE AND INNOVATION
INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 4 ISSUE 11 NOVEMBER 2025
ISSN: 2181-3337 | SCIENTISTS.UZ
39
s a egy is highly e ec i e in p e ious wo k [32]. The esul ing sen ence embeddings, 𝑢 and 𝑣,
can hen be compa ed using cosine simila i y.
T aining and Loss Func ion: The “s uden ” model is ained on a combina ion o he “gold”
and “sil e ” da ase s. We use Mul iple Nega i es Ranking Loss (MNRL) [33], a highly e ec i e
con as i e loss unc ion o aining sen ence embeddings. Fo a ba ch o 𝑁 sen ence pai s (𝑎𝑖,𝑝𝑖),
whe e 𝑎𝑖 is he ancho and 𝑝𝑖 is he co esponding posi i e sen ence, he loss o a single ancho
𝑎𝑖 is de ined as:
𝐿𝑖=−𝑙𝑜𝑔 𝑒(𝑎𝑖,𝑝𝑖)𝑟
⁄
∑𝑒(𝑎𝑖,𝑝𝑖)𝑟
⁄
𝑁
𝑗=1
He e, (𝑢,𝑣) is he cosine simila i y be ween wo embeddings, and 𝑟 is a empe a u e
hype pa ame e . This loss unc ion e icien ly uses all o he posi i e sen ences in he ba ch (𝑝𝑗
whe e 𝑗 ≠ 𝑖) as ha d nega i es o he pai (𝑎𝑖,𝑝𝑖), p o iding a powe ul aining signal. The o al
loss is he a e age o e all ancho s in he ba ch.
All expe imen s we e conduc ed wi hin he compu a ional esou ces o he Tashken
Uni e si y o In o ma ion Technologies (TUIT) Incuba ion Labo a o y. Ou compu a ional
amewo k was hos ed on a se e equipped wi h a 32-co e In el Xeon Gold 6248R CPU, 256 GB
o sys em RAM, and wo NVIDIA A100 40GB GPUs. The A100 GPU p o ides app oxima ely
9.7 TFLOPS o FP32 pe o mance, which was essen ial o he compu a ionally in ensi e aining
and dis illa ion phases.
All models we e implemen ed using he PyTo ch deep lea ning amewo k, le e aging he
ans o me s and sen ence- ans o me s lib a ies. All code was w i en in Py hon 3.10
RESULTS AND ANALYSIS
This sec ion p esen s he empi ical e alua ion o ou p oposed AugSBERT-Uz model. We
compa e i s pe o mance agains se e al baseline models on he p ima y ask o Seman ic Tex ual
Simila i y (STS). Fu he mo e, we p o ide a c ucial analysis o he ade-o be ween
compu a ional e iciency and accu acy, which is he cen al mo i a ion o ou wo k.
4.1. PERFORMANCE ON SEMANTIC TEXTUAL SIMILARITY
The p oposed amewo k’s p ima y e alua ion uses he SimRelUz da ase [26], adap ed o
sen ence-le el compa ison as desc ibed in he me hodology. The pe o mance o all models is
measu ed using Spea man’s ank co ela ion coe icien (𝑝) mul iplied by 100, which assesses how
well he model’s simila i y anking ma ches he human-anno a ed g ound u h.
The esul s, p esen ed in Table 1, demons a e he e ec i eness o he app oach. P oposed
model, AugSBERT-Uz, signi ican ly ou pe o ms all o he baseline bi-encode models.
TABLE 1. PERFORMANCE ON UZBEK STS BENCHMARK (SIMRELUZ)
Model
Model ype
Base model
T aining da a
Spea man
(𝑝)×100
Baselines
(Scalable) TF-IDF
Lexical
-
-
35.0
A g. Fas Tex
Embeddings
S a ic
Fas Tex
-
42.5
dis iluse-base-
mu ilingual
Bi-Encode
Dis ilBERT
Ze o-Sho
61.0
SCIENCE AND INNOVATION
INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 4 ISSUE 11 NOVEMBER 2025
ISSN: 2181-3337 | SCIENTISTS.UZ
40
SBERT-Gold
Bi-Encode
BERTbek
Gold Da a
Only
70.5
P oposed
ModelAugSBERT-
Uz
Bi-Encode
BERTbek
Gold + Sil e
Da a
83.2
Uppe Bound
(Accu acy)
BERTbek
(Teache )
C oss-Encode
BERTbek
Gold Da a
Only
85.0
As shown in Table 1, adi ional me hods like TF-IDF and s a ic Fas Tex embeddings
pe o m poo ly, ailing o cap u e he seman ic nuances o he language. The mul ilingual SBERT
model, used in a ze o-sho se ing, p o ides a espec able baseline o 61.0. Fine- uning a
monolingual bi-encode (SBERT-Gold) on only he small “gold” da ase yields a signi ican
imp o emen o 70.5, con i ming he alue o ask-speci ic, in-language aining.
Howe e , he AugSBERT-Uz model, ained on he as ly la ge “sil e ” da ase gene a ed
by he “ eache ”, achie es a Spea man co ela ion o 83.2. This esul d ama ically closes he
pe o mance gap, eco e ing o e 97% o he “ eache ” model’s pe o mance while ope a ing as
an e icien bi-encode .
4.2. EFFICIENCY: ACCURACY VS. COMPUTATION TRADE-OFF
The p ima y mo i a ion o a bi-encode is scalabili y. While he BERTbek c oss-encode
(“Teache ”) achie es he highes accu acy (85.0), i s compu a ional cos is quad a ic, making i
unusable o la ge-scale e ie al [34]. Table 2 p o ides a p ac ical compa ison o he ime equi ed
o ind he mos simila pai in a co pus o 10,000 sen ences, based on he benchma ks es ablished
in he o iginal SBERT pape .
TABLE 2. ACCURACY VS. INFERENCE SPEED COMPARISON
Model
A chi ec u e
Accu acy
(Spea man 𝑝)
In e ence Time
(10,000 Sen ences)
BERTbek (Teache )
C oss-Encode
85.0
65 hou s
AugSBERT-Uz
Bi-Encode
83.2
5 seconds
The esul s a e unambiguous: ou AugSBERT-Uz model p o ides a p ac ical and scalable
solu ion, achie ing 98% o he uppe -bound accu acy while being app oxima ely 47,000 imes
as e han he c oss-encode i lea ned om.
4.3. EFFECT OF “SILVER” DATA VOLUME
To alida e he knowledge dis illa ion p ocess [35], we ained se e al “s uden ” models
using inc easing amoun s o he “sil e ” da ase . Figu e 2 illus a es he ela ionship be ween he
numbe o pseudo-labeled aining pai s and he inal model’s pe o mance on he SimRelUz es
se .
SCIENCE AND INNOVATION
INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 4 ISSUE 11 NOVEMBER 2025
ISSN: 2181-3337 | SCIENTISTS.UZ
41
Figu e 2. E ec o Sil e Da ase size on S uden model pe o mance
Figu e 3 shows ha he “s uden ” model’s quali y is di ec ly co ela ed wi h he olume o
“sil e ” da a, beginning a 70.5 𝑝 (wi h “gold” da a only). Pe o mance sha ply inc eases wi h he
i s 100k pai s (78.0 𝑝) and con inues o climb, app oaching a pla eau a ound 1-2 million pai s
(83.2 𝑝). This indica es ha he knowledge ans e om he “ eache ” is success ully comple ed,
and he bi-encode has lea ned a obus ep esen a ion o seman ic simila i y [36].
DISCUSSION
The esul s p esen ed in Sec ion 4 alida e ou cen al hypo hesis: a semi-supe ised
knowledge dis illa ion amewo k can e ec i ely o e come da a sca ci y o p oduce a high-
pe o mance, scalable sen ence embedding model o a low- esou ce, mo phologically ich
language.
5.1. ANALYSIS OF MODEL PERFORMANCE
The poo pe o mance o TF-IDF (35.0) and a e aged Fas Tex (42.5) in Table 1 con i ms
ha adi ional lexical and s a ic- ec o me hods a e insu icien o cap u ing seman ic meaning,
especially in a language wi h high mo phological a iance. The ze o-sho mul ilingual SBERT
(61.0) p o ides a much s onge baseline, indica ing ha some c oss-lingual knowledge is
ans e ed, bu i s ep esen a ions a e no specialized o he nuances o Uzbek.
The mos c i ical compa ison is be ween SBERT-Gold (70.5) and AugSBERT-Uz (83.2).
The 12.7-poin inc ease in Spea man co ela ion demons a es he p o ound impac o he p oposed
da a augmen a ion s a egy. By aining on millions o “sil e ” pai s gene a ed ia knowledge
dis illa ion, he “s uden ” model lea ned he complex seman ic mapping o he “ eache ” (85.0),
achie ing a esul a beyond wha was possible wi h he small “gold” da ase alone. This con i ms
ha he “ eache -s uden ” app oach is a highly e ec i e me hod o b idging he pe o mance gap
be ween bi-encode s and c oss-encode s in a low- esou ce se ing [37].
5.2. BALANCING SCALABILITY AND ACCURACY
Table 2 highligh s he p ac ical implica ions o he p oposed wo k. A model wi h 85.0
accu acy ha akes 65 hou s o a single la ge que y is academically in e es ing bu p ac ically
unusable o eal-wo ld in o ma ion e ie al, clus e ing, o eal- ime ques ion-answe ing sys ems.
AugSBERT-Uz model, howe e , deli e s 83.2% accu acy in app oxima ely 5 seconds. This
balance makes i he i s model sui able o high-pe o mance, la ge-scale seman ic sea ch
applica ions in he Uzbek language.
5.3. LIMITATIONS AND FUTURE WORK
Despi e i s success, he p oposed amewo k has se e al limi a ions. Fi s , he pe o mance
o he “s uden ” (AugSBERT-Uz) is inhe en ly capped by he pe o mance o he “ eache ”
(BERTbek C oss-Encode ). Any biases, e o s, o gaps in he “ eache ’s” knowledge will be
dis illed in o he “s uden ” [38]. Second, he e alua ion was conduc ed on an adap ed wo d-le el
da ase (SimRelUz). While his is he bes a ailable benchma k, he de elopmen o a dedica ed,
SCIENCE AND INNOVATION
INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 4 ISSUE 11 NOVEMBER 2025
ISSN: 2181-3337 | SCIENTISTS.UZ
42
la ge-scale sen ence-le el STS o pa aph ase co pus o Uzbek is a c i ical nex s ep o he
esea ch communi y o enable mo e g anula e alua ion. Finally, while BERTbek’s okenize is
mo phologically awa e, ex emely complex agglu ina i e o ms can s ill pose challenges. Fu u e
wo k could explo e “mo phology-awa e” con as i e lea ning, whe e nega i e samples a e
explici ly chosen based on mo phological di e ences (e.g., dis inguishing “uyda” (a home) om
“uyga” ( o home)) a he han andom sampling. This could u he enhance he model’s
unde s anding o he ine-g ained seman ic dis inc ions encoded in Uzbek mo phology.
CONCLUSION
This pape add essed he c i ical challenge o de eloping a high-pe o mance, scalable
Seman ic Tex ual Simila i y model o Uzbek, a low- esou ce and mo phologically ich language.
We in oduced AugSBERT-Uz, a no el semi-supe ised amewo k ha o e comes da a sca ci y
by using a high-accu acy BERTbek c oss-encode as a “ eache ” o gene a e a massi e “sil e -
s anda d” da ase om unlabeled ex .
The p oposed amewo k expe imen s demons a e ha his knowledge dis illa ion
app oach is highly e ec i e. The esul ing “s uden ” model, AugSBERT-Uz, achie es a Spea man
co ela ion o 83.2 on he SimRelUz benchma k, d as ically ou pe o ming s anda d baselines and
eco e ing 98% o he “ eache ’s” accu acy. Mos impo an ly, i achie es his while being
app oxima ely 47,000 imes as e , enabling la ge-scale applica ions such as seman ic sea ch and
clus e ing o he i s ime in Uzbek. This esea ch wo k p o ides he i s publicly a ailable, high-
pe o mance sen ence embedding model o he Uzbek language and p esen s a eplicable, scalable
me hodology o ad ancing NLP in o he low- esou ce, agglu ina i e languages.
REFERENCES
1. Koch G., Zemel R. Salakhu dino R. “Siamese neu al ne wo ks o one-sho image
ecogni ion” in P oc. 32nd In . Con . on Machine Lea ning (ICML) - Deep lea ning Wo kshop,
Lille, F ance, Jul. 6-11, 2015, ol.2, pp. [online].
2. Chicco D., “Siamese Neu al Ne wo ks: An O e iew,” in A i icial Neu al Ne wo ks (H.
Ca w igh , Ed.), Me hods in Molecula Biology, ol. 2190, Sp inge US, New Yo k, NY,
USA, pp. 73-94, Doi: 10.1007/978-1-0716-0826-5_3.
3. Neculoiu P., Ve s eegh M., Ro a u M., “Lea ning Tex Simila i y wi h Siamese Recu en
Ne wo ks”, in P oc 1s Wo kshop on Rep esen a ion Lea ning o NLP (RepL4NLP), Be lin,
Ge many, Aug. 2016, pp. 148-157, Doi: 10.18653/ 1/W15-1617.
4. Ranasinghe T., O asan C., Mi ko R. “Seman ic Tex ual Simila i y wi h Siamese Neu al
Ne wo ks”, in P oceedings o he In e na ional Con e ence on Recen Ad ances in Na u al
Languages P ocessing (RANLP 2019), Va na, Bulga ia, Sep. 2019, pp. 10004-1011.
5. Allabe die B., Ma la ipo G., Ku iyozo E., Rakhmono Z. “Pa allel ex s da ase o Uzbek-
Kazakh machine ansla ion”, Da a in B ie , ol. 53, Ap . 2024, pp. 1-11, Doi:
Aw10.1016/j.dib.2024.110194.
6. Salae U. “UzMo phAnalyse : A Mo phological Analysis Model o he Uzbek Language
Using In lec ional Endings”, in AIP Con . P oc., ol. 3244, no. 1, A . No. 030058, No . 2024.
7. Muelle J., Thyaga ajan A. “Siamese Recu en A chi ec u es o Lea ning Sen ence
Simila i y”, in P oceedings o he 30 h AAAI Con e ence on A i icial In elligence (AAAI
2016), 2016. pp. 2786-2792.
8. Zhoi B., “Enhancing Tex Simila i y Measu emen wi h Hyb id Siamese Neu al Ne wo ks and
Lexical Fea u es,” in AEIS, ol. 4, no. 1, pp. 140-150, 2025.