scieee Science in your language
[en] (orig)

Ontology Coverage Analysis through Language Models

Author: Abad-Navarro, Francisco; Martínez Costa, Catalina; Fernández Breis, Jesualdo Tomás
Publisher: Zenodo
DOI: 10.1016/j.engappai.2025.112671
Source: https://zenodo.org/records/17708895/files/1-s2.0-S0952197625027022-main.pdf
Con en s lis s a ailable a ScienceDi ec
Enginee ing Applica ions o A i icial In elligence
jou nal homepage: www.else ie .com/loca e/engappai
On ology Co e age Analysis h ough Language Models
F ancisco Abad-Na a o , Ca alina Ma ínez-Cos a , Jesualdo Tomás Fe nández-B eis ∗
Depa amen o de In o má ica y Sis emas, Uni e sidad de Mu cia, CEIR Campus Ma e Nos um, IMIB-A ixaca, 30100, Mu cia, Spain
A R T I C L E I N F O
Da ase link:h ps://gi hub.com/ ana a o/oca
lm, h ps://doi.o g/10.5281/zenodo.11220686
Keywo ds:
On ologies
Language Models
Domain co e age
Knowledge ep esen a ion
Fas Tex
Bidi ec ional Encode Rep esen a ions om
T ans o me s
A B S T R A C T
On ologies play a c ucial ole in suppo ing da a and knowledge managemen in indus y by p omo ing
da a s anda diza ion, in e ope abili y, and acili a ing knowledge sha ing. Howe e , he g owing numbe
o on ologies a ailable in he eposi o ies has made i challenging o de elope s o selec he app op ia e
on ology o ensu ing da a in e ope abili y in speci ic domains. To add ess his, we p opose a me hod based
on a i icial in elligence o e alua e how well an on ology co e s a pa icula domain. The me hod begins
by using a ex co pus, which ep esen s he domain knowledge. I i s iden i ies noun ph ases wi hin he
ex , which a e hen ma ched wi h he classes in he on ology unde e alua ion. The alignmen p ocess uses a
sco ing unc ion ha combines a Le ensh ein-based simila i y me ic oge he wi h Fas Tex and Bidi ec ional
Encode Rep esen a ions om T ans o me s (BERT). These models cap u e he con ex ual meaning o bo h he
noun ph ases and he on ology classes. The me hod was es ed ac oss ou heal h enginee ing subdomains—
gene ics, ood, medicine, and law—by selec ing domain-speci ic on ologies and ex co po a o each. The
esul s demons a ed ha ou me hod e ec i ely iden i ied he mos app op ia e on ology o each domain.
The inco po a ion o language models in he me hod enabled i o o e come he limi a ions o adi ional
app oaches, which o en depend on exac s ing ma ches. Ou me hod has p o en o be an e ec i e ool o
assessing how well on ologies co e speci ic domains, he eby suppo ing he iden i ica ion and selec ion o
he mos sui able on ologies o in elligen enginee ing applica ions.
1. In oduc ion
On ologies a e one o he pilla s o Knowledge G aphs, p o iding he
o mal meaning o he en i ies and p ope ies used o he desc ip ion
o domains’ knowledge (Elnaga e al., 2020; Hogan e al., 2021). An
on ology is a o mal, explici speci ica ion o a sha ed concep ualiza-
ion (S ude e al., 1998). In o he wo ds, on ologies de ine concep s
o a pa icula domain by using a o mal language ha acili a es he
da a unde s anding by au oma ed agen s, and hese de ini ions mus
be ag eed and sha ed by he communi y. In his con ex , he on ology
communi y p omo es he open access o he knowledge and he collabo-
a i e de elopmen o on ologies, ocusing on eusing wha has al eady
been done (Ka sumi and G üninge , 2016). He e, he W3C conso ium
ecommended he Web On ology Language e sion 2 (OWL2) as a
s anda d language o de ining on ologies, being adop ed by mos on-
ology de elope s. These ac s a e in line wi h he Findable, Accessible,
In e ope able, Reusable (FAIR) p inciples, which pu speci ic emphasis
on enhancing he abili y o machines o au oma ically ind and use he
da a, in addi ion o suppo ing i s euse by indi iduals (Wilkinson e al.,
2016).
As a consequence, he use o on ologies and knowledge g aphs
has inc eased o a a ie y o pu poses in heal h enginee ing, such
∗Co esponding au ho .
E-mail add esses: [email p o ec ed] (F. Abad-Na a o), [email p o ec ed] (C. Ma ínez-Cos a), [email p o ec ed] (J.T. Fe nández-B eis).
as da a ha moniza ion amewo k o seconda y da a euse (Abad-
Na a o and Ma ínez-Cos a, 2024), which is he analysis o exis ing
da a collec ed by o he s (Donnellan and Lucas, 2013); o e en de-
ec ion and classi ica ion in biomedicine (Al e aie e al., 2024). In
addi ion o his, p ac ical applica ions can also be ound in b idge
heal h moni o ing (Ndinga Okina e al., 2023) o so wa e enginee -
ing (Bhushan e al., 2024). They ha e also been applied egula ly
o da a in eg a ion (Mule o-He nández e al., 2024; Gu ié ez e al.,
2024).
The de elopmen o in elligen applica ions in enginee ing ields
equi es a ca e ul selec ion o he on ologies o be included. This is a
di icul ask gi en he numbe o on ologies a ailable. Fo example, a
he ime o w i ing, he de elope s o heal h applica ions ha e mo e
han one housand seman ic esou ces a ailable in BioPo al (Whe zel
e al., 2011). Recen ly, he eposi o y o on ologies o indus y (Indus-
yPo al) (Amdouni e al., 2023) was launched, and i al eady con ains
mo e han one hund ed on ologies.
Typically, on ology selec ion is based on how an on ology co e s he
domain in which i will be used. He e, we de ine domain co e age as
he deg ee o which he en i ies and de ini ions included in he on ology
h ps://doi.o g/10.1016/j.engappai.2025.112671
Recei ed 18 Sep embe 2024; Recei ed in e ised o m 16 Ap il 2025; Accep ed 6 Oc obe 2025
Enginee ing Applica ions o A icial In elligence 162 (2025) 112671
A ailable online 10 Oc obe 2025
0952-1976/© 2025 The Au ho s. Published by Else ie L d. This is an open access a icle unde he CC BY-NC license (
h p://c ea i ecommons.o g/licenses/by-
nc/4.0/ ).
F. Abad-Na a o e al.
ep esen he co esponding domain, he mos app op ia e on ology
being he one ha con ains he concep s, ela ionships, and de ini ions
necessa y o ep esen he da a o he gi en domain. Howe e , he
pe ec on ology o a gi en pu pose a ely exis s, so i is necessa y o
iden i y subop imal on ologies ha desc ibe pa o he gi en domain
in o de o ex end hem by adding new on ology en i ies.
The esea ch ques ion o ou wo k is whe he an embedding-based
app oach can help o measu e on ology co e age in a domain gi en
by ex co po a. To his end, we p opose a me hod, called OCALM
(On ology Co e age Analysis h ough Language Models), o measu e
he co e age o an on ology wi hin a speci ic domain. This app oach
le e ages a co pus o na u al language documen s, as ex se es as he
p ima y knowledge sou ce in enginee ing domains. OCALM sea ches
o he ma ches be ween he noun ph ases de ec ed in he ex and
he on ology classes. Ou hypo hesis is ha on ologies and ex co po a
co e ing he same domain will ha e he highes sco es o hei (noun
ph ase, on ology class) ma ches. A noun ph ase consis s o a noun o
p onoun, which is called he head, and any dependen wo ds be o e
o a e he head (Pe e s, 2013). Each noun ph ase is ma ched wi h
an on ology class by maximizing a sco e unc ion ha akes in o
accoun hei lexical simila i y and hei seman ic simila i y. On he
one hand, lexical simila i y accoun s o he simila i y o he wo ds
composing he noun ph ase and he anno a ions o he on ology classes.
Anno a ions a e in o ma ion a ached o on ology en i ies, usually in
na u al language ex , o imp o e hei eadabili y. On he o he hand,
he seman ic simila i y akes in o accoun he nea es neighbo s o he
noun ph ase and he class in hei espec i e con ex s in o de o use
hem o ob ain hei simila i y in a gene al language model. We es ed
OCALM ac oss ou domains, namely ood, gene ics, law and medicine,
o which he me hod has been p o en e ec i e. The e o e, we belie e
ha OCALM is a aluable ool o guiding he selec ion o he mos
app op ia e on ology o a gi en domain.
The main con ibu ions o his wo k a e:
•A no el me hod o suppo ing he decision o which on ology o
use based on he assessmen o domain co e age
•A ex o on ology alignmen me hod, which elies on a sco e
unc ion ha conside s lexical and seman ic simila i y and allows
non exac ma ches
•The i s use o language models o compu ing seman ic simila -
i ies wi h he pu pose o co e age assessmen
The a icle is o ganized as ollows: Sec ion 2 desc ibes he s a e
o he a in domain co e age measu emen ; Sec ion 3 explains he
me hod p oposed and he expe imen al app oach; Sec ion 4 depic s he
esul s ob ained o he expe imen , including an abla ion s udy and a
compa ison wi h he s a e o he a ; Sec ion 5 discusses ou indings
and p oposes u u e wo k; and inally, he conclusions a e p esen ed in
Sec ion 6.
2. S a e o he a
In on ology enginee ing, he ele an cha ac e is ics o on ologies
ha e adi ionally been measu ed by quan i a i e me ics, which p o-
ide objec i e measu emen s ha help o p o ile on ology cha ac-
e is ics. The su ey published by Wilson e al. (2021) p o ides a
bibliog aphy e iew on his aspec . Me hods o e alua ing he be-
ha io o a se o me ics o e a se o on ologies ha e also been
p oposed (Be nabé-Díaz e al., 2022). One o he pe spec i es om
which an on ology can be e alua ed is es ima ing he deg ee in which
i p o ides a good co e age o ce ain domain knowledge. Domain
co e age is de ined by Wilson e al. (2023) as ‘ he deg ee o which
an on ology co e s he axioms which ha e been speci ied (i.e., e-
qui emen speci ica ions, s anda d on ologies, s anda d co pus) wi h
espec o he domain knowledge ha he on ology was de eloped o
ep esen ’. T adi ionally, domain expe s we e esponsible o assessing
he domain co e age o an on ology h ough expe judgmen . Ne e -
heless, he de elopmen o au oma ed me hods o assessing i would
educe he e o o domain expe s and pe mi a sys ema ic e alua ion
o he domain co e age o on ologies.
In Zhu e al. (2017), he me ic ‘ ocabula y co e age’ is used o
measu e he con en o on ologies. This me ic is calcula ed by com-
pa ing he on ology being e alua ed wi h ano he one ha ac s as gold
s anda d. This is in line wi h he p ecision me ic p oposed in McDaniel
e al. (2018), which also compa es on ologies wi h a gold s anda d
dic iona y. Ano he example is he OQuaRE amewo k (Duque-Ramos
e al., 2011), which p oposed di e en me ics linked o he cha ac e -
is ics p oposed in i s quali y model, such as he numbe o p ope ies
pe class o measu e he unc ional adequacy, and he HURON ame-
wo k (Abad-Na a o e al., 2023), which p oposed a se o me ics, such
as names pe class o desc ip ions pe class, o cap u e he eadabili y
o on ologies.
Mos o he solu ions p oposed o au oma ically measu e on ology
co e age a e based on ecognizing on ology en i ies in ee ex . Fo
ins ance, BioPo al (Whe zel e al., 2011) and Indus yPo al (Amdouni
e al., 2023) a e buil o e On oPo al (Yang, 2009), which p o ides a
ecommenda ion sys em (Ma ínez-Rome o e al., 2017) o selec he
mos sui able on ology in hei eposi o y gi en a ee ex . This sys em
akes in o accoun on ology co e age by iden i ying on ology en i ies
in he inpu ex h ough a Named En i y Recogni ion (NER) (Jonque
e al., 2009), which uses he on ologies s o ed in he sys em as a dic-
iona y. None heless, mos o he NER sys ems ely on exac ma ching
be ween ee ex and on ology class anno a ions, which could lead o
a poo ecall.
In addi ion o dic iona y-based app oaches, he de elopmen o
a i icial in elligence echniques applied o na u al language, such as
embedding algo i hms (Mikolo e al., 2013a; Bojanowski e al., 2017a;
Joulin e al., 2017; Ken on and Tou ano a, 2019) o La ge Language
Models (LLM) (Achiam e al., 2023; Tou on e al., 2023; Jiang e al.,
2023; Qiu e al., 2023), is leading o new pa adigms in on ology engi-
nee ing. In his line, he e a e se e al wo ks ha use LLMs o on ology
lea ning and gene a ion (Babaei Giglou e al., 2023; Saeedizade and
Blomq is , 2024; Beh e al., 2023; Val-Cal o e al., 2025), howe e ,
i s applica ion o on ology e alua ion needs o be u he in es iga ed.
Fo example, in Tsane a e al. (2024) GPT-4 was used o e i ying
on ology es ic ions. Mo ing in o he domain co e age, Zai oun e al.
(2023) p oposed a combina ion o NER and LLM me hods ained o e
a co pus o documen s, which we e conside ed as an au ho i a i e
sou ce o u h, and whose esul s a e compa ed o he inpu on ology.
In ha wo k, he co e age is mainly based on he NER, bu i is
complemen ed by o he me ics suppo ed by he LLM, such as child
o pa en simila i y.
The p esen wo k deepens he use o LLM me hods o c ea e a me ic
o compu e he domain co e age o a gi en on ology, whe e he domain
is gi en as a na u al language ex co pus. In con as o adi ional
NER sys ems, which a e ypically based on exac ma ches, he use o
LLM echniques can enhance he measu emen o domain co e age
in on ologies by conside ing no only exac ma ches bu also ela ed
concep s be ween ee ex and on ology en i ies.
3. Ma e ials and me hods
This sec ion explains he OCALM me hodology o measu ing he
ex en o which an on ology co e s a domain. The inpu o OCALM
is a co pus o na u al language ex , w i en in English, ep esen ing
he domain knowledge; and he OWL on ology we wan o e alua e.
OCALM ma ches each noun ph ase iden i ied in he na u al language
ex o an on ology class by using i s anno a ions. A noun ph ase
is a wo d o g oup o wo ds con aining a noun and unc ioning in
a sen ence as subjec , objec , o p eposi ional objec , such as ‘Spo
d inks’ o ‘co ona y a e y diseases’; whe eas a class anno a ion is
in o ma ion a ached o an on ology class ha is p ima ily in ended
Enginee ing Applica ions o A icial In elligence 162 (2025) 112671
2
F. Abad-Na a o e al.
Table 1
Example o noun ph ases de ec ed in he ex ‘Diabe es melli us is ano he
endoc ine sys em disease ha a ec s many people in he Uni ed S a es’ and
i s no malized o m.
Noun ph ase No malized noun ph ase
Diabe es melli us Diabe es melli us
Ano he endoc ine sys em disease Endoc ine sys em disease
Many people People
The Uni ed S a es Uni ed S a es
o be in o ma i e o humans, including names, synonyms, o desc ip-
ions, among o he s. This ma ching is based on maximizing a sco e
unc ion ha akes in o accoun he lexical and he seman ic simila i y
be ween he noun ph ase and he on ology class. OCALM e u ns he
bes on ology class ma ch o each noun ph ase, oge he wi h he sco e
achie ed. Fig. 1 depic s he o e iew o he me hod, which consis s o
he ollowing s eps:
1. No maliza ion o he na u al language inpu ex .
2. Gene a ion o a ec o space model (𝑀𝑡) om he no malized
na u al language inpu ex .
3. No maliza ion o he inpu on ology.
4. Gene a ion o a ec o space model (𝑀𝑜) om he no malized
on ology.
5. Applica ion o he sco e unc ion o ob ain he bes on ology
class ma ch o each noun ph ase iden i ied in he no malized
na u al language inpu ex .
The ollowing sec ions desc ibe each s ep o he me hod in de ail.
3.1. Tex no maliza ion
Tex no maliza ion consis s o eplacing each noun ph ase in he
ex by i s canonical o m. The pu pose o his s ep is o iden i y
di e en lexical o ms o he same concep in o de o ans o m hem
in o he same lexical o m. This simpli ies u he s eps by a oiding
dealing wi h di e en a ia ions o noun ph ases ha a e ac ually
e e ing o he same concep .
The i s s ep o ex no maliza ion is o iden i y he noun ph ases
ha appea in he ex . This is achie ed by applying a s anda d pipeline
o Na u al Language P ocessing (NLP) consis ing o a okenize , a pa
o speech agge , a dependency pa se , a lemma ize , and a name en i y
ecognize . In his wo k, we ha e used he NLP pipeline o e ed by
SpaCy (Mon ani e al., 2023), an open sou ce NLP lib a y o Py hon,
and he model ‘en_co e_web_sm’, which con ains he de ini ion o he
pipeline and i s componen s. This model was eleased unde an MIT
license, and was ained using ex s w i en in English ound on he
Web (Explosion, 2023). A e applying his pipeline o na u al language
ex , he noun ph ases a e anno a ed, hus becoming di ec ly a ailable.
Once he noun ph ases a e iden i ied wi hin he inpu ex , hey a e
ans o med in o hei canonical o ms by applying he ollowing ules:
•Remo e language dependen s op wo ds, such as ‘ano he ’, ‘o h-
e wise’, ‘wha ’, e c.
•Remo e special cha ac e s, such as quo es o backslashes.
•Replace ca iage e u ns by blank spaces.
•Replace plu al nouns by i s lemma.
•Con e o lowe case.
Noun ph ases a e s o ed in a lis oge he wi h hei canonical o m.
Those e e ing o p ope nouns o single numbe s a e emo ed om
he lis . Then, noun ph ases in he lis a e eplaced by hei canonical
o ms in he inpu ex , con o ming he no malized ex . Finally, he
ou pu o he ex no maliza ion s ep is he no malized ex oge he
wi h he lis o he iden i ied noun ph ases, which will be used in
u he s eps.
Fo example, he inpu ex ‘Diabe es melli us is ano he endoc ine
sys em disease ha a ec s many people in he Uni ed S a es’ esul s in he
lis o noun ph ases depic ed in Table 1. The no malized ex is he
esul o eplacing he noun ph ases by hei no malized o m in he
o iginal ex , esul ing in ‘diabe es melli us is endoc ine sys em disease ha
a ec s people in uni ed s a es’.
3.2. Vec o space model c ea ion o he na u al language ex co pus
The ec o space model c ea ion o he na u al language ex akes
as inpu he no malized ex and he se o noun ph ases iden i ied in
he ex no maliza ion s ep (see Sec ion 3.1). The me hod gene a es a
ec o space model 𝑀𝑡 and e u ns an embedding o each noun ph ase.
Fo his pu pose, we applied he as Tex model (Bojanowski e al.,
2017b) o he no malized na u al language ex wi h he ollowing
pa ame e s:
•Vec o size = 100 (numbe o dimensions o he wo d ec o s).
•Window = 5 ( he maximum dis ance be ween he cu en and
p edic ed wo d wi hin a sen ence).
•Min coun = 1 ( he model igno es all wo ds wi h o al equency
lowe han his).
•Nega i e = 5 (how many ‘noise wo ds’ should be d awn).
•I e = 10 (numbe o epochs o e he co pus).
•Seed = 1 (seed o andom numbe gene a o , o ep oducibili y).
The as Tex model (Bojanowski e al., 2017b) is de i ed om
he skipg am model wi h nega i e sampling in oduced in Mikolo
e al. (2013b). The goal o he skipg am model is o lea n a ec o ial
ep esen a ion o each wo d 𝑤 belonging o a ocabula y o size 𝑊
by p edic ing wo ds appea ing in he con ex o a gi en wo d. Fo his,
he model uses a sco ing unc ion o gi e a sco e o pai s o (wo d,
con ex ), whe e he con ex is a se o wo ds su ounding he gi en
wo d. Finally, he skipg am model compu es he sco e unc ion as he
scala p oduc be ween wo d and con ex ec o s. None heless, his
model p o ides a dis inc ec o ep esen a ion o each wo d, igno ing
he in e nal s uc u e o wo ds. He e, as Tex p oposed o ep esen
each wo d as a bag o n-g ams ha also includes he wo d i sel and
o modi y he sco e unc ion o ake in o accoun he in e nal s uc u e
o wo ds. In pa icula , gi en a dic iona y o n-g ams o size 𝐺 and
a wo d 𝑤, ⊂{1,…, 𝐺} is he se o n-g ams appea ing in 𝑤; hen,
as Tex associa es a ec o ep esen a ion 𝐳𝑔 o each n-g am 𝑔. Finally,
as Tex ep esen s a wo d by he sum o he ec o ep esen a ion o
i s n-g ams, ob aining he sco ing unc ion de ined in Eq. (1).
𝑠(𝑤, 𝑐) = ∑
𝑔∈𝑤
𝐳𝑔𝐯𝑐(1)
Then, we used he as Tex model o assign an embedding o each
wo d in he no malized ex . I akes in o accoun no only he wo ds
appea ing in he ex bu also he N-g ams con o ming hese wo ds. This
acili a es ob aining ec o s o mul i-wo d ph ases, which is needed o
de ec he embeddings associa ed wi h ou noun ph ases. Finally, we
il e he ec o space o keep only he noun ph ases de ec ed in he
ex no maliza ion s ep. The e o e, he ou pu o his s ep is a ec o
space model 𝑀𝑡 con aining an embedding o each noun ph ase o he
no malized ex .
3.3. On ology no maliza ion
The on ology no maliza ion s ep akes an OWL on ology as inpu
and e u ns i s no malized e sion. This is simila o he ex no mal-
iza ion, shown in Sec ion 3.1, as i also consis s o iden i ying and
no malizing noun ph ases. He e, as he inpu is an OWL on ology,
no maliza ion is applied o he ex o each on ology class anno a ion,
which include labels, synonyms, desc ip ions, o commen s, among
o he s, o each on ology class.
Enginee ing Applica ions o A icial In elligence 162 (2025) 112671
3
F. Abad-Na a o e al.
Fig. 1. O e iew o he main s eps o OCALM. Bo h he na u al language ex co pus and he on ology a e no malized in o de o gene a e a ec o space model
o each one. The sco e unc ion uses bo h ec o spaces oge he wi h a gene al pu pose language model, and he lexical o ms o he noun ph ases om he
ex and he anno a ions o on ology classes o ge simila i y. The sco e unc ion is used o ge he bes on ology class ma ch o each noun ph ase de ec ed in
he ex .
Table 2
Example o noun ph ases de ec ed in he class GO:0019012 ( i ion) om Gene
On ology and hei no malized o m.
Noun ph ase No malized noun ph ase
Wikipedia Wikipedia
Vi us Vi us
GO:0019012 Go:0019012
Comple e i us pa icle Comple e i us pa icle
The comple e ully in ec ious
ex acellula i us pa icle
Comple e ully in ec ious
ex acellula i us pa icle
Noun ph ases om on ology class anno a ions a e ob ained and
no malized acco ding o he al eady desc ibed ules (see Sec ion 3.1).
Simila ly o he ex no maliza ion s ep, he iden i ied noun ph ases
a e also s o ed in a lis . Then, he on ology is no malized by eplacing
he de ec ed noun ph ases in he class anno a ions by hei no mal-
ized o m. Finally, he ou pu o his s ep is he no malized on ology
oge he wi h he noun ph ases lis .
Fo example, he class GO:0019012 ( i ion), om gene on ology,
con ained he anno a ions depic ed in Fig. 2(a). Table 2 shows he
noun ph ases de ec ed in he anno a ions o he class, oge he wi h
hei no malized o ms. Then, he class is no malized by eplacing he
o iginal noun ph ases by hei no malized ones, esul ing in he class
showed in Fig. 2(b).
3.4. Vec o space model c ea ion o he on ology
The ec o space model c ea ion o he on ology akes he no -
malized on ology and he noun ph ases iden i ied in Sec ion 3.3 as
Enginee ing Applica ions o A icial In elligence 162 (2025) 112671
4
F. Abad-Na a o e al.
Fig. 2. Example o he no maliza ion p ocess o he class GO:0019012 ( i ion).
inpu and e u ns a ec o space model 𝑀𝑜 wi h a se o embeddings
ep esen ing each noun ph ase.
Fi s , he no malized on ology is used as inpu o he OWL2Vec*
embedding algo i hm (Chen e al., 2021), which gene a es embeddings
o OWL on ologies. OWL2Vec* gene a es ex documen s om he inpu
on ology o apply a ex embedding algo i hm o hem. OWL2Vec*
o iginally applies he Wo d2Vec algo i hm (Mikolo e al., 2013a)
o he gene a ed ex documen s o ob ain he on ology embeddings;
howe e , we modi ied his beha io o use as Tex ins ead, as we used
i in he c ea ion o he ec o model space o he na u al language
ex , as desc ibed in Sec ion 3.2. We used he same as Tex pa ame e s
as in Sec ion 3.2, whe eas he pa ame e s used o OWL2Vec* a e
shown in Table 3. Since ou me hod is applicable o any on ology, he
pa ame e s we e selec ed o ake in o accoun mos o he in o ma ion
s o ed in he on ology while p ese ing good pe o mance.
In pa icula , wi h he selec ed pa ame e s, OWL2Vec* i s ans-
o ms he inpu OWL on ology in o an RDF g aph, which is called he
‘p ojec ed on ology’. This p ojec ion p ocess akes in o accoun all he
ela ionships s a ed in he on ology and simpli ies some OWL cons uc s
p e en ing he appea ance o blank nodes de i ed om complex class
exp essions. The p ojec ed on ology also includes in e se ela ionships
o d : ype o d s:subClassO o enable bidi ec ional walks be ween
en i ies linked h ough hese ela ionships. An OWL easone could
be used o in e new RDF iples om he s a ed ones; howe e , his
easoning was disabled o pe o mance easons. Then, OWL2Vec*
pe o ms andom walks o a e se he RDF g aph con aining he
p ojec ed on ology, whe e he leng h o he andom walks was se o
3 s eps o pe o mance easons. Du ing hese walks, OWL2Vec* gen-
e a es h ee ex documen s by aking he in o ma ion o he on ology
en i ies ha a e a e sed. On he one hand, he s uc u e documen
con ains sen ences esul ing om p in ing he IRIs o he en i ies being
a e sed, aiming a cap u ing bo h he g aph s uc u e and he logical
cons uc o s o he on ology. On he o he hand, he lexical documen
con ains sen ences esul ing om p in ing he anno a ions (e.g. labels
and synonyms) co esponding o he en i ies. Finally, he combined
documen con ains sen ences ha mix en i y IRIs and anno a ions wi h
he aim o cap u ing he co ela ion be ween en i y IRIs and he lexical
in o ma ion. Then, hese h ee documen s a e me ged and used as inpu
o he as Tex algo i hm, p e iously commen ed in Sec ion 3.2, o
gene a e a ec o space o he okens con ained in hem.
A e OWL2Vec* has gene a ed he ec o space o he no malized
on ology, we keep only he embeddings ha e e o he noun ph ases
iden i ied in he on ology no maliza ion s ep (see Sec ion 3.3). Thus,
he ou come o his phase is a ec o space model 𝑀𝑜 con aining he
noun ph ases iden i ied in he no malized on ology.
3.5. Sco e unc ion
A sco e unc ion is used o measu e he deg ee o ela ion be ween
a noun ph ase iden i ied in he na u al language ex and an on ology
class. The sco e unc ion e u ns a ma ching sco e o wo inpu s ings,
one o he noun ph ase, and one o he on ology class. Thus, o ob ain
he ma ching sco e be ween hem, he noun ph ase is compa ed o
he anno a ions o he on ology class, which may include labels o
synonyms, selec ing he one ha p o ides he highes sco e. In his way,
each noun ph ase om he na u al language ex is compa ed o each
on ology class, assigning he class ha maximizes he sco e unc ion o
he noun ph ase.
The sco e unc ion is he esul o a weigh ed mean be ween he
lexical (lexSim) and he seman ic simila i y (semSim) o he compa ed
inpu s ings, as desc ibed in Eq. (2), whe e 𝑎 and 𝑏 a e he inpu
s ings, ex ac ed om he na u al language ex and om he on ology,
espec i ely; 𝛼 and 𝛽 a e he weigh s gi en o he lexical and he
seman ic simila i y, espec i ely; and 𝑀𝑡 and 𝑀𝑜 a e he ec o space
models gene a ed om he na u al language ex and he on ology,
espec i ely (see Sec ions 3.2 and 3.4), whe eas 𝑀𝑔 is a gene al ec o
space model, used o compu ing he seman ic simila i y. The de ails o
hese simila i ies a e desc ibed in he nex sec ions.
𝑠𝑐𝑜𝑟𝑒(𝑎, 𝑏) = 𝛼⋅𝑙𝑒𝑥𝑆𝑖𝑚(𝑎, 𝑏) + 𝛽⋅𝑠𝑒𝑚𝑆𝑖𝑚(𝑎, 𝑏, 𝑀𝑡, 𝑀𝑜, 𝑀𝑔)
𝛼+𝛽(2)
3.5.1. Lexical simila i y
The lexical simila i y compa es wo ex s ings, 𝑎 and 𝑏, by con-
side ing only hei lexical o ms. Fo his, we compa e he okens o
bo h s ings, gene a ing pai s o okens maximizing hei Le ensh ein
simila i y (Lc ensh cin, 1966). Changes in he o de o he numbe
o okens be ween bo h penalize he sco e. Nex , we explain how he
lexical simila i y is compu ed wi h examples. An o e iew o how he
lexical simila i y is calcula ed is shown in Fig. 3.
Fi s , s ings 𝑎 and 𝑏 a e di ided in o okens by spli ing hem
by using he blank space cha ac e as sepa a o . Then, Le ensh ein
simila i y be ween each oken pai be ween 𝑎 and 𝑏 is calcula ed
ob aining a ma ix. Fo example, he compa ison be ween ‘meli us
diabe is’ and ‘diabe es melli us’ esul s in he ma ix depic ed in Table
4. Table 5 shows ano he example be ween he s ings ‘diabe es ype I’
and ‘diabe es melli us’.
The second s ep is o ma ch he okens o 𝑎 o he okens o 𝑏,
ob aining oken pai s ha maximize he Len esh ein simila i y. This is
a linea sum assigning p oblem, whe e okens om he i s s ing a e
Enginee ing Applica ions o A icial In elligence 162 (2025) 112671
5

F. Abad-Na a o e al.
Table 3
OWL2VEC pa ame e s used o building he on ology ec o space.
Pa ame e Pa ame e explana ion Value
On ology p ojec ion Use o no use he p ojec ed on ology Yes
P ojec ion only axonomy P ojec ion o only he axonomy o he on ology wi hou o he ela ionships No
Mul iple labels Using o no mul iple labels/synonyms o he li e al/mixed sen ences Yes
A oid owl cons uc Skip OWL cons uc s like d s:subclasso in he documen No
Axiom easone Reasone o use o in e ing axioms None
Walke Algo i hm o a e se he on ology Random
Walk dep h Numbe o hops when a e sing he on ology 3
URI Doc C ea e a documen wi h en i ies IRIs when a e sing he on ology Yes
Li Doc C ea e a documen wi h en i ies anno a ions when a e sing he on ology Yes
Mix Doc C ea e a documen mixing en i ies IRIs and anno a ions when a e sing he on ology Yes
Mix Type The ype o gene a ing he mix u e documen - all o andom All
Fig. 3. O e iew o he lexical simila i y compu a ion. The inpu s ings o be compa ed a e okenized, and a Le ensh ein simila i y ma ix is compu ed by
gene a ing he Le ensh ein simila i y o each oken pai be ween he inpu s ings. Tokens pai s ha maximize he Le ensh ein simila i y be ween he inpu
s ings a e ob ained. This simila i y is used, in addi ion o an o de ac o ha penalizes changes in he o de o he okens be ween he inpu s ings, o compu e
he lexical simila i y.
Table 4
Example o Le ensh ein simila i y ma ix o he s ings ‘meli us diabe is’ and
‘diabe es melli us’.
Diabe es Melli us
Meli us 0.40 0.93
Diabe is 0.88 0.38
Table 5
Example o Le ensh ein simila i y ma ix o he s ings ‘diabe es ype I’ and
‘diabe es melli us’.
Diabe es Melli us
Diabe es 1 0.38
Type 0.17 0.33
I 0 0
assigned o he okens o he second s ing maximizing he Le ensh ein
simila i y, gi ing a oken pai assignmen ma ix, whe e 1 means ha
he conce ning okens a e selec ed as pai . Nex , Table 6 shows he
oken assignmen ma ix ob ained o he s ings ‘meli us diabe is’ and
‘diabe es melli us’, whose Le esh ein simila i y ma ix is desc ibed in
Table 4, ha gi es he ollowing oken pai s and he co esponding
maximized Le ensh ein simila i y:
•Le ensh ein simila i y (meli us, melli us) = 0.93
•Le ensh ein simila i y (diabe is, diabe es) = 0.88
•Maximized Le ensh ein simila i y = 0.93 + 0.88 = 1.81
Fo i s pa , he s ings ‘diabe es ype I’ and ‘diabe es melli us’,
whose Le ensh ein simila i y ma ix is shown in Table 5, esul in he
ollowing oken pai s wi h he co esponding maximized Le ensh ein
Enginee ing Applica ions o A icial In elligence 162 (2025) 112671
6
F. Abad-Na a o e al.
Table 6
Example o oken pai assignmen ma ix o he s ings ‘meli us diabe is’ and
‘diabe es melli us’.
Diabe es Melli us
Meli us 0 1
Diabe is 1 0
Table 7
Example o oken pai assignmen ma ix o he s ings ‘diabe es ype I’ and
‘diabe es melli us’.
Diabe es Melli us
Diabe es 1 0
Type 0 1
simila i y, de i ed om he oken assignmen ma ix depic ed in Ta-
ble 7. No e ha he e, he oken ‘I’ emains unassigned because he
compa ed s ings ha e di e en leng hs:
•Le ensh ein simila i y (diabe es, diabe es) = 1
•Le ensh ein simila i y ( ype, melli us) = 0.33
•Maximized Le ensh ein simila i y = 1 + 0.33 = 1.33
The hi d s ep is o calcula e he o de ac o , which penalizes he
sco es o hose s ings ha ha e a di e en o de in hei okens. This
ac o is calcula ed acco ding o Eq. (3). I e u ns a numbe anged
om 0 (maximum penaliza ion) o 1 (no penaliza ion).
𝑜𝑟𝑑𝑒𝑟𝐹 𝑎𝑐𝑡𝑜𝑟(𝑎, 𝑏) = 1 − 𝑛𝑢𝑚𝑏𝑒𝑟𝑂𝑓 𝑈𝑛𝑠𝑜𝑟𝑡𝑒𝑑𝑇 𝑜𝑘𝑒𝑛𝑠(𝑎, 𝑏)
𝑚𝑎𝑥(𝑙𝑒𝑛𝑔𝑡ℎ(𝑎), 𝑙𝑒𝑛𝑔𝑡ℎ(𝑏))
⋅𝐾(3)
whe e 𝐾 is a con igu able pa ame e anged om 0 o 1 deno ing he
weigh o he o de penaliza ion, and 𝑛𝑢𝑚𝑏𝑒𝑟𝑂𝑓 𝑈𝑛𝑠𝑜𝑟𝑡𝑒𝑑𝑇 𝑜𝑘𝑒𝑛𝑠 is he
numbe o o de changes o consecu i e okens o he s ing 𝑎 wi h
espec o he s ing 𝑏. The alue o 𝑛𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝑈𝑛𝑠𝑜𝑟𝑡𝑒𝑑𝑇 𝑜𝑘𝑒𝑛𝑠 is cal-
cula ed om he oken pai assignmen ma ix by coun ing how many
consecu i e okens do no ha e a diagonal o alue 1. Fo example,
he oken pai assignmen ma ix o he s ings ‘meli us diabe is’ and
‘diabe es melli us’ (Table 6) has only wo consecu i e okens o each
inpu s ing, so, we check ha hese wo okens do no o m a diagonal
o ones, hus ha ing a alue o 1 o he 𝑛𝑢𝑚𝑏𝑒𝑟𝑂𝑓 𝑈𝑛𝑠𝑜𝑟𝑡𝑒𝑑𝑇 𝑜𝑘𝑒𝑛𝑠
a iable. Con a iwise, he oken pai assignmen ma ix o he s ings
‘diabe es ype I’ and ‘diabe es melli us’ (Table 7) has wo consecu i e
okens o ming a diagonal ma ix o ones, hus ha ing a alue o 0 o
he 𝑛𝑢𝑚𝑏𝑒𝑟𝑂𝑓 𝑈𝑛𝑠𝑜𝑟𝑡𝑒𝑑𝑇 𝑜𝑘𝑒𝑛𝑠 a iable. Thus, he o de ac o alue o
he s ings ‘meli us diabe is’ and ‘diabe es melli us’ is 1 − 1
𝑚𝑎𝑥(2,2)
⋅𝐾=
1 − 0.5𝐾, whe e he maximum penaliza ion is gi en i 𝐾= 1 and no
penaliza ion is gi en i 𝑘= 0. On he o he hand, he o de ac o alue
o he s ings ‘diabe es ype I’ and ‘diabe es melli us’ is 1− 0
𝑚𝑎𝑥(3,2)
⋅𝐾=
1. In his case, he o de o he okens ma ch, hus gi ing a alue o
1 (no penaliza ion) o he o de ac o wi h independence o he 𝐾
pa ame e .
Finally, he lexical simila i y is gi en by Eq. (4), whe e 𝑚𝑎𝑥𝐿𝑒𝑣𝑒𝑛
𝑠ℎ𝑡𝑒𝑖𝑛(𝑎, 𝑏) is he maximized Le ensh ein simila i y ound o he s ings
𝑎 and 𝑏. The e o e, he lexical simila i y be ween ‘meli us diabe is’ and
‘diabe es melli us’ is gi en by 1.81
𝑚𝑎𝑥(2,2)
⋅(1−0.5𝐾) = 0.905⋅(1−0.5𝐾). No e
ha , in his case, he lexical simila i y depends on 𝐾, whe e highe
alues o 𝐾 impac nega i ely on he lexical simila i y. On he o he
hand, he lexical simila i y o ‘diabe es ype I’ and ‘diabe es melli us’ is
gi en by 1.33
𝑚𝑎𝑥(3,2)
⋅1 = 0.443.
𝑙𝑒𝑥𝑆𝑖𝑚(𝑎, 𝑏) = 𝑚𝑎𝑥𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛(𝑎, 𝑏)
𝑚𝑎𝑥(𝑙𝑒𝑛𝑔𝑡ℎ(𝑎), 𝑙𝑒𝑛𝑔𝑡ℎ(𝑏))
⋅𝑜𝑟𝑑𝑒𝑟𝐹 𝑎𝑐𝑡𝑜𝑟(𝑎, 𝑏)(4)
3.5.2. Seman ic simila i y
This me hod measu es he simila i y be ween noun ph ases om he
co pus and he anno a ions associa ed wi h he classes o he on ology.
The inpu a e he wo s ings o compa e: s ing 𝑎, ex ac ed om he
na u al language ex , and s ing 𝑏, ex ac ed om he on ology, along
wi h hei espec i e ec o space models 𝑀𝑡 and 𝑀𝑜. The models
𝑀𝑡 and 𝑀𝑜, c ea ed as desc ibed in Sec ions 3.2 and 3.4, p o ide
he seman ic con ex o he inpu s ings 𝑎 and 𝑏. Addi ionally, a
gene al seman ic con ex 𝑀𝑔 is used o compu e he seman ic simila i y,
which is gi en by Ph ase-BERT (Wang e al., 2021a), a gene al BERT
model (Ken on and Tou ano a, 2019) ocused on ph ase embeddings.
This gene al model se es o in eg a ing he speci ic models 𝑀𝑡 and
𝑀𝑜, c ea ed o he na u al language co pus and he on ology. The
seman ic simila i y me hod e u ns a sco e in he ange 0 (lowes
simila i y) o 1 (highes simila i y).
Calcula ing he seman ic simila i y equi es o execu e he ollowing
s eps. Fi s , we ob ain he 10 nea es noun ph ases (𝑛𝑝) o each inpu
s ing by using he co esponding ec o space model. Fo his pu pose
we apply he cosine dis ance. In pa icula , he neighbo s o he s ing 𝑎,
𝑁𝑎, a e calcula ed by using he model 𝑀𝑡, whe eas he neighbo s o 𝑏,
𝑁𝑏 a e ex ac ed om 𝑀𝑜. Fig. 4 depic s an example o he neighbo s o
‘diabe es melli us’ and ‘ ype 2 diabe es melli us’, aken om he na u al
language ex and om he on ology ec o space models, espec i ely.
Fo eadabili y easons, only he 5 nea es neighbo s a e shown.
Second, he 10 nea es neighbo s o each inpu s ing a e ex ac ed
om he gene al Ph ase-BERT model 𝑀𝑔, ob aining i s ec o s in his
ec o space. This esul s in wo se s o ec o s, 𝑉𝑔𝑎 and 𝑉𝑔𝑏, encoding
he seman ic ep esen a ion o he neighbo s o 𝑎 and 𝑏 in his gene al,
mul ipu pose ec o space model. This allows o ob ain a ep esen a ion
o he inpu s ings in a common ec o space by using hei nea es
neighbo s as con ex , hus a oiding polysemy ela ed p oblems. In
addi ion, his BERT model is specialized o ph ases, which is con-
sis en wi h ou ph ase- ocused me hod. Fig. 5 shows he neighbo s
o ‘diabe es melli us’ and ‘ ype 2 diabe es melli us’ ep esen ed in he
Ph ase-BERT model.
Thi d, each se o ec o s, 𝑉𝑔𝑎 and 𝑉𝑔𝑏, p e iously gene a ed, is
summa ized in o a single poin in he Ph ase-BERT gene al ec o space
by applying he mean o all poin s in he se . As a esul , we ha e wo
a e age poin s, 𝑎𝑝𝑎 (see Eq. (5)) and 𝑎𝑝𝑏 (see Eq. (6)), ep esen ing he
inpu s ings in he Ph ase-BERT ec o space model in e ms o hei
neighbo s, which we e p e iously ex ac ed om hei pa icula ec o
space models. Fig. 6 shows hese a e age poin s o ‘diabe es melli us’
and ‘ ype 2 diabe es melli us’, oge he wi h hei neighbo s.
𝑎𝑝𝑎=∑10
𝑖=1 𝑣𝑖
10 |𝑣𝑖𝜖𝑉𝑔𝑎 (5)
𝑎𝑝𝑏=∑10
𝑖=1 𝑣𝑖
10 |𝑣𝑖𝜖𝑉𝑔𝑏 (6)
Finally, he seman ic simila i y, 𝑠𝑒𝑚𝑆𝑖𝑚, be ween he inpu s ings
𝑎 and 𝑏 is gi en by he cosine simila i y (𝑐𝑜𝑠𝑆𝑖𝑚) o he compu ed
mean poin s, 𝑎𝑝𝑎 and 𝑎𝑝𝑏, as depic ed in Eq. (7). Al hough he cosine
simila i y anges om −1 o 1, whe e −1 is a pe ec nega i e simila i y,
1 is a pe ec posi i e simila i y, and 0 is no simila i y, we can assume
ha only posi i e alues a e conside ed since nega i e simila i ies a e
no conside ed in his wo k.
𝑠𝑒𝑚𝑆𝑖𝑚(𝑎, 𝑏) = 𝑐𝑜𝑠𝑆𝑖𝑚(𝑎𝑝𝑎, 𝑎𝑝𝑏) = 𝑎𝑝𝑎⋅𝑎𝑝𝑎
|𝑎𝑝𝑎||𝑎𝑝𝑏|(7)
3.6. Co e age sco e
The sco e unc ion desc ibed in Sec ion 3.5 is used o ind he bes
on ology class ma ch o each noun ph ase de ec ed in he na u al
language ex . To achie e his, o each noun ph ase 𝑎𝑖 appea ing in
he na u al language ex , we selec ed he noun ph ase appea ing in
he on ology class anno a ions ( es ic ed o labels and synonyms) 𝑏𝑗
maximizing he sco e unc ion de ined in Eq. (2). Mo e o mally, he
pai s (𝑎𝑖, 𝑏𝑗) a e ound acco ding o Eq. (8).
𝑝𝑎𝑖𝑟(𝑎𝑖, 𝑏𝑗) ∶ 𝑠𝑐𝑜𝑟𝑒(𝑎𝑖, 𝑏𝑗) = 𝑚𝑎𝑥(𝑠𝑐𝑜𝑟𝑒(𝑎𝑖, 𝑏𝑘)) (8)
Enginee ing Applica ions o A icial In elligence 162 (2025) 112671
7
F. Abad-Na a o e al.
Fig. 4. Example o he nea es neighbo s o ‘diabe es melli us’ and ‘ ype 2 diabe es melli us’, aken om he na u al language ex and he on ology class
anno a ions, espec i ely. Only he 5 nea es neighbo s a e shown due o eadabili y easons.
Fig. 5. Example o he nea es neighbo s o ‘diabe es melli us’ (blue do s) and
‘ ype 2 diabe es melli us’ ( ed do s), aken om he na u al language and he
on ology ec o models, espec i ely, ep esen ed in he Ph ase-BERT gene al
model. Only he 5 nea es neighbo s a e shown due o eadabili y easons.
The pai (𝑎𝑖, 𝑏𝑗) means ha 𝑏𝑗 is he bes on ology class anno a ion
o he noun ph ase 𝑎𝑖. Fu he mo e, since he me hod acks he sou ce
class o he selec ed on ology class anno a ions, we can easily ansla e
he pai s (𝑎𝑖, 𝑏𝑗) in o pai s o ype (𝑎𝑖, 𝑐𝑗), whe e 𝑐𝑗 is he on ology class
ha con ains he anno a ion 𝑏𝑗. Thus, he mean o he sco es achie ed
by each pai (𝑎𝑖, 𝑐𝑗) is used as a global sco e o he co e age o he
on ology in he domain gi en by he na u al language ex .
3.7. Implemen a ion
OCALM has been implemen ed in a Py hon 3.8 sc ip , a ailable a
Gi Hub.1 I mainly uses he ollowing lib a ies:
•SpaCy (Mon ani e al., 2023), o na u al language p ocessing and
noun ph ase iden i ica ion.
•ScyPy (Vi anen e al., 2020), o sol ing linea sum assignmen
p oblems.
1h ps://gi hub.com/ ana a o/ocalm.
Fig. 6. Example o he a e age poin s compu ed in he gene al Ph ase-BERT
ec o model om he neighbo s o ‘diabe es melli us’ (blue do s) and ‘ ype 2
diabe es melli us’ ( ed do s) ex ac ed om he na u al language ex and he
on ology ec o space models, espec i ely. Only he 5 nea es neighbo s a e
shown due o eadabili y easons.
•Gensim (Rehu ek and Sojka, 2011), which p o ides he as Tex
implemen a ion.
•OWL2Vec* (Chen e al., 2021), o on ology embeddings. The
o iginal code OWL2Vec code was modi ied o use as Tex in-
s ead o Wo d2Vec; his modi ica ion is included in ou Gi Hub
eposi o y.1
•Owl eady2 (Lamy, 2017), o managing on ologies.
•Sen enceT ans o me s (Reime s and Gu e ych, 2019), which al-
lows BERT models o be que ied.
4. Resul s
4.1. Expe imen al se ing
We ha e applied OCALM o ou di e en domains; namely, ood,
gene ics, legal and medical domain. Fo each domain, we selec ed
a domain on ology and a na u al language ex co pus co e ing he
co esponding domain. Tables 8and 9 desc ibe he na u al language
ex co po a and he on ologies selec ed, espec i ely. We ex ac ed
Enginee ing Applica ions o A icial In elligence 162 (2025) 112671
8
F. Abad-Na a o e al.
Table 8
Desc ip ion o he co po a used in he expe imen .
Domain Co pus Wo d coun Desc ip ion
Food Amazon ine ood e iewsa10,099 Re iews o ine oods om amazon.
Gene ic BioC-BioGRIDb
(Islamaj Doğan e al.,
2017)
18,610 Full ex a icles anno a ed o cu a ion o p o ein–p o ein
and gene ic in e ac ions.
Legal Legal case epo sc
(Galgani, 2010)
5801 Aus alian legal cases om he Fede al Cou o Aus alia
(FCA).
Medical i2b2 clinical eco dsd13,885 Clinical eco ds used as aining da ase s in he 2009
medica ion challenge a i2b2 wo kshop (Uzune e al.,
2010).
a h ps://www.kaggle.com/snap/amazon- ine- ood- e iews/ e sion/2.
b h ps://bioc.sou ce o ge.ne /BioC-BioGRID.h ml.
c h ps://a chi e.ics.uci.edu/ml/da ase s/Legal+Case+Repo s.
d h ps://po al.dbmi.hms.ha a d.edu/p ojec s/download_da ase /? ile_uuid=3e6 6a8e-7b22-4d7a-8ddd-3cc5e1ab8c08 login equi ed.
Table 9
Desc ip ion o he on ologies used in he expe imen .
Domain On ology Numbe o
classes
Class
anno a ions
Desc ip ion
Food FoodOn 29,906 203,221 Fa m- o- o k on ology abou ood, ha desc ibes oods
commonly known in cul u es om a ound he wo ld.
Gene ics GeneOn ology 44,085 365,293 P o ides a amewo k and se o concep s o desc ibing he
unc ions o gene p oduc s om all o ganisms.
Legal LKIF 206 421 Pa o he Eu opean p ojec o S anda dized T anspa en
Rep esen a ions in o de o Ex end Legal Accessibili y.
Medical SNOMED CT 364,712 974,828 Clinical heal hca e e minology, enabling consis en
ep esen a ion o clinical con en in clinical in o ma ion
sys ems.
subse s om he o iginal na u al language ex co po a due o hei size;
he column ‘Wo d coun ’ indica es he size o he co pus used in he
expe imen . These iles a e a ailable a Zenodo (Abad-Na a o, 2024),
excep o he SNOMED CT on ology and he clinical ex co pus, due
o licensing es ic ions.
We an an expe imen o e e y possible pai (on ology, ex co pus)
as inpu o ou me hod. The con igu a ion o he expe imen was he
ollowing:
•The lexical and he seman ic simila i y had he same weigh in
he sco e unc ion; in pa icula 𝛼=𝛽= 1.
•The pa ame e 𝐾 was se o 0.25, indica ing he weigh o he
o de penaliza ion when compu ing he lexical simila i y (see
Sec ion 3.5.1). The low alue o 𝐾 is in ended o p io i ize exac
s ing ma ches o e ma ches whe e he okens a e in a di e en
o de ; o example, he ma ch (‘diabe es melli us’, ‘diabe es melli-
us’) will ha e a highe lexical simila i y han (‘melli us diabe es’,
‘diabe es melli us’). Fu he mo e, he low 𝐾 alue does no highly
penalize ma ches like ‘slice o b ead’ and ‘b ead slice’, which will
s ill ha e a high simila i y.
Finally, he dis ibu ions o he sco es o each (noun ph ase, on ol-
ogy class) ma ch p o ided by he me hod we e compa ed o ind he
on ology ha be e ep esen s each na u al language ex co pus.
4.2. Findings
The ou pu o he expe imen desc ibed in Sec ion 4.1 consis ed
o a abula ile o each (on ology, ex co pus) pai , a ailable a
Zenodo (Abad-Na a o, 2024). Each abula ile con ains a lis o he
noun ph ases de ec ed in he co esponding ex , oge he wi h he bes
on ology class ma ch ound by ou me hod o ha noun ph ase, and
he co esponding sco e.
These iles ha e been summa ized in Fig. 7, whe e he on ologies a e
compa ed be ween hem o each ex co pus by using he dis ibu ion
o he sco es o he (noun ph ase, on ology class) pai s ound by OCALM.
The igu e also shows he p- alues e u ned by he Wilcoxon Rank Sum
Tes (Wilcoxon, 1947) o each compa ison. The Wilcoxon Rank Sum
Tes is a non-pa ame ic s a is ical es used o compa e he medians
o wo independen popula ions. The igu e shows ha SNOMED CT
ob ained high sco es o all o he na u al language ex co po a, which
means ha his on ology co e s well he conside ed domains, namely,
ood, gene ics, legal and medical domains. In addi ion o his, i we
ob ia e SNOMED CT in he non-medical ex s, OCALM ound ha he
bes sco es a e eached by he on ologies in he same domain as he
co esponding na u al language ex . Thus, he highes sco es we e
eached by he pai s (FoodOn, ood ex ), (GeneOn ology, genes ex ),
(LKIF, legal ex ) and (SNOMED CT, medical ex ). Ne e heless, i is
no ewo hy ha , o he legal ex , he LKIF on ology did no s and ou
agains he o he s, especially FoodOn, whose compa ison did no each
s a is ical signi icance, assuming 𝛼= 0.05.
Addi ionally, Fig. 8 shows a compa ison be ween he ex co po a
o each on ology. In o he wo ds, his igu e depic s which ex co pus
i s be e wi h each on ology. In his case, each ex co pus had he
highes sco e when he co esponding on ology had he same domain,
wi h no excep ion. Thus, he ood ex was be e ep esen ed by
FoodOn; he gene ic ex was be e ep esen ed by GeneOn ology; he
legal ex was be e ep esen ed by LKIF; and he medical ex was
be e ep esen ed by SNOMED CT.
Finally, Table 10 shows he mean o he noun ph ase-on ology class
ma ches sco es, which se es as co e age sco e o each on ology o
each domain, oge he wi h he s anda d de ia ion (SD).
4.3. Compa ison wi h he On oPo al ecommende
The on ologies and ex co po a used as inpu o OCALM, desc ibed
in Sec ion 4.1, we e also used o es he On oPo al Recommende
sys em (Ma ínez-Rome o e al., 2017). In pa icula , a i ual machine
was deployed wi h On oPo al i ual appliance e sion 3.2.2. This
i ual appliance deploys he On oPo al web pla o m oge he wi h
i s se ices, including he Anno a o and he Recommende . Then,
Enginee ing Applica ions o A icial In elligence 162 (2025) 112671
9
F. Abad-Na a o e al.
Tsane a, S., Vasic, S., Sabou, M., 2024. LLM-d i en on ology e alua ion: Ve i ying
on ology es ic ions wi h Cha GPT. In: Da a Quali y Mee s Machine Lea ning and
Knowledge G aphs. In: CEUR Wo kshop P oceedings, RWTH, URL h ps://ceu -
ws.o g/Vol-3747/dqmlkg_pape 3.pd .
Uzune , Ö., Sol i, I., Cadag, E., 2010. Ex ac ing medica ion in o ma ion om clin-
ical ex . J. Am. Med. In o m. Assoc. 17 (5), 514–518. h p://dx.doi.o g/10.
1136/jamia.2010.003947, a Xi :h ps://academic.oup.com/jamia/a icle-pd /17/5/
514/9713258/17-5-514.pd .
Val-Cal o, M., Egaña A angu en, M., Mule o-He nández, J., Almag o-He nández, G.,
Deshmukh, P., Be nabé-Díaz, J.A., Espinoza-A ias, P., Sánchez-Fe nández, J.L.,
Muelle , J., Fe nández-B eis, J.T., 2025. On oGenix: Le e aging la ge language
models o enhanced on ology enginee ing om da ase s. In . P ocess. Manage.
62 (3), 104042. h p://dx.doi.o g/10.1016/j.ipm.2024.104042, URL h ps://www.
sciencedi ec .com/science/a icle/pii/S0306457324004011.
Vi anen, P., Gomme s, R., Oliphan , T.E., Habe land, M., Reddy, T., Cou napeau, D.,
Bu o ski, E., Pe e son, P., Weckesse , W., B igh , J., an de Wal , S.J., B e , M.,
Wilson, J., Millman, K.J., Mayo o , N., Nelson, A.R.J., Jones, E., Ke n, R.,
La son, E., Ca ey, C.J., Pola , İ., Feng, Y., Moo e, E.W., Vande Plas, J., Lax-
alde, D., Pe k old, J., Cim man, R., Hen iksen, I., Quin e o, E.A., Ha is, C.R.,
A chibald, A.M., Ribei o, A.H., Ped egosa, F., an Mulb eg , P., SciPy 1.0 Con ib-
u o s, 2020. SciPy 1.0: Fundamen al algo i hms o scien i ic compu ing in Py hon.
Na u e Me hods 17, 261–272. h p://dx.doi.o g/10.1038/s41592-019-0686-2.
Wang, S., Thompson, L., Iyye , M., 2021a. Ph ase-BERT: Imp o ed ph ase embed-
dings om BERT wi h an applica ion o co pus explo a ion. In: Moens, M.-F.,
Huang, X., Specia, L., Yih, S.W.- . (Eds.), P oceedings o he 2021 Con e -
ence on Empi ical Me hods in Na u al Language P ocessing. Associa ion o
Compu a ional Linguis ics, Online and Pun a Cana, Dominican Republic, pp.
10837–10851. h p://dx.doi.o g/10.18653/ 1/2021.emnlp-main.846, URL h ps://
aclan hology.o g/2021.emnlp-main.846.
Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., e al.,
2021b. Mil us: A Pu pose-Buil ec o da a managemen sys em. In: P oceedings
o he 2021 In e na ional Con e ence on Managemen o Da a. pp. 2614–2627.
Whe zel, P.L., Noy, N.F., Shah, N.H., Alexande , P.R., Nyulas, C., Tudo ache, T.,
Musen, M.A., 2011. BioPo al: enhanced unc ionali y ia new Web se ices om
he Na ional Cen e o Biomedical On ology o access and use on ologies in
so wa e applica ions. Nucleic Acids Res. 39 (suppl_2), W541–W545.
Wilcoxon, F., 1947. P obabili y ables o indi idual compa isons by anking me hods.
Biome ics 3 (3), 119–122.
Wilkinson, M.D., Dumon ie , M., Aalbe sbe g, I.J., Apple on, G., Ax on, M., Baak, A.,
Blombe g, N., Boi en, J.-W., da Sil a San os, L.B., Bou ne, P.E., e al., 2016. The
FAIR guiding p inciples o scien i ic da a managemen and s ewa dship. Sci. Da a
3 (1), 1–9.
Wilson, R., Goone illake, J.S., Indika, W., Ginige, A., 2021. Analysis o on ology quali y
dimensions, c i e ia and me ics. In: In e na ional Con e ence on Compu a ional
Science and i s Applica ions. Sp inge , pp. 320–337.
Wilson, R., Goone illake, J., Indika, W., Ginige, A., 2023. A concep ual model o
on ology quali y assessmen . Seman . Web 14, 1051–1097. h p://dx.doi.o g/10.
3233/SW-233393, 6.
Yang, S.-Y., 2009. On oPo al: An on ology-suppo ed po al a chi ec u e wi h lin-
guis ically enhanced and ocused c awle echnologies. Expe Sys . Appl. 36 (6),
10148–10157.
Zai oun, A., Sagi, T., Hose, K., 2023. Au oma ed on ology e alua ion: E alua ing
co e age and co ec ness using a Domain Co pus. In: Companion P oceedings o
he ACM Web Con e ence 2023. pp. 1127–1137.
Zhu, H., Liu, D., Bayley, I., Aldea, A., Yang, Y., Chen, Y., 2017. Quali y model
and me ics o on ology o seman ic desc ip ions o web se ices. Tsinghua Sci.
Technol. 22 (3), 254–272.
Enginee ing Applica ions o A icial In elligence 162 (2025) 112671
16