Sen iTegi: Semi-manually C ea ed Seman ic O ien ed Basque
Lexicon o Sen imen Analysis
Jon Alko a, Koldo Gojenola, Mikel I uskie a
IXA NLP G oup, Uni e si y o he Basque Coun y (UPV/EHU), Vizcaya,
Spain
jon.alko [email p o ec ed], koldo[email p o ec ed], mikel.i [email p o ec ed]
Abs ac . The c ea ion o a seman ic o ien ed lexicon
o posi i e and nega i e wo ds is o en he i s s ep o
analyze he sen imen o a co pus. Va ious me hods
can be employed o c ea e a lexicon: supe ised
and unsupe ised. Un il now, me hods employed o
c ea e Basque pola i y lexicons we e unsupe ised. The
aim o his pape is o p esen he cons uc ion and
e alua ion o he i s seman ic o ien ed supe ised
Basque lexicon anging om +5 o −5. Due o he
lack o esou ces, he Basque lexicon was c ea ed
ansla ing he SO-CAL Spanish dic iona y by means
o wo bilingual dic iona ies ollowing speci ic c i e ia
and hen sligh ly co ec ed wi h he SO-CAL English
dic iona y and equency da a ob ained om he Basque
Opinion Co pus. E alua ion esul s show ha he
co ela ion be ween human anno a o s is sligh ly be e
han be ween a gold s anda d lexicon (ob ained om
human anno a ion) and he ansla ed dic iona y. This
shows ha he quali y o he ansla ed lexicon is
sa is ac o y, al hough he e is a space o imp o e i .
Keywo ds. Seman ic o ien ed lexicon, manual
ansla ion me hod, Basque, sen imen analysis.
1 In oduc ion
Sen imen analysis is a ask ha classi ies
documen s acco ding o hei pola i y. This
esea ch a ea has had a big de elopmen in he
las yea s due o social ne wo ks and In e ne ,
which ha e inc eased he quan i y o opinions and
o he ypes o ex wi h emo ion, and is in demand
o me hods o au oma ic p ocessing.
The e a e many esou ces o sen imen analysis
o he mos used languages such as English [9],
Chinese [15] and Spanish [5].
Addi ionally, compe i ions like SemE al [10]
ha e g ea ly con ibu ed o he de elopmen
o esou ces and ools o sen imen analysis.
Howe e , he de elopmen is no symme ic
on lesse used languages o languages in
no maliza ion p ocess like Basque.
The seman ic o ien ed lexicons a e ela ed o
he lexical le el and, so, hey a e use ul and
impo an in sen imen analysis. I he seman ic
o ien a ion o he wo ds is known, oppo uni ies
open up o calcula e he seman ic o ien a ion o
sen ences and, he e o e, he seman ic o ien a ion
o ex s aking in o accoun syn ax and discou se
cons ain s.
The c ea ion o he seman ic o ien ed Basque
lexicon has been semi-manual ansla ing om he
SO-CAL Spanish dic iona y, and hen en iching
i wi h co pus analysis and he English SO-CAL
dic iona y. In he ansla ion p ocess, di e en
bilingual dic iona ies ha e been used. We ha e
decided o use a semi-manual p ocedu e o c ea e
ou lexicon, in o de o ake in o accoun some
idiosync a ic cha ac e is ics o Basque language.
The aim o his pape is o p esen a seman ic
o ien ed lexicon o Basque. We will emphasize
he p ocess o c ea ing his lexicon, and pa icula ly
he solu ions adop ed o sol e he p oblems
encoun e ed.
The main con ibu ions o his wo k a e: i) he
c ea ion o a domain-speci ic seman ic o ien ed
Basque lexicon, ii)a desc ip ion o a semi-manual
echnique o c ea e he lexicon and iii)a ho ough
e alua ion.
Compu ación y Sis emas, Vol. 22, No. 4, 2018, pp. 1295–1306
ISSN 1405-5546
doi: 10.13053/CyS-22-4-3075
This pape has been o ganized as ollows: a e
p esen ing ela ed wo k in Sec ion 2, Sec ion
3 desc ibes he me hodology o he ansla ion
p ocess. Then, Sec ion 4 discusses he
design decisions, while Sec ion 5 desc ibes he
cha ac e is ics o he c ea ed lexicon in wo s ages.
In Sec ion 6 he quali y o he lexicon is e alua ed
and, inally, Sec ion 7 concludes he pape , also
p oposing di ec ions o u u e wo k.
2 Rela ed Wo k
The e a e a ious app oaches o he c ea ion
o pola i y lexicons, based on knowledge o on
au oma ic me hods. Each o he app oaches has
i s ad an ages and d awbacks.
SO-CAL [14] is a dic iona y-based ool o ex ac
sen imen om ex s. The dic iona y was c ea ed
manually, whe e wo ds a e anno a ed wi h pola i y
(posi i e o nega i e) and s eng h (seman ic
o ien a ion: om ±1 o ±5). The e a e wo e sions
o SO-CAL ool. The o iginal e sion is he English
SO-CAL and he Spanish e sion, he second one,
is based on he p e ious e sion. The English
and Spanish dic iona ies (V1.11) con ain 6,610 and
4,880 wo ds, espec i ely.
A disad an age o manually-c ea ed lexicons
is he ha d-wo k o make modi ica ions. In
con as , hey can be ailo ed o be domain-speci ic
and, depending on he linguis ic in o ma ion used,
hey can ea a a ie y o di e en linguis ic
phenomena.
ML-Sen iCon [6] is a mul ilingual pola i y lexicon,
whe e he lexicons ha e been au oma ically
gene a ed om an imp o ed e sion o Sen i-
Wo dNe . I con ains a Basque lexicon ha
con ains 4,323 lemmas. The pola i y alues a e
si ua ed be ween −1and +1, in a con inuous
scale. Addi ionally, QWN-PPV ool [11] is able
o gene a e mul ilingual pola i y lexicons, including
Basque. This unsupe ised ool makes use o a
co pus and Wo dNe .
The main disad an age o hese lexicons is ha
hey a e no domain-speci ic, so hei esul s could
a y om one domain o ano he . In con as , hei
main ad an age lies on he acili y o c ea e hem.
Ano he cha ac e is ic o p e ious h ee wo ks
is ha he sen imen alue o wo ds is in a
scale, al hough he scale dimensions a e di e en .
Howe e , he e a e wo ks in which he sen imen
alue o wo ds a e no in scale. Fo example, in
some wo ks like [13], he e a e wo non-nume ical
ags: posi i e and nega i e. Consequen ly, wo
wo ds wi h di e en in ensi y a e exp essed wi h
he same ag.
Me hods o e alua e lexicons a e di e en
depending on each echnique. Some wo ks [3] use
in insic me hods whe e he esul o he sys em is
compa ed o a gold s anda d da a se , p ede ined
by e alua o s. In con as , he e a e o he sys ems
[4] which use ex insic me hods whe e he sys em
is e alua ed in an applied se ing. Finally, some
wo ks [7] use bo h ex insic and in insic me hods.
The lexicon p esen ed in his wo k di e s om
p e ious ones in se e al espec s. SO-CAL
dic iona ies ha e also been manually c ea ed bu ,
un il now, hey ha e deal wi h languages which a e
no mo phologically ich (Spanish and English) in
con as wi h Basque. Ano he ele an di e ence
o his s udy has been he e alua ion. We will
apply an in insic e alua ion and measu e, using
Pea son co ela ion, he ag eemen be ween wo
human anno a o s, and he eliabili y be ween he
gold s anda d (based on human anno a ion) and
he ansla ed dic iona y. Finally, he cha ac e is ic
o he c ea ed lexicon is ano he in e es ing aspec .
The wo ds o he lexicon ha e he sen imen alue
in a scale om −5 o +5. This allows us o
s udy how sen imen shi e s o di e en linguis ic
le els (mo phology, syn ax and discou se) a ec on
sen imen analysis.
3 Me hodology
In o de o c ea e a seman ic o ien ed lexicon o
Basque, we ha e adop ed se e al decisions aking
di e en ac o s in o accoun :
i) Time. The c ea ion o a seman ic o ien ed
lexicon o Basque is ela ed o he p ojec o
linguis ics-based Basque sen imen analysis
and, o ha eason, he ime o c ea e he
lexicon is limi ed.
Compu ación y Sis emas, Vol. 22, No. 4, 2018, pp. 1295–1306
ISSN 1405-5546
doi: 10.13053/CyS-22-4-3075
Jon Alko a, Koldo Gojenola, Mikel I uskie a1296
ii) Resou ces. The Basque language is s ill in
a no maliza ion p ocess and his has some
limi a ions o c ea e co po a and o euse
compu a ional esou ces. On he one hand,
i is di icul o c ea e a la ge opinion co pus o
di e en opics. This si ua ion could a ec o
he quali y o he lexicon i he co pus is used
o ha . The collabo a ion o lexicog aphe s
would be ideal bu i is a cos ly esou ce,
no a ailable. This si ua ion adds a di icul y
o c ea e a seman ic o ien ed Basque lexicon
om ze o.
iii) Quali y. We wan o de elop he lexicon
wi h he bes possible quali y (and in he less
ime possible) and wi h ha aim we will i s
ansla e he lexicon, a e ha e alua e i and
hen imp o e ou seman ic o ien ed lexicon
ollowing an speci ic c i e ia.
3.1 Resou ces o T ansla ion
We ha e used mainly ou esou ces in he
ansla ion p ocess.
i) The SO-CAL Spanish Dic iona y [14]. This
dic iona y is he sou ce o c ea e he Basque
seman ic o ien ed lexicon. I con ains 4,880
wo ds o i e g amma ical ca ego ies (noun,
adjec i e, ad e b, e b and in ensi ie ).
ii) Two Bilingual Dic iona ies: Spanish-Bas-
que: Elhuya dic iona y [16] and Zehazki
[12]. These dic iona ies ha e been used
o ansla e he Spanish SO-CAL dic iona y.
Mo eo e , hey ha e also been used o check
i he ansla ed wo d is an en y o such
dic iona ies since we will wo k only wi h wo ds
which a e en ies o one o hese dic iona ies.
Dealing wi h colloca ions and exp essions is
necessa y bu i is ou o he scope o his
wo k.
iii) The Basque Opinion Co pus [1]. A e
ge ing he i s e sion o he lexicon, each
en y has been checked in he co pus o c ea e
a domain-based lexicon. The co pus con ains
240 ex s o six di e en domains.
i ) The SO-CAL English dic iona y [14]. This
e sion which con ains 6,610 wo ds has been
used o e i y and en ich he al eady c ea ed
domain-based lexicon.
Taking all he ac o s explained abo e in o
accoun and using he men ioned esou ces, we
ha e decided o ansla e he SO-CAL Spanish
dic iona y o c ea e he Basque SO-lexicon
Sen i egi, ollowing he me hodology explained in
Figu e 1.
3.2 T ansla ion S eps
Figu e 1 shows he s eps ollowed in he ansla ion
p ocess. To begin wi h, a i s e sion o a seman ic
o ien ed Basque lexicon has been c ea ed om
he Spanish e sion o he SO-CAL dic iona y.
A e ha , he second e sion has been c ea ed
en iching i wi h he English lexicon e sion (V1.11)
and limi ing i o he domains o Basque Opinion
Co pus.
Some in e es ing phenomena ha e been de-
ec ed in he ansla ion p ocess o SO-CAL
dic iona ies om Spanish and English e sions
(V1.11) o Basque. Table 1 shows hese i e
phenomena.
−Phenomenon 1 (P1): he Spanish wo d is
ansla ed bu he ansla ion is no an en y o
Elhuya [16] and Zehazki [12] dic iona ies, so
we do no ake i in o accoun .
−Phenomenon 2 (P2): The Spanish wo d is
ansla ed, i is an en y o Elhuya bu he
ansla ion does no appea in he Basque
Opinion Co pus. Consequen ly, i will appea
in he i s e sion (V1.0) bu no in he second
one (V2.O).
−Phenomenon 3 (P3): The Spanish wo d is
ansla ed, i is an en y, i appea s in he
co pus bu i is no in he SO-CAL English
dic iona y. So, i will appea in he i s e sion
o he dic iona y, bu no in he second one.
−Phenomenon 4 (P4): The Spanish wo d is
ansla ed, i is an en y, i appea s in he
co pus and i is no p esen in he SO-CAL
English dic iona y. Then, i will be included in
he ( i s and) second e sion.
Compu ación y Sis emas, Vol. 22, No. 4, 2018, pp. 1295–1306
ISSN 1405-5546
doi: 10.13053/CyS-22-4-3075
Sen iTegi: Semi-manually C ea ed Seman ic O ien ed Basque Lexicon o Sen imen Analysis 1297
Fig. 1. S eps o he ansla ion p ocess. The enume a ion in blue on he le indica es me hodological s eps. The blue
code on he igh (P1 o P5) indica es di e en phenomena in he ansla ion p ocess
−Phenomenon 5 (P5): The Spanish wo d is
ansla ed, i is an en y, i appea s in he
co pus and i also a wo d o he SO-CAL
English dic iona y. I will appea in he i s and
second e sions. These las wo phenomena
a e he same bu he decision is di e en ha
depends on he cha ac e is ic o each wo d.
The ansla ion p ocess has been he ollowing
(see Figu e 1):
i) Au oma ic ansla ion om Spanish in o
Basque. The Spanish sen imen dic iona y
o SO-CAL has been ansla ed using Elhuya
[16] and Zehazki [12] dic iona ies. When one
wo d o he dic iona y has mo e han one
en y, all he en ies ha e been aken in o
accoun . The sen imen alue o he Spanish
wo d has been assigned o all he co ela ed
elemen s in Basque.
Fo example, he Spanish wo d desac edi a
−2“disc edi ” has been ansla ed in o Basque
in di e en o ms: izena kendu,ospea kendu
and sona kendu “disc edi ” wi h he same
meaning. This example shows how one
Spanish wo d could be ansla ed in di e en
o ms o Basque. Bu hese ansla ions a e
no en ies o he dic iona y. Consequen ly,
hey ha e no been aken in o accoun .
ii) Fil e ing and g ouping. A e ansla ing
all he wo ds and ans e ing hei sen imen
alues, he epea ed wo ds in Basque ha e
been il e ed and g ouped.
Table 1 shows how wo ds in Basque ( ou h
column) can ha e one o mo e ansla ions
in Spanish ( hi d column). The phenomena
numbe ed 1, 2 and 4 ha e one ansla ed wo d
in Spanish whe eas 3 and 5 ha e mo e han
one.
This phenomenon occu ed because hose
wo ds a e polysemic. The e a e cases whe e
wo o mo e wo ds in Spanish co espond o
he same wo d in Basque and ice e sa.
Consequen ly, in some cases, each wo d in
Compu ación y Sis emas, Vol. 22, No. 4, 2018, pp. 1295–1306
ISSN 1405-5546
doi: 10.13053/CyS-22-4-3075
Jon Alko a, Koldo Gojenola, Mikel I uskie a1298
Table 1. Wo ds ha belongs o i e phenomena ela ed o ansla ion p ocess
Phenomenon SPA SPA g ouping EUS ENG Value
P1 desac edi a
“disc edi ”
desac edi a -2
“disc edi ”
ospea kendu -2
“disc edi ” - -
P2 a o ia
“a ophy”
a o ia -1
“a ophy”
a o ia u -1
“a ophy” - -
P3 amago
“ ein ”
amago “ ein ” -1
cica iz “sca ” -2 seinale “signal” -1 - -
P4 anquismo
“F ancoism”
anquismo -2
“ ancoism”
ankismo -2
“ ancoism” - -2
P5 co ec o
“co ec ”
ace ado “co ec ” +3
co ec o “co ec ” +3
decen e “decen ” -2
zuzen +3
“co ec ”
igh +1
co ec +3 +3
Basque has se e al meanings and sen imen
alues in Spanish.
iii) Dic iona y en y: Check i he Basque
ansla ion is an en y in he Elhuya [16]
and Zehazki [12] dic iona ies. We ha e
only accep ed he ansla ions which a e
en ies o Elhuya and Zehazki dic iona ies.
Consequen ly, Phenomenon 1 in Table 1
has occu ed: ospea kendu “disc edi ” is a
colloca ion and no an en y, so we will no ake
i in o accoun . In con as , o he wo ds in he
able a e en ies in he dic iona y and hey a e
main ained.
i ) Sen imen alue selec ion. The alue (and
meaning in Spanish) o each wo d in Basque
will be selec ed.
In o de o choose he alue, we ha e ollowed
he ollowing c i e ia:
- I he wo d in Basque has one ansla ion
(and alue) in Spanish and i ha
ansla ion is co ec , he ansla ion is
selec ed. This is he case o phenomena
2 and 4 in Table 1. Some imes he
ansla ion is no “co ec ” o “di ec ” as
we will obse e in Sec ion 4.
- I he wo d in Basque has many
ansla ions (and alues) in Spanish, he
ansla ion has been selec ed acco ding
o which ansla ion is he bes o use
in he Basque Opinion Co pus [1]. We
ha e analyzed he con ex o he wo ds
in he co pus using Key Wo d In Con ex
(KWIC) o ma o conco dance. This
is he case o Phenomena 3 and 5 in
Table 1.
- In he c ea ion o he i s e sion o
he lexicon, he e ha e also been cases
whe e he wo d in Basque has no
ins ances in he co pus. In hese
cases, he meanings ha a e used mo e
equen ly ha e been selec ed.
A e hese ou s eps, he i s e sion o he
Basque lexicon (V1.0) has been c ea ed. Howe e ,
we de ec ed some inconsis encies and we ha e
el he necessi y o eed mo e in o ma ion and, o
ha eason, we ollowed new s eps o c ea e he
Basque lexicon (V2.0):
) Domain and co pus adap a ion: New
lexicon based on he Basque Opinion
Co pus [1]. We ha e cu a ed he i s lexicon
(Basque V1.0) and c ea ed he second e sion
o his lexicon (Basque V2.0). This new
lexicon has been cu a ed wi h he in o ma ion
ob ained o m wo d equencies we ha e
ex ac ed om he Basque Opinion Co pus.
The e ec s o his s ep a e showed in
Phenomenon 2 in Table 1. The wo d a o ia u
“ o a ophy” does no appea in he co pus,
Compu ación y Sis emas, Vol. 22, No. 4, 2018, pp. 1295–1306
ISSN 1405-5546
doi: 10.13053/CyS-22-4-3075
Sen iTegi: Semi-manually C ea ed Seman ic O ien ed Basque Lexicon o Sen imen Analysis 1299
so i is no ela ed o he domains o he
co pus and, consequen ly, we do no ake i
in o accoun . We do no ake in o accoun
hem because ou wo k is limi ed o ou
co pus and we wan o main ain as much
as possible he cohe ence o SO alues and
a oid complexi ies which we do no see use ul.
In Table 1, Phenomena 3, 4 and 5 a e no
a ec ed by his limi a ion while Phenomenon 2
is. Wi h his p ocedu e, he numbe o en ies
in he lexicon was educed om 8,140 o 1,813
wo ds, because i was manually checked and
e iewed.
i) Cu a e and check SO alues o each en y:
Find he English ansla ions o each Basque
en y in he SO-CAL English dic iona y. Using
he Elhuya dic iona y [16], we ha e ansla ed
he wo ds in Basque o English and, a e ha ,
we ha e checked i he ansla ed wo ds a e in
he SO-CAL English dic iona y. I he wo d is in
his dic iona y, we ha e main ain he dic iona y
en y and i s alue in he second e sion o
he Basque dic iona y. I he wo d is no in
he English dic iona y, almos in all cases. i
was excluded om he second e sion in he
Basque dic iona y.
In Table 1, Phenomena 3 and 4 do no
ha e any ansla ion in he English dic iona y
and, consequen ly, hei (English) column in
Table 1 is emp y. In con as , Phenomenon 5
has wo ansla ions acco ding o he English
dic iona y: igh and co ec .
ii) E alua ion and co ec ion: Compa e and
choose he bes ansla ion and alue. In
his s ep, each wo d in Basque has he same
alue, mos o he imes, in Spanish and
English (Basque V1.0).
The e a e 3 di e en cases in his si ua ion:
–Phenomenon 3. The e is no a wo d in
he English e sion co esponding o he
Basque wo d and he p e ious Spanish
one is no accep ed. In phenomenon 3,
he wo d seinale “sign” has been assig-
ned he alue −1(Table 1, ou h column)
bu he e is no a co esponding alue
in he English e sion and, consequen ly,
we ha e emo ed ha alue.
–Phenomenon 4. The e is no a co -
esponding wo d in he English e sion
o Basque and he p e ious Spanish
ansla ion and alue a e accep ed. The
wo d ankismo “ ancoism” is ela ed o
Spain and, o ha eason, i appea s in
he Spanish e sion and no in English.
In his case, we ha e main ained he
assigned alue.
–Phenomenon 5. The English ansla ion
and alue a e he same o be e quali y
han he Spanish ones. Phenomenon
5 shows ha he Spanish and English
alues ag ee, so we ha e assigned he
alue +3 o zuzen “co ec ”. In o he
cases, he English and Spanish alues
di e . When his happens we decided
ha he English alue will p e ail o he
Spanish one in he second e sion o he
Basque dic iona y, because he quali y is
sligh ly be e in English as we p e iously
epo .
Phenomena 3 and 5 show how we ha e
decided o gi e mo e ele ance o he English
e sion.1
4 Discussion
We explain in his sec ion how we ha e sol e he
mos undamen al p oblems we ha e ound du ing
he ansla ion p ocess:
i) Sou ce language is no always he p e e -
ed language. English and Spanish could
be he sou ce language bu we ha e chosen
Spanish due o se e al easons. The o e all
accu acy o he English SO-CAL is 76.62%
while in he Spanish e sion is 71.81% [2].
In o he wo ds, he di e ence be ween hem
is no big enough. On he o he hand,
he e a e many mo e esou ces o ansla e
1Some imes he e is no a co esponding wo d in he English
dic iona y [16], an example and he explana ion o wha we ha e
done in such cases is explained in Sec ion 4.
Compu ación y Sis emas, Vol. 22, No. 4, 2018, pp. 1295–1306
ISSN 1405-5546
doi: 10.13053/CyS-22-4-3075
Jon Alko a, Koldo Gojenola, Mikel I uskie a1300
Table 2. Examples o ansla ions applying he cohe ence c i e ia
C i e ia EUS Value EUS Value
A e ukigabe ” u hless” −4e ukigabeko ”(wi h) u hless” −4
B on o ”s upid” −3 un un ”s upid” −3
C a du adun ” esponsible” +2 a du agabe ”i esponsible” −2
he dic iona y om Spanish o Basque han
o ansla e om English o Basque. So,
he ansla ion om Spanish is mo e eliable
and ex ended as shown in Table 1, whe e
he phenomenon numbe ed 4 ( ankismo
“ ancoism”) shows ha al hough he English
dic iona y con ains mo e i ems, he e a e
some wo ds in he Spanish dic iona y ha a e
no p esen in he English one.
In con as , he English e sion has helped
o check i he assigned alue o he Basque
wo d in he i s e sion om Spanish is
co ec . In he cases whe e he alue
o he Spanish and English e sions a e
di e en , we ha e p e e ed he English one
as Phenomenon 3 (seinale “signal”) shows.
Due o his decision, he numbe o wo ds o
he lexicon has dec eased om 1,813 o 1,237
en ies.
ii) No one o one ansla ion. Ano he p oblem
was p esen ed when, in he ansla ion, a
Spanish wo d could be ansla ed in o Basque
in di e en o ms bu wi h he same sense. We
ha e decided o use all he ansla ed wo ds in
Basque so as o ge he highe ecall possible.
The i s s ep, he au oma ic ansla ion om
Spanish in o Basque, shows ha one o mo e
en ies ha e been aken in Basque.
Fo example, he Spanish wo d apa a oso
“showy, spec acula ” has been ansla ed in o
Basque in wo di e en ways: a andi su
“spec acula ” and deiga i “showy”.
iii) Domain adop a ion o polysemic wo ds.
The e a e some wo ds ha ha e opposi e
meanings acco ding o hei con ex . The bes
solu ion would be o c ea e wo en ies bu
hen i would be di icul o implemen i in
a sys em ha does no dis inguish be ween
wo d senses. In his si ua ion, we ha e
decided o ake only one meaning and we
ha e used he Basque Opinion Co pus [1] o
choose he meaning wi h he app opia e SO
alue.
Fo example, he Basque wo d deiga i
“showy, spec acula ” comes om Spanish
apa a oso −3“spec acula ” o llama i o +3
“showy”. Taking he con ex o he wo d in he
co pus in o accoun , we ha e disambigua ed
he wo d manually and chosen he alue +3
o his wo d.
i ) Cohe ence consis ency. In he p ocess o
choosing he alue, we ha e o y (when he
alues ma ch) o main ain he cohe ence o
he alues aking hese c i e ia in o accoun .
Examples o he c i e ia a e shown in Table 2.
A) Some imes, he same wo d appea s
in di e en o ms. Fo example, in
he c ea ion o he i s e sion o he
lexicon, i is usual ha one wo d appea s
some imes wi h geni i e -ko “wi h” and
o he imes wi h an elided geni i e, and in
bo h cases is a dic iona y en y. In hese
cases, we decided o assign he same
alue. One o he cases is he adjec i e
be ehala “immedia e”. I appea s wi h
geni i e su ix: be ehalako “immedia ely”
and wi hou i be ehala “immedia e”. We
ha e assigned he same sen imen alue
(+2) o bo h.
B) We assign (when he alues ma ch)
he same alue o wo ds wi h simila
meanings. Fo example, on o “s upid” is
used wi h man while un un wi h he same
meaning is used wi h woman. We assign
he alue −3 o bo h.
Compu ación y Sis emas, Vol. 22, No. 4, 2018, pp. 1295–1306
ISSN 1405-5546
doi: 10.13053/CyS-22-4-3075
Sen iTegi: Semi-manually C ea ed Seman ic O ien ed Basque Lexicon o Sen imen Analysis 1301
Table 3. The seman ic o ien ed Basque lexicons (V1.0 and V2.0)
V1.0 V2.0
G amma ical ca ego y Wo ds % Wo ds %
Noun 2,282 28.06 461 37.27
Adjec i es 3,162 38.85 446 36.05
Ad e bs 652 7.98 54 4.36
Ve bs 1,657 20.36 276 22.32
In ensi ie s 387 4.75
To al 8,140 100 1,237 100
C) We also assign he same in ensi y (1 o
5), bu opposi e alue (posi i e/nega i e)
o an onymic wo ds when he alues
coincide in Basque dic iona y en y. In
Basque, some p e ixes (des- and ez-
“dis-”) and su ixes (-ezin “impossibili y”
“inabili y” and -gabe “wi hou ”) a e used
o in e he meaning o he wo ds and we
ha e pu special a en ion on hese ones.
) “Inco ec ” ansla ions. The e ha e been
some ansla ions which a e inco ec because
o di e en ac o s. The Spanish wo d
p o inciano “backwa d” (−1) is employed o
e e o people o Bizkaia and Gipuzkoa
p o inces. The Elhuya dic iona y [16] has
de ined he wo d as “inhabi an o Bizkaia o
Gipuzkoa”, a ansla ion which is no use ul o
ou pu pose.
i) “Indi ec ” ansla ions. The e ha e been
some ansla ions ha we ha e conside ed
as indi ec . They a e co ec ansla ions bu
since hey ha e an ex ensi e meaning and
hey a e used in limi ed si ua ions, hey a e no
use ul o us.
Fo example, he wo d bel z “black” could
ha e wo meanings: i)a colo ii)“black,
sad; gloomy, dep essing” ( igu a i e meaning).
The igu a i e use o ha wo d is less
usual, he e a e o he wo ds wi h he same
meaning and, aking in o accoun ha he wo d
could complica e he co ec sen imen alue
assigna ion o ex s, we ha e decided no o
assign any SO alue.
The explained p oblems show he di icul y
o ansla e a seman ic o ien ed lexicon semi-
au oma ically. This ansla ion p ocess is la ge and
e y de ailed whe e he ansla ion o he lexicon
has di e en phenomena.
5 Resul s
As a esul o he ansla ion p ocess, wo e sions
o he seman ic o ien ed Basque lexicon ha e been
c ea ed. Table 3 shows he cha ac e is ics o hese
wo e sions.
The i s e sion (V1.0) is he esul o he i s
ou s eps in he ansla ion p ocess (Figu e 1).
I is ansla ed di ec ly om he Spanish SO-CAL
dic iona y wi h a s ic c i e ia. Bu , unlike he
second e sion (V2.0), he i s e sion is no
subjec o he es ic ions o being an en y
o he Basque bilingual dic iona ies and i was
no imp o ed aking in o accoun he English
SO-CAL dic iona y, he Basque Opinion Co pus
and o he kind o ea u es ha wo k di e en ly such
in ensi ie s a e conside ed as dic iona y en ies.
As a esul o hese conside a ions, he i s
e sions ha e 8,140 en ies and he second
e sion 1,237, espec i ely. In bo h cases, nouns
and adjec i es a e he g amma ical ca ego ies wi h
mo e en ies. Ve bs and ad e bs a e leas equen
en ies, whe eas in ensi ie s ha e no been aken
in o accoun in he second e sion because hey
a ec o o he wo ds, so we hink ha i is be e
o analyze di e en ly assigning di e en alues ha
does no go om -5 o -5 alues.
Compu ación y Sis emas, Vol. 22, No. 4, 2018, pp. 1295–1306
ISSN 1405-5546
doi: 10.13053/CyS-22-4-3075
Jon Alko a, Koldo Gojenola, Mikel I uskie a1302
Table 4. Examples o pa allel lexicon
Wo d in lexicon Value SPA Value ENG Value
bikain +5 excepcional +5 excellen +5
on +2 buen +2 - -
eskas −1escaso −2insu icien −1
xa −3ad e so −3bad −3
Ano he in e es ing cha ac e is ic o he c ea ed
lexicon is ha i is pa allel. Tha means ha
each wo d o he lexicon has i ansla ions in
English and Spanish and he sen imen alues in
each language also a e included. This in o ma ion
appea s in an o de ly manne in he esou ce.
In Table 4, he e a e ou examples showing he
pa allel lexicon. Some imes, ou sen imen alues
do no ma ch because he Spanish and English
SO-CAL lexicons ha e been c ea ed in di e en
way. Bu he Basque wo d always ma ches wi h
one o hem. The examples o Table 4 a e
adjec i es and hey show how he sen imen alues
a e in a scale.
Once we ha e implemen ed his lexicon in he
Basque SO-CAL p elimina y e sion, he c ea ed
seman ic o ien ed lexicon is use ul o assign
sen imen alue o wo ds as well as sen ences, as
is shown in he ollowing examples:
(1) [Hale e, pen sa li ekeena en au ka, gaien
u i asunak e a disku so e epikako ak−6
ez daka e ˜
naba du a abe as asunik, a e
gu xiago a gumen u-mailako sakon asunik.]−6
(Howe e , con a y o wha is hough , he
sca ci y o p oblems and he epe i i e−6
discou ses do no imply ich nuances, much
less a plo dep h.)−6
(2) [A azo nagusia+2, ni e us ez, gaien+4 eman-
ko asun zalan zazkoan e a ekin za en bilaka-
e a eskasean−3da za.]+3
(The main+2 p oblem is, I belie e, he
unce ain e ili y o he opics+4 and he
slow−3e olu ion o he ac ion.)+3
(3) (...) [Emai za ezus eko ik−1.5 gabeko is o io
ba da, i aku lea epel−1.5 uz eko a isku
dezen e duen onu a as mode a u ba ean
emana.]−3
(The esul is an unsu p ising−1.5 s o y, gi en
in a mode a e one wi h a isk o lea e he
eade cold−1.5.)−3
As we show in he h ee examples he wo ds o
he dic iona y ha e a SO alue a he end o he
wo d. To men ion one, in Example 1, he Basque
e sion o SO-CAL ool assigns he alue −6 o he
wo d e epikako “ epe i i e”. The e is no ano he
wo d wi h sen imen alue acco ding o lexicon,
so he sen imen alue o he sen ence is also
−6.2The me hodology o calcula e he seman ic
o ien a ion o he sen ence is simila in Examples 2
and 3.
6 E alua ion
In his sec ion, we wan o e alua e wo aspec s
o he ansla ion ask. On he one hand, we
wan o e alua e he di icul y o he ask. We
hink ha he anno a ion o sen imen pola i y is
a di icul ask because he e is no a guide o
ollow and subjec i e pe cep ions mus be, i s ,
measu ed and, las , co ec ed i possible. On
he one hand, he in e -anno a o ag eemen o
SO alue anno a ion has been e alua ed be ween
wo linguis s anno a o s. On he o he hand, we
also wan o measu e he quali y o he ansla ed
lexicon. Wi h hese in mind, a gold s anda d
anno a ion has been c ea ed om he p e ious
anno a ion and discussion by bo h anno a o s.
2In his sen ence, he sen imen alue o he wo d
e epikako “ epe i i e” in he lexicon is −4. Bu in SO-CAL ool,
he e a e some ma hema ical ope a ions ela ed o linguis ic
phenomena ha inc ease o dec ease he sen imen alue o
he wo ds. In his case, he sen imen alue has inc eased o
−6.
Compu ación y Sis emas, Vol. 22, No. 4, 2018, pp. 1295–1306
ISSN 1405-5546
doi: 10.13053/CyS-22-4-3075
Sen iTegi: Semi-manually C ea ed Seman ic O ien ed Basque Lexicon o Sen imen Analysis 1303