scieee Science in your language
[en] (orig)

Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation

Author: Cer, Daniel,Diab, Mona,Agirre Bengoa, Eneko,López Gazpio, Iñigo,Specia, Lucia
Publisher: ACL
Year: 2017
DOI: 10.18653/v1/S17-2001
Source: https://addi.ehu.eus/bitstream/10810/68989/4/S17-2001.pdf
P oceedings o he 11 h In e na ional Wo kshop on Seman ic E alua ions (SemE al-2017), pages 1–14,
Vancou e , Canada, Augus 3 - 4, 2017. c
2017 Associa ion o Compu a ional Linguis ics
SemE al-2017 Task 1: Seman ic Tex ual Simila i y
Mul ilingual and C oss-lingual Focused E alua ion
Daniel Ce a, Mona Diabb, Eneko Agi ec,
I˜
nigo Lopez-Gazpioc, and Lucia Speciad
aGoogle Resea ch
Moun ain View, CA
bGeo ge Washing on Uni e si y
Washing on, DC
cUni e si y o he Basque Coun y
Donos ia, Basque Coun y
dUni e si y o She ield
She ield, UK
Abs ac
Seman ic Tex ual Simila i y (STS) mea-
su es he meaning simila i y o sen ences.
Applica ions include machine ansla ion
(MT), summa iza ion, gene a ion, ques ion
answe ing (QA), sho answe g ading, se-
man ic sea ch, dialog and con e sa ional
sys ems. The STS sha ed ask is a enue
o assessing he cu en s a e-o - he-a .
The 2017 ask ocuses on mul ilingual and
c oss-lingual pai s wi h one sub- ack ex-
plo ing MT quali y es ima ion (MTQE)
da a. The ask ob ained s ong pa icipa-
ion om 31 eams, wi h 17 pa icipa ing
in all language acks. We summa ize pe -
o mance and e iew a selec ion o well
pe o ming me hods. Analysis highligh s
common e o s, p o iding insigh in o he
limi a ions o exis ing models. To suppo
ongoing wo k on seman ic ep esen a ions,
he STS Benchma k is in oduced as a new
sha ed aining and e alua ion se ca e ully
selec ed om he co pus o English STS
sha ed ask da a (2012-2017).
1 In oduc ion
Seman ic Tex ual Simila i y (STS) assesses he
deg ee o which wo sen ences a e seman ically
equi alen o each o he . The STS ask is mo i-
a ed by he obse a ion ha accu a ely modeling
he meaning simila i y o sen ences is a ounda-
ional language unde s anding p oblem ele an o
nume ous applica ions including: machine ans-
la ion (MT), summa iza ion, gene a ion, ques ion
answe ing (QA), sho answe g ading, seman ic
sea ch, dialog and con e sa ional sys ems. STS en-
ables he e alua ion o echniques om a di e se
se o domains agains a sha ed in e p e able pe o -
mance c i e ia. Seman ic in e ence asks ela ed o
STS include ex ual en ailmen (Ben i ogli e al.,
2016;Bowman e al.,2015;Dagan e al.,2010),
seman ic ela edness (Ben i ogli e al.,2016) and
pa aph ase de ec ion (Xu e al.,2015;Gani ke i ch
e al.,2013;Dolan e al.,2004). STS di e s om
bo h ex ual en ailmen and pa aph ase de ec ion
in ha i cap u es g ada ions o meaning o e lap
a he han making bina y classi ica ions o pa -
icula ela ionships. While seman ic ela edness
exp esses a g aded seman ic ela ionship as well, i
is non-speci ic abou he na u e o he ela ionship
wi h con adic o y ma e ial s ill being a candida e
o a high sco e (e.g., “nigh ” and “day” a e highly
ela ed bu no pa icula ly simila ).
To encou age and suppo esea ch in his a ea,
he STS sha ed ask has been held annually since
2012, p o iding a enue o e alua ion o s a e-o -
he-a algo i hms and models (Agi e e al.,2012,
2013,2014,2015,2016). Du ing his ime, di-
e se simila i y me hods and da a se s
1
ha e been
explo ed. Ea ly me hods ocused on lexical se-
man ics, su ace o m ma ching and basic syn ac-
ic simila i y (B
¨
a e al.,2012;
ˇ
Sa i
´
c e al.,2012a;
Jimenez e al.,2012a). Du ing subsequen e alua-
ions, s ong new simila i y signals eme ged, such
as Sul an e al. (2015)’s alignmen based me hod.
Mo e ecen ly, deep lea ning became compe i i e
wi h op pe o ming ea u e enginee ed sys ems
(He e al.,2016). The bes pe o mance ends o
be ob ained by ensembling ea u e enginee ed and
deep lea ning models (Rychalska e al.,2016).
Signi ican esea ch e o has ocused on STS
o e English sen ence pai s.
2
English STS is a
1
i.a., news headlines, ideo and image desc ip ions,
glosses om lexical esou ces including Wo dNe (Mille ,
1995;Fellbaum,1998), F ameNe (Bake e al.,1998),
On oNo es (Ho y e al.,2006), web discussion o a, plagia-
ism, MT pos -edi ing and Q&A da a se s. Da a se s a e sum-
ma ized on: h p://ixa2.si.ehu.es/s swiki.
2
The 2012 and 2013 STS asks we e English only. The
2014 and 2015 ask included a Spanish ack and 2016 had a
1
well-s udied p oblem, wi h s a e-o - he-a sys ems
o en achie ing 70 o 80% co ela ion wi h human
judgmen . To p omo e p og ess in o he languages,
he 2017 ask emphasizes pe o mance on A abic
and Spanish as well as c oss-lingual pai ings o
English wi h ma e ial in A abic, Spanish and Tu k-
ish. The p ima y e alua ion c i e ia combines pe -
o mance on all o he di e en language condi-
ions excep English-Tu kish, which was un as a
su p ise language ack. E en wi h his depa u e
om p io yea s, he ask a ac ed 31 eams p o-
ducing 84 submissions.
STS sha ed ask da a se s ha e been used ex en-
si ely o esea ch on sen ence le el simila i y and
seman ic ep esen a ions (i.a., A o a e al. (2017);
Conneau e al. (2017); Mu e al. (2017); Paglia dini
e al. (2017); Wie ing and Gimpel (2017); He and
Lin (2016); Hill e al. (2016); Ken e e al. (2016);
Lau and Baldwin (2016); Wie ing e al. (2016b,a);
He e al. (2015); Pham e al. (2015)). To encou age
he use o a common e alua ion se o assessing
new me hods, we p esen he STS Benchma k, a
publicly a ailable selec ion o da a om English
STS sha ed asks (2012-2017).
2 Task O e iew
STS is he assessmen o pai s o sen ences acco d-
ing o hei deg ee o seman ic simila i y. The ask
in ol es p oducing eal- alued simila i y sco es
o sen ence pai s. Pe o mance is measu ed by he
Pea son co ela ion o machine sco es wi h human
judgmen s. The o dinal scale in Table 1guides
human anno a ion, anging om 0 o no meaning
o e lap o 5 o meaning equi alence. In e media e
alues e lec in e p e able le els o pa ial o e lap
in meaning. The anno a ion scale is designed o
be accessible by easonable human judges wi h-
ou any o mal expe ise in linguis ics. Using ea-
sonable human in e p e a ions o na u al language
seman ics was popula ized by he ela ed ex ual
en ailmen ask (Dagan e al.,2010). The esul -
ing anno a ions e lec bo h p agma ic and wo ld
knowledge and a e mo e in e p e able and use ul
wi hin downs eam sys ems.
3 E alua ion Da a
The S an o d Na u al Language In e ence (SNLI)
co pus (Bowman e al.,2015) is he p ima y e alu-
a ion da a sou ce wi h he excep ion ha one o he
pilo ack on c oss-lingual Spanish-English STS. The English
acks a ac ed he mos pa icipa ion and ha e he la ges use
o he e alua ion da a in ongoing esea ch.
5
The wo sen ences a e comple ely equi alen , as hey
mean he same hing.
The bi d is ba hing in he sink.
Bi die is washing i sel in he wa e basin.
4
The wo sen ences a e mos ly equi alen , bu some
unimpo an de ails di e .
Two boys on a couch a e playing ideo games.
Two boys a e playing a ideo game.
3
The wo sen ences a e oughly equi alen , bu some
impo an in o ma ion di e s/missing.
John said he is conside ed a wi ness bu no a suspec .
“He is no a suspec anymo e.” John said.
2
The wo sen ences a e no equi alen , bu sha e some
de ails.
They lew ou o he nes in g oups.
They lew in o he nes oge he .
1
The wo sen ences a e no equi alen , bu a e on he
same opic.
The woman is playing he iolin.
The young lady enjoys lis ening o he gui a .
0
The wo sen ences a e comple ely dissimila .
The black dog is unning h ough he snow.
A ace ca d i e is d i ing his ca h ough he mud.
Table 1: Simila i y sco es wi h explana ions and
English examples om Agi e e al. (2013).
c oss-lingual acks explo es da a om he WMT
2014 quali y es ima ion ask (Boja e al.,2014).3
Sen ences pai s in SNLI de i e om Flick 30k
image cap ions (Young e al.,2014) and a e labeled
wi h he en ailmen ela ions: en ailmen , neu al,
and con adic ion. D awing om SNLI allows STS
models o be e alua ed on he ype o da a used o
assess ex ual en ailmen me hods. Howe e , since
en ailmen s ongly cues o seman ic ela edness
(Ma elli e al.,2014), we cons uc ou own sen-
ence pai ings o de e gold en ailmen labels om
in o ming e alua ion se STS sco es.
T ack 4b in es iga es he ela ionship be ween
STS and MT quali y es ima ion by p o iding STS
labels o WMT quali y es ima ion da a. The da a
includes Spanish ansla ions o English sen ences
om a a ie y o me hods including RBMT, SMT,
hyb id-MT and human ansla ion. T ansla ions
a e anno a ed wi h he ime equi ed o human co -
ec ion by pos -edi ing and Human- a ge ed T ans-
la ion E o Ra e (HTER) (Sno e e al.,2006).
4
Pa icipan s a e no allowed o use he gold quali y
es ima ion anno a ions o in o m STS sco es.
3
P e ious yea s o he STS sha ed ask include mo e da a
sou ces. This yea he ask d aws om wo da a sou ces and
includes a di e se se o languages and language-pai s.
4
HTER is he minimal numbe o edi s equi ed o co -
ec ion o a ansla ion di ided by i s leng h a e co ec ion.
2
T ack Language(s) Pai s Sou ce
1 A abic (a -a ) 250 SNLI
2 A abic-English (a -en) 250 SNLI
3 Spanish (es-es) 250 SNLI
4a Spanish-English (es-en) 250 SNLI
4b Spanish-English (es-en) 250 WMT QE
5 English (en-en) 250 SNLI
6 Tu kish-English ( -en) 250 SNLI
To al 1750
Table 2: STS 2017 e alua ion da a.
3.1 T acks
Table 2summa izes he e alua ion da a by ack.
The six acks span ou languages: A abic, En-
glish, Spanish and Tu kish. T ack 4 has sub acks
wi h 4a d awing om SNLI and 4b pulling om
WMT’s quali y es ima ion ask. T ack 6 is a su -
p ise language ack wi h no anno a ed aining
da a and he iden i y o he language pai i s an-
nounced when he e alua ion da a was eleased.
3.2 Da a P epa a ion
This sec ion desc ibes he p epa a ion o he e al-
ua ion da a. Fo SNLI da a, his includes he se-
lec ion o sen ence pai s, anno a ion o pai s wi h
STS labels and he ansla ion o he o iginal En-
glish sen ences. WMT quali y es ima ion da a is
di ec ly anno a ed wi h STS labels.
3.3 A abic, Spanish and Tu kish T ansla ion
Sen ences om SNLI a e human ansla ed in o
A abic, Spanish and Tu kish. Sen ences a e ans-
la ed independen ly om hei pai s. A abic ans-
la ion is p o ided by CMU-Qa a by na i e A abic
speake s wi h s ong English skills. T ansla o s
a e gi en an English sen ence and i s A abic ma-
chine ansla ion
5
whe e hey pe o m pos -edi ing
o co ec e o s. Spanish ansla ion is comple ed
by a Uni e si y o She ield g adua e s uden who
is a na i e Spanish speake and luen in English.
Tu kish ansla ions a e ob ained om SDL.6
3.4 Embedding Space Pai Selec ion
We cons uc ou own pai ings o he SNLI sen-
ences o de e gold en ailmen labels being used
o in o m STS sco es. The wo d embedding sim-
ila i y selec ion heu is ic om STS 2016 (Agi e
e al.,2016) is used o ind in e es ing pai s. Sen-
ence embeddings a e compu ed as he sum o in-
5P oduced by he Google T ansla e API.
6h p://www.sdl.com/languagecloud/
managed- ansla ion/
di idual wo d embeddings,
(s) = Pw∈s (w)
.
7
Sen ences wi h likely meaning o e lap a e iden i-
ied using cosine simila i y, Eq. (1).
sim (s1, s2) = (s1) (s2)
k (s1)k2k (s2)k2
(1)
4 Anno a ion
Anno a ion o pai s wi h STS labels is pe o med
using C owdsou cing, wi h he excep ion o T ack
4b ha uses a single expe anno a o .
4.1 C owdsou ced Anno a ions
C owdsou ced anno a ion is pe o med on Amazon
Mechanical Tu k.
8
Anno a o s examine he STS
pai ings o English SNLI sen ences. STS labels
a e hen ans e ed o he ansla ed pai s o c oss-
lingual and non-English acks. The anno a ion in-
s uc ions and empla e a e iden ical o Agi e e al.
(2016). Labels a e collec ed in ba ches o 20 pai s
wi h anno a o s paid $1 USD pe ba ch. Fi e anno-
a ions a e collec ed pe pai . The MTu k mas e
9
quali ica ion is equi ed o pe o m he ask. Gold
sco es a e age he i e indi idual anno a ions.
4.2 Expe Anno a ion
English-Spanish WMT quali y es ima ion pai s o
T ack 4b a e anno a ed o STS by a Uni e si y o
She ield g adua e s uden who is a na i e speake
o Spanish and luen in English. This ack di e s
signi ican ly in label dis ibu ion and he complex-
i y o he anno a ion ask. Sen ences in a pai a e
ansla ions o each o he and end o be mo e se-
man ically simila . In e p e ing he po en ially sub-
le meaning di e ences in oduced by MT e o s
is challenging. To accu a ely assess STS pe o -
mance on MT quali y es ima ion da a, no a emp
is made o balance he da a by simila i y sco es.
5 T aining Da a
The ollowing summa izes he aining da a: Ta-
ble 3English; Table 4Spanish;
10
Table 5Spanish-
English; Table 6A abic; and Table 7A abic-
English. A abic-English pa allel da a is supplied
by ansla ing English aining da a, Table 8.
7
We use 50-dimensional GloVe wo d embeddings (Pen-
ning on e al.,2014) ained on a combina ion o Gigawo d
5 (Pa ke e al.,2011) and English Wikipedia a ailable a
h p://nlp.s an o d.edu/p ojec s/glo e/.
8h ps://www.m u k.com/
9
A designa ion ha s a is ically iden i ies wo ke s who
pe o m high quali y wo k ac oss a di e se se o asks.
10
Spanish da a om 2015 and 2014 uses a 5 poin scale
ha collapses STS labels 4 and 3, emo ing he dis inc ion
be ween unimpo an and impo an de ails.
3
Yea Da a se Pai s Sou ce
2012 MSRpa 1500 newswi e
2012 MSR id 1500 ideos
2012 OnWN 750 glosses
2012 SMTnews 750 WMT e al.
2012 SMTeu opa l 750 WMT e al.
2013 HDL 750 newswi e
2013 FNWN 189 glosses
2013 OnWN 561 glosses
2013 SMT 750 MT e al.
2014 HDL 750 newswi e headlines
2014 OnWN 750 glosses
2014 De - o um 450 o um pos s
2014 De -news 300 news summa y
2014 Images 750 image desc ip ions
2014 Twee -news 750 wee -news pai s
2015 HDL 750 newswi e headlines
2015 Images 750 image desc ip ions
2015 Ans.-s uden 750 s uden answe s
2015 Ans.- o um 375 Q&A o um answe s
2015 Belie 375 commi ed belie
2016 HDL 249 newswi e headlines
2016 Plagia ism 230 sho -answe plag.
2016 pos -edi ing 244 MT pos edi s
2016 Ans.-Ans. 254 Q&A o um answe s
2016 Ques .-Ques . 209 Q&A o um ques ions
2017 T ial 23 Mixed STS 2016
Table 3: English aining da a.
Yea Da a se Pai s Sou ce
2014 T ial 56
2014 Wiki 324 Spanish Wikipedia
2014 News 480 Newswi e
2015 Wiki 251 Spanish Wikipedia
2015 News 500 Sewswi e
2017 T ial 23 Mixed STS 2016
Table 4: Spanish aining da a.
English, Spanish and English-Spanish aining
da a pulls om p io STS e alua ions. A abic and
A abic-English aining da a is p oduced by ans-
la ing a subse o he English aining da a and
ans e ing he simila i y sco es. Fo he MT qual-
i y es ima ion da a in ack 4b, Spanish sen ences
a e ansla ions o hei English coun e pa s, di -
e ing subs an ially om exis ing Spanish-English
STS da a. We elease one housand new Spanish-
English STS pai s sou ced om he 2013 WMT
ansla ion ask and p oduced by a ph ase-based
Moses SMT sys em (Boja e al.,2013). The da a
is expe anno a ed and has a simila label dis ibu-
ion o he ack 4b es da a wi h 17% o he pai s
sco ing an STS sco e o less han 3, 23% sco ing
3, 7% achie ing a sco e o 4 and 53% sco ing 5.
5.1 T aining s. E alua ion Da a Analysis
E alua ion da a om SNLI end o ha e sen ences
ha a e sligh ly sho e han hose om p io yea s
o he STS sha ed ask, while he ack 4b MT qual-
Yea Da a se Pai s Sou ce
2016 T ial 103 Sampled ≤2015 STS
2016 News 301 en-es news a icles
2016 Mul i-sou ce 294 en news headlines,
sho -answe plag.,
MT pos edi s,
Q&A o um answe s,
Q&A o um ques ions
2017 T ial 23 Mixed STS 2016
2017 MT 1000 WMT13 T ansla ion Task
Table 5: Spanish-English aining da a.
Yea Da a se Pai s Sou ce
2017 T ial 23 Mixed STS 2016
2017 MSRpa 510 newswi e
2017 MSR id 368 ideos
2017 SMTeu opa l 203 WMT e al.
Table 6: A abic aining da a.
i y es ima ion da a has sen ences ha a e much
longe . The ack 5 English da a has an a e age
sen ence leng h o 8.7 wo ds, while he English
sen ences om ack 4b ha e an a e age leng h o
19.4. The English aining da a has he ollowing
a e age leng hs: 2012 10.8 wo ds; 2013 8.8 wo ds
(excludes es ic ed SMT da a); 2014 9.1 wo ds;
2015 11.5 wo ds; 2016 13.8 wo ds.
Simila i y sco es o ou pai ings o he SNLI
sen ences a e sligh ly lowe han ecen sha ed ask
yea s and much lowe han ea ly yea s. The change
is a ibu ed o di e ences in da a selec ion and
il e ing. The a e age 2017 simila i y sco e is 2.2
o e all and 2.3 on he ack 7 English da a. P io
English da a has he ollowing a e age simila i y
sco es: 2016 2.4; 2015 2.4; 2014 2.8; 2013 3.0;
2012 3.5. T ansla ion quali y es ima ion da a om
ack 4b has an a e age simila i y sco e o 4.0.
6 Sys em E alua ion
This sec ion epo s pa icipan e alua ion esul s
o he SemE al-2017 STS sha ed ask.
6.1 Pa icipa ion
The ask saw s ong pa icipa ion wi h 31 eams
p oducing 84 submissions. 17 eams p o ided 44
sys ems ha pa icipa ed in all acks. Table 9sum-
ma izes pa icipa ion by ack. T aces o he ocus
on English a e seen in 12 eams pa icipa ing jus
in ack 5, English. Two eams pa icipa ed exclu-
si ely in acks 4a and 4b, English-Spanish. One
eam ook pa solely in ack 1, A abic.
4
Yea Da a se Pai s Sou ce
2017 T ial 23 Mixed STS 2016
2017 MSRpa 1020 newswi e
2017 MSR id 736 ideos
2017 SMTeu opa l 406 WMT e al.
Table 7: A abic-English aining da a.
Yea Da a se Pai s Sou ce
2017 MSRpa 1039 newswi e
2017 MSR id 749 ideos
2017 SMTeu opa l 422 WMT e al.
Table 8: A abic-English pa allel da a.
6.2 E alua ion Me ic
Sys ems a e e alua ed on each ack by hei Pea -
son co ela ion wi h gold labels. The o e all ank-
ing a e ages he co ela ions ac oss acks 1-5 wi h
acks 4a and 4b indi idually con ibu ing.
T ack Language(s) Pa icipan s
1 A abic 49
2 A abic-English 45
3 Spanish 48
4a Spanish-English 53
4b Spanish-English MT 53
5 English 77
6 Tu kish-English 48
P ima y All excep Tu kish 44
Table 9: Pa icipa ion by sha ed ask ack.
6.3 CodaLab
As di ec ed by he SemE al wo kshop o ganize s,
he CodaLab esea ch pla o m hos s he ask.11
6.4 Baseline
The baseline is he cosine o bina y sen ence ec-
o s wi h each dimension ep esen ing whe he an
indi idual wo d appea s in a sen ence.
12
Fo c oss-
lingual pai s, non-English sen ences a e ansla ed
in o English using s a e-o - he-a machine ans-
la ion.
13
The baseline achie es an a e age co e-
la ion o 53.7 wi h human judgmen on acks 1-5
and would ank 23
d
o e all ou he 44 sys em sub-
missions ha pa icipa ed in all acks.
11h ps://compe i ions.codalab.o g/
compe i ions/16051
12
Wo ds ob ained using A abic (a ), Spanish (es) and En-
glish (en) T eebank okenize s.
13h p:// ansla e.google.com
6.5 Rankings
Pa icipan pe o mance is p o ided in Table 10.
ECNU is bes o e all (a g : 0.7316) and achie es
he highes pa icipan e alua ion sco e on: ack
2, A abic-English ( : 0.7493); ack 3, Spanish ( :
0.8559); and ack 6, Tu kish-English ( : 0.7706).
BIT a ains he bes pe o mance on ack 1, A abic
( : 0.7543). CompiLIG places i s on ack 4a,
SNLI Spanish-English ( : 0.8302). SEF@UHH
exhibi s he bes co ela ion on he di icul ack
4b WMT quali y es ima ion pai s ( : 0.3407). RTV
has he bes sys em o he ack 5 English da a ( :
0.8547), ollowed closely by DT Team ( : 0.8536).
Especially challenging acks wi h SNLI da a
a e: ack 1, A abic; ack 2, A abic-English; and
ack 6, English-Tu kish. Spanish-English pe o -
mance is much highe on ack 4a’s SNLI da a han
ack 4b’s MT quali y es ima ion da a. This high-
ligh s he di icul y and impo ance o making ine
g ained dis inc ions o ce ain downs eam appli-
ca ions. Assessing STS me hods o quali y es ima-
ion may bene i om using al e na i es o Pea son
co ela ion o e alua ion.14
Resul s end o dec ease on c oss-lingual acks.
The baseline d ops
>10%
ela i e on A abic-
English and Spanish-English (SNLI) s. mono-
lingual A abic and Spanish. Many pa icipan sys-
ems show smalle dec eases. ECNU’s op anking
en y pe o ms sligh ly be e on A abic-English
han A abic, wi h a sligh d op om Spanish o
Spanish-English (SNLI).
6.6 Me hods
Pa icipa ing eams explo e echniques anging
om s a e-o - he-a deep lea ning models o elabo-
a e ea u e enginee ed sys ems. P edic ion signals
include su ace simila i y sco es such as edi dis-
ance and ma ching n-g ams, sco es de i ed om
wo d alignmen s ac oss pai s, assessmen by MT
e alua ion me ics, es ima es o concep ual simi-
la i y as well as he simila i y be ween wo d and
sen ence le el embeddings. Fo c oss-lingual and
non-English acks, MT was widely used o con e
he wo sen ences being compa ed in o he same
language.15 Selec me hods a e highligh ed below.
14
e.g., Reime s e al. (2016) epo success using STS labels
wi h al e na i e me ics such as no malized Cumula i e Gain
(nCG), no malized Discoun ed Cumula i e Gain (nDCG) and
F1 o mo e accu a ely p edic pe o mance on he downs eam
asks: ex euse de ec ion, bina y classi ica ion o documen
ela edness and documen ela edness wi hin a co pus.
15
Wi hin he highligh ed submissions, he ollowing use a
monolingual English sys em ed by MT: ECNU, BIT, HCTI
5

T ack 1 T ack 2 T ack 3 T ack 4a T ack 4b T ack 5 T ack 6
Team P ima y AR-AR AR-EN SP-SP SP-EN SP-EN-WMT EN-EN EN-TR
ECNU (Tian e al.,2017) 73.16 74.40 74.93•85.59•81.31 33.63 85.18 77.06•
ECNU (Tian e al.,2017) 70.44 73.80 71.26 84.56 74.95 33.11 81.81 73.62
ECNU (Tian e al.,2017) 69.40 72.71 69.75 82.47 76.49 26.33 83.87 74.20
BIT (Wu e al.,2017)* 67.89 74.17 69.65 84.99 78.28 11.07 84.00 73.05
BIT (Wu e al.,2017)* 67.03 75.35 70.07 83.23 78.13 7.58 81.61 73.27
BIT (Wu e al.,2017) 66.62 75.43•69.53 82.89 77.61 5.84 82.22 72.80
HCTI (Shao,2017) 65.98 71.30 68.36 82.63 76.21 14.83 81.13 67.41
MITRE (Hende son e al.,2017) 65.90 72.94 67.53 82.02 78.02 15.98 80.53 64.30
MITRE (Hende son e al.,2017) 65.87 73.04 67.40 82.01 77.99 15.74 80.48 64.41
FCICU (Hassan e al.,2017) 61.90 71.58 67.82 84.84 69.26 2.54 82.72 54.52
neobili y (Zhuang and Chang,2017) 61.71 68.21 64.59 79.28 71.69 2.00 79.27 66.96
FCICU (Hassan e al.,2017) 61.66 71.58 67.81 84.89 68.54 2.14 82.80 53.90
STS-UHH (Kohail e al.,2017) 60.58 67.81 63.07 77.13 72.01 4.81 79.89 59.37
RTV 60.50 67.13 55.95 74.85 70.50 7.61 85.41 62.04
HCTI (Shao,2017) 59.88 43.73 68.36 67.09 76.21 14.83 81.56 67.41
RTV 59.80 66.89 54.82 74.24 69.99 7.34 85.41 59.89
Ma us iIndia 59.60 68.60 54.64 76.14 71.18 5.72 77.44 63.49
STS-UHH (Kohail e al.,2017) 57.25 61.04 59.10 72.04 63.38 12.05 73.39 59.72
SEF@UHH (Duma and Menzel,2017) 56.76 57.90 53.84 74.23 58.66 18.02 72.56 62.11
SEF@UHH (Duma and Menzel,2017) 56.44 55.88 47.89 74.56 57.39 30.69 78.80 49.90
RTV 56.33 61.43 48.32 68.63 61.40 8.29 85.47•60.79
SEF@UHH (Duma and Menzel,2017) 55.28 57.74 48.13 69.79 56.60 34.07•71.86 48.78
neobili y (Zhuang and Chang,2017) 51.95 13.69 62.59 77.92 69.30 0.44 75.56 64.18
neobili y (Zhuang and Chang,2017) 50.25 3.69 62.07 76.90 69.47 1.47 75.35 62.79
Ma us iIndia 49.75 57.03 43.40 67.86 55.63 8.57 65.79 49.94
NLPP oxem 49.02 51.93 53.13 66.42 51.44 9.96 62.56 47.67
UMDeep (Ba ow and Pesko ,2017) 47.92 47.53 49.39 51.65 56.15 16.09 61.74 52.93
NLPP oxem 47.90 55.06 43.69 63.81 50.79 14.14 64.63 43.20
UMDeep (Ba ow and Pesko ,2017) 47.73 45.87 51.99 51.48 52.32 13.00 62.22 57.25
Lump (Espa˜
na Bone and Ba ´
on-Cede˜
no,2017)* 47.25 60.52 18.29 75.74 43.27 1.16 73.76 58.00
Lump (Espa˜
na Bone and Ba ´
on-Cede˜
no,2017)* 47.04 55.08 13.57 76.76 48.25 11.12 72.69 51.79
Lump (Espa˜
na Bone and Ba ´
on-Cede˜
no,2017)* 44.38 62.87 18.05 73.80 44.47 1.51 73.47 36.52
NLPP oxem 40.70 53.27 47.73 0.16 55.06 14.40 66.81 47.46
RTM (Bic¸ici,2017)* 36.69 33.65 17.11 69.90 60.04 14.55 54.68 6.87
UMDeep (Ba ow and Pesko ,2017) 35.21 39.05 37.13 45.88 34.82 5.86 47.27 36.44
RTM (Bic¸ici,2017)* 32.91 33.65 0.25 56.82 50.54 13.68 64.05 11.36
RTM (Bic¸ici,2017)* 32.78 41.56 13.32 48.41 45.83 23.47 56.32 0.55
ResSim (Bje a and ¨
Os ling,2017) 31.48 28.92 10.45 66.13 23.89 3.05 69.06 18.84
ResSim (Bje a and ¨
Os ling,2017) 29.38 31.20 12.88 69.20 10.02 1.62 68.77 11.95
ResSim (Bje a and ¨
Os ling,2017) 21.45 0.33 10.98 54.65 22.62 1.99 50.57 9.02
LIPN-IIMAS (A oyo-Fe n´
andez and Meza Ruiz,2017) 10.67 4.71 7.69 15.27 17.19 14.46 7.38 8.00
LIPN-IIMAS (A oyo-Fe n´
andez and Meza Ruiz,2017) 9.26 2.14 12.92 4.58 1.20 1.91 20.38 21.68
hjpwhu 4.80 4.12 6.39 6.17 2.04 6.24 1.14 7.53
hjpwhu 2.94 4.77 2.04 7.63 0.46 2.57 0.69 2.46
compiLIG (Fe e o e al.,2017)83.02•15.50
compiLIG (Fe e o e al.,2017) 76.84 14.64
compiLIG (Fe e o e al.,2017) 79.10 14.94
DT TEAM (Maha jan e al.,2017)85.36
DT TEAM (Maha jan e al.,2017)83.60
DT TEAM (Maha jan e al.,2017)83.29
FCICU (Hassan e al.,2017)82.17
ITNLPAiKF (Liu e al.,2017)82.31
ITNLPAiKF (Liu e al.,2017)82.31
ITNLPAiKF (Liu e al.,2017)81.59
L2F/INESC-ID (Fialho e al.,2017)* 76.16 1.91 5.44 78.11 2.93
L2F/INESC-ID (Fialho e al.,2017) 69.52
L2F/INESC-ID (Fialho e al.,2017)* 63.85 15.61 5.24 66.61 3.56
LIM-LIG (Nagoudi e al.,2017)74.63
LIM-LIG (Nagoudi e al.,2017)73.09
LIM-LIG (Nagoudi e al.,2017) 59.57
Ma us iIndia 68.60 76.14 71.18 5.72 77.44 63.49
NRC* 42.25 0.23
NRC 28.08 11.33
OkadaNaoya 77.04
OPI-JSA (´
Spiewak e al.,2017) 78.50
OPI-JSA (´
Spiewak e al.,2017) 73.42
OPI-JSA (´
Spiewak e al.,2017) 67.96
Pu dueNLP (Lee e al.,2017) 79.28
Pu dueNLP (Lee e al.,2017) 55.35
Pu dueNLP (Lee e al.,2017) 53.11
QLUT (Meng e al.,2017)* 64.33
QLUT (Meng e al.,2017) 61.55
QLUT (Meng e al.,2017)* 49.24
SIGMA 80.47
SIGMA 80.08
SIGMA 79.12
SIGMA PKU 2 81.34
SIGMA PKU 2 81.27
SIGMA PKU 2 80.61
STS-UHH (Kohail e al.,2017) 80.93
UCSC-NLP 77.29
UdL (Al-Na sheh e al.,2017) 80.04
UdL (Al-Na sheh e al.,2017)* 79.01
UdL (Al-Na sheh e al.,2017) 78.05
cosine baseline 53.70 60.45 51.55 71.17 62.20 3.20 72.78 54.56
* Co ec ed o la e submission
Table 10: STS 2017 ankings o de ed by a e age co ela ion ac oss acks 1-5. Pe o mance is epo ed
by con en ion as Pea son’s
×100
. Fo acks 1-6, he op anking esul is ma ked wi h a
•
symbol
and esul s in bold ha e no s a is ically signi ican di e ence wi h he bes esul on a ack,
p > 0.05
Williams’ - es (Diedenho en and Musch,2015).
6
ECNU
(Tian e al.,2017) The bes o e all sys-
em is om ENCU and ensembles well pe o m-
ing a ea u e enginee ed models wi h deep lea n-
ing me hods. Th ee ea u e enginee ed models
use Random Fo es (RF), G adien Boos ing (GB)
and XGBoos (XGB) eg ession me hods wi h ea-
u es based on: n-g am o e lap; edi dis ance;
longes common p e ix/su ix/subs ing; ee ke -
nels (Moschi i,2006); wo d alignmen s (Sul-
an e al.,2015); summa iza ion and MT e alua-
ion me ics (BLEU, GTM-3, NIST, WER, ME-
TEOR, ROUGE); and ke nel simila i y o bags-
o -wo ds, bags-o -dependencies and pooled wo d-
embeddings. ECNU’s deep lea ning models a e
di e en ia ed by hei app oach o sen ence em-
beddings using ei he : a e aged wo d embeddings,
p ojec ed wo d embeddings, a deep a e aging ne -
wo k (DAN) (Iyye e al.,2015) o LSTM (Hoch e-
i e and Schmidhube ,1997). Each ne wo k eeds
he elemen -wise mul iplica ion, sub ac ion and
conca ena ion o pai ed sen ence embeddings o
addi ional laye s o p edic simila i y sco es. The
ensemble a e ages sco es om he ou deep lea n-
ing and h ee ea u e enginee ed models.16
BIT
(Wu e al.,2017) Second place o e all is
achie ed by BIT p ima ily using sen ence in o ma-
ion con en (IC) in o med by Wo dNe and BNC
wo d equencies. One submission uses sen ence
IC exclusi ely. Ano he ensembles IC wi h Sul-
an e al. (2015)’s alignmen me hod, while a hi d
ensembles IC wi h cosine simila i y o summed
wo d embeddings wi h an IDF weigh ing scheme.
Sen ence IC in isola ion ou pe o ms all sys ems
excep hose om ECNU. Combining sen ence IC
wi h wo d embedding simila i y pe o ms bes .
HCTI
(Shao,2017) Thi d place o e all is ob-
ained by HCTI wi h a model simila o a con olu-
ional Deep S uc u ed Seman ic Model (CDSSM)
(Chen e al.,2015;Huang e al.,2013). Sen ence
embeddings a e gene a ed wi h win con olu ional
neu al ne wo ks (CNNs). The embeddings a e hen
compa ed using cosine simila i y and elemen wise
di e ence wi h he esul ing alues ed o addi-
ional laye s o p edic simila i y labels. The a chi-
and MITRE. HCTI submi ed a sepa a e un using a , es and
en ained models ha unde pe o med using hei en model
wi h MT o a and es. CompiLIG’s model is c oss-lingual
bu includes a wo d alignmen ea u e ha depends on MT.
SEF@UHH buil a , es, and en models and use bi-di ec ional
MT o c oss-lingual pai s. LIM-LIG and DT Team only pa -
icipa e in monolingual acks.
16
The wo emaining ECNU uns only use ei he RF o GB
and exclude he deep lea ning models.
ec u e is abs ac ly simila o ECNU’s deep lea n-
ing models. UMDeep (Ba ow and Pesko ,2017)
ook a simila app oach using LSTMs a he han
CNNs o he sen ence embeddings.
MITRE
(Hende son e al.,2017) Fou h place
o e all is MITRE ha , like ECNU, akes an ambi-
ious ea u e enginee ing app oach complemen ed
by deep lea ning. Ensembled componen s in-
clude: alignmen simila i y; TakeLab STS (
ˇ
Sa i
´
c
e al.,2012b); s ing simila i y measu es such as
ma ching n-g ams, summa iza ion and MT me ics
(BLEU, WER, PER, ROUGE); a RNN and ecu -
en con olu ional neu al ne wo ks (RCNN) o e
wo d alignmen s; and a BiLSTM ha is s a e-o -
he-a o ex ual en ailmen (Chen e al.,2016).
FCICU
(Hassan e al.,2017) Fi h place o e all
is FCICU ha compu es a sense-base alignmen us-
ing BabelNe (Na igli and Ponze o,2010). Babel-
Ne synse s a e mul ilingual allowing non-English
and c oss-lingual pai s o be p ocessed simila ly o
English pai s. Alignmen simila i y sco es a e used
wi h wo uns: one ha combines he sco es wi hin
a s ing ke nel and ano he ha uses hem wi h a
weigh ed a ian o Sul an e al. (2015)’s me hod.
Bo h uns a e age he Babelne based sco es wi h
so -ca dinali y (Jimenez e al.,2012b).
CompiLIG
(Fe e o e al.,2017) The bes
Spanish-English pe o mance on SNLI sen ences
was achie ed by CompiLIG using ea u es in-
cluding: c oss-lingual concep ual simila i y using
DBNa y (Se asse ,2015), c oss-language Mul i-
Vec wo d embeddings (Be a d e al.,2016), and
B ychcin and S oboda (2016)’s imp o emen s o
Sul an e al. (2015)’s me hod.
LIM-LIG
(Nagoudi e al.,2017) Using only
weigh ed wo d embeddings, LIM-LIG ook sec-
ond place on A abic.
17
A abic wo d embeddings
a e summed in o sen ence embeddings using uni-
o m, POS and IDF weigh ing schemes. Sen ence
simila i y is compu ed by cosine simila i y. POS
and IDF ou pe o m uni o m weigh ing. Combin-
ing he IDF and POS weigh s by mul iplica ion is
epo ed by LIM-LIG o achie e
0.7667, highe
han all submi ed A abic ( ack 1) sys ems.
DT Team
(Maha jan e al.,2017) Second place
on English ( ack 5)
18
is DT Team using ea u e en-
17
The app oach is simila o SIF (A o a e al.,2017) bu
wi hou emo al o he common p inciple componen
18
RTV ook i s place on ack 5, English, bu submi ed
no sys em desc ip ion pape .
7
Gen e T ain De Tes To al
news 3299 500 500 4299
cap ion 2000 625 525 3250
o um 450 375 254 1079
o al 5749 1500 1379 8628
Table 11: STS Benchma k anno a ed examples
by gen es ( ows) and by ain, de . es spli s
(columns).
ginee ing combined wi h he ollowing deep lea n-
ing models: DSSM (Huang e al.,2013), CDSSM
(Shen e al.,2014) and skip- hough s (Ki os e al.,
2015). Enginee ed ea u es include: unig am o e -
lap, summed wo d alignmen s sco es, ac ion o
unaligned wo ds, di e ence in wo d coun s by ype
(all, adj, ad e bs, nouns, e bs), and min o max
a ios o wo ds by ype. Selec ea u es ha e a mul-
iplica i e penal y o unaligned wo ds.
SEF@UHH
(Duma and Menzel,2017) Fi s
place on he challenging Spanish-English MT pai s
(T ack 4b) is SEF@UHH. Unsupe ised simila -
i y sco es a e compu ed om pa ag aph ec o s
(Le and Mikolo ,2014) using cosine, nega ion
o B ay-Cu is dissimila i y and ec o co ela ion.
MT con e s c oss-lingual pai s, L
1
-L
2
, in o wo
monolingual pai s, L
1
-L
1
and L
2
-L
2
, wi h a e -
aging used o combine he monolingual simila i y
sco es. B ay-Cu is pe o ms well o e all, while
cosine does bes on he Spanish-English MT pai s.
7 Analysis
Figu e 1plo s model simila i y sco es agains hu-
man STS labels o he op 5 sys ems om acks
5 (English), 1 (A abic) and 4b (English-Spanish
MT). While many sys ems e u n sco es on he
same scale as he gold labels, 0-5, o he s e u n
sco es om app oxima ely 0 and 1. Lines on he
g aphs illus a e pe ec pe o mance o bo h a 0-5
and a 0-1 scale. Mapping he 0 o 1 sco es o ange
om 0-5,
20
app oxima ely 80% o he sco es om
op pe o ming English sys ems a e wi hin 1.0 p o
he gold label. E o s o A abic a e mo e b oadly
dis ibu ed, pa icula ly o model sco es be ween
1 and 4. The English-Spanish MT plo s he weak
ela ionship be ween he p edic ed and gold sco es.
Table 12 p o ides examples o di icul sen ence
pai s o pa icipan sys ems and illus a es com-
mon sou ces o e o o e en well- anking sys ems
including: (i) wo d sense disambigua ion “making”
19ECNU, BIT and LIM-LIG a e scaled o he ange 0-5.
20snew = 5 ×s−min(s)
max(s)−min(s)is used o escale sco es.
and “p epa ing” a e e y simila in he con ex o
“ ood”, while “pic u e” and “mo ie” a e no simila
when pic u e is ollowed by “day”; (ii) a ibu e
impo ance “ou side” s. “dese ed” a e smalle
de ails when con as ing “The man is in a dese ed
ield” wi h “The man is ou side in he ield”; (iii)
composi ional meaning “A man is ca ying a ca-
noe wi h a dog” has he same con en wo ds as
“A dog is ca ying a man in a canoe” bu ca ies
a di e en meaning; (i ) nega ion sys ems sco e
“. . . wi h goggles and a swimming cap” as nea ly
equi alen o “...wi hou goggles o a swimming
cap”. In la ed simila i y sco es o examples like
“The e is a young gi l” s. “The e is a young boy
wi h he woman” demons a e ( ) seman ic blend-
ing, whe eby appending “wi h a woman” o “boy”
b ings i s ep esen a ion close o ha o “gi l”.
Fo mul ilingual and c oss-lingual pai s, hese is-
sues a e magni ied by ansla ion e o s o sys ems
ha use MT ollowed by he applica ion o a mono-
lingual simila i y model. Fo ack 4b Spanish-
English MT pai s, some o he poo pe o mance
can in pa be a ibu ed o many sys ems using MT
o e- ansla e he ou pu o ano he MT sys em, ob-
scu ing e o s in he o iginal ansla ion.
7.1 Con as ing C oss-lingual STS wi h MT
Quali y Es ima ion
Since MT quali y es ima ion pai s a e ansla ions
o he same sen ence, hey a e expec ed o be min-
imally on he same opic and ha e an STS sco e
≥1
.
21
The ac ual dis ibu ion o STS sco es is
such ha only 13% o he es ins ances sco e be-
low 3, 22% o he ins ances sco e 3, 12% sco e 4
and 53% sco e 5. The high STS sco es indica e
ha MT sys ems a e su p isingly good a p ese -
ing meaning. Howe e , e en o a human, in e -
p e ing changes caused by ansla ions e o s can
be di icul due bo h o dis luencies and sub le e -
o s wi h impo an changes in meaning.
The Pea son co ela ion be ween he gold MT
quali y sco es and he gold STS sco es is 0.41,
which shows ha ansla ion quali y measu es and
STS a e only mode a ely co ela ed. Di e ences
a e in pa explained by ansla ion quali y sco es
penalizing all misma ches be ween he sou ce seg-
men and i s ansla ion, whe eas STS ocuses on
di e ences in meaning. Howe e , he di icul in-
e p e a ion wo k equi ed o STS anno a ion may
21
The e alua ion da a o ack 4b does in ac ha e STS
sco es ha a e
≥1
o all pai s. In he 1,000 sen ence aining
se o his ack, one sen ence ha ecei ed a sco e o ze o.
8
012345
Model Simila i y Sco e
0
1
2
3
4
5
Human Simila i y Label
DT_Team
ECNU
BIT
FCICU
ITNLP-AiKF
(a) T ack 5: English
012345
Model Simila i y Sco e
0
1
2
3
4
5
Human Simila i y Label
MITRE
BIT
FCICU
LIM_LIG
ECNU
(b) T ack 1: A abic
012345
Model Simila i y Sco e
0
1
2
3
4
5
Human Simila i y Label
SEF@UHH
RTM
UMDeep
MITRE
ECNU
(c) T ack 4b: English-Spanish MT
Figu e 1: Model s. human simila i y sco es o op sys ems.
Pai s Human DT Team ECNU BIT FCICU ITNLP-AiKF
The e is a cook p epa ing ood. 5.0 4.1 4.1 3.7 3.9 4.5
A cook is making ood.
The man is in a dese ed ield. 4.0 3.0 3.1 3.6 3.1 2.8
The man is ou side in he ield.
A gi l in wa e wi hou goggles o a swimming cap. 3.0 4.8 4.6 4.0 4.7 0.1
A gi l in wa e , wi h goggles and swimming cap.
A man is ca ying a canoe wi h a dog. 1.8 3.2 4.7 4.9 5.0 4.6
A dog is ca ying a man in a canoe.
The e is a young gi l. 1.0 2.6 3.3 3.9 1.9 3.1
The e is a young boy wi h he woman.
The kids a e a he hea e wa ching a mo ie. 0.2 1.0 2.3 2.0 0.8 1.7
i is pic u e day o he boys
Table 12: Di icul English sen ence pai s (T ack 5) and sco es assigned by op pe o ming sys ems.19
Gen e File Y . T ain De Tes
news MSRpa 12 1000 250 250
news headlines 13/6 1999 250 250
news de -news 14 300 0 0
cap ions MSR id 12 1000 250 250
cap ions images 14/5 1000 250 250
cap ions ack5.en-en 17 0 125 125
o um de - o um 14 450 0 0
o um ans- o ums 15 0 375 0
o um ans-ans 16 0 0 254
Table 13: STS Benchma k de ailed b eak-down by
iles and yea s.
inc ease he isk o inconsis en and subjec i e la-
bels. The anno a ions o MT quali y es ima ion
a e p oduced as by-p oduc o pos -edi ing. Hu-
mans ix MT ou pu and he edi dis ance be ween
he ou pu and i s pos -edi ed co ec ion p o ides
he quali y sco e. This pos -edi ing based p oce-
du e is known o p oduce ela i ely consis en es i-
ma es ac oss anno a o s.
8 STS Benchma k
The STS Benchma k is a ca e ul selec ion o he
English da a se s used in SemE al and *SEM STS
sha ed asks be ween 2012 and 2017. Tables 11
and 13 p o ide de ails on he composi ion o he
benchma k. The da a is pa i ioned in o aining,
de elopmen and es se s.
22
The de elopmen se
can be used o design new models and une hy-
pe pa ame e s. The es se should be used spa -
ingly and only a e a model design and hype pa-
ame e s ha e been locked agains u he changes.
Using he STS Benchma k enables compa able as-
sessmen s ac oss di e en esea ch e o s and im-
p o ed acking o he s a e-o - he-a .
Table 14 shows he STS Benchma k esul s o
some o he bes sys ems om T ack 5 (EN-EN)
23
and compa es hei pe o mance o compe i i e
baselines om he li e a u e. All baselines we e
un by he o ganize s using canonical p e- ained
models made a ailable by he o igina o o each
me hod,
24
wi h he excep ion o PV-DBOW ha
22
Simila o he STS sha ed ask, while he aining se
is p o ided as a con enience, esea che s a e encou age o
inco po a e o he supe ised and unsupe ised da a as long as
no supe ised anno a ions o he es pa i ions a e used.
23
Each pa icipan submi ed he un which did bes in he
de elopmen se o he STS Benchma k, which happened o
be he same as hei bes un in T ack 5 in all cases.
24sen 2 ec
:
h ps://gi hub.com/ep ml/
sen 2 ec
, ained model sen 2 ec wi e unig ams;
SIF
:
h ps://gi hub.com/ep ml/sen 2 ec
Wikipedia ained wo d equencies enwiki ocab min200. x ,
h ps://gi hub.com/alexand es/lex ec
em-
beddings om lex ec.commonc awl.300d.W+C.pos. ec o s,
i s 15 p inciple componen s emo ed,
α= 0.001
, de
9