scieee Science in your language
[en] (orig)

Translation Artifacts in Cross-lingual Transfer Learning

Author: Artetxe Zurutuza, Mikel,Labaka Intxauspe, Gorka,Agirre Bengoa, Eneko
Publisher: ACL
Year: 2020
DOI: 10.18653/v1/2020.emnlp-main.618
Source: https://addi.ehu.eus/bitstream/10810/69978/1/2020.emnlp-main.618.pdf
P oceedings o he 2020 Con e ence on Empi ical Me hods in Na u al Language P ocessing, pages 7674–7684,
No embe 16–20, 2020. c
2020 Associa ion o Compu a ional Linguis ics
7674
T ansla ion A i ac s in C oss-lingual T ans e Lea ning
Mikel A e xe, Go ka Labaka, Eneko Agi e
HiTZ Cen e
Uni e si y o he Basque Coun y (UPV/EHU)
{mikel.a e xe,go ka.labaka,e.agi e}@ehu.eus
Abs ac
Bo h human and machine ansla ion play a
cen al ole in c oss-lingual ans e lea ning:
many mul ilingual da ase s ha e been c ea ed
h ough p o essional ansla ion se ices, and
using machine ansla ion o ansla e ei he
he es se o he aining se is a widely used
ans e echnique. In his pape , we show ha
such ansla ion p ocess can in oduce sub le
a i ac s ha ha e a no able impac in exis ing
c oss-lingual models. Fo ins ance, in na u-
al language in e ence, ansla ing he p emise
and he hypo hesis independen ly can educe
he lexical o e lap be ween hem, which cu -
en models a e highly sensi i e o. We show
ha some p e ious indings in c oss-lingual
ans e lea ning need o be econside ed in he
ligh o his phenomenon. Based on he gained
insigh s, we also imp o e he s a e-o - he-a
in XNLI o he ansla e- es and ze o-sho ap-
p oaches by 4.3 and 2.8 poin s, espec i ely.
1 In oduc ion
While mos NLP esou ces a e English-speci ic,
he e ha e been se e al ecen e o s o build
mul ilingual benchma ks
. One possibili y is o
collec and anno a e da a in mul iple languages
sepa a ely (Cla k e al.,2020), bu mos exis -
ing da ase s ha e been c ea ed h ough ansla ion
(Conneau e al.,2018;A e xe e al.,2020). This ap-
p oach has wo desi able p ope ies: i elies on ex-
is ing p o essional ansla ion se ices a he han
equi ing expe ise in mul iple languages, and i
esul s in pa allel e alua ion se s ha o e a mean-
ing ul measu e o he c oss-lingual ans e gap
o di e en models. The esul ing mul ilingual
da ase s a e gene ally used o e alua ion only, e-
lying on exis ing English da ase s o aining.
Closely ela ed o ha ,
c oss-lingual ans e
lea ning
aims o le e age la ge da ase s a ail-
able in one language— ypically English— o build
mul ilingual models ha can gene alize o o he
languages. P e ious wo k has explo ed 3 main
app oaches o ha end: machine ansla ing he
es se in o English and using a monolingual En-
glish model (TRANSLATE-TEST), machine ansla -
ing he aining se in o each a ge language and
aining he models on hei espec i e languages
(TRANSLATE-TRAIN), o using English da a o ine-
une a mul ilingual model ha is hen ans e ed
o he es o languages (ZERO-SHOT).
The da ase c ea ion and ans e p ocedu es de-
sc ibed abo e esul in a
mix u e o o iginal,1
human ansla ed and machine ansla ed da a
when dealing wi h c oss-lingual models. In ac ,
he ype o ex a sys em is ained on does no
ypically ma ch he ype o ex i is exposed o a
es ime: TRANSLATE-TEST sys ems a e ained on
o iginal da a and e alua ed on machine ansla ed
es se s, ZERO-SHOT sys ems a e ained on o ig-
inal da a and e alua ed on human ansla ed es
se s, and TRANSLATE-TRAIN sys ems a e ained on
machine ansla ed da a and e alua ed on human
ansla ed es se s.
Despi e o e looked o da e, we show ha
such
misma ch has a no able impac
in he pe o -
mance o exis ing c oss-lingual models. By using
back- ansla ion (Senn ich e al.,2016) o pa a-
ph ase each aining ins ance, we ob ain ano he
English e sion o he aining se ha be e e-
sembles he es se , ob aining subs an ial imp o e-
men s o he TRANSLATE-TEST and ZERO-SHOT ap-
p oaches in c oss-lingual Na u al Language In e -
ence (NLI). While imp o emen s b ough by ma-
chine ansla ion ha e p e iously been a ibu ed
o da a augmen a ion (Singh e al.,2019), we e-
jec his hypo hesis and show ha he phenomenon
is only p esen in ansla ed es se s, bu no in
o iginal ones. Ins ead, ou analysis e eals ha
1We use he e m o iginal o e e o non- ansla ed ex .
7675
his beha io is caused by sub le
a i ac s a ising
om he ansla ion
p ocess i sel . In pa icula ,
we show ha ansla ing di e en pa s o each
ins ance sepa a ely (e.g., he p emise and he hy-
po hesis in NLI) can al e supe icial pa e ns in he
da a (e.g., he deg ee o lexical o e lap be ween
hem), which se e ely a ec s he gene aliza ion
abili y o cu en models. Based on he gained in-
sigh s, we imp o e he s a e-o - he-a in XNLI,
and show ha some p e ious indings need o be
econside ed in he ligh o his phenomenon.
2 Rela ed wo k
C oss-lingual ans e lea ning.
Cu en c oss-
lingual models wo k by p e- aining mul ilingual
ep esen a ions using some o m o language mod-
eling, which a e hen ine- uned on he ele an
ask and ans e ed o di e en languages. Some
au ho s le e age pa allel da a o ha end (Conneau
and Lample,2019;Huang e al.,2019), bu aining
a model akin o BERT (De lin e al.,2019) on he
combina ion o monolingual co po a in mul iple
languages is also e ec i e (Conneau e al.,2020).
Closely ela ed o ou wo k, Singh e al. (2019)
showed ha eplacing segmen s o he aining da a
wi h hei ansla ion du ing ine- uning is help-
ul. Howe e , hey a ibu e his beha io o a da a
augmen a ion e ec , which we belie e should be
econside ed gi en he new e idence we p o ide.
Mul ilingual benchma ks.
Mos benchma ks
co e ing a wide se o languages ha e been c e-
a ed h ough ansla ion, as i is he case o XNLI
(Conneau e al.,2018) o NLI, PAWS-X (Yang
e al.,2019) o ad e sa ial pa aph ase iden i ica-
ion, and XQuAD (A e xe e al.,2020) and MLQA
(Lewis e al.,2020) o Ques ion Answe ing (QA).
A no able excep ion is TyDi QA (Cla k e al.,2020),
a con empo aneous QA da ase ha was sepa a ely
anno a ed in 11 languages. O he c oss-lingual
da ase s le e age exis ing mul ilingual esou ces,
as i is he case o MLDoc (Schwenk and Li,2018)
o documen classi ica ion and Wikiann (Pan e al.,
2017) o named en i y ecogni ion. Concu en o
ou wo k, Hu e al. (2020) combine some o hese
da ase s in o a single mul ilingual benchma k, and
e alua e some well-known me hods on i .
Anno a ion a i ac s.
Se e al s udies ha e
shown ha NLI da ase s like SNLI (Bowman e al.,
2015) and Mul iNLI (Williams e al.,2018) con ain
spu ious pa e ns ha can be exploi ed o ob ain
s ong esul s wi hou making eal in e en ial deci-
sions. Fo ins ance, Gu u angan e al. (2018) and
Poliak e al. (2018) showed ha a hypo hesis-only
baseline pe o ms be e han chance due o cues on
hei lexical choice and sen ence leng h. Simila ly,
McCoy e al. (2019) showed ha NLI models end
o p edic en ailmen o sen ence pai s wi h a high
lexical o e lap. Se e al au ho s ha e wo ked on
ad e sa ial da ase s o diagnose hese issues and
p o ide a mo e challenging benchma k (Naik e al.,
2018;Glockne e al.,2018;Nie e al.,2020). Be-
sides NLI, o he asks like QA ha e also been ound
o be suscep ible o anno a ion a i ac s (Jia and
Liang,2017;Kaushik and Lip on,2018). While
p e ious wo k has ocused on he monolingual sce-
na io, we show ha ansla ion can in e e e wi h
hese a i ac s in mul ilingual se ings.
T ansla ionese.
T ansla ed ex s a e known o
ha e unique ea u es like simpli ica ion, explici a-
ion, no maliza ion and in e e ence, which a e e-
e o as ansla ionese (Volansky e al.,2013). This
phenomenon has been epo ed o ha e a no able
impac in machine ansla ion e alua ion (Zhang
and To al,2019;G aham e al.,2019). Fo ins ance,
back- ansla ion b ings la ge BLEU gains o e-
e sed es se s (i.e., when ansla ionese is on he
sou ce side and o iginal ex is used as e e ence),
bu i s e ec diminishes in he na u al di ec ion
(Eduno e al.,2020). While connec ed, he phe-
nomenon we analyze is di e en in ha i a ises
om ansla ion inconsis encies due o he lack o
con ex , and a ec s c oss-lingual ans e lea ning
a he han machine ansla ion.
3 Expe imen al design
Ou goal is o analyze he e ec o bo h human
and machine ansla ion in c oss-lingual models.
Fo ha pu pose, he co e idea o ou wo k is o (i)
use machine ansla ion o ei he ansla e he ain-
ing se in o o he languages, o gene a e English
pa aph ases o i h ough back- ansla ion, and (ii)
e alua e he esul ing sys ems on o iginal, human
ansla ed and machine ansla ed es se s in com-
pa ison wi h sys ems ained on o iginal da a. We
nex desc ibe he models used in ou expe imen s
(
§
3.1), he speci ic aining a ian s explo ed (
§
3.2),
and he e alua ion p ocedu e ollowed (§3.3).
3.1 Models and ans e me hods
We expe imen wi h wo models ha a e ep esen-
a i e o he s a e-o - he-a in monolingual and
7676
c oss-lingual p e- aining: (i) ROBERTA (Liu e al.,
2019), which is an imp o ed e sion o BERT ha
uses masked language modeling o p e- ain an En-
glish T ans o me model, and (ii) XLM-R (Conneau
e al.,2020), which is a mul ilingual ex ension o
he o me p e- ained on 100 languages. In bo h
cases, we use he la ge models eleased by he au-
ho s unde he ai seq eposi o y.
2
As discussed
nex , we explo e di e en a ian s o he aining
se o ine- une each model on di e en asks. A
es ime, we y bo h machine ansla ing he es
se in o English (TRANSLATE-TEST) and, in he case
o XLM-R, using he ac ual es se in he a ge
language (ZERO-SHOT).
3.2 T aining a ian s
We y 3 a ian s o each aining se o ine- une
ou models: (i) he o iginal one in English (ORIG),
(ii) an English pa aph ase o i gene a ed h ough
back- ansla ion using Spanish o Finnish as pi o
(BT-ES and BT-FI), and (iii) a machine ansla ed
e sion in Spanish o Finnish (MT-ES and MT-FI).
Fo sen ences occu ing mul iple imes in he ain-
ing se (e.g., p emises epea ed o mul iple hy-
po heses), we use he exac same ansla ion o
all occu ences, as ou goal is o unde s and he in-
he en e ec o ansla ion a he han i s po en ial
applica ion as a da a augmen a ion me hod.
In o de o ain he machine ansla ion sys ems
o MT-XX and BT-XX, we use he big T ans o me
model (Vaswani e al.,2017) wi h he same se ings
as O e al. (2018) and Sen encePiece okeniza-
ion (Kudo and Richa dson,2018) wi h a join o-
cabula y o 32k subwo ds. Fo English-Spanish,
we ain o 10 epochs on all pa allel da a om
WMT 2013 (Boja e al.,2013) and Pa aC awl
5.0 (Espl
`
a e al.,2019). Fo English-Finnish, we
ain o 40 epochs on Eu opa l and Wiki Ti les
om WMT 2019 (Ba aul e al.,2019), Pa aC awl
5.0, and DGT, EUbookshop and TildeMODEL
om OPUS (Tiedemann,2012). In bo h cases,
we emo e sen ences longe han 250 okens, wi h
a sou ce/ a ge a io exceeding 1.5, o o which
langid.py
(Lui and Baldwin,2012) p edic s a
di e en language, esul ing in a inal co pus size
o 48M and 7M sen ence pai s, espec i ely. We
use sampling decoding wi h a empe a u e o 0.5
o in e ence, which p oduces mo e di e se ansla-
ions han beam sea ch (Eduno e al.,2018) and
pe o med be e in ou p elimina y expe imen s.
2h ps://gi hub.com/py o ch/ ai seq
3.3 Tasks and e alua ion p ocedu e
We use he ollowing asks o ou expe imen s:
Na u al Language In e ence (NLI).
Gi en a
p emise and a hypo hesis, he ask is o de e mine
whe he he e is an en ailmen ,neu al o con a-
dic ion ela ion be ween hem. We ine- une ou
models on Mul iNLI (Williams e al.,2018) o 10
epochs using he same se ings as Liu e al. (2019).
In mos o ou expe imen s, we e alua e on XNLI
(Conneau e al.,2018), which comp ises 2490 de-
elopmen and 5010 es ins ances in 15 languages.
These we e o iginally anno a ed in English, and he
esul ing p emises and hypo heses we e indepen-
den ly ansla ed in o he es o he languages by
p o essional ansla o s. Fo he TRANSLATE-TEST
app oach, we use he machine ansla ed e sions
om he au ho s. Following Conneau e al. (2020),
we selec he bes epoch checkpoin acco ding o
he a e age accu acy in he de elopmen se .
Ques ion Answe ing (QA).
Gi en a con ex
pa ag aph and a ques ion, he ask is o iden i y
he span answe ing he ques ion in he con ex .
We ine- une ou models on SQuAD 1.1 (Ra-
jpu ka e al.,2016) o 2 epochs using he same
se ings as Liu e al. (2019), and epo es esul s
o he las epoch. We use wo da ase s o e al-
ua ion: XQuAD (A e xe e al.,2020), a subse
o he SQuAD de elopmen se ansla ed in o 10
o he languages, and MLQA (Lewis e al.,2020)
a da ase consis ing o pa allel con ex pa ag aphs
plus he co esponding ques ions anno a ed in En-
glish and ansla ed in o 6 o he languages. In bo h
cases, he ansla ion was done by p o essional
ansla o s a he documen le el (i.e., when ans-
la ing a ques ion, he ex answe ing i was also
shown). Fo ou BT-XX and MT-XX a ian s, we
ansla e he con ex pa ag aph and he ques ions
independen ly, and map he answe spans using he
same p ocedu e as Ca ino e al. (2020).
3
Fo he
TRANSLATE-TEST app oach, we use he o icial ma-
chine ansla ed e sions o MLQA, un in e ence
o e hem, and map he p edic ed answe spans
back o he a ge language.4
3
We use Fas Align (Dye e al.,2013) o wo d alignmen ,
and disca d he ew ques ions o which he mapping me hod
ails (when none o he okens in he answe span a e aligned).
4
We use he same p ocedu e as o he aining se excep
ha (i) gi en he small size o he es se , we combine i wi h
WikiMa ix (Schwenk e al.,2019) o aid wo d alignmen , (ii)
we use Jieba o Chinese segmen a ion ins ead o he Moses
okenize , and (iii) o he ew unaligned spans, we e u n he
English answe .
7677
Model T ain en es de el bg u a i h zh hi sw u a g
Tes se machine ansla ed in o English (TRANSLATE-TEST)
ROBERTA
ORIG 91.2 82.2 84.6 82.4 82.1 82.1 79.2 76.5 77.4 73.8 73.4 76.7 70.5 67.2 66.8 77.7 ±0.6
BT-ES 91.6 85.7 87.4 85.4 85.1 85.1 83.6 81.3 81.5 78.7 78.2 81.1 76.3 72.7 71.5 81.7 ±0.2
BT-FI 91.4 86.0 87.4 85.7 85.7 85.4 84.4 82.3 82.1 79.0 79.3 81.8 77.6 73.5 73.6 82.3 ±0.2
XLM-R
ORIG 90.3 82.2 84.2 82.6 81.9 82.0 79.3 76.7 77.5 75.0 73.7 77.5 70.9 67.8 67.2 77.9 ±0.3
BT-ES 90.2 84.1 86.3 84.5 84.5 84.1 82.2 79.6 80.7 78.5 77.3 80.8 75.2 72.5 71.2 80.8 ±0.3
BT-FI 89.5 84.9 85.5 84.5 84.5 84.6 82.9 80.6 81.4 78.9 78.1 81.5 76.3 73.3 72.5 81.3 ±0.2
MT-ES 89.8 83.2 85.6 84.2 84.0 83.6 81.6 78.4 79.3 77.6 76.7 80.0 74.3 71.3 70.1 80.0 ±0.6
MT-FI 89.8 84.4 85.3 84.7 84.1 84.0 82.0 79.8 80.3 77.4 77.7 80.6 74.7 71.8 71.3 80.5 ±0.3
Tes se in a ge language (ZERO-SHOT)
XLM-R
ORIG 90.4 84.4 85.5 84.3 81.9 83.6 80.1 80.1 79.8 81.8 78.3 80.3 77.7 72.8 74.5 81.0 ±0.2
BT-ES 90.2 86.0 86.9 86.5 84.0 85.3 83.2 82.5 82.7 83.7 80.7 83.0 79.7 75.6 77.1 83.1 ±0.2
BT-FI 89.5 86.0 86.2 86.2 83.9 85.1 83.4 82.2 83.0 83.9 81.2 83.9 80.1 75.2 78.1 83.2 ±0.1
MT-ES 89.9 85.7 87.3 85.6 83.9 85.4 82.9 82.0 82.3 83.6 80.0 82.6 79.9 75.5 76.8 82.9 ±0.4
MT-FI 90.2 85.9 86.9 86.5 84.4 85.5 83.4 83.0 82.4 83.6 80.5 83.6 80.4 76.5 77.9 83.4 ±0.2
Table 1: XNLI de esul s (acc). BT-XX and MT-XX consis en ly ou pe o m ORIG in all cases.
Bo h o NLI and QA, we un each sys em 5
imes wi h di e en andom seeds and epo he
a e age esul s. Space pe mi ing, we also epo
he s anda d de ia ion ac oss he 5 uns. In ou e-
sul ables, we use an unde line o highligh he bes
esul wi hin each block, and bold ace o highligh
he bes o e all esul .
4 NLI expe imen s
We nex discuss ou main esul s in he XNLI de el-
opmen se (
§
4.1,
§
4.2), un addi ional expe imen s
o be e unde s and he beha io o ou di e en
a ian s (
§
4.3,
§
4.4,
§
4.5), and compa e ou esul s
o p e ious wo k in he XNLI es se (§4.6).
4.1 TRANSLATE-TEST esul s
We s a by analyzing XNLI de elopmen esul s
o TRANSLATE-TEST. Recall ha , in his app oach,
he es se is machine ansla ed in o English, bu
aining is ypically done on o iginal English da a.
Ou BT-ES and BT-FI a ian s close his gap by
aining on a machine ansla ed English e sion o
he aining se gene a ed h ough back- ansla ion.
As shown in Table 1, his b ings subs an ial gains
o bo h ROBERTA and XLM-R, wi h an a e age im-
p o emen o 4.6 poin s in he bes case. Qui e e-
ma kably, MT-ES and MT-FI also ou pe o m ORIG
by a subs an ial ma gin, and a e only 0.8 poin s be-
low hei BT-ES and BT-FI coun e pa s. Recall ha ,
o hese wo sys ems, aining is done in machine
ansla ed Spanish o Finnish, while in e ence is
done in machine ansla ed English. This shows
ha he loss o pe o mance when gene alizing
om o iginal da a o machine ansla ed da a is
subs an ially la ge han he loss o pe o mance
when gene alizing om one language o ano he .
4.2 ZERO-SHOT esul s
We nex analyze he esul s o he ZERO-SHOT ap-
p oach. In his case, in e ence is done in he es se
in each a ge language which, in he case o XNLI,
was human ansla ed om English. As such, di -
e en om he TRANSLATE-TEST app oach, nei he
aining on o iginal da a (ORIG) no aining on ma-
chine ansla ed da a (BT-XX and MT-XX) makes
use o he exac same ype o ex ha he sys em
is exposed o a es ime. Howe e , as shown in
Table 1, bo h BT-XX and MT-XX ou pe o m ORIG
by app oxima ely 2 poin s, which sugges s ha ou
(back-) ansla ed e sions o he aining se a e
mo e simila o he human ansla ed es se s han
he o iginal one. This also p o ides a new pe -
spec i e on he TRANSLATE-TRAIN app oach, which
was epo ed o ou pe o m ORIG in p e ious wo k
(Conneau and Lample,2019): while he o iginal
mo i a ion was o ain he model on he same lan-
guage ha i is es ed on, ou esul s show ha
machine ansla ing he aining se is bene icial
e en when he a ge language is di e en .
4.3 O iginal s. ansla ed es se s
So as o unde s and whe he he imp o emen s ob-
se ed so a a e limi ed o ansla ed es se s o
apply mo e gene ally, we conduc addi ional ex-
pe imen s compa ing ansla ed es se s o o iginal
ones. Howe e , o he bes o ou knowledge, all
7678
XNLI de Ou da ase
OR HT OR HT MT
Model T ain (en) (es) (es) (en) (en)
ROBERTA
ORIG 92.1 - - 78.7 79.0
BT-ES 91.9 - - 80.3 80.5
BT-FI 91.4 - - 80.5 80.5
XLM-R
ORIG 90.5 85.5 81.0 77.5 78.5
BT-ES 90.3 87.1 81.4 78.6 79.4
BT-FI 89.7 86.5 80.8 78.8 79.2
MT-ES 90.2 87.5 81.3 78.4 78.9
MT-FI 90.4 87.1 81.1 78.3 78.9
Table 2: NLI esul s on o iginal (OR), human ans-
la ed (HT) and machine ansla ed (MT) se s (acc).
BT-XX
and
MT-XX
ou pe o m
ORIG
in ansla ed se s,
bu do no ge any clea imp o emen in o iginal ones.
exis ing non-English NLI benchma ks we e c e-
a ed h ough ansla ion. Fo ha eason, we build
a new es se ha mimics XNLI, bu is anno a ed
in Spanish a he han English. We i s collec he
p emises om a il e ed e sion o CommonC awl
(Buck e al.,2014), aking a subse o 5 websi es
ha ep esen a di e se se o gen es: a newspa-
pe , an economy o um, a celeb i y magazine, a
li e a u e blog, and a consume magazine. We hen
ask na i e Spanish anno a o s o gene a e an en ail-
men , a neu al and a con adic ion hypo hesis o
each p emise.
5
We collec a o al o 2490 exam-
ples using his p ocedu e, which is he same size
as he XNLI de elopmen se . Finally, we c ea e a
human ansla ed and a machine ansla ed English
e sion o he da ase using p o essional ansla o s
om Gengo and ou machine ansla ion sys em
desc ibed in
§
3.2,
6
espec i ely. We epo esul s
o he bes epoch checkpoin on each se .
As shown in Table 2, bo h BT-XX and MT-XX
clea ly ou pe o m ORIG in all es se s c ea ed
h ough ansla ion, which is consis en wi h ou
p e ious esul s. In con as , he bes esul s on
he o iginal English se a e ob ained by ORIG, and
nei he BT-XX no MT-XX ob ain any clea imp o e-
men on he one in Spanish ei he .
7
This con i ms
ha he unde lying phenomenon is limi ed o ans-
la ed es se s. In addi ion, i is wo h men ioning
ha he esul s o he machine ansla ed es se in
English a e sligh ly be e han hose o he human
5
Unlike XNLI, we do no collec 4 addi ional labels o
each example. No e, howe e , ha XNLI kep he o iginal
label as he gold s anda d, so he addi ional labels a e i ele an
o he ac ual e alua ion. This is no en i ely clea in Conneau
e al. (2018), bu can be e i ied by inspec ing he da ase .
6We use beam sea ch ins ead o sampling decoding.
7No e ha he s anda d de ia ions a e a ound 0.3.
Compe ence Dis ac ion Noise
Model T ain AT NR WO NG LN SE
ROBERTA ORIG 72.9 65.7 64.9 59.1 88.4 86.5
BT-FI 56.6 57.2 80.6 67.8 87.7 86.6
XLM-R
ORIG 78.4 56.8 67.3 61.2 86.8 85.3
BT-FI 60.6 51.7 76.7 64.6 86.2 85.4
MT-FI 64.3 50.3 77.8 68.5 86.4 85.3
Table 3: NLI S ess Tes esul s (combined ma ched
& misma ched acc). AT = an onymy, NR = nume ical
easoning, WO = wo d o e lap, NG = nega ion, LN =
leng h misma ch, SE = spelling e o .
BT-FI
and
MT-FI
a e conside ably weake han
ORIG
in he compe ence
es , bu subs an ially s onge in he dis ac ion es .
ansla ed one, which sugges s ha he di icul y
o he ask does no only depend on he ansla ion
quali y. Finally, i is also in e es ing ha MT-ES is
only ma ginally be e han MT-FI in bo h Spanish
es se s, e en i i co esponds o he TRANSLATE-
TRAIN app oach, whe eas MT-FI needs o ZERO-SHOT
ans e om Finnish in o Spanish. This ein o ces
he idea ha i is aining on ansla ed da a a he
han aining on he a ge language ha is key in
TRANSLATE-TRAIN.
4.4 S ess es s
In o de o be e unde s and how sys ems ained
on o iginal and ansla ed da a di e , we un addi-
ional expe imen s on he NLI S ess Tes s (Naik
e al.,2018), which we e designed o es he o-
bus ness o NLI models o speci ic linguis ic phe-
nomena in English. The benchma k consis s o a
compe ence es , which e alua es he abili y o un-
de s and an onymy ela ion and pe o m nume ical
easoning, a dis ac ion es , which e alua es he
obus ness o shallow pa e ns like lexical o e lap
and he p esence o nega ion wo ds, and a noise
es , which e alua es obus ness o spelling e o s.
Jus as wi h p e ious expe imen s, we epo esul s
o he bes epoch checkpoin in each es se .
As shown in Table 3,ORIG ou pe o ms BT-FI
and MT-FI on he compe ence es by a la ge ma -
gin, bu he opposi e is ue on he dis ac ion es .
8
In pa icula , ou esul s show ha BT-FI and MT-FI
a e less elian on lexical o e lap and he p esence
o nega i e wo ds. This eels in ui i e, as ansla -
ing he p emise and hypo hesis independen ly—as
BT-FI and MT-FI do—is likely o educe he lexical
o e lap be ween hem. Mo e gene ally, he ans-
8
We obse e simila ends o BT-ES and MT-ES, bu
omi hese esul s o conciseness.

7679
la ion p ocess can al e simila supe icial pa e ns
in he da a, which NLI models a e sensi i e o (
§
2).
This would explain why he esul ing models ha e
a di e en beha io on di e en s ess es s.
4.5 Ou pu class dis ibu ion
Wi h he aim o unde s and he e ec o he p e i-
ous phenomenon in c oss-lingual se ings, we look
a he ou pu class dis ibu ion o ou di e en mod-
els in he XNLI de elopmen se . As shown in Ta-
ble 4, he p edic ions o all sys ems a e close o he
ue class dis ibu ion in he case o English. Ne -
e heless, ORIG is s ongly biased o he es o lan-
guages, and ends o unde p edic en ailmen and
o e p edic neu al. This can again be a ibu ed o
he ac ha he English es se is o iginal, whe eas
he es a e human ansla ed. In pa icula , i is
well-known ha NLI models end o p edic en ail-
men when he e is a high lexical o e lap be ween
he p emise and he hypo hesis (
§
2). Howe e , he
deg ee o o e lap will be smalle in he human
ansla ed es se s gi en ha he p emise and he
hypo hesis we e ansla ed independen ly, which
explains why en ailmen is unde p edic ed. In con-
as , BT-FI and MT-FI a e exposed o he exac same
phenomenon du ing aining, which explains why
hey a e no ha hea ily a ec ed.
So as o measu e he impac o his phenomenon,
we explo e a simple app oach o co ec his bias:
ha ing ine- uned each model, we adjus he bias
e m added o he logi o each class so he model
p edic ions ma ch he ue class dis ibu ion o
each language.
9
As shown in Table 5, his b ings
la ge imp o emen s o ORIG, bu is less e ec i e
o BT-FI and MT-FI.
10
This shows ha he pe o -
mance o ORIG was conside ably hinde ed by his
bias, which BT-FI and MT-FI e ec i ely mi iga e.
4.6 Compa ison wi h he s a e-o - he-a
So as o pu ou esul s in o pe spec i e, we com-
pa e ou bes a ian o p e ious wo k on he XNLI
es se . As shown in Table 6, ou me hod imp o es
he s a e-o - he-a o bo h he TRANSLATE-TEST and
he ZERO-SHOT app oaches by 4.3 and 2.8 poin s,
9
We achie e his using an i e a i e p ocedu e whe e, a
each s ep, we selec one class and se i s bias e m so he class
is selec ed o he igh pe cen age o examples.
10
No e ha we a e adjus ing he bias e m in he e alua ion
se i sel , which equi es knowing i s class dis ibu ion and
is hus a o m o chea ing. While use ul o analysis, a ai
compa ison would equi e adjus ing he bias e m in a sepa a e
alida ion se . This is wha we do o ou inal esul s in
§
4.6,
whe e we adjus he bias e m in he XNLI de elopmen se
and epo esul s on he XNLI es se .
EN EN →XX (a g)
Model T ain en neu con en neu con
ROBERTA
( ansla e- es )
ORIG 33.4 32.8 33.8 23.2 40.7 36.1
BT-FI 34.5 31.9 33.6 30.2 35.7 34.1
XLM-R
(ze o-sho )
ORIG 32.4 33.2 34.4 27.0 37.8 35.2
BT-FI 34.3 31.6 34.1 33.1 32.9 34.0
MT-FI 33.6 32.6 33.9 30.8 35.3 33.9
Gold S anda d 33.3 33.3 33.3 33.3 33.3 33.3
Table 4: Ou pu class dis ibu ion on XNLI de . All
sys ems a e close o he ue dis ibu ion in English, bu
ORIG
is biased owa d neu and con in he ans e lan-
guages. BT-FI and MT-FI alle ia e his issue.
Model T ain Base Unbias +∆
ROBERTA
( ansla e- es )
ORIG 77.7 ±0.6 80.6 ±0.2 2.9 ±0.5
BT-FI 82.3 ±0.2 82.8 ±0.1 0.4 ±0.2
XLM-R
(ze o-sho )
ORIG 81.0 ±0.2 82.4 ±0.2 1.4 ±0.3
BT-FI 83.2 ±0.1 83.3 ±0.1 0.1 ±0.1
MT-FI 83.4 ±0.2 83.8 ±0.1 0.4 ±0.2
Table 5: XNLI de esul s wi h class dis ibu ion un-
biasing (a e age acc ac oss all languages). Adjus ing
he bias e m o he classi ie o ma ch he ue class
dis ibu ion b ings la ge imp o emen s o
ORIG
, bu is
less e ec i e o BT-FI and MT-FI.
espec i ely. I also ob ains he bes o e all esul s
published o da e, wi h he addi ional ad an age
ha he p e ious s a e-o - he-a equi ed a ma-
chine ansla ion sys em be ween English and each
o he 14 a ge languages, whe eas ou me hod
uses a single machine ansla ion sys em be ween
English and Finnish (which is no one o he a ge
languages). While he main goal o ou wo k is no
o design be e c oss-lingual models, bu o ana-
lyze hei beha io in connec ion o ansla ion, his
shows ha he phenomenon unde s udy is highly
ele an , o he ex en ha i can be exploi ed o
imp o e he s a e-o - he-a .
5 QA expe imen s
So as o unde s and whe he ou p e ious indings
apply o o he asks besides NLI, we un addi-
ional expe imen s on QA. As shown in Table 7,
BT-FI and BT-ES do indeed ou pe o m ORIG o he
TRANSLATE-TEST app oach on MLQA. The imp o e-
men is modes , bu e y consis en ac oss di e en
languages, models and uns. The esul s o MT-ES
and MT-FI a e less conclusi e, p esumably because
mapping he answe spans ac oss languages migh
in oduce some noise. In con as , we do no ob-
7680
Model en es de el bg u a i h zh hi sw u a g
Fine- une an English model and machine ansla e he es se in o English (TRANSLATE-TEST)
BERT (De lin e al.,2019) 88.8 81.4 82.3 80.1 80.3 80.9 76.2 76.0 75.4 72.0 71.9 75.6 70.0 65.8 65.8 76.2
Robe a (Liu e al.,2019)91.3 82.9 84.3 81.2 81.7 83.1 78.3 76.8 76.6 74.2 74.1 77.5 70.9 66.7 66.8 77.8
P oposed (ROBERTA –BT-FI) 90.6 85.4 86.3 84.3 85.2 85.7 82.3 80.6 81.5 77.8 78.6 81.2 77.1 73.5 72.3 81.5
+ Unbiasing ( uned in de ) 90.5 85.8 86.6 84.6 85.5 85.8 82.9 81.2 82.3 78.7 79.7 82.3 77.6 74.4 72.9 82.1
Fine- une a mul ilingual model on all machine ansla ed aining se s (TRANSLATE-TRAIN-ALL)
Unicode (Huang e al.,2019) 85.6 81.1 82.3 80.9 79.5 81.4 79.7 76.8 78.2 77.9 77.1 80.5 73.4 73.8 69.6 78.5
XLM-R (Conneau e al.,2020) 88.7 85.2 85.6 84.6 83.6 85.5 82.4 81.6 80.9 83.4 80.9 83.3 79.8 75.9 74.3 82.4
Fine- une a mul ilingual model on he English aining se (ZERO-SHOT)
mBERT (De lin e al.,2019) 82.1 73.8 74.3 71.1 66.4 68.9 69.0 61.6 64.9 69.5 55.8 69.3 60.0 50.4 58.0 66.3
XLM (Conneau and Lample,2019) 85.0 78.7 78.9 77.8 76.6 77.4 75.3 72.5 73.1 76.1 73.2 76.5 69.6 68.4 67.3 75.1
Unicode (Huang e al.,2019) 85.1 79.0 79.4 77.8 77.2 77.2 76.3 72.8 73.5 76.4 73.6 76.2 69.4 69.7 66.7 75.4
XLM-R (Conneau e al.,2020) 88.8 83.6 84.2 82.7 82.3 83.1 80.1 79.0 78.8 79.7 78.6 80.2 75.8 72.0 71.7 80.1
P oposed (XLM-R –MT-FI) 88.8 84.8 85.7 84.6 84.2 85.7 82.9 81.8 82.0 82.1 79.9 81.8 79.8 75.9 76.7 82.4
+ Unbiasing ( uned in de ) 88.7 85.0 86.1 84.8 84.8 86.1 83.5 82.2 82.4 83.0 80.8 82.6 80.3 76.0 77.3 82.9
Table 6: XNLI es esul s (acc). Resul s o o he me hods a e aken om hei espec i e pape s o , i no
p o ided, om Conneau e al. (2020). Fo hose wi h mul iple a ian s, we selec he one wi h he bes esul s.
se e any clea imp o emen o he ZERO-SHOT
app oach on his da ase . Ou XQuAD esul s in
Table 8a e mo e posi i e, bu s ill inconclusi e.
These esul s can pa ly be explained by he
ansla ion p ocedu e used o c ea e he di e en
benchma ks: he p emises and hypo heses o XNLI
we e ansla ed independen ly, whe eas he ques-
ions and con ex pa ag aphs o XQuAD we e ans-
la ed oge he . Simila ly, MLQA made use o pa -
allel con ex s, and ansla o s we e shown he sen-
ence con aining each answe when ansla ing he
co esponding ques ion. As a esul , one can ex-
pec bo h QA benchma ks o ha e mo e consis en
ansla ions han XNLI, which would in u n di-
minish his phenomenon. In con as , he ques ions
and con ex pa ag aphs a e independen ly ans-
la ed when using machine ansla ion, which ex-
plains why BT-ES and BT-FI ou pe o m ORIG o
he TRANSLATE-TEST app oach. We conclude ha
he ansla ion a i ac s e ealed by ou analysis a e
no exclusi e o NLI, as hey also show up on QA
o he TRANSLATE-TEST app oach, bu hei ac ual
impac can be highly dependen on he ansla ion
p ocedu e used and he na u e o he ask.
6 Discussion
Ou analysis p omp s o econside p e ious ind-
ings in c oss-lingual ans e lea ning as ollows:
The c oss-lingual ans e gap on XNLI was
o e es ima ed.
Gi en he pa allel na u e o
XNLI, accu acy di e ences ac oss languages a e
commonly in e p e ed as he loss o pe o mance
when gene alizing om English o he es o lan-
guages. Howe e , ou wo k shows ha he e is
ano he ac o ha can ha e a much la ge impac :
he loss o pe o mance when gene alizing om
o iginal o ansla ed da a. Ou esul s sugges ha
he eal c oss-lingual gene aliza ion abili y o XLM-
Ris conside ably be e han wha he accu acy
numbe s in XNLI e lec .
O e coming he c oss-lingual gap is no wha
makes TRANSLATE-TRAIN wo k.
The o iginal
mo i a ion o TRANSLATE-TRAIN was o ain he
model on he same language i is es ed on. How-
e e , we show ha i is aining on ansla ed da a,
a he han aining on he a ge language, ha is
key o his app oach o ou pe o m ZERO-SHOT as
epo ed by p e ious au ho s.
Imp o emen s p e iously a ibu ed o da a
augmen a ion should be econside ed.
The
me hod by Singh e al. (2019) combines machine
ansla ed p emises and hypo heses in di e en
languages (
§
2), esul ing in an e ec simila o
BT-XX and MT-XX. As such, we belie e ha his
me hod should be analyzed om he poin o iew
o da ase a i ac s a he han da a augmen a ion,
as he au ho s do.
11
F om his pe spec i e, ha ing
he p emise and he hypo heses in di e en lan-
guages can educe he supe icial pa e ns be ween
hem, which would explain why his app oach is
be e han using examples in a single language.
11
Recall ha ou expe imen al design p e en s a da a aug-
men a ion e ec , in ha he numbe o unique sen ences and
examples used o aining is always he same (§3.2).
7681
Model T ain en es de a i zh hi a g
Tes se machine ansla ed in o English (TRANSLATE-TEST)
ROBERTA
ORIG 84.7 /71.4 70.1 / 49.7 60.5 / 41.2 55.7 / 32.5 65.6 / 40.8 53.5 / 26.0 42.7 / 20.7 61.8 ±0.1 / 40.3 ±0.2
BT-ES 84.4 / 71.2 70.9 / 50.7 61.0 / 41.6 56.5 / 33.3 66.7 / 41.8 54.4 / 27.1 43.0 / 21.1 62.4 ±0.1 / 41.0 ±0.2
BT-FI 83.8 / 70.4 70.3 / 50.1 61.1 / 41.9 56.5 / 33.4 66.8 / 42.1 54.9 / 27.5 42.8 / 21.3 62.3 ±0.1 / 40.9 ±0.2
XLM-R
ORIG 84.1 / 71.0 69.9 / 49.2 60.8 / 42.5 55.2 / 31.8 65.4 / 40.6 54.3 / 27.8 43.6 / 21.3 61.9 ±0.1 / 40.6 ±0.1
BT-ES 83.8 / 70.8 70.5 / 50.0 61.4 / 43.5 56.1 / 33.1 66.5 / 41.6 55.4 / 29.0 44.0 / 22.2 62.5 ±0.2 / 41.5 ±0.2
BT-FI 82.7 / 69.6 70.0 / 49.7 61.1 / 43.3 56.0 / 33.1 66.2 / 41.5 55.6 / 29.2 43.7 / 22.0 62.2 ±0.1 / 41.2 ±0.2
MT-ES 83.4 / 69.7 70.0 / 49.1 61.0 / 42.7 55.6 / 32.2 65.9 / 40.9 54.9 / 28.1 43.9 / 21.6 62.1 ±0.3 / 40.6 ±0.2
MT-FI 82.6 / 69.0 69.7 / 48.6 61.0 / 42.8 55.7 / 32.3 65.8 / 40.9 54.8 / 27.9 43.9 / 21.6 61.9 ±0.3 / 40.4 ±0.2
Tes se in a ge language (ZERO-SHOT)
XLM-R
ORIG 84.1 / 71.0 74.5 / 56.3 70.3 / 55.1 66.5 / 45.9 74.3 / 53.1 67.8 / 43.4 71.6 / 53.4 72.7 ±0.1 / 54.0 ±0.1
BT-ES 83.8 / 70.8 74.7 / 56.8 70.3 / 55.2 66.9 / 46.5 74.3 / 53.0 68.2 /43.8 71.4 / 53.6 72.8 ±0.2 /54.3 ±0.2
BT-FI 82.7 / 69.6 74.1 / 56.3 69.8 / 54.5 66.6 / 46.0 73.3 / 52.3 67.9 / 43.4 71.0 / 53.2 72.2 ±0.2 / 53.6 ±0.2
MT-ES 83.4 / 69.7 75.2 /57.3 70.5 / 55.1 67.5 /46.5 74.5 /53.2 67.5 / 42.5 71.7 / 52.7 72.9 ±0.3 / 53.9 ±0.4
MT-FI 82.6 / 69.0 74.1 / 56.0 70.2 / 54.6 66.9 / 46.0 73.7 / 52.6 67.2 / 41.5 71.9 / 53.4 72.4 ±0.2 / 53.3 ±0.4
Table 7: MLQA es esul s (F1 / exac ma ch).
Model T ain en es de el u a i h zh hi a g
XLM-R
(ze o-sho )
ORIG 88.2 82.7 80.8 80.9 80.1 76.1 76.0 80.1 75.4 71.9 76.4 79.0 ±0.2
BT-ES 87.9 83.5 80.5 81.2 80.7 76.8 77.4 80.2 76.4 73.0 76.9 79.5 ±0.3
BT-FI 87.1 82.5 80.2 80.7 79.8 75.7 76.6 79.4 75.7 71.5 76.8 78.7 ±0.3
MT-ES 87.1 84.1 80.3 81.2 80.1 76.0 77.4 80.9 76.7 72.7 77.1 79.4 ±0.3
MT-FI 86.3 81.4 80.2 80.5 80.2 76.6 77.0 80.3 77.6 74.5 77.8 79.3 ±0.2
Table 8: XQuAD esul s (F1). Resul s o he exac ma ch me ic a e simila .
The po en ial o TRANSLATE-TEST was unde es i-
ma ed.
The p e ious bes esul s o TRANSLATE-
TEST on XNLI lagged behind he s a e-o - he-a
by 4.6 poin s. Ou wo k educes his gap o only
0.8 poin s by add essing he unde lying ansla-
ion a i ac s. The eason why TRANSLATE-TEST
is mo e se e ely a ec ed by his phenomenon is
wo old: (i) he e ec is doubled by i s using
human ansla ion o c ea e he es se and hen ma-
chine ansla ion o ansla e i back o English, and
(ii) TRANSLATE-TRAIN was inad e en ly mi iga ing
his issue (see abo e), bu equi alen echniques
we e ne e applied o TRANSLATE-TEST.
Fu u e e alua ion should be e accoun o
ansla ion a i ac s.
The e alua ion issues
aised by ou analysis do no ha e a simple so-
lu ion. In ac , while we use he e m ansla ion
a i ac s o highligh ha hey a e an unin ended
e ec o ansla ion ha impac s inal e alua ion,
one could also a gue ha i is he o iginal da ase s
ha con ain he a i ac s, which ansla ion simply
al e s o e en mi iga es.
12
In any case, his is a
mo e gene al issue ha alls beyond he scope o
12
Fo ins ance, he high lexical o e lap obse ed o he
en ailmen class is usually ega ded a spu ious pa e n, so
educing i could be conside ed a posi i e e ec o ansla ion.
c oss-lingual ans e lea ning, so we a gue ha
i should be ca e ully con olled when e alua ing
c oss-lingual models. In he absence o mo e obus
da ase s, we ecommend ha u u e mul ilingual
benchma ks should a leas p o ide consis en es
se s o English and he es o languages. This
can be achie ed by (i) using o iginal anno a ions
in all languages, (ii) using o iginal anno a ions in
a non-English language and ansla ing hem in o
English and o he languages, o (iii) i ansla ing
om English, doing so a he documen le el o
minimize ansla ion inconsis encies.
7 Conclusions
In his pape , we ha e shown ha bo h human and
machine ansla ion can al e supe icial pa e ns in
da a, which equi es econside ing p e ious ind-
ings in c oss-lingual ans e lea ning. Based on he
gained insigh s, we ha e imp o ed he s a e-o - he-
a in XNLI o he TRANSLATE-TEST and ZERO-SHOT
app oaches by a subs an ial ma gin. Finally, we
ha e shown ha he phenomenon is no speci ic
o NLI bu also a ec s QA, al hough i is less p o-
nounced he e hanks o he ansla ion p ocedu e
used in he co esponding benchma ks. So as o
acili a e simila s udies in he u u e, we elease
7682
ou NLI da ase ,
13
which, unlike p e ious bench-
ma ks, was anno a ed in a non-English language
and human ansla ed in o English.
Acknowledgmen s
We hank No a A anbe i and Uxoa I
˜
nu ie a o
help ul discussion du ing he de elopmen o his
wo k, as well as he es o ou colleagues om he
IXA g oup ha wo ked as anno a o s o ou NLI
da ase .
This esea ch was pa ially unded by a Face-
book Fellowship, he Basque Go e nmen ex-
cellence esea ch g oup (IT1343-19), he Span-
ish MINECO (UnsupMT TIN2017-91692-EXP
MCIU/AEI/FEDER, UE), P ojec BigKnowledge
(Ayudas Fundaci
´
on BBVA a equipos de in es i-
gaci
´
on cien
´
ı ica 2018), and he NVIDIA GPU
g an p og am.
This esea ch is suppo ed ia he BETTER P o-
g am con ac #2019-19051600006 (ODNI, IARPA
ac i i y). The iews and conclusions con ained
he ein a e hose o he au ho s and should no be in-
e p e ed as necessa ily ep esen ing he o icial
policies, ei he exp essed o implied, o ODNI,
IARPA, o he U.S. Go e nmen . The U.S. Go -
e nmen is au ho ized o ep oduce and dis ibu e
ep in s o go e nmen al pu poses no wi hs and-
ing any copy igh anno a ion he ein.
Re e ences
Mikel A e xe, Sebas ian Rude , and Dani Yoga ama.
2020. On he c oss-lingual ans e abili y o mono-
lingual ep esen a ions. In P oceedings o he 58 h
Annual Mee ing o he Associa ion o Compu a-
ional Linguis ics, pages 4623–4637. Associa ion
o Compu a ional Linguis ics.
Lo¨
ıc Ba aul , Ondˇ
ej Boja , Ma a R. Cos a-juss`
a,
Ch is ian Fede mann, Ma k Fishel, Y e e G a-
ham, Ba y Haddow, Ma hias Huck, Philipp Koehn,
She in Malmasi, Ch is o Monz, Ma hias M¨
ulle ,
San anu Pal, Ma Pos , and Ma cos Zampie i. 2019.
Findings o he 2019 Con e ence on Machine T ans-
la ion (WMT19). In P oceedings o he Fou h Con-
e ence on Machine T ansla ion (Volume 2: Sha ed
Task Pape s, Day 1), pages 1–61, Flo ence, I aly. As-
socia ion o Compu a ional Linguis ics.
Ondˇ
ej Boja , Ch is ian Buck, Ch is Callison-Bu ch,
Ch is ian Fede mann, Ba y Haddow, Philipp
Koehn, Ch is o Monz, Ma Pos , Radu So icu , and
Lucia Specia. 2013. Findings o he 2013 Wo k-
shop on S a is ical Machine T ansla ion. In P oceed-
ings o he Eigh h Wo kshop on S a is ical Machine
13h ps://gi hub.com/a e xem/esxnli
T ansla ion, pages 1–44, So ia, Bulga ia. Associa-
ion o Compu a ional Linguis ics.
Samuel R. Bowman, Gabo Angeli, Ch is ophe Po s,
and Ch is ophe D. Manning. 2015. A la ge anno-
a ed co pus o lea ning na u al language in e ence.
In P oceedings o he 2015 Con e ence on Empi i-
cal Me hods in Na u al Language P ocessing, pages
632–642, Lisbon, Po ugal. Associa ion o Compu-
a ional Linguis ics.
Ch is ian Buck, Kenne h Hea ield, and Bas an Ooyen.
2014. N-g am coun s and language models om
he Common C awl. In P oceedings o he Nin h In-
e na ional Con e ence on Language Resou ces and
E alua ion (LREC’14), pages 3579–3584, Reyk-
ja ik, Iceland. Eu opean Language Resou ces Asso-
cia ion (ELRA).
Casimi o Pio Ca ino, Ma a R. Cos a-juss`
a, and Jos´
e
A. R. Fonollosa. 2020. Au oma ic Spanish ansla-
ion o he SQuAD da ase o mul i-lingual ques-
ion answe ing. In P oceedings o he 12 h Lan-
guage Resou ces and E alua ion Con e ence, pages
5515–5523, Ma seille, F ance. Eu opean Language
Resou ces Associa ion.
Jona han H. Cla k, Eunsol Choi, Michael Collins, Dan
Ga e e, Tom Kwia kowski, Vi aly Nikolae , and
Jennima ia Palomaki. 2020. TyDi QA: A bench-
ma k o in o ma ion-seeking ques ion answe ing in
ypologically di e se languages.T ansac ions o he
Associa ion o Compu a ional Linguis ics, 8:454–
470.
Alexis Conneau, Ka ikay Khandelwal, Naman Goyal,
Vish a Chaudha y, Guillaume Wenzek, F ancisco
Guzm´
an, Edoua d G a e, Myle O , Luke Ze le-
moye , and Veselin S oyano . 2020. Unsupe ised
c oss-lingual ep esen a ion lea ning a scale. In
P oceedings o he 58 h Annual Mee ing o he Asso-
cia ion o Compu a ional Linguis ics, pages 8440–
8451. Associa ion o Compu a ional Linguis ics.
Alexis Conneau and Guillaume Lample. 2019. C oss-
lingual language model p e aining. In Ad ances
in Neu al In o ma ion P ocessing Sys ems 32, pages
7059–7069.
Alexis Conneau, Ru y Rino , Guillaume Lample, Ad-
ina Williams, Samuel Bowman, Holge Schwenk,
and Veselin S oyano . 2018. XNLI: E alua ing
c oss-lingual sen ence ep esen a ions. In P oceed-
ings o he 2018 Con e ence on Empi ical Me hods
in Na u al Language P ocessing, pages 2475–2485,
B ussels, Belgium. Associa ion o Compu a ional
Linguis ics.
Jacob De lin, Ming-Wei Chang, Ken on Lee, and
K is ina Tou ano a. 2019. BERT: P e- aining o
deep bidi ec ional ans o me s o language unde -
s anding. In P oceedings o he 2019 Con e ence
o he No h Ame ican Chap e o he Associa ion
o Compu a ional Linguis ics: Human Language
Technologies, Volume 1 (Long and Sho Pape s),