AFR
AMR
EAS
EUR
SAS
HLA-A*33:03
HLA-B*53:01
HLA-C*08:01
HLA-C*03:02
HLA-B*40:06
HLA-A*02:11
HLA-C*07:18
HLA-A*74:01
HLA-C*07:06
HLA-A*02:02
HLA-C*15:05
HLA-A*34:02
HLA-A*36:01
HLA-A*33:01
HLA-B*38:02
P esen in 2 Mul iple Allele cell lines
No p esen in any pa o he aining da a
P esen in 2 Mul iple Allele cell lines
No p esen in any pa o he aining da a
No p esen in any pa o he aining da a
No p esen in any pa o he aining da a
No p esen in any pa o he aining da a
No p esen in any pa o he aining da a
P esen in 1 Mul iple Allele cell line
No p esen in any pa o he aining da a
P esen in 1 Mul iple Allele cell line
No p esen in any pa o he aining da a
No p esen in any pa o he aining da a
P esen in 3 Mul iple Allele cell lines
No p esen in any pa o he aining da a
HLA-A*31:01
HLA-B*35:01
HLA-C*08:02
HLA-C*03:03
HLA-B*40:02
HLA-A*02:01
HLA-C*07:02
HLA-A*31:01
HLA-C*07:02
HLA-A*02:01
HLA-C*15:02
HLA-A*03:01
HLA-A*01:01
HLA-A*31:01
HLA-B*08:01
2
3
2
2
2
2
2
2
2
2
1
3
2
3
10
Q62R, E63N
S77N, N80I, L81A
E152T, R156L
I95L, Y116S
L95W, S97T
T73I, H74D
K66N, S99Y
T9F, I73T
K66N, S99Y
V95L, L156W
L116F
F9Y, Q62R, E63N
R163T, G167W
Q62R, E63N, Y171H
D9Y, F67C, D74Y, S77N, N80T,
L81A, S97R, Y116F, D156L, A158T
Supe popula ions Allele No es Polymo phic dis ance /
polymo phisms
Closes
allele
Rep ese a ion
o closes allele
Visualising he conside able bias wi hin a ious
immunological da ase s sugges s ha a ge ed da a
ga he ing is equi ed o ed ess he balance and imp o e he
gene alisabili y o me hods ained on hem. Whils he
majo i y o human alleles a e wi hin 3 amino acid changes o
he Ne MHCPan pseudosequence, many non-human alleles
a e a g ea e dis ances, and a signi ican numbe o HLA-A
and HLA-B alleles a e a highe dis ances.
One s a egy o p io i ise da a ga he ing o ex end
knowledge, would be o iew i h ough he lens o
polymo phic dis ance a he han he addi ion o alleles,
which may be mo e signi ican in e ms o equency in
unde ep esen ed ances ies and may no add da a ele an
o enhancing he gene alisa ion o he algo i hms.
Conclusions
HLA-A
Yea 1
Indi iduals on he ial
Yea 2
HLA-C
HLA-B
Pep ide coun
0
50k
100k
150k
200k
250k
HLA-C
HLA-B
HLA-A
AFR
AMR
EAS
EUR
SAS
HLA-B*40:10
HLA-C*04:03
HLA-C*18:01
HLA-B*07:05
HLA-B*57:04
P esen in 5 Mul iple Allele cell lines
No p esen in any pa o he aining da a
No p esen in any pa o he aining da a
No p esen in any pa o he aining da a
No p esen in any pa o he aining da a
HLA-B*40:01
HLA-C*04:01
HLA-C*04:01
HLA-B*07:02
HLA-B*57:01
2
1
2
1
2
H9Y, T24A
S9Y
S9D, A24S
D114N
S116D, L156R
Figu e 3a. Alleles p esen in Legacy ialis s, bu absen om he Ne MHCPan 4.1
SA EL aining da ase .
Figu e 3b. Ne MHCPan 4.1 SA LE ep esen a ion o he indi iduals in he ial. Only
one indi idual in Yea 1 had all six alleles ep esen ed in he aining da a. Th ee
indi iduals had only wo alleles ep esen ed.
Figu e 3c. Loadable e ame a ailabili y o alleles in he ial.
Pep ide coun
0
50k
100k
150k
200k
250k
Unde s anding ances y biases in HLA- ela ed da ase s.
Implica ions o machine lea ning p edic o s and heal hca e equi y.
Ch is ophe Tho pe and Ellen McDonagh
1. Eu opean Bioin o ma ics Ins i u e, Hinx on.
1 1
AFR
AMR
EAS
EUR
SAS
0
50
100
150
200
250
300
0.5k
1k
2.5k
5k
10k
25k
50k
75k
100k
125k
150k
175k
200k
225k
250k
Supe popula ions Allele Coun o pep ides in he Ne MHCPan Single Allele Elu ion T aining Da ase
Acknowledgmen s
We would like o acknowledge he LEGACY Ne wo k o
p o iding issue yping da a (Figu e 3b) and he use case o
his analysis. We also acknowledge Tim Ellio , Malcolm Sim,
Benny Chain, And eas Ti eau-Maye , Hashem Koohy and his
g oup and he Qimmuno London communi y o aluable
eedback and suppo on ea lie e sions o he wo k.
In oduc ion
Machine lea ning algo i hms a e mi o s o he da a hey a e
ained on. They need abundan , well-labelled, highly di e se
aining da a, ideally wi h bo h posi i e and nega i e da a.
E en many s a e-o - he-a machine lea ning models
gene alise poo ly away om hei aining da a.
Whils some da ase s, such as hose o he pep ide-binding
p edic o Ne MHCPan, a e collec ed exp essly o aining he
algo i hm, many o he s consis o “exhaus ume” da a
collec ed o e ime in scien i ic s udies. Fo a a ie y o
easons, none o he aul o he da a collec o s and cu a o s,
he aining da a is o en signi ican ly biased owa ds alleles
p esen in Whi e wes e n popula ions o hose ela ed o
speci ic disease con ex s, e.g., Abaca i sensi i i y in HLA-
B*57:01 ca ie s.
The bias in immunological aining da a se s can a ec he
equi y o he apies based on hem. Whils p edic i e
me hods co e a wide ange o alleles and speci ici ies due
o hei simila i y o alleles in he aining da a, many alleles
wi h high polymo phic dis ances in he esidues comp ising
he pep ide-binding si e emain uns udied.
Me hods
Openly a ailable da a on pep ides bound o HLA Class I
molecules was downloaded om he websi es o
Ne MHCPan, MHCMo i A las and Immune Epi ope Da abase.
S uc ual in o ma ion was ob ained om P o ein Da abase in
Eu ope and p ocessed using exis ing pipelines. HLA Class I
sequences we e e ie ed om he Imuno Polymo phism
Da abase (IPD) and he HLA yping da a e ie ed om he
One Thousand Genomes p ojec . In o ma ion on e ame
a ailabili y was ga he ed om he websi es o he supplie s.
Fo each da ase he numbe o i ems ela ing o a pa icula
allele we e coun ed using a Py hon sc ip . The equency o
alleles in di e en supe popula ions was es ima ed using he
HLA yping da a om he 1000 Genomes P ojec . All g aphs
we e c ea ed using Ma plo Lib and Seabo n.
Resul s
All s udied da ase s exhibi a classical “long- ail” dis ibu ion
ela ing o he numbe o pep ides pe allele (Figu e 1). The
MHC Mo i A las da ase has a sho e ail and less alleles as
i is a dis inc da ase comp ised o indi idual
immunopep idomics expe imen s. The Ne MHCPan da ase
con ains mos o he IEDB da a (be o e he Ne MHCPan
da ase gene a ion cu o ). All o he da ase s ha e HLA-
A*02:01 as he p edominan allele, apa om he MHC Mo i
A las da a, which has HLA-A*03:01. The s uc u e da ase has
he smalles numbe o eco ds, due o he complexi y o
da a gene a ion, and also he la ges dispa i y/bias, wi h
HLA-A*02:01 making up o e 40% o he da a. This has
signi ican implica ions o s uc u al p edic ion me hods.
Fo a deepe analysis, we ha e ocused on Ne MHCPan, no
o c i icise ha algo i hm bu o highligh bias in he aining
da a, gi en i s p e alence and deep in eg a ion in o he
ab ic o mode n-day immunological esea ch. Ou o 130
alleles in he Ne MHCPan 4.1 Single Allele Elu ion aining
da ase , o e hal ha e ewe han 10k pep ides. Fi y ha e
ewe han 1000 pep ides, and nea ly hi y ha e ewe han
250 pep ides. In con as , HLA-A*02:01 has 265,252.
This ARISE p ojec has ecei ed unding om he Eu opean Union's Ho izon 2020 esea ch and inno a ion p og amme unde he Ma ie Sklodowska-Cu ie g an ag eemen numbe 945405.
Case s udy: The LEGACY Ne wo k.
Flu accina ion o indi iduals o
di e se ances y.
As pa o he de elopmen o his analysis, we ha e been
o una e o collabo a e wi h he LEGACY Ne wo k, which has
es ed a pen a alen lu accine wi h indi iduals o di e se
ances y. This expe ience has illus a ed some po en ial
sho comings o cu en immunological wo k lows when
unde s anding immunological esponses in hese
indi iduals.
The lack o p e iously de e mined epi opes in he IEDB, less
con idence in Ne MHCPan p edic ions due o low
ep esen a ion (Figu e 3b), whe e non-conse a i e o many
subs i u ions occu , and less eadily a ailable eagen s, such
as loadable e ame s (Figu e 3c).
Alleles p esen in hese indi iduals a e indica ed wi h a small
do a he side o Figu es 2a, 2d and Figu e 3a.
Figu e 2a. The 130 alleles p esen in Ne MHCPan 4.1 Single Allele Elu ion (SA EL)
aining da ase in he con ex o allele equency in o ma ion pe
supe popula ion in he 1000 Genomes P ojec HLA yping.
A whi e do wi hin he ile o he hea map indica es ha he allele is he
p edominan one in he supe popula ion o ha locus.
The g ey box indica es he op 14 alleles in he aining se .
Figu e 1. Rep esen a ion o pep ides bound o a speci ic allele ac oss ou
da ase s.
Figu e 2d. Alleles common in 1000 Genomes, missing in Ne MHCPan 4.1 SA EL
aining da ase . O pa icula no e is HLA-A*33:03, which is p esen a a high
equencies in Eas Indian (25.7%), Indonesian (29.7%) and Malaysia Pa ani (36%)
- a simila le el o HLA-A*02:01 in Eu opean and No h Ame ica.
Figu e 2c. Numbe o IPD alleles a speci ic polymo phic dis ances (Hamming
dis ance in he Ne MHCPan pseudosequence) om a closes allele in he
Ne MHCPan 4.1 SA EL aining da ase .
Figu e 2b. The c osso e be ween cumula i e op-N alleles and he emainde o
he aining da ase
Top N-alleles (cumula i e)
Remaining alleles (cumula i e)
C osso e poin (14)
Numbe o Top Alleles
Cumula i e Pep ide Coun
Numbe o P o ide s
Pep ide coun
0
50k
100k
150k
200k
250k
[email p o ec ed]Pos e PDF Online