scieee Science in your language
[en] (orig)

Survival Stacking Ensemble Model for Lung Cancer Risk Prediction

Author: Alonso, Eduardo,Calle, Xabier,Gurrutxaga Goikoetxea, Ibai,Beristain Iraola, Andoni
Publisher: IOS Press
Year: 2024
DOI: 10.3233/SHTI241083
Source: https://addi.ehu.eus/bitstream/10810/72315/1/SHTI-321-SHTI241083.pdf
Su i al S acking Ensemble Model o
Lung Cance Risk P edic ion
Edua do ALONSOa,b,1, Xabie CALLEa, Ibai GURRUTXAGAb, Andoni BERISTAINa
a Vicom ech Founda ion, Basque Resea ch and Technology Alliance (BRTA), Donos ia
- San Sebas ián, Spain
b Depa men o Compu e A chi ec u e and Technology, Uni e si y o he Basque
Coun y (UPV/EHU), Donos ia - San Sebas ián, Spain
ORCiD ID: Edua do Alonso h ps://o cid.o g/0009-0003-3984-549X, Xabie Calle
h ps://o cid.o g/0000-0001-5689-7433, Andoni Be is ain h ps://o cid.o g/0000-0002-
5452-2141, Ibai Gu u xaga h ps://o cid.o g/0000-0003-1830-1058
Abs ac . The mos well-es ablished isk ac o o lung cance (LC) is smoking,
esponsible o app oxima ely 85% o cases. The Lung Cance Risk Assessmen
Tool (LCRAT) is a key ad ancemen in his ield, which p edic s indi idual isk
based on ac o s like smoking habi s, demog aphic de ails, pe sonal and amily
medical his o y, and en i onmen al exposu es. This pape p oposes a model wi h
ewe ea u es ha imp o es s a e o he a pe o mance, using a simpli ied s acking
ensemble, making i mo e accessible and easie o implemen in ou ine heal hca e
p ac ice. The da a used in his wo k we e de i ed om wo coho s in he Uni ed
S a es: The Na ional Lung Sc eening T ial (NLST) and he P os a e, Lung,
Colo ec al, and O a ian (PLCO) Cance Sc eening T ial. Bo h ou model and
LCRAT achie e an AUC o 0.799 and 0.782 on es espec i ely. In e ms o
pe cen age o posi i es, in he 50% o he popula ion, bo h de ec 0.766 and 0.754
o he cases. The ensemble o di e en su i al models enhances obus ness by
mi iga ing he weakness o indi idual models and di ec ly impac s he e iciency o
he model, inc easing he e iciency and gene alizabili y.
Keywo ds. Cance , isk ac o s, machine lea ning, ensemble models
1. In oduc ion
Lung cance (LC), a p edominan cause o cance - ela ed mo ali y globally, p esen s a
signi ican public heal h challenge due o i s high incidence and poo p ognosis. The
mos well-es ablished isk ac o o LC is obacco smoking, esponsible o
app oxima ely 85% o cases [1].
In ecen decades, ex ensi e esea ch e o s ha e been dedica ed o comba ing LC.
Among he a ious s a egies de eloped, sc eening p og ams ha e eme ged as a c ucial
ool in educing LC mo ali y by enabling ea ly de ec ion o he disease. These p og ams
a e designed o iden i y indi iduals a high isk based on associa ed isk ac o s ( amily
his o y, smoking, age…), allowing o imely in e en ion and ea men . A no able
example is he Lung Cance Risk Assessmen Tool (LCRAT) [2], which employs a Cox
model o p o ide indi idual-le el isk assessmen s on smoke s o e 50 yea s o pa ien s
1 Co esponding Au ho : Edua do Alonso; E-mail: ealonso@ icom ech.o g.
Collabo a ion ac oss Disciplines o he Heal h o People, Animals and Ecosys ems
L. S oicu-Ti ada e al. (Eds.)
© 2024 The Au ho s.
This a icle is published online wi h Open Access by IOS P ess and dis ibu ed unde he e ms
o he C ea i e Commons A ibu ion Non-Comme cial License 4.0 (CC BY-NC 4.0).
doi:10.3233/SHTI241083
155
wi h p e ious espi a o y condi ions, called high isk popula ion. This app oach has
signi ican ly enhanced he ea ly de ec ion o LC cases, he eby imp o ing pa ien
ou comes and highligh ing he impo ance o a ge ed sc eening in he igh agains LC.
While he LCRAT has signi ican ly ad anced he ea ly de ec ion o LC, i is no
wi hou i s limi a ions. One majo d awback is i s limi ed p edic i e accu acy, as i may
no accu a ely iden i y many high- isk indi iduals, leading o alse posi i es o alse
nega i es. Addi ionally, LCRAT's e ec i eness is hinde ed by i s eliance on inpu da a
quali y; inaccu a e o incomple e da a can comp omise i s isk assessmen s. To add ess
hese limi a ions, ecen ad ancemen s in su i al analysis and machine lea ning [5]
o e an oppo uni y o le e age mo e sophis ica ed and mode n su i al models o selec
he mos impo an known isk ac o s. This wo k aims o de elop a s acking su i al
ensemble app oach o high isk popula ion enhancing p edic i e accu acy, be e handle
di e se and in ica e isk ac o s, and imp o e he o e all quali y and eliabili y o LC
isk assessmen s using a educed numbe o p edic i e ea u es.
2. Me hods
2.1. Da a Sou ces
In he conduc ed expe imen , da a was u ilized om wo la ge-scale coho s in he Uni ed
S a es: The Na ional Lung Sc eening T ial (NLST) [3] and he P os a e, Lung, Colo ec al,
and O a ian (PLCO) Cance Sc eening T ial [4]. Pa icipan s we e ec ui ed om
mul iple cen e s and da a we e collec ed h ough s uc u ed ques ionnai es and ollow-
ups, ollowing he app op ia e clinical ial p o ocols alida ed by he co esponding
e hics commi ee. The NLST was a andomized ial in ol ing o e 53,000 smoke s aged
55 o 74 yea s wi h a leas 30 pack-yea s smoked om 2002-2004. I aimed o assess i
low-dose compu ed omog aphy could educe LC mo ali y compa ed o s anda d ches
X- ays. F om 1993-2001 PLCO ial was ano he andomized s udy wi h abou 155,000
smoke pa icipan s aged 55 o 74 yea s. I e alua ed he impac o speci ic cance
sc eening es s on cance - ela ed mo ali y. Addi ional insigh s in o LC sc eening
e icacy and pa ien demog aphics we e p o ided by he PLCO da a. Table 1 epo s he
s a is ical cha ac e is ics o he coho s.
Table 1. A m coho s s a is ical cha ac e is ics. 'cig_yea s' ep esen s he yea s smoking, 'cigpd_ ' he ciga e es
pe day, 'cig_s op' he yea s since s opped smoking, 'lung_ h_cn ' he numbe o i s deg ee amilia s wi h
his o y o LC.
Fea u e/A m
NLST CT
NLST X-Ray
PLCO Con ol
PLCO Radio
N
26627
26621
40064
40590
age
61.42 ± 5.02
61.41 ± 5.01
62.45 ± 5.31
62.38 ± 5.28
cig_yea s
39.83 ± 7.34
39.86 ± 7.33
27.76 ± 13.81
27.59 ± 13.85
cigpd_
28.47 ± 11.44
28.42 ± 11.51
19.5 ± 13.69
19.26 ± 13.52
cig_s op
3.75 ± 5.0
3.74 ± 5.0
16.1 ± 13.47
16.2 ± 13.46
lung_ h_cn
0.24 ± 0.52
0.24 ± 0.51
0.12 ± 0.37
0.13 ± 0.37
bmi
27.89 ± 5.03
27.9 ± 5.07
27.35 ± 4.83
27.39 ± 4.88
sex (male)
15725 (59.1%)
15698 (58.9%)
23210 (57.9%)
23701 (58.4%)
Posi i es
1079 (4.0%)
964 (3.6%)
1604 (4.0%)
1705 (4.2%)
2.2. Model A chi ec u e
To p edic ime-dependen LC isk, we designed a s acked ensemble model ailo ed o
include LC- ela ed isk ac o s as inpu da a. The ensemble employs a dual-phase
E. Alonso e al. / Su i al S acking Ensemble Model o Lung Cance Risk P edic ion156
s a egy. In he i s phase, mul iple indi idual su i al analysis models, including Cox
P opo ional Haza ds (CoxPH), CoxNe , Ex a Su i al T ees, G adien Boos ing
Su i al Analysis, and Su i al Suppo Vec o Machine (SVM), independen ly p oduce
p edic ions based on he inpu da a. In he second phase, hese indi idual p edic ions a e
ed as inpu a iables o a inal CoxPH model o p edic he ime-dependen isk o
de eloping LC. In his s acked ensemble app oach, each base model i s p edic s he
isk o each sample independen ly. These indi idual isk p edic ions a e hen used as
inpu ea u es o aining he me a-model. Speci ically, he me a-model lea ns o
combine hese isk p edic ions o p oduce a inal, e ined isk p edic ion. This me hod
allows he me a-model o le e age he s eng hs and unique insigh s o each base model.
2.3. T aining Wo k low
The aining wo k low o ou models in ol es a de ailed and sys ema ic p ocess o
ensu e obus and eliable pe o mance. The models a e ained using he PLCO da ase ,
while alida ion is pe o med wi h he NLST da ase . This app oach le e ages he
s eng hs o bo h da ase s and ensu es ha ou models gene alize well ac oss di e en
coho s. Wi hin he PLCO da ase , indi iduals om he con ol a m a e used o aining
he models, while indi iduals who unde wen adiog aphic sc eening a e used o es ing.
Du ing he aining wo k low, a p ep ocessing pipeline wi h se e al da a
ans o ma ion s eps and a inal es ima o is buil . This pipeline is used du ing he aining
and alida ion s eps, ensu ing ha e e y ans o ma ion is consis en ly applied o bo h
he alida ion and p edic ion da a. All ans o ma ions ha e been ca ied ou ollowing
he same p ocedu es as LCRAT in o de o be compa ible wi h hem. Nume ical da a
impu a ion is pe o med using he mean alue o he aining se , while o ca ego ical
da a, he mos equen alue is used. The nex s ep is he s anda diza ion o nume ical
ea u es and he ca ego ical encoding, and inally he AI model ha is going o be used.
The aining p ocess was execu ed h ough a 5- old c oss- alida ion (CV) s a egy.
Th oughou each i e a ion o CV, igo ous hype pa ame e op imiza ion was pe o med
using e olu iona y algo i hms. These algo i hms explo e he hype pa ame e space based
on heu is ic sco es, seeking op imal con igu a ions ha minimize bias and a iance in
he model. This op imiza ion p ocess helps o ine- uning he models o he da ase ’s
cha ac e is ics, enhancing hei adap abili y and pe o mance ac oss di e en subse s.
2.4. Fea u e selec ion
Fea u e selec ion (FS) is a c ucial echnique o imp o ing model pe o mance, educing
o e i ing, and enhancing in e p e abili y. In his wo k we ha e execu ed a ious ea u e
selec ion echniques, including Bo u a, Lasso, and XGB. The main idea is o emo e
hose ea u es ha do no con ibu e much in o ma ion o he models, he eby achie ing
highe pe o mance as a esul and making he models easie o implemen in clinical
p ac ice.
3. Resul s
Ou objec i e on his wo k ocuses on achie ing a simila pe o mance, o e en be e ,
han he LCRAT e e ence model by educing he numbe o a iables equi ed o
p edic ion. The models ained ocus on high- isk popula ions and can es ima e isk a
di e en ime s a es. We p esen he esul s ob ained o he 3-yea o ecas , helping o
E. Alonso e al. / Su i al S acking Ensemble Model o Lung Cance Risk P edic ion 157
p io i ize hose pa ien s mos likely o de elop lung cance in he nea u u e, ensu ing
imely de ec ion and po en ial ea ly ea men . The FS s a egy desc ibed in Sec ion 4 o
he me hodology indica ed he emo al o he a iables ace, educa ion, numbe o packs
smoked pe yea , and his o y o emphysema.
Figu e 1. (a) Pe cen age o posi i e cases de ec ed among popula ion and (b) SVM shap alues
To de elop ou LC sc eening model, we ained a s acking ensemble algo i hm using
p io i ized isk ac o s iden i ied h ough he a o emen ioned ea u e selec ion s a egy.
We compa ed i s pe o mance agains LCRAT o 3-yea p edic ions using he same
da ase s ( ain: PLCO con ol, es : PLCO adiog aphy) and alida ion: NLST). Ou
model consis en ly ou pe o med LCRAT in ROC-AUC sco es: 0.789 s. 0.781 (T ain),
0.799 s. 0.782 (Tes ), and 0.698 s. 0.697 (Valida ion). Addi ionally, we e alua ed bo h
models by analyzing he pe cen age o posi i e cases wi hin he a - isk popula ion in he
NLST alida ion coho (Figu e 1a). Ini ially simila , ou model showed sligh ly highe
de ec ion a es ac oss di e en isk s a a: 0.339 s. 0.338 ( i s 15%), 0.766 s. 0.754
(50%), and 0.92 s. 0.90 (75%). O e all, hese esul s unde sco e ou model's supe io
p edic i e pe o mance in a ious e alua ion me ics.
Finally, we applied a ea u e impo ance algo i hm o one o he base models o
e alua e he impac o inpu a iables on p edic ions. Ou analysis e ealed ha a iables
such as he numbe o yea s smoking, age, numbe o ciga e es smoked, and numbe o
i s -deg ee ela i es wi h cance inc ease he isk, whe eas yea s o smoking cessa ion
dec ease i . Sex showed minimal impac , wi h alues cen e ed a ound ze o indica ing
low in luence (Figu e 1b).
4. Discussion
The bu den o LC on heal hca e sys ems and indi iduals is undeniable. Ea ly sc eening
p o ocols, including con ibu ions om models like he LCRAT [2], ha e eased his
bu den o some ex en . Ou wo k ep esen s a s ep o wa d in op imizing such models.
We p opose ha by educing he numbe o ea u es and in eg a ing mode n ensemble
s acking echniques, we can enhance hei pe o mance, ex end hei applicabili y and
u ili y in clinical se ings.
The a ionale o excluding ce ain isk ac o s, indica ed by he FS echnique, may
be explained by con ounding o spu ious co ela ions. In he case o educa ion, he e is
a ecognized co ela ion be ween lowe educa ion le els and highe LC incidence.
Howe e , he ue causal ac o is likely lowe socioeconomic s a us, which is o en
associa ed wi h highe pollu ion exposu e and poo e die s, bo h known isk ac o s o
LC [6]. Including educa ion le el in he isk model migh no e ec i ely cap u e hese
E. Alonso e al. / Su i al S acking Ensemble Model o Lung Cance Risk P edic ion158
unde lying en i onmen al ac o s and could con ound he esul s. The e o e, excluding
educa ion may imp o e he model's accu acy by a oiding misleading co ela ions.
Ou p oposed s acking ensemble a chi ec u e demons a es e icacy compa able o
o su passing ha o he LCRAT, unde sco ing he capabili y o ensemble app oaches o
e ec i ely unde s and ela ionships be ween a iables and op imize p edic i e
pe o mance. While he inc ease in p edic ing posi i e cases may appea ma ginal, e en
a modes imp o emen , such as 1%, can yield signi ican clinical bene i s and economic
sa ings by educing he need o addi ional cos ly es s.
5. Conclusions
We ha e de eloped a s acked su i al ensemble LC sc eening model ha imp o es upon
he widely used LCRAT model in wo key aspec s. Fi s ly, ou model enhances he
de ec ion o posi i e cases, leading o ea lie iden i ica ion and enabling p omp
in e en ion and ea men . Ea ly de ec ion is c ucial because lung cance has a be e
p ognosis when caugh ea ly, allowing o p omp ea men in e en ions. These
ea men s a e mo e e ec i e in he ea ly s ages, which can signi ican ly educe mo ali y
a es. Secondly, he model s eamlines pa ien da a collec ion by minimizing equi ed
a iables, add essing po en ial unce ain ies in pa ien epo ing.
Acknowledgemen s
This wo k has been ounded by he Eu opean Union's Ho izon Eu ope Resea ch and
Inno a ion P og amme unde G an Ag eemen no 101096473. The au ho s also
acknowledge he Na ional Cance Ins i u e o g an ing access o he da a om he
Na ional Lung Sc eening T ial (NLST) and he P os a e, Lung, Colo ec al and O a ian
Cance Sc eening T ial (PLCO).
Re e ences
[1] Chang JT e al. Ciga e e smoking educ ion and heal h isks: A sys ema ic e iew and me a-analysis.
Nico ine Tob Res. 2021;23(4):635–42. doi: 10.1093/n /n aa156
[2] Ka ki HA e al. De elopmen and alida ion o isk models o selec e e -smoke s o CT lung cance
sc eening. JAMA. 2016;315(21):2300. doi: 10.1001/jama.2016.6255.
[3] Na ional Lung Sc eening T ial Resea ch Team. Reduced lung-cance mo ali y wi h low-dose compu ed
omog aphic sc eening. N Engl J Med. 2011;365(5):395–409. doi: 10.1056/NEJMoa1102873
[4] Zhu CS e al. The p os a e, lung, colo ec al, and o a ian cance sc eening ial and i s associa ed esea ch
esou ce. J Na l Cance Ins . 2013;105(22):1684–93. doi: 10.1093/jnci/dj 281
[5] S epanek L e al. A machine-lea ning app oach o su i al ime-e en p edic ing: Ini ial analyses using
s omach cance da a.2020 EHB. IEEE; 2020. p. 1–4.
[6] Pampel FC, K uege PM, Denney JT. Socioeconomic dispa i ies in heal h beha io s. Annu Re Sociol.
2010;36:349–70. doi: 10.1146/annu e .soc.012809.102529
E. Alonso e al. / Su i al S acking Ensemble Model o Lung Cance Risk P edic ion 159