D opou P edic ion Using Ad anced Machine Lea ning
Models in a School and Communi y-Based In e en ion o
P omo e Heal hy Li es yle and P e en Type 2 Diabe es:
Feel4Diabe es
Ch is os Tzias as1, And eas T ian a yllidis1, Anas asios Alexiadis1, Kons an inos Vo is1,
G ee Ca don2, Jaana Linds öm3, Viole a Io o a4, Im e Ru ik5, Luis A.
Mo eno6,7,8,9, E a Ka aglani10, Ch is ina Ma ogianni10, Yannis Manios10,11
1In o ma ion Technologies Ins i u e, Cen e o Resea ch and Technology Hellas,
Thessaloniki, G eece
2Depa men o Mo emen and Spo s Sciences, Ghen Uni e si y, Ghen , Belgium
3Popula ion Heal h Uni , Finnish Ins i u e o Heal h and Wel a e, Helsinki, Finland.
4Depa men o Social Medicine and Heal h Ca e O ganiza ion Medical Uni e si y
o Va na, Va na, Bulga ia
5Semmelweis Uni e si y, Depa men o Family Medicine, Budapes
6G ow h, Exe cise, Nu i ion and De elopmen (GENUD) Resea ch G oup, Uni e si y
o Za agoza, Za agoza, Spain
7Cen o de In es igación Biomédica en Red de Fisiopa ología de la Obesidad y
Nu ición (CIBERObn), Ins i u o de Salud Ca los III, Mad id, Spain
8Ins i u o Ag oalimen a io de A agón (IA2), Za agoza, Spain
9Ins i u o de In es igación Sani a ia de A agón (IIS A agón), Za agoza, Spain
10Depa men o Nu i ion and Die e ics, School o Heal h Science & Educa ion,
Ha okopio Uni e si y, A hens, G eece
11Ins i u e o Ag i- ood and Li e Sciences, Hellenic Medi e anean Uni e si y
Resea ch Cen e, He aklion, G eece
2
Abs ac . Pa icipan d opou om in e en ional s udies a ge ing heal hy
li es yles can signi ican ly unde mine he alidi y o s udy ou comes. Accu a e d opou
p edic ion can help mi iga e his issue by enabling p oac i e pa icipan engagemen
s a egies. This s udy aims o de elop a obus Machine Lea ning (ML) model o p edic
d opou om a school and communi y-based in e en ional s udy o p omo e a heal hy
li es yle and p e en ype 2 diabe es: The Feel4Diabe es s udy. Using da a om 3274
pa icipan s ac oss 790 a iables, we aim o iden i y key d opou de e minan s and
enhance ML p edic i e accu acy. We e alua ed h ee indi idual machine lea ning
models—Random Fo es , XGBoos , and Suppo Vec o Machine (SVM)—based on
pe o mance me ics including accu acy, p ecision, ecall, and F1-sco e. Among hese,
he Random Fo es model eme ged as he mos e ec i e, achie ing an accu acy o 0.80
on he es se , wi h balanced p ecision and ecall sco es. Ou s udy highligh s he
e ec i eness o machine lea ning me hods in p edic ing d opou in in e en ional
s udies p omo ing heal hy li es yles and p e en ing ype 2 diabe es. Fu u e esea ch will
concen a e on e ining hese models u he and explo ing addi ional da a sou ces o
enhance hei gene alizabili y.
Keywo ds: D opou P edic ion, Machine Lea ning, Fea u e Selec ion, Type 2
Diabe es P e en ion
1 In oduc ion
1.1 Backg ound
Ch onic illnesses such as diabe es, hea disease, and cance a e p ima y causes o
p ema u e mo ali y, accoun ing o o e wo- hi ds o all dea hs and a subs an ial
po ion o heal hca e expendi u es. In Eu ope, hese condi ions consume an es ima ed
75% o heal hca e budge s [1]. Acco ding o he Wo ld Heal h O ganiza ion (WHO),
he global p opo ion o dea hs due o ch onic illnesses is p ojec ed o ise om 57% o
65% by 2030 [1], unde sco ing he u gen need o e ec i e in e en ions.
Howe e , high d opou a es pose a signi ican challenge o he e icacy o heal h
in e en ions. I is c ucial o highligh and analyze usage me ics and de e minan s o
a i ion o unde s and how hese applica ions pe o m among use s who con inue o
engage wi h hem [2]. Adhe ence is a majo issue in heal h p omo ion p og ams, wi h
many pa icipan s discon inuing use be o e comple ing he in e en ion. This
phenomenon, known as non-usage a i ion o d opou , can d as ically a ec he
p og am's e ec i eness [3]. High d opou a es no only p e en pa icipan s om
ecei ing he ull bene i s o he in e en ion bu also lead o ine icien use o esou ces,
inc eased cos s and in oduc ion o bias in he esul s [4].
Iden i ying p edic o s o d opou has been he ocus o se e al s udies, hough no
consis en se o p edic o s has been es ablished [5,6,7,8]. High d opou a es om
heal h p omo ion in e en ions highligh he need o e ec i e p edic i e models o
iden i y a - isk pa icipan s and enable imely in e en ions o imp o e e en ion [8].
3
P edic i e modeling echniques like su i al analysis, logis ic eg ession, and
andom o es s ha e been applied o p edic d opou in educa ional se ings, acing
simila issues [9,10]. Howe e , he applica ion o hese echniques in heal h
in e en ional s udies se ings is limi ed, sugges ing a gap in in eg a ing p edic i e
analy ics in o heal hca e in e en ions.
This s udy aims o de elop an ad anced machine lea ning (ML) pipeline o p edic
d opou om he Feel4Diabe es s udy, a school and communi y-based in e en ion
designed o p omo e heal hy li es yles and p e en ype 2 diabe es. By analyzing da a
om 3274 pa icipan s ac oss 790 a iables, we aim o iden i y key de e minan s o
d opou and enhance he p edic i e accu acy o ML models. Ou app oach includes
igo ous da a p ep ocessing, ea u e selec ion using Sequen ial Backwa d Floa ing
Selec ion (SBFS), add essing class imbalance wi h he Syn he ic Mino i y
O e sampling Technique (SMOTE), and hype pa ame e uning ia G idSea chCV o
Random Fo es , Suppo Vec o Machine and XGBoos models (Figu e 1). We hen
selec he bes pe o ming model o achie e supe io p edic i e pe o mance.
1.2 Ma e ials and Me hods
The Feel4Diabe es da ase is a comp ehensi e eposi o y con aining da a om 3,274
pa icipan s ac oss 790 a iables. The da ase comp ises ex ensi e pa icipan
in o ma ion, including demog aphic, clinical, and beha io al a ibu es, which a e
essen ial o de eloping accu a e p edic i e models. Th ough igo ous p ep ocessing,
ea u e selec ion, and model uning p ocesses, we aim o iden i y key de e minan s o
pa icipan d opou and enhance he p edic i e accu acy o ou models.
4
Figu e 1:Block diag am o he d opou p edic ion pipeline
1.3. P ep ocessing
E ec i e machine lea ning models a e buil on well-p epa ed da ase s. In his chap e ,
we ocus on he p ep ocessing o he eel4diabe es.cs da ase , which is c ucial o
accu a ely p edic ing pa icipan d opou in diabe es ollow-up s udies. In his p oposed
amewo k, he p ep ocessing s eps included emo ing ows wi h mo e han 12%
missing alues, because we wan ed o keep some ea u es ha had a bi mo e han 10%
missing alues such as encou agemen o walk/bicycle and ui s ege ables in ake as
hey showed co ela ion o he d opou . This ensu es ha ou machine lea ning models
a e based on he mos eliable in o ma ion.
In p epa ing he da ase o p edic ing pa icipan d opou , we i s excluded columns
i ele an o ou analysis, such as 'cen e ', 'schoolcode', 'classcode', and a ious
demog aphic de ails, so as o ocus on he mos in luen ial a iables.
Simul aneously, we assessed he da ase o missing da a, so ing he columns by
missing da a pe cen ages in ascending o de . This s ep was c ucial o iden i ying and
p io i izing ea u es o ou analysis. We hen emba ked on an i e a i e p ocess o e ine
5
ou ea u e se . Ou app oach was o main ain a balance be ween da a comple eness and
ea u e ichness, selec ing ea u es wi h minimal missing alues and condi ionally
adding hose wi h less han 12% missing da a. This s a egy yielded a da ase which
consis ed o 15 ea u es om he ini ial 790. We also cons uc ed a co ela ion hea map
o he ea u es o display he co ela ion be ween mul iple a iables (Figu e 2) and
Violin Plo s o compa ing he p obabili y dis ibu ions ac oss di e en p edic i e
ea u es (Figu e 3)
Figu e 2:Co ela ion hea map o selec ed ea u es
6
Figu e 3:Violin Plo s o p edic i e ea u es
1.4. Sequen ial Backwa d Floa ing Selec ion
A e he p ep ocessing s age, we employed he Sequen ial Backwa d Floa ing
Selec ion (SBFS) algo i hm o ex ac meaning ul ea u es o aining he model.
Fea u e selec ion is c ucial o enhancing he obus ness and e iciency o machine
lea ning models, pa icula ly in da ase s wi h a la ge numbe o ea u es like
Feel4diabe es. Among a ious heu is ic app oaches, we chose he SBFS algo i hm o
i s e ec i eness in iden i ying he mos p edic i e ea u es.
The SBFS algo i hm begins wi h he comple e se o ea u es and i e a i ely emo es
he ea u e whose exclusion maximizes classi ie pe o mance. Following each
emo al, i condi ionally ein oduces ea u es ha could enhance he classi ie 's
7
pe o mance. This al e na ing p ocess o exclusion and condi ional inclusion con inues
un il he desi ed numbe o ea u es is achie ed, esul ing in a subse o highly p edic i e
ea u es[11].In ou da ase , he SBFS algo i hm educed he dimension o he ea u e
space o 10 ea u es ( om 15 ha had been selec ed in he p ep ocessing s ep).The
ea u es ha we e deemed he mos in luen ial o p edic ing he d opou we e :Smoking
s a us, Body-mass index (BMI), Wais ci cum e ence, FindRisk sco e, Physical
Ac i i y, F ui / ege able in ake, Blood Glucose, Diabe es His o y,Age and
Encou agemen o walk/Bicycle.
2.SMOTE
SMOTE (Syn he ic Mino i y O e -sampling Technique) is an o e -sampling me hod
ha gene a es syn he ic examples o balance class dis ibu ion. The p ocess in ol es
c ea ing syn he ic samples along he line segmen s connec ing each mino i y class
sample o i s k-nea es mino i y neighbo s. Fo ins ance, wi h i e nea es neighbo s
and a 200% o e -sampling equi emen , wo neighbo s a e chosen, and syn he ic
samples a e gene a ed be ween each pai . This is done by adding a andom ac ion o
he di e ence be ween a sample and i s neighbo o he sample. This me hod e ec i ely
gene alizes he decision egion o he mino i y class, imp o ing model pe o mance.
We applied he SMOTE in he aining se a e spli ing o make su e ha he model
will be e alua ed in o iginal da a and ensu e obus ness (Figu e 4).
Figu e 4:Class dis ibu ion be o e and a e SMOTE
8
1.2 Hype pa ame e uning – K old c oss alida ion
Following he balancing o he da ase using SMOTE, we pe o med hype pa ame e
uning h ough G idSea chCV. This app oach e icien ly sea ches he hype pa ame e
space, aiming o ind he op imal pa ame e s. We implemen ed 5- old c oss- alida ion
o ensu e obus ness and eliabili y o he model e alua ion. The da a was spli in o
aining and es ing se s wi h an 80-20 a io. This demons a ed ha he chosen
hype pa ame e s we e e ec i e in enhancing he model's pe o mance.
3.ML Models aining
We explo ed h ee machine lea ning models—Random Fo es , Suppo Vec o
Machines (SVM) and XGBoos o p edic d opou om he Feel4Diabe es
in e en ional s udy. Each model unde wen hype pa ame e uning using he
p e iously desc ibed G idSea chCV me hod combined wi h 5- old c oss- alida ion.
De ini ion 1.1. A andom o es is a classi ie consis ing o a collec ion o ee-
s uc u ed classi ie s {h(x,Θk ), k = 1, . . .} whe e he {Θk} a e independen iden ically
dis ibu ed andom ec o s and each ee cas s a uni o e o he mos popula class a
inpu x.[12]
De ini ion 1.2. Suppo ec o machines (SVMs) a e pa icula linea classi ie s which
a e based on he ma gin maximiza ion p inciple. They pe o m s uc u al isk
minimiza ion, which imp o es he complexi y o he classi ie wi h he aim o achie ing
excellen gene aliza ion pe o mance. The SVM accomplishes he classi ica ion ask by
cons uc ing, in a highe dimensional space, he hype plane ha op imally sepa a es he
da a in o wo ca ego ies.[13]
De ini ion 1.3. XGBoos s ands o Ex eme G adien Boos ing, which applies a
G adien Boos ing echnique based on decision ees. I cons uc s sho , basic decision
ees i e a i ely. Each ee is e med as a “weak lea ne ” because o i s high bias.
XGBoos begins by building he i s basic ee ha has a poo pe o mance. Then i
builds ano he ee, ained o p edic wha he i s ee, which is a weak lea ne , canno
do. The echnique sequen ially p oduces weake lea ne s, each co ec ing he p e ious
ee be o e he s opping condi ion is me , such as he numbe o ees (es ima o s) o be
c ea ed [14].
9
To enhance he pe o mance o ou machine lea ning models, we ocused on selec ing
he mos e ec i e classi ie among he h ee. In his ega d, we concen a ed on uning
and e alua ing indi idual models o iden i y he one wi h he bes p edic i e
pe o mance.
A e uning each model—Random Fo es , XGBoos , and Suppo Vec o Machine
(SVM)—we ho oughly assessed hei pe o mance me ics, including accu acy,
p ecision, ecall, and F1-sco e, on he es se . This igo ous e alua ion p ocess was
aimed a de e mining which model p o ided he highes accu acy and mos eliable
p edic ions.
Among he e alua ed models, he Random Fo es model eme ged as he bes pe o me ,
demons a ing supe io accu acy, p ecision, ecall, and F1-sco e compa ed o XGBoos
and SVM. This app oach o selec ing he highes -pe o ming indi idual model ensu ed
obus and accu a e p edic ions, making i he p e e ed choice o ou s udy.
The esul s alida ed he e ec i eness o his s a egy, con i ming ha ca e ully
choosing and op imizing he bes -pe o ming single model can achie e high p edic i e
pe o mance. These indings a e c i ical o enhancing d opou p edic ion in
in e en ional s udies, such as he Feel4Diabe es s udy, aimed a p omo ing heal hy
li es yles and p e en ing ype 2 diabe es. Fu u e e o s will ocus on u he e ining
he Random Fo es model and inco po a ing addi ional da a sou ces o imp o e i s
gene alizabili y, p o iding aluable insigh s o esea che s and p ac i ione s wo king
o enhance pa icipan e en ion in simila s udies.
5.Resul s
The in eg a ion o SMOTE o da a balancing, k- old c oss- alida ion wi h
hype pa ame e uning, and Sequen ial Backwa d Floa ing Selec ion (SBFS), ollowed
by a selec ion o he bes indi idual model, esul ed in imp o ed accu acy o p edic ing
d opou om he eel4diabe es s udy in compa ison o each indi idual model. The
Random Fo es a 0.80, XGBoos a 0.77, SVM a 0.78. This unde sco es he Random
Fo es ’s accu acy and o e all pe o mance compa ed o he SVM and XGBoos . To
highligh his, we cons uc ed he ROC-AUC cu e o he Random Fo es (Figu e 6)
and he pe o mance me ics o each espec i e model (Figu e 5).